We are seeking a Data Engineer to build and scale the data infrastructure powering our Agentic AI products. You will be responsible for the "Ingestion-to-Insight" pipeline that allows autonomous agents to access, search, and reason over vast amounts of proprietary and public data.
Your role is critical: you will design the RAG (Retrieval-Augmented Generation) architectures and data pipelines that ensure our agents have the right context at the right time to make accurate decisions.
Key Responsibilities
- AI-Ready Data Pipelines: Design and implement scalable ETL/ELT pipelines that process both structured (SQL, logs) and unstructured (PDFs, emails, docs) data specifically for LLM consumption.
- Vector Database Management: Architect and optimize Vector Databases (e.g., Pinecone, Weaviate, Milvus, or Qdrant) to ensure high-speed, relevant similarity searches for agentic retrieval.
- Chunking & Embedding Strategies: Collaborate with AI Engineers to optimize data chunking strategies and embedding models to improve the "recall" and "precision" of the agent's knowledge retrieval.
- Data Quality for AI: Develop automated "Data Cleaning" workflows to remove noise, PII (Personally Identifiable Information), and toxicity from training/context datasets.
- Metadata Engineering: Enrich raw data with advanced metadata tagging to help agents filter and prioritize information during multi-step reasoning tasks.
- Real-time Data Streaming: Build low-latency data streams (using Kafka or Flink) to provide agents with "fresh" data, enabling them to act on real-time market or operational changes.
- Evaluation Frameworks: Construct "Gold Datasets" and versioned data snapshots to help the team benchmark agent performance over time.
Required Skills & Qualifications
- Experience: 4+ years in Data Engineering, with at least 1 year focusing on data for LLMs or AI/ML applications.
- Python Mastery: Deep expertise in Python (Pandas, Pydantic, FastAPI) for data manipulation and API integration.
- Data Tooling: Strong experience with modern data stack tools (e.g., dbt, Airflow, Dagster, Snowflake, or Databricks).
- Vector Expertise: Hands-on experience with at least one major Vector Database and knowledge of similarity search algorithms (HNSW, Cosine Similarity).
- Search Knowledge: Familiarity with hybrid search techniques (combining semantic search with traditional keyword search like Elasticsearch/BM25).
- Cloud Infrastructure: Proficiency in managing data workloads on AWS, Azure, or Google Cloud Platform.
Preferred Qualifications
- Experience with LlamaIndex or LangChain for data ingestion.
- Knowledge of Graph Databases (e.g., Neo4j) to help agents understand complex relationships between data points.
- Familiarity with "Data-Centric AI" principles prioritizing data quality over model size.