AI Infrastructure Data Engineer
Remore role
Long Term Contract
References must needed
Job Description:
Build the data backbone that powers AI — pipelines, knowledge bases, ingestion, and retrieval
infrastructure.
Minneapolis (Hybrid) · Intermediate / Senior · 4–8 YOE · Data pipelines required
AI systems are only as good as the data feeding them. This role owns the infrastructure that gets data from
internal systems, document stores, APIs, and enterprise databases into vector indexes, knowledge bases, and
structured stores that AI agents can reliably query. You'll build ingestion pipelines with freshness management,
design chunking and embedding strategies, and ensure retrieval quality — the hidden layer that determines
whether agents give accurate answers or hallucinate. This is not a traditional data warehousing role; it is data
engineering specifically in service of AI systems.
WHAT YOU'LL BUILD
▸ Ingestion pipelines pulling from internal systems,
APIs, document repositories, and enterprise
databases into AI knowledge stores
▸ Vector indexing infrastructure — embedding
model selection, chunking strategies, metadata
enrichment, hybrid index design
▸ Freshness and change detection — incremental
re-indexing, stale data detection, TTL management
▸ ETL / ELT pipelines for structured data feeding
AI decision and retrieval layers
▸ High-throughput event-driven ingestion for real-
time and batch processing at enterprise scale
▸ Data quality validation — schema checks,
completeness scoring, anomaly detection before
indexing
REQUIRED EXPERIENCE
▸ 4+ years building production data pipelines —
orchestrated workflows, not one-off scripts
▸ Strong SQL — query optimization, indexing,
execution plans, large result sets
▸ Experience with vector databases or search
infrastructure (OpenSearch, Pinecone, pgvector,
Azure AI Search)
▸ Python data processing at scale — Pandas,
Polars, or equivalent
▸ Understands embedding models — how to
evaluate retrieval quality, why chunking strategy
matters
▸ Cloud data stack — AWS (Glue, S3, RDS) or
Azure equivalent
▸ Can diagnose why a RAG system's retrieval is
failing — at the data layer
NICE TO HAVE
▸ Event streaming platforms — event-driven pipeline design, high-throughput ingestion patterns
▸ Legacy enterprise RDBMS experience (DB2, Oracle, or equivalent)
▸ Document intelligence — OCR pipelines, PDF/scanned document ingestion
▸ dbt, Airflow, or similar pipeline orchestration tooling
▸ Knowledge graph experience — Neo4j, Amazon Neptune, RDF/SPARQL, ontology design
▸ Experience building knowledge bases specifically for LLM consumption — not just generic warehousing
▸ Financial services data — understanding of regulated data handling, PII, audit trails
TECH STACK
Python · Pandas / Polars / PySpark
ETL / ELT Pipelines
Event Streaming Pipelines
Vector Databases (pgvector · Pinecone · Weaviate)
OpenSearch · Hybrid Search
Knowledge Graphs · Graph Databases
Neo4j · Amazon Neptune
RDF · SPARQL · Ontology Design
Embedding Models · Chunking Strategies
Document Intelligence · OCR Pipelines
dbt · Airflow · Pipeline Orchestration
Cloud Data Services (AWS / Azure)
Relational Databases · SQL Optimization
Data Quality · Schema Validation
Docker · Container Orchestration
Enterprise API Integration