Overview
On Site
$DOE
Contract - W2
Contract - Month Contract
Skills
Python
SQL
Spark
Pyspark
Dask
Job Details
Job Title: Data Engineer Gen AI
Location: Edison, NJ
Domain: IT Services
Duration: Long Term Contract
Looking for W2 Candidates. No C2C
Responsibilities:
- Design, build, and maintain scalable data pipelines to support Generative AI and LLM-based applications.
- Collect, clean, and preprocess structured and unstructured data for model training, fine-tuning, and retrieval-augmented generation (RAG).
- Implement robust data ingestion frameworks integrating APIs, streaming sources, and external repositories.
- Collaborate with AI/ML teams to deliver high-quality, domain-specific datasets optimized for transformer-based architectures.
- Architect and manage vector databases (e.g., Pinecone, FAISS, Weaviate) for efficient embedding storage and semantic search.
- Optimize data storage, retrieval, and transformation workflows across multi-cloud and hybrid environments.
- Automate data versioning, lineage tracking, and governance processes to ensure compliance and reproducibility.
- Build scalable ETL/ELT frameworks and orchestrate workflows using Airflow, Prefect, or Dagster.
- Contribute to prompt engineering and model evaluation pipelines through metadata enrichment and contextual data provisioning.
- Ensure data quality, privacy, and ethical use standards across all Generative AI applications.
Qualifications:
- 8+ years of professional experience in Data Engineering; 2+ years supporting AI/ML or Generative AI workflows.
- Proficiency in Python, SQL, and distributed data processing frameworks (Spark, Pyspark, Dask).
- Strong experience with data pipeline orchestration tools (Airflow, Luigi, Dagster, or Prefect).
- Hands-on experience with cloud data ecosystems such as AWS (Glue, Redshift, S3), Azure Data Factory, or Google Cloud Platform BigQuery.
- Knowledge of vector databases and embedding models for RAG-based systems.
- Familiarity with Lang Chain, LLMOps, and data preparation for fine-tuning LLMs.
- Experience in containerization and orchestration (Docker, Kubernetes).
- Working knowledge of API integration, data governance, and data cataloging tools (e.g., Data Hub, Amundsen).
- Exposure to Generative AI concepts such as embeddings, tokenization, and prompt optimization.
- Understanding of Responsible AI practices, data anonymization, and bias mitigation techniques.
Best Regards,
Tarun K
Phone: +1-
Email:
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.