Lead Data Engineer + AI at Boston MA

  • Boston, MA
  • Posted 6 hours ago | Updated 6 hours ago

Overview

On Site
Accepts corp to corp applications
Contract - W2
Contract - Independent

Skills

python
Pyspark
DataBricks
MLflow
AI data pipelines

Job Details

Position: Lead Data Engineer + AI

Location : Boston MA

Experience 10 years to 15years

Note: Please submit only genuine candidate with LinkedIn.

Local will be preferred.

Need minimum 3 years of experience as Lead.

About the role

We're looking for a Senior Data Engineer to build and scale our lakehouse and AI data pipelines on Databricks. You'll design robust ETL/ELT, enable feature engineering for ML/LLM use cases, and drive best practices for reliability, performance, and cost.

What you'll do
  • Design, build, and maintain batch/streaming pipelines in Python + PySpark on Databricks (Delta Lake, Autoloader, Structured Streaming).
  • Implement data models (Bronze/Silver/Gold), optimize with partitioning, Z-ORDER, and indexing, and manage reliability (DLT/Jobs, monitoring, alerting).
  • Enable ML/AI: feature engineering, MLflow experiment tracking, model registries, and model/feature serving; support RAG pipelines (embeddings, vector stores).
  • Establish data quality checks (e.g., Great Expectations), lineage, and governance (Unity Catalog, RBAC).
  • Collaborate with Data Science/ML and Product to productionize models and AI workflows; champion CI/CD and IaC.
  • Troubleshoot performance and cost issues; mentor engineers and set coding standards.
Must-have qualifications
  • 10+ years in data engineering with a track record of production pipelines.
  • Expert in Python and PySpark (UDFs, Window functions, Spark SQL, Catalyst basics).
  • Deep hands-on Databricks: Delta Lake, Jobs/Workflows, Structured Streaming, SQL Warehouses; practical tuning and cost optimization.
  • Strong SQL and data modeling (dimensional, medallion, CDC).
  • ML/AI enablement experience: MLflow, feature stores, model deployment/monitoring; familiarity with LLM workflows (embeddings, vectorization, prompt/response logging).
  • Cloud proficiency on AWS/Azure/Google Cloud Platform (object storage, IAM, networking).
  • CI/CD (GitHub/GitLab/Azure DevOps), testing (pytest), and observability (logs/metrics).
Nice to have
  • Databricks Delta Live Tables, Unity Catalog automation, Model Serving.
  • Orchestration (Airflow/Databricks Workflows), messaging (Kafka/Kinesis/Event Hubs).
  • Data quality & lineage tools (Great Expectations, OpenLineage).
  • Vector DBs (FAISS, pgvector, Pinecone), RAG frameworks (LangChain/LlamaIndex).
  • IaC (Terraform), security/compliance (PII handling, data masking).
  • Experience interfacing with BI tools (Power BI, Tableau, Databricks SQL).
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

About Exatech Inc