Overview
Remote
$DOE
Accepts corp to corp applications
Contract - W2
Contract - 12 Month(s)
Skills
workflows
Amazon Web Services
PYSPARK
Data Science
data modelling
Data Pipelines
Continuous Integration
Data Quality
Microsoft Azure
Identity and Access management
cloud computing
Problem solving
Artificial Intelligence
governance
Data Streaming
Databricks
requirements analysis
Reliability
data lakes
Extract Transform Load (ETL)
Large Language Models
Python (Programming Language)
Apache Hive
Machine Learning Operations
SQL Databases
Information Engineering
Data Logging
Role-Based Access Control
Stock Control
Networking Skills
Catalyst (Software)
Cost Optimisation
Feature Engineering
Indexer
Software Coding
Job Details
Role: Sr./ Lead Data Engineer + AI
Location: Boston, MA - Remote
Experience Needed: 10 Years to 15 Years For Lead/ 05 to 10 Years for Senior
Location: Boston, MA - Remote
Experience Needed: 10 Years to 15 Years For Lead/ 05 to 10 Years for Senior
Need minimum 3 years of experience as Lead.
About the role:
We're looking for a Senior Data Engineer to build and scale our lakehouse and AI data pipelines on Databricks. You'll design robust ETL/ELT, enable feature engineering for ML/LLM use cases, and drive best practices for reliability, performance, and cost.
We're looking for a Senior Data Engineer to build and scale our lakehouse and AI data pipelines on Databricks. You'll design robust ETL/ELT, enable feature engineering for ML/LLM use cases, and drive best practices for reliability, performance, and cost.
What you'll do:
Design, build, and maintain batch/streaming pipelines in Python + PySpark on Databricks (Delta Lake, Autoloader, Structured Streaming).
Implement data models (Bronze/Silver/Gold), optimize with partitioning, Z-ORDER, and indexing, and manage reliability (DLT/Jobs, monitoring, alerting).
Enable ML/AI: feature engineering, MLflow experiment tracking, model registries, and model/feature serving; support RAG pipelines (embeddings, vector stores).
Establish data quality checks (e.g., Great Expectations), lineage, and governance (Unity Catalog, RBAC).
Collaborate with Data Science/ML and Product to productionize models and AI workflows; champion CI/CD and IaC.
Troubleshoot performance and cost issues; mentor engineers and set coding standards.
Design, build, and maintain batch/streaming pipelines in Python + PySpark on Databricks (Delta Lake, Autoloader, Structured Streaming).
Implement data models (Bronze/Silver/Gold), optimize with partitioning, Z-ORDER, and indexing, and manage reliability (DLT/Jobs, monitoring, alerting).
Enable ML/AI: feature engineering, MLflow experiment tracking, model registries, and model/feature serving; support RAG pipelines (embeddings, vector stores).
Establish data quality checks (e.g., Great Expectations), lineage, and governance (Unity Catalog, RBAC).
Collaborate with Data Science/ML and Product to productionize models and AI workflows; champion CI/CD and IaC.
Troubleshoot performance and cost issues; mentor engineers and set coding standards.
Must-have qualifications:
10+ years in data engineering with a track record of production pipelines.
Expert in Python and PySpark (UDFs, Window functions, Spark SQL, Catalyst basics).
Deep hands-on Databricks: Delta Lake, Jobs/Workflows, Structured Streaming, SQL Warehouses; practical tuning and cost optimization.
Strong SQL and data modeling (dimensional, medallion, CDC).
ML/AI enablement experience: MLflow, feature stores, model deployment/monitoring; familiarity with LLM workflows (embeddings, vectorization, prompt/response logging).
Cloud proficiency on AWS/Azure/Google Cloud Platform (object storage, IAM, networking).
10+ years in data engineering with a track record of production pipelines.
Expert in Python and PySpark (UDFs, Window functions, Spark SQL, Catalyst basics).
Deep hands-on Databricks: Delta Lake, Jobs/Workflows, Structured Streaming, SQL Warehouses; practical tuning and cost optimization.
Strong SQL and data modeling (dimensional, medallion, CDC).
ML/AI enablement experience: MLflow, feature stores, model deployment/monitoring; familiarity with LLM workflows (embeddings, vectorization, prompt/response logging).
Cloud proficiency on AWS/Azure/Google Cloud Platform (object storage, IAM, networking).
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.