We re on a mission to improve the quality of human life. We re developing new tools and capabilities to amplify the human experience. To lead this transformative shift in mobility, we ve built a world-class team in Energy & Materials, Human-Centered AI, Human Interactive Driving, Large Behavioral Models, and Robotics.
Within the Human Interactive Driving division, the Extreme Performance Intelligent Control department is working to develop scalable, human-like driving intelligence by learning from expert human drivers. This project focuses on creating a configurable, data-driven world model that serves as a foundation for intelligent, multi-agent reasoning in dynamic driving environments. By tightly integrating advances in
perception, world modeling, and model-based reinforcement learning, we aim to overcome the limitations of more compartmentalized, rule-based approaches. The end goal is to enable robust, adaptable, and interpretable driving policies that generalize across tasks, sensor modalities, and public road scenarios delivering transformative improvements for ADAS, autonomous systems, and simulation-driven software
development.
As a Data Engineer, you will be a key enabler of this mission owning the systems that collect, organize, clean, and deliver the volumes of sensor and simulation data that fuel our world models, perception systems, and reinforcement learning algorithms. You will collaborate closely with research scientists and machine learning engineers to ensure our pipelines are reliable, scalable, and performant powering breakthroughs in intelligent driving across simulation and real-world deployments.
Responsibilities
- Design, implement, and maintain robust data pipelines for ingesting, cleaning, and transforming large-scale autonomous vehicle datasets (camera, LiDAR, radar, GPS, simulation logs).
- Develop scalable storage and retrieval systems using AWS services (S3, EC2, SageMaker, Athena, etc.).
- Ensure data quality and consistency through automated validation, deduplication, and schema enforcement.
- Collaborate with ML researchers and engineers to provide efficient access to training data, labels, and metadata.
- Optimize data preprocessing and batching pipelines to support large-scale training and evaluation workflows.
- Build tools to manage and audit dataset versions, experiment tracking, and feature reproducibility.
- Implement and maintain CI/CD workflows for data and pipeline updates, ensuring minimal downtime and reproducible outputs.
- Monitor data pipeline performance and respond to bottlenecks or outages proactively.
-
Qualifications
B.S. or M.S. in Computer Science, Data Engineering, or a related field.
3+ years of experience building production-grade data infrastructure or ML data pipelines.
Strong proficiency with Python and SQL, and experience with data workflow orchestration tools (e.g., Airflow, Prefect, Luigi).
Deep experience with AWS services, especially S3 (data storage), EC2 (compute), and SageMaker (model training).
Familiarity with distributed computing frameworks like Spark, Dask, or Ray.
Understanding of best practices for dataset documentation, standardization, and reproducibility in research.
Bonus Qualifications
Experience with autonomous vehicle datasets or robotics sensor data.
Familiarity with ML training pipelines and model evaluation workflows.
Prior experience collaborating with researchers or applied ML teams in high-throughput environments.