Overview
Skills
Job Details
We are seeking a Senior Data & ML Infrastructure Engineer to join our growing team focused on enabling scalable, reliable, and cost-efficient data and machine learning pipelines. This is an exciting opportunity to work at the intersection of data engineering, machine learning, and platform reliability.
In this role, you ll be designing and enhancing Python frameworks that support key components like cost tracking, data quality, lineage, governance, and MLOps. You'll work closely with cross-functional teams, including ML Engineers, Data Scientists, and Platform Engineers, to ensure seamless integration and operation of scalable batch and stream pipelines on Google Cloud Platform (Google Cloud Platform).
Key Responsibilities:
Technical Leadership
- Proven experience leading MLOps initiatives across the full ML lifecycle
- Deep understanding of CI/CD pipelines, model deployment, monitoring, and scaling
- Hands-on expertise with cloud platforms (AWS, Google Cloud Platform, Azure) and containerization (Docker, Kubernetes)
Project Ownership
- Ability to define and drive project plans from inception to delivery
- Skilled in managing execution timelines, resource allocation, and risk mitigation
- Comfortable working with cross-functional teams including data scientists, engineers, and product managers
Communication & Collaboration
- Strong communicator who can translate technical concepts to non-technical stakeholders
- Experience in stakeholder management and status reporting
- Capable of leading meetings, resolving conflicts, and aligning teams toward shared goals
Problem Solving in Ambiguity
- Thrives in undefined or evolving problem spaces
- Demonstrates creativity and initiative in shaping solutions from vague requirements
- Comfortable making decisions with incomplete information and iterating quickly
Strategic Thinking
- Ability to align MLOps practices with business goals
- Experience in evaluating and implementing tools and frameworks that improve team productivity and model reliability
Design & enhance Python libraries to support robust data and ML operations including governance, lineage, and cost tracking.
Implement data processing optimizations to reduce cost and improve performance of large-scale ML pipelines.
Develop scalable features and training data pipelines using BigQuery, Dataflow, and Cloud Composer on Google Cloud Platform.
Build and maintain monitoring, logging, and alerting systems to ensure data pipeline reliability and visibility.
Lead infrastructure rollouts with careful planning, phased deployment strategies, validation steps, and rollback plans.
Serve as the primary point of contact for cross-team coordination during updates, deployments, and incident handling.
Work closely with ML platform teams to ensure seamless integration of enhancements and changes.
Create detailed runbooks, documentation, and handoffs for operational support.
Requirements:
5+ years of experience in data engineering, ML infrastructure, or related roles.
Strong proficiency in Python and experience building reusable libraries/frameworks.
Hands-on experience with BigQuery, Dataflow, and Cloud Composer.
Solid understanding of data pipeline orchestration, MLOps, and cloud-native architectures.
Experience implementing monitoring and observability for pipelines and infrastructure.
Strong communication and coordination skills, especially in cross-functional environments.
Familiarity with ML workflows and how infrastructure supports model development and deployment.
Nice to Have:
Experience with CI/CD for data pipelines
Exposure to data quality tools, lineage tracking frameworks, or ML feature stores
Google Cloud certification (e.g., Professional Data Engineer, ML Engineer)
Python, Data Engineering, Machine Learning, MLOps, BigQuery, Google Cloud Platform, Google Cloud Platform, Dataflow, Cloud Composer, Data Pipelines, Infrastructure as Code, Data Governance, Data Quality, Lineage, Monitoring, Logging, ML Infrastructure, CI/CD, Feature Pipelines, Cost Optimization, Data Orchestration, Airflow, ML Engineering, ML Ops, Runbooks, Observability, Python, Data Engineering, MLOps, BigQuery, Dataflow, Google Cloud Platform, Google Cloud Platform, Data Pipelines, Composer, Machine Learning Infrastructure