Job Description:
What we’re looking for
Toyota Financial Services Enterprise Platforms team is looking for a Senior ML Platform Engineer to design, build, and operationalize an enterprise ML platform on AWS SageMaker Unified Studio. You will migrate the organization from a fragmented ML toolchain to a unified, governed platform on AWS Landing Zone 2, covering the full ML lifecycle from data discovery through model deployment and monitoring.
What you’ll be doing
- Set up SageMaker Unified Studio platform — domain configuration, project provisioning, persona-based roles, and multi-environment (Dev, Prod-UAT, Prod) promotion workflows
- Build MLOps pipelines using SageMaker Pipelines — data extraction from Snowflake, preprocessing, training, evaluation, and model registration
- Manage SageMaker Model Registry — cross-account model promotion, versioning, immutability, and lineage tracking
- Configure MLflow experiment tracking — auto-logging of parameters, metrics, and artifacts
- Set up identity and access management — Okta SSO, SailPoint entitlements, persona-based execution roles, service roles for pipelines
- Build model serving — real-time SageMaker endpoints and batch prediction workflows
- Set up model monitoring — data drift, model drift, performance degradation detection
- Configure data catalog — searchable datasets, access-level visibility, access-request workflows, lineage
- Own platform operations — observability (CloudWatch, Datadog), logging, custom images, instance availability
Qualifications/ What you bring (Must Haves) – Highlight Top 3-5 skills:
- 10-15 years of software engineering experience focused on cloud infrastructure or ML platform operations
- 5+ years hands-on with AWS, including deep expertise in Amazon SageMaker (Studio, Pipelines, Model Registry, Endpoints, Feature Store)
- 3+ years building and operating production MLOps pipelines — training, versioning, deployment, monitoring, rollback
- Experience with SageMaker Unified Studio or Studio Classic — domain/project setup, blueprints, multi-tenant configuration
- Infrastructure-as-Code with Terraform, CDK, or CloudFormation
- IAM design for ML platforms — execution roles, service roles, cross-account access, Lake Formation, SSO/SAML
- MLflow or equivalent experiment tracking
- SageMaker Pipelines or similar workflow orchestration (Airflow, Step Functions)
- Model serving — real-time endpoints, batch transform, auto-scaling, endpoint monitoring
- Snowflake as a data source for ML pipelines
- Kubernetes (EKS) and container orchestration
- Networking and security — VPC, security groups, private endpoints, cross-account connectivity
Added bonus if you have (Preferred): - SageMaker Unified Studio domain provisioning, custom blueprints, project standardization
- SageMaker Feature Store for online/offline feature management
- SageMaker Model Monitor — data quality checks, bias detection, drift detection
- AWS Machine Learning Specialty certification
-