Overview
On Site
$50+
Contract - W2
No Travel Required
Skills
Machine Learning Operations (ML Ops)
Machine Learning (ML)
Job Details
Key Responsibilities:
Incident Management & Support:
- Provide L2 support for MLOps production environments, ensuring uptime and reliability.
- Troubleshoot ML pipelines, data processing jobs, and API issues.
- Monitor logs, alerts, and performance metrics using Dataiku, Prometheus, Grafana, or AWS tools such CloudWatch.
- Perform root cause analysis (RCA) and resolve incidents within SLAs.
- Escalate unresolved issues to L3 engineering teams when needed. Dataiku Platform Management:
- Manage Dataiku DSS workflows, troubleshoot job failures, and optimize performance.
- Monitor and support Dataiku plugins, APIs, and automation scenarios.
- Collaborate with Data Scientists and Data Engineers to debug ML model deployments.
- Perform version control and CI/CD integration for Dataiku projects.
Deployment & Automation:
- Support CI/CD pipelines for ML model deployment (Bamboo, Bitbucket etc).
- Deploy ML models and data pipelines using Docker, Kubernetes, or Dataiku Flow.
- Automate monitoring and alerting for ML model drift, data quality, and performance.
Cloud & Infrastructure Support:
- Monitor AWS-based ML workloads (SageMaker, Lambda, ECS, S3, RDS).
- Manage storage and compute resources for ML workflows.
- Support database connections, data ingestion, and ETL pipelines (SQL, Spark, Kafka).
Security & Compliance:
- Ensure secure access control for ML models and data pipelines.
- Support audit, compliance, and governance for Dataiku and MLOps workflows.
- Respond to security incidents related to ML models and data access.
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.