100% remote role - work on EST
We are seeking a Senior MLOps Engineer to support large-scale production machine learning environments focused on text, image, and video processing workloads in AWS.
This is a highly operational and infrastructure-focused role. The ideal candidate has hands-on experience deploying, monitoring, scaling, and optimizing ML systems in production environments particularly within AWS SageMaker ecosystems.
This is NOT a data science or model research role. The focus is production reliability, deployment governance, infrastructure scalability, observability, and operational efficiency.
Responsibilities
ML Deployment & Operations
- Design, deploy, and support end-to-end production ML pipelines
- Manage ML promotion across Dev, QA, and Production environments
- Implement deployment standards, rollback strategies, and recovery mechanisms
- Support containerized inference and orchestration patterns
AWS & Infrastructure Management
- Configure and manage AWS SageMaker pipelines, endpoints, and monitoring
- Optimize GPU and CPU infrastructure selection and scaling
- Benchmark infrastructure performance and tune autoscaling behavior
- Perform load testing and production infrastructure optimization
Monitoring & Reliability
- Implement monitoring, alerting, observability, and drift detection
- Track latency, throughput, error rates, and model/data drift
- Build A/B testing and controlled rollout frameworks
- Ensure governance, reproducibility, security, and cost efficiency
Large-Scale ML Workloads
- Support production ML systems across text, image, and video workloads
- Manage high-throughput infrastructure and large-scale data movement
- Prevent compute, networking, and storage bottlenecks
- Support systems processing hundreds of thousands of requests daily
Collaboration
- Partner closely with ML Engineers, Platform Engineering, DevOps, and Data teams
- Operationalize ML models into stable production systems
- Help drive scalability, reliability, and infrastructure best practices
Required Qualifications
- Strong hands-on experience operating production ML systems at scale
- Deep AWS SageMaker experience including:
- Pipelines
- Endpoints
- Monitoring
- Multi-environment deployments
- Experience operationalizing PyTorch and TensorFlow models
- Experience with containerized ML deployment and orchestration
- Experience optimizing GPU/CPU infrastructure for ML workloads
- Strong monitoring and observability experience
- Experience implementing deployment governance and rollback strategies
Strongly Preferred
- Experience supporting:
- Transformer-based NLP systems
- Computer vision workloads
- Ranking/reranking systems
- Familiarity with:
- ANN systems
- HNSW indexing
- Large-scale neural network operational workloads
- Experience supporting high-volume text, image, and video dataset
- Candidates must be able to work directly on W2 or approved independent consulting arrangements.
- NO 3RD PARTY FIRMS, LAYERED VENDORS, OR STAFFING PASSTHROUGHS.
- Third-party submissions will NOT be reviewed or responded to.
If interested, please send:
- Updated resume
- Current location
- Work authorization status
- Availability
- Hourly rate expectations