Overview
Hybrid
Depends on Experience
Contract - W2
Skills
Amazon Web Services
Machine Learning Operations (ML Ops)
DevSecOps
Production Support
Financial Services
Artificial Intelligence
Job Details
Job Responsibilities:
Player-Coach: We are seeking a technically engaged leader who enjoys rolling up their sleeves purely supervisory profiles will not match the hands-on production support focus of this role.
AboutOurTeam
MLOps organization partners with datascience squads across the bank. We donot write models; we make them run safely, quickly, and continuously.
2025 focus:
- Velocity shorten the path from approved model to production deployment.
- Trust & Security embed DevSecOps controls so every model meets bankgrade risk standards.
Why This Role Exists?
We need a handson leader who can own 247 production health, turn incidents into permanent improvements, and coach an 8 10engineer team (mix of onsite and offshore) without losing touch with the code.
Key Responsibilities
- Area What You ll Do: Incident Command & SRE Lead P1/P2 bridges for ML/LLM and batch pipelines. Drive rootcause analysis, publish blameless postmortems, and ensure fixes are automated not repeated.
- DevSecOps Automation Patch CI/CD jobs, Helm charts, and Python utilities as part of incident followup. Embed vulnerability scans, rollback logic, and changeticket integration.
- Reliability Governance Define & track MTTR, changefailure rate, and repeatincident rate. Report trends to leadership in clear, metricsfirst language.
- People Leadership Mentor engineers, set sprint priorities, and foster an SRE mindset in the offshore pod. Participate in hiring and onboarding.
- Partnerships Work daily with Solution Engineering, Platform Enablement, and Architecture to harden AWS deployments, review HA/DR designs, and close security gaps.
MustHave Qualifications
- Handson incident response in a machinelearning or dataplatform environment (you have debugged Python code at 2a.m.).
- Strong Python & Bash; comfortable editing pipelines, writing quickfix scripts, and reviewing pull requests.
- AWS practitioner: IAM roles, ECR, EKS, S3 versioning, CloudWatch alarms.
- Hands-on with Docker and Kubernetes.
- Track record converting Sev1 incidents into durable controls (can share concrete examples).
- Experience leading or coaching blended onshore/offshore teams.
- Familiar with DevSecOps practices static scans (Snyk/Trivy), container runtime controls (Aqua Enforcer), SBOM generation.
- Skilled in creating clear, actionable postmortems for management audiences.
Nice to Have
- Exposure to large language model operations (Bedrock, VertexAI, or similar).
- Financial services or other regulated industry background.
- Terraform and Helm chart authoring.
How We Work
- OnCall: Manager engages on all P1/P2 events; ICs rotate night/weekend coverage.
- Culture: Every incident is an unplanned investment root causes must harden code, docs, or infrastructure.
- Collaboration: Teams for triage, AzureDevOps for work tracking, ServiceNow for change control.
Location: Hybrid model; Three days a week in our Columbus office
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.