MLOPS Production Support & DevSecOps Manager

Overview

Hybrid
Depends on Experience
Accepts corp to corp applications
Contract - W2
Contract - 3 Month(s)
10% Travel

Skills

MLOPS
Production Support
DevSecops
CloudWatch
EKS
Python
Bash
AWS practitioner
IAM roles
ECR
S3 versioning
CI/CD
Helm charts
AWS deployments
HA/DR designs
IAM
Docker
Kubernetes
AzureDevOps
ServiceNow.

Job Details

Role       : MLOPS Production Support & DevSecOps Manager

Location: Columbus, OH

Duration: 3 Months C2H

MOI        : Telephonic & Skype

Primary Skills: CloudWatch, EKS, Python, Bash, AWS practitioner, IAM roles, ECR, S3 versioning

This is a 3-month contract-to-hire to hire role, and we need Candidates with a Minimum of 12+ Years of Experience, and this is an Onsite/Hybrid Model.

Job Responsibilities:

We need a hands-on leader who can own 247 production health, turn incidents into permanent improvements, and coach an 8 10 engineer team (mix of onsite and offshore) without losing touch with the code.
Key Responsibilities:

  • Area What You ll Do: Incident Command & SRE Lead P1/P2 bridges for ML/LLM and batch pipelines.
  • Drive root cause analysis, publish blameless postmortems, and ensure fixes are automated, not repeated.
  • DevSecops Automation Patch CI/CD jobs, Helm charts, and Python utilities as part of incident follow-up.
  • Embed vulnerability scans, rollback logic, and change ticket integration.
  • Reliability Governance: Define & track MTTR, change failure rate, and repeat incident rate.
  • Report trends to leadership in clear, metrics-first language.
  • People Leadership Mentor engineers, set sprint priorities, and foster an SRE mindset in the offshore pod.
  • Participate in hiring and onboarding.
  • Partnerships work daily with Solution Engineering, Platform Enablement, and Architecture to harden AWS deployments, review HA/DR designs, and close security gaps.

MustHave Qualifications:

  • Hands-on incident response in a machine learning or data platform environment (you have debugged Python code at 2 am.).
  • Strong Python & Bash; comfortable editing pipelines, writing quickfix scripts, and reviewing pull requests.
  • AWS practitioner: IAM roles, ECR, EKS, S3 versioning, CloudWatch alarms.
  • Hands-on with Docker and Kubernetes.
  • Track record converting Sev1 incidents into durable controls (can share concrete examples).
  • Experience leading or coaching blended onshore/offshore teams.
  • Familiar with DevSecops practices static scans (Snyk/Trivy), container runtime controls (Aqua Enforcer), SBOM generation.
  • Skilled in creating clear, actionable postmortems for management audiences.

Nice to Have:

  • Exposure to large language model operations (Bedrock, VertexAI, or similar).
  • Financial services or other regulated industry background.
  • Terraform and Helm chart authoring.

How We Work:

  • OnCall: The manager engages in all P1/P2 events; ICs rotate night/weekend coverage.
  • Culture: Every incident is an unplanned investment root causes must be hardened code, docs, or infrastructure.
  • Collaboration: Teams for triage, AzureDevOps for work tracking, ServiceNow for change control.
  • Location: Hybrid model; Three days a week in our Columbus office.
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.