Site Reliability Engineer SRE ML platform

Overview

On Site
Depends on Experience
Contract - W2
Contract - 12 Month(s)

Skills

Site Reliability Enginee
SRE
Machine LEarning
ML
GitHub
AWS
MLOps
Docker
VLLM
Kubernete
Apache SOLR

Job Details

Title: Site Reliability Engineer SRE ML platform
Location: Austin, TX OR Sunnyvale, CA
Type: FTC

Responsibilities:

  • Continuous Deployment using GitHub Actions, Flux, Kustomize
  • Design and implement cloud solutions, build MLOps on cloud AWS
  • Data science model containerization, deployment using docker, VLLM, Kubernetes
  • Communicate with a team of data scientists, data engineers and architects, document the processes
  • Develop and deploy scalable tools and services for our clients to handle machine learning training and inference.
  • Knowledge of ML models and LLM

Qualifications:

  • 6+ years of experience in ML Ops with strong knowledge in Kubernetes, Python, MongoDB and AWS.
  • Good understanding of Apache SOLR.
  • Proficient with Linux administration.
  • Knowledge of ML models and LLM.
  • Ability to understand tools used by data scientists and experience with software development and test automation
  • Ability to design and implement cloud solutions and ability to build MLOps pipelines on cloud solutions (AWS)
  • Experience working with cloud computing and database systems
  • Experience building custom integrations between cloud-based systems using APIs
  • Experience developing and maintaining ML systems built with open-source tools
  • Experience with MLOps Frameworks like Kubeflow, MLFlow, DataRobot, Airflow etc., experience with Docker and Kubernetes
  • Experience developing containers and Kubernetes in cloud computing environments
  • Familiarity with one or more data-oriented workflow orchestration frameworks (Kubeflow, Airflow, Argo, etc.)
  • Ability to translate business needs to technical requirements
  • Strong understanding of software testing, benchmarking, and continuous integration
  • Exposure to machine learning methodology and best practices
  • Good communication skills and ability to work in a team

Note: Focus is to have 60% SRE and 40% ML Ops

Skill Area

Includes

Weight (%)

Platform Reliability & Containerization

Kubernetes, Docker, Microservices, Linux

30%

MLOps & AWS Cloud

Model deployment, versioning, monitoring, AWS (SageMaker, S3, Lambda, EKS)

25%

CI/CD & GitOps

GitHub Actions, Flux

15%

Monitoring & Observability

Splunk, Grafana, Prometheus, performance tracking

15%

Integration & Collaboration

Python scripting, API integrations, Apache Solr, LLM awareness, teamwork with data scientists & engineers

15%

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.