Overview
On Site
Depends on Experience
Full Time
Unable to Provide Sponsorship
Skills
Amazon Web Services
Automated Testing
Continuous Integration and Development
Cloud Computing
Data Science
Docker
GitHub
Kubernetes
Large Language Models (LLMs)
Linux Administration
Machine Learning Operations (ML Ops)
Orchestration
Software Testing
Lifecycle Management
Machine Learning (ML)
Job Details
< class=text-capitalize mb-0 lh-base fw-600 pt-0 mb-1>Site Reliability Engineer (SRE) ML Platform</>
Austin, TX/ Sunnyvale, CA
Job Description
Roles and Responsibilities
- Build and maintain continuous deployment pipelines using GitHub Actions, Flux, and Kustomize.
- Design and implement scalable cloud-based MLOps solutions on AWS.
- Containerize and deploy data science and machine learning models using Docker, VLLM, and Kubernetes.
- Collaborate effectively with data scientists, data engineers, and solution architects; document processes and system designs.
- Develop and deploy scalable tools and services for training and inference of machine learning models.
- Apply knowledge of machine learning models, including large language models (LLMs), in production environments.
Qualifications
- 6+ years of experience in MLOps or related roles, with strong expertise in Kubernetes, Python, MongoDB, and AWS.
- Proficiency in Linux system administration.
- Solid understanding of Apache Solr.
- Hands-on experience with containerization and orchestration using Docker and Kubernetes in cloud environments.
- Experience building and maintaining MLOps pipelines using frameworks like Kubeflow, MLflow, DataRobot, or Airflow.
- Familiarity with workflow orchestration tools such as Argo, Airflow, or Kubeflow Pipelines.
- Experience in developing custom cloud integrations using APIs.
- Knowledge of machine learning methodologies, best practices, and model lifecycle management.
- Proven ability to develop and maintain machine learning systems using open-source tools.
- Understanding of the tools and workflows used by data scientists, with experience in test automation and CI/CD practices.
- Strong software testing, benchmarking, and continuous integration skills.
- Ability to translate business requirements into scalable technical solutions
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.