Overview
Skills
Job Details
Hello
My name is Shubham Pal and I am a Staffing Specialist at Sapear Inc. I am reaching out to you on an exciting job opportunity with one of our clients.
Title: Site Reliability Engineer SRE ML platform
Location: Austin, TX OR Sunnyvale, CA
Type: FTE/ FTC
Responsibilities:
- Continuous Deployment using GitHub Actions, Flux, Kustomize
- Design and implement cloud solutions, build MLOps on cloud AWS
- Data science model containerization, deployment using docker, VLLM, Kubernetes
- Communicate with a team of data scientists, data engineers and architects, document the processes
- Develop and deploy scalable tools and services for our clients to handle machine learning training and inference.
- Knowledge of ML models and LLM
Qualifications:
- 6+ years of experience in ML Ops with strong knowledge in Kubernetes, Python, MongoDB and AWS.
- Good understanding of Apache SOLR.
- Proficient with Linux administration.
- Knowledge of ML models and LLM.
- Ability to understand tools used by data scientists and experience with software development and test automation
- Ability to design and implement cloud solutions and ability to build MLOps pipelines on cloud solutions (AWS)
- Experience working with cloud computing and database systems
- Experience building custom integrations between cloud-based systems using APIs
- Experience developing and maintaining ML systems built with open-source tools
- Experience with MLOps Frameworks like Kubeflow, MLFlow, DataRobot, Airflow etc., experience with Docker and Kubernetes
- Experience developing containers and Kubernetes in cloud computing environments
- Familiarity with one or more data-oriented workflow orchestration frameworks (Kubeflow, Airflow, Argo, etc.)
- Ability to translate business needs to technical requirements
- Strong understanding of software testing, benchmarking, and continuous integration
- Exposure to machine learning methodology and best practices
- Good communication skills and ability to work in a team
Note: Focus is to have 60% SRE and 40% ML Ops
Skill Area | Includes | Weight (%) |
Platform Reliability & Containerization | Kubernetes, Docker, Microservices, Linux | 30% |
MLOps & AWS Cloud | Model deployment, versioning, monitoring, AWS (SageMaker, S3, Lambda, EKS) | 25% |
CI/CD & GitOps | GitHub Actions, Flux | 15% |
Monitoring & Observability | Splunk, Grafana, Prometheus, performance tracking | 15% |
Integration & Collaboration | Python scripting, API integrations, Apache Solr, LLM awareness, teamwork with data scientists & engineers | 15% |
Shubham Pal
Lead Business Development Manager
Sapear Inc.
Email :
Cell : +1
We are hiring: