Site Reliability Engineer SRE ML platform

Overview

On Site
$40 - $50
Accepts corp to corp applications
Contract - Independent
Contract - W2
Contract - 12 Month(s)
100% Travel
Able to Provide Sponsorship

Skills

Continuous Delivery
Automated Testing
Benchmarking
Cloud Computing
Collaboration
Communication
API
Amazon S3
Amazon SageMaker
Amazon Web Services
Linux Administration
Database
Docker
FOCUS
GitHub
Grafana
Kubernetes
Apache Solr
Continuous Integration
Continuous Integration and Development
Data Science
MongoDB
Open Source
Orchestration
Python
Scripting
Software Development
Linux
Machine Learning (ML)
Machine Learning Operations (ML Ops)
Microservices
Software Testing
Splunk
Teamwork
Training
Workflow

Job Details

Note: This position is open only for C2C candidates.

Responsibilities:
Continuous Deployment using GitHub Actions, Flux, Kustomize
Design and implement cloud solutions, build MLOps on cloud AWS
Data science model containerization, deployment using docker, VLLM, Kubernetes
Communicate with a team of data scientists, data engineers and architects, document the processes
Develop and deploy scalable tools and services for our clients to handle machine learning training and inference.
Knowledge of ML models and LLM
Qualifications:
6+ years of experience in ML Ops with strong knowledge in Kubernetes, Python, MongoDB and AWS.
Good understanding of Apache SOLR.
Proficient with Linux administration.
Knowledge of ML models and LLM.
Ability to understand tools used by data scientists and experience with software development and test automation
Ability to design and implement cloud solutions and ability to build MLOps pipelines on cloud solutions (AWS)
Experience working with cloud computing and database systems
Experience building custom integrations between cloud-based systems using APIs
Experience developing and maintaining ML systems built with open-source tools
Experience with MLOps Frameworks like Kubeflow, MLFlow, DataRobot, Airflow etc., experience with Docker and Kubernetes
Experience developing containers and Kubernetes in cloud computing environments
Familiarity with one or more data-oriented workflow orchestration frameworks (Kubeflow, Airflow, Argo, etc.)
Ability to translate business needs to technical requirements
Strong understanding of software testing, benchmarking, and continuous integration
Exposure to machine learning methodology and best practices
Good communication skills and ability to work in a team

Note: Focus is to have 60% SRE and 40% ML Ops

Skill Area Includes Weight (%)
Platform Reliability & Containerization Kubernetes, Docker, Microservices, Linux 30%
MLOps & AWS Cloud Model deployment, versioning, monitoring, AWS (SageMaker, S3, Lambda, EKS) 25%
CI/CD & GitOps GitHub Actions, Flux 15%
Monitoring & Observability Splunk, Grafana, Prometheus, performance tracking 15%
Integration & Collaboration Python scripting, API integrations, Apache Solr, LLM awareness, teamwork with data scientists & engineers 15%

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

About Padmas Technology LLC