Overview
On Site
130k - 160k
Full Time
Skills
Machine Learning Operations (ML Ops)
e-commerce
IaaS
Machine Learning (ML)
Health insurance
Software development
Transformation
Management
Amazon Web Services
Kubernetes
Training
Grafana
Software deployment
Cloud computing
Servers
Storage
Middleware
Network
Design
Automation
Orchestration
Collaboration
SAP BASIS
Job Details
Job Description This Fortune 500 company in the Chicago area, a top 10 North American e-commerce player focused on industrial supplies, underwent a digital transformation around 2018 under a new CTO, enabling growth during the pandemic and retaining tech talent due to its competitive and challenging environment.
This SRE/Cloud Infrastructure Engineer role involves managing AWS-hosted Kubernetes platforms engineered for machine learning workloads like training, experimentation, and serving. Responsibilities include ensuring a robust and scalable infrastructure for advanced ML workloads, implementing and managing monitoring tools (Grafana, Loki, Prometheus, Thanos), and maintaining continuous deployment using GitOps practices with ArgoCD and Flux.
The engineer will build, test, configure, tune and support the Kubernetes infrastructure in the cloud, encompassing servers, storage, middleware, network, and client technologies. They will design and implement automation solutions across multiple platforms, recommending improvements for automated tools and identifying opportunities for increased orchestration adoption. The individual will work in a large, complex 24/7 e-commerce environment, gaining experience with various on-premises and cloud-based applications, as part of the Machine Learning Operations team supporting the ML platform. Required Skills & Experience
Applicants must be currently authorized to work in the US on a full-time basis now and in the future.
This SRE/Cloud Infrastructure Engineer role involves managing AWS-hosted Kubernetes platforms engineered for machine learning workloads like training, experimentation, and serving. Responsibilities include ensuring a robust and scalable infrastructure for advanced ML workloads, implementing and managing monitoring tools (Grafana, Loki, Prometheus, Thanos), and maintaining continuous deployment using GitOps practices with ArgoCD and Flux.
The engineer will build, test, configure, tune and support the Kubernetes infrastructure in the cloud, encompassing servers, storage, middleware, network, and client technologies. They will design and implement automation solutions across multiple platforms, recommending improvements for automated tools and identifying opportunities for increased orchestration adoption. The individual will work in a large, complex 24/7 e-commerce environment, gaining experience with various on-premises and cloud-based applications, as part of the Machine Learning Operations team supporting the ML platform. Required Skills & Experience
- 5+ years of professional experience
- In-depth Kubernetes experience
- ArgoCD
- Monitoring tools like Grafana or Prometheus
- Experience supporting ML platforms
- At least 2 years supporting GitOps
- Flux
- 70% Hands On
- 30% Team Collaboration
- Bonus eligible
- Medical Insurance
- Dental Benefits
- Vision Benefits
- Paid Time Off (PTO)
- 401(k)
Applicants must be currently authorized to work in the US on a full-time basis now and in the future.