Cloud Infrastructure & ML Operations Engineer (Ray, AWS, Google Cloud Platform)

Overview

Remote
Depends on Experience
Accepts corp to corp applications
Contract - W2

Skills

Machine Learning (ML)
Machine Learning Operations (ML Ops)
Ray
Amazon Web Services
GCP
Google Cloud Platform
Python
DevOps

Job Details

We are looking for Cloud Infrastructure & ML Operations Engineer with expertise in Ray, AWS, Google Cloud Platform, Python and MLops and need above 10+ years of experience in IT.

  • Operate, monitor, and troubleshoot production & non-production cloud environments (AWS, Google Cloud Platform).
  • Automate deployment, orchestration, and routine operational processes.
  • Manage and optimize Ray clusters for ML tuning and inference endpoints.
  • Perform capacity planning, scale testing, and disaster recovery exercises.
  • Collaborate with Engineering, QA, and Program Management teams.
  • Design and develop RESTful/RPC APIs and services using Golang or Python.
  • Implement and maintain SLO/SLI metrics and error budget reports.

Main Skills Required:

  • Ray Framework (deep understanding & operational experience).
  • Cloud Platforms: AWS & Google Cloud Platform.
  • Programming: Golang or Python.
  • ML Ops Skills: Troubleshooting ML inference endpoints, performance tuning.
  • DevOps & Automation: Deployment pipelines, orchestration tools.
  • Monitoring & Reliability: SLO/SLI, capacity planning, disaster recovery.
  • Collaboration Skills: Cross-team communication and problem-solving.
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.