Overview
Skills
Job Details
Location: Austin, TX / Sunnyvale, CA (Hybrid 3 days onsite)
Key Responsibilities
Provide Level-1 support for distributed ML workloads running on Ray and related frameworks.
Monitor, troubleshoot, and resolve issues in MLOps pipelines and distributed systems.
Assist in performance tuning of ML models and infrastructure for optimized execution.
Support Flink workloads and ensure smooth integration with data/ML pipelines.
Write and maintain automation scripts using Python or Shell scripting to streamline operational workflows.
Perform ML tuning to enhance training efficiency and inference performance.
Work closely with L2/L3 Support, DevOps, and Data Science teams to escalate and resolve complex issues.
Document troubleshooting steps, standard procedures, and create runbooks for repetitive support tasks.
Required Skills & Qualifications
Hands-on experience with Ray for distributed ML workloads.
Knowledge of MLOps workflows and pipeline orchestration.
Understanding of Flink (or similar distributed data frameworks).
Proficiency in Python or Shell scripting for automation.
Familiarity with performance tuning and ML tuning techniques.
Strong troubleshooting and problem-solving skills in production support environments.
Good communication skills and ability to work collaboratively across teams.