Do something big and innovative! Stretch your creative muscles and work on big issues. Since 1989, we have developed technology environments, applications, and tools by providing experienced teams to implement, enhance, and maintain our clients essential systems and applications. Come join the Scalence team!
Job Title: ML Serving Operations Analyst
Duration: 12+ months
Location: 100% Remote - Pacific work hours (Must be local to bay area)
Pay rate: up to $55/hr. W2 with benefits
Job Summary:
Resource Management team is responsible for end-to-end resource planning and provisioning on our client s infrastructure, including Budgeting, Compute, Storage, Accelerators & Network, Data Center infrastructure resources to support Engineering ( Eng ) & Site Reliability Engineering ( SRE ) service related requests. Responsible for handling tactical execution tasks that cannot yet be automated in order to improve service response times and reduce risk to client s infrastructure. Additionally, the team supports data-driven decision-making and leverages machine learning (ML) techniques to enhance forecasting, automation, and operational efficiency.
Ideal candidate will have an engineering degree like a computer science major with experience in running Terminal Commands and will have really good understanding of SQL, machine learning fundamentals, and the terminology of computer hardware.
Requirements:
- Respond to Pool Minding Alerts to proactively keep production service pools Healthy & reduce reliability risk, leveraging ML-based alerting and anomaly detection where applicable.
- Manage Resource Requests from SRE/Eng to FTE team for all Infrastructure services, incorporating predictive insights from ML models where available.
- Manage Supply Planning Operations including ordering of weekly resources (Machine Orders), writing the weekly health reports, monitoring in progress orders, and escalating in case of SLO slippage for critical growth dependencies, with support from ML-based forecasting models.
- Establish migration execution plans to move services between locations to mitigate against data center constraints, using data analysis and ML-driven capacity planning insights.
- Execute replacement plans for large-scale infrastructure projects, i.e. cluster turndowns, cluster migrations due to limited data center space, service rebalance due to resource constraints, potentially guided by ML-based optimization models.
- Assist in Special Projects (e.g. building data pipelines for automated reporting & metrics management, and supporting ML model data pipelines).
- Update vendor playbooks as process changes, subject to FTE review and approval, including documentation of ML-enabled workflows where applicable.
Other requirements:
- Required to attend weekly meetings with the client stakeholders and any additional meetings that the client feels is necessary.
- Required to provide written reports such as: Weekly Supply/Demand fulfillment status report; Weekly Flexpool low inventory alert report; Weekly Operation ticket queue report on aging tickets and reasons; and Operational project status report, incorporating insights derived from data analysis and ML models where relevant.
- Respond to resource ticket requests;
- Manage resource pool alerts and machine orders, including ML-assisted alert prioritization;
- Support pool migrations; and
- Perform data analysis to measure operational performance, including applying machine learning techniques for trend analysis, forecasting, and anomaly detection.