Site Reliability Engineer (SREs)- Sagemaker and AI/ ML

  • Johnston, RI
  • Posted 10 days ago | Updated 10 days ago

Overview

On Site
$120,000 - $140,000
Full Time

Skills

SRE
AI/ML
MLOps

Job Details

Objectives of this role:

  • Run the production environment by monitoring availability and taking a holistic view of system health.
  • Build software and systems to manage platform infrastructure and applications.
  • Improve reliability, quality, and time-to-market of our suite of software solutions.
  • Measure and optimize system performance, with an eye toward pushing our capabilities forward, getting ahead of customer needs, and innovating for continual improvement.
  • Provide primary operational support and engineering for multiple large-scale distributed software applications.

Responsibilities:

  • Gather and analyse metrics from operating systems as well as applications to assist in performance tuning and fault finding.
  • Partner with development team, Data Scientist, MLOps Architect/Engineers to improve services through rigorous testing and release procedures.
  • Participate in system design consulting, platform management, Troubleshooting production issues and capacity planning.
  • Create/manage sustainable systems and services through automation and uplifts.
  • Balance feature development speed and reliability with well-defined service-level objectives

Required skills and qualifications:

  • Ability to program (structured and OOP) using one or more high-level languages, such as Python, Java, C/C++, Ruby, and JavaScript
  • Experience in working with such as Amazon S3, Sagemaker, Amazon Bedrock
  • Excellent knowledge working with cloud-native infrastructure, such as AWS Lambda, OpenShift
  • Good understanding of API management and should be able to troubleshoot API related issues.
  • Automation Mindset to manage cloud infrastructure using AWS CloudFormation/Terraform
  • Impeccable creative and communication skills.
  • Ability to problem solve in a fast-paced, high-stakes environment.
  • Proactive approach to identifying problems, performance bottlenecks, and areas for improvement.