Overview
On Site
$120,000 - $140,000
Full Time
Skills
SRE
AI/ML
MLOps
Job Details
Objectives of this role:
- Run the production environment by monitoring availability and taking a holistic view of system health.
- Build software and systems to manage platform infrastructure and applications.
- Improve reliability, quality, and time-to-market of our suite of software solutions.
- Measure and optimize system performance, with an eye toward pushing our capabilities forward, getting ahead of customer needs, and innovating for continual improvement.
- Provide primary operational support and engineering for multiple large-scale distributed software applications.
Responsibilities:
- Gather and analyse metrics from operating systems as well as applications to assist in performance tuning and fault finding.
- Partner with development team, Data Scientist, MLOps Architect/Engineers to improve services through rigorous testing and release procedures.
- Participate in system design consulting, platform management, Troubleshooting production issues and capacity planning.
- Create/manage sustainable systems and services through automation and uplifts.
- Balance feature development speed and reliability with well-defined service-level objectives
Required skills and qualifications:
- Ability to program (structured and OOP) using one or more high-level languages, such as Python, Java, C/C++, Ruby, and JavaScript
- Experience in working with such as Amazon S3, Sagemaker, Amazon Bedrock
- Excellent knowledge working with cloud-native infrastructure, such as AWS Lambda, OpenShift
- Good understanding of API management and should be able to troubleshoot API related issues.
- Automation Mindset to manage cloud infrastructure using AWS CloudFormation/Terraform
- Impeccable creative and communication skills.
- Ability to problem solve in a fast-paced, high-stakes environment.
- Proactive approach to identifying problems, performance bottlenecks, and areas for improvement.