Cloud Site Reliability Engineer - Azure/AWS (34084)

Overview

Remote
On Site
Contract - W2

Skills

Recovery
Management
Real-time
Scripting
Cloud Computing
Auditing
Access Control
Operational Excellence
Innovation
Reliability Engineering
DevOps
Systems Engineering
Linux
IaaS
Amazon Web Services
Microsoft Azure
Kubernetes
Orchestration
Ansible
Incident Management
Workflow
Continuous Integration
Continuous Delivery
Lifecycle Management
Communication
Collaboration
Value Engineering
Apache Kafka

Job Details

Cloud Site Reliability Engineer - AWS & Azure

Responsibilities

  • Oversee the design and improvement of infrastructure using SRE best practices, including IaC, recovery automation, and systems that detect and resolve issues independently.
  • Manage and fine-tune critical services across both cloud and on-prem environments: Kubernetes clusters, CI/CD pipelines, artifact registries, and custom workloads.
  • Enhance observability through intelligent logging, metrics, tracing, and alerting. Ensuring systems are transparent and actionable in real time.
  • Champion automation by eliminating repetitive tasks, from deployment workflows to security audits, through scripting and tooling.
  • Elevate the developer experience for 80+ engineers and researchers by streamlining secure, reliable workflows across hybrid and cloud-native platforms.
  • Take ownership of IAM governance across platforms like Azure AD and AWS IAM. Implement lifecycle automation, auditing, and access controls.
  • Foster a culture of operational excellence with strong practices around security, incident management, and resilience engineering.
  • Act as a trusted partner to developers and researchers, enabling their speed and innovation without compromising stability.


Experience

  • Experience in Site Reliability Engineering, DevOps, or Systems Engineering within fast-paced, technically demanding environments.
  • Strong background in Linux systems and cloud infrastructure, with hands-on experience in AWS (primary) and Azure environments.
  • Solid command of Kubernetes and container orchestration in production environments.
  • Expertise in Infrastructure as Code tools such as Ansible, building reproducible, scalable infrastructure is second nature to you.
  • Deep experience in observability and incident response: you know how to set up effective monitoring, handle incidents, and lead blameless post-mortems.
  • A security-first mindset, especially when it comes to protecting distributed systems and developer workflows.
  • Proven ability to support and optimize CI/CD pipelines, container image builds, and artifact lifecycle management.
  • Strong communication and collaboration skills. You build trust across teams and advocate for thoughtful, scalable solutions.
  • Bonus if you've worked with event-driven architectures using technologies like Kafka.
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

About Myticas LLC