Cloud Infrastructure Site Reliability Engineer (SRE)

Overview

On Site
Depends on Experience
Full Time

Skills

SRE
GCP
Devops
Python
Bash
Terraform
CloudFormation
Ansible
Docker
Kubernetes

Job Details

Job Title: Cloud Infrastructure Site Reliability Engineer (SRE)

Location: Alpharetta, GA or Berkeley Heights, NJ (Fully Onsite)

Full-time / W2 - with Infinite Computer Solutions

Position Summary:

We are looking for a skilled Cloud Infrastructure Site Reliability Engineer (SRE) with strong expertise in Azure and AWS to help ensure the availability, performance, and scalability of our cloud environments. In this role, you will apply SRE principles to automate operations, drive incident response, and enhance system reliability. Experience with Google Cloud Platform is a plus.

Required skillsets/Qualifications:

  • 7+ years of hands-on experience in SRE, DevOps, or Cloud Infrastructure roles.
  • 3+ years of experience in software development with proficiency in at least one programming language (e.g., Python or Bash etc.)
  • Strong proficiency in Azure and AWS cloud services networking, compute, storage, IAM, and monitoring.
  • Hands-on experience with infrastructure automation using Terraform, CloudFormation, or Ansible.
  • Experience with Linux systems, containerization (e.g., Docker, Kubernetes), and scripting (Python, Bash, etc.).
  • Familiarity with CI/CD tools and infrastructure automation.
  • Strong troubleshooting, communication, and collaboration skills.

Good to Have:

  • Experience working in Google Cloud Platform (Google Cloud Platform) environments.
  • Familiarity with regulated enterprise environments (e.g., finance, healthcare).
  • Relevant certifications (AWS, Azure, Google Cloud Platform, Kubernetes, SRE Foundation).

Key Responsibilities:

  • Design, implement, and manage scalable and secure cloud infrastructure primarily on AWS or Azure.
  • Automate infrastructure provisioning and configuration using tools like Terraform, CloudFormation, or Ansible.
  • Set up and maintain observability stacks using tools like CloudWatch or Azure Monitor
  • Monitor service health, resolve incidents, and lead root cause analyses and post-mortems.
  • Partner with DevOps, engineering, and security teams to improve system design and operational readiness.
  • Define and maintain SLOs, SLIs, and SLAs to meet reliability targets.
  • Contributes to improving deployment pipelines, reducing manual toil, and increasing automation.
  • Document procedures, configurations, and runbooks for operational readiness.
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.