Cloud Infrastructure Site Reliability Engineer SRE

Overview

On Site

DOE

Contract - W2

Skills

Continuous Improvement

Operational Efficiency

IaaS

Incident Management

Terraform

Ansible

Collaboration

Software Engineering

Computer Science

Software Development

Python

Java

C++

Amazon Web Services

Google Cloud

Google Cloud Platform

Microsoft Azure

Network Security

Storage

Linux

Computer Networking

Cloud Computing

Continuous Integration

Continuous Delivery

Automated Testing

Provisioning

Management

Root Cause Analysis

Service Level

Reliability Engineering

Financial Services

DevOps

Job Details

Job Summary: We are seeking a Cloud Infrastructure Site Reliability Engineer (SRE) with expertise across multiple public cloud platforms. This role is responsible for ensuring the reliability, availability, and performance of cloud services by applying SRE principles. The ideal candidate will drive automation, incident response, and continuous improvement across production environments while collaborating with cross-functional teams to enhance cloud reliability and operational efficiency. Key Responsibilities: Design, build, and maintain highly available, scalable, and secure cloud infrastructure on platforms such as AWS, Google Cloud Platform, or Azure. Develop and implement automation for provisioning, monitoring, scaling, and incident response using Infrastructure-as-Code tools (e.g., Terraform, CloudFormation, Ansible). Monitor system reliability, capacity, and performance; proactively detect and resolve issues. Respond to production incidents, participate in on-call rotations, and lead post-incident reviews. Collaborate with software engineering and security teams to ensure production readiness of new services and features. Build and maintain tools for deployment, monitoring, and operations; automate manual processes to reduce operational toil. Document operational processes and system architectures for knowledge sharing and repeatability. Continuously evaluate and implement new technologies to improve system reliability, security, and efficiency. Required Qualifications: Bachelors degree in Computer Science, Engineering, or a related technical field, or equivalent practical experience. 3+ years of experience in software development with proficiency in at least one programming language (e.g., Python, Go, Java, C++). Experience administering cloud platforms (AWS, Google Cloud Platform, Azure), including networking, security, containerization, storage, and serverless technologies. Strong understanding of Linux systems, networking fundamentals, and distributed systems. Experience with observability tools for monitoring, alerting, and logging in cloud environments. Familiarity with CI/CD tools for automated testing, deployments, and provisioning. Ability to manage and respond to incidents, perform root cause analysis, and conduct post-mortem reviews. Understanding of Service-Level Objectives (SLOs) and Service-Level Agreements (SLAs) for system reliability. Preferred Qualifications: Experience working in enterprise-scale financial services or other regulated industries. Certifications: Certified Engineer, DevOps, SRE, or CSREF (preferred) Education: Bachelors Degree Certification: Certified Engineer , DevOps , Site Reliability Engineer , Cross Site Request Forger

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Job Details

Share