Apply Now

Cloud Site Reliability Engineer

Hybrid in New York, NY, US • Posted 4 days ago • Updated 2 days ago

Contract Independent

On-site

Fitment

Dice Job Match Score™

🛠️ Calibrating flux capacitors...

Job Details

Skills

Must-Have Skills: 1) 7 years of experience in softwar

Summary

Hybrid role: Must be within 1 hour drivable distance: New York, NY/San Francisco, CA/ Philadelphia, PA/Boston, MA/ Richmond, VA/ St.Louis, MO/Minneapolis, MN/ Dallas, TX/Cleveland, OH/ Charlotte, NC/Kansas City, KS/Atlanta, GA

As a Senior Cloud Engineer in the Cloud SRE team, you will be responsible for designing and developing cloud solutions and engineering reliability tools for the Cloud Foundation Services (CFS) platform in the Infrastructure, Platforms, & Operations organization. You will apply software engineering practices to build scalable, reusable solutions and utilities that enhance platform reliability across the Federal Reserve System.

Responsibilities

Design, develop, and maintain reliability solutions and SRE utilities to reduce toil, improve cloud platform reliability, and industrialize SRE practices across the system

Build and optimize Infrastructure as Code (IaC) using Terraform to manage AWS resources related to SRE solutions, incorporating cost-efficient design principles

Develop CI/CD pipelines and automated testing to ensure code quality, reliability, and rapid delivery of the solutions

Define SRE standards, best practices, and guidelines for adoption across teams; establish SRE metrics like SLI, SLOs, etc.

Apply software engineering best practices, including version control, code reviews, test-driven development, and documentation to all development

Participate in incident management and on-call rotation, providing technical support for SRE tools, troubleshooting production issues, and collaborating with teams to reduce incident recurrence through proactive detection and pattern analysis

Stay current with emerging AWS services, SRE methodologies, and cloud-native development technologies, and drive adoption of innovative solutions

Collaborate within Agile and Scaled Agile frameworks with cross-functional teams to deliver integrated cloud automation solutions

Produce clear, blameless postmortems with actionable items and documented failure scenarios

Qualifications

Seven years of experience in software development, with focus on reliability and platform engineering

Five years of Python development skills, with proven experience building enterprise-grade, highly available tools, APIs, and utilities

A minimum of three years of hands-on experience developing solutions in AWS environments, with deep understanding of core services (EC2, VPC, S3, Lambda, IAM, CloudFormation, EventBridge, Step Functions etc.) and resource cost optimization

Three years of experience applying SRE principles ? including observability, toil automation, SLIs/SLOs and reliability engineering

Expert-level proficiency with Infrastructure as Code (IaC) using Terraform, including module development and state management

Strong experience with CI/CD pipelines, automated testing frameworks, and DevOps practices

Experience with observability tools and practices, including Grafana, AWS CloudWatch, AWS Canary

Experience defining, implementing, and managing SLOs/SLIs and error budgets; familiarity with conducting RCAs and producing postmortem documentation

Working experience in Agile and Scaled Agile environments, and familiarity with ITSM processes (incident, change, and problem management), resilience testing, and chaos engineering practices

Experience with GoLang or additional programming languages is a plus

Requirements

Must-Have Skills:

1) 7 years of experience in software development, with focus on reliability and platform engineering

2) 5 years of Python development skills, with proven experience building enterprise-grade, highly available tools, APIs, and utilities

3) ?3 years of hands-on experience developing solutions in AWS environments, with deep understanding of core services (EC2, VPC, S3, Lambda, IAM, CloudFormation, EventBridge, Step Functions etc.) and resource cost optimization

4) 3 years of experience applying SRE principles ? including observability, toil automation, SLIs/SLOs and reliability engineering

5) Expert-level proficiency with Infrastructure as Code (IaC) using Terraform, including module development and state management

6) Strong experience with CI/CD pipelines, automated testing frameworks, and DevOps practices

7) Experience with observability tools and practices, including Grafana, AWS CloudWatch, AWS Canary

8) Experience defining, implementing, and managing SLOs/SLIs and error budgets; familiarity with conducting RCAs and producing postmortem documentation

9) Working experience in Agile and Scaled Agile environments, and familiarity with ITSM processes (incident, change, and problem management), resilience testing, and chaos engineering practices

Preferred Skills (Nice to Have):

1) Experience with GoLang or additional programming languages

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Dice Id: 91172131
Position Id: 326000001679031
Posted 4 days ago

Create job alert

Never miss an opportunity! Create an alert based on the job you applied for.

New York, New York

•

7d ago

Blackstone is the world's largest alternative asset manager. We seek to create positive economic impact and long-term value for our investors, the companies we invest in, and the communities in which we work. We do this by using extraordinary people and flexible capital to help companies solve problems. Our $1.1 trillion in assets under management include investment vehicles focused on private equity, real estate, public debt and equity, infrastructure, life sciences, growth equity, opportunisti

Full-time

USD 140,000.00 - 225,000.00 per year

Site Reliability Engineer - Data, Cloud & Developer Experience

New York, New York

•

Today

Full-time

USD 140,000.00 - 225,000.00 per year

Site Reliability Engineer

New York, New York

•

Today

Join Mizuho as a Site Reliability Engineer! In this role you will play a crucial role in maintaining the reliability, scalability, and overall performance of our production systems. This position collaborates closely with development, operations, and product teams to automate workflows, monitor system health, and maintain robust services. Expertise in Grafana is vital for creating insightful visualizations and analyzing performance metrics. Key Responsibilities: Design, implement, and manage a

Full-time

USD 111,000.00 - 160,000.00 per year

Staff Site Reliability Engineer

New York, New York

•

Today

job summary: A rapidly growing, well-funded fintech innovator leveraging AI to redefine automated financial lifecycles is seeking a Staff Site Reliability Engineer to spearhead platform evolution in New York City. Operating as a senior individual contributor in a high-impact, onsite role, you will shape the future of a highly scalable cloud infrastructure and elevate operational excellence across core engineering teams. This permanent position offers a highly competitive compensation package al

Full-time

USD180,000 - USD250,000

Search all similar jobs

Cloud Site Reliability Engineer

Dice Job Match Score™

Job Details

Skills

Summary

Requirements

Similar Jobs