Senior DevOps and SRE Engineer

Overview

On Site

Depends on Experience

Contract - W2

Contract - Independent

Contract - 12 Month(s)

Skills

Senior DevOps and SRE Engineer

DevOps

SRE Engineer

SRE

CI/CD

SLIs/SLOs/Error Budgets

GitHub Actions

AWS CodePipeline

Jenkins

Infrastructure

IaC

Terraform

CloudFormation

AWS CDK

Job Details

Role: Senior DevOps and SRE Engineer

Location: Washington, DC

Duration : Long term (Onsite)

Job Description:-

Randstad is seeking a highly experienced and technically proficient Senior DevOps and Site Reliability Engineer (SRE) to join our client in the DC Metro area. This critical, senior-level role is responsible for driving the reliability, performance, security, and scalability of high-availability production environments on AWS. The ideal candidate is a hands-on technical leader who blends deep expertise in software development, infrastructure-as-code, and observability to automate operational toil, lead capacity planning, and serve as a primary on-call responder for critical incidents. This role demands a strong focus on applying SRE principles (SLIs/SLOs/Error Budgets), mentoring team members, and proactively influencing cross-functional teams to achieve world-class operational excellence.

Responsibilities:-

Deployment & Automation Engineering

Implement, maintain, and optimize robust CI/CD pipelines utilizing tools such as GitHub Actions, AWS CodePipeline, and Jenkins.
Automate infrastructure provisioning and configuration management using Infrastructure-as-Code (IaC) tools like Terraform, CloudFormation, or AWS CDK.
Design and develop automation scripts and self-service tools to significantly enhance development and operational efficiency.
Proficiency in multiple programming languages (Python, Go, Java) to develop automation and troubleshoot applications.

Site Reliability & Observability:

Serve as a production on-call responder, leading incident management and orchestrating critical service outages and disaster recovery failover activities.
Facilitate detailed post-mortem meetings and drive systemic improvement patterns across teams.
Define, monitor, and enforce Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Error Budgets.
Expertly leverage observability tools (Dynatrace, AppDynamics, ELK Stack, Dynatrace strongly preferred) for proactive monitoring and troubleshooting.
Utilize distributed tracing and context propagation to identify performance bottlenecks and root causes of failures.
Design and implement custom dashboards and anomaly detectors to generate actionable insights.

Capacity, Performance & Cost Management:

Develop sophisticated capacity models and forecasting systems to ensure service scalability.
Lead cost optimization initiatives, identifying and implementing efficiency gains across cloud services.
Design and execute comprehensive Resiliency and Performance testing frameworks.
Configure and maintain dynamic auto-scaling policies and thresholds for optimal resource utilization.

Security & Governance:

Lead security incident investigations and execute swift remediation plans.
Design and implement automated compliance validation and security automation frameworks.
Drive the implementation of zero-trust architecture patterns within the cloud environment.
Proficiently apply ITIL framework principles, preferably leveraging ITSM tools such as ServiceNow.

Qualifications Education & Experience:

Bachelor s degree in Computer Science, Engineering, or a related technical field.
5 to 8 years of progressive experience in DevOps, Site Reliability Engineering (SRE), or Platform Engineering.
3+ years of experience maintaining and optimizing high-availability production environments.
Proven track record of leading complex technical initiatives from conception to completion.

Technical Expertise:

Expert-level knowledge of at least one major cloud platform, with AWS strongly preferred.
Deep expertise in cloud architecture, networking, and core services.
High proficiency in IaC tools such as Terraform, CloudFormation, or AWS CDK.
Expert-level experience with observability and APM tools, with a strong preference for Dynatrace.
Proficiency in modern programming languages like Python, Go, or Java.
Knowledge of relational, cloud-native, and NoSQL database technologies.

Professional & Leadership Skills:

Strong leadership and mentoring capabilities, with the ability to elevate the technical skills of the team.
Exceptional ability to influence without direct authority across engineering and product teams.
Excellent technical writing and documentation skills (e.g., RCA development, Knowledge articles).
Ability to maintain flexible availability for on-call duties and to work outside of standard business hours as required for incident response.

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Job Details

Share