Overview
Skills
Job Details
Job Title: Principal Site Reliability Engineer
Location: Washington, DC (Onsite Only)
Talent must reside in Washington, DC at time of submission
Position Type: Contract
Job Summary
We are seeking a Principal Site Reliability Engineer for a key
Randstad client based in Washington, DC. This senior-level position
plays a pivotal role in ensuring the reliability, scalability,
security, and performance of the organization's critical systems and
services. The ideal candidate will have deep technical knowledge in
SRE practices, infrastructure automation, CI/CD security, and
observability, along with strong leadership and mentoring
capabilities.
Responsibilities
Reliability & Operations
Define and manage Service Level Objectives (SLOs) and Service Level
Indicators (SLIs)
Own the error budget process
Lead incident response, root cause analysis, and postmortem documentation
Infrastructure Automation
Design and maintain cloud environments using Infrastructure as Code
(IaC) tools such as Terraform, Ansible, and CloudFormation
CI/CD Optimization & Security
Architect secure, high-performing CI/CD pipelines (e.g., GitHub
Actions, Jenkins)
Implement deployment strategies like canary, blue/green, and automated rollback
Observability & Telemetry
Develop observability solutions with metrics, logs, and traces using
tools like Prometheus, Grafana, Datadog, or ELK
Configure dashboards, alerts, and synthetic monitoring
Security & Compliance
Integrate security scanning tools (SAST, DAST, SBOM) into pipelines
Enforce security policies-as-code and ensure regulatory compliance
Cost & Capacity Management
Monitor cloud usage trends, optimize infrastructure for cost-efficiency
Forecast resource requirements to maintain availability and performance
Internal Platform Enablement
Build reusable tools, platforms, and self-service frameworks
Improve developer workflows and consistency across teams
Mentorship & Technical Leadership
Serve as a technical mentor and thought leader
Establish and promote best practices in site reliability, operational
excellence, and secure system delivery
Required Qualifications
Education
Bachelor s degree in Computer Science, Engineering, or a related technical field
Experience
Minimum 5+ years in Site Reliability Engineering, DevOps, or Platform
Engineering
At least 3+ years managing high-availability cloud-native production
environments
Technical Skills
Cloud: Deep experience with AWS, Azure, or Google Cloud Platform (focus on Compute, IAM,
Networking, Monitoring)
IaC: Proficiency in Terraform, CloudFormation, Ansible
CI/CD: Hands-on experience with GitHub Actions, Jenkins, and modern
deployment strategies
Containers: Expertise with Docker, Kubernetes
Observability: Tools such as Prometheus, Grafana, ELK, Datadog, or CloudWatch
Programming & Scripting
Strong scripting skills in Python, Go, or Bash
Knowledge & Practices
Solid understanding of SRE principles (SLOs, incident management,
chaos engineering)
Experience building internal tools and documentation that promote best practices
Additional Information
Onsite Requirement: This is a non-remote role. Candidates must be
local to Washington, DC at the time of submission.
Work Authorization: [Insert if any restrictions apply e.g., USC/H1B, etc.]
Clearance Requirement: [Insert if applicable]
--