Overview
Skills
Job Details
Location: Washington D.C. Area (Onsite - Only Locals)
Duration: 12 months contract
About the Role
We are seeking a Principal Site Reliability Engineer (SRE) to lead the operational excellence, resilience, and security of our client's core systems. This role combines deep technical expertise in infrastructure automation, CI/CD architecture, and cloud security with strong Site Reliability Engineering principles. You'll define SLOs, manage incident response, optimize cloud costs, and mentor teams to deliver secure, scalable, and highly available systems.
Key Responsibilities
Reliability Engineering & Operations
- Define and maintain Service Level Objectives (SLOs) and Service Level Indicators (SLIs).
- Lead incident response, root cause analysis, and postmortem reviews to drive continuous improvement.
- Implement and manage error budgets to balance reliability and innovation.
Infrastructure Automation
- Design and manage secure, scalable, and automated environments using Terraform, Ansible, or CloudFormation.
- Champion Infrastructure-as-Code (IaC) best practices for consistency and repeatability.
CI/CD Optimization & Security
- Architect and enhance CI/CD pipelines (GitHub Actions, Jenkins) with advanced deployment methods - canary, blue/green, and automated rollback.
- Integrate security gates (SAST, DAST, SBOM, secrets scanning) into the build and deployment lifecycle.
Observability & Telemetry
- Build and maintain observability frameworks - dashboards, alerts, metrics, and tracing pipelines.
- Use tools like Prometheus, Grafana, ELK, Datadog, and CloudWatch to ensure full visibility and proactive monitoring.
Cost & Capacity Management
- Implement cost monitoring and right-sizing strategies to optimize cloud resources.
- Plan capacity and availability in alignment with business goals.
Platform Enablement & Mentorship
- Develop internal tools, playbooks, and self-service platforms to enhance developer efficiency.
- Mentor cross-functional teams on SRE best practices, operational readiness, and secure delivery.
Qualifications
Education & Experience
- Bachelor's degree in Computer Science, Engineering, or a related field.
- 5+ years in SRE, DevOps, or Platform Engineering, including technical leadership roles.
- 3+ years managing production-grade cloud environments with advanced security and observability practices.
Technical Skills
- Expertise in AWS, Azure, or Google Cloud Platform, with strong knowledge of Compute, Networking, IAM, and monitoring.
- Proficient with Terraform, CloudFormation, Kubernetes, and Docker.
- Strong Linux administration and scripting (Bash, Python, or Go).
- Hands-on experience with CI/CD, GitOps, and observability stacks.
Core Competencies
- Deep understanding of SRE principles - SLOs, SLAs, incident management, chaos engineering, and capacity planning.
- Strong communicator and collaborator with a passion for building reliable, secure, and efficient systems.
- Proven ability to create and share operational tooling, documentation, and best practices across teams.
Thanks & Regards
Bhargav Kalyandurg (Find me on LinkedIn)
ASPIRE IT SOLUTIONS INC.