Overview
Skills
Job Details
Position: Senior DevOps and Site Reliability Engineer (SRE)
Location: Washington, D.C. (Onsite)
Duration: 12 months contract
Overview:
We are seeking a highly experienced and technically proficient Senior DevOps and Site Reliability Engineer (SRE) to join a leading organization in the DC Metro area. This senior-level role is responsible for ensuring the reliability, performance, security, and scalability of high-availability production environments on AWS. The ideal candidate will be a hands-on technical leader who combines deep expertise in software development, infrastructure-as-code, and observability to automate operations, lead capacity planning, and serve as a key on-call responder for critical incidents.
This position emphasizes SRE principles (SLIs/SLOs/Error Budgets), team mentorship, and collaboration across engineering and product teams to drive operational excellence.
Key Responsibilities
Deployment & Automation
- Build, maintain, and optimize CI/CD pipelines using tools like GitHub Actions, AWS CodePipeline, or Jenkins.
- Automate infrastructure provisioning and configuration using Terraform, CloudFormation, or AWS CDK.
- Develop automation scripts and self-service tools to improve operational and development efficiency.
- Utilize programming languages such as Python, Go, or Java for automation and troubleshooting.
Site Reliability & Observability
- Serve as an on-call responder for production systems, leading incident response and recovery efforts.
- Conduct post-incident reviews and drive root cause analysis and systemic improvements.
- Define and monitor SLIs, SLOs, and Error Budgets to ensure service reliability.
- Leverage observability tools like Dynatrace, AppDynamics, or ELK Stack for proactive monitoring.
- Use distributed tracing to identify performance bottlenecks and system inefficiencies.
- Develop custom dashboards and alerts to generate actionable insights.
Capacity, Performance & Cost Optimization
- Develop capacity models and forecasting systems to ensure scalability.
- Lead cloud cost optimization initiatives across environments.
- Design and execute resiliency and performance testing frameworks.
- Configure and maintain auto-scaling policies for efficient resource utilization.
Security & Compliance
- Lead security incident investigations and implement remediation measures.
- Automate compliance validation and integrate security automation into DevOps pipelines.
- Contribute to zero-trust architecture implementation within cloud environments.
- Apply ITIL principles using ITSM tools (e.g., ServiceNow).
Qualifications
Education & Experience
- Bachelor's degree in Computer Science, Engineering, or related field.
- 5 8 years of experience in DevOps, SRE, or Platform Engineering.
- 3+ years of experience managing and optimizing production systems in the cloud.
- Proven ability to lead complex technical initiatives end-to-end.
Technical Expertise
- Deep understanding of AWS (preferred), including architecture, networking, and core services.
- Advanced proficiency with Terraform, CloudFormation, or AWS CDK.
- Strong knowledge of observability tools, especially Dynatrace.
- Proficient in Python, Go, or Java for automation and scripting.
- Experience with relational, NoSQL, and cloud-native databases.
Professional Skills
- Strong leadership and mentoring abilities.
- Excellent communication and collaboration skills with cross-functional teams.
- Strong documentation and incident reporting skills (RCA, knowledge base, etc.).
- Flexible availability for on-call responsibilities and after-hours incident management.
Thanks & Regards
Bhargav Kalyadurg (Find me on LinkedIn)
ASPIRE IT SOLUTIONS INC.