Overview
Skills
Job Details
Title : Lead Site Reliability Engineer
Job Type : Contract
Location : North Carolina, Concord Location.
We are seeking a Lead Site Reliability Engineer (SRE) with deep expertise in AWS networking, infrastructure automation, and production system reliability. This role demands a strong grasp of observability, operational excellence, and the ability to drive the adoption of DevOps/SRE best practices across engineering teams. You will be instrumental in shaping SLIs/SLOs, defining our DevOps maturity roadmap, and building robust, scalable infrastructure using Terraform, Lambda, Step Functions, and more.
You ll be leading a team of SREs and collaborating closely with DevOps, Security, and Application teams to ensure reliable delivery and availability of services.
Key Responsibilities:
- Lead and mentora team of SREs in developing scalable infrastructure and operational processes.
- Design and implement SLIs, SLOs, and Error Budgetsacross critical services and evangelize them across product teams.
- Architect and manage AWS networkingenvironments including VPCs, Transit Gateways, Route 53, VPNs, NACLs, and Security Groups.
- Manage and monitor Palo Altoand Fortigate firewalls, and integrate them with cloud environments for hybrid network visibility.
- Define and evolve DevOps maturity models, guiding teams toward higher automation and reliability.
- Build and manage observability dashboards using Grafana, Cloudwatchand Datadogto track application and infrastructure health.
- Implement and maintain Infrastructure as Code (IaC)using Terraformto automate cloud deployments across environments.
- Develop and maintain serverless applicationsusing AWS Lambdaand Step Functions to support platform automation and operations.
- Collaborate with developers to define GitLab CI/CD pipelinesand streamline the build, test, and deployment lifecycle.
- Champion incident response, blameless postmortems, and continuous improvement initiatives.
- Write scripts in Pythonor Bashto automate tasks and integrate systems.
Required Qualifications:
- 7+ years in SRE, DevOps, or Systems Engineering roles with increasing responsibility.
- Proven experience managing AWS production environmentswith a focus on networking.
- In-depth knowledge of Palo Altoand/or Fortigate firewall management and troubleshooting.
- Expertise in monitoring and observability tools, including Grafanaand Datadog.
- Hands-on experience with Terraformin managing cloud infrastructure at scale.
- Experience building and deploying serverless architecturesusing Lambdaand Step Functions.
- Demonstrated understanding of SLI/SLO design, error budgets, and reliability metrics.
- Strong understanding of CI/CD principlesand tools like GitLab CI/CD.
- Proficiency in scripting using Pythonor Bash.
Preferred Qualifications:
- AWS Certifications (e.g., Solutions Architect, Advanced Networking, DevOps Engineer)
- Familiarity with DevOps/SRE maturity modelsand implementing organizational transformation.
- Experience with compliance frameworks (SOC2, ISO 27001, etc.) as they pertain to infrastructure reliability.
- Familiarity with container orchestration is a plus.
Soft Skills:
- Strong leadership and mentoring capabilities.
- Ability to translate complex technical problems into actionable initiatives.
- Excellent communication and cross-functional collaboration skills.
Bias for automation and continuous improvement