Position: Site Reliability Engineering (SRE) Lead
Location: Fort Mill, SC (2-3 days onsite in a week, hybrid)
Only seniors with Lead exp
Summary
A senior technical leader responsible for owning reliability strategy, leading an SRE team, and ensuring the operational health, scalability, and availability of services. Combines hands on engineering, automation, and people leadership to drive reliability across the organization.
Core responsibilities
Strategy & process
Define SRE strategy, process frameworks, standards, and best practices.
Establish SLIs, SLOs, and error budget policies; embed reliability into the SDLC.
Promote a culture of service ownership and maintain strong cross team feedback loops.
Reliability & capacity
Oversee monitoring and maintenance to meet SLAs and uptime targets.
Drive capacity planning and forecasting to ensure performance at scale.
Use data and metrics to prioritize reliability investments and tradeoffs.
Automation & tooling
Lead automation efforts to eliminate operational toil and streamline runbooks.
Oversee Infrastructure as Code practices (for example Terraform, CloudFormation) and configuration management.
Improve CI/CD pipelines to enable safer, faster releases.
Incident & change management
Lead incident response and communications during outages.
Conduct blameless postmortems and ensure corrective actions are executed.
Govern change control to ensure safe, tested production deployments.
Collaboration & communication
Partner with engineering, architecture, and product teams to bake reliability into designs and roadmaps.
Translate technical issues and tradeoffs for technical and nontechnical stakeholders.
Team leadership
Hire, mentor, and develop SRE engineers; set team goals and a roadmap.
Lead calmly and effectively under pressure during critical incidents and drive customer focused decisions.
Qualifications & skills
Technical
Proven SRE/DevOps/infrastructure experience (typically 6+ years) with leadership experience (about 2 3 years).
Strong cloud experience (AWS preferred), containerization (Docker), and orchestration (Kubernetes).
Expertise with IaC and automation tools (Terraform, CloudFormation, Ansible, or similar).
Proficient in scripting and programming for automation (Python, Bash, or similar).
Deep experience with monitoring and observability tooling (Prometheus, Grafana, ELK/ELK Stack, Splunk, Datadog, etc.).
Leadership & soft skills
Strong people leadership and coaching skills with proven stakeholder communication.
Excellent problem solving, analytical thinking, and adaptability.
Strategic mindset balancing engineering excellence with business priorities.
Deliverables
A measurable reliability roadmap aligned to business goals.
Reduced operational toil through automation and improved runbooks.
Clear SLIs, SLOs and established error budget governance.
A high performing SRE team with documented processes for incident and change management.