Overview
Skills
Job Details
Role Name: Site Reliability Engineer - Lead
Location: Cincinnati, OH (Remote) - Only W2
Role Description:
- As a Site Reliability Engineer - Lead, you will drive the reliability, scalability, and performance of mission-critical systems and services while leading a team of SREs.
- This role combines deep technical expertise with leadership, mentoring, and strategic planning.
- You will set standards for operational excellence, guide incident response, and foster a culture of automation and continuous improvement.
- Collaboration with engineering, operations, and product teams is essential to align reliability initiatives with business objectives and ensure seamless service delivery.
REQUIRED SKILL:
-
Proven experience in site reliability, DevOps, or systems engineering, with prior leadership or team lead responsibilities
-
Strong programming/scripting skills (e.g., Python, Go, Bash, or similar)
-
Deep expertise in Linux/Unix system administration and networking
-
Experience architecting and operating cloud platforms (AWS, Azure, Google Cloud Platform)
-
Proficiency with infrastructure-as-code and automation tools (e.g., Terraform, Ansible, CloudFormation)
-
Advanced knowledge of monitoring, logging, and alerting solutions (e.g., Prometheus, Grafana, ELK, Datadog)
-
Demonstrated incident management and root cause analysis skills
-
Experience designing and implementing CI/CD pipelines
-
Strong understanding of containerization and orchestration (Docker, Kubernetes)
-
Ability to define and enforce reliability, scalability, and security best practices
-
Excellent communication, stakeholder management, and collaboration skills
-
Experience mentoring, coaching, and developing SRE or engineering teams
-
Strong hands-on knowledge to define business process dashboards in APM tools like dynatrace with SLA, ALO and SLI definition, design and implementation as part of observability.
-
Experience with devices like Scanner, POS Devices, Peripheral devices (includes On device memory based devices)
-
Experience with Hardcoded protocols and software for devices and should be able to decode and run them and help integrate with other modules.
-
Experience in Edge computing, Google Distributed Cloud and Hybrid cloud environments.
-
Experience leading SRE teams in high-growth or regulated environments
-
Advanced database administration and optimization skills(both SQL e.g. MYSQL and No SQL e.g. Mongo DB databases)
Key Responsibilities:
-
Team Leadership & Development:
-
Technical expertise, hands on experience with ability to lead the development team.
-
Should be able to mentor team members and guide on the right approach for SRE related work.
-
Foster a culture of operational excellence, automation, and continuous learning
-
Conduct regular team meetings, 1:1s, and performance reviews
-
Reliability Strategy & Architecture:
-
Define and implement reliability, scalability, and performance strategies for critical systems
-
Set standards for monitoring, alerting, and incident response
-
Guide architectural decisions to ensure robust, resilient infrastructure
-
Incident & Problem Management:
-
Oversee incident response, root cause analysis, and post-mortem processes
-
Coordinate with cross-functional teams to resolve complex issues and prevent recurrence
-
Drive improvements based on incident learnings
-
Process Improvement & Automation:
-
Identify and eliminate manual operational tasks through automation
-
Optimize CI/CD pipelines and deployment processes
-
Continuously enhance system reliability and efficiency
-
Stakeholder Collaboration:
-
Partner with engineering, operations, and product teams to align reliability goals with business objectives
-
Communicate reliability metrics, risks, and progress to leadership and stakeholders
-
Security & Compliance:
-
Ensure infrastructure and processes adhere to security best practices and compliance requirements
-
Experience in handling chaos and resilience