Skills
Job Description
* Incident Management:
- Delivering Incident Command for high-severity incidents
- Running blameless postmortem reviews for high-severity incidents
- Assisting in developing automated incident detection and response improvements
* Operational Excellence:
- Delivering data analysis (Incident Management, Change Management, Service Availability etc)
- Creation of regular reporting/insights and advancing automation of such to reduce manual toil
- Conducting Production Readiness Reviews for new services
- Reviewing of upcoming production change requests
* Incident Management - Incident Command for high-severity incidents
* Incident Management - Communications & Updates for high-severity incidents
* Operational Excellence - Reporting and analytics (Incident Management, Change Management, Service Availability etc)
- 7+ years of experience in a web-centric Linux production environment in a NOC or DevOps in a continuous release environment
- Experience in running critical incidents from a technical leadership position
- Experience with Computer Engineering with a focus on Infrastructure, Platform, and Application (Cloud, Containerization, Container orchestration, Network, Application Reliability, Database Architecture) and an understanding of full stack and the SDLC (Software Development Life Cycle)
- Experience running and monitoring applications at scale, using metrics and tracing tools like Prometheus, Influx, Grafana, New Relic, Data Dog, Stackdriver, Zipkin, etc
- Professional experience with Python, Go, or similar programming languages
- Experience developing production quality tooling
- Familiarity with SRE methodologies; passionate about solving operational challenges by using automation and software
- Ability to communicate effectively vertically and horizontally within the organization through demonstrating written and verbal communication skills
- Scala, Typescript, JS, Java, C++,)
- The team also develops automation and AI capabilities to ensure minimum toil across the engineering organization
- Lead essential incidents in our environment with a focus on troubleshooting and fast restoration of our essential services
- Provide insights on trends on issues affecting reliability and partner in cross functional projects to provide scalable solutions
- Review high risk platform changes to minimize impact to the site
- Work within a large distributed system based on Kubernetes and Google Cloud services
- Maintain an automation-centric vision and incorporate SRE methodologies to increase reliability and decrease toil
- Participate in technical design and architecture decisions and contribute to technical troubleshooting in various parts of the system