Senior Site Reliability Engineer *** Direct end client ***

Depends on Experience

Contract: W2, Independent, Corp-To-Corp, 6 Month(s)

    Skills

    AutomationChange managementChange request managementCloudComputer engineeringDevOpsGoogle CloudIT managementIncident managementInfrastructure

    Job Description

    * Incident Management:
    - Delivering Incident Command for high-severity incidents
    - Running blameless postmortem reviews for high-severity incidents
    - Assisting in developing automated incident detection and response improvements
    * Operational Excellence:
    - Delivering data analysis (Incident Management, Change Management, Service Availability etc)
    - Creation of regular reporting/insights and advancing automation of such to reduce manual toil
    - Conducting Production Readiness Reviews for new services
    - Reviewing of upcoming production change requests
    * Incident Management - Incident Command for high-severity incidents
    * Incident Management - Communications & Updates for high-severity incidents
    * Operational Excellence - Reporting and analytics (Incident Management, Change Management, Service Availability etc)

    - 7+ years of experience in a web-centric Linux production environment in a NOC or DevOps in a continuous release environment

    -  Experience in running critical incidents from a technical leadership position

    -  Experience with Computer Engineering with a focus on Infrastructure, Platform, and Application (Cloud, Containerization, Container orchestration, Network, Application Reliability, Database Architecture) and an understanding of full stack and the SDLC (Software Development Life Cycle)

    -  Experience running and monitoring applications at scale, using metrics and tracing tools like Prometheus, Influx, Grafana, New Relic, Data Dog, Stackdriver, Zipkin, etc

    -  Professional experience with Python, Go, or similar programming languages

    - Experience developing production quality tooling

    -  Familiarity with SRE methodologies; passionate about solving operational challenges by using automation and software

    -  Ability to communicate effectively vertically and horizontally within the organization through demonstrating written and verbal communication skills

    - Scala, Typescript, JS, Java, C++,)

    -  The team also develops automation and AI capabilities to ensure minimum toil across the engineering organization

    - Lead essential incidents in our environment with a focus on troubleshooting and fast restoration of our essential services

    -  Provide insights on trends on issues affecting reliability and partner in cross functional projects to provide scalable solutions

    -  Review high risk platform changes to minimize impact to the site

    -  Work within a large distributed system based on Kubernetes and Google Cloud services

    - Maintain an automation-centric vision and incorporate SRE methodologies to increase reliability and decrease toil

    -  Participate in technical design and architecture decisions and contribute to technical troubleshooting in various parts of the system