12 months contract ; may extend
Onsite
In addition to technical skills, candidates must have strong communication skills
At this time we are not planning to use a Glider test (this could change), but a technical screen will be conducted during the interview.
Deep and broad understanding of technology and develops creative solutions for resiliency, reliability, and security for applications.
Has a strong ability to anticipate product failure and automate solutions for failure modes and recovery.
Leads as an Incident Commander to coordinate failure recovery for large or complex systems.
Leads in the solution of division-wide complex problems for resiliency, reliability, and security by identifying and organizing necessary resources. Acts as technical advisor.
Leads in the development of Service Level Objectives over multiple parts of the system.
Evaluates and implements enhancement design solutions to improve cost, quality, performance, and security of software applications.
Evaluates and implements enhancement design solutions for gathering metrics for cost, quality, performance and security of software applications.
Develops and maintains knowledge of industry and technical innovations in the technical discipline.
Executes necessary support and playbook documentation, as directed, or needed.
Collaborates with other site reliability engineers and product team members to ensure that features meet business needs.
Leads problem post mortems in order to understand issues, learn from issues, and share any changes in automation, recovery, and documentation with other site reliability engineers
AWS Cloud Services, Kubernetes, Datadog, Terraform, multiple possible coding frameworks may include Java, Java(Scala), JavaScript, .NET, Go, Python and others are used in the environment.