Site Reliability - Resiliency (Chaos) Engineering Architect

Site Reliability, Chaos, Resiliency, Gremlin
Contract W2, Contract Independent
Depends on Experience
Work from home available

Job Description

Role – Resiliency Test Engineering Architect

Alternate Titles - SRE Architect , Chaos Engineering Architect

Location –Anywhere in US is fine - REMOTE


Essential Functions:

  • Leading core activities in setting up Resiliency testing COE at enterprise level.
  • Develop roadmap, policies, procedures, framework, reference architectures, Resiliency services (Test Scorecard, Failure Mode Analysis, Test Scenarios etc. ) related to resiliency testing, Chaos engineering.
  • Develop Site Reliability Engineering practices
  • Working closely with all stakeholder groups (App Dev teams, IT Infra, etc.) ensuring end-to-end Application resiliency while upholding ETE policy, procedures and standards
  • Improving, setting the direction for the resiliency test automation framework, publishing reusable artifacts to the Developer Marketplace
  • Capture technical requirements, assessing capabilities and mapping to organizational resiliency principles to determine resiliency characteristics of applications.
  • Chip in to strategy discussions and decisions on overall application design and best approach for implementing cloud, and on premises solutions.
  • Focus on continuous improvement practices as the need arises to meet system resiliency imperatives.
  • Define high availability and resilience standards and guidelines for embracing technologies from AWS and other service providers.


  • Minimum of 12+ years of total experience
  • Minimum 5+ years of Site Reliability Engineering experience
  • Minimum of 2 years' experience as Chaos Engineering Architect
  • Must have expertise with industry patterns, chaos engineering methodologies, and techniques across the disaster recovery subject areas
  • Specialist in highly available architecture and solution implementation
  • Experience in Enterprise IT Infrastructure and Solution Architecture
  • Chaos Engineering / Resiliency Testing experience for distributed applications using tools like Gremlin or other tools.
  • Design and Implement CI/CD tools (Git, Maven / Gradle, Jenkins, and Bamboo etc)
  • Demonstrated experience in web services, Rest API & JavaScript
  • Hands-on work experience in any of Public Cloud AWS, Google Cloud Platform & Azure
  • Proven knowledge in Containerization & Container Orchestration Solutions (Docker & Kubernetes)
  • Hand-on work experience in configuration management tools (Chef / Puppet / Ansible / SaltStack / Terraform)
  • Financial domain experience is added advantage.


Tools & Technologies

Gremlin ,Dynatrace / AppDynamics, HP Load Runner, HP BSM, Kubernetes, Elastic ECE, Jaeger, Splunk, SolarWinds (for VMs), Promotheus Enterprise Edition (for RHOS), , Hygieia (DevOps) and Grafana (Mon)

Dice Id : 10124918
Position Id : 7047419
Originally Posted : 6 months ago
Have a Job? Post it

Similar Positions

Performance tester - REMOTE
  • Collabera
  • New York, NY, USA
Java/Spring-Site Reliability Engineer
  • Infonex Technologies, Inc.
  • Alpharetta, GA, USA
Performance Architect
  • Matlen Silver
  • Pennington, NJ, USA
Sr Performance Engineer
  • Experis
  • Waltham, MA, USA
Java / DevOps Developer
  • Alltech Consulting Services, Inc.
  • Alpharetta, GA, USA
Lead Platform Architect, Cloud
  • Software Resources, Inc.
  • Lake Buena Vista, FL, USA
DevSecOps Architect / Engineer
  • VDart, Inc.
  • Atlanta, GA, USA