Resiliency & Chaos Test Engineer

Plano, TX, US • Posted 20 hours ago • Updated 1 hour ago
Contract W2
12 Months
On-site
Depends on Experience
Fitment

Dice Job Match Score™

⭐ Evaluating experience...

Job Details

Skills

  • Journey Resiliency Testing
  • Microservices Resilience Assessment
  • Cloud Platform Resiliency
  • AWS cloud
  • Site Reliability Engineering
  • Resiliency Engineering
  • Chaos Engineering
  • Infrastructure Engineering
  • Production Operations
  • Chaos Engineer
  • Resiliency

Summary

Role:: Resiliency & Chaos Test Engineer
Location: Plano, TX
Experience: 8+ Years
Any Visa  w2 only 
Position Overview
F2F Intrview
seeking a Senior Resiliency & Chaos Engineer to join the Cloud Operations and Service Engineering organization. This role will be responsible for validating the resiliency of critical enterprise applications and services through end-to-end chaos testing, failure injection, disaster recovery validation, and multi-region resilience engineering.
The successful candidate will lead resiliency testing efforts across complex distributed systems consisting of 25-30+ microservices and enterprise applications supporting critical customer journeys. The role requires deep expertise in application, infrastructure, network, and database resiliency with a strong focus on identifying failure scenarios, documenting mitigation strategies, and ensuring business continuity.
Key Responsibilities
  • Design and execute enterprise resiliency and chaos engineering strategies across distributed systems.
  • Perform failure injection testing across: Application Layer, Infrastructure Layer, Network Layer, Database Layer
  • Analyze end-to-end customer journeys and identify resiliency risks across business-critical workflows.
  • Create resiliency test scenarios for upstream and downstream dependencies.
  • Develop chaos engineering test plans covering: Service outages, Network failures, Latency injection, Database failures, Region outages and Infrastructure failures
  • Validate disaster recovery and failover capabilities across multi-region deployments.
  • Test primary-to-secondary and secondary-to-primary failover mechanisms.
  • Conduct latency triage and performance impact analysis.
  • Work closely with business application teams to understand business workflows and critical operational dependencies.
  • Create resiliency documentation, implementation standards, runbooks, and testing frameworks.
  • Evaluate recovery objectives and service-level expectations across enterprise systems.
  • Support resiliency initiatives within cloud operations, service engineering, and infrastructure organizations.
  • Drive continuous improvement of resilience engineering practices across the organization.
  • Present findings and recommendations to engineering leadership and stakeholders.
Required Qualifications
  • 8+ years of experience in Site Reliability Engineering (SRE), Resiliency Engineering, Chaos Engineering, Infrastructure Engineering, or Production Operations.
  • Strong expertise in distributed systems architecture and microservices environments.
  • Hands-on experience with chaos engineering and resiliency testing frameworks.
  • Experience conducting failure injection testing and disaster recovery validation.
  • Strong understanding of Multi-region architectures, High availability systems, Fault tolerance patterns & Business continuity planning
  • Experience with AWS cloud environments.
  • Strong knowledge of networking concepts: Latency, Routing, Load balancing & DNS failover
  • Experience supporting enterprise-scale Java and Python application environments.
  • Understanding of database resiliency concepts including: Replication & Backup and Recovery
  • Experience documenting resiliency strategies and implementation plans.
  • Excellent analytical and troubleshooting skills.
Key Focus Areas
  • End-to-End User Journey Resiliency Testing
  • Chaos Engineering & Failure Injection
  • Multi-Region Disaster Recovery Validation
  • Microservices Resilience Assessment
  • Network & Latency Testing
  • Database Failover Testing
  • Infrastructure Resilience Engineering
  • Business Continuity & Recovery Validation
  • Service Reliability Engineering
  • Cloud Platform Resiliency
 
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.
  • Dice Id: 10330068
  • Position Id: 9009215
  • Posted 20 hours ago
Contact the job poster
PM

Prashant Mishra

Recruiter @ Maintec Technologies Inc
Create job alert
Set job alertNever miss an opportunity! Create an alert based on the job you applied for.

Similar Jobs

Richardson, Texas

Today

Full-time

USD 92,700.00 - 203,940.00 per year

Richardson, Texas

Today

Full-time

Coppell, Texas

11d ago

Easy Apply

Full-time

Depends on Experience

Coppell, Texas

Yesterday

Easy Apply

Full-time

USD 65.00 - 70.00 per hour

Search all similar jobs