Overview
Skills
Job Details
Key Responsibilities:
Chaos Testing and Experimentation: Design and execute chaos engineering experiments to identify weaknesses in systems and improve resilience.
System Analysis: Analyze system behavior under stress conditions and develop strategies to mitigate potential failures.
Performance Monitoring: Continuously monitor system performance, identify vulnerabilities, and implement improvements to enhance resilience.
Collaboration and Training: Work with cross-functional teams to understand system requirements and provide training on chaos engineering principles and best practices.
Documentation: Develop and maintain comprehensive documentation on chaos experiments, findings, and mitigation strategies.
Experience
5+ years of experience in software engineering or system reliability engineering, with a focus on chaos engineering and resilience testing.
Technical Skills
Proficiency in chaos engineering tools (e.g., Chaos Monkey, Gremlin) and scripting languages (e.g., Python, Bash).
Experience with cloud platforms (e.g., AWS, Azure) and container orchestration (e.g., Kubernetes).
Knowledge of distributed systems and microservices architecture.
Familiarity with monitoring and observability tools (e.g., Prometheus, Grafana).
Experience with incident response and root cause analysis.
Desired Skills
Strong analytical and problem-solving abilities.
Excellent communication and collaboration skills to interact with technical and non-technical stakeholders.
Ability to mentor and train junior engineers on chaos engineering practices.
Commitment to continuous learning to stay updated with the latest chaos engineering techniques and tools.
Familiarity with tools such as Jenkins, Ansible, Terraform.