Role :: Resiliency and Recovery Engineer
Location :: Charlotte, NC
Type :: Fulltime
Job Description
The Resiliency & Recovery Engineer (Contractor) is a senior, hands-on engineering role focused on improving production resiliency and recovery outcomes across critical services and payment rails. This role is responsible for driving measurable improvements such as faster recovery (reduced time to restore service), stronger and actionable alert coverage, increased automation to reduce manual toil, and safer releases with repeatable rollback/cutback readiness. The engineer will partner closely with application teams, DevOps, Infrastructure, Database teams, and operational stakeholders to identify resiliency gaps, prioritize remediation, and implement durable solutions that improve stability and reduce customer impact.
• Work across all MMC payment rails to develop faster, more repeatable resiliency and recovery processes that benefit every platform, ensuring these enhancements are adopted broadly across the organization rather than siloed on any single platform.
• Identify resiliency gaps based on incident patterns and recurring failures; turn findings into prioritized remediation work.
• Build/strengthen monitoring, alerting, and dashboards that are actually used by engineers and leadership.
• Create runbooks and automate recovery actions to reduce manual toil and human error during incidents.
• Improve release safety and rollback/fallback readiness (clear, repeatable cutback procedures).
• Support SQL reliability efforts (SQL Server 2022 focus) in partnership with DB/infrastructure teams.
• Owns backlog, prioritization, design reviews, and cross-team coordination (Ops/Product/Tech).
• Runs weekly standup + prepares bi-weekly exec readout.
• Integrate resilience testing into CI/CD pipelines and DevOps workflows to catch issues early and ensure robust, automated releases.
• Conduct chaos engineering experiments (failure injections, game days) to proactively uncover system weaknesses and validate recovery processes under real-world failure scenarios.
• Document and share resiliency best practices; mentor and train engineering teams to foster a culture of reliability and continuous improvement across the organization.
• Improve release safety and rollback/fallback readiness (clear, repeatable cutback procedures).
• Ensure a seamless handoff of all newly created resiliency and recovery practices (once mature and repeatable) to the MMC Engineering team by thoroughly documenting the improvements and conducting knowledge transfer, so that the permanent team can sustain and build upon these enhancements after the contract period.
Must-Have Qualifications:
• Proven experience in high-availability, high-transaction environments (preferably payments or financial services).
• Strong background in production resiliency and recovery (recovery execution, runbooks/playbooks, RCA mindset).
• Incident pattern analysis + MTTR baselines (P2 Major/Minor) and recurring failure taxonomy (by rail/service).
• Senior-level observability expertise: dashboards, monitors, and alerts (Datadog preferred; similar tools considered).
• Splunk, Datadog, SQLs, JQL Jira Query language, Gitlab,
• Experience of CI / CD metrics and generating code quality, changes, testing automation executives reports from Gitlab
• Understand quality of stories, metrics, monitoring experiences - help get data to showcase deficiencies
• Senior CI/CD experience: pipeline design/operation, release safety patterns, and rollback readiness.
• Experience using metrics and monitoring data to identify and communicate deficiencies.
• Automation skills: Python and/or PowerShell (or equivalent) for building repeatable recovery workflows and operational tooling.
• Kubernetes/container platform production troubleshooting (deployments, pods, config drift, safe restarts, and “why did this change break prod” investigations
• Experience with identity/credentials/certificate & secret-rotation resilience (preventing outages during password rotations, certificate upgrades, and secret propagation; implementing guardrails and monitoring for these events).
• Batch/scheduler/job-execution reliability (detecting/preventing silent job failures, validating multi-DC scenarios, and building controls to ensure scheduled processing does not impact customers).
• Distributed integration failure-handling (timeouts, retries, backpressure, idempotency, duplicate prevention, and reconciliation—especially across vendor/downstream dependencies).
Nice-to-have (differentiators)
• Experience with SRE-style reliability practices (SLO/SLI thinking, error budgets, operational metrics).
• Experience with failover / DC flip / active-active or active-passive recovery concepts and scenario-based runbooks.
• Cloud Engineering (Azure, AWS)
• DevOps tools expertise, (Jenkins, Terraform, Sonar Cube, Helm Charts)
• Network & traffic-management incident triage (load balancers/firewalls/VLAN changes, DC traffic flips, and rapid isolation of “app vs infra vs network” to stabilize service)