Resiliency and Recovery Engineer

Charlotte, NC, US • Posted 16 hours ago • Updated 16 hours ago

Full Time

No Travel Required

On-site

Depends on Experience

Stanley David and Associates

Fitment

Dice Job Match Score™

🧠 Analyzing your skills...

Job Details

Skills

JIRA
SQL
Splunk

Summary

Role :: Resiliency and Recovery Engineer

Location :: Charlotte, NC

Type :: Fulltime

Job Description
The Resiliency & Recovery Engineer (Contractor) is a senior, hands-on engineering role focused on improving production resiliency and recovery outcomes across critical services and payment rails. This role is responsible for driving measurable improvements such as faster recovery (reduced time to restore service), stronger and actionable alert coverage, increased automation to reduce manual toil, and safer releases with repeatable rollback/cutback readiness. The engineer will partner closely with application teams, DevOps, Infrastructure, Database teams, and operational stakeholders to identify resiliency gaps, prioritize remediation, and implement durable solutions that improve stability and reduce customer impact.

• Work across all MMC payment rails to develop faster, more repeatable resiliency and recovery processes that benefit every platform, ensuring these enhancements are adopted broadly across the organization rather than siloed on any single platform.
• Identify resiliency gaps based on incident patterns and recurring failures; turn findings into prioritized remediation work.
• Build/strengthen monitoring, alerting, and dashboards that are actually used by engineers and leadership.
• Create runbooks and automate recovery actions to reduce manual toil and human error during incidents.
• Improve release safety and rollback/fallback readiness (clear, repeatable cutback procedures).
• Support SQL reliability efforts (SQL Server 2022 focus) in partnership with DB/infrastructure teams.
• Owns backlog, prioritization, design reviews, and cross-team coordination (Ops/Product/Tech).
• Runs weekly standup + prepares bi-weekly exec readout.
• Integrate resilience testing into CI/CD pipelines and DevOps workflows to catch issues early and ensure robust, automated releases.
• Conduct chaos engineering experiments (failure injections, game days) to proactively uncover system weaknesses and validate recovery processes under real-world failure scenarios.
• Document and share resiliency best practices; mentor and train engineering teams to foster a culture of reliability and continuous improvement across the organization.
• Improve release safety and rollback/fallback readiness (clear, repeatable cutback procedures).
• Ensure a seamless handoff of all newly created resiliency and recovery practices (once mature and repeatable) to the MMC Engineering team by thoroughly documenting the improvements and conducting knowledge transfer, so that the permanent team can sustain and build upon these enhancements after the contract period.

Must-Have Qualifications:
• Proven experience in high-availability, high-transaction environments (preferably payments or financial services).
• Strong background in production resiliency and recovery (recovery execution, runbooks/playbooks, RCA mindset).
• Incident pattern analysis + MTTR baselines (P2 Major/Minor) and recurring failure taxonomy (by rail/service).
• Senior-level observability expertise: dashboards, monitors, and alerts (Datadog preferred; similar tools considered).
• Splunk, Datadog, SQLs, JQL Jira Query language, Gitlab,
• Experience of CI / CD metrics and generating code quality, changes, testing automation executives reports from Gitlab
• Understand quality of stories, metrics, monitoring experiences - help get data to showcase deficiencies
• Senior CI/CD experience: pipeline design/operation, release safety patterns, and rollback readiness.
• Experience using metrics and monitoring data to identify and communicate deficiencies.
• Automation skills: Python and/or PowerShell (or equivalent) for building repeatable recovery workflows and operational tooling.
• Kubernetes/container platform production troubleshooting (deployments, pods, config drift, safe restarts, and “why did this change break prod” investigations
• Experience with identity/credentials/certificate & secret-rotation resilience (preventing outages during password rotations, certificate upgrades, and secret propagation; implementing guardrails and monitoring for these events).
• Batch/scheduler/job-execution reliability (detecting/preventing silent job failures, validating multi-DC scenarios, and building controls to ensure scheduled processing does not impact customers).
• Distributed integration failure-handling (timeouts, retries, backpressure, idempotency, duplicate prevention, and reconciliation—especially across vendor/downstream dependencies).

Nice-to-have (differentiators)
• Experience with SRE-style reliability practices (SLO/SLI thinking, error budgets, operational metrics).
• Experience with failover / DC flip / active-active or active-passive recovery concepts and scenario-based runbooks.
• Cloud Engineering (Azure, AWS)
• DevOps tools expertise, (Jenkins, Terraform, Sonar Cube, Helm Charts)
• Network & traffic-management incident triage (load balancers/firewalls/VLAN changes, DC traffic flips, and rapid isolation of “app vs infra vs network” to stabilize service)

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Dice Id: 91097129
Position Id: 8933557
Posted 16 hours ago

Company Info

About Stanley David and Associates

We strive to add value and work as true partner with our clients

Stanley David And Associates is a recruitment specialist in the area of IT and Engineering and we stay firmly in our area of expertise, doing what we love.

We know the players and the companies and invest a lot of time getting to know candidates and clients in equal measure. This ensures a swift, cost effective and perfect placement whether it s permanent or interim.

In addition we have a reputation for having the best understanding of the market landscape, for sourcing great candidates

-We have a Global Footprint with offices in 3 countries USA, UK and India.

-SDNA Global have built up an incredible reputation within the IT strategic hiring.

-We work with Tier1 and Tier 2 IT Outsourcing companies for Leadership hiring needs in UK, Europe, USA and Indian geos.

-Each SDNA member has over 5 years of experience in Talent Acquisition

-We have successfully closed roles in countries UK, USA, Germany, Sweden, Dubai, France, Netherlands, Switzerland, Austria, Hungry, Spain, Italy, Norway, Denmark, Nigeria and South Africa

-Telecom, Media and Hi-tech

-Health care and Life Sciences

-Energy and Utilities

-CPG, Retail and Transport

-Banking and Financial Services

Go to company profile

Create job alert

Never miss an opportunity! Create an alert based on the job you applied for.

Similar Jobs

Resiliency and Recovery Engineer

Charlotte, North Carolina

•

3d ago

Role: Resiliency and Recovery Engineer Location: Charlotte, NC (Onsite) Type: Full-time role with TCS (TATA Consultancy Services) Job Description The Resiliency & Recovery Engineer (Contractor) is a senior, hands-on engineering role focused on improving production resiliency and recovery outcomes across critical services and payment rails. This role is responsible for driving measurable improvements such as faster recovery (reduced time to restore service), stronger and actionable alert covera

Easy Apply

Full-time

100,000 - 110,000

Senior Site Reliability Engineer - NC, TX

Charlotte, North Carolina

•

Today

Job#: 3028557 Job Description: Senior Site Reliability Engineer Location: Charlotte, NC and Irving, TX (Hybrid) Employment Type: 18 Months Contract Pay Rate: $61.00 and $65.00 Role Overview We are seeking a Senior Site Reliability Engineer (SRE) with a background in software engineering and a passion for solving complex problems at scale. This role supports large-scale production systems for regulated communication archives critical for compliance and eDiscovery. The position blends softwar

Easy Apply

Full-time

USD 61.00 - 65.00 per hour

SRE Engineer - NC

Charlotte, North Carolina

•

Today

Job#: 3024152 Job Description: Client: Financial Services Team: TBA Job Title: Systems Operations Engineer 4 / Senior Site Reliability Engineer (SRE) Location: Charlotte, NC - (Zone 2, 3 days RTO mandatory; days vary monthly) Contract Length: 18 months (possibility to extend) Pay Rate: $61-$65 Top Requirements: 2+ years senior-level SRE experience leading/operating SRE functions.2+ years Autosys and Oracle experience.2+ years supporting applications on Kubernetes / OpenShift and Google Cl

Easy Apply

Full-time

USD 61.00 - 65.00 per hour

Site Reliability Engineer

Charlotte, North Carolina

•

Today

Job#: 3026642 Job Description: Site Reliability Engineer Location: Charlotte, North Carolina (Onsite) Employment Type: Contract Contract duration: 12 Months Role Overview We are seeking a motivated and detail-oriented Site Reliability Engineer (SRE) to join a platform team within a support organization. This role requires strong engineering capabilities to design and build solutions that enhance platform reliability, efficiency, and automation. The ideal candidate will be responsible for a

Easy Apply

Full-time

Search all similar jobs