Resiliency and Recovery Engineer - Tech Lead

Charlotte, NC, US • Posted 1 day ago • Updated 1 day ago
Contract Independent
Contract W2
On-site
Depends on Experience
Fitment

Dice Job Match Score™

⏳ Almost there, hang tight...

Job Details

Skills

  • Resiliency
  • Recovery
  • SQL
  • MTTR
  • Jira
  • SLO
  • SLI

Summary

Job Title: Resiliency and Recovery Engineer - Tech Lead

Location: Charlotte, North Carolina (On-Site)

Job Description:

The Resiliency & Recovery Engineer (Contractor) is a senior, hands-on engineering role focused on improving production resiliency and recovery outcomes across critical services and payment rails. This role is responsible for driving measurable improvements such as faster recovery (reduced time to restore service), stronger and actionable alert coverage, increased automation to reduce manual toil, and safer releases with repeatable rollback/cutback readiness.

Responsibilities:

  • Work across all payment rails to develop faster, repeatable resiliency and recovery processes adopted broadly across the organization.
  • Identify resiliency gaps based on incident patterns and recurring failures; turn findings into prioritized remediation work.
  • Build or strengthen monitoring, alerting, and dashboards that are actually used by engineers and leadership.
  • Create runbooks and automate recovery actions to reduce manual toil and human error during incidents.
  • Improve release safety and rollback/fallback readiness with clear, repeatable cut-back procedures.
  • Support SQL reliability efforts (SQL Server 2022 focus) in partnership with DB and infrastructure teams.
  • Own backlog, prioritization, design reviews, and cross-team coordination (Ops/Product/Tech). Run weekly stand-ups and prepare bi-weekly executive readouts.
  • Integrate resilience testing into CI/CD pipelines and DevOps workflows to catch issues early and ensure robust, automated releases.
  • Conduct chaos engineering experiments (failure injections, game days) to proactively uncover system weaknesses and validate recovery processes under real-world failure scenarios.
  • Document and share resiliency best practices; mentor and train engineering teams to foster a culture of reliability and continuous improvement.
  • Ensure seamless handoff of newly created resiliency and recovery practices to the permanent Engineering team through thorough documentation and knowledge transfer.

Must-Have Qualifications:

  • Proven experience in high-availability, high-transaction environments (preferably payments or financial services).
  • Strong background in production resiliency and recovery, runbooks/playbooks, and root-cause analysis.
  • Incident pattern analysis and MTTR baselining.
  • Senior-level observability expertise with dashboards, monitors, and alerts (Datadog preferred; similar tools considered).
  • Experience with Splunk, Datadog, SQL, JQL (Jira Query Language), and GitLab.
  • Deep CI/CD experience: pipeline design and operation, release safety patterns, rollback readiness, and generating code-quality and testing metrics.
  • Automation skills using Python and/or PowerShell for building repeatable recovery workflows and operational tooling.
  • Kubernetes/container platform troubleshooting (deployments, pods, config drift, safe restarts, production incident investigation).
  • Experience with identity/credentials/certificate and secret-rotation resilience.
  • Reliability of batch/scheduler job execution and distributed integration failure-handling (timeouts, retries, idempotency, duplicate prevention, reconciliation).

Nice-to-Have Qualifications:

  • SRE-style reliability practices (SLO/SLI, error budgets, operational metrics).
  • Failover / data-center flip / active-active or active-passive recovery concepts.
  • Cloud engineering with Azure or AWS.
  • DevOps tooling such as Jenkins, Terraform, SonarQube, and Helm Charts.
  • Network and traffic-management incident triage (load balancers, firewalls, VLAN changes, rapid isolation of app vs. infra vs. network issues).
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.
  • Dice Id: 90891773
  • Position Id: 8938055
  • Posted 1 day ago
Create job alert
Set job alertNever miss an opportunity! Create an alert based on the job you applied for.

Similar Jobs

Charlotte, North Carolina

Today

Easy Apply

Contract

Depends on Experience

Charlotte, North Carolina

Today

Contract

USD 61.00 - 66.00 per hour

Charlotte, North Carolina

15d ago

Easy Apply

Contract

Depends on Experience

Charlotte, North Carolina

23d ago

Easy Apply

Contract

Depends on Experience

Search all similar jobs