Disaster Recovery Engineer

Overview

On Site
USD 70.00 - 72.00 per hour
Full Time

Skills

Computer Networking
Orchestration
Sockets
Routing
IBM WebSphere
Scalability
Database
UI
RPO
Switches
Inventory
Microsoft Exchange
Data Warehouse
Replication
Nexus
Workflow
Dashboard
Auditing
Regulatory Compliance
Reporting
Quality Assurance
Testing
Microsoft Windows
Recovery
Continuous Improvement
Java
Software Architecture
TCP/IP
Socket Programming
Red Hat Linux
Virtual Machines
Design Patterns
Apache Kafka
IBM DB2
Oracle
Mainframe
Data Integration
Splunk
ServiceNow
CA Workload Automation AE
Leadership
Analytical Skill
Documentation
Communication
Finance
Authorization
IBM WebSphere Application Server
PCI DSS
Encryption
Management
Reliability Engineering
Computer Science
Linux
Python
Kubernetes
System Integration Testing
Ansible
Puppet
Progress Chef
Failover
DNS
Dragon NaturallySpeaking
Load Balancing
Disaster Recovery
High Availability
Cloud Computing
Amazon Web Services
Microsoft Azure
Google Cloud Platform
Google Cloud
Jenkins
GitHub
Payments
Network
Taxes
Life Insurance
Partnership
Collaboration
Business Transformation
Law

Job Details

Description

***Sit out of Chicago, IL We are seeking a Lead Engineer to lead the resiliency, disaster recovery (DR), and operational continuity efforts for our mission-critical Domestic Transaction Switching Application and Diner's Club International (DCI) Switch. This role requires deep technical expertise in Java, Linux, networking, and distributed systems-combined with strategic program delivery skills to coordinate multiple infrastructure, development, and operations teams. The ideal candidate has hands-on experience managing Active/Active multi-data-center architectures, low-level TCP/IP integrations, and DR orchestration within regulated financial environments. Core Responsibilities Application & Infrastructure Oversight Oversee the Domestic Transaction Switching Application, a Java-based platform running on VMs hosted in Nutanix clusters with Red Hat Linux. Manage all low-level TCP/IP socket communications, including connectors, listeners, and transaction routing logic. Coordinate with teams supporting the Diner's Club International Switch, a WebSphere application, ensuring interoperability and fault-tolerant communication between domestic and international payment networks. Ensure high availability, scalability, and compliance of multi-data-center deployments through Active/Active/Active architecture review and validation. Disaster Recovery (DR) Strategy & Analysis Own the end-to-end DR planning, testing, and documentation as outlined in Milestone 5.1 of the detailed DR plan. Evaluate the impact of DR events across configuration data sources, including 30+ read-only configuration files (IIN ranges, currency codes, merchant category codes, etc.) loaded into in-memory caches. Assess external dependencies such as DB2 Global Database, mainframe negative files and account-level processing files, and the Oracle UI used by operations to manage client connections and routes. Perform criticality analysis to classify configuration dependencies (blockers, critical, non-critical) and design mitigation strategies for stale or unavailable data sources. Define recovery point (RPO) and recovery time objectives (RTO) for all dependent systems. Active/Active Architecture Validation Review and strengthen the Active/Active/Active data-center strategy for the Hydra Switching Application. Identify and document exceptions, such as low-volume participants operating in Active/Passive mode, and assess potential transaction impact during site failover; inventory and track remediation plans. Analyze inter-data-center dependencies, including the dynamic key exchange (DKE) process requiring three-way acknowledgment for encryption key rotation. Document functional areas that degrade or fail during partial data-center outages and propose operational mitigations. Transaction Extracts & Event Processing Oversee downstream batch transaction extracts distributed to Data Warehouse, Settlement Systems, WorldPay, and regional datastores (e.g., India). Verify Kafka Enterprise Event Bus integrity during DR events, ensuring Active/Active message replication and recovery consistency-trust but verify. Analyze downstream dependencies to validate continuity for all transaction, settlement, and compliance feeds. Control Plane & Platform Dependencies Assess DR implications for control-plane components (Jenkins, GitHub, Nexus, Vault, Protegrity, Okta, etc.) which operate in Active/Passive configurations. Coordinate with enterprise platform teams to balance scope and minimize global outage risk during DR testing. Contribute to the Enterprise DR Playbook to define which components are within or excluded from DR scope. Monitoring, Runbooks & Evidence Capture Maintain comprehensive monitoring coverage using Splunk (functional transaction view) and DataDog (infrastructure health). Develop runbooks and implementation plan templates integrating ServiceNow, Jenkins, and Autosys workflows for deployment, validation, and rollback. Standardize evidence capture processes using Splunk dashboards, system logs, and console screenshots for audit and compliance reporting. Non-Production Test Environments & Simulation Design and coordinate a production-like QA/Dev environment for full DR simulation testing across all dependent components. Execute controlled DR test events, emulating change windows and data-center failovers: place impacted data center into down state, freeze configuration and batch jobs, redirect traffic and validate health on remaining sites, bring passive site online and validate configuration/job recovery. Document lessons learned and integrate continuous improvement into DR planning. Required Skills & Experience Strong background in Java application architecture and TCP/IP socket programming. Expertise with Linux (Red Hat), VM environments, and Nutanix infrastructure. Knowledge of multi-data-center Active/Active design patterns and high-availability systems. Familiarity with Kafka, DB2, Oracle, and mainframe data integration. Hands-on experience with Splunk, DataDog, Jenkins, ServiceNow, and Autosys. Proven ability to lead technical DR exercises, coordinate multi-team execution, and present results to leadership. Excellent analytical, documentation, and stakeholder-communication skills. Preferred Qualifications Experience in financial transaction processing, payment network systems, or card authorization platforms. Familiarity with WebSphere Application Server, PCI DSS, and encryption key management (DKE) processes. Experience developing or managing Active/Passive control-plane components in enterprise environments. Knowledge of site reliability engineering (SRE) principles and observability best practices. Bachelor's or Master's degree in Computer Science, Engineering, or related technical discipline.

Skills

linux, cloud, python, aws, kubernetes, Disaster recovery, Automation, Engineering

Top Skills Details

linux,cloud,python,aws,kubernetes

Additional Skills & Qualifications

***Sit hybrid in Chicago High Level- Regions / Failovers o Ansible / Puppet / Chef Copy bare metal so it plays better rather than using Cloud o AWS Openshift Automate it all o DNS Non-DNS failover strategy DNS is okay for Blue/Green deployment Should be able to explain why & how o Load-balancing o Monitoring Use current AWS issues as scenario High availability o Disaster recovery is a property of high availability Cloud o AWS / Azure / Google Cloud Platform Pipelines o Jenkins, Github This role sits at the core of our global payments infrastructure, ensuring that billions of transactions continue to process securely and reliably, even during adverse events. The successful candidate will help shape the resiliency architecture, automation, and DR strategy that safeguard customer trust and institutional stability across our network

Experience Level

Expert Level

Pay and Benefits
The pay range for this position is $70.00 - $72.00/hr.
Eligibility requirements apply to some benefits and may depend on your job
classification and length of employment. Benefits are subject to change and may be
subject to specific elections, plan, or program terms. If eligible, the benefits
available for this temporary role may include the following:
Medical, dental & vision
Critical Illness, Accident, and Hospital
401(k) Retirement Plan - Pre-tax and Roth post-tax contributions available
Life Insurance (Voluntary Life & AD&D for the employee and dependents)
Short and long-term disability
Health Spending Account (HSA)
Transportation benefits
Employee Assistance Program
Time Off/Leave (PTO, Vacation or Sick Leave)
Workplace Type
This is a hybrid position in Riverwoods,IL.
Application Deadline
This position is anticipated to close on Oct 27, 2025.
>About TEKsystems:
We're partners in transformation. We help clients activate ideas and solutions to take advantage of a new world of opportunity. We are a team of 80,000 strong, working with over 6,000 clients, including 80% of the Fortune 500, across North America, Europe and Asia. As an industry leader in Full-Stack Technology Services, Talent Services, and real-world application, we work with progressive leaders to drive change. That's the power of true partnership. TEKsystems is an Allegis Group company.

The company is an equal opportunity employer and will consider all applications without regards to race, sex, age, color, religion, national origin, veteran status, disability, sexual orientation, gender identity, genetic information or any characteristic protected by law.

About TEKsystems and TEKsystems Global Services

We're a leading provider of business and technology services. We accelerate business transformation for our customers. Our expertise in strategy, design, execution and operations unlocks business value through a range of solutions. We're a team of 80,000 strong, working with over 6,000 customers, including 80% of the Fortune 500 across North America, Europe and Asia, who partner with us for our scale, full-stack capabilities and speed. We're strategic thinkers, hands-on collaborators, helping customers capitalize on change and master the momentum of technology. We're building tomorrow by delivering business outcomes and making positive impacts in our global communities. TEKsystems and TEKsystems Global Services are Allegis Group companies. Learn more at TEKsystems.com.

The company is an equal opportunity employer and will consider all applications without regard to race, sex, age, color, religion, national origin, veteran status, disability, sexual orientation, gender identity, genetic information or any characteristic protected by law.
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

About TEKsystems c/o Allegis Group