Job Title: Site Reliability Engineering (SRE) Lead
Location: Phoenix AZ
Duration: Long Term Contract
We are seeking an experienced Site Reliability Engineering (SRE) Lead to design, build, and evolve highly available, scalable, and secure payment platforms. The role requires strong expertise across AWS cloud, enterprise middleware (IBM WebSphere, IBM MQ), modern application stacks, observability, and DevOps, with deep understanding of Payments domain systems.
You will define SRE strategy, reliability architecture, and operational excellence while collaborating closely with application, infrastructure, security, and business teams.
Key Responsibilities
Reliability & Architecture
- Design and architect highly resilient, fault tolerant payment systems supporting high throughput and low latency SLAs.
- Define SRE principles, including SLOs, SLIs, error budgets, and reliability KPIs for mission critical payment services.
- Lead architecture decisions for cloud native, hybrid, and legacy systems, including IBM WebSphere based platforms.
- Drive active active, DR, and HA strategies for AWS and on prem integrations.
Cloud & Platform Engineering
- Architect and operate workloads on AWS (EC2, EKS/ECS, RDS, S3, IAM, VPC, CloudWatch).
- Optimize infrastructure for scalability, availability, security, and cost efficiency.
- Guide containerization and orchestration strategies where applicable.
Application & Middleware Expertise
- Partner with development teams on Java, Spring Boot based microservices.
- Support front end platforms built using React and Angular in terms of performance and reliability.
- Architect and operate messaging platforms using Kafka and IBM MQ.
- Manage enterprise middleware including IBM WebSphere Application Server.
DevOps & Automation
- Build and maintain CI/CD pipelines using Jenkins.
- Automate infrastructure provisioning, deployments, monitoring, and recovery processes.
- Promote Infrastructure as Code (IaC) and immutable infrastructure best practices.
- Champion DevOps and SRE culture across engineering teams.
Observability & Operations
- Design and standardize monitoring, logging, and alerting using:
- Splunk
- AWS CloudWatch
- Datadog
- Implement proactive monitoring and advanced alerting for payment flows.
- Lead incident response, root cause analysis (RCA), and post incident reviews.
- Drive reduction in MTTR and recurring incidents.
Database & Data Layer
- Architect and support PostgreSQL and Oracle databases with focus on:
- High availability
- Performance tuning
- Backup, restore, and disaster recovery
Payments Domain Leadership
- Provide reliability leadership for payment processing systems (authorization, capture, settlement, reconciliation).
- Ensure compliance with PCI DSS, security, and regulatory standards relevant to payments.
- Understand dependencies across gateways, processors, fraud, and downstream systems.
- Leadership & Collaboration
- Act as technical lead/architect for SRE initiatives.
- Mentor SREs and engineers; guide best practices and standards.
- Work closely with product, architecture, security, and operations teams.
- Influence executive stakeholders on reliability, risk, and scalability decisions.