This is not a traditional support role. The ideal candidate will partner closely with Product, Software Engineering, SRE, and Platform teams to diagnose complex production challenges, improve system reliability, and transform operational insights into permanent platform improvements.
You will play a critical role in ensuring the reliability, scalability, observability, and operational excellence of systems that support embedded finance, payments, and banking services.
Key Responsibilities
Production Engineering & Incident Management
- Lead complex production triage and incident response across APIs, payment workflows, distributed systems, infrastructure, and data platforms.
- Diagnose and resolve critical production issues affecting applications, services, integrations, and transaction processing systems.
- Perform deep root cause analysis across application code, cloud infrastructure, databases, and third-party dependencies.
- Drive incident resolution activities while maintaining composure and technical leadership during high-severity events.
Reliability & Platform Improvement
- Partner with engineering teams to convert production issues into long-term platform improvements.
- Improve platform reliability through automation, engineering enhancements, architectural improvements, and system hardening.
- Design and implement monitoring, alerting, and observability strategies to improve system visibility and operational awareness.
- Enhance system resiliency, fault tolerance, and recovery capabilities.
Systems & Technology Leadership
- Work across a modern technology stack including:
- Ruby on Rails
- Java
- AWS Cloud Services
- APIs & Microservices
- Distributed Systems
- SQL Databases
- Build and improve operational tooling, automation frameworks, diagnostic workflows, and runbooks.
- Participate in architecture and design discussions to ensure systems are built with reliability and observability from the start.
- Mentor engineers and promote best practices across Engineering, SRE, and Operations teams.
Required Skills & Qualifications
Technical Expertise
- 8+ years of professional experience in:
- Software Engineering
- Production Engineering
- Site Reliability Engineering (SRE)
- Platform Engineering
- Distributed Systems Engineering
Must-Have Skills
- Strong experience debugging production issues end-to-end:
- Application Code
- Infrastructure
- Databases
- Third-Party Dependencies
- Cloud Services
- Hands-on expertise with:
- AWS
- Ruby on Rails and/or Java
- REST APIs
- Microservices Architecture
- Distributed Systems Troubleshooting
- SQL and Data-Level Investigations
Observability & Reliability
- Experience with enterprise monitoring and observability tools such as:
- Splunk
- Datadog
- New Relic
- Similar monitoring platforms
Core Engineering Knowledge
- Deep understanding of:
- Production system behavior
- Fault isolation techniques
- Incident management
- Performance tuning
- Reliability engineering
- Resiliency patterns
- High-availability architectures
Professional Skills
- Excellent communication and stakeholder management abilities.
- Ability to effectively communicate technical concepts to both technical and non-technical audiences.
- Proven ability to operate calmly and effectively during incidents, escalations, and mission-critical outages.
Preferred Qualifications
- Experience within:
- Payments
- FinTech
- Banking
- Financial Services
- Other regulated industries
- Experience supporting:
- High-volume transaction systems
- Real-time payments platforms
- Enterprise-scale, customer-facing applications
- Bachelor's Degree in:
- Computer Science
- Engineering
- Information Technology
- Or equivalent practical experience