Role Overview
The SRE / Principal Engineer is the highest technical escalation tier within the NOC. You will be engaged exclusively for architecture-level and code-level issues that cannot resolve within SLA. You act as the bridge to Engineering for product defects, own complex RCAs and post-incident reviews, and drive platform reliability improvements across all tenants in the shared pool.
Key Responsibilities
Provide architecture-level and code-level diagnosis and remediation for critical incidents.
Serve as the primary liaison to Engineering for confirmed product defects.
Own and deliver complex Root Cause Analysis (RCA) and post-incident review documents.
Drive post-incident improvement actions, including permanent code or configuration fixes.
Review and approve changes to the platform architecture arising from incident learnings.
Set technical standards for runbooks, diagnostic tooling, and monitoring instrumentation.
Advise the Engineering Manager on capacity, resilience, and observability improvements.
Provide cross-tenant knowledge identify systemic risks that affect multiple clients.
Engage on Sev1 bridge calls as technical authority when requires escalation.
Required Skills & Qualifications
8+ years of experience in senior platform engineering, SRE, or technical operations roles.
Deep expertise in distributed systems architecture and microservices-based platforms.
Proficiency in at least one JVM language (Java/Kotlin) and Python or Go.
Expert-level debugging: heap dumps, thread dumps, memory profiling, distributed tracing.
Strong understanding of mortgage industry workflows and lending platform architecture.
Experience contributing to or reviewing code in production environments.
Demonstrated ability to produce board-ready RCA and post-incident reports.
Familiarity with security and compliance considerations in financial services (SOC2, PCI DSS, etc.).
Preferred Skills
Prior hands-on experience with MACER or similar enterprise mortgage orchestration platforms.
Track record of working directly with product engineering teams to resolve systemic defects.
Experience leading SRE or reliability engineering functions.
AWS / Azure certified solutions architect or equivalent credentials.
Key Performance Indicators (KPIs)
Quality and depth of RCA documents assessed by Engineering Manager and CSM.
Time-to-permanent-fix for platform defects bridged to Engineering.
Reduction in repeat Sev1/Sev2 incidents attributable to systemic improvements.
Number of cross-tenant learnings shared and operationalized per quarter.
Stakeholder satisfaction score from CSM post-incident reviews.