Role Overview
The Shared SME / Principal Engineer is the highest technical escalation tier within the NOC. You will be engaged exclusively for architecture-level and code-level issues that cannot resolve within SLA. You act as the bridge to Engineering for product defects, own complex RCAs and post-incident reviews, and drive platform reliability improvements across all tenants in the shared pool.
Key Responsibilities
• Provide architecture-level and code-level diagnosis and remediation for critical incidents.
• Serve as the primary liaison to Engineering for confirmed product defects.
• Own and deliver complex Root Cause Analysis (RCA) and post-incident review documents.
• Drive post-incident improvement actions, including permanent code or configuration fixes.
• Review and approve changes to the platform architecture arising from incident learnings.
• Set technical standards for runbooks, diagnostic tooling, and monitoring instrumentation.
• Advise the Engineering Manager on capacity, resilience, and observability improvements.
• Provide cross-tenant knowledge — identify systemic risks that affect multiple clients.
• Engage on Sev1 bridge calls as technical authority when requires escalation.
Required Skills & Qualifications
• 8+ years of experience in senior platform engineering, SRE, or technical operations roles.
• Deep expertise in distributed systems architecture and microservices-based platforms.
• Proficiency in at least one JVM language (Java/Kotlin) and Python or Go.
• Expert-level debugging: heap dumps, thread dumps, memory profiling, distributed tracing.
• Strong understanding of mortgage industry workflows and lending platform architecture.
• Experience contributing to or reviewing code in production environments.
• Demonstrated ability to produce board-ready RCA and post-incident reports.
• Familiarity with security and compliance considerations in financial services (SOC2, PCI DSS, etc.).
Preferred Skills
• Prior hands-on experience with MACER or similar enterprise mortgage orchestration platforms.
• Track record of working directly with product engineering teams to resolve systemic defects.
• Experience leading SRE or reliability engineering functions.
• AWS / Azure certified solutions architect or equivalent credentials.
Key Performance Indicators (KPIs)
• Quality and depth of RCA documents — assessed by Engineering Manager and CSM.
• Time-to-permanent-fix for platform defects bridged to Engineering.
• Reduction in repeat Sev1/Sev2 incidents attributable to systemic improvements.
• Number of cross-tenant learnings shared and operationalized per quarter.
• Stakeholder satisfaction score from CSM post-incident reviews.