Job Overview
The Production Support & SRE Manager is responsible for end-to-end ownership of production operations for our SaaS applications, including leading L1/L2 application support, defining and enforcing Incident and Problem Management processes, and driving Site Reliability Engineering (SRE) practices. This is a hands-on position requiring strong technical, operational, and leadership skills, as well as excellent interpersonal and communication abilities. The role involves close collaboration with Development, QA, Infrastructure, and Database teams to ensure stability, reliability, and high availability of our systems
Job Responsibilities
Own the full Incident Management lifecycle for production issues, from detection through resolution and post-incident review.
Lead and coordinate incident bridge calls with customer users and internal teams for high-priority issues.
Ensure incidents are logged correctly, prioritized appropriately, and resolved within defined SLAs.
Maintain clear, timely communication with internal stakeholders and clients during outages and major incidents.
Drive Problem Management by identifying recurring issues, patterns, and systemic weaknesses.
Gather technical inputs from Development, QA, Infrastructure, DBAs, and SRE teams to produce accurate, detailed RCA documents.
Prepare and present structured RCA documents and incident reports, including impact, timeline, root cause, and corrective actions.
Define and maintain SLIs/SLOs for critical services (availability, latency, error rates, throughput).
Champion observability across systems logging, metrics, tracing, dashboards, and alerts.
Improve and standardize monitoring and alerting for our Angular, C#, and SQL Server based applications.
Identify and implement automation opportunities (runbooks, self-healing, deployment checks, validation scripts) to reduce manual toil.
Participate in capacity planning, performance tuning, and resilience testing as needed.
Lead and mentor L1 and L2 support engineers and SRE-focused team members.
Establish clear expectations around ticket hygiene, communication, and ownership within the team.
Run regular operational reviews covering backlog, aged incidents, recurring issues, SLAs, and reliability metrics.
Work closely with development managers and product owners to prioritize stability and reliability improvements alongside feature work.
Define, document, and continuously improve Incident and Problem Management processes aligned with ITIL and SRE best practices.
Ensure all incidents, problems, and changes are properly documented in the ticketing system.
Create and maintain operational dashboards and reports for leadership and key stakeholders.
Ensure the team builds and uses knowledge base articles and runbooks to speed up L1/L2 resolution.
Qualifications
5+ years of experience in Production Support, Application Support, Site Reliability Engineering (SRE), or Operations for web-based / SaaS applications.
3+ years in a leadership role (Manager / Lead) handling production support and/or SRE responsibilities.
Strong experience with Incident Management, including leading P1/P0 calls and coordinating multi-team responses.
Proven experience with Problem Management and Root Cause Analysis for complex, multi-team issues.
Hands-on experience working with web application environments, preferably Angular, C#/.NET, and SQL Server.
Experience with monitoring, logging, alerting tools and building or working with observability dashboards.
Ability to read and interpret application logs, metrics, and distributed traces.
Ability to understand and analyze SQL queries and database performance symptoms (e.g., blocking, deadlocks, slow queries).
Excellent verbal and written communication skills, including the ability to explain technical issues to non-technical stakeholders.
Strong analytical, critical thinking, and problem-solving skills.
Other Requirements & Qualifications
Bachelor s or Master s degree in Computer Science, Information Technology, or a related field.
Relevant experience in Site Reliability Engineering and/or Production/Application Support management.
Experience supporting applications for Health Plans or Insurance organizations is preferred.
Exposure to regulated environments such as healthcare (HIPAA/HITECH, HITRUST, NIST-based controls) is preferred.
ITIL and/or SRE certifications are a plus.