Role OverviewWe are seeking a Lead SRE & AI Ops Engineer with 12+ years of total experience, including at least 5 years in a leadership capacity, to oversee the reliability and performance of our AI-powered medical insurance call center platform. This role represents a strategic shift from traditional Big Data to AIOps, leveraging AI to process large volumes of telemetry data for proactive system support.You will lead the production support strategy for a complex ecosystem involving Google Contact Center AI (CCAI), Generative AI, and several Cloud Run microservices. As the "middle man" between Client stakeholders, DevOps, and Development teams, you will be responsible for maintaining the stability of a real-time speech-to-text and AI-driven "Advocate Assist" application. Key Responsibilities1. High-Volume Incident LeadershipIncident Commander: Act as the lead orchestrator for high-volume P1 and P2 incidents, managing the full lifecycle from detection to resolution.Crisis Management: Direct cross-functional teams during large-scale outages, ensuring clear communication with stakeholders and driving technical teams toward rapid service restoration.RCA & Problem Records: Own the Root Cause Analysis (RCA) process, creating and driving Problem Records (PR) to closure to ensure permanent remediation of recurring issues.2. AI Ops & GenAI Ecosystem SupportCCAI & Telephony Flow: Manage the reliability of the end-to-end flow from Media Hub through Google Telephony, Speech-to-Text conversion, and Dialogflow CX.GenAI Pipeline Maintenance: Optimize and troubleshoot LLM-powered services built on Vertex AI, Gemini, and LangChain, ensuring low-latency answers for call center advocates.AIOps Implementation: Shift from reactive monitoring to AI-driven operations, using machine learning to correlate signals across large datasets and predict failures before they impact users.3. Monitoring, Observability & TraceabilityMulti-Stack Visibility: Oversee a sophisticated monitoring suite including Datadog, Dynatrace, Splunk (for logging), and Google Cloud Platform Observability.Traceability Engineering: Implement and maintain end-to-end tracing to pinpoint latency and failure points across asynchronous Pub/Sub messages and several Cloud Run microservices.Proactive Health Checks: Use BigQuery and Splunk logs to establish performance baselines and automate anomaly detection.4. Integration & CI/CD LeadershipOperational Liaison: Serve as the technical point of contact for the client, bridging the gap between business needs and technical DevOps/Developer execution.Automated Lifecycles: Oversee CI/CD pipelines via GitHub Actions, ensuring that releases for AI prompts, knowledge bases, and python-based FastAPI services are stable and safe.Required Technical ExperienceExperience Level: 10+ years in IT/Operations with 5+ years in SRE leadership or Incident Management.Google Cloud Platform Infrastructure: Deep expertise in Cloud Run, Pub/Sub, GCS, BigQuery, and Redis.AI/ML Stack: Strong knowledge of Google CCAI, Vertex AI, Gemini, and LangChain.Backend & API: Proficient in Python (FastAPI) and RESTful API troubleshooting.Observability Tools: Expert-level knowledge of Splunk, Datadog, and Dynatrace.