Lead SRE & AI Ops Engineer

East Hartford, CT, US • Posted 2 hours ago • Updated 2 hours ago
Contract Corp To Corp
Contract Independent
75% Travel Required
On-site
60+
Fitment

Dice Job Match Score™

🛠️ Calibrating flux capacitors...

Job Details

Skills

  • API
  • Artificial Intelligence
  • Big Data
  • Bridging
  • Call Center
  • Cloud Computing
  • Communication
  • Continuous Delivery
  • Continuous Integration
  • Crisis Management
  • Customer Experience
  • DevOps
  • Dynatrace
  • GCS
  • Generative Artificial Intelligence (AI)
  • GitHub
  • Good Clinical Practice
  • Google Cloud Platform
  • Health Insurance
  • IT Operations
  • Incident Management
  • LangChain
  • Leadership
  • Machine Learning (ML)
  • Management
  • Media
  • Microservices
  • Production Support
  • Public Relations
  • Python
  • RESTful
  • Real-time
  • Redis
  • Root Cause Analysis
  • SAFE
  • Splunk
  • Technical Support
  • Telephony
  • Vertex

Summary

Role OverviewWe are seeking a Lead SRE & AI Ops Engineer with 12+ years of total experience, including at least 5 years in a leadership capacity, to oversee the reliability and performance of our AI-powered medical insurance call center platform. This role represents a strategic shift from traditional Big Data to AIOps, leveraging AI to process large volumes of telemetry data for proactive system support.You will lead the production support strategy for a complex ecosystem involving Google Contact Center AI (CCAI), Generative AI, and several Cloud Run microservices. As the "middle man" between Client stakeholders, DevOps, and Development teams, you will be responsible for maintaining the stability of a real-time speech-to-text and AI-driven "Advocate Assist" application. Key Responsibilities1. High-Volume Incident LeadershipIncident Commander: Act as the lead orchestrator for high-volume P1 and P2 incidents, managing the full lifecycle from detection to resolution.Crisis Management: Direct cross-functional teams during large-scale outages, ensuring clear communication with stakeholders and driving technical teams toward rapid service restoration.RCA & Problem Records: Own the Root Cause Analysis (RCA) process, creating and driving Problem Records (PR) to closure to ensure permanent remediation of recurring issues.2. AI Ops & GenAI Ecosystem SupportCCAI & Telephony Flow: Manage the reliability of the end-to-end flow from Media Hub through Google Telephony, Speech-to-Text conversion, and Dialogflow CX.GenAI Pipeline Maintenance: Optimize and troubleshoot LLM-powered services built on Vertex AI, Gemini, and LangChain, ensuring low-latency answers for call center advocates.AIOps Implementation: Shift from reactive monitoring to AI-driven operations, using machine learning to correlate signals across large datasets and predict failures before they impact users.3. Monitoring, Observability & TraceabilityMulti-Stack Visibility: Oversee a sophisticated monitoring suite including Datadog, Dynatrace, Splunk (for logging), and Google Cloud Platform Observability.Traceability Engineering: Implement and maintain end-to-end tracing to pinpoint latency and failure points across asynchronous Pub/Sub messages and several Cloud Run microservices.Proactive Health Checks: Use BigQuery and Splunk logs to establish performance baselines and automate anomaly detection.4. Integration & CI/CD LeadershipOperational Liaison: Serve as the technical point of contact for the client, bridging the gap between business needs and technical DevOps/Developer execution.Automated Lifecycles: Oversee CI/CD pipelines via GitHub Actions, ensuring that releases for AI prompts, knowledge bases, and python-based FastAPI services are stable and safe.Required Technical ExperienceExperience Level: 10+ years in IT/Operations with 5+ years in SRE leadership or Incident Management.Google Cloud Platform Infrastructure: Deep expertise in Cloud Run, Pub/Sub, GCS, BigQuery, and Redis.AI/ML Stack: Strong knowledge of Google CCAI, Vertex AI, Gemini, and LangChain.Backend & API: Proficient in Python (FastAPI) and RESTful API troubleshooting.Observability Tools: Expert-level knowledge of Splunk, Datadog, and Dynatrace.

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.
  • Dice Id: 91099306
  • Position Id: 8957396
  • Posted 2 hours ago
Create job alert
Set job alertNever miss an opportunity! Create an alert based on the job you applied for.

Similar Jobs

Hybrid in East Hartford, Connecticut

Today

Easy Apply

Contract, Third Party

Depends on Experience

Hartford, Connecticut

2d ago

Easy Apply

Contract, Third Party

East Hartford, Connecticut

30+d ago

Easy Apply

Contract, Third Party

$65 - $70

Hybrid in Bloomfield, Connecticut

Today

Easy Apply

Contract

50 - 60

Search all similar jobs