Principal Site Reliability Engineer

Charlotte, NC, US • Posted 3 days ago • Updated 3 days ago
Contract Corp To Corp
Contract W2
Contract Independent
12 Months
No Travel Required
On-site
Depends on Experience
Fitment

Dice Job Match Score™

📋 Comparing job requirements...

Job Details

Skills

  • 8+ years of site reliability
  • platform
  • or production operations engineering experience 3+ years of principal or senior-level ownership of SRE programs or enterprise observability platforms Deep expertise in Prometheus
  • Grafana
  • Alertmanager
  • and the broader Prometheus ecosystem (PromQL
  • recording rules
  • alerting rule design
  • etc.) Demonstrated experience designing and operating SLO frameworks Experience designing and maintaining machine-readable runbook libraries Strong experience owning incident response processes Experience designing and implementing toil elimination automation for complex distributed systems Ability to set reliability standards and influence engineering architecture decisions Excellent written and verbal communication skills for documentation
  • incident reports
  • and cross-team collaboration

Summary

Title : Principal Site Reliability Engineer
Location : Charlotte, NC
Contract to Hire on W2
 
Job Description :
This is a principal-level, deeply hands-on SRE role and the most senior individual contributor responsible for the operational reliability of our client's AI systems, agents, platform services, and infrastructure in production. This role owns the SLO framework that defines reliability, the machine-readable runbook library that operationalizes incident response, the Prometheus observability configuration, and the incident response process. The Principal SRE writes SLOs, builds alerting rules, authors machine-readable runbooks, configures Prometheus, and owns the reliability engineering practices for AI systems in an enterprise production environment.
 
Responsibilities:
Own and continuously evolve the SLO framework, defining and maintaining Service Level Objectives, Service Level Indicators, and error budget policies for all AI services
Own and maintain the machine-readable runbook library, authoring and operating structured runbooks that enable automated incident response workflows
Own the Prometheus observability configuration, including scrape configurations, recording rules, alerting rules, Alertmanager routing, and Grafana dashboard architecture
Own and continuously improve the incident response process, defining the lifecycle, severity classification, escalation paths, and post-incident review process
Design and enforce reliability engineering standards across the engineering organization
Build and maintain toil elimination automation to reduce operational burden
Partner with Principal Platform and AI Platform Engineers to instrument Azure infrastructure and design AI-specific reliability patterns
Lead post-incident reviews, facilitating blameless retrospectives and tracking remediation commitments
Mentor platform and application engineers on SRE principles, reliability design, and observability best practices
Required Skills :
8+ years of site reliability, platform, or production operations engineering experience
3+ years of principal or senior-level ownership of SRE programs or enterprise observability platforms
Deep expertise in Prometheus, Grafana, Alertmanager, and the broader Prometheus ecosystem (PromQL, recording rules, alerting rule design, etc.)
Demonstrated experience designing and operating SLO frameworks
Experience designing and maintaining machine-readable runbook libraries
Strong experience owning incident response processes
Experience designing and implementing toil elimination automation for complex distributed systems
Ability to set reliability standards and influence engineering architecture decisions
Excellent written and verbal communication skills for documentation, incident reports, and cross-team collaboration
Desired skills:
Experience defining and operating SLOs for AI systems, LLM-powered applications, or agentic workflows
Experience with Azure Monitor, Azure Managed Grafana, or Container Insights
Experience with PagerDuty, OpsGenie, or comparable incident management platforms
Experience in financial services, cybersecurity, or other regulated enterprise environments
Experience with chaos engineering practices (Chaos Mesh, Azure Chaos Studio)
Familiarity with OpenTelemetry for distributed tracing and metric collection
Experience with DORA metrics tracking and reliability scorecards
Google SRE certification, Prometheus Certified Associate (PCA), or comparable reliability and observability certifications
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.
  • Dice Id: 10120236
  • Position Id: 824-8935-
  • Posted 3 days ago
Create job alert
Set job alertNever miss an opportunity! Create an alert based on the job you applied for.

Similar Jobs

Charlotte, North Carolina

Today

Full-time

USD 160,000.00 - 180,000.00 per year

No location provided

Today

Full-time

USD 99,600.00 - 234,600.00 per year

California

Today

Full-time

USD 147,000.00 - 237,500.00 per year

Remote

Today

Full-time

USD 142,696.00 - 158,303.00 per year

Search all similar jobs