Apply Now

Principal Site Reliability Engineer

Charlotte, NC, US • Posted 3 days ago • Updated 3 days ago

Contract Corp To Corp

Contract W2

Contract Independent

12 Months

No Travel Required

On-site

Depends on Experience

Fitment

Dice Job Match Score™

📋 Comparing job requirements...

Job Details

Skills

8+ years of site reliability
platform
or production operations engineering experience 3+ years of principal or senior-level ownership of SRE programs or enterprise observability platforms Deep expertise in Prometheus
Grafana
Alertmanager
and the broader Prometheus ecosystem (PromQL
recording rules
alerting rule design
etc.) Demonstrated experience designing and operating SLO frameworks Experience designing and maintaining machine-readable runbook libraries Strong experience owning incident response processes Experience designing and implementing toil elimination automation for complex distributed systems Ability to set reliability standards and influence engineering architecture decisions Excellent written and verbal communication skills for documentation
incident reports
and cross-team collaboration

Summary

Title : Principal Site Reliability Engineer
Location : Charlotte, NC
Contract to Hire on W2

Job Description :
This is a principal-level, deeply hands-on SRE role and the most senior individual contributor responsible for the operational reliability of our client's AI systems, agents, platform services, and infrastructure in production. This role owns the SLO framework that defines reliability, the machine-readable runbook library that operationalizes incident response, the Prometheus observability configuration, and the incident response process. The Principal SRE writes SLOs, builds alerting rules, authors machine-readable runbooks, configures Prometheus, and owns the reliability engineering practices for AI systems in an enterprise production environment.

Responsibilities:
Own and continuously evolve the SLO framework, defining and maintaining Service Level Objectives, Service Level Indicators, and error budget policies for all AI services
Own and maintain the machine-readable runbook library, authoring and operating structured runbooks that enable automated incident response workflows
Own the Prometheus observability configuration, including scrape configurations, recording rules, alerting rules, Alertmanager routing, and Grafana dashboard architecture
Own and continuously improve the incident response process, defining the lifecycle, severity classification, escalation paths, and post-incident review process
Design and enforce reliability engineering standards across the engineering organization
Build and maintain toil elimination automation to reduce operational burden
Partner with Principal Platform and AI Platform Engineers to instrument Azure infrastructure and design AI-specific reliability patterns
Lead post-incident reviews, facilitating blameless retrospectives and tracking remediation commitments
Mentor platform and application engineers on SRE principles, reliability design, and observability best practices
Required Skills :
8+ years of site reliability, platform, or production operations engineering experience
3+ years of principal or senior-level ownership of SRE programs or enterprise observability platforms
Deep expertise in Prometheus, Grafana, Alertmanager, and the broader Prometheus ecosystem (PromQL, recording rules, alerting rule design, etc.)
Demonstrated experience designing and operating SLO frameworks
Experience designing and maintaining machine-readable runbook libraries
Strong experience owning incident response processes
Experience designing and implementing toil elimination automation for complex distributed systems
Ability to set reliability standards and influence engineering architecture decisions
Excellent written and verbal communication skills for documentation, incident reports, and cross-team collaboration
Desired skills:
Experience defining and operating SLOs for AI systems, LLM-powered applications, or agentic workflows
Experience with Azure Monitor, Azure Managed Grafana, or Container Insights
Experience with PagerDuty, OpsGenie, or comparable incident management platforms
Experience in financial services, cybersecurity, or other regulated enterprise environments
Experience with chaos engineering practices (Chaos Mesh, Azure Chaos Studio)
Familiarity with OpenTelemetry for distributed tracing and metric collection
Experience with DORA metrics tracking and reliability scorecards
Google SRE certification, Prometheus Certified Associate (PCA), or comparable reliability and observability certifications

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Dice Id: 10120236
Position Id: 824-8935-
Posted 3 days ago

Create job alert

Never miss an opportunity! Create an alert based on the job you applied for.

Similar Jobs

Senior Site Reliability Engineer

Charlotte, North Carolina

•

Today

Job Description: AssetMark is a leading strategic provider of innovative investment and consulting solutions serving independent financial advisors. We provide investment, relationship, and practice management solutions that advisors use in helping clients achieve wealth, independence, and purpose. The Opportunity We are seeking a Site Reliability Engineer (SRE) to join our Charlotte-based engineering team. This role sits at the center of platform resilience - ensuring high availability, perf

Full-time

USD 160,000.00 - 180,000.00 per year

Principal Site Reliability Engineer

No location provided

•

Today

Job Description As a Principal Site Reliability Engineer, you will play a pivotal role in building and operating the Oracle HealthPatient Portal. In this role, you will design, build, and operate highly reliable, scalable infrastructure that supports Commercial and Federal customers. You will also contribute to the next evolution of cloud operations by advancing automation, observability, and AI-assisted reliability practices. You will work within a globally distributed team to deliver robust

Full-time

USD 99,600.00 - 234,600.00 per year

Principal Site Reliability Engineer

California

•

Today

Our Mission At Palo Alto Networks , we're united by a shared mission-to protect our digital way of life. We thrive at the intersection of innovation and impact, solving real-world problems with cutting-edge technology and bold thinking. Here, everyone has a voice, and every idea counts. If you're ready to do the most meaningful work of your career alongside people who are just as passionate as you are, you're in the right place. Who We Are In order to be the cybersecurity partner of choice, w

Full-time

USD 147,000.00 - 237,500.00 per year

Site Reliability Engineer

Remote

•

Today

Basic Qualifications Bachelor's degree in Software Engineering, or related Science, Technology, Engineering or Mathematics field, plus a minimum of 8 years of relevant experience; or Master's degree, plus 6 years relevant experience. CLEARANCE REQUIREMENTS:: Department of Defense Secret security clearance is required at time of hire. Applicants selected will be subject to a U.S. Government security investigation and must meet eligibility requirements for access to classified information. Due t

Full-time

USD 142,696.00 - 158,303.00 per year

Search all similar jobs

More jobs at Pantar Solutions, Inc. in Charlotte, NC