Apply Now

Alert Management & Observability Standards Lead

Hybrid in Fairfield, CA, US • Posted 16 hours ago • Updated 16 hours ago

Contract W2

13 Months

No Travel Required

Hybrid

$45 - $50/hr

Fitment

Dice Job Match Score™

⏳ Almost there, hang tight...

Job Details

Skills

alert management

Summary

Alert Management & Observability Standards Lead

Fairfield CA

Job Description

Job Title: Alert Management & Observability Standards Lead Role Summary The Alert Management & Observability Standards Lead is responsible for rationalizing and governing all system alerts to ensure they align with department priorities, operational coverage models, and service reliability goals. This role defines alerting standards, reviews and approves alerts before they are routed to the 24x7 Eyes-on-Glass Operations team, and establishes a scalable approach to cataloging alert response instructions (runbooks/playbooks) so responders can take consistent, high-quality actions. This position operates at the intersection of the IT Operations Command Center (OCC), engineering/application teams, platform/monitoring tool owners, and service owners, ensuring alerts are actionable, prioritized, and paired with clear response guidance. Key Responsibilities 1) Alert Rationalization & Prioritization (Core) Establish and maintain a department-wide alert rationalization framework that evaluates alerts for: Business/service criticality and operational priority Actionability (clear operator action available) Signal-to-noise (duplicate/low-value alerts removed or suppressed) Ownership and escalation paths Perform regular alert reviews (new + existing) to ensure alert quality, correct routing, and alignment with operational coverage. Lead continuous improvement efforts to reduce alert fatigue while preserving detection of true incidents and high-impact degradation. 2) Standards, Policies, and Guardrails Define and enforce alerting standards including: Severity definitions and thresholds Required metadata (service, CI, owner, runbook link, escalation) Naming conventions and tagging taxonomy Routing rules and “when to page vs. when to ticket” Create a standardized Alert Design Checklist and approval workflow (e.g., “Definition of Done” for alert onboarding). Partner with tool/platform owners to ensure standards are embedded in monitoring tooling (templates, required fields, automated validation). 3) Routing Decisions to 24x7 Eyes-on-Glass Act as gatekeeper (or lead the governance process) for determining which alerts should: Go to 24x7 Eyes-on-Glass for immediate triage Route to on-call engineering directly Create tickets for business-hours handling Be suppressed, aggregated, or converted to dashboards/health indicators Ensure routing aligns with: Operational responsibilities and skills of the Eyes-on-Glass team Department priorities (e.g., safety, reliability, customer impact) Service ownership and support models 4) Runbook / Response Instruction Cataloging (Knowledge System) Establish a consistent approach to cataloging response instructions for every actionable alert, including: “What does this alert mean?” (symptoms + impact) “What to check first” (triage steps) “What actions to take” (standard remediation) “When to escalate and to whom” (clear escalation triggers) Links to dashboards, logs, SOPs, and known issues Own the runbook template and ensure runbooks are versioned, maintained, and reviewed on a defined cadence. Partner with service owners to ensure runbooks stay current as systems change. 5) Reporting & Operational Outcomes Define and publish KPIs that demonstrate alerting health and operational performance, such as: Alert volume trends by service and severity Percentage of alerts with runbooks and valid ownership Alert “actionability rate” and noise reduction Mean time to acknowledge / triage effectiveness (as applicable) Facilitate governance forums (weekly/monthly) with service owners and engineering leads to review alert quality and backlog. 6) Cross-Functional Enablement Coach service teams on best practices: SLIs/SLOs, alert thresholds, dependency monitoring, and incident correlation. Drive adoption of observability patterns (golden signals, health indicators, multi-signal alerting). Support major incident learning by feeding post-incident insights back into:

Note

Alert Management & Observability Standards Lead (req ending in 4328)

Goal: Rationalize and reduce alert noise for the 24x7 NOC; establish monitoring standards and thresholds across compute, network, and application layers
Key tools in use: Comarch OSS, Spectrum OI, NetBrain, NetMRI, Dynatrace, SCOM — Splunk was recently removed
Work split: ~85-90% hands-on technical, 10-15% governance
Ideal profile:
Empathy toward 24x7 NOC and emergency response environments
Ability to translate technical alert data into business impact language
Comfortable working with pushback — Joe will provide executive backing
Work arrangement: Hybrid — 1 to 2 days/week on-site; local candidates preferred near Fairfield, CA (Sacramento area also acceptable)
Reporting: Direct report to Joe, cross-functional across all teams
Schedule: No 24x7 shift requirements for this role
Equipment: Supplier provides laptop; candidate logs in via VDI (Joe will try to request a PG&E laptop when possible)

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Dice Id: 10383289
Position Id: 8978892
Posted 16 hours ago

Contact the job poster

Vincent Kumar

Talent Acquisition @ Avenue Code, LLC

View Profile

Create job alert

Never miss an opportunity! Create an alert based on the job you applied for.

Similar Jobs

Software Engineer

Concord, California

•

Today

Location: Concord, CA Salary: $78.00 USD Hourly - $84.00 USD Hourly Description: Software Engineer / Site Reliability Engineer (SRE) Location: Concord, CA (1755 Grant St) Work Model: Hybrid - 3 days onsite (Monday & Tuesday preferred) Schedule: Start at 7:00 AM PT to coordinate with India-based teams Employment Type: 12-month contract (with potential extension or conversion) Line of Business: TCOO Positions Available: 1 About the Role In this contingent assignment, you will serve as a sen

Contract

USD 78.00 - 84.00 per hour

Site Reliability Engineer (SRE) - CA

Concord, California

•

Today

Job#: 3034075 Job Description: Client: Financial Services Job Title: Software Engineer 4 / Site Reliability Engineer (SRE) Location: Concord, CA - Hybrid (3 days onsite; Mon & Tues preferred) Contract Length: 12 months (possible extension or conversion) Pay Rate: $79 - $85 Top Requirements: 5+ years of experience with observability and monitoring tools (Grafana, Splunk, ThousandEyes, AppDynamics)Experience with Kubernetes/OpenShift (OCP) and containerized environmentsStrong understanding of

Easy Apply

Full-time

USD 79.00 - 85.00 per hour

Staff Software Engineer, Observability

San Francisco, California

•

Today

About Gusto At Gusto, we're on a mission to grow the small business economy. We handle the hard stuff - payroll, health insurance, 401(k)s, and HR - so owners can focus on their craft and their customers. With teams in Denver, San Francisco, and New York, we support more than 500,000 small businesses nationwide and are building a workplace that reflects the people we serve. All full-time employees receive competitive base pay, benefits, and equity (RSUs) - because everyone who helps build Gust

Full-time

USD 200,000.00 - 230,000.00 per year

Software Asset Management Lead / SME

Woodland, California

•

Today

We are seeking an experienced Software Asset Management (SAM) Lead / Subject Matter Expert to oversee and mature the enterprise-wide software asset lifecycle program. This senior individual contributor and team lead role is responsible for ensuring strong governance, regulatory compliance, audit readiness, and cost optimization across all software assets. The ideal candidate will bring deep expertise in SAM methodologies, ServiceNow SAM Pro, and cross-functional stakeholder engagement. Key Respo

Easy Apply

Full-time

Depends on Experience

Search all similar jobs