Role Title: Observability Engineer
Employment Type: Contract
Duration: 6 Months (Potential Extensions)
Location: Cleveland, OH area – Hybrid (4 days onsite / 1 day remote)
About the Role
We are seeking an experienced Observability Engineer to support and expand a centralized enterprise observability platform. This initiative is focused on building a true “single pane of glass” monitoring environment using modern telemetry and monitoring technologies including Prometheus, Grafana, and Loki.
The current environment captures approximately 50% of server telemetry and is now evolving to include cross-domain observability across infrastructure, applications, databases, storage, and business transaction data. Long-term goals include enabling AI/ML-driven anomaly detection and intelligent root-cause analysis.
This is an opportunity to play a key role in building an enterprise-wide operational intelligence platform.
Responsibilities
- Expand telemetry ingestion across infrastructure, databases, storage platforms, applications, and network environments
- Assist with onboarding remaining systems and extending monitoring beyond traditional OS metrics
- Build and enhance Grafana dashboards that correlate infrastructure health with application performance and business transaction metrics
- Develop and maintain synthetic monitoring scripts using Playwright or similar tools to simulate critical user journeys
- Configure and optimize alerting workflows using Alertmanager and Loki
- Improve signal-to-noise ratio and reduce alert fatigue through better event management practices
- Establish and maintain telemetry labeling standards and data quality practices
- Support troubleshooting, root-cause analysis, and operational documentation efforts
- Partner with engineering and infrastructure teams to drive observability best practices across the enterprise
Required Qualifications
- Hands-on experience with:
- Prometheus
- Grafana
- Loki
- Alertmanager
- Strong experience writing PromQL queries and building Grafana dashboards
- Experience designing or supporting enterprise observability and monitoring platforms
- Ability to collect and normalize telemetry across:
- Servers
- Databases
- Storage environments
- Networks
- Applications
- Experience with synthetic monitoring tools such as Playwright or Selenium
- Strong Linux command-line experience
- Experience editing and managing YAML and JSON configuration files
- Knowledge of alert routing, escalation workflows, and reducing alert fatigue
- Understanding of telemetry standards, labeling strategy, and data hygiene practices
- Strong troubleshooting and analytical skills
Preferred Qualifications
- Oracle and SQL database experience
- Experience with SNMP, network flow data, or infrastructure performance monitoring
- Exposure to AI/ML-based observability or anomaly detection initiatives
This role offers the opportunity to help shape the future of enterprise monitoring and observability while working on high-impact initiatives supporting large-scale infrastructure and application environments.