DevOps Engineer (SRE/Grafana/Google Cloud Platform)

Overview

Hybrid

Depends on Experience

Contract - W2

Skills

API

Grafana

Good Clinical Practice

Google Cloud

Google Cloud Platform

DevOps

Data Collection

Docker

SNOW

ServiceNow integration

Git

Kubernetes

Linux

ServiceNow

SRE

SiteReliabilityEngineer

Cloud

OpenTelemetry

Open Telemetry

JVM

APIGEE

GKE

Job Details

Position: Site Reliability Engineer (SRE) Cloud Platform

Role of position: Lead SRE implementation specifically for frontend portal monitoring, reliability, and performance on Google Cloud Platform or Microsoft Azure.

Location: Hartford, CT.

Remote/Onsite: Hybrid 2-3/days onsite

Method/ Duration: 90 Day CTH

Interview Process: 1- 2 virtual interviews. Panel with manager and other 2 SRE s.

Submittal Requirements: Feenyx assessment, 1 reference completed with CVS template

Start Date: ASAP. He has to have this role filled by end of October.

Years of experience: 5+ years in SRE/DevOps with proven JVM, APIGEE, Google Cloud Platform observability, Grafana stack, GKE, OpenTelemetry, and UI instrumentation implementation experience

Below are the top skills they're targeting.

Grafana - Dashboard creation, panel configuration, business metrics visualization
PromQL - Query writing, metric aggregation, SLO calculations, alerting conditions
Google Cloud Platform Metrics Explorer - Monitoring setup, alerting policies, escalation procedures
Loki - Log management, structured logging, log correlation, troubleshooting
Tempo - Distributed tracing, Open Telemetry, performance bottleneck identification
Automation & Alerts with SNOW - ServiceNow integration, automated incident creation, workflow automation.

Requirements:

-Technical: Python, Linux, Prometheus, Grafana, Kubernetes, Docker, Loki, Tempo
JVM Metrics: Java application monitoring, JVM performance tuning, heap analysis, garbage collection optimization for portal applications
Logging & Tracing: Splunk, distributed tracing, log aggregation standards, correlation IDs across portal systems
API Management: APIGEE experience, API monitoring, rate limiting, security, performance tracking for portal APIs
Infrastructure: CI/CD pipelines , AI tools like GIT copilot , Cursor etc.
Observability Tools & Query Languages: PromQL, InfluxQL for querying metrics(Grafana)
Strong experience with Kubernetes (GKE), including namespace management, RBAC, and deploying/maintaining SRE tools via code (Python, Bash, YAML, Helm).
Google Cloud Platform-Specific Observability Skills
UI Instrumentation & Frontend Monitoring

Job Description:

Responsibilities:

Design and implement comprehensive SRE monitoring for web portal on Google Cloud Platform
Set up JVM metrics collection and performance monitoring for Java applications using Google Cloud Platform Monitoring
Implement logging and tracing standards across all portal components using Cloud Logging and Cloud Trace
Configure APIGEE monitoring and API performance tracking for portal services
Implement distributed tracing with W3C Trace Context headers and OpenTelemetry
Create drill-down dashboards with correlation between metrics, logs, and traces using Google Cloud Platform tools
Integrate Google Cloud Platform Monitoring, Logging, and Trace with existing PrometheGrafana stack
Configure GMP (Google Managed Prometheus) for enhanced metrics collection
Implement UI zero code instrumentation for frontend monitoring and traceability
Create RED (Request, Error, Duration) dashboards for Performance and Production environments
Build service health dashboards with drill-down capabilities and error message analysi
Develop and maintain SRE automation/scripts within GKE namespaces (SRE and others) for monitoring, deployment, and troubleshooting.

Clear Skills Needed:

Technical: Python, Linux, Prometheus, Grafana, Kubernetes, Docker, Loki, Tempo
JVM Metrics: Java application monitoring, JVM performance tuning, heap analysis, garbage collection optimization for portal applications
Logging & Tracing: Splunk, distributed tracing, log aggregation standards, correlation IDs across portal systems
API Management: APIGEE experience, API monitoring, rate limiting, security, performance tracking for portal APIs
Infrastructure: CI/CD pipelines , AI tools like GIT copilot , Cursor etc.
Observability Tools & Query Languages: PromQL, InfluxQL for querying metrics(Grafana)
Strong experience with Kubernetes (GKE), including namespace management, RBAC, and deploying/maintaining SRE tools via code (Python, Bash, YAML, Helm).

Additional Critical Skills:

Distributed Tracing Standards: W3C Trace Context headers implementation
Structured Logging: JSON format with specific fields (trace_id, , log.level, , )
Performance Baseline Establishment: Ability to collect and analyze 2-4 weeks historical data for performance baselines
Dashboard Implementation: Drill-down capabilities, service mapping from trace data, correlation between metrics/logs/traces

Google Cloud Platform-Specific Observability Skills (CRITICAL):

Google Cloud Monitoring: GMP (Google Managed Prometheus), Cloud Monitoring dashboards, alerting policies
Google Cloud Logging: Centralized logging, log-based metrics, log exports
OpenTelemetry (OTEL): Instrumentation, collectors, data collection from Google Cloud Platform services

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Job Details

Share