DevOps Engineer (SRE/Grafana/Google Cloud Platform)

Overview

Hybrid
Depends on Experience
Contract - W2

Skills

API
Grafana
Good Clinical Practice
Google Cloud
Google Cloud Platform
DevOps
Data Collection
Docker
SNOW
ServiceNow integration
Git
Kubernetes
Linux
ServiceNow
SRE
SiteReliabilityEngineer
Cloud
OpenTelemetry
Open Telemetry
JVM
APIGEE
GKE
UI

Job Details

Position: Site Reliability Engineer (SRE) Cloud Platform
Role of position: Lead SRE implementation specifically for frontend portal monitoring, reliability, and performance on Google Cloud Platform or Microsoft Azure.
Location: Hartford, CT.
Remote/Onsite: Hybrid 2-3/days onsite
Method/ Duration: 90 Day CTH
Interview Process: 1- 2 virtual interviews. Panel with manager and other 2 SRE s.
Submittal Requirements: Feenyx assessment, 1 reference completed with CVS template
Start Date: ASAP. He has to have this role filled by end of October.
Years of experience: 5+ years in SRE/DevOps with proven JVM, APIGEE, Google Cloud Platform observability, Grafana stack, GKE, OpenTelemetry, and UI instrumentation implementation experience
Below are the top skills they're targeting.
  • Grafana - Dashboard creation, panel configuration, business metrics visualization
  • PromQL - Query writing, metric aggregation, SLO calculations, alerting conditions
  • Google Cloud Platform Metrics Explorer - Monitoring setup, alerting policies, escalation procedures
  • Loki - Log management, structured logging, log correlation, troubleshooting
  • Tempo - Distributed tracing, Open Telemetry, performance bottleneck identification
  • Automation & Alerts with SNOW - ServiceNow integration, automated incident creation, workflow automation.
Requirements:
  • -Technical: Python, Linux, Prometheus, Grafana, Kubernetes, Docker, Loki, Tempo
  • JVM Metrics: Java application monitoring, JVM performance tuning, heap analysis, garbage collection optimization for portal applications
  • Logging & Tracing: Splunk, distributed tracing, log aggregation standards, correlation IDs across portal systems
  • API Management: APIGEE experience, API monitoring, rate limiting, security, performance tracking for portal APIs
  • Infrastructure: CI/CD pipelines , AI tools like GIT copilot , Cursor etc.
  • Observability Tools & Query Languages: PromQL, InfluxQL for querying metrics(Grafana)
  • Strong experience with Kubernetes (GKE), including namespace management, RBAC, and deploying/maintaining SRE tools via code (Python, Bash, YAML, Helm).
  • Google Cloud Platform-Specific Observability Skills
  • UI Instrumentation & Frontend Monitoring
Job Description:
Responsibilities:
  • Design and implement comprehensive SRE monitoring for web portal on Google Cloud Platform
  • Set up JVM metrics collection and performance monitoring for Java applications using Google Cloud Platform Monitoring
  • Implement logging and tracing standards across all portal components using Cloud Logging and Cloud Trace
  • Configure APIGEE monitoring and API performance tracking for portal services
  • Implement distributed tracing with W3C Trace Context headers and OpenTelemetry
  • Create drill-down dashboards with correlation between metrics, logs, and traces using Google Cloud Platform tools
  • Integrate Google Cloud Platform Monitoring, Logging, and Trace with existing PrometheGrafana stack
  • Configure GMP (Google Managed Prometheus) for enhanced metrics collection
  • Implement UI zero code instrumentation for frontend monitoring and traceability
  • Create RED (Request, Error, Duration) dashboards for Performance and Production environments
  • Build service health dashboards with drill-down capabilities and error message analysi
  • Develop and maintain SRE automation/scripts within GKE namespaces (SRE and others) for monitoring, deployment, and troubleshooting.
Clear Skills Needed:
  • Technical: Python, Linux, Prometheus, Grafana, Kubernetes, Docker, Loki, Tempo
  • JVM Metrics: Java application monitoring, JVM performance tuning, heap analysis, garbage collection optimization for portal applications
  • Logging & Tracing: Splunk, distributed tracing, log aggregation standards, correlation IDs across portal systems
  • API Management: APIGEE experience, API monitoring, rate limiting, security, performance tracking for portal APIs
  • Infrastructure: CI/CD pipelines , AI tools like GIT copilot , Cursor etc.
  • Observability Tools & Query Languages: PromQL, InfluxQL for querying metrics(Grafana)
  • Strong experience with Kubernetes (GKE), including namespace management, RBAC, and deploying/maintaining SRE tools via code (Python, Bash, YAML, Helm).
Additional Critical Skills:
  • Distributed Tracing Standards: W3C Trace Context headers implementation
  • Structured Logging: JSON format with specific fields (trace_id, , log.level, , )
  • Performance Baseline Establishment: Ability to collect and analyze 2-4 weeks historical data for performance baselines
  • Dashboard Implementation: Drill-down capabilities, service mapping from trace data, correlation between metrics/logs/traces
Google Cloud Platform-Specific Observability Skills (CRITICAL):
  • Google Cloud Monitoring: GMP (Google Managed Prometheus), Cloud Monitoring dashboards, alerting policies
  • Google Cloud Logging: Centralized logging, log-based metrics, log exports
  • OpenTelemetry (OTEL): Instrumentation, collectors, data collection from Google Cloud Platform services
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.