Overview
Hybrid
Depends on Experience
Contract - W2
Skills
Site Reliability Engineer
SRE
Cloud Platform
API
Artificial Intelligence
DevOps
JVM
APIGEE
GCP
Grafana stack
GKE
Google Cloud
Java
JSON
Grafana
UI
Google Cloud Platform
Git
Docker
Linux
Python
Splunk
Continuous Delivery
Continuous Integration
Cloud Computing
Job Details
Job Description:
Responsibilities:
- Design and implement comprehensive SRE monitoring for web portal on Google Cloud Platform
- Set up JVM metrics collection and performance monitoring for Java applications using Google Cloud Platform Monitoring
- Implement logging and tracing standards across all portal components using Cloud Logging and Cloud Trace
- Configure APIGEE monitoring and API performance tracking for portal services
- Implement distributed tracing with W3C Trace Context headers and OpenTelemetry
- Create drill-down dashboards with correlation between metrics, logs, and traces using Google Cloud Platform tools
- Integrate Google Cloud Platform Monitoring, Logging, and Trace with existing PrometheGrafana stack
- Configure GMP (Google Managed Prometheus) for enhanced metrics collection
- Implement UI zero code instrumentation for frontend monitoring and traceability
- Create RED (Request, Error, Duration) dashboards for Performance and Production environments
- Build service health dashboards with drill-down capabilities and error message analysi
- Develop and maintain SRE automation/scripts within GKE namespaces (SRE and others) for monitoring, deployment, and troubleshooting.
Skills Needed:
- Technical: Python, Linux, Prometheus, Grafana, Kubernetes, Docker, Loki, Tempo
- JVM Metrics: Java application monitoring, JVM performance tuning, heap analysis, garbage collection optimization for portal applications
- Logging & Tracing: Splunk, distributed tracing, log aggregation standards, correlation IDs across portal systems
- API Management: APIGEE experience, API monitoring, rate limiting, security, performance tracking for portal APIs
- Infrastructure: CI/CD pipelines , AI tools like GIT copilot , Cursor etc.
- Observability Tools & Query Languages: PromQL, InfluxQL for querying metrics(Grafana)
- Strong experience with Kubernetes (GKE), including namespace management, RBAC, and deploying/maintaining SRE tools via code (Python, Bash, YAML, Helm).
Additional Critical Skills:
- Distributed Tracing Standards: W3C Trace Context headers implementation
- Structured Logging: JSON format with specific fields (trace_id, , log.level, , )
- Performance Baseline Establishment: Ability to collect and analyze 2-4 weeks historical data for performance baselines
- Dashboard Implementation: Drill-down capabilities, service mapping from trace data, correlation between metrics/logs/traces
Google Cloud Platform-Specific Observability Skills (CRITICAL):
- Google Cloud Monitoring: GMP (Google Managed Prometheus), Cloud Monitoring dashboards, alerting policies
- Google Cloud Logging: Centralized logging, log-based metrics, log exports
- OpenTelemetry (OTEL): Instrumentation, collectors, data collection from Google Cloud Platform services
UI Instrumentation & Frontend Monitoring (CRITICAL):
- UI Span Management: Naming conventions for UI-initiated spans, W3C Trace Context headers for frontend
- Frontend Observability: User session tracking, component-level monitoring, UI performance metrics
- Cross-Platform Tracing: End-to-end traceability from UI to backend services
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.