SRE Engineer

Overview

On Site

Accepts corp to corp applications

Contract - W2

Contract - Independent

Contract - 6+ Month(s)

Skills

dashboards

applications

Grafana OSS/Enterprise

SLOs/SLIs

alerts

and reporting mechanisms for infrastructure

and services.

Job Details

Contact Details:

1.Sandeep Bisane
Email:
Cell:

Job Title: SRE Engineer

Location: Basking Ridge, NJ (Onsite)

Duration: 6+ Months

Years of Experience: 12+ Yrs.

Required Hours/Week: 40hrs./Week

Job Summary:

We are seeking a seasoned SRE Engineer with strong expertise in designing and implementing observability and monitoring solutions across multi-cloud (AWS, Azure, Google Cloud Platform) and on-premise environments.
The ideal candidate will have deep hands-on experience with Grafana, Prometheus, Loki, Tempo, and integrations with various telemetry sources.
You will be responsible for end-to-end observability strategy, architectural governance, implementation, and evangelizing best practices across teams.

Key Responsibilities:

Architect and implement scalable observability solutions across hybrid/multi-cloud and on-premise environments using Grafana OSS/Enterprise.
Define monitoring strategies, SLOs/SLIs, dashboards, alerts, and reporting mechanisms for infrastructure, applications, and services.
Integrate Grafana with Prometheus, Loki, Tempo, InfluxDB, Elasticsearch, cloud-native tools (e.g., AWS CloudWatch, Azure Monitor, Google Cloud Platform Operations Suite), and on-prem systems.
Lead design and implementation of custom plugins, data sources, and dashboards for cross-platform observability.
Build and standardize templates, alerting rules, and RBAC models within Grafana Enterprise.
Collaborate with DevOps, SRE, Cloud, and App teams to define observability needs and onboard them into the platform.
Define and implement monitoring as code (MaC) practices using Terraform/Ansible for observability infrastructure.
Govern and optimize telemetry collection (logs, metrics, traces) for performance, cost, and usability.
Lead capacity planning, HA/DR design, performance tuning, and upgrades for Grafana stack.
Provide thought leadership on OpenTelemetry, distributed tracing, log aggregation, and AIOps capabilities.
Conduct training, documentation, and internal community engagement around observability tools.

Required Skills & Experience:

5+ years of hands-on experience with Grafana, including dashboard design, plugin development, and user management.
Strong expertise with Prometheus, Loki, Tempo, Alertmanager, and OpenTelemetry.
Proven experience designing multi-cloud (AWS, Azure, Google Cloud Platform) observability frameworks.
Experience integrating with on-premise systems (e.g., vSphere, bare-metal monitoring, SNMP, legacy tools).
Hands-on with Terraform, Helm, Ansible, GitOps practices for monitoring infrastructure.
Strong scripting and automation skills (Python, Bash, etc.).
In-depth knowledge of monitoring standards, telemetry formats (Prometheus metrics, OTLP, JSON logs).
Proficient in SRE principles (SLOs, SLIs, error budgets, alerting strategy).
Experience with RBAC, LDAP/SAML integration, Grafana Enterprise features.
Strong troubleshooting skills in distributed systems and observability pipelines.
Excellent communication, stakeholder management, and leadership skills.

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Job Details

Share