AWS Observability or Grafana Architect

Warren, NJ, US • Posted 20 hours ago • Updated 38 minutes ago
Full Time
On-site
Fitment

Dice Job Match Score™

🔢 Crunching numbers...

Job Details

Skills

  • grafana

Summary

Hello,

My name is Sreeja and I represent TestingXperts Inc. TestingXperts is a Specialist QA & Software Testing Company, and an Independent Software Testing division of Damco Group, which is a leading IT Solutions and Services company working with Fortune Enterprises globally. Inheriting the virtues of job quality and optimal user satisfaction from Damco Group, TestingXperts aims at promoting the ethics of connected innovation, thereby seeding the integral values in our employees and achieving unmatched contentment in our clients. To know more about Testingxperts Inc., please visit our website .

If you are interested in the opportunity listed below, please forward your updated resume along with current contact information, or perhaps you can recommend someone who would be interested in this position

AWS Observability or Grafana Architect

Location: Warren, NJ (Onsite)

We are seeking a highly skilled AWS Observability Architect with deep, hands-on expertise in designing and implementing enterprise-grade observability platforms on AWS - with Grafana as the primary observability tool and OpenTelemetry as the instrumentation standard. This is a technical specialist role requiring genuine implementation experience, not platform familiarity.

The ideal candidate has personally architected and delivered large-scale observability solutions for production AWS environments - building telemetry pipelines, designing dashboards that operations teams actually use, and creating alerting frameworks that reduce MTTR rather than add noise. You understand the full observability stack: from application instrumentation with OpenTelemetry SDKs through to Grafana dashboards consumed by SREs, on-call engineers, and engineering leadership.

This role sits at the intersection of cloud infrastructure, software engineering discipline, and operational excellence - requiring someone who can design an enterprise observability architecture in the morning, write a Grafana dashboard query in the afternoon, and advise a development team on OpenTelemetry instrumentation strategy the next day.

Key Responsibilities

Observability Architecture & Strategy

  • Define and own the enterprise observability architecture for AWS environments - establishing the target-state design across the four pillars of observability: metrics, logs, traces, and events.
  • Design end-to-end telemetry pipelines - from instrumentation at the application and infrastructure layer through collection, processing, storage, and visualisation - with Grafana as the enterprise observability platform.
  • Develop observability standards and reference architectures - defining how AWS workloads across compute (EC2, EKS, ECS, Lambda), storage, networking, and managed services should be instrumented, collected, and visualised consistently across the organisation.
  • Establish signal-to-noise discipline across the observability platform - designing alerting frameworks that surface actionable signals, eliminate false positives, and ensure on-call engineers are alerted only when human intervention is genuinely required.
  • Define observability maturity roadmaps for client environments - assessing current-state coverage, identifying gaps, and building a phased improvement plan from reactive monitoring to proactive, AIOps-ready observability.
  • Drive FinOps for observability - optimising telemetry data volumes, retention policies, and Grafana Enterprise licensing costs to ensure the observability platform itself does not become a significant cost centre.

Grafana Enterprise Implementation

  • Architect, deploy, and operate Grafana Enterprise or Grafana SaaS as the primary observability platform - including high-availability Grafana deployment on AWS (EKS-based or managed via Grafana Cloud), data source federation, RBAC configuration, and enterprise plugin management.
  • Design and implement Grafana data source integrations across the AWS observability ecosystem:
    • Amazon CloudWatch - metrics, logs, and alarms as a core AWS data source
    • Grafana Mimir - for scalable, long-term Prometheus-compatible metrics storage
    • Grafana Loki - for cost-efficient, label-based log aggregation at scale
    • Grafana Tempo - for distributed tracing storage and trace-to-log-to-metric correlation
    • Amazon Managed Service for Prometheus (AMP) - for AWS-native Prometheus metrics
    • Amazon OpenSearch - for log analytics and full-text search use cases
    • Elasticsearch / OpenSearch - for existing log infrastructure integration
  • Build and maintain a Grafana dashboard library - covering infrastructure health, application performance, SLO/SLA tracking, capacity planning, cost visibility, incident response, and executive reporting - using reusable, variable-driven, and consistently styled templates.
  • Implement Grafana alerting at enterprise scale - including alert routing, notification policies, silence management, and integration with PagerDuty, OpsGenie, ServiceNow, and Slack for multi-channel incident notification.
  • Configure Grafana RBAC and team structures - designing role hierarchies, folder permissions, and data source access controls that enable self-service dashboarding for development teams while protecting sensitive operational data.
  • Deploy and manage Grafana Oncall for on-call scheduling and alert routing, or integrate Grafana alerting with existing incident management platforms.
  • Implement Grafana SLO (Service Level Objectives) - defining, tracking, and reporting error budgets across production services, enabling data-driven reliability decisions.
  • Manage Grafana as code - using Grafana's provisioning capabilities (YAML/JSON), Terraform provider, and Grizzly/Grafonnet for dashboard version control, environment promotion, and GitOps-based dashboard management.

OpenTelemetry Implementation

  • Define and lead the organisation's OpenTelemetry (OTel) instrumentation strategy - establishing standards for automatic and manual instrumentation across application stacks running on AWS.
  • Design and deploy the OpenTelemetry Collector as the central telemetry processing layer - including:
    • Collector deployment patterns: agent (DaemonSet on EKS), gateway (centralised), and sidecar configurations
    • Receiver configuration - OTLP, Prometheus, Jaeger, Zipkin, AWS X-Ray, CloudWatch, Fluent Bit
    • Processor pipeline design - batch processing, memory limiting, attribute enrichment, tail-based sampling, and resource detection processors
    • Exporter configuration - routing telemetry to Grafana Mimir (metrics), Grafana Loki (logs), Grafana Tempo (traces), AMP, and CloudWatch
  • Instrument AWS workloads with OpenTelemetry SDKs across languages (Java, Python, Node.js, Go) - including auto-instrumentation for containerised EKS workloads, Lambda instrumentation using OTel Lambda layers, and ECS task definition instrumentation.
  • Implement distributed tracing using OpenTelemetry - establishing trace propagation standards across microservices, configuring context propagation (W3C TraceContext, B3), and ensuring end-to-end trace visibility from frontend to backend to database.
  • Design OTel-based log correlation - enriching logs with trace IDs and span IDs to enable trace-to-log navigation in Grafana, supporting faster RCA during incidents.
  • Implement OTel-based metric instrumentation - defining custom business and application metrics alongside system metrics, following OTel semantic conventions for consistent metric naming and attribute tagging across services.
  • Define sampling strategies for distributed traces - including head-based sampling for development environments and tail-based sampling (via OTel Collector) for production environments, balancing observability coverage with storage cost.
  • Manage OTel Collector as infrastructure - including horizontal scaling, resource limits, high-availability deployment, collector health monitoring, and pipeline performance optimisation.

AWS Observability Services Integration

  • Design the integration architecture between AWS-native observability services and Grafana - positioning Grafana as the unified observability plane while leveraging AWS-native services as data sources:
    • Amazon CloudWatch - metrics, logs, alarms, dashboards, Contributor Insights, and Synthetics
    • Amazon Managed Grafana (AMG) - evaluating and advising on AMG vs self-managed Grafana deployment decisions
    • Amazon Managed Service for Prometheus (AMP) - remote write from OTel Collector and Prometheus agents, recording rules, and alert manager integration
    • AWS X-Ray - ingesting X-Ray traces into Grafana Tempo or directly via Grafana X-Ray data source
    • AWS CloudTrail - audit log integration for security and compliance observability
    • VPC Flow Logs - network observability integration for security monitoring and traffic analysis
  • Implement infrastructure-level observability for core AWS services - EC2 (CloudWatch agent, Node Exporter via OTel), EKS (kube-state-metrics, cAdvisor, OTel DaemonSet), RDS (Enhanced Monitoring, Performance Insights), Lambda (OTel Lambda layer, custom metrics), and API Gateway (access logs, CloudWatch metrics).
  • Design business and synthetic monitoring - implementing Grafana Synthetic Monitoring or CloudWatch Synthetics for endpoint availability, API health, and user journey monitoring with Grafana alerting integration.

Delivery & Enablement

  • Lead observability implementation projects end-to-end - from requirements gathering and architecture design through deployment, dashboard development, alert tuning, and team enablement.
  • Conduct observability maturity assessments for client environments - evaluating current monitoring coverage, tool sprawl, alert quality, and SLO definition maturity, and producing prioritised remediation roadmaps.
  • Develop and deliver observability enablement workshops for engineering and operations teams - covering OTel instrumentation, Grafana dashboard development, alert design, and on-call best practices.
  • Produce observability architecture documentation - reference architectures, runbooks, onboarding guides, and dashboard documentation that enable teams to self-serve and maintain the platform.
  • Advise on observability tool consolidation - helping organisations rationalise fragmented monitoring estates (Datadog, New Relic, Splunk, Nagios, Zabbix) toward a unified Grafana + OTel platform, including migration planning and cost impact analysis.

Experience

  • 10+ years of overall experience in cloud infrastructure, platform engineering, or DevOps.
  • 5+ years of hands-on AWS experience in production environments - not advisory or oversight roles.
  • 3+ years of hands-on Grafana Enterprise or SaaS implementation experience - designing, deploying, and operating Grafana at enterprise scale, including Mimir, Loki, Tempo, and the LGTM stack.
  • Proven experience implementing OpenTelemetry in production environments - including OTel Collector deployment, SDK-based instrumentation, and distributed tracing implementation.
  • Demonstrated experience building production-grade observability pipelines - from instrumentation through collection, processing, storage, and visualisation.
  • Hands-on experience with PromQL for metrics querying and alerting - including complex queries, recording rules, and alert expression design.
  • Experience with LogQL (Grafana Loki) for log querying and log-based alerting.
  • Hands-on experience deploying observability infrastructure on Kubernetes (EKS) - including Prometheus Operator, OTel DaemonSets, Grafana deployment, and persistent storage configuration.
  • Experience with Grafana as code - provisioning dashboards, data sources, and alert rules via YAML, Terraform, or Grafonnet.

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.
  • Dice Id: 10383634
  • Position Id: 2026-35143/24140
  • Posted 20 hours ago
Create job alert
Set job alertNever miss an opportunity! Create an alert based on the job you applied for.

Similar Jobs

Woodbridge Township, New Jersey

5d ago

Easy Apply

Third Party, Contract

Depends on Experience

Jersey City, New Jersey

Today

Contract

Compensation information provided in the description

Hybrid in Holmdel, New Jersey

30+d ago

Easy Apply

Contract

$60 - $80

New York, New York

16d ago

Full-time

USD 139,000.00 - 220,000.00 per year

Search all similar jobs