Overview
Skills
Job Details
Job Description
We are seeking a highly skilled and experienced Senior Data Analyst with deep expertise in data analysis, metric modeling, PromQL, SQL, and Grafana dashboard engineering. The ideal candidate will be exceptional at transforming raw operational and observability data into meaningful insights, building intuitive and interactive dashboards, and enabling data-driven decision-making across engineering and leadership teams.
This role requires strong analytical thinking, hands-on technical skills, and a profound understanding of time-series data, monitoring ecosystems (Prometheus, Loki), and database systems.
Primary Responsibilities:
Data Analysis & Insights
Analyze large volumes of system, application, and database performance metrics from Prometheus, database, and observability pipelines.
Perform deep-dive statistical analysis, trend identification, anomaly detection, and correlation analysis across multiple data sources.
Work closely with engineering teams to translate raw metrics into actionable insights and operational recommendations.
Prometheus & PromQL Expertise
Write complex PromQL expressions leveraging aggregations, label filtering, rate functions, histogram buckets, offsets, joins, subqueries, and recording rules.
Design and maintain Prometheus recording/alerting rules aligned with SLOs, SLIs, and threshold logic.
Optimize PromQL queries for performance, cardinality reduction, and efficient metric scraping.
Grafana Dashboard Engineering
Build fantastic, highly interactive Grafana dashboards with drill-downs, templating variables, transformations, panel repeaters, and custom visualizations.
Create reusable dashboard frameworks aligned with organizational standards (e.g., cross-database observability blueprint, unified metric naming).
Work with Grafana data sources such as PostgreSQL, Prometheus, Loki, AWS CloudWatch, and JSON API endpoints.
Implement advanced Grafana features including:
o Annotations linked to events
o Alerting dashboards using Grafana Alerting
o Multi-region / multi-cluster comparison visualizations
o Heatmaps, stat panels, table transformations, and color-coded KPIs
SQL & Database Expertise
Write advanced SQL queries (CTEs, window functions, pivots, JSONB extraction, time-bucket analysis).
Support ad-hoc data extraction for engineers, product teams, SRE, and leadership.
Monitoring & Observability Analytics
Interpret system behavior using metrics from Prometheus, logs from Loki, and events from pipelines.
Analyze CPU, memory, I/O, lock waits, query performance, connection health, replication lags, and other resource utilization.
Work with database teams to understand data patterns across MongoDB, Oracle, Postgres, MySQL, SQL Server, and cloud database platforms.
KPI & Reporting Frameworks
Define SLIs, SLO dashboards, executive scorecards, cost & usage dashboards, incident lifecycle analytics, and MTTR/MTBF reports.
Build leadership-focused dashboards with narrative storytelling and trend summaries.
Secondary Responsibilities
Build data quality checks, audit patterns, and anomaly detection using both SQL and PromQL.
Collaborate with microservices and API teams to shape the data models surfaced to Prometheus and Grafana.
Participate in capacity planning, forecasting, and cost optimization initiatives.
Contribute to automation pipelines.
Qualifications
Expert-level proficiency in Prometheus and PromQL, including advanced functions and optimization.
Strong SQL skills (PostgreSQL preferred), including deep experience with JSONB, time-series analysis, and analytical queries.
Hands-on experience building enterprise-grade Grafana dashboards with templating, variables, transformations, repeaters, and custom UI logic.
Experience with operational metrics from databases and infrastructure (CPU, memory, storage, network, replication, queries, transactions, locks).
Ability to interpret complex system behaviors and correlate multi-dimensional observability data.
Familiarity with databases such as Oracle, MongoDB, PostgreSQL, MySQL, SQL Server, and cloud database platforms.
Understanding of logs, traces, and events (Fluentd, Loki, OpenTelemetry is a plus).
Strong analytical, statistical thinking, and problem-solving skills.
Experience working with microservices, cloud-native systems, or SRE/DevOps teams is a strong advantage.
Excellent communication skills with the ability to explain technical insights to engineering and leadership audiences.