Job Title: Senior Observability Engineer (ESS Platform SME)
Location: McLean, VA(onsite) & inperson Interview
Job Type: C2C or W2
Role Overview:
We are seeking a highly experienced Senior Observability Engineer with deep expertise in ESS (Elastic Stack) to lead and accelerate the development of enterprise-grade observability capabilities across mission-critical applications.
This role requires a hands-on SME who can design, build, and scale observability dashboards, APM, tracing, and monitoring solutions exclusively within ESS. The candidate will play a key role in transforming current monitoring into a proactive, intelligent, and scalable observability ecosystem.
This is a high-impact, fast-paced engagement (target < 6 months) requiring ownership, technical depth, and execution excellence.
Key Responsibilities:
ESS Observability Architecture & Implementation
- Design and implement end-to-end observability solutions using ESS (Elastic Stack).
- Build a centralized observability layer covering all MF applications.
- Ensure block-level aggregation with drill-down to:
- Application-level metrics
- APM traces
- Logs and events
- Service dependencies
Dashboard Engineering (Critical Priority)
- Develop and scale a large backlog of ESS dashboards, including but not limited to:
- Cluster Health (OCP/K8s)
- API & APM Dashboards
- Service Health & Dependency Monitoring
- Pod Status / Restart / Scaling Metrics
- HTTP Status Analytics (200/400/500 trends)
- Transaction Processing Metrics
- Infra Metrics (CPU, Memory, Disk, Network)
- Synthetic Monitoring & Availability
- Build intuitive, drill-down dashboards from MF Block Service Application level.
APM, Tracing & Monitoring Expansion
- Expand ESS-based:
- Application Performance Monitoring (APM)
- Distributed tracing
- Real User Monitoring (RUM)
- Synthetic monitoring
- Enable end-to-end traceability across microservices.
Proactive Observability & Alerting
- Design and implement smart alerting rules:
- Move from reactive proactive detection
- Reduce noise, improve signal quality
- Define SLOs, SLIs, and error budgets
- Enhance anomaly detection and trend analysis
Collaboration & Leadership
- Work closely with:
- EOT Observability Team
- Internal CDLs
- Application teams
- Act as ESS Observability SME
- Provide guidance, standards, and best practices
Required Skills & Experience:
- Strong hands-on experience with ESS (Elastic Stack):
- Elasticsearch
- Logstash
- Kibana
- Beats / Elastic Agent
- Elastic APM
- Proven experience building enterprise-scale observability dashboards in ESS
- Deep understanding of:
- Microservices architecture
- Kubernetes / OpenShift (OCP)
- Experience with:
- APM, distributed tracing, logging, metrics correlation
- Ability to design multi-layer observability (infra platform app)
Strongly Preferred:
- Experience with:
- Synthetic monitoring tools integrated with ESS
- Real User Monitoring (RUM)
- Service maps and dependency graphs
- Knowledge of:
- CI/CD observability integration
- Alerting frameworks within Elastic
- Scripting: Python / Shell / Groovy (nice to have)
Soft Skills:
- Strong ownership mindset
- Ability to work under aggressive timelines
- Excellent problem-solving skills
- Clear communication with technical and non-technical teams
Success Criteria (First 3 6 Months):
- Deliver enterprise-grade ESS observability dashboards
- Achieve full MF application visibility
- Implement end-to-end APM + tracing coverage
- Establish proactive alerting framework
Additional Notes:
- Candidate must be an ESS expert - alternative tools experience alone will not be sufficient.
- This is a high-priority, business-critical role with immediate impact expectations.