Elastic Site Reliability Engineer

Overview

On Site
Full Time

Skills

CISA
Normalization
Visualization
COTS
Software Configuration
Critical Thinking
Cyber Security
Clinical Data Management
Microsoft SSIS
Regulatory Compliance
Service Level
Storage
Scripting
Computer Cluster Management
Onboarding
Forecasting
Scalability
Gap Analysis
Optimization
Collaboration
DevOps
Documentation
Communication
Problem Solving
Conflict Resolution
Attention To Detail
Technical Writing
Work Ethic
SLA
Splunk
MongoDB
Replication
Query Optimization
Management
JavaScript
Disaster Recovery
Machine Learning (ML)
KPI
Kibana
Dashboard
Real-time
Programming Languages
Bash
Python
Elasticsearch
EOD
Software Development
SAFE
Agile
Continuous Delivery
Jenkins
GitLab
Continuous Integration
Configuration Management
Ansible
Progress Chef
Puppet
Cloud Computing
Migration
SIEM
Pipeline Management
Amazon Web Services
SAP BASIS
Law
FOCUS

Job Details

Job Description

ECS is seeking an Elastic Site Reliability Engineer to work in our Fairfax, VA office.

ECS is seeking talented professionals to join our successful and growing team in building the next-generation Continuous Diagnostics and Mitigation (CDM) Cyber data solution. The CDM Program is the Cybersecurity and Infrastructure Security Agency's (CISA) dynamic approach to strengthening the cybersecurity of Federal networks and systems through better awareness and visibility into their security posture and cyber threats. ECS is responsible for designing, building, deploying, operating, and maintaining a complete 'Data Services' solution which includes the collection, normalization, visualization, and sharing of cyber data from more than 100 Federal agencies. The CDM Data Services product is an integrated suite of multiple Commercial Off the Shelf (COTS) products, software configuration packages, and custom code which work together to operate as an integrated solution tailored to meet Department of Homeland Security (DHS) requirements.

We are seeking professionals who thrive in a dynamic, fast-paced, and highly collaborative environment where problem-solving, critical thinking, and a holistic approach to serving the mission are key. Our program operates within the Scaled Agile Framework (SAFe). An aptitude and enthusiasm for continuous learning, improvement, and cyber security is a must!

ECS is currently seeking a skilled Elastic Site Reliability Engineer (SRE) to support the Department of Homeland Security (DHS) Continuous Diagnostics and Mitigation (CDM) SIEM as a Service (SIEMaaS) Project. The CDM SIEMaaS project provides SIEM platform and integration services to participating agencies to support them in focusing their respective security posture on operationalizing their SIEM. The Elastic SRE will focus on maintaining and optimizing Elastic deployments in Elastic Cloud Hosted (ECH). The Elastic SRE will ensure effective monitoring for cluster health, availability, performance, and cost.

The ideal Elasticsearch SRE Engineer candidate must be able to work independently and proactively in finding solutions, and within a dynamic team structure to achieve program objectives. This person primarily p erforms duties of:
  • Monitor and maintain the health, uptime, and availability of Elastic Deployments in Elastic Cloud Hosted (ECH) using an Elastic logging / observability cluster, ensuring compliance with service-level agreements (SLAs) and service-level objectives (SLOs).
  • Analyze and optimize cluster performance (e.g., indexing, search latency, resource utilization) to meet business and tenant requirements.
  • Implement cost optimization strategies (e.g., right-sizing nodes, optimizing storage tiers) to reduce operational costs while maintaining performance and reliability.
  • Support Elastic SIEM Engineers to troubleshoot service degradation impacting SLA or SLO.
  • Develop and maintain automation scripts and tools (e.g., via ECH APIs, Python) for cluster management and tenant onboarding to reduce manual effort.
  • Forecast resource needs and plan cluster scaling within ECH to support growth in data volume and query load, ensuring scalability and resilience.
  • Conduct gap analyses for prospective tenants' Elastic environments to assess health, stability, adherence to Elastic best practices, and optimization opportunities, providing actionable recommendations.
  • Collaborate with development, DevOps and SIEM Engineers to align Elastic configurations with application needs and business objectives.
  • Create and maintain comprehensive documentation for cluster configurations and monitoring processes.


Required Skills

  • Excellent written and verbal communication skills, detail oriented, effective interpersonal skills, strong organization skills, problem-solving ability, attention to detail, technical documentation skills and strong work ethic that is proactive and self-motivated.
  • Must consistently seek to improve quality and efficiency.
  • Be flexible and thrive in an evolving environment.
  • Must be able to apply a proactive mindset for detecting potential areas of SLA and SLO breaches.
  • Minimum of 3 years' experience deploying, managing, and monitoring Elastic-based Tooling (e.g. Elastic Stack, Logstash, Beats, Elastic Agent).
  • Minimum of 3 years' experience managing Elastic, or like technologies (e.g. Splunk, MongoDB), sharding, replication, and query optimization.
  • Minimum of 3 years' experience with managing and/or monitoring highly available and fault tolerant platforms.
  • Experience in implementing Synthetic Monitors using Playwright and JavaScript.
  • Experience in conducting Disaster recovery best practices and tabletop exercises.
  • Implement FinOps practices to optimize cost based on industry best practices and understand cost drivers.
  • Experience leveraging Machine Learning for monitoring key performance indicators (KPIs) that may impact SLAs and SLOs.
  • Experience creating Kibana visualizations, dashboards and alerts for real-time and historical insights.
  • Experience in leveraging REST APIs through programming languages such as bash, python, etc.
  • Knowledge of Elastic Common Schema or similar (e.g. OTel, CIM, CEF, OCIF).
  • Proficiency and knowledge of Elasticsearch's cross-cluster search (CCS) feature


Desired Skills

  • Bachelor's Degree
  • Active DHS Suitability/Entry on Duty (EOD) is a plus.
  • Experience supporting ELK deployments and implementations.
  • Experience and proficiency working within the Software Development Life Cycle and working knowledge of various methodologies/frameworks such as SAFe Agile
  • Experience with CI/CD tools (Jenkins, Gitlab CI) and/or automated configuration management tools and playbooks (Ansible, Chef, Puppet, SaltStack). Highly desirable: experience with Elastic and Elastic Cloud Hosted REST APIs.
  • Experience migrating to an Elastic SIEM from other SIEM platforms.
  • Proven track record of optimizing distributed systems for performance and cost, including ingest pipeline management.
  • Holds a valid Elastic-issued Certificate. Highly desirable: Elastic Certified Engineer or Elastic Certified Observability Engineer.
  • Holds a valid AWS Certificate or prior background experience. Highly desirable: AWS Certified Solutions Architect or equivalent.
#ECS1

ECS is an equal opportunity employer and does not discriminate or allow discrimination on the basis any characteristic protected by law. All qualified applicants will receive consideration for employment without regard to disability, status as a protected veteran or any other status protected by applicable federal, state, or local jurisdiction law.

ECS is a leading mid-sized provider of technology services to the United States Federal Government. We are focused on people, values and purpose. Every day, our 3800+ employees focus on providing their technical talent to support the Federal Agencies and Departments of the US Government to serve, protect and defend the American People.
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.