Site Reliability Engineer (SRE) / Operations Engineer

ARLINGTON, VA, US • Posted 1 hour ago • Updated 1 hour ago
Full Time
On-site
USD $145,000.00 - 180,000.00 per year
Fitment

Dice Job Match Score™

🫥 Flibbertigibetting...

Job Details

Skills

  • Bridging
  • Software Engineering
  • IT Operations
  • Root Cause Analysis
  • Recovery
  • Operational Efficiency
  • Release Management
  • Collaboration
  • Scalability
  • Standard Operating Procedure
  • System Documentation
  • Management
  • Capacity Management
  • Performance Tuning
  • Regulatory Compliance
  • Continuous Improvement
  • Operational Risk
  • Computer Science
  • Information Technology
  • Information Systems
  • Cloud Computing
  • Splunk
  • Dynatrace
  • Amazon Web Services
  • Configuration Management
  • Satellite
  • Red Hat Linux
  • Continuous Integration
  • Continuous Delivery
  • DevOps
  • GitLab
  • Computer Networking
  • Orchestration
  • Kubernetes
  • Linux
  • Unix Administration
  • Scripting
  • Python
  • Bash
  • Reliability Engineering
  • Service Level
  • Budget
  • Incident Management
  • Agile
  • DevSecOps
  • Communication
  • SAP BASIS
  • Law
  • FOCUS

Summary

Job Description

ECS is seeking a Site Reliability Engineer (SRE) / Operations Engineer to work in our Arlington, VA office / remote .

ECS is seeking a Site Reliability Engineer (SRE) / Operations Engineer who is responsible for ensuring the reliability, availability, performance, and operational efficiency of enterprise applications and supporting infrastructure. This role bridges software engineering and IT operations by applying engineering practices, automation, and monitoring to maintain stable systems and rapidly resolve operational issues. The SRE/Ops Engineer works closely with development, security, and platform teams to support system deployments, manage incidents, improve observability, and implement resilient architectures that support continuous delivery and mission-critical operations.

Responsibilities
  • Maintain the reliability, availability, and performance of production systems and cloud-based services.
  • Monitor system health using observability tools (metrics, logs, and tracing) and respond to alerts and incidents.
  • Participate in incident response, troubleshooting, and root cause analysis to restore service and prevent recurrence.
  • Implement automation and infrastructure-as-code to improve operational efficiency and reduce manual intervention.
  • Support deployment pipelines and release management processes to enable reliable and repeatable software delivery.
  • Collaborate with development teams to improve application resiliency, scalability, and operational readiness.
  • Develop and maintain operational runbooks, standard operating procedures, and system documentation.
  • Manage system capacity planning, performance tuning, and scaling strategies.
  • Ensure systems comply with security, compliance, and organizational operational standards.
  • Contribute to continuous improvement initiatives by identifying opportunities to reduce operational risk and technical debt.

Salary Range: $145,000 - $180,000

General Description of Benefits

Required Skills

  • U.S. Citizenship
  • Ability to obtain at minimum a Public Trust suitability designation.
  • Bachelor's degree in Computer Science , Engineering, Information Technology, Information Systems, or a related field
  • Minimum of seven (7) years of related experience


Desired Skills

  • Experience supporting production systems in cloud or hybrid environments (e.g., AWS).
  • Proficiency with monitoring and observability tools (e.g., Splunk, Dynatrace, AWS Red Hat Console ).
  • Experience with infrastructure automation and configuration management tools (e.g., Red Hat Satellite Server, Red Hat Open Shift 4 ).
  • Familiarity with CI/CD pipelines and DevOps practices using tools such as GitLab.
  • Strong troubleshooting skills across application , infrastructure , and networking layers.
  • Experience with containerization and orchestration technologies (e.g., Kubernetes).
  • Knowledge of Linux/Unix system administration and scripting (e.g., Python, Bash, or similar).
  • Understanding of reliability engineering principles such as service level objectives (SLOs), error budgets, and incident management.
  • Ability to work collaboratively in cross-functional teams supporting Agile or DevSecOps environments.
  • Strong written and verbal communication skills to document processes and coordinate during operational events.
#ECS1

ECS is an equal opportunity employer and does not discriminate or allow discrimination on the basis any characteristic protected by law. All qualified applicants will receive consideration for employment without regard to disability, status as a protected veteran or any other status protected by applicable federal, state, or local jurisdiction law.

ECS is a leading mid-sized provider of technology services to the United States Federal Government. We are focused on people, values and purpose. Every day, our 3200+ employees focus on providing their technical talent to support the Federal Agencies and Departments of the US Government to serve, protect and defend the American People.
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.
  • Dice Id: 10112MAN
  • Position Id: 3472
  • Posted 1 hour ago
Create job alert
Set job alertNever miss an opportunity! Create an alert based on the job you applied for.

Similar Jobs

Washington, District of Columbia

Today

Full-time

USD 146,100.00 - 278,500.00 per year

Reston, Virginia

Today

Full-time

USD 78,750.00 - 131,250.00 per year

Reston, Virginia

Today

Full-time

USD 119,800.00 - 234,700.00 per year

Arlington, Virginia

Today

Full-time

Search all similar jobs