Job Description ECS is seeking a
Site Reliability Engineer (SRE) / Operations Engineer to work in our
Arlington, VA office /
remote .
ECS is seeking a Site Reliability Engineer (SRE) / Operations Engineer who is responsible for ensuring the reliability, availability, performance, and operational efficiency of enterprise applications and supporting infrastructure. This role bridges software engineering and IT operations by applying engineering practices, automation, and monitoring to maintain stable systems and rapidly resolve operational issues. The SRE/Ops Engineer works closely with development, security, and platform teams to support system deployments, manage incidents, improve observability, and implement resilient architectures that support continuous delivery and mission-critical operations.
Responsibilities
- Maintain the reliability, availability, and performance of production systems and cloud-based services.
- Monitor system health using observability tools (metrics, logs, and tracing) and respond to alerts and incidents.
- Participate in incident response, troubleshooting, and root cause analysis to restore service and prevent recurrence.
- Implement automation and infrastructure-as-code to improve operational efficiency and reduce manual intervention.
- Support deployment pipelines and release management processes to enable reliable and repeatable software delivery.
- Collaborate with development teams to improve application resiliency, scalability, and operational readiness.
- Develop and maintain operational runbooks, standard operating procedures, and system documentation.
- Manage system capacity planning, performance tuning, and scaling strategies.
- Ensure systems comply with security, compliance, and organizational operational standards.
- Contribute to continuous improvement initiatives by identifying opportunities to reduce operational risk and technical debt.
Salary Range: $145,000 - $180,000
General Description of Benefits
Required Skills - U.S. Citizenship
- Ability to obtain at minimum a Public Trust suitability designation.
- Bachelor's degree in Computer Science , Engineering, Information Technology, Information Systems, or a related field
- Minimum of seven (7) years of related experience
Desired Skills - Experience supporting production systems in cloud or hybrid environments (e.g., AWS).
- Proficiency with monitoring and observability tools (e.g., Splunk, Dynatrace, AWS Red Hat Console ).
- Experience with infrastructure automation and configuration management tools (e.g., Red Hat Satellite Server, Red Hat Open Shift 4 ).
- Familiarity with CI/CD pipelines and DevOps practices using tools such as GitLab.
- Strong troubleshooting skills across application , infrastructure , and networking layers.
- Experience with containerization and orchestration technologies (e.g., Kubernetes).
- Knowledge of Linux/Unix system administration and scripting (e.g., Python, Bash, or similar).
- Understanding of reliability engineering principles such as service level objectives (SLOs), error budgets, and incident management.
- Ability to work collaboratively in cross-functional teams supporting Agile or DevSecOps environments.
- Strong written and verbal communication skills to document processes and coordinate during operational events.
#ECS1
ECS is an equal opportunity employer and does not discriminate or allow discrimination on the basis any characteristic protected by law. All qualified applicants will receive consideration for employment without regard to disability, status as a protected veteran or any other status protected by applicable federal, state, or local jurisdiction law.
ECS is a leading mid-sized provider of technology services to the United States Federal Government. We are focused on people, values and purpose. Every day, our 3200+ employees focus on providing their technical talent to support the Federal Agencies and Departments of the US Government to serve, protect and defend the American People.