Site Reliability Engineer

Overview

On Site

Full Time

Skills

Cross-functional Team

Instrumentation

Budget

User Experience

Analytics

FOCUS

Standard Operating Procedure

Collaboration

Scalability

Capacity Management

Performance Tuning

SAFE

Provisioning

Regulatory Compliance

Access Control

Management

Computer Networking

Storage

Database

Caching

Disaster Recovery

RPO

Failover

Recovery

Backup

Forecasting

DevOps

Linux

Unix Administration

Scripting Language

Python

Bash

Cloud Computing

Amazon Web Services

Microsoft Azure

Google Cloud

Google Cloud Platform

Docker

Orchestration

Kubernetes

Grafana

Incident Management

Continuous Delivery

Jenkins

GitHub

GitLab

Continuous Integration

Scripting

Conflict Resolution

Problem Solving

Debugging

Communication

Information Technology

Computer Science

IT Service Management

Service Management

ITIL

DoD

Aerospace

Job Details

This posting is for a contract assignment and is not a full-time employment offer with Boeing. Candidates selected for roles will be employed as contract workers through a Boeing approved 3rd party for the duration of the specified project.

Experienced DevOps/Site Reliability Engineer

The ideal candidate will possess a strong foundation in DevOps and practical experience in owning and operating platform services and underlying infrastructure to help ensure the reliability, scalability, and performance of our systems. You will work closely with a cross-functional team to implement automated monitoring, incident response, capacity planning, and runbooks. You will contribute to the evolution of our reliability practices, instrumentation, and error budgets, while gaining hands-on experience with our production systems. This role suits those who enjoy building scalable platforms, automating end-to-end processes, and improving the overall user experience. As part of the team, you will tackle a broad range of complex tasks using modern tools and methodologies, contributing to the evolution of our digital and analytics solutions.

Position Responsibilities

Maintain and improve the reliability, availability, and performance of production services, with a focus on reducing incident frequency and recovery/restoration time.

Design, implement, and operate monitoring, alerting, logging, and tracing solutions to provide end-to-end visibility of systems and dependencies.

Respond to and resolve production incidents, participate in post-incident reviews, and help implement corrective actions.

Build and maintain runbooks, standard operating procedures, and automation to reduce manual toil and improve operational consistency.

Collaborate with software engineers to optimize code for reliability, scalability, and resilience, and assist with capacity planning and performance tuning.

Implement and manage CI/CD pipelines, deployment strategies, and blue/green/canary release patterns to ensure safe and rapid software delivery.

Manage infrastructure and assist with provisioning, scaling, and maintaining cloud resources.

Enforce security and compliance best practices in the production environment, including access controls, secrets management, and secure logging.

Participate in on-call coverage, rotate responsibilities, and communicate clearly with stakeholders about status and risks.

Contribute to reliability-related projects, tooling, and initiatives that improve platform health and developer experience.

Infrastructure reliability and resilience: regularly assess and improve the reliability of core infrastructure components (networking, storage, compute, databases, caching layers) with emphasis on redundancy, fault tolerance, and scalable failover strategies.

Participate in defining disaster recovery objectives (RPO, RTO), implement capabilities (backup/restore, cross-region failover, site failover), and conduct regular exercises to validate recovery procedures.

Ensure robust backup/restore procedures, perform regular backup validation, and protect critical data across regions and environments.

Forecast growth, model failure domains, and ensure capacity buffers and scalable architectures to withstand regional outages or component failures.

Basic Qualifications (Required Skills/Experience)

Bachelor's degree in Computer Science, Information Technology, or a related field (or equivalent practical experience).

5-7 years of experience in DevOps or a related field.

Strong Linux/Unix administration skills and proficiency in at least one scripting language (e.g., Python, Bash).

Experience with cloud platforms, containers, and orchestration (AWS/Azure/Google Cloud Platform, Docker/Kubernetes).

Familiarity with containerization (Docker) and container orchestration (Kubernetes).

Experience with monitoring and observability tools (Prometheus, Grafana, ELK/EFK, OpenTelemetry).

Solid understanding of incident management processes, on-call practices, and post-mortem analysis.

Knowledge of CI/CD concepts and tooling (e.g., Jenkins, GitHub Actions, GitLab CI) and automation scripting.

Strong problem-solving, debugging, and communication skills; ability to work in a collaborative, cross-functional environment.

Preferred Qualifications (Desired Skills/Experience)

Bachelor's degree in Information Technology, Computer Science or a related field, or equivalent practical experience.

ITIL/ITSM or similar service management certifications (ITIL Foundation or equivalent) environments is a plus.

Knowledge of DoD or government security requirements or other regulated environments is a plus.

1+ years of experience in the Aerospace industry

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Job Details

Share