Overview
On Site
Full Time
Skills
Cross-functional Team
Instrumentation
Budget
User Experience
Analytics
FOCUS
Standard Operating Procedure
Collaboration
Scalability
Capacity Management
Performance Tuning
SAFE
Provisioning
Regulatory Compliance
Access Control
Management
Computer Networking
Storage
Database
Caching
Disaster Recovery
RPO
Failover
Recovery
Backup
Forecasting
DevOps
Linux
Unix Administration
Scripting Language
Python
Bash
Cloud Computing
Amazon Web Services
Microsoft Azure
Google Cloud
Google Cloud Platform
Docker
Orchestration
Kubernetes
Grafana
Incident Management
Continuous Delivery
Jenkins
GitHub
GitLab
Continuous Integration
Scripting
Conflict Resolution
Problem Solving
Debugging
Communication
Information Technology
Computer Science
IT Service Management
Service Management
ITIL
DoD
Aerospace
Job Details
This posting is for a contract assignment and is not a full-time employment offer with Boeing. Candidates selected for roles will be employed as contract workers through a Boeing approved 3rd party for the duration of the specified project.
Experienced DevOps/Site Reliability Engineer
The ideal candidate will possess a strong foundation in DevOps and practical experience in owning and operating platform services and underlying infrastructure to help ensure the reliability, scalability, and performance of our systems. You will work closely with a cross-functional team to implement automated monitoring, incident response, capacity planning, and runbooks. You will contribute to the evolution of our reliability practices, instrumentation, and error budgets, while gaining hands-on experience with our production systems. This role suits those who enjoy building scalable platforms, automating end-to-end processes, and improving the overall user experience. As part of the team, you will tackle a broad range of complex tasks using modern tools and methodologies, contributing to the evolution of our digital and analytics solutions.
Position Responsibilities
Maintain and improve the reliability, availability, and performance of production services, with a focus on reducing incident frequency and recovery/restoration time.
Design, implement, and operate monitoring, alerting, logging, and tracing solutions to provide end-to-end visibility of systems and dependencies.
Respond to and resolve production incidents, participate in post-incident reviews, and help implement corrective actions.
Build and maintain runbooks, standard operating procedures, and automation to reduce manual toil and improve operational consistency.
Collaborate with software engineers to optimize code for reliability, scalability, and resilience, and assist with capacity planning and performance tuning.
Implement and manage CI/CD pipelines, deployment strategies, and blue/green/canary release patterns to ensure safe and rapid software delivery.
Manage infrastructure and assist with provisioning, scaling, and maintaining cloud resources.
Enforce security and compliance best practices in the production environment, including access controls, secrets management, and secure logging.
Participate in on-call coverage, rotate responsibilities, and communicate clearly with stakeholders about status and risks.
Contribute to reliability-related projects, tooling, and initiatives that improve platform health and developer experience.
Infrastructure reliability and resilience: regularly assess and improve the reliability of core infrastructure components (networking, storage, compute, databases, caching layers) with emphasis on redundancy, fault tolerance, and scalable failover strategies.
Participate in defining disaster recovery objectives (RPO, RTO), implement capabilities (backup/restore, cross-region failover, site failover), and conduct regular exercises to validate recovery procedures.
Ensure robust backup/restore procedures, perform regular backup validation, and protect critical data across regions and environments.
Forecast growth, model failure domains, and ensure capacity buffers and scalable architectures to withstand regional outages or component failures.
Basic Qualifications (Required Skills/Experience)
Bachelor's degree in Computer Science, Information Technology, or a related field (or equivalent practical experience).
5-7 years of experience in DevOps or a related field.
Strong Linux/Unix administration skills and proficiency in at least one scripting language (e.g., Python, Bash).
Experience with cloud platforms, containers, and orchestration (AWS/Azure/Google Cloud Platform, Docker/Kubernetes).
Familiarity with containerization (Docker) and container orchestration (Kubernetes).
Experience with monitoring and observability tools (Prometheus, Grafana, ELK/EFK, OpenTelemetry).
Solid understanding of incident management processes, on-call practices, and post-mortem analysis.
Knowledge of CI/CD concepts and tooling (e.g., Jenkins, GitHub Actions, GitLab CI) and automation scripting.
Strong problem-solving, debugging, and communication skills; ability to work in a collaborative, cross-functional environment.
Preferred Qualifications (Desired Skills/Experience)
Bachelor's degree in Information Technology, Computer Science or a related field, or equivalent practical experience.
ITIL/ITSM or similar service management certifications (ITIL Foundation or equivalent) environments is a plus.
Knowledge of DoD or government security requirements or other regulated environments is a plus.
1+ years of experience in the Aerospace industry
Experienced DevOps/Site Reliability Engineer
The ideal candidate will possess a strong foundation in DevOps and practical experience in owning and operating platform services and underlying infrastructure to help ensure the reliability, scalability, and performance of our systems. You will work closely with a cross-functional team to implement automated monitoring, incident response, capacity planning, and runbooks. You will contribute to the evolution of our reliability practices, instrumentation, and error budgets, while gaining hands-on experience with our production systems. This role suits those who enjoy building scalable platforms, automating end-to-end processes, and improving the overall user experience. As part of the team, you will tackle a broad range of complex tasks using modern tools and methodologies, contributing to the evolution of our digital and analytics solutions.
Position Responsibilities
Maintain and improve the reliability, availability, and performance of production services, with a focus on reducing incident frequency and recovery/restoration time.
Design, implement, and operate monitoring, alerting, logging, and tracing solutions to provide end-to-end visibility of systems and dependencies.
Respond to and resolve production incidents, participate in post-incident reviews, and help implement corrective actions.
Build and maintain runbooks, standard operating procedures, and automation to reduce manual toil and improve operational consistency.
Collaborate with software engineers to optimize code for reliability, scalability, and resilience, and assist with capacity planning and performance tuning.
Implement and manage CI/CD pipelines, deployment strategies, and blue/green/canary release patterns to ensure safe and rapid software delivery.
Manage infrastructure and assist with provisioning, scaling, and maintaining cloud resources.
Enforce security and compliance best practices in the production environment, including access controls, secrets management, and secure logging.
Participate in on-call coverage, rotate responsibilities, and communicate clearly with stakeholders about status and risks.
Contribute to reliability-related projects, tooling, and initiatives that improve platform health and developer experience.
Infrastructure reliability and resilience: regularly assess and improve the reliability of core infrastructure components (networking, storage, compute, databases, caching layers) with emphasis on redundancy, fault tolerance, and scalable failover strategies.
Participate in defining disaster recovery objectives (RPO, RTO), implement capabilities (backup/restore, cross-region failover, site failover), and conduct regular exercises to validate recovery procedures.
Ensure robust backup/restore procedures, perform regular backup validation, and protect critical data across regions and environments.
Forecast growth, model failure domains, and ensure capacity buffers and scalable architectures to withstand regional outages or component failures.
Basic Qualifications (Required Skills/Experience)
Bachelor's degree in Computer Science, Information Technology, or a related field (or equivalent practical experience).
5-7 years of experience in DevOps or a related field.
Strong Linux/Unix administration skills and proficiency in at least one scripting language (e.g., Python, Bash).
Experience with cloud platforms, containers, and orchestration (AWS/Azure/Google Cloud Platform, Docker/Kubernetes).
Familiarity with containerization (Docker) and container orchestration (Kubernetes).
Experience with monitoring and observability tools (Prometheus, Grafana, ELK/EFK, OpenTelemetry).
Solid understanding of incident management processes, on-call practices, and post-mortem analysis.
Knowledge of CI/CD concepts and tooling (e.g., Jenkins, GitHub Actions, GitLab CI) and automation scripting.
Strong problem-solving, debugging, and communication skills; ability to work in a collaborative, cross-functional environment.
Preferred Qualifications (Desired Skills/Experience)
Bachelor's degree in Information Technology, Computer Science or a related field, or equivalent practical experience.
ITIL/ITSM or similar service management certifications (ITIL Foundation or equivalent) environments is a plus.
Knowledge of DoD or government security requirements or other regulated environments is a plus.
1+ years of experience in the Aerospace industry
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.