Senior DevOps Engineer

  • Boston, MA
  • Posted 5 days ago | Updated 7 hours ago

Overview

Remote
On Site
Full Time

Skills

Pivotal
DXP
Cloud Computing
High Availability
DevOps
IaaS
Microsoft Azure
Google Cloud Platform
Google Cloud
Provisioning
Terraform
Amazon Web Services
Collaboration
Continuous Integration
Continuous Delivery
GitLab
Capacity Management
Scripting
Backup
Disaster Recovery
Data Integrity
FedRAMP
Real-time
Root Cause Analysis
Database
Incident Management
Performance Tuning
Auditing
Reliability Engineering
Knowledge Management
Leadership
Lifecycle Management
Management
Application Lifecycle Management
Knowledge Base
Regulatory Compliance
Service Level
Communication
Documentation
Knowledge Transfer
Training
Supervision
Virtual Team

Job Details

Job Title: Senior DevOps Engineer
Location: 100% Remote
Job Description:
Summary:


The Senior DevOps Engineer, a key member of the EIT DevOps Team, is responsible for the staging and production infrastructure of Digital Services within the federal sector. This role is pivotal in managing and optimizing staging and production deployment environments across Google Cloud Platform (Google Cloud Platform), Amazon Web Services (AWS), and Microsoft Azure.

Core responsibilities include provisioning and maintaining secure, scalable, and robust cloud infrastructure for the InSight DXP Platform. The Senior DevOps Engineer will apply extensive knowledge of cloud services and DevOps best practices to ensure application efficiency, high availability, and performance.

dditionally, this role involves creating and maintaining FedRAMP controls and documentation compliance. The Senior DevOps Engineer will execute automation pipelines, upgrade infrastructure, troubleshoot complex issues, and contribute to the ongoing enhancement of deployment processes. Close collaboration with development, operations, and other EIT teams is crucial for delivering seamless and reliable solutions.

Core Responsibilities:
  • Cloud Infrastructure Management: Deploy, manage, and maintain cloud infrastructure across AWS, Azure, and/or Google Cloud Platform, ensuring compliance for government workloads.
  • Infrastructure Automation: utomate infrastructure provisioning using Infrastructure as Code (IaC) tools like Terraform, OpenTofu, or AWS CloudFormation.
  • Deployment Pipeline Streamlining: Collaborate with development teams to streamline CI/CD pipelines using tools such as GitLab and OpenTofu for efficient infrastructure and application delivery.
  • Performance Optimization: Monitor system performance, participate in capacity planning, and optimize application and infrastructure performance by tuning configurations and identifying bottlenecks.
  • utomation Development: Develop scripts and tools to automate routine operations, including patching, scaling, and monitoring.
  • Self-Healing Systems: Design and implement self-healing systems that proactively detect and resolve faults.
  • Data Integrity & Availability: Manage backup and disaster recovery strategies to ensure data integrity and availability across environments.
  • Security & Compliance: Perform regular security audits and vulnerability patching, adhering to government compliance requirements (e.g., FedRAMP, NIST).

Incident Management & Observability:
  • Real-time Incident Resolution: Respond to and resolve infrastructure incidents and outages in real-time, minimizing disruption.
  • Root Cause Analysis (RCA): Conduct RCA for production issues and implement long-term corrective actions.
  • On-Call Participation: Participate in an on-call rotation, escalating and coordinating responses to high-severity issues.
  • Incident Documentation: Document incidents, responses, and postmortems to capture lessons learned.
  • Complex Problem Diagnosis: Diagnose complex infrastructure and application problems, including database performance issues, latency, and service connectivity challenges.
  • Comprehensive Logging & Telemetry: Ensure comprehensive logging and telemetry to support incident response, performance tuning, and auditing.
  • Observability Improvements: Drive observability improvements by collaborating with Engineering and Platform teams to enhance system reliability and traceability.

pplication & Knowledge Management:
  • pplication Incident Leadership: Lead resolution efforts for application-level incidents, ensuring coordinated response across teams.
  • pplication Lifecycle Management: Oversee application lifecycle management, including version upgrades, security patches, and regional rollouts.
  • Knowledge Base Contribution: Contribute to a shared knowledge base, documenting recurring issues and resolution steps.
  • Scaling Strategies: Support scaling strategies to meet regional demand, ensuring infrastructure resilience and compliance with service-level objectives (SLOs).
  • Strong written and verbal communication skills, with the ability to clearly document procedures, incidents, and solutions.
  • Effective at producing support documentation and conducting knowledge transfer or training sessions.
  • Demonstrated ability to work independently with minimal supervision in a fast-paced, collaborative, and globally distributed team.
motivated, proactive mindset with a commitment to delivering high-quality, secure, and reliable systems.
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.