Apply Now

Site Reliability Engineer

Sterling, VA, US • Posted 60+ days ago • Updated 7 hours ago

Full Time

On-site

Fitment

Dice Job Match Score™

⏳ Almost there, hang tight...

Job Details

Skills

Spectrum
System Integration
Research
Surveillance
Software Modernization
Security Clearance
IT Strategy
Log Analysis
Performance Tuning
Patch Management
Testing
Systems Analysis
IT Operations
Interfaces
ROOT
Optimization
Computer Hardware
Provisioning
Leadership
Network
Business Process
Service Operations
Workflow
Incident Management
LSA
Content Management
Change Management
Training
OM
IT Service Management
Reporting
Issue Resolution
Tier 2
Python
Java
JavaScript
Unix Administration
Linux
Unix
Operating Systems
Command-line Interface
System Administration
Computer Networking
Network Protocols
Database Administration
Database
Configuration Management
Ansible
Puppet
Progress Chef
Orchestration
Analytical Skill
Conflict Resolution
Problem Solving
Communication
Docker
Kubernetes
DevOps
Service Level
Management
Data Analysis
Visualization
Reliability Engineering
Amazon Web Services
Google Cloud Platform
Google Cloud
Cloud Computing
Microsoft Azure
Collaboration
Teamwork
Innovation
Cyber Security

Summary

Nightwing provides technically advanced full-spectrum cyber, data operations, systems integration and intelligence mission support services to meet our customers' most demanding challenges. Our capabilities include cyber space operations, cyber defense and resiliency, vulnerability research, ubiquitous technical surveillance, data intelligence, lifecycle mission enablement, and software modernization. Nightwing brings disruptive technologies, agility, and competitive offerings to customers in the intelligence community, defense, civil, and commercial markets.

Job Title: Site Reliability Engineer
Location: Sterling, VA
Clearance: TS/SCI Poly

**This position is CONTINGENT upon contract award**

The Site Reliability Engineer (SRE) collaboratively works closely with the contract leadership, Platform teams, and Sponsor to refine the operational and technical strategy to automate key portions of IT operations and enable the Product team (Platform) to bring new software or new features to production as quickly as possible. The SRE executes and analyzes manual IT operations/admin tasks (log analysis, performance tuning, patch management, testing, and incident response) and converts them to automated tasks. The SRE works with the Platform, Network and Data Operations teams to assist in deployment planning and onboard systems. They assist with monitoring, system analysis, and IT operations support. Daily tasks include, but are not limited to:

Work with Sponsor, Mission partners, and technical personnel to deliver robust scalable operations architecture that meets the customer goals for the enterprise.
Analyze, define, and document requirements for data, workflow, logical processes, hardware and operating system environment, and network connectivity, other system interfaces, internal and external checks and controls, and outputs.
Monitor and track metrics, logs and traces across all services in the system/network and provide context for identifying root causes in the event of an incident, performance degradation, or availability issue.
Perform Network/Cloud optimization and resilience planning
Develop capabilities to automate hardware/software provisioning, monitoring, patching, and troubleshooting.
Collaborate with and assist Platform team and leadership in network and security health, intrusions or inappropriate activities.
Optimize business processes, workflows, and service operations by building efficient on-call processes and streamlining alerting workflows.
Leverage operational data to automate systems administration, operations and incident response processes to improve enterprise reliability to manage IT environment complexity.
Works with LSA, Lab Manager, and CM to compose technical documents including Design, Deployment, System specifications and Host Nation baselines, updates, user's manuals, training materials, installation guides, proposals, and reports.
Work with the OM to implement ITSM best practices for ICA/Service discrepancy and reporting, issue resolution and operations support to include Tier 2/3 escalation.

Required Skills:

Programming: Proficiency in at least one programming language (e.g., Python, Go, Java, or JavaScript) is essential for automating tasks and developing tools.
Linux/Unix Systems Administration: Strong knowledge of Linux/Unix operating systems, including command-line tools and system administration tasks.
Networking: Understanding of network protocols, infrastructure, and troubleshooting techniques.
Database Management: Familiarity with database technologies and principles.
Automation: Experience with automation tools and techniques, such as configuration management (e.g., Ansible, Puppet, Chef) and orchestration (e.g., Kubernetes).
Monitoring and Logging: Experience with monitoring tools and logging systems.
Problem-Solving: Strong analytical and problem-solving skills to diagnose and resolve system issues.
Communication: Ability to communicate technical information clearly and concisely to both technical and non-technical audiences.
Collaboration: Ability to work effectively with cross-functional teams, including software developers and operations personnel.

Desired Skills:

Cloud Technologies: Experience with cloud platforms (e.g., AWS, Google Cloud, Azure).
Containerization: Knowledge of containerization technologies (e.g., Docker, Kubernetes).
DevOps Principles: Understanding DevOps principles and practices.
Service Level Objectives (SLOs) and Service Level Agreements (SLAs): Experience with defining, tracking, and managing SLOs and SLAs.
Data Analysis: Experience with data analysis and visualization tools.

Desired Certs:

Global Skill Development Council (GSDC) Site Reliability Engineering (SRE) Foundation Certification (CSREF).
AWS Certified SysOps Administrator - Associate.
Google Cloud Certified Professional Cloud Architect.
Azure Certified Solutions Architect Expert.

At Nightwing, we value collaboration and teamwork. You'll have the opportunity to work alongside talented individuals who are passionate about what they do. Together, we'll leverage our collective expertise to drive innovation, solve complex problems, and deliver exceptional results for our clients.

Thank you for considering joining us as we embark on this new journey and shape the future of cybersecurity and intelligence together as part of the Nightwing team.

Nightwing is An Equal Opportunity/Affirmative Action Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability or veteran status, age or any other federally protected class.

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Dice Id: 91159926
Position Id: 2152f0802fcd940047a1859aaca7c15b
Posted 30+ days ago

Create job alert

Never miss an opportunity! Create an alert based on the job you applied for.

Similar Jobs

Site Reliability Engineer

Hybrid in Fairfax, Virginia

•

4d ago

Hi, Job Title: Site Reliability Engineer Location: Reston, VA (Hybrid) Duration: long Term Required Skills: Strong experience with Kubernetes administration and platform engineering.Expertise in PostgreSQL, TimescaleDB, and MySQL migrations.Experience with database performance tuning, high availability, backup, and recovery.Familiarity with Zabbix, Apache NiFi, and enterprise monitoring environments.Strong troubleshooting, automation, and production support skills.Thanks.

Easy Apply

Contract

Depends on Experience

Performance & Reliability Engineer

Washington, District of Columbia

•

Today

At Accenture Federal Services, nothing matters more than helping the US federal government make the nation stronger and safer and life better for people. Our 13,000+ people are united in a shared purpose to pursue the limitless potential of technology and ingenuity for clients across defense, national security, public safety, civilian, and military health organizations. Join Accenture Federal Services, a technology company within global Accenture. Recognized as a Glassdoor Top 100 Best Place to

Full-time

USD 70,500.00 - 136,700.00 per year

Site Reliability Engineer - TS/SCI with Poly

Annapolis, Maryland

•

Today

Type of Requisition: Pipeline Clearance Level Must Currently Possess: Top Secret SCI + Polygraph Clearance Level Must Be Able to Obtain: Top Secret SCI + Polygraph Public Trust/Other Required: None Job Family: IT Infrastructure and Operations Job Qualifications: Skills: Automation Tools, Enterprise Infrastructures, Enterprise Operations, Site Reliability Engineering Certifications: None Experience: 5 + years of related experience ship Required: Yes Job Description: SITE RELIABILITY ENGIN

Full-time

USD 128,039.00 - 173,229.00 per year

Site Reliability Engineer

Remote

•

Today

Type of Requisition: Regular Clearance Level Must Currently Possess: Secret Clearance Level Must Be Able to Obtain: Top Secret/SCI Public Trust/Other Required: None Job Family: IT Infrastructure and Operations Job Qualifications: Skills: Complex Systems, Networking Hardware, Systems Architecture, Systems Design, Technical Guidance Certifications: None Experience: 15 + years of related experience ship Required: Yes Job Description: GDIT is seeking a Site Reliability Engineer (SRE) to help

Full-time

USD 164,382.00 - 215,050.00 per year

Search all similar jobs

More jobs at Nightwing in Sterling, VA