Site Reliability Engineer

Hybrid in Reston, VA, US • Posted 11 days ago • Updated 11 days ago

Full Time

No Travel Required

Hybrid

$160,000 - $180,000/yr

Fitment

Dice Job Match Score™

🧠 Analyzing your skills...

Job Details

Skills

Kubernetes
Openshift
Azure
Platform Engineering

Summary

Site Reliability Engineer (SRE) / Platform Engineer

Location: Reston, VA (Hybrid — 2 days onsite / 3 days remote)
Employment Type: Full-time

About the Organization

Join a mission-driven, national financial services organization at the heart of the U.S. housing finance ecosystem. This is a mid-sized, highly regulated enterprise operating at market scale—supporting platforms and analytics that enable trillions of dollars in annual economic activity. You’ll work in a modern tech environment with strong engineering partners, clear business impact, and a mandate for reliability, security, and continuous improvement.

The Role

Our client is hiring a hands-on SRE / Platform Engineer to operate, tune, and scale our OpenShift/Kubernetes platforms while bridging on-prem to Azure to power our analytics ecosystem. You’ll own reliability, automation, and observability across a hybrid estate—partnering closely with developers, data engineers, infrastructure operations, and security to deliver secure, performant platform services using modern DevSecOps practices.

Why This Role Stands Out

Hybrid impact: Operate critical OpenShift clusters and manage Azure services used by data and analytics teams.
Hybrid architecture: Help design and support the bridge from on-prem to cloud—migration, integration, and steady-state operations.
Real-world scale: Reliability work that directly supports high-volume financial market operations and enterprise analytics.
Automation-first: Lean into Terraform, Ansible, and GitOps to make reliability repeatable.

What You’ll Do In The First 180 days....

Operate, tune, and optimize OpenShift/Kubernetes clusters (scheduling, ingress, upgrades, quotas, policies).
Stand up and/or refine observability (Datadog, Prometheus, Grafana)—dashboards, alerts, SLOs, runbooks.
Map current hybrid topology and critical delivery pipelines; identify toil and prioritize automation (Terraform/Ansible).
Begin supporting Azure environments (compute, networking, storage, data services) used by analytics teams.
Drive GitOps-first workflows; harden CI/CD with ArgoCD/Jenkins/GitHub Actions and policy-as-code guardrails.
Implement or enhance platform services (Vault, Kafka/AMQ, ingress, service mesh) for dev and data teams.
Lead incident response and postmortems; institutionalize RCA, blameless learning, and continuous improvement.
Advance the hybrid service model—migrations, integrations, reliability/latency tuning, cost and performance optimization.

Day-to-Day Responsibilities

Operate and optimize OpenShift/Kubernetes clusters, ingress (e.g., Nginx), and container networking/service mesh.
Manage Azure services (compute, VNet, storage, data services) supporting analytics workloads.
Build and maintain automated infrastructure with Terraform, Ansible, and GitOps workflows.
Implement and evolve observability (Datadog, Prometheus, Grafana): metrics, traces, logs, alerting, SLOs, runbooks.
Design, harden, and support delivery pipelines with ArgoCD/Jenkins/GitHub Actions.
Provide platform tooling and enablement for application developers, data engineers, and operations teams.
Ensure security and access management (HashiCorp Vault, secrets management, least privilege).
Lead incident response, coordinate cross-functional resolution, and drive corrective actions and platform improvements.
Script or develop tools in Bash, Python, or Go to eliminate toil and improve developer experience.

Tech You’ll Work With

Kubernetes / OpenShift
Azure (compute, networking, storage, and data services)
Automation & IaC: Terraform, Ansible, GitOps
Observability: Datadog, Prometheus, Grafana
Networking & Ingress: Nginx, service meshes, container networking
Messaging: Kafka, AMQ
Secrets & Access: HashiCorp Vault
CI/CD: ArgoCD, Jenkins, GitHub Actions
Scripting/Coding: Bash, Python, Go

Must-Have Qualifications

5+ years hands-on operating and managing Kubernetes and OpenShift clusters.
Strong experience with Microsoft Azure (compute, networking, storage, and data services).
Proven skills in automation and Infrastructure-as-Code (Terraform, Ansible, GitOps).
Proficiency with observability tooling (Datadog, Prometheus, Grafana).
Scripting/coding ability in Bash, Python, or Go.

Preferred / Stand-Out Skills

Experience bridging on-prem and cloud in a hybrid service model (migration, integration, optimization).
Expertise with Kafka/AMQ, HashiCorp Vault, and ArgoCD/Jenkins/GitHub Actions.
Background leading incident response and postmortems with strong RCA and continuous improvement practices.

Work Model & Team

Hybrid: 2 days onsite in Reston, VA; 3 days remote.
You’ll be part of the IT organization, collaborating daily with developers, data engineers, infrastructure operations, and security.

How to Succeed In This Role

You’re a hands-on engineer who thrives in regulated, high-impact environments.
You favor automation over repetition, and observability over guesswork.
You collaborate openly, communicate clearly, and leave systems better than you found them.

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Dice Id: 91122755
Position Id: 8919029
Posted 11 days ago

Create job alert

Never miss an opportunity! Create an alert based on the job you applied for.

Similar Jobs

Senior Site Reliability Engineer

McLean, Virginia

•

15d ago

Your work days are brighter here. We're obsessed with making hard work pay off, for our people, our customers, and the world around us. As a Fortune 500 company and a leading AI platform for managing people, money, and agents, we're shaping the future of work so teams can reach their potential and focus on what matters most. The minute you join, you'll feel it. Not just in the products we build, but in how we show up for each other. Our culture is rooted in integrity, empathy, and shared enthus

Full-time

USD 147,400.00 per year

DevOps / Site Reliability Engineering Intern

Reston, Virginia

•

Today

Leidos is seeking an intern DevOps Engineer in Morgantown, WV or Reston, VA for Summer 2026. This position is full-time, onsite for 3 months. Come put your DevOps skills into action! The Leidos Software Accelerator has openings for talented DevOps Engineers. In this role, you will join dynamic Agile software teams that are singularly focused on providing world-class solutions to our customers in an exciting, collaborative, and inclusive atmosphere. You will be intellectually challenged and prov

Full-time

USD 48,100.00 - 86,950.00 per year

Principal Systems Reliability Engineer, Secure Federal Operations

Herndon, Virginia

•

Today

At T-Mobile, we invest in YOU! Our Total Rewards Package ensures that employees get the same big love we give our customers. All team members receive a competitive base salary and compensation package - this is Total Rewards. Employees enjoy multiple wealth-building opportunities through our annual stock grant, employee stock purchase plan, 401(k), and access to free, year-round money coaches. That's how we're UNSTOPPABLE for our employees! This role is responsible for designing and implementin

Full-time

USD 114,800.00 - 207,200.00 per year

Senior Software Development Engineer

McLean, Virginia

•

15d ago

Full-time

USD 163,800.00 per year

Search all similar jobs