Overview
On Site
Accepts corp to corp applications
Contract - W2
Contract - Independent
Contract - to 07/10/2026
Skills
Kubernetes
sre
Prod Support
Datadog
Vault
Job Details
Sr Site Reliability Engineer (SRE), Kubernetes, Datadog, Vault, Prod Support 12+ Mths Cont Alpharetta, GA
JPC- # 3488
Location : Alpharetta, GA (Hybrid 4 days a week onsite - Locals Needed)
Duration : 12+ Contract
Description:
Sr Site Reliability Engineer, Kubernetes, Datadog, Hashicorp Vault, Infra security, Production Support 12+ Mths Cont Alpharetta, GA
About the Role
We're seeking an experienced Senior Site Reliability Engineer to join our team and play a critical role in ensuring the reliability, scalability, and performance of our cloud infrastructure. You'll be a technical leader who combines deepoperational expertise with strong automation skills to build and maintain highly available systems. As a Kubernetes expert,you'll drive our container orchestration strategy and serve as a technical authority for our platform teams.
About the Role
We're seeking an experienced Senior Site Reliability Engineer to join our team and play a critical role in ensuring the reliability, scalability, and performance of our cloud infrastructure. You'll be a technical leader who combines deepoperational expertise with strong automation skills to build and maintain highly available systems. As a Kubernetes expert,you'll drive our container orchestration strategy and serve as a technical authority for our platform teams.
Key Responsibilities
Infrastructure & Automation
Design, deploy, and manage cloud infrastructure across AWS and Azure using Terraform and infrastructure-as-code (IaC) principles
Architect, deploy, and maintain production-grade Kubernetes clusters with a focus on reliability, security, and performance
Serve as the subject matter expert on Kubernetes, providing guidance and best practices to engineering teams
Build and maintain automated provisioning pipelines to ensure consistent, repeatable deployments
Implement and maintain HashiCorp Vault on AWS for secrets management and security, including Vault integration with Kubernetes
Design and implement automated High Availability and Disaster Recovery (HA/DR) capabilities through CI/CD pipelines
Optimize cloud resources and Kubernetes workloads for performance, cost efficiency, and reliability
Infrastructure & Automation
Design, deploy, and manage cloud infrastructure across AWS and Azure using Terraform and infrastructure-as-code (IaC) principles
Architect, deploy, and maintain production-grade Kubernetes clusters with a focus on reliability, security, and performance
Serve as the subject matter expert on Kubernetes, providing guidance and best practices to engineering teams
Build and maintain automated provisioning pipelines to ensure consistent, repeatable deployments
Implement and maintain HashiCorp Vault on AWS for secrets management and security, including Vault integration with Kubernetes
Design and implement automated High Availability and Disaster Recovery (HA/DR) capabilities through CI/CD pipelines
Optimize cloud resources and Kubernetes workloads for performance, cost efficiency, and reliability
Observability & Monitoring
Architect and implement comprehensive observability solutions using Datadog for cloud-native applications and Kubernetes infrastructure
Build monitoring, logging, and alerting frameworks for containerized workloads that provide actionable insights into system health
Implement Kubernetes-native monitoring patterns and troubleshoot complex container orchestration issues
Integrate Datadog with PagerDuty and other incident management platforms
Define and track SLIs, SLOs, and error budgets to drive reliability improvements
Create custom dashboards and monitors to track infrastructure, application, and Kubernetes cluster performance
Architect and implement comprehensive observability solutions using Datadog for cloud-native applications and Kubernetes infrastructure
Build monitoring, logging, and alerting frameworks for containerized workloads that provide actionable insights into system health
Implement Kubernetes-native monitoring patterns and troubleshoot complex container orchestration issues
Integrate Datadog with PagerDuty and other incident management platforms
Define and track SLIs, SLOs, and error budgets to drive reliability improvements
Create custom dashboards and monitors to track infrastructure, application, and Kubernetes cluster performance
CI/CD & Pipeline Management
Design, build, and maintain robust CI/CD pipelines that enable rapid, safe deployments to Kubernetes
Implement GitOps workflows and automated deployment strategies for containerized applications
Implement automated testing, security scanning, and quality gates within pipelines
Drive solutions through test, QA, and production environments with appropriate controls and safeguards
Automate deployment strategies including blue-green, canary, and rolling deployments in Kubernetes
Security & Vulnerability Management
Identify, assess, and remediate security vulnerabilities in infrastructure, applications, and Kubernetes clusters
Implement Kubernetes security best practices including RBAC, pod security policies/standards, and network policies
Collaborate with security teams to implement and maintain security best practices
Manage and maintain HashiCorp Vault infrastructure for secure secrets management
Ensure compliance with security policies and industry standards across all environments
Incident Management & Response
Participate in 24/7 on-call rotation to respond to critical production incidents
Serve as Incident Commander, coordinating cross-functional response teams during major outages
Lead post-incident reviews and drive thorough root cause analysis across engineering teams
Troubleshoot complex Kubernetes and distributed systems issues under pressure
Develop and refine incident response procedures and runbooks
Collaboration & Leadership
Partner with engineering teams to improve system reliability and performance
Mentor junior SREs and promote SRE best practices across the organization
Lead Kubernetes adoption efforts and educate teams on container orchestration best practices
Drive initiatives to reduce toil through automation and process improvement
Contribute to architectural decisions with a reliability and operability lens
Required Qualifications
5+ years of experience in Site Reliability Engineering, DevOps, or similar roles
Expert-level knowledge of Kubernetes
, including architecture, operations, and troubleshooting in production environments
Proven track record as a go-to Kubernetes resource and technical authority
Deep understanding of container technologies (Docker, containerd) and orchestration patterns
Strong hands-on experience with
AWS and Azure
cloud platforms
Proficiency in
Terraform
for infrastructure automation and management
Expert-level knowledge of
Datadog
for monitoring, logging, and observability
Experience with
HashiCorp Vault
, including deployment and management on AWS and Kubernetes integration
Deep understanding of
CI/CD pipelines
, including design, implementation, and optimization for containerized workloads
Proven ability to implement automated HA/DR solutions through CI/CD workflows
Strong programming skills in
Python
for automation, tooling, and analysis
Proven experience building observability solutions for distributed cloud applications
Experience configuring monitoring and alerting systems and integrating with paging platforms like PagerDuty
Demonstrated experience identifying and remediating security vulnerabilities
Experience driving deployments through multiple environments (test/QA/production) with proper gates and controls
Demonstrated experience participating in on-call rotations and responding to production incidents
Experience serving as Incident Commander or leading incident response efforts
Track record of conducting root cause analysis and driving systemic improvements
Strong understanding of networking, security, and cloud architecture principles
Excellent communication skills with ability to work across multiple teams and explain complex Kubernetes concepts
Preferred Qualifications
Experience with
Google Cloud Platform (Google Cloud Platform)
and GKE
Certified Kubernetes Administrator (CKA) or Certified Kubernetes Security Specialist (CKS)
Experience with service mesh technologies (Istio, Linkerd, Consul)
Knowledge of Helm, Kustomize, and other Kubernetes tooling
Experience with GitOps tools (ArgoCD, Flux)
Familiarity with additional CI/CD tools (Jenkins, GitLab CI, GitHub Actions, CircleCI)
Experience with configuration management tools (Ansible, Chef, Puppet)
Background in software engineering or systems programming
Understanding of chaos engineering and reliability testing methodologies
Experience with cost optimization strategies in cloud and Kubernetes environments
Security certifications (AWS Security Specialty, CISSP, CKS, etc.)
Experience with compliance frameworks (SOC 2, ISO 27001, etc.)
Contributions to open-source Kubernetes projects or active participation in the Kubernetes community
What We Offer
Competitive salary and equity compensation
Comprehensive health, dental, and vision insurance
Flexible work arrangements
Professional development opportunities and certification support
Collaborative and inclusive team culture
Our Commitment
We are an equal opportunity employer and value diversity at our company. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.
To Apply:
Please submit your resume and a brief cover letter explaining your Kubernetes expertise and experience with cloud reliability engineering.
Design, build, and maintain robust CI/CD pipelines that enable rapid, safe deployments to Kubernetes
Implement GitOps workflows and automated deployment strategies for containerized applications
Implement automated testing, security scanning, and quality gates within pipelines
Drive solutions through test, QA, and production environments with appropriate controls and safeguards
Automate deployment strategies including blue-green, canary, and rolling deployments in Kubernetes
Security & Vulnerability Management
Identify, assess, and remediate security vulnerabilities in infrastructure, applications, and Kubernetes clusters
Implement Kubernetes security best practices including RBAC, pod security policies/standards, and network policies
Collaborate with security teams to implement and maintain security best practices
Manage and maintain HashiCorp Vault infrastructure for secure secrets management
Ensure compliance with security policies and industry standards across all environments
Incident Management & Response
Participate in 24/7 on-call rotation to respond to critical production incidents
Serve as Incident Commander, coordinating cross-functional response teams during major outages
Lead post-incident reviews and drive thorough root cause analysis across engineering teams
Troubleshoot complex Kubernetes and distributed systems issues under pressure
Develop and refine incident response procedures and runbooks
Collaboration & Leadership
Partner with engineering teams to improve system reliability and performance
Mentor junior SREs and promote SRE best practices across the organization
Lead Kubernetes adoption efforts and educate teams on container orchestration best practices
Drive initiatives to reduce toil through automation and process improvement
Contribute to architectural decisions with a reliability and operability lens
Required Qualifications
5+ years of experience in Site Reliability Engineering, DevOps, or similar roles
Expert-level knowledge of Kubernetes
, including architecture, operations, and troubleshooting in production environments
Proven track record as a go-to Kubernetes resource and technical authority
Deep understanding of container technologies (Docker, containerd) and orchestration patterns
Strong hands-on experience with
AWS and Azure
cloud platforms
Proficiency in
Terraform
for infrastructure automation and management
Expert-level knowledge of
Datadog
for monitoring, logging, and observability
Experience with
HashiCorp Vault
, including deployment and management on AWS and Kubernetes integration
Deep understanding of
CI/CD pipelines
, including design, implementation, and optimization for containerized workloads
Proven ability to implement automated HA/DR solutions through CI/CD workflows
Strong programming skills in
Python
for automation, tooling, and analysis
Proven experience building observability solutions for distributed cloud applications
Experience configuring monitoring and alerting systems and integrating with paging platforms like PagerDuty
Demonstrated experience identifying and remediating security vulnerabilities
Experience driving deployments through multiple environments (test/QA/production) with proper gates and controls
Demonstrated experience participating in on-call rotations and responding to production incidents
Experience serving as Incident Commander or leading incident response efforts
Track record of conducting root cause analysis and driving systemic improvements
Strong understanding of networking, security, and cloud architecture principles
Excellent communication skills with ability to work across multiple teams and explain complex Kubernetes concepts
Preferred Qualifications
Experience with
Google Cloud Platform (Google Cloud Platform)
and GKE
Certified Kubernetes Administrator (CKA) or Certified Kubernetes Security Specialist (CKS)
Experience with service mesh technologies (Istio, Linkerd, Consul)
Knowledge of Helm, Kustomize, and other Kubernetes tooling
Experience with GitOps tools (ArgoCD, Flux)
Familiarity with additional CI/CD tools (Jenkins, GitLab CI, GitHub Actions, CircleCI)
Experience with configuration management tools (Ansible, Chef, Puppet)
Background in software engineering or systems programming
Understanding of chaos engineering and reliability testing methodologies
Experience with cost optimization strategies in cloud and Kubernetes environments
Security certifications (AWS Security Specialty, CISSP, CKS, etc.)
Experience with compliance frameworks (SOC 2, ISO 27001, etc.)
Contributions to open-source Kubernetes projects or active participation in the Kubernetes community
What We Offer
Competitive salary and equity compensation
Comprehensive health, dental, and vision insurance
Flexible work arrangements
Professional development opportunities and certification support
Collaborative and inclusive team culture
Our Commitment
We are an equal opportunity employer and value diversity at our company. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.
To Apply:
Please submit your resume and a brief cover letter explaining your Kubernetes expertise and experience with cloud reliability engineering.
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.