Overview
On Site
USD 61.00 - 80.00 per hour
Contract - W2
Skills
Financial Services
Recruiting
SAP BASIS
Scalability
IaaS
FOCUS
Provisioning
Disaster Recovery
Budget
Dashboard
SAFE
Automated Testing
RBAC
Network
Mentorship
Educate
Process Improvement
Reliability Engineering
DevOps
Docker
Orchestration
Microsoft Azure
Terraform
Management
High Availability
Workflow
Python
Quality Assurance
Incident Management
Root Cause Analysis
Network Security
Cloud Architecture
Communication
Collaboration
Google Cloud Platform
Google Cloud
Continuous Delivery
Jenkins
GitLab
Continuous Integration
GitHub
CircleCI
Configuration Management
Ansible
Progress Chef
Puppet
Software Engineering
CHAOS
Test Methods
Optimization
Cloud Computing
Amazon Web Services
CISSP
Regulatory Compliance
System On A Chip
ISO/IEC 27001:2005
Open Source
Kubernetes
Job Details
Our client, a leading financial services company is hiring a Senior Site Reliability Engineer on a long-term contract basis.
Job ID 83689
Work Location:
Alpharetta, GA
Summary:
We're seeking an experienced Senior Site Reliability Engineer to join our team and play a critical role in ensuring the reliability, scalability, and performance of our cloud infrastructure. You'll be a technical leader who combines deep operational expertise with strong automation skills to build and maintain highly available systems. As a Kubernetes expert, you'll drive our container orchestration strategy and serve as a technical authority for our platform teams.
Responsibilities:
Required Skills:
Preferred Skills:
Pay: $61-$80 per hour.
Job ID 83689
Work Location:
Alpharetta, GA
Summary:
We're seeking an experienced Senior Site Reliability Engineer to join our team and play a critical role in ensuring the reliability, scalability, and performance of our cloud infrastructure. You'll be a technical leader who combines deep operational expertise with strong automation skills to build and maintain highly available systems. As a Kubernetes expert, you'll drive our container orchestration strategy and serve as a technical authority for our platform teams.
Responsibilities:
- Design, deploy, and manage cloud infrastructure across AWS and Azure using Terraform and infrastructure-as-code principles
- Architect, deploy, and maintain production-grade Kubernetes clusters with a focus on reliability, security, and performance
- Serve as the subject matter expert on Kubernetes, providing guidance and best practices to engineering teams
- Build and maintain automated provisioning pipelines to ensure consistent, repeatable deployments
- Implement and maintain HashiCorp Vault on AWS for secrets management and security, including Vault integration with Kubernetes
- Design and implement automated High Availability and Disaster Recovery (HA/DR) capabilities through CI/CDpipelines
- Optimize cloud resources and Kubernetes workloads for performance, cost efficiency, and reliability
- Architect and implement comprehensive observability solutions using Datadog for cloud-native applications and Kubernetes infrastructure
- Build monitoring, logging, and alerting frameworks for containerized workloads that provide actionable insights into system health
- Implement Kubernetes-native monitoring patterns and troubleshoot complex container orchestration issues
- Integrate Datadog with PagerDuty and other incident management platforms
- Define and track SLIs, SLOs, and error budgets to drive reliability improvements
- Create custom dashboards and monitors to track infrastructure, application, and Kubernetes cluster performance
- Design, build, and maintain robust CI/CD pipelines that enable rapid, safe deployments to Kubernetes
- Implement GitOps workflows and automated deployment strategies for containerized applications
- Implement automated testing, security scanning, and quality gates within pipelines
- Drive solutions through test, QA, and production environments with appropriate controls and safeguards
- Automate deployment strategies including blue-green, canary, and rolling deployments in Kubernetes
- Identify, assess, and remediate security vulnerabilities in infrastructure, applications, and Kubernetes clusters
- Implement Kubernetes security best practices including RBAC, pod security policies/standards, and network policies
- Collaborate with security teams to implement and maintain security best practices
- Manage and maintain HashiCorp Vault infrastructure for secure secrets management
- Ensure compliance with security policies and industry standards across all environments
- Participate in 24/7 on-call rotation to respond to critical production incidents
- Serve as Incident Commander, coordinating cross-functional response teams during major outages
- Lead post-incident reviews and drive thorough root cause analysis across engineering teams
- Troubleshoot complex Kubernetes and distributed systems issues under pressure
- Develop and refine incident response procedures and runbooks
- Partner with engineering teams to improve system reliability and performance
- Mentor junior SREs and promote SRE best practices across the organization
- Lead Kubernetes adoption efforts and educate teams on container orchestration best practices
- Drive initiatives to reduce toil through automation and process improvement
- Contribute to architectural decisions with a reliability and operability lens
Required Skills:
- 5+ years of experience in Site Reliability Engineering, DevOps, or similar roles
- Expert-level knowledge of Kubernetes, including architecture, operations, and troubleshooting in production environments
- Proven track record as a go-to Kubernetes resource and technical authority
- Deep understanding of container technologies (Docker, containerd) and orchestration patterns
- AWS and Azure cloud platforms
- Terraform for infrastructure automation and management
- Datadog for monitoring, logging, and observability
- HashiCorp Vault, including deployment and management on AWS and Kubernetes integration
- Deep understanding of CI/CD pipelines, including design, implementation, and optimization for containerized workloads
- Proven ability to implement automated HA/DR solutions through CI/CD workflows
- Strong programming skills in Python for automation, tooling, and analysis
- Proven experience building observability solutions for distributed cloud applications
- Experience configuring monitoring and alerting systems and integrating with paging platforms like PagerDuty
- Demonstrated experience identifying and remediating security vulnerabilities
- Experience driving deployments through multiple environments (test/QA/production) with proper gates and controls
- Demonstrated experience participating in on-call rotations and responding to production incidents
- Experience serving as Incident Commander or leading incident response efforts
- Track record of conducting root cause analysis and driving systemic improvements
- Strong understanding of networking, security, and cloud architecture principles
- Excellent communication skills with ability to work across multiple teams and explain complex Kubernetes concepts
Preferred Skills:
- Experience with Google Cloud Platform (Google Cloud Platform) and GKE
- Certified Kubernetes Administrator (CKA) or Certified Kubernetes Security Specialist (CKS)
- Experience with service mesh technologies (Istio, Linkerd, Consul)
- Knowledge of Helm, Kustomize, and other Kubernetes tooling
- Experience with GitOps tools (ArgoCD, Flux)
- Familiarity with additional CI/CD tools (Jenkins, GitLab CI, GitHub Actions, CircleCI)
- Experience with configuration management tools (Ansible, Chef, Puppet)
- Background in software engineering or systems programming
- Understanding of chaos engineering and reliability testing methodologies
- Experience with cost optimization strategies in cloud and Kubernetes environments
- Security certifications (AWS Security Specialty, CISSP, CKS, etc.)
- Experience with compliance frameworks (SOC 2, ISO 27001, etc.)
- Contributions to open-source Kubernetes projects or active participation in the Kubernetes community
Pay: $61-$80 per hour.
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.