SRE Architect

Overview

On Site

Part Time

Accepts corp to corp applications

Contract - Independent

Contract - W2

Contract - 12th Month(s)

Skills

SRE Architect

Job Details

Job Title: SRE Architect

Location: Atlanta GA

Job Type: - Contract

On-Site

Job Description:-

Key Responsibilities: Architecture & Reliability Design

Define and implement the SRE architecture, reliability framework, and operational strategy.
Design scalable, fault-tolerant systems for high availability and disaster recovery.
Establish SLOs, SLIs, and SLAs across services and ensure compliance.
Architect systems for observability: logging, tracing, metrics, and alerting.

Automation & Engineering

Drive automation for infrastructure, deployment, and monitoring using tools like:
- Terraform, Ansible, Helm, Kubernetes Operators
- CI/CD pipelines (Jenkins, GitHub Actions, GitLab CI, ArgoCD)
Automate manual processes to improve efficiency and reduce MTTR.
Develop self-healing mechanisms and automated remediation workflows.

Cloud & Infrastructure

Lead cloud architecture design on AWS, Azure, or Google Cloud Platform.
Architect and optimize Kubernetes clusters and containerized applications.
Implement and manage scaling strategies, load balancing, and failover designs.
Oversee network reliability, security, and configuration management.

Monitoring, Performance & Incident Management

Implement observability tools like:
- Prometheus, Grafana, ELK, Datadog, New Relic, Splunk
Lead incident management processes by identifying root causes and improving system resilience.
Conduct performance testing, capacity planning, and SLA compliance reporting.

Collaboration & Leadership

Partner with software engineering, DevOps, security, and product teams.
Mentor the SRE team and promote reliability engineering best practices.
Establish playbooks, runbooks, and operational documentation.

Required Skills & Qualifications:

8 12+ years of experience in SRE, DevOps, or infrastructure engineering.
Strong hands-on experience with:
- Kubernetes & Docker
- Cloud platforms (AWS/Azure/Google Cloud Platform)
- IaC tools (Terraform/CloudFormation)
- CI/CD systems
Deep understanding of:
- Distributed systems
- Reliability and performance engineering
- Observability tools
- Incident & problem management
Experience in scripting/programming (Python, Go, Shell, etc.).
Strong troubleshooting, analytical, and architectural skills.

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Job Details

Share