Overview
Skills
Job Details
Location: Remote (US)
Job Summary:We are seeking an experienced Cloud Resiliency Architect to lead the design and implementation of highly available, secure, and automated cloud infrastructure with a focus on resiliency and operational excellence. You will leverage expertise in SRE principles, Terraform, observability, and security automation to build resilient systems that ensure business continuity and compliance.
This role requires a deep understanding of cloud security controls including IAM, KMS, Vaulting, and messaging platforms such as Kafka, combined with hands-on experience driving infrastructure automation and monitoring.
Key Responsibilities:-
Architect and implement resilient cloud infrastructure using Terraform and automation best practices to support high availability and disaster recovery goals.
-
Design and integrate security automation controls across IAM, KMS, and secret management tools like Vault to enforce least privilege and compliance.
-
Lead initiatives around resiliency operations, including failure detection, automated recovery, and incident response.
-
Develop and maintain observability frameworks leveraging tools and metrics to proactively monitor infrastructure and applications for availability and performance.
-
Collaborate with database and messaging platform teams to ensure fault tolerance and secure configurations for DBs and Kafka clusters.
-
Drive the adoption of Site Reliability Engineering (SRE) practices and principles to improve operational reliability and automation.
-
Build and maintain CI/CD pipelines to automate infrastructure provisioning, security policy enforcement, and monitoring deployments.
-
Mentor engineering teams on resiliency patterns, security automation, and infrastructure best practices.
-
Stay current with cloud security, resilience trends, and emerging technologies to continuously enhance the cloud environment.
-
7+ years of experience in cloud infrastructure engineering, site reliability, or resiliency operations.
-
Strong hands-on expertise with Terraform for cloud infrastructure automation.
-
Deep knowledge of cloud security concepts, including IAM, KMS, and secret management solutions like HashiCorp Vault.
-
Experience with observability tools (e.g., Prometheus, Grafana, ELK, CloudWatch) and implementing monitoring/alerting frameworks.
-
Familiarity with messaging platforms such as Kafka and resilient database architectures.
-
Proficiency in scripting or programming languages (e.g., Python, Go, Bash) for automation and tooling.
-
Strong understanding of Site Reliability Engineering (SRE) principles and practices.
-
Excellent problem-solving skills and ability to work collaboratively across cross-functional teams.
-
Cloud certifications such as AWS Certified Security Specialty, AWS Certified DevOps Engineer, or equivalent.
-
Experience with Kubernetes and container orchestration resilience strategies.
-
Knowledge of regulatory compliance frameworks and their impact on cloud security and operations.
-
Experience designing multi-region disaster recovery and failover solutions.