Position Overview: We are seeking a highly experienced and hands-on Vice President of DevOps to lead our infrastructure operations and DevOps strategy. This is a leadership role that requires a deep technical background in AWS and Kubernetes, coupled with the ability to manage and mentor a small team of DevOps engineers. The VP of DevOps will be responsible for the reliability, scalability, and security of our entire cloud infrastructure, ensuring robust disaster recovery, backup, and environment management practices are in place.
Key Responsibilities:
Team Leadership & Management:
● Lead, mentor, and manage a small team of DevOps Engineers, fostering a culture of automation, collaboration, and operational excellence.
● Drive technical strategy, set team goals, and manage performance and professional development for team members.
● Serve as the primary point of contact and escalation for all infrastructure and DevOps-related initiatives.
AWS Infrastructure Management:
● Provide hands-on, end-to-end management of our AWS cloud environment.
● Architect, implement, and maintain scalable, secure, and cost-effective infrastructure solutions using services such as EC2, VPC, S3, RDS, Lambda, CloudFormation, and Terraform.
● Oversee cloud governance, including cost management, security compliance, and resource optimization.
Kubernetes Infrastructure & Operations:
● Maintain, troubleshoot, and remediate issues within our Kubernetes (EKS or self-hosted) clusters.
● Manage cluster upgrades, patching, and performance tuning.
● Implement and manage CI/CD pipelines for containerized applications, ensuring smooth deployments and rollbacks.
Disaster Recovery & Environment Management:
● Design, implement, and regularly test comprehensive disaster recovery (DR) and backup strategies.
● Own the processes for environment creation (development, staging, production) and secure teardown, leveraging Infrastructure as Code (IaC).
● Ensure high availability and business continuity for all critical services.
Required Skills & Qualifications:
● Proven experience (10+ years) in DevOps, Site Reliability Engineering (SRE), or a related infrastructure field, with at least 3+ years in a leadership role managing a technical team.
● Expert-level, hands-on experience with AWS services, including networking, compute, storage, and database management. Professional-level AWS certification is highly preferred.
● Deep, hands-on expertise in administering and troubleshooting Kubernetes environments in production (EKS, GKE, or AKS).
● Strong proficiency in Infrastructure as Code tools such as Terraform or CloudFormation.
● Extensive experience with CI/CD tools (e.g., Jenkins, GitHub Actions, ArgoCD).
● Solid scripting skills in languages such as Bash, Python, or Go.
● Demonstrated experience in designing and implementing disaster recovery, backup, and restoration procedures.
● Strong understanding of security best practices in cloud and containerized environments.
Excellent problem-solving, communication, and leadership skills, with the ability to operate effectively in a fast-paced, hands-on environment.