Job Description
We are seeking a Senior Kubernetes Administrator with deep, hands-on experience in administering and operating Azure Kubernetes Service (AKS) clusters in production environments. The ideal candidate will be responsible for designing, deploying, securing, scaling, and troubleshooting AKS clusters, leveraging Azure-native services and Infrastructure as Code (Terraform). This role requires strong operational ownership, an automation mindset, and close collaboration with DevOps, SRE, and application teams.
Required Skills & Qualifications
Mandatory
- Strong hands-on experience administering Kubernetes in production.
- Deep expertise in Azure Kubernetes Service (AKS).
- Solid experience with Microsoft Azure (compute, networking, security).
- Proven experience with Terraform for infrastructure provisioning.
- Strong troubleshooting skills across Kubernetes, networking, and cloud infrastructure.
- Experience managing Linux-based systems.
Key Responsibilities
Kubernetes & AKS Administration
- Administer, operate, and maintain production-grade AKS clusters (single and multi-region).
- Perform cluster lifecycle management: provisioning, upgrades, patching, scaling, and decommissioning.
- Configure and manage node pools, autoscaling (HPA/Cluster Autoscaler), taints, tolerations, and resource quotas.
- Implement and manage Kubernetes networking (Azure CNI, ingress controllers, services, DNS).
- Troubleshoot cluster-level and workload-level issues (pods, nodes, networking, storage, performance).
Azure Platform Responsibilities
- Design and manage AKS integrations with Azure services such as:
- Azure Monitor / Log Analytics
- Azure Load Balancer / Application Gateway / NGINX Ingress
- Azure Key Vault
- Azure Storage (Disk, File, Blob)
- Azure Container Registry (ACR)
- Ensure high availability, resilience, and disaster recovery for Kubernetes workloads.
- Manage Azure IAM, RBAC, and managed identities for secure access.
Infrastructure as Code & Automation
- Use Terraform to provision and manage:
- AKS clusters
- Networking (VNETs, subnets)
- Supporting Azure resources
- Maintain reusable, modular Terraform code following best practices.
- Automate cluster operations and maintenance activities wherever possible.
Security & Compliance
- Implement Kubernetes security best practices:
- RBAC, Network Policies, Pod Security Standards
- Secrets management (Key Vault integration)
- Ensure secure container runtime and image practices.
- Support compliance, audit, and governance requirements.
Monitoring, Logging & Reliability
- Set up and manage monitoring, logging, and alerting for AKS and workloads.
- Perform capacity planning and cost optimization.
- Participate in incident response, root cause analysis, and preventive actions.
Collaboration & Leadership
- Work closely with DevOps, SRE, and application teams to support deployments.
- Provide guidance and mentoring to junior engineers.
- Contribute to operational documentation, runbooks, and best practices.
Preferred / Nice to Have
- Experience with CI/CD pipelines (Azure DevOps, GitHub Actions, GitLab).
- Knowledge of Helm, Kustomize, or similar Kubernetes packaging tools.
- Exposure to SRE practices and reliability engineering.
- Kubernetes certifications (CKA/CKAD) or Azure certifications (AZ-104, AZ-305).
- Experience with multi-cluster or hybrid Kubernetes environments.