*Certifications*
Required: AZ-104, CKA
Preferred: AZ-305, AZ-400, ITIL v4 Foundation
Design and operate the enterprise observability stack: Azure Monitor, Log Analytics, and Managed Grafana.
Develop self-healing capabilities and platform automation using Logic Apps and Python.
Job Responsibilities*
Define and enforce enterprise-wide SLOs, SLIs, and Error Budgets for Tier-0 and Tier-1 services.
Lead architectural reviews and implement chaos engineering practices using Azure Chaos Studio.
Serve as Incident Commander for P1/P2 incidents and drive blameless post-mortem culture.
*Department/Project Description*
7+ years of experience in SRE, platform engineering, or cloud infrastructure engineering in large-scale enterprise environments.
Deep, hands-on expertise with Microsoft Azure (minimum 4 years) including Landing Zones and Enterprise-Scale architecture.
Expert-level proficiency with AKS: cluster lifecycle management, RBAC, network policies, pod security standards, and Workload Identity.
Strong Infrastructure-as-Code skills using Terraform (required) and/or Bicep.
Proficiency in Python (preferred), Go, or PowerShell for automation and scripting.
Strong knowledge of enterprise networking: Hub-and-Spoke, Virtual WAN, ExpressRoute, and Azure Firewall.