Description:
Short Description :
The Kubernetes Engineer is a specialist responsible for the design, deployment, and operations of all GKE clusters in the H100 platform. This includes the KCC management cluster (Config Connector, Config Sync, Policy Controller), the GitHub Actions runner clusters (ARC, External Secrets Operator), and any future tenant workload clusters. You will ensure cluster security hardening, node pool optimization, workload scheduling, and GitOps delivery are operating at enterprise-grade reliability.
Key Responsibilites :-
- Design, deploy, and operate GKE clusters: private clusters with Shielded Nodes, Workload Identity, CMEK encryption, custom node SAs
- Manage Config Connector (KCC) on the management cluster: namespace isolation, ConfigConnectorContext per project, IAM SA bindings
- Operate Config Sync across multiple clusters: RootSync/RepoSync configuration, Git/OCI/Helm sources, drift detection and remediation
- Deploy and manage ARC (Actions Runner Controller) on runner clusters: scale-set configuration, autoscaling, ephemeral runner lifecycle
- Implement and manage External Secrets Operator (ESO): ClusterSecretStore for Google Cloud Platform Secret Manager, ExternalSecret resources for GitHub App credentials
- Configure and enforce Policy Controller / Gatekeeper: constraint templates (no-public-IP, require-CMEK, require-labels, require-private-networking, restrict-owner-role)
- Manage node pools: machine types, taints/tolerations, autoscaling, surge upgrades, maintenance windows
- Build and maintain runner pod specs for Linux, Windows, Android, and iOS runners with appropriate resource limits and security contexts
- Implement Kubernetes RBAC: ClusterRoles, RoleBindings, service account management aligned with IAM tiering (ADR-016)
- Monitor cluster health: node readiness, pod scheduling, Config Sync sync status, KCC resource reconciliation
- Manage GKE Fleet membership, Binary Authorization policies, and GKE security bulletin monitoring
- Troubleshoot Kubernetes issues: pod failures, scheduling problems, network policies, DNS resolution, storage classes
- Perform cluster upgrades (control plane and node pools) with zero-downtime strategies
Required Skills & Qualification''s :-
- 7+ years in Kubernetes engineering, with 3+ years on GKE specifically
- Deep GKE expertise: private clusters, Workload Identity, Config Sync, Policy Controller, Fleet management
- Strong understanding of Kubernetes internals: API server, etcd, scheduler, kubelet, kube-proxy, CNI
- Experience with Config Connector (KCC) or similar Kubernetes-native Google Cloud Platform resource management
- Hands-on ARC (Actions Runner Controller) deployment and management on Kubernetes
- External Secrets Operator (ESO) configuration and troubleshooting
- Helm chart management: values customization, chart versioning, Helm-sourced Config Sync
- Kubernetes networking: Calico/Cilium network policies, Services, Ingress, DNS (CoreDNS)
- Kubernetes security: Pod Security Admission, RBAC, SecurityContexts, secrets management
- Kustomize expertise for manifest composition and overlay management
- Monitoring: Prometheus, Grafana, Cloud Monitoring for GKE, PodMonitoring resources
- Strong kubectl skills and Kubernetes troubleshooting methodology
Preferred /Nice to Have Skill''s :-
- Experience with Binary Authorization and container image signing
- Familiarity with GKE Autopilot vs Standard mode trade-offs
- Experience with Windows and non-Linux node pools on GKE
- Kubernetes operator development (Go, controller-runtime)
- CKA or CKAD certification
Technology Stack :-
- Kubernetes: GKE (Standard), Config Sync, Policy Controller, ARC, ESO, KCC, Calico, Gatekeeper
- Cloud: Google Cloud Platform (GKE, IAM, KMS, Secret Manager, Artifact Registry, Fleet, Binary Auth)
- IaC: Kustomize, Helm, KCC YAML manifests, Terraform (GKE module)
- CI/CD: GitHub Actions, ARC runner scale sets, Config Sync (Git, OCI, Helm sources)
- Monitoring: Cloud Monitoring, Prometheus, Grafana, PodMonitoring
- Networking: Calico Network Policies, GKE Dataplane V2, Services, PSC
- OS: Linux (container base images), Windows (node pools), Android/iOS (runner images)