Job Title: Senior Kubernetes Platform Engineer
Level: Senior / Lead (IC5 equivalent)
Employment Type: W2 Contract
Location: Remote / Hybrid US
Experience :7–12 Years (5+ years hands-on Kubernetes)
Industry Domain: Enterprise AI / Cloud Infrastructure
ABOUT THE ROLE
We are seeking a highly skilled Senior Kubernetes Platform Engineer to design, build, and operate mission-critical Kubernetes-based infrastructure that powers Machine Learning (ML) training, inference, and GenAI workloads at enterprise scale. This role demands deep Kubernetes expertise not just cluster administration, but advanced understanding of scheduling, networking, storage, security, and multi-tenancy. You will be the subject-matter expert (SME) driving platform decisions that directly impact AI product velocity. You will partner closely with ML engineers, GenAI researchers, and application teams to architect GPU-optimized Kubernetes clusters, define infrastructure-as-code (IaC) patterns using Terraform, and embed platform reliability practices across the organization.
KEY RESPONSIBILITIES
Kubernetes Platform Engineering (Core Focus)
· Design, deploy, and manage production-grade multi-cluster Kubernetes environments on AWS EKS, Google Cloud Platform GKE, or Azure AKS.
· Architect advanced Kubernetes constructs: custom schedulers, admission webhooks, CRDs, Operators, and API aggregation layers.
· Configure and tune Kubernetes for GPU workloads NVIDIA device plugins, time-slicing, MIG partitioning, and CUDA-aware scheduling.
· Implement Horizontal Pod Autoscaler (HPA), Vertical Pod Autoscaler (VPA), KEDA, and Cluster Autoscaler for ML training burst workloads.
· Manage Kubernetes RBAC, OPA/Gatekeeper policies, Pod Security Standards (PSS), and namespace-level multi-tenancy controls.
· Operate service mesh infrastructure (Istio / Linkerd) for mTLS, traffic management, circuit breaking, and observability across ML microservices.
· Configure and maintain CNI plugins (Calico, Cilium), ingress controllers (NGINX, Traefik, AWS ALB), and network policies.
· Lead Kubernetes cluster lifecycle operations: upgrades, drain/cordon procedures, etcd backup/restore, and disaster recovery runbooks.
· Build and maintain Helm charts and Kustomize overlays for packaging ML platform components, feature stores, and model serving systems.
ML / GenAI Infrastructure
· Design Kubernetes-native ML pipelines using Kubeflow Pipelines, Argo Workflows, or Ray on Kubernetes.
· Architect scalable model serving infrastructure using KServe (KFServing), Triton Inference Server, TorchServe, or vLLM deployed on Kubernetes.
· Build and optimize Ray clusters on Kubernetes for distributed GenAI fine-tuning, reinforcement learning, and large-scale batch inference.
· Design persistent volume strategies for ML artifacts — models, datasets, feature vectors — using NFS, Lustre, or cloud-native storage (EFS, GCS FUSE).
· Implement JupyterHub / Kubeflow Notebooks on Kubernetes as GPU-backed interactive environments for ML researchers.
· Manage vector database deployments (Pinecone, Weaviate, Qdrant) on Kubernetes for RAG-based GenAI applications.
· Support LLM inference optimization KV cache tuning, batching strategies, and model sharding on multi-GPU Kubernetes pods.
Infrastructure as Code Terraform
· Author and maintain production-grade Terraform modules for Kubernetes cluster provisioning (EKS, GKE, AKS) and associated cloud resources.
· Implement Terraform workspaces and remote state management (S3 + DynamoDB / GCS + Cloud Storage) for multi-environment IaC.
· Enforce Terraform best practices: module versioning, drift detection, policy-as-code via Sentinel or OPA, and automated plan review in CI/CD.
· Manage Kubernetes-specific Terraform providers (hashicorp/kubernetes, hashicorp/helm) for declarative cluster configuration.
· Integrate Terraform with Atlantis or Terraform Cloud for GitOps-driven infrastructure deployments with plan/apply audit trails.
Observability, Security & Reliability
· Deploy and manage full-stack observability: Prometheus + Thanos (metrics), Loki (logs), Tempo / Jaeger (traces), and Grafana dashboards for Kubernetes and ML workloads.
· Implement Kubernetes audit logging, Falco runtime threat detection, and SIEM integration for security compliance.
· Design and enforce Pod Disruption Budgets (PDBs), topology spread constraints, and priority classes for SLA-critical inference workloads.
· Conduct capacity planning for GPU node pools forecasting, right-sizing, spot/preemptible strategies, and cost attribution per team/model.
· Define and track SLOs/SLIs for platform reliability; lead incident reviews (post-mortems) on Kubernetes and ML infrastructure issues.
REQUIRED SKILLS & TECHNOLOGY STACK
Category | Technologies |
Kubernetes (Expert) | Helm / Kustomize, AWS EKS / GKE / AKS, Istio / Linkerd, Argo Workflows, Kubeflow, KServe / Triton, Ray on K8s |
Observability & Networking | Prometheus / Grafana, Cilium / Calico, OPA / Gatekeeper |
GPU & Ops | NVIDIA GPU Operator, Python / Go, GitOps / ArgoCD, Docker / containerd |
Security & More | Vault / Secrets Mgmt, Linux/eBPF |
REQUIRED QUALIFICATIONS
· 7+ years in cloud infrastructure/platform engineering; 5+ years of hands-on Kubernetes in production environments.
· Deep expertise in Kubernetes internals: control plane components (kube-apiserver, etcd, scheduler, controller-manager), kubelet, CRI, CNI, CSI.
· Demonstrated experience running GPU-backed Kubernetes workloads for ML training or GenAI inference at scale.
· Strong Terraform proficiency module development, state management, CI/CD integration, and multi-cloud usage.
· Solid understanding of container networking (overlay networks, eBPF, network policies) and storage (CSI drivers, PV/PVC lifecycle).
· Hands-on experience with at least one ML orchestration framework: Kubeflow, Argo Workflows, or Ray.
· Proficiency in at least one programming language: Python (preferred), Go, or Bash for automation and tooling.
· Experience with GitOps tools: ArgoCD, Flux, or equivalent for continuous delivery of Kubernetes resources.
· Familiarity with container image security: Trivy, Grype, image signing (Cosign/Notary), and image admission policies.
PREFERRED QUALIFICATIONS
· CKA (Certified Kubernetes Administrator) Required.
· CKS (Certified Kubernetes Security Specialist) Strongly Preferred.
· CKAD (Certified Kubernetes Application Developer) Advantageous.
· AWS Certified DevOps Engineer / Google Cloud Platform Professional DevOps Engineer.
· HashiCorp Terraform Associate or Professional certification.
· Experience with Crossplane for Kubernetes-native cloud resource provisioning.
· Knowledge of eBPF-based observability tools (Hubble, Pixie, Tetragon).
· Familiarity with Kubernetes Federation (KubeFed) or multi-cluster management (Liqo, Admiral).
· Experience contributing to CNCF projects or Kubernetes upstream.
WHAT YOU WILL DELIVER (FIRST 90 DAYS)
· Day 30: Audit existing Kubernetes clusters; produce a gap analysis on security posture, GPU utilization, and observability coverage.
· Day 60: Deliver Terraform-managed EKS/GKE cluster configuration with OPA policies, Prometheus stack, and GPU node pool auto-scaling.
· Day 90: Deploy production-ready KServe or Triton-based model serving cluster; establish SLO dashboards and runbooks for ML platform operations.
COMPETENCY PROFILE
· Platform Mindset: You think of systems not just tickets. You proactively identify toil and automate it away.
· Ownership: You treat production Kubernetes clusters like your own infrastructure alerting, runbooks, and reliability are personal commitments.
· Collaboration: You work directly with ML researchers, enabling them through platform abstraction, not blocking them with process.
· Influence without authority: You can drive architectural decisions through data, documentation, and trusted expertise.
· Security-first: You embed security into every Kubernetes design decision, RBAC, network policy, image security, and secrets management.