Senior Kubernetes Platform Engineer
ML / GenAI Infrastructure | Terraform | Cloud-Native
In Person Interview is Non-Negotiable
Location: Charlotte, NC - On-Site/Hybrid
Employment Type: Contract-to-Hire
Experience: 7–12 Years (5+ years hands-on Kubernetes)
Industry: Enterprise AI / Cloud Infrastructure
⸻
About the Role
We are looking for a Senior Kubernetes Platform Engineer to design, build, and operate mission-critical Kubernetes infrastructure that powers large-scale Machine Learning (ML) and Generative AI (GenAI) workloads.
This is not a standard Kubernetes admin role — you will act as a subject matter expert, driving architecture decisions across scheduling, networking, security, storage, and multi-tenancy. You will work closely with ML engineers, researchers, and application teams to build scalable, GPU-optimized platforms that accelerate AI innovation.
⸻
Key Responsibilities
Kubernetes Platform Engineering
• Design, deploy, and manage multi-cluster Kubernetes environments (EKS, GKE, AKS)
• Build advanced Kubernetes components including CRDs, Operators, admission webhooks, and custom schedulers
• Optimize Kubernetes for GPU workloads (NVIDIA device plugins, MIG, time-slicing)
• Implement autoscaling solutions (HPA, VPA, KEDA, Cluster Autoscaler)
• Enforce security using RBAC, OPA/Gatekeeper, and Pod Security Standards
• Manage service mesh (Istio / Linkerd) for secure and observable microservices
• Configure networking (Cilium, Calico), ingress controllers, and network policies
• Lead cluster lifecycle management (upgrades, backups, disaster recovery)
• Package platform components using Helm and Kustomize
⸻
ML / GenAI Infrastructure
• Design ML pipelines using Kubeflow, Argo Workflows, or Ray
• Build scalable model serving platforms (KServe, Triton, TorchServe, vLLM)
• Optimize distributed compute using Ray on Kubernetes
• Design storage solutions for ML datasets and artifacts (EFS, GCS, NFS, etc.)
• Enable GPU-backed environments (JupyterHub, Kubeflow Notebooks)
• Deploy and manage vector databases for RAG applications
• Optimize LLM inference (batching, caching, multi-GPU scaling)
⸻
Infrastructure as Code (Terraform)
• Develop and maintain reusable Terraform modules for cloud infrastructure
• Implement remote state management and multi-environment workflows
• Enforce best practices: versioning, drift detection, policy-as-code
• Integrate Terraform into CI/CD pipelines and GitOps workflows
• Use tools like Atlantis or Terraform Cloud for automated deployments
⸻
Observability, Security & Reliability
• Build observability stack (Prometheus, Grafana, Loki, Jaeger/Tempo)
• Implement audit logging and runtime security (Falco, SIEM integration)
• Define SLOs/SLIs and maintain platform reliability
• Perform GPU capacity planning and cost optimization
• Lead incident response and post-mortem analysis
⸻
Required Skills & Technologies
• Kubernetes (Expert level)
• Terraform (Advanced)
• Helm / Kustomize
• AWS / Google Cloud Platform / Azure (EKS, GKE, AKS)
• Istio / Linkerd
• Argo Workflows / Kubeflow / Ray
• KServe / Triton
• Prometheus / Grafana
• Cilium / Calico
• OPA / Gatekeeper
• NVIDIA GPU Operator
• Docker / containerd
• GitOps tools (ArgoCD / Flux)
• Python / Go / Bash
• Linux systems and networking
⸻
Required Qualifications
• 7+ years in cloud/platform engineering
• 5+ years hands-on Kubernetes in production
• Deep understanding of Kubernetes internals (control plane, CNI, CSI, etc.)
• Experience running GPU-based ML/AI workloads at scale
• Strong Terraform expertise (modules, CI/CD, multi-cloud)
• Experience with ML orchestration tools (Kubeflow, Argo, or Ray)
• Proficiency in at least one programming language (Python, Go, or Bash)
• Experience with GitOps and secure container practices
⸻
Preferred Qualifications
• CKA (Certified Kubernetes Administrator) — Required
• CKS (Certified Kubernetes Security Specialist) — Preferred
• CKAD certification
• Cloud DevOps certifications (AWS / Google Cloud Platform)
• Terraform certification
• Experience with Crossplane or multi-cluster management
• Familiarity with eBPF tools (Hubble, Pixie)
• Contributions to CNCF or open-source Kubernetes ecosystem
⸻
What You’ll Deliver (First 90 Days)
• Day 30: Audit existing Kubernetes clusters and deliver a gap analysis
• Day 60: Implement Terraform-managed clusters with security and observability
• Day 90: Deploy production-ready model serving platform with SLO dashboards
⸻
Who You Are
• A systems thinker with a strong platform mindset
• Proactive and automation-driven
• Comfortable working cross-functionally with ML and engineering teams
• Influential communicator who can drive architecture decisions
• Security-focused and reliability-driven
⸻
Why Join Us
This role is ideal for engineers passionate about Kubernetes and AI infrastructure who want to build the backbone of next-generation enterprise AI platforms.