Senior Kubernetes Platform Engineer

Hybrid in Charlotte, NC, US • Posted 8 hours ago • Updated 8 hours ago

Contract Independent

Contract W2

No Travel Required

Remote

$65/hr

Fitment

Dice Job Match Score™

⏳ Almost there, hang tight...

Job Details

Skills

API
Amazon DynamoDB
Amazon S3
Amazon Web Services
Artificial Intelligence
Backup
CUDA
Apache Velocity
Bash
Amazon EFS
Capacity Management
Budget
Cloud Computing
Collaboration
Caching
Cloud Storage
Computer Networking
Computer Cluster Management
Continuous Integration
Database
Disaster Recovery
GCS
DevOps
Documentation
Generative Artificial Intelligence (AI)
Gap Analysis
Grafana
IaaS
Infrastructure Architecture
Internal Communications
Kubernetes
MIG
Management
Integrated Circuit
Machine Learning (ML)
Google Cloud Platform
Microservices
NFS
Network
Optimization
Orchestration
Nginx
Oracle Policy Automation
Regulatory Compliance
SIEM
Recovery
Training
Scheduling
Provisioning

Summary

Job Title: Senior Kubernetes Platform Engineer

Level: Senior / Lead (IC5 equivalent)

Employment Type: W2 Contract

Location: Remote / Hybrid US

Experience :7–12 Years (5+ years hands-on Kubernetes)

Industry Domain: Enterprise AI / Cloud Infrastructure

ABOUT THE ROLE

We are seeking a highly skilled Senior Kubernetes Platform Engineer to design, build, and operate mission-critical Kubernetes-based infrastructure that powers Machine Learning (ML) training, inference, and GenAI workloads at enterprise scale. This role demands deep Kubernetes expertise not just cluster administration, but advanced understanding of scheduling, networking, storage, security, and multi-tenancy. You will be the subject-matter expert (SME) driving platform decisions that directly impact AI product velocity. You will partner closely with ML engineers, GenAI researchers, and application teams to architect GPU-optimized Kubernetes clusters, define infrastructure-as-code (IaC) patterns using Terraform, and embed platform reliability practices across the organization.

KEY RESPONSIBILITIES

Kubernetes Platform Engineering (Core Focus)

· Design, deploy, and manage production-grade multi-cluster Kubernetes environments on AWS EKS, Google Cloud Platform GKE, or Azure AKS.

· Architect advanced Kubernetes constructs: custom schedulers, admission webhooks, CRDs, Operators, and API aggregation layers.

· Configure and tune Kubernetes for GPU workloads NVIDIA device plugins, time-slicing, MIG partitioning, and CUDA-aware scheduling.

· Implement Horizontal Pod Autoscaler (HPA), Vertical Pod Autoscaler (VPA), KEDA, and Cluster Autoscaler for ML training burst workloads.

· Manage Kubernetes RBAC, OPA/Gatekeeper policies, Pod Security Standards (PSS), and namespace-level multi-tenancy controls.

· Operate service mesh infrastructure (Istio / Linkerd) for mTLS, traffic management, circuit breaking, and observability across ML microservices.

· Configure and maintain CNI plugins (Calico, Cilium), ingress controllers (NGINX, Traefik, AWS ALB), and network policies.

· Lead Kubernetes cluster lifecycle operations: upgrades, drain/cordon procedures, etcd backup/restore, and disaster recovery runbooks.

· Build and maintain Helm charts and Kustomize overlays for packaging ML platform components, feature stores, and model serving systems.

ML / GenAI Infrastructure

· Design Kubernetes-native ML pipelines using Kubeflow Pipelines, Argo Workflows, or Ray on Kubernetes.

· Architect scalable model serving infrastructure using KServe (KFServing), Triton Inference Server, TorchServe, or vLLM deployed on Kubernetes.

· Build and optimize Ray clusters on Kubernetes for distributed GenAI fine-tuning, reinforcement learning, and large-scale batch inference.

· Design persistent volume strategies for ML artifacts — models, datasets, feature vectors — using NFS, Lustre, or cloud-native storage (EFS, GCS FUSE).

· Implement JupyterHub / Kubeflow Notebooks on Kubernetes as GPU-backed interactive environments for ML researchers.

· Manage vector database deployments (Pinecone, Weaviate, Qdrant) on Kubernetes for RAG-based GenAI applications.

· Support LLM inference optimization KV cache tuning, batching strategies, and model sharding on multi-GPU Kubernetes pods.

Infrastructure as Code Terraform

· Author and maintain production-grade Terraform modules for Kubernetes cluster provisioning (EKS, GKE, AKS) and associated cloud resources.

· Implement Terraform workspaces and remote state management (S3 + DynamoDB / GCS + Cloud Storage) for multi-environment IaC.

· Enforce Terraform best practices: module versioning, drift detection, policy-as-code via Sentinel or OPA, and automated plan review in CI/CD.

· Manage Kubernetes-specific Terraform providers (hashicorp/kubernetes, hashicorp/helm) for declarative cluster configuration.

· Integrate Terraform with Atlantis or Terraform Cloud for GitOps-driven infrastructure deployments with plan/apply audit trails.

Observability, Security & Reliability

· Deploy and manage full-stack observability: Prometheus + Thanos (metrics), Loki (logs), Tempo / Jaeger (traces), and Grafana dashboards for Kubernetes and ML workloads.

· Implement Kubernetes audit logging, Falco runtime threat detection, and SIEM integration for security compliance.

· Design and enforce Pod Disruption Budgets (PDBs), topology spread constraints, and priority classes for SLA-critical inference workloads.

· Conduct capacity planning for GPU node pools forecasting, right-sizing, spot/preemptible strategies, and cost attribution per team/model.

· Define and track SLOs/SLIs for platform reliability; lead incident reviews (post-mortems) on Kubernetes and ML infrastructure issues.

REQUIRED SKILLS & TECHNOLOGY STACK

Category	Technologies
Kubernetes (Expert)	Helm / Kustomize, AWS EKS / GKE / AKS, Istio / Linkerd, Argo Workflows, Kubeflow, KServe / Triton, Ray on K8s
Observability & Networking	Prometheus / Grafana, Cilium / Calico, OPA / Gatekeeper
GPU & Ops	NVIDIA GPU Operator, Python / Go, GitOps / ArgoCD, Docker / containerd
Security & More	Vault / Secrets Mgmt, Linux/eBPF

REQUIRED QUALIFICATIONS

· 7+ years in cloud infrastructure/platform engineering; 5+ years of hands-on Kubernetes in production environments.

· Deep expertise in Kubernetes internals: control plane components (kube-apiserver, etcd, scheduler, controller-manager), kubelet, CRI, CNI, CSI.

· Demonstrated experience running GPU-backed Kubernetes workloads for ML training or GenAI inference at scale.

· Strong Terraform proficiency module development, state management, CI/CD integration, and multi-cloud usage.

· Solid understanding of container networking (overlay networks, eBPF, network policies) and storage (CSI drivers, PV/PVC lifecycle).

· Hands-on experience with at least one ML orchestration framework: Kubeflow, Argo Workflows, or Ray.

· Proficiency in at least one programming language: Python (preferred), Go, or Bash for automation and tooling.

· Experience with GitOps tools: ArgoCD, Flux, or equivalent for continuous delivery of Kubernetes resources.

· Familiarity with container image security: Trivy, Grype, image signing (Cosign/Notary), and image admission policies.

PREFERRED QUALIFICATIONS

· CKA (Certified Kubernetes Administrator) Required.

· CKS (Certified Kubernetes Security Specialist) Strongly Preferred.

· CKAD (Certified Kubernetes Application Developer) Advantageous.

· AWS Certified DevOps Engineer / Google Cloud Platform Professional DevOps Engineer.

· HashiCorp Terraform Associate or Professional certification.

· Experience with Crossplane for Kubernetes-native cloud resource provisioning.

· Knowledge of eBPF-based observability tools (Hubble, Pixie, Tetragon).

· Familiarity with Kubernetes Federation (KubeFed) or multi-cluster management (Liqo, Admiral).

· Experience contributing to CNCF projects or Kubernetes upstream.

WHAT YOU WILL DELIVER (FIRST 90 DAYS)

· Day 30: Audit existing Kubernetes clusters; produce a gap analysis on security posture, GPU utilization, and observability coverage.

· Day 60: Deliver Terraform-managed EKS/GKE cluster configuration with OPA policies, Prometheus stack, and GPU node pool auto-scaling.

· Day 90: Deploy production-ready KServe or Triton-based model serving cluster; establish SLO dashboards and runbooks for ML platform operations.

COMPETENCY PROFILE

· Platform Mindset: You think of systems not just tickets. You proactively identify toil and automate it away.

· Ownership: You treat production Kubernetes clusters like your own infrastructure alerting, runbooks, and reliability are personal commitments.

· Collaboration: You work directly with ML researchers, enabling them through platform abstraction, not blocking them with process.

· Influence without authority: You can drive architectural decisions through data, documentation, and trusted expertise.

· Security-first: You embed security into every Kubernetes design decision, RBAC, network policy, image security, and secrets management.

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Dice Id: 90767752
Position Id: 8928191
Posted 8 hours ago

Create job alert

Never miss an opportunity! Create an alert based on the job you applied for.

Similar Jobs

Senior Kubernetes Platform Engineer

Hybrid in Charlotte, North Carolina

•

Today

Senior Kubernetes Platform EngineerML / GenAI Infrastructure | Terraform | Cloud-NativeIn Person Interview is Non-NegotiableLocation: Charlotte, NC - On-Site/HybridEmployment Type: Contract-to-HireExperience: 712 Years (5+ years hands-on Kubernetes)Industry: Enterprise AI / Cloud InfrastructureAbout the RoleWe are looking for a Senior Kubernetes Platform Engineer to design, build, and operate mission-critical Kubernetes infrastructure that powers large-scale Machine Learning (ML) and Generative

Easy Apply

Contract

50 - 55

Kubernetes Administrator

Charlotte, North Carolina

•

Yesterday

Senior Kubernetes Administrator (AKS / Azure / Terraform) Location:Charlotte, NC Duration:6 Months+ Job Description We are seeking a Senior Kubernetes Administrator with deep, hands-on experience in administering and operating Azure Kubernetes Service (AKS) clusters in production environments. The ideal candidate will be responsible for designing, deploying, securing, scaling, and troubleshooting AKS clusters, leveraging Azure-native services and Infrastructure as Code (Terraform). Th

Easy Apply

Contract, Third Party

$60 - $70

Lead Google Cloud Platform Cloud Engineer with Logistics exp

Hybrid in Charlotte, North Carolina

•

5d ago

Position- Lead Google Cloud Platform Cloud Engineer with Logistics exp, Local Type- W2/Hybrid (3 days/week office) Location- Charlotte, NC Visa- GC JOB DESCRIPTION Lead Cloud Engineer Apache Spark / Google Cloud Platform / Python / Microservices (S3) Overview This role supports the Model Risk Management platform used to run statistical risk models and large-scale data workloads. The engineer will help maintain and enhance a cloud-native platform used by statisticians and data scientists working

Easy Apply

Contract

Depends on Experience

Sr. Python/Spark Platform Engineer

Charlotte, North Carolina

•

Today

Job Description Software Engineer 3 Senior Cloud Engineer - Apache Spark / Google Cloud Platform / Python / Microservices (S3) Overview This role supports the Model Risk Management platform used to run statistical risk models and large-scale data workloads. The engineer will help maintain and enhance a cloud-native platform used by statisticians and data scientists working with millions of customer accounts. The platform runs on Google Cloud Platform and supports distributed data processing us

Easy Apply

Contract

$53.56 - $58.7

Search all similar jobs