Senior Kubernetes Platform Engineer (Multi-Cloud AI Infrastructure)

  • Mountain View, CA
  • Posted 20 hours ago | Updated 19 hours ago

Overview

On Site
Depends on Experience
Full Time

Skills

CHAOS
Kubernetes
Google Cloud Platform
ISO/IEC 27001:2005
Machine Learning (ML)
Artificial Intelligence
Ansible
Amazon Web Services
Collaboration
Linux
Jenkins
Python

Job Details

 

Candidates must be local to the Bay Area and open to onsite work and in-person interviews.

Job Description

We're looking for a hands-on Kubernetes expert to help build and scale a multi-cloud AI platform. This is an exciting opportunity to work on high-performance Kubernetes infrastructure for cutting-edge machine learning workloads.

Key Responsibilities
  • Design, build, and scale Kubernetes-based infrastructure to support Client multi-cloud AI platform, ensuring high availability, resilience, and performance.
  • Architect and optimize large-scale Kubernetes clusters, improving scheduling, networking (CNI), and workload orchestration for production environments.
  • Develop and extend Kubernetes controllers and operators to automate cluster management, lifecycle operations, and scaling strategies.
  • Enhance observability, diagnostics, and monitoring by building tools for real-time cluster health tracking, alerting, and performance tuning.
  • Lead efforts to automate fleet management, optimizing node pools, autoscaling, and multi-cluster deployments across AWS, Google Cloud Platform, and Azure.
  • Define and implement Kubernetes security policies, RBAC models, and best practices to ensure compliance and platform integrity.
  • Collaborate with ML engineers and platform teams to optimize Kubernetes for machine learning workloads, ensuring seamless resource allocation for AI/ML models.
  • Drive commit-to-production automation, cloud connectivity, and deployment orchestration, ensuring seamless application rollouts, zero-downtime upgrades, and global infrastructure reliability.
Required Skills and Experience
  • Kubernetes Mastery: 5-7+ years of experience managing large-scale Kubernetes clusters (EKS, GKE, AKS, or OpenSource) in production. Deep expertise in Kubernetes internals, including controllers, operators, scheduling, networking (CNI), and security policies.
  • Cloud-Native Infrastructure: 5-7+ years of experience building cloud-native Kubernetes-based infrastructure across AWS, Azure, and Google Cloud Platform.
  • Platform Engineering: 5-7+ years of experience building Kubernetes service meshes (Istio/Envoy, Traefik), networking policies (Calico/Tigera), and distributed ingress/egress control.
  • Fleet Management & Scaling: Proven experience in optimizing, scaling, and maintaining Kubernetes clusters across multi-cloud environments, ensuring high availability and performance.
  • Software Development: 5-7+ years of experience writing production-grade controllers and operators in Python, Go, or Rust to extend Kubernetes functionality.
  • Infrastructure-as-Code & Automation: Hands-on experience with Terraform, CloudFormation, Ansible, BASH and Make scripting to automate Kubernetes cluster provisioning and management.
  • Distributed Systems & SaaS: Expertise in building and operating large-scale distributed systems for cloud-native B2B SaaS applications running on Kubernetes.
  • Cloud Application Deployment: Deep expertise in building of container orchestration, workload scheduling, and runtime optimizations using Kubernetes, Argo or Flux.
  • Education: BS/MS in Computer Science or a related field (PhD preferred)
Nice to Have
  • Proficiency with cloud platforms such as AWS, Google Cloud Platform, or Azure.
  • Familiarity with chaos engineering tools and practices for testing system resilience.
  • Strong understanding of security best practices and compliance standards (GDPR, SOC2, ISO27001, vulnerability assessments, GRC, risk management).
  • Contributions to open-source projects, particularly in the Kubernetes or cloud-native ecosystem.
  • Expertise in Docker, Kubernetes, Jenkins, Flux, Argo, and Terraform in a Linux environment.
  • Hands-on experience with monitoring and observability tools such as Prometheus and Grafana.
  • Ability to develop customer-facing web frontends or public APIs/SDKs for platform services.
 
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

About TekReliance