Kubernetes Administrator

  • Huntsville, AL
  • Posted 1 day ago | Updated 1 day ago

Overview

Hybrid
Depends on Experience
Accepts corp to corp applications
Contract - W2
Contract - 12 Month(s)

Skills

Kubernetes
Rancher
RKE2
Linux
Docker
GPU Management
NVIDIA GPU Operator
Prometheus
Grafana
Kyverno
TLS Certificates
RBAC
Velero
Ceph Storage
FluentBit
Splunk
OpenTelemetry
Jaeger
ClearML
vLLM
Neo4J
Weaviate
Chroma
Milvus
ITAR
EAR
DFARS
NIST 800-171
Namespace Management
Resource Quotas
Disaster Recovery
Incident Management
MLOps
Monitoring Dashboards
Security Audits
Container Runtime
CNI Plugins
Audit Logs
Certificate Rotation

Job Details

Job Title: Kubernetes Administrator

Location: Huntsville, AL (Mostly Remote with Occasional Onsite Travel)

Duration / Term: Long-Term Contract

Job Description

We are seeking a highly skilled Kubernetes Administrator to manage and maintain a complex RKE2 cluster environment supporting AI workloads and GPU-based inferencing. This role involves hands-on experience with Rancher, Kubernetes, Linux systems, GPU management, and security tools. The position is primarily remote but may require occasional travel to the Huntsville site for hardware maintenance.

Key Responsibilities

Cluster Maintenance & Upgrades

  • Perform Kubernetes version upgrades, RKE2 patches, and OS updates
  • Maintain container runtimes, GPU drivers, and NVIDIA GPU Operator
  • Update monitoring stack (Prometheus, Grafana), CNI plugins, and apply security patches

High Availability & Disaster Recovery

  • Test and validate etcd backups, Velero restores, and DR procedures
  • Document and execute node replacement and cluster recovery workflows

Monitoring & Alerting

  • Implement and tune Prometheus alert rules
  • Create runbooks, manage on-call rotations, and build Grafana dashboards

Performance & Resource Optimization

  • Monitor and optimize cluster performance, GPU utilization, and Ceph storage
  • Tune database performance and assess resource trends

Security & Compliance

  • Conduct security audits, manage RBAC, rotate TLS certificates, and scan container images
  • Maintain Kyverno policies, review audit logs, and enforce network policies

User & Access Management

  • Manage access for 30+ users, onboard/offboard team members
  • Maintain namespace separation and provide user training and documentation

Capacity Planning

  • Plan for resource scaling, manage quotas, and monitor storage capacity

MLOps Infrastructure

  • Maintain ClearML Enterprise, optimize GPU allocation, and support ML model deployment pipelines

Logging & Observability

  • Configure and maintain FluentBit, Splunk, OpenTelemetry, and Jaeger

Multi-tenancy & Compliance

  • Ensure compliance with ITAR, EAR, DFARS, and NIST 800-171
  • Validate data access controls and namespace isolation

Incident Management

  • Conduct root cause analysis, document incidents, and monitor SLA performance

Optional: Database Infrastructure

  • Support Neo4J, Weaviate, Chroma, and Milvus
  • Manage backups, performance tuning, and HA configurations

Qualifications & Experience

  • Strong hands-on experience with Rancher, RKE2, and Kubernetes administration
  • Proficiency in Linux OS management, GPU handling, and container orchestration
  • Experience with Docker, Prometheus, Grafana, and security tools like Kyverno
  • Familiarity with LLM deployment, inferencing engines (e.g., vLLM), and MLOps platforms
  • Knowledge of software-defined storage/networking, TLS management, and compliance frameworks
  • Excellent documentation, troubleshooting, and stakeholder communication skills
  • Ability to travel occasionally for onsite hardware support

Key Skills

Kubernetes, Rancher, RKE2, Linux, Docker, GPU Management, NVIDIA GPU Operator, Prometheus, Grafana, Kyverno, TLS Certificates, RBAC, Velero, Ceph Storage, FluentBit, Splunk, OpenTelemetry, Jaeger, ClearML, vLLM, Neo4J, Weaviate, Chroma, Milvus, ITAR, EAR, DFARS, NIST 800-171, Namespace Management, Resource Quotas, Disaster Recovery, Incident Management, MLOps, Monitoring Dashboards, Security Audits, Container Runtime, CNI Plugins, Audit Logs, Certificate Rotation

VDart Group, a global leader in technology, product, and talent management, empowers businesses with comprehensive solutions through our four distinct, industry-leading business units With a diverse team of over 4,000 professionals across 13 countries, we deliver strong results across various industries, including Fortune 500 companies

Committed to "People, Purpose, Planet," we prioritize social responsibility and sustainability, as evidenced by our EcoVadis Bronze Medal Certification and participation in the UN Global Compact

Our dedication to delivering strong results has earned us recognition as a trusted advisor for businesses seeking to drive innovation and growth, including many

Fortune 500 companies Join our network! Partner with VDart Group to leverage our global network, industry expertise, and proven track record with a diverse clientele

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

About VDart, Inc.