Overview
Skills
Job Details
Job Title: Kubernetes Administrator
Location: Huntsville, AL (Mostly Remote with Occasional Onsite Travel)
Duration / Term: Long-Term Contract
Job Description
We are seeking a highly skilled Kubernetes Administrator to manage and maintain a complex RKE2 cluster environment supporting AI workloads and GPU-based inferencing. This role involves hands-on experience with Rancher, Kubernetes, Linux systems, GPU management, and security tools. The position is primarily remote but may require occasional travel to the Huntsville site for hardware maintenance.
Key Responsibilities
Cluster Maintenance & Upgrades
- Perform Kubernetes version upgrades, RKE2 patches, and OS updates
- Maintain container runtimes, GPU drivers, and NVIDIA GPU Operator
- Update monitoring stack (Prometheus, Grafana), CNI plugins, and apply security patches
High Availability & Disaster Recovery
- Test and validate etcd backups, Velero restores, and DR procedures
- Document and execute node replacement and cluster recovery workflows
Monitoring & Alerting
- Implement and tune Prometheus alert rules
- Create runbooks, manage on-call rotations, and build Grafana dashboards
Performance & Resource Optimization
- Monitor and optimize cluster performance, GPU utilization, and Ceph storage
- Tune database performance and assess resource trends
Security & Compliance
- Conduct security audits, manage RBAC, rotate TLS certificates, and scan container images
- Maintain Kyverno policies, review audit logs, and enforce network policies
User & Access Management
- Manage access for 30+ users, onboard/offboard team members
- Maintain namespace separation and provide user training and documentation
Capacity Planning
- Plan for resource scaling, manage quotas, and monitor storage capacity
MLOps Infrastructure
- Maintain ClearML Enterprise, optimize GPU allocation, and support ML model deployment pipelines
Logging & Observability
- Configure and maintain FluentBit, Splunk, OpenTelemetry, and Jaeger
Multi-tenancy & Compliance
- Ensure compliance with ITAR, EAR, DFARS, and NIST 800-171
- Validate data access controls and namespace isolation
Incident Management
- Conduct root cause analysis, document incidents, and monitor SLA performance
Optional: Database Infrastructure
- Support Neo4J, Weaviate, Chroma, and Milvus
- Manage backups, performance tuning, and HA configurations
Qualifications & Experience
- Strong hands-on experience with Rancher, RKE2, and Kubernetes administration
- Proficiency in Linux OS management, GPU handling, and container orchestration
- Experience with Docker, Prometheus, Grafana, and security tools like Kyverno
- Familiarity with LLM deployment, inferencing engines (e.g., vLLM), and MLOps platforms
- Knowledge of software-defined storage/networking, TLS management, and compliance frameworks
- Excellent documentation, troubleshooting, and stakeholder communication skills
- Ability to travel occasionally for onsite hardware support
Key Skills
Kubernetes, Rancher, RKE2, Linux, Docker, GPU Management, NVIDIA GPU Operator, Prometheus, Grafana, Kyverno, TLS Certificates, RBAC, Velero, Ceph Storage, FluentBit, Splunk, OpenTelemetry, Jaeger, ClearML, vLLM, Neo4J, Weaviate, Chroma, Milvus, ITAR, EAR, DFARS, NIST 800-171, Namespace Management, Resource Quotas, Disaster Recovery, Incident Management, MLOps, Monitoring Dashboards, Security Audits, Container Runtime, CNI Plugins, Audit Logs, Certificate Rotation
VDart Group, a global leader in technology, product, and talent management, empowers businesses with comprehensive solutions through our four distinct, industry-leading business units With a diverse team of over 4,000 professionals across 13 countries, we deliver strong results across various industries, including Fortune 500 companies
Committed to "People, Purpose, Planet," we prioritize social responsibility and sustainability, as evidenced by our EcoVadis Bronze Medal Certification and participation in the UN Global Compact
Our dedication to delivering strong results has earned us recognition as a trusted advisor for businesses seeking to drive innovation and growth, including many
Fortune 500 companies Join our network! Partner with VDart Group to leverage our global network, industry expertise, and proven track record with a diverse clientele