Linux operating, Kubernetes Admin/engineer

Overview

Remote

On Site

Depends on Experience

Accepts corp to corp applications

Contract - W2

Contract - 12 Month(s)

Skills

Machine Learning Operations (ML Ops)

Auditing

Backup

Backup Administration

Capacity Management

Computer Hardware

Onboarding

Offshoring

Neo4j

NIST SP 800 Series

ITAR

Grafana

Dashboard

Database

Development Management

Docker

Documentation

GPU

Kubernetes

Large Language Models (LLMs)

Linux

Management

Operating Systems

Performance Monitoring

Recovery

Resource Allocation

Security Policy

Status Reports

Storage

Test Execution

Testing

Job Details

Location Huntsville, AL This position is mostly remote (for maintaining the software), but it may require occasional travel in case of hardware issues/changes to maintain the hardware in Huntsville location.

Contract

look for candidates with Rancher, Kubernetes , Nvidia GPU and kubernetes security/policy tools.
Good Knowledge and Hands on Linux Operating systems and management

Good Knowledge on GPU management on Linux and Containers

Good Knowledge on Docker and Kubernetes platform like Rancher- RKE and Rancher Management server(Key Advantage) Key Skill

Good Knowledge and understanding on Software defines storage and Network

Good Knowledge on monitoring tools like Prometheus and Grafana

Knowledge on Large Language Model deployment and Monitoring

Knowledge on inferencing engines like vLLM and other inferencing engines.

SCOPE of Work

RKE2 Cluster Maintenance and Administration Tasks

Regular System Updates and Upgrades

Kubernetes version upgrades (planning, testing, execution, rollback if needed)
RKE2 specific updates and patches
Host OS patching and upgrades on all 10 nodes
Container runtime (containerd) updates
GPU driver and NVIDIA GPU Operator updates
Monitoring software stack updates (Prometheus, Grafana, etc.)
CNI plugin updates
Security patches and CVE remediations

High Availability and Disaster Recovery

Regular testing of etcd backups and restoration procedures
Velero backup validation and test restores
Documentation and execution of DR procedures
Node replacement procedures when hardware fails
Cluster recovery exercises and documentation

Monitoring and Alerting

Implementing Prometheus rules for our brainstormed alert list
Alert tuning and reduction of false positives
Creating and maintaining runbooks for each alert type
24/7 alert response (if required)
On-call rotation management
Creation of monitoring dashboards for different stakeholders

Performance Management

Regular cluster performance assessments
Resource utilization optimization
GPU utilization monitoring and optimization
Ceph storage performance monitoring and tuning
Database performance monitoring and tuning

Security Management

Regular security audits
Network policy reviews and updates
RBAC configuration maintenance
Certificate rotation and management
Security scanning of container images
Kyverno policy maintenance and updates
Audit log reviews
TLS certificate management and rotation

User Support and Access Management

Managing access for the approx 30 users (20 US-based, 10 offshore)
Namespace management for proper separation of US and global teams
User onboarding and offboarding procedures
Training and documentation for users
Troubleshooting application deployment issues

Capacity Planning and Resource Management

Regular assessment of resource utilization trends
Planning for additional capacity needs
Resource quota management and adjustments
Namespace resource limit management
Storage capacity planning and management

Documentation and Knowledge Transfer

Maintaining up-to-date cluster documentation
Creating and updating runbooks for common procedures
Knowledge transfer sessions with internal team members
Regular status reporting and metrics

MLOps Infrastructure Management

ClearML Enterprise (or alternative MLOps tool) maintenance and updates
ML model deployment pipeline maintenance
GPU resource allocation optimization
ML experiment tracking infrastructure maintenance

Logging and Observability

FluentBit configuration maintenance
Splunk integration maintenance
OpenTelemetry and Jaeger maintenance
Log retention policy implementation

Multi-tenancy and Compliance

Ensuring continued separation between US and global teams
Regular audits of ITAR, EAR, DFARS, NIST 800-171 compliance
Enforcement of data access controls
Periodic validation of namespace isolation

Incident Management

Root cause analysis for production incidents
Post-incident reviews and improvement plans
Incident documentation and knowledge sharing
SLA monitoring and reporting

Database Infrastructure Management - Optional, may fall on app developers

Management of Neo4J, Weaviate, Chroma, and Milvus instances
Database backup procedures
Database performance tuning
High-availability configuration for critical databases

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Job Details

About K-Tek Resourcing LLC

Share