Linux operating, Kubernetes Admin/engineer

Overview

Remote
On Site
Depends on Experience
Accepts corp to corp applications
Contract - W2
Contract - 12 Month(s)

Skills

Machine Learning Operations (ML Ops)
Auditing
Backup
Backup Administration
Capacity Management
Computer Hardware
Onboarding
Offshoring
Neo4j
NIST SP 800 Series
ITAR
Grafana
Dashboard
Database
Development Management
Docker
Documentation
GPU
Kubernetes
Large Language Models (LLMs)
Linux
Management
Operating Systems
Performance Monitoring
Recovery
Resource Allocation
Security Policy
Status Reports
Storage
Test Execution
Testing

Job Details

Location Huntsville, AL This position is mostly remote (for maintaining the software), but it may require occasional travel in case of hardware issues/changes to maintain the hardware in Huntsville location.

Contract

JD

look for candidates with Rancher, Kubernetes , Nvidia GPU and kubernetes security/policy tools.
Good Knowledge and Hands on Linux Operating systems and management

Good Knowledge on GPU management on Linux and Containers

Good Knowledge on Docker and Kubernetes platform like Rancher- RKE and Rancher Management server(Key Advantage) Key Skill

Good Knowledge and understanding on Software defines storage and Network

Good Knowledge on monitoring tools like Prometheus and Grafana

Knowledge on Large Language Model deployment and Monitoring

Knowledge on inferencing engines like vLLM and other inferencing engines.

SCOPE of Work

RKE2 Cluster Maintenance and Administration Tasks

Regular System Updates and Upgrades

  • Kubernetes version upgrades (planning, testing, execution, rollback if needed)
  • RKE2 specific updates and patches
  • Host OS patching and upgrades on all 10 nodes
  • Container runtime (containerd) updates
  • GPU driver and NVIDIA GPU Operator updates
  • Monitoring software stack updates (Prometheus, Grafana, etc.)
  • CNI plugin updates
  • Security patches and CVE remediations

High Availability and Disaster Recovery

  • Regular testing of etcd backups and restoration procedures
  • Velero backup validation and test restores
  • Documentation and execution of DR procedures
  • Node replacement procedures when hardware fails
  • Cluster recovery exercises and documentation

Monitoring and Alerting

  • Implementing Prometheus rules for our brainstormed alert list
  • Alert tuning and reduction of false positives
  • Creating and maintaining runbooks for each alert type
  • 24/7 alert response (if required)
  • On-call rotation management
  • Creation of monitoring dashboards for different stakeholders

Performance Management

  • Regular cluster performance assessments
  • Resource utilization optimization
  • GPU utilization monitoring and optimization
  • Ceph storage performance monitoring and tuning
  • Database performance monitoring and tuning

Security Management

  • Regular security audits
  • Network policy reviews and updates
  • RBAC configuration maintenance
  • Certificate rotation and management
  • Security scanning of container images
  • Kyverno policy maintenance and updates
  • Audit log reviews
  • TLS certificate management and rotation

User Support and Access Management

  • Managing access for the approx 30 users (20 US-based, 10 offshore)
  • Namespace management for proper separation of US and global teams
  • User onboarding and offboarding procedures
  • Training and documentation for users
  • Troubleshooting application deployment issues

Capacity Planning and Resource Management

  • Regular assessment of resource utilization trends
  • Planning for additional capacity needs
  • Resource quota management and adjustments
  • Namespace resource limit management
  • Storage capacity planning and management

Documentation and Knowledge Transfer

  • Maintaining up-to-date cluster documentation
  • Creating and updating runbooks for common procedures
  • Knowledge transfer sessions with internal team members
  • Regular status reporting and metrics

MLOps Infrastructure Management

  • ClearML Enterprise (or alternative MLOps tool) maintenance and updates
  • ML model deployment pipeline maintenance
  • GPU resource allocation optimization
  • ML experiment tracking infrastructure maintenance

Logging and Observability

  • FluentBit configuration maintenance
  • Splunk integration maintenance
  • OpenTelemetry and Jaeger maintenance
  • Log retention policy implementation

Multi-tenancy and Compliance

  • Ensuring continued separation between US and global teams
  • Regular audits of ITAR, EAR, DFARS, NIST 800-171 compliance
  • Enforcement of data access controls
  • Periodic validation of namespace isolation

Incident Management

  • Root cause analysis for production incidents
  • Post-incident reviews and improvement plans
  • Incident documentation and knowledge sharing
  • SLA monitoring and reporting

Database Infrastructure Management - Optional, may fall on app developers

  • Management of Neo4J, Weaviate, Chroma, and Milvus instances
  • Database backup procedures
  • Database performance tuning
  • High-availability configuration for critical databases
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

About K-Tek Resourcing LLC