Overview
Skills
Job Details
Location Huntsville, AL This position is mostly remote (for maintaining the software), but it may require occasional travel in case of hardware issues/changes to maintain the hardware in Huntsville location.
Contract
JD
look for candidates with Rancher, Kubernetes , Nvidia GPU and kubernetes security/policy tools.
Good Knowledge and Hands on Linux Operating systems and management
Good Knowledge on GPU management on Linux and Containers
Good Knowledge on Docker and Kubernetes platform like Rancher- RKE and Rancher Management server(Key Advantage) Key Skill
Good Knowledge and understanding on Software defines storage and Network
Good Knowledge on monitoring tools like Prometheus and Grafana
Knowledge on Large Language Model deployment and Monitoring
Knowledge on inferencing engines like vLLM and other inferencing engines.
SCOPE of Work
RKE2 Cluster Maintenance and Administration Tasks
Regular System Updates and Upgrades
- Kubernetes version upgrades (planning, testing, execution, rollback if needed)
- RKE2 specific updates and patches
- Host OS patching and upgrades on all 10 nodes
- Container runtime (containerd) updates
- GPU driver and NVIDIA GPU Operator updates
- Monitoring software stack updates (Prometheus, Grafana, etc.)
- CNI plugin updates
- Security patches and CVE remediations
High Availability and Disaster Recovery
- Regular testing of etcd backups and restoration procedures
- Velero backup validation and test restores
- Documentation and execution of DR procedures
- Node replacement procedures when hardware fails
- Cluster recovery exercises and documentation
Monitoring and Alerting
- Implementing Prometheus rules for our brainstormed alert list
- Alert tuning and reduction of false positives
- Creating and maintaining runbooks for each alert type
- 24/7 alert response (if required)
- On-call rotation management
- Creation of monitoring dashboards for different stakeholders
Performance Management
- Regular cluster performance assessments
- Resource utilization optimization
- GPU utilization monitoring and optimization
- Ceph storage performance monitoring and tuning
- Database performance monitoring and tuning
Security Management
- Regular security audits
- Network policy reviews and updates
- RBAC configuration maintenance
- Certificate rotation and management
- Security scanning of container images
- Kyverno policy maintenance and updates
- Audit log reviews
- TLS certificate management and rotation
User Support and Access Management
- Managing access for the approx 30 users (20 US-based, 10 offshore)
- Namespace management for proper separation of US and global teams
- User onboarding and offboarding procedures
- Training and documentation for users
- Troubleshooting application deployment issues
Capacity Planning and Resource Management
- Regular assessment of resource utilization trends
- Planning for additional capacity needs
- Resource quota management and adjustments
- Namespace resource limit management
- Storage capacity planning and management
Documentation and Knowledge Transfer
- Maintaining up-to-date cluster documentation
- Creating and updating runbooks for common procedures
- Knowledge transfer sessions with internal team members
- Regular status reporting and metrics
MLOps Infrastructure Management
- ClearML Enterprise (or alternative MLOps tool) maintenance and updates
- ML model deployment pipeline maintenance
- GPU resource allocation optimization
- ML experiment tracking infrastructure maintenance
Logging and Observability
- FluentBit configuration maintenance
- Splunk integration maintenance
- OpenTelemetry and Jaeger maintenance
- Log retention policy implementation
Multi-tenancy and Compliance
- Ensuring continued separation between US and global teams
- Regular audits of ITAR, EAR, DFARS, NIST 800-171 compliance
- Enforcement of data access controls
- Periodic validation of namespace isolation
Incident Management
- Root cause analysis for production incidents
- Post-incident reviews and improvement plans
- Incident documentation and knowledge sharing
- SLA monitoring and reporting
Database Infrastructure Management - Optional, may fall on app developers
- Management of Neo4J, Weaviate, Chroma, and Milvus instances
- Database backup procedures
- Database performance tuning
- High-availability configuration for critical databases