Overview
Skills
Job Details
Job Title: Red Hat Engineer (RedHat Linux, Ansible, RedHat Cluster)
Location: Huntsville, AL (Hybrid)
Duration / Term: Long-Term Contract
Job Description
We are seeking a highly skilled and hands-on Linux Infrastructure Engineer with deep expertise in RedHat Linux, Ansible automation, and RedHat Cluster technologies. The ideal candidate will have strong experience in Kubernetes (RKE2), Rancher, GPU management, and container orchestration, along with a solid understanding of software-defined storage, networking, and observability tools. This role involves maintaining and optimizing a high-performance, secure, and compliant MLOps infrastructure supporting advanced workloads including Large Language Models (LLMs) and inferencing engines.
Experience & Qualifications
- Proven experience with RedHat Linux administration, Ansible automation, and RedHat Cluster configuration
- Hands-on expertise in Rancher RKE2, Kubernetes, Docker, and GPU management on Linux and container platforms
- Strong understanding of Kubernetes security tools, Kyverno, RBAC, and network policies
- Experience with Prometheus, Grafana, FluentBit, Splunk, OpenTelemetry, and Jaeger for monitoring and observability
- Familiarity with ClearML Enterprise, ML model deployment pipelines, and GPU resource optimization
- Knowledge of inferencing engines such as vLLM, Neo4J, Weaviate, Chroma, and Milvus
- Understanding of ITAR, EAR, DFARS, and NIST 800-171 compliance
- Strong documentation, troubleshooting, and user support capabilities
- Bachelor’s degree in Computer Science or related field preferred
Scope of Work
- Maintain and upgrade RKE2 clusters, including Kubernetes versions, container runtimes, and GPU drivers
- Implement and validate disaster recovery procedures, Velero backups, and etcd restorations
- Develop and tune Prometheus alert rules, dashboards, and runbooks
- Conduct performance assessments, optimize GPU and Ceph storage utilization
- Perform security audits, manage TLS certificates, and scan container images
- Manage user access, namespace isolation, and resource quotas for multi-tenant environments
- Maintain MLOps infrastructure, including ML pipelines and experiment tracking
- Ensure logging and observability integrations are operational and compliant
- Support incident management, root cause analysis, and SLA reporting
- Document procedures and conduct knowledge transfer sessions
Key Skills
RedHat Linux, Ansible, RedHat Cluster, Rancher, RKE2, Kubernetes, Docker, GPU Management, Prometheus, Grafana, FluentBit, Splunk, OpenTelemetry, Jaeger, Kyverno, RBAC, TLS, ClearML, MLOps, vLLM, Neo4J, Weaviate, Chroma, Milvus, ITAR, EAR, DFARS, NIST 800-171, Software-Defined Storage, Network Policies, Certificate Management, Disaster Recovery, Observability Tools
VDart Group, a global leader in technology, product, and talent management, empowers businesses with comprehensive solutions through our four distinct, industry-leading business units With a diverse team of over 4,000 professionals across 13 countries, we deliver strong results across various industries, including Fortune 500 companies
Committed to "People, Purpose, Planet," we prioritize social responsibility and sustainability, as evidenced by our EcoVadis Bronze Medal Certification and participation in the UN Global Compact
Our dedication to delivering strong results has earned us recognition as a trusted advisor for businesses seeking to drive innovation and growth, including many
Fortune 500 companies Join our network! Partner with VDart Group to leverage our global network, industry expertise, and proven track record with a diverse clientele