On-Premise LLM Inference & GPU Systems Engineer

Charlotte, NORTH CAROLINA, US • Posted 15 hours ago • Updated 14 hours ago
Contract W2
On-site
DOE
Fitment

Dice Job Match Score™

🤯 Applying directly to the forehead...

Job Details

Skills

  • Operational Efficiency
  • Scheduling
  • Language Models
  • Workflow
  • Service Delivery
  • Scalability
  • Technical Writing
  • Caching
  • Orchestration
  • Open Source
  • Large Language Models (LLMs)
  • Management
  • Onboarding
  • Version Control
  • GPU
  • Resource Management
  • Performance Tuning
  • Cloud Computing
  • Performance Analysis
  • Problem Solving
  • Conflict Resolution
  • Communication
  • Collaboration
  • Generative Artificial Intelligence (AI)
  • Analytics
  • Regulatory Compliance
  • Kubernetes
  • Optimization
  • Artificial Intelligence

Summary

Job Summary We are seeking an On-Premise LLM Inference & GPU Systems Engineer to build, optimize, and support a large-scale enterprise Generative AI infrastructure environment. This role is focused exclusively on Large Language Model (LLM) inference operations within a private on-premises ecosystem utilizing NVIDIA H200 GPU clusters and OpenShift AI. The ideal candidate will possess deep expertise in GPU runtime optimization, inference serving platforms, Kubernetes-based orchestration, and production-scale deployment of open-source LLMs. This position will be responsible for maximizing inference performance, operational efficiency, and platform reliability across enterprise AI workloads. Key Responsibilities Design, deploy, and maintain large-scale on-premises LLM inference infrastructure supporting enterprise Generative AI workloads. Optimize runtime performance of token generation pipelines, including prefill/decode optimization and KV cache management. Deploy, configure, and manage inference serving platforms such as vLLM and TensorRT-LLM. Optimize GPU utilization, throughput, batching strategies, latency, and resource efficiency across production inference environments. Manage workload scheduling and orchestration using Kubernetes-based GPU orchestration platforms and RunAI. Oversee the complete lifecycle of open-source language models, including onboarding, deployment, version management, monitoring, and retirement. Manage and support enterprise Hugging Face model deployment workflows and operational processes. Operate, maintain, and optimize the OpenShift AI ecosystem supporting Generative AI applications and services. Monitor platform performance, identify bottlenecks, and implement optimization strategies to improve inference efficiency and scalability. Collaborate with AI, platform engineering, infrastructure, and operations teams to ensure reliable service delivery. Implement operational best practices related to platform availability, monitoring, security, and governance. Develop automation, deployment processes, and operational procedures to support platform scalability and maintainability. Troubleshoot and resolve infrastructure, inference, performance, and deployment issues across the AI ecosystem. Create and maintain technical documentation, operational runbooks, and platform standards. Required Qualifications 5+ years of experience as an LLM Systems Engineer, AI Infrastructure Engineer, AI Platform Engineer, or related role. 5+ years of hands-on experience supporting NVIDIA GPU environments and runtime optimization techniques. Experience optimizing token generation pipelines, including KV cache management and prefill/decode optimization strategies. Strong experience deploying and managing inference frameworks such as vLLM and TensorRT-LLM. 3+ years of experience with OpenShift AI and containerized AI platform operations. 3+ years of experience with GPU orchestration technologies, including RunAI and Kubernetes-based environments. Experience deploying, managing, and supporting open-source Large Language Models in production environments. Proven experience managing the Hugging Face model lifecycle, including onboarding, deployment, version management, and retirement. Strong understanding of AI inference architectures, GPU resource management, workload optimization, and performance tuning. Experience working with containerization technologies, Kubernetes, and cloud-native application platforms. Strong troubleshooting, performance analysis, and problem-solving skills. Excellent communication and collaboration skills with the ability to work across infrastructure, platform, and AI engineering teams. Preferred Qualifications Experience supporting enterprise-scale Generative AI platforms and private AI infrastructure environments. Experience optimizing large-scale LLM inference workloads in highly regulated or secure environments. Knowledge of AI observability, monitoring, logging, and performance analytics tools. Experience implementing infrastructure automation and operational tooling for AI platforms. Familiarity with enterprise governance, security, and compliance practices for AI workloads. Experience supporting multi-cluster Kubernetes or OpenShift environments. Knowledge of emerging trends and best practices in LLM inference optimization and AI platform engineering. Education: Bachelors Degree
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.
  • Dice Id: compun
  • Position Id: BHADC5822012
  • Posted 15 hours ago
Create job alert
Set job alertNever miss an opportunity! Create an alert based on the job you applied for.

Similar Jobs

Charlotte, North Carolina

18d ago

Easy Apply

Third Party, Contract

Depends on Experience

Charlotte, North Carolina

Today

Easy Apply

Contract, Third Party

$0,00/-

Hybrid in Charlotte, North Carolina

16d ago

Easy Apply

Contract

70 - 80

Charlotte, North Carolina

10d ago

Easy Apply

Contract

Depends on Experience

Search all similar jobs