Apply Now

On-Premise LLM Inference & GPU Systems Engineer

Charlotte, NORTH CAROLINA, US • Posted 15 hours ago • Updated 14 hours ago

Contract W2

On-site

DOE

Fitment

Dice Job Match Score™

🤯 Applying directly to the forehead...

Job Details

Skills

Operational Efficiency
Scheduling
Language Models
Workflow
Service Delivery
Scalability
Technical Writing
Caching
Orchestration
Open Source
Large Language Models (LLMs)
Management
Onboarding
Version Control
GPU
Resource Management
Performance Tuning
Cloud Computing
Performance Analysis
Problem Solving
Conflict Resolution
Communication
Collaboration
Generative Artificial Intelligence (AI)
Analytics
Regulatory Compliance
Kubernetes
Optimization
Artificial Intelligence

Summary

Job Summary We are seeking an On-Premise LLM Inference & GPU Systems Engineer to build, optimize, and support a large-scale enterprise Generative AI infrastructure environment. This role is focused exclusively on Large Language Model (LLM) inference operations within a private on-premises ecosystem utilizing NVIDIA H200 GPU clusters and OpenShift AI. The ideal candidate will possess deep expertise in GPU runtime optimization, inference serving platforms, Kubernetes-based orchestration, and production-scale deployment of open-source LLMs. This position will be responsible for maximizing inference performance, operational efficiency, and platform reliability across enterprise AI workloads. Key Responsibilities Design, deploy, and maintain large-scale on-premises LLM inference infrastructure supporting enterprise Generative AI workloads. Optimize runtime performance of token generation pipelines, including prefill/decode optimization and KV cache management. Deploy, configure, and manage inference serving platforms such as vLLM and TensorRT-LLM. Optimize GPU utilization, throughput, batching strategies, latency, and resource efficiency across production inference environments. Manage workload scheduling and orchestration using Kubernetes-based GPU orchestration platforms and RunAI. Oversee the complete lifecycle of open-source language models, including onboarding, deployment, version management, monitoring, and retirement. Manage and support enterprise Hugging Face model deployment workflows and operational processes. Operate, maintain, and optimize the OpenShift AI ecosystem supporting Generative AI applications and services. Monitor platform performance, identify bottlenecks, and implement optimization strategies to improve inference efficiency and scalability. Collaborate with AI, platform engineering, infrastructure, and operations teams to ensure reliable service delivery. Implement operational best practices related to platform availability, monitoring, security, and governance. Develop automation, deployment processes, and operational procedures to support platform scalability and maintainability. Troubleshoot and resolve infrastructure, inference, performance, and deployment issues across the AI ecosystem. Create and maintain technical documentation, operational runbooks, and platform standards. Required Qualifications 5+ years of experience as an LLM Systems Engineer, AI Infrastructure Engineer, AI Platform Engineer, or related role. 5+ years of hands-on experience supporting NVIDIA GPU environments and runtime optimization techniques. Experience optimizing token generation pipelines, including KV cache management and prefill/decode optimization strategies. Strong experience deploying and managing inference frameworks such as vLLM and TensorRT-LLM. 3+ years of experience with OpenShift AI and containerized AI platform operations. 3+ years of experience with GPU orchestration technologies, including RunAI and Kubernetes-based environments. Experience deploying, managing, and supporting open-source Large Language Models in production environments. Proven experience managing the Hugging Face model lifecycle, including onboarding, deployment, version management, and retirement. Strong understanding of AI inference architectures, GPU resource management, workload optimization, and performance tuning. Experience working with containerization technologies, Kubernetes, and cloud-native application platforms. Strong troubleshooting, performance analysis, and problem-solving skills. Excellent communication and collaboration skills with the ability to work across infrastructure, platform, and AI engineering teams. Preferred Qualifications Experience supporting enterprise-scale Generative AI platforms and private AI infrastructure environments. Experience optimizing large-scale LLM inference workloads in highly regulated or secure environments. Knowledge of AI observability, monitoring, logging, and performance analytics tools. Experience implementing infrastructure automation and operational tooling for AI platforms. Familiarity with enterprise governance, security, and compliance practices for AI workloads. Experience supporting multi-cluster Kubernetes or OpenShift environments. Knowledge of emerging trends and best practices in LLM inference optimization and AI platform engineering. Education: Bachelors Degree

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Dice Id: compun
Position Id: BHADC5822012
Posted 15 hours ago

Create job alert

Never miss an opportunity! Create an alert based on the job you applied for.

Similar Jobs

LLM Inference & GPU Systems Consultant

Charlotte, North Carolina

•

18d ago

Role : LLM Inference & GPU Systems Consultant Location : Charlotte , NC ( Locals only) We are seeking an AI Infrastructure Runtime Engineer to build and maintain large-scale on-prem LLM infrastructure. This is an enterprise private GenAI environment running on NVIDIA H200 GPU clusters and an OpenShift AI deployment ecosystem. You will manage production inference internally, including self-hosting open-source LLMs like Llama. We are focused exclusively on inferencing; this role involves no model

Easy Apply

Third Party, Contract

Depends on Experience

Hybrid || LLM Inference & GPU Systems Consultant || Charlotte, NC

Charlotte, North Carolina

•

Today

TECHNOGEN, Inc. is a Proven Leader in providing full IT Services, Software Development and Solutions for 15 years. TECHNOGEN is a Small & Woman Owned Minority Business with GSA Advantage Certification. We have offices in VA; MD & Offshore development centers in India. We have successfully executed 100+ projects for clients ranging from small business and non-profits to Fortune 50 companies and federal, state and local agencies. Description: Local candidates preferred. Role Overview: We are se

Easy Apply

Contract, Third Party

$0,00/-

NVIDIA H200 -- LLM Inference & GPU Systems Consultant

Hybrid in Charlotte, North Carolina

•

16d ago

Role Overview: We are seeking an AI Infrastructure Runtime Engineer to build and maintain large-scale on-prem LLM infrastructure. This is an enterprise private GenAI environment running on NVIDIA H200 GPU clusters and an OpenShift AI deployment ecosystem. You will manage production inference internally, including self-hosting open-source LLMs like Llama. We are focused exclusively on inferencing; this role involves no model training infrastructure or fine-tuning pipelines. Key Responsibilities N

Easy Apply

Contract

70 - 80

LLM Inference / AI Infrastructure Engineer

Charlotte, North Carolina

•

10d ago

LLM Inference / AI Infrastructure Engineer Location: Charlotte, NC Duration: 9-12 Month JD: vLLM TensorRTLLM Triton Inference Server SGLang Inference Optimization Continuous Batching Speculative Decoding KV Cache / Prefix Caching FP8 / AWQ / GPTQ Tensor Parallelism Kubernetes ML Serving KServe OpenShift AI Helm / Operators GPU Orchestration Run:AI Performance Benchmarking CUDA / NCCL / MIG Prometheus / Grafana ML Observability skills sanity check: HAVE YOU WORKED ON Nvidia H200? If yes, chance

Easy Apply

Contract

Depends on Experience

Search all similar jobs