Apply Now

On-prem Platform Engineer

Charlotte, NC, US • Posted 16 hours ago • Updated 14 hours ago

Contract Corp To Corp

Contract Independent

Contract W2

6 Months

No Travel Required

On-site

$65 - $75/hr

Fitment

Dice Job Match Score™

🫥 Flibbertigibetting...

Job Details

Skills

Summary

Role :: On-prem Platform Engineer

Location: Charlotte, NC

Key Skills:

Must-Have Skills (Mandatory Keywords)

LLM Inference & Optimization

vLLM, TensorRT-LLM, Triton Inference Server, SGLang
Inference optimization techniques:

Continuous batching
Speculative decoding
KV cache / Prefix caching

Model optimization:

FP8, AWQ, GPTQ

Distributed & GPU Systems

Tensor parallelism and large model scaling
CUDA, NCCL, GPU architecture
GPU partitioning & optimization (MIG)

Kubernetes & ML Serving

Kubernetes-based ML serving platforms
KServe, OpenShift AI
Helm charts, Operators, platform automation

GPU Orchestration

Run:AI or similar GPU scheduling/orchestration platforms
Multi-tenant GPU workload management

Platform Engineering

Experience building internal AI/ML platforms (on-prem or hybrid)
Strong automation and system design mindset

Observability & Performance

Prometheus, Grafana
ML observability (model latency, throughput, drift, resource utilization)
Performance benchmarking and tuning

Good to Have / Preferred Skills

Experience with LLMOps / GenAI pipelines
Exposure to hybrid cloud (on-prem + Google Cloud Platform/Azure integration)
Familiarity with Inferentia / alternative accelerators
Knowledge of service mesh / networking in GPU clusters

· Build, configure, and operate on‑prem Kubernetes/OpenShift AI platforms for deploying and serving GenAI models and LLM inference workloads.

· Design and optimize high‑performance inference stacks using vLLM, TensorRT‑LLM, Triton Inference Server, SGLang, and advanced techniques (continuous batching, speculative decoding, KV caching).

· Manage GPU orchestration and capacity using Run:AI, MIG, CUDA/NCCL, and tensor parallelism to maximize utilization and throughput.

· Deploy and operate Kubernetes ML serving frameworks (KServe, Helm, Operators) for scalable, reliable model serving.

· Drive inference optimization and benchmarking, leveraging FP8, AWQ, GPTQ, and performance tools such as GuideLLM and Locust.

· Implement observability and ML monitoring using Prometheus, Grafana, Arize AI, ensuring SLA/SLO compliance for GenAI services.

· Collaborate with ML and research teams to onboard new models, tune inference performance, and productionize GenAI use cases.

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Dice Id: 91139925
Position Id: 8985840
Posted 16 hours ago

Contact the job poster

Rajesh Duvvuri

Recruiter @ TekGlobal

View Profile

Create job alert

Never miss an opportunity! Create an alert based on the job you applied for.

Similar Jobs

Senior On-Prem GenAI Platform Engineer

Charlotte, North Carolina

•

Today

Senior On-Prem GenAI Platform EngineerLocation: Charlotte, NC < data-start="168" data-end="184">Job SummaryWe are seeking an experienced On-Prem GenAI Platform Engineer to build, optimize, and manage enterprise AI/ML platforms supporting Large Language Models (LLMs) and Generative AI workloads. The ideal candidate will have expertise in Kubernetes/OpenShift AI, GPU infrastructure, distributed systems, and LLM inference optimization. < data-start="520" data-end="545">Key ResponsibilitiesBuild and

Easy Apply

Contract

$50 - $60

On-Premises LLM Inference & GPU Systems Engineer

Charlotte, North Carolina

•

5d ago

Job Description - We are urgently looking to onboard a top-tier On-Premises LLM Inference & GPU Systems Engineer for an exciting project with one of our premium clients. We are specifically seeking high-caliber professionals with deep, hands-on experience in On-Premises LLM Inference & GPU Systems Engineering. Key Requirements: Experience:10+ years of total experience is mandatory.Location:Local to Charlotte, NC only. There are no relocation or remote options for this role.Interview Process: Can

Easy Apply

Contract

$60 - $65

LLM Inference / AI Infrastructure Engineer

Charlotte, North Carolina

•

7d ago

LLM Inference / AI Infrastructure Engineer Location: Charlotte, NC Duration: 9-12 Month JD: vLLM TensorRTLLM Triton Inference Server SGLang Inference Optimization Continuous Batching Speculative Decoding KV Cache / Prefix Caching FP8 / AWQ / GPTQ Tensor Parallelism Kubernetes ML Serving KServe OpenShift AI Helm / Operators GPU Orchestration Run:AI Performance Benchmarking CUDA / NCCL / MIG Prometheus / Grafana ML Observability skills sanity check: HAVE YOU WORKED ON Nvidia H200? If yes, chance

Easy Apply

Contract

Depends on Experience

Infrastructure Engineer (Onsite NC)

Hybrid in Charlotte, North Carolina

•

4d ago

Role: AI Infrastructure Engineer Location: Charlotte, NC (Hybrid) Role Overview: We are seeking an AI Infrastructure Runtime Engineer to build and maintain large-scale on-prem LLM infrastructure. This is an enterprise private GenAI environment running on NVIDIA H200 GPU clusters and an OpenShift AI deployment ecosystem. You will manage production inference internally, including self-hosting open-source LLMs like Llama. We are focused exclusively on inferencing; this role involves no model train

Easy Apply

Contract, Third Party

Depends on Experience

Search all similar jobs