On-prem Platform Engineer

Charlotte, NC, US • Posted 16 hours ago • Updated 14 hours ago
Contract Corp To Corp
Contract Independent
Contract W2
6 Months
No Travel Required
On-site
$65 - $75/hr
Fitment

Dice Job Match Score™

🫥 Flibbertigibetting...

Job Details

Skills

  • LLM

Summary

Role :: On-prem Platform Engineer

Location:  Charlotte, NC

Key Skills:

Must-Have Skills (Mandatory Keywords)

LLM Inference & Optimization

  • vLLM, TensorRT-LLM, Triton Inference Server, SGLang
  • Inference optimization techniques:
    • Continuous batching
    • Speculative decoding
    • KV cache / Prefix caching
  • Model optimization:
    • FP8, AWQ, GPTQ

Distributed & GPU Systems

  • Tensor parallelism and large model scaling
  • CUDA, NCCL, GPU architecture
  • GPU partitioning & optimization (MIG)

Kubernetes & ML Serving

  • Kubernetes-based ML serving platforms
  • KServe, OpenShift AI
  • Helm charts, Operators, platform automation

GPU Orchestration

  • Run:AI or similar GPU scheduling/orchestration platforms
  • Multi-tenant GPU workload management

Platform Engineering

  • Experience building internal AI/ML platforms (on-prem or hybrid)
  • Strong automation and system design mindset

Observability & Performance

  • Prometheus, Grafana
  • ML observability (model latency, throughput, drift, resource utilization)
  • Performance benchmarking and tuning

Good to Have / Preferred Skills

  • Experience with LLMOps / GenAI pipelines
  • Exposure to hybrid cloud (on-prem + Google Cloud Platform/Azure integration)
  • Familiarity with Inferentia / alternative accelerators
  • Knowledge of service mesh / networking in GPU clusters

·       Build, configure, and operate on‑prem Kubernetes/OpenShift AI platforms for deploying and serving GenAI models and LLM inference workloads.

·       Design and optimize high‑performance inference stacks using vLLM, TensorRT‑LLM, Triton Inference Server, SGLang, and advanced techniques (continuous batching, speculative decoding, KV caching).

·       Manage GPU orchestration and capacity using Run:AI, MIG, CUDA/NCCL, and tensor parallelism to maximize utilization and throughput.

·       Deploy and operate Kubernetes ML serving frameworks (KServe, Helm, Operators) for scalable, reliable model serving.

·       Drive inference optimization and benchmarking, leveraging FP8, AWQ, GPTQ, and performance tools such as GuideLLM and Locust.

·       Implement observability and ML monitoring using Prometheus, Grafana, Arize AI, ensuring SLA/SLO compliance for GenAI services.

·       Collaborate with ML and research teams to onboard new models, tune inference performance, and productionize GenAI use cases.

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.
  • Dice Id: 91139925
  • Position Id: 8985840
  • Posted 16 hours ago
Contact the job poster
RD

Rajesh Duvvuri

Recruiter @ TekGlobal
Create job alert
Set job alertNever miss an opportunity! Create an alert based on the job you applied for.

Similar Jobs

Charlotte, North Carolina

Today

Easy Apply

Contract

$50 - $60

Charlotte, North Carolina

5d ago

Easy Apply

Contract

$60 - $65

Charlotte, North Carolina

7d ago

Easy Apply

Contract

Depends on Experience

Hybrid in Charlotte, North Carolina

4d ago

Easy Apply

Contract, Third Party

Depends on Experience

Search all similar jobs