Role :: On-prem Platform Engineer
Location: Charlotte, NC
Key Skills:
Must-Have Skills (Mandatory Keywords)
LLM Inference & Optimization
- vLLM, TensorRT-LLM, Triton Inference Server, SGLang
- Inference optimization techniques:
- Continuous batching
- Speculative decoding
- KV cache / Prefix caching
- Model optimization:
Distributed & GPU Systems
- Tensor parallelism and large model scaling
- CUDA, NCCL, GPU architecture
- GPU partitioning & optimization (MIG)
Kubernetes & ML Serving
- Kubernetes-based ML serving platforms
- KServe, OpenShift AI
- Helm charts, Operators, platform automation
GPU Orchestration
- Run:AI or similar GPU scheduling/orchestration platforms
- Multi-tenant GPU workload management
Platform Engineering
- Experience building internal AI/ML platforms (on-prem or hybrid)
- Strong automation and system design mindset
Observability & Performance
- Prometheus, Grafana
- ML observability (model latency, throughput, drift, resource utilization)
- Performance benchmarking and tuning
Good to Have / Preferred Skills
- Experience with LLMOps / GenAI pipelines
- Exposure to hybrid cloud (on-prem + Google Cloud Platform/Azure integration)
- Familiarity with Inferentia / alternative accelerators
- Knowledge of service mesh / networking in GPU clusters
· Build, configure, and operate on‑prem Kubernetes/OpenShift AI platforms for deploying and serving GenAI models and LLM inference workloads.
· Design and optimize high‑performance inference stacks using vLLM, TensorRT‑LLM, Triton Inference Server, SGLang, and advanced techniques (continuous batching, speculative decoding, KV caching).
· Manage GPU orchestration and capacity using Run:AI, MIG, CUDA/NCCL, and tensor parallelism to maximize utilization and throughput.
· Deploy and operate Kubernetes ML serving frameworks (KServe, Helm, Operators) for scalable, reliable model serving.
· Drive inference optimization and benchmarking, leveraging FP8, AWQ, GPTQ, and performance tools such as GuideLLM and Locust.
· Implement observability and ML monitoring using Prometheus, Grafana, Arize AI, ensuring SLA/SLO compliance for GenAI services.
· Collaborate with ML and research teams to onboard new models, tune inference performance, and productionize GenAI use cases.