Apply Now

Senior LLMOps / AI Platform Engineer (Onsite)

Atlanta, GA, US • Posted 6 hours ago • Updated 6 hours ago

Contract Independent

Contract W2

Contract Corp To Corp

12 Months

Travel Required

On-site

Depends on Experience

Fitment

Dice Job Match Score™

🧠 Analyzing your skills...

Job Details

Skills

Azure OpenAI
Kubernetes
ArgoCD
Jenkins
Azure AI Search
Langfuse
LLMOps
RAG
CI/CD
Observability
Python

Summary

Job Title: Senior LLMOps / AI Platform Engineer

Location: Atlanta, GA

Visa: USC

Key Requirement: Azure OpenAI, Kubernetes, ArgoCD, Jenkins, Azure AI Search, Langfuse, LLMOps, RAG, CI/CD, Observability, Python

Must be willing to work onsite in Atlanta, GA

Job Description:

This role owns the operational foundation for George and other generative AI applications. The person should make the AI platform reliable, observable, cost-aware, secure, and release-ready across development, QA, and production environments.

Role Area Expected Ownership

Azure AI / Azure OpenAI Model deployments, quotas, TPM/RPM planning, rate-limit troubleshooting, deployment configuration, model upgrade support, fallback planning, and cost/performance monitoring.
Azure AI Search / RAG Infrastructure Index, indexer, skillset, embedding, semantic search, hybrid search, retrieval quality support, and performance troubleshooting for George knowledge sources.
CI/CD and Release Engineering Jenkins pipelines, build promotion, environment readiness, automated gates, release checklists, rollback planning, and deployment validation.
Kubernetes / Argo CD Argo CD sync health, manifest drift, Kubernetes pod health, environment promotion, scaling, config maps, secrets, and rollback operations.
Langfuse / Observability Trace ingestion, dashboards, prompt/version visibility, datasets support, experiment troubleshooting, latency, token usage, cost, and failure analysis.
Production Support Incident triage, root-cause analysis, operational runbooks, cross-team issue resolution, and executive-ready status reporting.

George-Specific Responsibilities

Azure resource configuration: Maintain Azure OpenAI deployments, model capacity, deployment names, environment configuration, quotas, private networking needs, and service-level access. Microsoft documents Azure OpenAI quota as scoped by region, subscription, model, and deployment type, which makes active quota planning part of production operations.
Model operations: Coordinate model upgrades, fallback behavior, model routing, deployment comparisons, cost control, latency monitoring, and model-specific release risk.
RAG platform support: Support Azure AI Search indexes, indexers, skillsets, chunking strategy, vectorization, semantic ranking, hybrid retrieval, document ingestion failures, and search performance. Azure AI Search positions RAG as a pattern for grounding LLM responses in proprietary content.
Langfuse ownership: Own Langfuse uptime, trace ingestion, dashboards, prompt/version visibility, dataset support, experiment logging, scoring ingestion, and troubleshooting when runs do not persist.
Jenkins pipeline ownership: Maintain build, test, package, promotion, release, and rollback workflows. Jenkins should be used as the automation backbone for CI/CD activities where applicable.
Argo CD / Kubernetes ownership: Resolve sync errors, drift, bad manifests, image deployment issues, pod failures, config map/secrets issues, autoscaling, and rollback scenarios. Argo CD continuously compares desired Git state to live Kubernetes state and surfaces OutOfSync conditions.
Operational observability: Own dashboards and alerting for p50/p95 response latency, LLM latency, search latency, tool-call latency, token usage, error rate, trace ingestion health, cost trends, and failed user journeys.
Incident management: Create runbooks and RCA templates for failed LLM calls, quota pressure, failed traces, broken eval ingestion, indexer failures, retrieval misses, bad releases, and gateway or service errors.

Responsibilities for Broader Agentic Applications

Capability What This Role Should Cover
Agent runtime operations Support LangGraph, LangChain, OpenAI Agents SDK, Microsoft Agent Framework, Semantic Kernel, MCP servers, and other future orchestration patterns.
Tool-call observability Track tool selection, parameters, latency, retries, failures, confidence, source system errors, and downstream business impact.
Prompt/model release operations Version prompts, compare prompt changes, coordinate release promotion, roll back bad prompt versions, and document model behavior changes.
Evaluation gates in CI/CD Partner with QA to make eval results, latency, cost, safety, and tool-call quality part of release approval.
Security and governance Support identity, secrets, private endpoints, least-privilege access, environment isolation, and auditability for AI systems.

Candidate Requirements

Must-Have Requirement Reason It Matters
Azure cloud experience George depends on Azure AI/OpenAI, Azure AI Search, networking, monitoring, and environment configuration.
Kubernetes experience Needed for pod health, scaling, deployment troubleshooting, secrets/configuration, and service reliability.
Jenkins or comparable CI/CD experience Needed to own build, release, validation, rollback, and promotion workflows.
Argo CD / GitOps experience Needed to manage environment drift, declarative deployments, sync issues, and auditable release promotion.
Observability and incident response Needed to debug logs, traces, metrics, dashboards, alerts, production incidents, and performance regressions.
LLM application understanding Must understand prompts, tokens, embeddings, RAG, retrieval, context windows, hallucinations, model latency, and rate limits.
Python/scripting ability Needed for automation, eval support, Langfuse integrations, operational scripts, and troubleshooting utilities.
Cross-functional communication Must translate technical AI platform risks into business-readable status, release risk, and recommended action.

Strongly Preferred Why It Helps

Langfuse Directly relevant to George tracing, prompt versions, datasets, eval runs, and production behavior analysis.
Azure AI Foundry / Azure OpenAI Useful for model deployment, evaluation, monitoring, tracing, and quota/capacity operations.
Azure AI Search Required to deeply support George RAG retrieval, indexes, skillsets, and search quality diagnostics.
Dynatrace / App Insights / OpenTelemetry Helps bridge LLM telemetry with normal distributed system telemetry.
LangGraph / MCP / agent frameworks Important as George evolves from RAG chatbot into agentic workflow platform.
IaC: Bicep, Terraform, Helm, Kustomize Makes AI infrastructure repeatable, reviewable, and environment-consistent.

Expected Deliverables

Deliverable Description

LLMOps runbook Troubleshooting guide for model failures, latency spikes, quota limits, Langfuse ingestion issues, indexer failures, Argo CD sync failures, and rollback.
Environment readiness checklist Dev/QA/Prod checklist for Azure resources, model deployments, secrets, pods, indexes, Langfuse, monitoring, and routing.
Release readiness gate Pre-release sign-off proving deployment health, eval completion, traces captured, acceptable latency/cost, rollback readiness, and no critical incidents.
Observability dashboard set Dashboards for LLM latency, total latency, token usage, cost, search latency, tool-call success, error rates, trace health, and feedback trends.
Incident RCA template Standard format for capturing root cause, customer impact, detection gap, mitigation, permanent fix, and prevention plan.
Model/prompt deployment process Documented approach for promoting, comparing, monitoring, and rolling back model, prompt, and routing changes.

30 / 60 / 90 Day Expectations

Timeframe Expected Outcomes
First 30 days Understand George architecture, Azure resources, model deployments, Jenkins, Argo CD, Langfuse, environments, release process, and current operational gaps. Produce an initial LLMOps assessment.
First 60 days Improve dashboards, runbooks, release checklists, Langfuse health checks, Azure quota monitoring, and Argo CD/Jenkins troubleshooting documentation.
First 90 days Implement or formalize AI release gates, improve observability coverage, reduce production troubleshooting time, and establish repeatable deployment/rollback process for model, prompt, index, and backend changes.

“Cleo Consulting is an equal opportunity employer (Minorities/Women/Veterans/Disabled)”

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Dice Id: 91081631
Position Id: 7362-22658-
Posted 6 hours ago

Create job alert

Never miss an opportunity! Create an alert based on the job you applied for.

Hybrid in Atlanta, Georgia

•

10d ago

Hiring: AI SDLC Architect (Agentic AI / AI-Native Engineering) Atlanta, GA | Contract RoleWe are looking for a senior AI SDLC Architect to lead enterprise-scale AI-native software engineering transformation initiatives.Key expertise required: Agentic AI & Multi-Agent Systems (LangChain, LangGraph, CrewAI) AI-powered SDLC Transformation LLM Architectures, RAG Pipelines & Prompt Engineering DevOps, CI/CD & Platform Engineering Kubernetes, Cloud Platforms (AWS/Azure/Google Cloud Platform) AI Govern

Easy Apply

Contract, Third Party

$60 - $70

AI Lead Developer/Engineer

Hybrid in Atlanta, Georgia

•

4d ago

Position Title: AI Lead Developer/Engineer Location: Atlanta, GA/San Jose, CA (Hybrid) Duration: 6+ Months Job Details: Need strong AI Lead Developer/Engineer experienceShould have good Agentic exp and MCP server expExcellent Communication Skills

Easy Apply

Third Party, Contract

Depends on Experience

ML Ops Technical Architect (AI/ML) - Remote

Atlanta, Georgia

•

3d ago

Must Have Technical/Functional Skills Technical Skills - Programming Languages: Python, Java Agentic AI : Google ADK, A2A, LangChain/LangGraph, CrewAI, Semantic Kernel/Autogen and Open AI Agentic SDK Tool Integration: Gemini Tools, Custom MCP tools Machine Learning Frameworks: Experience with TensorFlow, PyTorch and AutoML. Generative AI: Hands-on experience with generative AI models, RAG (Retrieval-Augmented Generation) architecture, and Natural Language Processing (NLP). Cloud Platforms: Goog

Easy Apply

Full-time, Third Party

$120000 - $150000

Lead Gen AI Developer - 11+years

Atlanta, Georgia

•

10d ago

Role: Gen AI Developer Location: Atlanta, GA ( 3 days at Onsite ) Duration: Long Term Contract Required Skills 11+ years of experience in handson exposure to AI/ML or Generative AI systems Strong understanding of AI evaluation techniques, including hallucination detection, factual accuracy, bias, and output consistency Knowledge of Responsible AI principles, including fairness, transparency, and explainability Python(must-have) and Experience with: REST APIs and microservices. Key Responsibilit

Easy Apply

Contract, Third Party

Depends on Experience

Search all similar jobs

Senior LLMOps / AI Platform Engineer (Onsite)

Dice Job Match Score™

Job Details

Skills

Summary

Similar Jobs