Senior LLMOps / AI Platform Engineer (Onsite)

Atlanta, GA, US • Posted 6 hours ago • Updated 6 hours ago
Contract Independent
Contract W2
Contract Corp To Corp
12 Months
Travel Required
On-site
Depends on Experience
Fitment

Dice Job Match Score™

🧠 Analyzing your skills...

Job Details

Skills

  • Azure OpenAI
  • Kubernetes
  • ArgoCD
  • Jenkins
  • Azure AI Search
  • Langfuse
  • LLMOps
  • RAG
  • CI/CD
  • Observability
  • Python

Summary

Job Title: Senior LLMOps / AI Platform Engineer
Location: Atlanta, GA
Visa: USC
Key Requirement: Azure OpenAI, Kubernetes, ArgoCD, Jenkins, Azure AI Search, Langfuse, LLMOps, RAG, CI/CD, Observability, Python
Must be willing to work onsite in Atlanta, GA
 
Job Description:
  • This role owns the operational foundation for George and other generative AI applications. The person should make the AI platform reliable, observable, cost-aware, secure, and release-ready across development, QA, and production environments.

Role Area Expected Ownership

  • Azure AI / Azure OpenAI Model deployments, quotas, TPM/RPM planning, rate-limit troubleshooting, deployment configuration, model upgrade support, fallback planning, and cost/performance monitoring.
  • Azure AI Search / RAG Infrastructure Index, indexer, skillset, embedding, semantic search, hybrid search, retrieval quality support, and performance troubleshooting for George knowledge sources.
  • CI/CD and Release Engineering Jenkins pipelines, build promotion, environment readiness, automated gates, release checklists, rollback planning, and deployment validation.
  • Kubernetes / Argo CD Argo CD sync health, manifest drift, Kubernetes pod health, environment promotion, scaling, config maps, secrets, and rollback operations.
  • Langfuse / Observability Trace ingestion, dashboards, prompt/version visibility, datasets support, experiment troubleshooting, latency, token usage, cost, and failure analysis.
  • Production Support Incident triage, root-cause analysis, operational runbooks, cross-team issue resolution, and executive-ready status reporting.

George-Specific Responsibilities

  • Azure resource configuration: Maintain Azure OpenAI deployments, model capacity, deployment names, environment configuration, quotas, private networking needs, and service-level access. Microsoft documents Azure OpenAI quota as scoped by region, subscription, model, and deployment type, which makes active quota planning part of production operations.
  • Model operations: Coordinate model upgrades, fallback behavior, model routing, deployment comparisons, cost control, latency monitoring, and model-specific release risk.
  • RAG platform support: Support Azure AI Search indexes, indexers, skillsets, chunking strategy, vectorization, semantic ranking, hybrid retrieval, document ingestion failures, and search performance. Azure AI Search positions RAG as a pattern for grounding LLM responses in proprietary content.
  • Langfuse ownership: Own Langfuse uptime, trace ingestion, dashboards, prompt/version visibility, dataset support, experiment logging, scoring ingestion, and troubleshooting when runs do not persist.
  • Jenkins pipeline ownership: Maintain build, test, package, promotion, release, and rollback workflows. Jenkins should be used as the automation backbone for CI/CD activities where applicable.
  • Argo CD / Kubernetes ownership: Resolve sync errors, drift, bad manifests, image deployment issues, pod failures, config map/secrets issues, autoscaling, and rollback scenarios. Argo CD continuously compares desired Git state to live Kubernetes state and surfaces OutOfSync conditions.
  • Operational observability: Own dashboards and alerting for p50/p95 response latency, LLM latency, search latency, tool-call latency, token usage, error rate, trace ingestion health, cost trends, and failed user journeys.
  • Incident management: Create runbooks and RCA templates for failed LLM calls, quota pressure, failed traces, broken eval ingestion, indexer failures, retrieval misses, bad releases, and gateway or service errors.

Responsibilities for Broader Agentic Applications

  • Capability What This Role Should Cover
  • Agent runtime operations Support LangGraph, LangChain, OpenAI Agents SDK, Microsoft Agent Framework, Semantic Kernel, MCP servers, and other future orchestration patterns.
  • Tool-call observability Track tool selection, parameters, latency, retries, failures, confidence, source system errors, and downstream business impact.
  • Prompt/model release operations Version prompts, compare prompt changes, coordinate release promotion, roll back bad prompt versions, and document model behavior changes.
  • Evaluation gates in CI/CD Partner with QA to make eval results, latency, cost, safety, and tool-call quality part of release approval.
  • Security and governance Support identity, secrets, private endpoints, least-privilege access, environment isolation, and auditability for AI systems.

Candidate Requirements

  • Must-Have Requirement Reason It Matters
  • Azure cloud experience George depends on Azure AI/OpenAI, Azure AI Search, networking, monitoring, and environment configuration.
  • Kubernetes experience Needed for pod health, scaling, deployment troubleshooting, secrets/configuration, and service reliability.
  • Jenkins or comparable CI/CD experience Needed to own build, release, validation, rollback, and promotion workflows.
  • Argo CD / GitOps experience Needed to manage environment drift, declarative deployments, sync issues, and auditable release promotion.
  • Observability and incident response Needed to debug logs, traces, metrics, dashboards, alerts, production incidents, and performance regressions.
  • LLM application understanding Must understand prompts, tokens, embeddings, RAG, retrieval, context windows, hallucinations, model latency, and rate limits.
  • Python/scripting ability Needed for automation, eval support, Langfuse integrations, operational scripts, and troubleshooting utilities.
  • Cross-functional communication Must translate technical AI platform risks into business-readable status, release risk, and recommended action.

Strongly Preferred Why It Helps

  • Langfuse Directly relevant to George tracing, prompt versions, datasets, eval runs, and production behavior analysis.
  • Azure AI Foundry / Azure OpenAI Useful for model deployment, evaluation, monitoring, tracing, and quota/capacity operations.
  • Azure AI Search Required to deeply support George RAG retrieval, indexes, skillsets, and search quality diagnostics.
  • Dynatrace / App Insights / OpenTelemetry Helps bridge LLM telemetry with normal distributed system telemetry.
  • LangGraph / MCP / agent frameworks Important as George evolves from RAG chatbot into agentic workflow platform.
  • IaC: Bicep, Terraform, Helm, Kustomize Makes AI infrastructure repeatable, reviewable, and environment-consistent.

Expected Deliverables

Deliverable Description

  • LLMOps runbook Troubleshooting guide for model failures, latency spikes, quota limits, Langfuse ingestion issues, indexer failures, Argo CD sync failures, and rollback.
  • Environment readiness checklist Dev/QA/Prod checklist for Azure resources, model deployments, secrets, pods, indexes, Langfuse, monitoring, and routing.
  • Release readiness gate Pre-release sign-off proving deployment health, eval completion, traces captured, acceptable latency/cost, rollback readiness, and no critical incidents.
  • Observability dashboard set Dashboards for LLM latency, total latency, token usage, cost, search latency, tool-call success, error rates, trace health, and feedback trends.
  • Incident RCA template Standard format for capturing root cause, customer impact, detection gap, mitigation, permanent fix, and prevention plan.
  • Model/prompt deployment process Documented approach for promoting, comparing, monitoring, and rolling back model, prompt, and routing changes.

30 / 60 / 90 Day Expectations

  • Timeframe Expected Outcomes
  • First 30 days Understand George architecture, Azure resources, model deployments, Jenkins, Argo CD, Langfuse, environments, release process, and current operational gaps. Produce an initial LLMOps assessment.
  • First 60 days Improve dashboards, runbooks, release checklists, Langfuse health checks, Azure quota monitoring, and Argo CD/Jenkins troubleshooting documentation.
  • First 90 days Implement or formalize AI release gates, improve observability coverage, reduce production troubleshooting time, and establish repeatable deployment/rollback process for model, prompt, index, and backend changes.

“Cleo Consulting is an equal opportunity employer (Minorities/Women/Veterans/Disabled)”

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.
  • Dice Id: 91081631
  • Position Id: 7362-22658-
  • Posted 6 hours ago
Create job alert
Set job alertNever miss an opportunity! Create an alert based on the job you applied for.

Similar Jobs

Hybrid in Atlanta, Georgia

10d ago

Easy Apply

Contract, Third Party

$60 - $70

Hybrid in Atlanta, Georgia

4d ago

Easy Apply

Third Party, Contract

Depends on Experience

Atlanta, Georgia

3d ago

Easy Apply

Full-time, Third Party

$120000 - $150000

Atlanta, Georgia

10d ago

Easy Apply

Contract, Third Party

Depends on Experience

Search all similar jobs