Overview
On Site
Depends on Experience
Full Time
No Travel Required
Skills
Artificial Intelligence
Continuous Integration
Generative Artificial Intelligence (AI)
Machine Learning Operations (ML Ops)
Machine Learning (ML)
Job Details
Looking for: Lead, AI/ML Platform Engineer
Job Type: Full time
Location: San Francisco, CA OR Denver, CO OR Springfield Gardens, NY OR Atlanta, GA OR Dallas, TX
We re looking for a hands-on Lead AI/ML Platform Engineer to architect and evolve the GenAI platform that drives business value across our global operations. You ll lead the execution of strategy and technical direction of our AI platform, mentoring engineers, shaping standards, and driving adoption across decentralized teams helping to scale reusable frameworks and support production-grade AI deployments.
Key responsibilities include:
- Design and own AI/ML infrastructure for scalable, secure, cloud-native platforms (Dataiku, AWS, OpenAI, Pinecone).
- Lead GenAI platform development including prompt workflows, agent context, memory, and retrieval architectures.
- Design and own custom Model Context Protocol (MCP) server architecture.
- Build scalable EKS-based backends to support MCP services and real-time AI API endpoints.
- Define and enforce CI/CD, MLOps, and IaC standards across all AI projects.
- Architect Agent evaluation tooling, cost tracking, and feedback loops.
- Establish agent scalability & governance: develop template driven scaling patterns and define platform standards and lifecycle governance for reusable agents.
- Design internal self-service agent toolkits with permission controls.
- Enable and mentor development teams in AI/ML techniques.
- Work effectively with offshore teams to coordinate and integrate AI/ML developments.
- Communicate effectively, translating complex technical details into understandable concepts for non-technical stakeholders.
Requirements
- Bachelor s degree in Computer Science, Engineering, or related field; Master s a plus.
- 8+ years building and operating production ML/AI platforms, including 3+ years in a technical lead capacity.
- Deep expertise with AWS (EC2, S3, Lambda, EKS) and infrastructure as code with Terraform or CloudFormation.
- Hands on Kubernetes experience and strong grasp of containerized microservice architectures.
- Advanced software engineering skills in Python with experience building high throughput APIs and real time serving systems.
- Strong SQL skills
- Proven track record implementing CI/CD, MLOps frameworks, and observability tooling.
- Practical experience with LLM tooling and vector databases (OpenAI, LangChain, Pinecone) and designing agent context, memory, and retrieval.
- Demonstrated ability to design secure, compliant systems and manage cost efficiency.
- Skilled communicator and mentor, able to lead distributed teams and influence senior stakeholders.
- Ability to multitask, prioritize effectively, and thrive in a fast-paced, dynamic environment.
Preferred:
- Experience with Dataiku and Snowflake strongly preferred.
- Experience designing multi-agent systems and agent lifecycle standards.
- Exposure to agent evaluation systems and ROI tracking.
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.