Apply Now

Staff Machine Learning Platform Engineer, AI Evaluation

Washington, WA, US • Posted 20 hours ago • Updated 7 hours ago

Full Time

On-site

Fitment

Dice Job Match Score™

🔗 Matching skills to job...

Job Details

Skills

Architectural Design
High Availability
Software Engineering
Technical Direction
Python
Pandas
Partnership
Research
Testing
API
FOCUS
Continuous Integration
Continuous Delivery
Docker
Kubernetes
Accountability
Computer Science
Artificial Intelligence
LangSmith
Orchestration
Concurrent Computing
Machine Learning (ML)
Evaluation
Workflow
Generative Artificial Intelligence (AI)
Management
Economics
Reasoning
Startups
Roadmaps
Scratch

Summary

Join Apple Services Engineering to build the next generation of AI evaluation systems. We are seeking a staff machine learning platform engineer to lead the architectural design and development of the high availability services and internal tools powering self-service evaluation at scale. You will partner with researchers to operationalize their innovations, transforming complex workflows into intuitive, developer-first platforms. We are looking for builders who thrive in the ambiguity of new initiatives and are passionate about creating scalable infrastructure.

You will join the engineering team responsible for democratizing AI evaluation across the organization. Your focus will be on developing the developer experience-architecting and implementing the APIs, SDKs, and platform services that turn complex evaluation metrics into simple, self-service calls. You will work hand-in-hand with researchers to operationalize sophisticated measurement techniques, ensuring they scale reliably within our high-availability infrastructure. In this role, you will drive the engineering standards for a new organization, upholding the code quality, automation, and testing rigor required to support the rapid evolution of Generative AI and Agentic systems.

8+ years of hands-on software engineering experience, with a track record of owning the technical direction of a platform or infrastructure domain. \nStrong proficiency in the Python ecosystem (e.g., FastAPI, Pydantic, Pandas). You write production-grade code and lead architectural discussions on day one.\nCustomer Obsession & Product Thinking: You have owned the technical roadmap for an internal platform, presented it to senior stakeholders, and shipped against it. You independently translate vague requirements from other teams into concrete engineering specifications and platform roadmaps.\nDemonstrated experience leading technical partnerships with Data Scientists or Researchers: You have taken research code and shipped it as a production service and built the abstractions, testing frameworks, and deployment pipelines that made the next handoff faster than the last..\nStrong expertise in API Design & Platform Infrastructure: You have designed and owned APIs and SDKs that other developers rely on, with a focus on versioning, backward compatibility, and developer experience at scale.\nOperational excellence background: You have architected and owned CI/CD pipelines, containerization (Docker/Kubernetes), and monitoring (Datadog/Prometheus) for production services, and have been accountable for their reliability.\nBachelors in Computer Science or related field, Masters preferred.

Deep familiarity with AI Evaluation Frameworks: You have built, extended, or contributed to modern evaluation tools like DeepEval, Ragas, TruLens, or LangSmith. You understand how to implement and scale model-based evaluation workflows across a large organization.\nEvaluation Service Deployment: Own the deployment, scaling, and operational health of evaluation services in production - including high-throughput evaluation job orchestration (queueing, prioritization, concurrency, auto-scaling), and defining SLAs for evaluation pipeline latency and availability.\nObservability & Reliability: Experience instrumenting production ML evaluation pipelines including tracking evaluation job throughput, queue depth, judge model latency SLAs, scoring drift over time, and failure modes specific to non-deterministic LLM-based evaluation workflows.\nDeep understanding of Generative AI & Agents: You understand the engineering challenges of relying on LLMs and Agents as software components-specifically managing token economics, handling rate limits, and evaluating non-deterministic, multi-step reasoning capabilities. You have built production systems that depend on these components and have solved these problems at scale.\nBuilder Experience: You have thrived in startup-like environments, navigating high ambiguity to deliver complex technical roadmaps from scratch.

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Dice Id: 90733111
Position Id: ecc7ce77780fff73b3525e96c8f2d23e
Posted 20 hours ago

Create job alert

Never miss an opportunity! Create an alert based on the job you applied for.

Washington

•

Today

Join Apple Services Engineering to build the next generation of AI evaluation systems. We are seeking machine learning platform engineers at multiple levels (Mid-Level to Principal) to architect and build high-availability services and internal tools that enable self-service evaluation at scale. You will partner with researchers to operationalize their innovations, transforming complex workflows into intuitive, developer-first platforms. We are looking for builders who thrive in the ambiguity of

Full-time

ML Research Engineer, AI Evaluation Platform

Washington

•

Today

AI systems are only as trustworthy as the methods used to evaluate them. At Apple, where AI powers experiences for billions of people, getting evaluation right is not a support function-it is a foundational science. Our team, part of Apple Services Engineering, is building that scientific foundation: rigorous, scalable evaluation methodology for LLMs, agentic systems, and human-AI interaction.\\n\\nWhat makes this team unusual is its interdisciplinary core. You will work alongside measurement sc

Full-time

Senior/Staff Applied ML Engineer - AI/ML Evaluation & Simulation

Washington

•

Today

We're building the next generation of AI evaluation systems - and we're looking for a hands-on engineer who can bridge ML, software, and product to make AI systems more measurable, testable, and trustworthy.\\n\\nWe're part of the AI/ML Evaluation organization, seeking a Senior or Staff-level Applied ML Engineer with strong software engineering skills and a solid understanding of machine learning. In this hands-on role, you'll help design and build intelligent systems that simulate complex inter

Full-time

Evaluation & Insights Machine Learning Engineer

Washington

•

Today

Imagine what you could do here. At Apple, great new ideas have a way of becoming extraordinary products, services, and customer experiences very quickly. Bring passion and dedication to your job and there's no telling what you could accomplish! Are you passionate about music, movies, and the world of Artificial Intelligence and Machine Learning? So are we! Join our Human-Centered AI team for Apple Products. In this role, you'll represent the user perspective on new features, review and analyze d

Full-time

Search all similar jobs

Staff Machine Learning Platform Engineer, AI Evaluation

Dice Job Match Score™

Job Details

Skills

Summary

Similar Jobs