Apply Now

ML Safety Engineer

San Francisco, CA, US • Posted 30+ days ago • Updated 5 hours ago

Full Time

On-site

Fitment

Dice Job Match Score™

⏳ Almost there, hang tight...

Job Details

Skills

Music
Video
Continuous Integration and Development
Media
SAFE
Computer Science
Research
Design Of Experiments
Benchmarking
Software Engineering
Workflow
Analytical Skill
Communication
Python
Pandas
NumPy
Jupyter
PyTorch
Unstructured Data
Data Science
Linguistics
HCI
Psychology
Publications
Artificial Intelligence
Machine Learning (ML)
Automated Testing
Evaluation
Swift
Human Factors And Ergonomics
Science

Summary

Apple Services Engineering (ASE) powers many AI features across App Store, Music, Video and more. We build deeply personal products with the goal of representing users around the globe authentically. We work continuously to avoid perpetuating systemic biases and maintain safe and trustworthy experiences across our AI tools and models.

Our team, part of Apple Services Engineering, is looking for an ML Research Engineer to lead the design and continuous development of automated safety benchmarking methodologies. In this role, you will investigate how media-related agents behave, develop rigorous evaluation frameworks and techniques, and establish scientific standards for assessing risks they pose and safety performance. This role supports the development of scalable evaluation techniques that ensure our engineers have the right tools to assess candidate models and product features for responsible and safe performance. \n\nThe capabilities you build will allow for the generation of benchmark datasets and evaluation methodologies for model and application outputs, at scale, to enable engineering teams to translate safety insights into actionable engineering and product improvements. This role blends deep technical expertise with strong analytical judgment to develop tools and capabilities for assessing and improving the behavior of advanced AI/ML models. You will work cross-functionally with Engineering and Project Managers, Product, and Governance teams to develop a suite of technologies to ensure that AI experiences are reliable, safe, and aligned with human expectations.\n\nThe successful candidate will take a proactive approach to working independently and collaboratively on a wide range of projects. In this role, you will work alongside a small but impactful team, collaborating with ML and data scientists, software developers, project managers, and other teams at Apple to understand requirements and translate them into scalable, reliable, and efficient evaluation frameworks.

Advanced degree (MS or PhD) in Computer Science, Software Engineering, or equivalent research/work experience\n1+ years of work experience either as a postdoc or in the industry\nStrong research background in empirical evaluation, experimental design, or benchmarking\nStrong proficiency in Python (pandas, NumPy, Jupyter, PyTorch, etc.)\nDeep familiarity with software engineering workflows and developer tools\nExperience working with or evaluating AI/ML models, preferably LLMs or program synthesis systems\nStrong analytical and communication skills, including the ability to write clear reports\n\nTechnical Skills:\nProficiency in Python (pandas, NumPy, Jupyter, PyTorch, etc.).\nExperience working with large datasets, annotation tools, and model evaluation pipelines\nFamiliarity with evaluations specific to responsible AI and safety, hallucination detection, and/or model alignment concerns\nAbility to design taxonomies, categorization schemes, and structured labeling frameworks\nAnalytical Strength: Ability to interpret unstructured data (text, transcripts, user sessions) and derive meaningful insights\nCommunication: Strong ability to stitch together qualitative and quantitative insights into actionable guidance; strong ability to communicate complex architectures and systems to a variety of stakeholders\nEducation in Data Science, Linguistics, Cognitive Science, HCI, Psychology, Social Science, or a related field

Publications in AI/ML evaluation or related fields\nExperience with automated testing frameworks\nExperience constructing human-in-the-loop or multi-turn evaluation setups\nIntermediate or Advanced Proficiency in Swift \nFamiliarity with RAG systems, reinforcement learning, agentic architectures, and model fine-tuning\nExpertise in designing annotation guidelines and validation instruments and techniques\nBackground in human factors, social science, and/or safety assessment methodologies

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Dice Id: 90733111
Position Id: 4114c0403291ebc67c72e21822eca212
Posted 30+ days ago

Create job alert

Never miss an opportunity! Create an alert based on the job you applied for.

Similar Jobs

Senior Software Engineer, AI Evals

San Francisco, California

•

Today

About Sentry Software runs the world and the pace is faster than ever. Sentry helps developers fix errors and performance issues before users notice, so teams can spend less time firefighting and more time building. Trusted by 200,000+ organizations, Sentry is today's application monitoring standard and our team is building its AI-native future. About the role As a Senior Software Engineer on Sentry's AI/ML team, you'll be responsible for building the evaluation infrastructure that measures

Full-time

USD 240,000.00 - 280,000.00 per year

AI Evaluation Program Manager

San Francisco, California

•

Today

Who We Are: At Twelve Labs, we are pioneering the development of cutting-edge multimodal foundation models that have the ability to comprehend videos just like humans do. Our models have redefined the standards in video-language modeling, empowering us with more intuitive and far-reaching capabilities, and fundamentally transforming the way we interact with and analyze various forms of media. With a remarkable $107 million in Seed and Series A funding, our company is backed by top-tier venture

Full-time

USD 150,000.00 - 160,000.00 per year

Model Evaluation & Data Quality Lead

San Francisco, California

•

Today

Full-time

USD 150,000.00 - 160,000.00 per year

Machine Learning Engineer, LLM Evals & Observability

San Francisco, California

•

Today

About Glean: Glean is the Work AI platform that helps everyone work smarter with AI. What began as the industry's most advanced enterprise search has evolved into a full-scale Work AI ecosystem, powering intelligent Search, an AI Assistant, and scalable AI agents on one secure, open platform. With over 100 enterprise SaaS connectors, flexible LLM choice, and robust APIs, Glean gives organizations the infrastructure to govern, scale, and customize AI across their entire business - without vendor

Full-time

USD 200,000.00 - 300,000.00 per year

Search all similar jobs

More jobs at Apple, Inc. in San Francisco, CA