Search Jobs | Dice.com

Job Details

Skills

Privacy
IOS Development
OS X
Artificial Intelligence
Shipping
Human-computer Interaction
Microsoft Exchange
Modeling
Data Science
Computer Science
Linguistics
Statistics
Science
Natural Language Processing
Large Language Models (LLMs)
Python
Data Processing
Prototyping
Pandas
Jupyter
Data Visualization
Ontologies
Machine Learning (ML)
Training
Management
Performance Metrics
SQL
Apache Spark
Prompt Engineering
Evaluation
Communication
Collaboration

Privacy
IOS Development
OS X
Artificial Intelligence
Shipping
Human-computer Interaction
Microsoft Exchange
Modeling
Data Science
Computer Science
Linguistics
Statistics
Science
Natural Language Processing
Large Language Models (LLMs)
Python
Data Processing
Prototyping
Pandas
Jupyter
Data Visualization
Ontologies
Machine Learning (ML)
Training
Management
Performance Metrics
SQL
Apache Spark
Prompt Engineering
Evaluation
Communication
Collaboration

Summary

Join the team redefining what a deeply personal and integrated assistant can be.

As part of the Siri organization, you will help shape one of the world's most widely used AI assistants, powered by our next-generation of Apple Intelligence, with capabilities like personal context understanding and on-screen awareness, built with privacy from the ground up. Your work will have direct, meaningful impact for users across iOS, iPadOS, macOS, watchOS, and visionOS.

This is a rare opportunity to build at the intersection of cutting-edge AI and human-centered design, shipping technology that is centered around users and their needs.

Description

Play a part in the ongoing revolution in human-computer interaction. Siri is evolving - and the way we evaluate it has to evolve with it. Join the Evaluation Integrity team to help build the trusted quality signal behind every Siri release.

Within the Siri evaluation organization, the Human Evaluation sub-team is responsible for answering the question: can we trust our evals? We do that by designing human-in-the-loop (HITL) annotation tasks that scrutinize every moving part of an agentic evaluation - the simulated user agent, the conversation it has with Siri, and the automated evaluators that grade the exchange. This role sits at the intersection of data science, human annotation engineering, and evaluation methodology, and is instrumental in turning human judgment into a rigorous, reproducible signal that directly informs pre-ship model and product decisions.

As an Annotation Data Scientist on the Evaluation Integrity team, you will design and run HITL annotation projects that evaluate the quality and authenticity of agentic user personae, the validity of agent-to-agent conversations, and the reliability of LLM-as-judge and rule-based evaluators against Siri's product specifications. You will own annotation initiatives end-to-end; from rubric design and tooling, through annotator calibration, to data science analysis that turns annotator judgments into actionable signal for modeling, planning, and product teams.

Minimum Qualifications

Bachelor's or Master's degree in a quantitative or related field such as Data Science, Computer Science, Linguistics, Statistics, or Cognitive Science, or equivalent job-related experience.

5+ years of hands-on experience working with human-annotated datasets or human-in-the-loop evaluation methodologies for machine learning, natural language processing, or large language model systems.

5+ years of experience using Python for data processing, analysis, and prototyping, including experience with libraries such as pandas, Jupyter, and at least one data visualization library.

Experience designing, implementing, and communicating annotation schemas, rubrics, or ontologies for machine learning training or evaluation data.

Experience managing multiple concurrent dataset curation efforts, including scoping work, iterating on guidelines, coordinating with in-house or vendor annotators, and monitoring annotator performance metrics such as accuracy, throughput, and inter-annotator agreement.

Experience specifying or designing custom annotation tooling in collaboration with software engineers.

Preferred Qualifications

Experience evaluating LLM-powered or agentic systems, including familiarity with LLM-as-judge methodologies, rubric-based grading, or trajectory and tool-call evaluation.

Familiarity with statistical methods that address accuracy and variability in human annotation data, such as inter-annotator agreement, Cohen's or Fleiss' kappa, Krippendorff's alpha, or bootstrapping.

Data-querying experience with SQL, Spark, or similar, and comfort working with large, complex, real-world datasets.

Experience building pre-ship evaluation pipelines for conversational or assistant products.

Experience with prompt engineering, or with designing simulated user personae for agent evaluation.

Experience running annotation programs across multiple locales or at large scale.

Excellent written and verbal communication skills, with the ability to explain technical topics clearly to data scientists, engineers, annotators, and cross-functional partners.

Proven ability to collaborate effectively across functions and drive projects of varying sizes and scopes - knowing when to dive deep and when to delegate.

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Dice Id: 90733111
Position Id: e7f7289fa927f66c585f0a55475a0858
Posted 30+ days ago

Fitment

Dice Job Match Score™

⏳ Almost there, hang tight...

Create job alert

Never miss an opportunity! Create an alert based on the job you applied for.

Waltham, Massachusetts

•

Today

Boston Dynamics is a world leader in mobile robots, tackling some of the toughest robotics challenges. We combine the principles of dynamic control and balance with sophisticated mechanical designs, cutting-edge electronics, and next-generation software for high-performance robots equipped with perception, navigation, and intelligence. The Atlas team is focused on advancing machine learning and manipulation capabilities. We are seeking an Annotation Manager to own data quality and annotation op

Full-time

USD 115,000.00 - 140,000.00 per year

Senior Data Scientist - Agentic AI / Graph/ LLM Science, Vice President

Quincy, Massachusetts

•

6d ago

Who we are looking for: We are seeking a highly motivated and technically skilled Data Scientist to join our AI/ML science team as part of the Agentic AI platform buildout. The Science pillar advances LLM research, fine-tuning, evaluation frameworks, anomaly detection, and frontier model development-partnering closely with Platform Engineering, SRE, and Pods/Business Engagement to ship production-grade agents and platform features. This role will focus on applied research and product-oriented da

Full-time

USD 120,000.00 - 202,500.00 per year

Human Factors Engineer

Cambridge, Massachusetts

•

Today

MORSE Corp is an employee owned, small business based in Cambridge, MA, Arlington, VA, and Seattle, WA with a history of fielding cutting-edge technology. MORSE boasts a specially selected team of scientists, engineers, and software developers to deliver best-in-class technical solutions that solve difficult multidisciplinary problems faced by the US National Security Ecosystem. Human Factors Engineer At MORSE, we solve hard problems to field innovative technology to our customers in the US Dep

Full-time

USD 90,000.00 - 210,000.00 per year

Applied AI Engineer

Boston, Massachusetts

•

Today

We take play seriously. We're looking for curious adventurers ready to find their party, fueled by imagination and drive to build what's never been built before. At Hasbro and Wizards of the Coast, you'll collaborate with passionate teams to reimagine our iconic brands and create experiences that spark joy, connection, and community through the magic of play. This is your chance to shape legendary play that lasts a lifetime. We're building something new inside our AI Studio, and we're looking f

Full-time

USD 109,200.00 - 163,800.00 per year

Filter Results

Job post features

Posted date

Work settings

Employment type

Distance

Employer type

Work authorization

Annotation Data Scientist, Evaluation Integrity (Siri)

Skills

Summary

Dice Job Match Score™

Similar Jobs

Annotation Data Scientist, Evaluation Integrity (Siri)

Job Details

Skills

Summary

Dice Job Match Score™

Similar Jobs