Apply Now

Machine Learning Engineer, ML/GenAI Evaluation

Austin, TX, US • Posted 3 days ago • Updated 1 day ago

Full Time

On-site

Fitment

Dice Job Match Score™

🤯 Applying directly to the forehead...

Job Details

Skills

Generative Artificial Intelligence (AI)
Accountability
Payments
Science
Legal
Applied Mathematics
Research
Dimensional Modeling
Testing
Distribution
Test Suites
Optical Character Recognition
Finance
Data Extraction
Python
Fluency
Communication
Computer Science
Data Science
Statistics
Artificial Intelligence
Machine Learning (ML)
Evaluation
Auditing
Privacy
Financial Services

Summary

Would you like to contribute to Machine Learning and Generative AI technologies? Are you passionate about measuring what matters and ensuring AI systems work reliably for everyone? Do you believe that rigorous evaluation - including holding models accountable to fairness standards - is what separates great ML from good ML? We truly believe it is!\\n\\nWe are defining what exceptional looks like for machine learning across Wallet, Payments, and Commerce. As a Machine Learning Engineer specializing in Evaluation, you will establish the evaluation criteria, metrics frameworks, and quality standards that determine when models are ready to reach hundreds of millions of users. Your judgment shapes model quality and earns the confidence to ship.\\n\\nYou'll work at the intersection of rigorous ML science and high-impact product decisions, collaborating closely with ML Engineering, Product, Privacy, and Legal teams. This unique opportunity puts you at the center of model quality - designing adversarial test strategies, surfacing failure modes before they reach users, and owning the sign-off process that ensures Apple's financial features meet the highest bar for accuracy, robustness, and reliability.\\n

The ideal candidate is a rigorous, curious ML practitioner who believes that how you measure a model is just as important as how you train it. You think critically about what metrics actually capture, know how models break in the real world, and hold quality standards others find uncomfortably high - including on dimensions like fairness.\n\nYou will own the full evaluation lifecycle for ML models across Wallet features - designing test frameworks, adversarial corpora, and benchmarks that reflect the diversity of Apple's global user base, then making the final quality call before any model ships. Your findings directly shape model development priorities and product decisions at scale.\n

M.S. in Machine Learning, Computer Science, Statistics, Applied Mathematics, or a related technical field strongly preferred. \nBachelor's degree with 7+ years hands-on experience in ML evaluation, model quality, or applied research will be considered\n\n5+ years of hands-on ML experience, with deep expertise in model evaluation, offline metrics design, and behavioral testing\n\nStrong track record designing evaluation frameworks for production ML systems - not just accuracy/F1, but precision-recall tradeoffs, calibration, fairness, and task-specific quality dimensions\n\nCreative mindset with the ability to translate standard ML evaluation metrics (F1, AUC, etc.) into utility and user trust measures\n\nExperience testing for distribution shift, out-of-distribution generalization, and temporal drift in real-world deployed models\n\nProven ability to construct adversarial test suites, aggressor scenarios, and edge-case corpora that surface model failure modes before they reach users\n\nExperience with structured and semi-structured document understanding, OCR pipelines, or financial data extraction is a strong plus\n\nStrong programming skills in Python; fluency with evaluation tooling, data pipelines, and experiment tracking (e.g., MLflow, W&B, or equivalent)\n\nExcellent communication skills - ability to translate metric results into product-quality narratives for engineering and executive audiences\n\nExperience owning model quality sign-off in a cross-functional launch process\n

PhD in Computer Science, Data Science, Statistics, AI/ML, or a related field.\n\nExperience with Bayesian or causal graph-based approaches to data generation.\n\nExperience with causal approaches to fairness evaluation - counterfactual fairness, causal Shapley values, or structural causal model-based bias auditing.\n\nExperience evaluating models under privacy constraints or on-device inference settings is a plus.\n\nFamiliarity with confidence calibration techniques and uncertainty quantification a plus\n\nBackground in financial services, fintech, or consumer payment products\n

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Dice Id: 90733111
Position Id: a911bdeb43d139cf8c21b21c2cbdaece
Posted 3 days ago

Create job alert

Never miss an opportunity! Create an alert based on the job you applied for.

Similar Jobs

Data Scientist, AI/ML Model Quality

Austin, Texas

•

Yesterday

Would you like to contribute to Machine Learning and Generative AI technologies? Are you passionate about the integrity of the data that powers AI systems at scale? Do you believe that trustworthy data is the foundation of every great model? We truly believe it is!\\n\\nWe are defining what exceptional data quality looks like for machine learning across Wallet, Payments, and Commerce. As a Data Scientist, AI/ML Model Quality, you will build and maintain intelligent systems, validation frameworks

Full-time

Senior Machine Learning Engineer

Austin, Texas

•

Yesterday

Introduction to the role & team At Bumble, we're building a world where all relationships are healthy and equitable, and machine learning is central to how we make that real for millions of people every day. As part of our Machine Learning team, you'll help shape intelligent systems that power meaningful connections, safer interactions, and more personalised experiences across our platform. As a Senior Machine Learning Engineer, you'll own impactful problems end-to-end-from data exploration thr

Full-time

USD 220,000.00 - 250,000.00 per year

Machine Learning Engineer - LLM Evaluation & Automation

Remote

•

Yesterday

We are seeking a highly skilled Machine Learning Engineer who specializes in leveraging Large Language Models (LLMs) for automated evaluation and quality assessment. In this role, you will design and build systems that automatically measure and improve the accuracy, relevance, and consistency of model outputs. You will lead initiatives to create evaluation pipelines, develop metrics, and deliver actionable insights for continuous improvements. This position requires strong technical expertise, a

Full-time

Machine Learning Engineer

Remote

•

Yesterday

Who we are At Twilio, we're shaping the future of communications, all from the comfort of our homes. We deliver innovative solutions to hundreds of thousands of businesses and empower millions of developers worldwide to craft personalized customer experiences. Our dedication to remote-first work, and strong culture of connection and global inclusion means that no matter your location, you're part of a vibrant team with diverse experiences making a global impact each day. As we continue to revol

Full-time

USD 155,520.00 - 194,400.00 per year

Search all similar jobs

More jobs at Apple, Inc. in Austin, TX