GenAI Evaluation Engineer

Overview

Remote

Depends on Experience

Contract - W2

No Travel Required

Skills

ML evaluation

QA engineering

or analytics

Python

pandas

and NumPy

NLTK

HuggingFaceevaluate

and sacrebleu

Cohens Kappa.

Power BI

Grafana

DVC

Git-LFS

HIPAA

Job Details

Job Title: GenAI Evaluation Engineer
Location: Hybrid/Remote

Role Overview

Your mission is to translate stakeholder needs from Regulatory, QA, and Data Science into rigorous, automated pipelines that guarantee model quality, safety, and regulatory compliance.

Key Responsibilities

Requirements & Test Plan Design:

Run workshops with stakeholders to codify acceptance criteria for applications like CMC report analysis and patient-safety monitoring.
Draft detailed test cases (both positive and negative), define clear pass/fail thresholds, and maintain traceability matrices.

Automated Evaluation Pipelines:

Implement classic NLP metrics (BLEU/ROUGE), semantic similarity measures, and custom hallucination detectors in Python.
Orchestrate evaluation pipelines in Azure Data Factory or Airflow, integrating with tools like Prodigy or LightTag for human-in-the-loop annotation.
Develop qualitative coding schemas, author annotation guidelines, and ensure high inter-annotator agreement (Cohen s 0.8).

Data Versioning & Drift Monitoring:

Manage and version evaluation datasets using DVC/Git-LFS while tracking dataset lineage.
Automate data and model drift detection using KS-tests and embedding-based alerts, publishing weekly reports on findings.

Reporting & Governance:

Create interactive Power BI or Grafana dashboards to report on SLA compliance (e.g., accuracy > 98%, hallucination < 2%), trends, and anomalies.
Set up automated regression suites that block deployments if key metrics degrade beyond a set threshold (e.g., 5%).
Maintain detailed audit logs of evaluation runs and sign-offs to comply with GxP/GMP standards.
Lead internal training sessions on evaluation best practices and mentor junior evaluators.

Required Qualifications:

BS/MS in Computer Science, Statistics, Engineering, or a related field.
2 5 years of experience in ML evaluation, QA engineering, or analytics, ideally in regulated domains.
Proficiency in Python, pandas, and NumPy.
Hands-on experience with evaluation libraries like NLTK, HuggingFaceevaluate, and sacrebleu.
Strong statistical rigor, with a deep understanding of metrics like Cohen s Kappa.
Experience with BI/dashboarding tools (Power BI, Grafana) and data versioning tools (DVC, Git-LFS).

Preferred Qualifications (Nice-to-Haves):

Experience with MLflow for tracking experiments and metrics.
Background in qualitative research methods like open/axial coding, especially in safety-critical settings.
Expertise in regulatory compliance and audits for standards like HIPAA or GxP (21 CFR Part 11).

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Job Details

Share