GenAI Evaluation Engineer

  • Posted 6 hours ago | Updated 6 hours ago

Overview

Remote
Depends on Experience
Contract - W2
No Travel Required

Skills

ML evaluation
QA engineering
or analytics
Python
pandas
and NumPy
NLTK
HuggingFaceevaluate
and sacrebleu
Cohens Kappa.
Power BI
Grafana
DVC
Git-LFS
HIPAA

Job Details

Job Title: GenAI Evaluation Engineer
Location: Hybrid/Remote

Role Overview

Your mission is to translate stakeholder needs from Regulatory, QA, and Data Science into rigorous, automated pipelines that guarantee model quality, safety, and regulatory compliance.

Key Responsibilities

  • Requirements & Test Plan Design:
    • Run workshops with stakeholders to codify acceptance criteria for applications like CMC report analysis and patient-safety monitoring.
    • Draft detailed test cases (both positive and negative), define clear pass/fail thresholds, and maintain traceability matrices.
  • Automated Evaluation Pipelines:
    • Implement classic NLP metrics (BLEU/ROUGE), semantic similarity measures, and custom hallucination detectors in Python.
    • Orchestrate evaluation pipelines in Azure Data Factory or Airflow, integrating with tools like Prodigy or LightTag for human-in-the-loop annotation.
    • Develop qualitative coding schemas, author annotation guidelines, and ensure high inter-annotator agreement (Cohen s 0.8).
  • Data Versioning & Drift Monitoring:
    • Manage and version evaluation datasets using DVC/Git-LFS while tracking dataset lineage.
    • Automate data and model drift detection using KS-tests and embedding-based alerts, publishing weekly reports on findings.
  • Reporting & Governance:
    • Create interactive Power BI or Grafana dashboards to report on SLA compliance (e.g., accuracy > 98%, hallucination < 2%), trends, and anomalies.
    • Set up automated regression suites that block deployments if key metrics degrade beyond a set threshold (e.g., 5%).
    • Maintain detailed audit logs of evaluation runs and sign-offs to comply with GxP/GMP standards.
    • Lead internal training sessions on evaluation best practices and mentor junior evaluators.

Required Qualifications:

  • BS/MS in Computer Science, Statistics, Engineering, or a related field.
  • 2 5 years of experience in ML evaluation, QA engineering, or analytics, ideally in regulated domains.
  • Proficiency in Python, pandas, and NumPy.
  • Hands-on experience with evaluation libraries like NLTK, HuggingFaceevaluate, and sacrebleu.
  • Strong statistical rigor, with a deep understanding of metrics like Cohen s Kappa.
  • Experience with BI/dashboarding tools (Power BI, Grafana) and data versioning tools (DVC, Git-LFS).

Preferred Qualifications (Nice-to-Haves):

  • Experience with MLflow for tracking experiments and metrics.
  • Background in qualitative research methods like open/axial coding, especially in safety-critical settings.
  • Expertise in regulatory compliance and audits for standards like HIPAA or GxP (21 CFR Part 11).
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.