Senior SDET / AI LLM || Remote || W2 Contract - Integrass

Overview

Remote

$40 - $45

Contract - W2

Skills

SDET

Datadog

LLM

RAG

PyTorch

TensorFlow

Job Details

We are seeking a Senior Software Development Engineer in Test (SDET) with a strong background in test automation, backend systems testing, and AI/LLM validation.

This is a hands-on, highly influential role responsible for:

Testing LLM-powered applications used across the enterprise
Building LLM-driven testing and evaluation workflows
Defining organization-wide standards for GenAI quality, reliability, and release readiness

Key Responsibilities

LLM Testing & Evaluation

Design and implement test strategies for LLM-powered systems, including:
- Prompt and response validation
- Regression testing across model, prompt, and data changes
- Evaluation of accuracy, consistency, hallucinations, bias, and safety
Build and maintain LLM-based evaluation frameworks using tools such as DeepEval, MLflow, LangChain, and Langflow
Develop synthetic and real-world test datasets in collaboration with the Data Engineer
Define quality thresholds, scoring mechanisms, benchmarks, and pass/fail criteria for GenAI systems

Test Automation & Framework Development

Build and maintain automated test frameworks for:
- LLM APIs and services
- Agentic workflows and RAG pipelines
- Data ingestion and inference pipelines
Integrate LLM testing and evaluation into CI/CD pipelines, enforcing quality gates prior to production release
Partner with engineering teams to improve testability, reliability, and observability of AI systems
Perform root-cause analysis for failures related to model behavior, data quality, or orchestration logic

Observability & Monitoring

Instrument LLM applications using Datadog LLM Observability to track:
- Latency, token usage, errors, and cost
- Quality regressions, drift, and performance anomalies
Build dashboards and alerting focused on LLM quality and reliability
Use production telemetry to continuously refine test coverage and evaluation strategies

Shared Services & Collaboration

Act as a consultative partner to product, platform, and data teams adopting LLM technologies
Provide guidance on:
- Generative AI test strategies
- Prompt engineering and workflow validation
- Release readiness and AI risk assessment
Contribute to organization-wide standards and best practices for testing, explaining, and monitoring AI systems
Participate in architecture and design reviews from a quality-first perspective

Engineering Excellence

Advocate for automation-first testing, infrastructure as code, and continuous monitoring
Drive adoption of Agile, DevOps, and CI/CD best practices within AI quality engineering
Conduct code reviews and promote secure, maintainable, and scalable test frameworks
Continuously improve internal tooling and frameworks within the QA Center of Excellence

Required Skills & Experience

Strong Python development skills
Experience testing backend systems, APIs, microservices, or distributed platforms
Proven experience building and maintaining automation frameworks
Ability to work effectively with ambiguous, non-deterministic systems

AI / LLM Experience

Hands-on experience testing or validating ML- or LLM-based systems
Familiarity with LLM orchestration and evaluation tools, including:
- LangChain, Langflow
- DeepEval, MLflow
Strong understanding of challenges unique to testing generative AI systems

Nice to Have

Experience with Datadog, especially LLM Observability
Exposure to Hugging Face, PyTorch, or TensorFlow (usage-level)
Experience testing RAG pipelines, Vector Databases, or data-driven platforms
Background working in platform teams, shared services, or QA Centers of Excellence
Experience collaborating closely with Data Engineering or ML Platform teams

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.