Overview
Skills
Job Details
We are seeking a Senior Software Development Engineer in Test (SDET) with a strong background in test automation, backend systems testing, and AI/LLM validation.
This is a hands-on, highly influential role responsible for:
Testing LLM-powered applications used across the enterprise
Building LLM-driven testing and evaluation workflows
Defining organization-wide standards for GenAI quality, reliability, and release readiness
Key Responsibilities
LLM Testing & Evaluation
Design and implement test strategies for LLM-powered systems, including:
Prompt and response validation
Regression testing across model, prompt, and data changes
Evaluation of accuracy, consistency, hallucinations, bias, and safety
Build and maintain LLM-based evaluation frameworks using tools such as DeepEval, MLflow, LangChain, and Langflow
Develop synthetic and real-world test datasets in collaboration with the Data Engineer
Define quality thresholds, scoring mechanisms, benchmarks, and pass/fail criteria for GenAI systems
Test Automation & Framework Development
Build and maintain automated test frameworks for:
LLM APIs and services
Agentic workflows and RAG pipelines
Data ingestion and inference pipelines
Integrate LLM testing and evaluation into CI/CD pipelines, enforcing quality gates prior to production release
Partner with engineering teams to improve testability, reliability, and observability of AI systems
Perform root-cause analysis for failures related to model behavior, data quality, or orchestration logic
Observability & Monitoring
Instrument LLM applications using Datadog LLM Observability to track:
Latency, token usage, errors, and cost
Quality regressions, drift, and performance anomalies
Build dashboards and alerting focused on LLM quality and reliability
Use production telemetry to continuously refine test coverage and evaluation strategies
Shared Services & Collaboration
Act as a consultative partner to product, platform, and data teams adopting LLM technologies
Provide guidance on:
Generative AI test strategies
Prompt engineering and workflow validation
Release readiness and AI risk assessment
Contribute to organization-wide standards and best practices for testing, explaining, and monitoring AI systems
Participate in architecture and design reviews from a quality-first perspective
Engineering Excellence
Advocate for automation-first testing, infrastructure as code, and continuous monitoring
Drive adoption of Agile, DevOps, and CI/CD best practices within AI quality engineering
Conduct code reviews and promote secure, maintainable, and scalable test frameworks
Continuously improve internal tooling and frameworks within the QA Center of Excellence
Required Skills & Experience
Strong Python development skills
Experience testing backend systems, APIs, microservices, or distributed platforms
Proven experience building and maintaining automation frameworks
Ability to work effectively with ambiguous, non-deterministic systems
AI / LLM Experience
Hands-on experience testing or validating ML- or LLM-based systems
Familiarity with LLM orchestration and evaluation tools, including:
LangChain, Langflow
DeepEval, MLflow
Strong understanding of challenges unique to testing generative AI systems
Nice to Have
Experience with Datadog, especially LLM Observability
Exposure to Hugging Face, PyTorch, or TensorFlow (usage-level)
Experience testing RAG pipelines, Vector Databases, or data-driven platforms
Background working in platform teams, shared services, or QA Centers of Excellence
Experience collaborating closely with Data Engineering or ML Platform teams