Data Engineer – Data Quality & Validation
Location: Dallas, TX (Hybrid – 3 Days Onsite)
Job Type: Long-Term Contract
Employment Type: W2 Only
Interview Process: In-Person Client Interview (Mandatory)
Position Overview
We are seeking an experienced Data Engineer – Data Quality & Validation to support enterprise-scale data platforms and pipelines by ensuring the accuracy, completeness, reliability, and performance of data assets across the organization. This role will focus on validating both batch and real-time data processing solutions built on Databricks, Apache Spark, Kafka, AWS, SQL, and Python.
The ideal candidate will have a strong background in data engineering, ETL/ELT validation, data quality assurance, automation, and testing of distributed data systems. The candidate will work closely with data engineers, architects, business stakeholders, and platform teams to establish robust validation frameworks and maintain high data quality standards.
Key Responsibilities
Data Quality & Validation
- Validate data pipelines to ensure accuracy, completeness, consistency, and timeliness of data.
- Perform source-to-target reconciliation across multiple systems and platforms.
- Develop and execute SQL-based data validation checks and business rule validations.
- Ensure data lineage, traceability, and auditability throughout the data lifecycle.
- Identify, investigate, and resolve data quality issues and anomalies.
- Define and monitor data quality metrics, KPIs, SLAs, and SLOs.
ETL / ELT Pipeline Validation
- Validate data ingestion, transformation, aggregation, and consumption layers.
- Test batch and real-time streaming data pipelines.
- Verify business transformation logic using SQL, PySpark, and Python.
- Validate historical data loads, backfills, and reprocessing activities.
- Conduct end-to-end testing of data movement across enterprise systems.
- Ensure data consistency across upstream and downstream platforms.
Databricks & Apache Spark Testing
- Validate data processing workflows running on Databricks.
- Test Spark-based workloads developed using PySpark and Spark SQL.
- Verify large-scale data transformations, aggregations, and calculations.
- Support testing and validation of distributed processing environments.
- Analyze Spark execution behavior and data processing outcomes.
Kafka & Streaming Data Validation
- Validate Kafka-based streaming architectures and data pipelines.
- Test producer and consumer workflows across distributed systems.
- Verify message ordering, delivery guarantees, and data integrity.
- Validate schema evolution, retention policies, partitions, and offset management.
- Test serialization formats including Avro, JSON, and Protobuf.
- Simulate and validate duplicate records, late-arriving events, and failure scenarios.
- Ensure resiliency and reliability of event-driven processing pipelines.
Automation & Test Framework Development
- Design and develop Python-based automation frameworks for data validation.
- Build reusable testing utilities and validation components.
- Create synthetic datasets and test scenarios to support validation efforts.
- Integrate automated testing into CI/CD pipelines.
- Develop automated monitoring and alerting solutions for data quality issues.
- Improve testing efficiency through automation and reusable frameworks.
Performance, Reliability & Observability
- Validate throughput, scalability, latency, concurrency, and overall system performance.
- Test retry mechanisms, recovery processes, and idempotent workflows.
- Conduct regression, failover, resilience, and performance testing.
- Validate monitoring, logging, metrics, and observability solutions.
- Support incident investigations, root cause analysis, and remediation efforts.
- Ensure compliance with operational and data governance standards.
Required Qualifications
- Bachelor’s degree in Computer Science, Information Technology, Engineering, or a related field.
- 7+ years of experience in Data Engineering, Data Quality Engineering, QA Engineering, SDET, or related disciplines.
- 4+ years of hands-on experience with enterprise data platforms and large-scale data pipelines.
- 3+ years of hands-on experience with Databricks and Apache Spark.
- Strong SQL expertise for data validation, reconciliation, profiling, and analysis.
- Strong Python programming skills for automation and data validation frameworks.
- Experience testing ETL/ELT pipelines in both batch and streaming environments.
- Hands-on experience with Kafka or similar event-streaming platforms.
- Experience working with AWS data services, including:
- Amazon S3
- AWS Glue
- AWS Lambda
- Amazon EMR
- Amazon Redshift
- Amazon Athena
- Experience working with distributed data processing systems and cloud-based data platforms.
- Strong analytical, troubleshooting, and problem-solving abilities.
- Excellent verbal and written communication skills.
- Ability to collaborate effectively with cross-functional teams.
Preferred Qualifications
- Experience with data quality and observability tools such as:
- Great Expectations
- Monte Carlo
- Similar data quality platforms
- Knowledge of schema registries, metadata management, and data contracts.
- Experience integrating automated testing into CI/CD pipelines using:
- GitHub Actions
- Jenkins
- Similar DevOps platforms
- Experience supporting modern cloud-native data engineering ecosystems.
- Understanding of Data Lakehouse architectures and distributed computing frameworks.
- Familiarity with data governance, lineage, and compliance best practices.
- Experience with Agile/Scrum delivery methodologies.