Job Title: Data Engineer Cloud Data Integration & Transformation
Location: Remote
Duration: 6+ Months
Description:
We are seeking a hands-on Data Engineer to develop and maintain scalable data pipelines and transformation routines within a modern Azure + Databricks environment. This role is focused on executing ingestion, cleansing, standardization, matching, merging, and enrichment of complex legacy datasets into a governed data lakehouse architecture. The ideal candidate brings deep experience with Spark (PySpark), Delta Lake, Azure Data Factory, and data wrangling techniques and is comfortable working in a structured, code-managed, team-based delivery environment.
Key Responsibilities
Pipeline Development & Maintenance
- Build and maintain reusable data pipelines using Databricks, PySpark, and SQL.
- Implement full and incremental loads from sources including VSAM, Db2 (LUW and z/OS), SQL Server, and flat files.
- Use Delta Lake on ADLS Gen2 to support ACID transactions, scalable upserts/merges, and time travel.
- Leverage Azure Data Factory for orchestration and triggering of Delta Live Tables and Databricks Jobs as part of nightly pipeline execution.
Data Cleansing & Transformation
- Apply cleansing logic for deduplication, parsing, standardization, and enrichment based on business rule definitions.
- Use Spark-Cobol Library to parse EBCDIC/COBOL-formatted VSAM files into structured DataFrames.
- Maintain 'bronze $B"*(B silver $B"*(B gold' structured layers and ensure quality during data transformations.
- Support classification and mapping logic in collaboration with analysts and architects.
Observability, Testing & Validation
- Integrate robust logging and exception handling to enable observability and pipeline traceability.
- Monitor job performance and cost with Azure Monitor and Log Analytics.
- Support validation and testing using frameworks like Great Expectations or dbt tests to enforce expectations on nulls, ranges, and referential integrity.
Security, DevOps & Deployment
- Store and manage credentials securely using Azure Key Vault during pipeline execution.
- Maintain pipeline code using Azure DevOps Repos and participate in peer reviews and promotion workflows via Azure DevOps Pipelines.
- Deploy notebooks, configurations, and transformations using CI/CD best practices in repeatable environments.
Collaboration & Profiling
- Collaborate with architects to ensure alignment with data platform standards and governance models.
- Work with analysts and SMEs to profile data, refine cleansing logic, and conduct variance analysis using Databricks Notebooks and Databricks SQL Warehouse.
- Support metric publication and lineage registration using Microsoft Purview and Unity Catalog and contribute to profiling datasets for Power BI consumption.
Required Skills & Experience
5+ years of experience in data engineering or ETL development roles.
Proficiency in:
- Databricks, PySpark, SQL
- Delta Lake and Azure Data Lake Storage Gen2
- Azure Data Factory for orchestration and event-driven workflows
Experience with:
- Cleansing, deduplication, parsing, and merging of high-volume datasets
- Parsing EBCDIC/COBOL-formatted VSAM files using Spark-Cobol Library
- Connecting to Db2 databases using JDBC drivers for ingestion
Familiarity with:
- Git, Azure DevOps Repos & Pipelines
- Great Expectations or dbt for validation
- Azure Monitor + Log Analytics for job tracking and alerting
- Azure Key Vault for secrets and credentials
- Microsoft Purview and Unity Catalog for metadata and lineage registration