Fullstack Databricks Developer
Onsite Locations: Dallas TX, Middletown NJ
Long term
Skills & Qualifications
1.Technical Core (Databricks & Spark)
- Expert PySpark/Scala: Deep understanding of Spark internals, broadcast joins, and RDD/Dataframe partitioning.
Delta Lake Mastery: Proficiency in Delta features like Z-Ordering, Liquid Clustering, Change Data Feed (CDF), and Time Travel.
Streaming Patterns: Hands-on experience with Watermarking, Checkpoints, and handling late-arriving data in Structured Streaming.
2. Data Modeling & Languages
SQL: Expert-level SQL for complex transformations and window functions.
JSON/Semi-Structured Data: Mastery of parsing and generating complex nested JSON objects within Spark (e.g., struct, array, to_json, from_json).
Medallion Design: Proven experience moving data across Bronze, Silver, and Gold layers with clear "Data Contracts."
3. Full Stack & DevOps
CI/CD: Experience automating data pipeline deployments (Git-based workflows).
Observability: Ability to set up monitoring and alerts using Databricks SQL Alerts or Grafana to track pipeline lag.
4.Soft Skills
Architectural Thinking: Ability to decide when to use "Continuous" vs. "AvailableNow" streaming based on cost vs. latency requirements.
Client Focus: Understanding how an API client (e.g., a React app or a microservice) will consume the Gold layer JSON.
Job Title:
Data Engineer (Streaming & Full Stack Databricks)
Role Summary
We are seeking a high-performing Data Engineer to design and implement a real-time data platform using the Medallion Architecture.
You will be responsible for the end-to-end development of data pipelines from ingesting real-time source data into Bronze, transforming it into a relational silver layer, and finally delivering high-concurrency, consumption-ready JSON Gold tables.
You will act as a "Full Stack" data professional, handling everything from infrastructure automation (DataOps) to complex nested data modeling.
Key Responsibilities
Real-Time Ingestion: Build scalable ingestion pipelines using Auto Loader and Spark Structured Streaming to capture data from Kafka, Event Hubs, or CDC sources into raw Delta tables.
Relational Transformation: Develop ELT logic to cleanse, deduplicate, and normalize data into a relational format. Ensure ACID compliance and "exactly-once" processing semantics.
JSON API Optimization: Design and build the layer specifically for client consumption. This involves flattening/nesting data into optimized JSON structures within Delta tables to support low-latency API queries.
Advanced Orchestration: Implement and manage complex workflows using Delta Live Tables (DLT) or Standard Streaming Live tables and Databricks Workflows to ensure data freshness and lineage.
Governance & Security: Use Unity Catalog to enforce fine-grained access control (row/column level) and maintain a searchable data catalog for consuming clients.
DataOps & Automation: Own the deployment lifecycle using Databricks Asset Bundles (DABs) and CI/CD pipelines (GitHub Actions/Azure DevOps) to ensure reproducible environments.
Performance Tuning: Optimize streaming triggers, watermarking, and stateful processing to minimize latency and manage cloud costs effectively.