Data Engineer (Databricks)

Overview

Remote
Full Time

Skills

Backbone.js
Data Integration
Data Analysis
Decision-making
ELT
Database
Caching
Data Architecture
Modeling
Data Marts
Warehouse
Apache Parquet
Unity
Performance Tuning
Optimization
UI
Root Cause Analysis
Data Quality
Regulatory Compliance
PASS
Auditing
Fluency
Job Scheduling
Cloud Storage
Python
PySpark
Scala
Data Processing
Writing
R
Java
Data Modeling
Star Schema
Relational Databases
Data Storage
MySQL
PostgreSQL
NoSQL
Message Queues
Data Warehouse
Snow Flake Schema
Amazon Redshift
IaaS
Google Cloud Platform
Google Cloud
Amazon Web Services
Amazon S3
Storage
Microsoft Azure
Computer Networking
Virtual Private Cloud
Orchestration
Apache NiFi
Scheduling
Workflow
Big Data
Apache Kafka
Streaming
Real-time
Machine Learning (ML)
Management
SQL
Analytics
Business Intelligence
Tableau
Microsoft Power BI
Software Engineering
FOCUS
Apache Spark
Soft Skills
Problem Solving
Conflict Resolution
Debugging
Analytical Skill
Communication
Reporting
Extract
Transform
Load
Collaboration
Mentorship
Teaching
SQL Tuning
Data Engineering
Cloud Computing
Artificial Intelligence
Adaptability
Health Care
Finance
Databricks
Open Source
Data Governance
Dashboard
Dependability

Job Details

Role: Data Engineer (Databricks)
Location: Remote
Role Overview: As a Data Engineer specializing in Databricks and big data pipelines, you will help our clients build and optimize the backbone of their data analytics and AI initiatives. You become the trusted data advisor within the client's technology team - designing data architectures, implementing data ingestion and transformation workflows, and ensuring the client can derive timely, accurate insights from their data. You will likely work with a modern tech stack centered on Apache Spark (via Databricks), cloud data lakes/warehouses, and various data integration tools. In this role, you straddle the worlds of software engineering and data analysis: not only must pipelines be reliable and efficient, but they should also deliver data that is valuable for business decision-making. Being on-site enables you to closely collaborate with business analysts, data scientists, and IT personnel to align data infrastructure with the client's goals.
Key Responsibilities:
  • Data Pipeline Development: Design and implement scalable ETL/ELT pipelines on the Databricks platform. This includes ingesting data from diverse sources (databases, APIs, streaming platforms), transforming and aggregating it using Spark (PySpark or Scala), and storing the results in appropriate data stores (Delta Lake, data warehouses like Snowflake or Redshift, etc.). You will write data processing jobs that handle large volumes of data efficiently, optimizing for performance (e.g., using partitioning, caching, proper join strategies in Spark). These pipelines might be scheduled via Databricks Jobs or integrated into the client's workflow orchestrator (like Airflow or Azure Data Factory).
  • Data Architecture & Modeling: On-site with the client, you will often be involved in architecting the data platform. This means working out the best way to structure data lakes, data warehouses, and data marts to serve different needs (analytics, reporting, machine learning). You'll create data models - for example, designing fact and dimension tables for a warehouse or deciding how to organize a bronze-silver-gold layer structure in a Delta Lake. You ensure that data is stored with appropriate formats (parquet, Delta) and partitioning to balance query performance with storage costs. As a Databricks expert, you might also guide the setup of Unity Catalog or other governance tools to manage data assets and permissions.
  • Collaboration with Data Consumers: You interface with the people who need the data - data scientists building models, analysts writing reports, or sometimes application teams that need processed data. Being on-site means you can gather direct feedback: Are the datasets you produce meeting their needs? Is there additional data or granularity required? If a data scientist is trying to train an ML model, you might assist by providing a feature engineering pipeline or optimizing a particularly heavy query for them. You act as a liaison between raw data and actionable insight, ensuring the pipelines deliver high-quality data. You might even pair with analysts to write complex SQL queries or with data scientists to productionize a prototype model into a data pipeline.
  • Performance Tuning & Troubleshooting: A critical part of your job is making sure data processing is fast and reliable. You continuously monitor pipeline performance and troubleshoot issues. For instance, if a Spark job is running too slow or failing, you investigate - maybe there's data skew, maybe more memory is needed, or a join needs optimization. You're skilled in reading Spark UI or logs to pinpoint bottlenecks. If a pipeline fails at 2 AM, you might not be working those hours regularly, but you will perform root cause analysis the next day and implement fixes or more robust error handling. Onsite presence can be useful if, say, an urgent data issue arises during the day - stakeholders can directly alert you and you can quickly dive in to resolve it, preventing delays in downstream processes.
  • Data Governance and Quality: You implement measures to ensure data quality and security. This can involve building validation checks into pipelines (so that bad data is flagged or quarantined), creating audit logs of data lineage, and handling PII data appropriately (masking or encrypting sensitive fields using tools like AWS KMS or Azure Key Vault integration with Databricks). Additionally, you'll enforce best practices such as schema management (e.g., using Delta Lake's schema enforcement features) to prevent data corruption. If the client has compliance requirements, you help design the data platform to meet those (for example, GDPR-related data deletion processes). Being on-site, you might also coordinate with the client's data governance or security teams to pass any audits or reviews, explaining how the pipelines and data stores comply with policies.
Technical Skills & Experience:
  • Apache Spark & Databricks: Deep proficiency in Apache Spark is mandatory, as it's the core of Databricks. You should be fluent in writing Spark jobs (in PySpark or Scala) and understand Spark's execution model (shuffles, partitions, memory usage) to optimize jobs. Experience with Delta Lake is important, as it provides ACID transactions on data lakes - you know how to implement merges (upserts), use time-travel, and handle schema evolution with Delta. Familiarity with the Databricks workspace - notebooks, clusters, job scheduling, and integration with cloud storage (S3, ADLS) - is expected. If you have Databricks certifications or have used advanced features like Delta Live Tables or MLflow, that's a strong plus.
  • Programming & SQL: Strong programming skills in Python (for PySpark) and/or Scala. You can build modular code for data processing, and possibly create utility libraries to be reused across pipelines. SQL proficiency is also crucial, since a lot of data transformations and analyses involve writing complex SQL queries. You know how to optimize SQL (using proper filtering, joins, window functions, etc.) to ensure it runs efficiently on large datasets. Knowledge of other languages (like R or Java) is less important, but understanding how to use SQL within Spark and via BI tools is key.
  • Data Modeling & Storage: Experience designing data models for analytics - e.g., star schema design for warehouses, or designing efficient file schemas for lake. You are familiar with concepts of normalized vs denormalized data, partitioning strategies, and indexing (if using tools like AWS Athena or relational databases in the mix). Knowledge of other data storage technologies is valuable: e.g., relational DBs (MySQL, Postgres), NoSQL stores, message queues (Kafka) for streaming ingestion, and possibly data warehouse platforms (Snowflake, BigQuery, Redshift) as they often interplay with Databricks solutions.
  • Cloud Platforms & Ecosystem: Since Databricks runs on cloud infrastructure (AWS, Azure, Google Cloud Platform), you should be knowledgeable about the respective cloud's data services. For instance, AWS: S3 for storage, Glue Catalog, IAM roles for access; Azure: ADLS, Azure Data Factory, etc. Understanding networking (VPC, peering), security (IAM/service principals, key management) as it pertains to data pipelines is important for implementation. Familiarity with workflow orchestration tools (Airflow, NiFi, Data Factory) and scheduling is expected as you often integrate your pipelines into a larger data workflow.
  • Big Data & ML Ecosystem: You keep abreast of the broader data engineering ecosystem. Experience with stream processing (Spark Structured Streaming, Kafka streams) could be needed if the client has real-time data needs. Knowledge of ML pipelines (if working with data scientists) is useful - e.g., using Databricks MLflow to manage model lifecycle or integrating with AI frameworks. Additionally, tools like SQL analytics/BI (Tableau, PowerBI) - while not your main tool, understanding how they consume data helps you deliver data in the right format for them.
  • Experience: Generally 5+ years in data engineering or software engineering with significant data focus. Should have designed and implemented data pipelines in a production environment. Prior experience with Databricks or Apache Spark for at least 2-3 years is expected due to the specialization. Consulting experience or large enterprise project experience is beneficial, demonstrating that you can work on complex, multi-stakeholder data projects and adapt to client environments.
Soft Skills & Competencies:
  • Analytical Mindset & Problem-Solving: You love data and it shows. You approach problems logically - when a pipeline produces incorrect data, you systematically debug where the anomaly arose. You can dive into data to spot patterns or issues (e.g., noticing that one data source is delayed or that a particular record always causes errors). This analytical strength also means you can think about edge cases (like what happens if data for a day is late or a schema changes unexpectedly) and design defensively.
  • Communication & Storytelling with Data: As a data engineer on-site, you not only crunch data but also communicate about data. You can explain to a non-technical stakeholder why a report might be delayed ("the data pipeline encountered X issue and we're fixing it"), or advise them on how to better formulate a data request. You can also translate business questions into data requirements and vice versa - effectively acting as a bridge between the business perspective and technical implementation. When needed, you can present findings or status clearly, perhaps using simple charts or summaries, not just technical jargon.
  • Collaboration & Mentorship: You'll often work alongside the client's own data/IT team. You collaborate rather than work in a silo - perhaps coding together with a client's junior data engineer to upskill them, or coordinating with their DBAs on how to optimize a query. If the client team lacks some skills, you tactfully fill the gaps and possibly mentor them (e.g., teaching a SQL optimization trick or how to use a Databricks notebook efficiently). Your goal is to leave the client stronger in data engineering than you found them. Within our own company, you also liaise with other PSG colleagues (maybe a cloud engineer or AI engineer on the same project) to ensure solutions are aligned.
  • Adaptability & Learning: Every client's data landscape is different - different source systems, different quality of data, different requirements. You need to adapt quickly. One week you might need to quickly learn the basics of an unfamiliar data source or a domain (like healthcare data standards, or financial transaction formats) to do your job well. You embrace continuous learning - whether it's a new feature in Databricks, a new open-source tool, or a new approach to data governance. If something new is needed to solve a problem, you show initiative in picking it up.
  • Reliability & Ownership: Data pipelines are often mission-critical - a lot can depend on that morning dashboard being up-to-date. You take that responsibility seriously. You design with reliability in mind (e.g., alerting on failures, retry mechanisms) so issues are rare. If a problem does occur, you don't point fingers; you roll up your sleeves to fix it and communicate proactively. Being on-site, you might be seen as the owner of the data platform by the client team - and you embrace that role by being dependable. Stakeholders learn that they can trust you to deliver accurate data on time, and to be transparent about any risks or delays. This trust is fundamental, especially when making data-driven decisions depends on your work.
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

About Spark Tek Inc