Responsibilities: Design, develop, and maintain ETL/ELT pipelines using PySpark on Databricks
- Build and optimize batch and streaming data pipelines
- Implement Delta Lake solutions (Delta tables, time travel, ACID transactions)
- Collaborate with data scientists, analysts, and architects to deliver analytics-ready datasets
- Optimize Spark jobs for performance, scalability, and cost
- Integrate data from multiple sources (RDBMS, APIs, files, cloud storage)
- Implement data quality checks, validation, and monitoring
- Manage Databricks notebooks, jobs, clusters, and workflows
- Follow data governance, security, and compliance standards
- Participate in code reviews and contribute to best practices
Qualifications
Technical Skills
- PySpark: Hands-on experience with Data Frames, RDDs, joins, transformations, and actions.
- Databricks: Job optimization, cluster configuration, repartitioning, and Shuffle mechanics
- AWS: S3 buckets, IAM, CloudWatch, and integration with Databricks.
- SQL: Strong query skills for analytics and ETL.
- Performance tuning: Partitioning, caching, broadcast joins, and skew handling.
- Delta Lake, Medallion Architecture, Spark Streaming, Spark ML, and CI/CD pipelines.
Data & Platform Knowledge
- ETL/ELT design patterns
- Handling large-scale structured and semi-structured data
- Performance tuning (partitioning, caching, broadcast joins)
- Understanding of data warehousing concepts