Overview
Skills
Job Details
Responsibilities:
Develop, optimize, and maintain ETL/ELT pipelines using PySpark and SQL.
Work with structured and unstructured data to build scalable data solutions.
Write efficient and scalable PySpark scripts for data transformation and processing.
Optimize SQL queries, stored procedures, and indexing strategies to enhance performance.
Design and implement data models, schemas, and partitioning strategies for large-scale datasets.
Collaborate with Data Scientists, Analysts, and other Engineers to integrate data workflows.
Ensure data quality, validation, and consistency in data pipelines.
Implement error handling, logging, and monitoring for data pipelines.
Work with cloud platforms (AWS, Azure, or Google Cloud Platform) for data processing and storage.
Optimize data pipelines for cost efficiency and performance.
Technical Skills Required:
Strong experience in Python for data engineering tasks.
Proficiency in PySpark for large-scale data processing.
Deep understanding of SQL (Joins, Window Functions, CTEs, Query Optimization).
Experience in ETL/ELT development using Spark and SQL.
Experience with cloud data services (AWS Glue, Databricks, Azure Synapse, Google Cloud Platform BigQuery).
Familiarity with orchestration tools (Airflow, Apache Oozie).
Experience with data warehousing (Snowflake, Redshift, BigQuery).
Understanding of performance tuning in PySpark and SQL.
Familiarity with version control (Git) and CI/CD pipelines.