Job Description:
Sound Concepts of Large Datawarehouse/Data Lake Concepts, ETL/ELT – Ab Initio, Apache Spark, PySpark, SQL, Oracle, HADOOP
Advanced dimensional modeling, data vault, and schema design for largeβscale Data Warehouses and Data Lakes.
Deep expertise in ETL/ELT engineering using Ab Initio (graphs, plans, PDL, metadataβdriven design) and migration of those patterns to Spark.
Handsβon PySpark/Spark proficiency for batch, streaming, joins, windowing, partitioning, and performance tuning on large datasets.
Strong command of Hadoop ecosystem components: HDFS, Hive, YARN, Oozie/Airflow, Ranger, Atlas, and security/governance frameworks.
Oracle SQL mastery including performance tuning, partitioning, materialized views, and implementing/decoding Virtual Private Database (VPD) policies.
Data ingestion architecture using CDC, Kafka, fileβbased ingestion, and incremental load frameworks for highβvolume HR and financial data.
Data quality engineering: reconciliation frameworks, validation rules, audit controls, lineage, and automated regression testing.
Cloud and lakehouse engineering on Databricks: Delta Lake, Unity Catalog, cluster optimization, job orchestration, and CI/CD.
Metadataβdriven pipeline design, reusable transformation frameworks, and parameterized job orchestration patterns.
Performance engineering across platforms: skew mitigation, partition strategy, broadcast vs shuffle decisions, and storage format optimization (Parquet/ORC/Delta).