Job Description:
Sound Concepts of Large Datawarehouse/Data Lake Concepts, ETL/ELT – Ab Initio, Apache Spark, PySpark, SQL, Oracle, HADOOP
Advanced dimensional modeling, data vault, and schema design for large‑scale Data Warehouses and Data Lakes.
Deep expertise in ETL/ELT engineering using Ab Initio (graphs, plans, PDL, metadata‑driven design) and migration of those patterns to Spark.
Hands‑on PySpark/Spark proficiency for batch, streaming, joins, windowing, partitioning, and performance tuning on large datasets.
Strong command of Hadoop ecosystem components: HDFS, Hive, YARN, Oozie/Airflow, Ranger, Atlas, and security/governance frameworks.
Oracle SQL mastery including performance tuning, partitioning, materialized views, and implementing/decoding Virtual Private Database (VPD) policies.
Data ingestion architecture using CDC, Kafka, file‑based ingestion, and incremental load frameworks for high‑volume HR and financial data.
Data quality engineering: reconciliation frameworks, validation rules, audit controls, lineage, and automated regression testing.
Cloud and lakehouse engineering on Databricks: Delta Lake, Unity Catalog, cluster optimization, job orchestration, and CI/CD.
Metadata‑driven pipeline design, reusable transformation frameworks, and parameterized job orchestration patterns.
Performance engineering across platforms: skew mitigation, partition strategy, broadcast vs shuffle decisions, and storage format optimization (Parquet/ORC/Delta).