Required Skills & Experience
Programming: Python/PySpark, Scala is a plus
Big Data: Hadoop (HDFS, YARN), Hive, Spark (optimization, tuning)
Orchestration: Apache Airflow
Databases/ETL: MongoDB (indexing, sharding, tuning) SQL Server & SSIS (development, migration) Strong SQL & stored procedures
Data Lake: HDFS, Hive, Parquet/ORC, partitioning, compaction
APIs: REST-based ingestion Reverse engineering & lineage tools
CI/CD & DevOps: Git, Jenkins, Docker, IaC
Monitoring: logging, metrics, lineage
Key Responsibilities
- Reverse Engineering & Data Mapping
- Reverse engineer ETL pipelines (SSIS, Spark, stored procedures) to document data
- flows, logic, and transformations.
- Perform detailed source-to-target mappings with field-level transformations and business
- rules.
- Build data dictionaries, lineage, and mapping artifacts.
- Collaborate with SMEs to uncover undocumented logic.
- Identify data model gaps and recommend remediation.
- ETL Pipeline Remediation
- Design and refactor pipelines aligned to new source APIs and data contracts.
- Re-engineer ETL for 1:1 functional parity during migrations.
- Implement schema evolution, transformations, and mapping changes (batch &
- streaming).
- Eliminate redundancy and optimize legacy logic.
- Build modular, reusable pipelines using Spark/PySpark/Scala.
- Modernize SSIS and integrate with orchestration frameworks.
- Orchestrate workflows in Airflow (DAGs, dependencies, SLAs).
- Implement logging, error handling, alerting, and metadata capture.
- Data Storage Optimization
- Simplify schemas; remove redundant/obsolete data across Hive and MongoDB.
- Optimize partitioning, clustering, and file formats (Parquet, ORC, Avro).
- Redesign MongoDB indexing, sharding, and collections.
- Tune HDFS, Hive, MongoDB, and SQL Server for performance and cost.
- Implement lifecycle management, archival, and retention.
Functional Skills
- Experience in ETL migration/remediation projects
- Strong reverse engineering of legacy ETL (SSIS, Spark, scripts)
- Expertise in STM, transformation specs, and lineage artifacts
- Data modeling (dimensional, normalized, denormalized)
- Schema evolution and zero-downtime migrations
- Performance tuning across compute and storage layers
- Strong debugging and problem-solving for distributed systems
Preferred Qualifications
- AI/ML-assisted ETL remediation or code conversion
- Experience with Wiz or Palo Alto Prisma (APIs, data models, risk metrics)
- Prior Prisma to Wiz (or similar CSPM/CNAPP) migrations
- Knowledge of CSPM/CNAPP domains (vulnerabilities, identities, exposures)
- Experience in regulated, compliance-heavy environments