Project: Identity & Access Management (IAM) Data Modernization
Migration of an on-premises SQL data warehouse to a modern enterprise Data Lake platform, enabling analytics and GenAI use cases. The platform leverages PySpark-based processing, CI/CD pipelines, and containerized deployments on OpenShift (OCP), with Google Cloud Platform as a preferred cloud platform, to deliver scalable, secure, and high-performance data solutions
About Program/Project
The IAM Data Modernization program focuses on transforming legacy data platforms into a scalable and cloud-compatible architecture.
Key Highlights:
- Integration Scope: 30+ source systems with multiple downstream integrations
- Capabilities: Metrics, reporting, advanced analytics, and GenAI use cases (NL querying, summarisation, cross-domain insights)
- Benefits:
- Scalable and resilient data platform
- High-performance semantic and analytics layer
- Single source of truth for enterprise-wide reporting and analytics
Role Summary
We are looking for a Data Architect with strong expertise in OpenShift (OCP), PySpark, and CI/CD pipelines to design and govern scalable data platforms.
The role requires defining end-to-end data architecture, containerised deployment patterns, orchestration strategies (Airflow/Autosys), and platform standards, along with hands-on involvement in implementation.
Key Responsibilities
Data Architecture & Platform Design
- Define enterprise data architecture for IAM data lake and analytics platform
- Design scalable, modular, and containerised data pipeline architectures on OCP
- Establish data models, schema governance, and data lifecycle strategies
- Define best practices for data partitioning, performance optimisation, and cost efficiency
OpenShift (OCP) & Platform Engineering
- Architect and govern containerised data workloads on OpenShift (OCP)
- Define standards for deployment, scaling, and workload isolation
- Collaborate with DevOps teams for platform engineering and infrastructure alignment
Big Data & Processing (PySpark Focus)
- Define architecture for PySpark-based batch and near real-time processing pipelines
- Provide guidance on distributed processing design, optimisation, and performance tuning
- Establish reusable frameworks for ETL/ELT processing
Data Ingestion & Orchestration
- Architect data ingestion frameworks (batch, streaming, CDC)
- Define orchestration strategies using Airflow / Autosys
- Implement standards for retry, backfills, dependency management, and error handling
DevOps / CI-CD
- Define and oversee CI/CD strategy for data and platform deployments
- Enable automation of build, test, and deployment processes
- Ensure integration of CI/CD pipelines with OCP-based environments
Cloud & Data Platforms (Preferred)
- Provide architecture guidance for Google Cloud Platform-based data platforms (preferred, not mandatory)
- Define integration patterns for cloud-native and on-premise hybrid environments
- Guide teams on cloud migration strategies and modern data platform adoption
Data Governance, Quality & Observability
- Define frameworks for:
- Data quality, validation, and lineage
- Metadata management and cataloguing
- Establish monitoring, logging, alerting, and SLOs for platform reliability
- Ensure compliance with data security and audit requirements
Stakeholder Collaboration
- Work closely with client architects, IAM teams, and business stakeholders
- Translate business requirements into scalable technical architecture
- Provide architectural guidance and mentorship to engineering teams
Required Skills
Core Skills (Must Have)
- Strong experience in:
- OpenShift (OCP) / Kubernetes-based platforms
- PySpark / Spark ecosystem
- CI/CD implementation for data platforms
- Airflow / Autosys orchestration tools
- Solid understanding of:
- Data lake architectures (layered models)
- ETL/ELT design patterns
- Distributed data processing concepts
Data Engineering & Storage
- Expertise in:
- Data formats: Parquet, ORC, Avro
- Partitioning and performance tuning
- Large-scale data modelling for analytics
Cloud (Preferred Not Mandatory)
- Experience with Google Cloud Platform (Google Cloud Platform) (preferred)
- Exposure to services like BigQuery, Dataproc, Dataflow, GCS is a plus
Observability & Reliability
- Experience defining:
- Monitoring, logging, alerting frameworks
- Dashboards, SLOs, and operational runbooks
Good to Have
- Experience with IAM domain / cybersecurity data
- Understanding of data security and access control frameworks
- Exposure to GenAI-enabled data platforms
- Experience in Agile delivery and team leadership
Qualifications
- Experience:
- 10 14+ years in Data Architecture / Data Engineering
- Strong experience in OCP, PySpark, CI/CD, and orchestration frameworks
- Prior experience in data modernisation / migration programs
- Education:
Bachelor s/Master s in Computer Science, Information Systems, or equivalent - Certifications (Preferred):
- OpenShift / Kubernetes certifications
- Google Cloud Platform certifications (preferred, not mandatory)