Role: L3 – Cloudera Public Cloud Platform Engineer
Work location: Remote
Type: Contract
- 12+ years of experience in Big Data Platform Engineering / Cloud Platform Operations / Infrastructure roles
- 6+ years of hands-on experience with Cloudera ecosystem (CDH/CDP/ Cloudera Public Cloud)
- Demonstrated ability to quickly learn and adapt to new technologies and evolving platform capabilities, beyond the currently defined CDP stack
- Strong expertise in:
- End-to-end CDP platform operations (CDE, CDW, CDF, CDL, CAI)
- Advanced troubleshooting across multi-cluster, multi-environment deployments
- Kubernetes-based runtime environments (troubleshooting and diagnostics)
- Observability frameworks, including SLIs/SLOs, alerting, and performance tuning
- Proven experience in:
- Leading P1/P2 incident response, triage, and resolution
- Managing platform upgrades, patching, and lifecycle events
- Supporting large-scale environments (TB/PB scale, high concurrency workloads)
- Strong understanding of:
- Cloud infrastructure (IAM, VPC, networking, storage)
- Security and governance (Ranger, Kerberos, TLS/SSL, SDX)
- Expected to:
- Lead complex troubleshooting and drive root cause resolution across platform layers
- Mentor and guide L2 engineers
- Coordinate with Cloudera support and infrastructure teams for critical issues
- Hands-on experience in developing and troubleshooting NiFi (CDF) data flows, including:
- Flow design and configuration
- Processor-level debugging and performance tuning
- Handling backpressure, throughput optimization, and failure recovery
Required Skills
- Strong experience with Cloudera CDP Public Cloud
- Expertise in:
- Cloud platforms (AWS/Azure/Google Cloud Platform)
- Kubernetes concepts (troubleshooting-focused)
- Hands-on with:
- CDE, CDW, CDF (NiFi), CAI
- knowledge of:
- IAM, networking, observability tools
- Platforms operating at multi-terabyte to petabyte scale with high concurrency workloads
- Hands-on experience with:
- Kafka (or similar streaming platforms) including monitoring, troubleshooting, and performance tuning
- Experience with Cloudera CDP CLI (Command Line Interface) for:
- Platform operations and administration
- Job execution and service management (CDE/CDW/CDL)
- Automation of routine operational tasks
- Strong working knowledge of:
- Cloud IAM (AWS IAM / Azure AD) including roles, policies, and cross-service access
- User and group mapping across CDP, cloud IAM, and Ranger policies
- Troubleshooting access issues across storage (S3/ADLS), CDP services, and data access layers
Preferred Skills
- Experience with:
- Modernization of legacy data platforms/applications to Cloudera CDP Public Cloud
- Migration and onboarding of workloads to CDE, CDW, and CAI environments
- Supporting hybrid or multi-environment transitions (on-prem → cloud)
- Familiarity with:
- Cloud platforms (AWS, Azure, Google Cloud Platform) including storage, IAM, and networking concepts
- Kubernetes-based runtime environments (troubleshooting-focused)
- Strong scripting and automation skills (Python, Shell, Terraform) for platform operations
What You’ll Work On
- Enterprise-scale Cloudera CDP platform supporting data engineering, analytics, and AI workloads across multiple applications
- Modernization of legacy platforms and applications into cloud-native CDP services
- Operational support and scaling of:
- Data services (CDE, CDW, CDF, CDL)
- AI/ML platforms (CAI, inference, workbenches)
- Platform performance optimization, observability, and reliability engineering for mission-critical workloads
Why This Role Matters
- Ensures availability, stability, and performance of the CDP platform supporting all data and AI workloads
- Enables successful modernization of legacy applications into scalable, cloud-native services
- Maintains high availability, observability, and operational excellence across enterprise platforms
- Acts as the backbone for data engineering, analytics, and AI initiatives
- This role focuses on platform reliability and infrastructure operations and does not include data-layer ownership (e.g., Iceberg table management or data validation).
Job Summary
- We are seeking a highly skilled Cloudera Public Cloud Platform Engineer to operate and manage the end-to-end CDP platform ecosystem, including data services, NiFI, Kafka, AI/ML platforms, and enterprise observability.
- This role is responsible for ensuring availability, scalability, security, and performance of all platform services supporting data, analytics, and AI workloads across environments.
- The ideal candidate brings strong expertise in CDP on-prem, public cloud services, cloud infrastructure, Kubernetes-based runtime environments, and platform observability, supporting high-concurrency, mission-critical workloads at multi-terabyte to petabyte scale
- This role is critical to ensuring uninterrupted operation of data, analytics, and AI platforms—any degradation directly impacts downstream business reporting, data pipelines, and model execution.
Key Responsibilities
CDP Platform & Multi-Service Operations
- Own end-to-end operational responsibility for Cloudera Public Cloud services across Dev / Stage / UAT / Prod:
- CDE, CDW, COD, CDL, CDF (NiFi), CDV, CAI, Kafka
- Ensure multi-cluster stability, workload isolation, and SLA adherence
- Support onboarding and operations of multiple applications across environments
- Manage and support multi-environment, multi-cluster deployments with strict isolation, governance, and release coordination across Dev/UAT/Prod
AI/ML Platform Operations
- Operate and support Cloudera AI (CAI) environments:
- AI Workbenches, AI Studios
- Model training and development environments
- AI inference endpoints and model serving
- Troubleshoot:
- Resource contention (CPU/GPU)
- Model deployment/runtime failures
CDP Runtime & Kubernetes-Aware Operations
- Operate CDP services running on Cloudera-managed Kubernetes infrastructure
- Apply strong understanding of containerized workloads and Kubernetes concepts for troubleshooting
- Diagnose and resolve:
- Pod failures, restarts, and resource contention
- Spark job failures in containerized environments (CDE)
- Service-to-service communication issues
- Analyze logs and metrics to identify runtime failures and performance issues
- Collaborate with Cloudera support for managed service-level issues
Data Integration & Platform Services
- Operate and support:
- CDF (NiFi) for ingestion pipelines
- CDV (Data Visualization) for reporting workloads
- Octopai for data lineage and catalog integration
- Ensure reliability and performance of data pipelines and integrations
- Monitor and troubleshoot Kafka environments:
- Topic configurations, partitions, and replication
- Consumer lag and throughput issues
- Broker connectivity and performance bottlenecks
Security, Governance & SDX Administration
- Implement and manage:
- Kerberos, TLS/SSL, Ranger policies
- Administer SDX for:
- Centralized security
- Metadata and policy enforcement
- Support Atlas and Octopai integration
- Manage and troubleshoot user access and identity mapping across layers, including:
- Cloud IAM roles and permissions
- CDP users/groups and identity providers
- Ranger policies for fine-grained data access
- Resolve access-related issues impacting:
- Data access (S3/ADLS)
- Query execution (CDW/CDE)
- Application and service-level permissions
Cloud Infrastructure & Networking
- Troubleshoot:
- S3 / ADLS storage issues
- IAM roles and permissions
- VPC, subnets, routing, security groups
- Bastion host access and connectivity
- Ensure secure and reliable connectivity across services
- Understand and troubleshoot S3-based data lake patterns, including:
- Bucket structure, prefix design, and access patterns
- Performance issues related to small files, request rates, and throughput limits
- Encryption (SSE-S3, SSE-KMS) and access policies
- Manage and troubleshoot cross-account IAM roles and access patterns for CDP environments
- Ensure secure access between:
- CDP environments and cloud resources
- Multiple AWS accounts (dev/prod separation)
Disaster Recovery & Resiliency
- Support and validate disaster recovery and failover strategies across CDP environments
- Ensure backup, recovery, and environment resiliency for critical workloads
- Participate in DR drills and recovery validation
Observability, Monitoring & Alerting (Critical)
- Implement and manage end-to-end observability:
- Metrics, logs, and alerting
- Use:
- Cloudera observability, Cloudera Manager, Prometheus, Grafana
- Monitor:
- Cluster health
- Workload performance
- AI inference endpoints
- Enable proactive issue detection and prevention
- Define and implement SLIs/SLOs and alerting thresholds to ensure platform reliability and performance
- Support high-severity (P1/P2) incident response, triage, and resolution within defined SLAs
Operational Support & On-Call
- Participate in on-call rotation to support 24/7 platform operations
- Respond to production incidents, alerts, and service disruptions within defined SLAs
- Handle P1/P2 incidents, including triage, troubleshooting, and resolution
- Perform root cause analysis (RCA) and implement preventive measures
Upgrades, Patching & Platform Lifecycle
- Execute:
- CDP upgrades and version management
- Security patches and hotfixes
- Perform:
- Rolling upgrades
- Validation and rollback strategies
Performance Optimization & Cost Efficiency
- Optimize:
- Platform-level performance (Spark, Hive, Impala workloads)
- Cluster utilization and workload distribution
- Drive:
- Autoscaling strategies
- Cost optimization (FinOps practices)
Automation & Operational Excellence
- Utilize and support existing automation frameworks for:
- Platform provisioning
- Monitoring and alerting
- Routine operational tasks
- Work with infrastructure teams that manage Infrastructure-as-Code (Terraform) for environment setup and changes
- Leverage scripting (Python / Shell) for:
- Operational support
- Task automation
- Troubleshooting and diagnostics
- Maintain and follow runbooks, SOPs, and operational procedures to ensure consistent platform operations