Mandatory Skills:
6 8 years of experience in Data SRE / Production Support roles. Strong knowledge of: Spark job execution & tuning Hadoop ecosystem (HDFS, YARN) Kubernetes basics Serverless Spark environments Hands-on experience with monitoring, troubleshooting, alerting, and incident response. Comfort with shell scripting / Python for automation (nice to have, not mandatory coding heavy).
Job description:
Cloud Data SRE (Spark / Data Platform) 6 8 Years Experience.
Role Overview
We are looking for an experienced Cloud Data SRE with 6 8 years of relevant experience to support, manage, and optimize Spark-based data workloads in production. This role is not development-focused; instead, it emphasizes production support, troubleshooting, system reliability, platform migration, and operational excellence across Spark and data ecosystem components.
Key Responsibilities
Production Support & Incident Management
Provide on call support for production alerts and critical issues.
Perform log analysis, debug application failures, and drive quick resolution.
Handle incident management, root-cause analysis, and permanent remediation.
Conduct alert retrospectives, reduce noise, and fine-tune alert thresholds.
Monitoring & Operational Excellence
Monitor Spark jobs, data pipelines, and underlying infrastructure across Hadoop/Kubernetes/serverless platforms.
Manage server health, Hadoop cluster nodes, and disk utilization.
Configure resource parameters and optimize Spark job performance.
Support developers by helping diagnose and resolve job issues.
Data & Platform Management Manage data access, quotas, file permissions, and HDFS/Kube resources.
Handle data management operations including data copy, DR, retention planning, and utilization checks.
Tooling & Automation Build/maintain tools for automation, reporting, dashboarding, and incident analysis.
Improve operational efficiency through scripts, utilities, and internal platforms.
Migration Projects
Migrate projects from:
Legacy schedulers to Data Platform
Hadoop HDFS ACOS YARN / Kubernetes Serverless Spark