Mandatory Skills | 6–8 years of experience in Data SRE / Production Support roles. Strong knowledge of:
Spark job execution & tuning Hadoop ecosystem (HDFS, YARN) Kubernetes basics Serverless Spark environments
Hands-on experience with monitoring, troubleshooting, alerting, and incident response. Comfort with shell scripting / Python for automation (nice to have, not mandatory coding heavy). | |
JD | Cloud Data SRE (Spark / Data Platform) – 6–8 Years Experience Role Overview We are looking for an experienced Cloud Data SRE with 6–8 years of relevant experience to support, manage, and optimize Spark-based data workloads in production. This role is not development-focused; instead, it emphasizes production support, troubleshooting, system reliability, platform migration, and operational excellence across Spark and data ecosystem components.
Key Responsibilities Production Support & Incident Management
Provide on‑call support for production alerts and critical issues. Perform log analysis, debug application failures, and drive quick resolution. Handle incident management, root-cause analysis, and permanent remediation. Conduct alert retrospectives, reduce noise, and fine-tune alert thresholds.
Monitoring & Operational Excellence
Monitor Spark jobs, data pipelines, and underlying infrastructure across Hadoop/Kubernetes/serverless platforms. Manage server health, Hadoop cluster nodes, and disk utilization. Configure resource parameters and optimize Spark job performance. Support developers by helping diagnose and resolve job issues.
Data & Platform Management
Manage data access, quotas, file permissions, and HDFS/Kube resources. Handle data management operations including data copy, DR, retention planning, and utilization checks.
Tooling & Automation
Build/maintain tools for automation, reporting, dashboarding, and incident analysis. Improve operational efficiency through scripts, utilities, and internal platforms.
Migration Projects
Migrate projects from:
Legacy schedulers to Data Platform Hadoop HDFS → ACOS YARN / Kubernetes → Serverless Spark
Support data and compute migration initiatives end-to-end.
Required Experience
6–8 years of experience in Data SRE / Production Support roles. Strong knowledge of:
Spark job execution & tuning Hadoop ecosystem (HDFS, YARN) Kubernetes basics Serverless Spark environments
Hands-on experience with monitoring, troubleshooting, alerting, and incident response. Comfort with shell scripting / Python for automation (nice to have, not mandatory coding heavy). | |