Cloud Data SRE

Remote • Posted 2 hours ago • Updated 2 hours ago
Contract W2
No Travel Required
Remote
Depends on Experience
Fitment

Dice Job Match Score™

🔗 Matching skills to job...

Job Details

Skills

  • SRE
  • production support
  • Hadoop
  • Spark
  • Kubernetes
  • Data platform

Summary

Job Title/Role

Cloud Data SRE

 

Location

Remote

 

Mandatory Skills

6–8 years of experience in Data SRE / Production Support roles.
Strong knowledge of:

Spark job execution & tuning
Hadoop ecosystem (HDFS, YARN)
Kubernetes basics
Serverless Spark environments


Hands-on experience with monitoring, troubleshooting, alerting, and incident response.
Comfort with shell scripting / Python for automation (nice to have, not mandatory coding heavy).

 

JD

Cloud Data SRE (Spark / Data Platform) – 6–8 Years Experience
Role Overview
We are looking for an experienced Cloud Data SRE with 6–8 years of relevant experience to support, manage, and optimize Spark-based data workloads in production. This role is not development-focused; instead, it emphasizes production support, troubleshooting, system reliability, platform migration, and operational excellence across Spark and data ecosystem components.

Key Responsibilities
Production Support & Incident Management

Provide on‑call support for production alerts and critical issues.
Perform log analysis, debug application failures, and drive quick resolution.
Handle incident management, root-cause analysis, and permanent remediation.
Conduct alert retrospectives, reduce noise, and fine-tune alert thresholds.

Monitoring & Operational Excellence

Monitor Spark jobs, data pipelines, and underlying infrastructure across Hadoop/Kubernetes/serverless platforms.
Manage server health, Hadoop cluster nodes, and disk utilization.
Configure resource parameters and optimize Spark job performance.
Support developers by helping diagnose and resolve job issues.

Data & Platform Management

Manage data access, quotas, file permissions, and HDFS/Kube resources.
Handle data management operations including data copy, DR, retention planning, and utilization checks.

Tooling & Automation

Build/maintain tools for automation, reporting, dashboarding, and incident analysis.
Improve operational efficiency through scripts, utilities, and internal platforms.

Migration Projects

Migrate projects from:

Legacy schedulers to Data Platform
Hadoop HDFS → ACOS
YARN / Kubernetes → Serverless Spark


Support data and compute migration initiatives end-to-end.


Required Experience

6–8 years of experience in Data SRE / Production Support roles.
Strong knowledge of:

Spark job execution & tuning
Hadoop ecosystem (HDFS, YARN)
Kubernetes basics
Serverless Spark environments


Hands-on experience with monitoring, troubleshooting, alerting, and incident response.
Comfort with shell scripting / Python for automation (nice to have, not mandatory coding heavy).

 
 

 

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.
  • Dice Id: 10118140
  • Position Id: 8936756
  • Posted 2 hours ago
Create job alert
Set job alertNever miss an opportunity! Create an alert based on the job you applied for.

Similar Jobs

Remote or Wisconsin

Yesterday

Easy Apply

Third Party, Contract

$80

Remote or Cleveland, Ohio

Today

Easy Apply

Contract

USD 42.75 - 49.50 per hour

Remote

2d ago

Easy Apply

Contract

$160,000 - $180,000

Remote

2d ago

Easy Apply

Contract, Third Party

$55

Search all similar jobs