Apply Now

Site Reliability Engineer (Application/Platform)

Remote • Posted 54 minutes ago • Updated 54 minutes ago

Contract W2

Contract Independent

No Travel Required

Remote

Depends on Experience

Fitment

Dice Job Match Score™

🫥 Flibbertigibetting...

Job Details

Skills

System Administration
Reliability Engineering
Apache Kafka
Apache Flink
Apache Spark
Apache Hadoop
Kubernetes
IaC
Infrastructure as code
Cloud Computing
Amazon Web Services
Software Engineering
Scripting
Root Cause Analysis
Data Processing
Capacity Management
Access Control

Summary

The Role

We are seeking a high-caliber Site Reliability Engineer (SRE) with a focus on Application and Platform stability. In this role, you will be the guardian of our global application ecosystem, ensuring 24x7 reliability and peak performance. You will bridge the gap between software engineering and systems operations, specifically within heavy Big Data and Streaming environments.

Whether you prefer the collaboration of our Bloomfield office, the comfort of your home office, or a mix of both, we offer total flexibility to fit your lifestyle.

Key Responsibilities

Operational Excellence: Maintain 24x7 system reliability, incident response, and operational readiness for mission-critical global applications.
Incident Leadership: Lead troubleshooting efforts during high-pressure outages; perform deep-dive Root Cause Analysis (RCA) and automate preventive measures.
Reliability Engineering: Define and monitor SLIs/SLOs/SLAs (availability, latency, throughput, and resource utilization).
Big Data & Streaming Support: Manage and optimize distributed data frameworks, ensuring the health of Spark, Flink, and Kafka pipelines.
Infrastructure as Code: Support deployments across AWS Cloud and Kubernetes (EKS) environments.
Cluster Governance: Implement Kubernetes resource quotas, access controls (RBAC), and namespace management to ensure multi-tenant stability.

Technical Qualifications

Core SRE Skills: Proven expertise in monitoring, performance tuning, and capacity planning.
Distributed Systems: Strong hands-on experience with Spark, Flink, and Kafka.
Hadoop Ecosystem: Proficiency in Hadoop Cluster Administration and Operations.
Cloud & Containers: Deep understanding of AWS and Kubernetes (K8s) orchestration.
Automation Mindset: Experience replacing manual "toil" with automated scripts and tools.

Why Join Us?

True Flexibility: We trust our engineers. Choose the work mode that makes you most productive.
Scale: Work on massive distributed systems and global-scale data processing.
Culture: A collaborative environment where Root Cause Analysis is blameless and innovation is encouraged.

Note: We are currently only accepting applications from s (USC) or (GC) holders.

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Dice Id: 10430747
Position Id: 8962514
Posted 54 minutes ago

Create job alert

Never miss an opportunity! Create an alert based on the job you applied for.

Remote

•

Today

Title: Site Reliability Engineer, Senior Leader Duration: 6+ Months Core Responsibilities: Leadership & Mentoring: Lead a team of SREs, manage sprint planning, and foster career growth.System Reliability & Strategy: Own the uptime, performance, and capacity planning of production systems.Automation & Tools: Reduce manual work (toil) by building automation, managing infrastructure as code (Terraform, Kubernetes), and enhancing observability.Incident Management: Drive root cause analysis (RCA), le

Easy Apply

Contract

$50 - $70

Site Reliability Engineer

Remote

•

Today

Responsibilities Own foundational services that serve as a core component of the fleet, such as DHCP, DNS, NTP, PXEBuild, test and keep the fleet up to date with the latest Operating System and KernelOwn full-stack services which automate end to end rack provisioning, including:Network device detection and provisioningOS installation and config managementTooling to monitor datacenter environments, such as power, temperature, humidityOwn services that monitor the health of our fleet and host reme

Easy Apply

Contract

55 - 57

Platform Engineer IV

Remote or Almont, Colorado

•

Today

Description: Remote Our client seeks a Platform Engineer IV to build and maintain scalable Java and Angular services with Spring Boot for a high-availability platform. The role focuses on platform uptime, performance, stability, and public API development while collaborating across data science, operations, and database teams. Experience with big data tooling, cloud infrastructure, and automation will support reliable deployments and observability at scale. Due to client requirements, applica

Contract

Network Site Reliability Engineer - Remote

Remote or Canonsburg, Pennsylvania

•

Today

RESPONSIBILITIES: Kforce is supporting a client that is seeking a Network Site Reliability Engineering (SRE) contractor to support automation, observability, and AI driven operations within its global Network Engineering organization. The contractor will apply SRE principles to networking, focusing on reducing operational toil and improving reliability and scalability at scale. This is a contract role till the end of the year. This role is fully remote. Key Responsibilities: * Design and implem

Contract

$59.93 - $65.47 hourly

Search all similar jobs

Site Reliability Engineer (Application/Platform)

Dice Job Match Score™

Job Details

Skills

Summary

The Role

Key Responsibilities

Technical Qualifications

Why Join Us?

Similar Jobs