Senior Site Reliability Engineer/Contractor/ Kubernetes & Middleware Platforms

Overview

On Site

Depends on Experience

Accepts corp to corp applications

Contract - Independent

Contract - W2

Contract - 12 Month(s)

Skills

Kubernetes clusters

Amazon EKS

Red Hat OpenShift

Apache Kafka

Redis Enterprise Clusters

3 Scale API Gateway platform

IaC pipelines

Prometheus

Grafana

ELK Stack

Splunk

Python

Shell Scripting

EKS AND OR OPENSHIFT administration certification

Node.js

Java

Job Details

Senior Site Reliability Engineer/Contractor Kubernetes & Middleware Platforms

Location: Jersey City, NJ (100% Onsite)

Role Overview:

As a Senior Site Reliability Engineer, you ll bring software engineering practices to operations - building the reliability framework, defining Service Level Objectives (SLOs), and automating toil away. You ll own the health and performance of container platforms (EKS & OpenShift), Middleware Platforms (Kafka, Redis), and the CI/CD/observability pipelines that power modern, distributed applications.

Key Responsibilities:

Platform Operations:

Administer and optimize Kubernetes clusters - Amazon EKS and Red Hat OpenShift
Manage platform lifecycle, upgrades, scaling, and security controls

Middleware Management:

Operate and tune event platforms like Apache Kafka
Administer in-memory data stores like Redis Enterprise Clusters
Administer and maintain 3 Scale API Gateway platform.

Automation:

Fine tune Infrastructure-as-Code (IaC) pipelines and platform components
Automate manual operations through IaC & configuration management tools/platforms.

Observability & Instrumentation:

Design and implement monitoring dashboards and alerts with Prometheus, Grafana, ELK stack, and Splunk
Instrument Java, Node.js, and Python distributes apps - embed tracing, metrics, and logs at code-level to meet SLOs.

Reliability Engineering:

Define SLIs/SLOs and manage error budgets- use data-driven insights to balance reliability and feature velocity.
Lead on-call rotations, incident response, and conduct blameless root cause analysis to drive continuous improvement.

Performance & Capacity:

Forecast and right-size resource usage across clusters and middleware
Profile and tune application performance (CPU, memory, threading) in production.

Required Skills & Qualifications:

12+ years of overall industry experience.
6+ years in SRE, DevOps, Platform, or Production Engineering roles.
EKS and/or OpenShift administration certification (CKA, AWS Certified Kubernetes Administrator, Red Hat Certified OpenShift Administrator, or equivalent).
Hands-on with Kubernetes internals, networking, Helm charts, and Operators.
Middleware expertise: Deploying, scaling, and securing Kafka and Redis clusters.
Strong IaC toolchain experience: Helm, ArgoCD, Terraform, Ansible or equivalent tools/platforms
Observability mastery: Prometheus, Grafana, ELK/Splunk or equivalent tools/platforms.
Enforce container security and policy governance using tools like OPA/Gatekeeper, Kyverno, and scanners such as Trivy, Clair, and Snyk, integrated with CI/CD and admission controls for automated compliance.
Implement Kubernetes network segmentation using NetworkPolicy and/or Calico, ensuring secure east-west traffic and minimizing blast radius to protect service reliability.
Programming/scripting proficiency in Python, Shell Scripting, Groovy or similar automation scripting.
Demonstrable experience instrumenting distributed applications (Java, Node.js, Python) with metrics, logs, and tracing libraries.
Proven track record of running large-scale production systems with minimal downtime.
Strong analytical, debugging, communication, and collaboration skills.

Nice-to-Have:

Service mesh experience (Istio, Linkerd).
Chaos engineering foundations (Chaos Monkey, LitmusChaos).
Familiarity with security/compliance in regulated environments.
Experienced with any API Gateway platform (e.g. RedHat 3 Scale API Gateway).

What Makes This Role Unique:

You ll be the architect of reliability guardrails - building automation and pipelines that free developers and engineers from manual ops.
You ll define and enforce SLO-driven releases, leveraging error budgets to strike the right balance between innovation and uptime.
You ll own end-to-end instrumentation: from container runtime metrics through Kafka-backed event flows to application-level traces in code.

Consulting Details:

Duration: Contract (TBD)
Location: Jersey City, NJ (Onsite)

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Job Details

Share