Senior Site Reliability Engineer/Contractor/ Kubernetes & Middleware Platforms

  • Jersey City, NJ
  • Posted 4 hours ago | Updated 4 hours ago

Overview

On Site
Depends on Experience
Accepts corp to corp applications
Contract - Independent
Contract - W2
Contract - 12 Month(s)

Skills

Kubernetes clusters
Amazon EKS
Red Hat OpenShift
Apache Kafka
Redis Enterprise Clusters
3 Scale API Gateway platform
IaC pipelines
Prometheus
Grafana
ELK Stack
Splunk
Python
Shell Scripting
EKS AND OR OPENSHIFT administration certification
Node.js
Java

Job Details

Senior Site Reliability Engineer/Contractor Kubernetes & Middleware Platforms

Location: Jersey City, NJ (100% Onsite)

Role Overview:

As a Senior Site Reliability Engineer, you ll bring software engineering practices to operations - building the reliability framework, defining Service Level Objectives (SLOs), and automating toil away. You ll own the health and performance of container platforms (EKS & OpenShift), Middleware Platforms (Kafka, Redis), and the CI/CD/observability pipelines that power modern, distributed applications.

Key Responsibilities:

  • Platform Operations:
    • Administer and optimize Kubernetes clusters - Amazon EKS and Red Hat OpenShift
    • Manage platform lifecycle, upgrades, scaling, and security controls
  • Middleware Management:
    • Operate and tune event platforms like Apache Kafka
    • Administer in-memory data stores like Redis Enterprise Clusters
    • Administer and maintain 3 Scale API Gateway platform.
  • Automation:
    • Fine tune Infrastructure-as-Code (IaC) pipelines and platform components
    • Automate manual operations through IaC & configuration management tools/platforms.
  • Observability & Instrumentation:
    • Design and implement monitoring dashboards and alerts with Prometheus, Grafana, ELK stack, and Splunk
    • Instrument Java, Node.js, and Python distributes apps - embed tracing, metrics, and logs at code-level to meet SLOs.
  • Reliability Engineering:
    • Define SLIs/SLOs and manage error budgets- use data-driven insights to balance reliability and feature velocity.
    • Lead on-call rotations, incident response, and conduct blameless root cause analysis to drive continuous improvement.
  • Performance & Capacity:
    • Forecast and right-size resource usage across clusters and middleware
    • Profile and tune application performance (CPU, memory, threading) in production.

Required Skills & Qualifications:

  • 12+ years of overall industry experience.
  • 6+ years in SRE, DevOps, Platform, or Production Engineering roles.
  • EKS and/or OpenShift administration certification (CKA, AWS Certified Kubernetes Administrator, Red Hat Certified OpenShift Administrator, or equivalent).
  • Hands-on with Kubernetes internals, networking, Helm charts, and Operators.
  • Middleware expertise: Deploying, scaling, and securing Kafka and Redis clusters.
  • Strong IaC toolchain experience: Helm, ArgoCD, Terraform, Ansible or equivalent tools/platforms
  • Observability mastery: Prometheus, Grafana, ELK/Splunk or equivalent tools/platforms.
  • Enforce container security and policy governance using tools like OPA/Gatekeeper, Kyverno, and scanners such as Trivy, Clair, and Snyk, integrated with CI/CD and admission controls for automated compliance.
  • Implement Kubernetes network segmentation using NetworkPolicy and/or Calico, ensuring secure east-west traffic and minimizing blast radius to protect service reliability.
  • Programming/scripting proficiency in Python, Shell Scripting, Groovy or similar automation scripting.
  • Demonstrable experience instrumenting distributed applications (Java, Node.js, Python) with metrics, logs, and tracing libraries.
  • Proven track record of running large-scale production systems with minimal downtime.
  • Strong analytical, debugging, communication, and collaboration skills.

Nice-to-Have:

  • Service mesh experience (Istio, Linkerd).
  • Chaos engineering foundations (Chaos Monkey, LitmusChaos).
  • Familiarity with security/compliance in regulated environments.
  • Experienced with any API Gateway platform (e.g. RedHat 3 Scale API Gateway).

What Makes This Role Unique:

  • You ll be the architect of reliability guardrails - building automation and pipelines that free developers and engineers from manual ops.
  • You ll define and enforce SLO-driven releases, leveraging error budgets to strike the right balance between innovation and uptime.
  • You ll own end-to-end instrumentation: from container runtime metrics through Kafka-backed event flows to application-level traces in code.

Consulting Details:

  • Duration: Contract (TBD)
  • Location: Jersey City, NJ (Onsite)

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.