SRE (Site Reliability Engineer)

company banner
Judge Group, Inc.
Full Time

Job Description

Location: Malvern, PA
Description: Our client is currently seeking a Lead SRE (Site Reliability Engineer). Please send your resume to

As a Cloud Compute SRE Lead you'll proactively seek out points of pain and opportunities for wide-reaching improvement by analyzing enterprise-wide telemetry data. Also, support reliability-centric tasks for cross-cutting concerns and applications spanning more than one sub-division. You will be expected to be a hands on developer to implement the found opportunities to enhance resiliency and harden the platform. You will partner with other shared services teams (performance, chaos, security and fraud and various ops teams) to bring a holistic approach to hardening the platform's security and resiliency posture.

Qualifications & Requirements:
  • Undergraduate degree in a related field or the equivalent combination of training and experience
  • Adept user of telemetry tools, including CloudWatch, Splunk, and Honeycomb
  • Ability to read and understand application code written in NodeJS, Java, and Python
  • Ability to write and update application code confidently in at least one of the following languages: NodeJS, Java, Python.
  • Deep familiarity with SiteMinder, MFA and OIDC (Kong, envoy, OPA, etc. ) protocols and implementation
  • Strong conceptual thinking to quickly understand new and complex architectures, and ongoing incidents
  • Experience debugging production incidents using a combination of logs, metrics, and traces
  • Familiarity with executing performance and chaos tests and analyzing results
  • Experience working within the constraints of regulated workloads, including security restrictions
  • Experience building cloud-native applications/platforms*
  • Ability to create, interpret, and update technical architecture diagrams


RESPONSIBILITIES:

  • Proactively seek out operational anomalies using Honeycomb, Splunk, CloudWatch, and other telemetry tools
  • Execute chaos experiments and other resilience tests for spinal services and applications with cross-cutting impacts or high criticality
  • Define SLIs and aligned SLOs for platform services. Implement automation via synthetic monitors and formulas to capture platform availability.
  • Build\Deploy - Determine efficiencies to reduce build\deploy times and failures or application workloads.
  • Assess build\deploy metrics to capture for further refinement and reporting
  • Update application code based on findings to improve resilience and assist in automating workloads to be stood up in a Multi Region\Out of Region environment
  • Improve the platform's security posture by easing integration with modernized authorization\authentication protocols (OIDC, Auth0, Kong, Envoy) and identifying any potential vulnerabilities
  • Help product and platform teams and their SRE representatives diagnose complex technical problems, including performance issues and intermittent errors
  • Listen in and participate on high severity major incident calls (SEV1s, some cross-cutting SEV2s) to assist with triage and recovery -also participate in post-incident reviews for these incidents.
  • Review critical and complex architectures, including facilitation of FMEA exercises
  • Maintains product-level runbooks for incident response to document the step-by-step process to recover from specific components within a system.


Contact:

This job and many more are available through The Judge Group. Find us on the web at www.judge.com


Company Information

The Judge Group, celebrating its 50th anniversary, is a leading professional services firm specializing in talent, technology, and learning solutions. We consult, staff, train, and solve. Through our work we make people and organizations better. Our services are successfully delivered through a network of more than 30 offices in the United States, Canada, and India. The Judge Group serves more than 50 of the Fortune 100 and is responsible for over 9,000 professionals on assignment annually across a wide range of industries.

Dice Id : cxjudgpa
Position Id : 873287
Originally Posted : 2 months ago

Similar Positions at Judge Group, Inc.

Site Reliability Engineer
  • Jersey City, NJ
  • 1 day ago
ServiceNow Site Reliability Engineer
  • New York City, NY
  • 1 day ago
Identity Engineer
  • Plymouth Meeting, PA
  • 1 day ago
Reliability Engineer
  • Durham, NC
  • 1 day ago
Quality & Reliability Engineer
  • Andover, MN
  • 1 day ago
Cloud Engineer
  • Dallas, TX
  • 1 day ago