Site Reliability Engineer

• Posted 2 days ago • Updated 1 hour ago
Contract W2
On-site
USD55 - USD60/hr
Fitment

Dice Job Match Score™

🤯 Applying directly to the forehead...

Job Details

Skills

  • Site Reliability Engineer

Summary

job summary:

Key Responsibilities


Observability & Monitoring


Own and maintain a single-pane-of-glass dashboard for application, platform, dependency, and client journey health.


Improve SLOs, SLIs, alerts, dashboards, and monitoring standards.


Ensure proactive detection of client-impacting issues using logs, metrics, traces, and synthetic monitoring.


Reliability & Incident Management


Improve MTTD, MTTR, and overall service reliability.


Maintain incident response playbooks and alerting standards.


Facilitate blameless postmortems, root cause analysis, and track corrective actions through closure.


Analyze trends and recurring failure patterns to prevent repeat incidents.


Resilience Engineering


Lead FMEA assessments for critical applications and journeys.


Identify single points of failure and partner with teams on remediation plans.


Conduct Game Days, chaos testing, failover testing, and recovery exercises.


Validate multi-region, multi-AZ, and disaster recovery capabilities.


Safe Change & Operational Excellence


Define reliability standards and operational guardrails.


Review production readiness of high-risk changes.


Drive adoption of safe deployment practices such as canary releases, feature flags, and automated rollback mechanisms.


Community of Practice & Reliability Leadership


Build and lead the Cash & Money Movement SRE Community of Practice.


Drive engagement, knowledge sharing, and reliability culture across the organization.


Identify and mentor application-level SRE champions/POCs.


Facilitate weekly reliability forums, office hours, and operational reviews.


Educate teams on SRE best practices, observability, incident management, resilience testing, and safe change principles.


Partner closely with Danlin Hibay's SRE and operational excellence organizations to stay aligned with enterprise standards, emerging tools, lessons learned, and engineering best practices.


Act as the liaison between Cash & Money Movement and enterprise SRE communities to bring recommendations, standards, and innovations back to product teams





location: Malvern, Pennsylvania

job type: Contract

salary: $55 - 60 per hour

work hours: 8am to 5pm

education: Bachelors



responsibilities:


Key Responsibilities


Observability & Monitoring


  • Own and maintain a single-pane-of-glass dashboard for application, platform, dependency, and client journey health.
  • Improve SLOs, SLIs, alerts, dashboards, and monitoring standards.
  • Ensure proactive detection of client-impacting issues using logs, metrics, traces, and synthetic monitoring.
Reliability & Incident Management


  • Improve MTTD, MTTR, and overall service reliability.
  • Maintain incident response playbooks and alerting standards.
  • Facilitate blameless postmortems, root cause analysis, and track corrective actions through closure.
  • Analyze trends and recurring failure patterns to prevent repeat incidents.
Resilience Engineering


  • Lead FMEA assessments for critical applications and journeys.
  • Identify single points of failure and partner with teams on remediation plans.
  • Conduct Game Days, chaos testing, failover testing, and recovery exercises.
  • Validate multi-region, multi-AZ, and disaster recovery capabilities.
Safe Change & Operational Excellence


  • Define reliability standards and operational guardrails.
  • Review production readiness of high-risk changes.
  • Drive adoption of safe deployment practices such as canary releases, feature flags, and automated rollback mechanisms.
Community of Practice & Reliability Leadership


  • Build and lead the Cash & Money Movement SRE Community of Practice.
  • Drive engagement, knowledge sharing, and reliability culture across the organization.
  • Identify and mentor application-level SRE champions/POCs.
  • Facilitate weekly reliability forums, office hours, and operational reviews.
  • Educate teams on SRE best practices, observability, incident management, resilience testing, and safe change principles.
  • Partner closely with Danlin Hibay's SRE and operational excellence organizations to stay aligned with enterprise standards, emerging tools, lessons learned, and engineering best practices.
  • Act as the liaison between Cash & Money Movement and enterprise SRE communities to bring recommendations, standards, and innovations back to product teams




qualifications:

Key Deliverables


Unified Cash & Money Movement Reliability Dashboard


Journey Health Dashboard (Add Bank, Transfers, Wires, ACH, Direct Deposit, Cash Plus, etc.)


SLO/SLI Framework and Alert Standards


FMEA Library and Resiliency Test Plans


Incident Playbooks and Postmortem Reviews


Reliability Community of Practice


Reliability Maturity Assessments and Executive Reporting


Success Measures


Reduced Sev 1/2/3 incidents


Reduced MTTD and MTTR


100% critical applications with SLOs, dashboards, and actionable alerts


Completion of FMEA and resiliency testing for critical journeys


Timely closure of postmortem action items


Improved reliability, availability, and client experience across Cash & Money Movement.


Active and engaged reliability community across Cash & Money Movement


Operating Model


This is a Hub-and-Spoke SRE model, SRE defines what "good" looks like and drives continuous improvement while engineering teams remain accountable for execution and results.


SRE owns


Reliability standards and best practices


Observability and dashboards


Assessments, FMEA, and resilience testing


Incident reviews and postmortems


Community of Practice


Education, coaching, and governance


Product Teams own


Reliability backlog execution


Remediation and implementation


Operational outcomes


Service health and reliability improvements




Equal Opportunity Employer: Race, Color, Religion, Sex, Sexual Orientation, Gender Identity, National Origin, Age, Genetic Information, Disability, Protected Veteran Status, or any other legally protected group status.

At Randstad Digital, we welcome people of all abilities and want to ensure that our hiring and interview process meets the needs of all applicants. If you require a reasonable accommodation to make your application or interview experience a great one, please contact

Pay offered to a successful candidate will be based on several factors including the candidate's education, work experience, work location, specific job duties, certifications, etc. In addition, Randstad Digital offers a comprehensive benefits package, including: medical, prescription, dental, vision, AD&D, and life insurance offerings, short-term disability, and a 401K plan (all benefits are based on eligibility).

This posting is open for thirty (30) days.


Any consideration of a background check would be an individualized assessment based on the applicant or employee's specific record and the duties and requirements of the specific job.



Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.
  • Dice Id: cxsapwma1
  • Position Id: 1338419
  • Posted 2 days ago
Create job alert
Set job alertNever miss an opportunity! Create an alert based on the job you applied for.

Similar Jobs

Buffalo, New York

Today

Contract

USD 65.00 - 70.00 per hour

Remote

Today

Full-time

USD 150,000.00 - 225,000.00 per year

Hybrid in Phoenix, Arizona

23d ago

Easy Apply

Contract

Depends on Experience

Berkeley Heights, New Jersey

Today

Contract

USD70 - USD80

Search all similar jobs