job summary:
Key Responsibilities
Observability & Monitoring
Own and maintain a single-pane-of-glass dashboard for application, platform, dependency, and client journey health.
Improve SLOs, SLIs, alerts, dashboards, and monitoring standards.
Ensure proactive detection of client-impacting issues using logs, metrics, traces, and synthetic monitoring.
Reliability & Incident Management
Improve MTTD, MTTR, and overall service reliability.
Maintain incident response playbooks and alerting standards.
Facilitate blameless postmortems, root cause analysis, and track corrective actions through closure.
Analyze trends and recurring failure patterns to prevent repeat incidents.
Resilience Engineering
Lead FMEA assessments for critical applications and journeys.
Identify single points of failure and partner with teams on remediation plans.
Conduct Game Days, chaos testing, failover testing, and recovery exercises.
Validate multi-region, multi-AZ, and disaster recovery capabilities.
Safe Change & Operational Excellence
Define reliability standards and operational guardrails.
Review production readiness of high-risk changes.
Drive adoption of safe deployment practices such as canary releases, feature flags, and automated rollback mechanisms.
Community of Practice & Reliability Leadership
Build and lead the Cash & Money Movement SRE Community of Practice.
Drive engagement, knowledge sharing, and reliability culture across the organization.
Identify and mentor application-level SRE champions/POCs.
Facilitate weekly reliability forums, office hours, and operational reviews.
Educate teams on SRE best practices, observability, incident management, resilience testing, and safe change principles.
Partner closely with Danlin Hibay's SRE and operational excellence organizations to stay aligned with enterprise standards, emerging tools, lessons learned, and engineering best practices.
Act as the liaison between Cash & Money Movement and enterprise SRE communities to bring recommendations, standards, and innovations back to product teams
location: Malvern, Pennsylvania
job type: Contract
salary: $55 - 60 per hour
work hours: 8am to 5pm
education: Bachelors
responsibilities:
Key Responsibilities
Observability & Monitoring
- Own and maintain a single-pane-of-glass dashboard for application, platform, dependency, and client journey health.
- Improve SLOs, SLIs, alerts, dashboards, and monitoring standards.
- Ensure proactive detection of client-impacting issues using logs, metrics, traces, and synthetic monitoring.
Reliability & Incident Management- Improve MTTD, MTTR, and overall service reliability.
- Maintain incident response playbooks and alerting standards.
- Facilitate blameless postmortems, root cause analysis, and track corrective actions through closure.
- Analyze trends and recurring failure patterns to prevent repeat incidents.
Resilience Engineering- Lead FMEA assessments for critical applications and journeys.
- Identify single points of failure and partner with teams on remediation plans.
- Conduct Game Days, chaos testing, failover testing, and recovery exercises.
- Validate multi-region, multi-AZ, and disaster recovery capabilities.
Safe Change & Operational Excellence- Define reliability standards and operational guardrails.
- Review production readiness of high-risk changes.
- Drive adoption of safe deployment practices such as canary releases, feature flags, and automated rollback mechanisms.
Community of Practice & Reliability Leadership- Build and lead the Cash & Money Movement SRE Community of Practice.
- Drive engagement, knowledge sharing, and reliability culture across the organization.
- Identify and mentor application-level SRE champions/POCs.
- Facilitate weekly reliability forums, office hours, and operational reviews.
- Educate teams on SRE best practices, observability, incident management, resilience testing, and safe change principles.
- Partner closely with Danlin Hibay's SRE and operational excellence organizations to stay aligned with enterprise standards, emerging tools, lessons learned, and engineering best practices.
- Act as the liaison between Cash & Money Movement and enterprise SRE communities to bring recommendations, standards, and innovations back to product teams
qualifications:
Key Deliverables
Unified Cash & Money Movement Reliability Dashboard
Journey Health Dashboard (Add Bank, Transfers, Wires, ACH, Direct Deposit, Cash Plus, etc.)
SLO/SLI Framework and Alert Standards
FMEA Library and Resiliency Test Plans
Incident Playbooks and Postmortem Reviews
Reliability Community of Practice
Reliability Maturity Assessments and Executive Reporting
Success Measures
Reduced Sev 1/2/3 incidents
Reduced MTTD and MTTR
100% critical applications with SLOs, dashboards, and actionable alerts
Completion of FMEA and resiliency testing for critical journeys
Timely closure of postmortem action items
Improved reliability, availability, and client experience across Cash & Money Movement.
Active and engaged reliability community across Cash & Money Movement
Operating Model
This is a Hub-and-Spoke SRE model, SRE defines what "good" looks like and drives continuous improvement while engineering teams remain accountable for execution and results.
SRE owns
Reliability standards and best practices
Observability and dashboards
Assessments, FMEA, and resilience testing
Incident reviews and postmortems
Community of Practice
Education, coaching, and governance
Product Teams own
Reliability backlog execution
Remediation and implementation
Operational outcomes
Service health and reliability improvements
Equal Opportunity Employer: Race, Color, Religion, Sex, Sexual Orientation, Gender Identity, National Origin, Age, Genetic Information, Disability, Protected Veteran Status, or any other legally protected group status.
At Randstad Digital, we welcome people of all abilities and want to ensure that our hiring and interview process meets the needs of all applicants. If you require a reasonable accommodation to make your application or interview experience a great one, please contact
Pay offered to a successful candidate will be based on several factors including the candidate's education, work experience, work location, specific job duties, certifications, etc. In addition, Randstad Digital offers a comprehensive benefits package, including: medical, prescription, dental, vision, AD&D, and life insurance offerings, short-term disability, and a 401K plan (all benefits are based on eligibility).
This posting is open for thirty (30) days.
Any consideration of a background check would be an individualized assessment based on the applicant or employee's specific record and the duties and requirements of the specific job.
![]()