Principal Technical Architect, Site Reliability Engineering

Overview

Remote
On Site
USD200,000 - USD240,000
Full Time

Skills

Principal Technical Architect
Site Reliability Engineering

Job Details

job summary:

You will be responsible for building a purposeful, proactive, and sustainable approach to reliability on a foundation of SRE principles. You will partner with multiple support teams, architects, developers, and other stakeholders to develop common tools and guidance and drive adoption of key reliability engineering practices in support of large-scale and mission-critical services. Through your deep SRE knowledge and history of implementation, you will have open, candid conversations with senior leaders and engineers and play a pivotal role in establishing a foundational SRE practice.




location: Chicago, Illinois

job type: Permanent

salary: $200,000 - 240,000 per year

work hours: 8am to 5pm

education: Bachelors



responsibilities:

You will partner with multiple support teams, architects, developers, and other stakeholders to develop common tools and guidance and drive adoption of key reliability engineering practices in support of large-scale and mission-critical services. Through your deep SRE knowledge and history of implementation, you will have open, candid conversations with senior leaders and engineers and play a pivotal role in establishing a foundational SRE practice.




qualifications:

Required Qualifications



  • Minimum 5 years in SRE role, with at least 3 years in an architect or technical leadership position.
  • At least 3 or more years of experience designing and implementing highly scalable and fault tolerant systems.
  • In-depth knowledge of resilience patterns (i.e. circuit breakers, timeouts, retries, etc.) and how to design and implement them.
  • In-depth knowledge of CICD processes and tools to ensure software is delivered safely using known deployment strategies (i.e. blue/green, canary deployments, feature toggles, etc.).
  • Authored technical postmortems (at least weekly) with root cause analyses and documented action items that resulted in measurable resiliency improvements.
  • Contributed to the SLO strategy for at least 5 teams, ensuring alignment with business and client objectives.
  • Three or more years hands-on experience with monitoring and observability tools (e.g., Prometheus, Grafana, Datadog, Splunk), with a proven track record of setting up dashboards and alerts.
  • Developed at least 5 scripts or tools that reduced repetitive operational toil.
  • Led or participated in at least three cross-functional SRE-focused initiatives that included key stakeholders from both technical and business units.
  • Participated in resilience or chaos engineering exercises at least yearly, with documentation showing a reduction in unplanned downtime.
  • Presented findings or led training sessions at least twice annually to share SRE practices, enhancing team performance or adoption rates for reliability engineering methods.
  • Managed or mentored at least 2 junior engineers or teams in SRE best practices, with improvements in incident resolution speed and reliability metrics.
  • Authored and maintained comprehensive SRE documentation for at least 3 critical systems or workflows, including incident response guides, runbooks, operational playbooks, SLO implementation, and observability.




skills: Preferred Qualifications



  • Evangelize SRE mindset and practices across the Technology Solutions organization.
  • Partner with support, development, and business stakeholders to develop, measure, and leverage service level objectives.
  • Design and develop solutions to eliminate toil and manual effort from day-to-day support responsibilities.
  • Identify and implement improvements to logging, metrics, and tracing telemetry and triaging capabilities across a diverse technology stack.
  • Lead complex triage and postmortem activities for critical issues and drive prioritization/resolution of remediation items.
  • Perform chaos engineering experiments to improve application resilience to known and unknown failures.
  • Document reliability guidance and best practices. Advocate for and drive adoption of said practices.
  • Foster a culture of learning through coaching, mentoring, and knowledge sharing around reliability practices, processes, and tools.
  • Develop tools, frameworks, and instrumentation to validate and increase release success for applications.




Equal Opportunity Employer: Race, Color, Religion, Sex, Sexual Orientation, Gender Identity, National Origin, Age, Genetic Information, Disability, Protected Veteran Status, or any other legally protected group status.

At Randstad Digital, we welcome people of all abilities and want to ensure that our hiring and interview process meets the needs of all applicants. If you require a reasonable accommodation to make your application or interview experience a great one, please contact

Pay offered to a successful candidate will be based on several factors including the candidate's education, work experience, work location, specific job duties, certifications, etc. In addition, Randstad Digital offers a comprehensive benefits package, including: medical, prescription, dental, vision, AD&D, and life insurance offerings, short-term disability, and a 401K plan (all benefits are based on eligibility).

This posting is open for thirty (30) days.


It is unlawful in Massachusetts to require or administer a lie detector test as a condition of employment or continued employment. An employer who violates this law shall be subject to criminal penalties and civil liability.



Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.