Lead Site Reliability Engineering (SRE) - - Randstad Digital

Overview

On Site

USD70 - USD78

Contract - W2

Skills

Lead Site Reliability Engineering (SRE) -

Job Details

job summary:

We are seeking a senior-level technical candidate to support production systems, with a strong emphasis on Site Reliability Engineering (SRE) principles, incident and problem management, change/release processes, and observability maturity. The role requires deep collaboration with development and business teams to understand application functionality and drive operational excellence.

You will be responsible for transforming and maturing global support services, promoting adoption of core toolsets, and strengthening partnerships across technology and business. The environment includes large-scale, Tier 1 applications (e.g., IIS/.NET/SQL), multi-tier web hosting, clustering, and load balancing. You will ensure compliance with corporate policies on security, documentation, audit, and change control.

This role also includes leading a team of onshore and offshore engineers to maintain system availability, performance, and reliability through automation, monitoring, and continuous improvement.

In this role, you will:

Lead complex, high-impact initiatives including systems consultation and SRE strategy implementation.

Drive observability improvements by identifying gaps in monitoring, logging, and tracing across platforms.

Collaborate with engineering teams to define SLIs, SLOs, and error budgets.

Automate operational tasks and incident response workflows using modern programming languages (e.g., Python, Go, Bash).

Design and implement scalable, resilient systems using infrastructure-as-code and CI/CD pipelines.

Conduct root cause analyses and postmortems to improve system reliability.

Consult on technical changes and enhancements with a focus on performance, scalability, and fault tolerance.

Partner with architects and engineers to align with enterprise strategies and ensure secure, maintainable solutions.

Lead and mentor a distributed team, fostering a culture of continuous learning and operational excellence.

Required Qualifications:

5+ years of Systems Engineering, Technology Architecture experience, or equivalent demonstrated through one or a combination of the following: work experience, training, military experience, education

5+ years of experience in SRE, platform engineering, or production support roles.

Experience with observability tools such as Prometheus, Grafana, AppD, or Spunk.

2 years of experience programming in one or more languages such as Python, Java, or Go.

1+ years of experience with Cloud technologies

Desired Qualifications:

Strong understanding of distributed systems, cloud platforms (OpenShift, Azure, Google Cloud Platform), and container orchestration (Kubernetes).

Familiarity with CI/CD workflows, version control systems, and infrastructure-as-code tools (e.g., Terraform, Ansible).

Experience with ThousandEyes and BigPanda

Proven ability to identify and remediate gaps in system observability and performance.

Excellent problem-solving skills and ability to lead cross-functional teams.

location: Charlotte, North Carolina

job type: Contract

salary: $70 - 78 per hour

work hours: 8am to 5pm

education: Associates

responsibilities:

This role also includes leading a team of onshore and offshore engineers to maintain system availability, performance, and reliability through automation, monitoring, and continuous improvement.

In this role, you will:

Lead complex, high-impact initiatives including systems consultation and SRE strategy implementation.

Drive observability improvements by identifying gaps in monitoring, logging, and tracing across platforms.

Collaborate with engineering teams to define SLIs, SLOs, and error budgets.

Automate operational tasks and incident response workflows using modern programming languages (e.g., Python, Go, Bash).

Design and implement scalable, resilient systems using infrastructure-as-code and CI/CD pipelines.

Conduct root cause analyses and postmortems to improve system reliability.

Consult on technical changes and enhancements with a focus on performance, scalability, and fault tolerance.

Partner with architects and engineers to align with enterprise strategies and ensure secure, maintainable solutions.

Lead and mentor a distributed team, fostering a culture of continuous learning and operational excellence.

Required Qualifications:

5+ years of Systems Engineering, Technology Architecture experience, or equivalent demonstrated through one or a combination of the following: work experience, training, military experience, education

5+ years of experience in SRE, platform engineering, or production support roles.

Experience with observability tools such as Prometheus, Grafana, AppD, or Spunk.

2 years of experience programming in one or more languages such as Python, Java, or Go.

1+ years of experience with Cloud technologies

Desired Qualifications:

Strong understanding of distributed systems, cloud platforms (OpenShift, Azure, Google Cloud Platform), and container orchestration (Kubernetes).

Familiarity with CI/CD workflows, version control systems, and infrastructure-as-code tools (e.g., Terraform, Ansible).

Experience with ThousandEyes and BigPanda

Proven ability to identify and remediate gaps in system observability and performance.

Excellent problem-solving skills and ability to lead cross-functional teams.

qualifications:

This role also includes leading a team of onshore and offshore engineers to maintain system availability, performance, and reliability through automation, monitoring, and continuous improvement.

In this role, you will:

Lead complex, high-impact initiatives including systems consultation and SRE strategy implementation.

Drive observability improvements by identifying gaps in monitoring, logging, and tracing across platforms.

Collaborate with engineering teams to define SLIs, SLOs, and error budgets.

Automate operational tasks and incident response workflows using modern programming languages (e.g., Python, Go, Bash).

Design and implement scalable, resilient systems using infrastructure-as-code and CI/CD pipelines.

Conduct root cause analyses and postmortems to improve system reliability.

Consult on technical changes and enhancements with a focus on performance, scalability, and fault tolerance.

Partner with architects and engineers to align with enterprise strategies and ensure secure, maintainable solutions.

Lead and mentor a distributed team, fostering a culture of continuous learning and operational excellence.

Required Qualifications:

5+ years of Systems Engineering, Technology Architecture experience, or equivalent demonstrated through one or a combination of the following: work experience, training, military experience, education

5+ years of experience in SRE, platform engineering, or production support roles.

Experience with observability tools such as Prometheus, Grafana, AppD, or Spunk.

2 years of experience programming in one or more languages such

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Lead Site Reliability Engineering (SRE) -

Job Details

Share