Overview
On Site
USD 126,300.00 - 181,550.00 per year
Full Time
Skills
Insurance
Management
Capacity Management
Performance Analysis
Customer Experience
Scalability
Service Delivery
Operational Efficiency
Software Development
Systems Design
Reliability Engineering
Workflow
Automated Testing
Process Optimization
ROOT
Continuous Improvement
DevOps
Production Engineering
Programming Languages
Python
Java
C++
Linux
Scripting
Performance Tuning
Debugging
Cloud Computing
Kubernetes
Orchestration
GitHub
Continuous Integration
Continuous Delivery
Instrumentation
Dashboard
Splunk
Grafana
Terraform
Ansible
Cloud Architecture
Nginx
Database
Incident Management
Communication
Collaboration
Mentorship
CHAOS
Load Testing
Modeling
Google Cloud Platform
Google Cloud
CUSP
Workforce Management
HR Management
Artificial Intelligence
Law
Job Details
Company Overview
With 80,000 customers across 150 countries, UKG is the largest U.S.-based private software company in the world. And we're only getting started. Ready to bring your bold ideas and collaborative mindset to an organization that still has so much more to build and achieve? Read on.
At UKG, you get more than just a job. You get to work with purpose. Our team of U Krewers are on a mission to inspire every organization to become a great place to work through our award-winning HR technology built for all.
Here, we know that you're more than your work. That's why our benefits help you thrive personally and professionally, from wellness programs and tuition reimbursement to U Choose - a customizable expense reimbursement program that can be used for more than 200+ needs that best suit you and your family, from student loan repayment, to childcare, to pet insurance. Our inclusive culture, active and engaged employee resource groups, and caring leaders value every voice and support you in doing the best work of your career. If you're passionate about our purpose - people -then we can't wait to support whatever gives you purpose. We're united by purpose, inspired by you.
About the Team:
Lead Site Reliability Engineers at UKG are critical team members that have a breadth of knowledge encompassing all aspects of service delivery. They develop software solutions to enhance, harden and support our service delivery processes. This can include building and managing CI/CD deployment pipelines, automated testing, capacity planning, performance analysis, monitoring, alerting, chaos engineering and auto remediation.
Lead Site Reliability Engineers must be passionate about learning and evolving with current technology trends. They strive to innovate and are relentless in pursuing a flawless customer experience. They have an "automate everything" mindset, helping us bring value to our customers by deploying services with incredible speed, consistency, and availability.
About the Role
As a Lead Site Reliability Engineer (SRE) at UKG, you will be a key driver in ensuring the reliability, scalability, and performance of our critical services. You will lead the design, development, and implementation of automation and tooling that support seamless service delivery, improve operational efficiency, and enhance incident response. Your work will directly impact on our ability to deliver high-quality, resilient services to millions of users.
You will collaborate closely with engineering, product, and infrastructure teams to embed reliability best practices throughout the software development lifecycle, from system design to production deployment and ongoing support. You will also mentor junior engineers and serve as a technical leader and advocate for site reliability within the organization.
Key Responsibilities
- Define and implement SRE best practices, standards, and automation frameworks for system reliability, monitoring, and incident response.
- Build and maintain GHA Workflow automated testing suites, monitoring systems, alerting mechanisms, and self-healing infrastructure.
- Define, implement, and measure SLIs and SLOs to guide reliability-focused engineering decisions.
- Drive operational improvements by reducing manual toil through automation and process optimization.
- Lead incident response effort to minimize customer impact and reduce MTTx, including leading post-incident reviews to identify root causes and implement long-term solutions.
- Collaborate cross-functionally with software engineers, product owners, and infrastructure teams to ensure reliability goals are integrated into development and release processes.
- Mentor and coach team members on SRE principles, tools, and techniques, fostering a culture of reliability and continuous improvement.
Basic Qualifications
- Minimum 5 years of engineering experience, including 5+ years in Site Reliability, DevOps, or Production Engineering roles.
- Experience in one or more programming languages (e.g., Python, Go, Java, or C++) with the ability to write production-grade software.
-Experience with Linux systems expertise, including scripting, performance tuning, and debugging.
-Experience of operating large-scale distributed systems in public cloud environments, preferably Google Cloud Platform.
-Knowledge of Kubernetes and container orchestration patterns in production environments.
-Experience with GitHub Actions and modern CI/CD practices.
-Experience with SLI/SLO design, service health instrumentation, and production telemetry.
-Proven ability to build dashboards and alerts using Splunk and Grafana.
-Strong understanding of observability systems, including: Metrics pipelines, Distributed tracing, Log aggregation, Alerting strategies and incident triage
- Familiarity with infrastructure-as-code tools (e.g., Terraform, Ansible).
- Broad grounding in at least two of the following: Cloud Architecture, Nginx, Security, or Database Technologies
- Strong troubleshooting skills for complex system issues, with proven experience leading incident response efforts.
- Excellent communication and collaboration skills, with experience mentoring and guiding engineers.
Preferred Qualifications
-Experience implementing chaos engineering, load testing, and resilience modeling.
-Google Cloud Professional Architect Certification is a plus.
-Understanding of OpenTelemetry (metrics, tracing, logs) and its integration into observability pipelines.
Where we're going
UKG is on the cusp of something truly special. Worldwide, we already hold the #1 market share position for workforce management and the #2 position for human capital management. Tens of millions of frontline workers start and end their days with our software, with billions of shifts managed annually through UKG solutions today. Yet it's our AI-powered product portfolio designed to support customers of all sizes, industries, and geographies that will propel us into an even brighter tomorrow!
Equal Opportunity Employer
UKG is proud to be an equal opportunity employer and is committed to maintaining a diverse and inclusive work environment. All qualified applicants will receive considerations for employment without regard to race, color, religion, sex, age, disability, marital status, familial status, sexual orientation, pregnancy, genetic information, gender identity, gender expression, national origin, ancestry, citizenship status, veteran status, and any other legally protected status under federal, state, or local anti-discrimination laws.
View The EEO Know Your Rights poster
UKG participates in E-Verify. View the E-Verify posters here.
It is unlawful in Massachusetts to require or administer a lie detector test as a condition of employment or continued employment. An employer who violates this law shall be subject to criminal penalties and civil liability.
The pay range for this position is $126,300 to $181,550, however, base pay offered may vary depending on skills, experience, job-related knowledge and location. This position is also eligible for a short-term incentive and a long-term incentive as part of total compensation. Information about UKG's comprehensive benefits can be reviewed on our careers site at
With 80,000 customers across 150 countries, UKG is the largest U.S.-based private software company in the world. And we're only getting started. Ready to bring your bold ideas and collaborative mindset to an organization that still has so much more to build and achieve? Read on.
At UKG, you get more than just a job. You get to work with purpose. Our team of U Krewers are on a mission to inspire every organization to become a great place to work through our award-winning HR technology built for all.
Here, we know that you're more than your work. That's why our benefits help you thrive personally and professionally, from wellness programs and tuition reimbursement to U Choose - a customizable expense reimbursement program that can be used for more than 200+ needs that best suit you and your family, from student loan repayment, to childcare, to pet insurance. Our inclusive culture, active and engaged employee resource groups, and caring leaders value every voice and support you in doing the best work of your career. If you're passionate about our purpose - people -then we can't wait to support whatever gives you purpose. We're united by purpose, inspired by you.
About the Team:
Lead Site Reliability Engineers at UKG are critical team members that have a breadth of knowledge encompassing all aspects of service delivery. They develop software solutions to enhance, harden and support our service delivery processes. This can include building and managing CI/CD deployment pipelines, automated testing, capacity planning, performance analysis, monitoring, alerting, chaos engineering and auto remediation.
Lead Site Reliability Engineers must be passionate about learning and evolving with current technology trends. They strive to innovate and are relentless in pursuing a flawless customer experience. They have an "automate everything" mindset, helping us bring value to our customers by deploying services with incredible speed, consistency, and availability.
About the Role
As a Lead Site Reliability Engineer (SRE) at UKG, you will be a key driver in ensuring the reliability, scalability, and performance of our critical services. You will lead the design, development, and implementation of automation and tooling that support seamless service delivery, improve operational efficiency, and enhance incident response. Your work will directly impact on our ability to deliver high-quality, resilient services to millions of users.
You will collaborate closely with engineering, product, and infrastructure teams to embed reliability best practices throughout the software development lifecycle, from system design to production deployment and ongoing support. You will also mentor junior engineers and serve as a technical leader and advocate for site reliability within the organization.
Key Responsibilities
- Define and implement SRE best practices, standards, and automation frameworks for system reliability, monitoring, and incident response.
- Build and maintain GHA Workflow automated testing suites, monitoring systems, alerting mechanisms, and self-healing infrastructure.
- Define, implement, and measure SLIs and SLOs to guide reliability-focused engineering decisions.
- Drive operational improvements by reducing manual toil through automation and process optimization.
- Lead incident response effort to minimize customer impact and reduce MTTx, including leading post-incident reviews to identify root causes and implement long-term solutions.
- Collaborate cross-functionally with software engineers, product owners, and infrastructure teams to ensure reliability goals are integrated into development and release processes.
- Mentor and coach team members on SRE principles, tools, and techniques, fostering a culture of reliability and continuous improvement.
Basic Qualifications
- Minimum 5 years of engineering experience, including 5+ years in Site Reliability, DevOps, or Production Engineering roles.
- Experience in one or more programming languages (e.g., Python, Go, Java, or C++) with the ability to write production-grade software.
-Experience with Linux systems expertise, including scripting, performance tuning, and debugging.
-Experience of operating large-scale distributed systems in public cloud environments, preferably Google Cloud Platform.
-Knowledge of Kubernetes and container orchestration patterns in production environments.
-Experience with GitHub Actions and modern CI/CD practices.
-Experience with SLI/SLO design, service health instrumentation, and production telemetry.
-Proven ability to build dashboards and alerts using Splunk and Grafana.
-Strong understanding of observability systems, including: Metrics pipelines, Distributed tracing, Log aggregation, Alerting strategies and incident triage
- Familiarity with infrastructure-as-code tools (e.g., Terraform, Ansible).
- Broad grounding in at least two of the following: Cloud Architecture, Nginx, Security, or Database Technologies
- Strong troubleshooting skills for complex system issues, with proven experience leading incident response efforts.
- Excellent communication and collaboration skills, with experience mentoring and guiding engineers.
Preferred Qualifications
-Experience implementing chaos engineering, load testing, and resilience modeling.
-Google Cloud Professional Architect Certification is a plus.
-Understanding of OpenTelemetry (metrics, tracing, logs) and its integration into observability pipelines.
Where we're going
UKG is on the cusp of something truly special. Worldwide, we already hold the #1 market share position for workforce management and the #2 position for human capital management. Tens of millions of frontline workers start and end their days with our software, with billions of shifts managed annually through UKG solutions today. Yet it's our AI-powered product portfolio designed to support customers of all sizes, industries, and geographies that will propel us into an even brighter tomorrow!
Equal Opportunity Employer
UKG is proud to be an equal opportunity employer and is committed to maintaining a diverse and inclusive work environment. All qualified applicants will receive considerations for employment without regard to race, color, religion, sex, age, disability, marital status, familial status, sexual orientation, pregnancy, genetic information, gender identity, gender expression, national origin, ancestry, citizenship status, veteran status, and any other legally protected status under federal, state, or local anti-discrimination laws.
View The EEO Know Your Rights poster
UKG participates in E-Verify. View the E-Verify posters here.
It is unlawful in Massachusetts to require or administer a lie detector test as a condition of employment or continued employment. An employer who violates this law shall be subject to criminal penalties and civil liability.
The pay range for this position is $126,300 to $181,550, however, base pay offered may vary depending on skills, experience, job-related knowledge and location. This position is also eligible for a short-term incentive and a long-term incentive as part of total compensation. Information about UKG's comprehensive benefits can be reviewed on our careers site at
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.