Site Reliability Engineer

• Posted 1 day ago • Updated 7 hours ago

Full Time

USD $100,000.00 - 140,000.00 per year

Fitment

Dice Job Match Score™

📋 Comparing job requirements...

Job Details

Skills

Scalability
Optimization
Reliability Engineering
Consumer Goods
IT Management
Decision-making
Systems Design
Root Cause Analysis
Documentation
IaaS
Database
Microsoft SQL Server
PostgreSQL
MongoDB
Java
Budget
Capacity Management
Performance Tuning
Design Automation
Workflow
Computer Science
Information Systems
DevOps
Software Development
Linux Administration
Docker
Orchestration
Leadership
Mentorship
Incident Management
Communication
Collaboration
Agile
ITIL
SaaS
Authentication
Identity Management
Terraform
Cloud Computing
Amazon Web Services
Microsoft Azure
Google Cloud
Google Cloud Platform
Kubernetes
New Relic
Grafana
Linux
Red Hat Certified Engineer
LPIC
iCIMS
SAP BASIS
Market Analysis
Law
Insurance
Workday

Summary

Job Overview

We are seeking an experienced Engineer, Site Reliability (SRE) to drive technical excellence within our global Site Reliability Engineering organization. This role is essential to maintaining and improving the reliability, scalability, and performance of our multi-cloud SaaS platform serving thousands of customers worldwide. The successful candidate will lead a team of SRE engineers while providing hands-on technical expertise in incident response, system optimization, and reliability engineering practices across our complex technology stack. Off hours support as needed.

About Us

When you join iCIMS, you join the team helping global companies transform business and the world through the power of talent. Our customers do amazing things: design rocket ships, create vaccines, deliver consumer goods globally, overnight, with a smile. As the Talent Cloud company, we empower these organizations to attract, engage, hire, and advance the right talent. We're passionate about helping companies build a diverse, winning workforce and about building our home team. We're dedicated to fostering an inclusive, purpose-driven, and innovative work environment where everyone belongs.

Responsibilities

Technical Leadership

Provide technical leadership as part of a team of 15+ SRE engineers across one or more geographic regions (US, Ireland, or India)

Provide technical mentorship and career development for team members

Drive technical decision-making for complex reliability and performance challenges

Conduct architecture reviews and provide input on system design for reliability

Lead post-incident reviews and drive implementation of preventive measures

Incident Management & Response

Participate in enterprise-wide incident management, ensuring rapid prevention, detection, response, and resolution

Develop and maintain runbooks and emergency response procedures

Lead root cause analysis and ensure comprehensive documentation

Participate in 24/7 on-call rotation and escalation procedures across global teams

Interface with Engineering teams and Incident Manager during critical incident resolution

Platform Reliability & Performance

Monitor and optimize multi-cloud infrastructure (AWS primary, Azure, Google Cloud Platform)

Ensure reliability of core services: AWS resources, Auth0/Okta authentication, databases (SQL Server, PostgreSQL, MongoDB), and legacy Java applications

Implement and maintain SLIs, SLOs, and error budgets for assigned services

Drive capacity planning and performance optimization initiatives

Automation & Tooling

Design automation solutions to reduce manual operational overhead

Develop monitoring strategies using New Relic, Grafana, and Sumo Logic

Create infrastructure-as-code for reliable deployments

Build self-healing systems and automated remediation workflow

Qualifications

Bachelor's degree in computer science, Engineering, Information Systems, or related technical field
Equivalent combination of education and experience will be considered

Technical Experience

5+ years in SRE, DevOps, Software Development or Infrastructure Engineering roles
Deep hands-on experience with multi-cloud environments (AWS required, Azure preferred)
Strong Linux system administration and troubleshooting
Experience with containerization (Docker) and orchestration (Kubernetes, ECS)
Proficiency with monitoring tools (New Relic, Grafana, Prometheus)

Leadership & Communication

Proven track record leading and mentoring technical teams
Experience as incident response participant during critical incidents
Strong communication skills with engineering teams and stakeholders
Cross-functional collaboration in agile environments

SRE & Operations

Demonstrated success implementing SRE principles in large-scale production environments
Experience with ITIL frameworks and tools
Background in establishing and maintaining SLAs for enterprise SaaS product

Preferred

Authentication and identity management systems knowledge
Infrastructure-as-code tools (Terraform, CloudFormation)
Cloud certifications (AWS, Azure, or Google Cloud)
Kubernetes certifications
New Relic/Grafana monitoring certifications
Linux certifications (RHCE, LPIC-2)

EEO Statement

iCIMS is a place where everyone belongs. We celebrate diversity and are committed to creating an inclusive environment for all employees. Our approach helps us to build a winning team that represents a variety of backgrounds, perspectives, and abilities. So, regardless of how your diversity expresses itself, you can find a home here at iCIMS.

We are proud to be an equal opportunity and affirmative action employer. We prohibit discrimination and harassment of any kind based on race, color, religion, national origin, sex (including pregnancy), sexual orientation, gender identity, gender expression, age, veteran status, genetic information, disability, or other applicable legally protected characteristics. If you would like to request an accommodation due to a disability, please contact us at

Compensation and Benefits

We accept applications for this position on an ongoing basis until the position is filled. Applications will be reviewed as they are received, and qualified candidates may be contacted throughout the posting period.

The anticipated base pay range for this position is $100,000-140,000.00 annually. Final compensation will be based on factors such as relevant experience, skills, education, internal equity, and market data. This range aligns with our commitment to equitable and transparent compensation practices, as required by applicable law.

Competitive health and wellness benefits include medical, dental, vision, 401(k), dependent care, short term and long-term disability, life and AD&D insurance, bonding and parental leave, mindfulness resources, an open vacation policy, sick days, paid holidays, quiet hours each workday, and tuition reimbursement. Benefits and eligibility may vary by location, role, and tenure. Learn more here:

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Dice Id: 10526121
Position Id: 6411
Posted 1 day ago

Create job alert

Never miss an opportunity! Create an alert based on the job you applied for.

Site Reliability Engineer

Dice Job Match Score™

Job Details

Skills

Summary

Similar Jobs