Apply Now

Site Reliability Engineer, Enterprise Technology Services

Sunnyvale, CA, US • Posted 30+ days ago • Updated 9 hours ago

Full Time

On-site

Fitment

Dice Job Match Score™

🛠️ Calibrating flux capacitors...

Job Details

Skills

IDMS
Build Tools
Software Development
Pivotal
Identity Management
High Availability
Authorization
Provisioning
Lifecycle Management
Recovery
Fraud
Replication
Data Centers
Capacity Management
Disaster Recovery
Failover
Incident Management
PASS
Auditing
Debugging
Management
Acceptance Testing
FOCUS
Open Source
Data Processing
Scripting Language
Bash
Ansible
Stacks Blockchain
Problem Solving
Conflict Resolution
Splunk
Grafana
Budget
SLA
Release Engineering
DevOps
Version Control
Git
Continuous Integration
Continuous Delivery
Java
Python
Database
NoSQL
OLAP
Apache Kafka
RabbitMQ
Problem Management
Root Cause Analysis
Reliability Engineering
Cryptography
Authentication
OAuth
SAML
SSO
Regulatory Compliance
Innovation
Collaboration
Machine Learning (ML)
Generative Artificial Intelligence (AI)
Operational Efficiency
Cyber Security
Computer Science

Summary

At Apple, groundbreaking ideas quickly transform into extraordinary products and services that delight millions worldwide. If you're passionate about engineering and operating robust, large-scale systems, imagine the impact you could make.

The Identity Management Services (IdMS) SRE team is seeking a Service Reliability Engineer (SRE) to design, build tools for, and support our critical platform services. We're looking for someone with strong software development skills, deep systems expertise, and a solid understanding of SRE principles, ready to ensure operational precision at Apple's immense scale. Your work will be pivotal in powering services across Apple, partnering with engineering teams to deliver seamless experiences.

Description

This role involves managing one of the largest Identity Management Platform services for a vast customer base across various devices and services. Key responsibilities include overseeing critical services such as device provisioning, authentication, token management, and security. A primary objective is ensuring the high availability and reliability of the system to facilitate critical authentication and authorization transactions, user provisioning, purchases, subscriptions, and account lifecycle management (creation, management, and recovery). This also entails maintaining platform security by blocking and rate-limiting fraud traffic at the perimeter, and ensuring high data consistency and replication across multiple data centers through custom mechanisms. The role covers managing infrastructure, capacity planning, disaster recovery, and auto-failover mechanisms. It also involves monitoring infrastructure and application services, driving incident management for internal and external stakeholders, and defining system and functional observability. Furthermore, this position helps teams overcome system bottlenecks and architectural challenges for efficiency improvements, ensures systems are compliant with industry standards and pass critical audits, and drives automation solutions for large-scale platform service needs. Advanced responsibilities include alert engineering, anomaly detection with Machine Learning tools, and adapting to Generative AI enhancements. Investigating device-related issues by debugging relevant logs is also part of the role, alongside managing the full system lifecycle, including configuration and code deployment in user acceptance test and production environments.

Minimum Qualifications

5+ years of experience in Site Reliability Engineering with a strong focus on building, scaling, and operating large-scale distributed platform services, and Java.

BS degree in computer science or equivalent field with 7+ years of experience or MS degree in computer science or equivalent field with 5+ years of experience.

Strong technical grasp and experience working on Open Source technologies designed for large-scale data processing.

Experience designing, analyzing, and troubleshooting distributed systems.

Proficiency in at least one programming or scripting language (Python, Java, Go, Bash, Ansible, or similar).

Experience designing observability stacks (Prometheus, Grafana, Datadog, OpenTelemetry, ELK, etc.).

Excellent troubleshooting and problem-solving skills.

Preferred Qualifications

Observability & SRE Principles: Experience with monitoring and logging tools (e.g., Prometheus, Splunk, Grafana, OpenTelemetry) and a strong understanding of SRE principles, including observability, error budgeting, and service reliability metrics (SLA, SLO, SLI).

CI/CD & Automation: Proficiency with CI/CD, Release Engineering, DevOps practices, and source control (Git). Experience designing and implementing CI/CD pipelines and Infrastructure as Code (Helm, CRD).

Programming & Data Systems: Strong programming skills in languages like Java, Python, Go, etc. Experience with various databases (Relational, NoSQL, OLAP) and event-driven architectures (Kafka, RabbitMQ).

Reliability & Operations: Experience with on-call, including incident/problem management (PIR, RCA) and a strong sense of ownership for system reliability.

Security & Compliance: Understanding of security standards, policies, cryptography, and authentication (OAuth, SAML, SSO). Knowledge of Governance and Compliance.

Innovation & Collaboration: Passion for designing reliable systems, advocating for automation, and a desire to collaborate effectively. Experience leveraging ML/GenAI for operational efficiency is a plus.

Certification: Cybersecurity certification will be an added advantage.

Education: Bachelor's or Master's degree in Computer Science or equivalent practical experience.

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Dice Id: 90733111
Position Id: 75aeff85d87a76aec5ec26f42ced6701
Posted 30+ days ago

Create job alert

Never miss an opportunity! Create an alert based on the job you applied for.

Mountain View, California

•

Today

Company Overview ID.me is the next-generation digital identity wallet that simplifies how individuals securely prove their identity online. Consumers can verify their identity with ID.me once and seamlessly login across websites without having to create a new login and verify their identity again. Over 152 million users experience streamlined login and identity verification with ID.me at 20 federal agencies, 45 state government agencies, and 70+ healthcare organizations. More than 600+ consumer

Full-time

USD 168,926.00 - 192,500.00 per year

Senior Lead Site Reliability Engineer

Palo Alto, California

•

Today

Job Description Elevate your engineering prowess to unprecedented levels by joining a team of exceptionally gifted professionals and position yourself among the top echelon in site reliability and AI-powered infrastructure automation. As a Senior Lead Site Reliability Engineer at JPMorgan Chase within the Infrastructure Platforms and Foundational Services (IPFS) organization, you will work with your fellow stakeholders to define non-functional requirements (NFRs) and availability targets for t

Full-time

USD 171,000.00 - 260,000.00 per year

Compute SRE

Cupertino, California

•

Today

As a Site Reliability Engineer at Apple, you will be responsible for driving the reliability, scalability, and observability of our cloud platform. Your work will ensure the uptime and performance of mission critical systems that serve millions of users every day. We're looking for a self-motivated engineer, committed to operational excellence and continuous improvement. You'll work closely with developers and architects within the team to build and extend our platform, as well as be a part of r

Full-time

Senior Site Reliability Engineer

San Mateo, California

•

Today

IXL Learning, developer of personalized learning products used by millions of people globally, is seeking a Senior Site Reliability Engineer to join our team, and help maintain the reliability and optimal performance of our products. We are seeking engineers with a passion for problem solving and optimization. We find it immensely satisfying to develop products that impact the lives of millions, and we are eager to have you join our team. This position requires you to be in our San Mateo, CA, he

Full-time

USD 130,000.00 - 200,000.00 per year

Search all similar jobs

More jobs at Apple, Inc. in Sunnyvale, CA

Site Reliability Engineer, Enterprise Technology Services

Dice Job Match Score™

Job Details

Skills

Summary

Similar Jobs