Site Reliability Engineer, Enterprise Technology Services

Sunnyvale, CA, US • Posted 4 days ago • Updated 1 day ago
Full Time
On-site
Fitment

Dice Job Match Score™

⏳ Almost there, hang tight...

Job Details

Skills

  • IDMS
  • Build Tools
  • Software Development
  • Pivotal
  • Identity Management
  • High Availability
  • Authorization
  • Provisioning
  • Lifecycle Management
  • Recovery
  • Fraud
  • Replication
  • Data Centers
  • Capacity Management
  • Disaster Recovery
  • Failover
  • Incident Management
  • PASS
  • Auditing
  • Debugging
  • Management
  • Acceptance Testing
  • FOCUS
  • Open Source
  • Data Processing
  • Scripting Language
  • Bash
  • Ansible
  • Stacks Blockchain
  • Conflict Resolution
  • Problem Solving
  • Splunk
  • Grafana
  • Budget
  • SLA
  • Release Engineering
  • DevOps
  • Version Control
  • Git
  • Continuous Integration
  • Continuous Delivery
  • Java
  • Python
  • Database
  • NoSQL
  • OLAP
  • Apache Kafka
  • RabbitMQ
  • Problem Management
  • Root Cause Analysis
  • Reliability Engineering
  • Cryptography
  • Authentication
  • OAuth
  • SAML
  • SSO
  • Regulatory Compliance
  • Collaboration
  • Machine Learning (ML)
  • Generative Artificial Intelligence (AI)
  • Operational Efficiency
  • Cyber Security
  • Computer Science

Summary

At Apple, groundbreaking ideas quickly transform into extraordinary products and services that delight millions worldwide. If you're passionate about engineering and operating robust, large-scale systems, imagine the impact you could make.\\n\\nThe Identity Management Services (IdMS) SRE team is seeking a Service Reliability Engineer (SRE) to design, build tools for, and support our critical platform services. We're looking for someone with strong software development skills, deep systems expertise, and a solid understanding of SRE principles, ready to ensure operational precision at Apple's immense scale. Your work will be pivotal in powering services across Apple, partnering with engineering teams to deliver seamless experiences.

This role involves managing one of the largest Identity Management Platform services for a vast customer base across various devices and services. Key responsibilities include overseeing critical services such as device provisioning, authentication, token management, and security. A primary objective is ensuring the high availability and reliability of the system to facilitate critical authentication and authorization transactions, user provisioning, purchases, subscriptions, and account lifecycle management (creation, management, and recovery). This also entails maintaining platform security by blocking and rate-limiting fraud traffic at the perimeter, and ensuring high data consistency and replication across multiple data centers through custom mechanisms. The role covers managing infrastructure, capacity planning, disaster recovery, and auto-failover mechanisms. It also involves monitoring infrastructure and application services, driving incident management for internal and external stakeholders, and defining system and functional observability. Furthermore, this position helps teams overcome system bottlenecks and architectural challenges for efficiency improvements, ensures systems are compliant with industry standards and pass critical audits, and drives automation solutions for large-scale platform service needs. Advanced responsibilities include alert engineering, anomaly detection with Machine Learning tools, and adapting to Generative AI enhancements. Investigating device-related issues by debugging relevant logs is also part of the role, alongside managing the full system lifecycle, including configuration and code deployment in user acceptance test and production environments.

5+ years of experience in Site Reliability Engineering with a strong focus on building, scaling, and operating large-scale distributed platform services, and Java.\nBS degree in computer science or equivalent field with 7+ years of experience or MS degree in computer science or equivalent field with 5+ years of experience.\nStrong technical grasp and experience working on Open Source technologies designed for large-scale data processing.\nExperience designing, analyzing, and troubleshooting distributed systems.\nProficiency in at least one programming or scripting language (Python, Java, Go, Bash, Ansible, or similar).\nExperience designing observability stacks (Prometheus, Grafana, Datadog, OpenTelemetry, ELK, etc.).\nExcellent troubleshooting and problem-solving skills.\n

Observability & SRE Principles: Experience with monitoring and logging tools (e.g., Prometheus, Splunk, Grafana, OpenTelemetry) and a strong understanding of SRE principles, including observability, error budgeting, and service reliability metrics (SLA, SLO, SLI).\nCI/CD & Automation: Proficiency with CI/CD, Release Engineering, DevOps practices, and source control (Git). Experience designing and implementing CI/CD pipelines and Infrastructure as Code (Helm, CRD).\nProgramming & Data Systems: Strong programming skills in languages like Java, Python, Go, etc. Experience with various databases (Relational, NoSQL, OLAP) and event-driven architectures (Kafka, RabbitMQ).\nReliability & Operations: Experience with on-call, including incident/problem management (PIR, RCA) and a strong sense of ownership for system reliability.\nSecurity & Compliance: Understanding of security standards, policies, cryptography, and authentication (OAuth, SAML, SSO). Knowledge of Governance and Compliance.\nInnovation & Collaboration: Passion for designing reliable systems, advocating for automation, and a desire to collaborate effectively. Experience leveraging ML/GenAI for operational efficiency is a plus.\nCertification: Cybersecurity certification will be an added advantage.\nEducation: Bachelor's or Master's degree in Computer Science or equivalent practical experience.
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.
  • Dice Id: 90733111
  • Position Id: 75aeff85d87a76aec5ec26f42ced6701
  • Posted 4 days ago
Create job alert
Set job alertNever miss an opportunity! Create an alert based on the job you applied for.

Similar Jobs

San Jose, California

Yesterday

Full-time

USD 190,819.00 - 259,200.00 per year

Santa Clara, California

Yesterday

Full-time

USD 165,500.00 - 289,600.00 per year

Palo Alto, California

Yesterday

Full-time

USD 171,000.00 - 260,000.00 per year

Cupertino, California

Yesterday

Full-time

Search all similar jobs