Site Reliability Engineer, AiDP Production Engineering

Austin, TX, US • Posted 4 days ago • Updated 1 day ago
Full Time
On-site
Fitment

Dice Job Match Score™

🔢 Crunching numbers...

Job Details

Skills

  • Real-time
  • Analytical Skill
  • Sales Operations
  • Finance
  • Marketing
  • Open Source
  • Artificial Intelligence
  • Production Engineering
  • Architectural Design
  • Data Processing
  • Data Analysis
  • Management
  • Business Operations
  • Cloud Computing
  • Extract
  • Transform
  • Load
  • Apache Spark
  • Apache Flink
  • Messaging
  • Apache Kafka
  • IaaS
  • Amazon Web Services
  • Google Cloud Platform
  • Google Cloud
  • Kubernetes
  • Database
  • Snow Flake Schema
  • Apache Cassandra
  • SAP HANA
  • Python
  • Java
  • Computer Science
  • Systems Design
  • Data Structure
  • Incident Management
  • Grafana
  • Performance Analysis
  • FOCUS
  • Conflict Resolution
  • Problem Solving
  • Critical Thinking
  • Communication
  • Root Cause Analysis
  • Reliability Engineering
  • Generative Artificial Intelligence (AI)
  • Data Visualization
  • Tableau
  • Business Objects

Summary

The Production Engineering team within the AI and Data Platform (AiDP) organization manages a wide array of real-time, near real-time, and batch analytical solutions. These platforms are integral to core business functions across Apple. These include sales, operations, finance, AppleCare, marketing, and services, and are instrumental in driving critical, data-driven decisions. To build these solutions, we leverage a combination of proprietary and leading open-source technologies such as Kafka, Spark, Iceberg, and Airflow. A key part of our mission is to enable AI-centric automations that enhance the overall e?ciency and intelligence of the platform. We are looking for passionate engineers who thrive on solving complex infrastructure challenges at scale, both on-premises and in the cloud. If you are dedicated to optimizing scalable, maintainable, and user-friendly systems, you will find compelling opportunities to make a significant impact at AiDP.

The Service Reliability Engineer (SRE) role within AiDP Production Engineering is a dynamic position that blends strategic architectural design with hands-on technical execution. As an SRE, you will be responsible for configuring, tuning, and ensuring the resilience of complex, multi-tiered systems to achieve optimal application performance, stability, and availability. Our team manages critical data pipelines and applications across both bare-metal and cloud computing platforms, delivering essential data processing for all of Apple's key business functions. We operate at an immense scale, handling exabytes of data, petabytes of memory, and tens of thousands of jobs to enable predictable and performance data analytics that power features and inform decisions across the company. If you are passionate about designing, building, and running data infrastructure that has a direct and significant impact on Apple's global business operations, this is the ideal opportunity for you.

4+ years experience in cloud-native services, including ETL frameworks like Apache Spark, and Flink.\n4+ years experience in messaging systems (Kafka) and cloud infrastructure & services, AWS, Google Cloud Platform, Kubernetes.\n4+ years of experience in modern & distributed databases such as Snowflake, Cassandra, SingleStore, and SAP HANA.\n4+ years of programming experience in Python or Java.\nBS/MS in computer science or equivalent experience.

Solid understanding of system design, data structures, and incident management best practices.\nShould be able to understand complex architectures and be comfortable working with multiple teams.\nObservability tools (e.g: Prometheus, Grafana, CloudWatch).\nAbility to conduct performance analysis and troubleshoot large scale distributed systems.\nShould be highly proactive with a keen focus on improving uptime/availability of our mission critical services.\nStrong expertise in troubleshooting complex production issues.\nExcellent problem solving, critical thinking, and communication skills.\nProven ability to resolve incidents, perform root cause analysis, and drive system reliability improvements.\nExperience using GenAI or automation tools for issue detection, alerting, or remediation.\nExperience in data visualization tools such as Tableau, Business Objects, ThoughtSpot.
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.
  • Dice Id: 90733111
  • Position Id: 4156f553df1f571fcb62c6ecad3b924e
  • Posted 4 days ago
Create job alert
Set job alertNever miss an opportunity! Create an alert based on the job you applied for.

Similar Jobs

Austin, Texas

Yesterday

Full-time

Austin, Texas

Yesterday

Full-time

Austin, Texas

Yesterday

Full-time

Austin, Texas

Yesterday

Full-time

Search all similar jobs