Site Reliability Engineering- Program Manager

  • Sunnyvale, CA
  • Posted 25 days ago | Updated 25 days ago

Overview

On Site
$40 - $50
Contract - W2
Contract - 24 Month(s)

Skills

Hadoop
agile
sprint

Job Details

Job Description:

As an SRE EPM, you will be responsible for driving the strategic direction and execution of our Site Reliability Engineering initiatives. You will collaborate closely with cross-functional teams to ensure the reliability, scalability, and performance of our systems, with a focus on delivering exceptional service to our customers. This role requires a combination of technical expertise, project management skills, and strong leadership abilities.

Key Responsibilities:

  • Lead and manage projects related to improving site reliability, scalability, and performance, from initiation to completion.
  • Collaborate with engineering, operations, product management, and other stakeholders to define project goals, priorities, and timelines.
  • Identify potential risks to site reliability and develop strategies to mitigate them, including incident response planning and capacity management.
  • Define and track key performance indicators (KPIs) related to site reliability, and generate regular reports to communicate performance metrics to stakeholders.
  • Coordinate incident response activities during system outages or service disruptions, including postmortem analysis and remediation efforts.
  • Manage relationships with external vendors and service providers, including negotiating contracts and ensuring compliance with service level agreements (SLAs).
  • Foster a culture of continuous improvement within the SRE team and across the organization, encouraging experimentation, innovation, and knowledge sharing.
  • Support the professional growth and development of SRE team members through mentorship, training, and performance feedback.

Qualifications:

  • Bachelor's degree in Computer Science, Engineering, or related field; advanced degree preferred.
  • years of experience in Site Reliability Engineering or related field, with a proven track record of driving improvements in system reliability, scalability, and performance.
  • Strong project management skills, with experience leading cross-functional teams and managing complex projects from initiation to completion.
  • Deep understanding of cloud computing technologies, distributed systems, and software development methodologies.
  • Excellent communication skills, with the ability to effectively collaborate with technical and non-technical stakeholders at all levels of the organization.
  • Strong analytical and problem-solving abilities, with a focus on data-driven decision-making and continuous improvement.
  • Experience with incident management, postmortem analysis, and implementing best practices for reliability and resilience.
  • Certifications such as Certified Kubernetes Administrator (CKA) or Google Cloud Professional Cloud Architect (PCA) are a plus.

SRE Team maintains Hadoop cluster and provides 24X7 support.

The team is spread across multiple locations and have team for tooling and automation and cluster upgrade. The EPM role for this team will be support and coordinate the upgrades, plan downtime, releases , manage Sprint in Agile Board, manage the communication. Very Good Written and verbal Communication. Deep understanding and Knowledge of Hadoop infrastructure / cloud infra / Hadoop components / softwares and infra upgrades.

Apple tools and technologies is a Plus. Negotiate the sequencing and timing of delivery with technology leadership, Product, and engineering teams Develop and implement communication strategies for advising our organization of critical updates, their timing, impacts, and success and failure Roles expects to work from office and close coordination with the Apple technical team based in Bay Area( PST Zone)