Overview
Skills
Job Details
This is a remote position.
Our client is one of the fastest growing tech companies in history! Supporting this rapid growth requires top talent, which could be you! We are looking for a full-time Program Manager to join their platform engineering team. In this role, you?ll be working collaboratively in a fast-paced environment with engineering teams to manage and execute the migration of services and capacity from on-premises infrastructure to the cloud. You?ll ensure services are prepared for failovers and new features are ready for deployment. You?ll work to plan and execute failover drills to ensure service resilience and readiness. And you will also focus on the operational health of the organization, including incident tracking, metrics, code coverage, service SLAs, on-call health, running incident reviews, tracking action items, and other tasks related to cloud and infrastructure reliability. This position will be full time US Remote, working Pacific time hours. If you are in the SF Bay Area, you can optionally choose to work in the San Francisco office part of the time.
Responsibilities
Flexibility, Proactive, and Ad Hoc Work
Objective: Ensure flexibility and a proactive approach to accommodate the fast-paced movement and shift of program and project priorities.
Expectations:
? Adapt to changing priorities and reallocate resources as necessary to meet evolving project demands.
? Remain available to provide additional services on an ad hoc basis as needed, ensuring timely and effective response to urgent or unplanned requirements.
? Candidates should be proactive and understand the overall project milestones to be able to work through issues timely and preempt next steps to expedite progress.
Service & Capacity Migration Management
Objective: Manage and execute the migration of services and capacity from on-premises infrastructure to the cloud.
Responsibilities:
? Creation of detailed week-over-week project plans, encompassing various teams and work-streams necessary to execute on-time migration.
? Implementation of tracking mechanisms for progress via JIRA task management, including JIRA dashboards.
? Organization and running of status meetings to facilitate cross-functional team interaction and progress updates.
? Proactive collection, tracking, and closure of risks/blockers to ensure successful on-time program execution.
? Utilization and analysis of data to inform decisions and track metrics, key results, health, incidents, and MTTR.
? Communication of status through various forums such as E-mail, Slack, JIRA, Confluence, and dashboard tools (Google Looker Studio, Tableau, Grafana, etc.), tailored to the audience for clear, concise, and consistent progress reporting.
? Provide additional ad hoc requested services as needed and as available time allows for.
Expectations:
? Coordinate with engineering teams to facilitate the migration of millions of cores across stateful and stateless services.
? Monitor migration progress, identify potential issues, and implement solutions to mitigate risks.
? Ensure all migrated services meet performance, reliability, and security standards.
? Provide regular status updates to stakeholders, highlighting key milestones, risks, and mitigation strategies.
Failover Drill Execution & Issue Follow-up
Objective: Plan and execute failover drills to ensure service resilience and readiness.
Expectations:
? Organize and lead regional or zonal failover drills, including scenarios for all-active failures, tier-based failovers, and disaster recovery.
? Collaborate with engineering teams to define and document failover procedures and protocols.
? Identify and track issues arising from failover drills, working with relevant teams to resolve them promptly.
? Evaluate the effectiveness of failover drills and recommend improvements to processes and procedures.
? Maintain comprehensive records of all failover activities and outcomes for auditing and continuous improvement purposes.
? Provide additional ad hoc requested services as needed and as available time allows for.
Drill Preparedness & Service Feature Readiness Burn-down Tracking
Objective: Ensure services are prepared for failovers and new features are ready for deployment.
Expectations:
? Track the replacement of custom service logic (e.g., retries, rate limits, context propagation) with centrally managed libraries.
? Coordinate with hundreds of teams managing thousands of services to ensure readiness and compliance with new feature requirements.
? Develop and maintain burn-down charts to visualize progress towards service feature readiness.
? Identify and address blockers preventing teams from achieving readiness milestones.
? Communicate readiness status to stakeholders, providing clear and actionable insights.
? Provide additional ad hoc requested services as needed and as available time allows for.
Reliability & Incident Management
Objective: Focus on the operational health of the organization, including incident tracking, metrics, code coverage, service SLAs, on-call health, running incident reviews, tracking action items, and other tasks related to cloud and infrastructure reliability.
Expectations:
? Track and manage incidents, ensuring timely resolution and follow-up on action items.
? Develop and monitor metrics to assess operational health, such as service SLAs, code coverage, and on-call health.
? Conduct regular incident reviews, identifying root causes and implementing corrective actions.
? Collaborate with engineering teams to ensure reliability best practices are followed.
? Communicate reliability status and incident reports to stakeholders, providing actionable insights and recommendations.
? Provide additional ad hoc requested services as needed and as available time allows for.
Requirements
Skills and Qualifications
? Strong technical aptitude with experience in platform/ production/ site reliability engineering.
? Proven ability to manage complex cross-team and cross-functional initiatives.
? Excellent written and verbal communication skills.
? Experience managing cloud service migrations or reliability pursuits.
? Ability to run status meetings, communicate, and track issues/risks/blockers using tools like JIRA.
? Strong ability to utilize and analyze data effectively, with experience in metrics, key results, health, incidents, MTTR, etc.
? Experience with dashboard tools such as Google Looker Studio, Tableau, Grafana, and the ability to build useful dashboards for progress tracking and visibility.
? Proficiency in productivity tools such as Slack, Google Sheets/Docs, etc.
? Ability to determine useful metrics and KPIs to measure program performance.
Deliverables
? Detailed migration plans and timelines.
? Comprehensive documentation of procedures and outcomes.
? Issue tracking reports and resolution logs
? Burn-down charts and readiness tracking reports.
? Incident tracking reports and resolution logs
? Metrics and KPI dashboards
? Regular status updates and communication with stakeholders