We are seeking a highly skilled Technical Project Manager with strong expertise in Site Reliability Engineering (SRE), Automation, Cloud Operations, and AIOps. The ideal candidate will combine strong technical depth with outstanding project leadership, ensuring high availability, reliability, and automation-driven efficiency across large-scale distributed systems.
Key Responsibilities:
Program & Project Leadership
Lead end to end delivery of SRE and Operations modernization projects across cloud, network and platform environments.
Manage cross functional engineering teams, vendors and partners to deliver high-quality solutions on schedule.
Drive operational transformation through automation, observability and AI driven insights.
Develop detailed project plans, milestones, risk logs and communication plans for technical initiatives.
SRE & Operations Management
Oversee reliability engineering initiatives: incident management, problem management, capacity planning and performance optimization.
Ensure SLO/SLI/SLA compliance across e2e infrastructure and customer-facing platforms.
Implement best in class practices for monitoring, alerting and service resilience.
Automation & AIOps
Lead automation programs using scripting, orchestration and Infrastructure-as-Code (IaC) techniques.
Champion AIOps solutions for predictive analytics, smart alerting, anomaly detection and automated remediation.
Partner with engineering teams to build self healing capabilities and reduce MTTR.
Stakeholder Management
Serve as the primary interface between engineering, operations, product and leadership teams.
Present program updates, operational metrics, and business impact to senior executives.
Ensure stakeholder alignment on priorities, roadmaps, and technical dependencies.
Compliance, Governance & Telecom Standards
Oversee governance, security, and compliance as per industry standards and regulatory requirements.
Drive continuous improvement through retrospectives, root-cause analysis and process enhancements.
Required Skills & Qualifications
12+ years of experience in technical project/program management, with at least 4 6 years in SRE/DevOps/Operations.
Strong understanding of cloud platforms (AWS, Azure, Google Cloud Platform) and containerized environments (Kubernetes, Docker).
Hands-on familiarity with automation tools:
Terraform / Ansible / Jenkins
Python/Go/Bash scripting
CI/CD pipelines
Deep knowledge of:
Observability stacks (New Relic, Grafana, ELK, Splunk, Catchpoint)
Incident & change management systems (ServiceNow, Jira)
Proven experience deploying or managing AIOps platforms
Strong analytical and problem-solving skills with ability to lead high-impact incidents.
Excellent communication, leadership, and vendor management capabilities.