The Sr. Staff Software Engineer - Platform & Reliability will be part of the new Product Engineering team tasked with designing and building the next generation of Agentic AI-powered products for. Acting as the Technical Lead and Primary Architect, you will be a hands-on leader responsible for the team’s overall delivery of the runtime environment and automation for AI services and Agents. You will lead a small squad by decomposing complex platform requirements—such as AI-specific CI/CD, agent observability, and automated scaling—into actionable tasks while remaining deeply embedded in the codebase
Key Responsibilities
● Technical Lead & Execution: Lead the technical delivery of the Agentic Platform by translating high-level infrastructure roadmaps into actionable development tasks. You will own tasks breakdown for your squad, ensuring high-quality output through technical mentorship and rigorous architectural oversight.
● Automated Agent Delivery - CI/CD: Architect and implement high-velocity CI/CD pipelines specifically designed for the lifecycle of AI Agents and services, including automated model evaluation and blue-green deployments for agentic workflows on Google Cloud Platform.
● Cloud Infrastructure Engineering: Lead the design and implementation of our cloud-native infrastructure on Google Cloud Platform using Terraform and Kubernetes (GKE). You will own the core runtime environment where autonomous agents and transactional microservices coexist.
● Agentic Observability & SRE: Apply SRE principles to build a specialized monitoring and alerting stack for AI agents. You will implement tracing for agent "reasoning loops" and ensure the reliability of the underlying Vector and Graph data stores.
● AI-Native SDLC Leadership: Actively utilize coding agents to plan, generate, and refactor platform code and Infrastructure as Code “IaC”, maintaining high velocity while ensuring code quality.
● Scale & Performance: Monitor and optimize the performance and cost-effectiveness of AI workloads, ensuring our platform can handle high-frequency agent calls and multi-modal data processing.
● Security & Governance: Own the implementation of secure runtime boundaries, ensuring that both human users and AI agents operate within strict, audited permission sets
Experience: 10+ years of Software or Platform Engineering experience, with a background as a hands-on engineer who has successfully led technical squads.
Technical Stack: Expert mastery of Google Cloud Platform (GKE, Vertex AI), Terraform, Kubernetes, and Python.
Product AI Platform: Proven track record of designing and shipping production platforms for AI/LLM workloads, including specialized CI/CD and observability for agentic architectures.
Reliability Mindset: Strong command of SRE principles, including experience with SLOs, error budgets, and troubleshooting complex distributed systems.
Cloud Infrastructure: Experienced in working with cloud platforms (Google Cloud Platform, AWS) and deploying containerized services that are secure and scalable.
Coding Agents: Demonstrated proficiency in using coding agents to accelerate the SDLC and plan and code complex engineering tasks.