Staff Software Engineer - Platform & Reliability

Remote • Posted 5 hours ago • Updated 5 hours ago

Full Time

No Travel Required

Remote

$140,000 - $160,000/yr

Fitment

Dice Job Match Score™

👾 Reticulating splines...

Job Details

Skills

Cloud Computing
Google Cloud Platform
Kubernetes
Python
Continuous Delivery
Continuous Integration
Artificial Intelligence
Leadership
Mentorship
Microservices
Product Engineering
Vertex
Terraform
Workflow
GCP
Amazon Web Services
CI/CD pipelines
CI/CD

Summary

The Sr. Staff Software Engineer - Platform & Reliability will be part of the new Product Engineering team tasked with designing and building the next generation of Agentic AI-powered products for. Acting as the Technical Lead and Primary Architect, you will be a hands-on leader responsible for the team’s overall delivery of the runtime environment and automation for AI services and Agents. You will lead a small squad by decomposing complex platform requirements—such as AI-specific CI/CD, agent observability, and automated scaling—into actionable tasks while remaining deeply embedded in the codebase

Key Responsibilities

● Technical Lead & Execution: Lead the technical delivery of the Agentic Platform by translating high-level infrastructure roadmaps into actionable development tasks. You will own tasks breakdown for your squad, ensuring high-quality output through technical mentorship and rigorous architectural oversight.

● Automated Agent Delivery - CI/CD: Architect and implement high-velocity CI/CD pipelines specifically designed for the lifecycle of AI Agents and services, including automated model evaluation and blue-green deployments for agentic workflows on Google Cloud Platform.

● Cloud Infrastructure Engineering: Lead the design and implementation of our cloud-native infrastructure on Google Cloud Platform using Terraform and Kubernetes (GKE). You will own the core runtime environment where autonomous agents and transactional microservices coexist.

● Agentic Observability & SRE: Apply SRE principles to build a specialized monitoring and alerting stack for AI agents. You will implement tracing for agent "reasoning loops" and ensure the reliability of the underlying Vector and Graph data stores.

● AI-Native SDLC Leadership: Actively utilize coding agents to plan, generate, and refactor platform code and Infrastructure as Code “IaC”, maintaining high velocity while ensuring code quality.

● Scale & Performance: Monitor and optimize the performance and cost-effectiveness of AI workloads, ensuring our platform can handle high-frequency agent calls and multi-modal data processing.

● Security & Governance: Own the implementation of secure runtime boundaries, ensuring that both human users and AI agents operate within strict, audited permission sets

Experience: 10+ years of Software or Platform Engineering experience, with a background as a hands-on engineer who has successfully led technical squads.

Technical Stack: Expert mastery of Google Cloud Platform (GKE, Vertex AI), Terraform, Kubernetes, and Python.

Product AI Platform: Proven track record of designing and shipping production platforms for AI/LLM workloads, including specialized CI/CD and observability for agentic architectures.

Reliability Mindset: Strong command of SRE principles, including experience with SLOs, error budgets, and troubleshooting complex distributed systems.

Cloud Infrastructure: Experienced in working with cloud platforms (Google Cloud Platform, AWS) and deploying containerized services that are secure and scalable.

Coding Agents: Demonstrated proficiency in using coding agents to accelerate the SDLC and plan and code complex engineering tasks.

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Dice Id: 10428795
Position Id: 8946428
Posted 5 hours ago

Create job alert

Never miss an opportunity! Create an alert based on the job you applied for.

Remote or Santa Ana, California

•

Today

Who We Are Join a team that puts its People First! Since 1889, First American (NYSE: FAF) has held an unwavering belief in its people. They are passionate about what they do, and we are equally passionate about fostering an environment where all feel welcome, supported, and empowered to be innovative and reach their full potential. Our inclusive, people-first culture has earned our company numerous accolades, including being named to the Fortune 100 Best Companies to Work For list for eleven co

Full-time

USD 148,600.00 - 198,200.00 per year

Staff Software Development Engineer - NodeJs and AI

Remote or Scottsdale, Arizona

•

Today

We're building a world of health around every individual - shaping a more connected, convenient and compassionate health experience. At CVS Health , you'll be surrounded by passionate colleagues who care deeply, innovate with purpose, hold ourselves accountable and prioritize safety and quality in everything we do. Join us and be part of something bigger - helping to simplify health care one person, one family and one community at a time. POSITION SUMMARY CVS Health is seeking hands-on, passio

Full-time

USD 106,605.00 per year

Sr. Staff Software Engineer - AI Agents

Remote

•

Today

The Sr. Staff Software Engineer - AI Agents will be the primary architect and technical lead for the team building all AI agents across the ecosystem. Acting as a hands-on leader and Technical Lead, you will be responsible for the teams overall delivery, translating complex product requirements into actionable technical tasks for a small squad. You will design and build high-stakes agentic architectures that reason across multi-modal data sources. Responsibilities Technical Execution: Lead the t

Easy Apply

Full-time

140,000 - 160,000

Site Reliability Engineer (Google Cloud Platform)

Remote

•

Today

This global AI manufacturing platform is hiring a full time Site Reliability Engineer in Chicago. You'll design and operate secure, highly available infrastructure in Google Cloud Platform (with multi-cloud exposure), lead Terraform-based Infrastructure as Code, manage Kubernetes/Docker environments, and build CI/CD pipelines (GitHub Actions, ArgoCD) supporting AI/ML workloads. The role focuses on DevSecOps, compliance in regulated industries (Aerospace & Defense), observability (PrometheGrafana

Easy Apply

Full-time

$110000 - $150000

Search all similar jobs

Staff Software Engineer - Platform & Reliability

Dice Job Match Score™

Job Details

Skills

Summary

Similar Jobs