Apply Now

Sr DevOps Engineer

Santa Clara, CA, US • Posted 1 day ago • Updated 11 minutes ago

Full Time

On-site

USD $45.00 - 50.00 per hour

Fitment

Dice Job Match Score™

⏳ Almost there, hang tight...

Job Details

Skills

FOCUS
Development Testing
Production Support
Software Packaging
Cloud Computing
ProVision
CPU
Storage
Healthcare Information Technology
Inspection
Testing
Acceptance Testing
RACI
Embedded Systems
Agile
Incident Management
Computer Science
Information Systems
Lifecycle Management
Amazon Web Services
Managed Services
PostgreSQL
Terraform
Grafana
Linux
Computer Networking
Scripting
Python
Bash
Recovery
Workflow
Network
Orchestration
SAFE
Promotions
Database
Operational Risk
Continuous Integration
Continuous Delivery
Release Management
Remote Desktop Services
Amazon RDS
MongoDB
Redis
Caching
Debugging
Management
eXist
Writing
Kubernetes
DevOps
Software Engineering
Collaboration

Summary

Title : Sr DevOps Engg
Location :Santa Clara, CA

Location Santa Clara

Job Summary

We are seeking a highly capable Senior DevOps Engineer / Platform Engineer to build, operationalize, and scale the infrastructure and deployment foundation for a strategic site-builder / network automation platform. This role will focus on creating reliable CI/CD pipelines, production-grade Kubernetes deployment patterns, managed database services, observability, environment reproducibility, secrets management, and Infrastructure as Code across development, testing, staging, and production environments.

This engineer will play a critical role in moving the platform from an early-stage, partially manual operating model into a repeatable, supportable, and production-ready DevOps model. The environment includes Kubernetes-hosted services, AWS managed services, workflow orchestration with Temporal, integration with Nautobot, Argo-based promotion flows, and the supporting tooling required for debugging, snapshotting, local development, and production support.

This is a hands-on engineering role for someone who can design the right platform patterns, implement them directly, and establish a durable operating model between development and DevOps teams.

Key Responsibilities

Platform Deployment & CI/CD
Design, implement, and maintain CI/CD pipelines for testing, staging, and production environments.
Build and maintain deployment workflows that support safe and seamless promotion across environments.
Improve and maintain Argo-based deployment workflows to enable controlled release progression from test to staging to production.
Establish baseline deployment mechanisms for the site-builder application and related services.
Standardize Kubernetes application packaging and deployment patterns, with a strong preference toward Helm-based lifecycle management for complex services and third-party components.
Migrate existing deployments to Helm charts where appropriate.

Kubernetes & Runtime Platform Engineering
Support the deployment and ongoing operation of services running in Kubernetes.
Improve runtime reliability, resiliency, and troubleshooting for distributed services operating inside shared Kubernetes clusters.
Investigate and harden service-to-service connectivity patterns, especially for workflow components such as workers connecting to the Temporal engine.
Partner with development teams to define production-grade runtime requirements, resource sizing, restart policies, and platform support boundaries.

Infrastructure as Code & Cloud Services
Design and implement fully declarative Infrastructure as Code for managed cloud services, especially in AWS.
Provision and maintain managed data services such as RDS/PostgreSQL and MongoDB-compatible document databases across all environments.
Eliminate manual infrastructure setup where possible and replace it with reproducible, version-controlled deployment patterns.
Prepare the platform for future scale across multiple environments and regions through repeatable IaC and GitOps-aligned practices.

Data Services, Snapshots & Developer Enablement
Setup and maintain RDS, MongoDB, Redis/cache services, and related dependencies for all environments.
Build tooling and operational processes for:
production and staging database snapshots,
restoring snapshots into development environments,
enabling local debugging and development from realistic data states.
Support creation of local and development environments, including Minikube-based environment-as-code approaches that mirror production behavior as closely as practical.
Improve platform reproducibility so engineers can quickly stand up close-to-production development environments.

Workflow Orchestration & Temporal Support
Lead the setup, deployment, and operational support of Temporal for workflow orchestration.
Support production operations for Temporal, including troubleshooting performance issues, restarts, scaling concerns, and resource shortages.
Establish maintainable deployment patterns for Temporal using supported packaging and lifecycle management approaches.
Partner with engineering teams to ensure workflow platform reliability and upgradeability over time.

Observability, Reliability & Incident Readiness
Design and maintain observability across testing, staging, and production using tools such as Prometheus and Grafana.
Define and implement monitoring for:
service and cluster utilization,
CPU, memory, storage,
IOPS / throughput metrics,
database connections and session counts,
cache hit / miss / coverage metrics,
RDS and MongoDB utilization,
service health and alerting.
Build and maintain logging, tracing, and correlation capabilities, separated appropriately by environment.
Create tools to support deep debugging and operational inspection, including raw database reads, cleanup of unused volumes, and emergency cache invalidation.

Security, Access & Secrets Management
Maintain secrets management processes across environments.
Build tooling for short-lived internal token generation and long-lived secret rotation.
Support secure access from deployed services to active production devices and southbound systems.
Help establish credential management patterns for southbound integrations and device-facing access.
Partner with related teams to define safe operational limits and controls for service integrations.

External Integrations & Platform Support
Support integration patterns with Nautobot and help define safe client-side behaviors such as rate limiting, retry/backoff, and service protection mechanisms.
Partner with application teams to understand and mitigate integration issues such as rate limiting or request rejection.
Support staging and testing by enabling virtual device environments where needed.
Contribute to end-to-end acceptance testing and production readiness activities.

Operating Model & Cross-Functional Execution
Help define an effective operating model between Development and DevOps, whether via RACI, embedded Agile delivery, or a hybrid support model.
Support deployment readiness, incident management, environment ownership boundaries, and lifecycle responsibilities.
Work closely with software engineering, infrastructure, application owners, and partner teams to drive production readiness and sustainable operations.

Required Qualifications
Bachelor's degree in Computer Science, Engineering, Information Systems, or equivalent practical experience.
7+ years of experience in DevOps, Platform Engineering, SRE, or Infrastructure Engineering roles.
Strong hands-on experience with Kubernetes in production environments.
Strong experience building and maintaining CI/CD pipelines for multi-environment software delivery.
Strong experience with ArgoCD, GitOps workflows, or equivalent deployment tooling.
Strong experience with Helm and Kubernetes package/deployment lifecycle management.
Experience with AWS managed services, especially RDS/PostgreSQL, document databases, and related infrastructure.
Strong experience with Infrastructure as Code, such as Terraform and/or similar declarative tooling.
Experience with Prometheus, Grafana, and modern observability practices.
Experience with Redis/cache services, secrets management, and operational debugging.
Strong Linux, networking, and distributed systems troubleshooting skills.
Strong scripting and automation skills in one or more languages such as Python, Bash, or Go.
Proven ability to work cross-functionally and operate effectively in environments where ownership boundaries are still evolving.

Preferred Qualifications
Experience with Temporal deployment and production operations.
Experience supporting developer platforms with local environment reproducibility using Minikube, kind, or similar tools.
Experience with MongoDB / DocumentDB operations and restore workflows.
Experience integrating with Nautobot, NetBox, or similar infrastructure source-of-truth platforms.
Experience operating in shared-cluster environments with multi-team tenancy and constrained access models.
Experience designing platform patterns for internal products that must scale across regions or multiple deployment footprints.
Familiarity with network automation or infrastructure orchestration platforms is a plus.

What Success Looks Like
CI/CD pipelines are reliable, repeatable, and support safe promotion across all environments.
Kubernetes deployments are standardized, maintainable, and production ready.
Managed infrastructure is defined as code rather than through manual setup.
Temporal, databases, cache layers, and observability tooling are stable and supportable.
Development teams can reproduce realistic environments locally for faster debugging and delivery.
Secrets, access patterns, and operational tooling are mature enough to support production-scale operations.
The DevOps operating model is clearly defined and enables faster deployments with less operational risk.

Scope Notes

In scope
CI/CD and deployment foundations
Kubernetes packaging and release management
RDS, MongoDB, Redis/cache services
Temporal platform setup and operational support
Observability, alerting, and debugging tooling
Secrets management and access enablement
Infrastructure as Code and environment reproducibility
DevOps / Development operational model definition

Candidate Profile

The ideal candidate is a builder-operator: someone who can establish engineering discipline where manual patterns currently exist, create durable automation for platform operations, and raise the overall maturity of the product's deployment and runtime ecosystem. This person should be equally comfortable discussing deployment architecture, writing IaC and Helm code, troubleshooting Kubernetes runtime issues, and defining how DevOps and software engineering teams work together over the full product lifecycle.

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Dice Id: 90933573
Position Id: 42ed474b09168f46c558fa17ed8cea05
Posted 1 day ago

Create job alert

Never miss an opportunity! Create an alert based on the job you applied for.

Similar Jobs

Sr Staff Software Engineer - DevOps

San Jose, California

•

Today

At Bloom Energy, our vision for a world powered by clean, reliable, and affordable energy is more than just a dream-we're making it reality. For over two decades, we've been at the forefront of the global energy transition, pioneering solutions that empower critical industries to thrive in a rapidly digitizing, energy-intensive world. From revolutionizing power for AI-driven data centers to ensuring resilience for hospitals, electric grids, manufacturing facilities, and utilities, our solid oxi

Full-time

USD 134,500.00 - 193,500.00 per year

Senior DevOps Engineer (AI Ops)

San Jose, California

•

Today

Job Title SRE / AI Platform DevOps Engineer Role Description We are seeking a hands-on Senior DevOps Engineer specializing in AI Ops to own infrastructure provisioning, CI/CD automation, telemetry pipelines, and production deployment for AI-powered services, agents, and orchestration systems. This role is responsible for building and operating the infrastructure that enables reliable, observable, and scalable AI systems in production. The engineer will help operationalize AI platforms by implem

Full-time

USD 139,000.00 per year

DevOps Engineer - Info Apps

Cupertino, California

•

Today

The Apple Info Apps team is looking for a DevOps Engineer to drive the evolution of our CI/CD pipelines, infrastructure automation, and developer tooling. You will support large engineering teams and provide operational support for large-scale backend systems deployed across hybrid cloud environments like Kubernetes on AWS, Google Cloud Platform, and on-premise infrastructure. You will enjoy using technology to automate solutions and optimize outcomes, building robust internal productivity tools

Full-time

Senior Software Engineer - Enterprise AI

Palo Alto, California

•

Today

Join our innovative team as a Senior Software Engineer, where you'll lead the development of cutting-edge enterprise AI solutions, enhancing operational efficiency and driving technological advancement. Responsibilities Lead the comprehensive delivery of major platform projects, ensuring successful design, deployment, and post-launch performance. Manage and enhance Kubernetes clusters, focusing on networking, operators, and multi-tenant orchestration. Design and develop scalable and reliable d

Full-time

USD 49.00 - 85.00 per hour

Search all similar jobs