Sr DevOps Engineer

Santa Clara, CA, US • Posted 1 day ago • Updated 11 minutes ago
Full Time
On-site
USD $45.00 - 50.00 per hour
Fitment

Dice Job Match Score™

⏳ Almost there, hang tight...

Job Details

Skills

  • FOCUS
  • Development Testing
  • Production Support
  • Software Packaging
  • Cloud Computing
  • ProVision
  • CPU
  • Storage
  • Healthcare Information Technology
  • Inspection
  • Testing
  • Acceptance Testing
  • RACI
  • Embedded Systems
  • Agile
  • Incident Management
  • Computer Science
  • Information Systems
  • Lifecycle Management
  • Amazon Web Services
  • Managed Services
  • PostgreSQL
  • Terraform
  • Grafana
  • Linux
  • Computer Networking
  • Scripting
  • Python
  • Bash
  • Recovery
  • Workflow
  • Network
  • Orchestration
  • SAFE
  • Promotions
  • Database
  • Operational Risk
  • Continuous Integration
  • Continuous Delivery
  • Release Management
  • Remote Desktop Services
  • Amazon RDS
  • MongoDB
  • Redis
  • Caching
  • Debugging
  • Management
  • eXist
  • Writing
  • Kubernetes
  • DevOps
  • Software Engineering
  • Collaboration

Summary

Title : Sr DevOps Engg
Location :Santa Clara, CA


Location Santa Clara

Job Summary

We are seeking a highly capable Senior DevOps Engineer / Platform Engineer to build, operationalize, and scale the infrastructure and deployment foundation for a strategic site-builder / network automation platform. This role will focus on creating reliable CI/CD pipelines, production-grade Kubernetes deployment patterns, managed database services, observability, environment reproducibility, secrets management, and Infrastructure as Code across development, testing, staging, and production environments.

This engineer will play a critical role in moving the platform from an early-stage, partially manual operating model into a repeatable, supportable, and production-ready DevOps model. The environment includes Kubernetes-hosted services, AWS managed services, workflow orchestration with Temporal, integration with Nautobot, Argo-based promotion flows, and the supporting tooling required for debugging, snapshotting, local development, and production support.

This is a hands-on engineering role for someone who can design the right platform patterns, implement them directly, and establish a durable operating model between development and DevOps teams.

Key Responsibilities

Platform Deployment & CI/CD
Design, implement, and maintain CI/CD pipelines for testing, staging, and production environments.
Build and maintain deployment workflows that support safe and seamless promotion across environments.
Improve and maintain Argo-based deployment workflows to enable controlled release progression from test to staging to production.
Establish baseline deployment mechanisms for the site-builder application and related services.
Standardize Kubernetes application packaging and deployment patterns, with a strong preference toward Helm-based lifecycle management for complex services and third-party components.
Migrate existing deployments to Helm charts where appropriate.

Kubernetes & Runtime Platform Engineering
Support the deployment and ongoing operation of services running in Kubernetes.
Improve runtime reliability, resiliency, and troubleshooting for distributed services operating inside shared Kubernetes clusters.
Investigate and harden service-to-service connectivity patterns, especially for workflow components such as workers connecting to the Temporal engine.
Partner with development teams to define production-grade runtime requirements, resource sizing, restart policies, and platform support boundaries.

Infrastructure as Code & Cloud Services
Design and implement fully declarative Infrastructure as Code for managed cloud services, especially in AWS.
Provision and maintain managed data services such as RDS/PostgreSQL and MongoDB-compatible document databases across all environments.
Eliminate manual infrastructure setup where possible and replace it with reproducible, version-controlled deployment patterns.
Prepare the platform for future scale across multiple environments and regions through repeatable IaC and GitOps-aligned practices.

Data Services, Snapshots & Developer Enablement
Setup and maintain RDS, MongoDB, Redis/cache services, and related dependencies for all environments.
Build tooling and operational processes for:
production and staging database snapshots,
restoring snapshots into development environments,
enabling local debugging and development from realistic data states.
Support creation of local and development environments, including Minikube-based environment-as-code approaches that mirror production behavior as closely as practical.
Improve platform reproducibility so engineers can quickly stand up close-to-production development environments.

Workflow Orchestration & Temporal Support
Lead the setup, deployment, and operational support of Temporal for workflow orchestration.
Support production operations for Temporal, including troubleshooting performance issues, restarts, scaling concerns, and resource shortages.
Establish maintainable deployment patterns for Temporal using supported packaging and lifecycle management approaches.
Partner with engineering teams to ensure workflow platform reliability and upgradeability over time.

Observability, Reliability & Incident Readiness
Design and maintain observability across testing, staging, and production using tools such as Prometheus and Grafana.
Define and implement monitoring for:
service and cluster utilization,
CPU, memory, storage,
IOPS / throughput metrics,
database connections and session counts,
cache hit / miss / coverage metrics,
RDS and MongoDB utilization,
service health and alerting.
Build and maintain logging, tracing, and correlation capabilities, separated appropriately by environment.
Create tools to support deep debugging and operational inspection, including raw database reads, cleanup of unused volumes, and emergency cache invalidation.

Security, Access & Secrets Management
Maintain secrets management processes across environments.
Build tooling for short-lived internal token generation and long-lived secret rotation.
Support secure access from deployed services to active production devices and southbound systems.
Help establish credential management patterns for southbound integrations and device-facing access.
Partner with related teams to define safe operational limits and controls for service integrations.

External Integrations & Platform Support
Support integration patterns with Nautobot and help define safe client-side behaviors such as rate limiting, retry/backoff, and service protection mechanisms.
Partner with application teams to understand and mitigate integration issues such as rate limiting or request rejection.
Support staging and testing by enabling virtual device environments where needed.
Contribute to end-to-end acceptance testing and production readiness activities.

Operating Model & Cross-Functional Execution
Help define an effective operating model between Development and DevOps, whether via RACI, embedded Agile delivery, or a hybrid support model.
Support deployment readiness, incident management, environment ownership boundaries, and lifecycle responsibilities.
Work closely with software engineering, infrastructure, application owners, and partner teams to drive production readiness and sustainable operations.

Required Qualifications
Bachelor's degree in Computer Science, Engineering, Information Systems, or equivalent practical experience.
7+ years of experience in DevOps, Platform Engineering, SRE, or Infrastructure Engineering roles.
Strong hands-on experience with Kubernetes in production environments.
Strong experience building and maintaining CI/CD pipelines for multi-environment software delivery.
Strong experience with ArgoCD, GitOps workflows, or equivalent deployment tooling.
Strong experience with Helm and Kubernetes package/deployment lifecycle management.
Experience with AWS managed services, especially RDS/PostgreSQL, document databases, and related infrastructure.
Strong experience with Infrastructure as Code, such as Terraform and/or similar declarative tooling.
Experience with Prometheus, Grafana, and modern observability practices.
Experience with Redis/cache services, secrets management, and operational debugging.
Strong Linux, networking, and distributed systems troubleshooting skills.
Strong scripting and automation skills in one or more languages such as Python, Bash, or Go.
Proven ability to work cross-functionally and operate effectively in environments where ownership boundaries are still evolving.

Preferred Qualifications
Experience with Temporal deployment and production operations.
Experience supporting developer platforms with local environment reproducibility using Minikube, kind, or similar tools.
Experience with MongoDB / DocumentDB operations and restore workflows.
Experience integrating with Nautobot, NetBox, or similar infrastructure source-of-truth platforms.
Experience operating in shared-cluster environments with multi-team tenancy and constrained access models.
Experience designing platform patterns for internal products that must scale across regions or multiple deployment footprints.
Familiarity with network automation or infrastructure orchestration platforms is a plus.

What Success Looks Like
CI/CD pipelines are reliable, repeatable, and support safe promotion across all environments.
Kubernetes deployments are standardized, maintainable, and production ready.
Managed infrastructure is defined as code rather than through manual setup.
Temporal, databases, cache layers, and observability tooling are stable and supportable.
Development teams can reproduce realistic environments locally for faster debugging and delivery.
Secrets, access patterns, and operational tooling are mature enough to support production-scale operations.
The DevOps operating model is clearly defined and enables faster deployments with less operational risk.

Scope Notes

In scope
CI/CD and deployment foundations
Kubernetes packaging and release management
RDS, MongoDB, Redis/cache services
Temporal platform setup and operational support
Observability, alerting, and debugging tooling
Secrets management and access enablement
Infrastructure as Code and environment reproducibility
DevOps / Development operational model definition

Candidate Profile

The ideal candidate is a builder-operator: someone who can establish engineering discipline where manual patterns currently exist, create durable automation for platform operations, and raise the overall maturity of the product's deployment and runtime ecosystem. This person should be equally comfortable discussing deployment architecture, writing IaC and Helm code, troubleshooting Kubernetes runtime issues, and defining how DevOps and software engineering teams work together over the full product lifecycle.
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.
  • Dice Id: 90933573
  • Position Id: 42ed474b09168f46c558fa17ed8cea05
  • Posted 1 day ago
Create job alert
Set job alertNever miss an opportunity! Create an alert based on the job you applied for.

Similar Jobs

San Jose, California

Today

Full-time

USD 134,500.00 - 193,500.00 per year

San Jose, California

Today

Full-time

USD 139,000.00 per year

Cupertino, California

Today

Full-time

Palo Alto, California

Today

Full-time

USD 49.00 - 85.00 per hour

Search all similar jobs