REMOTE - Director Platform Engineering and Reliability

Remote • Posted 6 hours ago • Updated 6 hours ago
Full Time
Occasional Travel Required
Remote
Depends on Experience
Fitment

Dice Job Match Score™

📋 Comparing job requirements...

Job Details

Skills

  • site reliability
  • sre
  • governance
  • reliability
  • platform
  • devops
  • iac
  • rca
  • slo
  • kpi
  • dora
  • soc2
  • aws
  • docker

Summary

100% REMOTE - Director Platform Engineering and Reliability - Direct Hire Full Time

Reliability Governance, Incident Management & Root Cause Accountability

Establish a reliability operating model that makes risk visible, decisions repeatable and improvements durable. You will own the Root Cause Analysis (RCA) process for all production incidents — ensuring timely, thorough and blameless reviews that identify systemic contributors rather than surface-level causes and driving corrective actions to completion.

·        Define and operationalize SLIs, SLOs and error budgets for critical services, and ensure they influence prioritization and release decisions.

·        Normalize incident response routines (roles, severity definitions, escalation, communications) that reinforce trust during high-pressure events.

·        Drive durable remediation (code, architecture, automation, process) and verify outcomes to reduce recurrence over time.

Observability & Signal Quality

Evolve observability so the platform produces actionable signal with minimal noise. The goal is earlier detection and clearer diagnosis, so intervention happens before customers experience impact.

·        Strengthen logging, metrics and tracing standards to improve troubleshooting speed and confidence.

·        Improve alert quality and reduce fatigue by tuning thresholds, routing, and ownership.

·        Use observability improvements to measurably reduce MTTD (mean time to detection) and improve MTTR (mean time to recovery).

Engineering Effectiveness & Delivery Foundations

Improve delivery confidence and predictability by instrumenting effectiveness metrics and strengthening the delivery pipeline. We want teams shipping more frequently with lower risk.

·        Instrument and operationalize DORA metrics (deployment frequency, change failure rate, lead time, MTTR) and use the data to target bottlenecks.

·        Evolve CI/CD patterns, rollout safeguards and rollback strategies to increase deployment frequency while lowering change failure rate.

·        Raise engineering confidence through stronger automation discipline (including test automation and release guardrails) as the system matures.

Platform Engineering & Cloud Architecture

Advance platform foundations so product teams can build safely and consistently with less cognitive load. This includes cloud architecture governance, Infrastructure as Code, and containerization/orchestration practices appropriate to system scale.

·        Advance Infrastructure as Code standards (Terraform) and AWS architecture patterns that support scale, performance and cost visibility (with potential future Azure expansion).

·        Strengthen containerization and orchestration practices using technologies such as Docker and Kubernetes where appropriate.

·        Establish paved-road platform patterns that make the secure, reliable path the easiest path for product teams.

Our core stack includes AWS (future expansion into Azure), MS SQL Server on AWS RDS, Terraform, GitHub, Jira, Confluence, and development primarily in Visual Studio / Visual Studio Code environments.

The Profile We’re Looking For

You bring strong technical depth and pragmatic leadership. You are comfortable being hands-on early, and you can scale standards, systems and teams over time.

·        8–12+ years in software engineering, DevOps, SRE or platform roles within a B2B SaaS environment; demonstrated success improving reliability and delivery predictability in growth-stage companies.

·        Deep familiarity with AWS-based production systems, Infrastructure as Code, and containerized environments (Docker, Kubernetes).

·        Experience implementing SLOs/error budgets, operating incident management and RCA processes, and driving systemic reliability improvements.

·        Experience using operational and delivery metrics (DORA and related KPIs) to guide prioritization and measurable improvement.

·        Experience supporting SOC2, PCI-DSS or comparable compliance initiatives is valued.

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.
  • Dice Id: CONTEMP
  • Position Id: 8902102
  • Posted 6 hours ago
Create job alert
Set job alertNever miss an opportunity! Create an alert based on the job you applied for.

Similar Jobs

Remote or California

Today

Full-time

USD 150,000.00 - 160,000.00 per year

Remote

26d ago

Full-time

100,000 - 120,000

Remote

Today

Full-time

USD 165,000.00 - 247,500.00 per year

Remote

Today

Full-time

USD 167,600.00 - 279,400.00 per year

Search all similar jobs