100% REMOTE - Director Platform Engineering and Reliability - Direct Hire Full Time
Reliability Governance, Incident Management & Root Cause Accountability
Establish a reliability operating model that makes risk visible, decisions repeatable and improvements durable. You will own the Root Cause Analysis (RCA) process for all production incidents — ensuring timely, thorough and blameless reviews that identify systemic contributors rather than surface-level causes and driving corrective actions to completion.
· Define and operationalize SLIs, SLOs and error budgets for critical services, and ensure they influence prioritization and release decisions.
· Normalize incident response routines (roles, severity definitions, escalation, communications) that reinforce trust during high-pressure events.
· Drive durable remediation (code, architecture, automation, process) and verify outcomes to reduce recurrence over time.
Observability & Signal Quality
Evolve observability so the platform produces actionable signal with minimal noise. The goal is earlier detection and clearer diagnosis, so intervention happens before customers experience impact.
· Strengthen logging, metrics and tracing standards to improve troubleshooting speed and confidence.
· Improve alert quality and reduce fatigue by tuning thresholds, routing, and ownership.
· Use observability improvements to measurably reduce MTTD (mean time to detection) and improve MTTR (mean time to recovery).
Engineering Effectiveness & Delivery Foundations
Improve delivery confidence and predictability by instrumenting effectiveness metrics and strengthening the delivery pipeline. We want teams shipping more frequently with lower risk.
· Instrument and operationalize DORA metrics (deployment frequency, change failure rate, lead time, MTTR) and use the data to target bottlenecks.
· Evolve CI/CD patterns, rollout safeguards and rollback strategies to increase deployment frequency while lowering change failure rate.
· Raise engineering confidence through stronger automation discipline (including test automation and release guardrails) as the system matures.
Platform Engineering & Cloud Architecture
Advance platform foundations so product teams can build safely and consistently with less cognitive load. This includes cloud architecture governance, Infrastructure as Code, and containerization/orchestration practices appropriate to system scale.
· Advance Infrastructure as Code standards (Terraform) and AWS architecture patterns that support scale, performance and cost visibility (with potential future Azure expansion).
· Strengthen containerization and orchestration practices using technologies such as Docker and Kubernetes where appropriate.
· Establish paved-road platform patterns that make the secure, reliable path the easiest path for product teams.
Our core stack includes AWS (future expansion into Azure), MS SQL Server on AWS RDS, Terraform, GitHub, Jira, Confluence, and development primarily in Visual Studio / Visual Studio Code environments.
The Profile We’re Looking For
You bring strong technical depth and pragmatic leadership. You are comfortable being hands-on early, and you can scale standards, systems and teams over time.
· 8–12+ years in software engineering, DevOps, SRE or platform roles within a B2B SaaS environment; demonstrated success improving reliability and delivery predictability in growth-stage companies.
· Deep familiarity with AWS-based production systems, Infrastructure as Code, and containerized environments (Docker, Kubernetes).
· Experience implementing SLOs/error budgets, operating incident management and RCA processes, and driving systemic reliability improvements.
· Experience using operational and delivery metrics (DORA and related KPIs) to guide prioritization and measurable improvement.
· Experience supporting SOC2, PCI-DSS or comparable compliance initiatives is valued.