Job Description:
Keep our AWS platforms and customer-facing apps available, observable, recoverable, secure, and cost‑sensible. Make the runbook path the easiest path, so on-call personnel feel calm and releases feel straightforward—in a good way.
Scope of the role
AWS operations: EC2, EKS, RDS, ALB/CloudFront, IAM/OIDC, VPC/TGW/SGs, patching, and hygiene.
Application support: release readiness, runbooks, post-deploy smoke checks, performance baselines, and clean rollback paths.
Visibility: dashboards, logs, metrics, traces, synthetics, error budgets, and alert health.
Backup & DR: policies, schedules, retention, cross-region copies, restore testing, and DR runbooks (RPO/RTO owned and measured).
Incident leadership: run Sev‑1/2 bridges, keep comms clear, and land post‑mortems with actions that actually close.
Cost hygiene: tagging, right-sizing, SP/RI coverage, lifecycle cleanups (EBS/EIP/AMIs).
Team enablement: guardrails, golden runbooks, and small automations that remove toil.
Day‑to‑day (what this looks like)
Triage overnight alerts and hot issues, set priorities, and make sure owners are clear.
Keep dashboards honest; fix flapping or missing alerts before they wake people up.
Check backups and recent restore points; open tickets for any gaps and track to done.
Unblock releases; verify smoke checks; keep environments tidy and predictable.
Lead or delegate break/fix; no lingering “mystery” incidents.
Write down what we learned in the runbook so the next person can fix it faster.
Weekly rhythm
Ops review: incidents, alerts, deploys, costs, capacity, and backup status in one short readout.
Observability tune‑up: delete noise, add the missing signal, and test a synthetic from the edge.
Backup/DR: run a small restore test and record RPO/RTO evidence.
Patch and change review: what shipped, what rolled back, why.
Monthly outcomes
Share availability/SLOs, MTTR, change failure rate, observability coverage, backup compliance, and costs in plain English.
Close the top recurring issues (noisy alerts, flaky deploys).
Refresh the most‑used runbooks; validate DR for one critical workload (tabletop or live restore).
Core responsibilities
Own production readiness and stability for assigned AWS accounts and apps.
Lead incidents and land post‑mortems; make the fixes stick.
Keep monitoring/logging/tracing standards real; enforce SLOs and error budgets.
Own backup strategy end-to-end, including monthly restore tests and DR docs.
Keep access least‑privileged and auditable; rotate secrets and certs on time.
Drive cost posture and mentor the team; make on-call humane.
What “good” looks like**
Visibility: one clear dashboard per service, clean alert routing, low false positives.
Backups: 100% jobs green (or retried), documented RPO/RTO, and monthly restore tests that pass.
Reliability: MTTR trending down; most issues solved by the first responder with a runbook.
Change: predictable releases with smoke and rollback; fewer failed changes month over month.
Cost: flat or down against growth; tagging at or above 95%.
Minimum Experience Required
8–10+ years in cloud/app operations with strong AWS hands-on experience.
Comfortable leading incidents, shaping dashboards and alerts, and automating the boring bits (Terraform, Ansible, Python).
Experience running backups/DR in AWS and proving it with real restore tests.
Cloud network experience.
Preferred Experience
AWS Solution Architect Certification
Any professional networking certifications
ITIL Certification