Job Role: AWS Cloud OPS SRE
Location: New York, NY
Job Description:
Must Have Technical/Functional Skills
AWS Cloud Operations / Site Reliability Engineer (SRE) is responsible for delivering secure, reliable, and scalable cloud infrastructure. This role covers Infrastructure as a Service, AWS platform release activities, AMI lifecycle management, patching, infrastructure design documentation, terraform scripting and maintaining visibility into the application layer and how it functions in production environments. Experience with Harness for DevOps pipelines is a strong plus.
Required Qualifications
- 10+ years in SRE, Cloud Ops, or DevOps with heavy AWS experience.
- Strong hands-on experience with:
o AWS compute (EC2, ASG, EKS/ECS, Lambda)
o Networking (VPC, Route 53, SG/NACL, ALB/NLB)
o Storage (S3, EBS, EFS)
o Databases (RDS, Aurora, DynamoDB)
- Expertise in AMI pipeline management, image building, and OS level hardening.
- Solid experience with Terraform or CloudFormation for IaC.
- Demonstrated ability to troubleshoot AWS and application stack issues end-to-end
- AWS Platform Operations & Releases
- Own and execute AWS platform release management across environments, including validation, regression checks, and readiness reviews.
- Operate and evolve AWS core services: VPC, IAM, KMS, Route 53, networking baselines, proxy layers, and organizational guardrails.
- Infrastructure as a Service (IaS) using Terraform
- Build, manage, and scale cloud infrastructure using Terraform as primary IaC tooling.
- Create reusable Terraform modules covering networking, compute, storage, EKS, and security.
- Ensure IaC follows best practices versioned, immutable, peer reviewed, and automated through CI/CD.
- Amazon EKS (Kubernetes) Operations
- Deploy, manage, and maintain production grade AWS EKS clusters, node groups, and cluster add ons.
- Implement Kubernetes platform standards for security, networking, namespaces, RBAC, and secrets management.
- Work closely with application teams to ensure workloads run reliably and securely within EKS.
- Optimize cluster scaling, workload scheduling, resource limits, and performance tuning.
- AMI Lifecycle & Image Management
- Manage complete AMI lifecycle: creation, CIS hardening, vulnerability scanning, tagging, publishing, and deprecation.
- Build automated AMI pipelines using image builders, Packer (if applicable), and validation workflows.
- Maintain golden images for EC2 fleets, containers, and hybrid workloads.
- VIT (Vulnerability / Integration / Integrity Testing) & Patch Management
- Lead VIT process including vulnerability assessments, remediation workflows, compliance tracking, and closure.
- Own OS level and image patching using AWS Systems Manager (SSM) Patch Manager and automated maintenance windows.
- Generate patch baselines, dashboards, compliance reports, and ensure measurable SLA adherence.
- Observability & Application Layer Insights
- Build and maintain observability stack with CloudWatch, X Ray, Open Telemetry, and log analytics.
- Establish deep visibility into application Behavior, dependencies, performance, and error patterns.
- Create golden signals dashboards covering latency, traffic, errors, and saturation for both infra and applications.
- CI/CD & DevOps Automation
- Implement and maintain CI/CD pipelines for infrastructure and application deployments.
- Harness experience is an added advantage, leveraging workflows, verification steps, and deployment strategies (canary, blue/green).
- Integrate Terraform, AMI pipelines, EKS updates, and patch automation into CI/CD systems.
- Reliability Engineering & Incident Response
- Participate in on call rotation; lead incident triage and root cause analysis.
- Build automation and runbooks to reduce operational toil.
- Drive architectural improvements to increase availability, resiliency, and performance.
- Documentation & Architecture
- Produce high-quality Infrastructure Design Documents (IDDs), runbooks, DR procedures, release notes, and architectural diagrams.
- Conduct operational readiness reviews, capacity planning, and cost-optimization assessments.