Overview
Skills
Job Details
Design, develop, and troubleshoot large-scale, distributed, event-driven cloud systems to ensure high availability and performance.
Coordinate and implement infrastructure and software improvements to meet resiliency and scalability goals.
Maintain and enhance infrastructure and monitoring-as-code to ensure repeatability, traceability, and transparency in automation.
Support on-call rotations, resolve operational issues, and drive long-term fixes to reduce alert fatigue.
Collaborate with development teams to design enterprise-grade solutions and uphold healthy DevSecOps practices including agile methodologies and CI/CD.
Participate in chaos testing and AWS ecosystem learning to proactively strengthen system reliability.
SRE, DevOps, or Software Engineering roles supporting enterprise applications.
Strong problem-solving, triage, and root cause analysis skills with a systems engineering mindset
Deep expertise in the AWS ecosystem, with hands-on experience across core services including primarily ECS, RDS, EKS, IAM, CloudWatch, and networking configurations.