Overview
Skills
Job Details
JD:
We are looking for a skilled Site Reliability Engineer (SRE) with solid experience in AWS cloud infrastructure to join our growing engineering team. As an SRE, you will ensure our services are reliable, scalable, and performant, while driving operational excellence and infrastructure automation.
________________________________________
Key Responsibilities:
- Design, implement, and maintain scalable and secure AWS infrastructure.
- Build and maintain infrastructure as code using Terraform or CloudFormation.
- Manage Kubernetes (EKS) clusters and containerized workloads.
- Develop monitoring and alerting solutions using tools like Prometheus, Grafana, and CloudWatch.
- Support CICD pipelines using tools such as Jenkins, GitHub Actions, or CodePipeline.
- Participate in incident response, troubleshooting, and root cause analysis.
- Automate operational tasks through scripting (Python, Bash, etc.).
- Ensure high availability, reliability, and performance of production systems.
- Collaborate with development teams to improve system design and architecture.
________________________________________
Required Skills:
- 3 6 years of SRE DevOps experience in a production environment.
- Strong hands-on experience with AWS services (EC2, S3, IAM, VPC, RDS, Lambda, etc.).
- Proficient in Terraform or CloudFormation.
- Experience with Docker and Kubernetes (EKS preferred).
- Familiarity with Linux system administration and shell scripting.
- Strong understanding of monitoring, logging, and alerting frameworks.
- Good knowledge of networking concepts (DNS, TCP/IP, Load Balancing).
- Strong troubleshooting and incident management skills.