A media & entertainment company in New York City is seeking a Cloud Reliability Engineer for a hybrid role, focusing on improving the availability, performance, scalability, and operational resilience of large-scale AWS cloud platforms.
About the Opportunity:
- Schedule: Monday to Friday
- Hours: Standard business
- Setting: Hybrid (3 days onsite, 2 days remote)
Responsibilities:
- Design, implement, and maintain reliability practices for cloud infrastructure and platform services
- Develop and maintain monitoring, logging, alerting, and observability solutions across AWS environments
- Investigate incidents, troubleshoot issues, and perform root cause analysis and post-incident reviews
- Support the reliability and operational health of large-scale AWS environments, including multi-account structures
- Partner with cloud, networking, security, and application teams to improve resiliency, automation, and operational consistency
Qualifications:
- 3 to 7 years of experience in Site Reliability Engineering, Cloud Engineering, DevOps, Infrastructure Engineering, or a related role
- Bachelor's degree in Computer Science, Engineering, Information Systems, or equivalent practical experience
- Strong hands-on experience with AWS cloud services in enterprise-scale environments
- Experience with monitoring and observability platforms, incident management, troubleshooting, and root cause analysis
- Experience with AWS Organizations, Control Tower, and Identity Center
- Experience with Infrastructure as Code tools such as Terraform and CloudFormation
- Experience with scripting and automation using Python, PowerShell, Bash, or similar languages
- Strong understanding of AWS networking, resiliency, and cloud architecture concepts
- Experience with logging, metrics, tracing, and alerting technologies
- Strong troubleshooting, communication, and collaboration skills
Desired Skills:
- Experience building reusable Terraform or CloudFormation modules
- Experience standardizing cloud deployments through templates and self-service frameworks
- Experience with CI/CD platforms and deployment automation
- Experience with Datadog
- Experience improving cloud operations through automation at scale