Engineering Manager (SRE, DevOps, AWS)

Overview

Remote
$170,000 - $190,000
Full Time

Skills

Management
Software Engineering
Amazon Web Services
Cloud Computing
DevOps
Leadership
Reliability Engineering
Operational Excellence
Python
Java
Scalability
Terraform
Kubernetes
Generative Artificial Intelligence (AI)

Job Details

As an Engineering Manager for our Cloud & Reliability team, you will lead the engineering group responsible for our production and pre-production runtime environments. Your team is the critical counterpart to our platform delivery function, responsible for the stability, scalability, and security of the cloud platform where our applications run. This is a critical leadership role for a software engineer who excels at building highly available, scalable systems and is skilled at balancing the development of a modern cloud-native stack with the operational ownership of a legacy environment slated for modernization. This is also a unique opportunity to not only modernize our core infrastructure but to help rebuild it with a forward-looking, AI-first approach.

Although this is a remote position, we prefer a candidate that is located on the East Coast.

What You'll Do

  • Lead the Cloud & Reliability Team: Manage and grow a team of software engineers focused on the core runtime environment. You will be responsible for their technical and professional growth by mentoring and guiding their careers, fostering a culture of operational excellence and proactive problem-solving.
  • Engineer Cloud Infrastructure: Lead the design, implementation, and governance of our cloud infrastructure using Infrastructure as Code principles. Your domain will include our container orchestration platform (Kubernetes) , cloud networking, database infrastructure, and the management of our cloud budget.
  • Pioneer GenAI in the Platform: Champion the practical application of Generative AI to solve real-world platform challenges. You will identify and implement opportunities to enhance system reliability, automate complex operational tasks, and accelerate developer workflows.
  • Drive System Reliability: Apply Site Reliability Engineering (SRE) principles to improve the performance, availability, and security of our production systems. You will own the monitoring and observability stack, lead incident response, and drive a culture of blameless post-mortems.
  • Champion Best Practices: Champion and enforce best practices for documentation and coding standards, ensuring the team's work is accessible, understandable, and trusted by the rest of the engineering organization.
  • Manage the Legacy Stack: Assume direct ownership for the operational health, security, and maintenance of our legacy stack. You will ensure its stability while actively partnering with product teams on the migration and decommissioning strategy.

What You'll Bring

  • Software Engineering Leadership: Proven success in managing a team of software or infrastructure engineers. You have a background as a hands-on backend engineer and have successfully transitioned into leadership.
  • An AI-Curious Mindset: While deep AI/ML experience isn't required, you have a genuine curiosity and hands-on familiarity with modern GenAI tools and concepts. You are the kind of person who is already exploring how AI can augment and accelerate software engineering.
  • Cloud-Native Expertise: Deep, hands-on knowledge of a major cloud provider (AWS or Google Cloud Platform), extensive experience building and running production systems on Kubernetes using Terraform, and expert-level proficiency with Linux.
  • Polyglot Programming Skills: Strong programming skills in Python and a deep understanding of the Java ecosystem, including experience with both modern Spring frameworks and legacy enterprise systems.
  • A Reliability Mindset: A strong command of SRE principles, including experience with SLOs, error budgets, and building robust monitoring and alerting systems. You are skilled at troubleshooting complex distributed systems under pressure.

Success in Your First Year Looks Like

  • You are leading an engaged and effective team that has taken full ownership of our production runtime environment.
  • You have established a clear technical roadmap for the cloud infrastructure that balances feature delivery with reliability improvements.
  • You have successfully delivered at least one proof-of-concept that leverages GenAI to solve a platform engineering problem.
  • You have measurably improved the stability of the legacy environment while making clear, documented progress on the modernization and sunsetting plan.
  • You have built a strong, collaborative partnership with the Platform Delivery team and other engineering leaders.
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.