Technical Operations Manager Production/Web/SAAS

Overview

Remote
Depends on Experience
Full Time
No Travel Required

Skills

API Management
Amazon DynamoDB
Amazon Kinesis
Amazon RDS
Business Continuity Planning
Google Cloud Platform
DevOps
Database
Disaster Recovery
Apache Tomcat
Amazon Web Services
Authentication
Manufacturing
Microsoft Azure
MongoDB
Legacy Systems
High Availability
ITIL
IaaS
Cloud Computing
Amazon S3
Business Intelligence
Product Engineering
Team Leadership
Technical Direction
Value Engineering
SQL

Job Details

Technical Operations Manager Web/SAAS

Job Overview
We are seeking an exceptional Senior Manager of Site Reliability Engineering (SRE) to lead our global SRE organization and drive operational excellence across our multi-cloud SaaS platform. This role is critical to our mission of delivering reliable, scalable, and performant solutions to thousands of customers worldwide. The successful candidate will be an integral part of a fast growing Manufacturing Automation Software Company.

Success Metrics:

  • Customer Impact: Reduced MTTR and improved customer satisfaction scores
  • Reliability: Achievement of 99.9%+ uptime SLAs across all products and regions
  • Team Growth: Successful scaling of global SRE organization with low attrition
  • Proactive Prevention: Reduction in incident frequency through automated detection and prevention
  • Cross-functional Collaboration: Improved partnership metrics with Product, Engineering, and Customer Success teams

About Us

Our Company is a leading provider of innovative manufacturing quality management software (QMS) and Supplier Quality Management software (SQS) that transforms how the world's most demanding industries operate. For over a decade, we've empowered aerospace giants, automotive manufacturers, medical device companies, and energy sector leaders to eliminate quality incidents, reduce costs, and consistently hit delivery targets all while maintaining the highest quality standards and compliance.

Responsibilities

Leadership & Strategy

  • Lead and scale a global SRE organization spanning multiple time zones
  • Develop and execute SRE strategy aligned with business objectives and customer success metrics
  • Drive cultural transformation toward reliability-first engineering practices across the organization
  • Partner closely with Customer Success to ensure customer-centric approach to all SRE initiatives
  • Establish and maintain SLAs, SLOs, and error budgets that balance reliability with feature velocity

Incident Management & Response

  • Lead enterprise-wide incident management, ensuring rapid detection, response, and resolution
  • Serve as executive point of contact during critical incidents
  • Drive comprehensive root cause analysis (RCA) processes with actionable prevention strategies
  • Establish and maintain 24/7 on-call rotation and escalation procedures across global teams
  • Develop and execute disaster recovery and business continuity plans

Technical Leadership

  • Provide technical direction for complex, multi-cloud infrastructure spanning AWS, Azure, and Google Cloud Platform
  • Oversee reliability engineering for our entire product portfolio
  • Lead application performance monitoring initiatives
  • Drive modernization efforts and ensure optimal performance across geographically distributed DCs
  • Drive best practices in tuning SQL and NoSQL data platforms

Platform Reliability

  • Ensure high availability and performance of services including: AWS (ECS, ECR, RDS, Aurora, SQS, SNS, Kinesis, S3, DynamoDB, OpenSearch), Authentication (Auth0/Okta CIC), Integration platforms (Workato), BI (Looker), API management (Apigee), Legacy systems (Tomcat, MongoDB)
  • Manage reliability for thousands of customers in North America and EU

Operational Excellence

  • Establish observability standardization strategy (Sumo Logic, New Relic and Grafana)
  • Drive automation initiatives to reduce manual operational overhead
  • Implement chaos engineering and reliability testing practices
  • Lead capacity planning and performance optimization efforts
  • Establish metrics-driven culture with focus on customer impact measurements

Qualifications

Leadership Experience

  • 15+ years in SRE, DevOps, or Infrastructure Engineering roles with 5+ years in senior positions
  • Proven track record of scaling global engineering teams across multiple time zones
  • Experience leading teams through high-stakes incident response and customer escalations
  • Someone with a smaller company growth mindset would be very useful.
  • Strong organizational skills with ability to influence cross-functional stakeholders

Technical Expertise

  • Deep expertise in multi-cloud environments (AWS primary, Azure secondary, Google Cloud Platform preferred)
  • Extensive experience with containerization, orchestration, and modern deployment practices
  • Strong background in database technologies
  • Proficiency with observability tools (New Relic, Grafana, Sumo Logic, or similar)
  • Experience with large-scale Java applications and legacy system modernization

SRE & Operations

  • Demonstrated success implementing SRE principles in large-scale production environments
  • Experience with ITIL, incident management frameworks and tools
  • Background in establishing and maintaining SLAs for enterprise SaaS products

Preferred

  • Background with authentication systems (Auth0, Okta, SAML, OAuth)
  • Experience with API management platforms and integration architectures
  • Previous exposure to CDN optimization and global content delivery
  • Relevant certifications in AWS, Azure, or SRE practices

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.