10471 - Manager, SRE

$112,830 - $173,756

Full Time

  • Work from home


DevOpsTelematicsinfrastructure operationsvirtualizationReliability EngineeringLinuxsoftware engineeringOracleJavainformation systems

Job Description

10471 – Manager, SRE


The Site Reliability Engineering (SRE) Manager will be working with the development & operations team, focusing on ensuring that connected car systems are working as expected and the underlying infrastructure and network is running smoothly. This role is responsible for the day-to-day operations of the DevOps team and combines a mix of project management, team management, and engineering duties. The DevOps team are subject-matter experts within Telematics domain and provide insight and engineering advice to development and product teams, with a goal to create a highly reliable and scalable software system that can run with minimum failure

Essential Functions:

  • Act as primary point-of-contact (PoC) on all connected card infrastructure operations and projects
  • Work collaboratively with software engineering to define infrastructure and deployment requirements; be a sounding board and provide recommendations for engineering team around infrastructure design and deployment.
  • They first set a goal to create a highly reliable and scalable software system that can run with minimum failure
  • Gather and analyze metrics from both operating systems and applications to assist in performance tuning and fault finding
  • Be the driving force behind our automation and observability initiatives. Build tools and automation that eliminate repetitive tasks and prevent incident occurrence.
  • Build and maintain operational tools for deployment, monitoring, and analysis of connected car infrastructure and systems
  • Perform infrastructure cost analysis and optimization
  • Provide project management, sprint planning, and road-mapping support to the DevOps team
  • Activities include designing, developing, installing, and maintaining software solutions.
  • Work with engineering teams to refine deployment and release processes.
  • Collaborate with the engineering team on projects as the expert on reliability, performance, and efficiency.
  • Manage on-call rotations across connected car applications, using a follow-the-sun model.
  • Participate in 24x7 operational support and on-call rotation shifts.
  • Ensure that all system design and procedures are documented and up to date.
  • Monitor and stress test systems to collect metrics for tuning and capacity planning.
  • Work to automate detection and resolution of recurring issues.
  • Ensure safety, predictability, repeatability, and auditability of all build and deploy processes.
  • Partner with development teams to improve services through rigorous testing and release procedures
  • Participate in system design consulting, platform management, and capacity planning
  • Create sustainable systems and services through automation and uplifts
  • Balance feature development speed and reliability with well-defined service level objectives


Job requirements:

  • Bachelor’s or Master’s degree or equivalent in the field of computers, information systems or related degree.
  • 2+ experience as a manager or PM or in a Technical Leadership capacity, preferably in automobile industry within the Telematics domain.
  • Programming experience with one or more high level languages, such as Python, Java, C/C++, Ruby, and JavaScript
  • Proven track record of designing, building, optimizing, and maintaining infrastructure on a large scale.
  • Experience with distributed systems in a production operations environment
  • Expertise analyzing complex application, database, network, and OS issues across a distributed large scale customer facing system
  • Strong communication skills and ability to work effectively across multiple business and technical teams
  • Demonstrated ability to deliver results on time with high quality
  • Extensive experience leading customer facing systems in a high uptime 24/7 environment
  • A depth and breadth of experience with server-side Java development, Oracle and distributed databases
  • A well-developed understanding of the theory and principles of operation of the internet and packet data protocols.
  • Exposure to Cloud, SaaS, and virtualization concepts and performance concerns.
  • Working knowledge of operating system design, processes, and threading model.
  • Knowledge of defining and monitoring system quality measures, including SLO and SLA.
  • Built tooling to improve reliability of systems, automated remediation of issues, or improve scalability.
  • Experience with different flavors of Linux, i.e., RedHat, Ubuntu, CentOS, etc.
  • Hands-on experience collecting performance data, analyzing, troubleshooting, and tuning.
  • Experience with the operations of application with high concurrency, scalability, or availability requirements.
  • Experience leading high performing engineering teams.
  • Experience with containers and container orchestration tools (Docker, Kubernetes)
  • Experience with MySQL, Elasticsearch, Couchbase, Mongo and Redis


Nice to have:

  • Experience with stream-processing open-source frameworks/systems, i.e. Kafka, Spark, etc.
  • Experience with distributed storage technologies like NFS, HDFS, S3 as well as dynamic resource management frameworks (Mesos, Kubernetes, Yarn)

Salary Range - $112,830 to $173,756