Skills
Job Description
10471 – Manager, SRE
Purpose:
The Site Reliability Engineering (SRE) Manager will be working with the development & operations team, focusing on ensuring that connected car systems are working as expected and the underlying infrastructure and network is running smoothly. This role is responsible for the day-to-day operations of the DevOps team and combines a mix of project management, team management, and engineering duties. The DevOps team are subject-matter experts within Telematics domain and provide insight and engineering advice to development and product teams, with a goal to create a highly reliable and scalable software system that can run with minimum failure
Essential Functions:
- Act as primary point-of-contact (PoC) on all connected card infrastructure operations and projects
- Work collaboratively with software engineering to define infrastructure and deployment requirements; be a sounding board and provide recommendations for engineering team around infrastructure design and deployment.
- They first set a goal to create a highly reliable and scalable software system that can run with minimum failure
- Gather and analyze metrics from both operating systems and applications to assist in performance tuning and fault finding
- Be the driving force behind our automation and observability initiatives. Build tools and automation that eliminate repetitive tasks and prevent incident occurrence.
- Build and maintain operational tools for deployment, monitoring, and analysis of connected car infrastructure and systems
- Perform infrastructure cost analysis and optimization
- Provide project management, sprint planning, and road-mapping support to the DevOps team
- Activities include designing, developing, installing, and maintaining software solutions.
- Work with engineering teams to refine deployment and release processes.
- Collaborate with the engineering team on projects as the expert on reliability, performance, and efficiency.
- Manage on-call rotations across connected car applications, using a follow-the-sun model.
- Participate in 24x7 operational support and on-call rotation shifts.
- Ensure that all system design and procedures are documented and up to date.
- Monitor and stress test systems to collect metrics for tuning and capacity planning.
- Work to automate detection and resolution of recurring issues.
- Ensure safety, predictability, repeatability, and auditability of all build and deploy processes.
- Partner with development teams to improve services through rigorous testing and release procedures
- Participate in system design consulting, platform management, and capacity planning
- Create sustainable systems and services through automation and uplifts
- Balance feature development speed and reliability with well-defined service level objectives
Job requirements:
- Bachelor’s or Master’s degree or equivalent in the field of computers, information systems or related degree.
- 2+ experience as a manager or PM or in a Technical Leadership capacity, preferably in automobile industry within the Telematics domain.
- Programming experience with one or more high level languages, such as Python, Java, C/C++, Ruby, and JavaScript
- Proven track record of designing, building, optimizing, and maintaining infrastructure on a large scale.
- Experience with distributed systems in a production operations environment
- Expertise analyzing complex application, database, network, and OS issues across a distributed large scale customer facing system
- Strong communication skills and ability to work effectively across multiple business and technical teams
- Demonstrated ability to deliver results on time with high quality
- Extensive experience leading customer facing systems in a high uptime 24/7 environment
- A depth and breadth of experience with server-side Java development, Oracle and distributed databases
- A well-developed understanding of the theory and principles of operation of the internet and packet data protocols.
- Exposure to Cloud, SaaS, and virtualization concepts and performance concerns.
- Working knowledge of operating system design, processes, and threading model.
- Knowledge of defining and monitoring system quality measures, including SLO and SLA.
- Built tooling to improve reliability of systems, automated remediation of issues, or improve scalability.
- Experience with different flavors of Linux, i.e., RedHat, Ubuntu, CentOS, etc.
- Hands-on experience collecting performance data, analyzing, troubleshooting, and tuning.
- Experience with the operations of application with high concurrency, scalability, or availability requirements.
- Experience leading high performing engineering teams.
- Experience with containers and container orchestration tools (Docker, Kubernetes)
- Experience with MySQL, Elasticsearch, Couchbase, Mongo and Redis
Nice to have:
- Experience with stream-processing open-source frameworks/systems, i.e. Kafka, Spark, etc.
- Experience with distributed storage technologies like NFS, HDFS, S3 as well as dynamic resource management frameworks (Mesos, Kubernetes, Yarn)
Salary Range - $112,830 to $173,756