The Site Reliability Engineer (SRE) will work with local API development squads, platform teams, product owners, scrum masters, and architects. The SRE ensures that both our internally critical and our externally-visible systems have reliability and uptime appropriate to users' needs while keeping an ever-watchful eye on capacity and performance. Work on decreasing time on operational work and tickets and more time on improving the site performance, availability, and capacity.
Site Reliability Engineer Responsibilities:
- Engage in and improve the whole lifecycle of services—from inception and design, through deployment, operation and refinement.
- Support APIs before they go live through system design review, developing software platforms and frameworks, capacity planning and performance reviews.
- Maintain APIs once they are live by measuring and monitoring availability, latency and overall system health.
- Scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity.
- Practice sustainable incident response and blameless postmortems.
- Troubleshoot and mitigate the thorniest problems in our most mission-critical systems. Advise the team during postmortems on effectively avoiding repeated incidents
- BS degree in Computer Science or related technical field, or equivalent practical experience.
- Experience in one or more of: C, C++, Java, Perl, Python, Go, or scripting experience in Shell and Perl.
- Experience working with Unix/Linux systems from kernel to shell and beyond, with experience working with system libraries, file systems, and client-server protocols.
- Networking: experience with network theory e.g. TCP/IP, UDP, ICMP, etc., MAC addresses, IP packets, DNS, OSI layers, and load balancing.
- Experience with OpenStack and open-source software.
- Expertise in designing, analyzing and troubleshooting large-scale distributed systems.
- In-depth knowledge of operating systems (processes, threads, concurrency issues, locks, mutexes, semaphores, monitors and how they work).
- Familiarity with algorithms, data structures and complexity analysis.
- Systematic problem solving approach, coupled with a strong sense of ownership and drive.