As a System Reliability Analyst, your responsibilities will include, but not be limited to:
- Working closely with engineering/development teams to design, build, optimize, and maintain systems.
- Troubleshooting issues across the entire technology stack: hardware, software, application, and network.
- Aggressively targeting toil and operational risk, and deploying solutions to reduce these.
- Broadening infrastructure and application observability.
- Proactively identifying and addressing active or potential risks to system reliability.
- Advocating for reliability priorities in application design reviews and operational readiness exercises for new and existing services.
Qualifications:
- External What skills and experience do I need?
You should apply if you have at least a Bachelor's degree in Computer Science or other technical discipline(s), plus hands-on experience with any combination of the following:
- 3-5+ years practical experience in production systems support or application development- Hands on experience managing systems in a large scale distributed Unix/Linux environment is essential.
- Automation-related experience is required, using scripting languages such as Python, bash, Perl, and/or Ruby. Higher-level compiled languages such as C++, C#, JAVA, Scala, and Go are a big plus.
- Deep knowledge of and hands-on experience applying the principles of System/Site Reliability Engineering (SRE).
- Practical experience designing and instrumenting SLO/SLI dashboards is particularly valuable.
- Hands on experience on enterprise tools such as AppDynamics, Grafana, Splunk, Dynatrace
- Experience with Puppet, Ansible, Chef, GitHub or any automation/configuration/release management tools- Awareness of, and ability to reason through modern software and systems architectures, including load
-balancing, databases, queueing, caching, distributed systems failure modes, micro services, Cloud, etc.
- Working ability to interact with message transport platforms and protocols (MQ, CPS, XML, FIX) and distributed database technologies (DB2, Sybase, Mongo, GreenPlum, Postgres, KDB).
- Autosys scheduling and batch processing concepts.
- Deep understanding of infrastructure and operating system concepts such as processes, memory allocation, and networking, with an understanding of how applications are affected by the above, and ability to debug and troubleshoot accordingly.