- Site Reliability
*Open to sponsorship must have at least two years left*
*Hybrid 2 days in office 3 days remote*
A prestigious company is in search for an Associate Principal, Site Reliability Engineer. This engineer will focus on Splunk and other monitoring tools, heavy Terraform, and solid experience with solid experience, Puppet, chef, scripting in Python, CICD, and Cloud.
- Collaborate with development, operations, and infrastructure teams to ensure availability of services, and to work through implementation issues.
- Develop automation for incident response and to prevent problem recurrence
- Create and enhance runbooks to respond to service outages or degradations
- Define and track operational metrics for production performance, reliability, scalability, and availability
- Architect, develop and maintain shared services and tools to improve reliability and reduce toil across the organization
- Bachelor’s or Master’s Degrees in Computer Science, Information Systems or other related field. Or equivalent work experience.
- Minimum of 5-8 years of experience in Site Reliability Engineering / DevOps
- Experience managing infrastructure in public cloud environments like AWS (preferred), Azure or Google Cloud Platform
- Experience providing visibility using monitoring and alerting tools like Splunk, SignalFx, AppDynamics, Datadog, StackDriver, Sysdig, Prometheus or Grafana
- Programming/scripting experience in languages like Java, Bash, Python or Go
- Experience with distributed messaging systems like Kafka, RabbitMQ, or ActiveMQ
- Experience with container orchestration systems like Kubernetes, Mesos, Docker Swarm or Rancher
- Experience with using Continuous Integration and Continuous Delivery (CI/CD) tools like Jenkins, Travis, Harness, Spinnaker, Appveyor, CodeBuild or CodePipeline.