Site Reliability Engineer

$140,000 - $160,000

Full Time

  • No Travel Required


PythonJavaCC++Go or RubyAWS

Job Description




Client is looking for a Senior Principal/Architect Site Reliability Engineer to help us build a brand new SRE team from the ground up. 

  • Collaboratively build a roadmap for a newly configured SRE team within client
  • Ensure focus on systems uptime and security with swift incident response for availability issues or potential breaches
  • Lead the design, specifications and estimating of SRE projects in support of company initiatives with feedback from key stakeholders
  • Participate in the development lifecycle (design, delivery, management, learning)
  • Go beyond development and review designs, create platforms and frameworks, capacity plan
  • Improve frameworks and services with root cause analysis, blameless postmortems, and follow through to make sure the same incident never happens twice
  • Work cross functionally to understand the full stack and recommend areas for improvement
  • Maintain and improve monitoring services, metrics and reporting for quick issue detection and actionable alerting
  • Participate in a shared on-call rotation for high severity issues
  • After incidents, drive the discovery and implement automated self healing solutions

The Skills You’ll Bring:

  • A passion to develop and mentor team members
  • A drive to increase automation and improve monitoring
  • Strong interpersonal, written, and oral communication skills.
  • Comfortable operating in a rapidly evolving environment, adapting quickly to new information, and re-prioritizing as needed
  • Ability to quickly learn new technologies, frameworks, and architectures as well as facilitate technical conversations with external stakeholders and your team
  • 10+ years relevant work experience in a production environment
  • Significant experience as an SRE or Software Engineer
  • Experience writing code with one or more programming languages: Python, Java, C, C++, Go or Ruby
  • 3+ years’ experience working in the cloud, preferably in AWS
  • Experience with large scale distributed systems
  • Experience with metrics monitoring platform solution (ie: SignalFX, Datadog)

It is helpful, but not required to have:

  • Experience moving new services to the cloud and best practices
  • Experience with infrastructure configuration and automations processes and tools: Terraform, Puppet, Ansible
  • Experience with security in the cloud: Intrusion, penetration, and vulnerability scanning
  • Experience with log aggregation solutions: Splunk, SUMO, ELK
  • Experience with Atlassian JIRA Service Desk, PagerDuty, Big Panda
  • Experience with Change Management processes and functions
  • Experience with various data technologies including relational and non-relational databases and message queues
  • Good working knowledge of build automation and continuous integration/delivery ecosystem: Git, Gerrit, Maven/Gradle, Jenkins, Docker, Nexus, Artifactory, Selenium