Remote Job - Site Reliability/ SRE// 10 years experience

Prometheus, Grafana, SRE, Site Reliability
Contract W2, Contract Independent, Contract Corp-To-Corp
Depends on Experience

Job Description

Job Role: Site Reliability Engineer

Location: (Remote)

Duration : 6+ months of contract (Could be extend)


Job Description

**** is reimagining the Internet as a public network that hosts secure software and services. The Internet Computer is a new technology stack that will be unhackable, fast, scales to billions of users around the world, and supports a new kind of autonomous software that promises to reverse Big Tech’s monopolization of the internet. *** was founded in 2016 by Dominic Williams and is backed by top-tier institutions including Polychain Capital and Andreessen Horowitz. The SRE team at ***

is charged with creating tools, processes, and frameworks that ensure the stability of the Internet Computer, which is distributed and scalable. As a member of the team you will work with engineering, infrastructure, and security teams to bake reliability and operability into the product from the start, by participating in design and code reviews, identifying risks, problems, and mitigations. This is not a team that exists to be on-call; this is a team that elects to be on-call because it helps do the job better.



  • Implement tools that ensure high availability of *** product
  • Gain deep knowledge of *** complex applications
  • Identify opportunities to automate or improve processes and then implement the automation
  • Coordinate incident response across multiple teams -- clearly understanding and communicating what is going on, next steps, who is responsible for what, and so on
  • Implement observability tools to ensure visibility into service stability and performance
  • Be on-call for production services
  • Operating, troubleshooting, and deploying software to Unix systems
  • Thinking about things in a systemic, methodical way, especially when troubleshooting


Required Skills:

  • Expertise in observability and monitoring of applications, services, and networks, using tools such as PrometheGrafana and ELK logging
  • Unix/Linux experience, including application installation, configuration, and maintenance
  • Significant experience with site reliability, developer productivity, devops, or server infrastructure engineering (including on call incident response)
  • Understanding of Internet networking protocols: TCP/IP, TLS, DNS, HTTP/S, SMTP
  • Experience troubleshooting issues across the entire stack (hardware, software, network, etc)
  • Experience writing automation scripts and utilities in a scripting language such as Python, Perl, Shell, PHP, etc
  • Experience with incident and problem management
  • Strong communication and interpersonal skills
Dice Id : 10166369
Position Id : Nick-df
Originally Posted : 4 weeks ago
Have a Job? Post it