Title: Site Reliability Engineer (SRE) Crypto Services
Location: Remote
Video interview
Job Summary
We are seeking a Site Reliability Engineer (SRE) to join the Crypto Services SRE team, responsible for building and operating highly reliable systems that support both internal Client services and customer-facing platforms used by millions of users. In this role, you will help ensure systems are reliable, scalable, performant, and secure, while contributing to automation, monitoring, incident response, and infrastructure optimization. Your work will have a direct impact on improving the experience of Client users worldwide.
Key Skills & Qualifications
Strong knowledge of Linux/Unix fundamentals and networking concepts.
Hands-on experience with scripting or programming languages such as:
Bash, Zsh, Perl, Python.
C/C++, Go, or Java.
Experience with Configuration Management / Infrastructure as Code (IaC) tools:
Ansible, Puppet.
Terraform/Terragrunt.
AWS CloudFormation.
Basic understanding of containerization and orchestration technologies:
Docker or Podman.
Kubernetes or Apache Mesos.
Awareness of security principles, including encryption, key management, and key exchange protocols.
Understanding of SRE fundamentals, such as monitoring, alerting, automation, error budgets, and fault analysis.
Strong communication and collaboration skills with the ability to work effectively across cross-functional teams.
Responsibilities:
Assist in the implementation, maintenance, and support of monitoring, observability, alerting, and logging systems to ensure high availability and reliability.
Contribute to the design and development of automation and tooling, including:
Writing Ansible playbooks.
Developing tools to monitor API endpoints and system health.
Monitor key performance metrics and proactively identify opportunities for system optimization and efficiency improvements.
Collaborate with engineering and operations teams to troubleshoot incidents, perform root cause analysis, and implement preventative solutions.
Help document workflows, operational procedures, and runbooks, ensuring accuracy and reliability during incidents.