Senior Site Reliability Engineer

Overview

On Site
340k - 440k
Full Time

Skills

Data Analysis
Global Operations
Big Data
Optimization
ISP
Tier 3
FOCUS
Computer Science
Electrical Engineering
Computer Engineering
TCP/IP
Border Gateway Protocol
Dragon NaturallySpeaking
DNS
TLS
HTTP
Caching
Proxies
Python
Management
Unix
Linux
Computer Networking
Storage
Data Processing
Apache Hive
Apache Spark
SQL
Statistics
Orchestration
Docker
Kubernetes

Job Details

One of our clients in the entertainment platform space is looking for a Level 5 Reliability Engineer with a deep background in nix systems, networking, data analysis, and operating large-scale platforms to help build, scale, automate, and maintain our globally distributed infrastructure.

Key Responsibilities
  • Lead efforts to enhance system resiliency, observability, monitoring, and automation-ensuring global operations remain scalable and reliable.
  • Collect, evaluate, and interpret significant volumes of server and application performance data using the Netflix Big Data platform to uncover optimization opportunities and identify trends or anomalies needing deeper analysis.
  • Support ISP partners with technical guidance for integrating our Open Connect Appliances.
  • Act as a Tier 3 escalation point and take part in the on-call rotation to address platform incidents.

Required Qualifications
  • At least 5 years of experience in site reliability or operational engineering roles supporting large-scale, high-performance systems with a focus on uptime and efficiency.
  • Bachelor's degree in Computer Science, Electrical or Computer Engineering, or equivalent experience (preferred).
  • In-depth understanding of networking and protocols including TCP/IP, BGP, DNS, TLS, and HTTP/S; familiarity with CDN and HTTP caching/proxy technologies.
  • Proficient in developing and maintaining automation using languages like Python.
  • Advanced expertise in managing and troubleshooting Unix/Linux environments at scale-covering networking, storage, and OS fundamentals.
  • Hands-on experience with distributed data processing tools such as Hive, Presto/Trino, or Spark SQL.
  • Strong applied statistics knowledge with the ability to write code that detects anomalous system behavior.
  • Some familiarity with containerization and orchestration technologies like Docker and Kubernetes.
  • Effective communicator and collaborator, comfortable working with internal teams and external partners alike.
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

About Motion Recruitment Partners, LLC