Overview
On Site
$200k - 275k per year
Full Time
Skills
FOCUS
Systems Engineering
Management
NoSQL
Failover
Collaboration
Computer Networking
Reliability Engineering
Rust
SQL
Recovery
Debugging
Database
Stacks Blockchain
Grafana
GPU
Scheduling
High Performance Computing
HPC
Service Level
CHAOS
Incident Management
Workflow
Linux
Orchestration
Kubernetes
NOMAD
Job Details
Principal Systems Generalist - Distributed Systems & Databases is a senior technical position responsible for solving foundational challenges across distributed systems, production databases, orchestration, and observability. The focus of this role is to ensure the platform is reliable, scalable, and fault-tolerant under real-world workloads.
This role requires deep expertise in distributed systems engineering, production-grade database operations, and systems programming using Rust. The ideal candidate will be comfortable working across the stack from low-level systems internals to orchestration and observability tooling and will act as a cross-functional problem solver across infrastructure teams.
Key Responsibilities
Required Qualifications
Preferred Qualifications
This role requires deep expertise in distributed systems engineering, production-grade database operations, and systems programming using Rust. The ideal candidate will be comfortable working across the stack from low-level systems internals to orchestration and observability tooling and will act as a cross-functional problem solver across infrastructure teams.
Key Responsibilities
- Design and implement backend systems in Rust to manage and allocate GPUs across compute clusters.
- Scale and optimize high-performance SQL and NoSQL databases for low-latency, high-throughput workloads.
- Build and maintain resilient recovery and failover mechanisms to ensure availability during outages and cluster-level failures.
- Develop observability tooling, including monitoring, logging, and tracing, for diagnosing issues in distributed environments.
- Collaborate with infrastructure, platform, and networking teams to design and operate a unified and reliable systems platform.
- Lead or contribute to architectural reviews, incident response, and reliability engineering efforts.
Required Qualifications
- Significant experience building and operating distributed systems at production scale.
- Proficiency in Rust for systems and backend programming.
- Strong understanding of SQL
- Demonstrated ability to build tooling and automation for system health, monitoring, and recovery.
- Comfortable navigating and debugging complex interactions between kernel, databases, and orchestration layers.
Preferred Qualifications
- In-depth understanding of distributed consensus protocols (e.g., Raft, Paxos) and database internals.
- Experience with observability stacks (e.g., Prometheus, Grafana, OpenTelemetry).
- Exposure to GPU scheduling or high-performance computing (HPC) environments (optional).
- Familiarity with service-level objectives (SLOs), chaos engineering, and incident management workflows.
- Experience with Linux systems internals and container orchestration frameworks such as Kubernetes or Nomad.
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.