Overview
Skills
Job Details
Our client is seeking a Software Engineer to build a robust framework that schedules and manages jobs across on-prem and cloud compute environments. You ll design orchestration logic that loads LLM models onto compute nodes, retrieves queries from storage, processes inference, and ensure graceful shutdown and recovery. Great role for an engineer with 2 3 years of experience excited to work on scalable distributed systems.
Responsibilities
Build a framework to manage jobs across on-prem and cloud compute.
Implement job orchestration to allocate compute nodes, load LLMs, process queries, and deliver results.
Design fault-tolerant execution with restart/recovery mechanisms.
Ensure clean shutdown of child nodes and processes.
Work with AWS/Google Cloud Platform for compute, storage, and workflow integrations.
Manage containers and scheduling in Kubernetes.
Write clean, testable code with unit tests.
Collaborate with engineering teams on architecture and reviews.
Use Git for branching, PRs, reviews, and merges.
Requirements
2 3 years of software engineering experience.
Proficiency in Python.
Experience with LLM inference libraries (vLLM, transformers, or nemotron).
Experience with Kubernetes and distributed container orchestration.
Experience building robust distributed applications with graceful recovery.
Experience with AWS or Google Cloud Platform.
Experience writing unit tests.
Strong collaboration and communication skills.
PyTorch
API design experience
Type: Contract - Part Time (20hrs/week)
Duration: 9 months with extension
Location: Remote (U.S.)
Salary Range: $41/hr - $56/hr DOE
No 3rd party agencies or C2C