Senior Software Engineer – AI Research Clusters/Remote

Remote • Posted 1 hour ago • Updated 1 hour ago
Contract W2
Contract Corp To Corp
Contract Independent
No Travel Required
Remote
Depends on Experience
Fitment

Dice Job Match Score™

👾 Reticulating splines...

Job Details

Skills

  • Senior Software Engineer – AI Research Clusters

Summary

Job Title: Senior Software Engineer – AI Research Clusters

Location: Remote

Employment Type: Full-time


Role Overview

We are seeking a Senior Software Engineer to design, build, and optimize large-scale AI research clusters. This role focuses on distributed systems, high-performance computing, and infrastructure that supports AI/ML workloads such as model training and experimentation.


Key Responsibilities

  • Design and manage scalable AI research infrastructure and compute clusters.
  • Build and optimize distributed systems for large-scale model training and data processing.
  • Develop tools and frameworks to support researchers and ML engineers.
  • Work closely with AI researchers to understand workload requirements and improve system efficiency.
  • Optimize GPU/CPU utilization, storage, and networking performance.
  • Implement scheduling, resource allocation, and workload orchestration systems.
  • Ensure system reliability, monitoring, and fault tolerance.
  • Automate infrastructure provisioning using Infrastructure as Code (IaC).
  • Troubleshoot performance bottlenecks and system failures.

Required Qualifications

  • Bachelor’s or Master’s degree in Computer Science, Engineering, or related field.
  • Strong programming skills in Python, Go, C++, or similar.
  • Experience with distributed systems and parallel computing.
  • Hands-on experience with containerization and orchestration tools (Docker, Kubernetes).
  • Familiarity with cloud platforms (AWS, Azure, or Google Cloud Platform) or on-prem HPC clusters.
  • Understanding of networking, storage systems, and system performance tuning.

Preferred Skills

  • Experience with ML frameworks (TensorFlow, PyTorch).
  • Familiarity with GPU computing (CUDA, NCCL).
  • Knowledge of cluster schedulers (Slurm, Kubernetes schedulers).
  • Experience with big data tools (Spark, Ray).
  • Exposure to MLOps and experiment tracking tools.

Key Competencies

  • Strong problem-solving and systems thinking
  • Collaboration with research and engineering teams
  • Performance optimization mindset
  • Ownership and accountability
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.
  • Dice Id: 10513292
  • Position Id: 72430-12895-
  • Posted 1 hour ago
Create job alert
Set job alertNever miss an opportunity! Create an alert based on the job you applied for.

Similar Jobs

Remote

25d ago

Contract

$180,000

Remote or Annapolis, Maryland

Today

Easy Apply

Contract, Third Party

$Depends on Experience

Remote

5d ago

Easy Apply

Contract, Third Party

$65 - $70

Remote or Almont, Colorado

Today

Contract

Search all similar jobs