Overview
On Site
USD 197,400.00 per year
Full Time
Skills
Embedded Systems
Innovation
FOCUS
High Performance Computing
Use Cases
Infrastructure Architecture
Data Storage
Scheduling
MPI
Computer Networking
Network
InfiniBand
GPU
Job Scheduling
IBM GPFS
Scratch
Storage
Data Management
Orchestration
Management
Continuous Integration
Continuous Delivery
Cloud Computing
Collaboration
Research
Documentation
Computer Hardware
Remote Direct Memory Access
GPU Computing
Linux Administration
File Systems
Scripting
Python
Bash
HPC
Docker
Kubernetes
Machine Learning (ML)
PyTorch
TensorFlow
JAX
Training
Artificial Intelligence
Optimization
Machine Learning Operations (ML Ops)
Computer Science
Military
Law
Recruiting
Job Details
WHAT YOU DO AT AMD CHANGES EVERYTHING
We care deeply about transforming lives with AMD technology to enrich our industry, our communities, and the world. Our mission is to build great products that accelerate next-generation computing experiences - the building blocks for the data center, artificial intelligence, PCs, gaming and embedded. Underpinning our mission is the AMD culture. We push the limits of innovation to solve the world's most important challenges. We strive for execution excellence while being direct, humble, collaborative, and inclusive of diverse perspectives.
AMD together we advance_
Principal Solutions Engineer, Infrastructure (SLURM & AI Focus)
THE ROLE:
The AMD Datacenter GPU team is seeking an experienced Solutions Engineer specializing in high-performance computing infrastructure for AI workloads. This role focuses on designing, deploying, and optimizing GPU-accelerated computing environments for AI use-cases using SLURM as the primary workload manager.
THE PERSON:
The ideal candidate will have deep expertise in Multi-tenant Schedulers for large scale AI Clusters, RDMA networking, collective communications, container orchestration, and storage solutions optimized for AI/ML workloads.
KEY RESPONSIBILITIES:
AI Infrastructure Design
SLURM Optimization & Management
Networking & Interconnect
Storage Solutions
Container Orchestration & Integration
Collaboration & Support
PREFERRED EXPERIENCE:
Technical Skills
Container & Orchestration
AI/ML Infrastructure
EDUCATION:
#LI-EV1
#LI-HYBRID
Benefits offered are described: AMD benefits at a glance.
AMD does not accept unsolicited resumes from headhunters, recruitment agencies, or fee-based recruitment services. AMD and its subsidiaries are equal opportunity, inclusive employers and will consider all applicants without regard to age, ancestry, color, marital status, medical condition, mental or physical disability, national origin, race, religion, political and/or third-party affiliation, sex, pregnancy, sexual orientation, gender identity, military or veteran status, or any other characteristic protected by law. We encourage applications from all qualified candidates and will accommodate applicants' needs under the respective laws throughout all stages of the recruitment and selection process.
We care deeply about transforming lives with AMD technology to enrich our industry, our communities, and the world. Our mission is to build great products that accelerate next-generation computing experiences - the building blocks for the data center, artificial intelligence, PCs, gaming and embedded. Underpinning our mission is the AMD culture. We push the limits of innovation to solve the world's most important challenges. We strive for execution excellence while being direct, humble, collaborative, and inclusive of diverse perspectives.
AMD together we advance_
Principal Solutions Engineer, Infrastructure (SLURM & AI Focus)
THE ROLE:
The AMD Datacenter GPU team is seeking an experienced Solutions Engineer specializing in high-performance computing infrastructure for AI workloads. This role focuses on designing, deploying, and optimizing GPU-accelerated computing environments for AI use-cases using SLURM as the primary workload manager.
THE PERSON:
The ideal candidate will have deep expertise in Multi-tenant Schedulers for large scale AI Clusters, RDMA networking, collective communications, container orchestration, and storage solutions optimized for AI/ML workloads.
KEY RESPONSIBILITIES:
AI Infrastructure Design
- Build and design large GPU-accelerated clusters for AI/ML workloads
- Develop reference architectures for SLURM-based HPC environments
- Integrate SLURM with Kubernetes for hybrid workload management
- Design storage systems to support high-speed AI training pipelines
SLURM Optimization & Management
- Configure and optimize SLURM for efficient AI/ML scheduling and resource use
- Use advanced SLURM features such as GPU-aware scheduling, MPI integration, container runtime support, and fair-share policies
- Develop SLURM plugins and customizations for AI workloads
Networking & Interconnect
- Design RDMA network setups (InfiniBand, RoCE) for fast data transfer
- Optimize collective communications for distributed training (e.g., All Reduce)
- Configure GPU Direct RDMA and topology-aware job scheduling
Storage Solutions
- Architect parallel file systems like Lustre, GPFS, BeeGFS for AI data needs
- Implement high-performance scratch storage and tiered data management
- Optimize I/O patterns and manage data lifecycle for training datasets
Container Orchestration & Integration
- Collaborate on Kubernetes operators for SLURM integration
- Develop strategies for seamless containerized AI workload management
- Build CI/CD pipelines and enable hybrid cloud deployments
Collaboration & Support
- Work with research teams and customers to meet AI computing needs
- Provide technical guidance and training
- Create documentation and best practices
- Partner with vendors on hardware and software selection
PREFERRED EXPERIENCE:
Technical Skills
- Extensive SLURM experience in production HPC environments
- Expert knowledge of RDMA technologies and collective communications
- Hands-on GPU computing and Linux system administration skills
- Experience with parallel file systems and scripting (Python, Bash, Go)
Container & Orchestration
- Production Kubernetes experience in HPC settings
- Familiarity with Kubernetes SLURM plugin and container runtimes (Singularity, Docker)
- Experience with Helm and Kubernetes operators
AI/ML Infrastructure
- Understanding AI frameworks (PyTorch, TensorFlow, JAX) and distributed training
- Knowledge of AI workload optimization and MLOps practices
EDUCATION:
- Bachelor's degree in Computer Science, Engineering, or related field
- Advanced degree preferred
#LI-EV1
#LI-HYBRID
Benefits offered are described: AMD benefits at a glance.
AMD does not accept unsolicited resumes from headhunters, recruitment agencies, or fee-based recruitment services. AMD and its subsidiaries are equal opportunity, inclusive employers and will consider all applicants without regard to age, ancestry, color, marital status, medical condition, mental or physical disability, national origin, race, religion, political and/or third-party affiliation, sex, pregnancy, sexual orientation, gender identity, military or veteran status, or any other characteristic protected by law. We encourage applications from all qualified candidates and will accommodate applicants' needs under the respective laws throughout all stages of the recruitment and selection process.
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.