Apply Now

Senior Solutions Engineer, AI Infrastructure

Remote • Posted 30+ days ago • Updated 4 hours ago

Full Time

Remote

Fitment

Dice Job Match Score™

⏳ Almost there, hang tight...

Job Details

Skills

Cloud Computing
PB
POC
IT Strategy
Evaluation
Machine Learning (ML)
Analytics
Linux
SAN
Roadmaps
Linux Kernel
NAND
Load Balancing
HPC
Scratch
Machine Learning Operations (ML Ops)
Data Storage
IaaS
Conflict Resolution
Problem Solving
Debugging
Customer Facing
Software Design
GPU
Training
Artificial Intelligence
Kubernetes
Apache Spark
Orchestration
Scheduling
Ceph
Weka
IBM GPFS
Storage
Distributed File System
InfiniBand
Remote Direct Memory Access
Ethernet
Computer Networking
Management
CUDA

Cloud Computing
PB
POC
IT Strategy
Evaluation
Machine Learning (ML)
Analytics
Linux
SAN
Roadmaps
Linux Kernel
NAND
Load Balancing
HPC
Scratch
Machine Learning Operations (ML Ops)
Data Storage
IaaS
Conflict Resolution
Problem Solving
Debugging
Customer Facing
Software Design
GPU
Training
Artificial Intelligence
Kubernetes
Apache Spark
Orchestration
Scheduling
Ceph
Weka
IBM GPFS
Storage
Distributed File System
InfiniBand
Remote Direct Memory Access
Ethernet
Computer Networking
Management
CUDA

Summary

Description

We're looking for a deeply technical Solutions Architect to help customers design, evaluate, and deploy infrastructure for large-scale AI, HPC, analytics, and data-intensive workloads.

This is a customer-facing technical role for someone who has lived inside production infrastructure. You may have been a platform engineer, infrastructure engineer, SRE, MLOps engineer, AI infrastructure engineer, storage engineer, cloud engineer, or HPC systems engineer. What matters most is that you have built, operated, or architected real systems, and can bring that credibility into customer conversations.

Our customers are building infrastructure at serious scale: GPU clusters, high-performance storage systems, Kubernetes platforms, distributed training environments, inference platforms, data pipelines, lakehouses, and large enterprise systems. You'll help them reason about architectures involving 10,000+ GPUs, 100PB+ of storage, high-performance networking, distributed filesystems, orchestration layers, and demanding production workloads.

You'll own technical discovery, architecture design, PoC planning, competitive positioning, and customer technical strategy. You'll work from the first whiteboard session through evaluation, deployment planning, and production success. You'll also partner closely with product and engineering teams to bring field feedback into the roadmap.

We're looking for someone who can go deep technically, communicate clearly, operate without a rigid playbook, and translate complex infrastructure into customer outcomes.

Responsibilities

Lead technical discovery with customers across infrastructure, platform, ML, data, and executive stakeholders.
Design architectures for large-scale AI, HPC, analytics, and enterprise data workloads.
Help customers evaluate infrastructure involving GPUs, storage, networking, orchestration, and data movement.
Translate complex technical requirements into clear solution designs, reference architectures, and deployment guidance.
Debug customer issues across Linux, storage, networking, Kubernetes, schedulers, GPUs, and application workloads.
Build technical assets, runbooks, and field guidance for repeatable customer engagements.
Partner with product and engineering to communicate customer requirements, gaps, and roadmap opportunities.
Help customers move from architecture design to production deployment.

Requirements

8 to 12+ years of technical experience, with significant hands-on infrastructure experience.
Experience building, operating, or architecting production platform infrastructure.
Strong understanding of Linux kernel implementation details, distributed systems including PAXOS and raft, storage implementations details like NAND or write amplification, networking store/forward, load balancing designs, and production operations.
Experience with one or more of: GPU infrastructure, large scale HPC systems, Kubernetes platforms from scratch, MLOps, storage systems, cloud infrastructure, data platforms, or large-scale enterprise infrastructure.
Ability to communicate credibly with engineers, architects, technical executives, and business stakeholders.
Strong discovery, problem-solving, and systems debugging skills.
Comfort operating in ambiguous, fast-moving environments.
Interest in customer-facing technical work, solution design, and business outcomes.

Preferred Experience

Experience with large-scale GPU clusters, distributed training, inference infrastructure, or AI platforms.
Experience with petabyte-scale storage or high-performance data systems.
Experience with Kubernetes, Slurm, Ray, Spark, or other orchestration / scheduling systems.
Domain Expertise with one or more of these - Lustre, Ceph, Weka, BeeGFS, GPFS, VAST, object storage, or distributed filesystems.
Experience with large-scale InfiniBand, RoCE, RDMA, high-performance Ethernet, or NVIDIA/Mellanox networking.
Direct Experience with CUDA, NCCL, DCGM, GPUDirect, checkpointing, dataset staging, or model-serving infrastructure.
Experience across multiple industries or customer environments.

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Dice Id: 91137528
Position Id: 343d655500dda292606f2613edb8c0a2
Posted 30+ days ago

Create job alert

Never miss an opportunity! Create an alert based on the job you applied for.

Remote

•

Today

Company Description Mirantis is the Kubernetes-native AI infrastructure company, enabling organizations to build and operate scalable, secure, and sovereign infrastructure for modern AI, machine learning, and data-intensive applications. By combining open source innovation with deep expertise in Kubernetes orchestration, Mirantis empowers platform engineering teams to deliver composable, production-ready developer platforms across any environment-on-premises, in the cloud, at the edge, or in so

Full-time

Senior AI Infrastructure & Platform Operations Engineer (remote in the US)

Remote

•

Today

Full-time

Senior Solutions Architect - AI, HPC, & Lustre

Remote

•

Today

LOCATION This is a remote position and can sit anywhere in the United States. All candidates must be willing to very frequently travel across the United States. JOB SUMMARY NetApp's Solutions Architecture (SA) team partners with Enterprise Sales to drive advanced AI, analytics, and high-performance computing (HPC) conversations with customers, stepping in when deep technical expertise is required beyond day-to-day SE support. As the Senior Solutions Architect for AI, Analytics & Lustre-based H

Full-time

USD 197,200.00 - 255,200.00 per year

Principal Customer Solutions Architect - AI Infrastructure

Remote or California

•

Today

Description POSITION DESCRIPTION: We are looking for a Principal Customer Solutions Architect - AI Infrastructure to join our Field & Pre-Sales organization to serve as the technical lead across both customer acquisition and deployment delivery. This role partners closely with Sales, Engineering, Deployment, Product, and Customer Success teams to design and deliver large-scale AI infrastructure solutions for hyperscalers , NeoCloud operators, enterprise AI customers, and HPC organizations. The i

Full-time

USD 200,000.00 - 250,000.00 per year

Search all similar jobs

Remote jobs at VAST Data

Senior Solutions Engineer, AI Infrastructure

Dice Job Match Score™

Job Details

Skills

Summary

Similar Jobs