Overview
Skills
Job Details
Job Title: Senior Software Production Engineer, Infrastructure Software for AI
We have recently established a new US center in Silicon Valley, focused on infrastructure
software for AI and AI foundations for mobile networks. Our goals are to challenge the norms
and create products making use of our SOTA infrastructure (like Nvidia GB200, MGX and
DGX Grace & Hopper platforms) and cloud-native software. These products are geared towards
centralized AI data centers as well as distributed AI Radio Access Network (AI RAN) data
centers. We are looking for experienced practitioners who are inspired to bring innovation and
build transformative products.
Minimum Qualifications:
Bachelor's degree in Computer Science, Electrical Engineering, or related field.
7+ years in software, hardware, engineering, including platforms and distributed systems.
2 years in lead roles, leading high-impact projects and teams.
Experience working in systems & systems SW, Cloud and Kubernetes.
Deep experience with production-testing and automation of Kubernetes deployments.
Preferred Qualifications:
Master's or PhD in a relevant field.
Expertise in building scalable test and automation infrastructure to productionize
workloads.
Experience with GPU platforms (Nvidia DGX, H100, GB200) and high-performance
computing environments.
Experience triaging customer bugs, prioritizing, and resolving issues in production.
Familiarity with AI developer frameworks, tools, and automation systems.
Role: Be a key member of the infrastructure team responsible for building foundational software
on top of GPU systems supporting AI workloads (training, fine-tuning and serving). Own and
develop the test-automation infrastructure for Kubernetes and GPU systems. Drive process
innovation in end-end systems software testing.
As a Senior Software Production Engineer responsible for the entire test-automation infrastructure,
work with Staff Engineers, product management and program management to drive execution
towards commercialization.
Responsibilities:
Develop and build test-automation infrastructure for Kubernetes on large-scale GPU
clusters.
Build detailed test plans for different milestones and operationalize them in
test-automation infrastructure.
Build and own automation of the end to end system, scale and stress testing.
Work together with SW leads and Technical Program Manager, qualify the releases for
milestones.
Attract and help build downstream production engineering talent.
Role model and foster a culture of humility and innovation for product delivery.