Principal AI Infrastructure Abstraction Engineer

Overview

On Site

Full Time

Skills

Innovation

Startups

Interfaces

Cloud Computing

Computer Hardware

QoS

Management

Onboarding

Access Control

Scalability

Python

C++

Kubernetes

Docker

Orchestration

Storage

Computer Networking

Artificial Intelligence

Machine Learning (ML)

PyTorch

TensorFlow

GPU

Scheduling

MPS

MIG

Resource Management

FOCUS

Training And Development

Collaboration

Internet

Sustainability

Training

Recruiting

Insurance

Life Insurance

Cisco

Sales

Job Details

This position requires a hybrid working schedule in the San Jose or Milpitas office.

Meet the Team

We are an innovation team on a mission to transform how enterprises harness AI. Operating with the agility of a startup and the focus of an incubator, we're building a tight-knit group of AI and infrastructure experts driven by bold ideas and a shared goal: to rethink systems from the ground up and deliver breakthrough solutions that redefine what's possible - faster, leaner, and smarter.
We thrive in a fast-paced, experimentation-rich environment where new technologies aren't just welcome - they're expected. Here, you'll work side-by-side with seasoned engineers, architects, and thinkers to craft the kind of iconic products that can reshape industries and unlock entirely new models of operation for the enterprise.
If you're energized by the challenge of solving hard problems, love working at the edge of what's possible, and want to help shape the future of AI infrastructure - we'd love to meet you.

Your Impact

As an AI Infrastructure Abstraction Engineer, you will help shape the next generation of AI compute platforms by designing systems that abstract away hardware complexity and expose logical, scalable, and secure interfaces for AI workloads. Your work will enable multi-tenancy, resource isolation, and dynamic scheduling of GPUs and accelerators at scale - making infrastructure programmable, elastic, and developer-friendly.

You will bridge the gap between raw compute resources and AI/ML frameworks, allowing infrastructure teams and model developers to consume shared GPU resources with the performance and reliability of bare metal, but with the flexibility of cloud-native systems. Your contributions will empower internal and external users to run AI workloads securely, efficiently, and predictably - regardless of the underlying hardware topology.

This role is critical to enabling AI infrastructure that is multi-tenant by design, scalable in practice, and abstracted for portability across diverse platforms.

KEY RESPONSIBILITIES

Design and implement infrastructure abstractions that cleanly separate logical compute units (vGPUs, GPU pods, AI queues) from physical hardware (nodes, devices, interconnects).
Develop runtime services, APIs, and control planes to expose GPU and accelerator resources to users and frameworks with multi-tenant isolation and QoS guarantees.
Architect systems for secure GPU sharing, including time-slicing, memory partitioning, and namespace isolation across tenants or jobs.
Collaborate with platform, orchestration, and scheduling teams to map logical resources to physical devices based on utilization, priority, and topology.
Define and enforce resource usage policies, including fair sharing, quota management, and oversubscription strategies.
Integrate with model training and serving frameworks (e.g., PyTorch, TensorFlow, Triton) to ensure smooth and predictable resource consumption.
Build observability and telemetry pipelines to trace logical-to-physical mappings, usage patterns, and performance anomalies.
Partner with infrastructure security teams to ensure secure onboarding, access control, and workload isolation in shared environments.
Support internal developers in adopting abstraction APIs, ensuring high performance while abstracting away low-level details.
Contribute to the evolution of internal compute platform architecture, with a focus on abstraction, modularity, and scalability.

Minimum Qualifications:

Bachelors + 15 years of related experience, or Masters + 12 years of related experience, or PhD + 8 years of related experience
Experience building scalable, production-grade infrastructure components or control planes using Go, Python, and C++,
Experience with Kubernetes, Docker or Kubevirt for virtualization, containerization, and orchestration frameworks
Experience designing or implementing logical resource abstractions for compute, storage, or networking with a focus in multi-tenant environments.
Experience integrating with AI/ML platforms or pipelines (e.g., PyTorch, TensorFlow, Triton Inference Server, MLFlow).

Preferred Qualifications:

Experience with GPU sharing, scheduling, or isolation techniques (e.g., MPS, MIG, time-slicing, device plugin frameworks, or vGPU technologies).
Solid grasp of resource management concepts including quotas, fairness, prioritization, and elasticity.

#WeAreCisco

#WeAreCisco where every individual brings their unique skills and perspectives together to pursue our purpose of powering an inclusive future for all.

Our passion is connection-we celebrate our employees' diverse set of backgrounds and focus on unlocking potential. Cisconians often experience one company, many careers where learning and development are encouraged and supported at every stage. Our technology, tools, and culture pioneered hybrid work trends, allowing all to not only give their best, but be their best.

We understand our outstanding opportunity to bring communities together and at the heart of that is our people. One-third of Cisconians collaborate in our 30 employee resource organizations, called Inclusive Communities, to connect, foster belonging, learn to be informed allies, and make a difference. Dedicated paid time off to volunteer-80 hours each year-allows us to give back to causes we are passionate about, and nearly 86% do!

Our purpose, driven by our people, is what makes us the worldwide leader in technology that powers the internet. Helping our customers reimagine their applications, secure their enterprise, transform their infrastructure, and meet their sustainability goals is what we do best. We ensure that every step we take is a step towards a more inclusive future for all. Take your next step and be you, with us!

Message to applicants applying to work in the U.S. and/or Canada:

When available, the salary range posted for this position reflects the projected hiring range for new hire, full-time salaries in U.S. and/or Canada locations, not including equity or benefits. For non-sales roles the hiring ranges reflect base salary only; employees are also eligible to receive annual bonuses. Hiring ranges for sales positions include base and incentive compensation target. Individual pay is determined by the candidate's hiring location and additional factors, including but not limited to skillset, experience, and relevant education, certifications, or training. Applicants may not be eligible for the full salary range based on their U.S. or Canada hiring location. The recruiter can share more details about compensation for the role in your location during the hiring process.

U.S. employees have to quality medical, dental and vision insurance, a 401(k) plan with a Cisco matching contribution, short and long-term disability coverage, basic life insurance and numerous wellbeing offerings.

Employees receive up to twelve paid holidays per calendar year, which includes one floating holiday (for non-exempt employees), plus a day off for their birthday. Non-Exempt new hires accrue up to 16 days of vacation time off each year, at a rate of 4.92 hours per pay period. Exempt new hires participate in Cisco's flexible Vacation Time Off policy, which does not place a defined limit on how much vacation time eligible employees may use, but is subject to availability and some business limitations. All new hires are eligible for Sick Time Off subject to Cisco's Sick Time Off Policy and will have eighty (80) hours of sick time off provided on their hire date and on January 1st of each year thereafter. Up to 80 hours of unused sick time will be carried forward from one calendar year to the next such that the maximum number of sick time hours an employee may have available is 160 hours. Employees in Illinois have a unique time off program designed specifically with local requirements in mind. All employees also have access to paid time away to deal with critical or emergency issues. We offer additional paid time to volunteer and give back to the community.

Employees on sales plans earn performance-based incentive pay on top of their base salary, which is split between quota and non-quota components. For quota-based incentive pay, Cisco typically pays as follows:

.75% of incentive target for each 1% of revenue attainment up to 50% of quota;

1.5% of incentive target for each 1% of attainment between 50% and 75%;

1% of incentive target for each 1% of attainment between 75% and 100%; and once performance exceeds 100% attainment, incentive rates are at or above 1% for each 1% of attainment with no cap on incentive compensation.

For non-quota-based sales performance elements such as strategic sales objectives, Cisco may pay up to 125% of target. Cisco sales plans do not have a minimum threshold of performance for sales incentive compensation to be paid.

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Job Details

Share