AI Operations & Infrastructure Engineer

Fort Meade, MD, US • Posted 4 days ago • Updated 4 hours ago

Full Time

On-site

Fitment

Dice Job Match Score™

👤 Reviewing your profile...

Job Details

Skills

Scheduling
Machine Learning (ML)
Network Protocols
Data Storage
Data Centers
Technical Support
Collaboration
BMC
Total Productive Maintenance
TPM
Servers
Information Governance
Operating Systems
Benchmarking
Cabling
Firmware
Network
Computer Hardware
Reporting
Data Processing
Stacks Blockchain
Orchestration
Docker
Kubernetes
GPU
MIG
Management
Storage
Artificial Intelligence
Computer Networking
InfiniBand
Ethernet
Security Clearance
Continuous Integration

Summary

Title: AI Operations & Infrastructure Engineer

Location: Fort Meade, MD

Clearance: TS/SCI with a CI Polygraph

Job Details:

Manage and maintain AI computing platforms, including GPUs and other specialized hardware
Install and configure GPU drivers and software
Oversee the AI software stack and tools
Implement and manage containerization technologies like Docker and Kubernetes
Configure and optimize networking infrastructure for AI workloads, including InfiniBand and Ethernet
Manage storage solutions for AI data, considering performance and capacity requirements
Deploy and manage data processing units (DPUs) to accelerate data center workloads
Monitor and manage AI cluster health and resource utilization
Implement workload management and scheduling tools like Slurm and Kubernetes
Ensure efficient power and cooling for AI infrastructure to maintain optimal operating conditions
Configure high-performance networking solutions for AI and machine learning workloads
Optimize network performance to ensure maximum throughput and minimal latency for AI computations
Implement and fine-tune network protocols to enhance data transfer speeds and efficiency
Integrate NVIDIA networking products with existing AI infrastructure, including servers, GPUs, and storage systems
Deploy networking solutions in data centers to ensure seamless connectivity between AI components
Diagnose and resolve networking issues impacting AI workloads to maintain optimal system performance
Provide technical support and guidance to teams managing AI infrastructure
Collaborate with data scientists, researchers, and IT professionals to understand networking requirements and challenges
Lead deployment and validation of servers and systems for AI enabled platforms
Configure and manage network topologies, BMC, OOB, TPM, power, and cooling
Install, upgrade, and validate GPU-based servers, BlueField DPUs, cables, and transceivers
Perform firmware upgrades, hardware validation, and storage setup
Configure and administer physical and logical resources, including M IG partitioning and BlueField platforms
Install and configure operating systems, cluster software, drivers, containers (Docker), and NCLI
Manage and orchestrate clusters using NVIDIA Base Command Manager, Slurm, Pyxis, Enroot, and Run: Ai
Perform stress, benchmarking, and burn-in tests using HPL, NCCL, NVIDIA Nemo, and ClusterKit
Verify cabling, firmware/software versions, and network signal quality
Troubleshoot and resolve hardware, software, storage, and performance faults
Replace faulty components and optimize systems for AMD/Intel platforms
Monitor, document, and report on cluster health, resource usage, and job performance
Ensure secure, efficient, and scalable operation of NVIDIA AI infrastructure, including user access and workload management

Requirements:

Qualified candidates must hold an active NVIDIA Professional Certification in either AI Networking, AI Infrastructure, or AI Operations
Prior direct, hands-on professional experience administering NVIDIA GPU and data processing unit (DPU) technologies, AI software stacks, and data center environments for high-performance AI workloads
Comprehensive expertise in deploying and maintaining AI compute platforms, requiring proficiency in containerization and workload orchestration using Docker, Kubernetes, Slurm, NVIDIA Base Command Manager, and Run:Ai
Must be capable of configuring physical and logical resources, including Multi-Instance GPU (MIG) partitioning and BlueField platforms, while overseeing critical facility elements such as power, cooling, and storage solutions
The ability to demonstrate advanced skills in AI networking, specifically configuring and optimizing high-performance InfiniBand and Ethernet fabrics to ensure maximum throughput and minimal latency
Current active TS/SCI clearance with a CI Polygraph

Equal Opportunity Employer/Veterans/Disabled

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Dice Id: 90789821
Position Id: 60d4cd2ec352d15732c45dffc32f90ab
Posted 4 days ago

Create job alert

Never miss an opportunity! Create an alert based on the job you applied for.

Similar Jobs

AI Integration Engineer

Maryland

•

Today

Job Number: R0238798 AI Integration Engineer The Opportunity: We are seeking a highly motivated AI Integration Engineer to join our team and help design, deploy, and maintain the infrastructure that supports artifi cia l intelligence ( AI ) systems, including Large Language Models ( LLMs ) and distributed AI workloads. This role is critical to bridging the gap between advanced AI models, compute infrastructure, and operational workflows. You will be responsible for managing AI readiness by arch

Full-time

USD 112,800.00 - 257,000.00 per year

Junior Software Engineer

Laurel, Maryland

•

Today

Description Active Top Secret (TS/SCI) clearance with polygraph is required. Visionist has an exciting new, fully FUNDED opportunity for a Junior Software Engineer on our largest PRIME contract. Our team of Analysts and Engineers is motivated by the direct impact on the mission, crafting specialized tools for enhanced efficiency and quick iterations for our operations user base. Seeing your tools in real-time action brings immediate gratification. This premier program encompasses traditional so

Full-time

USD 85,000.00 - 120,000.00 per year

Cloud Architect

Fort Meade, Maryland

•

Today

Job Number: R0241774 Cloud Architect The Opportunity: Everyone is trying to "harness the cloud," but not everyone knows how. As an experienced cloud computing infrastructure architect, you know how to take advantage of cloud capabilities. Here, you'll oversee our team of experienced professionals and use cutting-edge enterprise cloud platforms to guide your clients as they modernize their IT infrastructure and meet their most challenging missions. We're looking for someone like you to help suppo

Full-time

USD 86,900.00 - 198,000.00 per year

Senior Software Engineer (AI Infrastructure)

Laurel, Maryland

•

Today

Description Active Top Secret (TS/SCI) clearance with polygraph is required. Visionist has an exciting new, fully FUNDED opportunity for a Senior Software Engineer (AI Infrastructure) on our largest PRIME contract. Our team of Analysts and Engineers is motivated by the direct impact on the mission, crafting specialized tools for enhanced efficiency and quick iterations for our operations user base. Seeing your tools in real-time action brings immediate gratification. This premier program encomp

Full-time

USD 170,000.00 - 240,000.00 per year

Search all similar jobs

More jobs at Invictus International Consulting in Fort Meade, MD