MUST HAVE SKILLS:
Configuration Management
GPGPU/GPU
Hardware Troubleshooting
Infrastructure & Operations
Infrastructure Automation and Orchestration
Linux Administration
Description:
- Join our team to operate and support cutting-edge GPU infrastructure powering AI and high-performance computing workloads for a leading global hyperscale cloud provider. In this hands-on role, you'll manage the full lifecycle of NVIDIA GPU platforms from bring-up to break/fix while ensuring optimal performance for advanced AI applications.
- At EPAM, you'll work on cutting-edge technologies, solve complex challenges, and shape the future of digital innovation. With access to continuous learning, mentorship, and global projects, your expertise will drive meaningful change.
Responsibilities:
- Operate and maintain production GPU and bare-metal compute platforms with hands-on hardware management
- Perform physical infrastructure tasks including rack/stack, cabling, power validation, and system bring-up
- Diagnose hardware faults, replace failed components, and coordinate vendor support for complex issues
- Install and configure Linux operating systems with GPU-specific drivers and software stacks
- Execute platform validation using diagnostic tools to ensure GPU health, stability, and performance
- Provision bare-metal systems through automated workflows while troubleshooting configuration issues
- Apply firmware, BIOS, and platform configuration changes following standardized change processes
Requirements:
- 5+ years professional experience supporting production server infrastructure in data center environments
- Strong Linux administration skills with ability to independently troubleshoot system-level issues
- Hands-on experience with physical server hardware including diagnostics and component replacement
- Familiarity with GPU platforms, preferably NVIDIA, and associated drivers and software stacks
- Experience working in structured, change-controlled production environments
- Knowledge of infrastructure monitoring tools and alert response procedures
- Excellent communication skills with ability to collaborate across operations and engineering teams
Location: On-site position in the Greater Seattle/Redmond area requiring regular hands-on access to hardware in lab or data center environments.