Cloud Infra Control Plane Service Engineering Architect

Overview

Remote
Up to $85
Contract - W2
Contract - 12 Month(s)
No Travel Required

Skills

NVIDIA
SuperPod

Job Details

Job Title: Cloud Infra Control Plane Service Engineering Architect

Location: Remote work, candidates in the Bay Area or Seattle will be prioritized.

Duration:6 months+

The project is all around implementing an Nvidia SuperPod. Major bonus points for candidates who have that experience.

Key Responsibilities:

Infrastructure Management:

  • Manage and monitor computer clusters, ensuring high availability and performance.
  • Implement and maintain automation scripts for infrastructure provisioning and management. Design and Implementation:
  • Design, implement, and maintain computer services for both GPU and non-GPU environments.
  • Develop and optimize algorithms for high-performance computing tasks, especially in the AI/ML Training and Inference domain. Performance Optimization:
  • Analyze and optimize the performance of compute workloads.
  • Implement best practices for resource utilization and efficiency. Collaboration:
  • Work closely with data scientists, researchers, and other engineering teams to understand and meet their compute requirements.
  • Collaborate with hardware vendors to evaluate and integrate new technologies. Security and Compliance:
  • Ensure that compute services comply with security policies and industry standards.
  • Implement and maintain security measures to protect data and compute resources. Troubleshooting and Support:
  • Provide support for compute-related issues, including debugging and resolving hardware and software problems.
  • Develop and maintain documentation for troubleshooting procedures and best practices. Continuous Improvement:
  • Stay updated with the latest advancements in compute technologies and integrate them into the infrastructure.
  • Continuously improve the reliability, scalability, and performance of compute services. Qualifications:

Education:

  • Bachelor's or Master's degree in Computer Science, Engineering, or a related field.
  • NVIDIA and AI Certification Experience:
  • Years of experience managing on-premise GPU or non GPU systems
  • Proven experience in managing and optimizing GPU and non-GPU computer environments.
  • AI Infra Engineering building and operating skills
  • Experience with high-performance computing (HPC) and parallel processing including Baremetel, large scale virtual environments.
  • Implement virtualization architectures, leveraging expertise with Kubernetes distributions like OpenShift or Rancher, and cloud technologies on bare metal environments.
  • Proficiency in hardware technologies such as SR-IOV, DPU, and GPU, with proven experience in implementing these technologies in virtualized and containerized environments. Technical Skills:
  • Proficiency in programming languages such as Python, C++, or similar.
  • Experience with infrastructure as code (IaC) tools like Terraform, Ansible, or similar.
  • Familiarity with containerization and orchestration tools like Docker and Kubernetes.
  • Familiarity with Kubernetes underlying technologies with CRI, CSI, CNI, Operators, GPU device plugin, RMDA/InfiniBand integration
  • Knowledge of cloud platforms (AWS, Azure, Google Cloud Platform) and their compute services. Soft Skills:
  • Strong problem-solving skills and attention to detail.
  • Excellent communication and collaboration skills.
  • Ability to work in a fast-paced, dynamic environment.

Siva Kumar. CH

P: +1
E:

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

About Xoriant Corporation