Job Title: Software Engineer III
Location: Redmond, WA
Duration: 1+ Year Contract
I. Manage and evolve high performance compute infrastructure used by research scientists for deep learning model training
ii. Help research scientists to troubleshoot various issues related to mode training (infrastructural, performance, etc.) Thus, it is necessary for a candidate to have a basic understanding of machine learning and, more specifically, deep learning (gradient descent, stochastic gradient descent, online learning). Understand which hardware, software and software frameworks are used to speed up model training and inference why (everything which is in the requirements already + TensorFlow, PyTorch, CUDA, GPUs, NVIDIA DGX). Understand software which is used to clusterize the related hardware (Docker, Docker Swarm, Slurm, Kubernetes).
- It is necessary for a candidate to have a basic understanding of machine learning and, more specifically, computational technics widely used in deep learning (gradient descent, stochastic gradient descent)
- Understand hardware and software frameworks which are used to define and train deep learning models (TensorFlow and/or PyTorch, CUDA).
- Experienced C/C++, Python, Ruby software developer
- Expert level knowledge of Linux-based systems and cluster management
- High speed network performance profiling and optimization
- Advanced understanding of Linux containers
- Advanced knowledge in cluster resource managers like Slurm, Kubernetes, Docker Swarm
- Previous experience with MPI and InfiniBand is very welcome
- Bachelor's degree in Computer Science, Mathematics, or related field or 5 years relevant experience
- SOFTWARE ENGINEER
- HIGH PERFORMANCE COMPUTING