Systems / ML Engineer

Overview

Remote
$85 - $100
Contract - W2
Contract - 12 Month(s)

Skills

PyTorch
Python
Machine Learning (ML)
FSDP
DDP
Distributed Data Parallel
Fully Sharded Data Parallel
Model Architecture
GPU
Distributed Systems
LLM
Large Language Models
AI
Artificial Intelligence
ML Pipelines
Machine Learning Pipelines
ML Model Training
Model Training
Distributed Training
Open Source
Deep Learning
GitHub

Job Details

Job Description:

  • Responsibilities Include developing deep learning libraries that support large-scale distributed training, open sourcing high quality code and reproducible results for the community and bringing the latest research to Client products for connecting billions of users.
  • The chosen candidate will work with a diverse and highly interdisciplinary team of scientists, engineers, and cross-functional partners, and will have access to cutting edge technology, resources, and research facilities.

Responsibilities:

  • Engineer, design, implement, and improve highly scalable machine learning systems and tools for enabling research.
  • Apply knowledge of relevant research domains, along with expert coding skills, to platform and framework development projects.
  • Write clean and robust machine learning code.

Minimum Qualifications:

  • Degree in Computer Science, Computer Engineering, or relevant technical field.
  • 3+ years experience with deep learning.
  • Experience developing machine learning algorithms or machine learning infrastructure in Python or C/C++.

Preferred Qualifications:

  • Demonstrated software engineering experience via work experience, coding competitions, or widely used contributions in open-source repositories (e.g., GitHub).
  • Experience in open-source development.

Must-Have HARD Skills:

  • PyTorch
  • Machine Learning
  • Python

Nice-to-have Skills:

  • Distributed training for ML models.
  • Building Open-Source Libraries for Machine Learning.
  • Experience with Machine Learning Research, publishing papers.
  • Experience with Large scale Model training with PyTorch is essential.

Interview: Mostly technical experience with distributed training. How DDP/FSDP works, what are different parallelism techniques to scale models, what are their tradeoffs, which one would you use in which case, some back of the envelope calculation of memory/throughput requirements, so on. - 1 Hour

Years of Experience: 5-10 Years

Degrees/Certifications Required: Computer Science / Engineering.

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.