Overview
Skills
Job Details
Experience designing large-scale distributed platforms and/or systems in cloud environments such as AWS, Azure. Experience architecting cloud systems for security, availability, performance, scalability, and cost. Experience with delivering very large models through the MLOps life cycle from exploration to serving. Experience with building GPU clusters in the public cloud with tightly-coupled storage and networking. Experience with the complete stack for distributed training of large models including Client compilers, distributed training frameworks, and Client development frameworks such as Pytorch, Tensorflow, Lightning etc. Experience with one or multiple areas of AI technology stack including prompt engineering, guardrails, vector databases/knowledge bases, LLM hosting and fine-tuning. Authored research publications in top peer-reviewed conferences, or industry-recognized contributions in the space of neural networks, distributed training and SysML.
9 + years of professional experience.
7+ years of experience programming in Python or R.
5+ years of experience with Natural Language Processing (NLP) and Large Language Models (LLM)
3+ years of experience building and maintaining scalable API solutions.