Direct Client Requirement: Sr. Data Scientist with VLM

Overview

On Site

Depends on Experience

Full Time

Skills

Data Scientist

VLM

Vision-Language Models

Python

PyTorch

VSS

Job Details

Role: Sr. Data Scientist with VLM
Duration: Full-Time + Benefits
Location: San Ramon, CA or Milwaukee, WI (Onsite)

What is in it for you?

If you're a Senior Data Scientist with a strong background in Vision-Language Models (VLMs), this is a chance to lead the charge in building smart, scalable multimodal AI solutions. We re looking for someone who s worked hands-on with cutting-edge frameworks like VILA, Isaac, and VSS and who knows how to take models from concept to production in real-world settings. If you ve got experience in healthcare, especially with medical devices, that s a big plus. You ll be diving into the latest VLM techniques and deploying them on cloud platforms like AWS, helping shape the future of AI in a meaningful, impactful way.

Key Responsibilities

VLM Development, Pose estimation & Deployment:

Design, train, and deploy efficient Vision-Language Models (e.g., VILA, Isaac Sim) for multimodal applications including image captioning, visual search, and document understanding, pose understanding, pose comparison.
Develop and manage Digital Twin frameworks using AWS IoT TwinMaker, SiteWise, and Greengrass to simulate and optimize real-world systems.
Develop Digital Avatars using AWS services integrated with 3D rendering engines, animation pipelines, and real-time data feeds.
Explore cost-effective methods such as knowledge distillation, modal-adaptive pruning, and LoRA fine-tuning to optimize training and inference.
Implement scalable pipelines for training/testing VLMs on cloud platforms (AWS services such as SageMaker, Bedrock, Rekognition, Comprehend, and Textract.)

NVIDIA Platforms:

Should develop a blend of technical expertise, tool proficiency, and domain-specific knowledge on below NVIDIA Platforms:
NIM (NVIDIA Inference Microservices): Containerized VLM deployment.
NeMo Framework: Training and scaling VLMs across thousands of GPUs.
Supported Models: LLaVA, LLaMA 3.2, Nemotron Nano VL, Qwen2-VL, Gemma 3.
DeepStream SDK: Integrates pose models like TRTPose and OpenPose, Real-time video analytics and multi-stream processing.

Multimodal AI Solutions:

Develop solutions that integrate vision and language capabilities for applications like image-text matching, visual question answering (VQA), and document data extraction.
Leverage interleaved image-text datasets and advanced techniques (e.g., cross-attention layers) to enhance model performance.

Image Processing and Computer Vision

Develop solutions that integrate Vision based deep learning models for applications like live video streaming integration and processing, object detection, image segmentation, pose Estimation, Object Tracking and Image Classification and defect detection on medical Xray images
Knowledge of real-time video analytics, multi-camera tracking, and object detection.
Training and testing the deep learning models on customized data

Healthcare Domain Expertise(Nice to Have):

While it s not a must, having experience in the healthcare space especially with medical imaging, motion detection, or patient monitoring can be a big advantage.

You ll be applying Vision-Language Models to use cases like analyzing scans, detecting positioning and movement, and making precise measurements.
If you re familiar with healthcare standards and know how to handle sensitive data responsibly, that s a definite plus.

Efficiency Optimization:

Evaluate trade-offs between model size, performance, and cost using techniques like elastic visual encoders or lightweight architectures.
Benchmark different VLMs (e.g., GPT-4V, Claude 3.5, Nova Lite) for accuracy, speed, and cost-effectiveness on specific tasks.
Benchmarking on GPU vs CPU

Collaboration & Leadership:

Collaborate with cross-functional teams including engineers and domain experts to define project requirements.
Mentor junior team members and provide technical leadership on complex projects.

Qualifications

Education: Master s or Ph.D. in Computer Science, Data Science, Machine Learning, or a related field.

Experience:

Minimum of 10+ years of experience in Machine Learning or Data Science roles with a focus on Vision-Language Models.
Proven expertise in deploying production-grade multimodal AI solutions.
Experience in self-driving cars and self-navigating robots.

Technical Skills:

Proficiency in Python and ML frameworks (e.g., PyTorch, TensorFlow).
Hands-on experience with VLMs such as VILA, Isaac Sim, or VSS.
Familiarity with cloud platforms like AWS SageMaker or Azure ML Studio for scalable AI deployment.
OpenCV, PIL, scikit-image
Frameworks: PyTorch, TensorFlow, Keras
CUDA, cuDNN
3D vision: point clouds, depth estimation, LiDAR

Domain Knowledge A Valuable Bonus:

It s helpful if you ve got a solid grasp of medical datasets, especially imaging data, and an understanding of healthcare regulations.

Knowing how to navigate the complexities of clinical data and compliance can really elevate your impact in this role

Preferred Technologies

Vision-Language Models: VILA, Isaac Sim, EfficientVLM
Cloud Platforms: AWS SageMaker, Bedrock
Optimization Techniques: LoRA fine-tuning, modal-adaptive pruning
Multimodal Techniques: Cross-attention layers, interleaved image-text datasets
MLOps Tools: Docker, MLflow

Best Regards
Mohd Suhaib
Cardinal Integrated Technologies
Ph:
Email:

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Job Details

Share