Overview
Remote
On Site
225k - 275k
Full Time
Skills
LOS
Startups
Computer Hardware
Artificial Intelligence
BERT
Embedded Systems
Scratch
PyTorch
TensorFlow
OpenCV
Evaluation
Real-time
Amazon Web Services
FFmpeg
Machine Learning (ML)
Signal Processing
Fusion
Video
Modeling
Optimization
Training
Cross-functional Team
Collaboration
Insurance
SAP BASIS
Job Details
Job Description
This is a full-time opportunity based in Los Angeles (hybrid or remote optional for the right candidate) with a stealth-mode startup that's building foundational AI technology for the connected home. The team is rethinking what it means for hardware, software, and machine learning to work together at the OS level delivering intelligent, real-time responsiveness through multimodal input (voice, video, and context). The tech stack is cutting-edge, and this role centers on applied machine learning at the intersection of speech, audio, and video understanding.
Join a world-class team of engineers, scientists, and designers reimagining the interface between people and their environments. In this role, you'll help build a system that doesn't just respond it anticipates. This is an incredible opportunity for an Audio ML or Intent Recognition expert to shape a product that integrates AI into daily life in a meaningful, consumer-facing way. The role offers ownership, autonomy, and the chance to work with advanced ML architectures, foundational model development, and real-time sequence processing.
Required Skills & Experience
4+ years of experience in applied ML focused on audio, speech, or intent classification
Strong foundation in training sequence models (RNNs, Transformers, BERT, Whisper, CLIP, Wav2Vec, etc.)
Experience working with speech/audio ML, embedded systems, or multimodal inputs
Proven track record building and training models from scratch not just relying on APIs
Deep familiarity with tools like PyTorch, TensorFlow, Hugging Face Transformers, torchaudio, librosa, OpenCV, etc.
Desired Skills & Experience
Experience with signal processing, voice activity detection, audio embeddings, speaker identification
Comfort with model evaluation, real-time inference, labeling, augmentation, and tuning
Background with visual sequence modeling, action recognition, or video context detection
Prior experience in environments like OpenAI, DeepMind, Amazon Alexa, Dolby, Sonos, Roku, AssemblyAI, etc., is a strong plus
Familiarity with ONNX, ffmpeg, NVIDIA Triton also welcome
What You Will Be Doing
Tech Breakdown
Audio ML and signal processing
Intent recognition and sequence classification
Multimodal fusion (audio + video input modeling)
System optimization and integration
Daily Responsibilities
80% Hands-on model development, training, tuning
20% Cross-functional team collaboration, experimentation, and architecture discussions
The Offer
equity eligible
You will receive the following benefits:
Medical, Dental, and Vision Insurance
Paid Vacation & Holidays
Generous Equity Package
Applicants must be currently authorized to work in the US on a full-time basis now and in the future.
This is a full-time opportunity based in Los Angeles (hybrid or remote optional for the right candidate) with a stealth-mode startup that's building foundational AI technology for the connected home. The team is rethinking what it means for hardware, software, and machine learning to work together at the OS level delivering intelligent, real-time responsiveness through multimodal input (voice, video, and context). The tech stack is cutting-edge, and this role centers on applied machine learning at the intersection of speech, audio, and video understanding.
Join a world-class team of engineers, scientists, and designers reimagining the interface between people and their environments. In this role, you'll help build a system that doesn't just respond it anticipates. This is an incredible opportunity for an Audio ML or Intent Recognition expert to shape a product that integrates AI into daily life in a meaningful, consumer-facing way. The role offers ownership, autonomy, and the chance to work with advanced ML architectures, foundational model development, and real-time sequence processing.
Required Skills & Experience
4+ years of experience in applied ML focused on audio, speech, or intent classification
Strong foundation in training sequence models (RNNs, Transformers, BERT, Whisper, CLIP, Wav2Vec, etc.)
Experience working with speech/audio ML, embedded systems, or multimodal inputs
Proven track record building and training models from scratch not just relying on APIs
Deep familiarity with tools like PyTorch, TensorFlow, Hugging Face Transformers, torchaudio, librosa, OpenCV, etc.
Desired Skills & Experience
Experience with signal processing, voice activity detection, audio embeddings, speaker identification
Comfort with model evaluation, real-time inference, labeling, augmentation, and tuning
Background with visual sequence modeling, action recognition, or video context detection
Prior experience in environments like OpenAI, DeepMind, Amazon Alexa, Dolby, Sonos, Roku, AssemblyAI, etc., is a strong plus
Familiarity with ONNX, ffmpeg, NVIDIA Triton also welcome
What You Will Be Doing
Tech Breakdown
Audio ML and signal processing
Intent recognition and sequence classification
Multimodal fusion (audio + video input modeling)
System optimization and integration
Daily Responsibilities
80% Hands-on model development, training, tuning
20% Cross-functional team collaboration, experimentation, and architecture discussions
The Offer
equity eligible
You will receive the following benefits:
Medical, Dental, and Vision Insurance
Paid Vacation & Holidays
Generous Equity Package
Applicants must be currently authorized to work in the US on a full-time basis now and in the future.
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.