Apply Now

AI Engineer - Audio/Speech

Remote • Posted 1 day ago • Updated 19 hours ago

Full Time

Remote

$180,000 - $210,000/yr

Fitment

Dice Job Match Score™

✨ Finding the perfect fit...

Job Details

Skills

audio
speech
multimodal learning
Large Audio Language Models
whisper
deepspeed
Wav2Vec 2.0
HuBERT
encodec
soundstream

Summary

Key Responsibilities:

PHD Degree is Must have for this role.

Design, develop, and deploy Large Audio Language Models (LALMs) capable of native audio understanding, reasoning, and generation.
Build Large Audio Reasoning Models that perform complex chain-of-thought reasoning over speech and audio inputs, including medical, technical, and conversational domains.
Contribute to Speech-to-Speech (S2S) system development, including speech understanding, dialogue management, and speech synthesis components.
Research and implement alignment mechanisms between speech encoders and LLM backbones using lightweight adapters, LoRA, and efficient fine-tuning strategies.
Design efficient speech tokenization and temporal compression techniques suitable for long-form audio reasoning and multi-turn spoken dialogue.
Build comprehensive evaluation frameworks for audio reasoning capabilities, including benchmarks for speech QA, audio understanding, and reasoning accuracy.
Optimize inference pipelines for low-latency, streaming applications in speech systems.
Collaborate with cross-functional teams to transfer research innovations into production systems and customer-facing applications.
Contribute to technical documentation, research write-ups, and publications at top-tier venues (NeurIPS, ICML, ACL, Interspeech).

Minimum Qualifications

Master's degree (required) or Ph.D. (preferred) in Computer Science, Electrical Engineering, or a related field with a focus on speech, audio ML, or multimodal learning.
2+ years of industry or applied research experience in speech/audio AI, Large Language Models, or multimodal systems.
Demonstrated applied research contributions through publications, patents, or shipped products in speech/audio AI or LLMs.
Strong proficiency in Python and PyTorch, with hands-on experience in GPU-accelerated training for large-scale models.
Solid understanding of speech and audio signal processing, acoustic modeling, and audio representations.
Working knowledge of modern LLM architectures (Transformers, SSMs) and training paradigms including instruction tuning and alignment methods.
Familiarity with modality alignment techniques: adapter-based integration, cross-modal attention, or audio-text fusion methods.
Strong experimentation habits: clean code, systematic ablations, reproducibility, and clear technical communication.

Preferred Qualifications

Publication record at top-tier venues (NeurIPS, ICML, ICLR, ACL, Interspeech, ICASSP) in audio language models, speech reasoning, or multimodal learning.
Hands-on experience building or fine-tuning Large Audio Language Models (e.g., Qwen-Audio, SALMONN, LTU, Gemini Audio).
Experience with speech representation pretraining (HuBERT, Wav2Vec 2.0, Whisper, WavLM) and discrete speech tokenization.
Familiarity with Speech-to-Speech components: neural audio codecs (EnCodec, SoundStream), vocoders, or speech synthesis systems.
Experience with audio reasoning benchmarks (AIR-Bench, MMAU, AudioBench) or building evaluation harnesses for audio QA.
Hands-on experience with distributed training (FSDP, DeepSpeed) and inference optimization (ONNX, TensorRT, quantization).
Familiarity with speech frameworks such as ESPnet, SpeechBrain, NVIDIA NeMo, or Fairseq.
Experience with multilingual speech systems, code-switching, or domain adaptation for specialized applications (medical, legal, technical).
Background in evaluating safety, bias, hallucination, or adversarial robustness in audio language models.

Technical Environment

Core: PyTorch, CUDA, torchaudio/librosa, Hugging Face Transformers
LLM Stack: Large language model backbones, lightweight adapters (LoRA, Q-Former), instruction tuning pipelines
Audio Models: Neural audio codecs, speech encoders, vocoders, discrete speech tokenizers
Infrastructure: Modern GPU clusters, experiment tracking (Weights & Biases), distributed training frameworks
Deployment: FastAPI/gRPC for services, ONNX/TensorRT for optimized inference

What We Offer

Competitive compensation package with comprehensive benefits
Opportunity to work on cutting-edge Large Audio Language Models and audio reasoning research with real-world impact
Collaboration with experienced applied scientists and engineers in speech and multimodal AI
Support for publications at top-tier conferences and professional development
Access to state-of-the-art GPU infrastructure for training large-scale audio models

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Dice Id: 91132139
Position Id: 8959084
Posted 1 day ago

Create job alert

Never miss an opportunity! Create an alert based on the job you applied for.

Remote

•

17d ago

This position is for a Junior to mid-level AI developer (3+ YEARS) who can go in and make an assessment of the clients current state and recommend ways to improve Voice AI performance using AI/LLM and Python for automation. HEALTHCARE EXPERIENCE IS A BIG PLUS AND CANDIDATES NEED EXCELLENT COMMUNICATION SKILLS. Job Description: Position Overview As we scale, were focused on improving how work gets done across operations and service delivery. AI will play a key role in reducing manual effort, im

Easy Apply

Full-time

140,000 - 150,000

Sr. AI Engineer

Remote or Austin, Texas

•

Today

At eBay, we're more than a global ecommerce leader - we're changing the way the world shops and sells. Our platform empowers millions of buyers and sellers in more than 190 markets around the world. We're committed to pushing boundaries and leaving our mark as we reinvent the future of ecommerce for enthusiasts. Our customers are our compass, authenticity thrives, bold ideas are welcome, and everyone can bring their unique selves to work - every day. We're in this together, sustaining the futur

Full-time

USD 156,800.00 - 255,300.00 per year

AI/ML Lead

Remote or New York, New York

•

Today

Tradeweb is a global leader in electronic trading across asset classes. As financial markets become increasingly interconnected, our technology enables efficient, multi-asset trading on a global scale. We serve more than 3,000 clients in more than 85 countries, including many of the world?s largest banks, asset managers, hedge funds, insurers, corporations, and wealth managers. Creative collaboration and sharp client focus have helped fuel our organic growth. We facilitated average daily tradin

Full-time

Lead AI Software Engineer - Remote

Remote or Raleigh, North Carolina

•

Today

Optum is a global organization that delivers care, aided by technology to help millions of people live healthier lives. The work you do with our team will directly improve health outcomes by connecting people with the care, pharmacy benefits, data and resources they need to feel their best. Here, you will find a culture guided by inclusion, talented peers, comprehensive benefits and career development opportunities. Come make an impact on the communities we serve as you help us advance health op

Full-time

USD 112,700.00 - 193,200.00 per year

Search all similar jobs

AI Engineer - Audio/Speech

Dice Job Match Score™

Job Details

Skills

Summary

Similar Jobs