What You’ll Do:
· Design, build, and scale ML-powered inference systems that process large volumes of text, image, and video data to power news-based intelligence products.
· Productionize and optimize state of the art models and inference pipelines. These models include, but are not limited to:
o DistilBERT for Named Entity Recognition (NER) over hundreds of thousands of search queries/day
o TransNetV2 for video shot boundary detection at scale for archival video as well as real-time
o SBERT for embedding generation from textual descriptions
o External multimodal APIs for image/video captioning
· Support hybrid search architectures by defining embedding/re-ranking interfaces, evaluation metrics, and inference performance requirements; partner with search/platform engineers on index configuration, sharding, and cluster tuning.
· Design and implement scalable data processing pipelines across hybrid CPU/GPU environments to handle millions of media assets.
· Partner with MLOps and platform engineering to enable the deployment and operation of ML systems reliably, contributing to:
o Distributed inference architectures
o Cloud-based execution (e.g., AWS EC2, Batch, Lambda, SageMaker)
o Efficient resource utilization across workloads
· Optimize inference latency and throughput across distributed workloads using cloud-based resources (AWS EC2, Batch, Lambda, SageMaker, etc.)
· Build resilient asynchronous processing systems for large-scale workloads, ensuring:
o Reliability (retries, fault tolerance)
o Efficiency (caching, deduplication)
o Observability (metrics, logging, traceability)
· Work closely with data scientists and product teams to iterate on models, improve performance, and deliver measurable impact in production.
Requirements:
· 8+ years of experience building production ML inference systems.
· Demonstrated ownership of deep-learning inference optimization in production (quantization, distillation, compilation, kernel/profile-level performance work) for transformer NLP and/or CV models.
· Experience with both TensorFlow (SavedModel, tf.data, XLA, TFLite) and PyTorch (TorchScript, ONNX, FastAPI/TorchServe)
· Hands-on experience optimizing inference pipelines on AWS infrastructure, ideally across different types of media assets.
· Experience with video frameworks/tools (e.g., FFmpeg), and working with large-scale frame-level inference.
· Demonstrated experience monitoring and debugging model latency, memory, and pipeline throughput.
· Experience with hybrid search architectures (BM25 + vector search + cross-encoder reranking).
· Familiarity with OpenAI APIs or other foundation model providers.
· Familiarity with open source HuggingFace LLMs.
· Experience with data pipeline and workflow orchestration tools (e.g., Airflow)
Who This Role is Not For:
Candidates whose primary background is MLOps platform work (DAG orchestration, Terraform, Kubernetes administration, generic CI/CD pipelines) will not be a fit. We need a senior level engineer who can profile a transformer, rewrite its serving path for a 2–3x latency reduction, tune an HNSW index, and tell us which SageMaker instance type will hit our p95 target at the lowest cost.