Job Description
We are seeking an experienced consultant to lead the design and deployment of a secure, on-premises Large Language Model (LLM) solution integrated with vector database and Retrieval-Augmented Generation (RAG) capabilities. The ideal candidate brings deep hands-on expertise across the full stack — from model deployment and inference optimization to enterprise security and knowledge transfer.
Core Experience
The consultant must have demonstrated experience deploying open-source LLMs, including models such as Meta Llama 3 and Mistral/Mixtral, within on-premises or private infrastructure environments. Strong Python proficiency is essential, particularly for LLM inference pipelines, prompt engineering, and system integration. The role also requires expertise in CPU-based inference strategies, model quantization techniques, and performance tuning to ensure efficient operation in resource-constrained environments.
Vector Databases & RAG
Candidates must have practical, production-level experience with open-source vector databases such as Qdrant, Chroma, Milvus, or pgvector. A strong track record of designing and implementing end-to-end RAG pipelines is required, along with expertise in embedding generation, management, and metadata filtering to support accurate and efficient semantic retrieval.