Self-Hosted Small Models: From OpenAI Lock-In to Open Models

Abstract

OpenAI's embedding API is the default for most Python RAG applications, but it creates real problems at scale: data privacy (every document leaves your network), cost (per-token pricing compounds fast), no customization (cannot fine-tune or swap models), and no quality validation. Most teams have never measured whether text-embedding-3-large actually works well for their domain.

Open embedding models like BGE-M3, GTE-Qwen2, and E5 now match or beat OpenAI on MTEB benchmarks. The problem is not model quality anymore, it is infrastructure: running these models with proper batching, memory management, and quality assurance.

This talk walks through a practical migration. We start from a typical LangChain app calling OpenAI, switch to self-hosted open models using SIE (an open-source inference server with an OpenAI-compatible endpoint), validate that retrieval quality is maintained using MTEB, and then go beyond what OpenAI offers by adding sparse embeddings for keyword, recall, and cross-encoder reranking for precision.

The talk is aimed at Python developers who currently use OpenAI embeddings and models and want to understand their options. No prior experience with model serving is required.