Rescuing the RAG: Hybrid Search, Contextual Memory, and Inline Assets

Track:: Machine Learning: Research & Applications
Type:: Talk
Level:: intermediate
Duration:: 30 minutes

Abstract

Retrieval-Augmented Generation (RAG) has become the default architecture for building AI assistants, but moving a RAG application from a tutorial notebook to production reveals messy reality gaps. Standard vector search often fails when users make typos or use specific domain jargon. Furthermore, technical documentation is rarely just text—it relies heavily on diagrams and icons, which standard text-only RAG pipelines strip away, leaving users with incomplete answers.

In this talk, I will walk through the evolution of a production documentation assistant built with Python, Streamlit, and PostgreSQL (pgvector). We will move beyond the basic pipeline to solve three specific production failures.

First, we will address the "typo problem" by implementing Hybrid Search. I will demonstrate how to combine semantic vector density with PostgreSQL’s full-text search (tsvector) to handle messy user input (e.g., mapping "usuage" to "usage" while preserving exact technical terms like "undercuts").

Second, we will tackle Contextual Memory. I will show you how to manage session state effectively to handle follow-up questions (e.g., "tell me more about that") without losing the conversational thread or confusing the retrieval layer.

Finally, we will solve the "visual gap." I will demonstrate a Pythonic approach to parsing, indexing, and re-injecting inline assets (images and icons) directly into the LLM's streaming response. You will see how to ensure that when the text says "Click the green checkmark," the user actually sees the green checkmark icon inline.