Semantic Search and RAG - a Powerful Combination by Seth Carney

Semantic search refers to the ability to understand the meaning and intent behind a user's query, rather than simply looking for matching keywords. It aims to provide more relevant and accurate search results by considering the semantics (meaning) of the query and the content being searched.

Retrieval-augmented generation (RAG) refers to a type of language model that generates responses to queries or prompts by retrieving relevant information from a knowledge base and uses it as context when generating output. It combines elements of both retrieval-based and generative models. This allows the model to produce more accurate and contextually relevant responses compared to traditional generative models.

These natural language processing (NLP) techniques are very useful when combined. Semantic search specializes in determining which relevant documents or passages should be fed into a given RAG model based on user’s query. These techniques working in unison allow users to rapidly search through diverse knowledge bases and receive informative responses. Semantic search with RAG integration can provide tremendous productivity benefits at a relatively low cost.

How does Semantic Searching work?

There are two main approaches to semantic search, both helping improve RAG outputs. The first type of semantic search is known as dense retrieval, which relies on vector similarity using dense vector embeddings in a high-dimensional vector space. It contrasts with the traditional sparse vector representations used in keyword-based retrieval. Dense retrieval allows a user to write a specific query relating to a portion of the knowledge source, and have the system return a concise response with correlating embedding references. The second type of semantic search is known as reranking. Reranking requires a systematic approach to assign relevance scores to matches within the knowledge base. These scores are then used to change the order in which results are displayed to the user, to optimize result relevance.

 How does RAG work?

Retrieval: The system begins by retrieving relevant information from a large dataset or knowledge base in response to a given prompt. This retrieval step is often performed using semantic search; however, keyword matching can be used. First, the prompt is tokenized in using the model’s tokenizer.  To achieve compatibility, a document collection or knowledge library undergo a transformation process, wherein they are converted into numerical formats called embeddings. Embedding refers to the process through which text is assigned a numerical vector representation within a high-dimensional vector space. The prompt tokens are then compared to the high dimensional vector space which contains semantic information about the tokens' meaning and context.

Augmentation: Once the relevant information is retrieved, it is augmented or integrated into the generation process. This additional context is then fed into the transformer alongside the user’s tokenized prompt. 

Generation: With the retrieved information integrated, the system generates a response or text based on the prompt and the augmented information. This generation step is typically performed by a neural network-based language model, such as GPT (Generative Pre-trained Transformer). By combining retrieval and generation techniques, RAG models can produce responses that are not only fluent and contextually appropriate but also grounded in factual information retrieved from external knowledge sources. This allows RAG models to excel in tasks such as question answering, conversational agents, and content generation where accurate and informative responses are desired.

Semantic Search Enhancements

As this technology continues to develop, helpful optimizations that can aid accuracy and consistency have been developed in parallel. Firstly, a very commonly sought feature is to enable the transformer model to display direct citations from the knowledge database, showing the embedding matches it found and used to generate its response. This allows the user to see exactly what the “relevant” information was added for context.

Furthermore, preprocessing user queries before they are used in a semantic search can be very helpful for obtaining useful results. Oftentimes, preprocessing libraries will handle any contradictions, spelling, and remove words that do not carry significant meaning in order to improve clarity and conciseness while still conveying the original request. An alternative to preprocessing prompts is to provide users with a prompt library. Having custom and well thought out prompts easily accessible can help ensure that the system receives thorough instructions so that responses remain as accurate as possible.

Seth Carney