Retrieval Augmented Generation (RAG)

Jed Lee
7 min readJan 22, 2024

The next (or current) thing for LLMs.

Photo by Philip Strong on Unsplash

Content of this Article

  1. What is RAG?
  2. What are the problems that RAG help addresses?
  3. How exactly does RAG work? (Main Bulk of this article)
  4. What about Fine-Tuning?
  5. Step-by-Step Application of RAG
  6. Conclusion

What is RAG?

Retrieval Augmented Generation (RAG) is a fairly recent groundbreaking approach in the field of natural language processing (NLP). First introduced by researchers from Facebook AI Research (FAIR) in a paper titled “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks” in 2021, it combines the strengths of two major components: a Neural Retrieval Mechanism and a Sequence-to-Sequence model.

In simpler terms, RAG combines an information retrieval component with a text generator model.

This combination allows RAG to extends the capabilities of LLMs like ChatGPT and Google Bard by supplementing it with additional knowledge and up-to-date data, making LLMs adept at answering questions and providing explanations where facts could evolve over time.

What are the problems that RAG help addresses?

Language models, particularly large language models (LLMs) like GPT-3, Google Bard, Claude, have shown remarkable capabilities in generating human-like text.

Image from Forbes

However, these models face several challenges:

  • Limited to Training Data: LLMs are constrained by the scope and recency of their training data. They often struggle with questions that require up-to-date information or knowledge not covered in their training.
  • Inaccuracies/Extrapolation in Responses: Traditional LLMs can extrapolate when facts are not available. They could confidently generate factually incorrect, yet plausible-sounding responses when there is a gap in their knowledge. This is also known as Hallucination.
  • Contextual Limitations: LLMs have limitations in handling contexts needing external knowledge or cross-referencing multiple sources. LLMs can often falter with obscure topics that require niche knowledge, as these are less represented in their training corpus.

If you ask an LLM to write something about a recent trend or event, the LLM will not have any idea what you are talking about, and the responses will be mixed at best and problematic at worst.

How exactly does RAG work?

RAG takes an user input (e.g., “Does my insurance policy pay for this drug X?”) and retrieves a set of relevant/supporting documents given a source (e.g., your Personal Policies Details). The documents are then concatenated as context with the original input prompt and fed to the text generator which produces the final output.

Lewis et al., (2021) proposed a general-purpose fine-tuning recipe for RAG. A pre-trained seq2seq model is used as the parametric memory and a dense vector index of Wikipedia is used as non-parametric memory (accessed using a neural pre-trained retriever). Below is a overview of how the approach works:

Image from the paper: Lewis et el. (2021)

To break RAG’s workflow down using an example…

In simpler terms, RAG essentially works in two phases:

Phase 1: Chunking (also known as Indexing)

  1. Document Collection: Aggregate all relevant documents intended for the LLM’s reference.
  2. Chunking and Embedding: Break these documents into manageable chunks. Each chunk is then processed through the LLM to create embeddings, which are essentially numerical representations capturing the essence of each chunk.
  3. Embedding Storage: Store these embeddings in a vector database. This database allows for efficient retrieval of document chunks based on their similarity in the embedding space.

Phase 2: Querying

  1. Query Embedding: When a user poses a query (e.g., “Does my insurance policy cover drug X?”), the LLM transforms this query into an embedding (let’s call it QUERY_EMBEDDING) using the same process as for document chunks.
  2. Embedding Matching: The vector database then searches for chunk embeddings that are most similar to the QUERY_EMBEDDING. This process identifies the document chunks most relevant to the user’s query.
  3. Response Generation: The LLM uses the retrieved chunks to inform its response, ensuring that the answer is not only based on its pre-trained knowledge but also augmented with specific information from the retrieved documents.
Screenshot from Jerry Liu’s talk on LlamaIndex (2023)

Well, you might start to think that what if we are able to squeeze more chunks into the context, intuitively, that would mean the model has access to more information, and that leads to better, more informed responses, right?

Unfortunately, that is not always the case.

How much context a model can use and how efficiently that model will use it are two different questions. There is a limit to the amount of context you can provide to your model. In parallel with the effort to increase model context length is the effort to make the context more efficient, which is often referred to as “prompt engineering” or “prompt construction”. For example, a paper that has made the rounds recently is about how models are more adept at processing information at the beginning and end of an index, rather than the middle — Lost in the Middle: How Language Models Use Long Contexts (Liu et al., 2023).

To explain RAG in more technical terms…

RAG combines an information retrieval component (Neural Retrieval Mechanism) with a text generator (Sequence-to-Sequence) model.

Neural Retrieval Mechanism: At its core, RAG employs a retrieval mechanism that operates on a large dataset or corpus of documents. When a query is received, this mechanism uses neural network techniques to identify and fetch relevant documents or information snippets from the dataset. This retrieval is based on the semantic similarity between the query and the content in the dataset, ensuring that the most relevant information is selected.

Sequence-to-Sequence Generation Model: The second component of RAG is a powerful sequence-to-sequence (seq2seq) model, like those found in large language models (LLMs). This model takes the input query and the retrieved documents as inputs. It then synthesizes this information to generate a coherent and contextually relevant response. The seq2seq model in RAG is trained on vast amounts of text, enabling it to understand and generate natural language effectively.

Joint Training and Latent Space Matching in RAG: RAG’s effectiveness stems from the joint training of its retrieval and generation components and its latent space matching approach. This integrated training allows the model to effectively utilize retrieved information when generating responses, learning from query-response pairs to refine its accuracy and relevance. Simultaneously, its retrieval mechanism operates in a latent space, matching queries and documents based on learned representations rather than direct keywords. This enhances its ability to understand and retrieve information more nuancedly, ensuring relevance even when exact query terms are not present in the documents.

What about Fine-Tuning?

Fine-tuning and RAG represent distinct yet complementary approaches in optimizing Large Language Models (LLMs).

  • Fine-Tuning: This process involves additional training of an LLM on a specific dataset to enhance performance in particular tasks or domains. For instance, Google’s Codey, fine-tuned on diverse coding examples, excels in coding tasks but may underperform in general chat compared to models like Duet or PaLM 2. The downside of fine-tuning is its lack of flexibility; once a model is fine-tuned for a specific task, its performance in unrelated tasks may decline, and it cannot ‘forget’ or selectively remove parts of its training. Another issue with Fine-tuning is that while fine-tuning a model will give the model additional general knowledge, the fine-tuned model will not (necessarily) give you an exact answer (i.e., a fact) to a specific question.

An example explanation on the official OpenAI forum by @juan_olano:

I fine-tuned a 70K-word book. My initial expectation was to have the desired QA, and at that point I didn’t know any better. But this fine-tuning showed me the limits of this approach. It just learned the style and stayed more or less within the corpus, but hallucinated a lot.

Then I split the book into sentences and worked my way through embeddings, and now I have a very decent QA system for the book, but for narrow questions. It is not as good for questions that need the context of the entire book.

  • RAG’s Flexibility: RAG, on the other hand, dynamically augments LLMs with current, external information from knowledge bases. This approach helps overcome the static knowledge limitations of LLMs, allowing them to generate more informed and contextually relevant responses. The trade-off, however, includes increased computational demands, potential latency, as well as context utility efficacy and limit.
  • Synergy of Fine-Tuning and RAG: Fine-tuning and RAG can be used together to create applications that are both task-specific and contextually informed. A prime example is GitHub Copilot, which leverages fine-tuning for coding proficiency and uses contextual knowledge from the user’s coding environment.
  • Advantage in ‘Forgetting’: A unique advantage of RAG, especially in comparison to fine-tuning, is its ability to ‘forget’ or update its knowledge base. Unlike fine-tuning, where training data becomes an irreversible part of the model, RAG’s vector stores can be modified — erroneous or outdated information can be removed or updated.

Step-by-Step Application of RAG

Feel free to explore my Notebook and I hope this can help you with building your own RAG!

Conclusion

By augmenting Large Language Models with a retrieval mechanism, RAG unlocks unprecedented possibilities in AI’s application across diverse fields, promising more accurate, contextually relevant, and up-to-date responses.

I am very excited and thrilled to witness these remarkable advancements, which are set to redefine the boundaries of what machines can understand and accomplish.

Thanks so much for reading my article!!! Feel free to drop me any comments, suggestions, and follow me on LinkedIn!

--

--

Jed Lee

Passionate about AI & NLP. Based in Singapore. Currently a Data Scientist at PatSnap.