Application of RAG using Llama 3, ChromaDB, QDrant, & Re-Ranking

Jed Lee
7 min readMay 16, 2024

Here’s a step-by-step on how you can build your own RAG!

Image from NVIDIA

Content of this Article

  1. Why RAG?
  2. RAG Demo on Kaggle
  3. Quantization
  4. Embedding Model
  5. Keywords Generation
  6. Vector Databases
  7. Re-Ranking
  8. Conclusion

Why RAG?

Retrieval Augmented Generation (RAG), introduced by Facebook AI Research in 2021, is an innovative approach in NLP that addresses key limitations of traditional large language models (LLMs) like GPT-3, Google Bard, and Claude.

By integrating a neural retrieval mechanism with a sequence-to-sequence model, RAG enhances the ability of LLMs by supplementing their training data with external knowledge.

RAG essentially helps mitigate issues such as reliance on outdated information, factual inaccuracies due to extrapolation, and difficulties with context-heavy or niche topics.

Here are some RAG user stories:

  1. As a Real Estate Agent, Paula needs to stay informed about the frequently changing government regulations on housing and cooling measures to advise her clients effectively. Using RAG, she can access the latest government policies and market data, ensuring that her recommendations are based on the most current information.
  2. As a Healthcare Professional, Dr. Lobs needs access to the most recent patient records and treatment guidelines to provide the best care. With RAG, she can securely retrieve and integrate the latest medical records and updated treatment protocols.
  3. As a Vintage Car Restorer, Jackie often needs specific technical manuals and rare part specifications that are not widely available. Using RAG, he can retrieve detailed restoration guides and historical documents.

Feel free to give a quick visit to my previous article that elaborates in more details about RAG here :)

RAG Demo on Kaggle

In this Demo, I will be:

  • Employing a Quantized Llama 3, a generative model released by Meta in April 2024, to showcase enhanced text generation with retrieval-based augmentation.
  • Utilizing embedding model to transform our raw text data into high-dimensional vectors, enabling efficient storage and retrieval while preserving semantic relationships.
  • Employing a keyword generation model to extract informative phrases/keywords from the data, supplementing our metadata.
  • Leveraging Vector Databases like ChromaDB and Qdrant for storing data in vectorized form.
  • Implementing a re-ranking technique to refine the relevance of retrieved documents so as to prioritize more pertinent information during generation.

I will be using the following dataset:

This will include the most updated State of the Union Address in 2024 that is not included as part of Llama 3’s training data.

Here is the link to my Kaggle Notebook:

For the rest of the article, I will be briefly touching on some key concepts in my demo, such as Quantization, Embedding Model, Keywords Generation, Vector Databases, and Re-Ranking.

Quantization (Quantized Llama 3)

  • Quantization is a powerful compression technique that converts the weights and activations within an LLM, which allows us to reduce its memory footprint and deploy the LLM on devices with limited computational capabilities.
  • In our context, we are employing a quantized version of Llama 3, a state-of-the-art generative model released by Meta in April 2024.

Quantization allows us to represent model parameters using lower precision formats, such as 8-bit or 4-bit integers, instead of the standard 32-bit floating-point format. Reducing the number of bits means the resulting model requires less memory storage, consumes less energy (in theory), and operations like matrix multiplication can be performed much faster with integer arithmetic, making it feasible to deploy RAG systems on resource-constrained devices or handle larger datasets.

  • In our context, I utilized BitsAndBytesConfig from transformers to change my Llama 3 parameters to either 8-bit or 4bit.

You can read up more about Quantization here:

Embedding Model

  • Embedding models are fundamental to RAG as they enable us to transform our raw text data into high-dimensional vectors (embeddings) that capture semantic relationships between words and documents.

Given the text “What can I do with RAG?”, an embedding of the sentence could be represented in a vector space, for example, with a list of 288 numbers (for example, [0.74, 0.22, …, 0.03]).

Since this list captures the meaning of the text, we can do a lot of things with it, like calculating the distance between different embeddings (i.e., an embedding of “RAG can enhance the ability of LLMs.”) to determine how well the meaning of two sentences matches.

  • Once a piece of information (a sentence, a document, an image) is embedded, the creativity starts; several interesting industrial applications use embeddings. For example, Google Search uses embeddings to match text to text and text to images; Snapchat uses them to “serve the right ad to the right user at the right time”; and Meta (Facebook) uses them for their social search.
  • In our context, the embedding model maps your question (input query) and documents to a common vector space, allowing the identification of the most relevant documents to retrieve and incorporate into the generation process through similarity metrics like cosine similarity.

You can read up more about Embeddings here:

Keyword Generation

  • Keyword generation models automatically identify salient and representative terms from documents, serving as concise summaries of main topics or concepts.

In our context, generated keywords supplement our documents’ metadata, providing additional context for retrieval and ranking purposes, thus improving retrieval accuracy and content understanding.

  • I used a vlT5 model for keyword extraction from short texts, trained on scientific articles, leveraging an encoder-decoder architecture using Transformer blocks presented by Google’s T5 model to generate keywords.

Vector Databases

  • Vector databases, such as ChromaDB and Qdrant, are specialized data storage systems optimized for efficiently storing, managing, and searching high-dimensional vector data, including embeddings generated by embedding models in RAG. There are many others; feel free to explore them here.
  • These databases enable fast similarity search operations, real-time retrieval of the most relevant documents based on their vector representations, and scalable solutions for querying large datasets.
  • Vector databases like ChromaDB and Qdrant employ indexing techniques like HNSW (Hierarchical Navigable Small World) to accelerate the retrieval process. HNSW is an efficient approximate nearest neighbor search algorithm that builds a hierarchical graph structure to enable fast similarity searches in high-dimensional spaces. By leveraging HNSW, vector databases can handle millions or billions of document embeddings effectively, significantly reducing search latency and improving the scalability of RAG systems.

Here is an acute explanation about HNSW:

  • In our context, I applied both ChromaDB and Qdrant to store my data. Check out my code on Kaggle :)

Re-Ranking

  • Re-ranking is a technique used in RAG to refine the relevance scores of retrieved documents and prioritize the most pertinent information for generation.

Unlike the standard RAG approach, which directly passes the top-k documents with the highest cosine similarity scores to the language model, re-ranking retrieves a larger initial set of documents and then applies additional ranking techniques to refine the relevance scores. These techniques consider factors such as semantic similarity, keyword matches, document structure, and domain-specific knowledge.

  • By implementing re-ranking, RAG systems can improve the quality of the final set of documents passed to the language model (LLM) for generating responses. This allows the system to retrieve a larger set of potentially relevant documents using a higher top-k value, and then select only the most relevant ones from that set. This helps strike a balance between retrieval recall (retrieving as many relevant documents as possible) and LLM recall (providing the LLM with the most relevant information to generate accurate responses).

You can read up more about Re-Rankers here:

  • In our context, I used an Open-Sourced Re-ranking model from Hugging Face called “BAAI/bge-reranker-v2-m3". There are multiple of them on Hugging Face. You can even check out their leaderboard here: https://huggingface.co/spaces/mteb/leaderboard
  • In my notebook, I showcased the documents that were retrieved before and after the re-ranking, together with the relevant scores. I was able to observe a stark improvement in the documents retrieved. The scores allowed me to rank and select the top n number of document chunks that I will parse to my prompt.

Conclusion

The emergence of RAG seems to promise the world. However, after developing a plain and bare-bones RAG pipeline, many of us are left wondering why it does not work as well as we had expected.

By leveraging these various techniques, we can unlock the full potential of RAG, pushing the boundaries of text generation to deliver higher quality, contextually relevant responses, and effectively manage vast document collections.

--

--

Jed Lee

Passionate about AI & NLP. Based in Singapore. Currently a Data Scientist at PatSnap.