Skip to main content

Vector Embeddings in RAG Applications

A guide on using vector embeddings to enhance Retrieval-Augmented Generation (RAG) systems.
Created on May 17|Last edited on March 1
This article explores one of the crucial factors that contributed to making Retrieval-Augmented Generation (RAG) systems: vector embeddings. These elements artechnical and transformational tools that enable RAG systems to process languages in very subtle ways, akin to some aspects of human understanding.
Vector embeddings provide a way to convert textual information into numerical data. This is instrumental, as it allows the system to quickly search and retrieve the most important information to generate responses. Such a capability is crucial in improving the accuracy and contextual relevance of an LLM output.
This article details the process of integrating vector embeddings into RAG systems and enhancing performance. We will discuss the mechanics behind this integration and its implications for AI applications.

What we'll cover:



What are vector embeddings?

Vector embeddings are a powerful technique used in machine learning and artificial intelligence to transform raw data into a numerical format that models can easily process.
This transformation involves representing data as vectors in a high-dimensional space where similar items are placed closer together, enabling efficient computations and similarity comparisons. Vector embeddings come in many shapes and forms, such as.

Text embeddings

Text embedding convert text into a set of vectors where each vector represents a word or a sentence. Techniques like Word2Vec, GloVe, and BERT are commonly used for this purpose.
These models work by learning representations for words based on their contexts in large text corpora. For instance, words that appear in similar contexts, like "king" and "queen," will have close vector representations. More on this in the common section, below.

Image embeddings

Another data type is images. Similar to text embeddings, image embedding involves converting images into a vector of numbers. Such embeddings require a different type of embedders, in this case, Convolutional Neural Networks (CNNs) are typically used to generate these embeddings.
An image is passed through various layers of the CNN, which extract features at different levels of abstraction. The output is a dense vector that encapsulates the visual features of the image, useful for tasks like image recognition, classification, and retrieval.

Audio embeddings

Lastly, audio data can also be converted into vector embeddings. Techniques like Mel Frequency Cepstral Coefficients (MFCCs) or deep learning models such as WaveNet and DeepSpeech are used to extract features from raw audio.
These features are then used to create a vector that represents the audio, which can be utilized for applications like speech recognition, music analysis, and audio classification.
Image By Author
In this article, we will be focusing mainly of text embedding, as this is the most common embedding format and the one that works best with large language models.

How do text embedding work?

Text embeddings represent words in a numerical format that captures semantic relationships and characteristics shared among them. Here’s how the process works, in context to the following example which includes the words "King," "Queen," "Boy," and "Girl".
When thinking about vectorization, imagine a multidimensional space where each dimension can represent a feature like age, gender, royalty, etc. In such a space:
  • "King" might be represented as a point at coordinates (male, high royalty).
  • "Queen" as (female, high royalty).
  • "Boy" as (male, low royalty).
  • "Girl" as (female, low royalty).
In the below example, we consider two features among others: gender and royalty.
Image By Author
  • Gender Feature: In the embedding space, "King" and "Boy" would be positioned closer to each other on the gender dimension, reflecting their common male attribute. Similarly, "Queen" and "Girl" would cluster together on this dimension due to their female attribute.
  • Royalty Feature: "King" and "Queen" would be close to each other on a royalty dimension, reflecting their high-status roles in royalty. In contrast, "Boy" and "Girl" would likely be positioned farther from "King" and "Queen" on this dimension due to their general status as non-royal.
These embeddings can then be used in various applications like:
  • Similarity Measurement: Calculating distances between words to find how similar they are (e.g., "King" is closer to "Queen" than "King" is to "Boy").
  • Analogies: Solving problems like "King is to Queen as Boy is to what?" The model would find "Girl" as the closest word vector.
  • Machine Translation and NLP Tasks: Using these vectors to understand text in tasks like sentiment analysis, classification, and more.
The effectiveness of embeddings in capturing such nuanced relationships makes them extremely valuable in natural language processing, allowing machines to process text in a more human-like and intuitive manner. Models like Word2Vec, GloVe, and more recently, BERT and GPT, use these principles to create powerful embeddings.

Vector embeddings in RAG systems

A Retrieval-Augmented Generation (RAG) system is an advanced NLP architecture that enhances the capabilities of language models by combining them with retrieval-based information. This system significantly improves the accuracy and relevance of the responses generated by the language model, especially in complex domains where specific knowledge is required.

Concept of Embedding Data into a Vector Database for Easy Retrieval

Before a RAG system can retrieve relevant data, the information must be structured in a way that makes it easily accessible and comparable. This involves:
  1. Data Vectorization: All the data—whether it's text, images, or any other form—needs to be transformed into vectors. This means each item in the database is converted into a high-dimensional numerical vector that represents various features of the data.
  2. Vector Database Creation and Storage: These vectors are then stored in a vector database, often using systems like FAISS (Facebook AI Similarity Search). These systems allow for efficient retrieval of vectors that are close to a given query vector in the high-dimensional space, which corresponds to retrieving data that is similar or relevant to a query.

RAG system explained in three steps

Image By Author

1) Searching for relevant data

  • Query Vectorization: When a query is received, it is first converted into a vector using the same model or method used for creating the database vectors.
  • Retrieval: The query vector is then used to search the vector database for similar vectors. The corresponding data for these vectors (which could be text passages, images, etc.) are deemed to be the most relevant to the query.

2) Sending relevant data to the LLM

  • Data Preparation: The retrieved data is often preprocessed or formatted in a way that can be easily used by the language model. This might involve summarizing the information or converting it into a structured format that the model can understand.
  • Integration: The processed, relevant data is then fed into a large language model (LLM) as additional context or as part of the input. This step is crucial because it provides the LLM with specific information that directly relates to the user's query.

3) The LLM answering using the sent data

  • Response Generation: With the context provided by the retrieved data, the LLM generates a response. The model leverages both the general knowledge it has learned during pre-training and the specific information provided by the retrieved data.
  • Refinement and Output: The generated response is sometimes refined or adjusted to ensure coherency and relevance. The final output is then presented as the answer to the initial query.

Benefits of RAG Systems

The integration of retrieval systems with generative models in a RAG setup allows for responses that are not only contextually accurate but also deeply informed by specific data, making them particularly useful in specialized fields such as medical, legal, or technical domains where precise information is crucial.
This multi-step approach leverages the strengths of both retrieval systems and generative AI, bridging the gap between vast data resources and sophisticated linguistic capabilities, resulting in outputs that are both rich in content and highly relevant to the queries posed.

Using W&B Weave

Weave is a powerful tool from Weights & Biases designed to simplify the monitoring of machine learning models during production. It provides an easy-to-use interface for tracking various metrics, visualizing data, and examining model performance in real time. By integrating Weave into our RAG system, we enhance our ability to monitor our model’s performance during production.
To use Weave, start by initializing it with weave.init('your_project_name'). Next, add the @weave.op() decorator to any function you wish to track. This decorator automatically logs all inputs and outputs of the function, capturing detailed information about its operation. Users can then examine this data within the Weave interface for easy inspection of function calls. We will use Weave for this project to track inputs and outputs to our model!

Practical Application: Building a RAG System for Financial Forecasting

In the below code, we will build our own RAG system. This RAG system would include an OpenAI LLM, the Housing Dataset from which we got our data, and a vector database, and FAISS which is the vector database that we used to store our vectorized data.



Step 1: Installing the necessary libraries

!pip install langchain langchain-community tiktoken faiss-cpu transformers pandas torch openai

Step 2: Importing the necessary libraries

These libraries will include LLMChain which allows us to build our answer chain, FAISS which is where we will store our vectorized data, and OpenAIEmbeddings as the data embedder of choice.
import pandas as pd
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain.llms import OpenAI
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from langchain.schema import Document
import os
import weave

Step 3: Integrating W&B Weave into our code

We will start by initializing a new Weave project.
weave.init('RAG_System')

Step 4: Loading and processing our dataset

Load the dataset from a local CSV file. You can always utilize any data set of choice, but make sure to properly process the dataset before usage.
df = pd.read_csv("/kaggle/input/housing-in-london/housing_in_london_monthly_variables.csv")
We will sample 10% of our dataset.
df = df.sample(frac=0.1, random_state=42)
Here we will be combining our fields into a text field for easier processing.
df['text'] = df.apply(lambda row: f"Date: {row['date']}, Area: {row['area']}, "
f"Average Price: {row['average_price']}, "
f"Code: {row['code']}, Houses Sold: {row['houses_sold']}, "
f"Number of Crimes: {row['no_of_crimes']}, "
f"Borough Flag: {row['borough_flag']}", axis=1)
Prepare the text for the knowledge base
texts = df['text'].tolist()

Step 5: Creating and storing our embeddings

Create embeddings for the texts.
embeddings = OpenAIEmbeddings(api_key=api_key)
vectors = embeddings.embed_documents(texts)
Create a FAISS vector store and store our embedding in it. I also added logic to cache our vector store to avoid generating new embeddings on every run:
# Check if the vector store already exists
vector_store_path = "faiss_index"

if os.path.exists(vector_store_path):
# Load the existing vector store
vector_store = FAISS.load_local(vector_store_path, embeddings, allow_dangerous_deserialization=True)
else:
# Create and save the vector store
vectors = embeddings.embed_documents(texts)
vector_store = FAISS.from_texts(texts, embeddings)
vector_store.save_local(vector_store_path)

Step 6: Retrieving our data

Define the retriever function. In this case, we will be defining our K value to 20. This will retrieve 20 data points from the vector database for each query given.
def retrieve(query, k=20):
return vector_store.similarity_search(query, k=k)
Create a prompt template for generating responses.
prompt_template = PromptTemplate(
input_variables=["context", "question"],
template="""Given the following data from the London housing dataset:
{context}

Please use the provided data to answer the question accurately. Calculate any necessary averages or totals directly from the data provided.

Question: {question}"""
)

Step 7: Testing and evaluating our model

Initialize the OpenAI LLM, and insert your OpenAI key below.
llm = OpenAI(api_key=api_key)
Here we will be creating an LLM chain.
llm_chain = LLMChain(prompt=prompt_template, llm=llm)
We will define a custom function to generate a response using retrieval and LLM chain. Here we leverage W&B Weave so that we can automatically log inputs and outputs to our generate_response function, which allows us to easily monitor how are model is performing in production!
@weave.op()
def generate_response(question):
# Retrieve relevant documents
retrieved_docs = retrieve(question)
context = "\n".join([doc.page_content for doc in retrieved_docs])
# Print the retrieved documents for debugging
print(f"Retrieved documents for question '{question}':")
for i, doc in enumerate(retrieved_docs):
print(f"Document {i+1}:\n{doc.page_content}\n")
# Generate response using LLM chain
response = llm_chain.run(context=context, question=question)
return {"response": response, "context": context}
Here are sample questions that we will use to test our model.
questions = [
"Using the data retrieved, what was the average price of houses in Westminster in 2019?",
"Using the data retrieved, how many crimes were reported in Hackney in 2018?",
"Using the data retrieved, what is the average price trend in London over the years?",
]
Get answers to the example questions.
for question in questions:
answer = generate_response(question)
print(f"Q: {question}\nA: {answer}\n")

Weave Logging

Below are the results of our model. Since we added Weave to our generate_response function, we automatically log our question, context, and response.
Inside Weave, you will see multiple cells for each call of the function, as shown below:

We can click into any of the cells, and examine more details about the inputs and outputs to the function:


Overall, Weave is a super easy tool to use for logging data in a RAG system, as it requires simply initialization of Weave, followed by adding the @weave.op() decorator.
For instance, you may want to train new models on previous production data, or perhaps you would just like a quick and easy way to monitor how your models are performing. As seen above, the context is information retrieved by our embedding search, and the response is generated by our LLM which uses the retrieved context!

Conclusion

By converting various forms of data into high-dimensional vectors, embeddings allow RAG systems to efficiently retrieve and utilize relevant information, significantly improving the accuracy and contextual relevance of generated responses.
In this article, we explored the mechanics of text embeddings, understanding how they capture semantic relationships and facilitate nuanced language processing. Further, we demonstrated a practical application of building a RAG system, showcasing the steps involved in integrating vector embeddings and leveraging their capabilities.
Hopefully it's clear that vector embeddings are pivotal in advancing the performance of RAG systems, bridging the gap between vast data resources and sophisticated linguistic capabilities to deliver highly informed and relevant outputs.
Iterate on AI agents and models faster. Try Weights & Biases today.