Skip to main content

Llama 3.2-Vision for multi-modal RAG in financial services

Understanding SEC filings with the help of foundation models
Created on September 27|Last edited on March 1

Introduction

With the recent Llama 3.2 release from Meta, the Llama family of models went both small and gained multi-modal capabilities. In this report, we'll explore a simple recipe to build a multi-modal RAG (Retrieval-Augmented Generation) system for question-answering.
We'll use Tesla's financial filing gathered from the SEC-EDGAR Database by the US Securities and Exchange Comission using the Llama 3.2 Vision model and Weave, a lightweight toolkit for tracking and evaluating LLM applications.
The code associated with this report can be found at github.com/wandb/rag.
💡
You can see some samples of images found in Tesla's SEC filings below, and more in this dataset entry here. These images can often contain important information for analysts or lawyers.


Table of contents



Installation (click to expand)

Extracting information from financial reports and images

First, you need to create a corpus which acts as the source of truth for our RAG system. We do this by fetching the 10-Q and the DEF 14A filings from the Edgar database. Besides collecting the text from all the filing reports, we also fetch all the images associated with the filings. In order to make it easy for ingest the information from both the text data and image files, we generate a comprehensive description corresponding to each image and extract all the text and tabular information using meta-llama/Llama-3.2-90B-Vision-Instruct.

Building the text corpus for RAG
2


Why do we need to annotate the images?


The importance of annotating images into text using a multi-modal LLM
1


Code to fetch filings (click to expand)

Creating a vector index for the financial filings and images

Chunking our corpus

Next, we are going split the text from corpus dataset into smaller chunks. These chunks will be stored in a Weave Dataset along with some metadata associated with each chunk, which include the following data and metadata:
  • The index of the filing report from which the chunk originates
  • The date of the filing report
  • The accession number of the filing report
  • The number of images associated with the filing report
  • The number of tokens parsed to create the chunk
These metadata would be useful in implementing our retrieval and response generation strategy.
We are going to use the semantic chunking strategy, as described in chapter 3 of the free course RAG++ : From POC to Production. Simply speaking, we are going to split the text into sentences which we will group into chunks using semantic similarity. This ensures that the chunks are "semantically meaningful", i.e, the chunks have clear and coherent meaning based on its content. This strategy has the following advantages:
  • When documents are divided into meaningful chunks, the retriever can focus on retrieving more specific and relevant sections, leading to a better match between the user's query and the retrieved information.
  • When the retrieved chunks are semantically coherent, the generated responses are more likely to be coherent and relevant as well.
  • When chunks are created based on fixed token or word limits, it can truncate important context. Semantic chunking avoids this issue by ensuring that chunks maintain full meaning and context.

Code for chunking (click to expand)

Building a vector indexing using bge-small-en-v1.5

Next, we build the vector index consisting of the vector embeddings corresponding to the chunks encoded using BAAI/bge-small-en-v1.5 and store persist the index locally and well as save it in W&B using W&B Artifacts.
Note that the information that we're chunking and indexing consists only of text from the filing reports but not the descriptions generated from the images associated with the reports. In the next section, we will discuss the retrieval and response generation strategy in detail where we will discuss how the information extracted from the images are going to be used.
💡

Code for indexing (click to expand)

Generating a response for the query

A baseline retrieval and generation strategy

For generating a relevant answer to the user's query, we can load back the vector index and simply retrieve the top_k chunks from our the corpus. These chunks can then be added to the prompt to generate the response using Llama 3.2.

Augmenting retrieved chunks using image descriptions

We have already discussed how the images associated with the filing reports often contain important information that might be relevant to the query. Hence, we can further improve the response generated by Llama 3.2 by augmenting the prompt with the image descriptions associated with the filing reports from which the retrieved chunks originate.
Note that for the sake of simplicity of demonstration, we append all the image descriptions associated with the filing reports from which the chunks originate. In practice, this is highly inefficient because there might be duplicate image descriptions across multiple reports (such as company logos) or event image descriptions that are not relevant to the user's query, thus making the retrieved context sparse in terms of relevant information. Ideally, we should index image descriptions separately and also filter our the most relevant ones for augmenting the query.
💡

Further augmenting our response with the most relevant image

To generate a relevant answer to the user's query, we implement the following strategy:
  • We first load back our vector index, retrieve the top_k chunks.
  • We check the number of images associated with the filing report from which the chunk originates.
  • If there are one or more images associated with the filing report, we retrieve the image corresponding to the most relevant image description associated with the filing report.
  • The most relevant image (along with the text information extracted from it) is passed to the Llama 3.2-Vision model for generating the response along with the user query and the respective top_k retrieved chunks from the filing reports.
For retrieving the most relevant image description, we follow the exact same indexing and retrieval strategy that we do for the text from the filing report, but rather than maintaining a dedicated vector index, we simply index the image description on-the-fly after receiving the query as input. This is because the number of images associated with a filing report is insignificant compared to the total number of reports that exist.
💡


A weave trace showing the retrieval and generation process
1



Code for generating response for the query (click to expand)

Conclusion

  • In this report, we implement a simple recipe to build a multi-modal RAG (Retrieval-Augmented Generation) system for question-answering on Tesla's financial filing using Llama 3.2 Vision model and Weave.
  • We discuss the process of constricting the corpus for our RAG pipeline by collecting the filing reports from the SEC-EDGAR Database and extracting additional information from the images associated with the filings.
  • We discuss the semantic chunking strategy and the specific metadata employed from indexing the filing reports.
  • Finally we discuss the retrieval strategy used to fetch the most relevant chunks and image and using these to generate a response for the user's query.
To learn more about RAG and multi-modal LLM workflows, try our free RAG++ course and check out the following reports:


Iterate on AI agents and models faster. Try Weights & Biases today.