Building a RAG-Based Digital Restaurant Menu with LlamaIndex and W&B Weave

Created on January 24|Last edited on March 1
Comment
Digital convenience is a requirement for nearly every business. One place we haven't seen much change, however, is when we go out for dinner. While restaurants embraced digital menus after COVID vaccines became readily available, most of these were either static PDFs or experiences that mimicked ordering online when you were, in fact, actually there in the restaurant. 
Physical menus are fine. They do the job. But if we're looking to actually improve upon them, current approaches miss the mark. They offer limited interactivity and flexibility. Searching through most menus is a matter of either reading the whole thing or finding with "Command + F" (which isn't great on most mobile phones either). 
It's here that we have a nice window into a real improvement: semantic search. Users could interact with a menu in a more nuanced way, not needing to know precisely what they wanted but instead making natural language queries like we would talking to an experienced waiter. 
With the help of LlamaIndex and W&B Weave, that's what we'll be building today. 
﻿
﻿
What We'll Cover Key Features of the App Technical OverviewWhy Do We Need a Vector Index? Data Standardization with GPT-4: Laying the FoundationCreating the Vector IndexThe App: Bringing It All Together
﻿
﻿
Key Features of the App To build this application, we will incorporate several key functionalities:
Semantic Search Capability: We want users to search for menu items based on context and meaning, rather than just exact word matches.
User-Friendly Interface: A simple and responsive design that makes menu navigation a seamless experience. There will be two screens for the app, which will contain the regular PDF menu screen, along with a 'chat screen' where users can ask questions about the contents of the menu. 
App Data Analysis: The capability to analyze user interactions and preferences, offering valuable insights for restaurants to analyze what their customers are actually looking for!
Technical OverviewTransforming these static menus into intelligent, searchable interfaces requires a multi-step approach. This project aims to achieve this through a series of steps which include the following:  
Data Standardization with GPT-4: The process begins with converting the diverse, unstructured text of traditional menus into a structured, standardized JSON format.  
Creating the Vector Index: The standardized data is next converted into vectors. This vector index represents the semantic content of the menu items, facilitating efficient and accurate semantic searches.
Querying the Vector Index: With the index in place, users can perform natural language queries. The system retrieves the most relevant items based on the semantic meaning of these queries, rather than mere keyword matching.
Filtering Returned Results with GPT-3.5: To refine the search results further, we'll use GPT-3.5 to filter the results. This ensures that the final output presented to the user is highly relevant and precise. While a vector index can efficiently retrieve menu items based on semantic content, it may still return results that are not entirely relevant to the user's specific query. GPT-3.5's language understanding capabilities can be used to further scrutinize these results, ensuring that they align closely with the user's intent and context of the query. This process is more akin to "retrieval-augmented deletion" rather than traditional retrieval-augmented generation (RAG). The use of GPT-3.5 here is not to create new content but to refine and filter the existing search results. The model acts as a smart filter, sifting through the retrieved data to remove any irrelevant or less accurate information. This method aligns with the principles of RAG, where the goal is to augment the generation process with relevant information retrieved from a dataset or database. 
Why Do We Need a Vector Index? Before going too far, I think it’s important to explain why we are using a vector index in the first place. 
You may be wondering, why not just loop through the menu and pass the user's query and the each page to the model, and simply ask the model which results are relevant? A brute force approach of passing the entire menu content through GPT-3.5 for processing every query would indeed be possible, but it would be highly inefficient for several reasons. 
Firstly, it could be cost-prohibitive due to GPT-3.5's pricing model, which is based on the number of tokens received and generated by the model, and processing extensive menus would require a large number of tokens. Secondly, it would significantly slow down response times, diminishing the user experience with delayed results. Instead, employing vector search to initially narrow down results ensures that only the most relevant items are considered by our LLM, maintaining both cost efficiency and high relevance in returned search results.
Data Standardization with GPT-4: Laying the FoundationBefore we can create a vector index, we will need to put our menu data in a format which can be accessed easily. In order to accomplish this, we will need to take our menu (which is currently in PDF form) and convert it into JSON format. 
GPT-4 is a great tool for a job like this. Using the model, and a specially crafted prompt, we can pass the document to the model and convert the menu into JSON format. 
Since the focus of this article is mainly centered around the AI search functionality, I’ll omit most of the script used to convert the menu into JSON format. The script basically loops over each page of the menu, and passes the contents to GPT-4, which will convert the menu to JSON format. Note that this script might not work perfectly for every PDF menu, and you may need to either adjust the prompt, or even manually enter in the contents to a JSON file. 
Here's a snippet from the script: 
    with open(pdf_path, 'rb') as file:
        reader = PyPDF2.PdfReader(file)
        for page_num in range(len(reader.pages)):
            page_text = reader.pages[page_num].extract_text()
            text_chunks = split_text(page_text)
            for chunk in text_chunks:
﻿
                prompt_text = ("I'm going to give you some text that I want you to convert into a JSON object (list), "
                            "each item with a title and description. For example, if the text describes various dishes at a restaurant, "
                            "the JSON output might look like: [{'title': 'Chicken Special', 'description': 'A delicious chicken dish seasoned with herbs and spices.', 'keywords': 'chicken'}, "
                            "{'title': 'Seafood Platter', 'description': 'An assortment of fresh seafood, including shrimp, scallops, and lobster.', 'keywords': 'seafood, shrimp,platter'}]"
                            "NOTE: THE ITEM TITLES WILL NOT BE A CATEGORY OF ITEMS, RATHER THEY ARE SPECIFIC DISHES/ENTRES/APETIZERS/DRINKS ETC. -> THEY ARE ONLY SPECIC ITEMS, NOT CATEGORIES"
                            "NOTE: Keywords should be the a 2-4 categories/common search terms describing the item, for example (but not limited to) -> dessert, side, chicken, sandwhich, cocktail, drink, burger, salad, etc etc -> Try to use multiple categories/descriptors"
                            "THIS DATA IS CHUNKED SO IF THE DATA AT THE BEGINNING OR END (FOR TITLE OR DESCRIPTION) SEEMS INCOMPLETE COMPARED TO THE REST OF THE ENTRIES, JUST SKIP IT"
                            "ONLY RESPOND WITH THE JSON OBJECT AND NOTHING ELSE. CONVERT THE FULL DATA. DO NOT TRUNCATE OR STOP EARLY. HERE IS THE DATA I WANT YOU TO CONVERT: " + chunk)
﻿
                response = client.chat.completions.create(
                    model="gpt-4-1106-preview",
                    messages=[
                        {"role": "system", "content": "You are a helpful assistant."},
                        {"role": "user", "content": prompt_text}
                    ]
                )
﻿
It took a bit of prompt engineering to get this just right, but after a few tries I was able to build a prompt that is able to parse the menu! Note that we only pass an about 4000 characters from the menu to the model at once, along with a prompt explaining to the model what we would like to do. Along with this, we also use an overlap in between sections that are passed to the model, which prevents skipping over relevant items. This data is parsed by the model, and converted into JSON format. Here is a sample of the resulting file: 
    {
        "title": "Roasted Half Chicken",
        "description": "Slow-roasted with housemade herb butter. Served with choice of two sides.",
        "keywords": "chicken, roasted, herb butter",
        "page": 1
    },
    {
        "title": "Parmesan Crusted Chicken Pasta",
        "description": "Crispy hand breaded parmesan chicken breast with melted mozzarella and marinara sauce over linguine.",
        "keywords": "chicken, pasta, parmesan",
        "page": 1
    },
Note, I prompted GPT-4 to add some 'keywords', which I've found to be an efficient way to improve our vector search later on, with only a marginal cost increase for using GPT-4, since we only have to run this script a single time. 
💡
Creating the Vector IndexOnce the menu data is in JSON format, the next step is to create a vector index. This involves converting each item on the menu into a vector, using a language model. The vectors represent the semantic content of the menu items, allowing for efficient and accurate searching based on the meaning of the user's query, rather than just keyword matching.
We'll use Llama Index to convert each item into a 'document' object, which will then be converted into an embedding which can easily be compared to user queries. Here is the code for creating the index: 
from flask import Flask, render_template, request, jsonify
import json
import os
from llama_index import VectorStoreIndex, ServiceContext, StorageContext, load_index_from_storage
from llama_index.schema import Document
from llama_index.embeddings import HuggingFaceEmbedding
from llama_index.retrievers import VectorIndexRetriever
from llama_index.query_engine import RetrieverQueryEngine
from llama_index.postprocessor import SimilarityPostprocessor
from llama_index import (VectorStoreIndex, get_response_synthesizer)
﻿
def create_document_from_json_item(json_item):
    ti = json_item['title'] 
    des = json_item['description']
    keys = json_item['keywords']
﻿
    # if des:  # depending on the menu/embedding model, this could be helpful 
    #     ti += des
﻿
    if keys: 
        ti += "keywords:" + keys
    document = Document(text=ti, metadata=json_item)
    return document
﻿
def generate_embeddings_for_document(document, model_name="BAAI/bge-small-en-v1.5"):
    embed_model = HuggingFaceEmbedding(model_name=model_name)
    embeddings = embed_model.get_text_embedding(document.text)
    return embeddings
﻿
file_path = "./gpt4_menu_data_v2.json"
index = None 
﻿
if not os.path.exists("./index"):        
    with open(file_path, 'r', encoding='utf-8') as file:
        json_data = json.load(file)
﻿
    documents = []
    for item in json_data:
        document = create_document_from_json_item(item)
        document_embeddings = generate_embeddings_for_document(document)
        document.embedding = document_embeddings
        documents.append(document)
﻿
    service_context = ServiceContext.from_defaults(llm=None, embed_model='local')
    index = VectorStoreIndex.from_documents(documents, service_context=service_context)
    index.storage_context.persist(persist_dir="./index")
else:
    storage_context = StorageContext.from_defaults(persist_dir="./index")
    service_context = ServiceContext.from_defaults(llm=None, embed_model='local')
    index = load_index_from_storage(storage_context, service_context=service_context)
﻿
We first describe a few functions which will convert the items in the JSON into document objects, which then can be converted into embeddings using the BAAI/bge-small-en-v1.5 model (available on HuggingFace). Finally, we use persistent storage for our vector index, which prevents the need for creating the index every time we run the app. 
How Does Vector Indexing Work?Vector indexing works by embedding words or phrases into a high-dimensional space where semantically similar items are positioned close to each other. This allows the search algorithm to understand the context and nuances of search queries, providing more relevant and accurate results.
Querying the Vector IndexNow that our vector index is created, we are ready to write the code that will query the index given a user query. 
retriever = VectorIndexRetriever(index=index, similarity_top_k=10)
service_context = ServiceContext.from_defaults(llm=None, embed_model='local')
response_synthesizer = get_response_synthesizer(service_context=service_context)
query_engine = RetrieverQueryEngine(retriever=retriever, response_synthesizer=response_synthesizer, node_postprocessors=[SimilarityPostprocessor(similarity_cutoff=0.6)])
﻿
def parse_response_to_json(response_str):
    items = response_str.split("title: ")[1:]  # Split the response and ignore the first empty chunk
    json_list = []
﻿
    for item in items:
        lines = item.strip().split('\n')
        item_json = {
            "title": lines[0].strip(),
            "description": lines[1].replace("description: ", "").strip(),
            "keywords": lines[2].replace("keywords: ", "").strip(),
            "page": int(lines[3].replace("page: ", "").strip())
        }
        json_list.append(item_json)
﻿
    return json_list
﻿
def query_index(query):
    response = query_engine.query(query)
    # Original results parsed to JSON
    return parse_response_to_json(str(response))
First, we create a VectorIndexRetriever to fetch the top 10 semantically similar items from an index. The query_engine is then set up with a response synthesizer and a similarity filter with a cutoff of 0.6, ensuring only closely relevant results are returned.
With the vector index in place, users can query the menu using natural language. The system interprets these queries, retrieves the most relevant items from the index, and presents them in an easily digestible format. 
Enhancing Search Relevance with GPT-3.5To further refine the search results, we employ GPT-3.5. This cheaper model is perfectly capable of assisting in filtering the results returned by our vector index to ensure that only the most relevant items are presented to the user. 
I’m sure lots of other open source LLM’s would suffice for this filtering stage as well, however, I have a lot of trust in the GPT-3.5 models (especially without an abundance of fine-tuning data for our specific task), so we will use it today. In a production setting, starting with a model like GPT-3.5 and collecting data to align a smaller open source model is also a great option, as it ensures high reliability at launch, while offering price reductions down the road. 
Logging App Usage With W&B Weave Logging data in any AI-driven application, like a searchable restaurant menu, is valuable for several reasons. Primarily, it allows for the collection of invaluable insights about how users interact with the application. By analyzing this data, businesses can identify popular features, common queries, and user preferences.
﻿Weights & Biases Weave is a great option for logging our data! With its intuitive dashboard, developers can monitor real-time interactions, assess the performance of the AI models in responding to queries, and detect any anomalies or areas for optimization. This level of detailed analysis is invaluable for iterative development, allowing for continuous improvement based on actual user data. 
Here is the rest of our flask app which queries, filters, and returns our results: 
# Login to W&B
wandb.login()
# Define constants for the StreamTable
WB_ENTITY = ""  # Set your W&B entity name here, or leave it empty to use the current logged-in entity
WB_PROJECT = "ai_menu"
STREAM_TABLE_NAME = "usage_data"
# Define a StreamTable
st = StreamTable(f"{WB_ENTITY}/{WB_PROJECT}/{STREAM_TABLE_NAME}")
﻿
app = Flask(__name__)
﻿
client = openai.OpenAI(api_key='sk-YOUR_API_KEY')
﻿
@app.route('/chat')
def index():
    return render_template('chat.html')
﻿
@app.route('/')
def menu():
    return render_template('index.html')
﻿
def describe_items(json_list):
    description_str = "Some possible items you might be interested in include the following:<br><br>"
    for item in json_list:
        description_str += f"<strong>{item['title']}</strong> - {item['description']}<br><br>"
    return description_str
﻿
def generate_response_gpt(query, original_res):
    # Generating prompt for GPT
    prompt = f"This is a user at a restaurant searching for items to order. Given these initial results {original_res} for the following user query '{query}', return the JSON object for the items that make sense to include as a response (e.g., remove only items that are not at all relevant to the query='{query}') -- keep in mind that they may all be relevant and its perfectly fine to not remove any items. YOU MUST RETURN THE RESULT IN JSON FORM"
    
    response = client.chat.completions.create(
        model="gpt-3.5-turbo-1106",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt}
        ]
    )
    
    if response.choices:
        reply = response.choices[0].message.content
        filtered_res_json_str = re.search(r"```json(.+?)```", reply, re.DOTALL)
﻿
        print(filtered_res_json_str)
        if filtered_res_json_str:
            filtered_res_json = json.loads(filtered_res_json_str.group(1))
            if not len(filtered_res_json): 
                return original_res
        else:
            filtered_res_json = original_res
        
        
        return filtered_res_json
    else:
        return original_res
﻿
﻿
@app.route('/search', methods=['POST'])
def search():
    query = request.form.get('query')
    original_res = query_index(query)
    filtered_res_json = generate_response_gpt(query, original_res)
    st.log({"query": query, "results": describe_items(filtered_res_json)})
    return jsonify({'res': describe_items(filtered_res_json)})
﻿
﻿
if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000, debug=True)
The code begins with integrating Weights & Biases using wandb.login(). This step involves defining constants such as WB_ENTITY, WB_PROJECT, and initializing a StreamTable in W&B, named "usage_data". This table is designed to log and track user interactions within the Flask application.
The Flask app is established with crucial routes like '/chat' and '/' the two pages for the interface, which includes the regular PDF viewer along with the menu chat screen. The integration of the GPT-3.5 model is accomplished through openai.OpenAI(api_key='sk-YOUR_API_KEY'), which allows the application to utilize the OpenAI API.
When a user enters a query, the query_index(query) function is triggered. This function is responsible for conducting an initial search within a predefined index. It extracts relevant initial data based on the user's input.
The generate_response_gpt(query, original_res) function takes the initial results from query_index(query) and applies GPT-3.5 to enhance and refine these results. It involves generating prompts and processing GPT-3.5's responses to ensure the final output is highly relevant and tailored to the user's query.
Lastly, the Flask route /search is the entry point that passes user queries into the previously mentioned functions. It calls query_index(query) to gather relevant items, and generate_response_gpt(query, original_res) to filter the retrieved results and then finally logs the final output in W&B's StreamTable using st.log(). This process not only provides users with refined answers but also captures valuable data for analytics and further improvements of the system.
Why Filter the Results?While a vector index can efficiently retrieve menu items based on semantic content, it may still return results that are not entirely relevant to the user's specific query. GPT-3.5's advanced language understanding capabilities can be used to further scrutinize these results, ensuring that they align closely with the user's intent and context of the query. This refinement process enhances the precision and relevance of the search results.
The App: Bringing It All TogetherMy web development experience is somewhat limited, so this app will be more of a proof of concept. 
Initially, I envisioned an interface where items returned by the user’s query would be highlighted on the menu, however, implementing this interface is a bit of a tall task for my web development skillset (working with PDF’s on the web is a bit tricky, especially on mobile which is our primary deployment target), so I’ll implement an interface more similar to the ChatGPT interface, where the user can switch between a chat screen and the PDF menu!  
The front-end is very much in it's "MVP" form, however, everything is functional. The code for this front end is available in the Github repo. Still, let's look at some screenshots to see what we're working.
First, our menu:
﻿
Next, what it might look like when I user clicked the blue button above:
﻿
﻿
And our results: 
﻿
After users use the app, we can log into Weave, and view our table to examine what users are searching for, and how well our system is performing! Here is the table for our app! We can see that our AI powered search is intelligent enough to handle specific queries, outperforming simple keyword searches. For example, when searching for just "chicken," lots of different items show, however, when we specify we are only looking for "appetizers with chicken" our search returns only appetizer items! 
﻿
panel
panel
tabledata
 - 4 of 4
query
results
timestamp
1
2
3
4
In summary, by leveraging advanced AI models like GPT-4 and GPT-3.5, along with RAG, and integrating them with a web app, we have transformed the static, unyielding PDF menu into an interactive experience. This not only elevates the customer experience but also paves the way for more intelligent and responsive digital solutions in the hospitality industry. If you have any questions, feel free to ask in the comments below! Here is the repo for the project. 
Helpful Resources ﻿https://weave.wandb.ai/browse/templates﻿﻿﻿
﻿https://colab.research.google.com/github/wandb/weave/blob/master/examples/prompts/llm_monitoring/openai_proxy_quickstart.ipynb#scrollTo=-YeikEVMsevO﻿
﻿https://www.llamaindex.ai/﻿
﻿
Add a comment
Qasem Nick • 1 year ago
not working. I'm saving you ONE day telling you this.
Tags: Articles, LLM, Tutorial, Fine-tuning, Intermediate, NLP, GenAI
Iterate on AI agents and models faster. Try Weights & Biases today.