Building an LLM App with TinyLLaVA

Building an LLM-powered app with TinyLLaVA, LangChain, W&B, Transformers, and Gradio!
Created on January 20|Last edited on February 1
Comment
﻿
IntroductionThe purpose of this article is to build an LLM-powered app with TinyLLaVA, LangChain, W&B, Transformers, and Gradio! To follow along, you can check out this notebook, and  this repository. For the wandb project, click here.  
I took inspiration from the W&B course by Darek Kłeczek, showcasing how you can use W&B with LangChain, Transformers, and Gradio to build an LLM-powered app. Huge thanks to Darek Kłeczek, Bharat Ramanathan, Thomas Capelle, and all the guest speakers for making this course possible!
We'll begin by covering Visual Instruction Tuning LLaVA (Large Language and Vision Assistant). Then, we'll cover its successor, Improved Baselines with Visual Instruction Tuning (LLaVA-1.5). Both of these papers can be conveniently found on the paper's website. Lastly, I'll cover TinyLlama, the backbone fine-tuned with the LLaVA-1.5 dataset to produce TinyLLaVA! 
Lastly, if you want to read a little more about using LLaVA for fine-tuning, check out this report: 
How to Fine-Tune LLaVA on a Custom Dataset
A tutorial for fine-tuning LLaVA on your own data! 
﻿
Note: feel free to skip these three sections (LLaVA, LLaVA-1.5, and TinyLlama). These sections break down the paper and give us a deeper understanding of what we're working with.
💡
Table of ContentsIntroductionTable of Contents🗻What is LLaVA?GPT-assisted Visual Instruction Data GenerationNaive MethodLLaVA MethodVisual Instruction TuningStage 1: Pretraining for Feature AlignmentStage 2: Fine-tuning End-to-EndExperiments🌋What is LLaVA-1.5?🤏🌋What is TinyLLaVA?🚧Let's Dive into the Code!🔩 Setup🧬Chroma: Vector Database with OpenAI🧪 Pipeline🚚 Building the App 🤔 Discussion & Future Work👋 ConclusionReferences
﻿
🗻What is LLaVA?As mentioned in the intro, LLaVA is short for Large Language and Vision Assistant. It  more detail, LLaVA brings:
A dataset construction pipeline to convert image-text pairs into appropriate iamge-text instruction-tuning data
A finetuning dataset, LLaVA-Instruct-150K, for image-text instruction-following and a pretraining filtered-down version of CC3M called LLaVA-CC3M-Pretrain-595K﻿
2 LLaVA models both descendants of Vicuna-13B (lmsys/FastChat also provides Vicuna models), one finetuned on ScienceQA and the other on image-text instruction-following data curated in this paper (LLaVA-Instruct-150K)
2 small evaluation benchmarks, LLaVA-Bench (COCO) and LLaVA-Bench (In-the-Wild)﻿
GPT-assisted Visual Instruction Data GenerationLet's look at a couple methods for data generation. This will be important in a minute.
Naive MethodFor an image XvX_vXv​﻿ and its caption XcX_cXc​﻿, create a set of questions XqX_qXq​﻿ via GPT-4, presumably. Thus, structure the input prompt to:
"Human: XqXvX_q X_vXq​Xv​﻿<STOP> Assistant: XcX_cXc​﻿<STOP>"Given some image and caption, this method is essentially prompting GPT-4 to generate questions about the image that the caption can answer. They are structured like the above to be a turn-based conversation. The authors argue this method lacks diversity and in-depth reasoning, something they can mitigate with their dataset construction method.
LLaVA MethodThe authors used the Microsoft Common Objects in Context (MS COCO) dataset as a starting point. For every image, they use the captions and bounding box information (no image included) in the prompt sent to GPT-4. Specifically, they prompted GPT-4 to generate three types of instruction-following results for each image in MS COCO. 
They developed seed examples for each of the 3 categories below for in-context learning:
Conversation: Multi-turn question-answer pairs about the visual contents of the image
Detailed description: From a list of questions/instructions asking about the image, a random question is selected, and sent to GPT-4 as a prompt (single-turn)
Complex reasoning: Single-turn question-answer pairs reasoning about the image	
The end result dataset consists of 158K unique language-image instruction-following instances including 58K conversations, 23K in detailed descriptions, 77K in complex reasoning (the HF dataset says its LLaVa-Instruct-150K, which is probably just convenient rounding). 
Here's a table of instructions for the detailed description category and sample input/outputs from their dataset construction process:
﻿
﻿
Visual Instruction TuningNow to the actual model and training! The authors produced two identical LLaVA models, one for ScienceQA (science question-answering) and another for instruction-following (chatbot).
ScienceQA example.
﻿
As for the actual network architecture, they used Vicuna-13B as the underlying LLM (at  at the time of publication, this was the best instruction-following model of all public LLM checkpoints). 
Their vision encoder was the CLIP visual encoder ViT-L/14. This joint system takes in as input an image XvX_vXv​﻿ and a language question/instruction XqX_qXq​﻿. The image is encoded with CLIP first to output ZvZ_vZv​﻿, then a project/linear layer WWW﻿ maps it from the visual embedding space of CLIP to the word embedding space, producing HvH_vHv​﻿. 
Stage 1: Pretraining for Feature AlignmentThis stage trains the projection layer and keeps the LLM and vision encoder frozen. They use the naive method to pretrain their projection layer on a 595K filtered-down CC3M dataset. 
Stage 2: Fine-tuning End-to-EndHere, we keep the vision encoder frozen and fine-tune the projection layer and LLM. The authors fine-tuned and tested with 2 different datasets: ScienceQA and LLaVA-Instruct-150K.
Here's one turn of a data instance from their LLaVA-Instruct-150K:
For an image XvX_vXv​﻿, we have TTT﻿ turns of conversations (multi-turn for conversation, single turn for detailed description and complex reasoning). Thus, XinstructtX_{instruct}^tXinstructt​﻿ is a single turn (turn ttt﻿) for a single data instance. I'm assuming they randomly swap the order of the question XqX_qXq​﻿ and image XvX_vXv​﻿ to ensure order diversity.  
Xinstructt={Randomly choose [Xq1,Xv] or [Xv,Xq1],the first turn t=1Xqt,the remaining turns t>1X_{\text{instruct}}^{t} = 
\begin{cases} 
\text{Randomly choose } [X_{q}^{1}, X_{v}] \text{ or } [X_{v}, X_{q}^{1}], & \text{the first turn } t = 1 \\
X_{q}^{t}, & \text{the remaining turns } t > 1 
\end{cases}Xinstructt​={Randomly choose [Xq1​,Xv​] or [Xv​,Xq1​],Xqt​,​the first turn t=1the remaining turns t>1​﻿
The formula for a forward pass looks like:
p(Xa∣Xv,Xinstruct)=∏i=1Lpθ(xi∣Xv,Xinstruct,x<i,Xa,<i),p(X_a | X_v, X_{instruct}) = \prod_{i=1}^{L} p_{\theta}(x_i | X_v, X_{instruct}, x_{<i}, X_{a,<i}),
p(Xa​∣Xv​,Xinstruct​)=∏i=1L​pθ​(xi​∣Xv​,Xinstruct​,x<i​,Xa,<i​),﻿
It's the probability of the ground truth answer XaX_aXa​﻿ given the image XvX_vXv​﻿ and the entire conversation/detailed description/complex reasoning string XinstructX_{instruct}Xinstruct​﻿. This probability is the product of previous probabilities. 
﻿LLL﻿ is the length of the instance XinstructX_{instruct}Xinstruct​﻿﻿
﻿θ={W,ϕ}\theta = \{W, \phi\}θ={W,ϕ}﻿ where ϕ\phiϕ﻿ is the model weights for the LLM
﻿xix_ixi​﻿ is the iii﻿-th token 
﻿XvX_vXv​﻿ is the image
﻿Xinstruct,<iX_{instruct, <i}Xinstruct,<i​﻿ is all the conversation/detailed description/complex reasoning string tokens before the iii﻿-th token
﻿Xa,<iX_{a, <i}Xa,<i​﻿ are all the answer tokens before the iii﻿-th token
How do we summarize this in one sentence?
This formulation is all to say that the predicted probability of LLaVA getting the ground truth answer is conditioned on some provided question (about visual content or complex reasoning) and provided image where this probability is computed autoregressively as a product of a sequence of predicted token probabilities.
ExperimentsHere's a TL;DR overview of their training runs (from the paper itself): 
We train all models with 8× A100s, following Vicuna’s hyperparameters [9]. We pre-train our model on the filtered CC-595K subset for 1 epoch with a learning rate of 2e-3 and a batch size of 128, and fine-tune on the proposed LLaVA-Instruct-158K dataset for 3 epochs, with a learning rate of 2e-5 and a batch size of 32.They tested it against GPT-4, BLIP-2, and OpenFlamingo on a pretty iconic example!
﻿
Their quantitative evaluation was done by using triplets of image, (visual information) ground-truth textual descriptions, and questions. The candidate model (LLaVA) predicts the answer based on the question and image and this predicted answer is fed into GPT-4 (the judge) along with the ground-truth textual descriptions. GPT-4, as the judge, gives a score between 1 and 10 to LLaVA's answer. 
In their experiments, they create 2 small evaluation benchmarks: LLaVA-Bench (COCO) and LLaVA-Bench (In-the-Wild). 
LLaVA-Bench (COCO)
Details:
Random set of 30 images from COCO-Val-2014 where each image has 3 types of questions (conversation, detailed description, complex reasoning), totaling 90 questions.
Tests model alignment behavior/capabilities with consistent visual inputs; essentially evaluates a candidate model along 3 axes: conversation, detailed description, complex reasoning
Showed that using data from all 3 axes improved scores the most, implying the 3 categories of questions does show greater improvement over the naive method
﻿
LLaVA-Bench (In-the-Wild)
Diverse set of 24 images with 60 questions in total aimed at testing vision-language models on more complex generalizability tasks in indoor and outdoor settings (hence, in-the-wild)
Has questions from each of the 3 categories mentioned before
Showed that LLaVA achieves strong performance in in-the-wild scenarios
Discovered interesting limitation where LLaVA seems to fail in grasping complex semantics in an image
﻿
See a challenging example and limitation below:
﻿
Their next set of experiments were on ScienceQA, a 21k multimodal dataset (12726 train, 4241 validation, 4241 test samples) spanning 3 subjects and 26 topics.
The ScienceQA-fine-tuned LLaVA was evaluated against GPT-3.5 with/without Chain-of-Thought (CoT), LLaMA-Adapter, and multimodal CoT (the current SOTA).  LLaVA, alone, achieves 90.92%, close to the 91.68% SOTA. The authors combined LLaVA with GPT-4 via 2 methods. The first combination method, complement, only uses LLaVA when GPT-4 fails to provide an answer. The second method, judge, if GPT-4 and LLaVA produce different answers, then re-prompt GPT-4 with the question based on the question and the 2 outcomes.
﻿
They also ran ablation studies testing for:
Visual Features (last layer of CLIP vs penultimate layer): found that the last layer performs 0.96% lower than the penultimate layer; hypothesize the last layer might be producing global features rather than more local ones
Chain-of-Thought (order of answer and reasoning process; reason → answer vs answer → reason): they tested whether the order in which the LLM within LLaVA generates its answer matters; found that the order change does not improve overall performance but reasoning-first followed by the answer (much like zero-shot CoT) can improve convergence
Pre-training (opt-in/opt-out for pretraining): they trained a LLaVA model directly on ScienceQA, skipping pretraining → 85.81%, a 5.11% drop in performance (this makes sense because the project layer is untrained!)
Model size (13B original vs new 7B): the larger model achieved 90.92% and the 7B achieved 89.84%) → larger model = better performance
🌋What is LLaVA-1.5?TL;DR: LLaVA-1.5 = Better model (CLIP-ViT-L-336px), more data (VQA).
﻿
The above table shows the incremental additions they made to LLaVA to further boost its performance. They show that LLaVA achieves the best performance with much less data while being a simple framework and using less compute. 
LLaVA struggled on short- and long-form visual question-answering. The authors hypothesize it may be due to the ambiguous prompts. In other words, the model doesn't know what the output should be. To solve this, they append this short string to the end of the prompt "Answer the question using a single word or phrase."
﻿
They changed their single linear layer to a 2-layer Multi-layer Perceptron (MLP), essentially 2 linear layers, leading to an improvement in performance.
﻿
They included additional academic-task-related VQA datasets for VQA and OCR like: open-knowledge VQA (OKVQA), Augmented OKVQA (A-OKVQA), OCRVQA, and TextCaps. They find that just by adding a subset of InstructBLIP's training data, LLaVA already surpasses it on all 3 tasks in Table 1.
﻿
They also scale up the resolution of the image and use the GQA dataset. They include data from ShareGPT and scale up their LLM from 7B to 13B (not sure why they didn't start with 13B like in the previous LLaVA paper). 
In summary, all of these additions and their effects on performance can be seen in Table 1.
It is encouraging that LLaVA-1.5 achieves the best performance with the simplest architecture, academic compute and public datasets, and yields a fully-reproducible and affordable baseline for future research.Below are their results across 12 benchmarks.
﻿
They note a few properties and limitations:
LLaVA-1.5 is somewhat multilingual thanks to ShareGPT data
increased resolution size for images (now 336px) doubles training (~6 hours of pretraining and a whopping ~20 hours of visual instruction tuning) on 8 x A100s
LLaVA-1.5 uses full image patches, prolonging training; a sample-efficient visual resampler could help
LLaVA-1.5 can't process multiple images at once; it wasn't trained to do so
limited proficiency in other domains 
🤏🌋What is TinyLLaVA?TinyLLaVA is built on the training scheme and data from LLaVA-1.5 and the model from TinyLlama. Let's briefly cover TinyLlama and TinyLLaVA.
﻿TinyLlama is a popular project focused on training a 1.1B parameter LlaMA model on 3 trillion tokens! The tokenizer and model are based on LLaMA-2. Below is a convenient table on their README.md summarizing training logistics.
﻿
TinyLlama boasts portability, speed, and efficiency. 
﻿TinyLLaVA is a small 1.1B parameter version of the LLaVA models. Note, the backbone of TinyLLaVA is TinyLlama, not Vicuna or other LLMs like in the original paper. TinyLLaVA is trained the same way LLaVA-1.5 is trained.
🚧Let's Dive into the Code!Our entire app is located in this notebook and this repository!
💡
Let's step through the code:
🔩 SetupWe need:
wandb
transformers (latest version to use with TinyLLaVA)
gradio for building the actual app
chromadb for storing conversations
openai and tiktoken for the embedding function used in our Chroma vector database
langchain as the LLM framework wrapper around Chroma (convenience)
accelerate and bitsandbytes for quantized loading
!pip install wandb -qqq
!pip install git+https://github.com/huggingface/transformers -qqq
!pip install --upgrade gradio -qqq
!pip install chromadb -qqq
!pip install openai -qqq
!pip install tiktoken -qqq
!pip install langchain -qqq
!pip install accelerate -qqq
!pip install bitsandbytes -qqq
Up next are our imports.
import os
import requests
import numpy as np
import torch
import datetime
﻿
# For loading in the tiny-LLaVA-v1-hf model in a transformers pipeline.
import transformers
from transformers import pipeline
from transformers import BitsAndBytesConfig
﻿
# For converting input images to PIL images.
from PIL import Image
﻿
# For creating the gradio app.
import gradio as gr
﻿
# For creating a simple prompt (open to extension) to our model.
from langchain.prompts import PromptTemplate
﻿
# Our vector database of choice: Chroma!
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings.openai import OpenAIEmbeddings
﻿
import chromadb
from chromadb.utils.embedding_functions import OpenCLIPEmbeddingFunction
from chromadb.utils.data_loaders import ImageLoader
﻿
# For loading in our OpenAI API key.
from google.colab import userdata
﻿
# For logging.
import wandb
from wandb.sdk.data_types.trace_tree import Trace
wandb.login()
﻿
# Required for us to load in our pipeline for TinyLLaVA.
assert transformers.__version__ >= "4.35.3"
🧬Chroma: Vector Database with OpenAIFirst off, before we even see any code. What is a vector database? Or a vector store? Pinecone's (a vector database provider) page has a great explanation:
A vector database is a type of database that indexes and stores vector embeddings for fast retrieval and similarity search, with capabilities like CRUD operations, metadata filtering, and horizontal scaling.They are exactly like a normal database except everything is done with vectors. Retrieval is done with some form of similarity searching. Vector store and vector database are often interchangeable terms and you can think of a vector store as a more light-weight version of a comprehensive vector database! 
My next natural question: why are they useful?
The challenge of working with vector data is that traditional scalar-based databases can’t keep up with the complexity and scale of such data, making it difficult to extract insights and perform real-time analysis. That’s where vector databases come into play – they are intentionally designed to handle this type of data and offer the performance, scalability, and flexibility you need to make the most out of your data.
﻿Source.
# Use OpenAI's embeddings for our Chroma collection.
embeddings = OpenAIEmbeddings(
    model="text-embedding-ada-002",
    openai_api_key=userdata.get("OPENAI_API_KEY"),
)
collection = Chroma("conversation_memory", embeddings)
Let's explain the code a little bit.
﻿OpenAIEmbeddings is one of many embedding providers LangChain has. You can specify any of their embedding models. I chose text-embedding-ada-002. We use LangChain as a convenient interface to Chroma instead of using Chroma directly (though you can do this if you want). We define our Chroma vector database/collection by specifying a name for our database and then the embedding function. For a full list of operations that LangChain's Chroma supports, click here.
🧪 PipelineNext, is instantiating our model inference pipeline. Huge thanks to Baichuan Zhou for releasing Tiny-LLaVA-v1-hf!
# Ref: https://huggingface.co/bczhou/tiny-llava-v1-hf
model_id = "bczhou/tiny-llava-v1-hf"
﻿
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)
﻿
pipe = pipeline(
    "image-to-text",
    model=model_id,
    device_map="auto",
    use_fast=True,
    model_kwargs={"quantization_config": bnb_config}
)
Let's walk through our code. In the model page, there are 2 ways to instantiate an inference pipeline with this model. I chose the pipeline approach.
We define a BitsAndBytesConfig for loading in 4-bits with double quantization in nf4 datatype. Our compute dtype will be torch.bfloat16. For more information, check out this great quantization page on Hugging Face! For more information on BitsAndBytesConfig, click here.
And setting up our inference pipeline is super simple. We define a Transformer pipeline with the task as "image-to-text" and pass in the model_id we defined earlier. We set device_map="auto" to automatically use GPU if available. use_fast=True specifies to use a fast tokenizer. Our model_kwargs are the values passed into the .from_pretrained(...) that is internally called when instantiating our pipeline. That mean, our bnb_config is passed into the .from_pretrained(...). The quantization drastically improves inference time and memory storage. I originally (mistakenly) ran the app with CPU, resulting in a staggering 80 seconds per pipeline forward pass. I ran with a Colab T4 GPU which improved the wall clock time 4x, with 20 seconds per pipeline forward pass. The additional quantization brings the 20 second/generation rate to 5 seconds. 
🚚 Building the App Let's build the app!
This first snippet of code is used get 2 images: one the user and one for the chatbot. This will be used later in the gradio app for profile pictures. I provided 2 default profile pictures, one of a giraffe and one of a pig!
try:
  assert user_avatar_image_path
except:
  img_data = requests.get("https://imgur.com/QehpHeV.png").content
  with open('user_avatar.png', 'wb') as handler:
      handler.write(img_data)
  user_avatar_image_path = "user_avatar.png"
﻿
try:
  assert chatbot_avatar_image_path
except:
  img_data = requests.get("https://imgur.com/ki4hPhZ.png").content
  with open('chatbot_avatar.png', 'wb') as handler:
      handler.write(img_data)
  chatbot_avatar_image_path = "chatbot_avatar.png"
I also provide a short snippet of code to pull a sample image for a test run (an iconic example from LLaVA).
# Let's get a sample image to use. You can download it and pass it into the app!
# The prompt is: What's unusual about this image?
img_data = requests.get("https://imgur.com/Ca6gjuf.png").content
with open('sample_image.png', 'wb') as handler:
    handler.write(img_data)
What's unusual about this image?
Next, let's define max_new_tokens (max number of new tokens generated by the model) and a folder for storing user inputted images.
max_new_tokens = 200
﻿
# Path for storing images.
IMG_ROOT_PATH = "data/"
os.makedirs(IMG_ROOT_PATH, exist_ok=True)
Since we are using gr.ChatInterface, which requires a function with a signature message: str, history: list, let's define that.
def generate_output(message: str, history: list, img: np.ndarray) -> str:
    """Generates an output given a message and image."""
Since our model is multi-modal, we will eventually specify an additional input in gr.ChatInterface. So we need to pass that additional input as an extra parameter in our function generate_output. We use LangChain to create our prompts. 
    status = "success"
﻿
    # Get detailed description of the image for Chroma.
    query = "Please provide a detailed description of the image."
    prompt = PromptTemplate.from_template(
        "USER: <image>\n" +
        "{query}" +
        "\n" +
        "ASSISTANT: "
    )
﻿
    start_time_ms = datetime.datetime.now().timestamp() * 1000
    try:
        outputs = pipe(Image.fromarray(img), prompt=prompt.format(query=query), generate_kwargs={"max_new_tokens": max_new_tokens})
        img_desc = outputs[0]["generated_text"].split("ASSISTANT:")[-1]
        status_message = (None,)
    except Exception as e:
        status = "error"
        status_message = str(e)
        img_desc = ""
    end_time_ms = round(datetime.datetime.now().timestamp() * 1000)
Here's our first snippet of code. We have a status variable which changes to "error" if there's an error. Our model inference pipeline will run twice for every input, once to describe the image and once to answer the user's query. 
Our first run will have the specified query and the prompt. This is passed through our pipeline with the max_new_tokens as part of the generate_kwargs parameter. The status_message is just a detailed description of the status. We also keep track of the start and end time of the generation for logging. I encourage you to dive into the code and play around with each bit of it to get a strong sense of how it works!
    # Create a span in wandb.
    root_span = Trace(
        name="img_desc_span",
        kind="llm",  # kind can be "llm", "chain", "agent" or "tool"
        status_code=status,
        status_message=status_message,
        metadata={
            "max_new_tokens": max_new_tokens,
            "model_name": model_id,
        },
        start_time_ms=start_time_ms,
        end_time_ms=end_time_ms,
        inputs={"system_prompt": prompt.format(query=query), "query": query},
        outputs={"response": img_desc},
    )
﻿
    # Log the span to wandb.
    root_span.log(name="img_desc_trace") 
Next, we will create our W&B Trace. This will contain all the metadata we want to include for our chatbot from input to output. All the useful data from the first run through the pipeline earlier is logged here:
Trace Name
Kind
Status & Status Message
Metadata
Start/End times
Input/Outputs
There is a lot more customizability, but these are the simplest options. For more information, click here.
Now, we will run the user's query through the pipeline. Our logging procedure will be identical to our first run.
The only new component is that we are incorporating our Chroma vector database! Our collection is queried with the user's message and all relevant documents in the collection are extracted (top 2). These serve as context for the prompt.
    # Visual Question-Answering!
    prompt = PromptTemplate.from_template(
        "Context: {context}\n\n"
        "USER: <image>\n" +
        "{message}" +
        "\n" +
        "ASSISTANT: "
    )
    context = collection.similarity_search(query=message, k=2)
    context = "\n".join([doc.page_content for doc in context])
﻿
    # Forward pass through the model with given prompt template.
    start_time_ms = datetime.datetime.now().timestamp() * 1000
    try:
        outputs = pipe(
            Image.fromarray(img),
            prompt=prompt.format(
                context=context,
                message=message
            ),
            generate_kwargs={"max_new_tokens": max_new_tokens}
        )
        response = outputs[0]["generated_text"].split("ASSISTANT:")[-1]
        status_message = (None,)
    except Exception as e:
      status = "error"
      status_message = str(e)
      response = ""
    end_time_ms = round(datetime.datetime.now().timestamp() * 1000)
﻿
    # Create a span in wandb.
    root_span = Trace(
        name="response_span",
        kind="llm",  # kind can be "llm", "chain", "agent" or "tool"
        status_code=status,
        status_message=status_message,
        metadata={
            "max_new_tokens": max_new_tokens,
            "model_name": model_id,
        },
        start_time_ms=start_time_ms,
        end_time_ms=end_time_ms,
        inputs={
            "system_prompt": prompt.format(
                context=context,
                message=message
            ),
            "query": message
        },
        outputs={"response": response},
    )
﻿
    # Log the span to wandb.
    root_span.log(name="response_trace")
The second-to-last component of this function is updating our vector database with the model's generation. I add the image description, user message, and the model's response as a single string to the database.
    # Add (img_desc, message, response) 3-tuple to Chroma collection.
    text = f"Image Description: {img_desc}\nUSER: {message}\nASSISTANT: {response}\n"
    collection.add_texts(texts=[text])
The last component of this function is the return statement, which is what's shown on your gradio app. I show the image description followed by the model's response.
    # Return model output.
    return img_desc + "\n\n" + response
Finally, let's initialize our W&B project and build the gradio app itself!
wandb.init(project="building_llm_app")
﻿
# Define the ChatInterface, customize, and launch!
gr.ChatInterface(
    generate_output,
    chatbot=gr.Chatbot(
        label="Chat with me!",
        show_label=True,
        container=False,
        scale=5,
        height=300,
        show_share_button=True,
        show_copy_button=True,
        avatar_images=(user_avatar_image_path, chatbot_avatar_image_path),
        likeable=False,
        layout="bubble",
        bubble_full_width=False
      ),
    textbox=gr.Textbox(
        lines=1,
        max_lines=5,
        placeholder="Message ...",
        container=False,
        scale=7,
        info="Input your textual response in the text field and your image below!"
    ),
    additional_inputs="image",
    additional_inputs_accordion=gr.Accordion(
        open=True,
    ),
    title="Language-Image Question Answering with bczhou/TinyLLaVA-v1-hf!",
    description="""
    This simple gradio app internally uses a Large Language-Vision Model (LLVM) and the Chroma vector database for memory.
    Note: this minimal app requires both an image and a text-based query before the chatbot system can respond.
    """,
    theme="soft",
    submit_btn="Submit ▶",
    retry_btn=None,
    undo_btn="Delete Previous",
    clear_btn="Clear",
).launch(debug=True, share=True)
Notice our generate_output function is the first argument to creating our gr.ChatInterface. We also define a gr.Chatbot with some design parameters. There are useful gradio features like "share" and "copy". We also use the avatar images we defined earlier! The textbox gr.Textbox will be a single line with a maximum of five lines. We add an additional input, "image". More information on additional inputs can be found in gr.ChatInterface. Lastly, we have a title, description, theme, and a couple button configurations. The last thing is to call .launch()!
Voilà, you just built an LLM-powered, local app with a small multi-modal LLM from Transformers, LangChain for prompt templates and the Chroma vector database, W&B for logging, and gradio for app building! 🥳
﻿
🤔 Discussion & Future WorkThe app we just built is riddled with limitations. This is where you come in: to improve the app! 
improve the additional inputs image input field (maybe move it a column or make the text field a multimodal input field)
implement output streaming
try implementing your own gradio theme!
add the ability for an image to be passed in with no text (and vice versa)
try using some form of LangChain-supported Hugging Face pipeline in generate_output (click here)
try building a more complex PromptTemplate (maybe use ChatPromptTemplate)
find a way to directly add visual memory to the context
implement some more advanced form of retrieval-augmented generation (RAG); maybe a multi-modal vector database?
try using a different multi-modal LLM like TinyGPT-V!
deploy the app!
👋 ConclusionIf you've reached this far, thank you for reading all the way through! This article is quite dense especially if you walked through the research papers with me. By now, you've built an LLM-powered app with a plethora of technologies! For further exploration, check out the other courses offered by W&B! Thanks for reading! 👋😎
ReferencesDeveloper Tools
W&B site: https://wandb.ai/site﻿
LangChain: https://www.langchain.com/﻿
Chroma: https://docs.trychroma.com/﻿
OpenAI: https://pypi.org/project/openai/﻿
Transformers: https://huggingface.co/docs/transformers/index﻿
Accelerate: https://huggingface.co/docs/accelerate/index﻿
Quantization with HF: https://huggingface.co/docs/accelerate/usage_guides/quantization﻿
Gradio: https://www.gradio.app/﻿
OpenAIEmbeddings: https://api.python.langchain.com/en/stable/embeddings/langchain_community.embeddings.openai.OpenAIEmbeddings﻿
OpenAI Embedding Models: https://platform.openai.com/docs/models/embeddings﻿
LangChain Chroma: https://api.python.langchain.com/en/stable/vectorstores/langchain_community.vectorstores.chroma.Chroma﻿
Quantization with HF: https://huggingface.co/docs/transformers/main/en/quantization#bitsandbytes﻿
BitsAndBytesConfig: https://huggingface.co/docs/transformers/main/en/main_classes/quantization﻿
gr.ChatInterface: https://www.gradio.app/docs/chatinterface﻿
PromptTemplate: https://python.langchain.com/docs/modules/model_io/prompts/﻿
W&B Traces: https://docs.wandb.ai/guides/prompts﻿
gr.Chatbot: https://www.gradio.app/docs/chatbot﻿
gr.Textbox: https://www.gradio.app/docs/textbox﻿
LangChain Hugging Face docs: https://python.langchain.com/docs/integrations/platforms/huggingface﻿
Chroma Multi-Modal: https://docs.trychroma.com/multi-modal﻿
W&B Course 
Building LLM-Powered Apps: https://www.wandb.courses/courses/building-llm-powered-apps﻿
Other courses: https://www.wandb.courses/pages/w-b-courses﻿
W&B Course Author Socials
Darek Kłeczek: https://www.linkedin.com/in/kleczek/﻿
Ramanathan (Bharat) Parameshwaran: https://www.linkedin.com/in/param-bharat/﻿
Thomas Capelle: https://www.linkedin.com/in/thomas-capelle-3b918671/﻿
Shreya Rajpal: https://www.linkedin.com/in/shreya-rajpal/﻿
Anton Troynikov: https://www.linkedin.com/in/antontroynikov/﻿
Shahram Anver: https://www.linkedin.com/in/shahramanver/﻿
Related Material﻿﻿
lm-sys/FastChat: https://github.com/lm-sys/FastChat﻿
Vicuna-13B: https://huggingface.co/lmsys/vicuna-13b-v1.3﻿
ScienceQA: https://scienceqa.github.io/﻿
MS COCO: https://paperswithcode.com/dataset/coco﻿
CLIP-ViT-L-14: https://huggingface.co/sentence-transformers/clip-ViT-L-14﻿
CC3M: https://github.com/rom1504/img2dataset/blob/main/dataset_examples/cc3m.md﻿﻿﻿﻿﻿
LLaVA
LLaVA: https://arxiv.org/abs/2304.08485﻿
LLaVA-CC3M-Pretrain-595K: https://huggingface.co/datasets/liuhaotian/LLaVA-CC3M-Pretrain-595K﻿
LLaVA-Instruct-150K: https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K﻿
HF LLaVA Doc page: https://huggingface.co/docs/transformers/main/model_doc/llava﻿
LLaVA/LLaVA-1.5 paper website: https://llava-vl.github.io/﻿
LLaVA-Bench-In-the-Wild: https://huggingface.co/datasets/liuhaotian/llava-bench-in-the-wild﻿
LLaVA-1.5
LLaVA-1.5: https://arxiv.org/abs/2310.03744﻿
OKVQA: https://okvqa.allenai.org/﻿
A-OKVQA: https://arxiv.org/abs/2206.01718﻿
OCRVQA: https://ocr-vqa.github.io/﻿
TextCaps: https://arxiv.org/abs/2003.12462﻿
GQA: https://github.com/stanfordnlp/mac-network/blob/gqa/readme.md﻿
TinyLLaVA & Related Materials
Baichuan Zhou: https://huggingface.co/bczhou﻿
TinyLLaVA-v1-hf: https://huggingface.co/bczhou/tiny-llava-v1-hf﻿
TinyLlama-1.1B-Chat-v0.3: https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v0.3﻿
TinyLlama GitHub: https://github.com/jzhang38/TinyLlama﻿
TinyGPT-V GitHub: ﻿﻿https://github.com/DLYuanGod/TinyGPT-V?tab=readme-ov-file﻿
﻿﻿﻿﻿﻿
﻿
﻿
Add a comment
Tags: LLM, Articles, NLP, Course, GenAI, Intermediate, Tutorial
Iterate on AI agents and models faster. Try Weights & Biases today.