Building a real-time answer engine with Llama 3.1 405B and W&B Weave

Infusing llama 3.1 405B with internet search capabilities!!
Created on August 2|Last edited on August 7
Comment
The ability to search for and retrieve information from the vast expanse of the internet is a distinguishing feature that separates human intelligence from basic language models. While humans can dynamically search for the most up-to-date information, most language models are limited to the data they were trained on, which can quickly become outdated. To bridge this gap, we can enhance language models with the capability to perform real-time web searches and extract relevant information.
In this project, we'll build an answer engine that combines the advanced capabilities of the Llama 3.1 405B model with web scraping. By integrating web search, screenshot capture, and optical character recognition (OCR), we'll enable our model to provide up-to-date, accurate answers, vastly enhancing its utility and performance. This approach leverages the broad reach of search engines to access diverse information, overcoming the limitations of specialized tools and APIs, and ensuring our model can deliver timely and relevant responses.
﻿
﻿
Table of contentsGoogle Search: The ultimate tool to enhance Llama 3.1 405BEnvironment setup Setting up Google Cloud AI PlatformUsing the Llama 3.1 405B API ServiceBuilding our answer engine with Google and Llama 3.1 405BIntegrating Llama 3.1 405BTracking the LLM inputs and outputsDeveloping the web interfaceUsing Weave to Track Model Responses and Search DecisionsContext is everything 
﻿
﻿
Google Search: The ultimate tool to enhance Llama 3.1 405BAPIs often have limitations that restrict their use to specific datasets or types of information. For example, APIs provided by social media platforms or specialized databases can only access and retrieve data from their respective sources. These tools are valuable but inherently limited in their scope, unable to provide a broad spectrum of information across different domains.
In contrast, search engines like Google act as universal tools that can access a vast array of information from across the internet. Search engines are not confined to specific datasets or platforms; they index and retrieve information from millions of websites, covering an extensive range of topics and sources. This universality makes search engines exceptionally powerful for gathering up-to-date and comprehensive information. 
In this project, we'll leverage Google Search in our answer engine to retrieve information that is not available in the Llama 3.1 405B model weights.
Environment setup First, ensure you have Python installed and set up a virtual environment. If you haven't done this, there's a quick tutorial here to get you going.
Once that's done, install the required libraries using pip with the following command:
pip install pytesseract Pillow google-search-python asyncio playwright nest-asyncio requests flask weave
Next, install Tesseract OCR. For your specific operating system, download the installer from the official Tesseract docs. 
Setting up Google Cloud AI PlatformTo use the Google Cloud AI Platform, you will need to create a Google Cloud project. Go to the Google Cloud Console and create a new project. Once your project is created, take note of the project ID, as it will be required for authentication and API requests.
You will also need to set up authentication. Install the gcloud CLI if you haven't already, then authenticate with your Google account using the following command:
gcloud auth login
Next, set your project ID in the gcloud configuration:
gcloud config set project YOUR_PROJECT_ID
Using the Llama 3.1 405B API ServiceTo try the Llama 3.1 API service with the command line interface (CLI), open Cloud Shell or a local terminal window with the gcloud CLI installed. Configure the necessary environment variables by modifying the following python script, replacing YOUR_PROJECT_ID with the ID of your Google Cloud project:
import requests
import json
import subprocess
﻿
# Set environment variables
ENDPOINT = "us-central1-aiplatform.googleapis.com"
REGION = "us-central1"
PROJECT_ID = "your google cloud project id"
﻿
# Get the access token using gcloud
access_token = subprocess.check_output("gcloud auth print-access-token", shell=True).decode('utf-8').strip()
﻿
# Set the headers for the request
headers = {
    "Authorization": f"Bearer {access_token}",
    "Content-Type": "application/json"
}
﻿
# Set the data payload for the request
data = {
    "model": "meta/llama3-405b-instruct-maas",
    "stream": True,
    "messages": [
        {
            "role": "user",
            "content": "Summer travel plan to Paris"
        }
    ]
}
﻿
# Define the endpoint URL
url = f"https://{ENDPOINT}/v1beta1/projects/{PROJECT_ID}/locations/{REGION}/endpoints/openapi/chat/completions"
﻿
# Make the request and stream the response
response = requests.post(url, headers=headers, json=data, stream=True)
﻿
print("Response from the AI Platform:")
﻿
# Process the response in real-time
if response.status_code == 200:
    for line in response.iter_lines():
        if line:
            try:
                data = line.decode('utf-8').strip()
                if data.startswith("data: "):
                    data_json = json.loads(data[6:])
                    if "choices" in data_json and len(data_json["choices"]) > 0:
                        delta_content = data_json["choices"][0]["delta"].get("content", "")
                        print(delta_content, end='', flush=True)
            except json.JSONDecodeError:
                continue
else:
    print(f"Error: {response.status_code}")
    print(response.text)
Note on pricing and availability
Currently (on August 6th 2024), the Llama 3.1 405B API service is offered at no cost during its public preview phase. However, this may change in the future, and the service will likely be priced on a dollar-per-1M-tokens basis at general availability (GA). We recommend to regularly checking the Llama 3.1 documentation for the latest information on pricing and usage terms to ensure compliance and cost management.
By following these steps, you can set up and use the Llama 3.1 405B API on the Google Cloud AI Platform, taking advantage of its powerful capabilities for real-time information retrieval and natural language processing.
Building our answer engine with Google and Llama 3.1 405BTo enable our answer engine to retrieve real-time information from the web, we will implement a Search class. This class will handle web searches, capture screenshots of web pages, and manage the concurrent processing of multiple URLs. By doing so, we will extract the necessary information to provide up-to-date answers.
Web scraping can be challenging due to the varying structures and formatting of web pages. Traditional scraping methods, which rely on parsing HTML and extracting data, often encounter issues with inconsistent page layouts and dynamic content. To overcome these challenges, we use image-based scraping.
By converting web pages into images and applying optical character recognition (OCR), we leverage the strengths of language models in interpreting text that isn't perfectly ordered or structured. There's even research showing that LLMs are superhuman at understanding completely scrambled text, making this approach particularly effective. Here is our search class that will allow us to search the web: 
﻿
class Search:
        
    @staticmethod
    def get_search_results(query, num_results=5):
        return [url for url in search(query, num_results=num_results)]
    
    @staticmethod
    async def download_screenshot(url, delay, index):
        async with async_playwright() as p:
            browser = await p.chromium.launch(headless=True)
            context = await browser.new_context()
            page = await context.new_page()
            file_name = f'{videos_folder}/Screenshot_{index}.png'
            try:
                await asyncio.wait_for(page.goto(url), timeout=5)
                await page.set_viewport_size({"width": 1920, "height": 1080})
                await page.wait_for_timeout(delay * 1000)
                await page.screenshot(path=file_name, full_page=True)
                print(f"Screenshot saved as {file_name}!")
            except (PlaywrightTimeoutError, asyncio.TimeoutError):
                print(f"Timeout occurred while loading {url}")
                file_name = None
            except Exception as e:
                print(f"Unexpected error occurred: {e}")
                file_name = None
            finally:
                await browser.close()
            return file_name
﻿
    @staticmethod
    def process_urls(urls, delay):
        if os.path.exists(videos_folder):
            for file in os.listdir(videos_folder):
                file_path = os.path.join(videos_folder, file)
                if os.path.isfile(file_path) or os.path.islink(file_path):
                    os.unlink(file_path)
                elif os.path.isdir(file_path):
                    os.rmdir(file_path)
        async def _process_urls():
            tasks = [Search.download_screenshot(url, delay, index) for index, url in enumerate(urls)]
            results = await asyncio.gather(*tasks, return_exceptions=True)
            return results
﻿
        loop = asyncio.new_event_loop()
        asyncio.set_event_loop(loop)
        results = loop.run_until_complete(_process_urls())
        return results
﻿
    @staticmethod
    def perform_ocr(image_path):
        if image_path is None:
            return None
        img = Image.open(image_path)
        tesseract_text = pytesseract.image_to_string(img)
        print(f"Tesseract OCR text for {image_path}:")
        print(tesseract_text)
        return tesseract_text
﻿
    @staticmethod
    def ocr_results_from_screenshots(screenshots):
        ocr_results = []
        with ThreadPoolExecutor() as executor:
            futures = [executor.submit(Search.perform_ocr, screenshot) for screenshot in screenshots]
            for future in concurrent.futures.as_completed(futures):
                try:
                    result = future.result()
                    ocr_results.append(result)
                except Exception as e:
                    print(f"An error occurred during OCR processing: {e}")
        return ocr_results
﻿
    @staticmethod
    def get_context_from_ocr_results():
        screenshots = [os.path.join(videos_folder, f) for f in os.listdir(videos_folder) if os.path.isfile(os.path.join(videos_folder, f))]
﻿
        if not screenshots:
            print("No valid screenshots to process.")
            return None
﻿
        # Perform OCR on downloaded screenshots and prepare the context
        ocr_results = Search.ocr_results_from_screenshots(screenshots)
        ocr_results = [val[:1000] for val in ocr_results if isinstance(val, str)]
        context = " ".join(ocr_results)[:3000]
        return context
﻿
﻿
    @staticmethod
    def decide_search(query):
        # Instantiate the model to decide if a web search is needed
        model = Model(endpoint=API_ENDPOINT, region=REGION, project_id=PROJECT_ID)
        context = ""
        res = model.query_model_for_search_decision(query)
        return res
The Search class includes several key methods:
The get_search_results method uses the Google Search library to retrieve URLs based on a given query, allowing you to specify the number of search results to return. 
The download_screenshot method is asynchronous and takes a URL, a delay, and an index (for creating a unique file name) as parameters. Using Playwright, it opens the URL in a headless browser, waits for the specified delay, and captures a screenshot of the web page. These screenshots are saved in the download folder with filenames that include the index and scroll position.
The process_urls method manages the processing of multiple URLs concurrently. It clears the download folder before starting the process and uses asyncio to run the download_screenshot method for each URL in parallel. The results of the screenshots are returned as a list. To ensure reliability, the download_screenshot method includes error handling for common exceptions such as PlaywrightTimeoutError and asyncio.TimeoutError. If an error occurs, the method prints an error message and returns None.
The perform_ocr method uses Tesseract to extract text from an image. The ocr_results_from_screenshots method performs OCR on multiple screenshots concurrently using a thread pool executor, which increases the speed of the OCR process. The get_context_from_ocr_results method combines the text extracted from all screenshots into a single context, which is used by the language model to generate accurate answers.
Collectively, these methods provide a robust solution for web scraping and capturing web page screenshots, which is essential for the subsequent steps in our answer engine pipeline. By leveraging image-based scraping and chunking for OCR, we ensure that our engine can reliably extract information from a wide variety of web pages, regardless of their structure or content dynamics. This approach takes full advantage of LLMs' ability to make sense of complex and unordered text, enhancing the accuracy and effectiveness of our answer engine.
Integrating Llama 3.1 405BTo integrate the advanced capabilities of the Llama 3.1 405B model, we will develop a Model class. This class will manage authentication, send queries, and handle responses from the Google Cloud AI Platform. 
Here is our model class: 
class Model:
    def __init__(self, endpoint, region, project_id):
        self.endpoint = endpoint
        self.region = region
        self.project_id = project_id
﻿
    def get_access_token(self):
        return subprocess.check_output("gcloud auth print-access-token", shell=True).decode('utf-8').strip()
﻿
﻿
    @weave.op()
    def query_model_non_stream(self, query, context):
        if context != "":
            q = "Answer the question {}. You can use this as help: {}".format(query, context)
        else: 
            q = query
﻿
        access_token = self.get_access_token()
        headers = {
            "Authorization": f"Bearer {access_token}",
            "Content-Type": "application/json"
        }
        data = {
            "model": "meta/llama3-405b-instruct-maas",
            "stream": False,
            "messages": [
                {
                    "role": "user",
                    "content": q
                }
            ]
        }
        url = f"https://{self.endpoint}/v1beta1/projects/{self.project_id}/locations/{self.region}/endpoints/openapi/chat/completions"
        response = requests.post(url, headers=headers, json=data)
﻿
        if response.status_code == 200:
            data = response.json()
            if "choices" in data and len(data["choices"]) > 0:
                res = data["choices"][0]["message"]["content"]
                return res
        else:
            print(f"Error: {response.status_code}")
            print(response.text)
            return ""
﻿
﻿
    @weave.op()
    def query_model_for_search_decision(self, query):
        access_token = self.get_access_token()
        headers = {
            "Authorization": f"Bearer {access_token}",
            "Content-Type": "application/json"
        }
        data = {
            "model": "meta/llama3-405b-instruct-maas",
            "stream": False,
            "messages": [
                {
                    "role": "user",
                    "content": f"Do we need a web search to answer the question: {query}? usually questions that are asking about time related details or new inforamtion that might be in you initial training set will require a web search. Also information that could be subject to change is also a good to double check with search. Respond with 'yes' or 'no'."
                }
            ]
        }
        url = f"https://{self.endpoint}/v1beta1/projects/{self.project_id}/locations/{self.region}/endpoints/openapi/chat/completions"
        response = requests.post(url, headers=headers, json=data)
﻿
        if response.status_code == 200:
            data = response.json()
            if "choices" in data and len(data["choices"]) > 0:
                decision = data["choices"][0]["message"]["content"].strip().lower()
                return 'yes' in decision
        else:
            print(f"Error: {response.status_code}")
            print(response.text)
            return False
﻿
﻿
The model class begins with an __init__ method, which initializes the necessary parameters, including the endpoint, region, and project ID. The get_access_token method uses the Google Cloud SDK to generate an access token, enabling authentication for API requests.
The query_model_non_stream method handles non-streaming responses. It sends a POST request to the AI Platform endpoint and processes the JSON response to extract the model's output. 
The query_model_for_search_decision method determines if a web search is necessary for a given query. It sends a formatted query to the Llama model, asking if a web search is needed, and processes the response to make a decision. This helps optimize the answer engine by deciding when additional information from the web is required. 
Note: To further improve the system, you could also ask the model to re-word the original query so it will be more suitable for a google search. Due to API rate limiting, I omitted this from the code. 
💡
Tracking the LLM inputs and outputsWe use Weave to track the inputs and outputs of these methods. By annotating the methods with @weave.op(), we enable Weave to log the data passing through these functions. This includes the queries sent to the model, the context provided, the decisions made about web searches, and the final responses generated by the model.
If you'd like to see the full logic.py file, which includes necessary imports and a few additional pieces of logic for running the app, I'll share it here: 
import os
import pytesseract
from PIL import Image
from googlesearch import search
import asyncio
from playwright.async_api import async_playwright, TimeoutError as PlaywrightTimeoutError
from concurrent.futures import ThreadPoolExecutor
import threading
import nest_asyncio
import requests
import json
import subprocess
import concurrent
import weave
﻿
# Project configuration
PROJECT_ID = "your google cloud poject id"
API_ENDPOINT = "us-central1-aiplatform.googleapis.com"
REGION = "us-central1"
﻿
weave.init("answer_engine")
﻿
# Apply nest_asyncio to allow nested event loops
nest_asyncio.apply()
﻿
# Set default download folder for screenshots
videos_folder = r"./download"
﻿
# Clear the download folder
if os.path.exists(videos_folder):
    for file in os.listdir(videos_folder):
        file_path = os.path.join(videos_folder, file)
        if os.path.isfile(file_path) or os.path.islink(file_path):
            os.unlink(file_path)
else:
    os.makedirs(videos_folder)
﻿
# Global stop event
stop_flag = threading.Event()
﻿
﻿
class Search:
        
    @staticmethod
    def get_search_results(query, num_results=5):
        return [url for url in search(query, num_results=num_results)]
    
    @staticmethod
    async def download_screenshot(url, delay, index):
        async with async_playwright() as p:
            browser = await p.chromium.launch(headless=True)
            context = await browser.new_context()
            page = await context.new_page()
            file_name = f'{videos_folder}/Screenshot_{index}.png'
            try:
                await asyncio.wait_for(page.goto(url), timeout=5)
                await page.set_viewport_size({"width": 1920, "height": 1080})
                await page.wait_for_timeout(delay * 1000)
                await page.screenshot(path=file_name, full_page=True)
                print(f"Screenshot saved as {file_name}!")
            except (PlaywrightTimeoutError, asyncio.TimeoutError):
                print(f"Timeout occurred while loading {url}")
                file_name = None
            except Exception as e:
                print(f"Unexpected error occurred: {e}")
                file_name = None
            finally:
                await browser.close()
            return file_name
﻿
    @staticmethod
    def process_urls(urls, delay):
        if os.path.exists(videos_folder):
            for file in os.listdir(videos_folder):
                file_path = os.path.join(videos_folder, file)
                if os.path.isfile(file_path) or os.path.islink(file_path):
                    os.unlink(file_path)
                elif os.path.isdir(file_path):
                    os.rmdir(file_path)
        async def _process_urls():
            tasks = [Search.download_screenshot(url, delay, index) for index, url in enumerate(urls)]
            results = await asyncio.gather(*tasks, return_exceptions=True)
            return results
﻿
        loop = asyncio.new_event_loop()
        asyncio.set_event_loop(loop)
        results = loop.run_until_complete(_process_urls())
        return results
﻿
    @staticmethod
    def perform_ocr(image_path):
        if image_path is None:
            return None
        img = Image.open(image_path)
        tesseract_text = pytesseract.image_to_string(img)
        print(f"Tesseract OCR text for {image_path}:")
        print(tesseract_text)
        return tesseract_text
﻿
    @staticmethod
    def ocr_results_from_screenshots(screenshots):
        ocr_results = []
        with ThreadPoolExecutor() as executor:
            futures = [executor.submit(Search.perform_ocr, screenshot) for screenshot in screenshots]
            for future in concurrent.futures.as_completed(futures):
                try:
                    result = future.result()
                    ocr_results.append(result)
                except Exception as e:
                    print(f"An error occurred during OCR processing: {e}")
        return ocr_results
﻿
    @staticmethod
    def get_context_from_ocr_results():
        screenshots = [os.path.join(videos_folder, f) for f in os.listdir(videos_folder) if os.path.isfile(os.path.join(videos_folder, f))]
﻿
        if not screenshots:
            print("No valid screenshots to process.")
            return None
﻿
        # Perform OCR on downloaded screenshots and prepare the context
        ocr_results = Search.ocr_results_from_screenshots(screenshots)
        ocr_results = [val[:1000] for val in ocr_results if isinstance(val, str)]
        context = " ".join(ocr_results)[:3000]
        return context
﻿
﻿
    @staticmethod
    def decide_search(query):
        # Instantiate the model to decide if a web search is needed
        model = Model(endpoint=API_ENDPOINT, region=REGION, project_id=PROJECT_ID)
        context = ""
        res = model.query_model_for_search_decision(query)
        return res
﻿
﻿
class Model:
    def __init__(self, endpoint, region, project_id):
        self.endpoint = endpoint
        self.region = region
        self.project_id = project_id
﻿
    def get_access_token(self):
        return subprocess.check_output("gcloud auth print-access-token", shell=True).decode('utf-8').strip()
﻿
﻿
    @weave.op()
    def query_model_non_stream(self, query, context):
        if context != "":
            q = "Answer the question {}. You can use this as help: {}".format(query, context)
        else: 
            q = query
﻿
        access_token = self.get_access_token()
        headers = {
            "Authorization": f"Bearer {access_token}",
            "Content-Type": "application/json"
        }
        data = {
            "model": "meta/llama3-405b-instruct-maas",
            "stream": False,
            "messages": [
                {
                    "role": "user",
                    "content": q
                }
            ]
        }
        url = f"https://{self.endpoint}/v1beta1/projects/{self.project_id}/locations/{self.region}/endpoints/openapi/chat/completions"
        response = requests.post(url, headers=headers, json=data)
﻿
        if response.status_code == 200:
            data = response.json()
            if "choices" in data and len(data["choices"]) > 0:
                res = data["choices"][0]["message"]["content"]
                return res
        else:
            print(f"Error: {response.status_code}")
            print(response.text)
            return ""
﻿
﻿
    @weave.op()
    def query_model_for_search_decision(self, query):
        access_token = self.get_access_token()
        headers = {
            "Authorization": f"Bearer {access_token}",
            "Content-Type": "application/json"
        }
        data = {
            "model": "meta/llama3-405b-instruct-maas",
            "stream": False,
            "messages": [
                {
                    "role": "user",
                    "content": f"Do we need a web search to answer the question: {query}? usually questions that are asking about time related details or new inforamtion that might be in you initial training set will require a web search. Also information that could be subject to change is also a good to double check with search. Respond with 'yes' or 'no'."
                }
            ]
        }
        url = f"https://{self.endpoint}/v1beta1/projects/{self.project_id}/locations/{self.region}/endpoints/openapi/chat/completions"
        response = requests.post(url, headers=headers, json=data)
﻿
        if response.status_code == 200:
            data = response.json()
            if "choices" in data and len(data["choices"]) > 0:
                decision = data["choices"][0]["message"]["content"].strip().lower()
                return 'yes' in decision
        else:
            print(f"Error: {response.status_code}")
            print(response.text)
            return False
Developing the web interfaceTo provide a user-friendly way to interact with our answer engine, we developed a Flask application. Flask is a lightweight web framework that allows us to create web applications quickly and efficiently.
Our Flask app serves as the interface between the user and the backend processes of our answer engine. The application consists of a few key components: routes to handle requests, logic to process queries, and endpoints to return results. While we won't delve into the details of the front end for this tutorial, it's important to understand the basic structure and functionality of our Flask app.
Here is the app.py file which declares our functions for running the logic behind the app: 
﻿
import os
import threading
import nest_asyncio
import asyncio
from flask import Flask, request, render_template, jsonify
from logic import Search, Model
﻿
# Project configuration
PROJECT_ID = "your google cloud poject id"
API_ENDPOINT = "us-central1-aiplatform.googleapis.com"
REGION = "us-central1"
﻿
# Apply nest_asyncio to allow nested event loops
nest_asyncio.apply()
﻿
# Set default download folder for screenshots
videos_folder = r"./download"
﻿
# Clear the download folder
if os.path.exists(videos_folder):
    for file in os.listdir(videos_folder):
        file_path = os.path.join(videos_folder, file)
        if os.path.isfile(file_path) or os.path.islink(file_path):
            os.unlink(file_path)
else:
    os.makedirs(videos_folder)
﻿
# Global stop event
stop_flag = threading.Event()
﻿
# Global variable for response storage
response_storage = ""
﻿
app = Flask(__name__)
﻿
@app.route('/')
def index():
    return render_template('index.html')
﻿
@app.route('/search', methods=['POST'])
def search():
    global response_storage
    query = request.form.get('query')
    delay = 1
﻿
    # Clear the stop flag before running the function
    stop_flag.clear()
﻿
    asyncio.run(run_search_and_ocr(query, delay))
    return jsonify({'status': 'Search started'})
﻿
async def run_search_and_ocr(query, delay):
    global response_storage
    context = ""
    if Search.decide_search(query):
        urls = Search.get_search_results(query, num_results=20)
        process_thread = threading.Thread(target=Search.process_urls, args=(urls, delay))
        process_thread.start()
        await asyncio.sleep(15)
        stop_flag.set()
        if process_thread.is_alive():
            process_thread.join(timeout=0)
﻿
        context = Search.get_context_from_ocr_results()
﻿
    model = Model(endpoint=API_ENDPOINT, region=REGION, project_id=PROJECT_ID)
    response = model.query_model_non_stream(query, context)  # Replaced with query_model_nonstream to just return the response
    response_storage = response
﻿
﻿
@app.route('/results', methods=['GET'])
def get_results():
    global response_storage
    return jsonify({'results': response_storage.splitlines()})
﻿
if __name__ == "__main__":
    app.run(debug=True)
﻿
First, we initialize the Flask app and configure the necessary settings. We also ensure that the default download folder for screenshots is set up and cleared before starting the application. This ensures that our environment is prepared for processing new queries.
We define the main route (`/`) to render the front-end interface. This interface allows users to input their queries and initiate the search process. The `search` route, accessed via a POST request, handles the user's query. It receives the query, clears any previous stop flags, and runs the search and OCR process asynchronously. This is achieved by calling the run_search_and_ocr function, which orchestrates the flow from searching to OCR and context preparation.
The run_search_and_ocr function checks if a web search is necessary using the Search.decide_search method. If a search is required, it retrieves URLs using the get_search_results method and processes these URLs to capture screenshots. After capturing the screenshots, it performs OCR on the images to extract text and prepare the context. Finally, it queries the Llama 3.1 405B model using the query_model_non_stream method and stores the response.
The results route, accessed via a GET request, returns the processed results to the user. This allows users to view the answers generated by the Llama 3.1 405B model based on their queries.
By developing this Flask application, we provide a seamless and interactive way for users to leverage the capabilities of our answer engine. The Flask app manages the flow of data, handles asynchronous tasks, and ensures that users receive accurate and up-to-date answers. While the front end is essential for user interaction, the focus of this tutorial remains on the backend processes that power the answer engine.
If you'd like to try out the app for yourself, feel free to clone the repo here. In order to run the code, you will need to add your PROJECT_ID in the logic.py and the app.py files, then run python app.py. Note, you will need to install all necessary pip packages to run the app. After running the app, the app will print you the url to access the website in the console, as shown below: 
﻿
Now, simply navigate to http://127.0.0.1:5000/ and you will see the app.
Using Weave to Track Model Responses and Search DecisionsWeave is a powerful tool that allows us to track and monitor the performance of our models, including the responses generated and the decisions made regarding whether a web search is necessary. By integrating Weave into our answer engine, we gain valuable insights into how our models are functioning and can make informed improvements.
We use Weave to log the responses generated by the Llama 3.1 405B model. This includes the final answers provided to user queries as well as intermediate outputs, such as the results of OCR processing. By tracking these responses, we can analyze the quality and accuracy of the answers, identify any patterns or issues, and make adjustments to improve performance.
In one case, the query "what is the weather in Houston?" resulted in a detailed response generated by the model. This response includes weather information, forecast details, and additional recommendations for the user. By logging this response in Weave, we can review it later to ensure it meets our quality standards and make any necessary adjustments.
﻿
The app was able to find a webpage showing the weather forecast for Houston, as shown below: 
﻿
The text from this website (and a few other websites that the search class retrieved) was extracted and passed to the app. Inside weave, we can analyze the exact inputs to our model for both the "search decision" (where the model decides to search or not) and also where the model answers the question (with or without the search results).
Here are some our logs from Weave:
﻿
As shown above, our model found it appropriate to search the web for our query. Next, context for the search results were gathered, and passed to the model. Here's the next log inside weave: 
﻿
Context is everything Enhancing language models with the ability to perform real-time web searches marks a significant advancement in bridging the gap between static data training and dynamic, up-to-date information retrieval. By integrating the advanced capabilities of the Llama 3.1 405B model with sophisticated web scraping techniques, we have constructed an answer engine that is not only accurate but also highly responsive to current data.
The implementation of web scraping showcases the power of combining traditional methods with modern innovations such as Optical Character Recognition (OCR). By converting web pages into images and leveraging the strengths of language models in interpreting complex and unordered text, we ensure that our answer engine can effectively extract and process information from a diverse array of web pages.
The real game-changer in our system is the integration of Weave. This powerful tool enables us to track and monitor every aspect of our model's performance. By logging the inputs and outputs of our processes, Weave provides invaluable insights into the decision-making processes of our model. This allows us to analyze the quality and accuracy of the responses generated, ensuring that our answer engine consistently meets high standards of performance.
Weave also plays a crucial role in understanding how often the model decides a web search is necessary and the effectiveness of these decisions. This detailed tracking aids in debugging, performance optimization, and making informed improvements. It brings a level of transparency and accountability to our system that is essential for maintaining trust and reliability.
By leveraging these advanced tools and methodologies, we have created a powerful and dynamic solution for real-time information retrieval. This integration ensures that our answer engine remains at the forefront of technology, providing accurate and timely information to users in an ever-changing world.
Here is the repo for the project. 
﻿
Add a comment
Tags: LLM, Articles, GenAI, Question Answering, Weave, NLP, Text Generation, Intermediate
Iterate on AI agents and models faster. Try Weights & Biases today.