Building the worlds fastest chatbot with the Cerebras Cloud API and W&B Weave

A guide to getting started using the Cerebras Cloud API with W&B Weave.
Created on September 4|Last edited on September 4
Comment
﻿
﻿Cerebras has launched Cerebras Inference, the fastest AI inference solution in the world, delivering 1,800 tokens per second for Llama 3.1 8B and 450 tokens per second for Llama 3.1 70B—up to 20 times faster than NVIDIA GPU-based clouds. Powered by the third-generation Wafer-Scale Engine (WSE-3), Cerebras Inference offers unprecedented speed and cost efficiency, operating at just 10 cents per million tokens for Llama 3.1 8B and 60 cents per million tokens for Llama 3.1 70B.
In this comprehensive guide, we'll explore the groundbreaking features of the CS-3 system, provide a detailed tutorial for setting up Cerebras Inference, and demonstrate how to create one of the fastest chatbots using the Llama 3.1 70B models integrated with W&B Weave.
﻿
﻿
Table of contentsWhat sets Cerebras Cloud apart?Sparsity acceleration Distributed memory architecture Weight streaming Scalable fabric The importance of latency in generative AI applicationsThe Cerebras Cloud API: Using Weave for performance monitoringBuilding a faster chatbotUsing W&B WeaveManaging context Core logic The Future of Inference 
﻿
What sets Cerebras Cloud apart?The CS-3 system, now available via the Cerebras Cloud, enables developers to harness wafer-scale computing for AI inference through a simple API. Unlike competing solutions that often trade accuracy for speed, the CS-3 maintains state-of-the-art precision, processing up to 1,700 tokens per second for Llama 3.1 8B and 450 tokens per second for Llama 3.1 70B models, all while staying within the 16-bit domain throughout the inference process.
So you may be wondering, what is different about these new chips from Cerebras, and how is it possible that they are so much faster than traditional GPU’s? Let's look at how this is done.
Sparsity acceleration The Cerebras architecture excels at accelerating unstructured sparsity, a phenomenon in neural networks where many of the parameters (weights) are zero or nearly zero, meaning they have little to no impact on the network's output. Cerebras efficiently skips over these insignificant computations, focusing only on the meaningful weights. This approach contrasts with traditional GPUs, which process all the weights, including zeros, leading to inefficiencies and longer processing times.
Distributed memory architecture The chip employs a distributed static random-access memory (SRAM) architecture, providing each core with local SRAM. This setup eliminates the bottlenecks associated with traditional shared memory, leading to higher memory bandwidth and reduced data movement latency. In other words, instead of all the cores competing for access to a single memory pool (like in a GPU), each core in the Cerebras chip has its own dedicated memory. This drastically reduces waiting times and allows for faster data processing, which is a significant advantage over the shared memory approach used in GPUs.
Weight streaming Cerebras leverages a weight streaming technique allowing for efficient training and inference of large models. Weights are stored in an external memory unit called MemoryX and streamed to the system as needed, enabling large model processing without the limitations of on-chip memory capacity. This means that instead of storing all the model data directly on the chip, which could be restrictive, Cerebras fetches only the necessary data when required. This is different from GPUs, where on-chip memory can quickly become a bottleneck, limiting the size and complexity of models that can be processed.
Scalable fabric Cerebras integrates a high-bandwidth, low-latency fabric connecting cores using a 2-D mesh topology. This allows the chip to function as a cohesive unit, with seamless scalability beyond a single chip through the SwarmX interconnect fabric. Essentially, this fabric acts like a super-fast network connecting all the cores together, allowing them to work in perfect harmony. Traditional GPUs use different methods to connect their cores, which can introduce delays and reduce efficiency, especially as the number of cores increases. Cerebras' approach ensures that even as the system scales up, it remains fast and efficient.
The importance of latency in generative AI applicationsLatency is a critical factor in AI applications, particularly in scenarios where real-time processing and quick response times are important. In the context of AI-driven chatbots, recommendation systems, or any interactive application, reducing latency directly impacts the user experience by providing faster, more accurate responses, and allowing for more advanced logic when chaining multiple LLM calls together.
As we push the boundaries of speed with solutions like Cerebras, the reduced latency opens up new capabilities for AI applications.
For instance, faster model inference can lead to the development of more complex and interactive applications that would previously have been impractical due to slower processing times. This could enable AI to handle more sophisticated tasks, such as real-time language translation, advanced predictive analytics, or dynamic content generation, all with minimal delay.
Moreover, with low latency, there may be less need for traditional techniques like retrieval-augmented generation, which is often used to manage the trade-offs between model complexity and speed. RAG typically retrieves relevant information from external databases to complement the AI model’s responses, compensating for the slower processing times of more complex models. However, with the speed offered by Cerebras, it becomes feasible to run larger with extremely large context windows, potentially eliminating the need for such techniques. 
The Cerebras Cloud API: Using Weave for performance monitoringBefore starting this tutorial, make sure you have a environment running python >= 3.9, with the following pip packages: 
pip install -U cerebras_cloud_sdk weave flask transformers
To effectively monitor and evaluate the performance of the Cerebras Cloud when running Llama 3.1 models, we'll use W&B Weave, our tool for logging and tracking metrics during model inference (among other things). By integrating Weave into your Cerebras Cloud workflow, you can easily monitor key performance indicators like tokens per second directly from your console, as well as inputs and outputs of the model. 
Normally, to log to Weave, you need to add the the @weave.op decorator to any function which you would log the inputs and outputs to for. However, Weave has been natively integrated into the Cerebras API, which means that you simply need to call weave.init("your_project_name") and all calls to the API will be logged to Weave! Here's a link to the Weave Docs﻿﻿ if you would like to learn more about Weave! 
Here’s a code snippet which will allow you to run inference with the Cerebras Cloud API:
from cerebras.cloud.sdk import Cerebras
import weave
﻿
# Initialize Weave for logging
weave.init("cerebras_llama31_performance")
﻿
# Hardcoded API key for Cerebras Inference
CEREBRAS_API_KEY = "your api key"
﻿
# Initialize the Cerebras client with the API key
client = Cerebras(api_key=CEREBRAS_API_KEY)
﻿
# No Weave Op needed due to native integration 
def run_inference(prompt: str):
    # Perform inference
    chat_completion = client.chat.completions.create(
        messages=[
            {
                "role": "user",
                "content": prompt,
            }
        ],
        model="llama3.1-8b",
    )
    # Extract the content of the response
    response_content = chat_completion.choices[0].message.content
    return {
        "response": response_content,
    }
﻿
# Example usage
prompt = "Why is fast inference important?"
﻿
# Parse the response and get results for the Cerebras LLaMA 3.1-8B model
results = run_inference(prompt)
﻿
# Example of using the returned results
print(f"Results: {results}")
In the code snippet provided, Weave is initialized with the project name cerebras_llama31_performance, allowing us to organize and store logs and metrics under a specific project. 
The run_inference function is the core component of the code. It handles the entire inference workflow, beginning with executing the model inference using a user-provided prompt. The response from the Cerebras client is then parsed to extract the content and return the result. Since Weave is natively integrated with the Cerebras API, and Weave is initialized with a project name, all calls to the API are automatically logged inside Weave!
After running the script, you will see the following logs inside of the Weave trace view in Weights & Biases: 
﻿
Building a faster chatbotIn this section, we're focusing on creating a high-performance chatbot using the Cerebras Cloud platform. The goal is to build a chatbot that not only responds quickly but also maintains the conversational context effectively, while also being simple in its implementation (so you can easily add the features you want).
The entire application is contained within a single script, making it easy to run and deploy. Here’s a single script that will allow you to experiment with the API. Here's a link to the code if you would like to view it on Github. 
The script contains inline HTML, and even though this is a bit "hacky", it serves as a good illustration of the strength of the Cerebras Cloud API and W&B Weave.
💡
﻿
import os
from flask import Flask, request, render_template_string
import threading
import sys
from transformers import AutoTokenizer
from cerebras.cloud.sdk import Cerebras
import weave
﻿
# Initialize Weave for logging
weave.init("cerebras_llama31_performance")
﻿
# Flask app initialization
app = Flask(__name__)
﻿
# Load API key from file if it exists
API_KEY_FILE = "cerebras_api_key.txt"
CEREBRAS_API_KEY = None
﻿
def load_api_key():
    global CEREBRAS_API_KEY
    if os.path.exists(API_KEY_FILE):
        with open(API_KEY_FILE, "r") as file:
            CEREBRAS_API_KEY = file.read().strip()
            if CEREBRAS_API_KEY:
                initialize_cerebras_client()
﻿
def save_api_key(api_key):
    global CEREBRAS_API_KEY
    CEREBRAS_API_KEY = api_key
    with open(API_KEY_FILE, "w") as file:
        file.write(api_key)
    initialize_cerebras_client()
﻿
def initialize_cerebras_client():
    global client
    client = Cerebras(api_key=CEREBRAS_API_KEY)
﻿
# Load the API key at startup
load_api_key()
﻿
# Check for --test and --port in the command line arguments
is_test = '--test' in sys.argv
port_index = sys.argv.index('--port') if '--port' in sys.argv else None
port = int(sys.argv[port_index + 1]) if port_index else 5001
﻿
# Initialize tokenizer from Hugging Face model
tokenizer = AutoTokenizer.from_pretrained("akjindal53244/Llama-3.1-Storm-8B")
﻿
# Determine if token-based or character-based context length is used
use_token_context = '--token_contextlen' in sys.argv
char_context_len = int(sys.argv[sys.argv.index('--char_contextlen') + 1]) if '--char_contextlen' in sys.argv else None
token_context_len = int(sys.argv[sys.argv.index('--token_contextlen') + 1]) if '--token_contextlen' in sys.argv else 8000
# Cache to hold previous chat messages and responses
chat_cache = []
﻿
def manage_cache(prompt):
    global chat_cache
    if use_token_context:
        # Tokenize the full conversation
        all_tokens = tokenizer.encode("".join(chat_cache + [prompt]), add_special_tokens=False)
        
        # Keep only the last token_context_len tokens
        if len(all_tokens) > token_context_len:
            all_tokens = all_tokens[-token_context_len:]
        
        # Decode back to text for processing
        chat_cache = [tokenizer.decode(all_tokens, skip_special_tokens=True)]
        print(len(chat_cache))
    else:
        # Character-based context management
        chat_cache.append(prompt)
        if len("".join(chat_cache)) > char_context_len:
            # Trim chat cache to fit within the character context length
            while len("".join(chat_cache)) > char_context_len:
                chat_cache.pop(0)
﻿
@app.route('/chat')
def chat():
    return render_template_string(html_content)
﻿
@app.route('/clear_chat', methods=['POST'])
def clear_chat():
    global chat_cache
    chat_cache.clear()
    return 'Cleared'
﻿
@app.route('/save_api_key', methods=['POST'])
def save_api_key_route():
    data = request.json
    api_key = data.get("api_key")
    if api_key:
        save_api_key(api_key)
    return "API Key Saved"
﻿
﻿
﻿
def perform_inference(prompt):
    if not CEREBRAS_API_KEY:
        return "API Key not set. Please enter your Cerebras API Key."
    
    chat_completion = client.chat.completions.create(
        messages=[
            {
                "role": "user",
                "content": prompt,
            }
        ],
        model="llama3.1-70b",
    )
﻿
    response_content = chat_completion.choices[0].message.content
    return response_content
﻿
@app.route('/send_message', methods=['POST'])
def send_message():
    global chat_cache
    
    data = request.json
    new_prompt = f"User: {data['prompt']}\n"
    
    # Manage context based on token or character length
    manage_cache(new_prompt)
    full_prompt = ''.join(chat_cache)
﻿
    try:
        response_content = perform_inference(full_prompt)
﻿
        if "```" in response_content:
            parts = response_content.split("```")
            for i in range(1, len(parts), 2):
                code_content = parts[i].strip().split('\n', 1)
                if len(code_content) > 1 and code_content[0].strip() in [
                    "python", "cpp", "javascript", "java", "html", "css", "bash",
                    "csharp", "go", "ruby", "php", "swift", "r", "typescript", "kotlin", "dart"
                ]:
                    parts[i] = "<pre><code>" + code_content[1].strip() + "</code></pre>"
                else:
                    parts[i] = "<pre><code>" + parts[i].strip() + "</code></pre>"
            response_content = "".join(parts)
        else:
            response_content = f'<div class="bot-message">{response_content}</div>'
﻿
        api_response = f"Bot: {response_content}\n"
        chat_cache.append(api_response)
﻿
        return api_response
    except Exception as e:
        print(f"Exception caught: {e}")
        return str(e)
﻿
﻿
html_content = """
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Chat with API</title>
    <style>
        body, html {
            height: 100%;
            margin: 0;
            font-family: Arial, sans-serif;
            display: flex;
            flex-direction: column;
            background-color: #f8f9fa;
        }
        #chat-container {
            flex-grow: 1;
            overflow-y: auto;
            padding: 20px;
            box-sizing: border-box;
            margin-bottom: 70px; /* Leave space for the input field */
        }
        #input-container {
            position: fixed;
            bottom: 0;
            width: 100%;
            background-color: #f8f9fa;
            padding: 10px 0;
            box-shadow: 0 -2px 5px rgba(0, 0, 0, 0.1);
        }
        #userInput {
            width: calc(100% - 40px);
            margin: 0 20px;
            padding: 10px;
            border: 1px solid #ccc;
            border-radius: 5px;
            box-sizing: border-box;
            font-size: 16px;
            height: 50px; /* Fixed height */
            background-color: #fff;
            resize: none;
        }
        #apiKeyInput {
            width: 200px;
            margin-right: 10px;
            padding: 10px;
            border: 1px solid #ccc;
            border-radius: 5px;
            box-sizing: border-box;
            font-size: 16px;
            height: 40px;
            background-color: #fff;
        }
        #loader {
            position: fixed;
            bottom: 70px;
            width: 100%;
            text-align: center;
            display: none;
        }
        #chat {
            padding: 10px;
            word-wrap: break-word;
        }
        pre {
            background-color: #f4f4f4;
            border: 1px solid #ccc;
            padding: 10px;
            border-radius: 5px;
            overflow-x: auto;
        }
        code {
            font-family: Consolas, 'Courier New', Courier, monospace;
            color: #d63384;
        }
        .bot-message {
            border: 1px solid #007BFF;
            background-color: #E9F7FF;
            border-radius: 5px;
            padding: 10px;
            margin: 10px 0;
        }
        .user-message {
            border: 1px solid #333;
            background-color: #f0f0f0;
            border-radius: 5px;
            padding: 10px;
            margin: 10px 0;
            text-align: right;
        }
        .control-panel {
            display: flex;
            align-items: center;
            padding: 0 20px;
        }
        button {
            padding: 10px 20px;
            background-color: #007BFF;
            color: #fff;
            border: none;
            border-radius: 5px;
            cursor: pointer;
        }
        button:hover {
            background-color: #0056b3;
        }
        p {
            text-align: center;
            margin: 0;
            color: #333;
        }
    </style>
</head>
<body>
    <div id="chat-container">
        <div id="chat"></div>
    </div>
    <div id="input-container">
        <div class="control-panel">
            <input type="password" id="apiKeyInput" placeholder="Enter API Key" onkeydown="handleApiKeyInput(event)">
            <button onclick="clearChat()">Clear Chat</button>
        </div>
        <textarea id="userInput" placeholder="Hit shift+enter to send..." onkeydown="handleKeyDown(event)"></textarea>
    </div>
    <div id="loader">Loading...</div>
    <p>To send your message, hit Shift+Enter.</p>
﻿
    <script>
        function handleKeyDown(event) {
            if (event.key === 'Enter' && event.shiftKey) {
                event.preventDefault();
                sendMessage();
            }
        }
﻿
        function handleApiKeyInput(event) {
            if (event.key === 'Enter') {
                event.preventDefault();
                saveApiKey();
            }
        }
﻿
        function saveApiKey() {
            const apiKeyInput = document.getElementById('apiKeyInput');
            const apiKey = apiKeyInput.value.trim();
            if (apiKey !== '') {
                fetch('/save_api_key', {
                    method: 'POST',
                    headers: {
                        'Content-Type': 'application/json',
                    },
                    body: JSON.stringify({ api_key: apiKey }),
                })
                .then(response => response.text())
                .then(data => {
                    console.log('API Key saved:', data);
                    apiKeyInput.value = '';  // Clear the input field
                })
                .catch(error => {
                    console.error('Error saving API key:', error);
                });
            }
        }
﻿
        function sendMessage() {
            const inputField = document.getElementById('userInput');
            const message = inputField.value.trim();
﻿
            if (message === '') return;
﻿
            inputField.value = '';
            const chatContainer = document.getElementById('chat');
            chatContainer.innerHTML += `<div class="user-message">User: ${message}</div>`;
﻿
            const loader = document.getElementById('loader');
            loader.style.display = 'block';
﻿
            fetch('/send_message', {
                method: 'POST',
                headers: {
                    'Content-Type': 'application/json',
                },
                body: JSON.stringify({ prompt: message }),
            })
            .then(response => response.text())
            .then(data => {
                loader.style.display = 'none';
                chatContainer.innerHTML += `<div class="bot-message">${data}</div>`;
                autoScrollToBottom();
            })
            .catch(error => {
                loader.style.display = 'none';
                chatContainer.innerHTML += `<div class="bot-message">Error: ${error}</div>`;
                autoScrollToBottom();
            });
        }
﻿
        function clearChat() {
            const chatContainer = document.getElementById('chat');
            chatContainer.innerHTML = '';
            fetch('/clear_chat', { method: 'POST' });
        }
﻿
        function autoScrollToBottom() {
            const chatContainer = document.getElementById('chat-container');
            chatContainer.scrollTop = chatContainer.scrollHeight;
        }
    </script>
</body>
</html>
"""
﻿
if __name__ == "__main__":
    t = threading.Thread(target=app.run, kwargs={'host': '0.0.0.0', 'port': 5001})
    t.start()
    print(f"Visit http://127.0.0.1:{port}/chat to start chatting.")
﻿
﻿
To run the app, save the file as chat.py, and you can use one of the following commands depending on your preferred context tracking method:
For character-based context (faster):
python chat.py --char_contextlen 8000
For token-based context (maximizes the 8k token window):
python chat.py --token_contextlen 8000
After running either command, you can visit the chatbot at the following url:
﻿http://127.0.0.1:5001/chat﻿
The script begins by importing the necessary libraries, including Flask for handling web requests, threading for running the Flask app, and Weave for logging and monitoring the chatbot's performance. Weave is initialized with a project name to organize and store logs, which is particularly useful for tracking performance metrics during chatbot interactions.
Using W&B WeaveWith Weave's native integration with the Cerebras API, all inputs, outputs, and key metrics such as latency are automatically logged, simplifying the monitoring of your model's performance in production. This automatic logging provides real-time insights into your chatbot's behavior without the need for additional code changes.
By using Weave in conjunction with Cerebras Cloud, you gain a comprehensive view of your AI model’s performance, enabling you to make informed decisions about choosing the right models for your application. Being able to easily view responses from your models allows you to step inside the shoes of your users, giving you a new perspective of how the model is performing. Additionally, the Weave integration with the Cerebras API automatically logs other metrics like latency, which is also a valuable metric to monitor for your app!
Managing context The script manages conversational context by storing previous chat messages and responses in a cache. By default, it uses character-based tracking, which is slightly faster and is set using the --char_contextlen argument. For those looking to fully utilize the 8k token context window of the Cerebras API for Llama 3.1, the --token_contextlen argument enables token-based tracking using a Hugging Face tokenizer. While character counting offers speed, the tokenizer allows you to efficiently maximize the context window, ensuring more comprehensive input handling during interactions. 
Core logic The core of the chatbot is the perform_inference function, which takes a prompt from the user, sends it to the Cerebras client for processing, and returns the response. The response is then processed to format any code snippets, ensuring that the chatbot can handle technical queries involving code examples. This response is also logged using Weave, allowing for real-time monitoring of the chatbot’s performance, including metrics like response time and the number of tokens processed.
Here's a screenshot of me using the chatbot to instantaneously create a new game: 
﻿
And here’s a screenshot of me playing the game on my local system! 
﻿
The Future of Inference Cerebras is redefining the landscape of LLM inference with its unparalleled speed and efficiency. As the fastest inference solution currently available, Cerebras positions itself ahead of the competition by enabling developers to leverage models like Llama 3.1 with unmatched performance. This allows for complex, sequential reasoning tasks to be performed in real-time, making chatbots, virtual assistants, and other interactive applications faster and more responsive than ever.
However, the competitive landscape is rapidly evolving. Cerebras’ advantage hinges on its ability to consistently deliver superior hardware and integrate new models as they emerge from leading developers like Meta. While competitors like Groq and NVIDIA are not sitting idle, the challenge for Cerebras will be maintaining its speed and efficiency edge as new models and technologies emerge.
As the competition intensifies, Cerebras must continue to innovate to retain its leadership in high-speed AI inference. I hope you enjoyed this tutorial, and feel free to drop a comment if you have any questions! 
﻿
Add a comment
Tags: Articles, LLM, GenAI, Weave
Iterate on AI agents and models faster. Try Weights & Biases today.