Build smarter RAG systems with Redis + W&B Inference

In this article, we will build a retrieval-augmented generation system powered by Redis for fast vector search, caching, and context-aware AI responses.
Brett Young
Created on August 22|Last edited on September 5
Comment
Redis has long been known as the backbone of fast, in-memory data storage, powering caching, session management, and real-time analytics at scale. But today, Redis has stepped into a new role: a high-performance platform for building AI-native applications. With support for vector search, semantic caching, and conversational memory, Redis is no longer just about speed but intelligence as well.
In this article, we’ll explore how Redis can serve as the foundation for retrieval augmented generation (RAG) systems. You’ll see how the Redis Vector Library (RedisVL) makes it simple to store and query embeddings, how W&B Weave adds observability into retrieval and reasoning, and how advanced models like Qwen3’s Thinking variant gain real accuracy and context-awareness when backed by Redis. The result is a pipeline that combines blazing-fast search, intelligent retrieval, and transparent monitoring, all combining to create a practical path to smarter GenAI applications.
﻿
Table of contentsRedis Vector Library (RedisVL) featuresRedis as a vector database for similarity searchTutorial: Building a RAG system with Redis and W&BAdditional Redis tools for Building RAG systems Tools for building chatbots Conclusion 
﻿
Redis Vector Library (RedisVL) featuresRedisVL simplifies how developers utilize Redis as a vector database, particularly in LLM-intensive applications. At its core, it provides clean abstractions for creating vector indexes, storing embeddings, and running similarity queries without needing to manage Redis commands directly. This makes Redis feel less like a low-level store and more like a purpose-built vector layer that can plug into a retrieval-augmented pipeline.
Developers can define schemas for embeddings, store metadata alongside vectors, and query with a hybrid search that combines semantic similarity with traditional filters. These capabilities transform Redis into a strong alternative to standalone vector databases, particularly for teams that already utilize it as part of their infrastructure.
In addition to these fundamentals, RedisVL also introduces features that enhance its usefulness for advanced LLM applications. Semantic caching helps avoid redundant queries, routing allows requests to be steered toward the right models or tools, and vectorized memory management supports longer, context-rich interactions. These extras are not the main reason to adopt RedisVL, but they add a layer of intelligence once the basic retrieval loop is in place.
Redis as a vector database for similarity searchRetrieval-augmented generation combines a language model with a fast search system that can retrieve fresh, relevant information whenever someone asks a question. When a user submits a question, the system turns that question into a vector embedding. Redis is used as the vector database, storing these embeddings for your documents, code samples, or any other materials you want the model to use as reference. The system then searches Redis to find the content with embeddings most similar to the question. Redis is designed for fast search across huge datasets, so it can quickly pull back the most relevant docs, passages, or examples.
Those results are passed to the language model. Now, instead of just guessing based on its training, the model can use this real and up-to-date context, which is pulled from your own data, to generate a more accurate and useful response. This allows it to reference company documentation, internal notes, or any information that was not included in the model’s original training.
By using Redis for retrieval, your RAG system stays current as your data changes. Redis gives you useful tools for the job: you can add filters to limit results by source or recency, rerank retrieved results to surface better answers, and use caching to make repeat searches faster. All of this ensures that the language model always has the right context, leading to more relevant answers and fewer hallucinations.
Tutorial: Building a RAG system with Redis and W&BIn this section, you’ll set up a retrieval-augmented generation pipeline using Redis and Qwen3. The main idea is to make your language model smarter by allowing it to retrieve real information from your documents or notes at answer time, rather than relying solely on what it learned during training. You’ll load and process your documents, store them in Redis as searchable embeddings, and wire everything together so your model can pull the best matches as context for every answer.
To start, I will instal Weave, RedisVL, and Redis-stack (I’m on Mac, but for other systems, feel free to check out the docs here): 
pip install redisvl weave 
brew tap redis-stack/redis-stack
brew install redis-stack
brew services start redis-stack
I’ll start by sharing the full code for the RAG system, and then dive deeper into the details. Here’s the code: 
import os
import re
import uuid
import time
import subprocess
import requests
import numpy as np
from pathlib import Path
from typing import List, Dict
﻿
import wandb
from openai import OpenAI
from sentence_transformers import SentenceTransformer
from redisvl.index import SearchIndex
from redisvl.query import VectorQuery
from redis import Redis
from redisvl.utils.rerank import HFCrossEncoderReranker
﻿
cross_encoder_reranker = HFCrossEncoderReranker("BAAI/bge-reranker-base")
﻿
import weave; weave.init('redis_rag')
# =========================
# Config (no CLI)
# =========================
os.environ.setdefault("REDIS_URL", "redis://localhost:6379")
os.environ.setdefault("WANDB_PROJECT", "redisrag")
os.environ.setdefault("TOKENIZERS_PARALLELISM", "false")
os.environ["WANDB_API_KEY"] = "your_wandb_api_key"
# set these via env if you can
﻿
WANDB_API_KEY = os.getenv("WANDB_API_KEY")
if not WANDB_API_KEY:
    raise SystemExit("Set WANDB_API_KEY in your environment")
﻿
# behavior toggles (set 1/0 in env)
FORCE_RECREATE = True
INCLUDE_ISSUES = False 
﻿
# speed knobs
FAST = False
MAX_PAGES = 1 if FAST else 5
MAX_FILES_PER_REPO = 100 if FAST else 1000
MAX_CHARS = 800 if FAST else 1200
OVERLAP = 100 if FAST else 150
EMBED_BATCH = 64
﻿
﻿
MODEL_NAME = "Qwen/Qwen3-235B-A22B-Thinking-2507"
﻿
﻿
# sources
REPOS = {
    "weave": "wandb/weave",
}
ISSUE_REPOS = {
    "weave": "wandb/weave",
}
ISSUES_ENDPOINT = "/issues?state=all&per_page=100"
﻿
WANDB_PROJECT = os.getenv("WANDB_PROJECT")
REDIS_URL = os.getenv("REDIS_URL")
﻿
# =========================
# Init
# =========================
﻿
﻿
client = OpenAI(
    base_url="https://api.inference.wandb.ai/v1",
    api_key=WANDB_API_KEY,
    default_headers={"OpenAI-Project": "wandb_fc/quickstart_playground"}
)
﻿
embedder = SentenceTransformer("BAAI/bge-small-en-v1.5")
EMBED_DIMS = embedder.get_sentence_embedding_dimension()
﻿
schema = {
    "index": {"name": "docs_idx", "prefix": "docs"},
    "fields": [
        {"name": "doc_id", "type": "tag"},
        {"name": "title", "type": "text"},
        {"name": "chunk", "type": "text"},
        {
            "name": "embedding",
            "type": "vector",
            "attrs": {
                "dims": EMBED_DIMS,
                "distance_metric": "cosine",
                "algorithm": "hnsw",
                "datatype": "float32",
            },
        },
    ],
}
index = SearchIndex.from_dict(schema, redis_url=REDIS_URL, validate_on_load=True)
r = Redis.from_url(REDIS_URL, decode_responses=False)
﻿
# =========================
# Helpers
# =========================
﻿
def clean_markdown(text: str) -> str:
    if not text:
        return ""
    text = re.sub(r'!\[.*?\]\(.*?\)', '', text)
    text = re.sub(r'\[([^\]]+)\]\((.*?)\)', r'\1', text)
    return text.strip()
﻿
def download_docs(repo_name: str, repo_url: str, path: str) -> None:
    p = Path(path)
    if p.exists():
        print(f"{repo_name} docs already present: {path}")
        return
    print(f"Cloning {repo_name} docs...")
    cmd = ["git", "clone", "--depth", "1", f"https://github.com/{repo_url}.git", path]
    try:
        subprocess.run(cmd, check=True, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
        print(f"OK: {repo_name} docs cloned to {path}")
    except subprocess.CalledProcessError:
        print(f"Failed to clone {repo_url} into {path}")
﻿
﻿
﻿
def load_texts(path: str = "docs") -> List[Dict]:
    p = Path(path)
    files = [p] if p.is_file() else [f for ext in ("*.md", "*.txt") for f in p.rglob(ext)]
    files = files[:MAX_FILES_PER_REPO]
    docs = []
    for fp in files:
        try:
            txt = fp.read_text(encoding="utf-8", errors="ignore")
            docs.append({"title": fp.stem, "content": clean_markdown(txt)})
        except Exception:
            continue
    return docs
﻿
﻿
def chunks_of(text: str, max_chars: int = MAX_CHARS, overlap: int = OVERLAP) -> List[str]:
    words, out, cur, cur_len = text.split(), [], [], 0
    for w in words:
        cur.append(w)
        cur_len += len(w) + 1
        if cur_len >= max_chars:
            out.append(" ".join(cur))
            cur = cur[-overlap:]
            cur_len = sum(len(x) + 1 for x in cur)
    if cur:
        out.append(" ".join(cur))
    return out
﻿
﻿
﻿
def fetch_issues(repo_url: str, max_pages: int = MAX_PAGES) -> List[Dict]:
    if not INCLUDE_ISSUES:
        return []
    all_issues, page = [], 1
    while page <= max_pages:
        url = f"https://api.github.com/repos/{repo_url}{ISSUES_ENDPOINT}&page={page}"
        try:
            resp = requests.get(url, headers={"Accept": "application/vnd.github.v3+json"}, timeout=30)
            if resp.status_code != 200:
                break
            issues = resp.json()
            if not issues:
                break
            for issue in issues:
                all_issues.append({
                    "title": issue.get("title", "untitled"),
                    "content": issue.get("body", "") or "",
                })
            page += 1
            if not FAST:
                time.sleep(1)
        except requests.RequestException:
            break
    return all_issues
﻿
﻿
﻿
def ingest_data(docs: List[Dict]) -> int:
    uniq = []
    seen = set()
    for d in docs:
        title = d.get("title", "untitled")
        for ch in chunks_of(d.get("content", "")):
            if not ch.strip():
                continue
            key = (title, ch[:200])
            if key in seen:
                continue
            seen.add(key)
            uniq.append((title, ch))
    if not uniq:
        return 0
﻿
    titles, chunks = zip(*uniq)
    vecs = embedder.encode(
        list(chunks),
        normalize_embeddings=True,
        batch_size=EMBED_BATCH,
        show_progress_bar=True
    )
﻿
    rows = []
    for title, ch, vec in zip(titles, chunks, vecs):
        rows.append({
            "doc_id": str(uuid.uuid4()),
            "title": title,
            "chunk": ch,
            "embedding": np.asarray(vec, dtype=np.float32).tobytes(),
        })
    if rows:
        index.load(rows)
    return len(rows)
﻿
def ensure_index() -> None:
    if FORCE_RECREATE:
        print("Forcing index recreation")
        index.create(overwrite=True)
        return
    if index.exists():
        print("Index exists, not recreating")
        return
    index.create(overwrite=False)
    print("Index created")
﻿
def index_doc_count() -> int:
    try:
        info = index.info()
        n = info.get("num_docs") or info.get("numDocs") or 0
        return int(n)
    except Exception:
        total = 0
        for _ in r.scan_iter(match="docs:*", count=10000):
            total += 1
            if total >= 1_000_000:
                break
        return total
﻿
    
﻿
@weave.op
def rerank_hits(query: str, hits: List[Dict], limit: int | None = None):
    if not hits:
        print("No hits to rerank")
        return [], []
    docs = []
    for i, h in enumerate(hits):
        content = (h.get("chunk") or h.get("title") or "").strip()
        if content:
            docs.append({"content": content, "orig_idx": i})
    if not docs:
        print("All hits are empty after filtering (no content)")
        return [], []
    try:
        results, scores = cross_encoder_reranker.rank(query=query, docs=docs, limit=limit)
        # results is a list of dicts with at least "content" and "orig_idx"
        ranked = [hits[d["orig_idx"]] for d in results]
        return ranked, scores
    except Exception as e:
        print("Rerank ERROR:", e)
        return [], []
﻿
﻿
@weave.op
def retrieve(query: str, k: int = 5):
    qvec = embedder.encode(query, normalize_embeddings=True)
    vq = VectorQuery(
        vector=qvec.tolist(),
        vector_field_name="embedding",
        return_fields=["doc_id", "title", "chunk", "vector_distance"],
        num_results=k,
    )
    return index.query(vq)
﻿
﻿
@weave.op
def answer(query: str, k: int = 5, max_tokens: int = 800):
    raw_hits = retrieve(query, k=k * 4)  # widen recall a bit
    if not raw_hits:
        return "No results in the index yet."
    hits, _scores = rerank_hits(query, raw_hits, limit=k)
    context = "\n\n".join(f"[{h['title']}] {h['chunk']}" for h in hits)
    messages = [
        {"role": "system", "content": "Answer with cited snippets in brackets like [title]. If unknown, say you don't know."},
        {"role": "user", "content": f"Question:\n{query}\n\nContext:\n{context}"},
    ]
    resp = client.chat.completions.create(
        model=MODEL_NAME,
        messages=messages,
        temperature=0.2,
        max_tokens=max_tokens,
    )
    return resp.choices[0].message.content
﻿
﻿
# =========================
# Main
# =========================
if __name__ == "__main__":
    ensure_index()
    before = index_doc_count()
    print(f"Index doc count before: {before}")
﻿
    total_chunks = 0
    if FORCE_RECREATE or before == 0:
        for repo_name, repo_url in REPOS.items():
            print(f"Processing {repo_name}")
            doc_path = f"{repo_name}_docs"
            if not Path(doc_path).exists():
                download_docs(repo_name, repo_url, doc_path)
﻿
            docs_data = load_texts(doc_path)
            issues_data = fetch_issues(ISSUE_REPOS[repo_name]) if INCLUDE_ISSUES else []
            print(f"{repo_name}: docs={len(docs_data)} issues={len(issues_data)}")
﻿
            added = ingest_data(docs_data + issues_data)
            total_chunks += added
            print(f"{repo_name}: ingested_chunks={added}")
﻿
        print(f"Total ingested chunks this run: {total_chunks}")
    else:
        print("Index already populated, skipping ingestion")
﻿
    after = index_doc_count()
    print(f"Index doc count after: {after}")
﻿
    q = "How do I log a trace with Weave - for any arbitrary code?"
    print("\nQ:", q)
    print("\nA:", answer(q))
After running the script, you now have a full retrieval-augmented generation stack running on Redis, with document ingestion, embedding, fast search, and answer generation all connected.
The script includes a few toggles that let you control how it runs. Setting FAST to true trims the workload by limiting pages, files, and chunk sizes so you can do a quick test instead of a full ingestion. FORCE_RECREATE indicates whether Redis should wipe and rebuild the index on every run or reuse the existing data. Finally, INCLUDE_ISSUES determines whether GitHub issues should be included alongside documentation, providing the option to expand the knowledge base beyond just repository files.
Everything starts with environment setup: the script loads config values for Redis, Weights & Biases, and the OpenAI API, then connects to Redis with Redis.from_url, sets up the vector index using SearchIndex.from_dict, and initializes both the embedding model (SentenceTransformer) and the language model client (OpenAI). Weave is also initialized with weave.init, letting you track metrics and experiment results throughout.
For data ingestion, you fetch documentation from GitHub using the download_docs method, then scan for .md and .txt files in the downloaded repo. Each file’s content is read, then cleaned with clean_markdown to remove links, images, and markdown syntax, producing plain text. The cleaned text is broken into overlapping chunks using chunks_of, allowing you to search for specific concepts instead of entire files.
def clean_markdown(text: str) -> str:
    if not text:
        return ""
    text = re.sub(r'!\[.*?\]\(.*?\)', '', text)
    text = re.sub(r'\[([^\]]+)\]\((.*?)\)', r'\1', text)
    return text.strip()
﻿
def download_docs(repo_name: str, repo_url: str, path: str) -> None:
    p = Path(path)
    if p.exists():
        print(f"{repo_name} docs already present: {path}")
        return
    print(f"Cloning {repo_name} docs...")
    cmd = ["git", "clone", "--depth", "1", f"https://github.com/{repo_url}.git", path]
    try:
        subprocess.run(cmd, check=True, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
        print(f"OK: {repo_name} docs cloned to {path}")
    except subprocess.CalledProcessError:
        print(f"Failed to clone {repo_url} into {path}")
﻿
﻿
﻿
def load_texts(path: str = "docs") -> List[Dict]:
    p = Path(path)
    files = [p] if p.is_file() else [f for ext in ("*.md", "*.txt") for f in p.rglob(ext)]
    files = files[:MAX_FILES_PER_REPO]
    docs = []
    for fp in files:
        try:
            txt = fp.read_text(encoding="utf-8", errors="ignore")
            docs.append({"title": fp.stem, "content": clean_markdown(txt)})
        except Exception:
            continue
    return docs
﻿
﻿
def chunks_of(text: str, max_chars: int = MAX_CHARS, overlap: int = OVERLAP) -> List[str]:
    words, out, cur, cur_len = text.split(), [], [], 0
    for w in words:
        cur.append(w)
        cur_len += len(w) + 1
        if cur_len >= max_chars:
            out.append(" ".join(cur))
            cur = cur[-overlap:]
            cur_len = sum(len(x) + 1 for x in cur)
    if cur:
        out.append(" ".join(cur))
    return out
Each chunk is embedded using the sentence transformer’s encode method, producing a vector embedding that captures the text’s meaning. The script then builds a row for each chunk, including its title, the chunk text, and its embedding, and loads all the rows into Redis using index.load. The Redis Query Engine and the Redis Vector Library create the underlying index and schema, allowing you to perform vector similarity searches across the whole knowledge base.
def ingest_data(docs: List[Dict]) -> int:
    uniq = []
    seen = set()
    for d in docs:
        title = d.get("title", "untitled")
        for ch in chunks_of(d.get("content", "")):
            if not ch.strip():
                continue
            key = (title, ch[:200])
            if key in seen:
                continue
            seen.add(key)
            uniq.append((title, ch))
    if not uniq:
        return 0
﻿
    titles, chunks = zip(*uniq)
    vecs = embedder.encode(
        list(chunks),
        normalize_embeddings=True,
        batch_size=EMBED_BATCH,
        show_progress_bar=True
    )
﻿
    rows = []
    for title, ch, vec in zip(titles, chunks, vecs):
        rows.append({
            "doc_id": str(uuid.uuid4()),
            "title": title,
            "chunk": ch,
            "embedding": np.asarray(vec, dtype=np.float32).tobytes(),
        })
    if rows:
        index.load(rows)
    return len(rows)
﻿
At this point, Redis is populated with vectors representing your documentation. When you ask a question, the system embeds the query using the same embedding model, then runs a vector search on Redis via VectorQuery and index.query. This efficiently finds the top matching chunks from your database.
@weave.op
def retrieve(query: str, k: int = 5):
    qvec = embedder.encode(query, normalize_embeddings=True)
    vq = VectorQuery(
        vector=qvec.tolist(),
        vector_field_name="embedding",
        return_fields=["doc_id", "title", "chunk", "vector_distance"],
        num_results=k,
    )
    return index.query(vq)
﻿
@weave.op
def answer(query: str, k: int = 5, max_tokens: int = 800):
    raw_hits = retrieve(query, k=k * 4)  # widen recall a bit
    if not raw_hits:
        return "No results in the index yet."
    hits, _scores = rerank_hits(query, raw_hits, limit=k)
    context = "\n\n".join(f"[{h['title']}] {h['chunk']}" for h in hits)
    messages = [
        {"role": "system", "content": "Answer with cited snippets in brackets like [title]. If unknown, say you don't know."},
        {"role": "user", "content": f"Question:\n{query}\n\nContext:\n{context}"},
    ]
    resp = client.chat.completions.create(
        model=MODEL_NAME,
        messages=messages,
        temperature=0.2,
        max_tokens=max_tokens,
    )
    return resp.choices[0].message.content
﻿
To boost answer quality, the script re-ranks the top hits using a cross-encoder reranker (HFCrossEncoderReranker). This step sorts the results by true semantic relevance to your query, not just vector distance. The top-ranked chunks are joined into a context string.
@weave.op
def rerank_hits(query: str, hits: List[Dict], limit: int | None = None):
    if not hits:
        print("No hits to rerank")
        return [], []
    docs = []
    for i, h in enumerate(hits):
        content = (h.get("chunk") or h.get("title") or "").strip()
        if content:
            docs.append({"content": content, "orig_idx": i})
    if not docs:
        print("All hits are empty after filtering (no content)")
        return [], []
    try:
        results, scores = cross_encoder_reranker.rank(query=query, docs=docs, limit=limit)
        # results is a list of dicts with at least "content" and "orig_idx"
        ranked = [hits[d["orig_idx"]] for d in results]
        return ranked, scores
    except Exception as e:
        print("Rerank ERROR:", e)
        return [], []
The answer is then generated by calling the client.chat.completions.create method with your question and the selected context. Qwen3 uses both its reasoning capabilities and the retrieved information to answer, and the output references the original document titles for transparency. Weights & Biases logs every stage, including how many docs were processed and how retrieval and generation performed, giving you visibility as you tune the pipeline.
This setup gives you a real RAG system: you can easily update your data or change the embedding model, re-run ingestion, and Redis will rebuild the vector index. You can also plug in new reranking strategies or swap language models without re-architecting anything. Functionality for embedding, indexing, searching, reranking, and generation all have their own methods, so you can adapt as your requirements change, while keeping everything monitored with W&B Weave.
In this code, I also leveraged W&B Inference. With this W&B Inference, I can access a catalog of open-source language models, run tests in the Weave Playground, and compare outputs from different models using the same interface. There’s no need to manage separate providers or deal with extra API keys. When I want to swap models, I just update the API endpoint and key in my client code.
﻿
All model calls are automatically tracked in Weave, giving instant insight into usage, performance, and cost. This makes it easy to monitor how each model is performing as the app evolves, and allows me to quickly experiment with new models or switch between them without incurring extra overhead. The process of trying, comparing, and deploying models is unified and visible, so the entire pipeline stays easy to optimize and maintain.
After running the system, you can go to Weave and monitor how each part of your RAG pipeline is working in real time. By adding the @weave.op decorator to your retrieve and answer functions, every call to these methods is automatically logged. This means you get detailed metrics on how retrieval and answer generation perform, including stats like which queries are being asked, how often each function is called, and what results are returned. You can track retrieval accuracy, latency, and even see which chunks were selected as context for each answer.
This level of observability makes it straightforward to debug, optimize, and iterate on your pipeline as your data or requirements evolve. Here’s a screenshot inside Weave after running our pipeline: 
﻿
﻿
﻿
Additional Redis tools for Building RAG systems Beyond simple vector search, Redis brings more intelligence to retrieval pipelines through features like embedding, caching, and semantic routing.
Embedding caching stores vector representations of texts that have already been processed. When a new text or query comes in, the cache is checked first. If the embedding is already there, it’s retrieved instantly, saving time and compute. If not, the system generates the embedding and saves it for later use. This reduces redundant work and speeds up retrieval, especially for frequently repeated texts or queries.
Semantic routing adds another layer of control. Routes are defined for topics such as technology or sports, each with example references and similarity thresholds. When a query arrives, Redis calculates its vector and compares it to the predefined routes. The query is then matched to the most relevant topic or handler based on semantic similarity, making the retrieval process more flexible and adaptive.
The typical flow starts by setting up a vectorizer and initializing the embedding cache. Topic routes are defined, each specifying references and thresholds. The semantic router is then initialized, connecting everything together. When a query is processed, the system checks for its embedding in the cache, generates it if missing, and then sends the query through the semantic router to determine its best match.
Here’s a very basic script demonstrating how this works in practice: 
import os
from redisvl.extensions.router import Route, SemanticRouter
from redisvl.extensions.cache.embeddings import EmbeddingsCache
from redisvl.utils.vectorize import HFTextVectorizer
﻿
# Setup
os.environ["TOKENIZERS_PARALLELISM"] = "false"
redis_url = "redis://localhost:6379"
﻿
# Vectorizer
vectorizer = HFTextVectorizer(model="BAAI/bge-reranker-base")
﻿
# 1. Initialize cache
cache = EmbeddingsCache(
    name="demo_cache",
    redis_url=redis_url,
    ttl=3600  # cache for 1 hour
)
﻿
# 2. Define routes
tech = Route(
    name="technology",
    references=["latest AI news", "new gadgets", "trending in tech"],
    metadata={"category": "tech"},
    distance_threshold=0.7
)
﻿
sports = Route(
    name="sports",
    references=["last night game", "sports events", "basketball and football"],
    metadata={"category": "sports"},
    distance_threshold=0.72
)
﻿
# 3. Initialize Semantic Router
router = SemanticRouter(
    name="demo-router",
    vectorizer=vectorizer,
    routes=[tech, sports],
    redis_url=redis_url,
    overwrite=True
)
﻿
# 4. Query example with caching
query = "Tell me about the latest in AI"
﻿
# Try cache first
if result := cache.get(text=query, model_name=vectorizer.model):
    print("Cache hit:", result["text"])
    embedding = result["embedding"]
else:
    print("Cache miss: generating embedding...")
    embedding = vectorizer.embed(query)
    cache.set(text=query, model_name=vectorizer.model, embedding=embedding)
﻿
# 5. Route the query
route_match = router(query)
print("Best route:", route_match)
This code shows how Redis extends basic retrieval to include caching and semantic routing. After initializing a vectorizer and connecting to Redis, an embedding cache is set up to store and reuse text embeddings. Two routes are defined, each one capturing a topic area with example references and a distance threshold for matching.
When a query arrives, the system first checks if an embedding already exists in the cache. A cache hit returns the vector instantly. If not, a new embedding is created and stored for future queries. This approach keeps response times fast, even as the volume of queries grows.
The semantic router then compares the embedding against each route. Using semantic similarity, the router automatically determines to which topic the query belongs. This makes it possible to direct questions to specialized subsystems, tools, or indexes behind the scenes, all while handling embeddings and routing at high speed in Redis. These tools provide flexible, low-latency building blocks for routing, caching, and managing knowledge in modern RAG systems.
Tools for building chatbots Chatbots that actually remember conversations and avoid repeating themselves are finally possible with the right tools. RedisVL adds two key building blocks that make this possible: a semantic cache for instant lookup of previous answers, and a semantic message history that lets chatbots pull in the most relevant parts from past interactions. With these, a chatbot can answer quickly, reuse knowledge, and keep track of what has already been discussed, all running on a single Redis backend.
Here’s an example of how Redis’s SemanticCache and SemanticMessageHistory work together to give a chatbot real memory and instant recall.
import os
import sys
import asyncio
﻿
from redis import Redis
from redisvl.utils.vectorize import HFTextVectorizer
from redisvl.extensions.cache.llm import SemanticCache
from redisvl.extensions.message_history import SemanticMessageHistory
﻿
from openai import OpenAI
﻿
# ---- Config
FORCE_CLEAR = True
USE_CACHE = True
USE_HISTORY = True
TOP_K_HISTORY = 3
﻿
WANDB_API_KEY = os.getenv("WANDB_API_KEY", "your_wandb_api_key")
if not WANDB_API_KEY:
    print("Set WANDB_API_KEY in your environment.")
    sys.exit(1)
﻿
REDIS_URL = os.getenv("REDIS_URL", "redis://localhost:6379")
MODEL_NAME = "Qwen/Qwen3-235B-A22B-Thinking-2507"
HF_EMBED_MODEL = "BAAI/bge-small-en-v1.5"
﻿
# ---- Redis and Vectorizer setup
redis = Redis.from_url(REDIS_URL, decode_responses=False)
vectorizer = HFTextVectorizer(model=HF_EMBED_MODEL)
﻿
# ---- Semantic Cache
semantic_cache = SemanticCache(
    name="llmcache",
    distance_threshold=0.1,
    vectorizer=vectorizer,
    redis_client=redis,
)
﻿
# ---- Semantic History
semantic_history = SemanticMessageHistory(
    name="chat_history",
    vectorizer=vectorizer,
    redis_client=redis,
    distance_threshold=0.35,
)
﻿
if FORCE_CLEAR:
    semantic_cache.clear()
    semantic_history.clear()
﻿
# ---- LLM client
client = OpenAI(
    base_url="https://api.inference.wandb.ai/v1",
    api_key=WANDB_API_KEY,
    default_headers={"OpenAI-Project": "wandb_fc/quickstart_playground"},
)
﻿
def get_llm_response(prompt):
    # 1. try cache
    if USE_CACHE:
        cache_hits = semantic_cache.check(prompt)
        if cache_hits and cache_hits[0].get("response"):
            print("[CACHE HIT] ✓")
            return cache_hits[0]["response"]
﻿
    # 2. pull context from semantic history
    context_msgs = []
    if USE_HISTORY:
        context_msgs = semantic_history.get_relevant(prompt, top_k=TOP_K_HISTORY)
﻿
    sys_message = "You are a helpful assistant. Cite any provided context."
    messages = [{"role": "system", "content": sys_message}]
﻿
    user_message = prompt
    if context_msgs:
        history_text = "\n".join(f"{m['role']}: {m['content']}" for m in context_msgs)
        user_message += f"\n\nContext:\n{history_text}"
    messages.append({"role": "user", "content": user_message})
﻿
    # 3. query LLM
    result = client.chat.completions.create(
        model=MODEL_NAME,
        messages=messages,
        temperature=0.2,
        max_tokens=800,
    )
    response = result.choices[0].message.content
﻿
    # 4. update history and cache
    if USE_HISTORY:
        semantic_history.store(prompt, response)
    if USE_CACHE:
        semantic_cache.store(prompt, response)
﻿
    return response
﻿
async def chat_cli():
    print("RedisVL Chatbot with configurable cache/history via vars")
    print("Type 'exit' or Ctrl+C to quit.\n")
﻿
    while True:
        prompt = input("You: ").strip()
        if not prompt or prompt.lower() == "exit":
            break
﻿
        answer = get_llm_response(prompt)
        print(f"Bot: {answer}\n")
﻿
﻿
if __name__ == "__main__":
    try:
        asyncio.run(chat_cli())
    except (KeyboardInterrupt, EOFError):
        print("\nGoodbye.")
SemanticCache works by storing question and answer pairs as embeddings in Redis. When a prompt comes in, it first checks the cache to see if a similar question was already asked. If so, it retrieves the previous answer immediately, saving you from having to call the model again and reducing response time. This isn’t just basic string matching; it uses vector similarity to spot questions that are phrased differently but mean the same thing.
SemanticMessageHistory enables chatbots to recall and utilize past conversation turns. Every message and its reply are embedded and saved in Redis. When a new question arrives, SemanticMessageHistory looks for previous messages that are semantically related, pulls them out, and adds them as extra context before the model answers. The result is a chatbot that can reference previous topics, clarify context, and respond more naturally across multiple turns.
For every prompt, it checks for a cache hit first. If there’s no match, it pulls the most relevant conversation snippets from history and uses those to help answer the new question. After the response is generated, both the cache and message history are updated, so future conversations get smarter as more data is seen.
With this setup, you get a chatbot or assistant that answers faster, repeats itself less, and actually remembers what’s been said before. Everything runs on top of Redis. You can monitor and reset everything as needed, making it practical for production and quick iteration. This is what lets RedisVL serve as the backbone for more intelligent, more responsive conversational AI.
Conclusion Redis started out as the go-to tool for low-latency key-value storage and lightning-fast caching. Now, with RedisVL, it’s evolved into a platform that can power modern retrieval-augmented AI systems, letting teams search, store, and manage dense vector data with the same speed and reliability that made Redis popular in the first place.
On top of the fundamental tools for building a RAG pipeline, RedisVL’s semantic cache and message history unlock smarter, faster conversational agents that avoid repeating themselves and actually remember what’s been said. Features like semantic routing and hybrid search let you route queries by topic or context, giving you fine-grained control over knowledge and retrieval. Everything remains easy to monitor and update thanks to tight integration with W&B Weave.
What’s most compelling is that all this runs on a system already familiar to thousands of engineering teams. You get the performance and reliability of Redis, the flexibility to swap models or data sources, and the tools needed to build systems that stay current and grounded in real information. Redis, as a vector database, is not just a technical upgrade; it’s a practical way to push retrieval-augmented AI further without reinventing your entire stack.
﻿
﻿
﻿
﻿
Add a comment
Tags: Articles, Agents, GenAI, Evaluations
Iterate on AI agents and models faster. Try Weights & Biases today.