Muvera tutorial: Experiment tracking and observability with W&B Models

Scale search with MUVERA: Google’s FDE approach cuts latency 18x while W&B observability ensures reliable, trackable experiments.
Atharva Ingle
Created on August 8|Last edited on September 9
Comment
﻿Google’s MUVERA (Multi-Vector Retrieval via Fixed Dimensional Encodings) is a sophisticated algorithm built to enhance both the precision and efficiency of search results. Instead of relying on a single vector, it interprets the deeper meaning and context of complex queries using multiple vectors. The system encodes both query and document data into a unified, easily searchable format, enabling it to pinpoint the most relevant portions of content and deliver more accurate, contextually appropriate answers.
Imagine trying to represent the meaning of an entire document or complex query with just one vector. It’s like summarizing a movie in a single sentence; you’ll capture the gist but miss a lot of nuance. That’s the limitation of traditional single-vector retrieval: they compress all semantics into one vector, making search fast with Maximum Inner Product Search (MIPS), but often at the cost of precision.
Multi-vector models, such as ColBERT, solve this by representing each token with its own vector and scoring queries against documents using Chamfer similarity (MaxSim). This allows different parts of a query to match different parts of a document, capturing much richer relationships. The drawback is that comparing sets of vectors is computationally expensive, requiring more storage and higher latency.
Systems like PLAID reduce this cost with complex multi-stage pipelines, but at the expense of simplicity and tunability. MUVERA takes a different path: it compresses a set of token vectors into a single Fixed Dimensional Encoding (FDE) that closely approximates multi-vector similarity — preserving semantic richness while enabling fast, standard single-vector search.
This is significant because it combines the accuracy of multi-vector retrieval with the scalability and speed of existing vector search infrastructure. In the sections that follow, we’ll break down how MUVERA works, why FDEs are effective, and walk through a real-world example of using MUVERA for retrieval.
What is MUVERA?MUVERA, or Multi-Vector Retrieval via Fixed Dimensional Encodings, is a retrieval algorithm that bridges the efficiency gap between single-vector and multi-vector search systems. At its core, MUVERA solves a deceptively simple problem: how do you take a set of vectors (like the token embeddings from a ColBERT model) and compress them into a single vector while preserving the essential similarity relationships?
The key insight is that most multi-vector retrieval systems struggle not because multi-vector representations are inherently slow, but because we lack efficient ways to search over them. Traditional approaches, such as PLAID, essentially perform multiple single-vector searches (one for each query token) and then attempt to piece together the results through complex filtering and pruning stages. This works, but it's like having to make dozens of phone calls to gather information that could be obtained in one conversation.
Muvera's two-step retrieval process, compared to PLAID's multi-stage retrieval process.
MUVERA flips this approach. Instead of searching over individual tokens, it creates what's called a Fixed Dimensional Encoding (FDE) for each document and query. Think of an FDE as a carefully, asymmetrically constructed fingerprint that captures the essential characteristics of all the token embeddings in a compressed form. The crucial property is that when you compute the dot product between a query FDE and a document FDE, you get a score that closely approximates what you would have gotten by computing the full Chamfer similarity between all the original token embeddings.
This transformation is powerful because it reduces multi-vector search to exactly the same problem as single-vector search. You can use any existing MIPS algorithm, whether it's a simple brute-force scan or sophisticated approximate nearest neighbor methods, such as graph-based search. Your infrastructure doesn't need to change; your indexing strategies remain the same, and you can leverage decades of optimization work that have gone into making single-vector search fast.
It comes with theoretical guarantees that the FDE approximation will be accurate within a specified error bound. This means you can tune the system to get exactly the accuracy-speed trade-off you need, with mathematical confidence that you won't miss truly relevant documents in your initial retrieval step.
The algorithm operates in two straightforward stages:
First, use the FDE approximation to quickly retrieve a set of candidate documents through standard MIPS;
Second, re-rank these candidates using the exact Chamfer similarity to ensure perfect accuracy in the final results.
This is dramatically simpler than the four-stage pipeline used by systems like PLAID, with fewer parameters to tune and fewer potential points of failure.
What makes MUVERA particularly appealing is its data-oblivious nature. The FDE transformation utilizes random projections and is independent of the specific characteristics of your dataset. This means the same transformation works across different domains, handles distribution shifts gracefully, and supports streaming applications where new documents are constantly being added to your index.
Now, let's understand how all of this works in detail in the next section.
How does MUVERA transform multi-vector retrieval?MUVERA's transformation of multi-vector retrieval is built around a technique called Fixed Dimensional Encodings (FDEs). To truly understand this, let's walk through the entire process step by step, from the fundamental challenges to the elegant solution.
The multi-vector challenge: Why sets are hard to searchBefore digging into MUVERA's solution, let's understand exactly what makes multi-vector retrieval so computationally expensive. In traditional single-vector systems, comparing a query to a document is straightforward: compute the dot product between the query and the document. But with multi-vector models like ColBERT, you're comparing sets of vectors.
The Chamfer similarity computation looks like this:
[CHAMFER(Q,P)=∑q∈Qmax⁡p∈P⟨q,p⟩][
\mathrm{CHAMFER}(Q,P) = \sum_{q \in Q} \max_{p \in P} \langle q, p \rangle
][CHAMFER(Q,P)=∑q∈Q​maxp∈P​⟨q,p⟩]﻿

This means for each query token q, you need to:
Compare it against every document token p
Find the maximum similarity
Sum all these maximums 
If your query has 32 tokens (typical for ColBERT) and your document has 80 tokens, that results in 2,560 dot products per document. For a corpus of 8 million documents, you're looking at over 20 billion operations per query. Even with optimizations, this becomes prohibitively expensive for real-time applications.
MUVERA's core insight: Structured compressionMUVERA's breakthrough comes from recognizing that we don't need to preserve every detail of the token-level interactions. We just need to preserve the similarity relationships that matter for retrieval quality. The solution is to compress each set of token embeddings into a single vector, while maintaining the essential matching patterns.
Think of it like creating a smart summary. Instead of reading every word of two books to compare them, you create structured chapter summaries that capture the key themes, then compare those summaries. If done well, you'll be able to identify which books are similar without having to read every word.
Partitioning the Embedding Space with SimHashBefore we can aggregate embeddings, we need to decide which embeddings should be grouped together. This is where SimHash comes in. It's a clever way to divide up the high-dimensional embedding space into regions such that similar embeddings are likely to end up in the same region.
﻿
﻿
The basic intuitionImagine you have a 2D space (like a piece of paper) and you want to divide it into regions. You could draw a few lines across the paper, and each intersection of lines creates a different region. SimHash performs a similar operation, but in a high-dimensional space using hyperplanes instead of lines.
How SimHash works step-by-stepLet's walk through this with a concrete example:
Step 1: Create random Hyperplanes
We generate several random vectors, let's say 3 of them: g₁, g₂, g₃
Each vector represents a hyperplane passing through the origin
These are just random directions in our embedding space
Step 2: Test Each Embedding Against Each Hyperplane
For any embedding vector (like the embedding for the word "doctor"), we check which side of each hyperplane it falls on:
Take the dot product of "doctor" with g₁. If it's positive, we get bit "1", if negative, we get bit "0"
Do the same with g₂ and g₃
This gives us a 3-bit code, like "101"
Step 3: The Binary Code Becomes a Region ID
"000" = Region 0
"001" = Region 1
"010" = Region 2
"011" = Region 3
"100" = Region 4
"101" = Region 5 ← Our "doctor" embedding
"110" = Region 6
"111" = Region 7
So "doctor" gets assigned to Region 5.
Why this works: The locality propertyThe magic of SimHash is that similar embeddings tend to produce similar binary codes. Here's why: if "doctor" and "physician" are semantically similar, their embedding vectors point in similar directions. Similar vectors are likely to fall on the same side of most hyperplanes. Therefore, they'll get similar (or identical) binary codes, and similar binary codes mean they end up in the same region or nearby regions.
Why use random hyperplanes?You might wonder: why not learn optimal hyperplanes that perfectly separate similar from dissimilar embeddings? There are several reasons MUVERA uses random hyperplanes:
Data-oblivious: Random hyperplanes work for any dataset without training
Robust: Performance doesn't change when your data distribution shifts
Simple: No complex optimization required
Theoretically grounded: Random projections have well-understood mathematical properties
Controlling the number of regionsThe number of hyperplanes determines how many regions you get (for k hyperplanes you get 2^k hyperplanes):
1 hyperplane → 2 regions (not very useful)
3 hyperplanes → 8 regions
5 hyperplanes → 32 regions
6 hyperplanes → 64 regions
More regions mean finer partitions (better precision) but also a higher chance that similar embeddings might end up in different regions by bad luck. MUVERA typically uses 3-6 hyperplanes, giving 8-64 regions, which provides a good balance.
What happens nextOnce every embedding has been assigned to a region via SimHash, we can proceed to the aggregation step. All query embeddings in region 11 get summed together, all document embeddings in region 11 get averaged together, and so on. The beauty is that semantically related embeddings (like "doctor" and "physician") are likely to be grouped together, so when we compute the dot product between query and document FDEs, we're essentially matching related concepts against each other, which is exactly what we want to approximate the original Chamfer similarity.
Asymmetric aggregation - the heart of FDEsNow comes the crucial part: how do we aggregate the embeddings within each region? MUVERA employs different strategies for queries versus documents, and this asymmetry is what enables the entire system to function.
For Query FDEs - Use Summation: Within each region, all query token embeddings are summed together. If region 13 contains embeddings for "machine" and "learning", the query FDE block for region 13 becomes the sum: embedding("machine") + embedding("learning").
Illustration of the construction of query FDEs. Each token (shown as a word in this example) is mapped to a high-dimensional vector (2-D in the example for simplicity). The high-dimensional space is randomly partitioned by hyperplane cuts. Each piece of space is assigned a block of coordinates in the output FDE, which is set to the sum of the coordinates of the query vectors that land in that piece.

For Document FDEs - Use Averaging: Within each region, document token embeddings are averaged. If region 13 contains embeddings for "AI", "artificial", and "intelligence", the document FDE block for region 13 becomes: [embedding("AI") + embedding("artificial") + embedding("intelligence")] / 3.
Why this asymmetry? It mirrors how Chamfer similarity actually works:
Each query token should contribute its full "weight" to the final similarity score (hence summation)
Document tokens provide the "best available match" for query tokens, and averaging prevents any single document token from dominating just because it appears with many other tokens in the same region
Illustration of the construction of document FDE's. The construction is the same as the query construction, except that the vectors falling in a given piece of the partitioned space are averaged together instead of summed, which accurately captures the asymmetric nature of the Chamfer similarity.
A Detailed Example: Healthcare QueryLet's trace through a complete example to see how this works:
Query: "machine learning healthcare applications" 
Document: "AI algorithms for medical diagnosis and patient care"After SimHash partitioning (let's say with 4 hyperplanes, giving us 16 regions):
"machine" → region 3
"learning" → region 3
"healthcare" → region 7
"applications" → region 12
"AI" → region 3
"algorithms" → region 3
"medical" → region 7
"diagnosis" → region 7
"patient" → region 9
"care" → region 9
⠀Query FDE construction:
Region 3: sum of "machine" + "learning" embeddings
Region 7: "healthcare" embedding
Region 12: "applications" embedding
All other regions: zero vectors
⠀Document FDE construction:
Region 3: average of "AI" + "algorithms" embeddings
Region 7: average of "medical" + "diagnosis" embeddings
Region 9: average of "patient" + "care" embeddings
All other regions: zero vectors
⠀When we compute the dot product between these FDEs:
Region 3 contributes: (machine+learning) · (AI+algorithms)/2 - capturing how well "machine learning" matches "AI algorithms"
Region 7 contributes: healthcare · (medical+diagnosis)/2 - capturing how well "healthcare" matches "medical diagnosis"
Region 9 contributes: 0 (no query tokens in this region)
The final FDE dot product approximates what we would have gotten by computing the full Chamfer similarity between all token pairs.
Handling edge cases - Empty clustersOne practical issue arises: what if a query token falls into a region where no document tokens exist? This would create artificial zeros in the similarity computation, potentially missing good matches.
MUVERA solves this with a "fill empty clusters" strategy for documents. If a document has no tokens in a particular region, that region gets filled with the document token that's closest to that region (measured by Hamming distance between the binary signatures). This ensures every query token has something to match against.
Importantly, this is only done for document FDEs, not query FDEs. Filling empty query regions would cause query tokens to contribute multiple times to the similarity score, which would break the approximation.
Controlling dimension and varianceThe basic method we've described creates FDEs with a dimension of B × d (number of regions × original embedding dimension). For a typical setup with 32 regions and 128-dimensional embeddings, this results in a 4,096-dimensional vector. For more complex setups, this can grow even larger. To manage this, MUVERA provides two primary "tuning knobs" to control the final trade-off between the FDE's size, search speed, and its accuracy
1. Inner projections (shrinking the building blocks)The first knob controls the size of each region's vector within the FDE. Instead of using the full, original token embeddings (dimension d) for aggregation, we can first reduce their dimensionality. This is done by applying a random linear projection to each token embedding before it's assigned to a region's sum or average.
This projection maps the embedding from its original d dimensions to a much smaller d_proj (projection dimension). Now, when we aggregate these smaller vectors, our "building blocks" for the FDE are already more compact. This means each of the B regions in our FDE will be represented by a d_proj-dimensional vector instead of a d-dimensional one.
2. Multiple repetitions (improving robustness)The second knob is used to enhance the quality and reliability of the approximation. Any single random partitioning might get unlucky and separate two very similar tokens. To protect against this, we don't just run the process once; instead, we repeat it multiple times. We repeat the entire partition-and-aggregate process multiple times. Let's say R_reps times, each with a fresh, independent set of random hyperplanes.
This gives us R_reps different FDEs. We then concatenate all of them together to produce our final, more robust FDE. If one repetition happens to group tokens poorly, the other repetitions will likely compensate.
Putting it all togetherBy using these two knobs, the final dimension of our FDE is no longer B × d, but is instead determined by this formula:
Final FDE Dimension = B (regions) × d_proj (projected dimension) × R_reps (repetitions)
The two-stage retrieval pipelineOnce we have FDEs for all documents and queries, retrieval becomes remarkably simple:
Stage 1 - FDE-based candidate retrieval:Compute the query's FDE
Use any MIPS algorithm to find the most similar document FDEs
Retrieve the top-K candidates (typically K=100 or K=1000)
Stage 2 - Exact reranking:For each candidate document, compute the exact Chamfer similarity using the original token embeddings
Re-rank the candidates by exact similarity
Return the final ranked list
The first stage utilizes a fast single-vector search to significantly reduce the search space. The second stage ensures perfect accuracy by using exact similarity computation, but only on a small set of candidates.
Understanding the pipeline comparisonNow let's examine how this compares to existing approaches using the pipeline diagram:
﻿

MUVERA's Elegant Simplicity: The left side shows MUVERA's streamlined approach:
Stage 1: Convert query embeddings to query FDE, then use standard MIPS (like DiskANN) to find the top K_s most similar document FDEs
Stage 2: Re-rank these K_s candidates using exact Chamfer similarity to get the final top-K documents
⠀PLAID's Complex Multi-Stage Process: The right side shows PLAID's four-stage pipeline:
Stage 1: For each query token, retrieve similar document tokens using centroids and clustering
Stage 2: Score and prune candidates using centroid interactions
Stage 3: Further scoring and pruning with additional centroid processing
Stage 4: Final ranking with document decompression and exact MaxSim computation
The complexity difference is striking. MUVERA has two clean stages with one major tuning parameter (K_s, the number of candidates to re-rank). PLAID has four stages, each with its own parameters for clustering, pruning thresholds, and scoring strategies.
This simplicity translates to practical benefits:
Easier deployment: Fewer components to configure and monitor
More robust performance: Less sensitive to parameter tuning
Better scalability: Can leverage any existing MIPS infrastructure
Clearer debugging: When something goes wrong, there are fewer places to look
Theoretical foundation: Why this worksThe authors prove that with appropriate parameter choices, the FDE dot product provides an ε-additive approximation to the normalized Chamfer similarity:
[∣⟨Fq(Q),Fdoc(P)⟩−CHAMFER(Q,P)∣≤ϵ][
\big| \langle Fq(Q), F{\text{doc}}(P) \rangle - \mathrm{CHAMFER}(Q,P) \big| \le \epsilon
][​⟨Fq(Q),Fdoc(P)⟩−CHAMFER(Q,P)​≤ϵ]﻿
This means you can tune the FDE dimension to achieve any desired approximation accuracy. Need higher precision? Use more regions, more repetitions, or higher-dimensional inner projections. The theory tells you exactly how these parameters trade off against approximation quality.  
Moreover, this guarantee extends to the full nearest neighbor search problem: if you retrieve the top-K documents by FDE similarity, you're guaranteed (with high probability) to find the true top Chamfer-similarity documents among them, up to the approximation error ε.
The data-oblivious advantageOne final crucial property: MUVERA's transformation is data-oblivious. The random hyperplanes used for partitioning don't depend on your specific dataset; they're just randomly sampled from a Gaussian distribution. This provides several important benefits:
Domain robustness: The same FDE construction works across different domains without retraining
Distribution shift resilience: Performance doesn't degrade when your data distribution changes
Streaming support: New documents can be encoded and indexed without affecting existing FDEs
Simplicity: No model training, no hyperparameter optimization on your data
This is fundamentally different from learned approaches that might optimize partitions for a specific dataset. While learned approaches might achieve slightly better performance on the training distribution, they can be brittle when conditions change.
TLDRMUVERA's transformation of multi-vector retrieval is elegant in its simplicity once you understand the components:
Partition the embedding space randomly but consistently
Aggregate tokens within partitions asymmetrically (sum for queries, average for documents)
Repeat the process multiple times to reduce variance
Search using the resulting single vectors with standard MIPS algorithms
Re-rank a small set of candidates with exact similarity
Now that we've covered the theory behind how MUVERA works, let's put it into practice. In the following tutorial section, we'll walk through setting up a complete benchmark to compare MUVERA against PLAID and see these performance gains for yourself.
Tutorial: Getting started with MUVERAIn this tutorial, we’ll set up to run MUVERA using muvera-py, a clean, community Python port of Google’s original C++ implementation. For the baseline, we’ll use ragatouille, a Python toolkit for building and searching ColBERT-style multi-vector indexes, which also includes a solid PLAID implementation.
We’ll compare three configurations:
PLAID baseline (current standard for efficient multi-vector retrieval)
Pure FDE search (no reranking)
Full Muvera pipeline (FDE + reranking)
The goal is to keep it minimal so you can follow along quickly.
Note: This tutorial just showcases snippets of code relevant for the specific section of the tutorial. You can run the complete benchmark and reproduce the results on the script on the Github Repo.
💡
Environment setupFirst, create an isolated environment with uv and install the required dependencies.
# Clone the Muvera Python implementation
git clone https://github.com/sigridjineth/muvera-py
cd muvera-py
﻿
# Create and sync environment
uv sync
﻿
# Install dependencies
uv pip install faiss-cpu
# uv pip install neural_cherche   # optional, for extra experiments
uv pip install datasets==3.6.0
uv pip install ragatouille
uv pip install transformers==4.49
Once set up, we’ll be ready to run retrieval experiments and compare Muvera’s speed and accuracy against PLAID.
Tracking experiments with Weights & BiasesLet’s ensure our Muvera experiments are tracked from the outset. We’ll log in, spin up a run, and store our key parameters so everything is reproducible and neatly visualized in the dashboard.
Here’s a minimal snippet adapted from our benchmark script:
import wandb
﻿
# Log in to your W&B account
wandb.login()
﻿
# Initialize a new run
wandb.init(
    project="muvera-benchmark",
    name="muvera_with_reranking",
    config={
        "embedding_model": "colbert-ir/colbertv2.0",
        "fde_dimension": 10240,
        "num_queries": 50,
        "rerank_k": 100,  # Number of candidates to re-rank
        "max_document_length": 256,
        "batch_size": 32
    }
)
This starts a W&B run that will capture all our metrics, tables, and artifacts for the Muvera vs PLAID showdown.
Loading the SciDocs DatasetLet’s load a realistic evaluation dataset, because toy examples can’t tell us much about the real-world speedups Muvera can deliver. We’ll be using SciDocs, a benchmark from the BEIR suite designed to evaluate scientific document retrieval systems.
SciDocs is built from scientific papers in the Semantic Scholar corpus and simulates the kind of retrieval problem you’d face in a real research search engine. It comes with three main pieces:
Corpus – Around 25,700 scientific papers, each with:
_id: a unique document identifier
title: the paper’s title
text: usually the abstract or a concise description of the paper’s content
Queries – Short natural-language queries (often paper titles) representing the “search” side of the benchmark. Each query has its own unique query-id.
Qrels – Short for “query relevance judgments.” This is the glue that connects queries to relevant documents in the corpus. Each row has:
query-id – matches one in the queries set
corpus-id – matches one in the corpus
score – a relevance score, typically 1 for relevant and 0 for not relevant in SciDocs
The structure allows you to start from a query, look up all its relevant corpus IDs in the qrels, and then fetch those documents from the corpus. In our experiments, these qrels serve as our ground truth: after Muvera or PLAID returns a ranked list of document IDs for a query, we compare them against the qrels to compute metrics such as recall@k.
Below we can see some samples of the data logged using W&B Tables:﻿
﻿
﻿
﻿
One important detail: SciDocs is fairly dense in relevant pairs; however, because the corpus is large, the retrieval problem remains challenging. The average document in ColBERT produces ~165 token embeddings, making it ideal for stress-testing multi-vector retrieval methods.
Designing the baseline: PLAID with RagatouilleTo determine if MUVERA truly represents a step forward, we need to compare it against a strong and relevant baseline. The natural choice is PLAID, the highly optimized retrieval engine for ColBERTv2, which represents the state-of-the-art in efficient multi-vector retrieval.
Instead of building PLAID's complex four-stage pipeline from scratch, we use the excellent ragatouille library. It provides a clean, user-friendly interface for creating and searching PLAID-style indexes, making it the perfect tool for a robust benchmark.
The process consists of two main steps: indexing the corpus and then searching it.
Indexing the corpusFirst, we need to encode our entire SciDocs corpus into a ColBERT index. ragatouille handles all the heavy lifting behind a simple .index() method. This includes chunking documents, generating token-level embeddings for every document, and organizing them into the structure PLAID uses for fast retrieval.
Here’s the snippet from our benchmark script that handles this. To avoid re-computing this expensive step every time, we first check if an index directory already exists and create one only if it's missing.
index_name = f"scidocs_index_{len(corpus)}"
index_path = os.path.join(".ragatouille/colbert/indexes/", index_name)
﻿
﻿
if not os.path.exists(index_path):
    logger.info(f"Index not found at '{index_path}'. Creating a new one...")
    rag_model = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")
    doc_ids_as_str = [str(doc_id) for doc_id in corpus.keys()]
    rag_model.index(
        collection=list(corpus.values()),
        document_ids=doc_ids_as_str,
        index_name=index_name,
        max_document_length=args.max_document_length,
        bsize=args.batch_size,
    )
else:
    logger.info(f"Loading model from existing index at '{index_path}'")
    rag_model = RAGPretrainedModel.from_index(index_path)
Searching and evaluationWith the index built, searching is just as straightforward. For each of our test queries, we call the .search() method and ask for the top 100 results (k=100). We wrap this call to time precisely how long the retrieval takes and then compute the recall to measure its accuracy.
latencies = []
all_metrics = {"recall@10": [], "recall@100": []}
﻿
﻿
for qid in tqdm(query_ids, desc="Searching with Ragatouille"):
    start_search_time = time.time()
    retrieved_docs = rag_model.search(
        query=queries[qid], k=100, index_name=index_name
    )
    latencies.append((time.time() - start_search_time) * 1000)
﻿
    retrieved_ids = [doc["document_id"] for doc in retrieved_docs]
    recall_metrics = calculate_recall(
        retrieved_ids, ground_truth.get(qid, set()), k_values=[10, 100]
    )
    # ... metric aggregation ...
This setup ensures we're making a fair comparison. We're not pitting MUVERA against a simplified or weak baseline; we're benchmarking it against a powerful and widely-used implementation of the current leading method for multi-vector search. Now, let's examine how MUVERA's FDE-based approach compares.
Setting up the MUVERA pipelineNow we get to the fun part: building the two-stage MUVERA pipeline.
Getting the raw token embeddingsBefore we can create FDEs, we need the original multi-vector embeddings. For a fair fight, we use the exact same colbert-ir/colbertv2.0 model to generate token-level embeddings for every document and query in our dataset. This ensures that any performance difference we see is due to the retrieval strategy (MUVERA vs. PLAID), not the underlying embeddings.
Our script handles this by loading a fresh model instance and encoding the corpus and queries in-memory.
# Using a fresh model instance for MUVERA's encoding pass
muvera_rag_model = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")
﻿
# Encode documents to get their token embeddings
rag_model.encode(
    documents=list(corpus.values()),
    document_metadatas=[{"doc_id": doc_id} for doc_id in corpus.keys()],
)
in_memory_embeddings = rag_model.model.in_memory_embed_docs.cpu().numpy()
﻿
# Encode queries to get their token embeddings
for qid in tqdm(query_ids, desc="Generating Query Embeddings"):
    q_emb_batch_tensor = rag_model.model.inference_ckpt.queryFromText([queries[qid]])
    if q_emb_batch_tensor is not None:
        query_embeddings_map[qid] = q_emb_batch_tensor[0].cpu().numpy()
Creating Fixed Dimensional Encodings (FDEs)Using our muvera-py implementation, we take the raw token embeddings for each document and query and convert them into single, high-dimensional FDE vectors.
We configure the FDE generation with parameters that strike a balance between speed and accuracy, as explored in the original paper. For this benchmark, we use 6 SimHash projections and 20 repetitions, resulting in a final FDE of 10,240 dimensions. Crucially, we employ an asymmetric strategy: we sum embeddings for queries but average them for documents, and we enable fill_empty_partitions for documents to prevent missing potential matches.
# Configure the FDE generation process
fde_config = FixedDimensionalEncodingConfig(
    dimension=128,                     # Original embedding dimension
    num_repetitions=20,                # Reduces variance
    num_simhash_projections=6,         # Creates 2^6=64 regions
    final_projection_dimension=fde_dimension, # Target dimension: 10240
    seed=42,
)
﻿
# Use asymmetric configurations for documents and queries
doc_fde_config = replace(fde_config, fill_empty_partitions=True)
query_fde_config = replace(fde_config, fill_empty_partitions=False)
﻿
# Generate the FDEs
logger.info("Generating Document FDEs (Batch)...")
doc_fdes = generate_document_fde_batch(doc_embeddings_list, doc_fde_config)
﻿
logger.info("Generating Query FDEs...")
query_fdes = np.array(
    [
        generate_query_fde(q_emb, query_fde_config)
        for q_emb in tqdm(query_embeddings_list, desc="Query FDEs")
    ]
)
Lightning-fast candidate search with FAISSWith our document and query sets successfully compressed into single vectors, we can now use a standard, highly optimized vector search library. We use FAISS (Facebook AI Similarity Search), a popular choice for MIPS.
We build a simple IndexFlatIP index, which performs an exact inner product search. We then add our document FDEs to this index and search it with our query FDEs to retrieve the top 100 candidates (rerank_k=100). This first pass is incredibly fast because it's just a single vector search.
# Normalize vectors for inner product search
faiss.normalize_L2(doc_fdes)
faiss.normalize_L2(query_fdes)
﻿
logger.info(f"Performing fast FDE candidate search (k={rerank_k})...")
index = faiss.IndexFlatIP(doc_fdes.shape[1])
index.add(doc_fdes)
﻿
# Search all queries in a single batch operation
start_faiss_time = time.time()
_, candidate_indices_matrix = index.search(query_fdes, rerank_k)
faiss_search_time = (time.time() - start_faiss_time) * 1000
This step gives us a list of promising candidates for each query. We'll evaluate this "FDE-only" method as one of our comparison points.
Reranking for final accuracyThe final stage of the MUVERA pipeline combines the speed of the FDE search with the precision of the original multi-vector model. We take the 100 candidates retrieved by FAISS and re-score them using the exact, but slower, ColBERT MaxSim (Chamfer) similarity function.
This ensures our final ranking is highly accurate, as we're only spending the extra computation time on a small, highly relevant subset of the corpus.
logger.info(f"Re-ranking top {rerank_k} candidates...")
# ... loop through queries and their candidates ...
﻿
    rerank_scores = []
    for doc_idx in candidate_indices_matrix[i]:
        doc_id = doc_ids_list[doc_idx]
        doc_emb = doc_embeddings_map.get(doc_id)
        # Calculate the exact score using original embeddings
        score = calculate_maxsim(query_emb, doc_emb)
        rerank_scores.append((score, doc_id))
﻿
    # Sort candidates by their new, exact scores
    rerank_scores.sort(key=lambda x: x[0], reverse=True)
This two-stage process is the essence of MUVERA's efficiency. By smartly reducing the search space first, we get the best of both worlds: the recall of a complex multi-vector model at a fraction of the latency. Now, let's look at the results.
The results: Simplicity crushes complexityAfter running our configurations, the results are in. The Weights & Biases Table below captures the showdown between the complex SOTA pipeline and MUVERA's elegant two-stage approach. The columns to watch are Avg Latency (how fast it is) and Speedup (how much faster it is than the baseline).
﻿
﻿
Let's break down what this table reveals:
Baseline - Ragatouille (PLAID): Our baseline uses ragatouille to implement PLAID, the current state-of-the-art (SOTA) engine for making multi-vector retrieval efficient. PLAID achieves this through an elaborate and complex four-stage pipeline involving multiple steps of candidate generation, filtering, and pruning. While powerful, this multi-step process can be difficult to manage and highly sensitive to parameter tuning. Our results show it sets a high bar for accuracy (Recall@100 of 0.98), but this complexity comes at a performance cost: 92.12 ms per query.
MUVERA (FDE-only): This is the raw power of MUVERA's first stage. Instead of a complex pipeline, we simply convert the multi-vector sets into FDEs and run a single vector search. The result is a staggering drop in latency to just 0.4461 ms, delivering a 206.5x speedup. This highlights the incredible efficiency of replacing a multi-stage process with a single, smart transformation.
MUVERA (FDE + Rerank): This is the full MUVERA system. By adding a quick reranking step to the FDE search, the total latency is a mere 5.048 ms. This is still a massive 18.2x speedup over the SOTA PLAID baseline. MUVERA achieves this by completely sidestepping the need for a complex, managed pipeline.
Now, the critical question is: what do we give up for this massive gain in speed and simplicity?
Looking at the Recall@100 column, the answer is: almost nothing. The final MUVERA pipeline's recall is nearly identical to that of the far more complex PLAID system. This is the core takeaway: MUVERA matches the accuracy of the state-of-the-art multi-stage method while being an order of magnitude faster and significantly simpler to implement and manage.
The Weights & Biases integration provides the visibility and reproducibility needed for production machine learning systems, enabling team collaboration and systematic optimization of your retrieval pipeline.
Remember that MUVERA's true value emerges at scale. While our tutorial dataset demonstrates the methodology, the speed advantages become exponentially more significant with larger document corpora and higher query volumes.
ConclusionMUVERA offers a practical solution to a persistent challenge in information retrieval. The trade-off between the detailed semantic matching of multi-vector models and the speed required for real-world applications. Its core method, Fixed Dimensional Encodings (FDEs), effectively approximates the expensive multi-vector similarity score, enabling a much faster initial search using standard single-vector techniques. As we saw in our hands-on benchmark, this approach is highly effective. We measured an 18.2x speedup over a well-optimized PLAID baseline while observing only a minimal drop in recall. This demonstrates that MUVERA provides a viable alternative to complex, multi-stage retrieval pipelines, offering a strong balance between performance and accuracy.
The efficiency gained from this method has clear practical implications. As demonstrated in our tutorial, the significant reduction in latency makes it more feasible for developers to use powerful multi-vector models in systems where query speed is critical, such as live search or recommendation services. Beyond the performance gains, a key advantage is MUVERA's relative simplicity. The two-stage process, a fast candidate search followed by a targeted reranking, can be easier to implement and maintain than the more intricate systems it aims to replace. Overall, MUVERA presents a compelling and useful technique for anyone building large-scale semantic retrieval systems today.
﻿
Add a comment