RAG vs. prompt stuffing: Do we still need vector retrieval?
An exploration of whether vector retrieval still makes sense now that LLMs can handle massive context windows.
Created on June 12|Last edited on June 27
Comment
Retrieval-augmented generation (RAG) has become the default recipe for tying large language model (LLM) outputs to a living, external corpus. By fusing vector-based document retrieval with an LLM’s generative power, practitioners have built chatbots, assistants, and automation that can answer with authority while citing ground truth. But what happens now that modern LLMs can handle context windows stretching into the hundreds of thousands or even a million tokens?
In this article, the million-token question is, "Do we still need the standard RAG vector database?" Or, given a big enough context window, could we simply “stuff” our corpus directly into the prompt and let the LLM act not just as a generator, but as its own retriever as well? When does this strategy hold up, and where does it break down? What are the tradeoffs, not just in accuracy, but in operational simplicity and (crucially) cost? Since inference costs scale with context size, and models have hard limits (roughly 1 million tokens per context or batch in current clouds), these are real and pressing considerations for anyone building real-world systems.
In this article, you’ll see exactly how classic, “vanilla” RAG pipelines stack up against in-context learning at the bleeding edge of context length. You’ll see where the boundaries lie, when one strategy starts to prevail, and what practical and economic factors influence your architecture choices in 2025 and beyond.

Table of contents
What is RAG?What is in-context learning RAG (a.k.a. prompt stuffing)?How it works:1. Knowledge base preparation2. Building the vector store for RAG3. Generating the evaluation test set4. Comparing RAG (vanilla) and in-context RAG Conclusion
What is RAG?
Retrieval-augmented generation is the current standard for grounding generative models in external data. Its core loop can be summarized in three steps:
- Embedding: An embedding model transforms your knowledge base into a dense vector space, where semantic relationships are represented by geometric closeness.
- Retrieval: At runtime, the user's query is embedded in the same way, and the retriever finds the top K most similar document chunks, surfacing the evidence most likely to answer the question.
- Generation: The language model receives a prompt constructed from just those retrieved passages and crafts an answer grounded only in the data it was given.
What makes RAG powerful is this tight grounding and efficiency. Any answer is, ideally, only as good as the evidence supplied by the retriever, which means you get transparency, updatability, and flexibility. Citations become straightforward, new knowledge is available instantly (no retraining required), and your document corpus can grow to millions or billions of passages without swamping the LLM’s prompt window. Vector lookup is also extremely efficient, with similarity metrics like cosine similarity being extremely cost-efficient to calculate.
But RAG comes with genuine operational costs: maintaining the vector store, updating embeddings as documents change, and keeping retrieval performance high as your data expands. For many teams, the complexity is justified by the scaling benefits, but it’s no longer the only game in town.
What is in-context learning RAG (a.k.a. prompt stuffing)?
If today’s language models can handle hundreds of thousands or even a million tokens in the prompt, could we bypass retrieval altogether? Why not put as much of our knowledge base as possible directly into the prompt, and rely on the model’s own attention and reasoning to “find” and synthesize the answer?
How it works:
Document batching: Select as many documents as will fit within your model’s maximum context window, which is currently capped at about 1 million tokens per LLM invocation for leading cloud models (or split the database into chunks that fit within the context window, and call the model several times in parallel).
Prompt construction: Concatenate these documents into a single, structured prompt, followed by the user’s query. Instructions might specify: “You may only use information from the above documents to answer the question. Cite your sources as you go.”
Single LLM call (or multiple in parallel): Make one extended inference call, letting the model function as both the retrieval and synthesis engine.
Why do this? Simplicity! No vector database, no dual-phase architecture, nothing to maintain but your prompt pipeline. For small to moderately sized corpora, or cases where batch-response latency and architecture complexity are major concerns, this might be attractive.
But this approach has hard constraints.
- First, cost: LLM providers often charge per-token, and a single call with nearly 1M tokens can be orders of magnitude more expensive than RAG’s lean prompts.
- Second, performance: As context grows, relevant passages may be “buried,” and LLMs may struggle to maintain attention or recall over very long, heterogeneous content.
- Third, scalability: if your dataset grows larger than what fits in a context window, you either have to sample, filter, or start sharding your queries, reintroducing design work similar to retrieval in another form.
When using Gemini 2.5 Flash (gemini-2.5-flash-preview-05-20) on Vertex AI, input tokens are billed at $0.15 per million tokens. However, with context caching enabled by default, you automatically get a 75% discount on input tokens, so the effective price is $0.0375 per million input tokens, plus query input tokens (billed at full price), and output tokens for all requests. This lower rate is applied once the database inputs have been cached.
In contrast, embedding costs in RAG are even lower. Leading embedding models run as low as $0.0001 per thousand tokens (i.e., $0.10 per million tokens). Since embedding a user query is typically just a few dozen tokens, the cost is less than a hundredth of a cent per question. Embedding your knowledge base, which is only needed when documents change, is also charged at the same negligible rate (plus it's a one-time cost, not an ongoing one).
Operational impact:
- Even with the 75% discount, large context LLM calls (i.e., prompt stuffing with substantial knowledge) can still multiply your per-query costs compared to RAG, where only small chunks are retrieved and embedded on the fly.
- While prompt stuffing with Gemini 2.5 Flash is now significantly less expensive due to this discounted rate, RAG remains substantially more cost-effective, especially as your query volume and knowledge base expand.
Overall, in-context LLM prompt stuffing is really only a legitimate option if you have a high budget and a reasonably small database (<10 million tokens, which could theoretically be achieved with multiple LLMs running in parallel on different sections of the database).
Note: Self-hosting your own large language model (in the cloud or on-premise) could potentially reduce inference costs, which may change the math and make large-context approaches more practical in some scenarios. If you control the infrastructure, you might be able to take greater advantage of prompt stuffing without the heavy per-token fees imposed by cloud LLM providers.
💡
The next section will walk through concrete code for each method and present side-by-side results, so you can decide, based on your own needs, data volume, and budget, what’s best for your next knowledge-driven LLM app.
1. Knowledge base preparation
The starting point for the experiment is to construct a standardized set of knowledge chunks, which will be used evenly for both RAG and in-context prompt stuffing strategies.
This process begins by loading a sizable text corpus, in this case, a subset of English Wikisource, using the HuggingFace Datasets library. Each raw document is converted into a compatible Langchain Document object, making it easy to slice, embed, or query later. However, to avoid running into context window or token budget problems, the documents must be split into chunks. The Langchain RecursiveCharacterTextSplitter is applied to break these documents into smaller, semantically usable segments that retain meaningful context. These chunks will be more adaptable for both retrieval and prompt stuffing scenarios.
For each chunk, we keep track of the total number of tokens using Gemini's own token counter, which helps ensure an accurate estimate of API costs and ensures compliance with cloud LLM input limits. As the chunks are produced, the code maintains a running tally of the Gemini token counts, stopping as soon as the total reaches the preset maximum limit of 500k tokens (which can be set to the full million-token context window if needed).
This curation ensures the resulting data is always safely within allowable context sizes for in-context experiments and fairly distributed for both RAG and prompt stuffing.
The final output is a JSON file, where each entry contains metadata, the text content, and dual token counts per chunk. This file will serve as the knowledge base for both RAG pipelines and in-context learning comparisons that will follow.
To start, install the following packages:
pip install ragas==0.2.15 langchain==0.3.25 google-genai==1.11.0
Note that you will need to enable Vertex AI in the Google Cloud Console and install the gcloud CLI on your local system. Now we'll write a script to create our initial dataset:
import jsonfrom tqdm import tqdmfrom datasets import load_datasetfrom langchain.text_splitter import RecursiveCharacterTextSplitterfrom langchain.docstore.document import Document as LangchainDocumentfrom transformers import GPT2TokenizerFastimport osos.environ["GOOGLE_CLOUD_PROJECT"] = "your-project"os.environ["GOOGLE_CLOUD_LOCATION"] = "us-central1"os.environ["GOOGLE_GENAI_USE_VERTEXAI"] = "True"import jsonfrom tqdm import tqdmfrom datasets import load_datasetfrom langchain.text_splitter import RecursiveCharacterTextSplitterfrom langchain.docstore.document import Document as LangchainDocumentfrom transformers import GPT2TokenizerFastfrom google import genaiclient = genai.Client()MAX_GEMINI_TOKENS = 500_000 # 1/2 milliondef gemini_count_tokens(text):result = client.models.count_tokens(model="gemini-2.5-flash-preview-05-20",contents=text)if isinstance(result, dict):return result["total_tokens"]return result.total_tokens# 1. Load datasetdataset = load_dataset("wikimedia/wikisource", "20231201.en")train_dataset = dataset['train'].select(range(50000))# 2. Convert to Langchain Document objectslangchain_docs = [LangchainDocument(page_content=doc["text"],metadata={"id": doc["id"], "url": doc["url"], "title": doc["title"]})for doc in tqdm(train_dataset)]# 3. Initialize text splitter and tokenizertext_splitter = RecursiveCharacterTextSplitter(chunk_size=2000,chunk_overlap=200,add_start_index=True,separators=["\n\n", "\n", ".", " ", ""],)tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")# 4. Process, capping Gemini tokens globallydocs_processed = []total_gemini_tokens = 0 # <--- THIS IS GLOBAL NOWdone = Falsefor doc in tqdm(langchain_docs, desc="Docs"):if done:breakchunks = text_splitter.split_documents([doc])for i, chunk in enumerate(chunks):chunk.metadata["custom_id"] = f"{chunk.metadata['id']}_chunk_{i}"gpt2_tokens = len(tokenizer.encode(chunk.page_content))chunk.metadata["gpt2_token_count"] = gpt2_tokensgemini_tokens = gemini_count_tokens("ID: " + str(chunk.metadata['id']) + "\n" + str(chunk.metadata['title']) + "\n" + chunk.page_content)print(str(gemini_tokens), flush=True)chunk.metadata["gemini_token_count"] = gemini_tokensprint(total_gemini_tokens, flush=True)if total_gemini_tokens + gemini_tokens > MAX_GEMINI_TOKENS:print("Global Gemini token limit reached, stopping.")done = Truebreak # Stop processing further chunks (and docs)docs_processed.append(chunk)total_gemini_tokens += gemini_tokens# 5. Convert to JSON formatdef documents_to_json(docs):json_list = []for doc in docs:json_item = {"custom_id": doc.metadata["custom_id"],"id": doc.metadata["id"],"url": doc.metadata["url"],"title": doc.metadata["title"],"gpt2_token_count": doc.metadata["gpt2_token_count"],"gemini_token_count": doc.metadata["gemini_token_count"],"text": doc.page_content}json_list.append(json_item)return json_listjson_data = documents_to_json(docs_processed)# 6. Save to diskoutput_file_path = "./chunked_wikisource_data.json"with open(output_file_path, 'w', encoding='utf-8') as file:json.dump(json_data, file, ensure_ascii=False, indent=4)print(f"Chunked data has been saved to {output_file_path}")
By the end of this process, you should have a well-structured dataset of text chunks, each annotated with all the information necessary for experimentation. The full dataset is already limited in size to fit the desired LLM context window.
This lays a foundation for fair experimentation: whether using a retrieval-based RAG pipeline or just stuffing as much as possible into the prompt, both methods will operate on the same document set. The saved JSON file streamlines all downstream steps, allowing you to reload this exact subset and reproduce results without redoing the costly pre-processing. This removes confounding variables and ensures that any differences in performance between strategies are due to the retrieval mechanism, rather than arbitrary data selection or inconsistent token boundaries.
2. Building the vector store for RAG
Now that the knowledge chunks have been created and token-limited, you can set up the RAG infrastructure. This section explains how to transform text chunks into semantic vector representations, making them suitable for high-quality similarity search. The in-context prompt stuffing method utilizes Google Gemini’s embedding model, encapsulating it within a custom Python class that handles embedding both documents and user queries in accordance with Gemini’s interface expectations.
The pre-existing JSON is loaded, and its entries are converted to Langchain Document objects. These will be embedded and stored in a Chroma vector database, a high-performance, disk-backed vector store that supports efficient similarity search at query time. Creating this vector database upfront is key to achieving fast, high-quality retrieval later, which is what powers the “retrieval” part of RAG. The code concludes by testing the pipeline with a sample query to ensure the index returns plausible results. You can verify both the embedding function and the data round-trip by examining the titles, token counts, and short excerpts.
import osos.environ["GOOGLE_CLOUD_PROJECT"] = "your_project"os.environ["GOOGLE_CLOUD_LOCATION"] = "us-central1"os.environ["GOOGLE_GENAI_USE_VERTEXAI"] = "True"import jsonfrom typing import Listfrom langchain.docstore.document import Document as LangchainDocumentfrom langchain_community.vectorstores import Chromafrom google import genaifrom google.genai.types import EmbedContentConfig# ------------- Custom Embedding wrapper for Gemini -------------class GoogleGeminiEmbeddings:def __init__(self, model="text-embedding-004"):self.model = modelself.client = genai.Client()def embed_documents(self, texts: List[str]) -> List[List[float]]:embeddings = []total = len(texts)print(f"Embedding {total} documents...", flush=True)for i, text in enumerate(texts):if i % 10 == 0 or i == total - 1:print(f"> Embedding doc {i+1}/{total}", flush=True)response = self.client.models.embed_content(model=self.model,contents=text,config=EmbedContentConfig(task_type="RETRIEVAL_DOCUMENT",output_dimensionality=768,),)emb = response.embeddings[0].valuesembeddings.append(emb)print("All embeddings complete.")return embeddingsdef embed_query(self, text: str) -> List[float]:print(f"Embedding query: {text[:60]}{'...' if len(text)>60 else ''}")response = self.client.models.embed_content(model=self.model,contents=text,config=EmbedContentConfig(task_type="RETRIEVAL_QUERY",output_dimensionality=768,),)print("Query embedding complete.")return response.embeddings[0].values# ----------- Load the chunked data -----------DATA_PATH = "./chunked_wikisource_data.json"print(f"Loading chunked data from {DATA_PATH}...")with open(DATA_PATH, "r", encoding="utf-8") as f:data_json = json.load(f)docs = []for idx, item in enumerate(data_json):if idx < 2:print(f"Sample doc {idx+1}: '{item['title'][:60]}...' tokens: {item['gpt2_token_count']}")meta = {"custom_id": item["custom_id"],"id": item["id"],"url": item["url"],"title": item["title"],"gemini_token_count": item["gemini_token_count"],}docs.append(LangchainDocument(page_content=item["text"], metadata=meta))print(f"Loaded {len(docs)} documents from {DATA_PATH}")# ------------ Create Chroma vector db with Gemini Embeddings ----------persist_dir = "./chroma_wikisource_vector_db_gemini"os.makedirs(persist_dir, exist_ok=True)print("Building vector database using Gemini Embeddings...")embeddings = GoogleGeminiEmbeddings(model="text-embedding-004")vectorstore = Chroma.from_documents(documents=docs,embedding=embeddings,persist_directory=persist_dir,)print(f"Vector DB created and saved at {persist_dir}")# (Optional) Test a simple retrievalquery = "What is the Magna Carta?"print(f"\nPerforming retrieval for test query: '{query}'")retriever = vectorstore.as_retriever()results = retriever.invoke(query)print(f"\nExample search results for query: '{query}'\n")for idx, r in enumerate(results):print(f"Result {idx+1}:")print("Title:", r.metadata.get('title'))print("Tokens:", r.metadata.get('gpt2_token_count'))print("Excerpt:", r.page_content[:200], "...\n")
Once this stage is complete, the knowledge base is now fully indexed for semantic retrieval. The Chroma vector store, seeded by Gemini embeddings, can efficiently match user queries to passages likely to contain the answer. All downstream RAG workflows use this vector database in the retrieval step, making the process robust, high-performance, and production-ready.
This setup not only accelerates future experimentation but also enables a transparent comparison with the stuffed-prompt (in-context) method, as both work off the same document and chunk selection. The successful retrieval example at the end ensures your entire pipeline is ready for RAG-style question answering and subsequent quantitative evaluation.
3. Generating the evaluation test set
To reliably evaluate retrieval-augmented generation or in-context learning approaches, it’s essential to assemble a test set that genuinely reflects the capabilities and boundaries of your current knowledge base. This code automates that process using RAGAS, LangChain, and OpenAI’s language models.
Here’s the code that generates the test set:
import osimport jsonimport pandas as pd# --- CONFIG ---JSON_PATH = "./chunked_wikisource_data.json"OUT_CSV = "./generated_testset.csv"OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")assert OPENAI_API_KEY, "Set your OPENAI_API_KEY variable!"os.environ["OPENAI_API_KEY"] = OPENAI_API_KEYTESTSET_SIZE = 30# --- RAGAS + LANGCHAIN IMPORTS ---from ragas.llms import LangchainLLMWrapperfrom ragas.embeddings import LangchainEmbeddingsWrapperfrom ragas.testset.graph import KnowledgeGraph, Node, NodeTypefrom ragas.testset.transforms import default_transforms, apply_transformsfrom ragas.testset import TestsetGeneratorfrom ragas.testset.synthesizers.single_hop.specific import SingleHopSpecificQuerySynthesizerfrom langchain_openai import ChatOpenAI, OpenAIEmbeddingsfrom langchain.docstore.document import Document as LCDocument# --- 1. LOAD DOCUMENTS ---with open(JSON_PATH, "r", encoding="utf-8") as f:raw_chunks = json.load(f)docs = [LCDocument(page_content=chunk["text"],metadata={"custom_id": chunk.get("custom_id"),"id": chunk.get("id"),"url": chunk.get("url"),"title": chunk.get("title"),"gpt2_token_count": chunk.get("gpt2_token_count"),},)for chunk in raw_chunks]print(f"Loaded {len(docs)} document chunks.")# --- 2. CREATE KNOWLEDGE GRAPH & ADD DOCS ---kg = KnowledgeGraph()for doc in docs:kg.nodes.append(Node(type=NodeType.DOCUMENT,properties={"page_content": doc.page_content,"document_metadata": doc.metadata}))print(f"KnowledgeGraph: {len(kg.nodes)} nodes, {len(kg.relationships)} relationships (before enrichment)")# --- 3. ENRICH KNOWLEDGEGRAPH with TRANSFORMS ---generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini", temperature=0))generator_emb = LangchainEmbeddingsWrapper(OpenAIEmbeddings())transforms = default_transforms(documents=docs,llm=generator_llm,embedding_model=generator_emb)apply_transforms(kg, transforms)print(f"Enriched KnowledgeGraph: {len(kg.nodes)} nodes, {len(kg.relationships)} relationships (after enrichment)")# Optionally, save & reload the knowledge graphkg.save("knowledge_graph.json")kg = KnowledgeGraph.load("knowledge_graph.json")# --- 4. GENERATE TESTSET using the enriched KG! ---synthesizers = [(SingleHopSpecificQuerySynthesizer(llm=generator_llm), 1.0),]generator = TestsetGenerator(llm=generator_llm,embedding_model=generator_emb,knowledge_graph=kg,)testset = generator.generate(testset_size=TESTSET_SIZE,query_distribution=synthesizers)df = testset.to_pandas()df.to_csv(OUT_CSV, index=False)print(f"\nDONE. Saved testset to {OUT_CSV} ({len(df)} rows)")# ---- 5. SHOW SAMPLES ----print(df[["user_input", "reference"]].head(5).to_markdown(index=False))
The workflow starts by loading your previously chunked source documents, each one captured as a LangChain Document with its metadata. These documents are assembled into a knowledge graph structure, where each chunk is represented as a node. Next, the code utilizes the RAGAS default_transforms and a GPT-4-backed LangChain LLM, along with OpenAI embeddings, to automatically enrich the knowledge graph. This step creates semantic links and contextual structure, helping the generator understand relationships across the data. Using RAGAS is quite computationally expensive, so I chose to use a relatively small dataset of 20 examples. In a production setting, I recommend using more examples to illustrate the point.
Once the knowledge graph is created, the RAGAS test set generator kicks in. Using the SingleHopSpecificQuerySynthesizer, questions are programmatically synthesized. This means that for each test case, the LLM is prompted, based on actual knowledge graph content, to create a focused, factoid-style question whose answer is anchored in your source data. The target size of the test set and the type of question (specific queries in this case) are both configurable.
The resulting questions, including both the generated user queries and their references to the original knowledge source, are collected and saved as a CSV file for easy inspection and reproducibility. To ensure quality and relevance, a sample of the generated questions and references is printed for manual review, allowing you to verify that the content aligns with the domain and test requirements before proceeding to the full system evaluation.
At the end of this pipeline, you are left with a purpose-built, LLM-curated test set perfectly suited for downstream evaluation, whether for RAG, prompt-based, or hybrid approaches. As each question is guaranteed to have a ground truth answer present in your knowledge base, this dataset serves as a reliable benchmark for both automated and LLM-based evaluation methods. The entire setup maximizes experimental integrity and ensures that each system is tested fairly, against the same, reproducible dataset.
4. Comparing RAG (vanilla) and in-context RAG
Once the knowledge chunks, vector index, and test question set are prepared, this script runs a direct benchmark between two answer-generation approaches: vanilla RAG and in-context prompt stuffing. In the vanilla RAG setup, each question is used to retrieve the most relevant knowledge chunks from the Chroma vector store, and these passages are provided as explicit sources to Gemini, which is then prompted to answer strictly using this material. The in-context approach skips retrieval and simply concatenates as many knowledge chunks as possible into the prompt, providing Gemini with access to a large swath of context without requiring relevance filtering.
Both methods process the complete evaluation set, and for every test question, the generated answers and the precise context visible to the language model are saved. All this information is written to a CSV file, ensuring transparent and reproducible results.
After collecting results for both systems, automated evaluation is performed with the RAGAS library. RAGAS utilizes large language models to systematically score each answer according to standardized metrics, including context recall, context precision, faithfulness, answer relevance, and context entity recall.
These metrics comprehensively measure how well answers are grounded in the supplied context, whether they faithfully reflect source information, and how relevant and complete the responses are. By recording the specific context alongside each answer and applying RAGAS’s rigorous evaluation framework, you get a detailed, side-by-side comparison of both architectures, making it easy to quantify and analyze the impact of retrieval versus context stuffing on answer quality.
import osimport jsonimport pandas as pdimport astfrom tqdm import tqdmimport warningsfrom typing import List, Dict, Anyfrom ragas import evaluatefrom ragas.llms import LangchainLLMWrapperfrom ragas.dataset_schema import EvaluationDatasetfrom ragas.metrics import (LLMContextRecall, Faithfulness, FactualCorrectness,LLMContextPrecisionWithoutReference, NoiseSensitivity,ResponseRelevancy, ContextEntityRecall)from ragas.run_config import RunConfigimport plotly.graph_objects as goimport timefrom langchain.docstore.document import Document as LangchainDocumentfrom langchain_community.vectorstores import Chromafrom google import genaifrom google.genai.types import GenerateContentConfigfrom langchain_core.language_models.llms import LLMimport wandbfrom google.genai.types import EmbedContentConfig# SETUPos.environ["GOOGLE_CLOUD_PROJECT"] = "your_project"os.environ["GOOGLE_CLOUD_LOCATION"] = "us-central1"os.environ["GOOGLE_GENAI_USE_VERTEXAI"] = "True"class GoogleGeminiLLM(LLM):model_name: str = "gemini-2.5-flash-preview-05-20"temperature: float = 0.0api_version: str = "v1"api_key: Any = Nonemax_tokens: Any = Nonedef __init__(self, **kwargs):super().__init__(**kwargs)object.__setattr__(self, "_client", genai.Client(api_key=self.api_key, http_options={"api_version": self.api_version}))def _call(self,prompt: str,stop: Any = None,run_manager: Any = None,**kwargs: Any,) -> str:config = GenerateContentConfig(temperature=self.temperature,max_output_tokens=self.max_tokens if self.max_tokens else None)response = self._client.models.generate_content(model=self.model_name,contents=prompt,config=config)return response.text if hasattr(response, 'text') else str(response)@propertydef _llm_type(self) -> str:return "google-gemini"@propertydef _identifying_params(self) -> Dict[str, Any]:return {"model_name": self.model_name,"temperature": self.temperature}# ------------- Custom Embedding wrapper for Gemini -------------class GoogleGeminiEmbeddings:def __init__(self, model="text-embedding-004"):self.model = modelself.client = genai.Client()def embed_documents(self, texts: List[str]) -> List[List[float]]:embeddings = []total = len(texts)print(f"Embedding {total} documents...", flush=True)for i, text in enumerate(texts):if i % 10 == 0 or i == total - 1:print(f"> Embedding doc {i+1}/{total}", flush=True)response = self.client.models.embed_content(model=self.model,contents=text,config=EmbedContentConfig(task_type="RETRIEVAL_DOCUMENT",output_dimensionality=768,),)emb = response.embeddings[0].valuesembeddings.append(emb)print("All embeddings complete.")return embeddingsdef embed_query(self, text: str) -> List[float]:print(f"Embedding query: {text[:60]}{'...' if len(text)>60 else ''}")response = self.client.models.embed_content(model=self.model,contents=text,config=EmbedContentConfig(task_type="RETRIEVAL_QUERY",output_dimensionality=768,),)print("Query embedding complete.")return response.embeddings[0].valuesllm = GoogleGeminiLLM(model_name="gemini-2.5-flash-preview-05-20",temperature=0.0,# api_version="v1")# DATA LOADtestset_path = "generated_testset.csv"data_path = "./chunked_wikisource_data.json"# df = pd.read_csv(testset_path)df = pd.read_csv(testset_path)#.head(3)with open(data_path, "r", encoding="utf-8") as f:docs_all_json = json.load(f)def safe_parse(val):try:return ast.literal_eval(val) if isinstance(val, str) else valexcept:return []df["reference_contexts"] = df["reference_contexts"].apply(safe_parse)docs_all = [LangchainDocument(page_content=docj["text"],metadata={"custom_id": docj.get("custom_id"),"id": docj.get("id"),"url": docj.get("url"),"title": docj.get("title"),"gpt2_token_count": docj.get("gpt2_token_count"),})for docj in docs_all_json]docs_by_id = {doc["custom_id"]: doc for doc in docs_all_json}# (A) Vanilla RAG (vector DB)persist_dir = "./chroma_wikisource_vector_db_gemini"if not os.path.exists(persist_dir):raise ValueError(f"Chroma vector DB '{persist_dir}' not found!")embeddings = GoogleGeminiEmbeddings(model="text-embedding-004")vectorstore = Chroma(persist_directory=persist_dir,embedding_function=embeddings # Already built with Gemini embeddings)retriever = vectorstore.as_retriever(search_kwargs={"k": 20})def format_context(docs):context = ""for i, doc in enumerate(docs):context += f"Source [{i+1}] (Title: {doc.metadata.get('title', 'n/a')} | ID: {doc.metadata.get('custom_id')})\n"context += doc.page_content.strip() + "\n---\n"return contextdef rag_vanilla(query: str) -> (str, List[str]):docs = retriever.invoke(query)context = format_context(docs)prompt = (f"Given the following documents and a user question, write a detailed, accurate answer using ONLY the information in the sources. "f"Do not use outside knowledge or hallucinate details. Cite your supporting source chunk(s) as [1], [2], etc. as appropriate.\n\n"f"=== Sources ===\n{context}"f"=== User Question ===\n{query}\n\n"f"=== Answer ===")answer = llm.invoke(prompt)return answer, [doc.page_content for doc in docs]# (B) In-Context All-Docs Promptdef make_doc_summary(doc):snippet = " ".join(doc["text"].split())return f'{doc["custom_id"]} | {doc["title"]}: {snippet}'# *** HERE IS THE FULL-UNLIMITED DOC ID LIST ***ALL_DOC_IDS = [doc["custom_id"] for doc in docs_all_json]def incontext_retrieval_and_response(query, num_to_retrieve=20):doc_summaries = [make_doc_summary(docs_by_id[doc_id]) for doc_id in ALL_DOC_IDS]retrieval_prompt = (f"You are a helpful retrieval engine. Given a user question and a list of document summaries (each with a unique ID and title), "f"return a comma-separated list of exactly {num_to_retrieve} document IDs most relevant for answering the question. "f"Only use IDs from the list provided - no made-up IDs!\n\n"f"Question: {query}\n"f"DOCUMENT SUMMARIES:\n" +"\n".join(doc_summaries) +"\n\nYour response: (just the comma-separated list of IDs)\n")ids_raw = llm.invoke(retrieval_prompt)ids = []for piece in ids_raw.replace("\n",",").split(","):x = piece.strip()if x in docs_by_id:ids.append(x)if len(ids) == num_to_retrieve:breakif not ids:print("DOC ID PARSING FAILED!!!!!!!")ids = ALL_DOC_IDS[:num_to_retrieve]context_chunks = [docs_by_id[i] for i in ids]context_text = ""for i, doc in enumerate(context_chunks):context_text += f"[Doc {i+1} | ID: {doc['custom_id']}] {doc['title']}:\n{doc['text']}\n---\n"answer_prompt = (f"You are a helpful question-answering assistant. Using ONLY the information in the provided documents, answer the user's question as specifically and factually as possible."f" If the answer is not contained, say so. Cite sources as [1], [2], etc. as appropriate.\n\n"f"=== DOCUMENTS ===\n{context_text}\n"f"=== QUESTION ===\n{query}\n\n"f"=== ANSWER ===")answer = llm.invoke(answer_prompt)return answer, [doc["text"] for doc in context_chunks]# Run both modes over testsetdef run_and_save(mode_func, mode_name, **kwargs):sys_answers = []sys_contexts = []print(f"\nRunning {mode_name} over testset...")for i, row in tqdm(df.iterrows(), total=len(df)):query = row['user_input']try:ans, contexts = mode_func(query, **kwargs)except Exception as e:print(f"Error at idx {i}: {e}")ans, contexts = "", []sys_answers.append(ans)sys_contexts.append(contexts)col_prefix = "incontext" if mode_name.lower().startswith("incontext") else "vanilla"df[f"{col_prefix}_response"] = sys_answersdf[f"{col_prefix}_contexts"] = sys_contextsout = f"results_{col_prefix}_gemini.csv"df.to_csv(out, index=False)print(f"Saved {mode_name} system answers to {out}")return out, f"{col_prefix}_response", f"{col_prefix}_contexts"# Run Vanilla RAGout_vanilla_filename, resp_col_vanilla, ctx_col_vanilla = run_and_save(rag_vanilla, "Vanilla")# Run In-Context RAG, FULL DOCSout_incontext_filename, resp_col_incontext, ctx_col_incontext = run_and_save(incontext_retrieval_and_response, "InContext")def eval_with_ragas(response_col,contexts_col,mode_label,project="rag-eval",run=None,):if df is None:raise ValueError("Input dataframe 'df' must be provided.")eval_records = []missing_keys = set()# Safely build the evaluation recordsfor i, row in df.iterrows():record = {}try:record["user_input"] = row["user_input"]record["reference"] = row["reference"]record["response"] = row[response_col]record["retrieved_contexts"] = row[contexts_col]except KeyError as e:print(f"Row {i}: missing expected column ({e}). Skipping row.")missing_keys.add(str(e))continueeval_records.append(record)if not eval_records:raise ValueError("No valid records to evaluate. Check your dataframe columns.")try:eval_dataset = EvaluationDataset.from_list(eval_records)except Exception as err:print(f"Failed to create EvaluationDataset: {err}")returntry:eval_llm = LangchainLLMWrapper(GoogleGeminiLLM(model_name="gemini-2.5-flash-preview-05-20",temperature=0.0,api_version="v1"))except Exception as err:print(f"Failed to initialize LLM: {err}")returnrun_config = RunConfig(max_workers=1, timeout=180)print(f"\nEvaluating {mode_label} with RAGAS/Gemini judge (can take a while)...")metrics = [LLMContextRecall(), Faithfulness(),LLMContextPrecisionWithoutReference(),ResponseRelevancy(), ContextEntityRecall()]try:results = evaluate(dataset=eval_dataset,metrics=metrics,llm=eval_llm,run_config=run_config,)results_df = results.to_pandas()except Exception as err:print(f"Evaluation failed: {err}")returneval_csv_out = f"evaluation_results_{mode_label.lower().replace('-','_')}_gemini.csv"try:results_df.to_csv(eval_csv_out, index=False)print(f"Saved RAGAS evaluation to {eval_csv_out}")except Exception as err:print(f"Failed to save results CSV: {err}")# Compute radar metrics (with KeyError protection)metric_map = {"Context Recall": "context_recall","Faithfulness": "faithfulness","Context Precision": "llm_context_precision_without_reference","Answer Relevancy": "answer_relevancy","Context Entity Recall": "context_entity_recall",}radar_metrics = {}for display, col in metric_map.items():try:radar_metrics[display] = results_df[col].mean()except KeyError:print(f"Warning: Metric '{col}' missing in results. Setting to 0.")radar_metrics[display] = 0.0for k, v in radar_metrics.items():print(f"{mode_label} | {k}: {v:.3f}")# Radar plot with error handlingfig = go.Figure()metrics_list = list(radar_metrics.keys())vals = list(radar_metrics.values())if metrics_list and vals:# close the radar chart loopmetrics_list.append(metrics_list[0])vals.append(vals[0])fig.add_trace(go.Scatterpolar(r=vals, theta=metrics_list, fill='toself', name=mode_label))fig.update_layout(polar=dict(radialaxis=dict(visible=True, range=[0, 1], tickvals=[0, 0.5, 1])),title=dict(text=f"RAGAS Results ({mode_label})", x=0.5),showlegend=True)try:path = f"./radar_results_{mode_label.lower()}_{int(time.time())}.html"fig.write_html(path, auto_play=False)print(f"Radar plot for {mode_label} saved: {path}")except Exception as err:print(f"Failed to save radar plot: {err}")else:print("Radar metrics are empty: unable to create plot.")# --- WANDB Logging ---try:if run is None:wandb_run = wandb.init(project=project, name=mode_label)else:wandb_run = runwandb_run.log({f"{mode_label}/" + k: v for k, v in radar_metrics.items()})if 'fig' in locals():wandb_run.log({f"{mode_label}_radar_chart": wandb.Plotly(fig)})if run is None:wandb_run.finish()except Exception as err:print(f"WANDB logging failed: {err}")if missing_keys:print(f"Completed with missing columns: {missing_keys}")# -------- Example Usage (run just once at start) --------wandb_run = wandb.init(project="rag-eval", name="rag_eval_run")# InContext RAGeval_with_ragas(resp_col_incontext, ctx_col_incontext, "InContext RAG", run=wandb_run)# Vanilla RAGeval_with_ragas(resp_col_vanilla, ctx_col_vanilla, "Vanilla RAG", run=wandb_run)wandb_run.finish()print("\nEvaluation completed for BOTH Vanilla and In-Context modes and logged to wandb.")
After the evaluation is complete, you will have clear, quantifiable results comparing vanilla RAG and in-context prompt stuffing. The performance of each approach is summarized using radar plots, providing an intuitive visual overview of metrics such as context recall, context precision, faithfulness, answer relevancy, and context entity recall for both methods. These radar charts make it easy to spot strengths and weaknesses at a glance, and are automatically saved as HTML files for convenient sharing or inclusion in reports.
Additionally, all raw metric scores and aggregates are logged to Weights & Biases, ensuring your experimental results are systematically tracked and easy to revisit or compare over time. This seamless integration with W&B supports further ablation studies, detailed error analyses, and the creation of informative presentations that illustrate the nuanced tradeoffs between retrieval-based and prompt-stuffed systems across large-scale benchmarks. Every step is fully reproducible, transparent, and ready for deeper exploration of what really drives answer quality.
Here’s the result for my evaluation.
I evaluated both in-context RAG and vanilla RAG using a range of RAGAS metrics designed to capture how well each approach finds and uses information:
- Context recall reflects how successfully each system gathers all the relevant information needed to answer a user’s question. High recall means the model isn’t missing any important facts. Both in-context RAG and vanilla RAG excelled here, each achieving a score of 0.9, indicating that nearly all the necessary information was included in the retrieved context.
- Context precision examines the quality of the retrieved context, specifically how much of it is actually helpful and directly related to the question, rather than irrelevant or distracting. Here, vanilla RAG stood out with a score of 0.827, compared to in-context RAG's 0.664. In practice, this means that vanilla RAG was much better at filtering out unnecessary or off-topic details, surfacing only the most directly relevant information. One important caveat is that I asked both systems to retrieve 20 of the top documents ranked by relevance, so reducing the K value here could potentially result in higher scores for both models, which might narrow the gap.
- Faithfulness measures whether the answers generated by the system adhere strictly to what is actually present in the retrieved context, without inventing new details or "hallucinating." Both systems delivered solid results, but in-context RAG had the edge at 0.984 compared to vanilla’s 0.935. This suggests that in-context RAG’s answers were slightly more grounded in verifiable evidence from the retrieved content, likely because it uses the same model for both retrieving and generating, which helps it stay consistent.
- Response relevancy assesses how directly and helpfully the answer addresses the original question, regardless of how the information was retrieved. Both systems performed nearly identically, 0.876 for In-Context RAG and 0.877 for vanilla RAG, so users were about equally likely to see on-topic, useful answers from either system.
- Context entity recall hones in on whether key entities, like names, places, or dates, from the source material show up in the retrieved content. In-context RAG narrowly led here (0.558 vs. 0.530), meaning it was slightly more likely to pull in all the important names and factual details needed to construct a precise answer.
In summary, both systems were highly effective at finding the necessary information and generating relevant, supported answers. Vanilla RAG was stronger at being selective and minimizing noise, while in-context RAG produced answers that hewed even more closely to the retrieved facts and did a slightly better job of including key entities.
Conclusion
The evolving landscape of large language models is reshaping the foundational assumptions of knowledge-based question answering. As context windows expand to unprecedented lengths, the very necessity of classic retrieval pipelines comes into question. Yet, this investigation reveals that while in-context “prompt stuffing” can match, or even marginally surpass, retrieval-based approaches on some metrics, it does so by trading away precision and efficiency. Simply scaling context is not a panacea; it exposes new operational and cost complexities and introduces noise that even powerful models cannot always filter out.
Ultimately, the choice between Vanilla RAG and In-Context Learning is no longer simply technical; it’s deeply architectural and economic. The “best” architecture is not one-size-fits-all, but a question of what you value: the maximal precision and cost-efficiency of retrieval, or the directness and simplicity of giant-context prompts. As LLM capabilities and pricing continue to shift, the lasting lesson is one of adaptability. Builders must rigorously examine their use cases, data scale, and update requirements, recognizing that optimal performance is achieved not by relying on any single method, but by adopting an engineering mindset that weighs tradeoffs and iterates as models and requirements evolve.
The era of retrieval isn’t over, but neither is it unchallenged. Rather, future systems may well combine adaptive, hybrid architectures that blend retrieval, filtering, and raw context at just the right scale for their problem. As we press further into an age of ever-larger models and ever-growing corpora, clarity about our needs and transparency about our costs will be the decisive edge.
Add a comment
Iterate on AI agents and models faster. Try Weights & Biases today.