Skip to main content

How to evaluate the "true" context length of your LLM using RULER

In this article we will explore why long context matters, survey the main benchmarks that aim to test it, and then use RULER as our framework for experiments.
Created on August 27|Last edited on September 4
RULER (Realistic and Universal Language Model Evaluation with Long-Contexts) is a synthetic benchmark designed to measure the true long-context capabilities of large language models (LLMs). It was created by researchers at NVIDIA to address the limitations of simpler evaluations, such as the "Needle-in-a-Haystack" (NIAH) test, which primarily assesses basic retrieval and does not accurately gauge more complex reasoning over long text.
The ability to handle long context has become one of the most visible frontiers in language model development. Context windows have grown from a few thousand tokens to claims of hundreds of thousands or even a million. On paper, this appears to be a breakthrough: whole books can be processed in a single prompt, legal documents can be read in full, and codebases can be navigated without losing track.
In practice, though? Raw capacity does not always match usable performance. Many models falter well before their advertised maximum, and the quality of answers often drops as the input grows. This makes evaluation especially important. It is not enough to know how many tokens a model can technically accept. We need to see how well it actually works when context grows, and whether it can still retrieve, trace, and reason effectively.
In this article, we will explore why long context matters, survey the main benchmarks that aim to test it, and then use RULER as our framework for experiments. We will then walk through a tutorial for setting up evaluations, examine the results, and discuss what they reveal about the real limitations of today’s long context models. Finally, we'll reflect on where evaluation still falls short and what should come next.

Table of contents



Why long context matters (and why capacity ≠ performance)

Bigger context windows open the door to tasks that were once impossible. A model that can process an entire legal brief or financial filing in one shot reduces the need to build complex retrieval systems that often work sub-optimally. The same applies to analyzing long research papers, following extended multi-turn conversations where memory matters, or navigating a codebase with many interconnected files. In theory, the more context a model can see at once, the more seamless and accurate these workflows should become.
But, a model might accept 128k tokens or more, yet show sharp drops in accuracy well before hitting that limit.
💡
Unfortunately, as the input grows, retrieval becomes noisy, reasoning chains collapse, and outputs drift toward copying text verbatim or falling back on stored parametric knowledge instead of the actual input. When distractors are added, even strong models often lose track of what matters. Studies comparing long context performance show that retrieval alone is not enough to measure real capability, since a model that aces a simple needle-in-a-haystack test can still fail at aggregation, tracing, or question answering once the context grows and complexity rises.
This mismatch between claimed capacity and actual performance is the reason careful evaluation is necessary. Long context should be more than just a number in a spec sheet. It should consider whether the model can continue to utilize relevant information as the input expands, and whether it can still reason effectively when noise and complexity increase.

The power of memory

Humans can recall information actively. We know when something we have read before is relevant, and we can bring it back into focus at the right time. Memory often feels less like retrieval and more like a skill. We know when to bring important details to the surface. Language models do not have that kind of memory. Knowing when to look something up or when to surface a fact they saw earlier is not a task that comes naturally to LLMs. It’s not to say that these tasks can’t be taught to them, it's just that the process of teaching this skill is extremely challenging, and has not been successfully achieved yet (in my opinion).
Retrieval augmented generation can help by letting a model fetch text from an external source when a query is clearly defined. If you ask a sharp, well-phrased question, RAG works fairly well. The challenging part with RAG is deciding when to search, how to phrase that search, and how to combine the results with the current conversation. That is where performance often breaks down. In practice, models using RAG in this way either search too little, search too much, or focus on the wrong signals.
This is why larger context windows are appealing. When the relevant material is already in the prompt, the model does not have to decide whether to go looking for it. Because all previous context is within the context window, it is able to take advantage of all of the previous training iterations that it learned how to incorporate context to produce a correct answer (taught via backpropagation). Bigger windows do not solve every problem, but they cut down the need for complex retrieval strategies and move closer to the kind of fluid recall that comes naturally to human memory.

Existing long context benchmarks

There are several evaluations for testing how models handle long inputs, but most take a fairly narrow view. Needle-in-a-haystack hides a fact inside a large distractor passage. Passkey retrieval buries a small token somewhere in the prompt. These tasks are simple and make it easy to chart accuracy as length grows, but they mainly show whether a model can recall a specific item rather than how well it can use extended context for reasoning.
Other benchmark suites aim for more coverage:
  • LongBench collects tasks across domains and languages to test summarization, classification, and question answering over longer documents.
  • ZeroScrolls reformulates existing NLP benchmarks into long-form versions.
  • L-Eval brings together twenty datasets and suggests using model judges to better align with human ratings.
  • InfiniteBench combines synthetic and realistic tasks, stretching inputs to over 100k tokens.
Even with these advances, most of the landscape still leans either toward retrieval-style challenges or realistic but noisy datasets.
What they rarely capture is whether a model can trace variables across multiple steps, aggregate scattered details, or answer questions reliably when multiple distractors are present. Those are the stress points that reveal the gap between theoretical context size and what models can actually do at scale.
It is unreasonable to expect a model to reason over long documents if it cannot even handle the simplest retrieval probes. A system that stumbles on finding a single key hidden in a clean context will almost certainly collapse when asked to trace references, combine evidence, or solve questions that require multiple steps of reasoning. Retrieval tests alone are not sufficient, but they are a baseline. They show us where a model’s effective context window really begins to shrink, and they highlight how fragile performance can be once noise or distractors are introduced.

What is RULER?

Released by NVIDIA researchers and published at COLM 2024, RULER expands long context evaluation into a broader set of categories. It keeps retrieval at the core but introduces variations such as multiple needles, multiple queries, and multiple values, each designed to stress how well a model can separate true signals from distractions. On top of that, it adds three other types of tasks:
  • multi-hop tracing, where the model has to follow chains of variables across long spans;
  • aggregation, where it needs to count or extract frequent items rather than repeat a single fact;
  • and question answering, where golden passages are hidden among large volumes of irrelevant text.
What makes RULER stand out is its synthetic design. Because the inputs are generated rather than drawn from noisy real data, the benchmark can flexibly scale the sequence length, the number of distractors, and the complexity of the task without relying on a model’s background knowledge. That means performance is tied directly to how well the model uses the context, not to how much it memorized during training.
RULER spans 13 representative tasks across four categories: retrieval, multi-hop tracing, aggregation, and question answering.
Retrieval extends the classic needle-in-a-haystack test by hiding key–value pairs inside long distractor passages. Variants include multiple needles, multiple values for the same key, and simultaneous multiple queries. These tasks assess whether a model can still locate and retrieve the correct items as the context length and noise increase.
Multi-hop tracing, also known as variable tracking, checks whether a model can follow chains of references across the input. A variable is assigned a value, then repeatedly re-bound through intermediate names (X1 = 123, X2 = X1, X3 = X2). The model must recover all variables tied to the original value. This stresses the ability to trace entities through long sequences rather than just spotting them.
Aggregation uses synthetic word lists to test summarization-style skills. In common words extraction, the goal is to identify a fixed set of words that appear throughout the sequence. In frequent words extraction, the model must return the most frequent items sampled from a skewed distribution. Both force the model to combine evidence spread across the context rather than picking out a single passage.
Question answering embeds short-context QA datasets inside much longer distractor text. The golden paragraph with the answer is mixed in with irrelevant passages, and the model has to find it and respond correctly. This simulates real scenarios where useful information is hidden in large volumes of noise.
Together, these four task categories broaden evaluation beyond retrieval alone, exposing whether a model can sustain reasoning, tracing, and aggregation when sequence length grows.

Tutorial: Evaluating GPT-5 and GPT-OSS using the RULER Eval

We will start by generating small synthetic datasets for four task families: common words extraction, frequent words extraction, variable tracing, and long context QA. In the RULER repo, there is a simple downloader for the QA material, then a prep script that builds JSONL files for each task using your tokenizer settings. I ran the downloader, then called the prep script once per task to write a compact validation set under a datasets folder, using cl100k_base and an OpenAI tokenizer type so token counts align with the models we will test.
I first target a 4k sequence length to establish a baseline of each model’s raw task performance in a short context. This gives a clean read on capability without long-range stress. Once that baseline is in place, I generated a second set at 128k and rerun the exact same evaluation. Comparing the two runs isolates the effect of context length, showing how quickly accuracy degrades and which task types fail first.
pip install wonderwords
cd RULER/
bash data/synthetic/json/download_qa_dataset.sh
cd scripts/data
for task in cwe fwe qa_2 vt; do python prepare.py --save_dir ./datasets --benchmark synthetic --task $task --tokenizer_path cl100k_base --tokenizer_type openai --max_seq_length 4096 --model_template_type base --num_samples 25; done
Once the datasets are generated, the script runs the evaluation loop. It loads a balanced mix of samples from each RULER task, formats them into prompts, and sends them to the models under test. In this tutorial, we will run two open-source checkpoints served through W&B Inference (openai/gpt-oss-20b and openai/gpt-oss-120b) alongside two smaller OpenAI-hosted models (gpt-5-mini and gpt-5-nano). For each model, predictions are logged to Weave together with the ground truth answers, and a judge model (gpt-5) automatically marks outputs as correct or incorrect, with the scoring rules adjusted depending on the task.
Here’s the code for the evaluation, which I will run for both of the datasets generated previously:
import openai
import weave
from weave import EvaluationLogger
import json
import os
from pathlib import Path
from collections import Counter, defaultdict
from typing import List, Dict, Any
import time

# ---------------- config ----------------
PROJECT = "ruler_eval"
weave.init(PROJECT)

# W&B Inference client for OSS models and judge
wandb_client = openai.OpenAI(
base_url="https://api.inference.wandb.ai/v1",
api_key="your_wandb_api_key",
project=PROJECT,
default_headers={
"OpenAI-Project": "wandb_fc/quickstart_playground"
}
)

# OpenAI client for gpt-5-mini via Responses API
oai_client = openai.OpenAI()

MODELS = [
"openai/gpt-oss-20b",
"openai/gpt-oss-120b",
"openai/gpt-5-mini",
"openai/gpt-5-nano",
]

JUDGE_MODEL = "gpt-5"

# Local dataset configuration - maps to your generated RULER tasks
LOCAL_DATASET_DIR = "/Users/brettyoung/Desktop/dev25/tutorials/long_cntxt/RULER/scripts/data/datasets_lg"
LOCAL_TASKS = {
"aggregation": "cwe", # Common Words Extraction
"retrieval": "fwe", # Frequency Words Extraction
"qa": "qa_2", # HotpotQA
"tracing": "vt", # Variable Tracking
}

# -------------- dataset -----------------
def load_local_jsonl(filepath: str) -> List[Dict[str, Any]]:
"""Load a JSONL file and return list of examples"""
examples = []
if not os.path.exists(filepath):
print(f"Warning: File not found: {filepath}")
return examples
with open(filepath, 'r', encoding='utf-8') as f:
for line in f:
line = line.strip()
if line:
try:
examples.append(json.loads(line))
except json.JSONDecodeError as e:
print(f"Error parsing line in {filepath}: {e}")
continue
return examples

def load_balanced_local(n: int = 100, seed: int = 123):
"""Load balanced samples from local RULER datasets"""
cats = list(LOCAL_TASKS.items())
k = len(cats)
base, rem = divmod(n, k)

all_examples = []
category_counts = {}
for i, (category, task_name) in enumerate(cats):
take = base + (1 if i < rem else 0)
if take == 0:
continue
filepath = Path(LOCAL_DATASET_DIR) / task_name / "validation.jsonl"
print(f"Loading {filepath} (take {take})...")
examples = load_local_jsonl(str(filepath))
print(f" {task_name} size: {len(examples)}")
if len(examples) < take:
print(f"Warning: Not enough examples in {task_name} (need {take}, have {len(examples)})")
take = len(examples)
# Add category label and select examples
selected = examples[:take]
for ex in selected:
ex["category"] = category
all_examples.extend(selected)
category_counts[category] = take

if not all_examples:
raise RuntimeError("No examples loaded. Check dataset paths.")

print("Category counts:", category_counts)
return all_examples

def build_prompt(ex: Dict[str, Any]) -> str:
"""Extract prompt from RULER format"""
# RULER format has 'input' field containing the full prompt
if "input" in ex:
prompt = ex["input"]
# Add answer prefix if it exists
if "answer_prefix" in ex:
prompt += ex["answer_prefix"]
return prompt
# Fallback for other formats
msgs = ex.get("messages")
if isinstance(msgs, list) and msgs and isinstance(msgs[0], dict):
return "\n".join(m.get("content", "") for m in msgs if m.get("content"))
return str(ex)

def extract_targets(ex: Dict[str, Any]) -> List[str]:
"""Extract target answers from RULER format"""
# RULER format uses 'outputs' field
outputs = ex.get("outputs", [])
if isinstance(outputs, list):
return [str(x) for x in outputs if x is not None]
elif outputs is not None:
return [str(outputs)]
# Fallback to other common fields
for field in ["expected_answer", "answer", "target"]:
v = ex.get(field)
if v is not None:
if isinstance(v, list):
return [str(x) for x in v if x is not None]
return [str(v)]
return []

# -------------- inference ----------------
def chat_once(model: str, prompt: str) -> str | None:
try:
if model == "openai/gpt-5-mini":
resp = oai_client.responses.create(
model="gpt-5-mini",
reasoning={"effort": "low"},
input=[{"role": "user", "content": prompt}],
)
return resp.output_text

if model == "openai/gpt-5-nano":
resp = oai_client.responses.create(
model="gpt-5-nano",
reasoning={"effort": "low"},
input=[{"role": "user", "content": prompt}],
)
return resp.output_text
if model in ["openai/gpt-oss-20b", "openai/gpt-oss-120b"]:
resp = wandb_client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
temperature=0.0,
max_tokens=512,
)
return resp.choices[0].message.content
except Exception as e:
print(f"Error in chat_once for {model}: {e}")
return None

def judge(pred: str | None, golds: List[str], ctx: str, problem_excerpt: str) -> float:
if pred is None or not golds:
return 0.0
gold_text = str(golds)
jp = (
"You are an evaluator.\n"
"Use the problem excerpt to infer whether order matters for this task. "
"If the instruction implies a specific ordering or exact string match, be strict. "
"If it asks for entities, spans, counts, or unordered sets, allow order-free matching. "
"Minor formatting and whitespace should be ignored.\n\n"
f"Problem excerpt (first 300 chars): {problem_excerpt}\n\n"
f"Gold answers (one may be correct): {gold_text}\n"
f"Model prediction: {pred}\n\n"
"Return a single word: CORRECT if the prediction semantically matches any gold answer given the instruction, otherwise INCORRECT. Ignore minor formatting differences and like added *'s or just a more verbose answer."
)
try:
# Use GPT-5 via OpenAI Responses API for judging
resp = oai_client.responses.create(
model=JUDGE_MODEL,
reasoning={"effort": "low"},
input=[{"role": "user", "content": jp}],
)
out = resp.output_text if resp and resp.output_text else ""
return 1.0 if out and out.strip().upper().startswith("CORRECT") else 0.0
except Exception as e:
print(f"Error in judge: {e}")
return 0.0

# -------------- eval ---------------
def run_eval_for_model(model_name: str, examples):
safe_model_name = model_name.replace("/", "_").replace("-", "_")
print("Using safe model name for logger:", safe_model_name)

ev = EvaluationLogger(model=safe_model_name, dataset="RULER_Local_4K")

per_cat_j = defaultdict(list)

for i, ex in enumerate(examples):
time.sleep(5) # Rate limit safety
print(f"\n--- Example {i} ({ex['category']}) ---")
print("Raw example keys:", list(ex.keys()))

prompt = build_prompt(ex)
print("Prompt preview:", prompt[:200], "..." if len(prompt) > 200 else "")

golds = extract_targets(ex)
print("Gold answers:", golds)

pred = chat_once(model_name, prompt)
print("Model prediction:", pred)

pred_logger = ev.log_prediction(
inputs={"category": ex["category"], "prompt": prompt, "gold": golds},
output={"prediction": pred}
)

problem_excerpt = prompt[:300]
jscore = judge(pred, golds, prompt, problem_excerpt)
print("Judge score:", jscore)

pred_logger.log_score("judge", jscore)
pred_logger.finish()

per_cat_j[ex["category"]].append(jscore)

if i % 5 == 0:
print(f"Progress checkpoint -> {i}: Judge {jscore}")

overall_j = sum(sum(v) for v in per_cat_j.values()) / max(1, len(examples))
per_cat_j_avg = {k: sum(v)/len(v) for k, v in per_cat_j.items()}

print(f"\nSummary for {model_name}")
print("Overall Judge:", overall_j)
print("Per-category Judge:", per_cat_j_avg)

ev.log_summary({
"overall_judge_correct": overall_j,
"per_category_judge_correct": per_cat_j_avg,
"model": model_name,
})

def main():
print("Loading local RULER datasets...")
examples = load_balanced_local(n=100, seed=123)
print("Loaded dataset with", len(examples), "examples")

for m in MODELS:
print("\n==============================")
print("Evaluating model:", m)
print("==============================")
run_eval_for_model(m, examples)

print("Done")

if __name__ == "__main__":
main()

Open-source checkpoints are served through W&B Inference with a dedicated OpenAI client pointed at the W&B endpoint.
The code initializes Weave with a project name, then creates a wandb_client using base_url set to the W&B Inference API, an API key, and a project header. That client is used only for the OSS models openai/gpt-oss-20b and openai/gpt-oss-120b via chat.completions.create with temperature 0. OpenAI-hosted models gpt-5-mini and gpt-5-nano are run through a separate oai_client using the official OpenAI Responses API. The same client also runs the judge model gpt-5.
The dataset loader pulls in a balanced set of examples across retrieval, tracing, aggregation, and QA, so no single category dominates the evaluation. Each example is converted into a plain prompt string and sent to the model. Predictions are captured and logged, along with the ground truth targets, so everything is tracked in one place.
The judge model then steps in to automatically decide if an output is correct. It compares the prediction against the gold answers, being strict when exact matches are required and more flexible when the task allows order-free sets or minor formatting differences. This removes the need for manual checking and makes it possible to scale experiments across hundreds of samples.
Weave is the glue that makes this evaluation framework practical and usable. Every prediction, gold answer, and judge score gets logged through its EvaluationLogger, which acts as a central tracker. The logger is initialized once per model, then each prediction is streamed in as the loop runs. For every example, you call log_prediction with the inputs and outputs, attach scores from the judge with log_score, and then finalize with finish. Once all predictions are processed, log_summary aggregates the results and uploads them to the Weave dashboard.
Inside the Weave UI, the evaluation appears as a structured run. You can drill into individual examples to see the model’s raw output, compare it against the gold targets, and check how the judge scored it. At the same time, you get automatic aggregation, including per-task accuracy, overall averages, and visualizations that show how different models stack up. You can also compare multiple evaluation runs side by side to see, for example, how performance shifts between a 4k baseline and a 128k context test.

Results

The evaluations at 4k and 128k sequence lengths show a consistent pattern: performance degrades as input length grows. At 4k, the OpenAI closed-source models (gpt-5-mini and gpt-5-nano) score clearly higher than the open-source baselines, but at 128k, both exhibit sharp drops.
GPT-5 Mini at 4k and 128k sequence length (pink is 4k, blue is 128k)
GPT-5 Nano at 4k and 128k sequence length (pink is 128k, blue is 4k)
Gpt-5-mini records 0.87 overall judge accuracy at 4k and falls to 0.59 at 128k. Aggregation is the category hit hardest, dropping from 0.96 at 4k to 0.0 at 128k. Retrieval falls from 0.92 to 0.84, QA from 0.88 to 0.84, and tracing from 0.72 to 0.68.
Gpt-5-nano shows the same trajectory, slipping from 0.71 overall at 4k to 0.48 at 128k. Aggregation again collapses, moving from 0.88 at 4k to 0.0 at 128k. Retrieval is flat at 0.48 in both settings, QA dips slightly from 0.84 to 0.76, and tracing edges down from 0.68 to 0.64.
The open-source checkpoints begin weaker and remain weak. Gpt-oss-120b sits at 0.58 at 4k but drops to 0.49 at 128k. Gpt-oss-20b is lowest of all, with 0.39 overall at 4k and 0.36 at 128k.
Blue is 4k, pink is 128k

The relative differences are clear. OpenAI closed-source models deliver significantly higher accuracy in the short context regime, but the gap narrows as context expands, as they lose more absolute ground. The open-source models do not collapse as dramatically, but that is largely because their baselines are already low. Overall, the results underscore how current models, regardless of headline context size, struggle to maintain accuracy once inputs approach the upper end of their advertised ranges.

Comparing models with RULER


At the full 128k context length, all models exhibit reduced performance compared to their 4k baselines; however, the relative ordering remains unchanged. Gpt-5-mini remains the strongest, reaching 0.59 overall. Among the open-source checkpoints, gpt-oss-120b lands at 0.49 overall. Gpt-oss-20b trails furthest behind at 0.36 overall, with QA at 0.40, tracing at 0.64, and retrieval collapsing entirely to zero.
These results confirm that larger context windows erode accuracy across the board. OpenAI models maintain higher QA and tracing scores, but aggregation proves fragile for every system, and the gap between commercial and open-source narrows at long lengths only because all models degrade under the same pressure.

Conclusion

It's now viscerally clear: bigger windows help reduce reliance on retrieval systems, but they do not yet guarantee reliable reasoning across extended context. Performance losses at scale reveal how brittle long-context capabilities remain, particularly in aggregation and multi-step reasoning. Evaluation frameworks like RULER expose these weaknesses by stressing models in controlled settings, showing where they actually begin to fail.
The path forward is not just about increasing raw token limits but about improving how models use the information placed in those windows. True long-context ability will mean sustaining reasoning quality, resisting distractors, and handling aggregation as effectively at 128k as at 4k. Until then, long context remains more of a promise than a fully realized capability.



Iterate on AI agents and models faster. Try Weights & Biases today.