Skip to main content

DeepSeek-V3.1 Benchmark Scores

Created on August 21|Last edited on August 21
DeepSeek has released V3.1, an update that shows steady progress across coding, retrieval, and reasoning evaluations. The model delivers higher accuracy than both V3-0324 and R1-0528 on most benchmarks, while also producing answers more efficiently. Instead of dramatic leaps in one area, the gains appear as consistent improvements across multiple real-world tasks, from software engineering challenges and terminal commands to web-based search and advanced reasoning exams.

SWE-bench and terminal tasks

On coding challenges, DeepSeek-V3.1 posts a significant leap over both V3-0324 and R1-0528. On SWE-bench Verified, which tracks whether a model’s GitHub fixes pass unit tests, V3.1 reaches a 66 percent success rate compared with 45 percent for the earlier models. In the multilingual version, V3.1 solves 54.5 percent of issues, nearly doubling the roughly 30 percent performance of V3-0324 and R1. Terminal-Bench, which measures command-line task execution in a live Linux environment, shows a similar gap: 31 percent for V3.1 versus 13 percent and 6 percent. These results suggest that V3.1 is far more dependable when applying code in practical environments.


Browsing, search, and QA

Evaluation on retrieval and browsing tasks also favors V3.1. On BrowseComp, where models must navigate and extract answers from real web pages, V3.1 answers 30 percent correctly compared with 9 percent for R1. In the Chinese version, accuracy rises to 49 percent versus 36 percent. On HLE, a challenging language exam, V3.1 edges ahead at 30 percent compared with 25 percent. In xbench-DeepSearch, which demands cross-source synthesis, V3.1 reaches 71 percent versus 55 percent. Other benchmarks, such as Frames for structured reasoning on multi-hop questions, SimpleQA for factual queries, and Seal0, which measures accuracy under tough retrieval conditions, also show consistent improvements. Collectively, these results underline that V3.1 is more effective than R1 in retrieval-heavy tasks.


Reasoning efficiency

Performance gains are not only in accuracy but also in efficiency. On AIME 2025, a math exam, V3.1-Think matches or slightly surpasses R1 at 88.4 percent accuracy compared with 87.5 percent while using about 30 percent fewer tokens. GPQA Diamond, a graduate-level exam across disciplines, shows near parity at 80.1 percent versus 81.0 percent, but again V3.1 achieves it with nearly half the tokens. On LiveCodeBench, which tests reasoning over code, V3.1 achieves both higher accuracy at 74.8 percent and greater conciseness than R1’s 73.3 percent. These outcomes suggest that V3.1 is able to deliver detailed reasoning without unnecessary verbosity.


Overall

Relative to V3-0324, V3.1 marks a generational step forward. Against R1, it consistently leads in coding, retrieval, and reasoning benchmarks, while also using fewer tokens. GPQA Diamond remains the one area where R1 keeps pace, but it does so at nearly double the computational cost. V3.1 thus positions itself as both more accurate and more efficient, strengthening its role as a dependable reasoning and agent-ready model.


Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.