Skip to main content

QwQ-32B: R1 Performance with 20x less parameters?

QwQ-32B shows how reinforcement Learning is extremely powerful for reasoning.
Created on March 6|Last edited on March 6
Reinforcement Learning has emerged as a powerful method for improving language model performance beyond traditional pretraining and post-training techniques. Recent advancements, such as DeepSeek R1, have demonstrated that RL can significantly enhance a model's reasoning capabilities by integrating structured data and multi-stage training. Building on this foundation, the Qwen team introduces QwQ-32B, a 32-billion-parameter model that achieves performance levels comparable to much larger models like DeepSeek-R1.

The Significance of Reinforcement Learning in Large Language Models

QwQ-32B demonstrates how RL can enhance the intelligence of language models, allowing them to reason more effectively and adapt dynamically to new challenges. Despite having fewer parameters than DeepSeek-R1 (which has 671 billion parameters, with 37 billion activated), QwQ-32B achieves comparable results due to its optimized training strategy. The model integrates agent-related capabilities, enabling it to utilize tools and adapt its reasoning processes based on environmental feedback.

Performance Benchmarks and Comparative Evaluation

QwQ-32B is tested on various benchmarks to evaluate its mathematical reasoning, coding proficiency, and general problem-solving skills. The results indicate that the model performs on par with or better than several leading models, including DeepSeek-R1-Distilled-Qwen-32B, DeepSeek-R1-Distilled-Llama-70B, and o1-mini. This performance validation reinforces the effectiveness of RL in refining a model’s reasoning capabilities.


Scaling Reinforcement Learning for Math and Coding Tasks

The training process for QwQ-32B began with a cold-start checkpoint (a model trained on structured long Chain-of-Thought examples), followed by a reinforcement learning scaling approach that uses outcome-based rewards. The initial RL stage focused on math and coding tasks, employing an accuracy verifier for math problems and a code execution server to evaluate generated code against predefined test cases. This approach ensured that training episodes continuously improved the model’s performance in these domains.

Enhancing General Capabilities Through RL

After refining its performance in math and coding, a second RL stage was introduced to enhance the model’s general capabilities. This stage incorporated rewards from a general reward model along with rule-based verifiers. Even with relatively few additional training steps, this phase improved the model’s abilities in instruction following, alignment with human preferences, and agent-like reasoning. Importantly, these improvements did not come at the expense of its math and coding performance, demonstrating the scalability of RL across different cognitive tasks.

Using QwQ-32B for Practical Applications

QwQ-32B is openly available on Hugging Face and ModelScope under the Apache 2.0 license. Developers can access and deploy the model via Qwen Chat. The model can be integrated using Hugging Face Transformers or through the Alibaba Cloud DashScope API, allowing users to generate responses to complex reasoning queries, solve mathematical problems, and execute code with high accuracy.

Future Directions for RL and Language Models

The development of QwQ-32B marks an important milestone in scaling RL for reasoning enhancement. The success of this model highlights the untapped potential of pretrained language models when combined with reinforcement learning and structured training. Moving forward, the Qwen team aims to integrate RL with agents to enable long-horizon reasoning and more advanced decision-making. With continued advancements in model architecture and computational scaling, RL-driven models like QwQ-32B could bring artificial intelligence closer to achieving AGI.

Using the model

Here's some code to run the model:
from transformers import AutoModelForCausalLM, AutoTokenizer
import weave; weave.init("qwq32B")


model_name = "Qwen/QwQ-32B"

model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "How many r's are in the word \"strawberry\""
messages = [
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)

model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
**model_inputs,
max_new_tokens=32768
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)
from openai import OpenAI
import os

# Initialize OpenAI client
client = OpenAI(
# If the environment variable is not configured, replace with your API Key: api_key="sk-xxx"
# How to get an API Key:https://help.aliyun.com/zh/model-studio/developer-reference/get-api-key
api_key=os.getenv("DASHSCOPE_API_KEY"),
base_url="https://dashscope.aliyuncs.com/compatible-mode/v1"
)

reasoning_content = ""
content = ""

is_answering = False

completion = client.chat.completions.create(
model="qwq-32b",
messages=[
{"role": "user", "content": "Which is larger, 9.9 or 9.11?"}
],
stream=True,
# Uncomment the following line to return token usage in the last chunk
# stream_options={
# "include_usage": True
# }
)

print("\n" + "=" * 20 + "reasoning content" + "=" * 20 + "\n")

for chunk in completion:
# If chunk.choices is empty, print usage
if not chunk.choices:
print("\nUsage:")
print(chunk.usage)
else:
delta = chunk.choices[0].delta
# Print reasoning content
if hasattr(delta, 'reasoning_content') and delta.reasoning_content is not None:
print(delta.reasoning_content, end='', flush=True)
reasoning_content += delta.reasoning_content
else:
if delta.content != "" and is_answering is False:
print("\n" + "=" * 20 + "content" + "=" * 20 + "\n")
is_answering = True
# Print content
print(delta.content, end='', flush=True)
content += delta.content

Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.