Skip to main content

How to fine-tune and evaluate Qwen3 with Unsloth

This article provides a comprehensive guide to fine-tuning, evaluating, and deploying the Qwen3 language model, emphasizing its flexibility, performance, and unique reasoning-toggle feature.
Created on May 2|Last edited on May 5
Large language models are rapidly evolving, and Qwen3 has emerged as one of the most capable open-source contenders, excelling in reasoning, code generation, and multilingual tasks. Beyond raw performance, what sets Qwen3 apart is its transparency and flexibility: it not only delivers strong results out-of-the-box, but also supports explicit reasoning modes and efficient fine-tuning.
In this comprehensive hands-on guide, you’ll learn how to leverage Qwen3 for your own projects - from understanding its unique “thinking” feature, to customizing it with parameter-efficient fine-tuning, and rigorously evaluating your results with best-in-class open tools. Whether you’re building research prototypes or reliable, production-grade systems, this tutorial on fine-tuning Qwen3 with Unsloth covers the practical details, tips, and workflows to help you move from experimentation to deployment with confidence.
Ready to get hands-on? Jump straight to the Fine-tuning Qwen3 section below
Jump to the tutorial


Continue reading if you'd like a little background on the Qwen3 model and what it excels at.
Evaluating our Owen3 model after fine-tuning.


Table of contents



When & why to use Qwen3

Choosing the right model often hinges on your specific requirements. Here are the key scenarios where Qwen3 shines:
  • Next-gen applications on diverse devices: From smartphones and smart glasses to autonomous vehicles and robotics, Qwen3’s range of model sizes lets you deploy AI where you need it most - whether that’s edge-constrained hardware or cloud servers.
  • Transparent chain-of-thought reasoning: When you need to debug a model’s logic or teach complex concepts, Qwen3’s thinking mode exposes its step-by-step deliberations, making it ideal for educational tools, math proofs, and advanced code generation tasks.
  • Global and multilingual products: Qwen3 supports 119 languages and dialects, ensuring high-fidelity translation and instruction-following for both major and low-resource languages in international applications.
  • Handling very long contexts: For multi-document summarization, legal-tech pipelines, or large codebases, larger Qwen3 models and the Mixture-of-Experts (MoE) variants can process up to 128 K tokens in a single pass, far beyond most standard LLMs.
  • Cost-efficient high-performance inference: The mixture-of-experts architecture routes each request through only a subset of experts, dramatically lowering compute costs compared to monolithic models of similar capacity.
  • Low-latency general-purpose chat: For production chatbots or real-time assistants where speed matters more than detailed reasoning, you can disable thinking mode and get fast, concise replies without losing core language understanding.
  • Seamless agent and tool integration: Qwen3’s built-in Model Context Protocol (MCP) and function-calling support make it straightforward to build agentic workflows that interact with external APIs, databases, or retrieval systems.
  • Fully open-source and customizable: Lastly, all Qwen3 weights are released under Apache 2.0, with ready-to-use checkpoints on Hugging Face.
What really powers many of those scenarios—from debuggable reasoning for education, low-latency chatbots and even cost-efficient edge deployments—is Qwen3’s ability to switch between “thinking” and “non-thinking” modes.
Let's dig in.

Understanding Qwen3 and it's dynamic thinking feature

A unique highlight of Qwen3 is its dual “thinking” and “non-thinking” modes, which add a new layer of transparency and control to model outputs.
Qwen3’s “thinking” mode lets you peek inside the model’s chain of thought. Before giving a final answer, it emits a <think>...</think> block that walks through its intermediate steps—be they logical deductions, calculations, or other reasoning—so you can see exactly how it arrived at its conclusion. You can toggle this behavior in three ways:
  • API flag
    • enable_thinking=True (default): includes a populated <think>...</think> section when the model deems reasoning necessary.
    • enable_thinking=False: omits the reasoning content (you’ll get empty <think></think> tags followed immediately by the answer).
  • In-prompt command
    • Appending /no_think to your user message suppresses the reasoning even if enable_thinking is set to True.
This clear separation between internal deliberation and final output makes debugging, teaching, and auditing much more straightforward—especially on tasks like math problems, logic puzzles, or complex code generation.
Here's some code that demonstrates this:
import torch
from unsloth import FastLanguageModel
import weave; weave.init('think_test')


# --- Model Setup Variables ---
BASE_MODEL_NAME = "unsloth/Qwen3-8B"
max_seq_length = 2048
dtype = None
load_in_4bit = False

# --- Load Model and Tokenizer ---
BASE_MODEL, TOKENIZER = FastLanguageModel.from_pretrained(
model_name=BASE_MODEL_NAME,
max_seq_length=max_seq_length,
dtype=dtype,
load_in_4bit=load_in_4bit,
)
BASE_MODEL.eval().to("cuda")
FastLanguageModel.for_inference(BASE_MODEL)

# --- Prompt Preparation Functions ---
def make_prompt(instruction):
return [{"role": "user", "content": instruction}]

def apply_chat_template(prompt, tokenizer, enable_thinking=True):
messages = make_prompt(prompt)
return tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=enable_thinking,
)

@weave.op
def generate_response(prompt, enable_thinking=True):
prompt_text = apply_chat_template(prompt, TOKENIZER, enable_thinking)
inputs = TOKENIZER([prompt_text], return_tensors="pt").to("cuda")
with torch.no_grad():
gen_output = BASE_MODEL.generate(
**inputs,
max_new_tokens=128,
use_cache=False,
temperature=0.7,
top_p=0.8,
top_k=20,
min_p=0.0,
)
output_text = TOKENIZER.decode(gen_output[0], skip_special_tokens=True)
return output_text

# --- Test Prompts ---
math_question = "What is 256 multiplied by 17?"
math_question_no_think = "/no_think\nWhat is 256 multiplied by 17?"

print("=== enable_thinking=True (default) ===")
output1 = generate_response(math_question, enable_thinking=True)
print(output1.strip())
print()

print("=== enable_thinking=False ===")
output2 = generate_response(math_question, enable_thinking=False)
print(output2.strip())
print()

print("=== enable_thinking=True + /no_think in prompt ===")
output3 = generate_response(math_question_no_think, enable_thinking=True)
print(output3.strip())

From this code, it's clear that Qwen3’s chat formatting consistently uses <think>...</think> markers as part of its design. The decision to use the reasoning or leave the block empty is controlled by enable_thinking and /no_think. If you want outputs that truly have no <think> tags at all, you must apply a simple string replacement or strip routine as post-processing. Here's what it looks like inside weave when you run this script:
NOTE: below shows using enable_thinking=True with /no_think in the prompt
💡

This shows using enable_thinking=False
💡

This one shows using enable_thinking=True
💡


Fine-tuning Qwen3

Now let’s fine-tune Qwen3 on a custom dataset using Unsloth, which streamlines every step, from spinning up the model to injecting adapters and logging metrics.
Unsloth’s FastLanguageModel API lets you load Qwen3 in just a couple of lines (with optional 4-bit quantization under the hood), while its built-in PEFT hooks enable LoRA adapter injection without boilerplate. It also integrates seamlessly with W&B Models and Weave for real-time experiment tracking and visualization, and handles optimizations like gradient checkpointing and memory-efficient kernels so you can focus on your data and hyperparameters.
Fine-tuning is the process of taking a large, pre-trained language model and continuing its training on your own examples, so it can better follow your specific instructions or generate content in your desired style. By leveraging LoRA (Low-Rank Adaptation), we only train a small set of adapter parameters instead of updating the entire model, which slashes both training time and GPU memory use.
To set up fine-tuning, we first prepare our dataset in a consistent schema (typically “instruction,” “input,” and “output” fields). A simple preprocessing function then merges each instruction and input into a single user prompt, pairs it with the desired response, and formats both exactly as Qwen3 expects for conversational data. This alignment ensures the model learns the precise mapping from your custom prompts to your targets.
In this process, we also take advantage of tools like Weights & Biases to monitor and visualize our training as it happens, tracking things like loss and learning rate across epochs. This helps us spot problems quickly and compare different training runs.
By combining these strategies—parameter-efficient adaptation, proper data formatting, and good experiment tracking—fine-tuning Qwen3 becomes accessible and efficient, whether you’re training on a few hundred examples or scaling up to much larger, domain-specific datasets.
import random
import numpy as np
import torch

SEED = 3407

random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed_all(SEED)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

# ====== REST OF SCRIPT ======
from unsloth import FastLanguageModel, is_bfloat16_supported
from datasets import load_dataset
from trl import SFTTrainer
from transformers import TrainingArguments

max_seq_length = 2048
dtype = None
load_in_4bit = False
MODEL_NAME = "unsloth/Qwen3-8B"
SAVE_DIR = "lora_model"

# Load model
model, tokenizer = FastLanguageModel.from_pretrained(
model_name=MODEL_NAME,
max_seq_length=max_seq_length,
dtype=dtype,
load_in_4bit=load_in_4bit,
)

model = FastLanguageModel.get_peft_model(
model,
r=16,
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
],
lora_alpha=16,
lora_dropout=0,
bias="none",
use_gradient_checkpointing="unsloth",
random_state=SEED,
use_rslora=False,
loftq_config=None,
)

def formatting_prompts_func(examples):
instructions = examples["instruction"]
inputs = examples["input"]
outputs = examples["output"]
texts = []
for instruction, input_text, output in zip(instructions, inputs, outputs):
if input_text.strip():
user_message = f"{instruction}\n\n{input_text}"
else:
user_message = instruction
messages = [
{"role": "user", "content": user_message},
{"role": "assistant", "content": output},
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=False,
enable_thinking=False
)
texts.append(text)
return {"text": texts}

dataset = load_dataset("yahma/alpaca-cleaned", split="train")
half_len = len(dataset) // 2
dataset = dataset.select(range(half_len))
dataset = dataset.map(formatting_prompts_func, batched=True, num_proc=2)

trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
dataset_text_field="text",
max_seq_length=max_seq_length,
dataset_num_proc=2,
packing=False,
args=TrainingArguments(
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
warmup_steps=5,
max_steps=60,
learning_rate=2e-4,
fp16=not is_bfloat16_supported(),
bf16=is_bfloat16_supported(),
logging_steps=1,
optim="adamw_8bit",
weight_decay=0.01,
lr_scheduler_type="linear",
seed=SEED, # Make sure to set this!
output_dir="outputs",
report_to="wandb",
),
)
trainer.train()

FastLanguageModel.for_inference(model)

user_query = "Continue the Fibonacci sequence.\n\n1, 1, 2, 3, 5, 8"
messages = [
{"role": "user", "content": user_query},
]
prompt = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=False,
)
inputs = tokenizer([prompt], return_tensors="pt").to("cuda")

outputs = model.generate(
**inputs,
max_new_tokens=128,
do_sample=True,
use_cache=False,
temperature=0.7,
top_p=0.8,
top_k=20,
min_p=0.0,
)

print("\n=========== Output from in-memory model (just trained):")
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

model.save_pretrained(SAVE_DIR)
tokenizer.save_pretrained(SAVE_DIR)

del model
del tokenizer
torch.cuda.empty_cache()

model, tokenizer = FastLanguageModel.from_pretrained(
model_name=SAVE_DIR,
max_seq_length=max_seq_length,
dtype=dtype,
load_in_4bit=load_in_4bit,
)
FastLanguageModel.for_inference(model)

prompt2 = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=False,
)
inputs2 = tokenizer([prompt2], return_tensors="pt").to("cuda")

outputs2 = model.generate(
**inputs2,
max_new_tokens=2048,
use_cache=False,
temperature=0.7,
top_p=0.8,
top_k=20,
min_p=0.0,
)

print("\n=========== Output from reloaded model (after save/load):")
print(tokenizer.decode(outputs2[0], skip_special_tokens=True))


After running the code we will see our results logged to W&B:

Run: ./qwen_int4_quantized
1


Evaluating with W&B Weave

After fine-tuning, it’s important to directly compare your custom LoRA Qwen3 model to the original base model to understand their differences in real-world responses. For this, we use W&B Weave, a tool designed specifically for side-by-side model evaluation and interactive analysis.
With Weave, we can run both the base and LoRA models on the same set of held-out prompts. For each example, Weave captures not just the generated output but also valuable details like response time and the exact prompt used. Its comparison view then displays the answers from each model next to one another, making it easy to quickly spot where your fine-tuned model’s answers have become clearer, more relevant, or better aligned with your expectations.
Here’s the code for our evaluation:
import random
import numpy as np
import torch

SEED = 3407
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed_all(SEED)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

import unsloth
from datasets import load_dataset
import weave
import asyncio
from unsloth import FastLanguageModel

max_seq_length = 2048
dtype = None
load_in_4bit = False
BASE_MODEL_NAME = "unsloth/Qwen3-8B"
LORA_MODEL_DIR = "lora_model"
N = 30

weave.init("q3")

# === GLOBAL: LOAD MODELS ONLY ONCE ===
BASE_MODEL, TOKENIZER = FastLanguageModel.from_pretrained(
model_name=BASE_MODEL_NAME,
max_seq_length=max_seq_length,
dtype=dtype,
load_in_4bit=load_in_4bit,
)
LORA_MODEL, _ = FastLanguageModel.from_pretrained(
model_name=LORA_MODEL_DIR,
max_seq_length=max_seq_length,
dtype=dtype,
load_in_4bit=load_in_4bit,
)
BASE_MODEL.eval().to("cuda")
LORA_MODEL.eval().to("cuda")
FastLanguageModel.for_inference(BASE_MODEL)
FastLanguageModel.for_inference(LORA_MODEL)

def make_prompt(instruction, input_text):
if input_text.strip():
user_message = f"{instruction}\n\n{input_text}"
else:
user_message = instruction
return [{"role": "user", "content": user_message}]

def apply_chat_template_loss(sample, tokenizer):
messages = make_prompt(sample["instruction"], sample["input"])
messages.append({"role": "assistant", "content": sample["output"]})
return tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=False,
enable_thinking=False,
)

def apply_chat_template_generation(sample, tokenizer):
messages = make_prompt(sample["instruction"], sample["input"])
return tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=False,
)

def output_only_loss(tokenizer, model, sample, device="cuda"):
# 1. Prepare full prompt+output for loss
prompt_plus_output = apply_chat_template_loss(sample, tokenizer)
# 2. Prepare prompt only (for prefix length)
prompt_only = make_prompt(sample["instruction"], sample["input"])
prompt_only_str = tokenizer.apply_chat_template(
prompt_only,
tokenize=False,
add_generation_prompt=False,
enable_thinking=False,
)
# 3. Tokenize both
tok_full = tokenizer(
prompt_plus_output,
return_tensors="pt",
truncation=True,
max_length=max_seq_length,
padding="max_length" # For safe shape ops
)
tok_prompt = tokenizer(
prompt_only_str,
return_tensors="pt",
truncation=True,
max_length=max_seq_length
)
input_ids = tok_full["input_ids"].to(device)
labels = input_ids.clone()

# 4. Loss ONLY on output tokens
prompt_len = tok_prompt["input_ids"].shape[-1] # prompt tokens count (may be == 2048!)
# Mask prompt tokens in labels
labels[:, :prompt_len] = -100
# Mask pad tokens if there
if tokenizer.pad_token_id is not None:
labels[input_ids == tokenizer.pad_token_id] = -100

with torch.no_grad():
loss = model(input_ids=input_ids, labels=labels).loss.item()
return loss

def safe_generate(model, tokenizer, prompt, device="cuda"):
# Tokenize prompt and ensure we never overflow model max length
prompt_tok = tokenizer(
[prompt],
return_tensors="pt",
truncation=True,
max_length=max_seq_length
).to(device)
prompt_len = prompt_tok['input_ids'].shape[1]
# Prevent overflow: at least generate 1, never beyond 2048
max_gen = max(1, max_seq_length - prompt_len)
with torch.no_grad():
output = model.generate(
**prompt_tok,
max_new_tokens=max_gen,
use_cache=False,
temperature=0.7,
top_p=0.8,
top_k=20,
min_p=0.0,
)
out_text = tokenizer.decode(output[0], skip_special_tokens=True)
return out_text

class QwenBaseModel(weave.Model):
@weave.op()
async def predict(self, instruction, input, output):
sample = {
"instruction": instruction,
"input": input,
"output": output,
}
# LOSS on output tokens only
loss = output_only_loss(TOKENIZER, BASE_MODEL, sample)
# GENERATION safely
prompt_gen = apply_chat_template_generation(sample, TOKENIZER)
output_text = safe_generate(BASE_MODEL, TOKENIZER, prompt_gen)
return {"loss": loss, "output": output_text}

class QwenLoraModel(weave.Model):
@weave.op()
async def predict(self, instruction, input, output):
sample = {
"instruction": instruction,
"input": input,
"output": output,
}
# LOSS on output tokens only
loss = output_only_loss(TOKENIZER, LORA_MODEL, sample)
# GENERATION safely
prompt_gen = apply_chat_template_generation(sample, TOKENIZER)
output_text = safe_generate(LORA_MODEL, TOKENIZER, prompt_gen)
return {"loss": loss, "output": output_text}

@weave.op()
def loss_only_scorer(output):
return {"loss": output["loss"]}

# ====== Load LAST 10% of train and pick 30 samples ======
full_ds = load_dataset("yahma/alpaca-cleaned", split="train")
length = len(full_ds)
start = int(length * 0.9)
end = length
ds_last10 = full_ds.select(range(start, end))
samples = [
dict(
instruction=row["instruction"],
input=row["input"],
output=row["output"]
)
for row in ds_last10.select(range(N))
]

async def main():
models = {
"Qwen3-8B-base": QwenBaseModel(),
"Qwen3-8B-LoRA": QwenLoraModel(),
}
scorers = [loss_only_scorer]

for model_name, model in models.items():
print(f"\n=== Evaluating {model_name} ===")
evaluation = weave.Evaluation(
dataset=samples,
scorers=scorers,
name=f"{model_name} LossEval"
)
results = await evaluation.evaluate(model)

if __name__ == "__main__":
asyncio.run(main())

One of the major strengths of Weave is the ability to visually inspect responses, and dive into particular samples where the models disagree or where the difference is especially stark. Rather than sorting through raw model outputs manually, Weave organizes and presents the information so you can easily trace patterns, and make informed judgments about which model better serves your needs.
Moreover, the evaluation process in Weave is streamlined and highly reproducible. You can share interactive reports or dashboards with collaborators, bookmark specific prompt-result pairs for review, and track improvements across different model versions or fine-tuning runs. In the Weave evaluations dashboard, we can see that our fine-tuned model had a much lower loss on our evaluation set.


Testing Qwen3 14B on AIME 2024

I was also curious how well this model is able to solve math questions with the thinking mode enabled, so I went ahead and benchmarked the 14B model on the AIME 2024 dataset, so I could compare it to DeepSeek R1’s llama 3 14B distilled model. Here’s the code for evaluating Qwen3 14B on AIME 2024:
import os
import asyncio
import json
from datasets import load_dataset
from openai import OpenAI
from litellm import completion
import weave
weave.init("aime_evaluation")

# ==== CONFIGURATION ====
OPENROUTER_API_KEY = "your open router api key" # Set your API key
os.environ["OPENAI_API_KEY"] = "your openai api key" # for litellm


client = OpenAI(
base_url="https://openrouter.ai/api/v1",
api_key=OPENROUTER_API_KEY,
)

extra_headers = {
"HTTP-Referer": "https://your-site.com", # Change for leaderboard credit
"X-Title": "Qwen3-14B-AIME-Eval",
}

system_message = "Solve the following problem. put your final answer within \\boxed{}: "

# ==== Qwen3-14B via OpenRouter ====
async def qwen3_14b_openrouter_inference(prompt):
resp = client.chat.completions.create(
extra_headers=extra_headers,
model="qwen/qwen3-14b",
messages=[{"role": "user", "content": f"{system_message} {prompt}"}]
)
return resp.choices[0].message.content.strip()

class Qwen3_14B_OpenRouter_Model(weave.Model):
@weave.op
async def predict(self, text: str) -> str:
return await qwen3_14b_openrouter_inference(text)

# ==== GPT-4o scorer via litellm ====
def run_inference_openai(prompt, model_id="gpt-4o-2024-08-06"):
try:
response = completion(
model=model_id,
temperature=0.0,
messages=[{"role": "user", "content": prompt}]
)
if response and hasattr(response, 'choices') and len(response.choices) > 0:
content = response.choices[0].message.content
return content
else:
print("No content found in response")
return None
except Exception as e:
print(f"Failed to get response: {e}")
return None


@weave.op
async def gpt4o_scorer(label: str, model_output: str) -> dict:
# Check minimum response length (non-whitespace chars)
if len("".join(model_output.split())) < 3:
return {
"correctness": False,
"reasoning": "Model output too short (less than 3 non-whitespace chars)."
}

query = (
"YOU ARE A LLM JUDGE DETERMINING IF THE FOLLOWING MODEL GENERATED ANSWER IS THE SAME AS THE CORRECT ANSWER.\n"
"I WILL GIVE YOU THE LAST 100 CHARS OF THE MODEL'S REASONING PATH, WHICH WILL CONTAIN THE FINAL ANSWER -> \n\n"
f"Model's Answer (last 100 chars): {str(model_output)[-100:]}\n"
f"Correct Answer: {label}\n\n"
"Your task:\n"
"1. State the model's predicted answer (answer only).\n"
"2. State the ground truth (answer only).\n"
"3. Determine if the model's final answer is correct (ignore formatting differences, etc.). RESPOND with the predicted and ground truth answer, "
"followed with a JSON object containing the correctness encapsulated within the following delimiters:\n"
"```json\n"
"{ \"correctness\": true/false }\n"
"```"
)
response = run_inference_openai(query, "gpt-4o-2024-08-06")
if response is None:
return {"correctness": False, "reasoning": "Inference failed."}
try:
json_start = response.index("```json") + 7
json_end = response.index("```", json_start)
correctness = json.loads(response[json_start:json_end].strip()).get("correctness", False)
except (ValueError, IndexError):
correctness = False
return {"correctness": correctness, "reasoning": response}


# ==== LOAD DATASET (AIME 2024 Problems) ====
def load_ds():
print("Loading AIME 2024 dataset from HuggingFace 🤗 ...")
dataset = load_dataset("Maxwell-Jia/AIME_2024")["train"]
return [{"text": row["Problem"], "label": row["Answer"]} for row in dataset]

# ==== EVALUATION LOOP ====
async def run_evaluations():
dataset = load_ds()
print("Initializing models...")
models = {
"qwen3-14b-openrouter": Qwen3_14B_OpenRouter_Model(),
}

dataset_prepared = dataset
print("Running evaluations...")
scorers = [gpt4o_scorer]

for model_name, model in models.items():
print(f"\n=== EVALUATING {model_name.upper()} ===")
evaluation = weave.Evaluation(
dataset=dataset_prepared,
scorers=scorers,
name=f"{model_name} Evaluation"
)
results = await evaluation.evaluate(model)
print(f"Results for {model_name}: {results}")

# Print accuracy if possible
if hasattr(results, "scores") and "gpt4o_scorer" in results.scores:
correct = sum(1 for score in results.scores["gpt4o_scorer"] if score["correctness"])
accuracy = correct / len(dataset_prepared) if dataset_prepared else 0
print(f"{model_name} accuracy: {accuracy:.2%} ({correct}/{len(dataset_prepared)})")

if __name__ == "__main__":
asyncio.run(run_evaluations())


Here I choose to use OpenRouter to avoid any memory issues that might occur during long thinking traces. Another nice feature of Weave Evaluations is that we easily compare new evaluations to ones that we ran previously. For example, I can easily compare this evaluation to my previous DeepSeek R1-14B evaluations I ran several months ago, without the need to re-run the DeepSeek Evaluation.
Here’s the results for my evaluation:


Here, Qwen3 14B scored 66.7% correctness, far ahead of R1-14B which only managed 20%! This model is seriously impressive for its size, and its incredible to see how fast these reasoning models are improving!

Catching bugs with Weave

Thanks to Weave’s visualizations, I caught a bug where blank model outputs were being interpreted as correct answers. The scorer was letting through empty or near-empty responses, and the judge LLM was sometimes accepting them. Once I added a check for at least three non-whitespace characters, these false positives disappeared. Subtle issues like this are much easier to spot with Weave.


Using the Weave EvaluationLogger

Weave also offers an alternative way to run evaluations through EvaluationLogger. Instead of sticking to a rigid evaluation format, you can manually loop through your data, call your model, and log predictions and scores. This style makes it easy to fit Weave into existing evaluation code—especially if you have already written lots of evaluation code!
For example, here's a bit of code that shows how you might evaluate a model on the AIME 2024 dataset. This code block is not meant to be fully functional, and is intended to show the core methods used for the EvaluationLogger:
from weave.flow.eval_imperative import EvaluationLogger
import weave; weave.init("aime_evaluation")


eval_logger = EvaluationLogger(model="qwen3_14b_openrouter", dataset="AIME_2024")

for row in dataset:
output = model.predict(row["text"])
pred_logger = eval_logger.log_prediction(inputs={"text": row["text"]}, output=output)
score = gpt4o_judge(row["label"], output)
pred_logger.log_score("correctness", score)
pred_logger.finish()

eval_logger.log_summary() # call after loop exits
In this snippet, we manually loop through each row of the dataset, call the model to generate predictions, and log both the predictions and their corresponding scores using the EvaluationLogger. This approach gives you flexibility to structure the evaluation however you like, while still taking advantage of Weave’s dashboards, comparison tools, and visualizations. Here’s the full code for our evaluation using the EvaluationLogger:
import os
import asyncio
import json
import httpx
from litellm import completion
from datasets import load_dataset
import weave

weave.init("aime_evaluation")

OPENROUTER_API_KEY = "your openrouter api key"
os.environ["OPENAI_API_KEY"] = "your openai api key"

system_message = "Solve the following problem. put your final answer within \\boxed{}: "

@weave.op
async def qwen3_14b_openrouter_inference(prompt):
headers = {
"Authorization": f"Bearer {OPENROUTER_API_KEY}",
"Content-Type": "application/json",
"HTTP-Referer": "https://your-site.com",
"X-Title": "Qwen3_14B_AIME_Eval"
}

body = {
"model": "qwen/qwen3-14b",
"messages": [{"role": "user", "content": f"{system_message} {prompt}"}]
}

try:
async with httpx.AsyncClient() as client:
resp = await client.post(
"https://openrouter.ai/api/v1/chat/completions",
headers=headers,
json=body,
timeout=1200
)
resp.raise_for_status()
data = resp.json()
return data["choices"][0]["message"]["content"].strip()
except Exception as e:
print(f"[OpenRouter ERROR] {e}")
return ""

def run_inference_openai(prompt, model_id="gpt-4o-2024-08-06"):
try:
response = completion(
model=model_id,
temperature=0.0,
messages=[{"role": "user", "content": prompt}]
)
if response and hasattr(response, 'choices') and len(response.choices) > 0:
return response.choices[0].message.content
return None
except Exception as e:
print(f"[GPT-4o scorer error] {e}")
return None



@weave.op
async def gpt4o_scorer(label: str, model_output: str) -> dict:
if len("".join(model_output.split())) < 3:
return {"correctness": False, "reasoning": "Model output too short."}

query = (
"YOU ARE A LLM JUDGE DETERMINING IF THE FOLLOWING MODEL GENERATED ANSWER IS THE SAME AS THE CORRECT ANSWER.\n"
f"Model's Answer (last 100 chars): {str(model_output)[-100:]}\n"
f"Correct Answer: {label}\n\n"
"Your task:\n"
"1. State the model's predicted answer (answer only).\n"
"2. State the ground truth (answer only).\n"
"3. Determine if the model's final answer is correct.\n"
"```json\n{ \"correctness\": true/false }\n```"
)
response = run_inference_openai(query, "gpt-4o-2024-08-06")
if response is None:
return {"correctness": False, "reasoning": "Inference failed."}
try:
json_start = response.index("```json") + 7
json_end = response.index("```", json_start)
correctness = json.loads(response[json_start:json_end].strip()).get("correctness", False)
except (ValueError, IndexError):
correctness = False
return {"correctness": correctness, "reasoning": response}




async def run_evaluations():
print("Loading AIME 2024 dataset from Hugging Face...")
hf_dataset = load_dataset("Maxwell-Jia/AIME_2024")["train"]
dataset = [{"text": row["Problem"], "label": row["Answer"]} for row in hf_dataset][:2]

eval_logger = weave.flow.eval_imperative.EvaluationLogger(
model="qwen3_14b_openrouter", dataset="AIME_2024_HF"
)

correct = 0
total = 0

print(f"Evaluating {len(dataset)} examples...")
for row in dataset:
prompt = row["text"]
label = row["label"]
model_output = await qwen3_14b_openrouter_inference(prompt)

pred_logger = eval_logger.log_prediction(
inputs={"text": prompt},
output=model_output
)

score_result = await gpt4o_scorer(label, model_output)
pred_logger.log_score("correctness", score_result["correctness"])
pred_logger.finish()

if score_result["correctness"]:
correct += 1
total += 1

accuracy = correct / total if total > 0 else 0
eval_logger.log_summary()
print(f"Accuracy: {correct}/{total} = {accuracy:.2%}")
print("Evaluation logging complete. View results in the Weave UI.")

if __name__ == "__main__":
asyncio.run(run_evaluations())

Conclusion

Qwen3 stands out as one of today's leading open large language models, offering powerful performance across reasoning, coding, and multilingual tasks. Its unique feature, which allows users to easily toggle its explicit reasoning on or off, provides exceptional flexibility, catering to different needs in experimentation and deployment. Combined with efficient fine-tuning techniques such as LoRA and intuitive evaluation tools like W&B Weave, Qwen3 can be effectively adapted and deployed in diverse, real-world scenarios, making it an attractive choice for practitioners seeking control and adaptability in their models.





Iterate on AI agents and models faster. Try Weights & Biases today.