How to fine-tune and evaluate Qwen3 with Unsloth
This article provides a comprehensive guide to fine-tuning, evaluating, and deploying the Qwen3 language model, emphasizing its flexibility, performance, and unique reasoning-toggle feature.
Created on May 2|Last edited on May 5
Comment
Large language models are rapidly evolving, and Qwen3 has emerged as one of the most capable open-source contenders, excelling in reasoning, code generation, and multilingual tasks. Beyond raw performance, what sets Qwen3 apart is its transparency and flexibility: it not only delivers strong results out-of-the-box, but also supports explicit reasoning modes and efficient fine-tuning.
In this comprehensive hands-on guide, you’ll learn how to leverage Qwen3 for your own projects - from understanding its unique “thinking” feature, to customizing it with parameter-efficient fine-tuning, and rigorously evaluating your results with best-in-class open tools. Whether you’re building research prototypes or reliable, production-grade systems, this tutorial on fine-tuning Qwen3 with Unsloth covers the practical details, tips, and workflows to help you move from experimentation to deployment with confidence.
Ready to get hands-on? Jump straight to the Fine-tuning Qwen3 section below
Jump to the tutorial
Continue reading if you'd like a little background on the Qwen3 model and what it excels at.

Evaluating our Owen3 model after fine-tuning.
Table of contents
When & why to use Qwen3Understanding Qwen3 and it's dynamic thinking featureFine-tuning Qwen3 Evaluating with W&B WeaveTesting Qwen3 14B on AIME 2024Catching bugs with Weave Using the Weave EvaluationLogger Conclusion
When & why to use Qwen3
Choosing the right model often hinges on your specific requirements. Here are the key scenarios where Qwen3 shines:
- Next-gen applications on diverse devices: From smartphones and smart glasses to autonomous vehicles and robotics, Qwen3’s range of model sizes lets you deploy AI where you need it most - whether that’s edge-constrained hardware or cloud servers.
- Transparent chain-of-thought reasoning: When you need to debug a model’s logic or teach complex concepts, Qwen3’s thinking mode exposes its step-by-step deliberations, making it ideal for educational tools, math proofs, and advanced code generation tasks.
- Global and multilingual products: Qwen3 supports 119 languages and dialects, ensuring high-fidelity translation and instruction-following for both major and low-resource languages in international applications.
- Handling very long contexts: For multi-document summarization, legal-tech pipelines, or large codebases, larger Qwen3 models and the Mixture-of-Experts (MoE) variants can process up to 128 K tokens in a single pass, far beyond most standard LLMs.
- Cost-efficient high-performance inference: The mixture-of-experts architecture routes each request through only a subset of experts, dramatically lowering compute costs compared to monolithic models of similar capacity.
- Low-latency general-purpose chat: For production chatbots or real-time assistants where speed matters more than detailed reasoning, you can disable thinking mode and get fast, concise replies without losing core language understanding.
- Seamless agent and tool integration: Qwen3’s built-in Model Context Protocol (MCP) and function-calling support make it straightforward to build agentic workflows that interact with external APIs, databases, or retrieval systems.
- Fully open-source and customizable: Lastly, all Qwen3 weights are released under Apache 2.0, with ready-to-use checkpoints on Hugging Face.
What really powers many of those scenarios—from debuggable reasoning for education, low-latency chatbots and even cost-efficient edge deployments—is Qwen3’s ability to switch between “thinking” and “non-thinking” modes.
Let's dig in.
Understanding Qwen3 and it's dynamic thinking feature
A unique highlight of Qwen3 is its dual “thinking” and “non-thinking” modes, which add a new layer of transparency and control to model outputs.
Qwen3’s “thinking” mode lets you peek inside the model’s chain of thought. Before giving a final answer, it emits a <think>...</think> block that walks through its intermediate steps—be they logical deductions, calculations, or other reasoning—so you can see exactly how it arrived at its conclusion. You can toggle this behavior in three ways:
- API flag
- enable_thinking=True (default): includes a populated <think>...</think> section when the model deems reasoning necessary.
- enable_thinking=False: omits the reasoning content (you’ll get empty <think></think> tags followed immediately by the answer).
- In-prompt command
- Appending /no_think to your user message suppresses the reasoning even if enable_thinking is set to True.
This clear separation between internal deliberation and final output makes debugging, teaching, and auditing much more straightforward—especially on tasks like math problems, logic puzzles, or complex code generation.
Here's some code that demonstrates this:
import torchfrom unsloth import FastLanguageModelimport weave; weave.init('think_test')# --- Model Setup Variables ---BASE_MODEL_NAME = "unsloth/Qwen3-8B"max_seq_length = 2048dtype = Noneload_in_4bit = False# --- Load Model and Tokenizer ---BASE_MODEL, TOKENIZER = FastLanguageModel.from_pretrained(model_name=BASE_MODEL_NAME,max_seq_length=max_seq_length,dtype=dtype,load_in_4bit=load_in_4bit,)BASE_MODEL.eval().to("cuda")FastLanguageModel.for_inference(BASE_MODEL)# --- Prompt Preparation Functions ---def make_prompt(instruction):return [{"role": "user", "content": instruction}]def apply_chat_template(prompt, tokenizer, enable_thinking=True):messages = make_prompt(prompt)return tokenizer.apply_chat_template(messages,tokenize=False,add_generation_prompt=True,enable_thinking=enable_thinking,)@weave.opdef generate_response(prompt, enable_thinking=True):prompt_text = apply_chat_template(prompt, TOKENIZER, enable_thinking)inputs = TOKENIZER([prompt_text], return_tensors="pt").to("cuda")with torch.no_grad():gen_output = BASE_MODEL.generate(**inputs,max_new_tokens=128,use_cache=False,temperature=0.7,top_p=0.8,top_k=20,min_p=0.0,)output_text = TOKENIZER.decode(gen_output[0], skip_special_tokens=True)return output_text# --- Test Prompts ---math_question = "What is 256 multiplied by 17?"math_question_no_think = "/no_think\nWhat is 256 multiplied by 17?"print("=== enable_thinking=True (default) ===")output1 = generate_response(math_question, enable_thinking=True)print(output1.strip())print()print("=== enable_thinking=False ===")output2 = generate_response(math_question, enable_thinking=False)print(output2.strip())print()print("=== enable_thinking=True + /no_think in prompt ===")output3 = generate_response(math_question_no_think, enable_thinking=True)print(output3.strip())
From this code, it's clear that Qwen3’s chat formatting consistently uses <think>...</think> markers as part of its design. The decision to use the reasoning or leave the block empty is controlled by enable_thinking and /no_think. If you want outputs that truly have no <think> tags at all, you must apply a simple string replacement or strip routine as post-processing. Here's what it looks like inside weave when you run this script:
NOTE: below shows using enable_thinking=True with /no_think in the prompt
💡

This shows using enable_thinking=False
💡

This one shows using enable_thinking=True
💡

Fine-tuning Qwen3
Now let’s fine-tune Qwen3 on a custom dataset using Unsloth, which streamlines every step, from spinning up the model to injecting adapters and logging metrics.
Unsloth’s FastLanguageModel API lets you load Qwen3 in just a couple of lines (with optional 4-bit quantization under the hood), while its built-in PEFT hooks enable LoRA adapter injection without boilerplate. It also integrates seamlessly with W&B Models and Weave for real-time experiment tracking and visualization, and handles optimizations like gradient checkpointing and memory-efficient kernels so you can focus on your data and hyperparameters.
Fine-tuning is the process of taking a large, pre-trained language model and continuing its training on your own examples, so it can better follow your specific instructions or generate content in your desired style. By leveraging LoRA (Low-Rank Adaptation), we only train a small set of adapter parameters instead of updating the entire model, which slashes both training time and GPU memory use.
To set up fine-tuning, we first prepare our dataset in a consistent schema (typically “instruction,” “input,” and “output” fields). A simple preprocessing function then merges each instruction and input into a single user prompt, pairs it with the desired response, and formats both exactly as Qwen3 expects for conversational data. This alignment ensures the model learns the precise mapping from your custom prompts to your targets.
In this process, we also take advantage of tools like Weights & Biases to monitor and visualize our training as it happens, tracking things like loss and learning rate across epochs. This helps us spot problems quickly and compare different training runs.
By combining these strategies—parameter-efficient adaptation, proper data formatting, and good experiment tracking—fine-tuning Qwen3 becomes accessible and efficient, whether you’re training on a few hundred examples or scaling up to much larger, domain-specific datasets.
import randomimport numpy as npimport torchSEED = 3407random.seed(SEED)np.random.seed(SEED)torch.manual_seed(SEED)torch.cuda.manual_seed_all(SEED)torch.backends.cudnn.deterministic = Truetorch.backends.cudnn.benchmark = False# ====== REST OF SCRIPT ======from unsloth import FastLanguageModel, is_bfloat16_supportedfrom datasets import load_datasetfrom trl import SFTTrainerfrom transformers import TrainingArgumentsmax_seq_length = 2048dtype = Noneload_in_4bit = FalseMODEL_NAME = "unsloth/Qwen3-8B"SAVE_DIR = "lora_model"# Load modelmodel, tokenizer = FastLanguageModel.from_pretrained(model_name=MODEL_NAME,max_seq_length=max_seq_length,dtype=dtype,load_in_4bit=load_in_4bit,)model = FastLanguageModel.get_peft_model(model,r=16,target_modules=["q_proj", "k_proj", "v_proj", "o_proj","gate_proj", "up_proj", "down_proj",],lora_alpha=16,lora_dropout=0,bias="none",use_gradient_checkpointing="unsloth",random_state=SEED,use_rslora=False,loftq_config=None,)def formatting_prompts_func(examples):instructions = examples["instruction"]inputs = examples["input"]outputs = examples["output"]texts = []for instruction, input_text, output in zip(instructions, inputs, outputs):if input_text.strip():user_message = f"{instruction}\n\n{input_text}"else:user_message = instructionmessages = [{"role": "user", "content": user_message},{"role": "assistant", "content": output},]text = tokenizer.apply_chat_template(messages,tokenize=False,add_generation_prompt=False,enable_thinking=False)texts.append(text)return {"text": texts}dataset = load_dataset("yahma/alpaca-cleaned", split="train")half_len = len(dataset) // 2dataset = dataset.select(range(half_len))dataset = dataset.map(formatting_prompts_func, batched=True, num_proc=2)trainer = SFTTrainer(model=model,tokenizer=tokenizer,train_dataset=dataset,dataset_text_field="text",max_seq_length=max_seq_length,dataset_num_proc=2,packing=False,args=TrainingArguments(per_device_train_batch_size=2,gradient_accumulation_steps=4,warmup_steps=5,max_steps=60,learning_rate=2e-4,fp16=not is_bfloat16_supported(),bf16=is_bfloat16_supported(),logging_steps=1,optim="adamw_8bit",weight_decay=0.01,lr_scheduler_type="linear",seed=SEED, # Make sure to set this!output_dir="outputs",report_to="wandb",),)trainer.train()FastLanguageModel.for_inference(model)user_query = "Continue the Fibonacci sequence.\n\n1, 1, 2, 3, 5, 8"messages = [{"role": "user", "content": user_query},]prompt = tokenizer.apply_chat_template(messages,tokenize=False,add_generation_prompt=True,enable_thinking=False,)inputs = tokenizer([prompt], return_tensors="pt").to("cuda")outputs = model.generate(**inputs,max_new_tokens=128,do_sample=True,use_cache=False,temperature=0.7,top_p=0.8,top_k=20,min_p=0.0,)print("\n=========== Output from in-memory model (just trained):")print(tokenizer.decode(outputs[0], skip_special_tokens=True))model.save_pretrained(SAVE_DIR)tokenizer.save_pretrained(SAVE_DIR)del modeldel tokenizertorch.cuda.empty_cache()model, tokenizer = FastLanguageModel.from_pretrained(model_name=SAVE_DIR,max_seq_length=max_seq_length,dtype=dtype,load_in_4bit=load_in_4bit,)FastLanguageModel.for_inference(model)prompt2 = tokenizer.apply_chat_template(messages,tokenize=False,add_generation_prompt=True,enable_thinking=False,)inputs2 = tokenizer([prompt2], return_tensors="pt").to("cuda")outputs2 = model.generate(**inputs2,max_new_tokens=2048,use_cache=False,temperature=0.7,top_p=0.8,top_k=20,min_p=0.0,)print("\n=========== Output from reloaded model (after save/load):")print(tokenizer.decode(outputs2[0], skip_special_tokens=True))
After running the code we will see our results logged to W&B:
Run: ./qwen_int4_quantized
1
Evaluating with W&B Weave
After fine-tuning, it’s important to directly compare your custom LoRA Qwen3 model to the original base model to understand their differences in real-world responses. For this, we use W&B Weave, a tool designed specifically for side-by-side model evaluation and interactive analysis.
With Weave, we can run both the base and LoRA models on the same set of held-out prompts. For each example, Weave captures not just the generated output but also valuable details like response time and the exact prompt used. Its comparison view then displays the answers from each model next to one another, making it easy to quickly spot where your fine-tuned model’s answers have become clearer, more relevant, or better aligned with your expectations.
Here’s the code for our evaluation:
import randomimport numpy as npimport torchSEED = 3407random.seed(SEED)np.random.seed(SEED)torch.manual_seed(SEED)torch.cuda.manual_seed_all(SEED)torch.backends.cudnn.deterministic = Truetorch.backends.cudnn.benchmark = Falseimport unslothfrom datasets import load_datasetimport weaveimport asynciofrom unsloth import FastLanguageModelmax_seq_length = 2048dtype = Noneload_in_4bit = FalseBASE_MODEL_NAME = "unsloth/Qwen3-8B"LORA_MODEL_DIR = "lora_model"N = 30weave.init("q3")# === GLOBAL: LOAD MODELS ONLY ONCE ===BASE_MODEL, TOKENIZER = FastLanguageModel.from_pretrained(model_name=BASE_MODEL_NAME,max_seq_length=max_seq_length,dtype=dtype,load_in_4bit=load_in_4bit,)LORA_MODEL, _ = FastLanguageModel.from_pretrained(model_name=LORA_MODEL_DIR,max_seq_length=max_seq_length,dtype=dtype,load_in_4bit=load_in_4bit,)BASE_MODEL.eval().to("cuda")LORA_MODEL.eval().to("cuda")FastLanguageModel.for_inference(BASE_MODEL)FastLanguageModel.for_inference(LORA_MODEL)def make_prompt(instruction, input_text):if input_text.strip():user_message = f"{instruction}\n\n{input_text}"else:user_message = instructionreturn [{"role": "user", "content": user_message}]def apply_chat_template_loss(sample, tokenizer):messages = make_prompt(sample["instruction"], sample["input"])messages.append({"role": "assistant", "content": sample["output"]})return tokenizer.apply_chat_template(messages,tokenize=False,add_generation_prompt=False,enable_thinking=False,)def apply_chat_template_generation(sample, tokenizer):messages = make_prompt(sample["instruction"], sample["input"])return tokenizer.apply_chat_template(messages,tokenize=False,add_generation_prompt=True,enable_thinking=False,)def output_only_loss(tokenizer, model, sample, device="cuda"):# 1. Prepare full prompt+output for lossprompt_plus_output = apply_chat_template_loss(sample, tokenizer)# 2. Prepare prompt only (for prefix length)prompt_only = make_prompt(sample["instruction"], sample["input"])prompt_only_str = tokenizer.apply_chat_template(prompt_only,tokenize=False,add_generation_prompt=False,enable_thinking=False,)# 3. Tokenize bothtok_full = tokenizer(prompt_plus_output,return_tensors="pt",truncation=True,max_length=max_seq_length,padding="max_length" # For safe shape ops)tok_prompt = tokenizer(prompt_only_str,return_tensors="pt",truncation=True,max_length=max_seq_length)input_ids = tok_full["input_ids"].to(device)labels = input_ids.clone()# 4. Loss ONLY on output tokensprompt_len = tok_prompt["input_ids"].shape[-1] # prompt tokens count (may be == 2048!)# Mask prompt tokens in labelslabels[:, :prompt_len] = -100# Mask pad tokens if thereif tokenizer.pad_token_id is not None:labels[input_ids == tokenizer.pad_token_id] = -100with torch.no_grad():loss = model(input_ids=input_ids, labels=labels).loss.item()return lossdef safe_generate(model, tokenizer, prompt, device="cuda"):# Tokenize prompt and ensure we never overflow model max lengthprompt_tok = tokenizer([prompt],return_tensors="pt",truncation=True,max_length=max_seq_length).to(device)prompt_len = prompt_tok['input_ids'].shape[1]# Prevent overflow: at least generate 1, never beyond 2048max_gen = max(1, max_seq_length - prompt_len)with torch.no_grad():output = model.generate(**prompt_tok,max_new_tokens=max_gen,use_cache=False,temperature=0.7,top_p=0.8,top_k=20,min_p=0.0,)out_text = tokenizer.decode(output[0], skip_special_tokens=True)return out_textclass QwenBaseModel(weave.Model):@weave.op()async def predict(self, instruction, input, output):sample = {"instruction": instruction,"input": input,"output": output,}# LOSS on output tokens onlyloss = output_only_loss(TOKENIZER, BASE_MODEL, sample)# GENERATION safelyprompt_gen = apply_chat_template_generation(sample, TOKENIZER)output_text = safe_generate(BASE_MODEL, TOKENIZER, prompt_gen)return {"loss": loss, "output": output_text}class QwenLoraModel(weave.Model):@weave.op()async def predict(self, instruction, input, output):sample = {"instruction": instruction,"input": input,"output": output,}# LOSS on output tokens onlyloss = output_only_loss(TOKENIZER, LORA_MODEL, sample)# GENERATION safelyprompt_gen = apply_chat_template_generation(sample, TOKENIZER)output_text = safe_generate(LORA_MODEL, TOKENIZER, prompt_gen)return {"loss": loss, "output": output_text}@weave.op()def loss_only_scorer(output):return {"loss": output["loss"]}# ====== Load LAST 10% of train and pick 30 samples ======full_ds = load_dataset("yahma/alpaca-cleaned", split="train")length = len(full_ds)start = int(length * 0.9)end = lengthds_last10 = full_ds.select(range(start, end))samples = [dict(instruction=row["instruction"],input=row["input"],output=row["output"])for row in ds_last10.select(range(N))]async def main():models = {"Qwen3-8B-base": QwenBaseModel(),"Qwen3-8B-LoRA": QwenLoraModel(),}scorers = [loss_only_scorer]for model_name, model in models.items():print(f"\n=== Evaluating {model_name} ===")evaluation = weave.Evaluation(dataset=samples,scorers=scorers,name=f"{model_name} LossEval")results = await evaluation.evaluate(model)if __name__ == "__main__":asyncio.run(main())
One of the major strengths of Weave is the ability to visually inspect responses, and dive into particular samples where the models disagree or where the difference is especially stark. Rather than sorting through raw model outputs manually, Weave organizes and presents the information so you can easily trace patterns, and make informed judgments about which model better serves your needs.
Moreover, the evaluation process in Weave is streamlined and highly reproducible. You can share interactive reports or dashboards with collaborators, bookmark specific prompt-result pairs for review, and track improvements across different model versions or fine-tuning runs. In the Weave evaluations dashboard, we can see that our fine-tuned model had a much lower loss on our evaluation set.

Testing Qwen3 14B on AIME 2024
I was also curious how well this model is able to solve math questions with the thinking mode enabled, so I went ahead and benchmarked the 14B model on the AIME 2024 dataset, so I could compare it to DeepSeek R1’s llama 3 14B distilled model. Here’s the code for evaluating Qwen3 14B on AIME 2024:
import osimport asyncioimport jsonfrom datasets import load_datasetfrom openai import OpenAIfrom litellm import completionimport weaveweave.init("aime_evaluation")# ==== CONFIGURATION ====OPENROUTER_API_KEY = "your open router api key" # Set your API keyos.environ["OPENAI_API_KEY"] = "your openai api key" # for litellmclient = OpenAI(base_url="https://openrouter.ai/api/v1",api_key=OPENROUTER_API_KEY,)extra_headers = {"HTTP-Referer": "https://your-site.com", # Change for leaderboard credit"X-Title": "Qwen3-14B-AIME-Eval",}system_message = "Solve the following problem. put your final answer within \\boxed{}: "# ==== Qwen3-14B via OpenRouter ====async def qwen3_14b_openrouter_inference(prompt):resp = client.chat.completions.create(extra_headers=extra_headers,model="qwen/qwen3-14b",messages=[{"role": "user", "content": f"{system_message} {prompt}"}])return resp.choices[0].message.content.strip()class Qwen3_14B_OpenRouter_Model(weave.Model):@weave.opasync def predict(self, text: str) -> str:return await qwen3_14b_openrouter_inference(text)# ==== GPT-4o scorer via litellm ====def run_inference_openai(prompt, model_id="gpt-4o-2024-08-06"):try:response = completion(model=model_id,temperature=0.0,messages=[{"role": "user", "content": prompt}])if response and hasattr(response, 'choices') and len(response.choices) > 0:content = response.choices[0].message.contentreturn contentelse:print("No content found in response")return Noneexcept Exception as e:print(f"Failed to get response: {e}")return None@weave.opasync def gpt4o_scorer(label: str, model_output: str) -> dict:# Check minimum response length (non-whitespace chars)if len("".join(model_output.split())) < 3:return {"correctness": False,"reasoning": "Model output too short (less than 3 non-whitespace chars)."}query = ("YOU ARE A LLM JUDGE DETERMINING IF THE FOLLOWING MODEL GENERATED ANSWER IS THE SAME AS THE CORRECT ANSWER.\n""I WILL GIVE YOU THE LAST 100 CHARS OF THE MODEL'S REASONING PATH, WHICH WILL CONTAIN THE FINAL ANSWER -> \n\n"f"Model's Answer (last 100 chars): {str(model_output)[-100:]}\n"f"Correct Answer: {label}\n\n""Your task:\n""1. State the model's predicted answer (answer only).\n""2. State the ground truth (answer only).\n""3. Determine if the model's final answer is correct (ignore formatting differences, etc.). RESPOND with the predicted and ground truth answer, ""followed with a JSON object containing the correctness encapsulated within the following delimiters:\n""```json\n""{ \"correctness\": true/false }\n""```")response = run_inference_openai(query, "gpt-4o-2024-08-06")if response is None:return {"correctness": False, "reasoning": "Inference failed."}try:json_start = response.index("```json") + 7json_end = response.index("```", json_start)correctness = json.loads(response[json_start:json_end].strip()).get("correctness", False)except (ValueError, IndexError):correctness = Falsereturn {"correctness": correctness, "reasoning": response}# ==== LOAD DATASET (AIME 2024 Problems) ====def load_ds():print("Loading AIME 2024 dataset from HuggingFace 🤗 ...")dataset = load_dataset("Maxwell-Jia/AIME_2024")["train"]return [{"text": row["Problem"], "label": row["Answer"]} for row in dataset]# ==== EVALUATION LOOP ====async def run_evaluations():dataset = load_ds()print("Initializing models...")models = {"qwen3-14b-openrouter": Qwen3_14B_OpenRouter_Model(),}dataset_prepared = datasetprint("Running evaluations...")scorers = [gpt4o_scorer]for model_name, model in models.items():print(f"\n=== EVALUATING {model_name.upper()} ===")evaluation = weave.Evaluation(dataset=dataset_prepared,scorers=scorers,name=f"{model_name} Evaluation")results = await evaluation.evaluate(model)print(f"Results for {model_name}: {results}")# Print accuracy if possibleif hasattr(results, "scores") and "gpt4o_scorer" in results.scores:correct = sum(1 for score in results.scores["gpt4o_scorer"] if score["correctness"])accuracy = correct / len(dataset_prepared) if dataset_prepared else 0print(f"{model_name} accuracy: {accuracy:.2%} ({correct}/{len(dataset_prepared)})")if __name__ == "__main__":asyncio.run(run_evaluations())
Here I choose to use OpenRouter to avoid any memory issues that might occur during long thinking traces. Another nice feature of Weave Evaluations is that we easily compare new evaluations to ones that we ran previously. For example, I can easily compare this evaluation to my previous DeepSeek R1-14B evaluations I ran several months ago, without the need to re-run the DeepSeek Evaluation.
Here’s the results for my evaluation:

Here, Qwen3 14B scored 66.7% correctness, far ahead of R1-14B which only managed 20%! This model is seriously impressive for its size, and its incredible to see how fast these reasoning models are improving!
Catching bugs with Weave
Thanks to Weave’s visualizations, I caught a bug where blank model outputs were being interpreted as correct answers. The scorer was letting through empty or near-empty responses, and the judge LLM was sometimes accepting them. Once I added a check for at least three non-whitespace characters, these false positives disappeared. Subtle issues like this are much easier to spot with Weave.

Using the Weave EvaluationLogger
Weave also offers an alternative way to run evaluations through EvaluationLogger. Instead of sticking to a rigid evaluation format, you can manually loop through your data, call your model, and log predictions and scores. This style makes it easy to fit Weave into existing evaluation code—especially if you have already written lots of evaluation code!
For example, here's a bit of code that shows how you might evaluate a model on the AIME 2024 dataset. This code block is not meant to be fully functional, and is intended to show the core methods used for the EvaluationLogger:
from weave.flow.eval_imperative import EvaluationLoggerimport weave; weave.init("aime_evaluation")eval_logger = EvaluationLogger(model="qwen3_14b_openrouter", dataset="AIME_2024")for row in dataset:output = model.predict(row["text"])pred_logger = eval_logger.log_prediction(inputs={"text": row["text"]}, output=output)score = gpt4o_judge(row["label"], output)pred_logger.log_score("correctness", score)pred_logger.finish()eval_logger.log_summary() # call after loop exits
In this snippet, we manually loop through each row of the dataset, call the model to generate predictions, and log both the predictions and their corresponding scores using the EvaluationLogger. This approach gives you flexibility to structure the evaluation however you like, while still taking advantage of Weave’s dashboards, comparison tools, and visualizations. Here’s the full code for our evaluation using the EvaluationLogger:
import osimport asyncioimport jsonimport httpxfrom litellm import completionfrom datasets import load_datasetimport weaveweave.init("aime_evaluation")OPENROUTER_API_KEY = "your openrouter api key"os.environ["OPENAI_API_KEY"] = "your openai api key"system_message = "Solve the following problem. put your final answer within \\boxed{}: "@weave.opasync def qwen3_14b_openrouter_inference(prompt):headers = {"Authorization": f"Bearer {OPENROUTER_API_KEY}","Content-Type": "application/json","HTTP-Referer": "https://your-site.com","X-Title": "Qwen3_14B_AIME_Eval"}body = {"model": "qwen/qwen3-14b","messages": [{"role": "user", "content": f"{system_message} {prompt}"}]}try:async with httpx.AsyncClient() as client:resp = await client.post("https://openrouter.ai/api/v1/chat/completions",headers=headers,json=body,timeout=1200)resp.raise_for_status()data = resp.json()return data["choices"][0]["message"]["content"].strip()except Exception as e:print(f"[OpenRouter ERROR] {e}")return ""def run_inference_openai(prompt, model_id="gpt-4o-2024-08-06"):try:response = completion(model=model_id,temperature=0.0,messages=[{"role": "user", "content": prompt}])if response and hasattr(response, 'choices') and len(response.choices) > 0:return response.choices[0].message.contentreturn Noneexcept Exception as e:print(f"[GPT-4o scorer error] {e}")return None@weave.opasync def gpt4o_scorer(label: str, model_output: str) -> dict:if len("".join(model_output.split())) < 3:return {"correctness": False, "reasoning": "Model output too short."}query = ("YOU ARE A LLM JUDGE DETERMINING IF THE FOLLOWING MODEL GENERATED ANSWER IS THE SAME AS THE CORRECT ANSWER.\n"f"Model's Answer (last 100 chars): {str(model_output)[-100:]}\n"f"Correct Answer: {label}\n\n""Your task:\n""1. State the model's predicted answer (answer only).\n""2. State the ground truth (answer only).\n""3. Determine if the model's final answer is correct.\n""```json\n{ \"correctness\": true/false }\n```")response = run_inference_openai(query, "gpt-4o-2024-08-06")if response is None:return {"correctness": False, "reasoning": "Inference failed."}try:json_start = response.index("```json") + 7json_end = response.index("```", json_start)correctness = json.loads(response[json_start:json_end].strip()).get("correctness", False)except (ValueError, IndexError):correctness = Falsereturn {"correctness": correctness, "reasoning": response}async def run_evaluations():print("Loading AIME 2024 dataset from Hugging Face...")hf_dataset = load_dataset("Maxwell-Jia/AIME_2024")["train"]dataset = [{"text": row["Problem"], "label": row["Answer"]} for row in hf_dataset][:2]eval_logger = weave.flow.eval_imperative.EvaluationLogger(model="qwen3_14b_openrouter", dataset="AIME_2024_HF")correct = 0total = 0print(f"Evaluating {len(dataset)} examples...")for row in dataset:prompt = row["text"]label = row["label"]model_output = await qwen3_14b_openrouter_inference(prompt)pred_logger = eval_logger.log_prediction(inputs={"text": prompt},output=model_output)score_result = await gpt4o_scorer(label, model_output)pred_logger.log_score("correctness", score_result["correctness"])pred_logger.finish()if score_result["correctness"]:correct += 1total += 1accuracy = correct / total if total > 0 else 0eval_logger.log_summary()print(f"Accuracy: {correct}/{total} = {accuracy:.2%}")print("Evaluation logging complete. View results in the Weave UI.")if __name__ == "__main__":asyncio.run(run_evaluations())
Conclusion
Qwen3 stands out as one of today's leading open large language models, offering powerful performance across reasoning, coding, and multilingual tasks. Its unique feature, which allows users to easily toggle its explicit reasoning on or off, provides exceptional flexibility, catering to different needs in experimentation and deployment. Combined with efficient fine-tuning techniques such as LoRA and intuitive evaluation tools like W&B Weave, Qwen3 can be effectively adapted and deployed in diverse, real-world scenarios, making it an attractive choice for practitioners seeking control and adaptability in their models.
Add a comment
Iterate on AI agents and models faster. Try Weights & Biases today.