Knowledge distillation: Teaching LLM's with synthetic data
Unlock the power of knowledge distillation by learning how to efficiently transfer complex model insights from teacher to student models, step by step.
Created on August 23|Last edited on September 11
Comment
Model deployment often comes with a trade-off: achieving high performance typically requires large, resource-intensive models. Knowledge distillation offers a solution by training a smaller, efficient "student" model to replicate the behavior of a larger "teacher" model. This process retains much of the original performance while significantly reducing computational requirements.
Whether you're deploying models on mobile devices, IoT systems, or looking to cut compute costs, knowledge distillation is invaluable. In this tutorial, we'll not only cover the basics but also introduce a new technique, "distilling step-by-step," which leverages chain-of-thought prompting to generate effective distillation data for student models."

Table of contents
What is knowledge distillation?Why is knowledge distillation important?How does knowledge distillation help overcome challenges in deploying LLMs?How does knowledge distillation work?Principal components of a knowledge distillation systemThe role of the teacher and student in knowledge distillationApplications of knowledge distillation in NLP and LLMsWhat is chain-of-thought prompting?Integrating chain-of-thought prompting with WeaveKnowledge distillation, step-by-stepOur model Experiments The Data Knowledge distillation trainingTutorial: Knowledge distillation step-by-stepConclusionRecommended reading More on knowledge distillation and Synthetic data
What is knowledge distillation?
Knowledge distillation reduces a large model's size by training a smaller "student" model to replicate the behavior of a larger "teacher" model. The student learns from the teacher's outputs, such as final predictions, intermediate features, or softmax probabilities.
It allows for the deployment of high-performing models in environments with limited computational resources, such as mobile devices or edge computing, or simply allows you to use a smaller model in a similar deployment environment. The origins of knowledge distillation can be traced back to the work of Buciluă et al. in 2006, which laid the groundwork for model compression techniques. However, the concept gained significant traction with the seminal paper by Geoffrey Hinton in 2015, which formalized the idea of distilling the knowledge from a large neural network into a smaller one.
Why is knowledge distillation important?
The significance of knowledge distillation lies in its ability to reduce model size while preserving accuracy. This is particularly important in scenarios where deploying large models is impractical due to resource constraints. By distilling knowledge, smaller models can achieve performance levels comparable to their larger counterparts, making them more suitable for real-time applications and environments with limited hardware capabilities.
Traditional knowledge distillation compresses large models into smaller, efficient ones by having the smaller "student" model learn from the outputs of the larger "teacher" model. This process traditionally focuses on matching the final predictions or logits, enabling the student model to replicate the teacher's performance with reduced size and computational demands.
Recent advancements have improved upon this method. Feature-based distillation enhances the process by training the student model to mimic the teacher's intermediate representations, capturing more detailed data patterns and improving robustness. Response-based distillation further refines the technique by focusing on the probabilities or class distributions predicted by the teacher, helping the student model generalize better, especially in noisy or imbalanced datasets.
A notable advancement is the use of chain-of-thought (CoT) prompting in knowledge distillation. CoT prompting incorporates the reasoning process of the teacher model, not just its final outputs. This method provides more "supervision" to the student model, guiding it to also learn the underlying reasoning behind why the teacher model made a certain decision. Additionally, CoT-generated data enriches the training set, allowing smaller models to achieve high performance even with limited data.
These advancements make the distillation process more effective, enabling the deployment of powerful models in resource-constrained environments without sacrificing performance.
One major component necessary for distillation learning is a tool that will allow you to record data from the teacher model, so that it be used later on to train the student model. Weights & Biases has a powerful tool called Weave, which is a lightweights toolset designed to monitor the performance of large language models in production.
Weave enables developers to track inputs, outputs, and code changes automatically, with minimal code. By integrating seamlessly with various LLM frameworks and APIs, Weave facilitates the creation, testing, and deployment of LLM-powered applications, making it an indispensable tool for developers in this rapidly advancing field. If you're looking to log the performance of your LLM, I highly recommend Weave.
How does knowledge distillation help overcome challenges in deploying LLMs?
Deploying large language models presents several challenges, including high computational requirements, latency issues, and energy consumption. Knowledge distillation addresses these challenges by compressing LLMs into smaller, more efficient models that retain much of the original model's performance.
Furthermore, the smaller model's reduced energy consumption makes it more sustainable and cost-effective, particularly in large-scale deployments. Knowledge distillation can also sometimes help the student model generalize better, often leading to improved performance in specific tasks while minimizing the risk of overfitting.
How does knowledge distillation work?
Various knowledge distillation methods have been developed to enhance the efficiency and effectiveness of training smaller models to replicate larger ones while preserving performance. Key variants include logit distillation, feature-based distillation, and response-based distillation, and Rationale Distillation. Here's a breakdown on each:
Logit Distillation: In this approach, the student model is trained to match the logits (pre-softmax outputs) of the teacher model. The student learns to mimic the teacher by minimizing the difference between its logits and those of the teacher. This method was popularized by Hinton et al. and serves as the foundation for many subsequent distillation techniques.
Feature-Based Distillation: In this method, the student model is trained to replicate not just the final outputs but also the intermediate feature representations of the teacher model. This approach helps the student model capture more nuanced patterns in the data, leading to improved performance. Techniques like "attention transfer" fall under this category, where the attention maps of the teacher are distilled into the student.
Response-Based Distillation: This method focuses on distilling the responses of the teacher model, such as predicted probabilities or class distributions, into the student model. By learning from the teacher's soft targets, the student model can generalize better and avoid overfitting to hard labels.
Distilling Rationales: A more recent approach involves distilling not just the outputs but also the reasoning or rationales behind the teacher model's predictions. This method leverages techniques like Chain-of-Thought prompting, where the teacher model generates intermediate reasoning steps that are then used to train the student. This approach has been shown to improve the student's performance with less training data and smaller model sizes .
Principal components of a knowledge distillation system
Generally speaking, a knowledge distillation system can be composed of four main components: the teacher model, the student model, the dataset, and the training objective. Each of these components plays a vital role in the process of transferring knowledge from the teacher model to the student model.
Teacher model
The teacher model is a large, pre-trained model that serves as the source of knowledge in the distillation process. It is typically a complex model with many parameters, capable of making highly accurate predictions. The outputs of the teacher model, which can include final predictions, logits, or even intermediate representations, are used to train the student model.
Student model
The student model is a smaller, more efficient model that is trained to replicate the behavior of the teacher model. It has fewer parameters and is designed to perform well with reduced computational resources. The goal of the student model is to learn from the outputs of the teacher model and achieve similar performance despite its smaller size.
Dataset
The dataset in a knowledge distillation system includes both the original data used to generate outputs from the teacher model and the synthetic data generated by the teacher model itself. This synthetic data consists of the teacher model’s outputs, such as predicted probabilities, logits, or intermediate features. The student model is trained on this combination of original and teacher-generated data to closely mimic the teacher's performance.
Training objective
The training objective is the loss function used to optimize the student model during training. It measures how closely the student model’s outputs match those of the teacher model. One common approach is minimizing the Kullback-Leibler (KL) divergence between the output distributions of the teacher and student models. This guides the student to replicate the teacher's patterns and relationships.
Another method involves logit matching, where the student model is trained to replicate the logits (pre-softmax activations) produced by the teacher model. This encourages the student to mirror the teacher's internal decision-making process, leading to better alignment in final predictions. Soft target distillation is closely related, focusing on aligning the student model with the soft targets (probabilities) generated by the teacher. This helps the student capture the teacher's nuanced understanding of the data, resulting in improved generalization.
Additionally, the student model can be trained directly to copy the teacher model's output in token space. This involves predicting the same sequence of tokens as the teacher, helping the student model generate coherent sequences while closely following the teacher's outputs.
The role of the teacher and student in knowledge distillation
In knowledge distillation, the teacher and student models work together in a complementary manner to transfer knowledge effectively. The teacher model, typically a large and complex network, serves as the source of knowledge. It generates outputs such as logits, probability distributions, embeddings, or intermediate features that reflect its deep understanding of the task.
The student model is smaller and more efficient, designed to replicate the teacher's behavior and performance. Instead of learning directly from the original dataset labels, the student model is trained using the outputs provided by the teacher model, known as "soft targets." These outputs offer richer information than traditional labels, guiding the student in capturing complex patterns and relationships.
The choice of loss function in the distillation process varies depending on the specific goals. When the objective is to align the student's output probability distribution with that of the teacher, Kullback-Leibler (KL) divergence is often used. This loss function helps the student model mimic the teacher’s predictions by capturing the nuances in how the teacher model weighs different classes. In other cases, cross-entropy loss may be employed, particularly when the focus is on matching the hard labels or when the student learns from a combination of the teacher’s soft targets and the original dataset labels. For tasks like matching embeddings or regression, other loss functions such as mean squared error (MSE) or contrastive loss might be more appropriate.
The student model may be trained on the same input data as the teacher or on a different dataset, which could include augmented or synthetic data generated by the teacher. The critical aspect is that the student model learns from the teacher’s outputs, guided by the chosen loss function, to closely replicate the teacher’s performance.
This dynamic between the teacher and student models allows the student to achieve high performance with significantly fewer resources, making knowledge distillation an effective strategy for deploying efficient models in resource-constrained environments.
Applications of knowledge distillation in NLP and LLMs
Knowledge distillation is widely applied in NLP and LLMs to create smaller, efficient models that retain the performance of their larger counterparts. For real-time applications like mobile chatbots or voice assistants, large models are distilled into compact versions, enabling fast responses on devices with limited computational power.
In production environments, distillation reduces latency, allowing systems like search engines or recommendation engines to maintain high accuracy while delivering quicker results. When fine-tuning models for specific tasks with limited data, distillation plays a crucial role by enabling smaller models to perform well even in specialized areas such as domain-specific sentiment analysis.
This technique is also pivotal in creating multilingual models, where large, resource-intensive models are distilled into smaller versions that handle multiple languages efficiently, supporting cross-lingual tasks with minimal data. In edge AI, knowledge distillation compresses models for deployment on devices like IoT systems or drones, facilitating real-time NLP tasks without the need for constant cloud connectivity. Additionally, distillation can enhance explainability by simplifying complex models into more interpretable versions, which is essential in fields like healthcare and finance where transparency is critical. Through these varied applications, knowledge distillation proves indispensable in optimizing LLMs and NLP systems for real-world use, where efficiency and scalability are as vital as maintaining high performance.
What is chain-of-thought prompting?
Chain-of-thought prompting is a technique used in training large language models that encourages the model to generate intermediate reasoning steps or rationales before arriving at a final answer. This method not only enhances the model’s ability to reason through complex problems but also provides greater transparency in how the model arrives at its conclusions.
In CoT prompting, the model is guided to articulate its thought process explicitly, making each decision point along the way clear. For instance, instead of directly predicting a label or answering a question, the model first breaks down the problem, explaining the logic or reasoning behind its approach. This intermediate reasoning is often crucial for tasks that require multi-step thinking, such as logical deduction, mathematical problem-solving, or interpreting ambiguous contexts.
Using CoT prompting improves the robustness and accuracy of LLMs, as it helps the model better understand and generalize from the data. Moreover, it facilitates the development of smaller, task-specific models that can perform well with less training data. The rationale provided by the model during CoT can also be used to generate synthetic data, enriching the training set with detailed examples that reflect real-world problem-solving scenarios.
Weave, a tool by Weights & Biases, excels at capturing and organizing production data, making it an ideal choice for storing synthetically generated CoT data in your LLM applications. By using CoT, where models articulate their reasoning step-by-step, you not only enhance performance across various tasks but also generate valuable datasets. These datasets, enriched with detailed rationales, can later be used to train smaller, more efficient models through distillation.
This approach ensures that the reasoning captured during production is systematically logged, providing a robust foundation for future model training and optimization. Below, I'll demonstrate how to integrate CoT prompting with Weave to capture this data during inference, setting the stage for effective distillation and model refinement.
Integrating chain-of-thought prompting with Weave
First, make sure to install the latest version of Weave with the following command:
pip install -U weave
Next, I'll share a script which uses Ollama and Llama 3.1 to generate a response using Chain of Thought Prompting:
import weaveimport ollama# Initialize Weave to start loggingweave.init('cot-example')# Define the Chain-of-Thought examplecot_prompt = '''Premise: An adult dressed in black holds a stick.Hypothesis: An adult is walking away, empty-handed.Rationale: The premise states that the adult is holding a stick, which means they are not empty-handed. Therefore, the hypothesis that the adult is empty-handed contradicts the premise.Label: contradictionPremise: A child in a yellow plastic safety swing is laughing as a dark-haired woman in pink and coral pants stands behind her.Hypothesis: A young mother is playing with her daughter in a swing.Rationale: The premise describes a woman standing behind a child but does not explicitly state that they are mother and daughter. The hypothesis assumes a relationship that isn’t clearly defined in the premise, making the relationship neutral.Label: neutralPremise: A man in an orange vest leans over a pickup truck.Hypothesis: A man is touching a truck.Rationale:'''# Decorate the function to log with Weave@weave.opdef chat_with_model(prompt: str) -> dict:response = ollama.chat(model='llama3.1', messages=[{'role': 'user','content': prompt,},])return response# Use the Chain-of-Thought examples to chat with the modelresponse = chat_with_model(cot_prompt)# Print the model's responseprint(response['message']['content'])
Here, we provide a few examples of rationales in the prompt, which guides the model to generate useful rationales for the new query. Since we use Weave, our model outputs are tracked and logged, ensuring that synthetic data accurately reflects real-world conditions and contributes to the robustness and effectiveness of LLMs and other machine learning models. While everyone has access to the latest GPT models, the real competitive edge comes from how well you manage and utilize the data generated during model interactions. By carefully tracking and analyzing this data with tools like Weave, AI developers can distinguish themselves, refining their models more effectively and staying ahead of the competition.
Using the @weave.op decorator, we can effectively track all inputs and outputs to our function, ensuring that we can keep track of all data going in and out of our model, allowing us to later leverage our data for distillation training, or any other future training technique that may arise.
Here's what your data will look like inside Weave after running the code:

Knowledge distillation, step-by-step
In our exploration of knowledge distillation, we will conduct a series of experiments to evaluate the effectiveness of different training methodologies. Among these, we are particularly interested in a novel approach called "distilling step-by-step." This method leverages the reasoning capabilities of large language models by extracting the rationales or step-by-step explanations that LLMs provide when making predictions.
To generate these rationales, the teacher model is prompted using a technique known as "Chain-of-Thought". These CoT-generated rationales are then used as additional supervision to train smaller, more efficient models in the distilling step-by-step approach.
Unlike traditional distillation, where the focus is solely on replicating the final predictions of a teacher model, this method incorporates the intermediate reasoning steps into the training process. This enables the student model not only to match the final outputs but also to learn the underlying reasoning process, resulting in models that require less training data and can outperform much larger LLMs. This makes distilling step-by-step a highly data-efficient and computation-efficient approach for deploying task-specific models.
Our model
The model we are using for this exploration of knowledge distillation is the T5 base model, which is an encoder-decoder transformer architecture with approximately 220 million parameters. It’s worth noting that this model is not decoder-only transformer, which is the common architecture choice more many popular large language models, however, the you will likely see similar results with other architectures including decoder-only transformers.
For these experiments, we will apply the T5 base model to the e-SNLI dataset, which offers a rich collection of entailment pairs. This dataset is particularly suitable for assessing how well a model can generalize when subjected to different training techniques.
Experiments
Our knowledge distillation experiment will begin with a standard fine-tuning approach where the model is trained directly on the ground truth labels from the e-SNLI dataset. Following this, we will explore standard distillation, a method that involves using labels generated by a larger language model to train a smaller model. This will allow us to observe how effectively the smaller model can replicate the performance of its larger counterpart given responses without rationales.
Finally, we will investigate a step-by-step distillation method, which goes beyond merely replicating final predictions. This approach involves distilling the reasoning or intermediate steps that the teacher model uses to arrive at its decisions. By comparing these different approaches, particularly in scenarios where training data is limited, we aim to uncover the strengths and limitations of each technique.
The Data
The e-SNLI (explainable SNLI) dataset is an extension of the Stanford Natural Language Inference (SNLI) dataset, designed to include human-written explanations for each inference. It contains sentence pairs, where the task is to determine whether the relationship between the two sentences is entailment, contradiction, or neutral.
For the specific step-by-step distillation method we are using, the rationales provided in the e-SNLI dataset are supplemented or generated by a larger model (Palm 540B). These rationales explain why a particular inference was made, making them crucial for training the student model to not only replicate the final predictions but also to capture the reasoning process of the teacher model. This additional layer of information is particularly valuable for tasks that require the model to perform well in scenarios where interpretability and the ability to generate explanations are important. The previous Ollama code example shows a sample from the dataset, if you are interested in the format and structure of the dataset.
Earlier, we used Weave to track our CoT responses from our LLM. However, to replicate the Distill Step-by-Step experiments, we need a decent amount of data, so we will use a pre-generated dataset which was created with the Palm 540B model, which includes rationales. This data can be found in the official repo here.
Knowledge distillation training
With CoT knowledge distillation, we are focused on training a student model to not only produce accurate predictions but also to understand and internalize the reasoning behind those predictions. This is achieved by using task prefixes during the training process. The student model learns to generate both the correct label and the underlying rationale, but at inference time, it only needs to produce the label, making the process more efficient.
Task prefixes like [label] and [rationale] are prepended to input examples, signaling the model to either generate the predicted label or the rationale that explains the prediction. This strategy allows the model to differentiate between tasks and learn how to provide accurate outputs based on the given prefix.
The advantage of using task prefixes is that the model can be trained to understand the reasoning process (rationale) during training without the need to generate these rationales during inference. This approach contrasts with an alternative where the model would be required to generate both the prediction and the rationale at inference time. In that scenario, the model would need to perform additional computations to produce detailed explanations alongside predictions, leading to increased computational costs and longer processing times. This could significantly slow down the model, especially in real-time applications or when deployed at scale.
By separating the training and inference tasks with task prefixes, our approach allows the model to internalize the rationale during training, making it capable of producing high-quality predictions without the extra step of generating explanations during inference. This results in a more efficient model, as it can deliver accurate predictions faster, without the latency associated with generating additional outputs for rationales.
To implement this, the implementation of Distill Step-by-Step uses a modified version of the Seq2SeqTrainer called TaskPrefixTrainer. This custom trainer handles the dual-task learning process by computing a composite loss function that balances the prediction task and the rationale generation task. The loss is calculated as follows:

In this equation, lambda acts as a weighting factor that allows us to control the importance of the rationale generation during training. By adjusting lambda, we can shift the model’s focus between improving its prediction accuracy and refining its ability to generate coherent rationales. This approach ensures that the student model is well-equipped to handle the prediction task effectively while having a deep understanding of the reasoning process learned during training.
Now that we have an overall idea of the training process for step-by-step knowledge distillation, we will dive into how to reproduce the experiments from the paper.
Tutorial: Knowledge distillation step-by-step
First you will need to clone the official repo, which can be found here. Next, we can unzip the datasets using the following command:
unzip datasets.zip
Now we are ready to train our models! We will start by simply training our T5 Base model on the ground truth examples in the dataset. This dataset does not include any distilled examples, and is meant to serve as a baseline for a comparison to distillation methods.
Here's the run command for training on ground truth examples:
python run.py --from_pretrained google/t5-v1_1-base --dataset esnli --model_type standard --label_type gt --batch_size 4 --eval_steps 1000
This will train our model for 10,000 steps! The code has already implemented logging, and your results will appear in the "HuggingFace" project inside Weights & Biases. Here are the results for the first training run:
Run: grateful-leaf-192
1
Next, we will explore how regular knowledge distillation can enhance our model's performance. The first experiment involves training the T5 Base model using examples labeled by an LLM instead of ground truth labels. This approach uses regular distillation, focusing solely on the predicted labels without incorporating any rationales or explanations from the teacher model. This allows us to evaluate the model's ability to learn from synthetic data generated by a more advanced model.
Here's the command to train the model on LLM-labeled data (without rationales):
python run.py --from_pretrained google/t5-v1_1-base --dataset esnli --model_type standard --label_type llm --batch_size 4 --eval_steps 1000
Here are the results for regular distillation for 10,000 steps:
Run: flowing-vortex-193
1
Finally, we will further refine our approach by incorporating the "Distilling Step-by-Step" method during training, which involves the use of task prefixes. This method allows the student model to learn not only to make predictions but also to understand and replicate the reasoning process (rationale) behind those predictions.
To implement this, we'll use the following command, which includes task prefixes and uses a different LLM, PaLM, for generating the labels:
python run.py --from_pretrained google/t5-v1_1-base --dataset esnli --model_type task_prefix --label_type llm --llm palm --batch_size 4 --eval_steps 1000
Here are the results for this run, along with the previous 2 runs. We see here that the top performing model is in the distill step-by-step model, outperforming the other models trained on ground truth and regular distillation data!
Run set
3
Conclusion
The exploration of knowledge distillation, particularly through innovative approaches like "Distilling Step-by-Step," underscores its transformative potential in optimizing model efficiency without sacrificing performance. By leveraging techniques such as Chain-of-Thought prompting and synthetic data generation, we can effectively train smaller models to replicate the behavior and reasoning of larger, more complex models. This not only enables the deployment of high-performing models in resource-constrained environments but also paves the way for faster, more efficient AI applications across various domains.
Recommended reading
Grokking: Improved generalization through over-overfitting
One of the most mysterious phenomena in deep learning; Grokking is the tendency of neural networks to improve generalization by sustained overfitting.
Building a real-time answer engine with Llama 3.1 405B and W&B Weave
Infusing llama 3.1 405B with internet search capabilities!!
How to fine-tune Phi-3 Vision on a custom dataset
Here's how to fine tune a state of the art multimodal LLM on a custom dataset
How to Fine-Tune LLaVA on a Custom Dataset
A tutorial for fine-tuning LLaVA on your own data!
More on knowledge distillation and Synthetic data
Add a comment
Iterate on AI agents and models faster. Try Weights & Biases today.