SetFit: Efficient Few-Shot Learning Without Prompts
Created on November 29|Last edited on December 2
Comment
IntroductionRelated workIn-context learningPattern Exploiting Training (PET)Parameter Efficient Fine-tuning (PEFT)SetFit: Sentence Transformer Fine-TuningST fine-tuningClassification head trainingInferenceHow to train a SetFit model for few-shot classificationUsing a end-to-end network with differentiable head.Scikit-learn head vs Differentiable headHyperparameter search using W&B SweepsKey Insights from our SweepCompressing a SetFit model with knowledge distillationSetFit EfficiencyConclusionReferences
Introduction
Few-Shot Learning has gained popularity since GPT-3. However, GPT-3 is a 175 billion parameter model and is much harder to deploy and use in practice. Since then many approaches (like PEFT and PET) were proposed to reduce the size of the model for few-shot learning, trying to make the process more efficient (especially for text classification). In this report, we will deep dive into a recent paper Efficient Few-Shot Learning Without Prompts from HuggingFace in collaboration with Intel Labs and the UKP Lab. With as few as 8 labeled samples, SetFit outperforms earlier methods in terms of accuracy and efficiency.
Some unique features of SetFit (as described by the authors) are given below:
🗣 No prompts or verbalisers: Current techniques for few-shot fine-tuning require handcrafted prompts or verbalisers to convert examples into a format that's suitable for the underlying language model. SetFit dispenses with prompts altogether by generating rich embeddings directly from a small number of labeled text examples.
🏎 Fast to train: SetFit doesn't require large-scale models like T0 or GPT-3 to achieve high accuracy. As a result, it is typically an order of magnitude (or more) faster to train and run inference with.
🌎 Multilingual support: SetFit can be used with any Sentence Transformer on the Hub, which means you can classify text in multiple languages by simply fine-tuning a multilingual checkpoint.
Before diving into SetFit, let's first go through the problems with the previous approaches.
This article was written as a Weights & Biases Report which is a project management and collaboration tool for machine learning projects. Reports let you organize and embed visualizations, describe your findings, share updates with collaborators, and more. To know more about reports, check out Collaborative Reports.
💡
Related work
In-context learning
In-context learning was popularized in the original GPT-3 paper as a way to use language models to learn tasks given only a few examples. During in-context learning, we give the Language Model (LM) a prompt that consists of a list of input-output pairs that demonstrate a task. At the end of the prompt, we append a test input and allow the LM to make a prediction just by conditioning on the prompt and predicting the next tokens. No gradient updates are performed during this process. However, there are some challenges with this approach as described below.
Pattern Exploiting Training (PET)
- PET was described in the paper Exploiting Cloze Questions for Few Shot Text Classification and Natural Language Inference. ADAPET is a improved version of PET which allowed it to train more easily and requires less task specific data.
Parameter Efficient Fine-tuning (PEFT)
PEFT was described in the paper Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning. They introduced a model called as T-Few which is the current SOTA for few-shot classification in terms of accuracy. It was inspired by the paper which goes back to few years Parameter-Efficient Transfer Learning for NLP.
SetFit: Sentence Transformer Fine-Tuning
The authors of the SetFit paper propose SetFit, an efficient and prompt-free framework for few-shot fine-tuning of Sentence Transformers (ST). SetFit is based on Sentence Transformers, which are modifications of pretrained transformer models that use Siamese and triplet network structures to derive semantically meaningful sentence embeddings. The goal of these models is to minimize the distance between pairs of semantically similar sentences and maximize the distance between sentence pairs that are semantically distant.
SetFit uses a two-stage training approach. In the first stage, a ST is fine-tuned on the input data (which may be very few labelled examples) in a contrastive, Siamese manner on sentence pairs. In the second stage, a text classification head is trained using the rich embeddings (or sentence representations) generated by the fine-tuned ST from the first step.

SetFit's fine-tuning and training block diagram
ST fine-tuning
- To handle the limited amount of labelled data, authors adopt a contrastive training approach often used for image similarity.
- Given a small set of labeled examples , where and are sentences and their class labels, respectively, for each class label , the authors generate a set of positive triplets; (pairs of sentences randomly chosen from the same class ).
- Similarly, a set of negative triplets are generated as (pairs of sentences randomly chosen from different classes).
- Finally, the contrastive fine-tuning data set is produced by concatenating the positive and negative tripletes across all class labels; , where is the number of class labels, is the number of pairs.
- We can see that this technique of creating data enlarges the data in few-shot settings. For example: assuming that a small number of labeled examples are given for a binary classification task, the potential size of the ST fine-tuning set is derived from the number of unique sentence pairs that can be generated, namely , which is significantly larger than .
- In the process of contrastive fine-tuning, the functionality of ST is changed from generating sentence embedding to topic embedding.
Classification head training
- In the second stage (after the ST is fine-tuned contrastively), the original labeled training data is encoded into a vector (sentence embedding) per training sample; where is the function representing the fine-tuned ST.
- The embeddings, along with their class labels, constitute the training set for the classification head .
- The authors used a logistic regression model as the classification head (however, we can use anything we want).
Inference
At inference time, the fine-tuned ST encodes an unseen input sentence and produces a rich sentence embedding. Next, the classification head that was trained in the training step, produces the class prediction of the input sentence based on its sentence embedding.
Now that we've seen how SetFIT works, it's time to actually use it in practice for few-shot classification.
How to train a SetFit model for few-shot classification
First, we will import the required packages. We will be using HuggingFace datasets library which gives access to wide range of datasets on the hub. We also import CosineSimilarityLoss from the sentence_transformers package which will be used for contrastive training.
from datasets import load_datasetfrom sentence_transformers.losses import CosineSimilarityLoss
We will load sst2 dataset from the hub by using the load_dataset method which will download, cache and split the dataset into train, validation and test split. You can choose any dataset you want from the hub by providing the dataset ID to the load_dataset method.
# Load a dataset from the Hugging Face Hubdataset = load_dataset("sst2")
However, the data that we loaded has a large number of samples with labels, making it appropriate for complete fine-tuning process. In order to test SetFit method for few-shot classification, we'll be simulating the few-shot regime by sampling 8 examples per class using the sample_dataset method.
We also create a evaluation dataset. However, in practical use-cases where you actually have limited number of labelled examples, you need to decide on how you want to validate your models. Since, we are using a normal dataset in this example, we'll just take the complete evaluation dataset as split by the load_dataset method.
# Simulate the few-shot regime by sampling 8 examples per classtrain_dataset = sample_dataset(dataset["train"], label_column="label", num_samples=8)eval_dataset = dataset["validation"]
Next, we load the SetFitModel using the from_pretrained method. This is analogous to how we load pretrained models in HuggingFace transformers library. We provide the model ID of our choice from the hub. You can choose any sentence transformer you like from the table on the Sentence Transformers documentation. You can also filter based on performance, speed and size of the model and select the model whichever suits your use-case.
SetFitModel is a wrapper that combines a pretrained body from sentence_transformers and a classification head from either scikit-learn or SetFitHead (a differentiable head built upon PyTorch with similar APIs to sentence_transformers)
# Load a SetFit model from Hubmodel = SetFitModel.from_pretrained("sentence-transformers/paraphrase-mpnet-base-v2")
Next, we initialize the SetFitTrainer (a lean version of the [HuggingFace Trainer](https://huggingface.co/docs/transformers/main_classes/trainer#trainer)) written by the authors that wraps the complete fine-tuning process of SetFit.
- It takes in the model, train dataset and evaluation dataset.
- loss_class: these are the loss classes from the sentence_transformer package used for contrastive learning. In contrastive learning, there are different losses that have been proposed for how you measure the distance between positive and negative samples. The authors found out that CosineSimilarityLoss works quite well.
- num_iterations: this specifies how many times the sampling process should be carried out to create the augmented data for contrastive learning (This is the hyperparameter as discussed above in the explanation of the SetFit method).
- num_epochs: the number of epochs to use for contrastive learning.
- column_mapping: the SetFit expects your texts and labels into columns named as text and label respectively. If the dataset doesn't have the column named the same way, we can provide the mapping to align the column names.
# Create trainertrainer = SetFitTrainer(model=model,train_dataset=train_dataset,eval_dataset=eval_dataset,loss_class=CosineSimilarityLoss,metric="accuracy",batch_size=16,num_iterations=20, # The number of text pairs to generate for contrastive learningnum_epochs=1, # The number of epochs to use for constrastive learningcolumn_mapping={"sentence": "text", "label": "label"} # Map dataset columns to text/label expected by trainer)
After initializing the SetFitTrainer, we can simply train the model with train() method. We can also evaluate our trained model using the evaluate() method
# Train and evaluatetrainer.train()metrics = trainer.evaluate()
Furthermore, we can also push our trained model on the hub for anyone to use by loading it with just the model id from the hub as shown in the example above. We can then run the inference on our model by providing it with sentences.
# Push model to the Hubtrainer.push_to_hub("my-awesome-setfit-model")# Download from Hub and run inferencemodel = SetFitModel.from_pretrained("<your_hf_username>/my-awesome-setfit-model")# Run inferencepreds = model(["i loved the spiderman movie!", "pineapple on pizza is the worst 🤮"])
That's it. This is how simple it is to train and use to SetFit for few-shot classification!
Using a end-to-end network with differentiable head.
Instead of using classification head from scikit-learn (like LogisticRegression), we can also use end-to-end network by using the SetFitHead which is a differentiable head built upon PyTorch.
The authors found out that SetFitHead can achieve similar performance as using a scikit-learn head. They use AdamW as the optimizer and scale down learning rates by 0.5 every 5 epochs. The authors recommend using a large learning rate (e.g. 1e-2) for SetFitHead and a small learning rate (e.g. 1e-5) for the body.
The advantage of using an end-to-end network is that we can do more interesting things for optimizing inference. For example: if we want to export the model to ONNX for using it in production or Quantize it for faster inference, it is more convenient to do on a single neural network which is end-to-end.
While loading the model, we need to set use_differentiable_head to True and specify the number of classes.
num_classes = 2# Load a SetFit model from Hub (with differentiable head)model = SetFitModel.from_pretrained("sentence-transformers/paraphrase-mpnet-base-v2",use_differentiable_head=True,head_params={"out_features": num_classes},)
Next, we initialize the SetFitTrainer similar to what we did earlier. The first way to train an end-to-end SetFit model is to first freeze the head and train only the body. Then, train only the head keeping the body frozen. The second way is to unfreeze both head and the body for end-to-end training.
# Train and evaluatetrainer.freeze() # Freeze the headtrainer.train() # Train only the body# Unfreeze the head and freeze the body -> head-only trainingtrainer.unfreeze(keep_body_frozen=True)# or# Unfreeze the head and unfreeze the body -> end-to-end trainingtrainer.unfreeze(keep_body_frozen=False)
Next, while calling the train() method we also have to pass some parameters as shown in the code below. Note, this functionality might change in the future versions of SetFit.
trainer.train(num_epochs=25, # The number of epochs to train the head or the whole model (body and head)batch_size=16,body_learning_rate=1e-5, # The body's learning ratelearning_rate=1e-2, # The head's learning ratel2_weight=0.0, # Weight decay on **both** the body and head. If `None`, will use 0.01.)
This is how we can train an end-to-end SetFit model.
SetFit can also be used for multi-label data. Please refer to this part of the README of SetFit's GitHub Repository for more details.
💡
Scikit-learn head vs Differentiable head
In this section, we'll be comparing how the differentiable head compares with the scikit-learn head. For this experiment, following datasets have been used from the HuggingFace hub:
Code has been emitted from this report for simplicity, you can view the complete code in the linked colab notebook.
Colab notebook for comparing Scikit-learn head and differentiable head ->
We will visualize the results using Weights & Biases Weave feature which allows us to directly query W&B data, visualize them, and further analyze interactively.
The following table shows the accuracy of SetFit for each dataset. Note that the original data from the hub was sampled using the sample_dataset method from SetFit to simulate the few-shot regime and the final fine-tuned model was tested on the complete validation dataset. The avg_accuracy column averages accuracies across all datasets. We can see that the sklearn head performs better than the differentiable head. The performance of differentiable head is not far off. However, differentiable head process takes twice as time to train compared to sklearn head as seen in _wandb.runtime column.
Table showing the accuracy comparison between scikit-learn head vs differentiable head for various datasets
2
Here's the configuration that was used for running the above two experiments.
Configuration used for comparing scikit-learn head and differentiable head
2
Hyperparameter search using W&B Sweeps
SetFitTrainer provides a hyperparameter_search() method that you can use to find good hyperparameters for your data which uses optuna at the backend. The downside is that SetFit doesn't have a W&B integration at the time of writing this report. So, you can't take advantage of rich features of W&B for tracking your hyperparameter tuning process. The authors or the community might work on the callback system which will allow easy W&B integration as discussed in this issue.
We'll be using W&B's own hyperparameter search functionality aka Sweeps. With Sweeps we can access a rich suite of W&B features like parallel co-ordinates chart, parameter importance chart, and much more.
W&B Sweeps allows us to automate hyperparameter search and explore the space of possible models. For demonstration purposes, sst2 dataset is being used, however feel free to use your own datasets.
To create a Sweep, we first have to define the configuration which describes what strategy to use for hyperparameter search, the parameters and their possible values or distribution to sample from.
You can follow along the complete code from the colab notebook 👇
Let's look at the results of the Sweep which was run on sst2 dataset using sentence-transformers/paraphrase-mpnet-base-v2 model.
- Parallel coordinates charts summarize the relationship between large numbers of hyperparameters and model metrics at a glance.
- Parameter Importance chart tells which of your hyperparameters were the best predictors of, and highly correlated to desirable values of your metrics. To learn about how to read these charts, you can refer the documentation here.
Hyperparameter Search using Weights & Biases Sweep
50
Key Insights from our Sweep
- batch_size have a strong negative correlation with the accuracy as seen in the parallel coordinates and parameter importance chart. Higher batch size leads to a lower accuracy.
- learning_rate too have a strong negative correlation with the accuracy. Higher learning rates degrade the performance.
Compressing a SetFit model with knowledge distillation
SetFit achieves state-of-the-art results in few-shot setups using underlying base model as paraphrase-mpnet-base-v2 which is a 110M parameter model. However, in real-world deployments use of even more efficient models is desirable. One of the interesting part of this paper is the knowledge distillation of a trained SetFit model into a smaller version that can run inference much faster, with little to no drop in accuracy. Here's a demonstration of an end-to-end example of the same:
Note: The code for this part is taken from the current pull request opened on SetFit GitHub Repository which will be released in the next version.
💡
For the distillations test the authors have used AGNews, Emotion and SST-5. For this code example, we'll be using AGNews dataset.
First we load the required libraries.
from datasets import load_datasetfrom sentence_transformers.losses import CosineSimilarityLossfrom setfit import SetFitModel, SetFitTrainer, DistillationSetFitTrainer, sample_dataset
Next, we will load the dataset and sample few-shot dataset similarly as we did previously to train the teacher model. Furthermore, we will also create a dataset of unlabeled examples to train the student model.
# Load a dataset from the Hugging Face Hubdataset = load_dataset("ag_news")# Create a sample few-shot dataset to train the teacher modeltrain_dataset_teacher = sample_dataset(dataset["train"], label_column="label", num_samples=16)# Create a dataset of unlabeled examples to train the studenttrain_dataset_student = dataset["train"].shuffle(seed=0).select(range(500))# Dataset for evaluationeval_dataset = dataset["test"]
After creating datasets, we will load our teacher model and first train it on the few-shot dataset using the SetFitTrainer. For teacher we'll be using the same model as we used previously (paraphrase-mpnet-base-v2).
# Load teacher modelteacher_model = SetFitModel.from_pretrained("sentence-transformers/paraphrase-mpnet-base-v2")# Create trainer for teacher modelteacher_trainer = SetFitTrainer(model=teacher_model,train_dataset=train_dataset_teacher,eval_dataset=eval_dataset,loss_class=CosineSimilarityLoss,)# Train teacher modelteacher_trainer.train()
After training our teacher model, we'll be loading the student model. We will use paraphrase-MiniLM-L3-v2 which is the fastest model in the sentence transformers table as our student model. The student model will be trained using DistillationSetFitTrainer which takes in the teacher model, student model and dataset to train student model on.
The complete process of the knowledge distillation training that happens at the backend is explained below 👇
- The student model is trained using sentence pairs and the level of similarity between each pair as input. The similarity is generated by using the underlying sentence transformer (ST) of the teacher model (which is used in the first stage of the training, before the classification head) to produce semantic embeddings for each pair and to calculate cosine-similarity between them.
- The underlying ST of the student model is trained to mimic the ST of the teacher output by minimizing the error between the teacher model produced cosine-similarity and its output.
- The classification head of the student model is then trained using the embeddings produced by the student's ST and the logits produced by the SetFit teacher classification head.
- The baseline student is trained to mimic the teacher output by minimizing the error between logits produced by the SetFit teacher classification head and its output.
# Load small student modelstudent_model = SetFitModel.from_pretrained("paraphrase-MiniLM-L3-v2")# Create trainer for knowledge distillationstudent_trainer = DistillationSetFitTrainer(teacher_model=teacher_model,train_dataset=train_dataset_student,student_model=student_model,eval_dataset=eval_dataset,loss_class=CosineSimilarityLoss,metric="accuracy",batch_size=16,num_iterations=20,num_epochs=1,)# Train student with knowledge distillationstudent_trainer.train()
Here are some of the results from the paper. The SetFit student is the student model trained using the above mentioned process and the Baseline student is a standard transformer encoder model (MiniLM-L3-H384-uncased)of the same size as our student model. We can observe that SetFit student significantly outperforms the baseline student when only small amounts of unlabeled data is available.
SetFit Efficiency
One of the major focus of this work was on the efficiency of the models as the previous SOTA weren't easily deployable and scalable.
The authors define the cost for inference and training given by (you can find more details in the paper):
where is the input sequence length, is the number of training steps, and is the batch size.
Here are the computational cost results from the paper as calculated by the authors.
As seen from the table above, we can see that we can get as much as 123x speed-up (using the distilled SetFit model) than the current SOTA T-Few model with a slight hit on the accuracy. Furthermore, the storage cost of the SETFIT models (70MB for the distilled model and 420MB for the SOTA SetFit model) is 163 to 26 times smaller than the T0-3B checkpoint used by T-FEW 3B (11.4GB), making it possible for real-world deployment.
Conclusion
SetFit is a new few-shot text classification approach. It has several advantages over comparable approaches such as T-Few, ADAPET and PERFECT. SetFit doesn't require any prompt engineering and thus doesn't have varying performances and instabilities. SetFit can also be used in multi-lingual settings by just changing the underlying ST backbone to any multi-lingual ST of choice. Furthermore, SetFit is much faster at inference and training and doesn't require huge compute for SOTA results. Finally, SetFit can be made even faster by knowledge distillation.
For more HuggingFace related reports, feel free to check out any of the following. Thanks for reading!
Compare Methods for Converting and Optimizing HuggingFace Models for Deployment
In this article, we'll walk through how to convert trained HuggingFace models to slimmer, leaner models for deployment with code examples.
Unconditional Image Generation Using HuggingFace Diffusers
In this article, we explore how to train unconditional image generation models using HuggingFace Diffusers and we will track these experiments and compare the results usingWeights & Biases.
Running Stable Diffusion on an Apple M1 Mac With HuggingFace Diffusers
In this article, we look at running Stable Diffusion on an M1 Mac with HuggingFace diffusers, highlighting the advantages — and the things to watch out for.
Hyperparameter Search for HuggingFace Transformer Models
In this article, we will explore how to perform hyperparameter search for pre-trained HuggingFace transformer models, making use of Weights & Biases Sweeps.
Hyperparameter Optimization for HuggingFace Transformers
This article explains three strategies for hyperparameter optimization for HuggingFace Transformers, using W&B to track our experiments.
How To Fine-Tune Hugging Face Transformers on a Custom Dataset
In this article, we will learn how to easily fine-tune a HuggingFace Transformer on a custom dataset with Weights & Biases.
References
💡
Add a comment