Skip to main content

Brief Introduction to P-Tuning

This articles aims to provide a brief overview of the paper "GPT Understands, Too", joint work from Tsinghua University and MIT that introduced P-Tuning as a way to efficiently tune Pretrained Language Models, along with code and interactive visualizations.
Created on March 17|Last edited on June 28
Ever since the wide scale adoption of GPT-3, large language models (LLMs) have risen in popularity which has lead to an increased interest in studying techniques to efficiently adapt such pretrained language models.
In the paper GPT Understands Too, the authors present a new method titled P-Tuning that employs a new paradigm of continuous prompts along with discrete prompts to stabilize training of PLM adaption and improve benchmark performance.



Table of Contents



Motivation

Pretrained language models (occasionally called PLMs) have improved the benchmarks of natural language understanding (NLU) significantly with the help of smarter prompt based methods such as discrete prompting or priming. However, the authors find that that such methods lead to extremely unstable performance, changing a simple word leads to a massive drop in model performance.
In discrete prompting, we attempt to enhance the performance of PLMs by manually written prompt patterns as additional input (such as the work done in It’s Not Just Size That Matters Small Language Models Are Also Few-Shot Learners). The authors note that change a simple word in the prompt can sometimes lead up to a 40% drop in Precision (about 20 points) !! While some recent approaches in the field of automatic prompting have attempted to search for better discrete prompts for a given task, this doesn't solve the unstable nature of discrete prompts.
Using this as motivation, the authors propose a new method called P-Tuning which aims to learn trainable continuous prompt embeddings along with discrete prompts. This method leads to stable training and improved benchmarks and works well for both frozen and tuned PLMs under both fully-supervised and few-shot settings.

P-Tuning Framework

Let's define some simple notation first:
  • Let M\mathcal{M} be a PLM with a hidden size of hh and a vocabulary size of V|\mathcal{V}|.
  • Let {(xi,yi)}\{(x_i, y_i)\} be a labeled dataset for an NLU task, where x0:n={x0,x1,...,xn}x_{0:n} = \{ x_0, x_1, ..., x_n \} is a input consisting of a sequence of discrete tokens and yYy \in \mathcal{Y} is the corresponding label.
  • Our goal is to then estimate the conditional probability for classification fM(x)=p^(yx)f_{\mathcal{M}}(x) = \hat{p}(y|x) with parameters of M\mathcal{M}.
TL;DR: Given a discrete prompt as input, P-Tuning concatenates continuous prompt embeddings with the discrete prompt tokens and feeds them as the input to the model. These continuous prompts are updated using backpropagation to optimize the objective.
💡
P-Tuning consists of 2 main components, Discrete Prompting and Continuous Prompting. Let's look into them one at a time.

Discrete Prompting

Prompting as a method to improve model performance in the form of Discrete Prompting was introduced in the paper It's Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners.
Let [Di][D_i]be a discrete prompt token. Each prompt can then be described as a template TT:
T={[D0:i],x,[D(i+1):j],y,[D(j+1):k]}\large T = \{ [D_{0:i}], x, [D_{(i+1):j}], y, [D_{(j+1):k}] \}

This enables us to reformulate the labeled data as a cloze task. Let's look into a example from the paper for clarification:
The task of predicting a country’s capital, a prompt could be “The capital of [INPUT] is [LABEL].” With a piece of labeled data “(Britain, London)”, the reformulated text would be “The capital of Britain is [MASK].”, where “[MASK]" should predict the given label “London”. - Section 2.2 of GPT Understands Too
Then, both the discrete prompts and the data are together mapped into input embeddings using a pretrained embedding layer, viz.
{e(D0),...,e(Di),e(x0),...,e(xn),....,e(Dk)}\large \{ e(D_0), ..., e(D_i), e(x_0), ..., e(x_n), ...., e(D_k) \}


🧘🏻Continuous Prompting

Let [Pi][P_i] be the ithi^{\text{th}} continuous prompt embedding, similar to discrete prompting we can also create a template for P-Tuning as follows:
T={[P0:i],x,[P(i+1):j],y,[P(j+1):k]}\large T = \{ [P_{0:i}], x, [P_{(i+1):j}], y, [P_{(j+1):k}] \}

P-Tuning leverages a extra embedding function f:[Pi]hif : [P_i] \rightarrow h_i to map the template:
{h0,...,hi,e(x),hi+1,...,hj,e(y),hj+1,...,hk}\large \{ h_0, ..., h_i, e(x), h_{i+1}, ..., h_j, e(y), h_{j+1}, ..., h_k \}

This mapping function is the Prompt Encoder based on the intuition that by using a mapping function, it is more convenient to model the dependency between different prompt embeddings, compared to using independent learnable embeddings. The Discrete prompts and continuous prompts are then concatenated together.
Figure 1: Comparison between Discrete Prompting and P-tuning as method for Language Model Adaptation.

P-Tuning: Code

The most widely used library for using P-Tuning and other efficient fine-tuning methods is the 🤗/peft library. It works seamlessly with the transformers ecosystem, allowing for easy integration with various Trainer APIs as well. It has excellent abstractions such as PromptEncoderConfig and get_peft_model, which allow us to convert any transformers model into a 🤗/peft model.
As P-tuning uses a prompt encoder to optimize the prompt parameters, we'll need to use the PromptEncoderConfig class. Following is a valid config for Sequence Classification using P-Tuning
from peft import PromptEncoderConfig

peft_config = PromptEncoderConfig(
task_type="SEQ_CLS",
num_virtual_tokens=20,
encoder_hidden_size=128
)

P-Tuning: Results

Let's look at how the Evaluation Accuracy and F1 Score vary on evaluating roberta-base and xlm-roberta-base on the GLUE benchmark using P-Tuning as a adaptation method.

Run set
2


Conclusion

In this article, you read through a brief overview of P-Tuning as a method for efficiently adapting pretrained language models and how we can use Weights & Biases to explore the training process and how that can lead to valuable insights.
To see the full suite of W&B features, please check out this short 5-minute guide. If you want more reports covering the math and "from-scratch" code implementations, let us know in the comments down below or on our forum ✨!
Check out these other reports on Fully Connected covering other LLM-related topics like Audio Transformers and hyperparameter optimization.