Brief Introduction to P-Tuning

This articles aims to provide a brief overview of the paper "GPT Understands, Too", joint work from Tsinghua University and MIT that introduced P-Tuning as a way to efficiently tune Pretrained Language Models, along with code and interactive visualizations.
Saurav Maheshkar
Created on March 17|Last edited on June 28
Comment
Ever since the wide scale adoption of GPT-3, large language models (LLMs) have risen in popularity which has lead to an increased interest in studying techniques to efficiently adapt such pretrained language models.
In the paper GPT Understands Too, the authors present a new method titled P-Tuning that employs a new paradigm of continuous prompts along with discrete prompts to stabilize training of PLM adaption and improve benchmark performance. 
﻿Link to Colab Notebook ⟶\longrightarrow ⟶﻿﻿﻿
﻿
Table of ContentsMotivation P-Tuning FrameworkP-Tuning: CodeP-Tuning: ResultsConclusion
﻿
Motivation Pretrained language models (occasionally called PLMs) have improved the benchmarks of natural language understanding (NLU) significantly with the help of smarter prompt based methods such as discrete prompting or priming. However, the authors find that that such methods lead to extremely unstable performance, changing a simple word leads to a massive drop in model performance. 
In discrete prompting, we attempt to enhance the performance of PLMs by manually written prompt patterns as additional input (such as the work done in It’s Not Just Size That Matters Small Language Models Are Also Few-Shot Learners). The authors note that change a simple word in the prompt can sometimes lead up to a 40% drop in Precision (about 20 points) !! While some recent approaches in the field of automatic prompting have attempted to search for better discrete prompts for a given task, this doesn't solve the unstable nature of discrete prompts.
Using this as motivation, the authors propose a new method called P-Tuning which aims to learn trainable continuous prompt embeddings along with discrete prompts. This method leads to stable training and improved benchmarks and works well for both frozen and tuned PLMs under both fully-supervised and few-shot settings.
P-Tuning FrameworkLet's define some simple notation first:
Let M\mathcal{M}M﻿ be a PLM with a hidden size of hhh﻿ and a vocabulary size of ∣V∣|\mathcal{V}|∣V∣﻿. 
Let {(xi,yi)}\{(x_i, y_i)\}{(xi​,yi​)}﻿ be a labeled dataset for an NLU task, where x0:n={x0,x1,...,xn}x_{0:n} = \{ x_0, x_1, ..., x_n \}x0:n​={x0​,x1​,...,xn​}﻿ is a input consisting of a sequence of discrete tokens and y∈Yy \in \mathcal{Y}y∈Y﻿ is the corresponding label.
Our goal is to then estimate the conditional probability for classification fM(x)=p^(y∣x)f_{\mathcal{M}}(x) = \hat{p}(y|x)fM​(x)=p^​(y∣x)﻿ with parameters of M\mathcal{M}M﻿.
TL;DR: Given a discrete prompt as input, P-Tuning concatenates continuous prompt embeddings with the discrete prompt tokens and feeds them as the input to the model. These continuous prompts are updated using backpropagation to optimize the objective.
💡
P-Tuning consists of 2 main components, Discrete Prompting and Continuous Prompting. Let's look into them one at a time.
Discrete PromptingPrompting as a method to improve model performance in the form of Discrete Prompting was introduced in the paper It's Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners.
Let [Di][D_i][Di​]﻿be a discrete prompt token. Each prompt can then be described as a template TTT﻿:
T={[D0:i],x,[D(i+1):j],y,[D(j+1):k]}\large T  = \{ [D_{0:i}], x, [D_{(i+1):j}], y, [D_{(j+1):k}] \}T={[D0:i​],x,[D(i+1):j​],y,[D(j+1):k​]}﻿
This enables us to reformulate the labeled data as a cloze task. Let's look into a example from the paper for clarification:
The task of predicting a country’s capital, a prompt could be “The capital of [INPUT] is [LABEL].” With a piece of labeled data “(Britain, London)”, the reformulated text would be “The capital of Britain is [MASK].”, where “[MASK]" should predict the given label “London”. 
- Section 2.2 of GPT Understands Too﻿Then, both the discrete prompts and the data are together mapped into input embeddings using a pretrained embedding layer, viz.
{e(D0),...,e(Di),e(x0),...,e(xn),....,e(Dk)}\large \{ e(D_0), ..., e(D_i), e(x_0), ..., e(x_n), ...., e(D_k) \}{e(D0​),...,e(Di​),e(x0​),...,e(xn​),....,e(Dk​)}﻿
🧘🏻Continuous PromptingLet [Pi][P_i][Pi​]﻿ be the ithi^{\text{th}}ith﻿ continuous prompt embedding, similar to discrete prompting we can also create a template for P-Tuning as follows:
T={[P0:i],x,[P(i+1):j],y,[P(j+1):k]}\large T  = \{ [P_{0:i}], x, [P_{(i+1):j}], y, [P_{(j+1):k}] \}T={[P0:i​],x,[P(i+1):j​],y,[P(j+1):k​]}﻿
P-Tuning leverages a extra embedding function f:[Pi]→hif : [P_i] \rightarrow h_if:[Pi​]→hi​﻿ to map the template:
{h0,...,hi,e(x),hi+1,...,hj,e(y),hj+1,...,hk}\large \{ h_0, ..., h_i, e(x), h_{i+1}, ..., h_j, e(y), h_{j+1}, ..., h_k \}
{h0​,...,hi​,e(x),hi+1​,...,hj​,e(y),hj+1​,...,hk​}﻿
This mapping function is the Prompt Encoder based on the intuition that by using a mapping function, it is more convenient to model the dependency between different prompt embeddings, compared to using independent learnable embeddings. The Discrete prompts and continuous prompts are then concatenated together. 
Figure 1: Comparison between Discrete Prompting and P-tuning as method for Language Model Adaptation. 
P-Tuning: Code﻿Link to Colab Notebook ⟶\longrightarrow ⟶﻿﻿﻿
The most widely used library for using P-Tuning and other efficient fine-tuning methods is the  🤗/peft library. It works seamlessly with the transformers ecosystem, allowing for easy integration with various Trainer APIs as well. It has excellent abstractions such as PromptEncoderConfig and get_peft_model, which allow us to convert any transformers model into a 🤗/peft model.
As P-tuning uses a prompt encoder to optimize the prompt parameters, we'll need to use the PromptEncoderConfig class. Following is a valid config for Sequence Classification using P-Tuning
from peft import PromptEncoderConfig
﻿
peft_config = PromptEncoderConfig(
	task_type="SEQ_CLS",
	num_virtual_tokens=20,
	encoder_hidden_size=128
)
P-Tuning: ResultsLet's look at how the Evaluation Accuracy and F1 Score vary on evaluating roberta-base and xlm-roberta-base on the GLUE benchmark using P-Tuning as a adaptation method.
﻿
Run set2
﻿
﻿Link to Colab Notebook ⟶\longrightarrow ⟶﻿﻿﻿
ConclusionIn this article, you read through a brief overview of P-Tuning as a method for efficiently adapting pretrained language models and how we can use Weights & Biases to explore the training process and how that can lead to valuable insights.
To see the full suite of W&B features, please check out this short 5-minute guide. If you want more reports covering the math and "from-scratch" code implementations, let us know in the comments down below or on our forum ✨!﻿
Check out these other reports on Fully Connected covering other LLM-related topics like Audio Transformers and hyperparameter optimization.
A Brief Introduction to LoRA
This article givens an overview of LoRA (Low-Rank Adaptation) of Large Language Models , using W&B for interactive visualizations. It includes code samples for you to follow.
What Are Intrinsic Dimensions? The Secret Behind LoRA
This article provides a brief overview of intrinsic dimensions and how they enable Low-Rank Domain Adaptation. We also provide code samples which use Weights & Biases for interactive visualizations. 
AdaLoRA: Adaptive Budget Allocation for LoRA
This article provides an overview of "Adaptive Budget Allocation for Parameter Efficient Fine-Tuning" using W&B for interactive visualizations. It includes code samples for you to follow!
A guide to large language models (LLMs)
Learn about the history of LLMs, including the groundbreaking GPT series and how they work, and explore developments like human-guided reinforcement learning. 
A Gentle Introduction to Retrieval Augmented Generation (RAG)
In this article, we will learn about Retrieval Augmented Generation (RAG) and how it helps pre-trained LLM models to generate more specific, diverse and factual responses. 
Transformers, tokenizers and the in-domain problem
What happens when generally trained tokenizers lack knowledge of domain specific vocabulary? How much of a problem is this for models like BERT?
﻿
﻿
Add a comment