Copy is All You Need?
This new method for language modeling is very clever! However, will it be as useful as attention?
Created on July 18|Last edited on July 18
Comment
In a departure from standard text generation models where the output is built by selecting words from a fixed vocabulary in sequence, a new method called Copy-and-Generate proposes a different way. This method takes existing text collections and reformulates the text generation process into a series of copy-and-paste operations.
The Lookup Table
This approach employs the calculation of contextualized representations of meaningful text segments from the text collection and organizes them into an index for easy access. The task of text generation is then a matter of finding and copying suitable text spans from the diverse text collection at each step.
Advantages
The model, appropriately named COG (short for COPY-GENERATOR), has demonstrated impressive results in both automatic and human evaluations. Not only does it offer improved text generation quality, but its inference efficiency is also comparable to token-level autoregressive models, as it reduces the number of decoding steps needed. This is mainly because it allows sequences of multiple tokens (i.e., multi-word phrases) to be generated in one single step.
Moreover, COG exhibits remarkable scalability. It can handle larger text collections without the need for additional training. This feature proves to be particularly useful in situations such as domain adaptation and data expansion or filtering. Furthermore, the model can easily transition to domain-specific text collections without any further training.
In short, the COG model reimagines text generation as a series of copy-and-paste operations from existing text collections, a significant shift from the traditional approach of next-token predictions. By doing this, COG offers better performance and adaptability in text generation tasks, setting a high standard in the field of language modeling.
Comparison to the Vanilla Transformer
The COG model significantly differs from a traditional Transformer in various ways:
Unit of Generation: While standard Transformers predict the next token based on previous ones, the COG model operates at the phrase level, thus producing text with enhanced coherence and fluency over longer spans. It can, however, generate at the token level when necessary.
Data Representation: In contrast to Transformers' context-independent embeddings, COG employs a hybrid approach of context-dependent phrase representations and context-independent token representations, granting the model more versatility in producing diverse and context-appropriate outputs.
Search Mechanism: COG employs a Maximum Inner Product Search (MIPS) over its phrase and token tables for output selection, a strategy more efficient than the softmax function used in Transformers.
Training Mechanism: Alongside autoregressive training, COG incorporates an InfoNCE loss function during training, aiming to bring the representations of semantically coherent prefixes and phrases closer together in the vector space.
Scalability and Adaptability: COG demonstrates exceptional scalability and adaptability, capable of improving performance simply by scaling up to larger text collections without further training, or effectively adapting to domain-specific collections without retraining.

Training
COG's training process is multi-faceted, here are the steps involved:
1. Data Preparation: Input documents are segmented into phrases using a forward maximum matching algorithm.
2. Phrase Encoding: For each document, the phrase encoder computes vector representations of all phrases.
3. Prefix Encoding: Vector representations of the prefix preceding each phrase are also calculated.
4. Loss Calculation: The training loss is the sum of the next-phrase prediction loss (InfoNCE loss function) and the standard token-level autoregressive loss.
5. Backpropagation and Parameter Updates: Once the loss is calculated, it's backpropagated, and the model parameters are updated using an optimization algorithm.
6. Iteration: The process of encoding, calculating loss, backpropagation, and updating parameters is repeated over multiple epochs until model performance on a validation set stops improving.
Inference
COG's inference process is aligned with its training methodology:
1. Initialization: The prefix is initialized as an empty sequence or a given context sequence if available.
2. Encoding: The prefix is fed into the Prefix Encoder to obtain its vector representation.
3. Phrase Selection: A Maximum Inner Product Search (MIPS) is performed over all phrases in the phrase table using the prefix representation. The phrase with the maximum inner product with the prefix representation is selected as the next output.
4. Token-Level Generation: If a suitable phrase isn't found, COG can revert to a token-level generation strategy using a standard softmax function over the fixed token vocabulary.
5. Prefix Update: The selected phrase or token is added to the prefix, which is then fed back into the Prefix Encoder in the next time step.
6. Iteration: The process is repeated until a termination condition is met.
Overall, COG stands out with its unique approach to text generation, using both phrase and token-level strategies, offering potential advantages in coherence, fluency, efficiency, scalability, and adaptability. The model was able to set new SOTA performances on the WikiText-103 dataset, especially for human evaluations. It should be interesting to see if this method will find its way into existing models like ChatGPT!
Add a comment
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.