Skip to main content

Sinks Are All You Need?

By analyzing the logits of a transformer, researchers were able to solve a key flaw in existing transformers!
Created on October 11|Last edited on October 13
In an effort to tackle the challenges of deploying large language models for real-time, dialogue-based applications with long context windows, researchers at MIT, Carnegie Mellon, and Meta have introduced an innovative solution known as StreamingLLM. This cutting-edge approach seeks to resolve two pressing issues that have long plagued LLM's: the high memory demands associated with caching Key and Value states (KV) of previous tokens, and the inherent limitations these models face when confronted with text sequences that exceed their training sequence length.

The Existing Challenges

Deploying large language models for infinite-length conversations or texts has long been a technical challenge. Traditional methods either consume too much memory or degrade in performance when handling long text sequences. The paper identifies two main hurdles:
1. Memory Constraints: Large language models, particularly those built on the Transformer architecture, cache the KV states of all previous tokens during the decoding stage. This results in increased memory usage and decoding latency.
2. Limited Sequence Length: Current models struggle to generalize to text sequences longer than their training data. Even efforts to expand the attention window during training haven't fully resolved this issue.
Two prevalent strategies have emerged to manage Key-Value (KV) states: window attention and sliding window with re-computation. While window attention employs a fixed-size sliding cache for recent tokens to ensure constant memory usage, it falters when the sequence length surpasses the cache size. Conversely, the sliding window with re-computation strategy, though robust in performance, proves to be computationally intensive and thus impractical for real-world streaming applications.

"Attention Sinks"

The researchers observed an interesting phenomenon: initial tokens in a sequence, despite lacking semantic significance, receive strong attention scores. These tokens act as "attention sinks," helping to stabilize the model's performance. The researchers attribute this behavior to the Softmax operation used in the attention mechanism.
In a Transformer model, the attention scores need to sum up to one for all contextual tokens. When the current token doesn't have a strong contextual match among previous tokens, the Softmax operation still has to allocate the remaining attention somewhere. It turns out that these unallocated attention scores often end up accumulating on the initial tokens in the sequence.
The visualization of average attention logits in Llama-2-7B over 256 sentences reveals that the first two layers focus more on recent tokens, exhibiting a "local" pattern, while layers beyond the bottom two predominantly attend to the initial token across all layers and heads.
To understand whether the importance of initial tokens is due to their semantics or their position, the researchers conducted an experiment. They substituted the first four tokens with the linebreak token "\n" and found that the model still significantly emphasized these tokens. This indicated that it’s not the semantic value but rather the absolute position of the tokens that is important.
Why do initial tokens become attention sinks? One reason is their visibility throughout the sequence. Because of the autoregressive nature of language models, initial tokens are visible to almost all subsequent tokens. Over time, they become more trained to serve as attention sinks, effectively acting as a stabilizing factor for the model's performance, especially when handling infinitely long sequences.
Based on this insight, StreamingLLM preserves these "attention sink" tokens along with a sliding window of the most recent tokens. This allows the model to handle infinitely long text sequences without any fine-tuning or performance degradation.
A visualization of how StreamingLLM incorporates initial tokens
The core of this approach lies in strategically modifying the attention computation and making the model more adaptable to streaming conditions. To stabilize attention scores, StreamingLLM reintroduces a few starting tokens' Key-Value (KV) pairs into the computation. The KV cache, a crucial component of this method, is divided into two parts: attention sinks and a rolling KV cache. The attention sinks are the reintroduced initial tokens that help stabilize the attention, while the rolling KV cache retains the most recent tokens essential for effective language modeling. One of the strengths of StreamingLLM is its versatility; it is designed to be compatible with any autoregressive language model that uses relative positional encoding, such as RoPE and ALiBi. This adaptability ensures that StreamingLLM can seamlessly improve the performance of existing language models in streaming applications without requiring any retraining.

Results

Results on the StreamingEval Dataset

StreamingLLM was tested on several popular models, including Llama-2, MPT, Falcon, and Pythia, demonstrating stable and efficient language modeling capabilities for up to 4 million tokens. Compared to existing methods like the sliding window with re-computation, StreamingLLM achieved a speedup of up to 22.2 times, making it highly efficient for real-world applications.

Sink Tokens

The paper also suggests a pre-training technique to further optimize performance. By adding a single, learnable "attention sink" token at the beginning of all training samples, StreamingLLM can maintain model performance in streaming settings. This eliminates the need to reintroduce multiple initial tokens as attention sinks, simplifying the implementation and enhancing efficiency.

Overall

StreamingLLM offers a promising solution for deploying large language models in dialogue systems, code completions, and other applications requiring long sequence generation. By effectively addressing memory and sequence length limitations, this framework opens up new possibilities for the real-time, interactive use of large language models.

The Paper: