Meta AI unveils "Multi-Token Attention"
Created on April 3|Last edited on April 3
Comment
Researchers at Meta’s FAIR lab have proposed a new mechanism called Multi-Token Attention (MTA), designed to fix a key limitation of traditional Transformer models. Standard attention operates by computing how similar each query token is to each key token—one pair at a time. This isolated computation prevents models from easily identifying context that depends on multiple relevant tokens appearing together. MTA overcomes this by letting the attention mechanism evaluate multiple query and key tokens together, using lightweight convolution operations. This upgrade significantly improves the model’s ability to pinpoint nuanced context.
Limitations of Standard Attention
Standard attention assigns weights by comparing a single query vector to each key vector. This means that if a model needs to find content based on the co-occurrence of several tokens, like identifying a sentence that mentions both “power” and “supply,” it struggles—each query focuses on only one token’s similarity. To combine multiple criteria, models must compress those ideas into a single vector beforehand, which eats up network capacity and isn’t always reliable. Moreover, separate attention heads might each find individual terms, but without a way to combine those signals, the model can’t reliably locate where all conditions are met.
How Multi-Token Attention Works
MTA enhances the attention computation by introducing convolution operations directly into the attention score calculations. Instead of each query-key pair being evaluated in isolation, MTA performs small 2D convolutions over the grid of attention scores, letting each score reflect patterns across nearby queries and keys. This allows attention to respond to patterns like token co-occurrence across local spans of text. For example, the attention score between token i and j can now depend on neighboring scores as well, enabling a more expressive and contextually aware mechanism.

Beyond this, MTA also incorporates convolution across attention heads. Normally, each head functions independently. With head mixing, different attention heads can combine their views before final output. This allows the model to merge partial information—say, one head finding “power” and another finding “supply”—into a unified attention map that boosts attention on relevant spans containing both.
Experimental Performance
The effectiveness of MTA was demonstrated in a series of tasks, starting with a simple toy problem where models had to find a block of characters that matched multiple target letters. Standard Transformers failed to learn this reliably, while MTA solved it with near-perfect accuracy. This confirmed that the ability to pool multi-token signals across context gives a major edge in structured search tasks.

In large-scale training on the SlimPajama dataset, MTA-equipped Transformers outperformed baseline models on language modeling benchmarks, including LAMBADA and Needle-in-the-Haystack. These tasks are specifically designed to test long-range and precise retrieval, which MTA handles better thanks to its multi-token context pooling. Despite these performance gains, MTA only adds a small number of extra parameters and does not increase the model size significantly.
Architectural Details and Flexibility
MTA’s convolutions can be applied before or after the softmax step in attention computation. Applying convolutions before softmax makes the influence multiplicative, allowing context interactions to shape where attention lands. Post-softmax convolutions make the effect additive. Both versions are tested, and while both perform well, pre-softmax convolution tends to yield slightly better results.
To balance performance and efficiency, MTA doesn’t need to be applied in every layer. In practice, key-query convolutions are inserted in every fourth layer, while head mixing is applied to all layers. Group normalization with depth scaling is also used to maintain stable training and gradient flow.
Conclusion
Multi-Token Attention represents a meaningful upgrade to how Transformers compute relevance in text. By letting attention scores be influenced by patterns across multiple queries, keys, and heads, MTA allows language models to find more precise and contextually relevant information. This improvement is especially valuable for tasks involving long contexts, structured data, or where conditions depend on co-occurrence of several cues. Without increasing model size substantially, MTA shows that attention can be made sharper and more nuanced with relatively small but smart changes to the architecture.
Add a comment
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.