Meta AI's Token Merging
Token Merging: Your ViT But Faster.
Created on February 13|Last edited on February 13
Comment
Essentially, their paper introduces a method called Token Merging or ToMe which is a function that merges similar tokens together, speeding up computation without sacrificing much performance. Their method is lightweight and can be inserted into any part of the ViT. It's simply a layer in a neural network.

ToMe showed strong performance in images, video, and audio. The drop in performance is lessened as the size of the model and image input grow.
When creating the method, they prioritized parallelization and a gradual reduction in r tokens for each block across L blocks (or a total of rL tokens). Thus, they opted for a matching algorithm (bipartite soft matching), defining similarity to not be the distance in feature space, but the cosine similarity between the keys (K) from attention layers as these key matrices already summarize token information.
Their algorithm is designed as follows:

Reference
Bolya, Daniel, et al. “Token Merging: Your VIT but Faster.” Meta Research, Meta, 8 Feb. 2023, https://research.facebook.com/blog/2023/2/token-merging-your-vit-but-faster/?utm_source=twitter&utm_medium=organic_social&utm_campaign=evergreen&utm_content=twittercard.
Add a comment
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.