Tokenformer: A GPT that uses tokens as parameters
Is this the future of GPT architectures?
Created on November 5|Last edited on November 5
Comment
With the increasing demand for powerful AI models across various fields, researchers and engineers are encountering a significant obstacle: how to scale up these models without constantly retraining from scratch. Transformer-based architectures, like GPT, have become essential in the AI landscape. However, as these models grow in size and complexity, the computational costs associated with scaling are growing at an unsustainable rate. Tokenformer is a novel architecture poised to redefine scalability for large models, offering a pathway to more efficient, adaptable, and cost-effective training.
The Scaling Challenge in Traditional GPT Models
In conventional GPT models, scaling involves increasing the size and complexity of the linear projections in each layer. These layers project input tokens using fixed sets of parameters, mapping inputs through complex transformations to generate outputs. However, when scaling up, significant changes—like increasing the number of layers or expanding the dimensions of these layers—require the model to be trained from scratch each time. The cost in terms of time, compute resources, and energy is massive. Tokenformer addresses this issue by fundamentally changing how parameters interact with inputs, enabling incremental scaling so the model can grow without resetting entirely, significantly reducing training costs.
Tokenformer’s Key Innovation: Tokenized Model Parameters
The central idea in Tokenformer is both simple and powerful: treat the model's parameters as tokens. Traditional transformers handle two types of interactions: token-token interactions and token-parameter interactions. In token-token interaction, input tokens (like words in a sentence) attend to each other to learn dependencies within the sequence. In token-parameter interaction, input tokens are multiplied by fixed matrices to project their representations into different spaces, such as generating queries, keys, and values.
Tokenformer rethinks this second interaction. Instead of using fixed matrices, it introduces Token-Parameter Attention (Pattention), in which model parameters themselves are represented as tokens. In this setup, input tokens serve as queries in an attention mechanism, while parameter tokens—learnable tokens that replace traditional weight matrices—serve as keys and values. This unique approach provides Tokenformer with flexibility in scaling, allowing the model to dynamically adjust the number of parameter tokens, which means it can grow by adding more tokens without changing its foundational structure.

Tokenformer’s Approach to Scaling
Tokenformer’s design offers a few transformative advantages. First, it enables incremental scaling. Tokenformer can scale simply by adding new parameter tokens. These tokens can be appended to the existing set without altering the input-output dimensions, which means the model can grow smoothly over time. Imagine a model the size of GPT-3 that can progressively expand to match the scale of GPT-4 by appending new tokens as needed, making it feasible to increase model capacity while keeping computational demands manageable.
Second, Tokenformer supports parameter reuse and adaptation. By using learnable parameter tokens, Tokenformer can retain and build upon previously learned parameters when scaling up. This means the model can keep its accumulated knowledge while allowing new parameter tokens to add more depth and complexity. This design fundamentally shifts the scaling approach, since it enables the model to leverage pre-trained weights without starting over—a major improvement from current practices, where scaling often involves retraining from scratch.
Finally, Tokenformer’s adaptability benefits a range of applications. Its flexible design is not only useful for scaling within a single application but also advantageous for transferring knowledge across tasks. For instance, a model trained on language tasks can add parameter tokens specifically for vision tasks, effectively expanding to a multimodal model without losing its original language understanding. This flexibility makes Tokenformer an ideal candidate for foundation models that span multiple domains.
Replacing the MLP
Tokenformer is designed as a fully attention-driven architecture, where both token interactions and parameter transformations are handled as token-parameter attention tasks. In conventional GPT models, each block contains linear layers for both self-attention and feed-forward transformations. In Tokenformer, however, both the self-attention and MLP (feed-forward network) layers use token-parameter attention mechanisms.
In this architecture, self-attention operates similarly to standard transformers, where input tokens attend to each other to learn dependencies within the sequence. For the feed-forward network, Tokenformer replaces dense layers with token-parameter attention, where input tokens attend to MLP-specific parameter tokens. This re-imagining of parameter transformations as token-parameter interactions enables Tokenformer to maintain a scalable, tokenized parameter approach throughout.
Why Tokenformer Could Be the Future of GPTs
Tokenformer’s efficient scaling, reusability of learned knowledge, and adaptability across domains make it especially relevant to the future needs of large models. By lowering the barrier to scaling—both in terms of compute and cost—Tokenformer opens up new possibilities for practical, large-scale deployments in industry and research. Its ability to scale incrementally, adapt across tasks, and minimize training costs aligns with what’s needed in today’s model-heavy AI environment.
Tokenformer’s approach allows for cost-effective growth by supporting gradual expansion as more parameter tokens are added. Additionally, by preserving the knowledge contained in existing tokens, Tokenformer offers a foundational flexibility that makes it easy to adapt across domains without disrupting existing model knowledge. This flexibility is highly beneficial for foundation models that must be flexible across domains and tasks. Finally, by reducing the need for complete retraining, Tokenformer is contributing to sustainability in AI, aligning with industry goals to reduce the environmental and financial costs associated with training large models.
Looking Ahead
Tokenformer represents a shift toward tokenized model parameters, offering a more flexible, scalable, and efficient way to build large language models. Its architecture is built to support the continual growth of models without the need to reset and restart training. As more research and development investigate token-parameter attention, Tokenformer (or models inspired by it) may lead the next generation of GPTs, making advanced capabilities accessible to a wider range of developers and applications than ever before.
Add a comment
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.