TII unveils Falcon Mamba
Created on August 13|Last edited on August 13
Comment
The Technology Innovation Institute (TII) has introduced Falcon Mamba, a groundbreaking model that challenges the dominance of transformer-based architectures in the world of large language models. Released under the TII Falcon License 2.0, Falcon Mamba is now available as open access within the Hugging Face ecosystem. This model stands out as the first strong attention-free 7B model, designed to overcome the limitations of traditional attention mechanisms in processing large sequences.
Falcon Mamba is based on the Mamba architecture, initially introduced in the paper "Mamba: Linear-Time Sequence Modeling with Selective State Spaces." Unlike transformers, which are constrained by the increasing compute and memory costs as sequence length grows, Falcon Mamba employs selective state spaces and RMS normalization layers. These enhancements enable the model to process sequences of any length without a corresponding increase in memory usage. This efficiency is achieved while maintaining competitive performance, making Falcon Mamba a viable alternative to existing state-of-the-art transformer models.
Training
The model was trained on approximately 5500 GT of data, which equates to around 5.5 trillion tokens. This vast amount of data was primarily sourced from RefinedWeb, a large and carefully curated web-only dataset. The RefinedWeb dataset was meticulously filtered and deduplicated to ensure high quality. In line with other models in the Falcon suite, Falcon Mamba's training leveraged a multi-stage strategy that increased the context length from 2,048 to 8,192 tokens. This method, inspired by the principles of Curriculum Learning, involved the careful selection of data mixtures at various stages, balancing diversity and complexity. This approach ensured that the model could handle a wide range of language tasks with increasing difficulty.
The training process for Falcon Mamba-7B utilized 256 H100 80GB GPUs, employing a 3D parallelism strategy with ZeRO (Zero Redundancy Optimizer). This setup allowed for efficient scaling across multiple GPUs, ensuring that the model could be trained effectively on a massive dataset without compromising performance.
During the final stage of training, a small portion of high-quality curated data, including samples from Fineweb-edu, was used to further enhance the model's performance. The data sources also included high-quality technical data, code data, and mathematical data extracted from public sources, all tokenized using the Falcon-7B/11B tokenizer.
Efficient Processing of Large Sequences
One of Falcon Mamba's key strengths is its ability to handle large sequences efficiently. Traditional transformers struggle with the linear scaling of memory and time requirements as context length increases. In contrast, Falcon Mamba's architecture allows it to generate new tokens at a constant speed and without a significant increase in memory usage. This makes it particularly suitable for applications requiring the processing of lengthy text inputs. This capability positions Falcon Mamba as a strong contender in areas where managing long sequences is crucial, offering a clear advantage over traditional transformer-based models.
Integration with Hugging Face
Falcon Mamba is fully integrated into the Hugging Face ecosystem and will be included in the upcoming release of the Hugging Face transformers library. Users can easily access and utilize the model through familiar APIs like AutoModelForCausalLM and pipeline. Additionally, the model supports advanced features such as bitsandbytes quantization, enabling it to run on smaller GPUs without sacrificing performance.
For those interested in instructional tasks, TII has also released an instruction-tuned version of Falcon Mamba, fine-tuned with an additional 5 billion tokens. This version is designed to enhance the model's ability to perform tasks requiring precise, instructional responses, making it a versatile tool for a wide range of applications.
Results and Performance
Falcon Mamba's performance has been rigorously evaluated across multiple benchmarks, demonstrating its superiority over both pure State Space Language Models (SSLMs) and hybrid models that combine attention with state space mechanisms. For example, on benchmarks like GPQA and MMLU-PRO, Falcon Mamba consistently outperformed similar models, proving its effectiveness in handling complex language tasks.

In comparison to other models, Falcon Mamba-7B showed strong results across various tasks. It scored 33.36 on the IFEval benchmark, 19.88 on BBH, and 14.47 on MMLU-PRO, surpassing other pure SSM models like TRI-ML/mamba-7b-rw and hybrid models like recurrentgemma-9b. Additionally, Falcon Mamba maintained competitive scores even when compared to transformer-based models, illustrating its capability to match and, in some cases, exceed traditional models in key areas.
Moreover, Falcon Mamba demonstrated exceptional efficiency in processing large sequences. Unlike transformers, which see an increase in memory usage and a decrease in generation speed as the number of generated tokens grows, Falcon Mamba maintained a constant throughput and memory usage, even with sequences up to 130k tokens long. This efficiency makes Falcon Mamba particularly well-suited for tasks that involve processing extensive text inputs.
Add a comment
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.