Skip to main content

Researchers Make BERT Exponentially Faster

Leveraging Fast Feed Forward Layers, researchers were able to improve the efficiency of BERT exponentially!
Created on November 27|Last edited on November 27
The paper "Exponentially Faster Language Modeling" recently released by Peter Belcak and Roger Wattenhofer presents a groundbreaking approach to language modeling that achieves significant speed improvements in the inference process. Central to this advancement is the development of UltraFastBERT, a variation of the widely recognized BERT model. UltraFastBERT is unique for its ability to effectively utilize just a minuscule portion of its neurons—merely 0.3% or about 12 out of 4095 neurons in each layer—during the inference phase, yet it manages to maintain performance levels comparable to traditional BERT models.

Fast Feedforward Networks

The cornerstone of this enhanced efficiency is the replacement of the conventional feedforward networks found in the BERT architecture with the newly designed fast feedforward networks (FFFs). This innovative alteration allows the model to process data at an exponentially faster rate. Traditional feedforward networks typically require every neuron in a layer to participate in computations. In contrast, FFFs reorganize neurons into a structured, balanced binary tree. This structure is pivotal because, during inference, only a single branch of this binary tree is activated and executed based on the given input. This selective activation leads to a substantial reduction in the number of computations.
Despite this reduction, UltraFastBERT does not compromise on the quality of language modeling. It successfully delivers results comparable to regular BERT models, which are renowned for their accuracy and effectiveness in various natural language processing tasks.
To dive deeper into the mechanics of FFFs: each neuron in the binary tree makes a decision based on the input, determining which path to follow—left or right—down the tree. This means that instead of processing the entire layer of neurons, the model quickly traverses down a path of the binary tree, engaging only a fraction of the neurons. This approach is fundamentally different and more efficient than standard methods where each input passes through every neuron.





78x Speedup

Despite the sophisticated nature of these innovations, the researchers successfully implemented a high-level CPU version of UltraFastBERT, demonstrating a striking 78x speedup compared to optimized traditional feedforward network implementations. Looking ahead, the paper indicates the possibility of achieving even faster speeds. The researchers acknowledge that while the current implementation already marks a significant leap forward, optimizing these techniques could unlock the full potential of conditional neural execution, potentially leading to further advancements in the efficiency of language models.
In essence, this research marks a significant leap forward in language modeling, introducing methods that substantially accelerate the inference process without compromising on performance. The integration of FFFs and CMM into the BERT framework signals an exciting and promising new direction in AI language processing.

Sources:
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.