Grokfast: Researchers Accelerate Grokking by 50x

Researchers Accelerate Grokking!
Created on June 4|Last edited on June 4
Comment
Researchers have recently discovered a peculiar phenomenon called "grokking," where a model suddenly achieves generalization long after it has overfitted to the training data. While this delayed generalization is intriguing, it can be impractical due to the significant computational resources required. A team of researchers from Seoul National University has developed a novel algorithm called "Grokfast" to accelerate the grokking phenomenon, making it more accessible to machine learning practitioners.
The Idea The key idea behind Grokfast is to think of the model's learning process as a combination of two types of changes: fast-varying and slow-varying (gradients). Fast-varying changes are those that happen quickly during training and are primarily responsible for the model's ability to fit the training data. On the other hand, slow-varying changes occur more gradually and are thought to be crucial for the model's ability to generalize to new data.
More Slow = Fast The researchers theorized that by emphasizing the slow-varying changes, they could accelerate the grokking process and help the model generalize faster. To do this, they developed a simple technique that involves applying a low-pass filter to the gradients used to update the model's parameters during training. A low-pass filter is a signal processing tool that removes high-frequency components from a signal, letting only the low-frequency components pass through.
﻿
﻿
Filtering Gradients In the context of Grokfast, the low-pass filter is applied to the gradients, which are the changes calculated during training to update the model's parameters. By filtering out the high-frequency, fast-varying components of the gradients and amplifying the low-frequency, slow-varying components, Grokfast helps the model focus on the changes that are more likely to lead to generalization.
Easy Implementation Implementing Grokfast is surprisingly simple and can be done with just a few lines of additional code in standard machine learning frameworks. The researchers proposed two variants of the algorithm: Grokfast-MA, which uses a moving average filter, and Grokfast-EMA, which uses an exponential moving average filter. Both variants work by maintaining a running average of the gradients and adding this average to the current gradients during each training step.
Experiments on a wide range of tasks and model architectures showed that Grokfast can speed up the grokking process by up to 50 times, leading to faster generalization. This means that models trained with Grokfast can start performing well on new data much earlier than models trained without it.
The Future The implications of Grokfast are significant for the machine learning community. By accelerating the grokking phenomenon, the technique can help researchers and practitioners save valuable computational resources and time. Moreover, the insights gained from this work contribute to our understanding of the underlying mechanisms of delayed generalization in machine learning models.
In summary, Grokfast is a simple and effective technique that speeds up the grokking process by emphasizing the slow-varying changes in the model's parameters during training. By applying a low-pass filter to the gradients, Grokfast helps the model focus on the changes that are more likely to lead to generalization, resulting in faster and more efficient learning.
﻿
Add a comment
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.