Skip to main content

5 Tips for Creating Lightweight Vision Transformers

This article covers some experimentally tested ways to make your Vision Transformer more accurate––without wasting resources.
Created on May 2|Last edited on May 19
Transformers have been around for a while, but it's been only a few years since they have been applied for computer vision and because of this, there's still some confusion about how to maximize your vision transformer's potential.
In this article, we'll provide a few tips, tested experimentally on different datasets, about how you can increase your model's accuracy without wasting resources in places you don't have to.

1. Stick to one transformer layer

Vision transformers have the ability to stack multiple transformer layers on top of one another, so it may be tempting to stack multiple, like people do in convolutional neural networks (CNNs). However, transformers can really lean towards overfitting on your training dataset, so unless your dataset is massive, adding multiple transformer layers not only slows down your model significantly, it might just reduce your model's accuracy.
Instead, just save yourself the trouble by just putting down a single transformer layer and invest your resources into different model parameters.

2. Use more patches

If your images are on the small side, you might think that cutting them into larger patches will be beneficial to store more information on each patch. It is actually quite the opposite: using more smaller patches helps your model's accuracy increase. Even cutting your image into patches as small as 3x3 pixels can still increase your model's accuracy (although it will also slow down your model quite a bit).
If you cannot afford to go really small, try to keep it so the image is cut into at least 100 patches a.k.a. make the patch height and width a 10th of your image's height and width.

3. Use more and bigger attention heads

While adding more transformer layers is not a good idea if you want to go lightweight, increasing the potential of the one layer you have usually is. Adding more attention heads into the transformer layer, as well as increasing the size of these heads, barely increases the size of your model but can have significant effects on its accuracy.
It might also overfit your training dataset (so use this tip with caution) but it is always a good idea to experiment with this parameter, as you can get a better model that is almost exactly as fast as a smaller worse one.

4. Have a deep neural network at the end

Vision transformers end with a multi layer perceptron–the part of the model that actually does the classification–while the transformer part of it is mainly focused on detecting features of the image.
Make sure that multi layer perceptron is actually multi layer. Adding more layers to it has a very low impact on the model's size and speed, compared to most of the other parameters, and can have a very large impact on the accuracy of your model. It can also have a negative impact due to overfitting, so experiment with different models for your use case and see what works best.
Just one more thing to keep in mind: the layers of the perceptron should go in decreasing order, or at least be the same size. If you add a smaller layer followed by a larger layer, the smaller layer becomes an information bottleneck and any larger layers afterward will be a waste of resources.

5. Have a strong dropout layer

As we mentioned multiple times in this article, vision transformers really tend to overfit. There is a layer for neural networks designed to combat exactly that: dropout. A dropout layer simple deletes some values that it gets, so that your model can have less information about what exactly it is being trained on and therefore generalize more.
Dropout layers even as strong as 0.5 coefficient (meaning they delete half of the information on the model) will likely have a positive impact on your overall model's accuracy. It might seem like a waste to just delete the information you gain, so people might forego dropout layers altogether, but that would be a grave mistake if you are using Vision Transformers.

Final words

No solution is one size fits all, especially when it comes to machine learning, so try to find a solution that is perfectly suited for your use case as opposed to copying something someone else made. Still, these tips have tended to work for me and I wanted to pass them along in hopes of speeding up your modeling. Thanks for reading!
Iterate on AI agents and models faster. Try Weights & Biases today.