Skip to main content

Hiera: Hierarchical Vision Transformer

A Simple ViT with no bells and whistles.
Created on June 6|Last edited on June 6
Hiera is a Hierarchical Vision Transformer (ViT) that removes all the bells and whistles from other ViTs. What does this mean? A lot of literature built upon ViTs has incorporated complex modules that instill in the ViT some spatial bias. Though these modules help and seem efficient, they generally make the model slower.
This paper argues that these additional techniques aren't needed (and a strong pretext task like Masked AutoEncoder is sufficient). Their line of models, Hiera, not only demonstrates faster inference speed but also better performance.


They chose the MViTv2 and iteratively removed/edited components. They eventually removed/edited the following components:

Relative positional embeddings were replaced with absolute PEs. Convolution layers were replaced with max pooling layers, and stride=1 max pooling layers were removed.
They used a simple trick to avoid any padding overhead and delete residual connections leading into the attention layer. They also replaced any pooling attention in MViTv2 with Masked Unit Attention.
The rest of the paper conducted thorough experiments, comparing Hiera against other ViTs on different benchmarks. For more details on Hiera and the results, check out their paper and code below in References!

References

Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.