Skip to main content

Zorro, Capturing Human Flexibility in Transformers

Zorro a paper by DeepMind and University of Oxford VGG Department of Engineering Science introduce a unique approach to accounting for multiple modalities.
Created on January 25|Last edited on January 25

Coined Zorro, this approach specializes and disentangling the crossing of multiple modalities. It can be added to existing transformer models like ViT, HiP, and Swin.
In multimodal transformers, it is often the case that different modality representations are entangled and indistinguishable (and possibly interconnected and reliant on each other). Zorro applies a special masking operation that allows transformer models to process multiple modalities (audio and visual in their paper) or just one.
The authors noted this paper was inspired by the flexibility humans possess in leveraging multiple vs single senses to understand something (like audio or video). Notably, Zorro can be trained supervised and self-supervised!

Reference

Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.