Zorro, Capturing Human Flexibility in Transformers

Zorro a paper by DeepMind and University of Oxford VGG Department of Engineering Science introduce a unique approach to accounting for multiple modalities.

Vincent Tu

Created on January 25|Last edited on January 25

Comment

﻿
﻿
Coined Zorro, this approach specializes and disentangling the crossing of multiple modalities. It can be added to existing transformer models like ViT, HiP, and Swin.
In multimodal transformers, it is often the case that different modality representations are entangled and indistinguishable (and possibly interconnected and reliant on each other). Zorro applies a special masking operation that allows transformer models to process multiple modalities (audio and visual in their paper) or just one.
The authors noted this paper was inspired by the flexibility humans possess in leveraging multiple vs single senses to understand something (like audio or video). Notably, Zorro can be trained supervised and self-supervised!
ReferencePaper - https://arxiv.org/abs/2301.09595﻿
﻿

Add a comment

Tags: ML News

Iterate on AI agents and models faster. Try Weights & Biases today.