MERLOT Reserve: Multimodal Neural Script Knowledge through Vision and Language and Sound
Created on March 24|Last edited on March 24
Comment
MERLOT Reserve is a model that learns to represent videos over time and across modalities: audio, subtitles, and video frames. The model has been trained on about twenty million youtube videos. So how does it work? Well, as humans, we can easily navigate the multimodal world without little supervision. However, this isn’t as straightforward for an AI system. So the authors introduce a new objective. if in a given video, the sound and text are replaced with a "MASK," then MERLOT Reserve can anticipate what's there.
The empirical results show that this model learns strong representations about videos, through all constituent modalities. When fine-tuned, it sets a new state-of-the-art on both Visual Commonsense Reasoning (VCR) and video question answering (TVQA), outperforming prior work by 5% and 7% respectively. The learning objective also enables transfer to zero-shot tasks. It obtains competitive results on four video understanding tasks, even outperforming supervised approaches on the recently-proposed Situated Reasoning (STAR) benchmark.
Check out the paper and demo below:
Add a comment
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.