MERLOT Reserve: Multimodal Neural Script Knowledge through Vision and Language and Sound

Created on March 24|Last edited on March 24

Comment

MERLOT Reserve is a model that learns to represent videos over time and across modalities: audio, subtitles, and video frames. The model has been trained on about twenty million youtube videos. So how does it work? Well, as humans, we can easily navigate the multimodal world without little supervision. However, this isn’t as straightforward for an AI system. So the authors introduce a new objective. if in a given video, the sound and text are replaced with a "MASK," then  MERLOT Reserve can anticipate what's there. 
The empirical results show that this model learns strong representations about videos, through all constituent modalities. When fine-tuned, it sets a new state-of-the-art on both Visual Commonsense Reasoning (VCR) and video question answering (TVQA), outperforming prior work by 5% and 7% respectively. The learning objective also enables transfer to zero-shot tasks. It obtains competitive results on four video understanding tasks, even outperforming supervised approaches on the recently-proposed Situated Reasoning (STAR) benchmark. 
Check out the paper and demo below: 
page: https://rowanzellers.com/merlotreserve﻿
paper: https://arxiv.org/abs/2201.02639﻿
demo: https://merlot-reserve.apps.allenai.org﻿
﻿
﻿

Add a comment

Tags: ML News

Published with ❤️ on Weights & Biases. Read more reports in our community, Fully Connected.

Iterate on AI agents and models faster. Try Weights & Biases today.