Skip to main content

Meta AI Releases MUGEN Dataset For Multimodal ML

Meta AI researchers have released their work on a new multimodal dataset: MUGEN (Multimodal Understanding and GENeration). The dataset contains 375K fully annotated video-audio-text data for training multimodal models on.
Created on June 14|Last edited on June 14
Researchers at Meta AI have been working on a new dataset for multimodal research called MUGEN (Multimodal Understanding and GENeration). This dataset hope to further the research into models which aim to work with multimodal data - video, audio, and text all together in different ways. Multimodal models are really the ultimate end goal of AI in a way, a generalized AI capable of applying all of it's senses for desired action.

Meta AI has highlighted the project in a blog post today. It covers a paper written by the researchers, presented alongside a project website and a GitHub repository for all the code.

What exactly is the MUGEN dataset?

The MUGEN dataset sits as a middle-ground compared to many other multimodal datasets which are either open world (containing real-life recorded data which can be prone to overly-complex data) or closed world (containing small-scope artificially created worlds). For the MUGEN dataset, the researchers wanted the rich and varied data potential of open world with the precision and predictability of closed world; To achieve this, they opted to use a modified version of the open-source CoinRun game engine.
The CoinRun game features a character (which the team has named Mugen) who runs around the game space to collect coins, kill monsters, and some other tasks. The environment is simple to a human observer, but the action space an AI model must contend with is vast. The modified version of this game introduces audio cues connected to certain actions, an altered game engine more conducive to optimal training, new game interactions, and more.

To create a multimodal dataset out of this game, the researchers had to do a few things: To get the gameplay, they trained 14 different RL models to play the game with different goals (some played risky for short term gain, some more careful and methodical). The gameplay videos were then split into three-second long clips. For the text transcriptions, human annotators were tasked to describe the actions in a sentence or two; An algorithm was also used to create auto-generated text descriptions. A total of 375K data pieces were created by the end of the process.
TL;DR: The MUGEN dataset features 375K pieces of data containing video, audio, and text all meticulously collected and annotated by AI and humans working together. The content features gameplay from a modified version of CoinRun to support a deep array of interaction, yet maintaining a manageable scope.

Where can I get the MUGEN dataset?

The MUGEN dataset, as well as the modified CoinRun engine and data examples, are all available at their project web page: https://mugen-org.github.io/

Find out more

Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.