Skip to main content

Crossmodal-3600: Google's New Multilingual, Multicultural Image Captioning Dataset

Google researchers have released a new multilingual, multicultural image captioning dataset called Crossmodal-3600.
Created on October 13|Last edited on October 13
Today Google has released Crossmodal-3600 (aka. XM3600), an image captioning dataset which captions a wide range of culturally-varied images across 36 languages. This dataset stands out from other image captioning datasets, which the researchers behind this project found had too much of a focus on English, with images similarly lacking cultural variety.


Crossmodal-3600's varied content

Crossmodal-3600 includes 3600 images collected from the Open Images dataset; These images represent various cultures from across the globe and were specifically collected to match the 36 languages that this image captioning dataset includes. In the spirit of the project, this selection of images better represents the multicultural entities that human beings are.
Each image in the dataset is captioned into 36 different languages. Of these languages, English is joined by 30 other languages chosen based on the percentage of presence they represent on the web, as well as five more under-resourced languages chosen because they either have many native speakers or are from continents that would not have been covered otherwise.
Instead of just having one caption in each language for each image, on average, each image gets two captions for each of the 36 languages. In total, Crossmodal-3600 contains 261,375 captions, all of which were written by human participants.

You can look through Crossmodal-3600's contents on its web page.

How Crossmodal-3600 was created

To ensure that each caption is correct, accurate, and most importantly valuable, humans were tasked to write each of the captions present in the dataset.
Annotators worked in batches of 15 images, all of which were presented with AI-generated captions that were deliberately styled in a certain way. First, annotators rate the quality of the AI-generated captions one by one and then go through them again to write captions in the target language.
The methodology behind this process is to ingrain the writing style into their minds without letting them memorize the English captions and simply translate them. At a count of 15, memorization of each caption for direct translation cannot occur, but the writing style is learned and naturally applied while captioning in the target language.
Trials runs were performed for each language as well, to ensure the process produced good results.

Find out more

Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.