Meta Unveils IMAGEBIND
Meta learns from data across all modalities
Created on May 11|Last edited on May 12
Comment
Meta's AI prowess is not to be underestimated, as they have continued to unveil interesting and useful AI methods that rival those of OpenAI and Google.
Led by Yann LeCun, the team places a large amount of focus on self-supervised learning (SSL), which essentially aims to uncover patterns in data and ultimately learn without requiring large amounts of annotated data, which can be costly. One of their latest offerings, IMAGEBIND is such a model, learning patterns between different forms of data that occur simultaneously (eg. audio and images at the same time in a video).
Utilizing the principles of self-supervised learning, IMAGEBIND is a multimodal model that learns to align different data modalities within a shared embedding space, without the need for large amounts of paired, annotated data. This approach enables the model to understand and correlate information across different forms of media, like text, images, audio, and even across alternate modalities like inertial measurement unit (IMU) data, thermal images, and depth images.
Contrastive Loss
Contrastive loss is a type of loss function that encourages the model to learn to differentiate between similar and dissimilar examples. Typically, contrastive loss is applied by taking a piece of data, and creating a positive example by slightly augmenting the original piece of data, and then sampling from the entire dataset to find another piece of data that will serve as the negative pair.
The original data, positive sample, and negative sample are then used in the loss function. For this to work, however, the positive pair can simply be the embedding vectors from two different modalities (retrieved at the same location temporally). So for example, audio and images from a video at the same time would represent a positive pair, whereas images and audio from different times in the video would represent a negative pair.

InfoNCE Loss (a form of contrastive loss)
The model architecture involves multiple different encoders (one for each modality).
Meta chose to use open-source versions of CLIP and their ViT model for the text and audio encoders respectively. Transformer-based encoders were used for the remaining modalities (audio, IMU, thermal, and depth). Using an InfoNCE loss (a form of contrastive loss), Meta was able to essentially train the encoders to produce similar embedding vectors regardless of the modality fed to the encoder.
The InfoNCE loss can be seen above, where τ is a scalar temperature that controls the smoothness of the softmax distribution and j denotes unrelated observations, also known as ‘negatives’. So for example, an audio recording of a dog barking would be similar to the embedding produced by the image encoder when images of dogs are fed through the encoder.

Replacing text embeddings with audio embeddings for object localization
This achievement means that the underlying embedding space can be utilized across modalities, and even be used to utilize new modalities in existing architectures! For example, IMAGEBIND's embeddings can be used to enhance the capabilities of existing models without any need for re-training. This could mean upgrading a text-based detector to an audio-based one or even repurposing a diffusion model to generate images using different types of sounds.
The authors note strong improvements in few/zero-shot image classification and tasks like data retrieval.

Embedding arithmetic, replacing textual embedding with audio embeddings

Building Human-like systems 
As we continue to make progress in self-supervised learning and multimodal AI, we get closer to a future where AI systems can understand and interact with the world much like we do. This work will likely have stronger implications as we build more sophisticated robotic hardware that will need to be more reliant on alternative modalities that humans take for granted!
Add a comment
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.