Skip to main content

Data2vec 2.0: Meta AI's Modality-Agnostic Self-Supervised Model Training Framework

Today Meta AI announced data2vec 2.0, an updated framework for training a model in a self-supervised and modality-agnostic way. It is open-source with pre-trained models available.
Created on December 13|Last edited on December 14
Today, Meta AI announced data2vec 2.0, the follow-up release to data2vec from earlier this year. This model training framework is self-supervised and modality-agnostic, meaning it learns vision, text, and speech all in the same way and in the same model architecture without any data labelling.

Data2vec 2.0 is incredibly efficient at training, matching the proficiency of many models with significantly quicker training times on the same hardware. It also is totally open-source with pretrained models and datasets available for download on GitHub.

How data2vec 2.0 works

Data2vec takes a self-supervised learning approach, meaning the input data isn't labelled. The learning objective is also modality-agnostic, so it can work with any type of data. The core of the learning process consists of two identical encoder models: a student model and a teacher model (always with slightly outdated weights).
The goal of data2vec is for the student model to learn an abstract vector representation of the input data, rather than learn towards any specific output target. It does this in a learning loop where raw input data is fed into the teacher model, and then masked input data is fed into the student model. From there, the student model must predict the teacher's vector representation of the data.
The importance of this modality-agnostic and self-supervised approach is that it better represents the way that humans learn. The encoder model at the core of a data2vec system creates it's own abstract representation of the patterns it discovers, regardless of input data type. Additionally, the way it predicts the entire vector representation of a masked data piece lets it take in full context, incorporating the entirety of the data sample, moving quickly towards a deep understanding.

Comparing data2vec 2.0 to others

Though data2vec 2.0 is fundamentally the same process as it's first incarnation, there were several improvements made that greatly sped up the whole process. Input data is more efficiently prepared, and the student model's decoder was swapped for a faster convolutional network.
The most important factor that data2vec brings to the table is pure speed - thanks to the way the encoder trains, and especially with the new system improvements with 2.0, data2vec can meet the same performance of traditionally trained models in a fraction of the time and compute cost.
Data2vec was tested in three modalities: computer vision, speech, and text (against a masked autoencoder, wav2vec 2.0, and RoBERTa respectively). All models were trained using the same hardware, and a data2vec model was trained in each category to match it's competitor model's performance.
For computer vision data2vec matched the masked autoencoder's performance 16x faster, for speech it matched wav2vec 2.0's performance 11x faster, and for text it matched RoBERTa's performance 2x faster. Additional training time of course improved data2vec's performance across all categories, still remaining well under the time it took the other models to train.

Data2vec 2.0 is open-source

Data2vec 2.0 is of course open-source like it's predecessor. You can head to the GitHub repository for instructions on implementing data2vec 2.0 for yourself, and to download pre-trained models that were created during research.

Find out more

Read the announcement blog post for data2vec 2.0 here, and take a look at the research paper for all the details here.
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.