Skip to main content

I-JEPA, MusicGen

Created on June 21|Last edited on June 21

I-JEPA

I-JEPA, or Image Joint Embedding Predictive Architecture, is a world model created by Yann LeCun to learn how the world works. What makes this architecture different from the current popular LLM/transformer paradigm?
This computer vision architecture learns to predict representations of parts of the input from other parts of the input. You can see, from the below image, that I-JEPA is a hybrid between joint-embedding and generative architectures.


I-JEPA is image-based and the encoder is a ViT which encodes the context of the image given a certain subset of the input image.

Then, a predictor, conditioned on the locations of the target blocks and the context-encoded vector, predicts these target block representations. These predicted target block representations are trained to mimic that of a target encoder.
This embedding architecture reduces computational cost as there's no need for additional data augmentation on input images.
All in all, I-JEPA reduces the need for intervening with the learning process within context.

MusicGen

MusicGen is Meta's music generator model conditioned on text and, optionally, audio, much like Google's MusicLM. However, unlike MusicLM, MusicGen uses 1 Transformer decoder model with novel tokenization and encoding instead of a cascade or hierarchy of models. The authors leveraged EnCodec, a convolutional autoencoder with Residual Vector Quantization (RVQ), for audio tokenization and T5 for text encoding. These are fed into a single autoregressive Transformer decoder.
Code for MusicGen is public in facebookresearch/audiocraft, which is a PyTorch DL library for audio generation, and a demo is available on HuggingFace!

References


Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.