Skip to main content

AudioPaLM: Speech Translation From Text and Audio Tokens

Meet AudioPaLM, a text-only decoder-based large language model (LLM) for speech translation and speech-to-text translation that uses text and audio tokens.
Created on June 27|Last edited on June 27

AudioPaLM uses a text-only decoder-based LLM for speech translation and speech-to-text translation except the input is a mixture of text and audio tokens. Accordingly, the transformer outputs a mixture of text and audio.

Audio Tokenization

A speech representation model, like w2v-BERT, is used to encode audio. A clustering algorithm like k-means is then used to discretize the encoding into a finite number of audio tokens. They tested wtih 3 different speech representation models: w2v-BERT, USM-v1, and USM-v2.
The entire model, minus the embedding matrix at the beginning and end of the model (embedding tokens and de-embedding them), are agnostic to the number of tokens. The only new addition to the model was the expansion of the embedding matrix from shape (t,m)(t, m) where tt is the number of tokens and mm is the embedding dimension to shape (t+a,m)(t + a, m) where aa is the number of audio tokens. The aa new rows had to be trained from scratch where the rest of the embedding matrix was from a checkpoint.

Training Tasks

AudioPaLM covers a number of tasks: speech recognition, speech translation, speech-to-speech translation, text-to-speech, and text-to-text machine translation. Like T5, the model's input tokens are prefixed by a special token denoting a language and a task like TTS English.
More details on AudioPaLM's datasets, metrics, and training regimen can be found in their paper!

References

Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.