AudioPaLM: Speech Translation From Text and Audio Tokens

Meet AudioPaLM, a text-only decoder-based large language model (LLM) for speech translation and speech-to-text translation that uses text and audio tokens.

Vincent Tu

Created on June 27|Last edited on June 27

Comment

﻿
﻿
AudioPaLM uses a text-only decoder-based LLM for speech translation and speech-to-text translation except the input is a mixture of text and audio tokens. Accordingly, the transformer outputs a mixture of text and audio.
Audio TokenizationA speech representation model, like w2v-BERT, is used to encode audio. A clustering algorithm like k-means is then used to discretize the encoding into a finite number of audio tokens. They tested wtih 3 different speech representation models: w2v-BERT, USM-v1, and USM-v2. 
The entire model, minus the embedding matrix at the beginning and end of the model (embedding tokens and de-embedding them), are agnostic to the number of tokens. The only new addition to the model was the expansion of the embedding matrix from shape (t,m)(t, m)(t,m)﻿ where ttt﻿ is the number of tokens and mmm﻿ is the embedding dimension to shape (t+a,m)(t + a, m)(t+a,m)﻿ where aaa﻿ is the number of audio tokens. The aaa﻿ new rows had to be trained from scratch where the rest of the embedding matrix was from a checkpoint. 
Training TasksAudioPaLM covers a number of tasks: speech recognition, speech translation, speech-to-speech translation, text-to-speech, and text-to-text machine translation. Like T5, the model's input tokens are prefixed by a special token denoting a language and a task like TTS English. 
More details on AudioPaLM's datasets, metrics, and training regimen can be found in their paper!
References﻿https://google-research.github.io/seanet/audiopalm/examples/?ref=upstract.com﻿
﻿https://arxiv.org/abs/2306.12925﻿
﻿

Add a comment

Tags: ML News

Iterate on AI agents and models faster. Try Weights & Biases today.