How to Track and Compare Audio Transcriptions With Whisper and Weights & Biases

In this article, we learn how to use OpenAI Whisper to transcribe a collection of audio files, and easily keep track of the progress with Weights & Biases.

Hans Ramsl

Created on February 3|Last edited on May 7

Comment

How well–and how quickly–does Whisper transcribe text? In this article, we’ll find out just that. 
In case you're not familiar with Whisper, it's an OpenAI’s open-source automatic speech recognition library (ASR). Whisper was trained on 680,000 hours of multilingual training data and supports 99 languages with varying quality. 
Here's what we'll be covering in this article: 
Table of ContentsUsing Whisper To Transcribe a PodcastAdded BonusThe ColabThe Code, in Python1️⃣ Collect Data2️⃣ Install Dependencies3️⃣ Transcribe a single file4️⃣ Track a minimalistic experiment5️⃣ Version Files With W&B Artifacts6️⃣ Apply Steps 3️⃣ + 4️⃣ + 5️⃣ to Audio CollectionThe Code, in C++7️⃣ Compare Model SizesCost Analysis and Conclusion
﻿
﻿
Using Whisper To Transcribe a PodcastToday, we're going to experiment with how it performs transcribing 75 episodes of Gradient Dissent, the machine learning podcast by Weights & Biases. Specifically, we'll compare three Whisper model sizes:  tiny (39M parameters), base (74M parameters), and large (1550M parameters), and evaluate two things:
Relative transcription speed of different model sizes on GPUs for the Python implementation of Whisper
Relative transcription speed of different model sizes on CPUs for the C++ implementation of Whisper﻿
Then, we’ll show how to use Weights & Biases (W&B) as a single source of truth that helps track and visualize all these experiments. In more detail, we will:
log experiments from outside Python, namely from C++ or bash script running in the command line to W&B.
run a line-based comparison of audio snippets with their respective transcription in a W&B table.
Did you know that you can track experiment results from C++ or any other language running in a bash script? We'll show you how by using `subprocess`.
💡
Here's what we'll be covering: 
A teaser of our project's outcomeThe table below is a transcription of the first ~70 seconds of Phil Brown — How IPUs are Advancing Machine Intelligence, an interview between Phil Brown (Director of Applications at Graphcore at the time of recording) and Lukas Biewald (CEO and Co-founder of Weights & Biases).
Each row is a 3-7 second audio snippet, with the corresponding transcriptions produced by the tiny, base, and large Whisper models.
Take a look at rows 15 and 16! ▶️ Play the audio and compare the transcripts. What do you think about the quality of the different model sizes?
💡
﻿
Run set470
﻿
Added BonusHere's a laser-cut version 🔥 of a W&B parallel coordinates plot for the large model! Why, you ask? You should really be asking, "Why not?"
﻿
The ColabIf you'd like to follow along, you can find code for Whisper in Python and Whisper.cpp in C++ in the Colab:


﻿
﻿
The Code, in PythonLet's get started and see Whisper in action. To start, we'll run this code:
whisper chip-huyen-ml-research.mp3 --model tiny --language=en
﻿
1️⃣ Collect DataFor our experiment here, we need to collect the data (in this case: audio from our podcast). We'll use yt-dlp to download all the episodes from the Gradient Dissent Youtube playlist.
InstallationNow, run: 
# Install yt-dlp
python3 -m pip install -U yt-dlp
﻿
## Download all episodes as MP3 files
yt-dlp -i -x --yes-playlist --audio-format mp3  https://www.youtube.com/playlist?list=PLD80i8An1OEEb1jP0sjEyiLG8ULRXFob_
This is an example of how the downloaded and renamed mp3 files look like in W&B Artifacts.
﻿
2️⃣ Install DependenciesNext, we'll install the libraries and code that we need to transcribe the fresh data:
wandb: to track experiments and version our models with Weights & Biases
whisper: to transcribe audio files with Whisper
setuptools-rust: used by Whisper for tokenization
ffmpeg-python: used by Whisper for down-mixing and resampling
Installation and Initialization!pip install wandb -qqq
!pip install git+https://github.com/openai/whisper.git -qqq
!pip install setuptools-rust 
!pip install ffmpeg-python -qqq
﻿
import wandb 
wandb.login(host="https://api.wandb.ai")
﻿
from google.colab import drive
drive.mount('/content/gdrive')
3️⃣ Transcribe a single fileThis is a simple example of how to use Whisper in Python:
import whisper
﻿
model = whisper.load_model("base")
result = model.transcribe("audio.mp3")
print(result["text"])
4️⃣ Track a minimalistic experimentThis is a simple example how to use W&B Experiment Tracking in Python:
import wandb
with wandb.init(project="gradient-dissent-transcription"):
  train_acc = 0.7
  train_loss = 0.3
  wandb.log({'accuracy': train_acc, 'loss': train_loss})
5️⃣ Version Files With W&B Artifacts
﻿
This is a simple example how to use W&B Artifacts in Python:
with wandb.init(project="gradient-dissent-transcription"):
  transcript = wandb.Artifact('transcript', type="transcription")
  transcript.add_file('transcript.txt')
  run.log_artifact(transcript)
6️⃣ Apply Steps 3️⃣ + 4️⃣ + 5️⃣ to Audio CollectionBased on the simple examples above, we'll combine the steps transcribe, wandb.log, and wandb.Artifact to transcribe an entire collection of audio files!
import whisper
import wandb
from pathlib import Path
import time
import ffmpeg
﻿
# Specify project settings
PROJECT_NAME="gradient-dissent-transcription"
JOB_DATA_UPLOAD="data-upload"
JOB_TRANSCRIPTION="transcription"
ARTIFACT_VERSION="latest"
AUDIOFORMAT = 'mp3'
MODELSIZE = 'tiny' # tiny, base, small, medium, large
﻿
# Load the Whisper model to perform transcription
model = whisper.load_model(MODELSIZE)
﻿
# Inititialize W&B Project
run = wandb.init(project=PROJECT_NAME, job_type=JOB_TRANSCRIPTION)
﻿
# Download raw Gradient Dissent audio data from W&B Artifacts
artifact = run.use_artifact('hans-ramsl/gradient-dissent-transcription/gradient-dissent-audio:'+ARTIFACT_VERSION, type='raw_audio')
artifact_dir = artifact.download(root="/content/")
wandb.finish()
﻿
# Path to all audio files in `artifact_dir` 
audiofiles = Path(artifact_dir).glob('*.'+AUDIOFORMAT)
﻿
# Iterate over all audio files
for audio_file in audiofiles:
  # Specify the path to where the audio file is stored
  path_to_audio_file = str(audio_file)
  
  # Measure start time  
  start = time.time()
﻿
  # Transcribe. Here is where the heavy-lifting takes place. 🫀🏋🏻
  result = model.transcribe(path_to_audio_file)
﻿
  # Measure duration of transcription time
  transcription_time = time.time()-start
﻿
  # Define a path to transcript file
  path_to_transcript_file = path_to_audio_file + '.txt'
  
  # Specify parameters to be logged
  duration = ffmpeg.probe(path_to_audio_file)['format']['duration']
  transcription_factor = float(duration) / float(transcription_time)
﻿
  # Save transcription to local text file
  with open(path_to_transcript_file, 'w') as f:
    f.write(result['text'])
    transcript_text = result['text']
﻿
    # Log the local transcript as W&B artifact
    with wandb.init(project=PROJECT_NAME, job_type=JOB_TRANSCRIPTION) as run:
      # Build a transcription table
      transcript_table = wandb.Table(columns=['audio_file', 'transcript', 'audio_path', 'audio_format', 'modelsize', 'transcription_time', 'audio_duration', 'transcription_factor'])
      
      # Add audio file, transcription and parameters to table
      transcript_table.add_data(wandb.Audio(path_to_audio_file), transcript_text, path_to_audio_file, AUDIOFORMAT, MODELSIZE, transcription_time, duration, transcription_factor)
      transcript_artifact = wandb.Artifact('transcript', type="transcription")
      transcript_artifact.add(transcript_table, 'transcription_table')
﻿
      # Log artifact including table to W&B
      run.log_artifact(transcript_artifact)
﻿
      # Log parameters to Runs table (Experiment Tracking)
      run.log({
        'transcript': transcript_text,
        'audio_path': path_to_audio_file,
        'audio_format': AUDIOFORMAT,
        'modelsize': MODELSIZE,
        'transcription_time': transcription_time,
        'audio_duration': duration,
        'transcription_factor': transcription_factor
        }
      )
﻿
The Code, in C++🤓 💻 If you want to see the C++ code we use, find it in the Colab.
Here is a short preview of the command line usage:
#!/bin/bash
﻿
input_file="$1"
model_size="$2"
cores="$3"
output_file=$(realpath "${input_file%.*}.wav")
yes | ffmpeg -hide_banner -loglevel error -i $input_file -acodec pcm_s16le -ac 1 -ar 16000 $(realpath $output_file)
exec /Users/hans/code/whisper.cpp/main -m /Users/hans/code/whisper.cpp/models/ggml-$model_size.bin -l en -t $cores -f $(realpath $output_file) > /Users/hans/code/whisper.cpp/transcriptions/temp-transcript.txt
After successfully running steps 1️⃣ - 6️⃣, we can now visualize the results and learn.
7️⃣ Compare Model Sizes
﻿
﻿
W&B makes it really easy to analyze the performance of those models. During experiment tracking, we tracked parameters such as transcription_factor or transcript. Below, you can see that the highest relative speed is 55.158, and the lowest 1.199. 
But why is there such a huge difference in speed performance? Let's find out.
Relative Speed Analysis﻿
﻿
Run set470
﻿
﻿
One episode, many transcriptionsTo find out why there is so much variance, we'll look at one specific episode:
Pete Warden — Practical Applications of TinyML.
We transcribed this episode with different model sizes of Whisper, both on the original Python version on GPU and using the C++ port (Whisper.cpp): 
🚤 tiny
🛥️ base
🚢 large
Speed here is, in fact, dependent on a few different factors in tandem, such as: 
The code we run
Whether we're using GPUs or CPUs
The size of the model
We'll dig into those factors in more detail below.
In the next figure, the Python on GPU is green 🟢 and the C++ on CPUs (Mac M1, 8 Cores) is blue 🔵. The GPU we used here is the NVIDIA A100, with a max power consumption of 400 W, while the CPUs of the M1 Pro have a max power consumption of 68.5W. The CPU energy consumption 78% lower.
﻿
﻿
Pete Warden (2021-10-21)6
﻿
﻿
🚢 Large Models﻿
For large models we can see that there is a low variance in transcription speed. The range is from 1.199 to 2.042 or 1.7x speed Δ.
﻿
﻿
﻿
 🚢 Large Models75
﻿
🛥️ Base ModelsFor base models, we can see that there is a little higher variance in transcription speed. The range is from 9.953 to 21.445 or 2.15x speed Δ.
﻿
﻿
🛥️ Base Models75
﻿
﻿
🚤 Tiny ModelsFor tiny models, we can see that there is a very high variance in transcription speed. The range is from 5.819 to 50 or 8.6x speed Δ.
﻿
﻿
 🚤 Tiny Models224
﻿
﻿
Model PerformanceThis parallel coordinates plot shows how the different model sizes perform in terms of transcription time and relative transcription speed.
﻿
🚤 vs. 🛥️ vs. 🚢 Model performance374
﻿
﻿
Model Performance (by model size)This grouped parallel coordinates plot summarizes model performance and shows that tiny models have a transcription factor of ~ 38x on average, base models ~18x and large models ~ 1.6x 
﻿
﻿
🚤 vs. 🛥️ vs. 🚢 Model performance (by model size)374
﻿
﻿
Access all transcript filesAll episodes that have been transcribed can be seen below. 
Click on any of the transcription text files and wait a little. You will be able to see the full transcript of a podcast episode.
💡
﻿
project("hans-ramsl", "gradient-dissent-transcription").artifact("gradient-dissent-transcript")
gradient-dissent-transcriptVersion 1
All Versions
Aliases
latest
post-processing
Versions
v1
v0
VersionMetadataUsageFilesLineage
> root
Directory
Directory
Object
transcript-base-aaron-colak-ml-and-nlp-in-experience-management-3vej4ilaqao.mp3.txt
63.6KB
transcript-base-adrien-gaidon-advancing-ml-research-in-autonomous-vehicles-mujpblzb4jo.mp3.txt
92.7KB
transcript-base-adrien-treuille-building-blazingly-fast-tools-that-people-love-xapf15jyzyu.mp3.txt
66.6KB
transcript-base-alyssa-simpson-rochwerger-responsible-ml-in-the-real-world-t4fuk9ow9j4.mp3.txt
78.2KB
transcript-base-amelia-filip-how-pandora-deploys-ml-models-into-production-cssfnh-2qrm.mp3.txt
55.6KB
transcript-base-anantha-kancherla-building-level-5-autonomous-vehicles-ht5uchnazu8.mp3.txt
55.4KB
transcript-base-angela-danielle-designing-ml-models-for-millions-of-consumer-robots-w55uo4gilq4.mp3.txt
63.6KB
transcript-base-anthony-goldbloom-how-to-win-kaggle-competitions-0zjq2vsgwf0.mp3.txt
46.7KB
transcript-base-bharath-ramsundar-deep-learning-for-molecules-and-medicine-discovery-gnkpvjp117k.mp3.txt
88.2KB
transcript-base-boris-dayma-the-story-behind-dall-e-mini-the-viral-phenomenon-vxc8fkqqxgm.mp3.txt
58.3KB
transcript-base-brandon-rohrer-machine-learning-in-production-for-robots-ot35pspxw4.mp3.txt
51.3KB
transcript-base-cade-metz-the-stories-behind-the-rise-of-ai-ta2hj9b9r-e.mp3.txt
53.1KB
transcript-base-chip-huyen-ml-research-and-production-pipelines-6adnhwe5phy.mp3.txt
63.2KB
transcript-base-chris-albon-ml-models-and-infrastructure-at-wikimedia-l1flcsh-n9k.mp3.txt
64.1KB
transcript-base-chris-anderson-robocars-drones-and-wired-magazine-nagcndqoq7k.mp3.txt
85.7KB
transcript-base-chris-mattmann-ml-applications-on-earth-mars-and-beyond-rqmywmnlufo.mp3.txt
65.7KB
transcript-base-chris-padwick-smart-machines-for-more-sustainable-farming-knrwpq1ujha.mp3.txt
64.2KB
transcript-base-chris-shawn-and-lukas-the-weights-biases-journey-dzu3wjmjdam.mp3.txt
79.6KB
transcript-base-cl-ment-delangue-the-power-of-the-open-source-community-sjx9fsnr-9q.mp3.txt
52.2KB
transcript-base-daeil-kim-the-unreasonable-effectiveness-of-synthetic-data-qj6dgjxfxmg.mp3.txt
102.6KB
transcript-base-daphne-koller-digital-biology-and-the-next-epoch-of-science-prgz-6jb16m.mp3.txt
57.9KB
transcript-base-dave-selinger-ai-and-the-next-generation-of-security-systems-dsl9ttdare8.mp3.txt
77.2KB
transcript-base-dominik-moritz-building-intuitive-data-visualization-tools-bcttibpleg8.mp3.txt
46.8KB
transcript-base-drago-anguelov-robustness-safety-and-scalability-at-waymo-5qpwafctmuw.mp3.txt
76.9KB
transcript-base-emily-m.-bender-language-models-and-linguistics-vaxnn3yrhba.mp3.txt
109.3KB
«⟨12345...9⟩»
Cost Analysis and ConclusionIt's worth noting that translation services, whether human or machine, are never quite perfect. If you need word-for-word accuracy, we recommend spot-checking and improving where necessary. 
But Whisper's outputs, especially the larger models, were really impressive. And what's better? They're as close to free as they can be:






























ModelPower ConsumptionCost in Germany in December 2022
Human transcription-54
Amazon Transcribe-1.44
Whisper (GPU)0.4 kWh0.2136
Whisper (M1 CPU)0.069 kWh0.037
﻿
﻿

Model	Power Consumption	Cost in Germany in December 2022
Human transcription	-	54
Amazon Transcribe	-	1.44
Whisper (GPU)	0.4 kWh	0.2136
Whisper (M1 CPU)	0.069 kWh	0.037

Add a comment

Tags: Articles, Audio, Intermediate, Panels, Plots, Sweeps, Experiment, Exemplary

Iterate on AI agents and models faster. Try Weights & Biases today.

transcript-base-aaron-colak-ml-and-nlp-in-experience-management-3vej4ilaqao.mp3.txt		63.6KB
transcript-base-adrien-gaidon-advancing-ml-research-in-autonomous-vehicles-mujpblzb4jo.mp3.txt		92.7KB
transcript-base-adrien-treuille-building-blazingly-fast-tools-that-people-love-xapf15jyzyu.mp3.txt		66.6KB
transcript-base-alyssa-simpson-rochwerger-responsible-ml-in-the-real-world-t4fuk9ow9j4.mp3.txt		78.2KB
transcript-base-amelia-filip-how-pandora-deploys-ml-models-into-production-cssfnh-2qrm.mp3.txt		55.6KB
transcript-base-anantha-kancherla-building-level-5-autonomous-vehicles-ht5uchnazu8.mp3.txt		55.4KB
transcript-base-angela-danielle-designing-ml-models-for-millions-of-consumer-robots-w55uo4gilq4.mp3.txt		63.6KB
transcript-base-anthony-goldbloom-how-to-win-kaggle-competitions-0zjq2vsgwf0.mp3.txt		46.7KB
transcript-base-bharath-ramsundar-deep-learning-for-molecules-and-medicine-discovery-gnkpvjp117k.mp3.txt		88.2KB
transcript-base-boris-dayma-the-story-behind-dall-e-mini-the-viral-phenomenon-vxc8fkqqxgm.mp3.txt		58.3KB
transcript-base-brandon-rohrer-machine-learning-in-production-for-robots-ot35pspxw4.mp3.txt		51.3KB
transcript-base-cade-metz-the-stories-behind-the-rise-of-ai-ta2hj9b9r-e.mp3.txt		53.1KB
transcript-base-chip-huyen-ml-research-and-production-pipelines-6adnhwe5phy.mp3.txt		63.2KB
transcript-base-chris-albon-ml-models-and-infrastructure-at-wikimedia-l1flcsh-n9k.mp3.txt		64.1KB
transcript-base-chris-anderson-robocars-drones-and-wired-magazine-nagcndqoq7k.mp3.txt		85.7KB
transcript-base-chris-mattmann-ml-applications-on-earth-mars-and-beyond-rqmywmnlufo.mp3.txt		65.7KB
transcript-base-chris-padwick-smart-machines-for-more-sustainable-farming-knrwpq1ujha.mp3.txt		64.2KB
transcript-base-chris-shawn-and-lukas-the-weights-biases-journey-dzu3wjmjdam.mp3.txt		79.6KB
transcript-base-cl-ment-delangue-the-power-of-the-open-source-community-sjx9fsnr-9q.mp3.txt		52.2KB
transcript-base-daeil-kim-the-unreasonable-effectiveness-of-synthetic-data-qj6dgjxfxmg.mp3.txt		102.6KB
transcript-base-daphne-koller-digital-biology-and-the-next-epoch-of-science-prgz-6jb16m.mp3.txt		57.9KB
transcript-base-dave-selinger-ai-and-the-next-generation-of-security-systems-dsl9ttdare8.mp3.txt		77.2KB
transcript-base-dominik-moritz-building-intuitive-data-visualization-tools-bcttibpleg8.mp3.txt		46.8KB
transcript-base-drago-anguelov-robustness-safety-and-scalability-at-waymo-5qpwafctmuw.mp3.txt		76.9KB
transcript-base-emily-m.-bender-language-models-and-linguistics-vaxnn3yrhba.mp3.txt		109.3KB