How to Track and Compare Audio Transcriptions with Whisper and Weights and Biases

Learn how to use OpenAI Whisper to transcribe a collection of audio files and keep track of the progress with Weights & Biases
Hans Ramsl
Created on January 16|Last edited on October 17
Comment
﻿
﻿
Relative Speed (max)
Relative Speed (max)
different-cloud-411
55.158
GPU Power Usage (W)
GPU Power Usage (W)
Showing first 10 runs
100200300400Time (seconds)7.888.28.48.68.89
Models and Times
Models and Times
null5,5005,0004,5004,0003,5003,0002,5002,000▶️ D...null051015202530354045505560⏱️ R...null3,5003,0002,5002,0001,5001,0005000✍🏻 ...<null>baselargetiny🧠 M...
Run set444
﻿
Outcome of this project
﻿
﻿
project("hans-ramsl", "gradient-dissent-transcription").artifactVersion("transcript_snippet_examples", "v2").file("snippet_examples.table.json")
 - 17 of 17
Audio
Tiny
Base
Large
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
OpenAI released a speech recognition neural net called Whisper.
The model was trained on 680k hours of multilingual training data and supports 99 languages with varying quality. There are multiple projects based on Whisper such as a C++ implementation of the model inference which is a high-performance inference for Whisper's model with dependencies that can run on multiple cores on Apple Silicon (e.g., M1).
In this project, we'll transcribe 75 Gradient Dissent podcast episodes and will evaluate 2 things:
compare the relative transcription speed of different model sizes on GPUs for the Python implementation
compare the relative transcription speed of different model sizes on CPUs for the C++ implementation
Let's get started to see Whisper in Action. We run this code:
whisper chip-huyen-ml-research.mp3 --model tiny --language=en
1️⃣ Data
At first, we collect the data. We use yt-dlp to download Gradient Dissent epidsodes from Youtube.
Installationpython3 -m pip install -U yt-dlp
Data DownloadThe following script downloads all episodes in the mp3 format.
yt-dlp -i -x --yes-playlist --audio-format mp3  https://www.youtube.com/playlist?list=PLD80i8An1OEEb1jP0sjEyiLG8ULRXFob_
2️⃣ Code
At first, we install the code that we need to transcribe the fresh data.
wandb: We track experiments and version our models with Weights & Biases
whisper: We transcribe audio files with Whisper
setuptools-rust: Used by Whisper for tokenization
ffmpeg-python: Used by Whisper for down-mixing and resampling
Installation and Initialization!pip install wandb -qqq
!pip install git+https://github.com/openai/whisper.git -qqq
!pip install setuptools-rust 
!pip install ffmpeg-python -qqq

import wandb 
wandb.login(host="https://api.wandb.ai")

from google.colab import drive
drive.mount('/content/gdrive')
3️⃣ Transcribe a single fileimport whisper

model = whisper.load_model("base")
result = model.transcribe("audio.mp3")
print(result["text"])
4️⃣ Track a minimalistic experimentimport wandb
with wandb.init(project="gradient-dissent-transcription"):
  train_acc = 0.7
  train_loss = 0.3
  wandb.log({'accuracy': train_acc, 'loss': train_loss})
5️⃣ Version files with W&B artifacts
with wandb.init(project="gradient-dissent-transcription"):
  transcript = wandb.Artifact('transcript', type="transcription")
  transcript.add_file('transcript.txt')
  run.log_artifact(transcript)
6️⃣ Apply steps 3️⃣ + 4️⃣ + 5️⃣ to audio collectionimport whisper
import wandb
from pathlib import Path
import time
import ffmpeg

# Specify project settings
PROJECT_NAME="gradient-dissent-transcription"
JOB_DATA_UPLOAD="data-upload"
JOB_TRANSCRIPTION="transcription"
ARTIFACT_VERSION="latest"
AUDIOFORMAT = 'mp3'
MODELSIZE = 'tiny' # tiny, base, small, medium, large

# Load the Whisper model to perform transcription
model = whisper.load_model(MODELSIZE)

# Inititialize W&B Project
run = wandb.init(project=PROJECT_NAME, job_type=JOB_TRANSCRIPTION)

# Download raw Gradient Dissent audio data from W&B Artifacts
artifact = run.use_artifact('hans-ramsl/gradient-dissent-transcription/gradient-dissent-audio:'+ARTIFACT_VERSION, type='raw_audio')
artifact_dir = artifact.download(root="/content/")
wandb.finish()

# Path to all audio files in `artifact_dir` 
audiofiles = Path(artifact_dir).glob('*.'+AUDIOFORMAT)

# Iterate over all audio files
for audio_file in audiofiles:
  # Specify the path to where the audio file is stored
  path_to_audio_file = str(audio_file)
  
  # Measure start time  
  start = time.time()

  # Transcribe. Here is where the heavy-lifting takes place. 🫀🏋🏻
  result = model.transcribe(path_to_audio_file)

  # Measure duration of transcription time
  transcription_time = time.time()-start

  # Define a path to transcript file
  path_to_transcript_file = path_to_audio_file + '.txt'
  
  # Specify parameters to be logged
  duration = ffmpeg.probe(path_to_audio_file)['format']['duration']
  transcription_factor = float(duration) / float(transcription_time)

  # Save transcription to local text file
  with open(path_to_transcript_file, 'w') as f:
    f.write(result['text'])
    transcript_text = result['text']

    # Log the local transcript as W&B artifact
    with wandb.init(project=PROJECT_NAME, job_type=JOB_TRANSCRIPTION) as run:
      # Build a transcription table
      transcript_table = wandb.Table(columns=['audio_file', 'transcript', 'audio_path', 'audio_format', 'modelsize', 'transcription_time', 'audio_duration', 'transcription_factor'])
      
      # Add audio file, transcription and parameters to table
      transcript_table.add_data(wandb.Audio(path_to_audio_file), transcript_text, path_to_audio_file, AUDIOFORMAT, MODELSIZE, transcription_time, duration, transcription_factor)
      transcript_artifact = wandb.Artifact('transcript', type="transcription")
      transcript_artifact.add(transcript_table, 'transcription_table')

      # Log artifact including table to W&B
      run.log_artifact(transcript_artifact)

      # Log parameters to Runs table (Experiment Tracking)
      run.log({
        'transcript': transcript_text,
        'audio_path': path_to_audio_file,
        'audio_format': AUDIOFORMAT,
        'modelsize': MODELSIZE,
        'transcription_time': transcription_time,
        'audio_duration': duration,
        'transcription_factor': transcription_factor
        }
      )

After successfully running steps 1️⃣ - 6️⃣, we can now visualize the results and learn.
7️⃣ Compare Model Sizes
﻿
W&B makes it really easy to see that the highest relative speed is 55.158 and the lowest 1.199. Why is there such a huge difference in speed performance? Let's find out.
﻿
Relative Speed Analysis
﻿
﻿
﻿
Run set470
﻿
One episode, many transcriptions
﻿
To find out why there is such a variance, we'll look at one specific episode,
Pete Warden — Practical Applications of TinyML.
We transcribed this episode with different model sizes of Whisper, both on the original Python version on GPU and usinge the C++ port (Whisper.cpp): 
🚤 tiny
🛥️ base
🚢 large
In the next figure, the Python on GPU is green 🟢 and the C++ on CPUs (Mac M1, 8 Cores) is blue 🔵.
The GPU used here is the NVIDIA A100 with a max power consumption of 400 W and the CPUs of the M1 Pro have a max power consumption of 68.5W.
The CPU energy consumption 78% lower.
﻿
﻿
﻿
Pete Warden (2021-10-21)6
﻿
🚢 Large Models
﻿
﻿
﻿
 🚢 Large Models75
﻿
🛥️ Base Models
﻿
﻿
﻿
🛥️ Base Models75
﻿
🚤 Tiny Models
﻿
﻿
﻿
 🚤 Tiny Models224
﻿
Model Performance
﻿
﻿
﻿
🚤 vs. 🛥️ vs. 🚢 Model performance374
﻿
Model Performance (by model size)
﻿
﻿
﻿
🚤 vs. 🛥️ vs. 🚢 Model performance (by model size)374
﻿
Access all transcript files
﻿
﻿
project("hans-ramsl", "gradient-dissent-transcription").artifact("gradient-dissent-transcript")
gradient-dissent-transcriptVersion 1
All Versions
Aliases
latest
post-processing
Versions
v1
v0
VersionMetadataUsageFilesLineage
Version overview
Full Name
hans-ramsl/gradient-dissent-transcription/gradient-dissent-transcript:v1
Aliases
latest
post-processing
v1
Tags
Digest
dd624c4b95d188ce4e958035a2ccb4a2
Created By
cosmic-fog-450
Created At
December 2nd, 2022 21:43:40
Num Consumers
5
Num Files
225
Size
14.4MB
TTL Remaining
Inactive
Description
How to Track and Compare Audio Transcriptions with Whisper and Weights and Biases
Learn how to use OpenAI Whisper to transcribe a collection of audio files and keep track of the progress with Weights & Biases
﻿
﻿
Add a comment