Skip to main content

OpenAI Whisper: How to Transcribe Your Audio to Text, for Free (with SRTs/VTTs)

In this beginner-friendly article, we’ll provide a gentle introduction to Whisper and demonstrate how to use it to transcribe and caption audio — for free!
Created on February 3|Last edited on June 30
In this article, we’ll show you how to automatically transcribe audio files for free, using OpenAI’s Whisper. You’ll learn how to save these transcriptions as a plain text file, as captions with time code data (aka as an SRT or VTT file), and even as a TSV or JSON file.
If you’d like to skip ahead to the code and instructions, click here!
💡
We’ll start by answering questions like:
Then, we’ll jump into the how-to section of this article and show you how to use Whisper to transcribe your own files and save them as caption files.
We also provide a companion Colab that you can use to immediately get started with audio transcription ⬇️
Click the button above!
Finally, if you’re interested in exploring Whisper more and comparing how well the different Whisper model sizes transcribe audio, we’ll finish this article with an introduction to a future part 2, “How to Track and Compare Audio Transcriptions with Whisper and Weights & Biases.”
Here's what we'll be covering:

Table of Contents



Background to Whisper

This section contains helpful, but optional, background information. Click here if you'd like to jump straight into running Whisper.

What is OpenAI’s Whisper?

Whisper is an open source ASR library released by OpenAI in September 2022. Whisper takes an audio or audiovisual file as input and returns a transcription of the audio as output. This transcription can be saved as a plain text file, or as a subtitle file with time code data.
OpenAI is the AI research company behind the incredibly powerful chatbot ChatGPT and the popular text-to-image model DALL-E 2.
Exactly how Whisper creates these transcriptions is a little bit beyond the scope of this article, but in a nutshell: Whisper is a deep learning model trained on 680,000 hours of multilingual audio data and their transcriptions. Through the training process, Whisper learns to process audio input and predict the most appropriate corresponding text caption.
Whisper comes in five different sizes, with different trade-offs between transcription quality, memory requirements, and relative speed.
Whisper's available models and languages (source)

A Quick Disclaimer

AI-powered automatic speech recognition (ASR) technology is still improving, and Whisper transcriptions are not perfect.
The transcription might lack some punctuation, incorrectly transcribe some words, or completely miss and not transcribe some words at all. Whisper also does not distinguish between speakers, and does not provide any indication of when or if a speaker changes. If you’re looking to publish Whisper transcriptions — as subtitles to a YouTube video, as part of a blog post, et cetera — you may wish to proofread them beforehand and do some manual corrections.
That being said, Whisper transcriptions are remarkably good, and Whisper represents a huge advance in the improvement of audio to text technology. Plus, Whisper is open source, giving the general public completely free (!!!) access to state-of-the-art software.
A note from the author: (February 3, 2023) Whisper is an open source library in active development. If any of the code in this article or the companion Colab no longer works, please leave a comment and let me know!

A Preview of Whisper

Here’s an example of running Whisper on the first ~30 seconds of "Cristóbal Valenzuela — The Next Generation of Content Creation and AI", an interview between Cristóbal Valenzuela (CEO and co-founder of Runway) and Lukas Biewald (CEO and co-founder of Weights & Biases).


I think a big mistake of research, specifically in the way of computational creativity, is the idea that you can automate it entirely. So you see one-click ops solutions to do X, Y, or Z. I think that's the bigger picture of how most creative work should actually work. Or that probably means that you've never actually worked with an agency where the client was asking you to change things every single hour, make it bigger, make it smaller, right? You're listening to Gradient Dissent, a show about machine learning in the real world. And I'm your host, Lukas Biewald.

How Long Does Whisper Take?

It depends on the length of your file and what type of hardware you have access to!
When we ran the medium.en Whisper model on the 40-minute "Cristóbal Valenzuela — The Next Generation of Content Creation and AI" interview, it took:
  • About 6 minutes on GPU
  • About 1.5 hours on CPU

What Types of File Formats Can Whisper Process?

Whisper uses ffmpeg to load audio — theoretically any audio or audiovisual format that ffmpeg supports should be okay, although we haven’t tested different formats extensively. MP3, FLAC, and WAV files definitely work — as do audiovisual MP4 files.

What Are Transcriptions and Captions?

Audio transcription is the process of converting the speech in an audio or audiovisual file into a written format. The resulting text document is a transcription. Captions are like transcriptions, but with time code data that synchronizes the text to the corresponding speech in the original file.
Transcriptions and captions are valuable for many reasons, including:
  • Audio content becomes accessible to people who are deaf or hard of hearing
  • Audio content that is difficult to understand (poor recording quality, background noise, speaker pronunciation, etc.) becomes more accessible
  • Written content is generally easier and faster to skim or understand than audio content
  • Written content is easier to search for than audio content, providing search engine optimization benefits
Manually creating transcripts can be quite tedious, and captions even more so. Transcribing requires listening to the audio and typing out the corresponding speech. Captioning takes it a step further, and requires indicating which timestamps (down to the millisecond) in the original file correspond to which lines of transcribed text.
These slices of audio range from 2-4 seconds, meaning that even a few minutes of audio/video can result in very, very long caption files.
Here’s an example of a plain text transcript vs the VTT captions for the same ~30 seconds transcribed above.

Run set
0

There are many paid transcription/captioning services that will do the hard work for you, with prices ranging from $0.25 (AI-generated, transcription only) to $1.50 (human-created) per minute of audio.
For the Gradient Dissent episode above, that means you would save between $10 and $60 by using Whisper, disregarding hardware cost.
In this article, we’ll show you how to use Whisper, the popular machine learning ASR model, to automatically generate high quality transcripts and captions on your own. They won’t be perfect but they’ll be pretty close — and more importantly, they’ll be free.

What Is An SRT/VTT file?

There are many different caption file formats; SRT and VTT are arguably two of the most common ones. In a nutshell, VTT files are more complex than SRT files, because they offer more formatting options and store metadata.
In the context of this article, SRT and VTT files are functionally equivalent. Whisper can easily export captions in both formats, and although VTTs are capable of storing more information (formatting options, metadata) than SRTs, Whisper does not not include any additional information when exporting captions as a VTT file.

How to Run Whisper

Whisper is available as a command line tool and as an importable Python library. In this article, we’ll focus on the Whisper Python library.

Transcribing Audio With Whisper

Transcribing audio with Whisper is pretty straightforward — there are really only two main steps:
  1. Load the desired Whisper model with whisper.load_model()
  2. Transcribe the desired audio file with the transcribe() method
Here’s a minimal example, from the Whisper repo on GitHub.
import whisper

model = whisper.load_model("base")
result = model.transcribe("audio.mp3")
Whisper’s transcribe() method returns a dictionary with three key-value pairs:
  • “text”: The transcription (type: str)
  • “segments”: Segment-level details, including a segmented transcription and time code data (type: list of dictionaries)
  • “language”: The spoken language (type: str)

Saving a Whisper Transcription As a Plain Text File

To save the transcription as a text file, open a new text file and write the value of the "text" key into that file.
import whisper

model = whisper.load_model("base")
audio = "audio.mp3"
result = model.transcribe(audio)

with open("transcription.txt", "w", encoding="utf-8") as txt:
txt.write(result["text"])
Or, you can use the built-in get_writer() function, which ultimately calls WriteTXT.write_result().
This method creates a plain text file with hard line breaks, where each line corresponds to a “segment” of the transcription. Here, “segment” refers to how Whisper divides the transcription into smaller, caption-sized chunks.
In contrast, using the simple write() method creates a text file with no line breaks whatsoever.
import whisper
from whisper.utils import get_writer

model = whisper.load_model("base")
audio = "audio.mp3"
result = model.transcribe(audio)
output_directory = "./"

# Save as a TXT file without any line breaks
with open("transcription.txt", "w", encoding="utf-8") as txt:
txt.write(result["text"])

# Save as a TXT file with hard line breaks
txt_writer = get_writer("txt", output_directory)
txt_writer(result, audio)
Here’s the plain text transcript for the ~30 seconds transcribed above, as a text file without any line breaks and with hard line breaks (empty new lines added for clarity).

Run set
0

In this article and the companion Colab, we’ll use the write() method to save a transcription without any line breaks, but ultimately the choice between write() and get_writer() just comes down to your individual use case and preference.

Saving a Whisper Transcription As An SRT/VTT File

To save the transcription as an SRT/VTT file, use the get_writer() function to call the WriteSRT.write_result() and WriteVTT.write_result() methods, respectively.
import whisper
from whisper.utils import get_writer

model = whisper.load_model("base")
audio = "audio.mp3"
result = model.transcribe(audio)
output_directory = "./"

# Save as an SRT file
srt_writer = get_writer("srt", output_directory)
srt_writer(result, audio)

# Save as a VTT file
vtt_writer = get_writer("vtt", output_directory)
vtt_writer(result, audio)

Bonus: Saving a Whisper transcription as a TSV or JSON file

In late January 2023, OpenAI added the options to save transcripts as a TSV (tab-separated values) file! Like the other formats, simply use get_writer() to call WriteTSV.write_result().
The resulting TSV file is separated into multiple segments, like the SRT and TSV files, and has three columns:
  • start: The start time (in integer milliseconds) of the segment
  • end: The start time (in integer milliseconds) of the segment
  • text: The transcript of the segment
Whisper also supports saving the transcript as a JSON file:
import whisper
from whisper.utils import get_writer

model = whisper.load_model("base")
audio = "audio.mp3"
result = model.transcribe(audio)
output_directory = "./"

# Save as a TSV file
tsv_writer = get_writer("tsv", output_directory)
tsv_writer(result, audio)

# Save as a JSON file
json_writer = get_writer("json", output_directory)
json_writer(result, audio)

Example Transcriptions

Here are the results of running the medium.en Whisper model "Cristóbal Valenzuela — The Next Generation of Content Creation and AI", using the medium.en model, saved in different formats:

How to Run Whisper On Google Colab

So, how do you actually run Whisper on a real file? Since we’re using the Whisper Python library, we’ll need to set up either a local or a cloud-based Python environment like Google Colab. If you’re new to programming or machine learning, we strongly recommend using Whisper via Colab.
Installing Python and setting up the appropriate environments can be challenging, and Colab does pretty much all of the hard work for you 🙏

What Is Google Colab?

Google Colab (short for “Colaboratory”) is a Jupyter notebook and compute environment hosted by Google for free. You can write and execute Python code within a browser-based Colab notebook, without having to install your own Python development environment.
Like Google Docs, Google Colab is cloud-based: Colab notebooks (also just called ”Colabs”) are stored in your Google Drive account, can be shared with a single link, can be edited by multiple people, and are accessible from anywhere. Colabs also allow you to combine executable code blocks with rich text and images, and create dynamic, interactive documents.
Colabs let you get started with writing and executing code immediately, and also provides access to more powerful computing power than you might otherwise be able to use — access to a GPU can make a huge difference in how long it takes Whisper to transcribe a file!
For more information about Google Colab, read Google’s "Welcome To Colaboratory” and “Overview of Colaboratory Features".
💡

The “Transcribe Audio With Whisper” Colab

We’ve provided a Colab that contains all of the code you need to transcribe with Whisper! You can transcribe three types of files, with three different output formats:
  • Possible inputs
    • A YouTube video (the audio stream is downloaded via the provided URL, and then transcribed)
    • A file in your Google Drive account
    • A local file that you have uploaded to this Colab
  • Possible outputs
    • A plain text file
    • An SRT file
    • A VTT file
    • A TSV file




Instructions for the Colab

  1. Optional: If you want to transcribe a local file, you will need to upload it to the Colab first. This step is not necessary if you are transcribing a YouTube video or a file in your Google Drive account.
    • Click the folder icon on the left menu to open the Files tab. Then, click the upload icon to upload your desired file. You can also drag and drop the file to upload it. This file will be deleted when the Colab runtime disconnects.
    
  2. Change the values of the variables in the “🌴 Change the values in this section” block:
    
  3. Click Runtime > Run all. That's it!
    
  4. Optional: Download your transcriptions.
    • If you set download = True, then this Colab will automatically download the specified transcription/caption files. However, if you forgot to do this, you can also download the transcribed file(s) afterwards! Just make sure to download the files before you disconnect the Colab, since they will be deleted along with the runtime.
    • As in Step 1, click the folder icon to open the Files tab. Then, click the kebab menu icon to the right of the desired file, and select Download.

Summary

Congratulations on completing the tutorial! 🥳
In this article, we learned how to use Whisper to transcribe audio, and save that transcription as a text file or as an SRT/VTT file.

Tracking and Comparing Whisper Transcriptions

You might now be wondering how different Whisper models compare to each other — this article and Colab used the medium-sized, English-only model, but there are both bigger and smaller models.
In the next article, we’ll dive into how to compare and track the results of different Whisper models using Weights & Biases, a collection of tools for machine learning projects and workflows.
Thanks for reading and stay tuned!
A note from the author: Hi, I'm Angelica, a technical writer at Weights & Biases — we make tools for machine learning. If you enjoyed this article, consider following us on Twitter or YouTube :)
If you enjoyed learning about Whisper, you might also enjoy this article on fine-tuning Whisper:

Vincent
Vincent •  
Hi! Do you know if there is an option to avoid transcribe AND translation? I only want to have transcription, not translation, but I did'nt find if it is possible... Thanks!
Reply
Curtis
Curtis •  
To fix the error exporting VTT """from pathlib import Path def transcribe_file(model, file, plain, srt, vtt, tsv, download): file_path = Path(file) print(f"Transcribing file: {file_path}\n") output_directory = file_path.parent # Run Whisper result = model.transcribe(file, verbose=False, language="en") if plain: txt_path = file_path.with_suffix(".txt") print(f"Creating text file at {txt_path}") with open(txt_path, 'w', encoding='utf-8') as txt: txt.write(result['text']) if srt: print("Creating SRT file") srt_writer = get_writer("srt", output_directory) srt_writer(result, str(file_path.stem)) if vtt: print("Creating VTT file") vtt_writer = get_writer("vtt", output_directory) vtt_writer(result, str(file_path.stem)) if tsv: print("Creating TSV file") tsv_writer = get_writer("tsv", output_directory) tsv_writer(result, str(file_path.stem)) if download: from google.colab import files colab_files = Path("/content") stem = file_path.stem for colab_file in colab_files.glob(f"{stem}*"): if colab_file.suffix in [".txt", ".srt", ".vtt", ".tsv"]: print(f"Downloading {colab_file}") files.download(str(colab_file)) return result """ """if input_format == "youtube": # Declare 'audio' before using it in the print statement audio = download_youtube_audio(file) print(f"Downloading audio stream: {audio}") # Run Whisper on the audio stream result = transcribe_file(model, audio, plain, srt, vtt, tsv, download) elif input_format == "gdrive": # Authorize a connection between Google Drive and Google Colab from google.colab import drive drive.mount('/content/drive') # Run Whisper on the specified file result = transcribe_file(model, file, plain, srt, vtt, tsv, download) elif input_format == "local": # Run Whisper on the specified file result = transcribe_file(model, file, plain, srt, vtt, tsv, download) """
Reply
Hiren Madan
Hiren Madan •  
Getting error: NameError: name 'audio' is not defined Please guide on how to resolve this.
Reply
Roman Vinogradov
Roman Vinogradov •  
Can't debug it even with the help of ChatGPT.
Reply
Roman Vinogradov
Roman Vinogradov •  
NameError: name 'audio' is not defined
Reply
Anthony G
Anthony G •  
Thank you so much for this tutorial! I appreciate the easy to follow steps. However, I have been unable to successfully run this in Google Collab. I'm trying to transcribe a file in my google drive. I input the appropriate options (selected "gdrive" for the path and used the appropriate path to the file) and select the plain, srt, vtt, and download options. After clicking "run all", everything works well until the final step ("Whisper It!") After processing the source file, I get the following error: ResultWriter.__call__() missing 1 required positional argument: 'options' The error seems to be on this line in the final block of code: # Run Whisper on the specified file result = transcribe_file(model, file, plain, srt, vtt, tsv, download) Do you know what the issue could be? I have changed nothing else in the provided workbook. Thank you again!
3 replies
Jhon Richard
Jhon Richard •  
Possible to update the code for a bulk transcription. Thanks
Reply
Angelica Pan
Angelica Pan •  
multilingual Since Whisper was trained on multilingual audio, its capabilities actually extend far beyond transcribing speech-to-text in English! Whisper is capable of: - English transcription - Non-English foreign language to English translation - Non-English foreign language transcription For this article, however, we’ll focus on English transcription only.
Reply
Iterate on AI agents and models faster. Try Weights & Biases today.