Fine-Tuning Whisper for Low-Resource Dravidian Languages

In this article, we get an overview of the project I completed as part of the Whisper fine-tuning sprint in December 2022, hosted by HuggingFace and Lambda Labs.

Bharat Ramanathan

Created on December 16|Last edited on January 27

Comment

﻿
IntroductionWe've recently seen deep learning widely adopted for automatic speech recognition (ASR) due to the availability of large datasets that contain speech and corresponding transcripts. 
A recent trend, specifically in this domain, has been self-supervised learning. For instance, the Wave2Vec family of models uses a task similar to masked language modeling to pre-train a network using unlabelled speech before fine-tuning on the ASR task. This allows the network to learn contextual speech embeddings that can be finetuned using a parallel corpora of speech and transcripts. 
﻿
﻿
﻿Whisper is a different type of ASR model released by OpenAI. Unlike other recent models, Whisper is completely trained using supervised learning on weakly labeled data. 
Specifically, the researchers gathered 680k hours of multilingual speech and transcription data in 96 languages from the web and fine-tuned the model to directly predict the transcript from audio. It's an encoder-decoder transformer trained using multi-task learning with tasks that include transcription, translation, and timestamp prediction. Here's an overview of the model architecture: 
The architecture of the Whisper Model: Source: https://openai.com/blog/whisper/﻿
﻿
Here's what we'll be covering in this article: 
Table of ContentsIntroductionBackgroundProblem StatementDescription of Software/ToolsDescription of DataModelingReleased ModelsTraining CurvesEvaluation ResultsDemonstrationsSummary
﻿
﻿
Let's get started. 
﻿﻿BackgroundThe awesome folks at HuggingFace ported the Whisper model into the transformers library within a few weeks of the model's release. However, the model is not without  limitations. The authors of the Whisper paper detail the following limitations (a.k.a. hackathon ideas).
Inaccurate timestamp predictions
Hallucinations
Low performance on low-resource languages
No speaker recognition
No real-time transcription
This leads us to the Whisper Fine-tuning event run by HuggingFace and Lambda Labs. The event was organized as a community sprint from December 5th to the 19th. HuggingFace provided the models, starter code, and community support via discord while lambda labs provided the necessary compute (roughly 100 hours of compute to use A100 GPUS).
The main components of the sprint included:
Open AI’s state-of-the-art Whisper model
Public datasets like Common Voice 11, VoxPopuli, CoVoST2 and more
Real-world audio for evaluation
With the key outcomes being:
Fine-tuned Whisper checkpoint (e.g., Whisper-large)
Evaluation script for fine-tuned checkpoints
Hugging Face space to demo fine-tuned models
Problem StatementThe goal of this project is to train a model that uses Seq2seq transformers(OpenAI/Whisper) for the speech-to-text task in four Dravidian languages - (Tamil, Malayalam, Telugu, and Kannada). Unlike languages such as English and Spanish, there are very few datasets and benchmarks available in these languages. We call these low-resource languages. 
To this end, the project aimed to perform rigorous data collection in these languages and train deep learning models to transcribe the speech in these languages. As a result, this project aims to extend the applications of AI technologies for speakers of these languages by removing the barrier of available models for them.
Description of Software/ToolsThis project was mostly created to run on Ubuntu Linux machines with NVIDIA Graphics Cards. This is primarily due to the following reasons:
Availability of audio processing libraries - libsndfile1, ffmpeg﻿
We'll process a large amount of audio data and rely on robust audio kernel implementations provided by the Linux software kernel
Availability of CUDA GPU (Min 12GB RAM) to train and evaluate the models.
A large part of the model training happens on cloud GPU machines with enough GPU RAM to load and process speech data.
The project uses an anaconda virtual environment with Python 3.8. I'll provide a shell script to create the environment in the install, config and setup section below
The project extensively uses git and git-lfs to track changes to code, data and model files.
It’s recommended to use a machine with a large number of CPU cores when recreating this project since the code makes effective use of multi-processing and multi-threading libraries in Python.
Some cloud providers with access to GPU machines I used for this project include:
﻿Google Colab﻿
﻿Kaggle﻿
﻿LambdaLabs﻿
﻿AWS﻿
Description of DataInitially I collected data for all the South Indian languages from various heterogeneous sources. These include publicly available datasets released by open source organizations such as The Mozilla Foundation, Google and The Government of India. 
Some datasets were curated and formatted while others had to be downloaded and preprocessed into the proper format. The following table gives you an overview of the various data sources used to create the datasets (note, you can scroll right for additional information about these datasets):

Duration(Hrs)Duration(Hrs)Duration(Hrs)Duration(Hrs)Transcript Length(Chars)Transcript Length(Chars)Transcript Length(Chars)Transcript Length(Chars)Transcript Length(Words)Transcript Length(Words)Transcript Length(Words)Transcript Length(Words)
LanguageDatasetsSplitTotalMeanMaxMinTotalMeanMaxMinTotalMeanMaxMin
Tamilhttps://huggingface.co/datasets/mozilla-foundation/common_voice_11_0/viewer/ta/trainTrain75.40110.00180.00300.0001282206967.659322753149607.5512241
Tamilhttps://huggingface.co/datasets/mozilla-foundation/common_voice_11_0/viewer/ta/validationValidation18.73250.00160.00290.000575415464.13961734844797.1848171
Tamilhttps://huggingface.co/datasets/mozilla-foundation/common_voice_11_0/viewer/ta/testTest18.59470.00160.00300.000566414156.21171842751906.3639171
Tamilhttps://huggingface.co/datasets/google/fleurs/viewer/ta_in/trainTrain8.67840.00370.01090.0002337121142.4254357293770315.9286403
Tamilhttps://huggingface.co/datasets/google/fleurs/viewer/ta_in/validationValidation1.25480.00330.01140.000452447139.116730746577215.3103376
Tamilhttps://huggingface.co/datasets/google/fleurs/viewer/ta_in/testTest2.13120.00360.01670.001087879148.695439852975416.5042424
Tamilhttps://github.com/Open-Speech-EkStep/ULCA-asr-dataset-corpus/blob/main/README.md#tamil-labelled--total-duration-is-116024-hoursTrain1160.24
Tamilhttp://openslr.org/127/Train133.29460.00170.01080.0001595346377.003760596647348.5978741
Tamilhttp://openslr.org/127/Test16.81000.00140.01010.000180053266.23085249887677.3440621
Malayalamhttps://huggingface.co/datasets/mozilla-foundation/common_voice_11_0/viewer/ml/trainTrain0.51330.00120.00290.00051717039.9302173117904.1628141
Malayalamhttps://huggingface.co/datasets/mozilla-foundation/common_voice_11_0/viewer/ml/testTest0.11490.00100.00250.0005401935.883913044353.8839131
Malayalamhttps://huggingface.co/datasets/google/fleurs/viewer/ml_in/trainTrain10.05510.00330.01120.0006423240139.0864358304338114.2560383
Malayalamhttps://huggingface.co/datasets/google/fleurs/viewer/ml_in/validationValidation1.68100.00400.01750.001557819138.323031447585714.0120355
Malayalamhttps://huggingface.co/datasets/google/fleurs/viewer/ml_in/testTest3.90900.00410.01330.0013139566145.6848366491423014.8539425
Malayalamhttps://github.com/Open-Speech-EkStep/ULCA-asr-dataset-corpus/blob/main/README.md#malayalam-labelled-externalTrainUnk
Kannadahttps://huggingface.co/datasets/google/fleurs/viewer/kn_in/trainTrain8.28000.00360.01940.0010306385134.2028320333651515.9943394
Kannadahttps://huggingface.co/datasets/google/fleurs/viewer/kn_in/validationValidation1.30020.00350.00900.001245861124.622326641544314.7908376
Kannadahttps://huggingface.co/datasets/google/fleurs/viewer/kn_in/testTest3.17720.00380.01100.0002111246132.7518385461320515.7578446
Kannadahttps://github.com/Open-Speech-EkStep/ULCA-asr-dataset-corpus/blob/main/README.md#kannada-labelled-total-duration-is-60891-hoursTrain608.91
Teluguhttps://huggingface.co/datasets/google/fleurs/viewer/te_in/trainTrain7.91640.00340.01150.0004272432118.3458287273586615.5804404
Teluguhttps://huggingface.co/datasets/google/fleurs/viewer/te_in/validationValidation0.89410.00290.00900.001036657117.868227543479015.4019396
Teluguhttps://huggingface.co/datasets/google/fleurs/viewer/te_in/testTest1.44890.00310.00850.001259496126.050835345782316.5742437
Teluguhttps://github.com/Open-Speech-EkStep/ULCA-asr-dataset-corpus/blob/main/README.md#telugu-labelled-total-duration-is-102593-hoursTrain1025.93
﻿
Downloading Source DatasetsWhile many data sources were available from the Huggingface Datasets Hub and can be accessed and downloaded using the datasets library, others (such as the UCLA corpus) needed to be manually downloaded. Here's how we handled that dataset:
Create source file lists:
Manually collect all the URLs for each language from the README of the following Github repository﻿
Save the URLs as individual text files for each language. These files can be seen in the code repository here.
Next, download the and save the zip files from the source file lists. This is done with the following function:
import os
from keras.utils import get_file
﻿
def download_archive(url):
    filename = os.path.split(url)[-1]
    try:
        filename = get_file(fname=filename, origin=url, cache_dir="./", extract=False)
    except:
        print(f"Unable to download {url}")
        return None
    return filename
﻿
Extract each zip file archive into its own individual directory using the zip archives name. The zip archive contains audio data in wav format and a metadata file named data.json . We extract the audio data into the wav sub-directory and the data.json file in the zip file directory. Here’s the code to achieve this:
import pathlib
from zipfile import ZipFile
﻿
def extract_zipfile(filename):
    filename = pathlib.Path(filename)
    dirname = filename.parent / filename.stem
    with ZipFile(filename, "r") as zipf:
        for zipinfo in zipf.infolist():
            if zipinfo.filename[-1] == "/":
                continue
            zipinfo.filename = pathlib.Path(zipinfo.filename).name
            if pathlib.Path(zipinfo.filename).suffix == ".wav":
                zipf.extract(zipinfo, dirname / "wav")
            else:
                zipf.extract(zipinfo, dirname)
    wavfiles = list((dirname / "wav").rglob("*.wav"))
    return wavfiles
Since the wav format is uncompressed this generates a lot of data and can quickly fill-up even a 1TB hard drive. Therefore, we convert the wav data into compressed mp3 format using the pydub library(convert and write mp3 file) and delete the wav file once converted. Any wav files that cannot be converted (for instance due to data corruption) are dropped and deleted. The following code achieves this:
from pydub import AudioSegment, effects
﻿
def get_mp3_path(wavfile):
    mp3dir = list(wavfile.parents)[1] / "mp3"
    mp3dir.mkdir(parents=True, exist_ok=True)
    mp3file = mp3dir / wavfile.with_suffix(".mp3").name
    return mp3file
﻿
def convert_wav_to_mp3(wavfile):
    mp3file = get_mp3_path(wavfile)
    try:
        wavaudio = AudioSegment.from_wav(wavfile)
        wavaudio = wavaudio.set_frame_rate(16000).set_channels(1)
        wavaudio = effects.normalize(wavaudio)
        wavaudio.export(mp3file, format="mp3", bitrate="16k")
        wavfile.unlink()
        return mp3file
    except Exception as e:
        print(f"Unable to convert {wavfile} to mp3")
        wavfile.unlink()
Finally, putting together all of the above steps for the various source list files in step 1 uses multi-processing and multi-threading to download, extract, and convert the source data files:
from concurrent.futures import as_completed, ThreadPoolExecutor
from multiprocessing import cpu_count, Pool
from tqdm import tqdm
﻿
def main():
    urls = list(map(lambda x: x.strip(), open("../data/file_list.txt").readlines()))
    print("Downloading archives")
    zipfiles = []
    with ThreadPoolExecutor(cpu_count()*4) as executor:
        results = executor.map(download_archive, urls)
        for url in tqdm(results):
            zipfiles.append(url)
﻿
    zipfiles = list(pathlib.Path("datasets").rglob("*.zip"))
﻿
    mp3files = []
    with ThreadPoolExecutor(max_workers=len(zipfiles)) as executor:
        print("Extracting archives")
        extracted = [executor.submit(extract_zipfile, file) for file in zipfiles]
        for wavfiles in as_completed(extracted):
            wavfiles = wavfiles.result()
            with Pool(cpu_count() - 1) as pool:
                results = pool.imap_unordered(convert_wav_to_mp3, wavfiles)
                print("Converting wav to mp3")
                for mp3file in tqdm(results, total=len(wavfiles)):
                    mp3files.append(mp3file)
    return mp3files
The full source code to download and convert the datasets can be obtained here﻿
Note: This takes a very long time (3days) and requires high speed internet and a lot of disk space!
💡
One final step that remains is to collate all the data from individual sub-directories into a directory for each language. The code for this can be obtained in the create_lang_dirs.py file in the repository. This collates all of the audio files in a language into a single folder with a metadata.json file in the same directory. For instance, it creates a tamil directory for the Tamil language with a tamil/train sub-directory for the audio data and tamil/metadata.jsonl containing all the metadata for the audio files. 
Preprocessing DatasetsOnce the source datasets have been collected, we are ready to preprocess and clean the datasets. Since the datasets have been collected from various heterogeneous sources, they are all available in multiple formats. For instance, the sampling rate of audio from each source varies from 8khz to 48khz. 
Additionally, the metadata for each file contains data that includes transcription, gender, source, duration, etc. Also, some data sources have very short (less than 3s) or very long transcriptions (more than 30s). Therefore, we'll clean the data and create a homogeneous dataset for each language. This can be done using the following steps:
First, load the source datasets. We use the datasets library to load and process datasets. The library uses a memory-mapped Pyarrow backend to speed up the processing of large amounts of data without the requirement for large ram space. We additionally convert the audio into 16kHz at this step. Here’s an example of this step from the Kannada language conversion script.
from datasets import load_dataset, Audio, DatasetDict, concatenate_datasets
﻿
def load_data_splits(is_streaming=True, stopping_strategy="all_exhausted"):
    data_dict = {}
    data_dict["openslr_dataset_train"] = load_dataset(
        "openslr", "SLR79", split="train", use_auth_token=True)
    data_dict["ucla_dataset_train"] = load_dataset(
        "audiofolder", data_dir="../data/kannada/", drop_labels=True)["train"]
    data_dict["fleurs_dataset_train"] = load_dataset(
        "google/fleurs", "kn_in", split="train",
        use_auth_token=True).rename_column("transcription", "sentence")
    data_dict["fleurs_dataset_val"] = load_dataset(
        "google/fleurs", "kn_in", split="validation",
        use_auth_token=True).rename_column("transcription", "sentence")
    data_dict["fleurs_dataset_test"] = load_dataset(
        "google/fleurs", "kn_in", split="test",
        use_auth_token=True).rename_column("transcription", "sentence")
    
    for k in data_dict:
        data_dict[k] = (
            data_dict[k].remove_columns(
                [col for col in data_dict[k].column_names if col not in ["audio", "sentence"]]))
        data_dict[k] = data_dict[k].cast_column("audio", Audio(sampling_rate=16000))
    
    dataset_dict = DatasetDict()
    train_datasets = []
    test_datasets = []
    for k in data_dict:
        if k.endswith("train") or k.endswith("val"):
            train_datasets.append(data_dict[k])
        if k.endswith("test"):
            test_datasets.append(data_dict[k])
﻿
    dataset_dict["train"] = concatenate_datasets(train_datasets)
    dataset_dict["test"] = concatenate_datasets(test_datasets)
    return dataset_dict
Next, we convert the audio from the source format to 16kHz single channel mp3 files and filter any records that have nulls, very short or long audio. Here’s the code to do this:
def audio_from_array(array):
    file = BytesIO()
    wavf.write(file, 16000, array)
    audio_segment = AudioSegment.from_file(file)
    audio_segment.set_frame_rate(16000).set_channels(1)
    audio_segment = effects.normalize(audio_segment)
    return audio_segment
﻿
def filter_nans_and_short(example):
    sentence = example["sentence"]
    length = example["length"]
    if sentence is None:
        return False
    elif length < 3 or length > 30:
        return False
    else:
        return True
﻿
def export_audio_and_sentence(example, split):
    audio, sentence = example["audio"], example["sentence"]
    new_name = pathlib.Path(uuid(audio["path"])).with_suffix(".mp3").name
    new_path = pathlib.Path(f"../data/filtered_datasets/kannada/{split}/{new_name}")
    new_path.parent.mkdir(parents=True, exist_ok=True)
    length = len(audio["array"])/16000
    if new_path.is_file():
        return {"path": str(new_path),"sentence": sentence, "length": length, "split": split}
    else:
        if not split ==  "test":
            is_not_filtered = filter_nans_and_short({"length": length, "sentence": sentence})
        else:
            is_not_filtered = True
        if is_not_filtered:
            try:
                new_path.parent.mkdir(exist_ok=True, parents=True)
                audio_segment = audio_from_array(audio["array"])
                audio_segment.export(new_path, format="mp3", bitrate="16k")
                segment = AudioSegment.from_mp3(new_path)
            except:
                return None
            return {"path": str(new_path),"sentence": sentence, "length": length, "split": split}
        else:
            return None
Finally, we put together all the above code for each language and create a single homogeneous dataset for each language. Here’s an example for the Kannada language: 
if __name__ == "__main__":
﻿
    dataset_dict = load_data_splits(is_streaming=False)
    exports = []
    with Pool(cpu_count()-1) as pool:
        for k in dataset_dict:
            dataset = dataset_dict[k].shuffle(seed=np.random.randint(1000))
            exporter = partial(export_audio_and_sentence, split=k)
            results = pool.imap_unordered(exporter,dataset,chunksize=10)
            for result in tqdm(results, total=len(dataset)):
                if result:
                    exports.append(result)
    exports = pd.DataFrame(exports)
    exports.to_json("../data/filtered_datasets/kannada/metadata.jsonl", lines=True, orient="records")
The full source code for cleaning up the data for each language is available in the following files:
﻿convert_and_filter_dataset_tamil.py﻿
﻿convert_and_filter_dataset_malayalam.py﻿
﻿convert_and_filter_dataset_kannada.py﻿
﻿convert_and_filter_dataset_telugu.py﻿
While the above results in some duplication of code, it’s the quickest and hackiest way to collate data for each language from multiple data sources. I hope to further optimize the above code to take a language parameter and create a dataset for each language using a single source file. The final statistics of the collected and cleaned data is in the below table:

Duration(Hours)Duration(Hours)Duration(Hours)Duration(Hours)Character LengthCharacter LengthCharacter LengthCharacter LengthWord LengthWord LengthWord LengthWord Length
LanguageTotalMeanMinMaxTotalMeanMinMaxTotalMeanMinMax
Tamil1293.710.00190.00080.008370519146103.262524801878311.74162
Telugu387.210.00190.00080.00421706116681.534289211556310.11136
Malayalam10.130.00150.00080.004248030972.004303532837.99136
Kannada358.840.00210.00080.00741297339375.11224915922039.22131
﻿
Hosting the DatasetsOnce cleaned and converted, we are ready to save the datasets in a place that is easily accessible to anyone across the globe! I chose to host the datasets in the datasets hub. 
To do this, the audio files were compressed into tar archives, and a data-loading script was built for each language. As an example, here’s the main source code of the data-loading script for the Tamil language:
# https://huggingface.co/datasets/parambharat/tamil_asr_corpus/blob/main/tamil_asr_corpus.py
class TamilASRCorpus(datasets.GeneratorBasedBuilder):
    """Tamil ASR Corpus contains transcribed speech corpus for training ASR systems for Tamil language."""
﻿
    VERSION = datasets.Version("1.1.0")
    def _info(self):
        features = datasets.Features(
            {
                "audio": datasets.Audio(sampling_rate=16_000),
                "path": datasets.Value("string"),
                "sentence": datasets.Value("string"),
                "length": datasets.Value("float")
            }
        )
        return datasets.DatasetInfo(
            description=_DESCRIPTION,
            features=features,
            supervised_keys=("sentence", "label"),
            homepage=_HOMEPAGE,
            license=_LICENSE,
            citation=_CITATION,
        )
﻿
    def _split_generators(self, dl_manager):
        metadata_paths = dl_manager.download(_METADATA_URLS)
        train_archive = dl_manager.download(_URLS["train"])
        test_archive = dl_manager.download(_URLS["test"])
        local_extracted_train_archive = dl_manager.extract(train_archive) if not dl_manager.is_streaming else None
        local_extracted_test_archive = dl_manager.extract(test_archive) if not dl_manager.is_streaming else None
        test_archive = dl_manager.download(_URLS["test"])
        train_dir = "train"
        test_dir = "test"
﻿
        return [
            datasets.SplitGenerator(
                name=datasets.Split.TRAIN,
                gen_kwargs={
                    "metadata_path": metadata_paths["train"],
                    "local_extracted_archive": local_extracted_train_archive,
                    "path_to_clips": train_dir,
                    "audio_files": dl_manager.iter_archive(train_archive),
                },
            ),
            datasets.SplitGenerator(
                name=datasets.Split.TEST,
                gen_kwargs={
                    "metadata_path": metadata_paths["test"],
                    "local_extracted_archive": local_extracted_test_archive,
                    "path_to_clips": test_dir,
                    "audio_files": dl_manager.iter_archive(test_archive),
                },
            ),
            
        ]
        
    def _generate_examples(self, metadata_path, local_extracted_archive, path_to_clips, audio_files):
        """Yields examples as (key, example) tuples."""
        examples = {}
        with open(metadata_path, encoding="utf-8") as f:
            for key, row in enumerate(f):
                data = json.loads(row)
                examples[data["path"]] = data
        inside_clips_dir = False
        id_ = 0
        for path, f in audio_files:
            if path.startswith(path_to_clips):
                inside_clips_dir = True
                if path in examples:
                    result = examples[path]
                    path = os.path.join(local_extracted_archive, path) if local_extracted_archive else path
                    result["audio"] = {"path": path, "bytes": f.read()}
                    result["path"] = path
                    yield id_, result
                    id_ += 1
            elif inside_clips_dir:
                break
Finally, the datasets were pushed to the datasets hub. The following datasets were created in this manner.
﻿parambharat/kannada_asr_corpus﻿
﻿parambharat/telugu_asr_corpus﻿
﻿parambharat/malayalam_asr_corpus﻿
﻿parambharat/tamil_asr_corpus﻿
Here's a table preview of the Tamil dataset: 
﻿
﻿
﻿
These datasets are provided with CC 4.0 License and can be reused, adapted, and shared by anyone, anywhere, with just a single line of code. This not only downloads the dataset but also caches it for future reuse:
from datasets import load_dataset
dataset = load_dataset("parambharat/telugu_asr_corpus", split="train")
Additionally, if you are constrained by disk space, you can pass the streaming parameter to the load_dataset function and access the data like you would a python iterator here’s an example:
from datasets import load_dataset
﻿
dataset = load_dataset("parambharat/telugu_asr_corpus", split="train", streaming=True)
next(iter(dataset))
>>{'audio': {'path': 'train/9zKDqLgBykz9FSyL6bFPRQ.mp3',
  'array': array([ 1.2912806e-03,  2.4933945e-03,  3.6223403e-03, ...,
          5.2775860e-05, -1.2118297e-05,  8.1833590e-05], dtype=float32),
  'sampling_rate': 16000},
 'path': 'train/9zKDqLgBykz9FSyL6bFPRQ.mp3',
 'sentence': 'అక్కడ మీడియాతో మాట్లాడిన అనంతరం తిరిగి హైదరాబాద్ బయలుదేరుతారు',
 'length': 3.87}
Dataset EDAWe the cleaned datasets and now we can perform the exploratory data analysis to get an understanding of the datasets. You can see the full analysis of the datasets in this notebook.﻿
Note: Run git clone <dataset_link> to fetch the datasets locally if you would like to re-run the notebook. the notebook expects all datasets to be in the data directory For instance if you would like to fetch the malayalam_dataset you need to run git clone https://huggingface.co/datasets/parambharat/malayalam_asr_corpus
💡
Here are a few plots to show what we're working with here:
﻿
﻿
Key InsightsAll the histograms are left-skewed with more data lying in the less than 15 seconds region. This will be useful in deciding the truncation and padding sizes when feeding the data into the neural network algorithm.
The violin plots show that the audio is most dense in the 5-10 seconds region.
Most of the transcripts across languages are distributed between 3 to 15 words. This will be helpful in deciding the decoding sequence length of the models.
ModelingThe model takes as inputs log-mel spectrogram features and processes them through two 1D Convolution layers before applying an encoder-decoder transformer with a multi-headed cross and self-attention to directly predict the transcript. 
The model makes use of special tokens to predict the language, timestamps, and transcripts in an autoregressive fashion. The authors boast near human-level performance, and a significant decrease in word error rates (WER) resulting in new state-of-the-art (SOTA) results on multiple languages and benchmarks. The model was evaluated across 96 languages, including the Tamil, Kannada, Malayalam, and Telugu languages that we're interested in for this report. 
However, the datasets used to train the model on these languages are quite small compared to the datasets we have been able to gather. The below screenshot of a table from the paper shows the performance on the google/fleurs benchmark dataset with the model's performance on languages of interest highlighted.
Table from the Whisper paper with  performance on the languages of this project highlighted.
ModelsIn this project, we train 3 variants of the Whisper model for each language. The details of the original fine-tuned models are in the following table below:

ModelLayersEmbedding SizeAttention HeadsParametersCheckpoint
https://huggingface.co/openai/whisper-tiny4384639Mhttps://huggingface.co/openai/whisper-tiny
https://huggingface.co/openai/whisper-base6512874Mhttps://huggingface.co/openai/whisper-base
https://huggingface.co/openai/whisper-small1276812244Mhttps://huggingface.co/openai/whisper-small
﻿
Note: Here, Layers refers to a symmetric number of encoder and decoder transformer layers . i.e. 4 layers means 4 encoder and 4 decoders.
💡
Loading the DataSince we're dealing with very large datasets that need to be turned into mel-spectrograms, we use the datasets library to lazily load and stream the datasets from disk while mapping the feature extraction process across the entire dataset. 
While this potentially slows down the model training time, it's not long enough to be a bottleneck for training when compared to the amount of RAM required to process nearly half a million audio files. The below code snippet shows how to load the dataset in streaming format:
def load_data_splits(is_streaming=True,):
    data_dict = load_dataset("parambharat/tamil_asr_corpus", streaming=is_streaming)
    return data_dict
Data AugmentationThe authors of the Whisper paper note that the data used to pre-train the model was not augmented in any way. Since we wish the model to work in robust and noisy settings and generalize well across speakers, we choose to augment the dataset. We use the audiomentations library to perform the augmentations and apply GaussianNoise, TimeStretch, PitchShift augmentations to the audio waveform. This is performed only on the training dataset with the following code. 
augment_waveform = Compose([
    AddGaussianNoise(min_amplitude=0.005, max_amplitude=0.015, p=0.2),
    TimeStretch(min_rate=0.8, max_rate=1.25, p=0.2, leave_length_unchanged=False),
    PitchShift(min_semitones=-4, max_semitones=4, p=0.2)
    ,])
﻿
def augment_dataset(batch):
    audio = batch["audio"]["array"]
    # apply augmentation
    augmented_audio = augment_waveform(samples=audio, sample_rate=16000)
    batch["audio"]["array"] = augmented_audio
    return batch
﻿
# call augment dataset on the training set
dataset_dict["train"] = dataset_dict["train"].map(augment_dataset)
Note: We apply augmentations randomly with a probability of 0.3 and that at most 3 augmentations can be applied to a waveform.
PreprocessingWhile our data collection stage mostly preprocessed the audio data and removed noisy data we still ensure that nulls and short transcript are filtered out when training the models. Additionally, we need to normalize by stripping punctuation before feeding it into the model. The following code snippet is used to filter and clean the dataset.
def fix_sentence(sentence):
    transcription = sentence
  
    if transcription.startswith('"') and transcription.endswith('"'):
        # we can remove trailing quotation marks as they do not affect the transcription
        transcription = transcription[1:-1]
  
    if transcription[-1] not in [".", "?", "!"]:
        # append a full-stop to sentences that do not end in punctuation
        transcription = transcription + "."
    transcription = transcription[:-1].translate(str.maketrans('', '', string.punctuation)) + transcription[-1]
    return transcription
    
def filter_empty_strings(sentence):
    if len(sentence) < 2:
        return False
    else: return True
﻿
for k in dataset_dict:
    dataset_dict[k] = dataset_dict[k].filter(filter_empty_strings, input_columns=["sentence"])
Feature ExtractionSince the model takes as input mel-spectrogram features and predicts transcription, we need to convert the audio into features and transcription text into tokenized inputs that the model can predict. The transformer library provides two utility classes to achieve this. The WhisperFeatureExtractor and the WhisperTokenzier. 
We use these classes to extract features from the audio and tokenize the transcripts in to byte-pairs. The following snippet of code can be used to run the step over the entire dataset, for instance, in the Telugu Language.
feature_extractor = WhisperFeatureExtractor.from_pretrained(
    "openai/whisper-tiny"
)
tokenizer = WhisperTokenizer.from_pretrained(
    "openai/whisper-tiny", 
     language="Telugu",
     task="transcribe",
     model_max_length=225
)
﻿
def prepare_dataset(examples):
    # compute log-Mel input features from input audio array 
    audio = examples["audio"]
    
    examples["input_features"] = feature_extractor(
        audio["array"], sampling_rate=16000).input_features[0]
    
    sentences = fix_sentence(examples["sentence"])
    
    # encode target text to label ids 
    examples["labels"] = tokenizer(sentences, max_length=225, truncation=True).input_ids
    return examples
﻿
for k in dataset_dict:
    dataset_dict[k] = dataset_dict[k].map(
        prepare_dataset,).with_format("torch")
﻿
Note: The above code also converts the dataset into an instance of IterableDataset for easy loading into the model. Furthermore, you will notice that the model adds some special tokens to the start and end of the transcript and that the shape of the input features is (80,3000) i.e. 80 mels with a frequency scale of 3000. Here’s a sample:
💡
features = feature_extractor(sample["audio"]["array"], sampling_rate=16000)["input_features"][0]
print(features)
print(features.shape)
﻿
[Output:]
﻿
[[ 0.03961742 -0.4417838  -0.423414   ... -0.6383085  -0.6383085
  -0.6383085 ]
 [ 0.04983616 -0.28957796 -0.087901   ... -0.6383085  -0.6383085
  -0.6383085 ]
 [ 0.13291407 -0.16156828 -0.17585051 ... -0.6383085  -0.6383085
  -0.6383085 ]
 ...
 [-0.6383085  -0.6383085  -0.6383085  ... -0.6383085  -0.6383085
  -0.6383085 ]
 [-0.6383085  -0.6383085  -0.6383085  ... -0.6383085  -0.6383085
  -0.6383085 ]
 [-0.6383085  -0.6383085  -0.6383085  ... -0.6383085  -0.6383085
  -0.6383085 ]]
(80, 3000)
﻿
input_str = sample["sentence"]
labels = tokenizer(input_str).input_ids
decoded_with_special = tokenizer.decode(labels, skip_special_tokens=False)
decoded_str = tokenizer.decode(labels, skip_special_tokens=True)
﻿
print(f"Input:                 {input_str}")
print(f"Decoded w/ special:    {decoded_with_special}")
print(f"Decoded w/out special: {decoded_str}")
print(f"Are equal:             {input_str == decoded_str}")
﻿
[Output:]
Input:                 అక్కడ మీడియాతో మాట్లాడిన అనంతరం తిరిగి హైదరాబాద్ బయలుదేరుతారు
Decoded w/ special:    <|startoftranscript|><|te|><|transcribe|><|notimestamps|>అక్కడ మీడియాతో మాట్లాడిన అనంతరం తిరిగి హైదరాబాద్ బయలుదేరుతారు<|endoftext|>
Decoded w/out special: అక్కడ మీడియాతో మాట్లాడిన అనంతరం తిరిగి హైదరాబాద్ బయలుదేరుతారు
Are equal:             True
BatchingSo far we have loaded, preprocessed and converted the dataset into the model’s input format. 
However, we also need to provide the model with batches of data for training. Dealing with variable transcription lengths means that we need to dynamically pad a batch of tensors to the same size. To do this we create a utility class to perform the data collation. Here’s the code for the class:
processor = WhisperProcessor.from_pretrained(
    "openai/whisper-tiny",
     language="Telugu", 
     task="transcribe",
     model_max_length=225
)
﻿
@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    processor: Any
﻿
    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lengths and need different padding methods
        # first treat the audio inputs by simply returning torch tensors
        input_features = [{"input_features": feature["input_features"]} for feature in features]
        batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")
﻿
        # get the tokenized label sequences
        label_features = [{"input_ids": self.processor.tokenizer.truncate_sequences(feature["labels"])[0]}
                          for feature in features]
        # pad the labels to max length
        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt",)
﻿
        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)
﻿
        # if bos token is appended in previous tokenization step,
        # cut bos token here as it's append later anyways
        if (labels[:, 0] == self.processor.tokenizer.bos_token_id).all().cpu().item():
            labels = labels[:, 1:]
﻿
        batch["labels"] = labels
﻿
        return batch 
Here the WhisperProcessor is a utility class in the transformers library that combines the tokenizer and the feature extractor from the feature extraction step. 
Computing MetricsBefore we begin to train the model, we need a way to compute the model metrics. Since we are interested in predicting the transcript for a given audio, we need to understand the amount of error the model is making in its prediction. 
A common metric for this is the word error rate (WER) which computes the ratios of errors in the generated transcript to the total words spoken and present in the reference. We make use of the evaluate library to compute this metric. The following code snippet is used to compute the normalized word error rate:
metric = evaluate.load("wer")
﻿
# evaluate with the 'normalised' WER
do_normalize_eval = True
﻿
def compute_metrics(pred):
    pred_ids = pred.predictions
    label_ids = pred.label_ids
﻿
    # replace -100 with the pad_token_id
    label_ids[label_ids == -100] = processor.tokenizer.pad_token_id
﻿
    # we do not want to group tokens when computing the metrics
    pred_str = processor.tokenizer.batch_decode(
					pred_ids, skip_special_tokens=True, normalize=do_normalize_eval)
    label_str = processor.tokenizer.batch_decode(
					label_ids, skip_special_tokens=True, normalize=do_normalize_eval)
﻿
    wer = 100 * metric.compute(predictions=pred_str, references=label_str)
﻿
    return {"wer": wer}
TrainingTo train the model, we use the Seq2SeqTrainer utility class in the transformers library. The class provides a wrapper around the common training logic required to train encoder-decoder models and requires training arguments, models, and callbacks. The following code snippet shows the training arguments used to train a whisper-tiny Telugu model. 
model = WhisperForConditionalGeneration.from_pretrained(
"openai/whisper-base", 
use_cache=False)
﻿
training_args = Seq2SeqTrainingArguments(
    output_dir="../models/whisper-tiny-te",  # change to a repo name of your choice
    per_device_train_batch_size=72, # batch size provided to each GPU
    gradient_accumulation_steps=1,  # increase by 2x for every 2x decrease in batch size
    learning_rate=1e-5, # learning rate
    save_total_limit=4, # no. of last model checkpoints to save 
    warmup_steps=500, # no. of batches for learning rate warmup
    max_steps=3000, # total number of training batches to train the model
    gradient_checkpointing=True, # whether or not to use gradient checkpointing on activations
    fp16=True, # whether to use mixed precision training
    optim="adamw_bnb_8bit", # change to a fixed precision optimizer for lower GPU RAM usage
    evaluation_strategy="steps", # wheather to perform evaluation every batch or epoch
    per_device_eval_batch_size=36, # evaluation batch size (lower than training for seq2seq models)
    predict_with_generate=True, # whether to run predictions with model.generate (autoregressively) 
    generation_max_length=225, # maximum length of the generated sequences.
    save_steps=300, # number of batches after which to checkpoint the model
    eval_steps=300, # number of batches after which to evaluate the model
    logging_steps=100, # number of bathes after which to log metrics from the model
    report_to="none", # whether to report the model progress to tensorboard or any other integration
    load_best_model_at_end=True, # whether to load the best model weights at the end of training
    metric_for_best_model="wer", # metric used to identify the best model
    greater_is_better=False, # needs to be lower for wer
    hub_strategy="checkpoint", # push model checkpoints to huggingface_hub
    push_to_hub=True, # whether to push the model to the huggingface_hub
    remove_unused_columns=False, # remove unused columns from the dataset while loading into the model
    ignore_data_skip=True # whether to ignore the first n batches when resuming model training.
)
﻿
trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    train_dataset=dataset_dict["train"],
    eval_dataset=samples_dataset,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=processor,
    callbacks=[ShuffleCallback()],   
)
CallbacksIn addition to the existing callbacks we also add a callback to shuffle iterative datasets and monitor model generations periodically. To do this we'll use Weights & Biases Tables to track, organize, and visualize model training and evaluations. We'll create custom callbacks that we then add to the model. 
The code for the custom callbacks:
# trainer callback to reinitialise and reshuffle the streamable datasets at the beginning of each epoch
class ShuffleCallback(TrainerCallback):
    def on_epoch_begin(self, args, state, control, train_dataloader, **kwargs):
        if isinstance(train_dataloader.dataset, IterableDatasetShard):
            pass  # set_epoch() is handled by the Trainer
        elif isinstance(train_dataloader.dataset, IterableDataset):
            train_dataloader.dataset.set_epoch(train_dataloader.dataset._epoch + 1)
﻿
class WandbProgressResultsCallback(WandbCallback):
    def __init__(self, trainer, sample_dataset): 
        super().__init__()
        self.trainer = trainer
        self.sample_dataset = sample_dataset
        self.records_df = dataset_to_records(sample_dataset)
        
    def on_log(self, args, state, control, model=None, logs=None, **kwargs):
        super().on_log(args, state, control, model, logs)
        predictions = trainer.predict(self.sample_dataset)
        predictions = decode_predictions(self.trainer, predictions)
        measures_df = compute_measures(predictions, self.records_df["sentence"].tolist())
        records_df = pd.concat([self.records_df, measures_df], axis=1)
        records_df["prediction"] = predictions
        records_df["step"] = state.global_step
        records_table = self._wandb.Table(dataframe=records_df)
        self._wandb.log({"sample_predictions": records_table})
        
    def on_save(self, args, state, control, model=None, tokenizer=None, **kwargs):
        if self._wandb is None:
            return
        if self._log_model and self._initialized and state.is_world_process_zero:
            with tempfile.TemporaryDirectory() as temp_dir:
                self.trainer.save_model(temp_dir)
                metadata = (
                    {
                        k: v
                        for k, v in dict(self._wandb.summary).items()
                        if isinstance(v, numbers.Number) and not k.startswith("_")
                    }
                    if not args.load_best_model_at_end
                    else {
                        f"eval/{args.metric_for_best_model}": state.best_metric,
                        "train/total_floss": state.total_flos,
                    }
                )
                artifact = self._wandb.Artifact(
                    name=f"model-{self._wandb.run.id}",
                    type="model", metadata=metadata)
                for f in Path(temp_dir).glob("*"):
                    if f.is_file():
                        with artifact.new_file(f.name, mode="wb") as fa:
                            fa.write(f.read_bytes())
                self._wandb.run.log_artifact(artifact)
﻿
# pass as sample from the test dataset to the callback
samples_dataset = load_samples_dataset(dataset_dict["test"]).map(compute_spectrograms)
#create the callback
progress_callback = WandbProgressResultsCallback(trainer, samples_dataset)
# add the callback to the trainer
trainer.add_callback(progress_callback)
In addition to logging training progress the callback also generates visualizations and saves model artifacts. These artifacts can be used to resume model training whenever training crashes. Here’s a sample artifact:
﻿
project("parambharat", "whisper_finetuning").artifact("model-2k10w4qq")
model-2k10w4qqVersion 5
All Versions
Aliases
latest
Versions
v5
v4
v3
v2
v1
v0
VersionMetadataUsageFilesLineage
Version overview
Full Name
parambharat/whisper_finetuning/model-2k10w4qq:v5
Aliases
latest
v5
Tags
Digest
8851084a88711f2d2a2e6d8ab2a35ace
Created By
dazzling-galaxy-27
Created At
December 9th, 2022 10:27:30
Num Consumers
0
Num Files
10
Size
151.1MB
TTL Remaining
Inactive
Description
Finally, we can train the model by simply calling the train method in the Trainer class
trainer.train()
This outputs the training progress including a table of the training loss, evaluation loss and the evaluation WER metics. Here’s an example from the whisper-tiny-telugu model.
﻿
Training progress of the whisper-tiny-te model
EvaluationWe'll evaluate the model on the test set of mozilla-foundation/common_voice_11 and google/fleurs dataset. The full code for evaluation can be seen in the run_streaming_evaluation.py file. 
To run the evaluations, This creates a transformers.AutomaticSpeechRecognitionPipeline and calculate the word error rate (WER) over the entire test datasets. Here’s the main code snippet for this process. For the full code, please refer to the evaluation script provided by the whisper community event organizers here:﻿
batch_size = args.batch_size
whisper_asr = pipeline(
    "automatic-speech-recognition", model=args.model_id, device=args.device
)
﻿
whisper_asr.model.config.forced_decoder_ids = (
    whisper_asr.tokenizer.get_decoder_prompt_ids(
        language=args.language, task="transcribe"
    )
)
﻿
dataset = load_dataset(
    args.dataset,
    args.config,
    split=args.split,
    streaming=False,
    use_auth_token=True,
)
﻿
dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))
dataset = dataset.map(normalise)
dataset = dataset.filter(is_target_text_in_range, input_columns=["norm_text"])
﻿
predictions = []
references = []
﻿
# run streamed inference
for out in tqdm(whisper_asr(data(dataset), batch_size=batch_size)):
    predictions.append(whisper_norm(out["text"]))
    references.append(out["reference"][0])
﻿
wer = wer_metric.compute(references=references, predictions=predictions)
wer = round(100 * wer, 2)
﻿
print("WER:", wer)
Released ModelsSince I trained multiple models across multiple machines, I created various model training notebooks that generated corresponding logs. The following table contains the links to each notebook, the corresponding training logs and models:

LanguageModel TypeNotebookTraining LogModel
Tamilwhisper-tinyhttps://github.com/parambharat/whisper-finetuning/blob/main/notebooks/whisper_tiny_ta.ipynbhttps://wandb.ai/parambharat/whisper_finetuning/runs/2k10w4qqhttps://huggingface.co/parambharat/whisper-tiny-ta
Tamilwhisper-basehttps://github.com/parambharat/whisper-finetuning/blob/main/notebooks/whisper_base_ta.ipynbhttps://wandb.ai/parambharat/whisper_finetuning/runs/20dsv01ohttps://huggingface.co/parambharat/whisper-base-ta
Tamilwhisper-smallhttps://github.com/parambharat/whisper-finetuning/blob/main/notebooks/whisper_small_ta.ipynbhttps://wandb.ai/parambharat/whisper_finetuning/runs/17xqqp5bhttps://huggingface.co/parambharat/whisper-small-ta
Malayalamwhisper-tinyhttps://github.com/parambharat/whisper-finetuning/blob/main/notebooks/whisper_tiny_ml.ipynbhttps://wandb.ai/parambharat/whisper_finetuning/runs/13kya0x9https://huggingface.co/parambharat/whisper-tiny-ml
Malayalamwhisper-basehttps://github.com/parambharat/whisper-finetuning/blob/main/notebooks/whisper_base_ml.ipynbhttps://wandb.ai/parambharat/whisper_finetuning/runs/304br4nphttps://huggingface.co/parambharat/whisper-base-ml
Malayalamwhisper-smallhttps://github.com/parambharat/whisper-finetuning/blob/main/notebooks/whisper_small_ml.ipynbhttps://wandb.ai/parambharat/whisper_finetuning/runs/219trr2lhttps://huggingface.co/parambharat/whisper-small-ml
Teluguwhisper-tinyhttps://github.com/parambharat/whisper-finetuning/blob/main/notebooks/whisper_tiny_te.ipynbhttps://wandb.ai/parambharat/whisper_finetuning/runs/23rnm6srhttps://huggingface.co/parambharat/whisper-tiny-te
Teluguwhisper-basehttps://github.com/parambharat/whisper-finetuning/blob/main/notebooks/whisper_base_te.ipynbhttps://wandb.ai/parambharat/whisper_finetuning/runs/2eu5o3v5https://huggingface.co/parambharat/whisper-base-te
Teluguwhisper-smallhttps://github.com/parambharat/whisper-finetuning/blob/main/notebooks/whisper_small_te.ipynbhttps://wandb.ai/parambharat/whisper_finetuning/runs/28l7467bhttps://huggingface.co/parambharat/whisper-small-te
Kannadawhisper-tinyhttps://github.com/parambharat/whisper-finetuning/blob/main/notebooks/whisper_tiny_kn.ipynbhttps://wandb.ai/parambharat/whisper_finetuning/runs/10ccca1ihttps://huggingface.co/parambharat/whisper-tiny-kn
Kannadawhisper-basehttps://github.com/parambharat/whisper-finetuning/blob/main/notebooks/whisper_base_kn.ipynbhttps://wandb.ai/parambharat/whisper_finetuning/runs/2d890zxthttps://huggingface.co/parambharat/whisper-base-kn
Kannadawhisper-smallhttps://github.com/parambharat/whisper-finetuning/blob/main/notebooks/whisper_small_kn.ipynbhttps://wandb.ai/parambharat/whisper_finetuning/runs/1sn36mp7https://huggingface.co/parambharat/whisper-small-kn
﻿
Training CurvesThe interactive logs from Weights & Biases dashboard in the above table is the best way to view all the training and validation logs. However, for ease of viewing, I also include the training and validation plots of each model along with the sample prediction tables logged by the custom callback  :
Tamil
Whisper Tiny﻿
﻿
Whisper Base﻿
﻿
Whisper Small﻿
﻿
Malayalam
Whisper Tiny﻿
﻿
Whisper Base﻿
﻿
Whisper Small﻿
﻿
Telugu
Whisper Tiny﻿
﻿
Whisper Base﻿
﻿
Whisper Small﻿
﻿
Kannada
Whisper Tiny﻿
﻿
Whisper Base﻿
﻿
Whisper Small﻿
﻿
﻿
Evaluation ResultsAs discussed earlier in the evaluation section, we'll perform evaluation and report evaluation results on the mozilla-foundation/common_voice_11 and google/fleurs test datasets. The results of these evaluations are also published along with the model cards. The following table reports the evaluation of each model on the test sets. 
(While Tamil and Malayalam are available in both common_voice and google/fleurs datasets, Telugu and Kannada are only available in the google/fleurs dataset.)

LanguageModelCommon Voice(WER)Fleurs(WER)
Tamilhttps://huggingface.co/parambharat/whisper-tiny-ta30.10326.07
Tamilhttps://huggingface.co/parambharat/whisper-base-ta15.78020.410
Tamilhttps://huggingface.co/parambharat/whisper-small-ta11.15015.800
Malayalamhttps://huggingface.co/parambharat/whisper-tiny-ml45.72062.150
Malayalamhttps://huggingface.co/parambharat/whisper-base-ml34.16053.290
Malayalamhttps://huggingface.co/parambharat/whisper-small-ml25.80048.160
Teluguhttps://huggingface.co/parambharat/whisper-tiny-teN/A52.670
Teluguhttps://huggingface.co/parambharat/whisper-base-teN/A39.090
Teluguhttps://huggingface.co/parambharat/whisper-small-teN/A30.260
Kannadahttps://huggingface.co/parambharat/whisper-tiny-knN/A43.700
Kannadahttps://huggingface.co/parambharat/whisper-base-knN/A30.260
Kannadahttps://huggingface.co/parambharat/whisper-small-knN/A25.540
﻿
DemonstrationsI provide an inference_notebook where running the code will initialize a gradio application to interactively generate transcripts. These applications are also made available publicly via HuggingFace spaces. See the below table for these applications. 
A web app for each model developed in this project is released as spaces in the Huggingface hub. The apps are capable of transcribing audio from the microphone and youtube. Below are interactive widgets that you can try 
﻿
﻿
﻿
SummaryThis project was a great opportunity for me to finetune Automatic Speech Recognition Models for low-resource languages using Whisper. I went through the entire process, starting from data sourcing, collection, cleansing, and preprocessing to model training, evaluation, and even deployment. 
In this process, I was able to release four new datasets and twelve new models for languages that have often been overlooked by the research community, mainly due to the lack of availability of large quantities of annotated datasets in these languages. This has left behind the people and speakers of these languages from access to new technologies enabled by deep learning that is often readily available to speakers of other common languages. 
This project demonstrates that the gap can be bridged using the right set of tools and technologies. In addition, to training deep learning models, I was also able to make a meaningful contribution to my native language - “Tamil” by releasing open-source speech recognition models and datasets that can be easily accessed by the speakers of the language. Additionally, I was also able to release models for various languages in my country’s region and neighboring states, including Kannada, Telugu, and Malayalam.
﻿

			Duration(Hrs)	Duration(Hrs)	Duration(Hrs)	Duration(Hrs)	Transcript Length(Chars)	Transcript Length(Chars)	Transcript Length(Chars)	Transcript Length(Chars)	Transcript Length(Words)	Transcript Length(Words)	Transcript Length(Words)	Transcript Length(Words)
Language	Datasets	Split	Total	Mean	Max	Min	Total	Mean	Max	Min	Total	Mean	Max	Min
Tamil	https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0/viewer/ta/train	Train	75.4011	0.0018	0.0030	0.0001	2822069	67.6593	227	5	314960	7.5512	24	1
Tamil	https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0/viewer/ta/validation	Validation	18.7325	0.0016	0.0029	0.0005	754154	64.1396	173	4	84479	7.1848	17	1
Tamil	https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0/viewer/ta/test	Test	18.5947	0.0016	0.0030	0.0005	664141	56.2117	184	2	75190	6.3639	17	1
Tamil	https://huggingface.co/datasets/google/fleurs/viewer/ta_in/train	Train	8.6784	0.0037	0.0109	0.0002	337121	142.4254	357	29	37703	15.9286	40	3
Tamil	https://huggingface.co/datasets/google/fleurs/viewer/ta_in/validation	Validation	1.2548	0.0033	0.0114	0.0004	52447	139.1167	307	46	5772	15.3103	37	6
Tamil	https://huggingface.co/datasets/google/fleurs/viewer/ta_in/test	Test	2.1312	0.0036	0.0167	0.0010	87879	148.6954	398	52	9754	16.5042	42	4
Tamil	https://github.com/Open-Speech-EkStep/ULCA-asr-dataset-corpus/blob/main/README.md#tamil-labelled--total-duration-is-116024-hours	Train	1160.24
Tamil	http://openslr.org/127/	Train	133.2946	0.0017	0.0108	0.0001	5953463	77.0037	605	9	664734	8.5978	74	1
Tamil	http://openslr.org/127/	Test	16.8100	0.0014	0.0101	0.0001	800532	66.2308	524	9	88767	7.3440	62	1
Malayalam	https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0/viewer/ml/train	Train	0.5133	0.0012	0.0029	0.0005	17170	39.9302	173	1	1790	4.1628	14	1
Malayalam	https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0/viewer/ml/test	Test	0.1149	0.0010	0.0025	0.0005	4019	35.8839	130	4	435	3.8839	13	1
Malayalam	https://huggingface.co/datasets/google/fleurs/viewer/ml_in/train	Train	10.0551	0.0033	0.0112	0.0006	423240	139.0864	358	30	43381	14.2560	38	3
Malayalam	https://huggingface.co/datasets/google/fleurs/viewer/ml_in/validation	Validation	1.6810	0.0040	0.0175	0.0015	57819	138.3230	314	47	5857	14.0120	35	5
Malayalam	https://huggingface.co/datasets/google/fleurs/viewer/ml_in/test	Test	3.9090	0.0041	0.0133	0.0013	139566	145.6848	366	49	14230	14.8539	42	5
Malayalam	https://github.com/Open-Speech-EkStep/ULCA-asr-dataset-corpus/blob/main/README.md#malayalam-labelled-external	Train	Unk
Kannada	https://huggingface.co/datasets/google/fleurs/viewer/kn_in/train	Train	8.2800	0.0036	0.0194	0.0010	306385	134.2028	320	33	36515	15.9943	39	4
Kannada	https://huggingface.co/datasets/google/fleurs/viewer/kn_in/validation	Validation	1.3002	0.0035	0.0090	0.0012	45861	124.6223	266	41	5443	14.7908	37	6
Kannada	https://huggingface.co/datasets/google/fleurs/viewer/kn_in/test	Test	3.1772	0.0038	0.0110	0.0002	111246	132.7518	385	46	13205	15.7578	44	6
Kannada	https://github.com/Open-Speech-EkStep/ULCA-asr-dataset-corpus/blob/main/README.md#kannada-labelled-total-duration-is-60891-hours	Train	608.91
Telugu	https://huggingface.co/datasets/google/fleurs/viewer/te_in/train	Train	7.9164	0.0034	0.0115	0.0004	272432	118.3458	287	27	35866	15.5804	40	4
Telugu	https://huggingface.co/datasets/google/fleurs/viewer/te_in/validation	Validation	0.8941	0.0029	0.0090	0.0010	36657	117.8682	275	43	4790	15.4019	39	6
Telugu	https://huggingface.co/datasets/google/fleurs/viewer/te_in/test	Test	1.4489	0.0031	0.0085	0.0012	59496	126.0508	353	45	7823	16.5742	43	7
Telugu	https://github.com/Open-Speech-EkStep/ULCA-asr-dataset-corpus/blob/main/README.md#telugu-labelled-total-duration-is-102593-hours	Train	1025.93

	Duration(Hours)	Duration(Hours)	Duration(Hours)	Duration(Hours)	Character Length	Character Length	Character Length	Character Length	Word Length	Word Length	Word Length	Word Length
Language	Total	Mean	Min	Max	Total	Mean	Min	Max	Total	Mean	Min	Max
Tamil	1293.71	0.0019	0.0008	0.0083	70519146	103.26	2	524	8018783	11.74	1	62
Telugu	387.21	0.0019	0.0008	0.0042	17061166	81.53	4	289	2115563	10.11	1	36
Malayalam	10.13	0.0015	0.0008	0.0042	480309	72.00	4	303	53283	7.99	1	36
Kannada	358.84	0.0021	0.0008	0.0074	12973393	75.11	2	249	1592203	9.22	1	31

Model	Layers	Embedding Size	Attention Heads	Parameters	Checkpoint
https://huggingface.co/openai/whisper-tiny	4	384	6	39M	https://huggingface.co/openai/whisper-tiny
https://huggingface.co/openai/whisper-base	6	512	8	74M	https://huggingface.co/openai/whisper-base
https://huggingface.co/openai/whisper-small	12	768	12	244M	https://huggingface.co/openai/whisper-small

Language	Model Type	Notebook	Training Log	Model
Tamil	whisper-tiny	https://github.com/parambharat/whisper-finetuning/blob/main/notebooks/whisper_tiny_ta.ipynb	https://wandb.ai/parambharat/whisper_finetuning/runs/2k10w4qq	https://huggingface.co/parambharat/whisper-tiny-ta
Tamil	whisper-base	https://github.com/parambharat/whisper-finetuning/blob/main/notebooks/whisper_base_ta.ipynb	https://wandb.ai/parambharat/whisper_finetuning/runs/20dsv01o	https://huggingface.co/parambharat/whisper-base-ta
Tamil	whisper-small	https://github.com/parambharat/whisper-finetuning/blob/main/notebooks/whisper_small_ta.ipynb	https://wandb.ai/parambharat/whisper_finetuning/runs/17xqqp5b	https://huggingface.co/parambharat/whisper-small-ta
Malayalam	whisper-tiny	https://github.com/parambharat/whisper-finetuning/blob/main/notebooks/whisper_tiny_ml.ipynb	https://wandb.ai/parambharat/whisper_finetuning/runs/13kya0x9	https://huggingface.co/parambharat/whisper-tiny-ml
Malayalam	whisper-base	https://github.com/parambharat/whisper-finetuning/blob/main/notebooks/whisper_base_ml.ipynb	https://wandb.ai/parambharat/whisper_finetuning/runs/304br4np	https://huggingface.co/parambharat/whisper-base-ml
Malayalam	whisper-small	https://github.com/parambharat/whisper-finetuning/blob/main/notebooks/whisper_small_ml.ipynb	https://wandb.ai/parambharat/whisper_finetuning/runs/219trr2l	https://huggingface.co/parambharat/whisper-small-ml
Telugu	whisper-tiny	https://github.com/parambharat/whisper-finetuning/blob/main/notebooks/whisper_tiny_te.ipynb	https://wandb.ai/parambharat/whisper_finetuning/runs/23rnm6sr	https://huggingface.co/parambharat/whisper-tiny-te
Telugu	whisper-base	https://github.com/parambharat/whisper-finetuning/blob/main/notebooks/whisper_base_te.ipynb	https://wandb.ai/parambharat/whisper_finetuning/runs/2eu5o3v5	https://huggingface.co/parambharat/whisper-base-te
Telugu	whisper-small	https://github.com/parambharat/whisper-finetuning/blob/main/notebooks/whisper_small_te.ipynb	https://wandb.ai/parambharat/whisper_finetuning/runs/28l7467b	https://huggingface.co/parambharat/whisper-small-te
Kannada	whisper-tiny	https://github.com/parambharat/whisper-finetuning/blob/main/notebooks/whisper_tiny_kn.ipynb	https://wandb.ai/parambharat/whisper_finetuning/runs/10ccca1i	https://huggingface.co/parambharat/whisper-tiny-kn
Kannada	whisper-base	https://github.com/parambharat/whisper-finetuning/blob/main/notebooks/whisper_base_kn.ipynb	https://wandb.ai/parambharat/whisper_finetuning/runs/2d890zxt	https://huggingface.co/parambharat/whisper-base-kn
Kannada	whisper-small	https://github.com/parambharat/whisper-finetuning/blob/main/notebooks/whisper_small_kn.ipynb	https://wandb.ai/parambharat/whisper_finetuning/runs/1sn36mp7	https://huggingface.co/parambharat/whisper-small-kn

Language	Model	Common Voice(WER)	Fleurs(WER)
Tamil	https://huggingface.co/parambharat/whisper-tiny-ta	30.103	26.07
Tamil	https://huggingface.co/parambharat/whisper-base-ta	15.780	20.410
Tamil	https://huggingface.co/parambharat/whisper-small-ta	11.150	15.800
Malayalam	https://huggingface.co/parambharat/whisper-tiny-ml	45.720	62.150
Malayalam	https://huggingface.co/parambharat/whisper-base-ml	34.160	53.290
Malayalam	https://huggingface.co/parambharat/whisper-small-ml	25.800	48.160
Telugu	https://huggingface.co/parambharat/whisper-tiny-te	N/A	52.670
Telugu	https://huggingface.co/parambharat/whisper-base-te	N/A	39.090
Telugu	https://huggingface.co/parambharat/whisper-small-te	N/A	30.260
Kannada	https://huggingface.co/parambharat/whisper-tiny-kn	N/A	43.700
Kannada	https://huggingface.co/parambharat/whisper-base-kn	N/A	30.260
Kannada	https://huggingface.co/parambharat/whisper-small-kn	N/A	25.540

Add a comment

Tags: Audio, Articles, Experiment, Plots, Panels, HuggingFace, OpenAI, Fine-tuning

Iterate on AI agents and models faster. Try Weights & Biases today.

Fine-Tuning Whisper for Low-Resource Dravidian Languages

Introduction

Table of Contents

﻿﻿Background

Problem Statement

Description of Software/Tools

Description of Data

Downloading Source Datasets

Preprocessing Datasets

Hosting the Datasets

Dataset EDA

Key Insights

Modeling

Models

Loading the Data

Data Augmentation

Preprocessing

Feature Extraction

Batching

Computing Metrics

Training

Callbacks

Evaluation

Released Models

Training Curves

Tamil

Whisper Tiny

Whisper Base

Whisper Small

Malayalam

Whisper Tiny

Whisper Base

Whisper Small

Telugu

Whisper Tiny

Whisper Base

Whisper Small

Kannada

Whisper Tiny

Whisper Base

Whisper Small

Evaluation Results

Demonstrations

Summary

Background