Vikas Reddy - University of Maryland

Abstract

In the rapidly evolving landscape of communication technology, recent breakthroughs, notably the OpenAI Whisper model, have significantly enhanced the accuracy and accessibility of multilingual speech-to-text capabilities. However, despite these developments, there remains room for enhancement, particularly in terms of accuracy. This research is dedicated to enhancing the capabilities of automatic speech recognition (ASR) models, with a specific focus on Vietnamese and Japanese languages. Our evaluation employs standard measures to gauge performance. For Vietnamese, we use the Word Error Rate (WER) metric, which assesses how often the recognized words deviate from the actual spoken words. For Japanese, we employ the Character Error Rate (CER) metric, which scrutinizes the accuracy of individual letters. The outcomes of our study are marked by substantial enhancements. Notably, the FOSD + Common Voice + Google Fleurs + Vivos Vietnamese model accomplished a WER of 9.4568%, while the Japanese ReazonSpeech + Common Voice + Google Fleurs model exhibited a CER of 8.1493%. These findings underscore the effectiveness of fine-tuning ASR models, underscoring their potential in practical applications of transforming spoken language into written text.

Background Information

In today's society, communication and technology have become indispensable, yet several challenges persist that impact accessibility, inclusivity, and efficient knowledge dissemination. This is where advancements like automatic speech recognition (ASR) step in, streamlining interactions between humans and computers, particularly evident in online meeting calls. ASR is the process of converting speech signals into its corresponding text. Recently, this task has gained attraction with various corporations, thanks to the accessibility of large speech datasets, along with their corresponding transcripts. However, traditional deep learning algorithms demand a substantial amount of labeled data and considerable training time for the models. To address this issue, a new trend has emerged: the use of self-supervised models like the Wave2Vec model. These models are first pre-trained on unlabeled speech data before fine-tuning. OpenAI has made groundbreaking advancements in this area, revealing their ASR model trained entirely using supervised learning on weakly labeled data, which includes 680,000 hours of speech data in 98 languages.

OpenAI Whisper is a Transformer-based encoder-decoder model, specifically designed as a sequence-to-sequence architecture. It takes audio spectrogram features as the input and converts them into a sequence of text tokens. This involves an initial step where a feature extractor transforms raw audio into a log-Mel spectrogram, followed by the Transformer encoder generating a sequence of encoder hidden states. The decoder predicts text tokens by employing cross-attention mechanisms, integrating information from previously predicted tokens in an autoregressive fashion. The model undergoes pre-training and fine-tuning using the cross-entropy objective function, contributing to its performance in speech recognition tasks.

Environment Setup

There are two distinct approaches to fine-tune whisper: utilizing Google Colab and running the code on a local PC. Google Colab provides the advantage of being a cloud-based platform, allowing users to access powerful hardware resources without the need for expensive on-premise equipment. However, it has its limitations, such as potential interruptions or session expirations, which can abruptly stop the fine-tuning process. On the other hand, fine-tuning on a local PC offers more control and stability, but it may require significant computational resources which can be expensive. Despite their respective drawbacks, both methods are valid options for fine-tuning Whisper.

There are many different packages required for fine-tuning Whisper, so it is essential to set up your environment before initiating the fine-tuning process. One efficient way to create a new environment is by using anaconda or miniconda, and using the conda create command with your desired environment name. Once the environment is created, the next step is to install the necessary packages. For fine-tuning Whisper, some essential packages can be seen below, including PyTorch with CUDA support. PyTorch is a popular deep learning framework that provides the required tools for training and working with neural networks. CUDA, on the other hand, is NVIDIA's parallel computing platform and programming model, which accelerates deep learning computations on compatible GPUs.

python -m pip install -U pip
pip install evaluate pandas numpy huggingface_hub pydub tqdm spacy ginza audiomentations
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install datasets>=2.6.1
pip install git+https://github.com/huggingface/transformers
pip install librosa
pip install evaluate>=0.30
pip install jiwer
pip install gradio
pip install -q bitsandbytes datasets accelerate loralib
pip install -q git+https://github.com/huggingface/transformers.git@main git+https://github.com/huggingface/peft.git@main

Figure 2: Setting up Conda Environment and Downloading the Necessary Packages

CUDA (Compute Unified Device Architecture), created by NVIDIA, serves as an exceptional parallel computing platform and application programming interface. NVIDIA GPUs (Graphics Processing Units) empower developers to efficiently handle a wide array of general-purpose computing tasks with increased efficiency. In this context, we leverage these GPUs to expedite and enhance the fine-tuning process of ASR models. The computer configuration adopted for this article's fine-tuning tasks involves a Windows 11 Pro PC housing an AMD Ryzen 7 3700X 8-Core Processor with 80GB of RAM installed. This computer also includes CUDA support through the GeForce RTX 3090 NVIDIA Graphics Card.

Load Datasets

There are two primary methods to load datasets for training with Whisper. The first approach involves using Hugging Face, which is a platform that grants access to datasets when users accept the terms of use. The process involves specifying four key elements within the “load_dataset” function: the dataset being imported, the desired language, the training, validation, or test splits, and the option to use an authentication token. For the sake of simplicity, it is crucial to retain only the essential metadata elements like "audio" and "transcription." Note that in certain datasets like the Common Voice 13 dataset, the "transcription" tag could be referred to as "sentence.”

from datasets import load_dataset, DatasetDict
common_voice = DatasetDict()
common_voice["train"] = load_dataset("mozilla-foundation/common_voice_11_0", "ja", split="train+validation", use_auth_token=True)
common_voice["test"] = load_dataset("mozilla-foundation/common_voice_11_0", "ja", split="test", use_auth_token=True)
common_voice = cv.remove_columns(["accent", "age", "client_id", "down_votes", "gender", "locale", "path", "segment", "up_votes"])

Figure 3: Loading Datasets Using Hugging Face

In addition to importing datasets through Hugging Face, the other option is to prepare the dataset manually. This process involves handling audio and text data by following a set of defined steps. Firstly, you would need to download the dataset from the provider which includes all of the audio files and its transcript. The transcript varies between datasets as some may have it all in one text file or a corresponding text file for each audio file, and these text files must be processed and transformed into a CSV file. Figure 4 illustrates this process for the FPT Open Speech Dataset, which contains all of the transcriptions in a single file.

import os, csv, codecs
def text_change_csv(input_path, output_path):
    file_csv = os.path.splitext(output_path)[0] + ".csv"
    output_dir = os.path.dirname(input_path)
    output_file = os.path.join(output_dir, file_csv)
    encodings = ["utf-8", "latin-1"]
    
    for encoding in encodings:
        try:
            with open(input_path, 'r', encoding=encoding) as rf:
                with codecs.open(output_file, 'w', encoding=encoding, errors='replace') as wf:
                    readfile = rf.readlines()
                    for read_text in readfile:
                        read_text = read_text.split('|')
                        writer = csv.writer(wf, delimiter=',')
                        writer.writerow(read_text)
            print(f"CSV has been created using encoding: {encoding}")
            return True
        except UnicodeDecodeError:
            continue

Figure 4: Converting Transcript into a CSV File

Once the CSV file is generated, it is read into a Pandas DataFrame, which is then modified for compatibility with the Hugging Face Dataset library. Concurrently, the audio files folder are processed using the process_data() function shown in Figure 5, resampling them to a target sampling rate of 16,000 Hz, which is required for Whisper. The resampled audio arrays, along with their file paths and sampling rates, are collected in the audio_data list. Finally, the data is organized into a Dataset object, combining the audio arrays with the corresponding text sentences to create a comprehensive dataset suitable for fine-tuning Whisper.

import os, librosa
import pandas as pd
from tqdm import tqdm
from datasets import Dataset

csv_file = "./output.csv"
df = pd.read_csv(csv_file, header=None, nrows=25921)
df.rename(columns={0: "path", 1: "sentence"}, inplace=True)
df.drop('path', axis=1, inplace=True)
df.to_csv("./df.csv", index=False)

def process_data(folder_path):
    audio_data = []  # Initialize the audio_data list

    for file_name in tqdm(os.listdir(folder_path)):
        file_path = os.path.join(folder_path, file_name)

        if os.path.isfile(file_path) and file_name.endswith(".mp3") and " " not in file_name:
            audio_array, samplerate = librosa.load(file_path, mono=True)
            target_sr = 16000
            resampled_array = librosa.core.resample(audio_array, orig_sr=samplerate, target_sr=target_sr)

            audio_data.append({
                'path': file_path,
                'array': resampled_array,
                'sampling_rate': target_sr
            })

    return audio_data

audio_data = process_data("./mp3")
dataset = Dataset.from_pandas(df)
dataset.set_format("numpy")
dataset = dataset.add_column('audio', audio_data)

Figure 5: Creating the Dataset object

The ASR pipeline used to fine-tune the Whisper model consists of three components: a feature extractor, a model responsible for sequence-to-sequence mapping, and a tokenizer for converting model outputs into text format. Hugging Face Transformers offer the WhisperFeatureExtractor and WhisperTokenizer as the associated components for the Whisper model. The feature extractor handles raw audio inputs by discretizing the continuous speech signal into fixed time steps with a 16000 Hz sampling rate. It further ensures all audio samples are either padded or truncated to 30-second lengths, followed by converting them into log-Mel spectrograms, serving as inputs for the Whisper model. The WhisperTokenizer facilitates the mapping of model outputs, containing predicted text indices, to actual text strings using a byte-pair vocabulary pre-trained on 96 languages. The combination of the feature extractor and tokenizer results in the creation of the Processor, which effectively streamlines the data preparation process for fine-tuning the Whisper model.

from transformers import WhisperFeatureExtractor, WhisperTokenizer, WhisperProcessor
feature_extractor = WhisperFeatureExtractor.from_pretrained("openai/whisper-large-v2")
tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-large-v2", language="Vietnamese", task="transcribe", model_max_length=225)
processor = WhisperProcessor.from_pretrained("openai/whisper-large-v2", language="Vietnamese", task="transcribe", model_max_length=225)

Figure 6: Creating the ASR pipeline

The following table presents various Vietnamese and Japanese datasets that have been used:

Dataset	Language	Usage	Speech Audio (Hours)
Common Voice 13.0	60 Languages including Vietnamese and Japanese	Load_dataset from Hugging Face	19 hours - Vietnamese 10 hours - Japanese
Google Fleurs	102 Languages including Vietnamese and Japanese	Load_dataset from Hugging Face	11 hours - Vietnamese 8 hours - Japanese
Vivos	Vietnamese	Load_dataset from Hugging Face	15 hours
FPT Open Speech Dataset (FOSD)	Vietnamese	Download zip file and extract	30 hours
Vietnamese Language and Speech Processing 2020 (VLSP2020)	Vietnamese	Download rar file and extract	100 hours
Vietnamese Movies Dataset	Vietnamese	Download mov and srt files and process using the pydub library	6 hours
ReazonSpeech	Japanese	Load_dataset from Hugging Face	5 hours
JSUT	Japanese	Download zip file and extract	10 hours
JVS	Japanese	Download zip file and extract	30 hours
Tatoeba	Japanese	Download zip file and extract	1 hour

Data Preprocessing

Data augmentation is a technique used to enhance the performance and generalization of machine learning models like Whisper by increasing the diversity and size of the training dataset. The audiomentations library is employed to introduce realistic variations into the audio data, such as adding Gaussian noise, time stretching, and pitch shifting, closely mimicking real-world scenarios and adapting the model to different speech characteristics. By augmenting the dataset, the Whisper model can learn from a more diverse set of samples, leading to improved performance on unseen data, and it effectively combats overfitting with limited monolingual data. Furthermore, data augmentation finds practical application in emulating background noise in meeting calls for the VoicePing application, helping the model handle real-world scenarios and improve performance in environments with background noise, enhancing user experience and reliability.

from audiomentations import Compose, AddGaussianNoise, TimeStretch, PitchShift
common_voice = common_voice.cast_column("audio", Audio(sampling_rate=16000)) # Resampling the audio if using a dataset from Hugging Face
augment_waveform = Compose([
    AddGaussianNoise(min_amplitude=0.005, max_amplitude=0.015, p=0.2),
    TimeStretch(min_rate=0.8, max_rate=1.25, p=0.2, leave_length_unchanged=False),
    PitchShift(min_semitones=-4, max_semitones=4, p=0.2)
    ,])

def augment_dataset(batch):
    audio = batch["audio"]["array"]
    augmented_audio = augment_waveform(samples=audio, sample_rate=16000)
    batch["audio"]["array"] = augmented_audio
    return batch

common_voice['train'] = common_voice['train'].map(augment_dataset, keep_in_memory=True)

Figure 7: Augmenting the Audio

Then, it is important to normalize the transcriptions, especially when dealing with multiple datasets. The provided code snippet demonstrates the process of normalizing the transcript before using it in the ASR pipeline. The function remove_punctuation(sentence) utilizes the str.maketrans() and translate() methods to remove any punctuation marks from the input sentence, creating a modified version with no punctuation. The fix_sentence(sentence) function further refines the transcript by removing quotation marks if they are present at the beginning and end. It then calls the remove_punctuation() function to eliminate any punctuation marks from the transcript and finally converts the entire text to lowercase. Normalizing the transcript in this manner ensures that the ASR model processes consistent and standardized text inputs during training and evaluation, leading to more accurate and reliable speech recognition results.

import string
def remove_punctuation(sentence):
    translator = str.maketrans('', '', string.punctuation)
    modified_sentence = sentence.translate(translator)
    return modified_sentence

def fix_sentence(sentence):
    transcription = sentence
  
    if transcription.startswith('"') and transcription.endswith('"'):
        transcription = transcription[1:-1]
  
    transcription = remove_punctuation(transcription)
    transcription = transcription.lower()
    
    return transcription

Figure 8: Normalizing the Transcript

The next step would be to prepare the dataset for the Whisper model. The process involves several steps. Firstly, we load and resample the audio data from the batch using batch["audio"]. Next, we utilize the feature extractor to compute the log-Mel spectrogram input features from the 1-dimensional audio array. This step transforms the raw audio into a visual representation of frequency information, which serves as the input for the Whisper model. Finally, we encode the transcriptions to label ids using the tokenizer. This step maps the text sequences into numerical representations, allowing the model to process and understand the textual content during training and evaluation. The following code snippet in Figure 9 demonstrates this process.

def prepare_dataset(batch):
    audio = batch["audio"]
    batch["input_features"] = processor.feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]
    batch["input_length"] = len(audio["array"]) / audio["sampling_rate"]
    
    transcription = batch["transcription"]
    transcription = fix_sentence(batch["transcription"])
    
    batch["labels"] = processor.tokenizer(transcription, max_length=225, truncation=True).input_ids
    return batch

common_voice = common_voice.map(prepare_dataset, remove_columns=cv.column_names['train'], num_proc=1, keep_in_memory=True)

Figure 9: Preparing the dataset

Additionally, if using multiple datasets, you could import the concatenate_datasets function from datasets and merge multiple datasets as shown below.

from datasets import concatenate_datasets
concatenated_dataset = DatasetDict({
    "train": concatenate_datasets ([cv['train'], fleurs['train'], reazon["train"]]),
    "test": concatenate_datasets ([cv['test'], fleurs['test']])
})

Figure 10: Concatenating Datasets

Training

Now that we have prepared the dataset, the next step is to initiate the training process. This involves defining a data collator, establishing evaluation metrics, loading a pre-trained model, and specifying the necessary training arguments.

The data collator for the sequence-to-sequence speech model handles input_features and labels separately. Input_features are already padded to 30 seconds and transformed into fixed-size log-Mel spectrograms, requiring only conversion to batched PyTorch tensors using the feature extractor's .pad method with return_tensors=pt. In contrast, labels are unpadded initially and are first padded to the maximum length in the batch using the tokenizer's .pad method. The padding tokens are replaced with -100 to exclude them from loss computation, and the start of transcript token is cut from the label sequence, as it is appended later during training.

import torch
from dataclasses import dataclass
from typing import Any, Dict, List, Union

@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    processor: Any

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lengths and need different padding methods
        # first treat the audio inputs by simply returning torch tensors
        input_features = [{"input_features": feature["input_features"]} for feature in features]
        batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")

        # get the tokenized label sequences
        label_features = [{"input_ids": feature["labels"]} for feature in features]
        # pad the labels to max length
        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")

        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        # if bos token is appended in previous tokenization step,
        # cut bos token here as it's append later anyways
        if (labels[:, 0] == self.processor.tokenizer.bos_token_id).all().cpu().item():
            labels = labels[:, 1:]

        batch["labels"] = labels

        return batch

data_collator = DataCollatorSpeechSeq2SeqWithPadding(processor=processor)

Figure 11: Defining the Data Collator

For evaluating our ASR model on the evaluation set, we opt for the widely recognized Word Error Rate (WER) and Character Error Rate (CER) metrics from the evaluate library, which is commonly used to assess ASR systems. To implement this, we define the compute_metrics function, responsible for calculating the WER and CER metrics based on the model predictions.

import evaluate
metric = evaluate.load("wer")
def compute_metrics(pred):
    pred_ids = pred.predictions
    label_ids = pred.label_ids

    # replace -100 with the pad_token_id
    label_ids[label_ids == -100] = tokenizer.pad_token_id

    # we do not want to group tokens when computing the metrics
    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = tokenizer.batch_decode(label_ids, skip_special_tokens=True)

    wer = 100 * metric.compute(predictions=pred_str, references=label_str)

    return {"wer": wer}

Figure 12: Evaluation Metrics

In this function, we handle the label_ids by replacing any occurrences of -100 with the pad_token_id to undo the step applied in the data collator, ensuring proper treatment of padded tokens during the loss computation. Subsequently, the predicted and label ids are decoded into strings. Finally, the WER metric is computed as the percentage of errors between the predictions and reference labels for the Vietnamese text, and the result is returned for evaluation.

However, since the Japanese language lacks spaces between words, slight modifications are required. The Ginza tokenizer in conjunction with mecab-ipadic-neologd will be used for processing Japanese text. This adjustment ensures that the evaluation process appropriately handles the Japanese characters, enabling accurate and reliable assessment of the ASR system's performance.

import pkg_resources, imp, evaluate, spacy, ginza
imp.reload(pkg_resources)
metric = evaluate.load("wer")
nlp = spacy.load("ja_ginza")
ginza.set_split_mode(nlp, "C")

def compute_metrics(pred):
    pred_ids = pred.predictions
    label_ids = pred.label_ids

    label_ids[label_ids == -100] = processor.tokenizer.pad_token_id

    pred_str = processor.tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = processor.tokenizer.batch_decode(label_ids, skip_special_tokens=True)

    pred_str = [" ".join([ str(i) for i in nlp(j) ]) for j in pred_str]
    label_str = [" ".join([ str(i) for i in nlp(j) ]) for j in label_str]

    wer = 100 * metric.compute(predictions=pred_str, references=label_str)

    return {"wer": wer}

Figure 13: Japanese Evaluation Metrics

Now, we can load the pre-trained whisper large-v2 checkpoint using Hugging Face Transformers. The Whisper model utilizes forced_decoder_ids as token ids, which act as model outputs before the autoregressive generation starts, enabling zero-shot ASR with control over transcription language and task. For our fine-tuning process, we set these ids to None, as we intend to train the model to predict the correct language and task. Additionally, there are suppress_tokens, tokens with their log probabilities set to -inf to prevent their sampling during generation. To override this suppression behavior, we set these tokens to an empty list, allowing all tokens to be considered during generation.

from transformers import WhisperForConditionalGeneration
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large-v2")
model.config.forced_decoder_ids = None
model.config.suppress_tokens = []

Figure 14: Loading the Pre-trained Whisper Checkpoint

Finally, we need to define all of the training parameters. Figure 15 describes all of the arguments needed to fine-tune Whisper.

from transformers import Seq2SeqTrainingArguments
model.config.dropout = 0.05
training_args = Seq2SeqTrainingArguments(
    output_dir="./whisper-fine-tuned",
    per_device_train_batch_size=16,
    gradient_accumulation_steps=1,
    learning_rate=1e-6, 
    lr_scheduler_type='linear',
    optim="adamw_bnb_8bit",
    warmup_steps=200,
    num_train_epochs=5,
    gradient_checkpointing=True,
    evaluation_strategy="steps", 
    fp16 =True, 
    per_device_eval_batch_size=8, 
    predict_with_generate=True,
    generation_max_length=255,
    eval_steps=500,
    logging_steps=500,
    report_to=["tensorboard"],
    load_best_model_at_end=True,
    metric_for_best_model="wer",
    greater_is_better=False,
    push_to_hub=False,
		save_total_limit=1
)

Figure 15: Training Arguments

The following bullet points describe each of the training arguments:

model.config.dropout: Set the dropout rate to 0.05 or 0.10 in the Whisper model configuration. This is used to combat overfitting.
output_dir: Specifies the output directory for saving the fine-tuned model.
per_device_train_batch_size: The batch size per GPU during training is set to 16, and this value should be determined based on the capacity of your GPU processor. For instance, with the GTX3090, the maximum batch size can be set to 16.
gradient_accumulation_steps: Performs gradient accumulation for every single step during training. If the batch size is decreased by 2x, then increase this value by 2x as well.
learning_rate=1e-6: Sets the learning rate for the optimizer to 1e-6. The optimal value for the learning rate is either 1e-5 or 1e-6.
lr_scheduler_type='linear': Uses a linear learning rate scheduler during training. A linear learning rate scheduler gradually decreases the learning rate in a linear fashion over the course of training, allowing the model to converge smoothly and often achieving better performance compared to fixed learning rates.
training_args.optim="adamw_bnb_8bit": Selects the AdamW optimizer with 8-bit bucketing. This optimizer incorporates weight decay to stabilize and improve the training process.
warmup_steps: Sets the number of warm-up steps for the learning rate scheduler. It is best to use 10% of the overall steps as the warmup steps.
num_train_epochs: Specifies the number of training epochs. Epochs refer to the number of times the entire training dataset is passed through the machine learning model during the training process.
gradient_checkpointing=True: Enables gradient checkpointing to reduce memory usage during training.
evaluation_strategy="steps": Defines the evaluation strategy as evaluating every specified number of steps.
fp16=True: Activates mixed precision training using FP16. GPU is required to set fp16 to true.
per_device_eval_batch_size=8: Sets the batch size per GPU during evaluation to 8. It is best to set the value as half of per_device_train_batch_size.
predict_with_generate=True: Enables generation-based prediction during evaluation.
generation_max_length=225: Sets the maximum length for generated sequences to 225 tokens, and through experimentation, this value has been determined to yield the best results.
eval_steps=500: Specifies the evaluation interval in terms of steps.
logging_steps=500: Sets the logging interval in terms of steps.
report_to=["tensorboard"]: Sends training logs to TensorBoard for visualization.
load_best_model_at_end=True: Loads the best model based on the specified evaluation metric at the end of training.
metric_for_best_model="wer": Chooses the Word Error Rate (WER) as the evaluation metric for determining the best model.
greater_is_better=False: Specifies that a lower value of the evaluation metric is considered better.
push_to_hub=False: Disables pushing the trained model to the Hugging Face model hub.
save_total_limit=1: Sets the limit for the total number of saved checkpoints to 1 to save storage space on the computer.

Afterward, we can pass the training arguments, along with our model, dataset, data collator, and compute_metrics function to the Hugging Face Trainer.

from transformers import Seq2SeqTrainer
trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    train_dataset=common_voice["train"],
    eval_dataset=common_voice["test"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=processor.feature_extractor,
)

model.config.use_cache = False
trainer.train()

Figure 16: Trainer Arguments

After specifying the trainer arguments, then we can simply start training the model. The training time entirely depends on your GPU, the number of hours of training audio used for training, and the number of epochs set. In case you encounter the “CUDA out-of-memory” issue when training, decrease the batch_size, and monitor your GPU consumption using Activity Monitor on the Mac or Task Manager on Windows. If it still does not work, then utilize a smaller pre-trained whisper checkpoint such as whisper-medium.

Parameter-efficient Fine-tuning

Parameter-efficient Fine-tuning (PEFT) is a technique used to enhance the performance of pre-trained language models on specific downstream tasks. It achieves this by reusing the majority of the pre-trained model's parameters and fine-tuning only the last few layers. By doing so, PEFT saves computational resources and time compared to training the entire model from scratch. This approach is particularly useful in low-resource settings with limited data and computational capabilities. The method aims to improve performance, while reducing the risk of overfitting due to its focused modification of model parameters. The following table demonstrates the major differences between fine-tuning and parameter efficient fine-tuning.

Fine-tuning	Parameter Efficient Fine-tuning
Faster training time as compared to fine-tuning	Longer training time as compared to PEFT
Requires larger computational resources	Uses fewer computational resources
Re-trains the entire model	Modifies only a small subset of model parameters
More prone to overfitting	Less prone to overfitting
Typically results in better performance than PEFT	Not as good as fine-tuning, but still good enough

The steps for parameter efficient fine-tuning are the same as fine-tuning until the training and evaluation section. After loading the data collator, we would not need to define the compute_metrics function. This is because it would always give the “CUDA out-of-memory” error if specifying this function. Instead, we would have to compute the WER after the model has finished training. The next step would be to load a pre-trained checkpoint and apply Low-Rank Adapters (LoRA) to it.

Low-Rank Adapters is a technique used to efficiently fine-tune pre-trained language models by introducing low-rank parameterizations in the form of adapters. These adapters are added to the pre-trained model, and they enable task-specific fine-tuning without significantly increasing the model's overall parameter count. By employing low-rank parameterizations, LoRA lessens memory and computation demands during the fine-tuning stage. This strategy effectively balances model efficiency and the capacity to tailor it to specific tasks, presenting a pragmatic option for employing language models on novel tasks where computational resources are constrained. LoRA has been shown to achieve competitive performance on a range of NLP tasks while maintaining a smaller memory footprint compared to full fine-tuning methods since it only uses 1% of the total trainable parameters.

from transformers import WhisperForConditionalGeneration, prepare_model_for_int8_training, LoraConfig, get_peft_model
model = WhisperForConditionalGeneration.from_pretrained(model_name_or_path, load_in_8bit=True, device_map="auto")
model = prepare_model_for_int8_training(model)

def make_inputs_require_grad(module, input, output):
    output.requires_grad_(True)

model.model.encoder.conv1.register_forward_hook(make_inputs_require_grad)
config = LoraConfig(r=32, lora_alpha=64, target_modules=["q_proj", "v_proj"], lora_dropout=0.05, bias="none")
model = get_peft_model(model, config)
model.print_trainable_parameters()

Figure 17: PEFT Set-up

This would give an output as the following. We are only using 1% of the total trainable parameters, thereby performing Parameter-Efficient Fine-Tuning

trainable params: 15728640 || all params: 1559033600 || trainable%: 1.0088711365810203

Figure 18: PEFT Training Parameters

There are some slight differences between the PEFT Training Parameters and the Fine-tuning Parameters. The biggest difference would be that since PEFT utilizes significantly less resources than fine-tuning, you would be able to increase the batch size and decrease the training time. There are also two things to consider:

remove_unused_columns=False and label_names=["labels"] are required since the PeftModel's forward doesn't have the signature of the base model's forward.
INT8 training requires autocasting, meaning that predict_with_generate can't be passed to Trainer because it internally calls transformer's “generate without autocasting”, resulting in an error.

from transformers import Seq2SeqTrainingArguments
model.config.dropout = 0.05
training_args = Seq2SeqTrainingArguments(
    output_dir="whisper-peft",
    per_device_train_batch_size=20,
    gradient_accumulation_steps=1,
    learning_rate=1e-6,
    warmup_steps=200,
    num_train_epochs=5,
		lr_scheduler_type='linear',
    optim="adamw_bnb_8bit",
    gradient_checkpointing=True,
    evaluation_strategy="steps",
    fp16=True,
    per_device_eval_batch_size=10,
    generation_max_length=225,
    logging_steps=500,
    report_to=["tensorboard"],
    greater_is_better=False,
    push_to_hub=False,
    remove_unused_columns=False,
    label_names=["labels"],
)

Figure 19: PEFT Training Parameters

Then, we would need to specify a custom callback function, which is triggered during the training process when the model is saved at regular intervals. This enables you to save the fine-tuned model, along with the parameters of the Low-Rank Adapters, at various stages of the fine-tuning process to preserve progress. All that is left to do is specify the trainer arguments and start training.

from transformers import TrainerCallback, TrainingArguments, TrainerControl, TrainerState, Seq2SeqTrainer
class SavePeftModelCallback(TrainerCallback):
    def on_save(
        self,
        args: TrainingArguments,
        state: TrainerState,
        control: TrainerControl,
        **kwargs,
    ):
        checkpoint_folder = os.path.join(args.output_dir, f"{PREFIX_CHECKPOINT_DIR}-{state.global_step}")

        peft_model_path = os.path.join(checkpoint_folder, "adapter_model")
        kwargs["model"].save_pretrained(peft_model_path)

        pytorch_model_path = os.path.join(checkpoint_folder, "pytorch_model.bin")
        if os.path.exists(pytorch_model_path):
            os.remove(pytorch_model_path)
        return control


trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    train_dataset=common_voice["train"],
    eval_dataset=common_voice["test"],
    data_collator=data_collator,
    tokenizer=processor.feature_extractor,
    callbacks=[SavePeftModelCallback],
)

model.config.use_cache = False
trainer.train()

Figure 20: PEFT Trainer Arguments

After the model has finished training, we would need to push the PEFT model to the Hugging Face Hub. Then we could simply load it from Hugging Face and start the evaluation process to determine the WER for Vietnamese and CER for Japanese.

from peft import PeftModel, PeftConfig
from transformers import WhisperForConditionalGeneration, Seq2SeqTrainer
from huggingface_hub import login

peft_model_id = "whisper-peft"
model.push_to_hub(peft_model_id)

peft_model_id = "whisper-peft" # Use the same model ID as before.
peft_config = PeftConfig.from_pretrained(peft_model_id)
model = WhisperForConditionalGeneration.from_pretrained(
    peft_config.base_model_name_or_path, load_in_8bit=True, device_map="auto"
)
model = PeftModel.from_pretrained(model, peft_model_id)
model.config.use_cache = True

Figure 21: PEFT Configuration

In the evaluation process for the PEFT model, predict_with_generate cannot be used, meaning that we would have to implement an evaluation loop with torch.cuda.amp.autocast() to manage GPU memory. Due to the frozen base model in PEFT, language recognition during decoding may fail, so starting tokens mentioning the language are enforced using processor.get_decoder_prompt_ids, and passed to model.generate. Two metrics are reported at the end: wer (Word Error Rate) without the BasicTextNormalizer, representing the true performance metric, and normalized_wer using the normalizer. When dealing with a Japanese model, we can simply change the wer metric to the cer metric.

import gc
from tqdm import tqdm
import numpy as np
from torch.utils.data import DataLoader
from transformers.models.whisper.english_normalizer import BasicTextNormalizer

eval_dataloader = DataLoader(fleurs["test"], batch_size=8, collate_fn=data_collator)
forced_decoder_ids = processor.get_decoder_prompt_ids(language='vi', task='transcribe')
normalizer = BasicTextNormalizer()

predictions = []
references = []
normalized_predictions = []
normalized_references = []

model.eval()
for step, batch in enumerate(tqdm(eval_dataloader)):
    with torch.cuda.amp.autocast():
        with torch.no_grad():
            generated_tokens = (
                model.generate(
                    input_features=batch["input_features"].to("cuda"),
                    forced_decoder_ids=forced_decoder_ids,
                    max_new_tokens=255,
                )
                .cpu()
                .numpy()
            )
            labels = batch["labels"].cpu().numpy()
            labels = np.where(labels != -100, labels, processor.tokenizer.pad_token_id)
            decoded_preds = processor.tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
            decoded_labels = processor.tokenizer.batch_decode(labels, skip_special_tokens=True)
            predictions.extend(decoded_preds)
            references.extend(decoded_labels)
            normalized_predictions.extend([normalizer(pred).strip() for pred in decoded_preds])
            normalized_references.extend([normalizer(label).strip() for label in decoded_labels])
        del generated_tokens, labels, batch
    gc.collect()
wer = 100 * metric.compute(predictions=predictions, references=references)
normalized_wer = 100 * metric.compute(predictions=normalized_predictions, references=normalized_references)
eval_metrics = {"eval/wer": wer, "eval/normalized_wer": normalized_wer}

print(f"{wer=} and {normalized_wer=}")
print(eval_metrics)

Figure 22: PEFT Evaluation

Results

The following graphs demonstrates the results obtained through fine-tuning Whisper. The blue bars show the results for fine-tuning using the original model while the red bars show the results using the PEFT model.

Graph 1: Whisper Medium Results

Graph 1 displays the results for Whisper Medium fine-tuned on various datasets for Vietnamese Automatic Speech Recognition (ASR). The datasets used for fine-tuning are Google Fleurs, Common Voice, and Vivos. The results are measured in Word Error Rate (WER) for each configuration. Among the different models, the lowest WER of 12.194933580475748 is achieved with the Whisper model fine-tuned solely on the Google Fleurs dataset.

Graph 2: Vietnamese Whisper Large Results

Graph 2 presents the results of fine-tuning the Whisper Large-v2 model on various datasets for Vietnamese Automatic Speech Recognition (ASR). The datasets used for fine-tuning include FOSD (FPT Open Speech Dataset), Google Fleurs, Vivos, Common Voice (CV), VLSP2020 (Vietnamese Language and Speech Processing 2020), and the Movies Dataset. The evaluation metric used is the Word Error Rate (WER), and lower values indicate better performance. Among the configurations, the model fine-tuned on the FOSD + Google Fleurs + Vivos + CV datasets achieved the lowest WER of 9.456834616554726.

Graph 3: Japanese Whisper Large Results

Graph 3 presents the results of the Whisper Large-v2 model fine-tuned on various datasets for Japanese Automatic Speech Recognition (ASR). The datasets used for fine-tuning include JSUT, ReazonSpeech, Google Xtreme-S, and Common Voice (CV). Each row represents a different fine-tuning configuration, showcasing the impact of different datasets on the model's performance. The evaluation metric used is the Character Error Rate (CER), and lower values indicate better performance. Among the configurations, the model fine-tuned on the JSUT + ReazonSpeech + Google Xtreme + CV datasets achieved the lowest CER of 8.149303540336621.

Graph 4: Optimization Learning Curve for the ReazonSpeech + Common Voice + Google Xtreme-S model

The optimization loss curve is a crucial visualization that provides insights into the performance of a machine learning model during the training process. It showcases how the loss function, which quantifies the discrepancy between predicted outputs and actual targets, changes over different epochs during training. In this context, the loss refers to the error in the model's predictions, and the primary goal is to minimize this error as training progresses. Graph 4 depicts an example of an optimization loss curve for the ReazonSpeech + Common Voice + Google Xtreme-S model, where the training loss and evaluation loss are plotted against the corresponding epochs. The "training_loss" and "evaluation_loss" lists contain the recorded loss values at different epochs, while "training_epoch" and "evaluation_epoch" represent the corresponding epoch values. By observing this curve, one can gain valuable insights into the convergence behavior, overfitting, and generalization capability of the model. The depicted example shows how the training loss decreases over epochs, indicating that the model learns from the training data. Similarly, the evaluation loss follows a decreasing trend, suggesting that the model is also improving its performance on unseen evaluation data. Such analyses help assess the model's training progress, and ultimately ensure that the trained model delivers optimal results on unseen data.

Evaluation

There are two steps to evaluate a fine-tuned whisper model. The first step would be to load the fine-tuned model depending on how you trained it. In Figure 23, we prepare the PEFT model for Vietnamese ASR tasks. We load the pre-trained PEFT model checkpoint using peft_model_id, specify the target language as "Vietnamese," and the task as "transcribe." The PEFT configuration is loaded with PeftConfig.from_pretrained, and the WhisperForConditionalGeneration model is initialized with 8-bit quantization and device mapping for efficiency. The PEFT model is then loaded using PeftModel.from_pretrained, and associated feature extractor and tokenizer are initialized using WhisperFeatureExtractor.from_pretrained and WhisperTokenizer.from_pretrained, respectively, tailored for Vietnamese text and the "transcribe" task. A WhisperProcessor object is created for streamlined data processing. Finally, the ASR pipeline is set up with AutomaticSpeechRecognitionPipeline, integrating the model, tokenizer, and feature extractor, making the PEFT model ready for fine-tuning and inference on Vietnamese ASR tasks.

from transformers import WhisperForConditionalGeneration, PeftConfig, PeftModel, WhisperFeatureExtractor, WhisperTokenizer, WhisperProcessor, AutomaticSpeechRecognitionPipeline
peft_model_id = "./whisper-peft"
language = "Vietnamese"
task = "transcribe"
peft_config = PeftConfig.from_pretrained(peft_model_id)
model = WhisperForConditionalGeneration.from_pretrained(peft_config.base_model_name_or_path, load_in_8bit=True, device_map="auto")

model = PeftModel.from_pretrained(model, peft_model_id)
feature_extractor = WhisperFeatureExtractor.from_pretrained(peft_config.base_model_name_or_path)
tokenizer = WhisperTokenizer.from_pretrained(peft_config.base_model_name_or_path, language=language, task=task)
processor = WhisperProcessor.from_pretrained(peft_config.base_model_name_or_path, language=language, task=task)
forced_decoder_ids = processor.get_decoder_prompt_ids(language=language, task=task)
pipe = AutomaticSpeechRecognitionPipeline(model=model, tokenizer=tokenizer, feature_extractor=feature_extractor)

Figure 23: Load PEFT Model

In Figure 24, we set up an Automatic Speech Recognition (ASR) pipeline for the Vietnamese language using the Whisper model. The ASR pipeline is initialized with a pre-trained model checkpoint specified by model_id, which can be either a fine-tuned checkpoint or the "openai/whisper-large-v2" base model. We create a feature extractor using WhisperFeatureExtractor.from_pretrained and a tokenizer using WhisperTokenizer.from_pretrained. A WhisperProcessor object is created to handle data preprocessing for the specific language and task. The pre-trained model is loaded, and the ASR pipeline is then established via the pipeline function, configuring the model, tokenizer, feature extractor, and inference device. This enables effortless transcription of Vietnamese speech inputs using the Whisper model.

from transformers import WhisperForConditionalGeneration, WhisperFeatureExtractor, WhisperTokenizer, WhisperProcessor, pipeline
feature_extractor = WhisperFeatureExtractor.from_pretrained("openai/whisper-large-v2")
tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-large-v2", language="Vietnamese", task="transcribe")
processor = WhisperProcessor.from_pretrained("openai/whisper-large-v2", language="Vietnamese", task="transcribe")
forced_decoder_ids = processor.get_decoder_prompt_ids(language="Vietnamese", task="transcribe")

model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large-v2")
model.config.forced_decoder_ids = None
device = "cuda:0" if torch.cuda.is_available() else "cpu"

pipe = pipeline(
  "automatic-speech-recognition",
  model="openai/whisper-large-v2",
  tokenizer=tokenizer,
  feature_extractor=feature_extractor,
  chunk_length_s=30,
  device=device,
)

Figure 24: Load Fine-tuned Model

The third method would be to convert the fine-tuned model using faster-whisper. The faster-whisper model, a reimplementation of the Whisper ASR model using CTranslate2, demonstrates similar accuracy to the fine-tuned Whisper model. To achieve faster inference times, we convert the pre-trained Whisper model to CTranslate2 using the TransformersConverter. The conversion process involves loading the model weights with float16 precision for improved efficiency. By optimizing the model for faster inference, faster-whisper delivers enhanced performance without compromising accuracy in Automatic Speech Recognition tasks. This conversion enables the model to efficiently process audio data, making it well-suited for real-time applications and scenarios where low-latency response is critical such as meeting calls for the VoicePing application. Faster Whisper and fine-tuned Whisper exhibit comparable accuracy, but the former offers the advantage of faster inference by approximately 40 percent, making it a valuable choice for time-sensitive ASR applications.

from ctranslate2.converters import TransformersConverter 
model_id = "./whisper-fine-tuned/checkpoint-5000" 
output_dir = "whisper-ct2" 
converter = TransformersConverter(model_id, load_as_float16=True) 
converter.convert(output_dir, quantization="float16") 
model = WhisperModel(output_dir, device="cuda", compute_type="float16")

Figure 25: Convert and Load Faster-whisper Model

The next step would be to load the audio files and transcribe it. To load the audio files, you would simply load the processed dataset from Hugging Face as mentioned in the Dataset section, or manually download audio mp3 or wav files and their corresponding transcript if using faster-whisper. Figure 26 and Figure 27 demonstrate how to transcribe the audio files.

def transcribe(audio):
    transcriptions = []
    with torch.cuda.amp.autocast():
        for i, audio_item in enumerate(audio):
            prediction = pipe(audio_item, generate_kwargs={"forced_decoder_ids": forced_decoder_ids}, max_new_tokens=255)
            if "text" in prediction:
                transcription = prediction["text"]
            else:
                transcription = ""
            print(f"Transcribing {i+1} out of {total_files}")
            transcriptions.append(transcription)
    return transcriptions

Figure 26: Transcribing the Audio Files for Hugging Face Datasets

transcriptions = []
original_sentences = []
for i, row in df.iterrows():
    filename = row["path"]
    transcription = row["sentence"]  
    audio_file = os.path.join(audio_folder, filename)
    segments, _ = model.transcribe(audio_file, beam_size=10, vad_filter=True, vad_parameters=dict(min_silence_duration_ms=500))

    combined_transcription = ""
    for segment in segments:
        combined_transcription += segment.text
    transcriptions.append(combined_transcription)
    original_sentence = transcription.strip().lower()
    original_sentences.append(original_sentence)

Figure 27: Transcribing the Audio Files for Downloaded Datasets for Faster-Whisper

If you want to determine the evaluation word error rate of the fine-tuned model, then you would write the prediction transcriptions and the original transcriptions in two separate text files and use the evaluate library to determine the word error rate. You can then visualize this difference using an online data comparison tool such as diffchecker.com for Vietnamese or diffnow.com for Japanese.

import evaluate
with open(output_file, "w", encoding="utf-8") as f:
    for transcription in transcriptions:
        transcription = transcription.strip().lower().replace(".", "").replace(",", "").replace("?", "").replace("!", "")
        f.write(f"{transcription}\n")

with open(original_file, "w", encoding="utf-8") as f:
    for original_sentence in original_sentences:
        original_sentence = original_sentence.strip().lower().replace(".", "").replace(",", "").replace("?", "").replace("!", "")
        f.write(f"{original_sentence}\n")

wer = evaluate.load("wer")
wer = 100 * wer.compute(predictions=transcriptions, references=original_sentences)
print(wer)

Figure 28: Generating the Text Files to Determine the Evaluation WER

Graph 5 demonstrates the evaluation word error rate of various fine-tuned models. The blue bars show the evaluation results using the original model, the red bars are the results for the PEFT model, and the green bars are the results using the faster-whisper model. The Google Fleurs dataset was chosen for evaluating the Whisper ASR model in Vietnamese due to its diverse audio recordings that cover a broad spectrum of topics, including economics, natural disasters, politics, and cultural events. For evaluation purposes, the last 100 files from this dataset were utilized to assess the model's performance on unseen data. It was ensured that these 100 files were not used during the training phase to avoid any bias or overfitting. The Whisper ASR model was then evaluated on these unseen audio samples, and the Word Error Rate (WER) was used as the evaluation metric to measure the accuracy of the transcriptions.

Graph 5: Vietnamese Evaluation Graph

Among the evaluated datasets, the Google Fleurs + Common Voice + Vivos dataset achieved the lowest CER of 7.843, indicating highly accurate transcriptions. In contrast, the Whisper Large-v2 Faster-Whisper dataset obtained a CER of 28.701, highlighting the significant performance improvement achievable by fine-tuning the Whisper model for speech recognition tasks. The following graph demonstrates the Japanese evaluation results with the blue bars showing the original model, the red bar showing the result of the PEFT model, and the orange bar showing the result for the ESPnet model.

Graph 6: Japanese Evaluation Graph

Graph 6 presents the evaluation results for various Japanese datasets using the Whisper ASR model. The Character Error Rate (CER) is used as the evaluation metric to measure the accuracy of the transcriptions. Among these datasets, the combined ReazonSpeech + Google Xtreme + CV dataset achieved the lowest CER of 7.441, indicating highly accurate transcriptions.

import torch, librosa
from transformers import Speech2Text
device = "cuda" if torch.cuda.is_available() else "cpu"
speech2text = Speech2Text.from_pretrained(
    "reazon-research/reazonspeech-espnet-v1",
    beam_size=5,
    batch_size=0,
    device=device,
)

def transcribe(audio_files):
    transcriptions = []
    durations = []
    audio_transcriptions = []
    for i, audio_file in enumerate(audio_files):
        speech, rate = librosa.load(audio_file, sr=16000)
        duration = len(speech) / rate  # Calculate the duration in seconds
        durations.append(duration)

        for cap in rs.transcribe(speech, speech2text):
            prediction = cap.text
        print(f"Transcribing {i+1} out of {len(audio_files)}")
        print(audio_file)
        transcriptions.append(prediction)
        audio_transcriptions.append((audio_file, prediction))
        print(prediction)
    return transcriptions, durations, audio_transcriptions

Figure 29: ReazonSpeech Espnet-v1 Evaluation

Among these models, the ReazonSpeech ESPnet-v1 is a speech recognition model developed by Reazon Human Interactions Lab, and its website claims to achieve a low Word Error Rate (WER) of 10% when evaluated on the Common Voice dataset. However, I obtained a significantly higher WER of 38.13799621928166 percent when using the same model on the Common Voice dataset. This discrepancy in performance could be attributed to several factors, such as differences in model configurations.

Azure Speech Studio

Other than training the Whisper model, it is possible to train other ASR models such as the Azure Baseline Model from the Azure Speech Studio. There are four primary steps to the fine-tuning process after creating a custom project on the dashboard. The first step would be to import datasets to the hub.

Uploading a dataset to Azure Speech Studio involves a straightforward process that ensures proper formatting and efficient file management. First, gather all the audio files in either mp3 or wav format and the corresponding transcript. The transcript should be edited to the following format: "file name + tab space + sentence + new line space" for each line in the transcript. Next, organize all the audio files and the transcript into a single folder. To ensure compliance with Azure Speech Studio's limitations, it's important to compress the folder into a ZIP file. However, it's crucial to keep the ZIP file size below 2GB to facilitate smooth uploading and processing within the platform. Furthermore, the audio files must be either 16000 or 8000 Hz. Once the dataset is appropriately prepared and compressed, it can be uploaded to Azure Speech Studio.

The training process begins by selecting the latest baseline model, which serves as the starting point for fine-tuning. Then, one or multiple training datasets can be selected to train the model. A unique name is assigned to the model for easy identification, and the training process commences. After the training phase, testing is performed to assess the model's performance. Evaluation of accuracy is conducted by selecting the testing data and comparing the current trained model with another model, such as the Azure Baseline Model. A test name is provided to identify the evaluation, and the Word Error Rate (WER) performance metric is used to quantify the effectiveness of the trained model in comparison to the baseline model. This comprehensive training and evaluation approach helps gauge the model's accuracy and suitability for real-world speech recognition tasks.

import os, evaluate, time
from azure.cognitiveservices.speech import SpeechConfig, SpeechRecognizer, AudioConfig

subscription_key = "subscription_key" #Enter your custom subscription_key here
location = "japaneast"
endpoint = "endpoint" # Enter your model's endpoint here
wav_base_path = "./test"

config = SpeechConfig(subscription=subscription_key, region=location)
config.endpoint_id = endpoint
speech_config = SpeechConfig(subscription=subscription_key, region=location, speech_recognition_language="ja-JP")

predictions = []
file_names = []
for root, _, files in os.walk(wav_base_path):
    for file_name in files:
        if file_name.endswith(".wav"):
            audio_file_path = os.path.join(root, file_name)
            audio_config = AudioConfig(filename=audio_file_path)
            speech_recognizer = SpeechRecognizer(speech_config=speech_config, audio_config=audio_config)
            print("Transcribing", processed_files + 1, "out of", total_files, "files:", audio_file_path)

            result = speech_recognizer.recognize_once()
            if result.text:
                transcribed_text = result.text
                predictions.append(transcribed_text)
                file_names.append(file_name)
            else:
                print("Speech Recognition failed for file:", audio_file_path)

Figure 30: Transcribing Audio Files using the Azure Cognitive Services Speech SDK

The code snippet above imports the necessary libraries, including os, evaluate, time, and the Azure Cognitive Services Speech SDK for speech recognition. Then, the subscription key, location, and endpoint are specified to connect to the Azure Cognitive Services Speech API. A SpeechConfig object is created with the subscription key and location details, and the endpoint ID is set to establish the connection. Additionally, a SpeechConfig object is initialized with language parameters, in this case "ja-JP,” to configure speech recognition for Japanese. Then, the code iterates through all the files with the ".wav" extension in the specified directory using os.walk. For each file, an AudioConfig object is created with the audio file path. Then, a SpeechRecognizer object is initialized with the speech and audio configurations. The code prints the progress of transcription, indicating the current file being processed out of the total files. The SpeechRecognizer is used to recognize the speech once from the audio file, and if the result contains transcribed text, it is appended to the "predictions" list along with the corresponding file name. This code effectively transcribes speech from multiple audio files using the Azure Speech API, making it suitable for batch processing and speech-to-text tasks.

Graph 7: Vietnamese Azure Speech Studio Results

Graph 7 presents the results of various models evaluated on the Vietnamese language dataset using the Word Error Rate (WER) as the performance metric. The baseline WER is provided as a reference for comparison. The models are trained and tested on different combinations of datasets, including Common Voice 14.0, Google Fleurs, and FPT Open Speech Dataset. The Azure Speech Studio is employed as the ASR system for evaluating the models. The results show that the baseline WER ranges from 8.56 to 8.71. Among the evaluated models, the one trained on the Common Voice 14.0 dataset achieves the lowest WER of 7.33. The models utilizing the Google Fleurs dataset also perform well, with WER values ranging from 7.70 to 8.18. The FPT Open Speech Dataset-based model obtains a WER of 8.74, a slightly higher WER, but still within a comparable range. These results demonstrate the effectiveness of the Azure Speech Studio and the impact of the training data on the ASR model's accuracy, with certain datasets significantly improving the model's performance in Vietnamese speech recognition tasks.

Graph 8: Japanese Azure Speech Studio Results

Graph 8 showcases the evaluation results of different models on the Japanese language dataset using the Character Error Rate (CER) as the performance metric. The baseline CER is provided for comparison. The models are trained and tested on various combinations of datasets, including JSUT, Common Voice, Google Fleurs, and Tatoeba. The Azure Speech Studio is utilized as the ASR system for evaluating the models. Among the evaluated models, the one trained solely on the JSUT dataset achieves the lowest CER of 6.97, indicating high accuracy in transcriptions. However, the models combining Common Voice, Google Fleurs, JSUT, and Tatoeba datasets exhibit slightly higher CER values, ranging from 10.04 to 10.13. These results demonstrate the impact of the training data on the ASR model's performance, with certain datasets contributing to improved accuracy in Japanese speech recognition tasks, while other combinations may result in slightly higher CER values.

Conclusion

In conclusion, the process of fine-tuning the Whisper ASR model emerges as a robust technique for enhancing its performance. By selectively modifying the model's parameters and adapting it to new contexts through different audio datasets, substantial improvements in accuracy can be achieved. Moreover, the integration of data augmentation methods via the audiomentations library has introduced valuable diversity into the training dataset, fostering heightened adaptability and generalizability of the ASR model.

Importantly, the decision to opt for a specific ASR model depends on several important factors. These factors encompass elements like the amount of time covered by the dataset, the clarity of the audio recordings, and the variety of subjects that the speakers talk about. These considerations ultimately contribute to achieving lower Character Error Rate (CER) or Word Error Rate (WER) values, which signify enhanced accuracy in converting speech to text.

Graph 9: Comprehensive Vietnamese Results

Graphs 9 and 10 unveil compelling insights into the efficacy of the fine-tuned Whisper ASR model. The light blue bars demonstrate the results using Azure Speech Studio, the blue bars demonstrate the results using the original fine-tuning method, and the red bars demonstrate the results using the PEFT model. When assessing Vietnamese datasets, fine-tuning Whisper consistently yields notable performance enhancements, shown by the range of Word Error Rates (WER) spanning from 7.33 to 12.145 for Vietnamese in Graph 9 and Character Error Rates (CER) ranging from 8.149 to 17.931 for Japanese in Graph 10.

Platform discrepancies exist between Azure Speech Studio and Whisper for fine-tuning speech recognition models. While Azure Speech Studio may yield lower Word Error Rates (WER) during training, Whisper tends to achieve better evaluation results. This means that Whisper demonstrates superior performance in real-world scenarios, particularly evident when evaluating on unseen data. On the other hand, Azure offers the advantage of easier integration into existing infrastructure. While Azure is well-suited for streamlined incorporation into established systems, Whisper's enhanced performance on diverse and complex audio data makes it a compelling choice for applications requiring higher accuracy.

Graph 10: Comprehensive Japanese Results

In summary, this research culminates in showcasing the fine-tuning process of ASR models, while also elaborating on its remarkable efficiency. The Whisper ASR model emerges as a powerful tool, which can be heightened by the introduction of different audio datasets. Its enhanced precision and adaptability hold the potential for a promising future in speech-to-text applications, where real-time transcription emerges as a cornerstone of efficient communication. The implications of our findings resonate beyond the scope of this study, opening new horizons in the realm of human-computer interaction and underscoring the potential of speech-to-text technology.

Next Steps

Moving forward, the next stage of our project involves creating a prototype web application. This application will utilize FastAPI for the backend and Next.js for the frontend, all while integrating Firebase to ensure a comprehensive user experience. This application will encompass features enabling both real-time transcription and transcription via audio files. The overarching aim of this project is to offer a user-friendly interface that ensures precise, dependable, and multilingual real-time transcriptions.

Another important aspect of our work involves making the transcription process more efficient. We're looking to overcome the current limits in computing speed and enhance the web application by exploring how Azure's cloud GPUs could help make the application more efficient. By using these powerful computing units, we expect to speed up transcription, making it more responsive. This will enable us to achieve real-time transcription that's fast and immediate.

Simultaneously, a concurrent initiative involves the creation of a Japanese audio dataset. This project entails creating an audio dataset extracted from a diverse array of YouTube sources, including prominent channels hosting Ted Talks and Audiobooks. This is done through the integration of the youtube-dl API, which downloads the MP3 audio file and SRT subtitle file of a youtube video. After that, we clean up the audio by getting rid of any background noise such as music or clapping noises from the audience.

In summary, our study has shown a promising way to make automatic speech recognition models better. Looking ahead, our future research in natural language processing aims to bring together advanced technology and human communication. This vision strives to pioneer more efficient and responsive methods for transforming spoken words into precise written text, thereby enhancing communication for all.

References

[1] Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2022. Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356 (2022).

[2] Ardila, Rosana, et al. “Common Voice: A Massively-Multilingual Speech Corpus.” arXiv.Org, 5 Mar. 2020, arxiv.org/abs/1912.06670.

[3] Brownlee, Jason. “How to Use Learning Curves to Diagnose Machine Learning Model Performance.” MachineLearningMastery.Com, 6 Aug. 2019, machinelearningmastery.com/learning-curves-for-diagnosing-machine-learning-model-performance/.

[4] Conneau, Alexis, et al. “Fleurs: Few-Shot Learning Evaluation of Universal Representations of Speech.” arXiv.Org, 25 May 2022, arxiv.org/abs/2205.12446.

[5] Eric-Urban. “Human-Labeled Transcriptions Guidelines - Speech Service - Azure AI Services.” Human-Labeled Transcriptions Guidelines - Speech Service - Azure AI Services | Microsoft Learn, 19 July 2023, learn.microsoft.com/en-us/azure/ai-services/speech-service/how-to-custom-speech-human-labeled-transcriptions.

[6] Gandhi, Sanchit. “Fine-Tune Whisper for Multilingual ASR with 🤗 Transformers.” Hugging Face – The AI Community Building the Future., 3 Nov. 2022, huggingface.co/blog/fine-tune-whisper.

[7] Mangrulkar, Sourab, and Sayak Paul. “Parameter-Efficient Fine-Tuning Using 🤗 PEFT.” Parameter-Efficient Fine-Tuning Using 🤗 PEFT, huggingface.co/blog/peft.

[8] nokomoro3. “Hugging Faceでopenaiの音声認識"whisper"をfine Tuningする方法が公開されました: Developersio.” クラスメソッド発「やってみた」系技術メディア | DevelopersIO, 9 Nov. 2022, dev.classmethod.jp/articles/whisper-fine-tuning-by-huggingface/.

[9] Ramanathan, Bharat. “Fine-Tuning Whisper ASR Models.” Fine-Tuning Whisper ASR Models, wandb.ai/parambharat/whisper_finetuning/reports/Fine-tuning-Whisper-ASR-models---VmlldzozMTEzNDE5.

[10] “ReazonSpeechの最新モデルを公開しました.” (2023-04-04) ReazonSpeechの最新モデルを公開しました - Reazon Human Interaction Lab, 4 Apr. 2023, research.reazon.jp/blog/2023-04-04-ReazonSpeech.html.

August 2023, Fine-Tuning ASR Models for Japanese and Vietnamese by using OpenAI Whisper and Azure Speech Studio