Linchuan Du

Department of Mathematics | The University of British Columbia

Abstract

Automatic Speech Recognition (ASR), also known as Speech to Text (STT), uses Deep Learning technologies to transcribe speech-included audios to texts. In the fields of Deep Learning Artificial Intelligence, Large Language Models (LLMs) mimic human brains in processing words and phrases, and have the ability to understand and generate text data. LLMs usually contain millions of weights and pre-trained with various kinds of datasets. Specifically, an ASR LLM will convert audio inputs to desired input formats by feature extraction and tokenization.

To customize an ASR LLM with ideal performance, fine-tuning procedures of Whisper, an ASR LLM developed by OpenAI, were tested on Google Colaboratory first. Larger models were then deployed in GPU-equipped environments in Windows OS to speed up training and alleviate GPU availability or limit issues on Colab and MacOS. Audio data were investigated on reliability based on information such as audio quality and transcript accuracy. Models were then improved and optimized by data preprocessing and hyper-parameter tuning techniques. In case of failing to resolve GPU memory issues by means of regular fine-tuning, Parameter-Efficient-Fine-Tuning (PEFT) with Low Rank Adaptation (LoRA) was utilized to freeze most parameters to save memory allocation without sacrificing too much in performances. Results were visualized along with loss curves to ensure the fitness and optimization of fine-tuning processes.

Possibility of multi-speaker support in Whisper was explored using Neural Speaker Diarization. Integration with Pyannote was implemented using pipeline and WhisperX, a project containing similar ideas with extra features of word-level timestamps and Voice Activity Detection (VAD). WhisperX was tested on long-form transcription with batching as well as diarization.

Besides Whisper, other models with ASR functionality were installed and compared with Whisper baseline, including Massively Multilingual Speech (MMS) by Meta AI research, PaddleSpeech by PaddlePaddle, SpeechBrain and ESPNet. Chinese datasets were used to compare these models in CER metrics. In addition, Custom Speech in Azure AI, which supports real-time STT features, was introduced to compare performances (mainly Mandarin Chinese). Then a choice can be made between trained Azure models and loadable models like Whisper for deployment.

Overview

Preparing Environment

Google Colaboratory
Anaconda
Visual Studio Code
CUDA GPU

Audio Data Source

Hugging Face
OpenSLR

Whisper Model Fine-tuning

Fine-tuning on Colab
Common Libraries
Data Preprocessing
Hyperparameters
Fine-tuned Results
PEFT with LoRA
PEFT Results
Loss Curves Visualization
Baseline Results

Speaker Diarization

Pyannote.audio
WhisperX
WhisperX Results

Other Models

Meta MMS
PaddleSpeech
SpeechBrain
ESPnet
Baseline Results

Azure Speech Studio

Upload datasets
Train models
Test models
Deploy model
Results

Prospect
References

1. Preparing Environment

a. Google Colaboratory

Google Colaboratory is a hosted Jupyter Notebook service that has limited free GPU & TPU computing resources. In Google Colaboratory, ipynb extension format is used to edit and execute Python scripts.

Figure 1: Google Colaboratory webpage view

Log in to Google Colab through Google account, share written scripts with others via “share” on the right top corner of the page, and optionally authorize Colab with a Github account.

How to set up environments on Colab:

Select Tab Runtime -> Change Runtime to enable GPU for use
Use pip or other package installers to install necessary dependencies

!pip install packageName

b. Anaconda

Besides Colab, environments can also be prepared on local PCs. Anaconda is a well-known distribution platform for Data Science field, including data analysis and building machine learning models in Python. It contains Conda, an environment and package manager that helps to manage open-source Python packages and libraries.

How to set up environments with Anaconda:

Install Anaconda on Free Download | Anaconda and add to PATH environment variable
Search Command Prompt and get into base environment, e.g (Windows):

(base) C:\Users\username>

Create a new Conda environment with a new name, e.g (“myenv”):

conda create –-name myenv

Activate every time a specific Conda environment is needed, or return to base environment using deactivate:

conda activate myenv
conda deactivate

Install dependencies through PyPl or Conda package manager, PyPl package version requirements can be specified using specifiers:

pip install packageName>=0.0.1
conda install packageName

💡

Other useful Conda commands: https://conda.io/projects/conda/en/latest/commands

c. Visual Studio Code

Visual Studio Code, or VS Code, is a powerful source-code editor for Windows, MacOS and Linux with various programming languages available for editing. It supports multiple tasks, including debugging, executing in integrated terminals, enriching functionalities by extensions, and version control by embedded Git.

Figure 2: Visual Studio Code view

How to set up environments in VS Code:

Open the folder(s) on the left side under EXPLORER and create files inside the folder.
On the bottom right, select the environment needed. As for terminals, execute Python scripts in either interactive window on the top right with IPython kernel installed or executing Python files using commands

python xxx.py

An alternative way is to use the ipynb extension (Jupyter Notebook).

The Git icon on the left panel is the place where the source codes are controlled. Either commit, push to & pull from Github, merge and checkout branches within VS Code.

💡

VS Code needs to reopen if packages in the environment are updated

d. CUDA GPU

Compute Unified Device Architecture (CUDA) is a parallel computing platform and Application Programming Interface (API) developed by NVIDIA. It allows developers to use NVIDIA Graphics Processing Units (GPUs) for multiple computing tasks.

How to use CUDA GPU:

Install the CUDA Toolkit, which includes necessary libraries, tools, and drivers for developing and running CUDA applications.
Check relevant information in Command Prompt with the command

nvidia-smi

Figure 3: nvidia-smi output

See what type(s) of Nvidia GPU(s) is/are used, what percentage of GPU memory is occupied (e.g. 24297MiB/24576MiB in Figure 3), GPU utility percentage (e.g. 4% in Figure 3) in the first table. For Processes section beneath, Processes that take GPU memory usage are visible along with GPU specification (0 for single-GPU case).
After setting up CUDA Toolkit, it is necessary to download a GPU-compatible PyTorch version for deep learning purposes. Go to the official website: PyTorch, and find the ideal version that is needed for the environment.

💡

When a previous PyTorch version is needed, it is essential to check the right commands of Previous PyTorch Versions to avoid possible compatibility issues.

Version check can be performed directly through Python

import torch

print(f' CUDA avaliabilty on PyTorch is {torch.cuda.is_available()}')
print(f' Current PyTorch version is {torch.__version__}')
print(f' Current CUDA version is {torch.version.cuda}')
print(f' cuDNN version is {torch.backends.cudnn.version()}')
print(f' The number of available GPU devices is {torch.cuda.device_count()}')

# Use CUDA on the device
device = torch.device("cuda")

On Windows, GPU information can be visualized and surveilled in Task Manager

Figure 4: GPU surveillance on Task Manager

Utilization is a useful indication to see if GPU is utilized as expected. If CPU is used heavily instead of GPU, then it means GPU has not been used for training and might indicate some problems within GPU settings. On the other hand, dedicated GPU memory measures GPU memory usage. A high memory usage will probably cause CUDA Out of Memory (OOM) errors.

2. Audio Data Source

a. Hugging Face

Hugging Face is a company and an open-source platform that is dedicated to the fields of Natural Language Processing (NLP) and Artificial Intelligence.

Figure 5: Hugging Face webpage view

It is important to create a Hugging Face account in order to utilize the models published or upload models customized by our own. Personal READ and WRITE tokens can be created on https://huggingface.co/settings/tokens, where READ token is for downloading models or datasets from the platform and WRITE for uploading local models or datasets. To specify tasks and categories, select tags from the left side of the website.

It is convenient to find most popular models and audio datasets based on downloads and likes. One point to notice is that some models are only supported in certain languages or certain tasks. And for the same model architecture or type, there can also be size differences.

On Hugging Face Platform, most well-known organizations or companies will publish their speech-to-text models and collected datasets. Sometimes an agreement is needed for data access. There are also source files on Hub that can be downloaded to the local.

Here are some common ASR LLMs and their relevant information:

Model	# Params Size	Languages	Task	Structure
OpenAI Whisper	large-v2 1550M	Most languages	Multitasks	Transformer encoder decoder Regularized
OpenAI Whisper	large 1550M	Most languages	Multitasks	Transformer encoder-decoder
OpenAI Whisper	medium 769M	Most languages	Multitasks	Transformer encoder-decoder
OpenAI Whisper	small 244M	Most languages	Multitasks	Transformer encoder-decoder
guillaumekln faster-whisper	large-v2	Most languages	Multitasks	CTranslate2
facebook wav2vec2	large-960h -lv60-self	English	transcription	Wav2Vec2CTC decoder
facebook wav2vec2	base-960h 94.4M	English	transcription	Wav2Vec2CTC decoder
facebook mms	1b-all 965M	Most languages	Multitasks	Wav2Vec2CTC decoder

Here are some audio datasets with their relevant information:

Dataset	# hours / Size	languages
mozilla-foundation common_voice_13_0	17689 validated hrs	108 languages
google fleurs	~ 12 hrs per language	102 languages
LIUM tedlium	118 to 452 hrs for 3 releases	English
librispeech_asr	~1000 hrs	English
speechcolab gigaspeech	10000 hrs	English
PolyAI minds14	8.17k rows	14 languages

💡

PolyA/minds14 is primarily for intent detection task, and not ideal for ASR purpose

b. Open SLR

Open SLR is another useful website that hosts speech and language resources with compressed files. Various audio datasets can be seen along with their brief summaries in the Resources tab. Some datasets that are not existent on Hugging Face might have access or link to websites.

Figure 6: Open SLR webpage

Specifically for Chinese audio datasets, there are resources that are ideal for ASR purposes and not found on Hugging Face:

Dataset	# hours (size)	# speakers	transcript accuracy
Aishell-1 (SLR33)	178 hrs	400	95+%
Free ST (SLR38)	100+ hrs	855	/
aidatatang_200zh (SLR62)	200 hrs	600	98+%
MAGICDATA (SLR68)	755 hrs	1080	98+%

3. Whisper Model Fine-tuning

Whisper is an ASR (Automatic Speech Recognition) system released by OpenAI in September, 2022. It was trained on 680,000 hours of multilingual and multitask supervised data, enabling multiple language transcription and translation. The architecture is an encoder-decoder Transformer.

Figure 7: Whisper approach conceptual diagram

The audios will be chunked into 30 seconds and converted into a log-Mel spectrogram, which enables frequencies to be changed into the Mel scale. Then it will be passed into an encoder.

The official website for OpenAI Whisper can be seen in Introducing Whisper, and its research papers are Robust Speech Recognition via Large-Scale Weak Supervision.

a. Fine-tuning on Colab

from huggingface_hub import notebook_login
notebook_login()

Load desired dataset(s) through load_dataset in datasets. Usually 2 or 3 datasets of the same source will be created for train, test and/or validation. Use DatasetDict to separate.

💡

Sometimes permissions for access to certain datasets are needed on Hugging Face

Then preprocess datasets to feed data into Whisper, such as:

Manipulate columns: e.g. remove_columns, cast_column
Normalize transcript, e.g. upper/lowercase, punctuations, special tokens
Changing sampling rate to 16k using Audio in Datasets library

Load pre-trained feature extractor and tokenizer from transformers library, Processor includes both feature extractor and tokenizer

from transformers import WhisperFeatureExtractor,WhisperTokenizer
WhisperFeatureExtractor.from_pretrained("model_id")
WhisperTokenizer.from_pretrained("model_id")
WhisperProcessor.from_pretrained("model_id")

💡

AutoProcessor detects processor type automatically

In tokenizer, usually target languages and tasks are specified

language="lang", task="transcribe" # or "translation"

💡

For some other models, the same format can be applied, i.e. Wav2Vec2

Prepare batch mapping to speed up processing, as Dataset.map() enables parallelizing the tokenization of all the samples in a batch
Define Data Collator in Sequence to Sequence with label padding

import torch

from dataclasses import dataclass
from typing import Any, Dict, List, Union

@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    processor: Any

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        input_features = [{"input_features": feature["input_features"]} for feature in features]
        batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")

        label_features = [{"input_ids": feature["labels"]} for feature in features]
        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")

        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        if (labels[:, 0] == self.processor.tokenizer.bos_token_id).all().cpu().item():
            labels = labels[:, 1:]

        batch["labels"] = labels

        return batch

Evaluate metrics (WER) can be then imported from Hugging Face Evaluate

import evaluate
metric = evaluate.load("wer")

💡

When using English or most European languages, WER (Word Error Rate) is a common evaluation metric for transcription accuracy.

Figure 8: WER formula

Design metrics computation for predictions and labels respectively

def compute_metrics(pred):
    pred_ids = pred.predictions
    label_ids = pred.label_ids

    label_ids[label_ids == -100] = tokenizer.pad_token_id

    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = tokenizer.batch_decode(label_ids, skip_special_tokens=True)

    wer = 100 * metric.compute(predictions=pred_str, references=label_str)

    return {"wer": wer}

Load conditional generation for setting hyper-parameters on the generative model

from transformers import WhisperForConditionalGeneration
model = WhisperForConditionalGeneration.from_pretrained("model_id")

Before training hyper-parameters are defined in Seq2SeqTrainingArguments
Configure in model.config

Whisper has token ids that are forced as model outputs before autoregressive generation is started, and config is set to None here because both target language and task have been specified. suppress_tokens are tokens that are suppressed, with their log probabilities set to -inf. An empty list will indicate no tokens are suppressed.

model.config.forced_decoder_ids = None
model.config.suppress_tokens = []

Before training the model, hyper-parameters are defined in Seq2SeqTrainingArguments

# arguments
training_args = Seq2SeqTrainingArguments(
    output_dir="kawadlc/whisperv1", # own repo name
    per_device_train_batch_size=16, # batch size per GPU for train
    gradient_accumulation_steps=1,  # increase by 2x for every 2x decrease in batch size
    learning_rate=1e-5,             # important param to handle overfitting and underfitting issue
    weight_decay = 1e-2,            # mechanism of regularization
    warmup_steps=200,               # enhance early performances
    max_steps=3000,                 # total optimization step
    gradient_checkpointing=True,    # saving memory
    evaluation_strategy="steps",    # evaluation strategy, others: "epoch"
    fp16 =True,                     # half-precision floating point format
    per_device_eval_batch_size=8,   # batch size per GPU for evaluation
    predict_with_generate=True,     # do generation
    generation_max_length=200,      # max num of tokens for autoregressive generation
    eval_steps=500,                 # num of steps per evaluation
    report_to=["tensorboard"],      # save training logs to tensorboard
    load_best_model_at_end=True,    # best model at the end of output 
    metric_for_best_model="wer",    # metric of the best at the end of output 
    greater_is_better=False,        # WER lower for better
    push_to_hub=False,              # push to hub, optional
)

Finally start training with customized settings using trainer.train()

from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    train_dataset=common_voice["train"],
    eval_dataset=common_voice["test"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=processor.feature_extractor,
)

Training losses, evaluation losses and WER result in each evaluation steps or epochs can be seen in the training process, a fine-tuned model will then complete after the training, and checkpoints can be made and stored in specified output directory.

If CUDA Out of Memory (OOM) Error occurred during the training process, adjustments are needed on the previous steps to reduce the GPU memory usage:

The first priority is to reduce batch size to use more time to compensate for memory savings. Work along with gradient accumulation. This technique is for accumulating gradients over multiple smaller batches before performing the weight update step.
Gradient checkpointing is another useful strategy that trades a small increase in computation time for significant reductions in memory usage.
If this does not work either, consider using mixed precision training if it is not applied, because this will reduce the memory footprint significantly while maintaining training stability.
Reduce GPU occupied storage and cache by garbage collector and empty cache method

import gc
import torch

gc.collect()
torch.cuda.empty_cache()

💡

If all of the methods above fail in fixing memory issues, changing a smaller model size can be the last resort. Less model complexity will help to save GPU memories. As mentioned in the above, some organizations might provide models with the same architectures but different sizes.

b. Common Libraries

There are many popular Python libraries that are suitable for performing tasks in Machine Learning and Data Processing. In Automatic Speech Recognition field, some Python Libraries are powerful in modifying audios.

‣

Common Libraries

c. Data Preprocessing

1) Hugging Face Dataset

Load the dataset using the load_dataset function. In many popular audio datasets, the splits of train, test and validation have already been preprocessed.

Specify the subset with their corresponding names
Choose the split name with “split”, and a plus sign can combine multiple splits
“token“ can access datasets remotely (use_auth_token will be deprecated)
The returned dataset is a datasets.Dataset type. Put into the dictionary with DatasetDict.

# common_voice for data source
common_voice = DatasetDict()

# split datasets
common_voice["train"] = load_dataset("common_voice"
    , "ja", split="train+validated", use_auth_token=True)
common_voice["validation"] = load_dataset("common_voice"
    , "ja", split="validation", use_auth_token=True)
common_voice["test"] = load_dataset("common_voice"
    , "ja", split="test", use_auth_token=True)

# create DatasetDict, choose the sample size for training and evaluation
common_voice = DatasetDict({
    "train": common_voice['train'].select(range(3500)),
    "validation": common_voice['validation'].select(range(500)),
    "test": common_voice['test'].select(range(100)),
})

# remove columns that are not needed for the training
common_voice = common_voice.remove_columns(["age", "client_id", "down_votes", "gender", "path", "up_votes"])

💡

There is also a “streaming“ parameter that can help to stream the dataset on-the-fly without having to download the entire dataset. Use “streaming=True“ when space is limited on the disk, or if the download of the whole dataset is unnecessary for the tasks. The returned dataset is a datasets.IterableDataset type.

For non-streaming mode, ensure enough space for the download before calling load_dataset function. In order to create randomization, use shuffle(seed=42) to randomly rearrange the column values. select() and filter() are options that reduce the dataset numbers and return rows that match a specified condition.

As described in the Fine-tuning on Colab Section, columns can be manipulated by removing and renaming columns. Use cast_column() to change the feature type of a single column (or more in cast()).

common_voice = common_voice.cast_column("audio", Audio(sampling_rate=16000))

Here the sampling rate should always be transformed to 16k Hz because of the requirements in Whisper architecture. flatten() is called when inside features are needed from nested structures in the dataset.

To combine datasets that come from different sources

concatenate_datasets() can help to combine datasets from end to end
interleave_datasets() mix several datasets together by taking alternating examples from each one to create a new dataset.

💡

concatenate_datasets(), interleave_datasets() also works for IterableDataset

For streaming mode, there are many similar call functions compared to non-streaming mode, but some are completely different. To use shuffling as the regular mode, set up a buffer size and randomly sample examples from this buffer.

shuffled_dataset = dataset.shuffle(buffer_size=10_000, seed=42)

It is also possible to create a new dataset with the first n examples by using take(n), or with the rest of the examples by skipping the first n examples with skip(n)

There are specific explanations on Hugging Face Documents: Process

The map function in HuggingFace's datasets is used to apply a given function to each example in a dataset. It performs transformation and pre-processing in a convenient and efficient way.

Here is the basic syntax:

dataset.map(function, batched=True, num_proc=1, remove_columns=None, **function_args)

Self-defined functions can be applied to every sample in the dataset and decided on whether it is processed in batches or one by one with the boolean parameter batched. num_proc parameter defines the number of processes to use for parallel processing. It helps speed up the data processing, especially when dealing with large datasets. However issues related to multi-processing may occur when increasing the number. Meanwhile, unnecessary columns should be removed after execution, and if the desired dataset is created within the defined function, the initial train column should be removed.

def prepare_dataset(batch):
    audio = batch["audio"]    
    batch["input_features"] = processor.feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]
    batch["labels"] = processor.tokenizer(batch["text"]).input_ids
    return batch

librispeech = librispeech.map(prepare_dataset, remove_columns=librispeech.column_names["train"], num_proc=1)

2) Transcript Cleaning

Sometimes human labeled transcripts are not ideal, having extra symbols or different formats with other data sources. Transcript cleaning is then an essential preprocessing step. It involves the process of refining or correcting the transcriptions of speech data. It directly impacts the performance of the ASR models, so it's crucial to ensure accurate and error-free transcriptions. Sometimes it is also called text normalization.

💡

It can also applied to data post processing on Whisper outputs

Punctuation and Capitalization

# lowercase texts and ignore apostrophes
text = [s.lower() for s in text]
punctuation_without_apostrophe = string.punctuation.replace("'", "")
translator = str.maketrans('', '', punctuation_without_apostrophe)
text = [s.translate(translator) for s in text]

Make all texts to lowercase with lower()method; Most punctuations are in the string.punctuation, and it is possible to replace the punctuations to empty strings. re.sub() function replaces all occurrences of matched patterns.

The same for the str.maketrans() method.

Tokenization

# remove special tokens
def remove_tags(text):
clean = re.compile('<.*?>')
return re.sub(clean, '', text)

Tokenization involves breaking the text into individual units or tokens, such as words. We can usually split the word using split(). Remove special tokens like timestamps, silence or non-speech tokens if they exist.

3) Audio Chunking

Audio chunking, also known as audio segmentation or audio slicing, is the process of breaking down a continuous audio stream into smaller, manageable segments or chunks.

Librosa, Pydub and SoundFile libraries are recommended for doing audio chunking. Use librosa.load call to read an audio file. The AudioSegment class in Pydub is helpful to create chunks, like AudioSegment.from_file(). The SoundFile library can read and write sound files in various formats.

import librosa
import soundfile as sf
from pydub import AudioSegment

# load the audio file with pydub or librosa
audio_path = "input_audio.mp3"
audio = AudioSegment.from_file(audio_path)
# audio, sr = librosa.load(audio_path, sr=None)

# chunking algorithms

# aftering getting chunked files, use sf for write
# sf.write(chunk_filename, chunk, sr)

d. Hyperparameters

In Whisper, Seq2SeqTrainingArguments and Seq2SeqTrainer are utilized to perform training because of its sequence to sequence architecture. There are parameters in the training arguments that can significantly affect the learning ability of LLMs.

‣

Learning Rate

‣

Batch Size

‣

Regularization

‣

Number of Epochs

‣

Optimizer

‣

Warm-up Steps

‣

Evaluation Strategy

‣

Generation Max Length

e. Fine-tuned Results

Abbreviations

‣

Dataset Sample Sizes

‣

Hyperparameters

Tables:

Dataset/Size/Split	Model/Lang/Task	Hyperparameters	Result
mozilla-foundation/ common_voice_11_0 #ts = 100, #es = 100 train/ test	Whisper small Hindi Transcribe	lr = 1e-5, wd = 0, ws = 5, ms = 40, es = steps, ml = 225, tbz = 4, ebz = 8	WER: 67.442%
mozilla-foundation/ common_voice_11_0 #ts = 100, #es = 100 train+validation/test	Whisper small Hindi Transcribe	lr = 1e-5, wd = 0, ws = 0, ms = 60, es = steps, ml = 50, tbz = 16, ebz = 8	WER: 69.240%
mozilla-foundation/ common_voice_11_0 #ts = 100, #es = 100 train+validation/test	Whisper small Hindi Transcribe	lr = 1e-5, wd = 0, ws = 0, ms = 60, es = steps, ml = 100 tbz = 16, ebz = 8	WER: 64.656%
mozilla-foundation/ common_voice_11_0 #ts = 500, #es = 500 train+validation/test	Whisper small Hindi Transcribe	lr = 1e-5, wd = 0, ws = 0, ms = 60, es = steps, ml = 50, tbz = 16, ebz = 8	WER: 62.207%
common_voice #ts = 100, #es = 100 train+validated/validation	Whisper small Japanese Transcribe	lr = 1e-5, wd = 0, ws = 0, ms = 80, es = steps, ml = 225 tbz = 16, ebz = 8	WER: 64.0%
common_voice #ts = 3500, #es = 500 train+validated/validation	Whisper small Japanese Transcribe	lr = 1e-6, wd = 0, ws = 50, ms = 3500, es = steps, ml = 200, tbz = 16, ebz = 8	WER: 2.4%
librispeech_asr #ts = 750, #es = 250 train.100/validation	Whisper medium English Transcribe	lr = 1e-5, wd = 0.01, ws = 10, ms = 750, es = steps, ml = 80, tbz = 1, ebz = 1	WER: 13.095%
cv & fleurs (50:50) #ts = 3500, #es = 500 train+validated & train/ validation & validation	Whisper small Japanese Transcribe	lr = 1e-6, wd = 0, ws = 50, ms = 3500, es = steps, ml = 200, tbz = 16, ebz = 8	WER: 55.424%

Graph 1: Whisper Fine-tuned model evaluations

According to Graph 1, Whisper-small has poor WER performance (above 60%) on Hindi transcription with small sample size. While combining training data sources and increasing sample size can improve models by some levels, better performances might be restricted because of complexity. Hence increasing model complexity is more effective. In the meantime, possible overfitting or catastrophic forgetting issues might occur when training with a large sample size or with a single source. Such WER results should not be considered as good references. In addition, Whisper model in English transcriptions had a WER rate much smaller than Hindi and Japanese. As Japanese is character-based, a more suitable evaluation metric is Character Error Rate (CER).

💡

For exact output check and comparison, we can use output printing or online text check tools to see what kinds of differences appear.

f. PEFT with LoRA

Sometimes out of memory errors might occur while full fine-tuning, and they are usually caused by low-resource hardwares and large model weights. In case the full fine-tuning can not be completed after attempts on adjusting hyper-parameters like batch sizes, then Parameter-Efficient Fine-tuning (PEFT) approaches might be a useful alternative. This approach only fine-tunes a small number of model parameters while freezing most parameters of the pre-trained LLMs, thereby greatly decreasing computational and storage costs. This also overcomes the issues of catastrophic forgetting, a behavior observed during the full fine-tuning of LLMs.

LoRA is for Low Rank Adaptation. The goal of Low Rank Adaptation is to improve the efficiency and performance of adapting a pre-trained model to a new task by reducing the dimensionality of the model's parameters. Low rank adaptation decomposes the weights of pre-trained models into low-rank matrices or tensors and significantly reduces the number of parameters that need to be fine-tuned.

Compared to normal full fine-tuning, the difference comes when using pre-trained checkpoints from Conditional Generation. load_in_8bit=True will quantize the model to use one-fourth precision than float32 with minimal loss to performance. The device_map=”auto” argument automatically determines how to load and store model weights. prepare_model_for_int8_training will cast all the non int8 modules to full precision (FP32) for stability, adds a forward hook to the input embedding layer and enables gradient checkpointing for memory-efficient training.

model = WhisperForConditionalGeneration.from_pretrained('openai/whisper-large-v2', load_in_8bit=True, device_map="auto")

from peft import LoraConfig, PeftModel, LoraModel, LoraConfig, get_peft_model, prepare_model_for_int8_training

model = prepare_model_for_int8_training(model)

def make_inputs_require_grad(module, input, output):
    output.requires_grad_(True)

model.model.encoder.conv1.register_forward_hook(make_inputs_require_grad)

config = LoraConfig(r=32, lora_alpha=64, target_modules=["q_proj", "v_proj"], lora_dropout=0.05, bias="none")

model = get_peft_model(model, config)
model.print_trainable_parameters()

💡

require grad and register forward hook calls can be deprecated

Now configurations on the PEFT model can be set using LoraConfig

Figure 9: LoRAConfig parameter explanations

model.print_trainable_parameters() call can print the output of the sizes of trainable parameters, full parameters and its percentage. The percentage shows the exact percent usage compared to the full fine-tuning. Usually the percentage is around 1.

Compared to full fine-tuning, it’s necessary to explicitly set remove_unused_columns=False and label_names=["labels"] as the model doesn't inherit the signature of base model. Additionally, predict_with_generate call can not be used as it internally calls generate function without auto-casting. For the same reason, compute_metrics can not be put into the training arguments.

training_args = Seq2SeqTrainingArguments(
    output_dir="jackdu/whisper-peft",
    per_device_train_batch_size=8,
    gradient_accumulation_steps=1,
    learning_rate=1e-4,
    weight_decay=0.01,
    warmup_steps=0,
    num_train_epochs=3,
    evaluation_strategy="steps",
    fp16=True,
    per_device_eval_batch_size=8,
    generation_max_length=150,
    logging_steps=100,
    #max_steps=100, # only for testing purposes
    remove_unused_columns=False,  # required
    label_names=["labels"],  # required
)

We can choose to write a custom TrainerCallback to save model checkpoints during training. The functions will save the adapter_model weights and remove the base model weights in pytorch_model.bin. Pass the Seq2SeqTrainingArguments, model, datasets, data collator, tokenizer, and callbacks to the Seq2SeqTrainer. Set model.config.use_cache = False to silence warnings. Finally PEFT model is ready for training!

class SavePeftModelCallback(TrainerCallback):
    def on_save(
        self,
        args: TrainingArguments,
        state: TrainerState,
        control: TrainerControl,
        **kwargs,
    ):
        checkpoint_folder = os.path.join(args.output_dir, f"{PREFIX_CHECKPOINT_DIR}-{state.global_step}")

        peft_model_path = os.path.join(checkpoint_folder, "adapter_model")
        kwargs["model"].save_pretrained(peft_model_path)

        pytorch_model_path = os.path.join(checkpoint_folder, "pytorch_model.bin")
        if os.path.exists(pytorch_model_path):
            os.remove(pytorch_model_path)
        return control

trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    train_dataset=common_voice["train"],
    eval_dataset=common_voice["test"],
    data_collator=data_collator,
    tokenizer=processor.feature_extractor,
    callbacks=[SavePeftModelCallback],
)

model.config.use_cache = True  # silence the warnings. Re-enable when inferencing!

Trained PEFT model can be pushed to or obtained from Hugging Face. PEFT model will have only two files, adapter_model.bin and adapter_config.json. Set model.config.use_cache = True to enable inference. An evaluation loop should be designed to evaluate model performance. As the call to predict_with_generate is disabled as previously stated, the eval loop could be hand rolled with torch.cuda.amp.autocast().

login(token='hf_writetoken')

peft_model_id = "jackdu/whisper-peft"
model.push_to_hub(peft_model_id)

from peft import PeftModel, PeftConfig
from transformers import WhisperForConditionalGeneration, Seq2SeqTrainer

peft_model_id = "jackdu/whisper-peft" # Use the same model ID as before.
peft_config = PeftConfig.from_pretrained(peft_model_id)
model = WhisperForConditionalGeneration.from_pretrained(
    peft_config.base_model_name_or_path, load_in_8bit=True, device_map="auto"
)
model = PeftModel.from_pretrained(model, peft_model_id)
model.config.use_cache = True

DataLoader loads test dataset in batches. The model generates predictions for the given input features using the specified decoder prompt IDs and a maximum limit of new tokens. The decoded predictions and labels are appended to the respective lists. The predictions and labels are converted from tensors to NumPy arrays and then decoded back into text format using the tokenizer's batch_decode method.

After each batch evaluation, intermediate variables are deleted, and garbage collection is triggered to manage memory. Finally WER and normalized WER are shown in the output.

import gc
import numpy as np
from tqdm import tqdm
from torch.utils.data import DataLoader
from transformers.models.whisper.english_normalizer import BasicTextNormalizer


eval_dataloader = DataLoader(common_voice["test"], batch_size=8, collate_fn=data_collator)
forced_decoder_ids = processor.get_decoder_prompt_ids(language='vi', task='transcribe')
normalizer = BasicTextNormalizer()

predictions = []
references = []
normalized_predictions = []
normalized_references = []

model.eval()
for step, batch in enumerate(tqdm(eval_dataloader)):
    with torch.cuda.amp.autocast():
        with torch.no_grad():
            generated_tokens = (
                model.generate(
                    input_features=batch["input_features"].to("cuda"),
                    forced_decoder_ids=forced_decoder_ids,
                    max_new_tokens=255,
                )
                .cpu()
                .numpy()
            )
            labels = batch["labels"].cpu().numpy()
            labels = np.where(labels != -100, labels, processor.tokenizer.pad_token_id)
            decoded_preds = processor.tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
            decoded_labels = processor.tokenizer.batch_decode(labels, skip_special_tokens=True)
            predictions.extend(decoded_preds)
            references.extend(decoded_labels)
            normalized_predictions.extend([normalizer(pred).strip() for pred in decoded_preds])
            normalized_references.extend([normalizer(label).strip() for label in decoded_labels])
        del generated_tokens, labels, batch
    gc.collect()

wer = 100 * metric.compute(predictions=predictions, references=references)
normalized_wer = 100 * metric.compute(predictions=normalized_predictions, references=normalized_references)
eval_metrics = {"eval/wer": wer, "eval/normalized_wer": normalized_wer}

print(f"{wer=} and {normalized_wer=}")
print(eval_metrics)

💡

Here are the relevant resources of ASR tasks: INT 8 training guide for ASR tasks: int8 training for automatic speech recognition PEFT Blog: Parameter-Efficient Fine-Tuning using 🤗 PEFT

g. PEFT Results

Abbreviations

‣

Dataset Sample Sizes

‣

Hyperparameters

Dataset/Size/Split	Model/Lang/Task	Hyperparameters	Result
mozilla-foundation/ common_voice_13_0 #ts = 1000, #es = 100 train+validation/test	Whisper medium Japanese Transcribe	lr = 1e-3, wd = 0, ws = 50, #e = 3, es = steps, ml = 128, tbz = 8, ebz = 8	WER: 73% NormWER: 70.186%
mozilla-foundation/ common_voice_13_0 #ts = 10000, #es = 100 train+validation/test	Whisper medium Japanese Transcribe	lr = 1e-5, wd = 0, ws = 200, #e = 5, es = steps, ml = 150, tbz = 20, ebz = 20	WER: 79.920% NormWER: 85.582%
mozilla-foundation/ common_voice_13_0 #ts = 7000, #es = 1500 train/test	Whisper large-v2 Japanese Transcribe	lr = 1e-5, wd = 0, ws = 200, #e = 4, es = steps, ml = 200, tbz = 16, ebz = 8 dropout = 0.05 lr_scheduler = linear	WER: 81.346% NormWER: 77.364%
mozilla-foundation/ common_voice_13_0 #ts = 100, #es = 30 train+validation/test	Whisper large-v2 Vietnamese Transcribe	lr = 1e-4, wd = 0.01, ws = 0, #e = 3, es = steps, ml = 150, tbz = 8, ebz = 8	WER: 26.577% NormWER: 22.523%

Graph 2: PEFT results (common_voice_13_0)

From Graph 2, Whisper with PEFT had poor WER performance on Japanese transcription in any model and sample size. However, as Japanese is character-based language, WER might not fully represent the model performance. After switching to Vietnamese, which is a language with English-like composition, the results were within expectations.

h. Loss Curves Visualization

Loss curves are often used in machine learning to monitor the performance of a model during training. They show how the loss functions change over epochs or steps. The loss function quantifies how well the model's predictions match the actual target values. Matplotlib is a powerful Python library for data visualization in graphs.

plt.figure(figsize=(10, 6))
plt.plot(training_epoch, training_loss, label="Training Loss")
plt.plot(evaluation_epoch, evaluation_loss, label="Evaluation Loss")
plt.xlabel("Training Epochs")
plt.ylabel("Loss")
plt.title("Loss Curves for Whisper Fine-Tuning")
plt.legend()
plt.grid(True)
plt.show()

After inputting losses and epochs data manually, a graph will be created

Graph 3: Loss curves plot of Whisper Fine-Tuning

There are several patterns and rules to identify if the loss curves indicate a good fit for Whisper or any other machine learning models. Strategies can be applied based on the information from plotted visualized graphs. This helps us speed up finding an ideal model.

Overfitting and Underfitting

Look for both the training and validation loss to converge to relatively low values. A convergence with a low value range typically means that the model is learning well, with neither underfitting (high training and validation loss) nor overfitting (low training loss but high validation loss).

Smoothness and Stability:

Smoothness of Loss curves are indicative of a well-behaved training process. If the curves are not smooth, having big fluctuations or irregular patterns, it might suggest instability or other issues in the training process.

Loss Plateau and Rebound:

After a certain number of epochs, the loss curves might become slow or stop on the decreasing trend in values. This could indicate that the model has reached the state that it struggles to learn further from the available data. If the evaluation loss stops decreasing and even starts to rebound, this should be a sign that the model begins to overfit. We can apply strategies like early stopping to prevent it from happening.

i. Baseline Results

Dataset/Split/Size	Model/Task	Result
distil-whisper/tedlium-long-form test dataset	Whisper medium baseline en->en	WER: 28.418%
distil-whisper/tedlium-long-form validation dataset	Whisper large-v2 baseline en->en	WER: 26.671%
distil-whisper/tedlium-long-form validation dataset	Whisper medium baseline en->en	WER: 24.049%
librispeech_asr clean test dataset	Whisper large-v2 baseline en->en	WER: 4.746%
mozilla-foundation/ common_voice_13_0 test dataset numbers: 1000	Whisper large-v2 baseline en->en	WER: 21.712%
GigaSpeech test dataset numbers: 1000 actual: 777 (excluding musics & noises)	Whisper large-v2 baseline en->en	WER: 12.819%
Aishell S0770 test dataset numbers: 353	Whisper large-v2 baseline zh-CN->zh-CN	CER: 8.595%
Aishell S0768 test dataset numbers: 367	Whisper large-v2 baseline zh-CN->zh-CN	CER: 12.379%
MagicData 38_5837 test dataset numbers: 585	Whisper large-v2 baseline zh-CN->zh-CN	CER: 21.750%
MagicData 4 speakers test dataset numbers: 2372	Whisper large-v2 baseline zh-CN->zh-CN	CER: 24.747%

Graph 4: Whisper baseline inference results

With Graph 4 results on Whisper Baseline Inference, the plots suggests qualities and splits of audio data sources could have some effects on model performances.

In the English category, TED LIUM (long form) had the worst accuracy with lowest 24.049% WER, and Common Voice the second worst, GigaSpeech third and LibriSpeech the best, with 4.746% WER. This is reasonable, as TED LIUM consists of TED talks that contain noises and is not taking ASR tasks as primary purposes. Common Voice, on the other hand, has significant variation in audio qualities since it collects data from vast amounts of volunteers and contributors. GigaSpeech has the longest trainable hours of audios, but it was recorded from Podcast and Youtube, and thus may have some sound quality losses. Librispeech consists of narrated audiobooks collected from the LibriVox project. It had been carefully segmented and cleaned by researchers.

In the Chinese category, Aishell and MagicData resulted in significantly different CERs. As both of them were divided by speakers, there might be considerable fluctuations of performance within the dataset. However, MagicData has better claimed transcript accuracy, longer training hours, and more speakers. A possible reason is audio quality differences. Aishell audios were recorded by high fidelity microphone, and Hi-Fi audios were created and then downsampled to 16k Hz within the pre-processing steps, while MagicData audios were recorded by mobile phones.

4. Speaker Diarization

Speaker Diarization is a field in speech recognition that involves segmenting a speech audio into distinct segments corresponding to different speakers. The goal is to identify and differentiate individual speakers in an audio stream, making it possible to assign diarized time segments to specific speakers. This is particularly useful in scenarios where there are multiple speakers in a recording, such as conference calls, interviews, podcasts, and recorded meetings.

a. Pyannote.audio

Pyannote-audio is an open-source toolkit developed by the Pyannote team for various speech and audio processing tasks. It provides a collection of pre-built models, algorithms, and utilities to perform tasks like speaker diarization, voice activity detection, and speech turn segmentation.

How to use Pyannote.audio with Whisper:

Install Pyannote.audio using PyPI:

pip install -qq https://github.com/pyannote/pyannote-audio/archive/refs/heads/develop.zip

login(read_token)
device = "cuda:0" if torch.cuda.is_available() else "cpu"

Import Pipeline and Audio from Pyannote.audio and prepare the wav audio files

from pyannote.audio import Pipeline, Audio
sd_pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization",use_auth_token=True)
wav_files = glob.glob(os.path.join(audio_dirpath, '*.wav'))

Set up the pipeline of base model of Whisper large-v2 with ASR task, 30 second chunk length (based on Whisper structures), and CUDA GPU.

pipe = pipeline(
  "automatic-speech-recognition",
  model="openai/whisper-large-v2",
  chunk_length_s=30,
  device=device,
)

In the pipeline the numbers of speakers can be defined in advance if minimum, maximum or exact number are known. As Pyannote.audio supports multi-channel audio diarization, we can select mono=’random’ or ’downmix’ to choose either a random single channel or the new channel down-mixed and averaged by all channels. Diarization will produce diarized time segments. Audios can be then cropped within exact time segments as waveforms and executed to Whisper Pipeline and set on a target language to decode.

results = []

for audio_file in wav_files:
    diarization = sd_pipeline(audio_file, min_speakers=min_speakers, max_speakers=max_speakers)
    audio = Audio(sample_rate=16000, mono='random')
    for segment, _, speaker in diarization.itertracks(yield_label=True):
        waveform, sample_rate = audio.crop(audio_file, segment)
        text = pipe({"raw":waveform.squeeze().numpy(), "sampling_rate": sample_rate}, batch_size=8, 
                          generate_kwargs = {"language":"<|zh|>","task": "transcribe"})["text"]
        results.append({
            'start': segment.start,
            'stop': segment.end,
            'speaker': speaker,
            'text': text
        })

        del waveform, sample_rate, text
        gc.collect()
        torch.cuda.empty_cache()

    del diarization, audio

Finally, pair the speakers with their corresponding diarized text

grouped_results = defaultdict(list)

for result in results:
    grouped_results[result['speaker']].append(result)

final_results = []

for speaker, group in grouped_results.items():
    group.sort(key=lambda x: x['start'])
    text = ' '.join(result['text'] for result in group)
    
    final_results.append({
        'speaker': speaker,
        'text': text
    })

print(final_results)

b. WhisperX

WhisperX is a model that integrates Whisper, Phoneme-Based Model (Wav2Vec2) and Pyannote.audio. It is claimed to be 70x faster in real-time speech recognition than Whisper large-v2 model with word-level timestamps and speaker diarization with VAD feature. It uses faster-whisper backend, and consumes GPU memories smaller than 8 GB.

Figure 10: WhisperX architecture

How to use WhisperX:

Install dependencies in specified versions (PyTorch 2.0.0, CUDA 11.7)

conda install pytorch==2.0.0 torchaudio==2.0.0 pytorch-cuda=11.8 -c pytorch -c nvidia
pip install git+https://github.com/m-bain/whisperx.git

Use WhisperX pipeline to get Pyannote diarization model

diarize_model = whisperx.DiarizationPipeline(model_name="pyannote/speaker-diarization", use_auth_token='hf_token', device=device)

Then configure whisper model size, the GPU usage, computation type as well as the language for transcripts. The provided languages are {en, fr, de, es, it, ja, zh, nl, uk, pt} by now. Local audios can be loaded and diarized with transcript segments. After assigning speakers with each audio segment, words in multi-speaker speech scenarios can be identified with a unique speaker ID.

model = whisperx.load_model\
    (whisper_arch=model, device=device, compute_type=compute_type, language = language_abbr)


audio = whisperx.load_audio(matching_file_path)
diarize_segments = diarize_model(matching_file_path, min_speakers=6, max_speakers=6)
result = model.transcribe(audio, batch_size=batch_size)
result = whisperx.assign_word_speakers(diarize_segments, result)

However, WhisperX is not perfect for such multi-speaker speech recognition with diarization tasks. Sometimes it will be less accurate if the speakers speak from different channels. Speaker diarization might get disordered if it does not identify the correct number of speakers; and it performs poorly when it comes to voice overlapping and speeches that contain interjections.

Of course, we can also simply use the transcribe functionality in WhisperX. As the architecture is built with Voice Activity Detection and Cut & Merge features, long audios can be processed to the model. Then a comparison will be possible to Whisper pipeline in Transformer (for long audio transcriptions).

device = "cuda" 
directory = testaudio_directory
batch_size = 1 # reduce if low on GPU mem
compute_type = "int8" # change to "int8" if low on GPU mem (may reduce accuracy)

model = whisperx.load_model("medium", device, compute_type=compute_type, language="en")

datatest = {
    'audio': [load_wav(os.path.join(testaudio_directory, f)) for f in wav_files],
    'transcript': [],
}

for file_name in os.listdir(directory):
    if file_name.endswith(".wav"):
        audio_file = os.path.join(directory, file_name)
        audio = whisperx.load_audio(audio_file)
        result = model.transcribe(audio, batch_size=batch_size)
        strtext = [seg['text'] for seg in result["segments"]]
        strtext = ' '.join(strtext)
        datatest['transcript'].append(strtext)

del model

💡

Advantages: WhisperX: Multi-speaker scenario, VAD, Extra Phenome model, Easier for local audios Whisper Pipeline: More languages, Flexible chunk length (<= 30s), Easier for HF datasets

c. WhisperX Results

Dataset	Model/Task/Compute Type	Result
TED LIUM 1st release SLR7 test dataset	WhisperX medium en->en int8	WER:37.041%
TED LIUM 1st release SLR7 test dataset	WhisperX large-v2 en->en int8	WER:36.917%
TED LIUM 1st release SLR7 test dataset	WhisperX medium en->en float16	WER:36.906%
distil-whisper/tedlium-long-form validation dataset	WhisperX large-v2 en->en int8 batch size = 1	WER:24.651%
distil-whisper/tedlium-long-form validation dataset	WhisperX medium en->en int8 batch size = 1	WER:24.353%
AISHELL-4 a selected audio file	WhisperX manual check	CER:15.6%~24.658%

Graph 5: WhisperX inference results

The results show that there were little accuracy differences in using different model sizes and computation types. WER results were not much different to the original Whisper model, thus indicating WhisperX is a good alternative other than Whisper Pipeline. However, it was the case only for single-channel clean audio datasets in English and a few other languages. As for Chinese datasets, the outcomes became unreliable.

Here are several possible reasons:

WhisperX only supports traditional Chinese transcription, and when using Hanziconv, a Python library that converts traditional Chinese to simplified Chinese, we cannot ensure the characters are perfectly converted as expected
TED LIUM is the audio source collected from TED talks and it was mostly completed by a single speaker, thus decreasing the difficulties of multi-speaker transcribing tasks. A further investigation on meeting scenarios by English speakers is needed for testing diarizing abilities.
AISHELL-4, unlike AISHELL-1, was collected by 8-channel circular microphone array for speech processing in conference scenarios. The dataset consists of 211 recorded meeting sessions, each containing 4 to 8 speakers, with a total length of 120 hours. AISHELL-4 provides realistic acoustics and rich natural speech characteristics in conversation such as short pause, speech overlap, quick speaker turn, noise, etc. This creates the difficulties in transcribing correct sentences from the correct speaker. Also, the meetings usually contain large amounts of interjections that have no real meanings in Chinese.

5. Other Models

Besides Whisper, there exists other competitive LLMs in ASR field that built with different architectures and techniques, including Deep Neural Networks (DNNs), Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), Connectionist Temporal Classification (CTC), etc. Some popular models are researched and evaluated with model inference.

a. Meta MMS

The Massively Multilingual Speech (MMS) is a project led by Meta (Facebook research) included in the Fairseq (a sequence-to-sequence toolkit). It expands speech technology from around 100 languages to more than 1,100 languages, more than 10 times as many as before. Language identification models are able to more than 4,000 spoken languages, 40 times more than before.

MMS uses religious audios and texts, such as the Bible, to be the data source because most of them have been translated in many different languages. The MMS is built on pre-trained wav2vec 2.0 models. The researcher claimed that MMS halves the WER of Whisper on 54 languages of the FLEURS benchmark while being trained on a small fraction of the labeled data.

There are several kinds of model size available: mms-1b-fl102, mms-1b-l1107, mms-1b-all. Among these, mms-1b-all is the largest size. Unlike Whisper, Wav2Vec2FeatureExtractor and Wav2Vec2CTCTokenizer are used.

import os
import gc
import torch
from evaluate import load
from huggingface_hub import login
from datasets import load_dataset, Audio
from transformers import Wav2Vec2ForCTC, AutoProcessor

model_id = "facebook/mms-1b-all"
target_lang = "cmn-script_simplified"
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
processor = AutoProcessor.from_pretrained(model_id)
model = Wav2Vec2ForCTC.from_pretrained(model_id)

The target language ids could be found in tokenizer.vocab.keys(). For example, id for Chinese is cmn-script_simplified. Different language adapter weights can be loaded for different languages via load_adapter(). MMS loads weights for English by default, so target_lang=<target-lang> and ignore_mismatched_sizes=True need to be specified when loading with other languages. An alternative way is to use an ASR pipeline from Transformer. Set “model_kwargs” dictionary to have the two settings above.

from transformers import pipeline
model_id = "facebook/mms-1b-all"
target_lang = "lang"
pipe = pipeline(model=model_id, model_kwargs={"target_lang": "lang", "ignore_mismatched_sizes": True})

For inference purposes, switch out the language adapters directly from the default English model using the codes.

processor.tokenizer.vocab.keys()

processor.tokenizer.set_target_lang(target_lang)
model.load_adapter(target_lang)
model = model.to(device)

The transcribing process of Hugging Face MMS resembles that of Wav2Vec2ForCTC Model.

for i, item in enumerate(dataset):
    zhcn_sample = item["audio"]["array"]

    inputs = processor(zhcn_sample, sampling_rate=16_000, return_tensors="pt")

    inputs = {name: tensor.to(device) for name, tensor in inputs.items()}

    with torch.no_grad():
        outputs = model(**inputs).logits

    ids = torch.argmax(outputs, dim=-1)[0]
    transcription = processor.decode(ids)
    transcriptions.append(transcription)

    del inputs, outputs, ids, zhcn_sample
    torch.cuda.empty_cache()
    gc.collect()

The model generates logits, which are the raw output scores from the model. After obtaining the logits from the model, it calculates the index of the highest logit value along the last dimension. This effectively finds the predicted token index with the highest probability. The [0] indexing is used to access the only example in the batch. Finally, the predicted token index ids is decoded to original text transcriptions.

b. PaddleSpeech

PaddleSpeech is a Chinese open-source toolkit on the PaddlePaddle platform for a variety of tasks in speech and audio, with the state-of-art and influential models. It provides production ready streaming ASR and streaming TTS systems. There are a variety of speech models with different architectures and pretrained data source available:

Figure 11: PaddleSpeech available ASR models

The recommended OS system is Linux, but others are also sufficient for certain tasks.

There is also a detailed introduction of the exact datasets, models, decoding and augmentation techniques used in the feature list in official GitHub Repo https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/asr/feature_list.md

There are several model architectures available in PaddleSpeech.

1) DeepSpeech2

The DeepSpeech2 online model is a modified DeepSpeech2 version. The model is mainly composed of 2D convolution subsampling layers and stacked single-direction RNN layers.

It also has a separate vocabulary for English data and Chinese data. A technology called Cepstral Mean and Variance Normalization (CMVN) is used. A subset of or full of the training set is selected and be used to compute the feature mean and standard deviation. For feature extraction, the released DeepSpeech2 online model uses the linear feature extraction method (Fast Fourier Transform without using filter bank). Encoder and Decoder architectures are shown below.

The DeepSpeech2 offline model (non-streaming) is similar to the online one. The main difference is the offline model uses the stacked bi-directional RNN layers (bi-directions with RNN Cells). For data preparation and decoder architecture they are identical.

Figure 12: DeepSpeech2 online model architecture

2) Conformer

The Conformer is a convolution-augmented transformer for speech recognition. It combines Convolution Neural Networks (CNN) and Transformers to improve speech recognition performance in a parameter-efficient way.

Figure 13: Conformer encoder model architecture

Conformer comprises two feed-forward layers with half step residual connections with Multi-Head Self Attention and Convolution modules at the middle. A post layernorm is followed at the end. By combining CNNs for local feature extraction and transformers for global context understanding, the Conformer model achieves state-of-the-art performance on various speech recognition benchmarks.

3) U2

Unified Streaming and Non-streaming Two-pass End-to-end Model, also known as U2, applies hybrid CTC/attention architecture with dynamic chunk-based attention and streaming CTC decoding. By adjusting the chunk size during inference, the latency of the speech recognition system can be controlled. After the CTC decoder generates n-best hypotheses, the attention decoder is used to rescore these hypotheses and generate the final result. It allows for more efficient and flexible speech recognition, making it suitable for real-time applications and scenarios with variable-length audio input.

4) Usage

The installations are dependent on how much PaddleSpeech will be utilized. It is possible to choose the specific version on PaddlePaddle official website (Baidu mirror / Tsinghua mirror): https://www.paddlepaddle.org.cn/en/install/quick?docurl=/documentation/docs/en/install/conda/windows-conda_en.html

How to use PaddleSpeech:

Install with PyPI

pip install pytest-runner
pip install paddlespeech

Another approach is to compile source code with commands given on the official website

git clone https://github.com/PaddlePaddle/PaddleSpeech.git
cd PaddleSpeech
pip install pytest-runner
pip install .

💡

Some features are not supportive on certain OS systems

Figure 14: PaddleSpeech installation

Now uses PaddleSpeech CLI to perform model inference (on multiple files):

from paddlespeech.cli.asr.infer import ASRExecutor
asr = ASRExecutor()
result = asr(audio_file=audio_file)

Default model conformer_wenetspeech will be used to perform the audio transcriptions.

transcript = []

for audio_file in wav_files:
    result = asr(
        model='conformer_wenetspeech',
        lang='zh',
        sample_rate=16000,
        config=None,
        ckpt_path=None,
        audio_file=audio_file,
        device=paddle.get_device())
    transcript.append(result)

💡

ASR training tutorial on Linux can be found here: asr1

However, it is also possible to customize executor parameters in asr arguments, the available parameters are shown in Figure 15:

Figure 15: PaddleSpeech asr parameter docs

The list of available PaddleSpeech models for ASR inference with target language:

Figure 16: PaddleSpeech ASR inference model list

c. SpeechBrain

SpeechBrain is an open-source conversational AI toolkit developed by the Speech and Audio Processing Group at the University of Montreal. It aims to provide a flexible and comprehensive platform for speech-related research, development, and applications. SpeechBrain is PyTorch-based and supports state-of-the-art methods for end-to-end speech recognition, including models based on CTC, CTC+attention, transducers, transformers, and neural language models relying on recurrent neural networks and transformers.

SpeechBrain offers a wide range of functionalities for various speech and audio processing tasks, including ASR. The official website is here SpeechBrain

How to use SpeechBrain:

SpeechBrain has Hugging Face models, install SpeechBrain with pip command

pip install speechbrain

Perform Inference on SpeechBrain Models

from speechbrain.pretrained import EncoderDecoderASR
from speechbrain.pretrained.interfaces import foreign_class
asr_model = EncoderDecoderASR.from_hparams(source="speechbrain/asr-transformer-aishell", savedir="pretrained_models/asr-transformer-aishell", run_opts={"device": "cuda"})
result = asr_model.transcribe_file(audio_file)

💡

If we want to train SpeechBrain models, the repository needs to be cloned and run with Python and yaml files. See Training Section on speechbrain/asr-wav2vec2-ctc-aishell · Hugging Face

d. ESPnet

ESPnet is an end-to-end speech processing toolkit covering end-to-end speech recognition, text-to-speech, speech translation, speaker diarization, etc. ESPnet uses PyTorch as the engine and also follows Kaldi style data processing, feature extraction/format, and recipes.

ESPnet is now upgrading to version 2, ESPnet2, and has been shifted with most of the developments. It will have On-the-fly feature extraction and reduce preparation complexity. It contains various ASR recipes such as Hybrid CTC/attention based end-to-end and Transducer based end-to-end. The most recommended OS system is Linux Ubuntu.

How to use ESPnet:

ESPnet models can be loaded from espnet_model_zoo to perform model inference. Install espnet_model_zoo with PyPI:

pip install espnet_model_zoo

Then import speech to text module from espnet2.bin.asr_inference

import soundfile
from espnet2.bin.asr_inference import Speech2Text

In Speech2Text, decoding parameters can be customized, e.g. beam size, ctc weight.

model_ id is the name of the ESPnet model on Hugging Face. The rest of the parameters are all decoding parameters that are not in the model file.
(max/min)lenratio controls allowed ratios between the length of the input audio and the length of the decoded output text.
The beam_size parameter sets the width of the beam search, which affects the number of hypotheses kept during decoding. A larger beam size can improve accuracy but also increases computation time.
The ctc_weight parameter controls the relative weight of the CTC loss during decoding.
LM (Language Model) is a separate model that can be used to improve ASR accuracy by incorporating language knowledge. The lm_weight parameter sets the relative weight of the language model during decoding.
The penalty parameter applies a length penalty during decoding. It can be used to discourage longer output sequences.
The nbest parameter controls the number of hypotheses (output sequences) to return during decoding. Setting nbest = 1 means we will get only the top 1 hypothesis.

speech2text = Speech2Text.from_pretrained(
    model_id,
    maxlenratio=0.0,
    minlenratio=0.0,
    beam_size=20,
    ctc_weight=0.3,
    lm_weight=0.5,
    penalty=0.0,
    nbest=1
)

After complete these settings, check whether sampling rate is the same as from training corpus before making speech to text call:

predictions = []

for file in matching_files:
    speech, rate = soundfile.read(file)
    nbests = speech2text(speech)

    text, *_ = nbests[0]
    predictions.append(text)

print(predictions)

💡

The differences of output accuracy can be compared by changing the parameters of the decoding parameters that are not in the model configurations.

e. Baseline Results

The baselines tests are primarily for Chinese transcription accuracy investigation purposes. Whisper will be used as base references for other models.

English Test Dataset	Model/Method	WER
librispeech_asr clean	Meta MMS mms-1b-all	4.331%
mozilla-foundation/ common_voice_13_0 numbers: 1000	Meta MMS mms-1b-all	23.963%

Chinese Test Dataset	Model/Method	CER
Aishell S0770 numbers: 353	PaddleSpeech Default(conformer_u2pp_online_wenetspeech) decode_method: attention_rescoring	4.062%
Aishell S0768 numbers: 367	PaddleSpeech Default(conformer_u2pp_online_wenetspeech) decode_method: attention_rescoring	10.322%
Aishell S0768 numbers: 367	SpeechBrain wav2vec2-transformer-aishell	8.436%
Aishell S0768 numbers: 367	Meta MMS mms-1b-all	34.241%
Aishell S0768 numbers: 367	ESPnet Emiru Tsunoo/ aishell_asr streaming Inference params: maxlenratio=0, minlenratio=0, beam_size=20, ctc_weight=0.3, lm_weight=0.5, penalty=0.0, nbest=1	11.084%
MagicData 38_5837 numbers: 585	Meta MMS mms-1b-all	43.296%
MagicData 38_5837 numbers: 585	PaddleSpeech Default(conformer_u2pp_online_wenetspeech) decode_method: attention_rescoring	30.422%
MagicData 38_5837 numbers: 585	SpeechBrain wav2vec2-transformer-aishell	32.852%
MagicData 38_5837 numbers: 585	ESPnet Emiru Tsunoo/ aishell_asr streaming Inference params: maxlenratio=0, minlenratio=0, beam_size=20, ctc_weight=0.3, lm_weight=0.5, penalty=0.0, nbest=1	55.324%
MagicData 38_5837 numbers: 585	ESPnet Emiru Tsunoo/ aishell_asr streaming Inference params: maxlenratio=0, minlenratio=0, beam_size=20, ctc_weight=0.3, lm_weight=0, penalty=0.0, nbest=1	52.878%
MagicData 4 speakers numbers: 2372	Meta MMS mms-1b-all	34.511%
MagicData 4 speakers numbers: 2372	PaddleSpeech conformer-wenetspeech decode_method: attention_rescoring	9.79%
MagicData 4 speakers numbers: 2372	PaddleSpeech conformer-aishell decode_method: attention_rescoring	23.135%
MagicData 4 speakers numbers: 2372	SpeechBrain wav2vec2-transformer-aishell	23.728%
MagicData 4 speakers numbers: 2372	SpeechBrain wav2vec2-ctc-aishell	15.911%
MagicData 4 speakers numbers: 2372	SpeechBrain transformer-aishell	26.166%
MagicData 4 speakers numbers: 2372	ESPnet Emiru Tsunoo/ aishell_asr streaming Inference params: maxlenratio=0, minlenratio=0, beam_size=20, ctc_weight=0.3, lm_weight=0.5, penalty=0.0, nbest=1	38.697%
MagicData 4 speakers numbers: 2372	ESPnet Emiru Tsunoo/ aishell_asr streaming Inference params: maxlenratio=0, minlenratio=0, beam_size=20, ctc_weight=0.3, lm_weight=0, penalty=0.0, nbest=1	36.779%

for English inference results, Meta MMS had similar transcript accuracies compared to Whisper.

On the other hand, for Chinese inference results, PaddleSpeech had a better performance compared to Whisper. While conformer-aishell with attention-rescoring and Whisper-large-v2 baseline have similar CER results, conformer-wenetspeech chooses WeNetSpeech as its source and provided a better performance in MagicData test datasets. For SpeechBrain models, wav2vec2-ctc-aishell appeared to have better performance on unseen data, and other models have similar performance compared to Whisper large-v2 baseline. Meta MMS Chinese transcription results were worse than Whisper and ESPnet models are the least accurate.

6.Azure Speech Studio

Azure AI Speech Services is a collection of cloud-based speech-related services offered by Microsoft Azure. These services enable developers to integrate various speech capabilities into their applications, including speech recognition, text-to-speech, and speech translation. These services leverage advanced machine learning algorithms to provide accurate and natural language processing functionalities.

Speech Studio in Azure AI is a set of UI-based tools for building and integrating features from Azure Speech service. Custom Speech Projects in Speech Studio can be created in different languages. Endpoints are then available for deployment using the Speech SDK, Speech CLI, or REST APIs. There are mainly four sections in a Custom Speech Project: Speech Datasets for uploading datasets, Train Custom Models with uploaded train datasets, Test Models with uploaded test datasets, and Deploy Models, which create endpoints for customized (trained) models.

Figure 17: Azure Speech Studio workflow

a. Upload datasets

There are three methods for uploading training and testing datasets for Custom Speech:

Speech Studio (direct upload), REST API and CLI usage.

1) Speech Studio

For direct upload to Speech Studio, prepare data in local directories and follow these steps:

Figure 18: Azure Speech Studio upload steps

There will be options for choosing dataset location in either local file or remote location like Azure Blob URL. For local files, simply find the address of the directory on the local device. However, if we want to ensure maximum security of dataset files by using a trusted Azure services security mechanism, Azure Blob is a good option.

Azure Blob Storage is a scalable and cost-effective cloud-based object storage service. It is designed to store and manage unstructured data, such as documents, images, audio files, videos, backups, logs, and more. Azure Blob Storage provides a reliable and secure way to store massive amounts of data, making it an essential component for many cloud-based applications and services.

How to use Azure Blob:

Install official Python Azure Blob Storage library

pip install azure-storage-blob

Import cilents from library

from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient

Then upload zipped files to Azure Blob

def upload_zip_to_azure_blob(account_name, account_key, container_name, local_zip_path, zip_blob_name):
    try:
        connection_string = f"DefaultEndpointsProtocol=https;AccountName={account_name};AccountKey={account_key};EndpointSuffix=core.windows.net"
        blob_service_client = BlobServiceClient.from_connection_string(connection_string)

        container_client = blob_service_client.get_container_client(container_name)
        if not container_client.exists():
            container_client.create_container()

        zip_blob_client = container_client.get_blob_client(zip_blob_name)
        with open(local_zip_path, "rb") as zip_file:
            zip_blob_client.upload_blob(zip_file)

        print("Zip file uploaded successfully!")
    except Exception as e:
        print(f"Error uploading the zip file: {e}")

Execute the function in the Python scripts with detailed storage information

if __name__ == "__main__":
		storage_account_name = "storage"
		storage_account_key = "key"
		container_name = "container"
		local_zip_file_path = "local_path"
		zip_blob_name = "data.zip"
		upload_zip_to_azure_blob(storage_account_name, storage_account_key, container_name, local_zip_file_path, zip_blob_name)

By this moment the desired zipped file has been uploaded to the Containers in Azure Blob Storage with specified container name. Click on the file, the URL attribute can be seen at the top of Properties in Overview. Other visible attributes include its content type, creation time and modified time. The URL link is the one that needs to copy and we paste it to the required field to perform safe upload.

2) REST

Unlike Speech Studio, it is not necessary to choose whether a dataset is for testing or training at the time of upload with STT REST API. The API kind is dependent on formats.

Figure 19: Azure REST upload format table

Speech to text REST API v3.1 will do the tricks. There are several features in this version: Datasets, Endpoints, Evaluations, Models, Projects, etc. It contains GET, POST, DELETE and PATCH methods.

💡

For details, see Speech to text REST API documentation https://learn.microsoft.com/en-us/azure/ai-services/speech-service/rest-speech-to-text

How to use REST API:

First, we need a request URL, headers and a request body.

To find the right resource, write the name of the resource in the request URL and find the subscription key for the resource in headers. Name the uploaded dataset and description in the request body. By putting in a project with a unique project id, we can directly upload the dataset into the specified project. If the uploaded file is from Azure Blob Storage Container, its location can be specified in the content URL.

The next step is to dump the JSON body and create an HTTPS connection. Upload dataset through the POST method and a response of whether it causes HTTPS errors or not will be shown in exception.

json_body = json.dumps(request_body)

try:
    conn = http.client.HTTPSConnection('website.com')
    
    conn.request("POST", "/speechtotext/v3.1/datasets", json_body, headers)
    
    response = conn.getresponse()
    data = response.read()
    print(data)
    conn.close()
except Exception as e:
    print("[Errno {0}] {1}".format(e.errno, e.strerror))

Finally, check on Speech Studio whether the dataset is successfully uploaded, and get attention on upload failures or the expected process not showing up.

3) Format

There are restrictions for data to upload. Usually the data composition determines whether it can be used for training or testing. It is better to include the audios and transcripts at the same time.

Figure 20: Upload file types and feature supports

For audio files, the format should be WAV, sampling rate to be 8k Hz or 16k Hz, single channeled. The maximum audio lengths are also different for testing and training. The archive should be in zip format, under 2GB and 10k files within.

Figure 21: Audio upload format

The transcriptions for all WAV files are contained in a single plain-text file (.txt or .tsv). Each line of the transcription file contains the name of one of the audio files, followed by the corresponding transcription. The file name and transcription are separated by a tab.

For different languages, the texts should also be normalized with a defined recipe.

Take zh-CN (Chinese) for example, human-labeled transcriptions for Mandarin Chinese audio must be UTF-8 encoded with a byte-order marker. Avoid the use of half-width punctuation characters. It is also required to write out abbreviations in spoken form.

💡

Some normalization techniques will also be automatically applied, such as removing all punctuation, converting full-width letters to half-width letters and using uppercase letters for all English words.

Figure 22: Sample Chinese normalization in Speech Studio

Additionally, Editor help us to edit and combine uploaded datasets in the same project. Automatic selection of audios in certain lengths and manual exportation of a subset as the new dataset are enabled. Also, qualities and correctnesses are testable by enabling play buttons to every audio with its corresponding transcript aside. Quantity will show the total audio length of a dataset.

b. Train models

Training process will be much easier than the local one. We do not need to fine-tune on the Azure models, but rather feed in speech data only.

Name the model and choose a baseline model as a starting point. Select one or more datasets from Speech Studio Datasets. Often it takes hours for the training to complete, or even a day.

c. Test models

After training, inspect or compare error rates with the models using a single test dataset, which should also exist among uploaded datasets.

For inspection, select two models, either customized or baseline type. The inspection is for audio-only datasets, and its purpose is to check the audio qualities.

For evaluation with error rates, select two (customized / baseline) models. The evaluation will follow WER calculation logic, and Error rate, Insertion number, Substitution number, Deletion number of each model can be seen in Speech Studio. Other than these, original labels, lexical transcripts (the original output from models) and normalized transcripts will be shown under the table in Figure 23 in details. The error rate from every single audio is visible beside these texts.

💡

The Error rates are based on lexical results.

Figure 23: Sample test results on Speech Studio

d. Deploy models

We can deploy models to applications to integrate with other features, or simply redo the evaluation in local scripts. When deploying a model, it will create an endpoint.

How to deploy a model in Azure Speech Studio:

First, install Speech SDK

pip install azure-cognitiveservices-speech

Import config and recognizer from Azure Speech SDK

from azure.cognitiveservices.speech import SpeechConfig, SpeechRecognizer, AudioConfig

To perform evaluations locally with the Azure model, set up configurations with the subscription key, service region and generated endpoint. Then use the method recognize_once() to transcribe the audio files in the local directory.

predictions = []

for root, _, files in os.walk(wav_base_path):
    for file_name in files:
        if file_name.endswith(".wav"):
            if file_name in appeared_filenames:
                audio_file_path = os.path.join(root, file_name)
                audio_config = AudioConfig(filename=audio_file_path)
                speech_recognizer = SpeechRecognizer(speech_config=speech_config, audio_config=audio_config)
                result = speech_recognizer.recognize_once()
                transcribed_text = result.text
                predictions.append(transcribed_text)

print(len(predictions))

💡

Besides evaluations, it is possible to deploy with speech recognition on a microphone: https://learn.microsoft.com/en-us/azure/ai-services/speech-service/get-started-speech-to-text?tabs=macos%2Cterminal&pivots=programming-language-python

e. Results

Abbreviations:

‣

Dataset

Test Dataset / Split / Size / Duration	Train Datasets / Split & Duration	Error Rate (Baseline second)
MagicData 9452 11:27:39s	Aishell 12+ hrs	4.69% 4.24%
MagicData 9452 11:27:39s	Aishell+Minds14 32+ hrs: 1+ hr	4.67% 4.23%
Aishell+MagicData+ST-CMDS 5:4:1 6105 10:53:49s	Aishell+Minds14+ST-CMDS 15 hrs: 1+ hr: 15+ hrs	3.51% 3.20%
MagicData+Aishell+CV13 15:13:7 8721 11:45:52s	Aishell+CV13 8+ hrs: 7+ hrs	2.51% 3.70%
MagicData+Aishell+CV13 15:13:7 8721 11:45:52s	Aishell+CV13+Fleurs 8+ hrs: 7+ hrs: 9+ hrs	2.48% 3.70%

The results showed that Minds14 is not ideal for ASR tasks, as its original purpose is for intent detection tasks. While a large amount of test data might decrease the randomization of audio qualities and accuracies, a mixture with multiple test data sources with reasonable portion divisions may be more insightful. The best Azure model by far was trained with AISHELL-1, mozilla-foundation/common_voice_13_0 and google/fleurs, resulting 2.48% error rate.

7. Prospect

In this project, audio sources in English and Chinese datasets were investigated. Whisper models were fine-tuned mainly by these 2 languages. While audio data sources in English have always been sufficient for training purposes, Chinese sources that are available whilst maintaining high transcript qualities are much less in quantities. Additionally, Chinese audio data were often classified by speakers, this indicates a mixture with different speakers might resolve the potential issues of speaker biases.

Because of hardware computing resource limits, full fine-tuning models sometimes produce CUDA OOM error when using a single GPU. This is very likely to obtain a better model in large weights if multi-GPU training or more advanced GPU (NVIDIA 40 series) could be utilized in the process.

A research point that could have been stretched further is effects of LoRA configurations on Parameter-Efficient Fine-tuned model performances. Other than that, having different optimizers and making data augmentation are also possible strategies for improvement. If Linux environment issues are overcomed, training other models for a specific language is also promising to enhance performance (e.g. PaddleSpeech models for Chinese, Meta MMS, SpeechBrain for English).

In the Speaker Diarization field, Pyannote.audio with Whisper integration has proven its potential. While the transcript accuracy has been improved to be close to pure Whisper’s, current diarizing ability of relevant models in multi-speaker meeting scenarios is still not sufficient for multi-speaker speech recognition support.

In the contexts of Azure Speech Services, the most important rule is to keep good audio qualities and word-level accuracy in transcripts. While adding audio source variety is important, filtering training audio files that are not in good quality or too short to identify correct meaning as transcripts can also potentially enhance model performances.

8. References

[1] Anaconda, Inc. (2017). Command reference - conda 23.7.3.dev30 documentation. conda.io/projects/conda/en/latest/commands

[2] OpenAI (2022, September 21). Introducing Whisper. openai.com/research/whisper.

[3] Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, & Ilya Sutskever (2022). Robust Speech Recognition via Large-Scale Weak Supervision.

[4] Wikipedia contributors. (2023). Word error rate. In Wikipedia, The Free Encyclopedia.

[5] Gandhi, Sanchit (2022, November 3). Fine-Tune Whisper for Multilingual ASR with 🤗 Transformers. Hugging Face, Inc. huggingface.co/blog/fine-tune-whisper.

[6] The Linux Foundation (2023). Previous PyTorch Versions | PyTorch. pytorch.org/get-started/previous-versions

[7] Hugging Face, Inc. (2023). Hugging Face - Documentations. huggingface.co/docs

[8] Vaibhav Srivastav (2023). fast-whisper-finetuning, GitHub repository. github.com/Vaibhavs10/fast-whisper-finetuning

[9] Mangrulkar, Sourab, and Sayak Paul (2023, February 10). Parameter-Efficient Fine-Tuning Using 🤗 PEFT. Hugging Face, Inc. huggingface.co/blog/peft

[10] Bredin, H., Yin, R., Coria, J., Gelly, G., Korshunov, P., Lavechin, M., Fustes, D., Titeux, H., Bouaziz, W., & Gill, M.P. (2020). pyannote.audio: neural building blocks for speaker diarization. In ICASSP 2020, IEEE International Conference on Acoustics, Speech, and Signal Processing.

[11] Bain, M., Huh, J., Han, T., & Zisserman, A. (2023). WhisperX: Time-Accurate Speech Transcription of Long-Form Audio. INTERSPEECH 2023.

[12] Meta AI (2023, May 22). Introducing speech-to-text, text-to-speech, and more for 1,100+ languages. ai.meta.com/blog/multilingual-model-speech-recognition

[13] Vineel Pratap, Andros Tjandra, Bowen Shi, Paden Tomasello, Arun Babu, Sayani Kundu, Ali Elkahky, Zhaoheng Ni, Apoorv Vyas, Maryam Fazel-Zarandi, Alexei Baevski, Yossi Adi, Xiaohui Zhang, Wei-Ning Hsu, Alexis Conneau, & Michael Auli (2023). Scaling Speech Technology to 1,000+ Languages. arXiv.

[14] Hui Zhang, L. (2022). PaddleSpeech: An Easy-to-Use All-in-One Speech Toolkit. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Demonstrations. Association for Computational Linguistics.

[15] Mirco Ravanelli, Titouan Parcollet, Peter Plantinga, Aku Rouhe, Samuele Cornell, Loren Lugosch, Cem Subakan, Nauman Dawalatabad, Abdelwahab Heba, Jianyuan Zhong, Ju-Chieh Chou, Sung-Lin Yeh, Szu-Wei Fu, Chien-Feng Liao, Elena Rastorgueva, François Grondin, William Aris, Hwidong Na, Yan Gao, Renato De Mori, & Yoshua Bengio. (2021). SpeechBrain: A General-Purpose Speech Toolkit.

[16] Gao, D., Shi, J., Chuang, S.P., Garcia, L., Lee, H.y., Watanabe, S., & Khudanpur, S. (2022). EURO: ESPnet Unsupervised ASR Open-source Toolkit. arXiv preprint arXiv:2211.17196.

[17] ESPnet (2021). espnet_model_zoo, GitHub repository. github.com/espnet/espnet_model_zoo

[18] Eric-Urban (2023, August 2). Custom Speech overview - Speech service - Azure AI services. Custom Speech overview - Speech Service - Azure AI Services | Microsoft Learn learn.microsoft.com/en-us/azure/ai-services/speech-service/custom-speech-overview

[19] Microsoft (2023). Speech service documentation - Tutorials, API Reference - Azure AI services - Azure AI services. Speech service documentation - Tutorials, API Reference - Azure AI services - Azure AI services | Microsoft Learn learn.microsoft.com/en-us/azure/ai-services/speech-service/

August 2023, Training, Evaluation and Deployment of Popular Large Language Models in Automatic Speech Recognition