Linchuan Du
Department of Mathematics | The University of British Columbia
Abstract
Automatic Speech Recognition (ASR), also known as Speech to Text (STT), uses Deep Learning technologies to transcribe speech-included audios to texts. In the fields of Deep Learning Artificial Intelligence, Large Language Models (LLMs) mimic human brains in processing words and phrases, and have the ability to understand and generate text data. LLMs usually contain millions of weights and pre-trained with various kinds of datasets. Specifically, an ASR LLM will convert audio inputs to desired input formats by feature extraction and tokenization.
To customize an ASR LLM with ideal performance, fine-tuning procedures of Whisper, an ASR LLM developed by OpenAI, were tested on Google Colaboratory first. Larger models were then deployed in GPU-equipped environments in Windows OS to speed up training and alleviate GPU availability or limit issues on Colab and MacOS. Audio data were investigated on reliability based on information such as audio quality and transcript accuracy. Models were then improved and optimized by data preprocessing and hyper-parameter tuning techniques. In case of failing to resolve GPU memory issues by means of regular fine-tuning, Parameter-Efficient-Fine-Tuning (PEFT) with Low Rank Adaptation (LoRA) was utilized to freeze most parameters to save memory allocation without sacrificing too much in performances. Results were visualized along with loss curves to ensure the fitness and optimization of fine-tuning processes.
Possibility of multi-speaker support in Whisper was explored using Neural Speaker Diarization. Integration with Pyannote was implemented using pipeline and WhisperX, a project containing similar ideas with extra features of word-level timestamps and Voice Activity Detection (VAD). WhisperX was tested on long-form transcription with batching as well as diarization.
Besides Whisper, other models with ASR functionality were installed and compared with Whisper baseline, including Massively Multilingual Speech (MMS) by Meta AI research, PaddleSpeech by PaddlePaddle, SpeechBrain and ESPNet. Chinese datasets were used to compare these models in CER metrics. In addition, Custom Speech in Azure AI, which supports real-time STT features, was introduced to compare performances (mainly Mandarin Chinese). Then a choice can be made between trained Azure models and loadable models like Whisper for deployment.
Overview
- Preparing Environment
- Google Colaboratory
- Anaconda
- Visual Studio Code
- CUDA GPU
- Audio Data Source
- Hugging Face
- OpenSLR
- Whisper Model Fine-tuning
- Fine-tuning on Colab
- Common Libraries
- Data Preprocessing
- Hyperparameters
- Fine-tuned Results
- PEFT with LoRA
- PEFT Results
- Loss Curves Visualization
- Baseline Results
- Speaker Diarization
- Pyannote.audio
- WhisperX
- WhisperX Results
- Other Models
- Meta MMS
- PaddleSpeech
- SpeechBrain
- ESPnet
- Baseline Results
- Azure Speech Studio
- Upload datasets
- Train models
- Test models
- Deploy model
- Results
- Prospect
- References
1. Preparing Environment
a. Google Colaboratory
Google Colaboratory is a hosted Jupyter Notebook service that has limited free GPU & TPU computing resources. In Google Colaboratory, ipynb extension format is used to edit and execute Python scripts.
Log in to Google Colab through Google account, share written scripts with others via “share” on the right top corner of the page, and optionally authorize Colab with a Github account.
How to set up environments on Colab:
- Select Tab Runtime -> Change Runtime to enable GPU for use
- Use pip or other package installers to install necessary dependencies
!pip install packageName
b. Anaconda
Besides Colab, environments can also be prepared on local PCs. Anaconda is a well-known distribution platform for Data Science field, including data analysis and building machine learning models in Python. It contains Conda, an environment and package manager that helps to manage open-source Python packages and libraries.
How to set up environments with Anaconda:
- Install Anaconda on Free Download | Anaconda and add to PATH environment variable
- Search Command Prompt and get into base environment, e.g (Windows):
- Create a new Conda environment with a new name, e.g (“myenv”):
- Activate every time a specific Conda environment is needed, or return to base environment using deactivate:
- Install dependencies through PyPl or Conda package manager, PyPl package version requirements can be specified using specifiers:
(base) C:\Users\username>
conda create –-name myenv
conda activate myenv
conda deactivate
pip install packageName>=0.0.1
conda install packageName
c. Visual Studio Code
Visual Studio Code, or VS Code, is a powerful source-code editor for Windows, MacOS and Linux with various programming languages available for editing. It supports multiple tasks, including debugging, executing in integrated terminals, enriching functionalities by extensions, and version control by embedded Git.
How to set up environments in VS Code:
- Open the folder(s) on the left side under EXPLORER and create files inside the folder.
- On the bottom right, select the environment needed. As for terminals, execute Python scripts in either interactive window on the top right with IPython kernel installed or executing Python files using commands
python xxx.py
An alternative way is to use the ipynb extension (Jupyter Notebook).
- The Git icon on the left panel is the place where the source codes are controlled. Either commit, push to & pull from Github, merge and checkout branches within VS Code.
d. CUDA GPU
Compute Unified Device Architecture (CUDA) is a parallel computing platform and Application Programming Interface (API) developed by NVIDIA. It allows developers to use NVIDIA Graphics Processing Units (GPUs) for multiple computing tasks.
How to use CUDA GPU:
- Install the CUDA Toolkit, which includes necessary libraries, tools, and drivers for developing and running CUDA applications.
- Check relevant information in Command Prompt with the command
- See what type(s) of Nvidia GPU(s) is/are used, what percentage of GPU memory is occupied (e.g. 24297MiB/24576MiB in Figure 3), GPU utility percentage (e.g. 4% in Figure 3) in the first table. For Processes section beneath, Processes that take GPU memory usage are visible along with GPU specification (0 for single-GPU case).
- After setting up CUDA Toolkit, it is necessary to download a GPU-compatible PyTorch version for deep learning purposes. Go to the official website: PyTorch, and find the ideal version that is needed for the environment.
- Version check can be performed directly through Python
nvidia-smi
import torch
print(f' CUDA avaliabilty on PyTorch is {torch.cuda.is_available()}')
print(f' Current PyTorch version is {torch.__version__}')
print(f' Current CUDA version is {torch.version.cuda}')
print(f' cuDNN version is {torch.backends.cudnn.version()}')
print(f' The number of available GPU devices is {torch.cuda.device_count()}')
# Use CUDA on the device
device = torch.device("cuda")
On Windows, GPU information can be visualized and surveilled in Task Manager
Utilization is a useful indication to see if GPU is utilized as expected. If CPU is used heavily instead of GPU, then it means GPU has not been used for training and might indicate some problems within GPU settings. On the other hand, dedicated GPU memory measures GPU memory usage. A high memory usage will probably cause CUDA Out of Memory (OOM) errors.
2. Audio Data Source
a. Hugging Face
Hugging Face is a company and an open-source platform that is dedicated to the fields of Natural Language Processing (NLP) and Artificial Intelligence.
It is important to create a Hugging Face account in order to utilize the models published or upload models customized by our own. Personal READ and WRITE tokens can be created on https://huggingface.co/settings/tokens, where READ token is for downloading models or datasets from the platform and WRITE for uploading local models or datasets. To specify tasks and categories, select tags from the left side of the website.
It is convenient to find most popular models and audio datasets based on downloads and likes. One point to notice is that some models are only supported in certain languages or certain tasks. And for the same model architecture or type, there can also be size differences.
On Hugging Face Platform, most well-known organizations or companies will publish their speech-to-text models and collected datasets. Sometimes an agreement is needed for data access. There are also source files on Hub that can be downloaded to the local.
Here are some common ASR LLMs and their relevant information:
Model | # Params Size | Languages | Task | Structure |
OpenAI Whisper | large-v2 1550M | Most languages | Multitasks | Transformer encoder decoder
Regularized |
OpenAI Whisper | large 1550M | Most languages | Multitasks | Transformer encoder-decoder |
OpenAI Whisper | medium 769M | Most languages | Multitasks | Transformer encoder-decoder |
OpenAI Whisper | small 244M | Most languages | Multitasks | Transformer encoder-decoder |
guillaumekln faster-whisper | large-v2 | Most languages | Multitasks | CTranslate2 |
facebook wav2vec2 | large-960h -lv60-self | English | transcription | Wav2Vec2CTC decoder |
facebook wav2vec2 | base-960h 94.4M | English | transcription | Wav2Vec2CTC decoder |
facebook mms | 1b-all 965M | Most languages | Multitasks | Wav2Vec2CTC decoder |
Here are some audio datasets with their relevant information:
Dataset | # hours / Size | languages |
mozilla-foundation common_voice_13_0 | 17689 validated hrs | 108 languages |
google fleurs | ~ 12 hrs per language | 102 languages |
LIUM tedlium | 118 to 452 hrs for 3 releases | English |
librispeech_asr | ~1000 hrs | English |
speechcolab gigaspeech | 10000 hrs | English |
PolyAI minds14 | 8.17k rows | 14 languages |
b. Open SLR
Open SLR is another useful website that hosts speech and language resources with compressed files. Various audio datasets can be seen along with their brief summaries in the Resources tab. Some datasets that are not existent on Hugging Face might have access or link to websites.
Specifically for Chinese audio datasets, there are resources that are ideal for ASR purposes and not found on Hugging Face:
Dataset | # hours (size) | # speakers | transcript accuracy |
Aishell-1 (SLR33) | 178 hrs | 400 | 95+% |
Free ST (SLR38) | 100+ hrs | 855 | / |
aidatatang_200zh (SLR62) | 200 hrs | 600 | 98+% |
MAGICDATA (SLR68) | 755 hrs | 1080 | 98+% |
3. Whisper Model Fine-tuning
Whisper is an ASR (Automatic Speech Recognition) system released by OpenAI in September, 2022. It was trained on 680,000 hours of multilingual and multitask supervised data, enabling multiple language transcription and translation. The architecture is an encoder-decoder Transformer.
The audios will be chunked into 30 seconds and converted into a log-Mel spectrogram, which enables frequencies to be changed into the Mel scale. Then it will be passed into an encoder.
The official website for OpenAI Whisper can be seen in Introducing Whisper, and its research papers are Robust Speech Recognition via Large-Scale Weak Supervision.
a. Fine-tuning on Colab
- Login through Hugging Face token to enable datasets download
- Load desired dataset(s) through load_dataset in datasets. Usually 2 or 3 datasets of the same source will be created for train, test and/or validation. Use DatasetDict to separate.
- Then preprocess datasets to feed data into Whisper, such as:
- Manipulate columns: e.g. remove_columns, cast_column
- Normalize transcript, e.g. upper/lowercase, punctuations, special tokens
- Changing sampling rate to 16k using Audio in Datasets library
- Load pre-trained feature extractor and tokenizer from transformers library, Processor includes both feature extractor and tokenizer
- Prepare batch mapping to speed up processing, as Dataset.map() enables parallelizing the tokenization of all the samples in a batch
- Define Data Collator in Sequence to Sequence with label padding
- Evaluate metrics (WER) can be then imported from Hugging Face Evaluate
- Design metrics computation for predictions and labels respectively
- Load conditional generation for setting hyper-parameters on the generative model
- Before training hyper-parameters are defined in Seq2SeqTrainingArguments
- Configure in model.config
- Before training the model, hyper-parameters are defined in Seq2SeqTrainingArguments
- Finally start training with customized settings using trainer.train()
from huggingface_hub import notebook_login
notebook_login()
from transformers import WhisperFeatureExtractor,WhisperTokenizer
WhisperFeatureExtractor.from_pretrained("model_id")
WhisperTokenizer.from_pretrained("model_id")
WhisperProcessor.from_pretrained("model_id")
In tokenizer, usually target languages and tasks are specified
language="lang", task="transcribe" # or "translation"
import torch
from dataclasses import dataclass
from typing import Any, Dict, List, Union
@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
processor: Any
def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
input_features = [{"input_features": feature["input_features"]} for feature in features]
batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")
label_features = [{"input_ids": feature["labels"]} for feature in features]
labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")
labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)
if (labels[:, 0] == self.processor.tokenizer.bos_token_id).all().cpu().item():
labels = labels[:, 1:]
batch["labels"] = labels
return batch
import evaluate
metric = evaluate.load("wer")
def compute_metrics(pred):
pred_ids = pred.predictions
label_ids = pred.label_ids
label_ids[label_ids == -100] = tokenizer.pad_token_id
pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
label_str = tokenizer.batch_decode(label_ids, skip_special_tokens=True)
wer = 100 * metric.compute(predictions=pred_str, references=label_str)
return {"wer": wer}
from transformers import WhisperForConditionalGeneration
model = WhisperForConditionalGeneration.from_pretrained("model_id")
Whisper has token ids that are forced as model outputs before autoregressive generation is started, and config is set to None here because both target language and task have been specified. suppress_tokens are tokens that are suppressed, with their log probabilities set to -inf. An empty list will indicate no tokens are suppressed.
model.config.forced_decoder_ids = None
model.config.suppress_tokens = []
# arguments
training_args = Seq2SeqTrainingArguments(
output_dir="kawadlc/whisperv1", # own repo name
per_device_train_batch_size=16, # batch size per GPU for train
gradient_accumulation_steps=1, # increase by 2x for every 2x decrease in batch size
learning_rate=1e-5, # important param to handle overfitting and underfitting issue
weight_decay = 1e-2, # mechanism of regularization
warmup_steps=200, # enhance early performances
max_steps=3000, # total optimization step
gradient_checkpointing=True, # saving memory
evaluation_strategy="steps", # evaluation strategy, others: "epoch"
fp16 =True, # half-precision floating point format
per_device_eval_batch_size=8, # batch size per GPU for evaluation
predict_with_generate=True, # do generation
generation_max_length=200, # max num of tokens for autoregressive generation
eval_steps=500, # num of steps per evaluation
report_to=["tensorboard"], # save training logs to tensorboard
load_best_model_at_end=True, # best model at the end of output
metric_for_best_model="wer", # metric of the best at the end of output
greater_is_better=False, # WER lower for better
push_to_hub=False, # push to hub, optional
)
from transformers import Seq2SeqTrainer
trainer = Seq2SeqTrainer(
args=training_args,
model=model,
train_dataset=common_voice["train"],
eval_dataset=common_voice["test"],
data_collator=data_collator,
compute_metrics=compute_metrics,
tokenizer=processor.feature_extractor,
)
Training losses, evaluation losses and WER result in each evaluation steps or epochs can be seen in the training process, a fine-tuned model will then complete after the training, and checkpoints can be made and stored in specified output directory.
If CUDA Out of Memory (OOM) Error occurred during the training process, adjustments are needed on the previous steps to reduce the GPU memory usage:
- The first priority is to reduce batch size to use more time to compensate for memory savings. Work along with gradient accumulation. This technique is for accumulating gradients over multiple smaller batches before performing the weight update step.
- Gradient checkpointing is another useful strategy that trades a small increase in computation time for significant reductions in memory usage.
- If this does not work either, consider using mixed precision training if it is not applied, because this will reduce the memory footprint significantly while maintaining training stability.
- Reduce GPU occupied storage and cache by garbage collector and empty cache method
import gc
import torch
gc.collect()
torch.cuda.empty_cache()
b. Common Libraries
There are many popular Python libraries that are suitable for performing tasks in Machine Learning and Data Processing. In Automatic Speech Recognition field, some Python Libraries are powerful in modifying audios.
Common Libraries
c. Data Preprocessing
1) Hugging Face Dataset
Load the dataset using the load_dataset function. In many popular audio datasets, the splits of train, test and validation have already been preprocessed.
- Specify the subset with their corresponding names
- Choose the split name with “split”, and a plus sign can combine multiple splits
- “token“ can access datasets remotely (use_auth_token will be deprecated)
- The returned dataset is a datasets.Dataset type. Put into the dictionary with DatasetDict.
# common_voice for data source
common_voice = DatasetDict()
# split datasets
common_voice["train"] = load_dataset("common_voice"
, "ja", split="train+validated", use_auth_token=True)
common_voice["validation"] = load_dataset("common_voice"
, "ja", split="validation", use_auth_token=True)
common_voice["test"] = load_dataset("common_voice"
, "ja", split="test", use_auth_token=True)
# create DatasetDict, choose the sample size for training and evaluation
common_voice = DatasetDict({
"train": common_voice['train'].select(range(3500)),
"validation": common_voice['validation'].select(range(500)),
"test": common_voice['test'].select(range(100)),
})
# remove columns that are not needed for the training
common_voice = common_voice.remove_columns(["age", "client_id", "down_votes", "gender", "path", "up_votes"])
For non-streaming mode, ensure enough space for the download before calling load_dataset function. In order to create randomization, use shuffle(seed=42) to randomly rearrange the column values. select() and filter() are options that reduce the dataset numbers and return rows that match a specified condition.
As described in the Fine-tuning on Colab Section, columns can be manipulated by removing and renaming columns. Use cast_column() to change the feature type of a single column (or more in cast()).
common_voice = common_voice.cast_column("audio", Audio(sampling_rate=16000))
Here the sampling rate should always be transformed to 16k Hz because of the requirements in Whisper architecture. flatten() is called when inside features are needed from nested structures in the dataset.
To combine datasets that come from different sources
- concatenate_datasets() can help to combine datasets from end to end
- interleave_datasets() mix several datasets together by taking alternating examples from each one to create a new dataset.
For streaming mode, there are many similar call functions compared to non-streaming mode, but some are completely different. To use shuffling as the regular mode, set up a buffer size and randomly sample examples from this buffer.
shuffled_dataset = dataset.shuffle(buffer_size=10_000, seed=42)
It is also possible to create a new dataset with the first n examples by using take(n), or with the rest of the examples by skipping the first n examples with skip(n)
There are specific explanations on Hugging Face Documents: Process
The map function in HuggingFace's datasets is used to apply a given function to each example in a dataset. It performs transformation and pre-processing in a convenient and efficient way.
Here is the basic syntax:
dataset.map(function, batched=True, num_proc=1, remove_columns=None, **function_args)
Self-defined functions can be applied to every sample in the dataset and decided on whether it is processed in batches or one by one with the boolean parameter batched. num_proc parameter defines the number of processes to use for parallel processing. It helps speed up the data processing, especially when dealing with large datasets. However issues related to multi-processing may occur when increasing the number. Meanwhile, unnecessary columns should be removed after execution, and if the desired dataset is created within the defined function, the initial train column should be removed.
def prepare_dataset(batch):
audio = batch["audio"]
batch["input_features"] = processor.feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]
batch["labels"] = processor.tokenizer(batch["text"]).input_ids
return batch
librispeech = librispeech.map(prepare_dataset, remove_columns=librispeech.column_names["train"], num_proc=1)
2) Transcript Cleaning
Sometimes human labeled transcripts are not ideal, having extra symbols or different formats with other data sources. Transcript cleaning is then an essential preprocessing step. It involves the process of refining or correcting the transcriptions of speech data. It directly impacts the performance of the ASR models, so it's crucial to ensure accurate and error-free transcriptions. Sometimes it is also called text normalization.
- Punctuation and Capitalization
- Tokenization
# lowercase texts and ignore apostrophes
text = [s.lower() for s in text]
punctuation_without_apostrophe = string.punctuation.replace("'", "")
translator = str.maketrans('', '', punctuation_without_apostrophe)
text = [s.translate(translator) for s in text]
Make all texts to lowercase with lower()method; Most punctuations are in the string.punctuation, and it is possible to replace the punctuations to empty strings. re.sub() function replaces all occurrences of matched patterns.
The same for the str.maketrans() method.
# remove special tokens
def remove_tags(text):
clean = re.compile('<.*?>')
return re.sub(clean, '', text)
Tokenization involves breaking the text into individual units or tokens, such as words. We can usually split the word using split(). Remove special tokens like timestamps, silence or non-speech tokens if they exist.
3) Audio Chunking
Audio chunking, also known as audio segmentation or audio slicing, is the process of breaking down a continuous audio stream into smaller, manageable segments or chunks.
Librosa, Pydub and SoundFile libraries are recommended for doing audio chunking. Use librosa.load call to read an audio file. The AudioSegment class in Pydub is helpful to create chunks, like AudioSegment.from_file(). The SoundFile library can read and write sound files in various formats.
import librosa
import soundfile as sf
from pydub import AudioSegment
# load the audio file with pydub or librosa
audio_path = "input_audio.mp3"
audio = AudioSegment.from_file(audio_path)
# audio, sr = librosa.load(audio_path, sr=None)
# chunking algorithms
# aftering getting chunked files, use sf for write
# sf.write(chunk_filename, chunk, sr)
d. Hyperparameters
In Whisper, Seq2SeqTrainingArguments and Seq2SeqTrainer are utilized to perform training because of its sequence to sequence architecture. There are parameters in the training arguments that can significantly affect the learning ability of LLMs.
e. Fine-tuned Results
Abbreviations
Tables:
Dataset/Size/Split | Model/Lang/Task | Hyperparameters | Result |
mozilla-foundation/ common_voice_11_0
#ts = 100, #es = 100
train/ test | Whisper small
Hindi
Transcribe | lr = 1e-5, wd = 0,
ws = 5, ms = 40,
es = steps, ml = 225,
tbz = 4, ebz = 8 | WER: 67.442% |
mozilla-foundation/ common_voice_11_0
#ts = 100, #es = 100
train+validation/test | Whisper small
Hindi
Transcribe | lr = 1e-5, wd = 0,
ws = 0, ms = 60,
es = steps, ml = 50,
tbz = 16, ebz = 8 | WER: 69.240% |
mozilla-foundation/ common_voice_11_0
#ts = 100, #es = 100
train+validation/test | Whisper small
Hindi
Transcribe | lr = 1e-5, wd = 0,
ws = 0, ms = 60,
es = steps, ml = 100
tbz = 16, ebz = 8 | WER: 64.656% |
mozilla-foundation/ common_voice_11_0
#ts = 500, #es = 500
train+validation/test | Whisper small
Hindi
Transcribe | lr = 1e-5, wd = 0,
ws = 0, ms = 60,
es = steps, ml = 50,
tbz = 16, ebz = 8 | WER: 62.207% |
common_voice
#ts = 100, #es = 100
train+validated/validation | Whisper small
Japanese
Transcribe | lr = 1e-5, wd = 0,
ws = 0, ms = 80,
es = steps, ml = 225
tbz = 16, ebz = 8 | WER: 64.0% |
common_voice
#ts = 3500, #es = 500
train+validated/validation | Whisper small
Japanese
Transcribe | lr = 1e-6, wd = 0,
ws = 50, ms = 3500,
es = steps, ml = 200,
tbz = 16, ebz = 8 | WER: 2.4% |
librispeech_asr
#ts = 750, #es = 250
train.100/validation | Whisper medium
English
Transcribe | lr = 1e-5, wd = 0.01,
ws = 10, ms = 750,
es = steps, ml = 80,
tbz = 1, ebz = 1 | WER: 13.095% |
cv & fleurs (50:50)
#ts = 3500, #es = 500
train+validated & train/ validation & validation | Whisper small
Japanese
Transcribe | lr = 1e-6, wd = 0,
ws = 50, ms = 3500,
es = steps, ml = 200,
tbz = 16, ebz = 8 | WER: 55.424% |
According to Graph 1, Whisper-small has poor WER performance (above 60%) on Hindi transcription with small sample size. While combining training data sources and increasing sample size can improve models by some levels, better performances might be restricted because of complexity. Hence increasing model complexity is more effective. In the meantime, possible overfitting or catastrophic forgetting issues might occur when training with a large sample size or with a single source. Such WER results should not be considered as good references. In addition, Whisper model in English transcriptions had a WER rate much smaller than Hindi and Japanese. As Japanese is character-based, a more suitable evaluation metric is Character Error Rate (CER).
f. PEFT with LoRA
Sometimes out of memory errors might occur while full fine-tuning, and they are usually caused by low-resource hardwares and large model weights. In case the full fine-tuning can not be completed after attempts on adjusting hyper-parameters like batch sizes, then Parameter-Efficient Fine-tuning (PEFT) approaches might be a useful alternative. This approach only fine-tunes a small number of model parameters while freezing most parameters of the pre-trained LLMs, thereby greatly decreasing computational and storage costs. This also overcomes the issues of catastrophic forgetting, a behavior observed during the full fine-tuning of LLMs.
LoRA is for Low Rank Adaptation. The goal of Low Rank Adaptation is to improve the efficiency and performance of adapting a pre-trained model to a new task by reducing the dimensionality of the model's parameters. Low rank adaptation decomposes the weights of pre-trained models into low-rank matrices or tensors and significantly reduces the number of parameters that need to be fine-tuned.
Compared to normal full fine-tuning, the difference comes when using pre-trained checkpoints from Conditional Generation. load_in_8bit=True will quantize the model to use one-fourth precision than float32 with minimal loss to performance. The device_map=”auto” argument automatically determines how to load and store model weights. prepare_model_for_int8_training will cast all the non int8 modules to full precision (FP32) for stability, adds a forward hook to the input embedding layer and enables gradient checkpointing for memory-efficient training.
model = WhisperForConditionalGeneration.from_pretrained('openai/whisper-large-v2', load_in_8bit=True, device_map="auto")
from peft import LoraConfig, PeftModel, LoraModel, LoraConfig, get_peft_model, prepare_model_for_int8_training
model = prepare_model_for_int8_training(model)
def make_inputs_require_grad(module, input, output):
output.requires_grad_(True)
model.model.encoder.conv1.register_forward_hook(make_inputs_require_grad)
config = LoraConfig(r=32, lora_alpha=64, target_modules=["q_proj", "v_proj"], lora_dropout=0.05, bias="none")
model = get_peft_model(model, config)
model.print_trainable_parameters()
Now configurations on the PEFT model can be set using LoraConfig
model.print_trainable_parameters() call can print the output of the sizes of trainable parameters, full parameters and its percentage. The percentage shows the exact percent usage compared to the full fine-tuning. Usually the percentage is around 1.
Compared to full fine-tuning, it’s necessary to explicitly set remove_unused_columns=False and label_names=["labels"] as the model doesn't inherit the signature of base model. Additionally, predict_with_generate call can not be used as it internally calls generate function without auto-casting. For the same reason, compute_metrics can not be put into the training arguments.
training_args = Seq2SeqTrainingArguments(
output_dir="jackdu/whisper-peft",
per_device_train_batch_size=8,
gradient_accumulation_steps=1,
learning_rate=1e-4,
weight_decay=0.01,
warmup_steps=0,
num_train_epochs=3,
evaluation_strategy="steps",
fp16=True,
per_device_eval_batch_size=8,
generation_max_length=150,
logging_steps=100,
#max_steps=100, # only for testing purposes
remove_unused_columns=False, # required
label_names=["labels"], # required
)
We can choose to write a custom TrainerCallback to save model checkpoints during training. The functions will save the adapter_model weights and remove the base model weights in pytorch_model.bin. Pass the Seq2SeqTrainingArguments, model, datasets, data collator, tokenizer, and callbacks to the Seq2SeqTrainer. Set model.config.use_cache = False to silence warnings. Finally PEFT model is ready for training!
class SavePeftModelCallback(TrainerCallback):
def on_save(
self,
args: TrainingArguments,
state: TrainerState,
control: TrainerControl,
**kwargs,
):
checkpoint_folder = os.path.join(args.output_dir, f"{PREFIX_CHECKPOINT_DIR}-{state.global_step}")
peft_model_path = os.path.join(checkpoint_folder, "adapter_model")
kwargs["model"].save_pretrained(peft_model_path)
pytorch_model_path = os.path.join(checkpoint_folder, "pytorch_model.bin")
if os.path.exists(pytorch_model_path):
os.remove(pytorch_model_path)
return control
trainer = Seq2SeqTrainer(
args=training_args,
model=model,
train_dataset=common_voice["train"],
eval_dataset=common_voice["test"],
data_collator=data_collator,
tokenizer=processor.feature_extractor,
callbacks=[SavePeftModelCallback],
)
model.config.use_cache = True # silence the warnings. Re-enable when inferencing!
Trained PEFT model can be pushed to or obtained from Hugging Face. PEFT model will have only two files, adapter_model.bin and adapter_config.json. Set model.config.use_cache = True to enable inference. An evaluation loop should be designed to evaluate model performance. As the call to predict_with_generate is disabled as previously stated, the eval loop could be hand rolled with torch.cuda.amp.autocast().
login(token='hf_writetoken')
peft_model_id = "jackdu/whisper-peft"
model.push_to_hub(peft_model_id)
from peft import PeftModel, PeftConfig
from transformers import WhisperForConditionalGeneration, Seq2SeqTrainer
peft_model_id = "jackdu/whisper-peft" # Use the same model ID as before.
peft_config = PeftConfig.from_pretrained(peft_model_id)
model = WhisperForConditionalGeneration.from_pretrained(
peft_config.base_model_name_or_path, load_in_8bit=True, device_map="auto"
)
model = PeftModel.from_pretrained(model, peft_model_id)
model.config.use_cache = True
DataLoader loads test dataset in batches. The model generates predictions for the given input features using the specified decoder prompt IDs and a maximum limit of new tokens. The decoded predictions and labels are appended to the respective lists. The predictions and labels are converted from tensors to NumPy arrays and then decoded back into text format using the tokenizer's batch_decode method.
After each batch evaluation, intermediate variables are deleted, and garbage collection is triggered to manage memory. Finally WER and normalized WER are shown in the output.
import gc
import numpy as np
from tqdm import tqdm
from torch.utils.data import DataLoader
from transformers.models.whisper.english_normalizer import BasicTextNormalizer
eval_dataloader = DataLoader(common_voice["test"], batch_size=8, collate_fn=data_collator)
forced_decoder_ids = processor.get_decoder_prompt_ids(language='vi', task='transcribe')
normalizer = BasicTextNormalizer()
predictions = []
references = []
normalized_predictions = []
normalized_references = []
model.eval()
for step, batch in enumerate(tqdm(eval_dataloader)):
with torch.cuda.amp.autocast():
with torch.no_grad():
generated_tokens = (
model.generate(
input_features=batch["input_features"].to("cuda"),
forced_decoder_ids=forced_decoder_ids,
max_new_tokens=255,
)
.cpu()
.numpy()
)
labels = batch["labels"].cpu().numpy()
labels = np.where(labels != -100, labels, processor.tokenizer.pad_token_id)
decoded_preds = processor.tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
decoded_labels = processor.tokenizer.batch_decode(labels, skip_special_tokens=True)
predictions.extend(decoded_preds)
references.extend(decoded_labels)
normalized_predictions.extend([normalizer(pred).strip() for pred in decoded_preds])
normalized_references.extend([normalizer(label).strip() for label in decoded_labels])
del generated_tokens, labels, batch
gc.collect()
wer = 100 * metric.compute(predictions=predictions, references=references)
normalized_wer = 100 * metric.compute(predictions=normalized_predictions, references=normalized_references)
eval_metrics = {"eval/wer": wer, "eval/normalized_wer": normalized_wer}
print(f"{wer=} and {normalized_wer=}")
print(eval_metrics)
g. PEFT Results
Abbreviations
Dataset/Size/Split | Model/Lang/Task | Hyperparameters | Result |
mozilla-foundation/ common_voice_13_0
#ts = 1000, #es = 100
train+validation/test | Whisper medium
Japanese
Transcribe | lr = 1e-3, wd = 0,
ws = 50, #e = 3,
es = steps, ml = 128,
tbz = 8, ebz = 8 | WER: 73%
NormWER: 70.186% |
mozilla-foundation/ common_voice_13_0
#ts = 10000, #es = 100
train+validation/test | Whisper medium
Japanese
Transcribe | lr = 1e-5, wd = 0,
ws = 200, #e = 5,
es = steps, ml = 150,
tbz = 20, ebz = 20 | WER: 79.920%
NormWER: 85.582% |
mozilla-foundation/ common_voice_13_0
#ts = 7000, #es = 1500
train/test | Whisper large-v2
Japanese
Transcribe | lr = 1e-5, wd = 0,
ws = 200, #e = 4,
es = steps, ml = 200,
tbz = 16, ebz = 8
dropout = 0.05
lr_scheduler = linear | WER: 81.346%
NormWER: 77.364% |
mozilla-foundation/ common_voice_13_0
#ts = 100, #es = 30
train+validation/test | Whisper large-v2
Vietnamese
Transcribe | lr = 1e-4, wd = 0.01,
ws = 0, #e = 3,
es = steps, ml = 150,
tbz = 8, ebz = 8 | WER: 26.577%
NormWER: 22.523% |
From Graph 2, Whisper with PEFT had poor WER performance on Japanese transcription in any model and sample size. However, as Japanese is character-based language, WER might not fully represent the model performance. After switching to Vietnamese, which is a language with English-like composition, the results were within expectations.
h. Loss Curves Visualization
Loss curves are often used in machine learning to monitor the performance of a model during training. They show how the loss functions change over epochs or steps. The loss function quantifies how well the model's predictions match the actual target values. Matplotlib is a powerful Python library for data visualization in graphs.
plt.figure(figsize=(10, 6))
plt.plot(training_epoch, training_loss, label="Training Loss")
plt.plot(evaluation_epoch, evaluation_loss, label="Evaluation Loss")
plt.xlabel("Training Epochs")
plt.ylabel("Loss")
plt.title("Loss Curves for Whisper Fine-Tuning")
plt.legend()
plt.grid(True)
plt.show()
After inputting losses and epochs data manually, a graph will be created
There are several patterns and rules to identify if the loss curves indicate a good fit for Whisper or any other machine learning models. Strategies can be applied based on the information from plotted visualized graphs. This helps us speed up finding an ideal model.
- Overfitting and Underfitting
- Smoothness and Stability:
- Loss Plateau and Rebound:
Look for both the training and validation loss to converge to relatively low values. A convergence with a low value range typically means that the model is learning well, with neither underfitting (high training and validation loss) nor overfitting (low training loss but high validation loss).
Smoothness of Loss curves are indicative of a well-behaved training process. If the curves are not smooth, having big fluctuations or irregular patterns, it might suggest instability or other issues in the training process.
After a certain number of epochs, the loss curves might become slow or stop on the decreasing trend in values. This could indicate that the model has reached the state that it struggles to learn further from the available data. If the evaluation loss stops decreasing and even starts to rebound, this should be a sign that the model begins to overfit. We can apply strategies like early stopping to prevent it from happening.
i. Baseline Results
Dataset/Split/Size | Model/Task | Result |
distil-whisper/tedlium-long-form
test dataset | Whisper medium baseline
en->en | WER: 28.418% |
distil-whisper/tedlium-long-form
validation dataset | Whisper large-v2 baseline
en->en | WER: 26.671% |
distil-whisper/tedlium-long-form
validation dataset | Whisper medium baseline
en->en | WER: 24.049% |
librispeech_asr
clean test dataset | Whisper large-v2 baseline
en->en | WER: 4.746% |
mozilla-foundation/ common_voice_13_0
test dataset numbers: 1000 | Whisper large-v2 baseline
en->en | WER: 21.712% |
GigaSpeech
test dataset numbers: 1000 actual: 777
(excluding musics & noises) | Whisper large-v2 baseline
en->en | WER: 12.819% |
Aishell S0770
test dataset numbers: 353 | Whisper large-v2 baseline
zh-CN->zh-CN | CER: 8.595% |
Aishell S0768
test dataset numbers: 367 | Whisper large-v2 baseline
zh-CN->zh-CN | CER: 12.379% |
MagicData 38_5837
test dataset numbers: 585 | Whisper large-v2 baseline
zh-CN->zh-CN | CER: 21.750% |
MagicData 4 speakers
test dataset numbers: 2372 | Whisper large-v2 baseline
zh-CN->zh-CN | CER: 24.747% |
With Graph 4 results on Whisper Baseline Inference, the plots suggests qualities and splits of audio data sources could have some effects on model performances.
In the English category, TED LIUM (long form) had the worst accuracy with lowest 24.049% WER, and Common Voice the second worst, GigaSpeech third and LibriSpeech the best, with 4.746% WER. This is reasonable, as TED LIUM consists of TED talks that contain noises and is not taking ASR tasks as primary purposes. Common Voice, on the other hand, has significant variation in audio qualities since it collects data from vast amounts of volunteers and contributors. GigaSpeech has the longest trainable hours of audios, but it was recorded from Podcast and Youtube, and thus may have some sound quality losses. Librispeech consists of narrated audiobooks collected from the LibriVox project. It had been carefully segmented and cleaned by researchers.
In the Chinese category, Aishell and MagicData resulted in significantly different CERs. As both of them were divided by speakers, there might be considerable fluctuations of performance within the dataset. However, MagicData has better claimed transcript accuracy, longer training hours, and more speakers. A possible reason is audio quality differences. Aishell audios were recorded by high fidelity microphone, and Hi-Fi audios were created and then downsampled to 16k Hz within the pre-processing steps, while MagicData audios were recorded by mobile phones.
4. Speaker Diarization
Speaker Diarization is a field in speech recognition that involves segmenting a speech audio into distinct segments corresponding to different speakers. The goal is to identify and differentiate individual speakers in an audio stream, making it possible to assign diarized time segments to specific speakers. This is particularly useful in scenarios where there are multiple speakers in a recording, such as conference calls, interviews, podcasts, and recorded meetings.
a. Pyannote.audio
Pyannote-audio is an open-source toolkit developed by the Pyannote team for various speech and audio processing tasks. It provides a collection of pre-built models, algorithms, and utilities to perform tasks like speaker diarization, voice activity detection, and speech turn segmentation.
How to use Pyannote.audio with Whisper:
- Install Pyannote.audio using PyPI:
- Login to Hugging Face and set up CUDA GPU on the device
- Import Pipeline and Audio from Pyannote.audio and prepare the wav audio files
- Set up the pipeline of base model of Whisper large-v2 with ASR task, 30 second chunk length (based on Whisper structures), and CUDA GPU.
- In the pipeline the numbers of speakers can be defined in advance if minimum, maximum or exact number are known. As Pyannote.audio supports multi-channel audio diarization, we can select mono=’random’ or ’downmix’ to choose either a random single channel or the new channel down-mixed and averaged by all channels. Diarization will produce diarized time segments. Audios can be then cropped within exact time segments as waveforms and executed to Whisper Pipeline and set on a target language to decode.
- Finally, pair the speakers with their corresponding diarized text
pip install -qq https://github.com/pyannote/pyannote-audio/archive/refs/heads/develop.zip
login(read_token)
device = "cuda:0" if torch.cuda.is_available() else "cpu"
from pyannote.audio import Pipeline, Audio
sd_pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization",use_auth_token=True)
wav_files = glob.glob(os.path.join(audio_dirpath, '*.wav'))
pipe = pipeline(
"automatic-speech-recognition",
model="openai/whisper-large-v2",
chunk_length_s=30,
device=device,
)
results = []
for audio_file in wav_files:
diarization = sd_pipeline(audio_file, min_speakers=min_speakers, max_speakers=max_speakers)
audio = Audio(sample_rate=16000, mono='random')
for segment, _, speaker in diarization.itertracks(yield_label=True):
waveform, sample_rate = audio.crop(audio_file, segment)
text = pipe({"raw":waveform.squeeze().numpy(), "sampling_rate": sample_rate}, batch_size=8,
generate_kwargs = {"language":"<|zh|>","task": "transcribe"})["text"]
results.append({
'start': segment.start,
'stop': segment.end,
'speaker': speaker,
'text': text
})
del waveform, sample_rate, text
gc.collect()
torch.cuda.empty_cache()
del diarization, audio
grouped_results = defaultdict(list)
for result in results:
grouped_results[result['speaker']].append(result)
final_results = []
for speaker, group in grouped_results.items():
group.sort(key=lambda x: x['start'])
text = ' '.join(result['text'] for result in group)
final_results.append({
'speaker': speaker,
'text': text
})
print(final_results)
b. WhisperX
WhisperX is a model that integrates Whisper, Phoneme-Based Model (Wav2Vec2) and Pyannote.audio. It is claimed to be 70x faster in real-time speech recognition than Whisper large-v2 model with word-level timestamps and speaker diarization with VAD feature. It uses faster-whisper backend, and consumes GPU memories smaller than 8 GB.
How to use WhisperX:
- Install dependencies in specified versions (PyTorch 2.0.0, CUDA 11.7)
- Use WhisperX pipeline to get Pyannote diarization model
- Then configure whisper model size, the GPU usage, computation type as well as the language for transcripts. The provided languages are {en, fr, de, es, it, ja, zh, nl, uk, pt} by now. Local audios can be loaded and diarized with transcript segments. After assigning speakers with each audio segment, words in multi-speaker speech scenarios can be identified with a unique speaker ID.
conda install pytorch==2.0.0 torchaudio==2.0.0 pytorch-cuda=11.8 -c pytorch -c nvidia
pip install git+https://github.com/m-bain/whisperx.git
diarize_model = whisperx.DiarizationPipeline(model_name="pyannote/speaker-diarization", use_auth_token='hf_token', device=device)
model = whisperx.load_model\
(whisper_arch=model, device=device, compute_type=compute_type, language = language_abbr)
audio = whisperx.load_audio(matching_file_path)
diarize_segments = diarize_model(matching_file_path, min_speakers=6, max_speakers=6)
result = model.transcribe(audio, batch_size=batch_size)
result = whisperx.assign_word_speakers(diarize_segments, result)
However, WhisperX is not perfect for such multi-speaker speech recognition with diarization tasks. Sometimes it will be less accurate if the speakers speak from different channels. Speaker diarization might get disordered if it does not identify the correct number of speakers; and it performs poorly when it comes to voice overlapping and speeches that contain interjections.
Of course, we can also simply use the transcribe functionality in WhisperX. As the architecture is built with Voice Activity Detection and Cut & Merge features, long audios can be processed to the model. Then a comparison will be possible to Whisper pipeline in Transformer (for long audio transcriptions).
device = "cuda"
directory = testaudio_directory
batch_size = 1 # reduce if low on GPU mem
compute_type = "int8" # change to "int8" if low on GPU mem (may reduce accuracy)
model = whisperx.load_model("medium", device, compute_type=compute_type, language="en")
datatest = {
'audio': [load_wav(os.path.join(testaudio_directory, f)) for f in wav_files],
'transcript': [],
}
for file_name in os.listdir(directory):
if file_name.endswith(".wav"):
audio_file = os.path.join(directory, file_name)
audio = whisperx.load_audio(audio_file)
result = model.transcribe(audio, batch_size=batch_size)
strtext = [seg['text'] for seg in result["segments"]]
strtext = ' '.join(strtext)
datatest['transcript'].append(strtext)
del model
c. WhisperX Results
Dataset | Model/Task/Compute Type | Result |
TED LIUM 1st release SLR7
test dataset | WhisperX medium
en->en int8 | WER:37.041% |
TED LIUM 1st release SLR7
test dataset | WhisperX large-v2
en->en int8 | WER:36.917% |
TED LIUM 1st release SLR7
test dataset | WhisperX medium
en->en float16 | WER:36.906% |
distil-whisper/tedlium-long-form
validation dataset | WhisperX large-v2
en->en int8
batch size = 1 | WER:24.651% |
distil-whisper/tedlium-long-form
validation dataset | WhisperX medium
en->en int8
batch size = 1 | WER:24.353% |
AISHELL-4
a selected audio file | WhisperX
manual check | CER:15.6%~24.658% |
The results show that there were little accuracy differences in using different model sizes and computation types. WER results were not much different to the original Whisper model, thus indicating WhisperX is a good alternative other than Whisper Pipeline. However, it was the case only for single-channel clean audio datasets in English and a few other languages. As for Chinese datasets, the outcomes became unreliable.
Here are several possible reasons:
- WhisperX only supports traditional Chinese transcription, and when using Hanziconv, a Python library that converts traditional Chinese to simplified Chinese, we cannot ensure the characters are perfectly converted as expected
- TED LIUM is the audio source collected from TED talks and it was mostly completed by a single speaker, thus decreasing the difficulties of multi-speaker transcribing tasks. A further investigation on meeting scenarios by English speakers is needed for testing diarizing abilities.
- AISHELL-4, unlike AISHELL-1, was collected by 8-channel circular microphone array for speech processing in conference scenarios. The dataset consists of 211 recorded meeting sessions, each containing 4 to 8 speakers, with a total length of 120 hours. AISHELL-4 provides realistic acoustics and rich natural speech characteristics in conversation such as short pause, speech overlap, quick speaker turn, noise, etc. This creates the difficulties in transcribing correct sentences from the correct speaker. Also, the meetings usually contain large amounts of interjections that have no real meanings in Chinese.
5. Other Models
Besides Whisper, there exists other competitive LLMs in ASR field that built with different architectures and techniques, including Deep Neural Networks (DNNs), Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), Connectionist Temporal Classification (CTC), etc. Some popular models are researched and evaluated with model inference.
a. Meta MMS
The Massively Multilingual Speech (MMS) is a project led by Meta (Facebook research) included in the Fairseq (a sequence-to-sequence toolkit). It expands speech technology from around 100 languages to more than 1,100 languages, more than 10 times as many as before. Language identification models are able to more than 4,000 spoken languages, 40 times more than before.
MMS uses religious audios and texts, such as the Bible, to be the data source because most of them have been translated in many different languages. The MMS is built on pre-trained wav2vec 2.0 models. The researcher claimed that MMS halves the WER of Whisper on 54 languages of the FLEURS benchmark while being trained on a small fraction of the labeled data.
There are several kinds of model size available: mms-1b-fl102, mms-1b-l1107, mms-1b-all. Among these, mms-1b-all is the largest size. Unlike Whisper, Wav2Vec2FeatureExtractor and Wav2Vec2CTCTokenizer are used.
import os
import gc
import torch
from evaluate import load
from huggingface_hub import login
from datasets import load_dataset, Audio
from transformers import Wav2Vec2ForCTC, AutoProcessor
model_id = "facebook/mms-1b-all"
target_lang = "cmn-script_simplified"
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
processor = AutoProcessor.from_pretrained(model_id)
model = Wav2Vec2ForCTC.from_pretrained(model_id)
The target language ids could be found in tokenizer.vocab.keys(). For example, id for Chinese is cmn-script_simplified. Different language adapter weights can be loaded for different languages via load_adapter(). MMS loads weights for English by default, so target_lang=<target-lang> and ignore_mismatched_sizes=True need to be specified when loading with other languages. An alternative way is to use an ASR pipeline from Transformer. Set “model_kwargs” dictionary to have the two settings above.
from transformers import pipeline
model_id = "facebook/mms-1b-all"
target_lang = "lang"
pipe = pipeline(model=model_id, model_kwargs={"target_lang": "lang", "ignore_mismatched_sizes": True})
For inference purposes, switch out the language adapters directly from the default English model using the codes.
processor.tokenizer.vocab.keys()
processor.tokenizer.set_target_lang(target_lang)
model.load_adapter(target_lang)
model = model.to(device)
The transcribing process of Hugging Face MMS resembles that of Wav2Vec2ForCTC Model.
for i, item in enumerate(dataset):
zhcn_sample = item["audio"]["array"]
inputs = processor(zhcn_sample, sampling_rate=16_000, return_tensors="pt")
inputs = {name: tensor.to(device) for name, tensor in inputs.items()}
with torch.no_grad():
outputs = model(**inputs).logits
ids = torch.argmax(outputs, dim=-1)[0]
transcription = processor.decode(ids)
transcriptions.append(transcription)
del inputs, outputs, ids, zhcn_sample
torch.cuda.empty_cache()
gc.collect()
The model generates logits, which are the raw output scores from the model. After obtaining the logits from the model, it calculates the index of the highest logit value along the last dimension. This effectively finds the predicted token index with the highest probability. The [0] indexing is used to access the only example in the batch. Finally, the predicted token index ids is decoded to original text transcriptions.
b. PaddleSpeech
PaddleSpeech is a Chinese open-source toolkit on the PaddlePaddle platform for a variety of tasks in speech and audio, with the state-of-art and influential models. It provides production ready streaming ASR and streaming TTS systems. There are a variety of speech models with different architectures and pretrained data source available:
The recommended OS system is Linux, but others are also sufficient for certain tasks.
There is also a detailed introduction of the exact datasets, models, decoding and augmentation techniques used in the feature list in official GitHub Repo https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/asr/feature_list.md
There are several model architectures available in PaddleSpeech.
1) DeepSpeech2
The DeepSpeech2 online model is a modified DeepSpeech2 version. The model is mainly composed of 2D convolution subsampling layers and stacked single-direction RNN layers.
It also has a separate vocabulary for English data and Chinese data. A technology called Cepstral Mean and Variance Normalization (CMVN) is used. A subset of or full of the training set is selected and be used to compute the feature mean and standard deviation. For feature extraction, the released DeepSpeech2 online model uses the linear feature extraction method (Fast Fourier Transform without using filter bank). Encoder and Decoder architectures are shown below.
The DeepSpeech2 offline model (non-streaming) is similar to the online one. The main difference is the offline model uses the stacked bi-directional RNN layers (bi-directions with RNN Cells). For data preparation and decoder architecture they are identical.
2) Conformer
The Conformer is a convolution-augmented transformer for speech recognition. It combines Convolution Neural Networks (CNN) and Transformers to improve speech recognition performance in a parameter-efficient way.
Conformer comprises two feed-forward layers with half step residual connections with Multi-Head Self Attention and Convolution modules at the middle. A post layernorm is followed at the end. By combining CNNs for local feature extraction and transformers for global context understanding, the Conformer model achieves state-of-the-art performance on various speech recognition benchmarks.
3) U2
Unified Streaming and Non-streaming Two-pass End-to-end Model, also known as U2, applies hybrid CTC/attention architecture with dynamic chunk-based attention and streaming CTC decoding. By adjusting the chunk size during inference, the latency of the speech recognition system can be controlled. After the CTC decoder generates n-best hypotheses, the attention decoder is used to rescore these hypotheses and generate the final result. It allows for more efficient and flexible speech recognition, making it suitable for real-time applications and scenarios with variable-length audio input.
4) Usage
The installations are dependent on how much PaddleSpeech will be utilized. It is possible to choose the specific version on PaddlePaddle official website (Baidu mirror / Tsinghua mirror): https://www.paddlepaddle.org.cn/en/install/quick?docurl=/documentation/docs/en/install/conda/windows-conda_en.html
How to use PaddleSpeech:
- Install with PyPI
pip install pytest-runner
pip install paddlespeech
Another approach is to compile source code with commands given on the official website
git clone https://github.com/PaddlePaddle/PaddleSpeech.git
cd PaddleSpeech
pip install pytest-runner
pip install .
- Now uses PaddleSpeech CLI to perform model inference (on multiple files):
from paddlespeech.cli.asr.infer import ASRExecutor
asr = ASRExecutor()
result = asr(audio_file=audio_file)
Default model conformer_wenetspeech will be used to perform the audio transcriptions.
transcript = []
for audio_file in wav_files:
result = asr(
model='conformer_wenetspeech',
lang='zh',
sample_rate=16000,
config=None,
ckpt_path=None,
audio_file=audio_file,
device=paddle.get_device())
transcript.append(result)
However, it is also possible to customize executor parameters in asr arguments, the available parameters are shown in Figure 15:
The list of available PaddleSpeech models for ASR inference with target language:
c. SpeechBrain
SpeechBrain is an open-source conversational AI toolkit developed by the Speech and Audio Processing Group at the University of Montreal. It aims to provide a flexible and comprehensive platform for speech-related research, development, and applications. SpeechBrain is PyTorch-based and supports state-of-the-art methods for end-to-end speech recognition, including models based on CTC, CTC+attention, transducers, transformers, and neural language models relying on recurrent neural networks and transformers.
SpeechBrain offers a wide range of functionalities for various speech and audio processing tasks, including ASR. The official website is here SpeechBrain
How to use SpeechBrain:
- SpeechBrain has Hugging Face models, install SpeechBrain with pip command
pip install speechbrain
- Perform Inference on SpeechBrain Models
from speechbrain.pretrained import EncoderDecoderASR
from speechbrain.pretrained.interfaces import foreign_class
asr_model = EncoderDecoderASR.from_hparams(source="speechbrain/asr-transformer-aishell", savedir="pretrained_models/asr-transformer-aishell", run_opts={"device": "cuda"})
result = asr_model.transcribe_file(audio_file)
d. ESPnet
ESPnet is an end-to-end speech processing toolkit covering end-to-end speech recognition, text-to-speech, speech translation, speaker diarization, etc. ESPnet uses PyTorch as the engine and also follows Kaldi style data processing, feature extraction/format, and recipes.
ESPnet is now upgrading to version 2, ESPnet2, and has been shifted with most of the developments. It will have On-the-fly feature extraction and reduce preparation complexity. It contains various ASR recipes such as Hybrid CTC/attention based end-to-end and Transducer based end-to-end. The most recommended OS system is Linux Ubuntu.
How to use ESPnet:
- ESPnet models can be loaded from espnet_model_zoo to perform model inference. Install espnet_model_zoo with PyPI:
- Then import speech to text module from espnet2.bin.asr_inference
- In Speech2Text, decoding parameters can be customized, e.g. beam size, ctc weight.
- model_ id is the name of the ESPnet model on Hugging Face. The rest of the parameters are all decoding parameters that are not in the model file.
- (max/min)lenratio controls allowed ratios between the length of the input audio and the length of the decoded output text.
- The beam_size parameter sets the width of the beam search, which affects the number of hypotheses kept during decoding. A larger beam size can improve accuracy but also increases computation time.
- The ctc_weight parameter controls the relative weight of the CTC loss during decoding.
- LM (Language Model) is a separate model that can be used to improve ASR accuracy by incorporating language knowledge. The lm_weight parameter sets the relative weight of the language model during decoding.
- The penalty parameter applies a length penalty during decoding. It can be used to discourage longer output sequences.
- The nbest parameter controls the number of hypotheses (output sequences) to return during decoding. Setting nbest = 1 means we will get only the top 1 hypothesis.
- After complete these settings, check whether sampling rate is the same as from training corpus before making speech to text call:
pip install espnet_model_zoo
import soundfile
from espnet2.bin.asr_inference import Speech2Text
speech2text = Speech2Text.from_pretrained(
model_id,
maxlenratio=0.0,
minlenratio=0.0,
beam_size=20,
ctc_weight=0.3,
lm_weight=0.5,
penalty=0.0,
nbest=1
)
predictions = []
for file in matching_files:
speech, rate = soundfile.read(file)
nbests = speech2text(speech)
text, *_ = nbests[0]
predictions.append(text)
print(predictions)
e. Baseline Results
The baselines tests are primarily for Chinese transcription accuracy investigation purposes. Whisper will be used as base references for other models.
English Test Dataset | Model/Method | WER |
librispeech_asr
clean | Meta MMS mms-1b-all | 4.331% |
mozilla-foundation/ common_voice_13_0
numbers: 1000 | Meta MMS mms-1b-all | 23.963% |
Chinese Test Dataset | Model/Method | CER |
Aishell S0770
numbers: 353 | PaddleSpeech Default(conformer_u2pp_online_wenetspeech)
decode_method: attention_rescoring | 4.062% |
Aishell S0768
numbers: 367 | PaddleSpeech Default(conformer_u2pp_online_wenetspeech)
decode_method: attention_rescoring | 10.322% |
Aishell S0768
numbers: 367 | SpeechBrain wav2vec2-transformer-aishell | 8.436% |
Aishell S0768
numbers: 367 | Meta MMS mms-1b-all | 34.241% |
Aishell S0768
numbers: 367 | ESPnet Emiru Tsunoo/ aishell_asr streaming
Inference params:
maxlenratio=0, minlenratio=0,
beam_size=20, ctc_weight=0.3,
lm_weight=0.5, penalty=0.0,
nbest=1 | 11.084% |
MagicData 38_5837
numbers: 585 | Meta MMS mms-1b-all | 43.296% |
MagicData 38_5837
numbers: 585 | PaddleSpeech Default(conformer_u2pp_online_wenetspeech)
decode_method: attention_rescoring | 30.422% |
MagicData 38_5837
numbers: 585 | SpeechBrain wav2vec2-transformer-aishell | 32.852% |
MagicData 38_5837
numbers: 585 | ESPnet Emiru Tsunoo/ aishell_asr streaming
Inference params:
maxlenratio=0, minlenratio=0,
beam_size=20, ctc_weight=0.3,
lm_weight=0.5, penalty=0.0,
nbest=1 | 55.324% |
MagicData 38_5837
numbers: 585 | ESPnet Emiru Tsunoo/ aishell_asr streaming
Inference params:
maxlenratio=0, minlenratio=0,
beam_size=20, ctc_weight=0.3,
lm_weight=0, penalty=0.0,
nbest=1 | 52.878% |
MagicData 4 speakers
numbers: 2372 | Meta MMS mms-1b-all | 34.511% |
MagicData 4 speakers
numbers: 2372 | PaddleSpeech conformer-wenetspeech
decode_method: attention_rescoring | 9.79% |
MagicData 4 speakers
numbers: 2372 | PaddleSpeech conformer-aishell
decode_method: attention_rescoring | 23.135% |
MagicData 4 speakers
numbers: 2372 | SpeechBrain wav2vec2-transformer-aishell | 23.728% |
MagicData 4 speakers
numbers: 2372 | SpeechBrain wav2vec2-ctc-aishell | 15.911% |
MagicData 4 speakers
numbers: 2372 | SpeechBrain transformer-aishell | 26.166% |
MagicData 4 speakers
numbers: 2372 | ESPnet Emiru Tsunoo/ aishell_asr streaming
Inference params:
maxlenratio=0, minlenratio=0,
beam_size=20, ctc_weight=0.3,
lm_weight=0.5, penalty=0.0,
nbest=1 | 38.697% |
MagicData 4 speakers
numbers: 2372 | ESPnet Emiru Tsunoo/ aishell_asr streaming
Inference params:
maxlenratio=0, minlenratio=0,
beam_size=20, ctc_weight=0.3,
lm_weight=0, penalty=0.0,
nbest=1 | 36.779% |
for English inference results, Meta MMS had similar transcript accuracies compared to Whisper.
On the other hand, for Chinese inference results, PaddleSpeech had a better performance compared to Whisper. While conformer-aishell with attention-rescoring and Whisper-large-v2 baseline have similar CER results, conformer-wenetspeech chooses WeNetSpeech as its source and provided a better performance in MagicData test datasets. For SpeechBrain models, wav2vec2-ctc-aishell appeared to have better performance on unseen data, and other models have similar performance compared to Whisper large-v2 baseline. Meta MMS Chinese transcription results were worse than Whisper and ESPnet models are the least accurate.
6.Azure Speech Studio
Azure AI Speech Services is a collection of cloud-based speech-related services offered by Microsoft Azure. These services enable developers to integrate various speech capabilities into their applications, including speech recognition, text-to-speech, and speech translation. These services leverage advanced machine learning algorithms to provide accurate and natural language processing functionalities.
Speech Studio in Azure AI is a set of UI-based tools for building and integrating features from Azure Speech service. Custom Speech Projects in Speech Studio can be created in different languages. Endpoints are then available for deployment using the Speech SDK, Speech CLI, or REST APIs. There are mainly four sections in a Custom Speech Project: Speech Datasets for uploading datasets, Train Custom Models with uploaded train datasets, Test Models with uploaded test datasets, and Deploy Models, which create endpoints for customized (trained) models.
a. Upload datasets
There are three methods for uploading training and testing datasets for Custom Speech:
Speech Studio (direct upload), REST API and CLI usage.
1) Speech Studio
For direct upload to Speech Studio, prepare data in local directories and follow these steps:
There will be options for choosing dataset location in either local file or remote location like Azure Blob URL. For local files, simply find the address of the directory on the local device. However, if we want to ensure maximum security of dataset files by using a trusted Azure services security mechanism, Azure Blob is a good option.
Azure Blob Storage is a scalable and cost-effective cloud-based object storage service. It is designed to store and manage unstructured data, such as documents, images, audio files, videos, backups, logs, and more. Azure Blob Storage provides a reliable and secure way to store massive amounts of data, making it an essential component for many cloud-based applications and services.
How to use Azure Blob:
- Install official Python Azure Blob Storage library
- Import cilents from library
- Then upload zipped files to Azure Blob
- Execute the function in the Python scripts with detailed storage information
pip install azure-storage-blob
from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient
def upload_zip_to_azure_blob(account_name, account_key, container_name, local_zip_path, zip_blob_name):
try:
connection_string = f"DefaultEndpointsProtocol=https;AccountName={account_name};AccountKey={account_key};EndpointSuffix=core.windows.net"
blob_service_client = BlobServiceClient.from_connection_string(connection_string)
container_client = blob_service_client.get_container_client(container_name)
if not container_client.exists():
container_client.create_container()
zip_blob_client = container_client.get_blob_client(zip_blob_name)
with open(local_zip_path, "rb") as zip_file:
zip_blob_client.upload_blob(zip_file)
print("Zip file uploaded successfully!")
except Exception as e:
print(f"Error uploading the zip file: {e}")
if __name__ == "__main__":
storage_account_name = "storage"
storage_account_key = "key"
container_name = "container"
local_zip_file_path = "local_path"
zip_blob_name = "data.zip"
upload_zip_to_azure_blob(storage_account_name, storage_account_key, container_name, local_zip_file_path, zip_blob_name)
By this moment the desired zipped file has been uploaded to the Containers in Azure Blob Storage with specified container name. Click on the file, the URL attribute can be seen at the top of Properties in Overview. Other visible attributes include its content type, creation time and modified time. The URL link is the one that needs to copy and we paste it to the required field to perform safe upload.
2) REST
Unlike Speech Studio, it is not necessary to choose whether a dataset is for testing or training at the time of upload with STT REST API. The API kind is dependent on formats.
Speech to text REST API v3.1 will do the tricks. There are several features in this version: Datasets, Endpoints, Evaluations, Models, Projects, etc. It contains GET, POST, DELETE and PATCH methods.
How to use REST API:
- First, we need a request URL, headers and a request body.
- The next step is to dump the JSON body and create an HTTPS connection. Upload dataset through the POST method and a response of whether it causes HTTPS errors or not will be shown in exception.
- Finally, check on Speech Studio whether the dataset is successfully uploaded, and get attention on upload failures or the expected process not showing up.
To find the right resource, write the name of the resource in the request URL and find the subscription key for the resource in headers. Name the uploaded dataset and description in the request body. By putting in a project with a unique project id, we can directly upload the dataset into the specified project. If the uploaded file is from Azure Blob Storage Container, its location can be specified in the content URL.
json_body = json.dumps(request_body)
try:
conn = http.client.HTTPSConnection('website.com')
conn.request("POST", "/speechtotext/v3.1/datasets", json_body, headers)
response = conn.getresponse()
data = response.read()
print(data)
conn.close()
except Exception as e:
print("[Errno {0}] {1}".format(e.errno, e.strerror))
3) Format
There are restrictions for data to upload. Usually the data composition determines whether it can be used for training or testing. It is better to include the audios and transcripts at the same time.
For audio files, the format should be WAV, sampling rate to be 8k Hz or 16k Hz, single channeled. The maximum audio lengths are also different for testing and training. The archive should be in zip format, under 2GB and 10k files within.
The transcriptions for all WAV files are contained in a single plain-text file (.txt or .tsv). Each line of the transcription file contains the name of one of the audio files, followed by the corresponding transcription. The file name and transcription are separated by a tab.
For different languages, the texts should also be normalized with a defined recipe.
Take zh-CN (Chinese) for example, human-labeled transcriptions for Mandarin Chinese audio must be UTF-8 encoded with a byte-order marker. Avoid the use of half-width punctuation characters. It is also required to write out abbreviations in spoken form.
Additionally, Editor help us to edit and combine uploaded datasets in the same project. Automatic selection of audios in certain lengths and manual exportation of a subset as the new dataset are enabled. Also, qualities and correctnesses are testable by enabling play buttons to every audio with its corresponding transcript aside. Quantity will show the total audio length of a dataset.
b. Train models
Training process will be much easier than the local one. We do not need to fine-tune on the Azure models, but rather feed in speech data only.
Name the model and choose a baseline model as a starting point. Select one or more datasets from Speech Studio Datasets. Often it takes hours for the training to complete, or even a day.
c. Test models
After training, inspect or compare error rates with the models using a single test dataset, which should also exist among uploaded datasets.
For inspection, select two models, either customized or baseline type. The inspection is for audio-only datasets, and its purpose is to check the audio qualities.
For evaluation with error rates, select two (customized / baseline) models. The evaluation will follow WER calculation logic, and Error rate, Insertion number, Substitution number, Deletion number of each model can be seen in Speech Studio. Other than these, original labels, lexical transcripts (the original output from models) and normalized transcripts will be shown under the table in Figure 23 in details. The error rate from every single audio is visible beside these texts.
d. Deploy models
We can deploy models to applications to integrate with other features, or simply redo the evaluation in local scripts. When deploying a model, it will create an endpoint.
How to deploy a model in Azure Speech Studio:
- First, install Speech SDK
- Import config and recognizer from Azure Speech SDK
- To perform evaluations locally with the Azure model, set up configurations with the subscription key, service region and generated endpoint. Then use the method recognize_once() to transcribe the audio files in the local directory.
pip install azure-cognitiveservices-speech
from azure.cognitiveservices.speech import SpeechConfig, SpeechRecognizer, AudioConfig
predictions = []
for root, _, files in os.walk(wav_base_path):
for file_name in files:
if file_name.endswith(".wav"):
if file_name in appeared_filenames:
audio_file_path = os.path.join(root, file_name)
audio_config = AudioConfig(filename=audio_file_path)
speech_recognizer = SpeechRecognizer(speech_config=speech_config, audio_config=audio_config)
result = speech_recognizer.recognize_once()
transcribed_text = result.text
predictions.append(transcribed_text)
print(len(predictions))
e. Results
Abbreviations:
Test Dataset / Split /
Size / Duration | Train Datasets /
Split & Duration | Error Rate (Baseline second) |
MagicData
9452 11:27:39s | Aishell
12+ hrs | 4.69%
4.24% |
MagicData
9452 11:27:39s | Aishell+Minds14
32+ hrs: 1+ hr | 4.67%
4.23% |
Aishell+MagicData+ST-CMDS
5:4:1 6105 10:53:49s | Aishell+Minds14+ST-CMDS
15 hrs: 1+ hr: 15+ hrs | 3.51%
3.20% |
MagicData+Aishell+CV13
15:13:7 8721 11:45:52s | Aishell+CV13
8+ hrs: 7+ hrs | 2.51%
3.70% |
MagicData+Aishell+CV13
15:13:7 8721 11:45:52s | Aishell+CV13+Fleurs
8+ hrs: 7+ hrs: 9+ hrs | 2.48%
3.70% |
The results showed that Minds14 is not ideal for ASR tasks, as its original purpose is for intent detection tasks. While a large amount of test data might decrease the randomization of audio qualities and accuracies, a mixture with multiple test data sources with reasonable portion divisions may be more insightful. The best Azure model by far was trained with AISHELL-1, mozilla-foundation/common_voice_13_0 and google/fleurs, resulting 2.48% error rate.
7. Prospect
In this project, audio sources in English and Chinese datasets were investigated. Whisper models were fine-tuned mainly by these 2 languages. While audio data sources in English have always been sufficient for training purposes, Chinese sources that are available whilst maintaining high transcript qualities are much less in quantities. Additionally, Chinese audio data were often classified by speakers, this indicates a mixture with different speakers might resolve the potential issues of speaker biases.
Because of hardware computing resource limits, full fine-tuning models sometimes produce CUDA OOM error when using a single GPU. This is very likely to obtain a better model in large weights if multi-GPU training or more advanced GPU (NVIDIA 40 series) could be utilized in the process.
A research point that could have been stretched further is effects of LoRA configurations on Parameter-Efficient Fine-tuned model performances. Other than that, having different optimizers and making data augmentation are also possible strategies for improvement. If Linux environment issues are overcomed, training other models for a specific language is also promising to enhance performance (e.g. PaddleSpeech models for Chinese, Meta MMS, SpeechBrain for English).
In the Speaker Diarization field, Pyannote.audio with Whisper integration has proven its potential. While the transcript accuracy has been improved to be close to pure Whisper’s, current diarizing ability of relevant models in multi-speaker meeting scenarios is still not sufficient for multi-speaker speech recognition support.
In the contexts of Azure Speech Services, the most important rule is to keep good audio qualities and word-level accuracy in transcripts. While adding audio source variety is important, filtering training audio files that are not in good quality or too short to identify correct meaning as transcripts can also potentially enhance model performances.
8. References
[1] Anaconda, Inc. (2017). Command reference - conda 23.7.3.dev30 documentation. conda.io/projects/conda/en/latest/commands
[2] OpenAI (2022, September 21). Introducing Whisper. openai.com/research/whisper.
[3] Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, & Ilya Sutskever (2022). Robust Speech Recognition via Large-Scale Weak Supervision.
[4] Wikipedia contributors. (2023). Word error rate. In Wikipedia, The Free Encyclopedia.
[5] Gandhi, Sanchit (2022, November 3). Fine-Tune Whisper for Multilingual ASR with 🤗 Transformers. Hugging Face, Inc. huggingface.co/blog/fine-tune-whisper.
[6] The Linux Foundation (2023). Previous PyTorch Versions | PyTorch. pytorch.org/get-started/previous-versions
[7] Hugging Face, Inc. (2023). Hugging Face - Documentations. huggingface.co/docs
[8] Vaibhav Srivastav (2023). fast-whisper-finetuning, GitHub repository. github.com/Vaibhavs10/fast-whisper-finetuning
[9] Mangrulkar, Sourab, and Sayak Paul (2023, February 10). Parameter-Efficient Fine-Tuning Using 🤗 PEFT. Hugging Face, Inc. huggingface.co/blog/peft
[10] Bredin, H., Yin, R., Coria, J., Gelly, G., Korshunov, P., Lavechin, M., Fustes, D., Titeux, H., Bouaziz, W., & Gill, M.P. (2020). pyannote.audio: neural building blocks for speaker diarization. In ICASSP 2020, IEEE International Conference on Acoustics, Speech, and Signal Processing.
[11] Bain, M., Huh, J., Han, T., & Zisserman, A. (2023). WhisperX: Time-Accurate Speech Transcription of Long-Form Audio. INTERSPEECH 2023.
[12] Meta AI (2023, May 22). Introducing speech-to-text, text-to-speech, and more for 1,100+ languages. ai.meta.com/blog/multilingual-model-speech-recognition
[13] Vineel Pratap, Andros Tjandra, Bowen Shi, Paden Tomasello, Arun Babu, Sayani Kundu, Ali Elkahky, Zhaoheng Ni, Apoorv Vyas, Maryam Fazel-Zarandi, Alexei Baevski, Yossi Adi, Xiaohui Zhang, Wei-Ning Hsu, Alexis Conneau, & Michael Auli (2023). Scaling Speech Technology to 1,000+ Languages. arXiv.
[14] Hui Zhang, L. (2022). PaddleSpeech: An Easy-to-Use All-in-One Speech Toolkit. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Demonstrations. Association for Computational Linguistics.
[15] Mirco Ravanelli, Titouan Parcollet, Peter Plantinga, Aku Rouhe, Samuele Cornell, Loren Lugosch, Cem Subakan, Nauman Dawalatabad, Abdelwahab Heba, Jianyuan Zhong, Ju-Chieh Chou, Sung-Lin Yeh, Szu-Wei Fu, Chien-Feng Liao, Elena Rastorgueva, François Grondin, William Aris, Hwidong Na, Yan Gao, Renato De Mori, & Yoshua Bengio. (2021). SpeechBrain: A General-Purpose Speech Toolkit.
[16] Gao, D., Shi, J., Chuang, S.P., Garcia, L., Lee, H.y., Watanabe, S., & Khudanpur, S. (2022). EURO: ESPnet Unsupervised ASR Open-source Toolkit. arXiv preprint arXiv:2211.17196.
[17] ESPnet (2021). espnet_model_zoo, GitHub repository. github.com/espnet/espnet_model_zoo
[18] Eric-Urban (2023, August 2). Custom Speech overview - Speech service - Azure AI services. Custom Speech overview - Speech Service - Azure AI Services | Microsoft Learn learn.microsoft.com/en-us/azure/ai-services/speech-service/custom-speech-overview
[19] Microsoft (2023). Speech service documentation - Tutorials, API Reference - Azure AI services - Azure AI services. Speech service documentation - Tutorials, API Reference - Azure AI services - Azure AI services | Microsoft Learn learn.microsoft.com/en-us/azure/ai-services/speech-service/