Boxuan Lyu - Tokyo Institute of Technology
Table of Contents
- Table of Contents
- 1. Introduction
- 2. Methodology
- 2.1 Dataset
- 2.2 Model Architecture
- 2.3 Training Process
- 2.4 Evaluation Methods
- 3. Results and Discussion
- 3.1 Training Process Analysis
- 4. Conclusions and future work
- 5. References
1. Introduction
What is Text-to-Speech?
Text-to-Speech (TTS) technology has emerged as a cornerstone of human-computer interaction, enabling machines to convert written text into natural-sounding speech. This technology has witnessed significant advancements in recent years, driven by the rapid progress in deep learning and artificial intelligence. TTS systems have found widespread applications across various domains, including intelligent assistants, accessible reading solutions, navigation systems, and automated customer service, providing a more intuitive and user-friendly interface for information access and communication.
The development of TTS systems typically involves three main steps: text analysis, speech synthesis, and sound output[1-3]. Modern TTS technology leverages deep learning algorithms to generate increasingly natural and expressive speech, with some advanced systems even capable of mimicking the voice characteristics of specific individuals. This level of sophistication has opened up new possibilities for personalized and context-aware speech synthesis.
Why Mandarin?
Among the world's languages, Mandarin Chinese stands out as the most widely spoken, with over a billion speakers primarily concentrated in China, Taiwan, and Hong Kong. As such, the development of high-quality Mandarin TTS systems holds immense potential for improving communication and accessibility for a significant portion of the global population. However, Mandarin presents unique challenges for TTS systems due to its tonal nature and complex linguistic structure, making it an interesting and important area of research in speech synthesis.
What is Bert-VITS2?
In this study, we focus on building a fast and natural TTS system for Mandarin Chinese, specifically tailored for meeting scenarios. Our objective is to create a system that can generate clear, expressive, and context-appropriate speech that closely mimics natural human speech in professional settings. To achieve this goal, we have leveraged the Bert-VITS2 framework, a state-of-the-art approach that combines the power of pre-trained language models with advanced voice synthesis techniques, which based on VITS2[4].
The Bert-VITS2 framework represents a significant advancement in TTS technology. It incorporates a pre-trained BERT (Bidirectional Encoder Representations from Transformers) model [5], which enables a deep understanding of the input text's semantic and contextual nuances. This understanding is crucial for generating speech with appropriate intonation, stress, and emotion. The framework also employs a GAN (Generative Adversarial Network) [6] style training approach, which aims to produce highly realistic speech by pitting a generator against a discriminator in an adversarial learning process.
Our research contributes to the field of Mandarin TTS in several ways:
- We provide a comprehensive evaluation of the Bert-VITS2 framework for Mandarin TTS, specifically in the context of meeting scenarios.
- We explore the challenges and solutions in training a complex TTS model, including issues such as mode collapse and the selection of appropriate training data.
- We offer insights into the fine-tuning process of large language models for specific TTS applications, which can be valuable for researchers and practitioners in the field.
The remainder of this paper is structured as follows: Section 2 details our methodology, including the dataset selection, model architecture, and training process. Section 3 describes our experimental setup, while Section 4 presents and discusses our results. Finally, Section 5 concludes the paper and outlines directions for future work.
Through this research, we aim to contribute to the ongoing efforts to improve the quality and applicability of Mandarin TTS systems, ultimately working towards more natural and effective human-computer interaction in Mandarin-speaking contexts.
2. Methodology
This chapter details our approach to developing a Mandarin Chinese Text-to-Speech (TTS) system using the Bert-VITS2 framework. We describe the dataset selection process, the model architecture, and the training methodology employed in our study. We used the Bert-VITS2 code (https://github.com/fishaudio/Bert-VITS2?tab=readme-ov-file) directly without any changes, except for the naming of some paths.
2.1 Dataset
2.1.1 Dataset Selection
The choice of an appropriate dataset is crucial for training a high-quality TTS system. Initially, we considered the Alimeeting dataset, which contains 118.75 hours of speech data recorded from more than 450 speakers. However, our preliminary experiments revealed that this dataset was not suitable for our TTS task. Fine-tuning on Alimeeting [6] (https://www.openslr.org/119/) resulted in the model generating either meaningless or blank audio. We hypothesize that this issue stemmed from the dataset's poor transcription quality and the low average per-speaker audio duration.
After careful consideration, we selected the AISHELL-3 [7] (https://www.openslr.org/93/) dataset for our study. AISHELL-3 is a high-quality, multi-speaker Mandarin speech corpus specifically designed for speech synthesis tasks. It consists of approximately 85 hours of audio recorded by 218 speakers, with an average of nearly half an hour of audio per speaker. The dataset's high transcription quality and substantial per-speaker audio duration make it well-suited for our TTS task.
2.1.2 Data Preprocessing
To prepare the AISHELL-3 dataset for training, we followed the preprocessing guidelines provided in the Bert-VITS2 repository, specifically the webui_preprocess.py script. The preprocessing steps included:
- Audio segmentation: Splitting long audio files into shorter segments to facilitate training.
- Text normalization: Converting Chinese characters to pinyin and applying text cleaning techniques.
- Feature extraction: Generating mel-spectrograms and other relevant features from the audio files.
- Speaker embedding: Creating speaker embeddings to enable multi-speaker synthesis.
Fortunately, this series of seemingly cumbersome pre-processing steps can be handled by an interactive script, which looks like Figure 2. To do this, we just need to follow the instructions in code https://github.com/fishaudio/Bert-VITS2/blob/master/webui_preprocess.py, i.e. run the command:
python ./webui_preprocess.py
Then we can see the following webui interactive interface:
Another important point is to follow the path hierarchy mentioned in the webui exactly when placing your data set, i.e. :
"## Prepare in advance :\\n"
"download BERT and WavLM checkpoints:\\n"
"- [Chinese RoBERTa](<https://huggingface.co/hfl/chinese-roberta-wwm-ext-large>)\\n"
"- [Japanese DeBERTa](<https://huggingface.co/ku-nlp/deberta-v2-large-japanese-char-wwm>)\\n"
"- [English DeBERTa](<https://huggingface.co/microsoft/deberta-v3-large>)\\n"
"- [WavLM](<https://huggingface.co/microsoft/wavlm-base-plus>)\\n"
"\\n"
"Place the BERT model in the `bert` folder and the WavLM model in the `slm` folder, overwriting the folders with the same names.\\n"
"\\n"
"Data preparation:\\n"
"Place the data in the data folder, organized as follows: \\n"
"\\n"
"```\\n"
"├── data\\n"
"│ ├── {Your datasets name}\\n"
"│ │ ├── esd.list\\n"
"│ │ ├── raw\\n"
"│ │ │ ├── ****.wav\\n"
"│ │ │ ├── ****.wav\\n"
"│ │ │ ├── ...\\n"
"```\\n"
"\\n"
"The `raw` folder contains all the audio files, and the `esd.list` file contains the tag text in the format \\n"
"\\n"
"```\\n"
"****.wav|{speaker}|{language ID}|{text label}\\n"
"```\\n"
"\\n"
"such as:\\n"
"```\\n"
"vo_ABDLQ001_1_paimon_02.wav|派蒙|ZH|没什么没什么,只是平时他总是站在这里,有点奇怪而已。\\n"
"noa_501_0001.wav|NOA|JP|そうだね、油断しないのはとても大事なことだと思う\\n"
"Albedo_vo_ABDLQ002_4_albedo_01.wav|Albedo|EN|Who are you? Why did you alarm them?\\n"
"...\\n"
"```\\n"
2.2 Model Architecture
Our TTS system is based on the Bert-VITS2 framework, which combines the strengths of pre-trained language models with advanced voice synthesis techniques. The processing of the data flow in the individual modules during the backward and forward propagation can be found in the code https://github.com/fishaudio/Bert-VITS2/blob/master/train_ms.py. The main components of the model include:
2.2.1 TextEncoder
The TextEncoder is responsible for processing the input text and extracting relevant linguistic features. It incorporates a pre-trained BERT model, which enables the system to capture deep semantic and contextual information from the input text. This enhanced text understanding is crucial for generating speech with appropriate prosody and emotion.
2.2.2 DurationPredictor and StochasticDurationPredictor
The DurationPredictor estimates the duration of each phoneme in the input text. The StochasticDurationPredictor adds a level of randomness to these predictions, allowing for more natural variations in speech timing. This stochastic element helps prevent the generated speech from sounding monotonous or robotic.
2.2.3 Flow
The Flow component is responsible for modeling the pitch and energy characteristics of speech. It uses normalizing flows, a type of invertible neural network, to capture and generate these prosodic features effectively.
2.2.4 Decoder
The Decoder is the final component that synthesizes the speech waveform based on the information provided by the previous components. It takes into account the linguistic features, predicted durations, and prosodic information to generate the final audio output.
2.3 Training Process
Our training process follows a GAN-style approach, where we simultaneously train a generator (the TTS model) and a discriminator. This adversarial training scheme aims to produce more realistic and natural-sounding speech.
2.3.1 Loss Functions
The training process involves optimizing multiple loss functions:
Reconstruction Loss: Ensures that the generated speech closely matches the ground truth.
Duration Loss: Minimizes the difference between predicted and actual phoneme durations.
Adversarial Loss: Encourages the generator to produce speech that can fool the discriminator.
Feature Matching Loss: Aligns the intermediate features of the generated and real speech in the discriminator.
2.3.2 Training Strategy
To mitigate the risk of mode collapse, a common issue in GAN training where the generator produces limited varieties of outputs, we employed the following strategies:
Gradient Penalty: Applied to the discriminator to enforce Lipschitz continuity, stabilizing the training process.
Spectral Normalization: Used in both the generator and discriminator to constrain the Lipschitz constant of the networks.
Progressive Training: Gradually increasing the complexity of the generated samples during training.
2.3.3 Hyperparameters
We made several customizations to the official hyperparameters to suit our specific setup:
Batch Size: Set to 20 due to hardware constraints (single RTX 4090 GPU).
Learning Rate: Reduced to accommodate the smaller batch size.
Precision: Training conducted at bfloat16 precision to balance between training speed and numerical stability.
These hyperparameter settings should be written to the corresponding config file. The content of a configured config file is as follows:
{
"train": {
"log_interval": 200,
"eval_interval": 1000,
"seed": 42,
"epochs": 100,
"learning_rate": 0.00001,
"betas": [
0.8,
0.99
],
"eps": 1e-09,
"batch_size": 20,
"bf16_run": true,
"lr_decay": 0.99995,
"segment_size": 16384,
"init_lr_ratio": 1,
"warmup_epochs": 0,
"c_mel": 45,
"c_kl": 1.0,
"c_commit": 100,
"skip_optimizer": true,
"freeze_ZH_bert": false,
"freeze_JP_bert": true,
"freeze_EN_bert": true,
"freeze_emo": false
},
"data": {
"training_files": "data/train/train.list",
"validation_files": "data/train/val.list",
"max_wav_value": 32768.0,
"sampling_rate": 44100,
"filter_length": 2048,
"hop_length": 512,
"win_length": 2048,
"n_mel_channels": 128,
"mel_fmin": 0.0,
"mel_fmax": null,
"add_blank": true,
"n_speakers": 174,
"cleaned_text": true,
"spk2id": {
"SSB0043": 0,
"SSB0018": 1,
"SSB0057": 2,
"SSB0080": 3,
"SSB0033": 4,
"SSB0016": 5,
"SSB0009": 6,
"SSB0012": 7,
"SSB0385": 8,
"SSB0038": 9,
"SSB0011": 10,
"SSB0073": 11,
"SSB0005": 12,
"SSB0382": 13,
"SSB0394": 14,
"SSB0393": 15,
"SSB0395": 16,
"SSB0380": 17,
"SSB0375": 18,
"SSB0366": 19,
"SSB0379": 20,
"SSB0323": 21,
"SSB0316": 22,
"SSB0309": 23,
"SSB0299": 24,
"SSB0315": 25,
"SSB0307": 26,
"SSB0288": 27,
"SSB0287": 28,
"SSB0261": 29,
"SSB0267": 30,
"SSB0273": 31,
"SSB0246": 32,
"SSB0241": 33,
"SSB0200": 34,
"SSB0197": 35,
"SSB0193": 36,
"SSB0578": 37,
"SSB1437": 38,
"SSB1630": 39,
"SSB0590": 40,
"SSB0544": 41,
"SSB0710": 42,
"SSB0570": 43,
"SSB0919": 44,
"SSB1408": 45,
"SSB1392": 46,
"SSB1204": 47,
"SSB1320": 48,
"SSB0700": 49,
"SSB0342": 50,
"SSB1136": 51,
"SSB1383": 52,
"SSB1050": 53,
"SSB0588": 54,
"SSB1221": 55,
"SSB0723": 56,
"SSB1138": 57,
"SSB1072": 58,
"SSB0751": 59,
"SSB0338": 60,
"SSB1891": 61,
"SSB0915": 62,
"SSB0427": 63,
"SSB1831": 64,
"SSB1008": 65,
"SSB0935": 66,
"SSB1218": 67,
"SSB1918": 68,
"SSB0778": 69,
"SSB1203": 70,
"SSB0794": 71,
"SSB1431": 72,
"SSB0435": 73,
"SSB1055": 74,
"SSB0913": 75,
"SSB1806": 76,
"SSB1393": 77,
"SSB1878": 78,
"SSB1108": 79,
"SSB1593": 80,
"SSB1846": 81,
"SSB1064": 82,
"SSB1385": 83,
"SSB0851": 84,
"SSB0817": 85,
"SSB1624": 86,
"SSB0599": 87,
"SSB1024": 88,
"SSB0887": 89,
"SSB0594": 90,
"SSB1555": 91,
"SSB1575": 92,
"SSB1759": 93,
"SSB1377": 94,
"SSB0720": 95,
"SSB1091": 96,
"SSB1650": 97,
"SSB0863": 98,
"SSB1935": 99,
"SSB1711": 100,
"SSB1100": 101,
"SSB0614": 102,
"SSB1670": 103,
"SSB1607": 104,
"SSB0762": 105,
"SSB0426": 106,
"SSB0871": 107,
"SSB0354": 108,
"SSB0339": 109,
"SSB0341": 110,
"SSB0786": 111,
"SSB0784": 112,
"SSB0748": 113,
"SSB0746": 114,
"SSB0760": 115,
"SSB0758": 116,
"SSB0780": 117,
"SSB0686": 118,
"SSB0737": 119,
"SSB0671": 120,
"SSB0668": 121,
"SSB0666": 122,
"SSB0601": 123,
"SSB0607": 124,
"SSB0603": 125,
"SSB0609": 126,
"SSB0629": 127,
"SSB0606": 128,
"SSB0565": 129,
"SSB0539": 130,
"SSB0535": 131,
"SSB0534": 132,
"SSB0139": 133,
"SSB0133": 134,
"SSB0149": 135,
"SSB0112": 136,
"SSB1956": 137,
"SSB1939": 138,
"SSB1161": 139,
"SSB1366": 140,
"SSB1448": 141,
"SSB1684": 142,
"SSB1686": 143,
"SSB1699": 144,
"SSB1837": 145,
"SSB1832": 146,
"SSB1341": 147,
"SSB1253": 148,
"SSB0966": 149,
"SSB0987": 150,
"SSB1863": 151,
"SSB1828": 152,
"SSB1056": 153,
"SSB1115": 154,
"SSB1096": 155,
"SSB1131": 156,
"SSB1125": 157,
"SSB1020": 158,
"SSB1625": 159,
"SSB1563": 160,
"SSB1567": 161,
"SSB1585": 162,
"SSB0632": 163,
"SSB0631": 164,
"SSB0623": 165,
"SSB0482": 166,
"SSB0470": 167,
"SSB0502": 168,
"SSB0415": 169,
"SSB0407": 170,
"SSB0434": 171,
"SSB0122": 172,
"SSB0145": 173
}
},
"model": {
"use_spk_conditioned_encoder": true,
"use_noise_scaled_mas": true,
"use_mel_posterior_encoder": false,
"use_duration_discriminator": true,
"inter_channels": 192,
"hidden_channels": 192,
"filter_channels": 768,
"n_heads": 2,
"n_layers": 6,
"kernel_size": 3,
"p_dropout": 0.1,
"resblock": "1",
"resblock_kernel_sizes": [
3,
7,
11
],
"resblock_dilation_sizes": [
[
1,
3,
5
],
[
1,
3,
5
],
[
1,
3,
5
]
],
"upsample_rates": [
8,
8,
2,
2,
2
],
"upsample_initial_channel": 512,
"upsample_kernel_sizes": [
16,
16,
8,
2,
2
],
"n_layers_q": 3,
"use_spectral_norm": false,
"gin_channels": 512,
"slm": {
"model": "./slm/wavlm-base-plus",
"sr": 16000,
"hidden": 768,
"nlayers": 13,
"initial_channel": 64
}
},
"version": "2.3"
}
Finally, following the instructions in the code https://github.com/fishaudio/Bert-VITS2/blob/master/webui_preprocess.py, we can use the following command to start training on a GPU:
torchrun --nproc_per_node=1 train_ms.py
2.4 Evaluation Methods
Given the subjective nature of speech quality assessment, we primarily relied on human evaluation for our model selection and final assessment. We employed the following evaluation strategy:
Checkpoint Selection: Regular human evaluation of model checkpoints on a held-out development set to track progress and select the best model.
While we acknowledge the importance of objective metrics in TTS evaluation, we found that human evaluation provided more reliable insights into the perceived quality and naturalness of the generated speech, especially considering the nuances of tonal languages like Mandarin.
On the other hand, we also provide automatic evaluation results based on WER and MOS. These automatic evaluation results are not only generated between different methods, but also computed at different checkpoints during the training process.
In the next chapter, we will delve into the specifics of our experimental setup, including hardware specifications, training duration, and the details of our evaluation process.
3. Results and Discussion
This chapter presents the results of our experiments with the Bert-VITS2 based Mandarin Chinese Text-to-Speech (TTS) system and provides a detailed discussion of our findings. We analyze both the quantitative metrics and qualitative assessments to evaluate the performance of our model.
3.1 Training Process Analysis
3.1.1 Convergence and Loss Curves
The training process of our Bert-VITS2 model exhibited interesting dynamics, which we analyze here to provide insights into the model's learning behavior.
Initially, we observed a seemingly normal loss curve:
This curve showed higher loss at the beginning of training, followed by a rapid drop and eventual convergence at a relatively low level. However, upon inference, we discovered that the model was only generating blank speech. We hypothesize that this was due to a phenomenon known as mode collapse, a common issue in GAN-style training where the generator finds a way to "cheat" the discriminator.
After refining it, we get the following loss curve for the time being:
The refined total loss curve (Figure 4) shows oscillations, indicating that not all components of the network fully converged. However, the discriminator loss (Figure 5) appears to have stabilized, while the generator loss (Figure 6) shows a clear downward trend. These patterns suggest that our refined training process successfully mitigated the mode collapse issue and led to meaningful learning.
3.1.2 Automatic evaluation
We also recorded the change in Word Error Rate (WER) [8] during training, using BELLE-2/Belle-whisper-large-v3-zh [9] as the ASR model:
The curve in Figure 7 shows that the WER has dropped from the initial value of around 0.5 to around 0.2.
3.1.3 Check the quality of the final checkpoint
Following the instructions at https://github.com/fishaudio/Bert-VITS2/blob/master/webui.py, we can simply use the trained checkpoint for inference by launching an interactive webui:
python ./webui.py
Then we can access the webui like this:
We use the MOS [9] evaluation index to measure the naturalness of our models' speech.
An evaluation can be obtained by running the following script:
import torch
import librosa
predictor = torch.hub.load("tarepan/SpeechMOS:v1.2.0", "utmos22_strong", trust_repo=True)
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from tqdm import tqdm
import numpy as np
from infer import infer, get_net_g
from config import config
import utils
import soundfile as sf
import os
def text_to_speech(text, speaker, language="ZH", model_path="path/to/your/model",
config_path="path/to/your/config.json", device="cuda"):
# Load model and config
hps = utils.get_hparams_from_file(config_path)
net_g = get_net_g(model_path=model_path, version="2.3",device=device, hps=hps)
# Set up parameters
sdp_ratio = 0.2
noise_scale = 0.6
noise_scale_w = 0.8
length_scale = 1.0
# Generate audio
with torch.no_grad():
audio = infer(
text,
sdp_ratio=sdp_ratio,
noise_scale=noise_scale,
noise_scale_w=noise_scale_w,
length_scale=length_scale,
sid=speaker,
language=language,
hps=hps,
emotion=None,
net_g=net_g,
device=device,
)
# Convert to 16-bit wav
audio_16bit = audio
audio_16bit = np.interp(audio_16bit, (audio_16bit.min(), audio_16bit.max()), (-32768, 32767)).astype(np.int16)
return audio_16bit, hps.data.sampling_rate
vaild_dataset_path="/home/voiceping/lyu/try_t2s/Bert-VITS2/data/train/val.list"
transcriber = pipeline(
"automatic-speech-recognition",
model="BELLE-2/Belle-whisper-large-v3-zh"
)
transcriber.model.config.forced_decoder_ids = (
transcriber.tokenizer.get_decoder_prompt_ids(
language="zh",
task="transcribe"
)
)
with open(vaild_dataset_path, "r", encoding="utf-8") as f:
lines = f.readlines()
data_paths, speakers, texts = [], [], []
for line in lines:
data_path, speaker, _, text, _, _,_ = line.strip().split("|")
data_paths.append(data_path)
speakers.append(speaker)
texts.append(text)
mos_results = []
for data_path, speaker, text in tqdm(zip(data_paths, speakers, texts), total=len(data_paths)):
audio, sample_rate = text_to_speech(text, speaker, model_path="/home/voiceping/lyu/try_t2s/Bert-VITS2/data/train/models/G_94000.pth", config_path="/home/voiceping/lyu/try_t2s/Bert-VITS2/data/train/configs/config.json")
# write to tmp file
sf.write("tmp.wav", audio, sample_rate)
wave, sr = librosa.load("tmp.wav", sr=None, mono=True)
score = predictor(torch.from_numpy(wave).unsqueeze(0), sr)
mos_results.append(score.item())
# delete tmp file
os.remove("tmp.wav")
assert len(mos_results) == len(texts),f"len(asr_results)={len(mos_results)}, len(texts)={len(texts)}"
with open("/home/voiceping/lyu/try_t2s/Bert-VITS2/bert_vits2.mos.txt", "w", encoding="utf-8") as f:
for mos in mos_results:
f.write(f"{mos}\n")
print("mean MOS:", np.mean(mos_results))
We show the final evaluation results and compare them with some popular open source TTS models:
WER | MOS | |
ours (Bert-VITS2) | 0.27 | 2.90 |
myshell-ai/MeloTTS-Chinese | 5.62 | 3.04 |
fish-speech (GPT) w/o ref | 0.49 | 3.57 |
ASR model: BELLE-2/Belle-whisper-large-v3-zh
Our model got the lowest WER, which means its able to generate speech accurately. But our model is below the state-of-the-art fish-speech model in terms of MOS. This means that the naturalness of the audio needs to be improved further. But note that our model has much fewer parameters than fish-speech and reasoning is much faster.
3.1.4 Some generation examples
Next, we show a few examples of the output:
- 平淡无奇到没有任何悬念 (“unremarkable and without any suspense.”):
- 是衡阳市祁东县太和堂镇大堂村的一个留守儿童 (”He is a left-behind child in the village of Taohuadong, Qidong County, Hengyang City.”)
- 是唯一的负增长品类 (”is the only category with negative growth”)
- 我敢打赌,我们会成为一家伟大的公司!(”I bet we'll be a great company! ”)
Next, we try a challenging example, a 22-second speech synthesis. Obviously, this sample is outside the scope of the training data, which usually consists of short speech samples of 2 to 10 seconds.
- 语音合成是将人类语音用人工的方式所产生。若是将电脑系统用在语音合成上,则称为语音合成器,而语音合成器可以用软/硬件所实现。文字转语音系统则是将一般语言的文字转换为语音,其他的系统可以描绘语言符号的表示方式,就像音标转换至语音一样。(”Speech synthesis is the artificial production of human speech. If a computer system is used for speech synthesis, it is called a speech synthesizer, and speech synthesizers can be implemented using software or hardware. Text-to-speech systems convert text into speech, while other systems can describe the representation of language symbols, such as phonetic transcriptions into speech. ”)
Quite surprisingly, the model still perfectly fulfilled this task, and every word in the generated speech was very clear.
However, we found one weakness in the model: it cannot handle text with code switching.
Let's look at an example:
- 语音处理(Speech processing),又称语音信号处理、人声处理,其目的是希望做出想要的信号,进一步做语音辨识,应用到手机界面甚至一般生活中,使人与电脑能进行沟通。(”Speech processing, also known as speech signal processing and human voice processing, aims to make the desired signal and further perform speech recognition, which is applied to mobile phone interfaces and even in general life, enabling people to communicate with computers.”)
Unfortunately, the generated speech does not contain the part of the code switching (Speech processing).
Although the sound is not perfect in terms of naturalness due to the limited size of the data set, we can clearly hear every word.
Importantly, the output of the current model is much better than the original checkpoint without fine-tuning. The first audio (平淡无奇到没有任何悬念, “unremarkable and without any suspense.”) generated by the original checkpoint is as follows:
Obviously, even if you are not a Chinese (Mandarin) speaker, you can feel that this is not a recognizable voice at all. This significant change proves the necessity of our work (fine-tuning the Bert-VITS2 model).
4. Conclusions and future work
We have tried fine-tuning Bert-VITS2 using the Chinese TTS dataset with tentative results. Although the GAN style training makes the model difficult to be trained, we believe we have mastered the methodology to get high quality synthesized speech. Next, we will try to train more steps to get higher generation quality.
5. References
[1] Y. Ren, Y. Ruan, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T. Liu, “Fastspeech: Fast, robust and controllable text to speech,” in Proc. NeurIPS, 2019, pp. 3165–3174. [2] Y. Wang, R. J. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, Q. V. Le, Y. Agiomyrgiannakis, R. Clark, and R. A. Saurous, “Tacotron: Towards end-to-end speech synthesis,” in Proc. Interspeech, 2017, pp. 4006–4010. [3] J. Kim, J. Kong, and J. Son, “Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech,” in Proc. ICML, 2021, pp. 5530–5540. [4] J. Kong, J. Park, B. Kim, J. Kim, D. Kong, and S. Kim, “VITS2: Improving Quality and Efficiency of Single-Stage Text-to-Speech with Adversarial Learning and Architecture Design,” in Proc. INTERSPEECH 2023, 2023, pp. 4374–4378.
[5] W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021. [6] Goodfellow, Ian J., Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville and Yoshua Bengio. “Generative Adversarial Nets.” Neural Information Processing Systems (2014). [6] F. Yuet al., "M2Met: The Icassp 2022 Multi-Channel Multi-Party Meeting Transcription Challenge,"ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, Singapore, 2022, pp. 6167-6171
[7] Shi, Yao, Hui Bu, Xin Xu, Shaojing Zhang and Ming Li. “AISHELL-3: A Multi-speaker Mandarin TTS Corpus and the Baselines.” ArXiv abs/2010.11567 (2020): n. pag. [8] Wang, Y.; Acero, A.; Chelba, C. (2003). Is Word Error Rate a Good Indicator for Spoken Language Understanding Accuracy. IEEE Workshop on Automatic Speech Recognition and Understanding. St. Thomas, US Virgin Islands
[9] Saeki, T., Xin, D., Nakata, W., Koriyama, T., Takamichi, S., & Saruwatari, H. (2022). UTMOS: UTokyo-SaruLab System for VoiceMOS Challenge 2022. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2022-September, 4521-4525. https://doi.org/10.21437/Interspeech.2022-439 [10] GitHub GitHub - LianjiaTech/BELLE: BELLE: Be Everyone's Large Language model Engine(开源中文对话大模型)