Taishin Maeda - Waseda University
Table of Contents
- Table of Contents
- 1. Introduction
- 2. Proposed Evaluation Method
- 3. Experimental Setup
- 4.Results and Discussion
- 5. Supplementary Materials and Applications
- 6. Conclusion and Next Steps
- 7. References
- 8. Greatest Appreciation
1. Introduction
What is Speaker Diarization?
Speaker Diarization is the process of segmenting and labeling an audio based on different speakers. In other words, it is the process among to identify “who spoke when?” in a given audio. It is a beneficial conversation analysis tools which coupled with Automatic Speech Recognition or so-called ASR. The speaker diarization system consists of Voice Activity Detection (VAD) model to get the timestamps of the audio where speech is being spoken and Audio Embeddings model to get Audio embeddings on segments that were previously time stamped. Then, those embeddings vectors are then grouped into clusters to estimate the number of speakers.
In this blog, two state-of-the-art open-source frameworks for speaker diarization are discussed including the Pyannote.audio and the Nvidia Nemo. The focus area is about evaluating the performance of those two frameworks in various audio scenarios. Then, the post-processing is conducted using OpenAI’s GPT-4-Turbo to observe the performance as another approach to evaluate the diarization performance.
What is Pyannote.audio?
Pyannote.audio is an open-source toolkit written Python library designed for speaker diarization and speaker embedding based on PyTorch machine learning (ML) framework.
What is Nvidia Nemo?
Nvida Nemo speaker diarization has a bit of different approaches compared to the Pyannote.audio. First of all, it is an open-source deep learning framework developed by NVIDIA in which also based on PyTorch. For the pipeline of the Nemo speaker diarization, neural diarizer is generated to label speakers including overlap speech based on the speaker profiles created from clustering results. The pipeline can be seen in the figure below:
For Nemo speaker diarization, Multi-scale approach is used for segmentation, speaker embedding extraction, and clustering. When the Voice Activity Detection (VAD) is performed, it will extract speaker embeddings from segmented audio and speaker embedding vector from each segment [7]. The point is that there will be trade-off between speaker identification quality and granularity. Meaning that, the longer the segmentation is performed, the higher quality of speaker representations will be acquired but with low temporal resolution leading to potential errors. On the other hand, the shorter the segmentation is performed, the lower quality of speaker representation will be acquired but the temporal resolution will be high. Such circumstances led to the idea of the multi-scale segmentation.
The purpose of this idea is to solve the trade-off between long and short segment lengths. Multiple layers or scales representing segment lengths are used and the affinity values from each layer/scale’s results are fused. The figure below shows how the multi-scale segmentation looks like.
From figure 3, the scale that is assigned to the highest scale index is called the base scale having the shortest segment length. The mapping among scales will be calculated and the middle point of each segment is considered as a point that matches with other scales to have the shortest distance between two middle points from the two segments [7]. Note that, the blue blocks represent the segmentation mapping. To sum up, the multi-scale approach is the idea that the segments are embedded with different time scales which allows performing the diarization that can deal with a trade-off between accurate voice embedding requiring longer segments and good granularity of the segmentation.
Neural Diarizaer
As mentioned earlier, neural diarizaer is the term to define the trainable neural models that can estimate speaker labels from the given audio output. The neural diarizer will be performed once the clustering is done, and it is needed to diarize the overlapping situation. The Multi-scale Diarization Decoder or so-called MSDD model is used as a neural diarizer [7]. The basic idea is that MSDD model utilize clustering diarizer to get the estimated number of speaker and predicted speaker profile of each speaker.
Comparing the main models involved in two diarization pipelines
Pyannote.auido | Nvidia Nemo | |
Voice Activity Detection (VAD) | Pyannet from Syncnet | Multilingual MarbleNet |
Audio/Speaker Embedding | ECAPA-TDNN | Titanet Large |
Clustering | Hidden Markov Model Clustering | Multi-scale Clustering (MSDD Telephonic) |
The above table is the comparison of the models used in two diarization pipelines [10].
Rich Transcription Time Marked (RTTM)
RTTM file format is a standard format for speaker diarization which can later be used to evaluate the predicted diarization.
SPEAKER obama_zach(5min).wav 1 66.32 0.27 <NA> <NA> SPEAKER_01 <NA> <NA>
SPEAKER obama_zach(5min).wav 1 66.60 0.17 <NA> <NA> SPEAKER_00 <NA> <NA>
The above plain text is an example of the RTTM format that is performed by speaker diarization. In the RTTM, there are 3 crucial parts including the segment start time, segment duration, the speaker label. In this case, the first line shows that the segment start time is 66.32 with the duration of 0.27 and the speaker table showing “SPEAKER_01” is speaking.
2. Proposed Evaluation Method
From this section, methods and metrics that are used to evaluate the speaker diarization will be explained.
Diarization Error Rate (DER) Metric
In general, the standard metric for speaker diarization problem is the Diarization Error Rate or so-called DER. It has been introduced by the the National Institute of Standards and Technology (NIST) in 2000. The evaluation of the Speaker Diarization was conducted for broadcast news and conversational telephone speech in English [11]. This DER metric computes the error in diarization, and the equation can be seen below:
False Alarm: The error occurred when the speech is detected but there is no speaker (can be called as False Positive)
Missed Detection: The error occurred when the no speech is detected but there is a speaker (can be called as False Negative)
Confusion: The error of when the speech is in the wrong cluster
Ideally, the goal is to reduce the DER to 0. Meaning that, there is no error. However, it is difficult, so reducing the DER to get as close as 0 is the target of this study.
Ground Truth Labeling File or Referencing File
In order to evaluate the DER, the reference RTTM file will be needed to compared with the diarized file. The file details will be discussed more deeply in the Experiment Setup section.
Time Performance and Hardware Resources
In addition to the DER, in this study, the execution time and the GPU memory usage that both pyannote and Nemo took will be taken in to an account. This will help to visualize not only about the accuracy of the model, but also the time and hardware costs that need to be considered for the further usage in the real business.
Post-Processing Approach Using ChatGPT-4-Turbo
After acquiring the diarrization results from both pyannote and Nemo, post-processings are conducted to each result in order to get additional results to compare the DER. The study is to use OpenAI’s chatGPT-4-Turbo to guess and rearrange the diarization results from both pyannote and Nemo and see the accuracy. Note that, the GPT-4 Turbo is the latest generation model developed by OpenAI having the knowledge up until April 2023 and are able to have 128,000 tokens in one prompt (300 pages 0f text) [4]. In order to conduct the study and experiment, OpenAI API is used as a guideline for making a GPT script in Python. Same as the ChatGPT chatbot that is available on the internet, the Python script will allow the users to input the prompt and get the answers. For the post-processing, three inputs are required including: 1. The speech-to-text (STT) transcription of the audio file 2. The diarization results in RTTM format from the specific framework 3. Prompt an instruction to the Chatbot. More detailed procedure will be discussed in the Experimental Setup section.
3. Experimental Setup
3.1 Experimental procedure
In this section, the process of the experiment will be elaborated. First and foremost, multiple datasets, audio files and reference files, are used to demonstrate the performance of the Pyannote.audio framework and the Nvidia Nemo framework in which each audio file has different scenarios and conditions that suits each library in order to exhibit their maximum potential. In order to evaluate the performance of those frameworks on whether they are optimal or suitable for various use cases, the experiment is conducted and check the Diarization Error Rate or so-called DER using pyannote.metrics and Time Performance considering hardware resources. With that being said, this experiment is operated regarding several scenarios, and ,for each scenario and framework, the experiment is conducted two times.
To be more specific, firstly, the diarization is tested without pre-identifying the number of speakers in the code which will be discussed later. For the second time, the number of speakers in the audio will be pre-identified in the code before executing the program. These will show the difference in terms of performance when running each framework. The first scenario is when there are two speakers in the audio file, and the second scenario is when there are more than five speakers in the audio file.
Then, the post-processing using ChatGPT-4-Turbo will be conducted last, after all results are gathered. The STT transcription script in Python is created and used to transcribe both 5 minutes and 9 minutes audios. Those transcriptions will be included as an input for the ChatGPT-4-Turbo script together with the diarization results in RTTM format from both Pyannote.audio and Nemo.
3.2 Time Performance and Hardware Resources
The time performance and hardware resource utilization were recorded for each scenario and framework execution. The time performance was measured using the “time” command in Python to calculate the time taken by the code execution.
Execution Time: The total time taken for the execution of the code.
GPU Usage: The memory used by the GPU.
GPU: Nvidia GeForce RTX 3090
3.3 Audio Files and Datasets
There are two audio files that are being used in this experiment.
First, it is a five minutes audio file where there are only two speakers. The audio is a part of the “obama-zach.wav” file. The reference file or so-called the ground truth labelling file for this case is conducted manually by using a free and open-source digital audio editor and recording application software named Audacity. Using Audacity, the segmentation labelling is created and the marker track from it is exported as a .txt file which is later converted to the RTTM format.
For the second nine minutes audio file, the audio file and the ground truth labelling file are retrieved from VoxConverse speaker diarization dataset (Referred from here: https://github.com/joonson/voxconverse?tab=readme-ov-file). In this audio file, there are more than five speakers in order to observe the pull potential of the Nvidia Nemo model.
Audio file lists:
obama_zach(5min).wav
bvqnu(9min).wav
Segmentation Labeling Using Audacity
The segmentation labeling can be seen at the bottom of the figure where the “who speak when?” is manually annotated.
3.4 Programming Code Using Python
Pyannote code
from tempfile import NamedTemporaryFile
from pydub import AudioSegment
from pyannote.audio import Pipeline
import torch
import os
import time
# Load the pretrained diarization pipeline and send it to GPU if available
pipeline = Pipeline.from_pretrained(
"pyannote/speaker-diarization-3.1",
use_auth_token="your_auth" # Replace with your Hugging Face token
)
if torch.cuda.is_available():
device = torch.device("cuda")
pipeline.to(device)
use_cuda = True
else:
use_cuda = False
def diarization(audio_path):
# Perform speaker diarization using the pretrained pipeline
start_time = time.time()
diarization = pipeline(audio_path) #Incase you want to pre-identify the number of people, add it here)
end_time = time.time()
execution_time = end_time - start_time
# Get the diarization result and format it to RTTM
rttm = "SPEAKER {file} 1 {start:.2f} {duration:.2f} <NA> <NA> {speaker} <NA> <NA>"
diarization_result = [
rttm.format(file=audio_path, start=turn.start, duration=turn.duration, speaker=speaker)
for turn, _, speaker in diarization.itertracks(yield_label=True)
]
# Save the RTTM file
rttm_file_path = f"{audio_path}.rttm"
with open(rttm_file_path, "w") as rttm_file:
for line in diarization_result:
rttm_file.write(line + "\n")
return rttm_file_path, execution_time
if __name__ == '__main__':
audio_path = "your_file_path" # Provide the path to your audio file
audio = AudioSegment.from_file(audio_path)
rttm_file_path, execution_time = diarization(audio_path)
print(f"RTTM file saved as: {rttm_file_path}")
print(f"Execution time: {execution_time} seconds")
if use_cuda:
os.system("nvidia-smi")
else:
print("No GPU in use.")
Code 1 performs speaker diarization on a prompted audio file using pre-trained model from Hugging Face’s Pyannote library. Basically, the program will load the pre-trained diarization pipeline and send it to the GPU if available. Next, the diarization function gets the audio file path, process the audio, and diarize. The results will be in RTTM format and save it.
The code is inspired from: https://github.com/pyannote/pyannote-audio
Nvidia Nemo Code
import json
import os
from nemo.collections.asr.models import NeuralDiarizer
from omegaconf import OmegaConf
import wget
def diarize_audio(input_file):
# Diarization configuration
meta = {
'audio_filepath': input_file,
'offset': 0,
'duration': None,
'label': 'infer',
'text': '-',
'num_speakers': None, #Incase you want to pre-identify the number of people, add it here)
'rttm_filepath': None,
'uem_filepath': None
}
# Write manifest
with open('input_manifest.json', 'w') as fp:
json.dump(meta, fp)
fp.write('\n')
output_dir = os.path.join('output')
os.makedirs(output_dir, exist_ok=True)
# Load model config
model_config = 'diar_infer_telephonic.yaml'
if not (model_config):
config_url = "https://raw.githubusercontent.com/NVIDIA/NeMo/main/examples/speaker_tasks/diarization/conf/inference/diar_infer_general.yaml"
model_config = wget.download(config_url)# Update the path to the MSDD model configuration
config = OmegaConf.load(model_config)
config.diarizer.msdd_model.model_path = 'diar_msdd_telephonic' # telephonic speaker diarization model
config.diarizer.msdd_model.parameters.sigmoid_threshold = [0.7, 1.0] # Evaluate with T=0.7 and T=1.0
# Initialize diarizer
msdd_model = NeuralDiarizer(cfg=config)
# Diarize audio
diarization_result = msdd_model.diarize()
return diarization_result
if __name__ == "__main__":
input_file = 'your_audio_file_path' # mono .wav
result = diarize_audio(input_file)
Code 2 perform speaker diarization on an audio file using NeMo library. The process of diarization involves preparing a manifest file with diarization configuration, loading the diarization model configuration (in .yaml file), initializing the diarizaer with the model, and diarizing the audio. In addition, the Multi-Scale Diarization Decoder (MSDD) model is used in the code. In this case, telephonic speaker diarization model is used which is efficient in telephonic scenarios and two sigmoid thresholds are set for evaluation.
The code is inspired from: https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/speaker_tasks/Speaker_Diarization_Inference.ipynb#scrollTo=CwtVUgVNBR_P
For the starting parameter files, it can be downloaded from: https://github.com/NVIDIA/NeMo/tree/main/examples/speaker_tasks/diarization/conf/inference
Note that, the parameter file is significant to operate the Nvidia Nemo framework in this case. It is required to adjust the configuration .yaml file that suits the audio scenarios (Ex. number of speakers) in order to produce an accurate outcome.
Diarization Error Rate Evaluation Code
from pyannote.core import Segment, Annotation
from pyannote.metrics.diarization import DiarizationErrorRate
# Read reference and hypothesis files
def read_rttm(file_path):
data = Annotation()
with open(file_path, 'r') as file:
for line in file:
parts = line.strip().split()
if len(parts) >= 7:
start_time = float(parts[3])
end_time = start_time + float(parts[4])
speaker_id = parts[7]
segment = Segment(start_time, end_time)
data[segment] = speaker_id
return data
ref_file_path = "ground-truth.rttm"
hyp_rttm_file_path1 = "pyannote.rttm"
hyp_rttm_file_path2 = "pyannote(pre-identified speaker no).rttm"
hyp_rttm_file_path3 = "Nemo.rttm"
hyp_rttm_file_path4 = "Nemo(pre-identified speaker no).rttm"
reference = read_rttm(ref_file_path)
hypothesis1 = read_rttm(hyp_rttm_file_path1)
hypothesis2 = read_rttm(hyp_rttm_file_path2)
hypothesis3 = read_rttm(hyp_rttm_file_path3)
hypothesis4 = read_rttm(hyp_rttm_file_path4)
# Initialize Diarization Error Rate
diarization_error_rate = DiarizationErrorRate()
# Evaluate DER
der1 = diarization_error_rate(reference, hypothesis1)
der2 = diarization_error_rate(reference, hypothesis2)
der3 = diarization_error_rate(reference, hypothesis3)
der4 = diarization_error_rate(reference, hypothesis4)
print(f'DER for pyannote: {der1:.3f}')
print(f'DER for pyannote with number of speakers pre-identified: {der2:.3f}')
print(f'DER for Nemo: {der3:.3f}')
print(f'DER for Nemo with number of speakers pre-identified: {der4:.3f}')
Code 3 is designed to evaluate the performance of different speaker diarization results using the Diarization Error Rate (DER) metric. Basically, it reads the reference and hypothesis RTTM files which represents ground truth and results file respectively. Then, it calculates the DER for each hypothesis against the reference.
ChatGPT-4-Turbo Code
STT Transcription Using Whisper-1 Model Code
from openai import OpenAI
api_key = "inpput your api key here."
client = OpenAI(api_key=api_key)
#gpt4Turbo = "gpt-4-1106-preview"
def generate_responses(prompt, model="gpt-4-turbo-2024-04-09"):
response = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": "You are a helpful assistant who provides information to users."},
{"role": "user", "content": prompt},
],
temperature=0.9,
max_tokens=4096,
)
return response.choices[0].message.content
print(generate_responses("Input prompt here."))
from openai import OpenAI, OpenAIError
# Set your OpenAI API key
api_key = "input your api key here"
# Initialize OpenAI client with API key
client = OpenAI(api_key=api_key)
# Open the audio file
audio_file = open("input your audio here", "rb")
try:
# Create transcription using OpenAI client
transcription = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file
)
# Print the transcription text
print(transcription.text)
except OpenAIError as e:
print("OpenAI API Error:", e)
finally:
# Close the audio file
audio_file.close()
Inputed Prompt in ChatGPT-4 Turbo
“ Based on the Transcription Results (In this part, the whole transcription should be included) and Diarization Results in RTTM (In this part, the whole RTTM should be included), can you guess or rearrange the speaker ID of the diarization results to the results that you think it is correct? ”
The quote above elaborates the inputed prompt used to instruct ChatGPT to give the diarization results, and it is included in the print statement (last statement) inside Code 4. Results are shown in the up-coming section.
4.Results and Discussion
Diarization Error Rate (DER) For Two Speakers Audio (5 min)
Diarization Error Rate (DER) For Nine Speakers Audio (9 min)
Framework | DER |
Pyannote.audio | 0.252 |
Pyannote.audio with the number of speakers pre-identified | 0.214 |
Nvidia Nemo | 0.161 |
Nvidia Nemo with the number of speakers pre-identified | 0.161 |
Table 1: The DER results from the 5 minutes audio containing two speakers
Framework | DER |
Pyannote.audio | 0.083 |
Pyannote.audio with the number of speakers pre-identified | 0.098 |
Nvidia Nemo with the number of speakers pre-identified | 0.097 |
Table 2: The DER results from the 9 minutes audio containing nine speakers
The results from comparing speaker diarization results using pyannote.audio framework and Nvidia Nemo framework are displayed above. Firstly, the Diarization Error Rate (DER) results of both 5 minutes and 9 minutes audio are shown in Table 1 and Table 2 respectively. For the five minutes audio containing two speakers, the overall DER is higher than what it is supposed to be. One of the reasons could be due to the accuracy of the ground truth annotation which was conducted manually in Audacity. Since the audio file is manually annotated, it might not be highly accurate. To be more specific, the Nvidia Nemo framework produced approximately 9 percent less DER than the Pyannote.audio framework. In the case of the Pyannote.audio framework, when the number of speakers in the audio is pre-identified, the DER became slightly less than when it is not. On the other hand, the DER did not change in the case of the Nvidia Nemo framework. As can be seen, the Nvidia Nemo framework gives better DER in the case when there are two speakers with a ground truth file that is manually annotated.
For the nine minutes audio containing nine speakers together with the ground truth annotation file acquired from the VoxConverse, the overall DER is significantly less than that of the five minutes audio. As mentioned before, the reason could be due to an accurate ground truth annotation file created by the VoxConverse leading to the lower DER. Note that, in this case, only Nvidia Nemo framework with the number of speakers pre-identified is used for the experiment. This is due to the default mode of the Nvidia Nemo code that is not suitable for more than two speakers. Hence, it is required to adjust the parameters inside the .yaml configuration file by assigning the number of speakers initially in order to make the code diarize the audio. The outcome shows a different trend from the previous one. In this case, the Pyannote.audio framework produced approximately 1.4 percent less DER than the pre-identified Nvidia Nemo framework. This could be due to the Telephonic MSDD model used in the configuration file. Since the model is designed to work best in the telephonic situation, but the audio is not considered as a telephonic situation, the results might not be the best. However, the Pyannote.audio framework with the number of speakers pre-identified exhibited almost the same DER as of the Nvidia Nemo framework.
DER of Post-Processing Approach Using OpenAI’s Chat-GPT
Framework | DER for GPT-4-Turbo | DER for GPT-3.5 |
Pyannote.audio for Two Speaker Audio (5min) | 0.427 | 0.494 |
Nvidia Nemo for Two Speaker Audio (5min) | 0.179 | 0.544 |
Table 3: The DER results after post-processing the 5 minutes audio containing two speakers
Framework | DER for GPT-4-Turbo | DER for GPT-3.5 |
Pyannote.audio for Nine Speaker Audio (9min) | 0.103 | 0.214 |
Nvidia Nemo for Nine Speaker Audio (9min) | 0.128 | 0.179 |
Table 4: The DER results after post-processing the 9 minutes audio containing nine speakers
After using Chat-GPT-4-Turbo for post-processing of the two frameworks, the results are shown in Table 3 and 4. Those table displays the DER comparison between the post-processing using ChatGPT-4 Turbo and ChatGPT-3.5. As can be seen, the DERs of using ChatGPT-4 Turbo are less than that of ChatGPT-3.5 for both 5 minutes and 9 minutes length audios. Although the DERs are more than when using the post-processing method, It proofs that GPT-4 turbo can give more promising results than GPT-3.5. With that being said, there are rooms of improvements. One of the potential reasons why the post-processing gives higher DERS is that the ChatGPT itself are not directly involved in considering the audio file to understand the conditions happening in the audio file which result in the lack of timestamp information and the exact number of speakers. The analyzation will be done based on the transcription text given by the STT results, so it is inevitable to say that ChatGPT does not know the timing of “who speak when” numerically. Since the topic is about dirization based on the audio, making the ChatGPT has an access to the audio accurately would be a must in order to give the best results. On top of that, the prompt or instructions inputted to the ChatGPT could be done in more details. Meaning that, the user could initialize the timestamp condition or number of speakers in the conversation before hand can provide the ChatGPT to sufficiently diarize the conversation. However, in the real world situation, the timestamp and number of speaker will not always be noticed. Therefore, it is a challenge to use this post-processing approach when those significant parameters are needed to get the optimal results.
Transcription Texts for 5 minutes Audio File Conducted By Using Whisper-1 Model
Sorry, I had to cancel a few times. My mouse pad broke last week, and I had to get my great-aunt some diabetes shoes.
And, uh, You know what, Zach? It's no problem. I mean, I have to say, when I heard that, like, people actually watch this show, I was actually pretty surprised.
Hi. Welcome to another edition of Between Two Ferns. I'm your host, Zach Galifianakis. And my guest today is Barack Obama. President Barack Obama.
Good to be with you, Zach. First question. In 2013, you pardoned a turkey. What do you have planned for 2014? We'll probably pardon another turkey.
....
Did you say invisible? Because I just think that's impolite. Not invisible, invincible, meaning that they don't think they can get hurt.
I'm just saying that nobody could be invisible if you had said invisible. I understand that. If they get that health insurance, it can really make a big difference.
And they've got till March 31 to sign up. I don't have a computer, so how does? Well, then you can call 1-800-318-2596. I don't have a phone.
I'm off the grid. I don't want you people looking at my text, if you know what I mean.
First of all, Zach, nobody's interested in your texts. Second of all, you can do it in person. And the law means that insurers can't discriminate against you if you've got a pre-existing condition anymore.
Yeah, but what about this, though?
Transcription Texts for 9 minutes Audio File Conducted By Using Whisper-1 Model
President Trump's rally tonight in Pennsylvania scattered among the crowd are people who believe in some conspiracy theories that are so broad and often bizarre it's difficult to believe to put it mildly.
It's no longer an isolated thing. Take a look. The sign with a Q on it stands for QAnon.
This video is from the presidential rally in Tampa two nights ago. Last night on the broadcast we focused more closely on what the group believes in and their views from the fringes of American political thought.
Tonight we wanted to give them a chance to have their say but because so much has been written about their reluctance to talk we weren't sure what we would get when we sent Gary Tuchman to tonight's Trump rally.
Gary joins us now. So what happened? Well Anderson the rally just ended a short time ago.
.....
Yeah it's very confusing I mean sort of the and it's constantly growing like sort of with the with something will happen in the news
and they'll claim oh the deep state tried to shoot down Air Force One sort of the gist of it is that Trump has teamed up with the military and sort of various virtuous world leaders including Vladimir Putin and Kim Jong-un to take on this global cabal of Democrats and Hollywood elites and bankers
and all this kind of stuff who they claim are essentially responsible for all the evil in the world and soon Trump will have all these people arrested.
Example Outputs Before and After Conducting Post-Processing Using ChatGPT-4 Turbo (5 minutes audio diarization using Pyannote.audio)
Before
SPEAKER obama_zach(5min).wav 1 0.01 2.33 <NA> <NA> SPEAKER_00 <NA> <NA>
SPEAKER obama_zach(5min).wav 1 3.73 2.11 <NA> <NA> SPEAKER_00 <NA> <NA>
SPEAKER obama_zach(5min).wav 1 5.83 7.28 <NA> <NA> SPEAKER_01 <NA> <NA>
SPEAKER obama_zach(5min).wav 1 13.39 3.62 <NA> <NA> SPEAKER_01 <NA> <NA>
SPEAKER obama_zach(5min).wav 1 17.94 4.75 <NA> <NA> SPEAKER_00 <NA> <NA>
SPEAKER obama_zach(5min).wav 1 23.18 3.40 <NA> <NA> SPEAKER_00 <NA> <NA>
SPEAKER obama_zach(5min).wav 1 27.43 3.80 <NA> <NA> SPEAKER_00 <NA> <NA>
.....
.....
SPEAKER obama_zach(5min).wav 1 175.70 0.03 <NA> <NA> SPEAKER_00 <NA> <NA>
SPEAKER obama_zach(5min).wav 1 175.73 0.42 <NA> <NA> SPEAKER_01 <NA> <NA>
SPEAKER obama_zach(5min).wav 1 176.77 6.50 <NA> <NA> SPEAKER_00 <NA> <NA>
SPEAKER obama_zach(5min).wav 1 183.85 2.00 <NA> <NA> SPEAKER_01 <NA> <NA>
SPEAKER obama_zach(5min).wav 1 187.28 2.82 <NA> <NA> SPEAKER_01 <NA> <NA>
SPEAKER obama_zach(5min).wav 1 190.40 0.44 <NA> <NA> SPEAKER_01 <NA> <NA>
SPEAKER obama_zach(5min).wav 1 190.99 1.66 <NA> <NA> SPEAKER_01 <NA> <NA>
.....
.....
SPEAKER obama_zach(5min).wav 1 297.90 0.93 <NA> <NA> SPEAKER_00 <NA> <NA>
After
SPEAKER obama_zach(5min).wav 1 0.01 2.33 <NA> <NA> SPEAKER_00 <NA> <NA>
SPEAKER obama_zach(5min).wav 1 3.73 2.11 <NA> <NA> SPEAKER_00 <NA> <NA>
SPEAKER obama_zach(5min).wav 1 5.83 7.28 <NA> <NA> SPEAKER_00 <NA> <NA>
SPEAKER obama_zach(5min).wav 1 13.39 3.62 <NA> <NA> SPEAKER_01 <NA> <NA>
SPEAKER obama_zach(5min).wav 1 17.94 4.75 <NA> <NA> SPEAKER_00 <NA> <NA>
SPEAKER obama_zach(5min).wav 1 23.18 3.40 <NA> <NA> SPEAKER_00 <NA> <NA>
SPEAKER obama_zach(5min).wav 1 27.43 3.80 <NA> <NA> SPEAKER_01 <NA> <NA>
.....
.....
SPEAKER obama_zach(5min).wav 1 175.70 0.03 <NA> <NA> SPEAKER_00 <NA> <NA>
SPEAKER obama_zach(5min).wav 1 175.73 0.42 <NA> <NA> SPEAKER_01 <NA> <NA>
SPEAKER obama_zach(5min).wav 1 176.77 6.50 <NA> <NA> SPEAKER_00 <NA> <NA>
SPEAKER obama_zach(5min).wav 1 183.85 2.00 <NA> <NA> SPEAKER_01 <NA> <NA>
SPEAKER obama_zach(5min).wav 1 187.28 2.82 <NA> <NA> SPEAKER_01 <NA> <NA>
SPEAKER obama_zach(5min).wav 1 190.40 0.44 <NA> <NA> SPEAKER_01 <NA> <NA>
.....
.....
SPEAKER obama_zach(5min).wav 1 297.90 0.93 <NA> <NA> SPEAKER_00 <NA> <NA
Example Outputs Before and After Conducting Post-Processing Using ChatGPT-4 Turbo (9 minutes audio diarization using Pyannote.audio)
Before
After
SPEAKER bvqnu.wav 1 0.26 8.66 <NA> <NA> SPEAKER_07 <NA> <NA>
SPEAKER bvqnu.wav 1 9.38 24.89 <NA> <NA> SPEAKER_07 <NA> <NA>
SPEAKER bvqnu.wav 1 36.10 14.82 <NA> <NA> SPEAKER_05 <NA> <NA>
SPEAKER bvqnu.wav 1 52.67 12.85 <NA> <NA> SPEAKER_05 <NA> <NA>
SPEAKER bvqnu.wav 1 65.63 0.73 <NA> <NA> SPEAKER_05 <NA> <NA>
SPEAKER bvqnu.wav 1 66.00 0.22 <NA> <NA> SPEAKER_03 <NA> <NA>
SPEAKER bvqnu.wav 1 66.36 0.20 <NA> <NA> SPEAKER_03 <NA> <NA>
.....
.....
SPEAKER bvqnu.wav 1 156.56 4.02 <NA> <NA> SPEAKER_05 <NA> <NA>
SPEAKER bvqnu.wav 1 160.87 9.32 <NA> <NA> SPEAKER_04 <NA> <NA>
SPEAKER bvqnu.wav 1 161.49 0.03 <NA> <NA> SPEAKER_01 <NA> <NA>
SPEAKER bvqnu.wav 1 161.52 0.73 <NA> <NA> SPEAKER_05 <NA> <NA>
SPEAKER bvqnu.wav 1 163.71 0.95 <NA> <NA> SPEAKER_05 <NA> <NA>
SPEAKER bvqnu.wav 1 165.00 0.58 <NA> <NA> SPEAKER_05 <NA> <NA>
SPEAKER bvqnu.wav 1 167.07 6.18 <NA> <NA> SPEAKER_05 <NA> <NA>
SPEAKER bvqnu.wav 1 172.49 0.02 <NA> <NA> SPEAKER_04 <NA> <NA>
SPEAKER bvqnu.wav 1 172.50 0.61 <NA> <NA> SPEAKER_03 <NA> <NA>
SPEAKER bvqnu.wav 1 173.25 0.41 <NA> <NA> SPEAKER_03 <NA> <NA>
.....
.....
SPEAKER bvqnu.wav 1 543.71 29.02 <NA> <NA> SPEAKER_06 <NA> <NA>
SPEAKER bvqnu.wav 1 0.26 8.66 <NA> <NA> SPEAKER_00 <NA> <NA>
SPEAKER bvqnu.wav 1 9.38 24.89 <NA> <NA> SPEAKER_00 <NA> <NA>
SPEAKER bvqnu.wav 1 36.10 14.82 <NA> <NA> SPEAKER_01 <NA> <NA>
SPEAKER bvqnu.wav 1 52.67 12.85 <NA> <NA> SPEAKER_01 <NA> <NA>
SPEAKER bvqnu.wav 1 65.63 0.73 <NA> <NA> SPEAKER_01 <NA> <NA>
SPEAKER bvqnu.wav 1 66.00 0.22 <NA> <NA> SPEAKER_02 <NA> <NA>
SPEAKER bvqnu.wav 1 66.36 0.20 <NA> <NA> SPEAKER_02 <NA> <NA>
.....
.....
SPEAKER bvqnu.wav 1 156.56 4.02 <NA> <NA> SPEAKER_01 <NA> <NA>
SPEAKER bvqnu.wav 1 160.87 9.32 <NA> <NA> SPEAKER_05 <NA> <NA>
SPEAKER bvqnu.wav 1 161.49 0.03 <NA> <NA> SPEAKER_03 <NA> <NA>
SPEAKER bvqnu.wav 1 161.52 0.73 <NA> <NA> SPEAKER_01 <NA> <NA>
SPEAKER bvqnu.wav 1 163.71 0.95 <NA> <NA> SPEAKER_01 <NA> <NA>
SPEAKER bvqnu.wav 1 165.00 0.58 <NA> <NA> SPEAKER_01 <NA> <NA>
SPEAKER bvqnu.wav 1 167.07 6.18 <NA> <NA> SPEAKER_01 <NA> <NA>
SPEAKER bvqnu.wav 1 172.49 0.02 <NA> <NA> SPEAKER_05 <NA> <NA>
SPEAKER bvqnu.wav 1 172.50 0.61 <NA> <NA> SPEAKER_03 <NA> <NA>
SPEAKER bvqnu.wav 1 173.25 0.41 <NA> <NA> SPEAKER_03 <NA> <NA>
.....
.....
SPEAKER bvqnu.wav 1 543.71 29.02 <NA> <NA> SPEAKER_07 <NA> <NA>
As can be seen, there couple of rearrangements occurred after the post-processing. The timestamps remained the same, but the order of speaker’s IDs are changed according to what ChatGPT-4 Turbo expected the diarization of the conversation would be. Based on the results, it can be clarified that the consideration of the timestamps and duration when each speakers speaks could be one of the factors that affects the DER which is constant with the reason that ChatGPT is not involved in accessing the audio file.
Regarding the numerical results and example outcomes, since ChatGPT generates response based on context, giving prior information or conditions of the desired task in detail could lead to more precise results. In this case, giving the number of speakers in the audio, the duration of the audio, the conversation topics spoken in the audio, the conversation types on whether it is via phone call, broadcasting, or face-to-face, or using visual data could provide more insights to ChatGPT in order to produce promising outcomes. However, the real-world situations of not always knowing those information are likely to arise. ChatGPT might not be specifically trained to evaluate this diarization scheme. Therefore, performing the post-processing on other large language model environments specifically trained for this topic could be one of the solutions.
Execution Time Performance For Two Speakers Audio (5min)
Framework | Execution Time (seconds) |
Pyannote.audio | 31.281 |
Pyannote.audio with the number of speakers pre-identified | 29.803 |
Nvidia Nemo | 63.868 |
Nvidia Nemo with the number of speaker pre-identified | 49.932 |
Table 5: The time performance and hardware usages for the 5 minutes audio containing two speakers
Execution Time Performance For Nine Speakers Audio (9 min)
Framework | Execution Time (seconds) |
Pyannote.audio | 44.550 |
Pyannote.audio with the number of speakers pre-identified | 41.509 |
Nvidia Nemo with the number of speaker pre-identified | 108.162 |
Table 6: The time performance and hardware usages for the 9 minutes audio containing nine speakers
Next, Table 5 and Table 6 provides the execution time for audio containing two speakers and nine speakers respectively. As can be seen, for both audio length, the Nvidia Nemo framework took approximately doubled the time of the Pyannote.audio framework to execute the diarization results. And as expected, the longer the audio file, the long the time for both frameworks to execute. Furthermore, by pre-identifying the number of speakers, the time for both frameworks to execute will be reduced. Therefore, these results are consistent with the hypothesis.
NVIDIA System Management Interface (SMI) for Two Speakers Audio
NVIDIA System Management Interface (SMI) for Nine Speakers Audio
These are the results of the GPU usages in different scenarios. The result is slightly difficult to interpret. For the Pyannote.audio framework, the GPU memory usage for the nine minutes audio is slightly less than the five minutes audio. However, the GPU memory usage for the Nvidia Nemo in case of the nine minutes audio is significantly higher than the five minutes audio which shows the opposite outcome compared the Pyannote.audio framework. Moreover, when comparing the two different frameworks in the same audio test case, the Nvidia Nemo framework showed less GPU memory usage than the Pyannote.audio framework for the five minutes audio. On the other hand, the Nvidia Nemo framework showed higher GPU memory usage than the Pyannote.audio framework for the nine minutes audio.
For the Nvidia Nemo framework, since the pipline involves a large models which is Titanet Large in this case, it requires high performance GPUs to utilize the multi-scale embedding approach to perform segmentation efficiently. Hence, if CPU or small GPU is used, changing the default models to lighter models (from Titanet large to Titanet small) could be more beneficial.
5. Supplementary Materials and Applications
In this section, the application of the speaker diarization is performed. One of the crucial usecases is making the dirization work in real-time. This can be significantly useful for the real business usages across multiple industries, such as video meetings, presentations, conferences, and many more. For the real-time speaker diarization web application, the Pyannote.audio framework is used.
Real-time Speaker Diarization Web application Using WebSockets and FastAPI
from pyannote.audio import Pipeline, Model, Inference
from scipy.spatial.distance import cdist
import torch
from pydub import AudioSegment
import os
from tempfile import TemporaryDirectory
from core.common.config import Config
from core.common.logger import use_logger
config = Config.get_instance()
logger = use_logger(__name__)
class PyannoteService:
def __init__(self):
self.pipeline = Pipeline.from_pretrained(
"pyannote/speaker-diarization-3.1",
use_auth_token=config.hugging_face.token,
)
self.pipeline.to(torch.device("cuda"))
self.embedding_model = Model.from_pretrained(
"pyannote/embedding", use_auth_token=config.hugging_face.token
)
self.embedding_inference = Inference(self.embedding_model, window="whole")
self.embedding_inference.to(torch.device("cuda"))
def diarize(self, audio_file: str):
diarization = self.pipeline(audio_file)
return [
{"timestamp": (turn.start, turn.end), "speaker": speaker}
for turn, _, speaker in diarization.itertracks(yield_label=True)
]
def create_embedding(self, audio_file: str):
return self.embedding_inference(audio_file).reshape(1, -1)
def calculate_embeddings_distance(self, embedding1, embedding2):
return cdist(embedding1, embedding2, metric="cosine")[0, 0]
def update_speaker_voice(self, audio_file, speakers, speaker_dir):
original_audio = AudioSegment.from_wav(audio_file)
chunk_size_ms = 3 * 1000 # 5 seconds in milliseconds
embeddings = {}
for entry in speakers:
start, end = entry["timestamp"]
speaker = entry["speaker"]
# Process each 5-second chunk
chunk_start = start
while chunk_start < end:
chunk_end = min(chunk_start + chunk_size_ms, end)
segment = original_audio[chunk_start * 1000 : chunk_end * 1000]
speaker_path = os.path.join(speaker_dir, f"{speaker}_{chunk_start}-{chunk_end}.wav")
if os.path.exists(speaker_path):
existing_audio = AudioSegment.from_wav(speaker_path)
combined_audio = existing_audio + segment
if len(combined_audio) > chunk_size_ms:
combined_audio = combined_audio[-chunk_size_ms:]
new_audio = combined_audio
else:
if len(segment) > chunk_size_ms:
segment = segment[-chunk_size_ms:]
new_audio = segment
# Extend the audio chunk to prevent embedding error
if len(new_audio) < chunk_size_ms:
silence = AudioSegment.silent(chunk_size_ms - len(new_audio))
new_audio = silence + new_audio
new_audio.export(speaker_path, format="wav")
embeddings[f"{speaker}_{chunk_start}-{chunk_end}"] = self.create_embedding(speaker_path)
chunk_start = chunk_end
return embeddings
def identify_each_chunk_speaker(self, audio_path, speakers, voices_embeddings):
audio = AudioSegment.from_wav(audio_path)
result = []
distances = {}
chunk_size_ms = 3 * 1000 # 5 seconds in milliseconds
with TemporaryDirectory() as chunks_dir:
for entry in speakers:
timestamp = entry["timestamp"]
speaker = entry["speaker"]
start, end = timestamp
chunk_start = start
while chunk_start < end:
chunk_end = min(chunk_start + chunk_size_ms / 1000, end) # Convert chunk size to seconds
chunk_start_ms = int(chunk_start * 1000)
chunk_end_ms = int(chunk_end * 1000)
audio_chunk_path = os.path.join(
chunks_dir, f"{os.path.basename(audio_path)}_{chunk_start_ms}-{chunk_end_ms}.wav"
)
audio_chunk = audio[chunk_start_ms:chunk_end_ms]
# Extend the audio chunk to prevent embedding error
if chunk_end - chunk_start < 1:
silence = AudioSegment.silent(int((1 - (chunk_end - chunk_start)) * 1000 / 2))
audio_chunk = silence + audio_chunk + silence
audio_chunk.export(audio_chunk_path, format="wav")
chunk_key = f"{chunk_start_ms}-{chunk_end_ms}" # Use chunk start and end times as key
distances[chunk_key] = {}
for voice in voices_embeddings:
distances[chunk_key][voice] = self.calculate_embeddings_distance(
voices_embeddings[voice],
self.create_embedding(audio_chunk_path),
)
# Set distance to 0.5 if there's no example of the speaker's voice
if not distances[chunk_key].get(speaker):
distances[chunk_key][speaker] = 0.5
chunk_start = chunk_end
logger.info("DISTANCES\n" + str(distances))
for entry in speakers:
timestamp = entry["timestamp"]
start, end = timestamp
# Calculate average distance for the chunk
chunk_key = f"{chunk_start_ms}-{chunk_end_ms}"
avg_distance = sum(distances[chunk_key].values()) / len(distances[chunk_key])
# Determine speaker based on average distance
if avg_distance > 0.65:
speaker = f"SPEAKER_{len(voices_embeddings) + 1:02}"
else:
speaker = min(distances[chunk_key], key=distances[chunk_key].get)
result.append({"timestamp": timestamp, "speaker": speaker})
return result
Code 6 shows the python code defining the “PyannoteService” class for speaker diarization operation and voice identification using the Pyannote library. The “diarize” method performs speaker diarization on an input audio file, returning speaker turns with timestamps. Fpr “create_embedding”, extracts voice embedding from audio files, and “calculate_embedding_distance” calculates teh cosine distance between embeddings. “update_speaker_voice” has the role to update speaker audio recordings based on provided segments and it will extract embeddings after. For “identify_each_chunk_speaker”, the speaker for each audio chunk will be identified in this part. This will be done by comparing embedding with known speakers’s embeddings.
To ellaborate more on the “update_speaker_voice” and “identify_each_chunk_speaker” logics, in this code, first of all, the logic of appending the audio chunks is modified from 30-second chunks to 3-second chunks. Those chunks will later be compared to the embeddings of known speakers. Instead of comparing the embedding of one 30-second large chunk with the embedding of a smaller chunk, the calculation of the distance between the embedding of each 3-second smaller chunk to the known speakers is designed. Then, the average of those distance will be taken aiming to obtain a better measurement of similarity between audio samples.
from fastapi import APIRouter, WebSocket, WebSocketDisconnect
from tempfile import TemporaryDirectory
import shutil
from core.common.logger import use_logger
from core.common.config import Config
from core.controllers.diarization_controller import diarize
config = Config.get_instance()
logger = use_logger(__name__)
router = APIRouter(
prefix="/diarization",
tags=["diarization"],
responses={404: {"description": "Not found"}},
)
@router.websocket("/ws/diarize")
async def websocket_diarize(websocket: WebSocket):
await websocket.accept()
logger.info("WebSocket connection established")
voices_embeddings = {}
with TemporaryDirectory() as voices_dir:
try:
while True:
audio_data = await websocket.receive_bytes()
logger.info("Received audio data from client")
chunks, voices_embeddings = diarize(
audio_data, voices_dir, voices_embeddings
)
logger.info("Diarization completed")
diarization_result = "\n".join(
f"{chunk['speaker']}: {chunk['text']}" for chunk in chunks
)
await websocket.send_text(diarization_result)
logger.info("Diarization result sent to client")
except WebSocketDisconnect:
logger.info("WebSocket connection closed")
from fastapi import FastAPI
from fastapi.middleware.cors import CORSMiddleware
from .routes import diarization_routes
from core.common.logger import use_logger
logger = use_logger(__name__)
app = FastAPI()
app.add_middleware(
CORSMiddleware,
allow_origins=["*"],
allow_credentials=False,
allow_methods=["*"],
allow_headers=["*"],
)
app.include_router(diarization_routes.router)
@app.get("/")
def health_check():
return "ok"
Test-case Scenario
For the aforementioned real-time speaker diarization web app, it is performed to diarize the speaking of two speakers talking to each other at the moment. The diarization is started after the user pressed the “start button” on the UI and stoped when the user pressed the “end button” on the UI. The results are shown below.
Diarization Results and Discussion
As can be seen from the two figures, the modified version provides a smoother tranisition and eliminates the abrupt start of the conversation. To illustrate more, the line 1 of the dizrization and transcription, figure 10 shows the confusion error when the another speaker started to speak. From “You know what …”, the diarization should detect the new speaker. Same as in line 4, 6, and 9 the program before modifying the chunk logic gave multiple confusion errors. In addition, the timing of switching to another speaker is inaccurate. When the a different person started to speak, the diarization and transcription should be on a new line. On the other hand, figure 11 correctly diarized the timing when different speaker started the conversation. It can be concluded that the code after modifying the chunk logic significantly reduced the error of the timing of the diarization. However, there is still a problem. Eventhough the timing of switching is accurately shown to the new line, as can be seen, the number of speakers diarized is incorrect. As mentioned before that there are only two speakers in this case, Figure 11 gives the number of speakers up to eight speakers. This suggests that the logic of assigning speaker’s ID needs to be adjusted. In order to achieve that, one of the possible solution is fixing the threshold value of creating a new speaker if the cosine distance is too far (can be seen in line 140 from code 4). In addition, line 124 from code 4 is where the cosine distance is set to 0.5 if there’s no example of speaker’s voice. Adjusting the set distance could impact the diarization results.
6. Conclusion and Next Steps
In a final consideration, the aforementioned details demonstrate various possible outcomes of DER applying Pyannote.audio and Nvidia Nemo frameworks together with the post-processing using ChatGPT-4 Turbo and ChatGPT-3.5. The Nvidia Nemo exhibiting superiority in shorter audio files with two speakers, while Pyannote.audio gives a promising results when speaker counts are pre-identified. The post-processing approach using ChatGPT-4 Turbo could be counted as an alternative way to leverage the potential capability of large language models for improving the diarization accuracy. However, based on the results, further enhancements are necessary to address limitations in understanding the audio conditions and speaker timestamps which leads to the finding where there is a needed of more integrated approach to minimize the DER. The execution time and GPU usage analysis revealed the differences between each frameworks. Nvidia Nemo required more time and GPU resources when dealing with a longer audio files. Yet it leads to a consideration that selecting lighter models for Nvidia Nemo could be varied depending on different usage situations. Likewise, optimizing framework configurations could enhance efficiency and scalability when utilizing different hardware setups.
For the supplementary materials and application, the real-time speaker diarization application has demonstrated the real-world application usages. Despite the advancements in achieving smoother transitions between speakers and more accurate timing of speaker changes, several challenges have arisen. Inaccuracy in identifying the number of speakers is one of the key factors that makes this topic interesting and required to be improved. To address this, refining parameters, such as the threshold for creating new speakers based in cosine distance and fine-tuning the cosine distance threshold when no example of a speaker's voice is available are sufficient. Calibrating the balance between accuracy capturing speaker changes and minimizing false identifications can potentially impact the results leading to a smoother transcription and diarization.
For the future studies and improvement of Pyannote.audio and Nvidia Nemo evaluation, the next steps should focus on model selections and integration of audio context. Especially in Nvida Nemo, adjusting its model that is suitable beyond telephonic scenarios to accommodate diverse audio contexts could lead to more robust performance. Integrating audio context into post-processing pipelines like giving more information of speaker timing and conditions for ChatGPT could enhance the dirization accuracy. Furthermore, performing post-processing on other large language model environments specifically trained for the speaker diarization task could an alternative solutions enhancing the performance. For the real-time speaker diarization, fine-tuning parameters and improving chunk acquisition logic could the next steps. Iterative testing and evaluation involving those steps could lead to a better issue findings proceeding to the further system improvement challenges. In addition, implementing the Nvidia Nemo framework to the system instead of Pyannote.audio could give a different results since it has been proved that it can deliver less DER. However, as mentioned earlier, the choosing the model and adjusting the configuration parameters are essential in order to obtain the best results.
All in all, advancing speaker diarization capabilities requires creative approaches that can address model suitability, integration of audio context, and parameters adjustment. By considering such aspects, the accuracy, efficiency, and usability of speaker diarization system can be enhanced to be utilized in various real-world applications which can unlock new possibilities not only in terms of business, but in terms of educations and many more.
7. References
[1].(PDF) methodologies for the evaluation of speaker diarization and automatic speech recognition in the presence of overlapping speech. (n.d.). https://www.researchgate.net/publication/258119065_Methodologies_for_the_evaluation_of_Speaker_Diarization_and_Automatic_Speech_Recognition_in_the_presence_of_overlapping_speech
[2]. Diarization error rate. (n.d.). https://xavieranguera.com/phdthesis/node108.html
[3]. M. Sinclair and S. King, "Where are the challenges in speaker diarization?," 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 2013, pp. 7741-7745, doi: 10.1109/ICASSP.2013.6639170. keywords: {Abstracts;Robustness;speaker diarization;diarization error rate},
[4]. GPT-4 Turbo in the openai API. (n.d.-a). https://help.openai.com/en/articles/8555510-gpt-4-turbo-in-the-openai-api
[5]. New models and developer products announced at DevDay. (n.d.). https://openai.com/blog/new-models-and-developer-products-announced-at-devday
[6]. NVIDIA Docs. (n.d.). Speaker diarization. NVIDIA Docs. https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/speaker_diarization/intro.html
[7].Nvidia. (n.d.). Nemo/tutorials/speaker_tasks/speaker_diarization_inference.ipynb at main · NVIDIA/nemo. GitHub. https://github.com/NVIDIA/NeMo/blob/main/tutorials/speaker_tasks/Speaker_Diarization_Inference.ipynb
[8]. R&D, L. J. (2023a, July 17). Deep-dive into nemo : How to efficiently tune a speaker Diarization Pipeline ?. Medium. https://lajavaness.medium.com/deep-dive-into-nemo-how-to-efficiently-tune-a-speaker-diarization-pipeline-d6de291302bf
[9].R&D, L. J. (2023a, July 17). Speaker diarization: An introductory overview. Medium. https://lajavaness.medium.com/speaker-diarization-an-introductory-overview-c070a3bfea70
[10]. R&D, L. J. (2023, October 2). Comparing state-of-the-art speaker diarization Frameworks : Pyannote vs Nemo. Medium. https://lajavaness.medium.com/comparing-state-of-the-art-speaker-diarization-frameworks-pyannote-vs-nemo-31a191c6300
[11]. Rich transcription evaluation. NIST. (2022, August 22). https://www.nist.gov/itl/iad/mig/rich-transcription-evaluation
8. Greatest Appreciation
- Akinori Nakajima: Representative Director of VoicePing Corporation
- Melnikov Ivan: AI Developer of VoicePing Corporation