Apr 2024, Speaker Diarization Performance Evaluation: Pyannote.audio vs Nvidia Nemo, and Post-Processing Approach Using OpenAI’s GPT-4 Turbo
Apr 2024, Speaker Diarization Performance Evaluation: Pyannote.audio vs Nvidia Nemo, and Post-Processing Approach Using OpenAI’s GPT-4 Turbo

Apr 2024, Speaker Diarization Performance Evaluation: Pyannote.audio vs Nvidia Nemo, and Post-Processing Approach Using OpenAI’s GPT-4 Turbo

Taishin Maeda - Waseda University

Table of Contents

1. Introduction

What is Speaker Diarization?

Speaker Diarization is the process of segmenting and labeling an audio based on different speakers. In other words, it is the process among to identify “who spoke when?” in a given audio. It is a beneficial conversation analysis tools which coupled with Automatic Speech Recognition or so-called ASR. The speaker diarization system consists of Voice Activity Detection (VAD) model to get the timestamps of the audio where speech is being spoken and Audio Embeddings model to get Audio embeddings on segments that were previously time stamped. Then, those embeddings vectors are then grouped into clusters to estimate the number of speakers.

F
Figure 1: Speaker Diarization Pipeline

In this blog, two state-of-the-art open-source frameworks for speaker diarization are discussed including the Pyannote.audio and the Nvidia Nemo. The focus area is about evaluating the performance of those two frameworks in various audio scenarios. Then, the post-processing is conducted using OpenAI’s GPT-4-Turbo to observe the performance as another approach to evaluate the diarization performance.

What is Pyannote.audio?

Pyannote.audio is an open-source toolkit written Python library designed for speaker diarization and speaker embedding based on PyTorch machine learning (ML) framework.

What is Nvidia Nemo?

Nvida Nemo speaker diarization has a bit of different approaches compared to the Pyannote.audio. First of all, it is an open-source deep learning framework developed by NVIDIA in which also based on PyTorch. For the pipeline of the Nemo speaker diarization, neural diarizer is generated to label speakers including overlap speech based on the speaker profiles created from clustering results. The pipeline can be seen in the figure below:

F
Figure 2: Nvidia Nemo Speaker Diarization pipeline

For Nemo speaker diarization, Multi-scale approach is used for segmentation, speaker embedding extraction, and clustering. When the Voice Activity Detection (VAD) is performed, it will extract speaker embeddings from segmented audio and speaker embedding vector from each segment [7]. The point is that there will be trade-off between speaker identification quality and granularity. Meaning that, the longer the segmentation is performed, the higher quality of speaker representations will be acquired but with low temporal resolution leading to potential errors. On the other hand, the shorter the segmentation is performed, the lower quality of speaker representation will be acquired but the temporal resolution will be high. Such circumstances led to the idea of the multi-scale segmentation.

The purpose of this idea is to solve the trade-off between long and short segment lengths. Multiple layers or scales representing segment lengths are used and the affinity values from each layer/scale’s results are fused. The figure below shows how the multi-scale segmentation looks like.

F
Figure 3: The Multi-scale segmentation

From figure 3, the scale that is assigned to the highest scale index is called the base scale having the shortest segment length. The mapping among scales will be calculated and the middle point of each segment is considered as a point that matches with other scales to have the shortest distance between two middle points from the two segments [7]. Note that, the blue blocks represent the segmentation mapping. To sum up, the multi-scale approach is the idea that the segments are embedded with different time scales which allows performing the diarization that can deal with a trade-off between accurate voice embedding requiring longer segments and good granularity of the segmentation.

Neural Diarizaer

As mentioned earlier, neural diarizaer is the term to define the trainable neural models that can estimate speaker labels from the given audio output. The neural diarizer will be performed once the clustering is done, and it is needed to diarize the overlapping situation. The Multi-scale Diarization Decoder or so-called MSDD model is used as a neural diarizer [7]. The basic idea is that MSDD model utilize clustering diarizer to get the estimated number of speaker and predicted speaker profile of each speaker.

Comparing the main models involved in two diarization pipelines

Pyannote.auido
Nvidia Nemo
Voice Activity Detection (VAD)
Pyannet from Syncnet
Multilingual MarbleNet
Audio/Speaker Embedding
ECAPA-TDNN
Titanet Large
Clustering
Hidden Markov Model Clustering
Multi-scale Clustering (MSDD Telephonic)

The above table is the comparison of the models used in two diarization pipelines [10].

Rich Transcription Time Marked (RTTM)

RTTM file format is a standard format for speaker diarization which can later be used to evaluate the predicted diarization.

SPEAKER obama_zach(5min).wav 1 66.32 0.27 <NA> <NA> SPEAKER_01 <NA> <NA>
SPEAKER obama_zach(5min).wav 1 66.60 0.17 <NA> <NA> SPEAKER_00 <NA> <NA>

The above plain text is an example of the RTTM format that is performed by speaker diarization. In the RTTM, there are 3 crucial parts including the segment start time, segment duration, the speaker label. In this case, the first line shows that the segment start time is 66.32 with the duration of 0.27 and the speaker table showing “SPEAKER_01” is speaking.

2. Proposed Evaluation Method

From this section, methods and metrics that are used to evaluate the speaker diarization will be explained.

Diarization Error Rate (DER) Metric

In general, the standard metric for speaker diarization problem is the Diarization Error Rate or so-called DER. It has been introduced by the the National Institute of Standards and Technology (NIST) in 2000. The evaluation of the Speaker Diarization was conducted for broadcast news and conversational telephone speech in English [11]. This DER metric computes the error in diarization, and the equation can be seen below:

DER=(FlaseAlarm+MissedDetection+Confusion)/TotalDER = (Flase Alarm + Missed Detection + Confusion) / Total

False Alarm: The error occurred when the speech is detected but there is no speaker (can be called as False Positive)

Missed Detection: The error occurred when the no speech is detected but there is a speaker (can be called as False Negative)

Confusion: The error of when the speech is in the wrong cluster

Ideally, the goal is to reduce the DER to 0. Meaning that, there is no error. However, it is difficult, so reducing the DER to get as close as 0 is the target of this study.

Ground Truth Labeling File or Referencing File

In order to evaluate the DER, the reference RTTM file will be needed to compared with the diarized file. The file details will be discussed more deeply in the Experiment Setup section.

Time Performance and Hardware Resources

In addition to the DER, in this study, the execution time and the GPU memory usage that both pyannote and Nemo took will be taken in to an account. This will help to visualize not only about the accuracy of the model, but also the time and hardware costs that need to be considered for the further usage in the real business.

Post-Processing Approach Using ChatGPT-4-Turbo

After acquiring the diarrization results from both pyannote and Nemo, post-processings are conducted to each result in order to get additional results to compare the DER. The study is to use OpenAI’s chatGPT-4-Turbo to guess and rearrange the diarization results from both pyannote and Nemo and see the accuracy. Note that, the GPT-4 Turbo is the latest generation model developed by OpenAI having the knowledge up until April 2023 and are able to have 128,000 tokens in one prompt (300 pages 0f text) [4]. In order to conduct the study and experiment, OpenAI API is used as a guideline for making a GPT script in Python. Same as the ChatGPT chatbot that is available on the internet, the Python script will allow the users to input the prompt and get the answers. For the post-processing, three inputs are required including: 1. The speech-to-text (STT) transcription of the audio file 2. The diarization results in RTTM format from the specific framework 3. Prompt an instruction to the Chatbot. More detailed procedure will be discussed in the Experimental Setup section.

F
Figure 4: The diagram exhibiting the flow of the post-processing

3. Experimental Setup

3.1  Experimental procedure

In this section, the process of the experiment will be elaborated. First and foremost, multiple datasets, audio files and reference files, are used to demonstrate the performance of the Pyannote.audio framework and the Nvidia Nemo framework in which each audio file has different scenarios and conditions that suits each library in order to exhibit their maximum potential. In order to evaluate the performance of those frameworks on whether they are optimal or suitable for various use cases, the experiment is conducted and check the Diarization Error Rate or so-called DER using pyannote.metrics and Time Performance considering hardware resources. With that being said, this experiment is operated regarding several scenarios, and ,for each scenario and framework, the experiment is conducted two times.

To be more specific, firstly, the diarization is tested without pre-identifying the number of speakers in the code which will be discussed later. For the second time, the number of speakers in the audio will be pre-identified in the code before executing the program. These will show the difference in terms of performance when running each framework. The first scenario is when there are two speakers in the audio file, and the second scenario is when there are more than five speakers in the audio file.

Then, the post-processing using ChatGPT-4-Turbo will be conducted last, after all results are gathered. The STT transcription script in Python is created and used to transcribe both 5 minutes and 9 minutes audios. Those transcriptions will be included as an input for the ChatGPT-4-Turbo script together with the diarization results in RTTM format from both Pyannote.audio and Nemo.

3.2 Time Performance and Hardware Resources

The time performance and hardware resource utilization were recorded for each scenario and framework execution. The time performance was measured using the “time” command in Python to calculate the time taken by the code execution.

Execution Time: The total time taken for the execution of the code.

GPU Usage: The memory used by the GPU.

GPU: Nvidia GeForce RTX 3090

3.3  Audio Files and Datasets

There are two audio files that are being used in this experiment.

First, it is a five minutes audio file where there are only two speakers. The audio is a part of the “obama-zach.wav” file. The reference file or so-called the ground truth labelling file for this case is conducted manually by using a free and open-source digital audio editor and recording application software named Audacity. Using Audacity, the segmentation labelling is created and the marker track from it is exported as a .txt file which is later converted to the RTTM format.

For the second nine minutes audio file, the audio file and the ground truth labelling file are retrieved from VoxConverse speaker diarization dataset (Referred from here: https://github.com/joonson/voxconverse?tab=readme-ov-file). In this audio file, there are more than five speakers in order to observe the pull potential of the Nvidia Nemo model.

Audio file lists:

obama_zach(5min).wav

bvqnu(9min).wav

Segmentation Labeling Using Audacity

F
Figure 5: The segmentation labeling created manually in Audacity

The segmentation labeling can be seen at the bottom of the figure where the “who speak when?” is manually annotated.

3.4  Programming Code Using Python

Pyannote code

from tempfile import NamedTemporaryFile
from pydub import AudioSegment
from pyannote.audio import Pipeline
import torch
import os
import time

# Load the pretrained diarization pipeline and send it to GPU if available
pipeline = Pipeline.from_pretrained(
    "pyannote/speaker-diarization-3.1",
    use_auth_token="your_auth" # Replace with your Hugging Face token
)
if torch.cuda.is_available():
    device = torch.device("cuda")
    pipeline.to(device)
    use_cuda = True
else:
    use_cuda = False

def diarization(audio_path):
    # Perform speaker diarization using the pretrained pipeline
    start_time = time.time()
    diarization = pipeline(audio_path) #Incase you want to pre-identify the number of people, add it here)
    end_time = time.time()
    execution_time = end_time - start_time

    # Get the diarization result and format it to RTTM
    rttm = "SPEAKER {file} 1 {start:.2f} {duration:.2f} <NA> <NA> {speaker} <NA> <NA>"
    diarization_result = [
        rttm.format(file=audio_path, start=turn.start, duration=turn.duration, speaker=speaker)
        for turn, _, speaker in diarization.itertracks(yield_label=True)
    ]

    # Save the RTTM file
    rttm_file_path = f"{audio_path}.rttm"
    with open(rttm_file_path, "w") as rttm_file:
        for line in diarization_result:
            rttm_file.write(line + "\n")

    return rttm_file_path, execution_time

if __name__ == '__main__':
    audio_path = "your_file_path"  # Provide the path to your audio file
    audio = AudioSegment.from_file(audio_path)
    rttm_file_path, execution_time = diarization(audio_path)
    print(f"RTTM file saved as: {rttm_file_path}")
    print(f"Execution time: {execution_time} seconds")

    if use_cuda:
        os.system("nvidia-smi")
    else:
        print("No GPU in use.")
Code 1: A diarization program using the pyannote.audio model

Code 1 performs speaker diarization on a prompted audio file using pre-trained model from Hugging Face’s Pyannote library. Basically, the program will load the pre-trained diarization pipeline and send it to the GPU if available. Next, the diarization function gets the audio file path, process the audio, and diarize. The results will be in RTTM format and save it.

The code is inspired from: https://github.com/pyannote/pyannote-audio

Nvidia Nemo Code

import json
import os
from nemo.collections.asr.models import NeuralDiarizer
from omegaconf import OmegaConf
import wget

def diarize_audio(input_file):
    # Diarization configuration
    meta = {
        'audio_filepath': input_file,
        'offset': 0, 
        'duration': None, 
        'label': 'infer', 
        'text': '-',
        'num_speakers': None, #Incase you want to pre-identify the number of people, add it here)
        'rttm_filepath': None,
        'uem_filepath': None 
    }

    # Write manifest
    with open('input_manifest.json', 'w') as fp:
        json.dump(meta, fp)
        fp.write('\n')

    output_dir = os.path.join('output')
    os.makedirs(output_dir, exist_ok=True)

    # Load model config
    model_config = 'diar_infer_telephonic.yaml'
    if not (model_config):
        config_url = "https://raw.githubusercontent.com/NVIDIA/NeMo/main/examples/speaker_tasks/diarization/conf/inference/diar_infer_general.yaml"
        model_config = wget.download(config_url)# Update the path to the MSDD model configuration
    config = OmegaConf.load(model_config)
    
    
    config.diarizer.msdd_model.model_path = 'diar_msdd_telephonic' # telephonic speaker diarization model 
    config.diarizer.msdd_model.parameters.sigmoid_threshold = [0.7, 1.0] # Evaluate with T=0.7 and T=1.0
    
    # Initialize diarizer
    msdd_model = NeuralDiarizer(cfg=config)

    # Diarize audio
    diarization_result = msdd_model.diarize()

    return diarization_result

if __name__ == "__main__":
    input_file = 'your_audio_file_path'  # mono .wav
    result = diarize_audio(input_file)
Code 2: A diarization program using Nvidia Nemo model

Code 2 perform speaker diarization on an audio file using NeMo library. The process of diarization involves preparing a manifest file with diarization configuration, loading the diarization model configuration (in .yaml file), initializing the diarizaer with the model, and diarizing the audio. In addition, the Multi-Scale Diarization Decoder (MSDD) model is used in the code. In this case, telephonic speaker diarization model is used which is efficient in telephonic scenarios and two sigmoid thresholds are set for evaluation.

The code is inspired from: https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/speaker_tasks/Speaker_Diarization_Inference.ipynb#scrollTo=CwtVUgVNBR_P

For the starting parameter files, it can be downloaded from: https://github.com/NVIDIA/NeMo/tree/main/examples/speaker_tasks/diarization/conf/inference

Note that, the parameter file is significant to operate the Nvidia Nemo framework in this case. It is required to adjust the configuration .yaml file that suits the audio scenarios (Ex. number of speakers) in order to produce an accurate outcome.

Diarization Error Rate Evaluation Code

from pyannote.core import Segment, Annotation
from pyannote.metrics.diarization import DiarizationErrorRate

# Read reference and hypothesis files
def read_rttm(file_path):
    data = Annotation()
    with open(file_path, 'r') as file:
        for line in file:
            parts = line.strip().split()
            if len(parts) >= 7:
                start_time = float(parts[3])
                end_time = start_time + float(parts[4])
                speaker_id = parts[7]
                segment = Segment(start_time, end_time)
                data[segment] = speaker_id
    return data

ref_file_path = "ground-truth.rttm"
hyp_rttm_file_path1 = "pyannote.rttm"
hyp_rttm_file_path2 = "pyannote(pre-identified speaker no).rttm"
hyp_rttm_file_path3 = "Nemo.rttm"
hyp_rttm_file_path4 = "Nemo(pre-identified speaker no).rttm"

reference = read_rttm(ref_file_path)
hypothesis1 = read_rttm(hyp_rttm_file_path1)
hypothesis2 = read_rttm(hyp_rttm_file_path2)
hypothesis3 = read_rttm(hyp_rttm_file_path3)
hypothesis4 = read_rttm(hyp_rttm_file_path4)

# Initialize Diarization Error Rate
diarization_error_rate = DiarizationErrorRate()

# Evaluate DER
der1 = diarization_error_rate(reference, hypothesis1)
der2 = diarization_error_rate(reference, hypothesis2)
der3 = diarization_error_rate(reference, hypothesis3)
der4 = diarization_error_rate(reference, hypothesis4)

print(f'DER for pyannote: {der1:.3f}')
print(f'DER for pyannote with number of speakers pre-identified: {der2:.3f}')
print(f'DER for Nemo: {der3:.3f}')
print(f'DER for Nemo with number of speakers pre-identified: {der4:.3f}')
Code 3: A program to check the Diarization Error Rate of both pyannote.audio framework and Nvidia Nemo framework

Code 3 is designed to evaluate the performance of different speaker diarization results using the Diarization Error Rate (DER) metric. Basically, it reads the reference and hypothesis RTTM files which represents ground truth and results file respectively. Then, it calculates the DER for each hypothesis against the reference.

ChatGPT-4-Turbo Code

STT Transcription Using Whisper-1 Model Code

from openai import OpenAI

api_key = "inpput your api key here."
client = OpenAI(api_key=api_key)

#gpt4Turbo = "gpt-4-1106-preview"

def generate_responses(prompt, model="gpt-4-turbo-2024-04-09"):
    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": "You are a helpful assistant who provides information to users."},
            {"role": "user", "content": prompt},
        ],
        temperature=0.9,
        max_tokens=4096,
    )

    return response.choices[0].message.content
print(generate_responses("Input prompt here."))
Code 4: The ChatGPT-4-Turbo
from openai import OpenAI, OpenAIError

# Set your OpenAI API key
api_key = "input your api key here"

# Initialize OpenAI client with API key
client = OpenAI(api_key=api_key)

# Open the audio file
audio_file = open("input your audio here", "rb")

try:
    # Create transcription using OpenAI client
    transcription = client.audio.transcriptions.create(
        model="whisper-1",
        file=audio_file
    )

    # Print the transcription text
    print(transcription.text)

except OpenAIError as e:
    print("OpenAI API Error:", e)
finally:
    # Close the audio file
    audio_file.close()
Code 5: STT Transcription

Inputed Prompt in ChatGPT-4 Turbo

“ Based on the Transcription Results (In this part, the whole transcription should be included) and Diarization Results in RTTM (In this part, the whole RTTM should be included), can you guess or rearrange the speaker ID of the diarization results to the results that you think it is correct? ”

The quote above elaborates the inputed prompt used to instruct ChatGPT to give the diarization results, and it is included in the print statement (last statement) inside Code 4. Results are shown in the up-coming section.

4.Results and Discussion

Diarization Error Rate (DER) For Two Speakers Audio (5 min)

Diarization Error Rate (DER) For Nine Speakers Audio (9 min)

Framework
DER
Pyannote.audio
0.252
Pyannote.audio with the number of speakers pre-identified
0.214
Nvidia Nemo
0.161
Nvidia Nemo with the number of speakers pre-identified
0.161

Table 1: The DER results from the 5 minutes audio containing two speakers

Framework
DER
Pyannote.audio
0.083
Pyannote.audio with the number of speakers pre-identified
0.098
Nvidia Nemo with the number of speakers pre-identified
0.097

Table 2: The DER results from the 9 minutes audio containing nine speakers

The results from comparing speaker diarization results using pyannote.audio framework and Nvidia Nemo framework are displayed above. Firstly, the Diarization Error Rate (DER) results of both 5 minutes and 9 minutes audio are shown in Table 1 and Table 2 respectively. For the five minutes audio containing two speakers, the overall DER is higher than what it is supposed to be. One of the reasons could be due to the accuracy of the ground truth annotation which was conducted manually in Audacity. Since the audio file is manually annotated, it might not be highly accurate. To be more specific, the Nvidia Nemo framework produced approximately 9 percent less DER than the Pyannote.audio framework. In the case of the Pyannote.audio framework, when the number of speakers in the audio is pre-identified, the DER became slightly less than when it is not. On the other hand, the DER did not change in the case of the Nvidia Nemo framework. As can be seen, the Nvidia Nemo framework gives better DER in the case when there are two speakers with a ground truth file that is manually annotated.

For the nine minutes audio containing nine speakers together with the ground truth annotation file acquired from the VoxConverse, the overall DER is significantly less than that of the five minutes audio. As mentioned before, the reason could be due to an accurate ground truth annotation file created by the VoxConverse leading to the lower DER. Note that, in this case, only Nvidia Nemo framework with the number of speakers pre-identified is used for the experiment. This is due to the default mode of the Nvidia Nemo code that is not suitable for more than two speakers. Hence, it is required to adjust the parameters inside the .yaml configuration file by assigning the number of speakers initially in order to make the code diarize the audio. The outcome shows a different trend from the previous one. In this case, the Pyannote.audio framework produced approximately 1.4 percent less DER than the pre-identified Nvidia Nemo framework. This could be due to the Telephonic MSDD model used in the configuration file. Since the model is designed to work best in the telephonic situation, but the audio is not considered as a telephonic situation, the results might not be the best. However, the Pyannote.audio framework with the number of speakers pre-identified exhibited almost the same DER as of the Nvidia Nemo framework.

DER of Post-Processing Approach Using OpenAI’s Chat-GPT

Framework
DER for GPT-4-Turbo
DER for GPT-3.5
Pyannote.audio for Two Speaker Audio (5min)
0.427
0.494
Nvidia Nemo for Two Speaker Audio (5min)
0.179
0.544

Table 3: The DER results after post-processing the 5 minutes audio containing two speakers

Framework
DER for GPT-4-Turbo
DER for GPT-3.5
Pyannote.audio for Nine Speaker Audio (9min)
0.103
0.214
Nvidia Nemo for Nine Speaker Audio (9min)
0.128
0.179

Table 4: The DER results after post-processing the 9 minutes audio containing nine speakers

After using Chat-GPT-4-Turbo for post-processing of the two frameworks, the results are shown in Table 3 and 4. Those table displays the DER comparison between the post-processing using ChatGPT-4 Turbo and ChatGPT-3.5. As can be seen, the DERs of using ChatGPT-4 Turbo are less than that of ChatGPT-3.5 for both 5 minutes and 9 minutes length audios. Although the DERs are more than when using the post-processing method, It proofs that GPT-4 turbo can give more promising results than GPT-3.5. With that being said, there are rooms of improvements. One of the potential reasons why the post-processing gives higher DERS is that the ChatGPT itself are not directly involved in considering the audio file to understand the conditions happening in the audio file which result in the lack of timestamp information and the exact number of speakers. The analyzation will be done based on the transcription text given by the STT results, so it is inevitable to say that ChatGPT does not know the timing of “who speak when” numerically. Since the topic is about dirization based on the audio, making the ChatGPT has an access to the audio accurately would be a must in order to give the best results. On top of that, the prompt or instructions inputted to the ChatGPT could be done in more details. Meaning that, the user could initialize the timestamp condition or number of speakers in the conversation before hand can provide the ChatGPT to sufficiently diarize the conversation. However, in the real world situation, the timestamp and number of speaker will not always be noticed. Therefore, it is a challenge to use this post-processing approach when those significant parameters are needed to get the optimal results.

Transcription Texts for 5 minutes Audio File Conducted By Using Whisper-1 Model

Sorry, I had to cancel a few times. My mouse pad broke last week, and I had to get my great-aunt some diabetes shoes. 
And, uh, You know what, Zach? It's no problem. I mean, I have to say, when I heard that, like, people actually watch this show, I was actually pretty surprised. 
Hi. Welcome to another edition of Between Two Ferns. I'm your host, Zach Galifianakis. And my guest today is Barack Obama. President Barack Obama. 
Good to be with you, Zach. First question. In 2013, you pardoned a turkey. What do you have planned for 2014? We'll probably pardon another turkey.
....
Did you say invisible? Because I just think that's impolite. Not invisible, invincible, meaning that they don't think they can get hurt. 
I'm just saying that nobody could be invisible if you had said invisible. I understand that. If they get that health insurance, it can really make a big difference. 
And they've got till March 31 to sign up. I don't have a computer, so how does? Well, then you can call 1-800-318-2596. I don't have a phone. 
I'm off the grid. I don't want you people looking at my text, if you know what I mean. 
First of all, Zach, nobody's interested in your texts. Second of all, you can do it in person. And the law means that insurers can't discriminate against you if you've got a pre-existing condition anymore. 
Yeah, but what about this, though?

Transcription Texts for 9 minutes Audio File Conducted By Using Whisper-1 Model

President Trump's rally tonight in Pennsylvania scattered among the crowd are people who believe in some conspiracy theories that are so broad and often bizarre it's difficult to believe to put it mildly. 
It's no longer an isolated thing. Take a look. The sign with a Q on it stands for QAnon. 
This video is from the presidential rally in Tampa two nights ago. Last night on the broadcast we focused more closely on what the group believes in and their views from the fringes of American political thought. 
Tonight we wanted to give them a chance to have their say but because so much has been written about their reluctance to talk we weren't sure what we would get when we sent Gary Tuchman to tonight's Trump rally. 
Gary joins us now. So what happened? Well Anderson the rally just ended a short time ago.
.....
Yeah it's very confusing I mean sort of the and it's constantly growing like sort of with the with something will happen in the news 
and they'll claim oh the deep state tried to shoot down Air Force One sort of the gist of it is that Trump has teamed up with the military and sort of various virtuous world leaders including Vladimir Putin and Kim Jong-un to take on this global cabal of Democrats and Hollywood elites and bankers 
and all this kind of stuff who they claim are essentially responsible for all the evil in the world and soon Trump will have all these people arrested.

Example Outputs Before and After Conducting Post-Processing Using ChatGPT-4 Turbo (5 minutes audio diarization using Pyannote.audio)

Before

SPEAKER obama_zach(5min).wav 1 0.01 2.33 <NA> <NA> SPEAKER_00 <NA> <NA>
SPEAKER obama_zach(5min).wav 1 3.73 2.11 <NA> <NA> SPEAKER_00 <NA> <NA>
SPEAKER obama_zach(5min).wav 1 5.83 7.28 <NA> <NA> SPEAKER_01 <NA> <NA>
SPEAKER obama_zach(5min).wav 1 13.39 3.62 <NA> <NA> SPEAKER_01 <NA> <NA>
SPEAKER obama_zach(5min).wav 1 17.94 4.75 <NA> <NA> SPEAKER_00 <NA> <NA>
SPEAKER obama_zach(5min).wav 1 23.18 3.40 <NA> <NA> SPEAKER_00 <NA> <NA>
SPEAKER obama_zach(5min).wav 1 27.43 3.80 <NA> <NA> SPEAKER_00 <NA> <NA>
.....
.....
SPEAKER obama_zach(5min).wav 1 175.70 0.03 <NA> <NA> SPEAKER_00 <NA> <NA>
SPEAKER obama_zach(5min).wav 1 175.73 0.42 <NA> <NA> SPEAKER_01 <NA> <NA>
SPEAKER obama_zach(5min).wav 1 176.77 6.50 <NA> <NA> SPEAKER_00 <NA> <NA>
SPEAKER obama_zach(5min).wav 1 183.85 2.00 <NA> <NA> SPEAKER_01 <NA> <NA>
SPEAKER obama_zach(5min).wav 1 187.28 2.82 <NA> <NA> SPEAKER_01 <NA> <NA>
SPEAKER obama_zach(5min).wav 1 190.40 0.44 <NA> <NA> SPEAKER_01 <NA> <NA>
SPEAKER obama_zach(5min).wav 1 190.99 1.66 <NA> <NA> SPEAKER_01 <NA> <NA>
.....
.....
SPEAKER obama_zach(5min).wav 1 297.90 0.93 <NA> <NA> SPEAKER_00 <NA> <NA>

After

SPEAKER obama_zach(5min).wav 1 0.01 2.33 <NA> <NA> SPEAKER_00 <NA> <NA>
SPEAKER obama_zach(5min).wav 1 3.73 2.11 <NA> <NA> SPEAKER_00 <NA> <NA>
SPEAKER obama_zach(5min).wav 1 5.83 7.28 <NA> <NA> SPEAKER_00 <NA> <NA>
SPEAKER obama_zach(5min).wav 1 13.39 3.62 <NA> <NA> SPEAKER_01 <NA> <NA>
SPEAKER obama_zach(5min).wav 1 17.94 4.75 <NA> <NA> SPEAKER_00 <NA> <NA>
SPEAKER obama_zach(5min).wav 1 23.18 3.40 <NA> <NA> SPEAKER_00 <NA> <NA>
SPEAKER obama_zach(5min).wav 1 27.43 3.80 <NA> <NA> SPEAKER_01 <NA> <NA>
.....
.....
SPEAKER obama_zach(5min).wav 1 175.70 0.03 <NA> <NA> SPEAKER_00 <NA> <NA>
SPEAKER obama_zach(5min).wav 1 175.73 0.42 <NA> <NA> SPEAKER_01 <NA> <NA>
SPEAKER obama_zach(5min).wav 1 176.77 6.50 <NA> <NA> SPEAKER_00 <NA> <NA>
SPEAKER obama_zach(5min).wav 1 183.85 2.00 <NA> <NA> SPEAKER_01 <NA> <NA>
SPEAKER obama_zach(5min).wav 1 187.28 2.82 <NA> <NA> SPEAKER_01 <NA> <NA>
SPEAKER obama_zach(5min).wav 1 190.40 0.44 <NA> <NA> SPEAKER_01 <NA> <NA>
.....
.....
SPEAKER obama_zach(5min).wav 1 297.90 0.93 <NA> <NA> SPEAKER_00 <NA> <NA

Example Outputs Before and After Conducting Post-Processing Using ChatGPT-4 Turbo (9 minutes audio diarization using Pyannote.audio)

Before

After

SPEAKER bvqnu.wav 1 0.26 8.66 <NA> <NA> SPEAKER_07 <NA> <NA>
SPEAKER bvqnu.wav 1 9.38 24.89 <NA> <NA> SPEAKER_07 <NA> <NA>
SPEAKER bvqnu.wav 1 36.10 14.82 <NA> <NA> SPEAKER_05 <NA> <NA>
SPEAKER bvqnu.wav 1 52.67 12.85 <NA> <NA> SPEAKER_05 <NA> <NA>
SPEAKER bvqnu.wav 1 65.63 0.73 <NA> <NA> SPEAKER_05 <NA> <NA>
SPEAKER bvqnu.wav 1 66.00 0.22 <NA> <NA> SPEAKER_03 <NA> <NA>
SPEAKER bvqnu.wav 1 66.36 0.20 <NA> <NA> SPEAKER_03 <NA> <NA>
.....
.....
SPEAKER bvqnu.wav 1 156.56 4.02 <NA> <NA> SPEAKER_05 <NA> <NA>
SPEAKER bvqnu.wav 1 160.87 9.32 <NA> <NA> SPEAKER_04 <NA> <NA>
SPEAKER bvqnu.wav 1 161.49 0.03 <NA> <NA> SPEAKER_01 <NA> <NA>
SPEAKER bvqnu.wav 1 161.52 0.73 <NA> <NA> SPEAKER_05 <NA> <NA>
SPEAKER bvqnu.wav 1 163.71 0.95 <NA> <NA> SPEAKER_05 <NA> <NA>
SPEAKER bvqnu.wav 1 165.00 0.58 <NA> <NA> SPEAKER_05 <NA> <NA>
SPEAKER bvqnu.wav 1 167.07 6.18 <NA> <NA> SPEAKER_05 <NA> <NA>
SPEAKER bvqnu.wav 1 172.49 0.02 <NA> <NA> SPEAKER_04 <NA> <NA>
SPEAKER bvqnu.wav 1 172.50 0.61 <NA> <NA> SPEAKER_03 <NA> <NA>
SPEAKER bvqnu.wav 1 173.25 0.41 <NA> <NA> SPEAKER_03 <NA> <NA>
.....
.....
SPEAKER bvqnu.wav 1 543.71 29.02 <NA> <NA> SPEAKER_06 <NA> <NA>
SPEAKER bvqnu.wav 1 0.26 8.66 <NA> <NA> SPEAKER_00 <NA> <NA>
SPEAKER bvqnu.wav 1 9.38 24.89 <NA> <NA> SPEAKER_00 <NA> <NA>
SPEAKER bvqnu.wav 1 36.10 14.82 <NA> <NA> SPEAKER_01 <NA> <NA>
SPEAKER bvqnu.wav 1 52.67 12.85 <NA> <NA> SPEAKER_01 <NA> <NA>
SPEAKER bvqnu.wav 1 65.63 0.73 <NA> <NA> SPEAKER_01 <NA> <NA>
SPEAKER bvqnu.wav 1 66.00 0.22 <NA> <NA> SPEAKER_02 <NA> <NA>
SPEAKER bvqnu.wav 1 66.36 0.20 <NA> <NA> SPEAKER_02 <NA> <NA>
.....
.....
SPEAKER bvqnu.wav 1 156.56 4.02 <NA> <NA> SPEAKER_01 <NA> <NA>
SPEAKER bvqnu.wav 1 160.87 9.32 <NA> <NA> SPEAKER_05 <NA> <NA>
SPEAKER bvqnu.wav 1 161.49 0.03 <NA> <NA> SPEAKER_03 <NA> <NA>
SPEAKER bvqnu.wav 1 161.52 0.73 <NA> <NA> SPEAKER_01 <NA> <NA>
SPEAKER bvqnu.wav 1 163.71 0.95 <NA> <NA> SPEAKER_01 <NA> <NA>
SPEAKER bvqnu.wav 1 165.00 0.58 <NA> <NA> SPEAKER_01 <NA> <NA>
SPEAKER bvqnu.wav 1 167.07 6.18 <NA> <NA> SPEAKER_01 <NA> <NA>
SPEAKER bvqnu.wav 1 172.49 0.02 <NA> <NA> SPEAKER_05 <NA> <NA>
SPEAKER bvqnu.wav 1 172.50 0.61 <NA> <NA> SPEAKER_03 <NA> <NA>
SPEAKER bvqnu.wav 1 173.25 0.41 <NA> <NA> SPEAKER_03 <NA> <NA>
.....
.....
SPEAKER bvqnu.wav 1 543.71 29.02 <NA> <NA> SPEAKER_07 <NA> <NA>

As can be seen, there couple of rearrangements occurred after the post-processing. The timestamps remained the same, but the order of speaker’s IDs are changed according to what ChatGPT-4 Turbo expected the diarization of the conversation would be. Based on the results, it can be clarified that the consideration of the timestamps and duration when each speakers speaks could be one of the factors that affects the DER which is constant with the reason that ChatGPT is not involved in accessing the audio file.

Regarding the numerical results and example outcomes, since ChatGPT generates response based on context, giving prior information or conditions of the desired task in detail could lead to more precise results. In this case, giving the number of speakers in the audio, the duration of the audio, the conversation topics spoken in the audio, the conversation types on whether it is via phone call, broadcasting, or face-to-face, or using visual data could provide more insights to ChatGPT in order to produce promising outcomes. However, the real-world situations of not always knowing those information are likely to arise. ChatGPT might not be specifically trained to evaluate this diarization scheme. Therefore, performing the post-processing on other large language model environments specifically trained for this topic could be one of the solutions.

Execution Time Performance For Two Speakers Audio (5min)

Framework
Execution Time (seconds)
Pyannote.audio
31.281
Pyannote.audio with the number of speakers pre-identified
29.803
Nvidia Nemo
63.868
Nvidia Nemo with the number of speaker pre-identified
49.932

Table 5: The time performance and hardware usages for the 5 minutes audio containing two speakers

Execution Time Performance For Nine Speakers Audio (9 min)

Framework
Execution Time (seconds)
Pyannote.audio
44.550
Pyannote.audio with the number of speakers pre-identified
41.509
Nvidia Nemo with the number of speaker pre-identified
108.162

Table 6: The time performance and hardware usages for the 9 minutes audio containing nine speakers

Next, Table 5 and Table 6 provides the execution time for audio containing two speakers and nine speakers respectively. As can be seen, for both audio length, the Nvidia Nemo framework took approximately doubled the time of the Pyannote.audio framework to execute the diarization results. And as expected, the longer the audio file, the long the time for both frameworks to execute. Furthermore, by pre-identifying the number of speakers, the time for both frameworks to execute will be reduced. Therefore, these results are consistent with the hypothesis.

NVIDIA System Management Interface (SMI) for Two Speakers Audio

F
Figure 6: The output of the NVIDIA System Management Interface (SMI) command for the Pyannote.audio framework (two speakers)
F
Figure 7: The output of the NVIDIA System Management Interface (SMI) command for the Nvidia Nemo framework (two speakers)

NVIDIA System Management Interface (SMI) for Nine Speakers Audio

F
Figure 8: The output of the NVIDIA System Management Interface (SMI) command for the Pyannote.audio framework (nine speakers)
F
Figure 9: The output of the NVIDIA System Management Interface (SMI) command for the Nvidia Nemo framework (Nine speakers)

These are the results of the GPU usages in different scenarios. The result is slightly difficult to interpret. For the Pyannote.audio framework, the GPU memory usage for the nine minutes audio is slightly less than the five minutes audio. However, the GPU memory usage for the Nvidia Nemo in case of the nine minutes audio is significantly higher than the five minutes audio which shows the opposite outcome compared the Pyannote.audio framework. Moreover, when comparing the two different frameworks in the same audio test case, the Nvidia Nemo framework showed less GPU memory usage than the Pyannote.audio framework for the five minutes audio. On the other hand, the Nvidia Nemo framework showed higher GPU memory usage than the Pyannote.audio framework for the nine minutes audio.

For the Nvidia Nemo framework, since the pipline involves a large models which is Titanet Large in this case, it requires high performance GPUs to utilize the multi-scale embedding approach to perform segmentation efficiently. Hence, if CPU or small GPU is used, changing the default models to lighter models (from Titanet large to Titanet small) could be more beneficial.

5. Supplementary Materials and Applications

In this section, the application of the speaker diarization is performed. One of the crucial usecases is making the dirization work in real-time. This can be significantly useful for the real business usages across multiple industries, such as video meetings, presentations, conferences, and many more. For the real-time speaker diarization web application, the Pyannote.audio framework is used.

Real-time Speaker Diarization Web application Using WebSockets and FastAPI

from pyannote.audio import Pipeline, Model, Inference
from scipy.spatial.distance import cdist
import torch
from pydub import AudioSegment
import os
from tempfile import TemporaryDirectory

from core.common.config import Config
from core.common.logger import use_logger

config = Config.get_instance()

logger = use_logger(__name__)


class PyannoteService:
    def __init__(self):
        self.pipeline = Pipeline.from_pretrained(
            "pyannote/speaker-diarization-3.1",
            use_auth_token=config.hugging_face.token,
        )
        self.pipeline.to(torch.device("cuda"))

        self.embedding_model = Model.from_pretrained(
            "pyannote/embedding", use_auth_token=config.hugging_face.token
        )
        self.embedding_inference = Inference(self.embedding_model, window="whole")
        self.embedding_inference.to(torch.device("cuda"))

    def diarize(self, audio_file: str):
        diarization = self.pipeline(audio_file)

        return [
            {"timestamp": (turn.start, turn.end), "speaker": speaker}
            for turn, _, speaker in diarization.itertracks(yield_label=True)
        ]

    def create_embedding(self, audio_file: str):
        return self.embedding_inference(audio_file).reshape(1, -1)

    def calculate_embeddings_distance(self, embedding1, embedding2):
        return cdist(embedding1, embedding2, metric="cosine")[0, 0]

    def update_speaker_voice(self, audio_file, speakers, speaker_dir):
        original_audio = AudioSegment.from_wav(audio_file)
        chunk_size_ms = 3 * 1000  # 5 seconds in milliseconds

        embeddings = {}
        for entry in speakers:
            start, end = entry["timestamp"]
            speaker = entry["speaker"]

            # Process each 5-second chunk
            chunk_start = start
            while chunk_start < end:
                chunk_end = min(chunk_start + chunk_size_ms, end)
                segment = original_audio[chunk_start * 1000 : chunk_end * 1000]

                speaker_path = os.path.join(speaker_dir, f"{speaker}_{chunk_start}-{chunk_end}.wav")

                if os.path.exists(speaker_path):
                    existing_audio = AudioSegment.from_wav(speaker_path)
                    combined_audio = existing_audio + segment
                    if len(combined_audio) > chunk_size_ms:
                        combined_audio = combined_audio[-chunk_size_ms:]
                    new_audio = combined_audio
                else:
                    if len(segment) > chunk_size_ms:
                        segment = segment[-chunk_size_ms:]
                    new_audio = segment

                # Extend the audio chunk to prevent embedding error
                if len(new_audio) < chunk_size_ms:
                    silence = AudioSegment.silent(chunk_size_ms - len(new_audio))
                    new_audio = silence + new_audio

                new_audio.export(speaker_path, format="wav")
                embeddings[f"{speaker}_{chunk_start}-{chunk_end}"] = self.create_embedding(speaker_path)

                chunk_start = chunk_end

        return embeddings

    def identify_each_chunk_speaker(self, audio_path, speakers, voices_embeddings):
        audio = AudioSegment.from_wav(audio_path)
        result = []
        distances = {}

        chunk_size_ms = 3 * 1000  # 5 seconds in milliseconds

        with TemporaryDirectory() as chunks_dir:
            for entry in speakers:
                timestamp = entry["timestamp"]
                speaker = entry["speaker"]
                start, end = timestamp

                chunk_start = start
                while chunk_start < end:
                    chunk_end = min(chunk_start + chunk_size_ms / 1000, end)  # Convert chunk size to seconds
                    chunk_start_ms = int(chunk_start * 1000)
                    chunk_end_ms = int(chunk_end * 1000)

                    audio_chunk_path = os.path.join(
                        chunks_dir, f"{os.path.basename(audio_path)}_{chunk_start_ms}-{chunk_end_ms}.wav"
                    )
                    audio_chunk = audio[chunk_start_ms:chunk_end_ms]

                    # Extend the audio chunk to prevent embedding error
                    if chunk_end - chunk_start < 1:
                        silence = AudioSegment.silent(int((1 - (chunk_end - chunk_start)) * 1000 / 2))
                        audio_chunk = silence + audio_chunk + silence

                    audio_chunk.export(audio_chunk_path, format="wav")

                    chunk_key = f"{chunk_start_ms}-{chunk_end_ms}"  # Use chunk start and end times as key
                    distances[chunk_key] = {}
                    for voice in voices_embeddings:
                        distances[chunk_key][voice] = self.calculate_embeddings_distance(
                            voices_embeddings[voice],
                            self.create_embedding(audio_chunk_path),
                        )

                    # Set distance to 0.5 if there's no example of the speaker's voice
                    if not distances[chunk_key].get(speaker):
                        distances[chunk_key][speaker] = 0.5

                    chunk_start = chunk_end

        logger.info("DISTANCES\n" + str(distances))

        for entry in speakers:
            timestamp = entry["timestamp"]
            start, end = timestamp

            # Calculate average distance for the chunk
            chunk_key = f"{chunk_start_ms}-{chunk_end_ms}"
            avg_distance = sum(distances[chunk_key].values()) / len(distances[chunk_key])

            # Determine speaker based on average distance
            if avg_distance > 0.65:
                speaker = f"SPEAKER_{len(voices_embeddings) + 1:02}"
            else:
                speaker = min(distances[chunk_key], key=distances[chunk_key].get)

            result.append({"timestamp": timestamp, "speaker": speaker})

        return result
Code 6: A modified code for real-time speaker diarization

Code 6 shows the python code defining the “PyannoteService” class for speaker diarization operation and voice identification using the Pyannote library. The “diarize” method performs speaker diarization on an input audio file, returning speaker turns with timestamps. Fpr “create_embedding”, extracts voice embedding from audio files, and “calculate_embedding_distance” calculates teh cosine distance between embeddings. “update_speaker_voice” has the role to update speaker audio recordings based on provided segments and it will extract embeddings after. For “identify_each_chunk_speaker”, the speaker for each audio chunk will be identified in this part. This will be done by comparing embedding with known speakers’s embeddings.

To ellaborate more on the “update_speaker_voice” and “identify_each_chunk_speaker” logics, in this code, first of all, the logic of appending the audio chunks is modified from 30-second chunks to 3-second chunks. Those chunks will later be compared to the embeddings of known speakers. Instead of comparing the embedding of one 30-second large chunk with the embedding of a smaller chunk, the calculation of the distance between the embedding of each 3-second smaller chunk to the known speakers is designed. Then, the average of those distance will be taken aiming to obtain a better measurement of similarity between audio samples.

from fastapi import APIRouter, WebSocket, WebSocketDisconnect
from tempfile import TemporaryDirectory
import shutil

from core.common.logger import use_logger
from core.common.config import Config
from core.controllers.diarization_controller import diarize

config = Config.get_instance()

logger = use_logger(__name__)

router = APIRouter(
    prefix="/diarization",
    tags=["diarization"],
    responses={404: {"description": "Not found"}},
)


@router.websocket("/ws/diarize")
async def websocket_diarize(websocket: WebSocket):
    await websocket.accept()
    logger.info("WebSocket connection established")
    voices_embeddings = {}

    with TemporaryDirectory() as voices_dir:
        try:
            while True:
                audio_data = await websocket.receive_bytes()
                logger.info("Received audio data from client")

                chunks, voices_embeddings = diarize(
                    audio_data, voices_dir, voices_embeddings
                )
                logger.info("Diarization completed")

                diarization_result = "\n".join(
                    f"{chunk['speaker']}: {chunk['text']}" for chunk in chunks
                )
                await websocket.send_text(diarization_result)
                logger.info("Diarization result sent to client")
        except WebSocketDisconnect:
            logger.info("WebSocket connection closed")
Code 7: A code for using the WebSocket for real-time speaker diarization
from fastapi import FastAPI
from fastapi.middleware.cors import CORSMiddleware

from .routes import diarization_routes
from core.common.logger import use_logger

logger = use_logger(__name__)

app = FastAPI()

app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],
    allow_credentials=False,
    allow_methods=["*"],
    allow_headers=["*"],
)

app.include_router(diarization_routes.router)


@app.get("/")
def health_check():
    return "ok"
Code 8: A code for using FastAPR for real-time speaker diarization

Test-case Scenario

For the aforementioned real-time speaker diarization web app, it is performed to diarize the speaking of two speakers talking to each other at the moment. The diarization is started after the user pressed the “start button” on the UI and stoped when the user pressed the “end button” on the UI. The results are shown below.

Diarization Results and Discussion

F
Figure 10: The real-time diarization results before modifying the chunk logic
F
Figure 11: The real-time diarization results after modifying the chunk logic

As can be seen from the two figures, the modified version provides a smoother tranisition and eliminates the abrupt start of the conversation. To illustrate more, the line 1 of the dizrization and transcription, figure 10 shows the confusion error when the another speaker started to speak. From “You know what …”, the diarization should detect the new speaker. Same as in line 4, 6, and 9 the program before modifying the chunk logic gave multiple confusion errors. In addition, the timing of switching to another speaker is inaccurate. When the a different person started to speak, the diarization and transcription should be on a new line. On the other hand, figure 11 correctly diarized the timing when different speaker started the conversation. It can be concluded that the code after modifying the chunk logic significantly reduced the error of the timing of the diarization. However, there is still a problem. Eventhough the timing of switching is accurately shown to the new line, as can be seen, the number of speakers diarized is incorrect. As mentioned before that there are only two speakers in this case, Figure 11 gives the number of speakers up to eight speakers. This suggests that the logic of assigning speaker’s ID needs to be adjusted. In order to achieve that, one of the possible solution is fixing the threshold value of creating a new speaker if the cosine distance is too far (can be seen in line 140 from code 4). In addition, line 124 from code 4 is where the cosine distance is set to 0.5 if there’s no example of speaker’s voice. Adjusting the set distance could impact the diarization results.

6. Conclusion and Next Steps

In a final consideration, the aforementioned details demonstrate various possible outcomes of DER applying Pyannote.audio and Nvidia Nemo frameworks together with the post-processing using ChatGPT-4 Turbo and ChatGPT-3.5. The Nvidia Nemo exhibiting superiority in shorter audio files with two speakers, while Pyannote.audio gives a promising results when speaker counts are pre-identified. The post-processing approach using ChatGPT-4 Turbo could be counted as an alternative way to leverage the potential capability of large language models for improving the diarization accuracy. However, based on the results, further enhancements are necessary to address limitations in understanding the audio conditions and speaker timestamps which leads to the finding where there is a needed of more integrated approach to minimize the DER. The execution time and GPU usage analysis revealed the differences between each frameworks. Nvidia Nemo required more time and GPU resources when dealing with a longer audio files. Yet it leads to a consideration that selecting lighter models for Nvidia Nemo could be varied depending on different usage situations. Likewise, optimizing framework configurations could enhance efficiency and scalability when utilizing different hardware setups.

For the supplementary materials and application, the real-time speaker diarization application has demonstrated the real-world application usages. Despite the advancements in achieving smoother transitions between speakers and more accurate timing of speaker changes, several challenges have arisen. Inaccuracy in identifying the number of speakers is one of the key factors that makes this topic interesting and required to be improved. To address this, refining parameters, such as the threshold for creating new speakers based in cosine distance and fine-tuning the cosine distance threshold when no example of a speaker's voice is available are sufficient. Calibrating the balance between accuracy capturing speaker changes and minimizing false identifications can potentially impact the results leading to a smoother transcription and diarization.

For the future studies and improvement of Pyannote.audio and Nvidia Nemo evaluation, the next steps should focus on model selections and integration of audio context. Especially in Nvida Nemo, adjusting its model that is suitable beyond telephonic scenarios to accommodate diverse audio contexts could lead to more robust performance. Integrating audio context into post-processing pipelines like giving more information of speaker timing and conditions for ChatGPT could enhance the dirization accuracy. Furthermore, performing post-processing on other large language model environments specifically trained for the speaker diarization task could an alternative solutions enhancing the performance. For the real-time speaker diarization, fine-tuning parameters and improving chunk acquisition logic could the next steps. Iterative testing and evaluation involving those steps could lead to a better issue findings proceeding to the further system improvement challenges. In addition, implementing the Nvidia Nemo framework to the system instead of Pyannote.audio could give a different results since it has been proved that it can deliver less DER. However, as mentioned earlier, the choosing the model and adjusting the configuration parameters are essential in order to obtain the best results.

All in all, advancing speaker diarization capabilities requires creative approaches that can address model suitability, integration of audio context, and parameters adjustment. By considering such aspects, the accuracy, efficiency, and usability of speaker diarization system can be enhanced to be utilized in various real-world applications which can unlock new possibilities not only in terms of business, but in terms of educations and many more.

7. References

[1].(PDF) methodologies for the evaluation of speaker diarization and automatic speech recognition in the presence of overlapping speech. (n.d.). https://www.researchgate.net/publication/258119065_Methodologies_for_the_evaluation_of_Speaker_Diarization_and_Automatic_Speech_Recognition_in_the_presence_of_overlapping_speech

[2]. Diarization error rate. (n.d.). https://xavieranguera.com/phdthesis/node108.html

[3]. M. Sinclair and S. King, "Where are the challenges in speaker diarization?," 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 2013, pp. 7741-7745, doi: 10.1109/ICASSP.2013.6639170. keywords: {Abstracts;Robustness;speaker diarization;diarization error rate},

[4]. GPT-4 Turbo in the openai API. (n.d.-a). https://help.openai.com/en/articles/8555510-gpt-4-turbo-in-the-openai-api

[5]. New models and developer products announced at DevDay. (n.d.). https://openai.com/blog/new-models-and-developer-products-announced-at-devday

[6]. NVIDIA Docs. (n.d.). Speaker diarization. NVIDIA Docs. https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/speaker_diarization/intro.html 

[7].Nvidia. (n.d.). Nemo/tutorials/speaker_tasks/speaker_diarization_inference.ipynb at main · NVIDIA/nemo. GitHub. https://github.com/NVIDIA/NeMo/blob/main/tutorials/speaker_tasks/Speaker_Diarization_Inference.ipynb 

[8]. R&D, L. J. (2023a, July 17). Deep-dive into nemo : How to efficiently tune a speaker Diarization Pipeline ?. Medium. https://lajavaness.medium.com/deep-dive-into-nemo-how-to-efficiently-tune-a-speaker-diarization-pipeline-d6de291302bf 

[9].R&D, L. J. (2023a, July 17). Speaker diarization: An introductory overview. Medium. https://lajavaness.medium.com/speaker-diarization-an-introductory-overview-c070a3bfea70 

[10]. R&D, L. J. (2023, October 2). Comparing state-of-the-art speaker diarization Frameworks : Pyannote vs Nemo. Medium. https://lajavaness.medium.com/comparing-state-of-the-art-speaker-diarization-frameworks-pyannote-vs-nemo-31a191c6300 

[11]. Rich transcription evaluation. NIST. (2022, August 22). https://www.nist.gov/itl/iad/mig/rich-transcription-evaluation 

8. Greatest Appreciation

  • Akinori Nakajima: Representative Director of VoicePing Corporation
  • Melnikov Ivan: AI Developer of VoicePing Corporation