August 2023, Evaluate Machine Translation And Improve Accuracy
August 2023, Evaluate Machine Translation And Improve Accuracy

August 2023, Evaluate Machine Translation And Improve Accuracy

Chen Yufeng - Waseda University


We are addressing the challenge of accurately assessing machine translation quality while also striving to enhance its accuracy to a level comparable to human translation. Our approach involves employing five distinct benchmark translation models and evaluating their performance using three diverse evaluation metrics. Concurrently, we are dedicated to refining the accuracy of these models through insights gained from prior research and studies.

Table of Contents

  1. Introduction
  2. Dataset
  3. How To Evaluate Machine Translation Accuracy

3.1. BLEU Score

3.2. BLEURT Score

3.3. COMET Score

  1. Five Basic Machine Translation Models And Their Accuracies

4.1. Azure Baseline Model

4.1.1. How To Use Azure Baseline Model

4.1.2. Results of Its Accuracy

4.2. Azure Custom Model

4.2.1. How To Use Azure Custom Model

4.2.2. Results of Its Accuracy

4.3. DeepL Model

4.3.1. How To Use  DeepL Model

4.3.2. Results of Its Accuracy

4.4 Google Translator

4.4.1. How To Use  DeepL Model

4.4.2. Results of Its Accuracy

4.5 GPT-4  Model

4.5.1. How To Use GPT-4 Model

4.5.2. Results of Its Accuracy

4.6  Comparison And Conclusion

  1. Improve Machine Translation Accuracy

5.1. In-Context Learning for GPT-4

5.2. Hybrid Model

5.2.1. Background of Hybrid Model

5.2.2. Different Model Combinations

5.2.3. Different Thresholds

5.2.4. Conclusions of Hybrid Model

5.3. GPT-4 as a data cleaning tool

  1. Conclusion
  2. References

1. Introduction

With the advancement of AI technology, particularly following the inception of ChatGPT by OpenAI last year, people are increasingly placing greater trust in the AI industry. As a pivotal component within the realm of natural language processing, machine translation has garnered ever-growing significance.

This paper focuses on the evaluation of five fundamental translation models using diverse evaluation metrics, while also delving into methods to enhance the precision of these models to the fullest extent possible.

2. Dataset

The research is centered around the Opus100(ZH-EN) dataset available on Hugging Face. This dataset comprises one million Chinese-to-English translation instances spanning various domains, rendering Opus100 a fitting choice for training translation models. However, it is imperative to acknowledge the presence of translation inaccuracies within the dataset. While these inaccuracies may ostensibly reduce training accuracy, they concurrently serve as a deterrent against potential overfitting issues.Furthermore, it is worth noting that prior to integration into the Azure AI platform, Opus100 necessitates a preprocessing step to eliminate anomalous symbols present in each sentence. This process is essential to ensure data integrity and effective utilization within the Azure AI platform.


3. How To Evaluate Machine Translation Accuracy

When faced with a multitude of translation models, selecting the most suitable one for a specific purpose becomes a challenging endeavor. Consequently, the comparative analysis of diverse models assumes great significance. In essence, there exist two fundamental approaches for assessing distinct translation models. The first approach, often referred to as the traditional method, centers around BLEU score. The second approach,based on neural metrics, encompasses like Bleurt score and Comet score. Notably, both of Bleurt score and Comet score rely on pre-trained models, utilizing their respective checkpoints to gauge the accuracy of translations.

It is important to recognize that the choice between these evaluation techniques hinges upon the specific requirements and nuances of the translation task at hand. As such, a thorough evaluation considering both traditional and neural metrics would contribute to a comprehensive understanding of the performance of various translation models.

3.1 Bleu Score

BLEU (bilingual evaluation understudy) is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another. (Papineni et al., 2002) common_voice['train'] = common_voice['train'].map(augment_dataset, keep_in_memory=True)

import nltk
bleu_scores = []
for reference, pre in zip(reference_translations, prediction):
    reference_tokens = nltk.word_tokenize(reference.lower())
    pre_tokens = nltk.word_tokenize(pre.lower())

    if not reference_tokens or not pre_tokens:

    bleu_score = nltk.translate.bleu_score.sentence_bleu(
        [reference_tokens], pre_tokens,

average_bleu_score = sum(bleu_scores) / len(bleu_scores)
print("Average BLEU score:", average_bleu_score)
Code of Bleu score

Above is a Python code snippet demonstrating the utilization of the BLEU score for assessing translation accuracy. The process involves importing the NLTK package, utilizing reference sentences as ground truth, and translations from various models as predictions. The subsequent steps entail tokenization of both the reference and translation sentences using the word_tokenize method. The final BLEU score is then computed by incorporating smoothing functions, of which there are seven distinct options, each yielding potentially disparate outcomes.

Here's an overview of the fundamental distinctions among these seven smoothing functions:

Additive Smoothing (Laplace Smoothing - Smoothing Function 1):

This approach involves adding a constant value (usually 1) to both the numerator and denominator of the precision calculation. It prevents zero probabilities and ensures that even unseen n-grams contribute to the score.

NIST Smoothing (Smoothing Function 2):

NIST smoothing introduces a reference length penalty based on a hyperparameter. This penalty adjusts the precision score to account for different translation lengths and helps control the effect of translation length on the BLEU score.

Chen and Cherry Smoothing (Smoothing Function 3):

This smoothing function adapts the smoothing parameter based on the length of the candidate translation. Longer translations receive less smoothing, allowing BLEU to better handle different sentence lengths.

JenLin Smoothing (Smoothing Function 4):

JenLin smoothing combines additive and adjusted methods to find a balance between avoiding zero probabilities and controlling the penalty for different translation lengths.

Gao and He Smoothing (Smoothing Function 5):

Gao and He smoothing addresses the bias towards shorter translations by introducing a scaling factor to adjust the reference length penalty.

Bayesian Smoothing (Smoothing Function 6):

Bayesian smoothing employs a Bayesian approach to estimate n-gram probabilities. It provides a more accurate and robust estimation, particularly for longer sentences.

Geometric Mean Smoothing (Smoothing Function 7):

This method calculates the geometric mean of n-gram precisions, addressing the issue of precision imbalance between shorter and longer n-grams.

Nevertheless, despite its widespread use in evaluating the quality of machine-generated translations, BLEU possesses several drawbacks and limitations that warrant consideration. The foremost limitation pertains to its disregard for word order and syntax, as BLEU fails to account for these critical elements of accurate and coherent translations. This constraint can lead to inflated scores for translations that manipulate word positioning in unnatural manners. The second limitation concerns its deficiency in achieving a human-like assessment. Primarily reliant on comparing n-gram overlaps between candidate and reference translations, BLEU's approach often falls short of capturing the more intricate facets of translation quality, such as fluency, idiomatic expressions, grammar, and overall coherence.

3.2 Bleurt Score

BLEURT is an evaluation metric for Natural Language Generation. It takes a pair of sentences as input, a reference and a candidate, and it returns a score that indicates to what extent the candidate is fluent and conveys the meaning of the reference.(Sellam, 2021). Please make sure to install tensorflow beforehand in order to use bleurt.

from bleurt import score
checkpoint = r"/Users/chenyufeng/bleurt/bleurt/BLEURT-20"

scorer = score.BleurtScorer( checkpoint)
scores = scorer.score(references=reference_translations, candidates=prediction)
assert isinstance(scores, list) and len(scores) == 100

total_score = sum(scores)/len(scores)
print("Total Score:", total_score)
Code of Bleurt

3.3 Comet Score

Comet is a neural framework for training multilingual machine translation evaluation models. Comet is designed to predict human judgments of translation quality.

from comet import download_model, load_from_checkpoint

model = load_from_checkpoint(model_path)

for  src, pre,reference in zip(source_sentences,preds,reference_translations):

model_output = model.predict(data, batch_size=8, gpus=0)
Code of Comet

Above is a Python code snippet demonstrating the utilization of the Comet score for assessing translation accuracy. Since Comet is also a pre-trained model, the initial step comprises importing the checkpoint. Once the checkpoint is successfully imported, the subsequent procedure involves creating our own evaluation dataset, which includes source sentences, predictions, and reference sentences. The final step entails utilizing the predict method while configuring the batch size and, if available, GPUs. Given that both Bleurt and Comet are neural metrics, their drawbacks exhibit similarities.

4.Five Basic Machine Translation Models And Their Accuracies

The subsequent phase involves employing these three evaluation metrics to assess the performance of the primary five benchmark models, which encompass the Azure baseline model, Azure custom model, DeepL, Google Translator, and GPT-4, respectively.

4.1.1 How to use Azure baseline model

To assess the translation accuracy of the Azure baseline model, the process involves obtaining translations using the RESTful API.

import requests, uuid, json

endpoint = ""
subscription_key = ""
location = ""

path = '/translate'
constructed_url = endpoint + path

params = {
    'api-version': '3.0',
    'from': 'zh',
    'to': 'en'

headers = {
    'Ocp-Apim-Subscription-Key': subscription_key,
    'Ocp-Apim-Subscription-Region': location,
    'Content-type': 'application/json',
    'X-ClientTraceId': str(uuid.uuid4())

for i in source_sentences:
    body.append({'text': i}) 

request =, params=params, headers=headers, json=body)
response = request.json()
Code for how to use Azure baseline model

Ensuring the accuracy of translation from the Azure baseline model involves verifying the correctness of the endpoint, API key, and location, which can all be located within the Azure platform. Subsequently, employing the POST method to transmit the necessary information, encompassing the source sentences, to the Azure server. This process culminates in receiving the server's response and converting translation sentences into a JSON format for further analysis.

4.1.2 Results of its accuracy

Upon obtaining the translation results from the Azure baseline model, the subsequent phase entails employing the three evaluation metrics to assess the model's accuracy. The final outcomes are as follows:


4.2.1 How to use Azure custom model

The Azure custom model is an enhanced version achieved through utilizing additional datasets to further train the Azure baseline model. Undoubtedly, the Azure custom model demonstrates superior performance compared to the baseline model. Here I used the custom model’s BLEU score on the Azure platform is 39.45.


(Details of Azure Custom Model)

The code structure for obtaining the translation remains largely consistent with that of the Azure baseline model. However, when working with the custom model, a key distinction arises – the custom model must be published on the Azure platform. This enables the custom model to be invoked and responsive when the API is called.

4.2.2 Results of its accuracy

Upon obtaining the translation results from the Azure custom model, the subsequent phase entails employing the three evaluation metrics to assess the model's accuracy. The final outcomes are as follows:


It is clear that the Azure custom model outperforms the baseline model.

4.3.1 How to use DeepL model

DeepL Translator is a neural machine translation service .Its algorithm uses convolutional neural networks and an English pivot.

And the code to use DeepL is very simple.Simply designate the source and target languages, and subsequently use the API key to initiate communication with the server side. This process facilitates the retrieval of the translation results seamlessly.

import deepl

API_KEY = ' ' 

source_lang = 'ZH'
target_lang = 'EN-US'

translator = deepl.Translator(API_KEY)

results = translator.translate_text(source_sentences, source_lang=source_lang, target_lang=target_lang)
code for how to use DeepL

4.3.2 Results of its accuracy

Upon obtaining the translation results from the DeepL, the subsequent phase entails employing the three evaluation metrics to assess the model's accuracy. The final outcomes are as follows:


4.4.1 How to use Google translator

Similarly, for obtaining translations from Google Translator, the process involves utilizing the RESTful API.

import requests

def translate_texts(texts, target_language):
    api_key = ' '  

    url = ' '
    translations = []

    for text in texts:
        params = {
            'key': api_key,
            'q': text,
            'target': target_language

        response = requests.get(url, params=params)
        if response.status_code == 200:
            data = response.json()
            translated_text = data['data']['translations'][0]['translatedText']
            print('Translation failed. Error:', response.status_code)

    return translations

target_language = 'EN'  
translated_texts = translate_texts(source_sentences, target_language)
code for how to use Google translator

Ensuring the accuracy of the provided parameters is paramount, as this guarantees the retrieval of a response from the server. This time, we could utilize the GET method to directly obtain the response, rather than initially posting the parameters by POST method.

4.4.2 Results of its accuracy

Upon obtaining the translation results from the Google Translator, the subsequent phase entails employing the three evaluation metrics to assess the model's accuracy. The final outcomes are as follows:


4.5.1 How to use the GPT-4 model

GPT-4, developed by OpenAI, is indeed a versatile large language model with capabilities extending beyond translation. However, it's worth noting that a growing number of individuals are currently leveraging GPT-4 as a translation tool to assist them across diverse domains. Here is the python code to get the translation from GPT-4.

import openai

def translate_text(text_list):
    openai.api_key = ''  
    translations = []

    for text in text_list:
        messages = [
            {"role": "system", "content": "You are a translation assistant from Chinese to English. Some rules to remember:\n\n- Do not add extra blank lines.\n- It is important to maintain the accuracy of the contents, but we don't want the output to read like it's been translated. So instead of translating word by word, prioritize naturalness and ease of communication."},
            {"role": "user", "content": text}

        model = 'gpt-4'

        response = openai.ChatCompletion.create(

        choices = response['choices']
        if len(choices) > 0:
            translation = choices[0]['message']['content']

    return translations

translations = translate_text(source_sentences)
code for how to use GPT-4

Crafting a well-structured prompt holds immense significance when engaging GPT-4 for various tasks. It's also crucial to carefully consider the max_tokens parameter, as an excessive number of tokens could potentially lead to overloading and crashing the response of GPT-4. Furthermore, it's noteworthy that GPT-4 tends to exhibit a slower response time compared to the other four models in this context.

4.5.2 Results of its accuracy

Upon obtaining the translation results from the GPT-4 model, the subsequent phase entails employing the three evaluation metrics to assess the model's accuracy. The final outcomes are as follows:


4.6  Comparison And Conclusion

Based on the insights from the preceding sections, the following shows final outcomes.

Bleu score comparison between benchmark models
Bleu score comparison between benchmark models

Bleurt score comparison between benchmark models
Bleurt score comparison between benchmark models

Comet score comparison between benchmark models
Comet score comparison between benchmark models

These results become evident that the Azure custom model emerges as the top performer. Following closely is DeepL, and subsequently, the Azure baseline model claims the third spot. Google Translator and GPT-4 share a similar standing due to the initial limitations in my prompt formulation. Nevertheless, considering the Azure custom model's extensive training with a dataset comprising 433,339 entries from Opus100, it is reasonable to deduce that DeepL currently holds the distinction of being the most effective model for translating Chinese to English, especially in cases where users lack pre-training capabilities.

5. Improve Machine Translation Accuracy

After conducting a comprehensive analysis of the accuracy exhibited by the five benchmark models, the focus now shifts to enhancing machine translation accuracy to the greatest extent possible. There are three distinct approaches that hold potential for achieving this goal:

(1)In-Context Learning: An effective strategy involves incorporating in-context learning to enhance the accuracy of GPT-4. By enabling GPT-4 to grasp and adapt to contextual nuances, its translation capabilities can be significantly improved.

(2)Hybrid Model: Implementing a hybrid model presents another avenue for improvement. This entails establishing a specific threshold, different models are employed when certain sentences fail to meet this threshold. This dynamic approach capitalizes on the strengths of various models to optimize translation accuracy.

(3)Dataset Enhancement: The quality of the dataset itself is pivotal. Since instances of incorrect translations can exist within the dataset, leveraging GPT-4 to rectify these inaccuracies holds promise. By using GPT-4 to correct flawed sentences, a refined dataset for training and testing can be obtained, ultimately leading to better translation outcomes.

By judiciously employing these three strategies, the aim is to push the boundaries of machine translation accuracy and pave the way for more precise and reliable translation results.

5.1 In-context learning for GPT-4

Large language models have shown impressive performance on downstream tasks by simply conditioning on a few input-label pairs. This type of inference has been referred to as in-context learning (Brown et al. 2020).

Bashir, 2023
Bashir, 2023

In simpler terms, GPT-4 could enhance its capabilities without changing any gradients. This could be achieved by providing specific task examples in the prompts given to GPT-4. This approach would allow GPT-4 to understand and perform tasks better, even without a complete retraining process.

Here are task examples I used, ordered from easiest to most challenging.

task examples for In-context learning
task examples for In-context learning

And this is the prompt for GPT-4.

Prompt for GPT-4
Prompt for GPT-4

The final result shows that the bleurt score was increased from 0.6486 to 0.6755, which demonstrates the effectiveness of In-context learning.

5.2.1 Hybrid Model

Hybrid threshold model is to establish a specific threshold, and different models will be used to retranslate when certain sentences fail to meet the threshold.


Below is the code for a hybrid model that combines the Azure baseline model with GPT-4. The threshold for this combination is determined by the Comet score.

import requests, uuid,json
import openai
from comet import download_model, load_from_checkpoint

def translate_with_fallback(text):
    model = load_from_checkpoint(model_path)
    indices_to_correct = [] 
    for i in range(len(translation_from_Azure)):
        if res.scores[0]<0.81:
    sentences_to_correct = [source_sentences[i] for i in indices_to_correct]
    corrected_sentences = gpt_translation(sentences_to_correct)

    corrected_index = 0
    for i in range(len(translation_from_Azure)):
        if i in indices_to_correct:
            corrected_index += 1

    return refined_translation
code for Hybrid model

If a sentence's comet score falls below 0.81, GPT-4 will handle its translation instead of the Azure baseline model. A hashmap is employed to keep track of the indexes of sentences translated by GPT-4. The underlying code remains largely consistent, whether integrating different models or adjusting the thresholds.

5.2.2 Different Model Combinations

First, let's explore various combinations of benchmark models and compare their outcomes.

(1)Azure baseline model and GPT-4(threshold:comet score=0.81)

Following is the result:


(2) Azure custom model and GPT-4(threshold:comet score=0.81)

Following is the result:


(3)Azure custom model and DeepL(threshold:comet score=0.81)

Following is the result:


(4)DeepL and GPT-4(threshold:comet score=0.81)

Following is the result:


5.2.2 Different Thresholds

Subsequently, we can utilize different thresholds, including Bleurt score, and Bleu score, to assess and contrast their effects when employing hybrid models.

(1)Azure baseline model and GPT-4(threshold:bleu score=0.4501)

Following is the result:


(2)Azure baseline model and GPT-4(threshold: bleurt score=0.666)

Following is the result:


5.2.4 Conclusions of Hybrid Model

Upon examining various combinations of benchmark models and diverse thresholds, several conclusions can be drawn:

(1)The optimal threshold for a hybrid model appears to align with the comet score.

(2)The most promising performance observed so far arises from the fusion of the Azure custom model with DeepL or or DeepL with GPT-4. Notably, all hybrid models leverage the comet score for their performance assessment.

(3)Nearly all hybrid model scores surpass those of individual models, underscoring the potency of hybridization in enhancing translation accuracy.

(4)Importantly, a higher threshold does not necessarily guarantee improved scores. Careful consideration is warranted when determining the threshold value.

5.3 GPT-4 as a data cleaning tool

Both In-context learning and hybrid models are methodologies grounded in translation models themselves. However, leveraging the capabilities of GPT-4 as a data cleaning tool can also enhance translation accuracy. This is particularly relevant because certain datasets may inadvertently include numerous inaccurate translations posing as ground truth or some strange symbols.

I employed GPT-4 to preprocess the Opus100 dataset, which contains both Chinese and English texts. These texts are denoted as source sentences and reference sentences, respectively.

import openai
import json

for zh, en in zip(source_sentences,reference_translations):

def translate_text(pair):
    openai.api_key = ' '  
    translations = []
    for zh,en in pair.items():
        messages = [
            {"role": "system", "content": "You are a Chinese to English translation corrector. You need to modify the incorrect English translations below and correct it by given Chinese sentences, please remember not to use English abbreviations and not add extra blank lines. Fix weird punctuation. And the result should be English sentences only"},
            {"role": "user", "content":json.dumps({"zh": zh, "en": en})}

        model = 'gpt-4'

        response = openai.ChatCompletion.create(

        choices = response['choices']
        if len(choices) > 0:
            model_response = choices[0]['message']['content']

    return translations

translations = translate_text(pair)
Data cleaning for translations
Prompt for translation correction
Prompt for translation correction

GPT-4 will rectify erroneous translations within the dataset based on the provided prompts and Chinese source sentences.

Additionally, Chinese source sentences also require correction, as errors exist within the dataset. It is indisputable that the trained model cannot attain high accuracy when source sentences are riddled with numerous errors. Thus, ensuring the accuracy of source sentences is imperative.

import openai
import json

def text(chinese):
    openai.api_key = ' '  
    text = []
    for zh in chinese:
        messages = [
            {"role": "system", "content": "You are a Chinese text corrector. You need to modify the incorrect Chinese sentences below and correct it , please fix weird punctuation and  do not add extra blank lines. "},
            {"role": "user", "content":zh}

        model = 'gpt-4'

        response = openai.ChatCompletion.create(

        choices = response['choices']
        if len(choices) > 0:
            model_response = choices[0]['message']['content']

    return text

text = text(source_sentences)
Data cleaning for source sentences correction
Prompt for source sentences correction
Prompt for source sentences correction

The Python code and prompt enable GPT-4 to autonomously identify and rectify errors within the dataset.

Upon utilizing GPT-4 as a data cleaning tool for both Chinese and English, the ensuing table presents the ultimate results.


In conclusion, leveraging GPT-4 for data cleaning on both the original text and target text proves to be a viable choice, provided the prompt is accurate. The scores achieved by Azure baseline on a refined dataset can align with the performance of DeepL on a subpar dataset.

6. Conclusion

This paper aimed to investigate the accuracy of machine translation and explore methods for enhancing this accuracy through three distinct evaluation metrics and five benchmark models. By employing code, analysis, and experimental procedures, the following conclusions have been drawn:

To begin with, the most proficient Chinese to English translation observed so far is achieved by DeepL. Furthermore, the Azure baseline model exhibits the potential for higher performance when provided with substantial data and adequate training time.

Moreover, despite variations in translation model performance, we have identified at least two avenues for enhancing accuracy (three for GPT-4). Firstly, through the utilization of hybrid models, where a combination of different models can lead to accuracy improvement. Secondly, leveraging GPT-4 for data cleaning can improve the quality of the original dataset so that the dataset will allow models to achieve a high performance.

However, this study acknowledges certain limitations, notably that manual inspection of the translated sentences revealed instances where the translation quality did not align with the high scores obtained.

Additionally, there were cases where the application of accuracy enhancement methods led to decreased scores. Future research endeavors could be directed towards addressing these limitations.

In summation, this study presents novel insights into the realm of machine translation. It is anticipated that these findings will serve as valuable reference points for subsequent research and practical applications.

7. References

[1]Papineni, K., Roukos, S., Ward, T., & Zhu, W. J. (2002). BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics (ACL) (pp. 311-318). Association for Computational Linguistics.

[2]Thibault Sellam (2021). BLEURT.

[3]Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Ma- teusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners.

[4]Daniel Bashir (2023). In-Context Learning, in Context. The Gradient.

[5]Amr Hendy, Mohamed Abdelrehim, Amr Sharaf, Vikas Raunak, Mohamed Gabr, Hitokazu Matsushita, Young Jin Kim, Mohamed Afify, Hany Hassan Awadalla. 2023. How Good Are GPT Models at Machine Translation? A Comprehensive Evaluation.Microsoft.

[6]Ricardo Rei (2022). COMET.