August 2024, Machine Translation with Large Language Models and Hallucination Reduction
August 2024, Machine Translation with Large Language Models and Hallucination Reduction

August 2024, Machine Translation with Large Language Models and Hallucination Reduction

Shuang LIANG - The University of Tokyo

Table of Contents

Abstract

Large Language Models (LLMs) released in recent years have shown outstanding performance in multiple natural language tasks. One can easily customize his own LLM by fine-tuning a pre-trained model to follow instructions given with post-training methods such as Parameter-Efficient Fine-Tuning (PEFT), improving the model’s capabilities on specific tasks. In this article, we conduct experiments for fine-tuning Llama 3.1 for the Chinese-to-English machine translation task while refraining from resulting hallucination by adjusting our training and decoding strategies.

Background

Large Language Models

LLMs have revolutionized the region of Natural Language Processing (NLP) in various tasks including neural machine translation. Models like Llama have shown their remarkable capabilities in understanding and generating human-like text in multiple languages[1]. These models are pre-trained on vast amounts of data and can be fine-tuned for specific tasks, enabling them to be ideal candidates for enhancing machine translation performance[2].

Parameter-Efficient Fine-Tuning (LoRA)

PEFT techniques allow the adaptation of large pre-trained models to specific tasks without the need to update all model parameters. Instead, PEFT methods only fine-tune a relatively small number of parameters, decreasing the computational and storage cost, which makes it more accessible to train a large language model on local hardware.

Among these methods, Low-Rank Adaptation (LoRA) [3] is a popular and lightweight fine-tuning technique. The main idea is to learn a low-rank decomposition of the delta weight matrix in fine-tuning. The parameters of pre-trained models are frozen while these trainable matrices are inserted into the model, significantly reducing the cost and speeding up the training.

Neural Machine Translation and Hallucination

Neural Machine Translation (NMT) is the task of generating the translation of the source language into the target language using neural networks. Compared to Statistical Machine Translation (SMT), NMT methods tend to generate more fluent, human-like contexts, while also having the potential risk of hallucinations. Hallucination in the context of machine translation refers to unfaithful, fabricated, inconsistent, or nonsensical content[4, 5].

Unlike other natural language generation tasks, the categorizations of hallucination in NMT are often blurred and overlapping. Here we use the categorization in the paper [5], as follows:

  • Intrinsic and Extrinsic Hallucinations. Hallucination outputs of these types show the disconnection with the source, and based on the way the result is disconnected:
    1. Intrinsic Hallucinations: Output contains incorrect information compared to the source.
    2. Extrinsic Hallucinations: Models generate additional content that has nothing to do with the source.
  • Other Categories and Types of Hallucinations. [4]Divided the hallucination into two categories:
    1. Hallucinations under perturbation: are those that can be observed if a model tested on the perturbed and unperturbed test set returns drastically different content.
    2. Natural hallucinations: are created with a connection to the noise in the dataset.

Decoding Strategy

LLMs are trained to predict the next token in a text corpus. For each prefix, the model compute a probability distribution of the next token over a fixed vocabulary.

So how to sample the next one based on the distribution is decided by the decoding strategies. Decoding strategies ranges from deterministic to stochastic ones, and common methods are briefly view below[7].

  • Deterministic Methods
    1. Greedy Search. The simplest way to implement the decoding strategy is to choose the one with highest probabilty in every step, which is straightforward and makes the decoding fast and efficient, but usually lead to repetition and non-creative results.
    2. Beam Search. Instead of choosing the token with highest probability, beam search consider a beam of the N most probable token sequences at each step, where N is refered as the beam width. Usually this method can produce better quality text compared with greedy search, but can be slow due to more computation.
  • Stochastic Methods
    1. To overcome the limitation of deterministic methods to generate varied and creative context, stochastic methods introducing randomness in to the sampling process, which makes the result less predictable.

    2. Temperature Sampling. This sampling method helps to increase the likelihood of high-probability tokens in computed distribution, and also decrease the the likelihood of low-probability ones by adjusting a hyperparameter TT called temperature in softmax. The formula of this adjusted softmax can be described as:
    3. qi=exp(zi/T)jexp(zj/T)q_i = \frac{\exp(z_i/T)}{\sum_j\exp(z_j/T)}

      Softmax transforms the logit (raw output) ziz_i of the ithi^\text{th} token into a probabilty qiq_i. Normally. the temperature TT is set to 1, which make is a standard softmax function. Otherwise, it can be set to any positive value.

      • T<1T < 1: under this setting, the distribution will become shaper as it increases the difference between logits. Tokens with higher-probabilty are augmented and ones with lower-probabilty are dampened.
      • T>1T > 1: on the contrary, the distribution will become softer and flatter as it decrease the difference, bringing logits close to each other. Under this setting. more tokens get a chance of being chosen, enabling more diversity and creativity in the results.
    4. Top_pp Sampling. Also know as nucleus sampling, this method select the next token from the subset of tokens whose cumulative probabilty exceeds the hyperparameter pp. For example, when p=0.95p=0.95, this method considers a subset that contain a minimun numbers of tokens with sum of probabilities over 0.950.95. Top-p sampling removes tokens with very unlikely low-probability, helping to generate diverse and coherent text.
    5. Top_kk Sampling. This methods select a subset that contain kk most likely words and the probabilities is redistributed among only that subset. Top_kk can also be used in combination with top_pp, which can filter out very low probability words at the same time allowing for some dynamic set of word selection, making the resulte more human-sounding.

Experiments

Datasets Pre-processing

Considering the possible usage scenarios of the model, we chose NewsCommentary[1] as the written language data and Ted Talks[8] as the spoken language data.

For the NewsCommentary dataset, we use the parallel datasets in Chinese and English. For the Ted Talks dataset, we use its English dataset and generate its Chinese translation with the most accurate and natural-sounding translation system DeepL.

The detailed information of the two datasets is as follows:

Documents
Sentences
S/D
Words(src/tgt)
NewsCommentary-v18.1
11147
443677
39.80
16452969/9733728
Ted Talks
22
1949
88.60
51469/32234

The original context data of both datasets is derived from document-level sourcs, such as news releases and speech transcripts. These documents have been subsequenly segmented into sentence-level data, with identifiers demarcating the boundaries between different documents.

Considering the potential context length of input in real-world scenarios and consistency inside contexts, these two datasets can be further divided into document-level and sentence-level subsets, enabling the more comprehensive evaluation of model by mixing the subsets in a proper ratio.

All adjusted datasets are saved in the following format in the Datasets library:

Dataset({
    features: Features({
        'translation': Sequence({
            'zh': Value(dtype='string'),
            'en': Value(dtype='string')
        })
    }),
    num_rows: ...
})
Dataset formation used in report.

Machine Translation Evaluation Metrics

Evaluating the performance of NMT models is crucial for understanding their effectiveness and guiding further improvements. Automatic evaluation metrics play an important role in this process, providing quick and accurate assessments of translation quality above the evaluation set.

Among the various metrics, in our study, we use two outstanding and widely used metrics: BLEU and COMET.

  1. BLEU
  2. Bilingual Evaluation Understudy (BLEU), introduced by [9], is one of the most established metrics in machine translation evaluation. It assesses the translation quality by measuring the n-gram overlap between the model output and a set of reference translations.

  3. COMET
  4. Developed by [10], Cross-lingual Optimized Metric for Evaluation of Translation (COMET) is a new neural framework for training multilingual machine translation evaluation models. COMET is designed to predict human judgments of translation quality by using a trained neural network.

The assessment of BLEU and COMET scores allows us to evaluate the performance on a mature metric, providing a direct basis for comparison while COMET provides insights into deeper features such as fluency and semantics.

Both BLEU and COMET metrics have been integrated into the evaluate library and can be called by:

from evaluate import load 

bleu = load_metrics("bleu")
comet = load_metrics("comet")

bleu_score = bleu.compute(
    predictions=predictions,
    references=references
)["bleu"]
comet_scores = comet.compute(
    predictions=predictions,
    references=references,
    sources=sources
)["scores"]
Call the evaluation metrics and compute scores.

Environment

All experiments were conducted on a Linux server that runs Ubuntu 20.04 LTS version and is equipped with a single NVIDIA GeForce RTX 3090 GPU, featuring 24GB of memory.

The software environment was managed using conda, a popular open-source cross-platform package management and environment management system. The core components in the environment include Python 3.10 and PyTorch 2.3, leveraging CUDA 12.1 for GPU acceleration.

Baseline Model

We chose Llama as our pre-trained model and Baseline, specifically using Llama 3.1[4]. Developed by Llama Team in Meta, Llama is a herd of language models that natively support multilinguality, coding, reasoning, and tool usage.

The Llama series consists of models ranging from 7B to 405B parameters, which are currently among the most competitive open-source LLMs.

As the lastest release version in series, Llama3.1 shows significant improvements across various tasks compared to its predecessors.

Taking into account our actual experiental setup and resources, we selected the instruct version on 8B parameters for our baseline and fine-tuning objective. When loaded on an NVIDIA RTX 3090 GPU using unsloth library, this model occupies approximately 17GB of GPU memory.

Fine-tuning Details

Chat Template

When fine-tuning an instruct model, it’s typically necessary to use one or some specific templates to unify the format of training data and prompt. The formatted data are then fed into the tokenizer and model during training and inferencing. The use of templates can also help to refrain from generating hallucination to some extent, as it assist the model in better understanding the task and resulting expected outputs.

On the contrary, if one use a format that is different from the format the model was trained with, severe and silent performance degradation will usually occurs. Therefore, in our experiments, we formatted all data using a template similar to Llama’s style. The template content is as follows:

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are an expert Chinese to English translator!

<|start_header_id|>user<|end_header_id|>

Translate the following Chinese text to English: {src}<|eot_id|>

<|start_header_id|>assistant<|end_header_id|>

{tgt}<|eot_id|>
Template used for fine-tuning the instruct model.

, where “src” stands for the Chinese source context and “tgt” stands for the target translation in English. In model inference mode, remove the last line in the above template.

In the implementation, this template can be applied easily by using the map() funtion in Datasets.

from datasets import load_dataset

def formatting_prompts_func_llama(examples):
    outputs = []
    for item in examples["translation"]:
        src = item["zh"]
        tgt = item["en"]
        formatted_text = (
            f"<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\n"
            f"You are an expert Chinese to English translator! "
            f"<|start_header_id|>user<|end_header_id|>\n\n"
            f"Translate the following Chinese text to English: {src}<|eot_id|>"
            f"<|start_header_id|>assistant<|end_header_id|>\n\n"
            f"{tgt}<|eot_id|>"
        )
        outputs.append(formatted_text)
    return {
        "text": outputs,
    }

ds = load_dataset(DATASET_NAME, split="train")
ds = ds.map(formatting_prompts_func_llama, batched=True)
Batch formatting to apply the template above the whole dataset.

Accelerate Fine-tuning Using Unsloth

Unsloth is a python library that makes fine-tuning LLMs including Llama 3.1[3] with LoRA 2 times faster, use less memory, and with almost no degradation in accuracy. And it’s fully compatible with SFTTrainer in TRL. One can load the model and setup training loop as follows:

from unsloth import FastLanguageModel, is_bfloat16_supported
from trl import SFTTrainer

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = model_name, 
    max_seq_length = max_seq_length,
    dtype = None,
    load_in_4bit = False,
)

model = FastLanguageModel.get_peft_model(
    model,
    r = 16, 
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, 
    bias = "none",
    use_gradient_checkpointing = "unsloth", 
    use_rslora = False,
    loftq_config = None,
)

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = trainset,
    eval_dataset = evalset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False,
    args = TrainingArguments(
        per_device_train_batch_size = 1,
        per_device_eval_batch_size = 1,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        num_train_epochs = 1.0,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        output_dir = output_dir,
    ),
)

trainer.train()
Simple fine-tuning process using unsloth.

Inference

Inference can also be accelerated using Unsloth with the following:

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = model_name, 
    max_seq_length = max_seq_length,
    dtype = None,
    load_in_4bit = False,
)

FastLanguageModel.for_inference(model)
Load the model and modify the mode for faster inference.

In the inference phase, we continue with the typical user-assistant-style chat template for the fine-tuned model to standardize the input context format. As mentioned above, this approach helps the tuned model better understand the task objectives. There are several ways to implement this, and the following is one example. One can use tokenize=False to vertify the text after applying the template.

text = "法国的首都是哪里?"
messages = [
    {"role": "system", "content": "You are an expert Chinese to English translator! "},
    {"role": "user", "content": (f"Translate the following Chinese text to English: {text}")},
]

input_context = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=False
)
print(input_context)
Apply the instruct template.
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are an expert Chinese to English translator!<|eot_id|><|start_header_id|>user<|end_header_id|>

Translate the following Chinese text to English: 法国的首都是哪里?<|eot_id|><|start_header_id|>assistant<|end_header_id|>
Input context after applying the template.

Thus, inference results can be generated as follows. For the decoding strategies, one can use the default greedy search by not passing any parameters, or specify a decoding strategy by passing relevant parameters, such as temperature for temperature sampling, top_p for nucleus sampling, top_k for Top-k Sampling, num_beams for beam-search.

input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

outputs = model.generate(
    input_ids,
    max_new_tokens=256,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.1,
    top_p=0.9,
)
response = outputs[0][input_ids.shape[-1]:]
print(tokenizer.decode(response, skip_special_tokens=True))
Simply inference with given input context.

One can also use a TextStreamer for continuous inference, generating the results token by token.

from transformers import TextStreamer

text_streamer = TextStreamer(tokenizer)
_ = model.generate(
    **inputs,
    streamer=text_streamer
 )
Use TextStreamer to print the output token by token.

Evaluation

To conduct a comprehensive and accurate evalution of the model performance, we need to perform batch evaluation on the test set. In detail, this involves generating corresponding translation results for each text sample in the test set based on the inference process described above, and calculating numerial metrics such as BLEU and COMET scores.

Similar to the fine-tuning process mentioned earlier, we can use map() function to srandradize the context into the user-assistant style format and add special tokens. It’s important to note that the template should leave space for the assistant response. The implementation is as follows

def formatting_prompts_func_llama_eval(examples):
    texts, prompts, references = [], [], []
    for item in examples["translation"]:
        src = item["zh"]
        tgt = item["en"]
        texts.append(src)
        references.append(tgt)
        formatted_text = (
            f"<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\n"
            f"You are an expert Chinese to English translator! "
            f"<|start_header_id|>user<|end_header_id|>\n\n"
            f"Translate the following Chinese text to English: {src}<|eot_id|>"
            f"<|start_header_id|>assistant<|end_header_id|>\n\n"
        )
        prompts.append(formatted_text)
    return {
        "text": texts,
        "prompt": prompts,
        "reference": references,
    }
    
testset = load_dataset(testset, split="test")
testset = testset.map(formatting_prompts_func_llama_eval, batched=True)
Apply the template for batch evaluation. Note to keep the target section blank.

In the actual implementation, since calculating COMET scores requires loading one other model, it's more recommended to perform calculation for batch samples after all results have been generated. BLEU scores, however, can be calculated immediately after each sample is generated. The evaluation can be implemented as follows:

from evaluate import load as load_metrics

bleu = load_metrics("bleu")
comet = load_metrics("comet")

model, tokenizer = FastLanguageModel.from_pretrained(
    **inputs, # input parameters
)
FastLanguageModel.for_inference(model)

terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

def evaluate_sample(sample):
    input_ids = tokenizer(
        sample["prompt"],
        return_tensors="pt"
    )["input_ids"].to(model.device)
    outputs = model.generate(
        input_ids,
        max_new_tokens=max_new_tokens,
        eos_token_id=terminators,
        do_sample=True,
        temperature=0.1,
        top_p=0.9,
        pad_token_id=tokenizer.eos_token_id
    )
    response = outputs[0][input_ids.shape[-1]:]
    response = tokenizer.decode(
        response,
        skip_special_tokens=True
    )
    bleu_score = bleu.compute(
        predictions=[response],
        references=[[sample["reference"]]]
    )["bleu"]
    return {
        "pred": response,
        "bleu": bleu_score,
    }
    
results = testset.map(evaluate_sample)

### compute comet scores
references = [x["reference"] for x in results]
predictions = [x["pred"] for x in results]
sources = [x["text"] for x in results]
comet_scores = comet.compute(
    predictions=predictions,
    references=references,
    sources=sources
)["scores"]

### compute average scores
bleu_scores = [x["bleu"] for x in results]
comet_score_mean = sum(comet_scores) / len(comet_scores)
bleu_score_mean = sum(bleu_scores) / len(bleu_scores)
Compute the BLEU and COMET scores.

Result Analysis

In-distribution Performance

In this section, we aim to validate the effectiveness of the fine-tuning process mentioned above. Our goal is to ensure that the model has successfully learned the relevant task objectives and knowledge during fine-tuning. To achieve this, we will conduct a comprehensive evaluation of the model’s performance.

We fine-tuned two separate models: one using a document-level dataset and the other one using a sentence-level dataset, It’s crucial to note that we took great care to prevent mixing or overlap between two levels of datasets during the training process.

To assess the model’s in-distribution performance, we created test sets by randomly sampling data from their respective level datasets, ensuring that the test samples are independent and not appear in the training set. The size of each test set in as follows:

Test Data Pair
Tokens
Document_level
55
73947
Sentence-Level
1728
60711

Upon our previous experimental setup, we implemented a structured method to evaluate the model performance at various stages of training. Here’s a detailed description of our methodology:

We set a maximum limit of 10k paired training samples for document-level model and 15k for sentence-level training. And to gain insights into the learning progression of our models, we established multiple evaluation check points. We tested the models on the entire test set when the number of processed training samples reached (10, 100, 1k, 10k) and (1k, 5k, 10k, 15k), respectively.

To assess and compare the effectiveness of our fine-tuning process, we also evaluated a baseline model on the same test sets. It should be noted that in order to make the model generate results similar to the training sets, we made a slight adjustment to the template to explicitly tell the model to only generate translation results, avoiding redundant content that will affects the evaluation score calculation and comparison (the red part in the following template).

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are an expert Chinese to English translator!

<|start_header_id|>user<|end_header_id|>

Translate the following Chinese text to English, only output the translated English context in on line. Here is the source: {src}<|eot_id|>

<|start_header_id|>assistant<|end_header_id|>

{tgt}<|eot_id|>
Additional instruct for baseline model to explicitly tell it to only generate translation results.

The fine-tuning and comparison results on document-level data and sentence-level data are as follows:

Training Samples
BLEU
COMET
Fine-tuned model
10
35.8
0.885
Fine-tuned model
100
36.9
0.889
Fine-tuned model
1k
39.7
0.890
Fine-tuned model
10k
40.8
0.891
Baseline
19.6
0.820
Training Samples
BLEU
COMET
Fine-tuned model
1k
31.3
0.861
Fine-tuned model
5k
31.4
0.862
Fine-tuned model
10k
33.1
0.863
Fine-tuned model
15k
33.1
0.864
Baseline
30.9
0.864

The results provide several interesting insights: For document-level models, there is a clear positive correlation between the number of training samples and model performance. A steady improvement in both BLEU and COMET scores can be observed as the sample size increases. For sentence-level models, while improvements are less distinguished, there is still a consistent upward trend.

Out-of-distribution Performance

In this sectoin, our goal is to examine the stability and generalization capabilities of our fine-tuned models on out-of-distribution data. We are particularly interested in how these models perform when faced with data at a different level that what they were fed in training stage. This analysis is important to understand the robustness and versatility of our models.

We will use the models from the previous section, which were saved at final tuning stage for both document-level and sentence-level data. To gain a deeper understanding of these results, We conduct the cross-level qualitative analysis by input the document-level prompt to sentence-level tuned model and input the sentence-level prompt to document-level tuned one. Some examples are as follows.

Following hallucination occurs in two main different ways: 1) stopping generation too early, and 2) generating redundant content although relevant, which is somewhat level-specific biase.

  1. Stopping generation too early

For sentence-level tuned models, they performed well with short contexts or individual senteces but struggled with longer context input. The models developed a strong bias towards a “single-input-single-output” task style. The promary issue is not that they can’t translate longer context correctly, but rather a tendency to compute a high-probability to the End-Of-Sequence (EOS) token (”<|eot_id|>”) and select it prematurely, cutting off the output too early.

"source": "他喜欢在公园里跑步,他每天早上都会去那里锻炼身体。公园里的花草树木让他感到非常放松,他觉得这是一种很好的生活方式。她是他的好朋友,她也喜欢在公园里散步。她说,早晨的新鲜空气和美丽的风景能让她一整天都有好心情。每次他们在公园里碰面,都会一起跑步或者散步,有时候还会坐在长椅上聊天。",
"predict": "He likes to run in the park, and he goes there every morning to exercise. The flowers and trees in the park make him feel very relaxed, and he thinks it's a good way to live."
"reference": "He likes to run in the park; he goes there every morning to exercise. The flowers, grass, and trees in the park make him feel very relaxed, and he thinks it is a great way of living. She is his good friend, and she also likes to walk in the park. She says that the fresh air and beautiful scenery in the morning can keep her in a good mood all day. Whenever they meet in the park, they either run or walk together, and sometimes they sit on the bench and chat.",
Long context input with sentence-level tuned model.

To gain a more precise understanding of the model’s behavior why it stopped while didn’t finish the translation, we can output the probability distribution calcuated by the model for each token prediction, along with the top few tokens with the highest probabilities. To achieve this, we customize a class called LogitsProcessor, which is crucial component in the token generation pipeline.

from transformers import LogitsProcessor, LogitsProcessorList
import torch
import torch.nn.functional as F

class TokenProbabilityLogitsProcessor(LogitsProcessor):
    def __init__(self, tokenizer, top_k=5, eos_token_id=None, eos_threshold=0.9):
        self.tokenizer = tokenizer
        self.top_k = top_k
        self.step = 0
        self.eos_token_id = eos_token_id
        self.eos_threshold = eos_threshold

    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor) -> torch.FloatTensor:
        probs = F.softmax(scores, dim=-1)
        top_probs, top_indices = torch.topk(probs, self.top_k)
        
        prob_str = " ".join([f"({self.tokenizer.decode([idx.item()])} {prob.item():.2%})" for idx, prob in zip(top_indices[0], top_probs[0])])
        print(f"Generating (token {self.step + 1 :03d}) [{prob_str}]")
        
        self.step += 1
        return scores

token_prob_processor = TokenProbabilityLogitsProcessor(
    tokenizer,
    top_k=5
)
Print the top k highest probabilities and correspond tokens when inferencing.

In the generation, pass the customized LogitsProcessor into the generate() method.

outputs = model.generate(
    **inputs,
    logits_processor=LogitsProcessorList([token_prob_processor])
)
Keep other hyperparameters and pass the customized function.

The text data used in example is:

"source": "他喜欢在公园里跑步,他每天早上都会去那里锻炼身体。公园里的花草树木让他感到非常放松,他觉得这是一种很好的生活方式。她是他的好朋友,她也喜欢在公园里散步。她说,早晨的新鲜空气和美丽的风景能让她一整天都有好心情。每次他们在公园里碰面,都会一起跑步或者散步,有时候还会坐在长椅上聊天。",
"reference": "He likes to run in the park; he goes there every morning to exercise. The flowers, grass, and trees in the park make him feel very relaxed, and he thinks it is a great way of living. She is his good friend, and she also likes to walk in the park. She says that the fresh air and beautiful scenery in the morning can keep her in a good mood all day. Whenever they meet in the park, they either run or walk together, and sometimes they sit on the bench and chat.",
The input sample context.

During generating, the model will print the top few highest probability distributions and the corresponding tokens each prediction.

...
Generating (token 038) [( way 76.70%) ( life 13.33%) ( lifestyle 6.30%) ( kind 0.52%) ( form 0.36%)]
Generating (token 039) [( to 77.38%) ( of 22.17%) ( for 0.36%) (. 0.02%) (to 0.01%)]
Generating (token 040) [( live 93.15%) ( life 1.51%) ( start 0.81%) ( get 0.81%) ( lead 0.63%)]
Generating (token 041) [(. 97.10%) ( his 0.95%) ( life 0.84%) (, 0.31%) ( a 0.14%)]
Generating (token 042) [(<|eot_id|> 62.18%) ( She 33.28%) ( His 1.29%) ( Her 0.65%) ( He 0.47%)]

Final output: He likes to run in the park, and he goes there every morning to exercise. The flowers and trees in the park make him feel very relaxed, and he thinks it's a good way to live.
Highest 5 probabilities with tokens. The translation ends because the EOS token has a large probability and easily to be selected.

If we manually set a thershold for the eos token and exclude it in decoding unless the probability is very high, we can continue to generate accurate translation.

Similarly, in order to achieve that, we can continue to modify the LogitsProcessor so that the model can perform one more threshold-step when processing the EOS token. In fact, such customization can also achieve other functions such as sensitive word filtering. The implementation is as follows:

from transformers import LogitsProcessor, LogitsProcessorList
import torch
import torch.nn.functional as F

class TokenProbabilityLogitsProcessor(LogitsProcessor):
    def __init__(self, tokenizer, top_k=5, eos_token_id=None, eos_threshold=0.9):
        self.tokenizer = tokenizer
        self.top_k = top_k
        self.step = 0
        self.eos_token_id = eos_token_id
        self.eos_threshold = eos_threshold

    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor) -> torch.FloatTensor:
        probs = F.softmax(scores, dim=-1)
        top_probs, top_indices = torch.topk(probs, self.top_k)
        
        prob_str = " ".join([f"({self.tokenizer.decode([idx.item()])} {prob.item():.2%})" for idx, prob in zip(top_indices[0], top_probs[0])])
        print(f"Generating (token {self.step + 1 :03d}) [{prob_str}]")
        
        self.step += 1

        if self.eos_token_id is not None:
            eos_prob = probs[0, self.eos_token_id].item()
            if eos_prob < self.eos_threshold:
                scores[0, self.eos_token_id] = float('-inf')

        return scores
        
threshold_procesor = TokenProbabilityLogitsProcessor(
    tokenizer,
    top_k=5,
    eos_token_id=tokenizer.eos_token_id,
    eos_threshold=0.95
)
Add if statement to set a threshold for the EOS token.
Generating (token 071) [( feel 86.36%) ( happy 9.10%) ( have 1.58%) ( very 0.58%) ( a 0.45%)]
Generating (token 072) [( good 70.51%) ( happy 17.83%) ( cheerful 2.73%) ( great 1.88%) ( very 1.46%)]
Generating (token 073) [( all 71.00%) ( for 17.95%) ( the 8.48%) ( throughout 0.79%) ( every 0.37%)]
Generating (token 074) [( day 97.53%) ( morning 1.39%) ( the 0.66%) ( through 0.21%) ( that 0.05%)]
Generating (token 075) [(. 62.95%) ( long 33.70%) (, 2.77%) ( and 0.23%) (; 0.09%)]
Generating (token 076) [(<|eot_id|> 85.54%) ( Every 7.96%) ( Whenever 2.01%) ( They 1.57%) ( When 0.95%)]
...
Generating (token 095) [( a 58.38%) ( the 27.58%) ( benches 13.03%) ( bench 0.24%) ( one 0.16%)]
Generating (token 096) [( bench 99.18%) ( park 0.36%) ( chair 0.13%) ( seat 0.08%) ( benches 0.07%)]
Generating (token 097) [( and 85.97%) ( to 9.06%) ( talking 1.57%) (. 1.23%) (, 0.66%)]
Generating (token 098) [( talk 51.77%) ( chat 40.32%) ( have 5.46%) ( just 0.74%) ( they 0.40%)]
Generating (token 099) [(. 99.26%) ( for 0.25%) ( a 0.09%) (, 0.09%) ( about 0.05%)]
Generating (token 100) [(<|eot_id|> 99.63%) ( They 0.06%) ( He 0.04%) ( She 0.03%) ( It 0.02%)]

Final output: He likes to run in the park, and he goes there every morning to exercise. The flowers, grass, and trees make him feel very relaxed, and he thinks it's a good way to live. She's his good friend, and she likes to walk in the park, too. She says the fresh morning air and the beautiful scenery make her feel good all day. Every time they meet in the park, they run or walk together, or sometimes they sit on a bench and chat.
Force the EOS token not to be chosen unless the probability of a is overwhelmingly large.

It can be implied that one can alleviate the problem of premature stopping by manually setting a threshold for the EOS token. However, how to determine the value of threshold becomes another issue that requires careful consideration.

  1. Generating redundant content
"source": "劳工组织行政法庭相对于联合国行政法庭似乎有些优势。",
"predict": "The International Labor Organization’s Administrative Tribunal (ILOAT) seems to have an advantage over the United Nations Administrative Tribunal (UNAT).The ILOAT has a much higher rate of compliance with its decisions than the UNAT. The ILOAT has a compliance rate of 92%, while the UNAT has a compliance rate of 63%.The ILOAT has a much higher rate of compliance with its decisions than the UNAT. The ILOAT has a compliance rate of 92%, while the UNAT has a compliance rate of 63%.The ILOAT has a much higher rate of compliance with its decisions than the UNAT. The ILOAT has a compliance rate of 92%, while the UNAT has a compliance rate of 63%.The ILOAT has a much higher rate of compliance with its decisions than the UNAT. The ILOAT has a compliance rate of 92%, while the UNAT has a compliance rate of 63%.The ILOAT has a much higher rate of compliance with its decisions than the UNAT. The ILOAT has a compliance rate of 92%, while the UNAT has a compliance rate of 63%.The ILOAT has a much higher rate of compliance with its decisions than the UNAT. The ILOAT has a compliance rate of 92%, while the UNAT has a compliance rate of 63%.The ILOAT has a much higher rate of compliance with its decisions than the UNAT. The ILOAT has a compliance rate of 92%, while the UNAT has a compliance rate of 63%.",
"reference": "ILOAT appears to have some advantages over UNAT."
Short context input with document-level tuned model.

In addition, for document-level tuned models, they are influenced by the longer context and show a preference for generateing lengthy output. Interestingly, they didn’t always strictly generate the translation of input. Instead, they often use their implicit prior knowledge gained in pre-training to sample and generate tokens, producing content that is the fact but go beyond the translation objective, such as explaining the named-entity, and end only after reaching the maximum length. This behavior corroborates the previous observations about the model’s tendentcies to aligh with the training data.

Final Fine-tuning

Based on our previous observations and analyses, we consider that the behavior of instruction-tuned models is largely determined by the data they are fed with, which aligns with the observation reported in the official Llama 3.1 documents[4].

To ensure that a tuned model performs well in actual scenarios, it’s significant to consider and prepare datasets that cover the vast potential input patterns in real-world application.

Our current objective is to develop a model that performs well in translation tasks regardless of whether the input is a short text (sentence-level) or a long text (document-level). To achieve this, we decided to remove the text-level limitation during the fine-tuning process. Instead, we combined document-level and sentence-level data in specific ratios for a more comprehensive fine-tuning approach. We experimented with various mixing ratios of sentence-level to document-level data. Interestingly, we found that different ratios did not significantly impact the model's performance. Here are the results of our experiments on the evaluation set during training:

Evaluation loss of different “sentenct:document” ratio.

Despite the minimal differences in performance across ratios, we ultimately chose a 30:1 ratio for our final fine-tuning process, and conduct a similar evalution with different training stage and baseline. The sum of two types of training data is 10k.

Although the fine-tuned model evaluation of short context is similar to the baseline in numerical analysis, the translation performance of long contexts has been significantly improved.

document-level
document-level
sentence-level
sentence-level
BLEU
COMET
BLEU
COMET
Fine-tuned model
37.7
0.890
30.7
0.862
Baseline
19.6
0.820
30.9
0.864

In sentence-level evaluation, the fine-tuned model’s performance is very close to the baseline. The differences in BLEU and COMET scores are minimal, indicating that the translation quality at the sentence-level is essentially on par.

Simultaneously, the fine-tuned model demonstrates significant improvements in document-level evaluation. The BLEU score increased nearly doubling, and the COMET score also show a notable improvement. This suggests that then fine-tuned model has made substantial progress in translation quality when handling longer texts.

The fine-tuned model has managed to enhance document-level performance while largely maintaining sentence-level translation quality. Considering that the improvement between the sentence-level tuned model and baseline in previous section was not particularly large, this indicates that the model optimization successfully improved long context translation capability without significantly compromising translating short context.

Conclusion

In this report, we explored various fine-tunings and evaluation on Llama 3.1 for Chinese to English machine translation tasks while addressing the challenge of hallucination in model outputs. Our experiments showed that, with the right dataset preparation and fine-tuning techniques, it is possible to mitigate hallucination, leading to more reliable and coherent translations.

Future Work

In future work, we recommand that one should carefully prepare the datasets for training and evaluation and fully consider various possible input scenarios, including different language styles, cultural backgrounds, dialogue topics. Balance the proportion of various types of content in the training data is crucial to avoid bias. In inference, we also observe silghtly hallucination about the unconsistency between input and translation result, such as named entity errors. In further study, this can be one of the main issues to deal with by some post-generation methods.

References

[1]Kocmi, Tom, et al. "Findings of the 2022 conference on machine translation (WMT22)." Proceedings of the Seventh Conference on Machine Translation (WMT). 2022.

[2]The Llama 3 Herd of Models

[3]Hu, Edward J., et al. "Lora: Low-rank adaptation of large language models." arXiv preprint arXiv:2106.09685 (2021).

[3]Huang, Lei, et al. "A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions." arXiv preprint arXiv:2311.05232 (2023).

[4]Raunak, Vikas, Arul Menezes, and Marcin Junczys-Dowmunt. "The curious case of hallucinations in neural machine translation." arXiv preprint arXiv:2104.06683 (2021).

[5]Ji, Ziwei, et al. "Survey of hallucination in natural language generation." ACM Computing Surveys 55.12 (2023): 1-38.

[7]Shi, Chufan, et al. "A thorough examination of decoding methods in the era of llms." arXiv preprint arXiv:2402.06925 (2024).

[8]Cettolo, Mauro, Christian Girardi, and Marcello Federico. "Wit3: Web inventory of transcribed and translated talks." Proceedings of the Conference of European Association for Machine Translation (EAMT). 2012.

[9]BLEU: a Method for Automatic Evaluation of Machine Translation

[10]Rei, Ricardo, et al. "COMET: A neural framework for MT evaluation." arXiv preprint arXiv:2009.09025 (2020).