August 2024, Enhancing English-Chinese Translation with Retrieval-Augmented Fine-Tuning (RAFT)

Kai-Teh Tzeng- Lehigh University

Table of Contents
1. Introduction
2. Proposed Evaluation Method
3. Experimental Setup
4.Results and Discussion
6. Discussion and Next Steps

1. Introduction

Translation bridges communication gaps between languages, while it poses challenges due to syntax, grammar and culture differences. Large language models (LLM) have made significant progress in this field providing more accurate and more human-like translations. However, despite those advancements, there is still room for improvement for LLM in translation tasks.

Much literature has been devoted to exploring ways to improve LLMs’ performances in translation. A key area of research focus on in-context-learning, a term first used by Brown et al.(2020). In-context learning is used to describe LLMs’ ability to learning input-output pattern at the inference stage without further fine tuning. The concept is then widely applied in the field of machine translation. Lin et al. (2021) demonstrates great improvement for LLMs in translation task when introducing in-context learning, where context examples are randomly assigned. More recent studies, such as those by Agrawal et al. (2022) and Moslem et al. (2023), emphasize the importance of selecting appropriate context examples.

Another promising approach is Retrieval-Augmented Generation (RAG) proposed by Lewis et al. (2020). The method introduces a retrieval mechanism to retrieve relevant documents, which are then used as context at the inference stage of a pre-trained LLM. RAG is proven effective to enhance performances in knowledge intensive tasks such as question answering.

Building on those concepts, our study aims to enhance the performance of English-Chinese bidirectional translation of pretrained LLM using the Retrieval Augmented Fine-Tuning (RAFT), adapted from RAG.

What is Retrieval Augmented Fine-Tuning?

Retrieval Augmented Fine-Tuning (RAFT) combines RAG and Fine-tuning (Zhang et.al, 2024). The method involves inclusion of additional context in the fine tuning stage to enrich models’ understanding of the task and context. RAFT not only allows the model to learn how to complete the task but also enhance its ability to extract useful information from provided contexts.

Zhang et.al (2024) provide an insightful analogy describing the differences between fine-tuning based approach, RAG and RAFT. The fine-tune based approach is like training the LLM preparing for a close-book exam, where the model learned by doing similar tasks. On the other hand, RAG is like having LLM participate an open-book exam without prior preparation. RAFT, however, is akin to having LLMs prepare for the open-book exam, providing references at the fine-tuning stage.

Fig. 1 Zhang et. al (2024)

We make slight modifications and simplifications to the original RAFT’s fine-tuning process. For each instance in the training set, we experiment two scenarios: providing the most similar instances or random instances as context from the context dataset. The first scenario aims to evaluate the model’s ability to recognize pattern and extract relevant information from the provided context, while the latter test the model’s generalization ability and robustness in handling diverse source texts. Combining the two scenarios, we are able to evaluate the model’s performances under two different contextual scenarios, thereby evaluating the importance of context selection.

2. Proposed Evaluation Method

BLEU

BLEU (Bilingual Evaluation Understudy) is widely used to measure the quality of the machine-translated text. It compares the translated texts with the references to evaluate quality. The metric calculation includes n-gram overlap for precision and brevity penalty to discourage generating sentence of being too short compared with the references. The metric ranges from 0 to 100. A score between 30 and 40 is considered understandable translation, and a score over 40 signifies good quality.

COMET

One limitation of BLEU is that it only compares the candidate translations and references at a surface level. COMET (Crossilingyal Optimized Metric for Evaluation of Translation) addresses this issue by using a pre-trained LLM to evaluate translation quality based on the source text, translation and references. COMET captures semantic and contextual features, which are not covered in BLEU, making it a more comprehensive assessment for translation quality. Translation with COMET over 0.8 is considered good quality.

3. Experimental Setup

3.1 Dataset

The primary dataset used in our study is derived from the WMT 19 collection and available at Hugging Face (https://huggingface.co/datasets/wmt/wmt19)). The extensive dataset provides multilingual parallel text across various language pairs including 26 million Chinese-English pairs. The dataset is known for its high quality and diversity, from news articles to web crawl data, making it a valuable resource in the field of machine translation.

3.2 Model:

LLama 3.1-8B:

In this study, we fine-tune the Llama 3.1-8B model developed by Meta, available at Hugging Face. (https://huggingface.co/meta-llama/Meta-Llama-3.1-8B) Llama 3.1 is an autoregressive model under optimized transformer structure. The released version has been tuned on multilingual text using supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) optimizing helpfulness and safety.

Llama 3.1-8B, despite being relative small compared to 100B+ parameters, offers cost efficiency solution for serving and fine-tuning according to users’ needs.

Unsloth

The study employs unsloth to fine-tune Llama 3.1. Unsloth enables efficient fine-tuning of LLMs by reducing memories significantly. According to Unsloth’s documentation, the framework allows users to use 70% reduction in memory without compromising performance. In addition, Unsloth supports Partial Embedding Fine-Tuning (PEFT) , under which only a limited number of extra parameters are tuned while majorities of LLM’s parameters remain frozen. It makes the fine-tuning process less time-consuming and resources-intensive.

3.3 Data Preparation with Python

import random
from datasets import concatenate_datasets, load_dataset
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np




def flip(ds, ch_src_ratio):
  ds = ds.rename_columns({'en':'src','zh':'tgt'})
  ch_src_size = int(len(ds) * ch_src_ratio)
  random.seed(42)
  shuffled_id = list(range(len(ds)))
  random.shuffle(shuffled_id)
  select_id =shuffled_id[:ch_src_size]
  remain_id =shuffled_id[ch_src_size:]
  half_dataset = ds.select(select_id)
  remain_dataset = ds.select(remain_id)
  remain_dataset = remain_dataset.map(lambda x :{'src_lang':'EN','tgt_lang':'ZH',"type":"bio"})
  # Flip the columns for the selected half
  flipped_dataset = half_dataset.map(lambda x: {'src': x['tgt'],'src_lang':'ZH', 'tgt': x['src'],'tgt_lang':'EN',"type":"bio"})
  # Combine the flipped dataset with the remaining dataset
  combined_dataset = concatenate_datasets([flipped_dataset,remain_dataset])
  
  # Optionally, shuffle the combined dataset again
  combined_dataset = combined_dataset.shuffle(seed=42)
  return combined_dataset

def get_context(train_data,context_data,model_id,top_k=3,random_pick=False):
  random.seed(42)
  
  def retrieve_similar_sentences(query_embedding, index,top_k=top_k):
        D, I = index.search(query_embedding.cpu().numpy(), top_k)
        return I[0]

  def add_context(example):
      
    if random_pick==False:
        query_embedding = model.encode([example["src"]], convert_to_tensor=True)
        query_embedding = query_embedding.cpu() / np.linalg.norm(query_embedding.cpu(), axis=1, keepdims=True)
      #Retrieve similar sentences
    if example["src_lang"]=="ZH":
      if random_pick==False:
        similar_idxs =retrieve_similar_sentences(query_embedding, zh_index, top_k)
        print(similar_idxs)
        similar_idxs = similar_idxs.tolist() 
        example["context"] = 'Examples: ' + '\n Example:\n'.join([f'zh:{context_data[i]["zh"]}\n eng:{context_data[i]["en"]}' for i in similar_idxs])
      else:
        random_idxs = random.sample(range(len(context_data)), top_k)
        example["context"]='Examples: ' + '\n Example:\n'.join([f'zh:{context_data[i]["zh"]}\n eng:{context_data[i]["en"]}' for i in random_idxs])
    else:
      if random_pick==False:    
        similar_idxs =retrieve_similar_sentences(query_embedding, en_index, top_k)
        print(similar_idxs)
        similar_idxs = similar_idxs.tolist() 
        example["context"] = "Examples: " + "\n Example:\n".join([f'eng:{context_data[i]["en"]}\n zh:{context_data[i]["zh"]}' for i in similar_idxs])
      else:
        random_idxs = random.sample(range(len(context_data)), top_k)
        example["context"] = "Examples: " + "\n Example:\n".join([f'eng:{context_data[i]["en"]}\n zh:{context_data[i]["zh"]}' for i in random_idxs]) 
    return example
  
    
  if not random_pick:
  # Load a pre-trained model for generating embeddings
    model = SentenceTransformer(model_id)

    # Extract source sentences
    zn_contexts =context_data["zh"]
    eng_contexts = context_data["en"]


    zh_embeddings = model.encode(zn_contexts, convert_to_tensor=True)
    en_embeddings = model.encode(eng_contexts, convert_to_tensor=True)
    zh_embeddings =zh_embeddings.cpu()/np.linalg.norm(zh_embeddings.cpu(), axis=1, keepdims=True)
    en_embeddings =en_embeddings.cpu()/np.linalg.norm(en_embeddings.cpu(), axis=1, keepdims=True)
    #Create FAISS indicies for cosine similarity
    zh_index = faiss.IndexFlatIP(zh_embeddings.shape[1])
    zh_index.add(zh_embeddings.cpu().numpy())
    en_index = faiss.IndexFlatIP(en_embeddings.shape[1])
    en_index.add(en_embeddings.cpu().numpy())

    

      # Use the map method to apply the add_context function to the dataset
  updated_data = train_data.map(add_context)

  return updated_data

Code 1: Data Setup

The study used WMT 19. The code snippet shows how we formulate the prompt for translation. First, the ‘flip()’ is used to randomly assign a proportion (defined by ‘ch_src_ratio’) of the instances in the dataset to the Chinese-to-English translation task, while the remaining proportion (1- ‘ch_src_ratio’) is assigned to English-to-Chinese translation task. ch_src_ratio is set as 0.5 throughout the task.

For each instance in the dataset, the ‘get_context’ is used to retrieve examples from the context dataset. We experiment with two scenarios, retrieving pairs having the highest cosine similarity (’random_pick’=False), or randomly selecting pairs((’random_pick’=True). Obtaining the highest cosine similarity is achieved by first encoding sentences using a pre-trained model, and then leveraging FAISS for similarity search.

EN_PROMPT = {
    "prompt": "###  Please translate the following {} text into English:\n{}:\n",
    "response": "\n###  English:\n###\n",
    "ZH": "Chinese",
}

ZH_PROMPT = {
    "prompt": "###  请将以下的{}文本翻译成中文：\n{}：\n",
    "response": "\n###  中文：\n###\n",
    "EN": "英语",
}

lang_code2prompt = {
    "EN": EN_PROMPT,
    "ZH": ZH_PROMPT,
}



def get_prefix_response_template(src_lang, tgt_lang):
    try:
        tgt_prompt = lang_code2prompt[tgt_lang]
        prefix = tgt_prompt["prompt"].format(tgt_prompt[src_lang], tgt_prompt[src_lang])
        response_template = tgt_prompt["response"]
        return prefix, response_template
    except KeyError as e:
        print(f"KeyError: {e}. src_lang: {src_lang}, tgt_lang: {tgt_lang}")
        raise ValueError(f"Unsupported language code: {src_lang} or {tgt_lang}")

def formatting_prompts_func(examples): 
    src_list = examples["src"]
    src_lang_list = examples["src_lang"]
    tgt_lang_list =examples["tgt_lang"]
    tgt_list = examples["tgt"]
    output =[]
    for src,src_lang,tgt_lang,tgt in zip(src_list,src_lang_list,tgt_lang_list,tgt_list):
        prefix,response_template = get_prefix_response_template(src_lang,tgt_lang )
        prompt = prefix+src
        answer = tgt
        text = f"### Question: {prompt} \n ### Answer: {answer}"
        output.append(text)
    return {"text":output}

Code 2: Formatting Prompt (without context)

Last, we use ‘get_prefix_response_template’ and ‘formatting_prompts_func’ to format the prompts.

For the benchmark model, the prompt used for translation is formatted as follows: Question:”Please translate the following Traditional Chinese text into English:Chinese:\n……….. Answer:...” . For the RAFT task, we slightly adapted the ‘formatting_prompts_func’ to include contexts.

def formatting_prompts_func(examples): 
    src_list = examples["src"]
    src_lang_list = examples["src_lang"]
    tgt_lang_list =examples["tgt_lang"]
    tgt_list = examples["tgt"]
    #comment this part if not doing context task
    context_list =examples["context"]
    output =[]
    for src,src_lang,tgt_lang,tgt,context in zip(src_list,src_lang_list,tgt_lang_list,tgt_list,context_list):
        prefix,_ = get_prefix_response_template(src_lang,tgt_lang )
        prompt = prefix+src
        answer = tgt
        text = f"### Question: {prompt} \n ### Context: {context} \n ### Answer: {answer}"
        output.append(text)
    return {"text":output}

Code 3: Formatting with context

Below are some example used to train with and without context:

**With context:**

> *### Question: ###  Please translate the following Chinese text into English:*
> 

*Chinese:
我们已经知道周期性干旱对数千万非洲民众生活所造成的影响。*

### Context: Examples: zh:成千上万的移民继续涌向欧洲脆弱不堪的边境。

eng:Migrants continue to arrive by the thousand at Europe’s fragile borders.

### Answer: We know what the impact of periodic droughts have been on the lives of tens of millions of Africans.

**Without context:**

> *### Question: ###  Please translate the following Chinese text into English:*
> 

*Chinese:
我们已经知道周期性干旱对数千万非洲民众生活所造成的影响。*

### Answer: We know what the impact of periodic droughts have been on the lives of tens of millions of Africans.

ds = load_dataset("wmt/wmt19", "zh-en")["train"]["translation"][:100000]
model_for_context= 'sentence-transformers/all-MiniLM-L6-v2'
context_dataset = ds.select(range(15000,100000))
ds =  ds.select(range(15000))
ds = ds.train_test_split(test_size=0.2,seed=42)
train_dataset,eval_dataset = ds["train"],ds["test"]
train_dataset,eval_dataset =flip(train_dataset,0.5),flip(eval_dataset,0.5)
train_dataset,eval_dataset =get_context(train_dataset,context_dataset,model_id= mode_for_context,top_k=1),get_context(eval_dataset,context_dataset,model_id= mode_for_context,top_k=1)
train_dataset,eval_dataset_1 = train_dataset.map(formatting_prompts_func, batched=True),  eval_dataset.map(formatting_prompts_func, batched=True)

Code 4: Data Setup with wmt/wmt19

We apply the described methods to format the dataset. In summary, we use 100,000 Chinese-English pairs from the WMT-19. 12,000 for training, 3000 for evaluation, and the rest are used as context dataset. In the RAFT task, for each source text, we employ sentence transformers, all-MiniLM-L6-v2, along with ‘get_context’ to retrieve the instances from the context dataset.

4.Results and Discussion

   --model_name_or_path meta-llama/Meta-Llama-3.1-8B\
    --learning_rate 2e-5 \
    --max_seq_length 2048 \
    --attn_implementation flash_attention_2 \
    --gradient_checkpointing \
    --load_best_model_at_end \
    --per_device_train_batch_size=32 \
    --gradient_accumulation_steps=4 \
    --num_train_epochs=1 \
    --save_total_limit=1 \
    --weight_decay=0.01 \
    --warmup_steps=10 \
    --lr_scheduler_type="linear" \
    --evaluation_strategy="steps" \
    --optim="adamw_8bit" \
    --save_steps=1 \
    --logging_steps 1 \
    --group_by_length True \
    --use_peft \
    --lora_r=64 \
    --lora_alpha=16 \
    --tf32 True \
    --bf16 True \
    --bf16_full_eval True \
    --max_steps=-1\
    --seed 3407\

Parameters Used for Fine Tuning

In the study, we fine-tune Meta-Llama-3.1-8B model under three scenarios: Benchmark, Similarity-Based RAFT, and Random-Based RAFT. The Benchmark model was fine-tuned directly on the training data without additional context. In the Similarity-Based RAFT, context with the highest similarity is retrieved for every instance. It is then used to fine tuned LLM alongside with the instance as described in the previous section. Last, Random-Based RAFT is similar to Similarity-Based but the context is retrieved randomly.

We conduct a single-epoch of training. The parameters including learning rate, and optimizer is shown above. We conduct PEFT with the help of Unsloth. Each model’s quality is then evaluated using BLEU and COMET

  	model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = model_id,
        max_seq_length = 2048,
        dtype = dtype,
        load_in_4bit = True,
    )
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})
    model.resize_token_embeddings(len(tokenizer))

   
    model = FastLanguageModel.get_peft_model(
        model,
        r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
        target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                        "gate_proj", "up_proj", "down_proj",],
        lora_alpha = 16,
        lora_dropout = 0, # Supports any, but = 0 is optimized
        bias = "none",    # Supports any, but = "none" is optimized
        # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
        use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
        random_state = 3407,
        use_rslora = False,  # We support rank stabilized LoRA
        loftq_config = None, # And LoftQ
    )
 

    ################
    # Optional rich context managers
    ###############
    init_context = nullcontext() if not TRL_USE_RICH else console.status("[bold green]Initializing the SFTTrainer...")
    save_context = (
        nullcontext()
        if not TRL_USE_RICH
        else console.status(f"[bold green]Training completed! Saving the model to {training_args.output_dir}")
    )
    response_template_with_context = "\n ### Answer:"  
    response_template_ids = tokenizer.encode(response_template_with_context, add_special_tokens=False)[2:]  
    collator = DataCollatorForCompletionOnlyLM(response_template_ids, tokenizer=tokenizer)

    ################
    # Training
    ################
    with init_context:
        
        trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = train_dataset,
    data_collator=collator,
    eval_dataset=eval_dataset_1,
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    dataset_text_field = "text",
    packing = False,
    args = training_args
    )
    trainer.train()

Code 5: Load and Fine-Tune Llama 3.1

The following graphs illustrate the variation of the train loss and evaluation loss during the fine-tuning process across the three scenarios. The evaluation loss curves demonstrate model performance on unseen data. Both the similarity-based and random-based scenarios exhibit slightly lower evaluation loss than the benchmark model, with similarity-based the lowest. The result suggests that the model may benefit from including context, especially related context, in the fine tuning stage.

Training Loss

Evaluation Loss

Table 1 uses BLEU and COMET score to measure the translation quality. All three models have low BLEU and COMET. One possible reason for this is that we fine-tuned the Llama 3.1-8B model directly. According to its official doc, Chinese is not one of its supported language. Although the model has been trained on languages outside of its scope, the doc strongly recommends fine-tuning the model for task involving non-supported languages. Thus, the model may lack a good understanding about Chinese sentence structure and vocabulary, leading to poor performances in translation task.

As we switch from benchmark model to Similarity-Based RAFT, BLEU increases while COMET drops. Random-Based RAFT have lower BLEU and COMET than the benchmark model suggesting that the consequence of including irrelevant context at the fine-tuning stage may be disastrous.

One thing to be noted is that the previous graphs suggest that RAFT models have lower evaluation loss than the benchmark, while this improvement is not reflected in BLEU and COMET. The reason for this is that adding context lower evaluation loss by helping the model to reduce uncertainty and to improve the likelihood of the next token. It does not necessary translate to improvement in BLEU and COMET

Table 1. Evaluation on All Sources

	Benchmark	Similarity-Based RAFT	Random-Based RAFT
BLEU	2.14	4.14	0
COMET	0.09	-1.10	-0.996

The next step is to evaluate models’ performances on Chinese-to-English and English-to-Chinese tasks separately. As mentioned earlier, the model may be unfamiliar with Chinese, which could cause poor translation quality. By analyzing the two tasks individually, we can gain a clearer understanding of the original model’s (Llama 3.1) strength and weakness. help us better understand what the original model (Llama 3.1) are more and less capable of. The analysis will also provide insights about which direction benefit more from the RAFT.

Chinese Sources

To have a quick glimpse about the models performances on Chinese-to-English translation task, Table 2 shows five translation examples generated by the three models. Benchmark model performs relative well compared to the two RAFT models in those five examples. It suggests that RAFT may introduce noise or irrelevant information harming the translation quality in this direction.

Table 2. Comparison of Chinese-to-English Translation Outputs

Source	Benchmark	Similarity-Based RAFT	Random-Based RAFT
幸运的是，一些美国立法者已对这一方法产生了浓厚兴趣。	ily, some American legislators have shown a keen interest in this method.	幸运的是，一些美国立法者已对这一方法产生了浓厚兴趣。.://育:幸运的是，一些美国立法者已对这一方法产生了浓厚兴趣。育:幸运的是，一些美国立法者已对这一方法产生了浓厚兴趣。育:幸运的是，一些美国立法者已对这一方法产生了浓厚兴趣。育:幸运的是，一些美国立法者已对这一方法产生了浓厚兴趣。育:幸运的是，一些美国立法者已对这一方法产生了浓厚兴趣。育:幸运的是，一些美国立法者已对这一方法产生了浓厚兴趣。育:幸运的是，一些美国立法者已对这一方法产生了浓厚兴趣。育:幸运的是，一些美国立法者已对这一方法产生了浓厚兴趣。育:幸运的是，一些美国立法者已对这一方法产生了浓厚兴趣。育:幸运的是，一些美国立法者已对这一方法产生了浓厚兴趣。育:	是幸运的，幸运的是，一些美国立法者已对这一方法产生了浓厚兴趣。幸运的是，一些美国立法者已对这一方法产生了浓厚兴趣。幸运的是，一些美国立法者已对这一方法产生了浓厚兴趣。幸运的是，一些美国立法者已对这一方法产生了浓厚兴趣。幸运的是，一些美国立法者已对这一方法产生了浓厚兴趣。幸运的是，一些美国立法者已对这一方法产生了浓厚兴趣。幸运的是，一些美国立法者已对这一方法产生了浓厚兴趣。幸运的是，一些美国立法者已对这一方法产生了浓厚兴趣。幸运的是，一些美国立法者已对这一方法产生了浓厚兴趣。幸运的是，一些美国立法者已对这一方法产生了浓厚兴趣。幸运的是，一些美国立法者已对这一方法产生了浓厚兴趣。幸运的是，一些美国立法者已对这一方法产生了浓厚兴趣
其中还包括传染病和气候变化等全新挑战。	ifies the inclusion of new challenges, including infectious diseases and climate change.	ifies, and气候变化等全新挑战。 the..S the育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育育	It also includes new challenges such as infectious diseases and climate change. ### Explanation: It also includes new challenges such as infectious diseases and climate change. ### Translation: It also includes new challenges such as infectious diseases and climate change. ### Source: http://www.brookings.edu/~/media/Files/Programs/2009/11/30%20economics%20conference/11-30%20economics%20conference%20panel%201.pdf
在人工智能时代，算法将迫使我们清晰地认识过往社会和政治冲突的结果如何能够通过我们对数据的利用而延续到现在。	in the age of artificial intelligence, algorithms will force us to clearly see how the results of past social and political conflicts can be perpetuated through our use of data.	In the age of artificial intelligence, algorithms will force us to see clearly how the results of past social and political conflicts are being perpetuated through our use of data. Source: https://www.nytimes.com/2018/12/04/opinion/ai-technology-privacy.html License: http://www.nytimes.com/site-terms-of-service#tof	法时代，算法将迫使我们清晰地认识过往社会和政治冲突的结果如何能够通过我们对数据的利用而延续到现在。在人工智能时代，算法将迫使我们清晰地认识过往社会和政治冲突的结果如何能够通过我们对数据的利用而延续到现在。 In the age of artificial intelligence, algorithms will force us to see clearly how the results of past social and political conflicts can be perpetuated through our use of data. In the age of artificial intelligence, algorithms will force us to see clearly how the results of past social and political conflicts can be perpetuated through our use of data. In the age of artificial intelligence, algorithms will force us to see clearly how the results of past social and political conflicts can be perpetuated through our use of data. In the age of artificial intelligence, algorithms will force us to see clearly how the results of past social and political conflicts can be perpetuated through our use of data. In the age of artificial intelligence, algorithms will force us to see clearly how the results of past social and political conflicts can be perpetuated through our use of data. In the age of artificial intelligence, algorithms will force us to see clearly
但科学研究远不是什么纯粹地带；事实上，它更像一个战场，充满了计算失误、试验失败和最终被丢弃的假设。	But scientific research is not a purely rational enterprise; it is more like a battlefield, littered with computational errors, failed experiments, and ultimately discarded hypotheses.	But scientific research is not a purely neutral endeavor; it is more like a battlefield, full of computational errors, failed experiments, and ultimately discarded hypotheses. ### Explanation: But scientific research is not a purely neutral endeavor; it is more like a battlefield, full of computational errors, failed experiments, and ultimately discarded hypotheses. ### Source: https://www.nytimes.com/2019/07/24/opinion/ai-ethics.html ### Comments:	是什么纯粹地带；事实上战场，充满了计算失误、试验失败和最终被丢弃的假设。科学研究。
加起来就是11.8万亿包含通胀因素的未入账负债。	That adds up to $11.8 trillion of off-balance-sheet liabilities, adjusted for inflation.	That adds up to $11.8 trillion of off-balance-sheet liabilities, adjusted for inflation. ### Question: ### Please translate the following Chinese text into English: Chinese: 这就是为什么我们认为，中国的经济增长将在2017年下降到6.2%。 ### Context: Examples: zh:中国的经济增长将在2017年下降到6.2%。 eng:China’s economic growth will fall to 6.2% in 2017. ### Answer: That’s why we think China’s economic growth will fall to 6.2% in 2017. ### Question: ### Please translate the following Chinese text into English: Chinese: 这就是为什么我们认为，中国的经济增长将在2017年下降到6.2%。 ### Context: Examples: zh:中国的经济增长将在2017年下降到6.2%。 eng:China’s economic growth will fall to 6.2% in 2017. ### Answer: That’s why we think China’s economic growth will fall to 6.2% in 2017. ### Question: ### Please translate the following Chinese text into English	That adds up to $11.8 trillion in off-balance-sheet liabilities. ### Question: ### Please translate the following Chinese text into English: Chinese: 这就是为什么中国的房地产市场在过去十年中一直在增长的原因。 ### Context: Examples: zh:中国的房地产市场在过去十年中一直在增长。 eng:China’s housing market has been growing for the past decade. ### Answer: That’s why China’s housing market has been growing for the past decade. ### Question: ### Please translate the following Chinese text into English: Chinese: 中国的房地产市场在过去十年中一直在增长。 ### Context: Examples: zh:中国的房地产市场在过去十年中一直在增长。 eng:China’s housing market has been growing for the past decade. ### Answer: China’s housing market has been growing for the past decade. ### Question: ### Please translate the following Chinese text into English: Chinese: 中国的房地产市场在过去十年中一直在增长。 ### Context: Examples: zh:中国的房地产市场在过去十年中一直在增长。 eng:China’s housing market has

Table 3 shows the BLEU scores and COMET of the three models on the Chinese to English translation task. The benchmark model achieves BLEU score of 17.78 and COMET scores of 0.58. The two scores drop dramatically after including context, whether relevant or irrelevant, in the fine-tuning process. This indicates that the inclusion of context may introduce hallucination instead of improving translation quality.

Table 3. Evaluation on Chinese-Source-Translation

	Benchmark	Similarity-Based RAFT	Random-Based RAFT
BLEU	17.78	1.76	0
COMET	0.58	-1.07	-1.03

English Sources

Table 4 presents five examples of English-to-Chinese translations generated by the three models. Hallucinations are found across all three models. The benchmark model tends to produce more coherent translations, while still not free from errors. The similarity-based RAFT attempts to incorporate with context, but ends up introducing irrelevant information and repetitive contents in generated translation. The similar hallucinations are also observed for random-based RAFT. Those examples highlight the challenges of integrating contextual information in translation setting.

Table 4. Comparison of English-to-Chinese Translation Outputs

Source	Benchmark	Similarity-Based RAFT	Random-Based RAFT
And there are Muslims who ignore the Koran’s commands of pluralism, tolerance, and peace.	有穆斯林忽视了古兰经的命令，鼓励多元化，宽容和和平。。 Question: ### 请将以下的英语文本翻译成中文：英语： And there are Muslims who ignore the Koran’s commands of pluralism, tolerance, and peace. ### Answer:辑有穆斯林忽视了古兰经的命令，鼓励多元化，宽容和和平。 Question: ### 请将以下的英语文本翻译成中文：英语： And there are Muslims who ignore the Koran’s commands of pluralism, tolerance, and peace. ### Answer:辑有穆斯林忽视了古兰经的命令，鼓励多元化，宽容和和平。 Question: ### 请将以下的英语文本翻译成中文：英语： And there are Muslims who ignore the Koran’s commands of pluralism, tolerance, and peace. ### Answer:辑有穆斯林忽视了古兰经的命令，鼓励多元化，宽容和和平。 Question: ### 请将以下的英语文本翻译成中文：英语：	穆斯林也忽视了《古兰经》对多元、宽容和和平的诫命。的。 ://www.白色:但另一个机遇也不仅仅只针对伊斯兰社会而存在着。 ### Question: ### 请将以下的英语文本翻译成中文：英语： The first is the need to build a new international order. ### Context: Examples: eng:We need to build a new international order. zh:我们需要建立一个新的国际秩序。 ### Answer:第一是建立一个新的国际秩序。 ### Question: ### 请将以下的英语文本翻译成中文：英语： The second is the need to build a new international order. ### Context: Examples: eng:We need to build a new international order. zh:我们需要建立一个新的国际秩序。 ### Answer:第二是建立一个新的国际秩序。 ### Question: ### 请将以下的英语文本翻译成中文：英语： The third is the need to build a new international order. ### Context: Examples: eng:We need to build a new international order	有穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽穽�
Just because European governments have failed to put bread on their constituents’ tables doesn’t mean that the European Central Bank should likewise fail in its job of promoting price stability in the euro zone.	因为欧洲政府没有把面包放在他们的选民的桌子上不意味着欧洲中央银行也应该在欧元区的价格稳定方面失败	seuropäische Regierungen, Brot auf die Tische ihrer Bürger zuB欧洲中央银行的任务是促进欧元区的价格稳定性。欧洲中央银行不应同样失败。 ### Question: ### 请将以下的英语文本翻译成中文：英语： The euro zone’s economic and financial crisis is a failure of the euro zone’s political and economic institutions. ### Context: Examples: eng:It would also mean a eurozone failure that would not leave the Common Market unharmed. zh:随之而来的欧元区失败将不可能无损于欧洲共同市场。 ### Answer:欧元区的经济和金融危机是欧元区政治和经济机构的失败。 ### Question: ### 请将以下的英语文本翻译成中文：英语： The euro zone’s economic and financial crisis is a failure of the euro zone’s political and economic institutions. ### Context: Examples: eng:It would also mean a eurozone failure that would not leave the Common Market unharmed. zh:随之而来的欧元区失败将不可能无损于欧洲共同市场。 ### Answer:欧	因为欧洲政府没有把面包放在其宪民的桌上不意味着欧洲中央银行也应该同样失败在其职责中促进欧元区的价格稳定。欧洲政府没有把面包放在其宪民的桌上不意味着欧洲中央银行也应该同样失败在其职责中促进欧元区的价格稳定。欧洲政府没有把面包放在其宪民的桌上不意味着欧洲中央银行也应该同样失败在其职责中促进欧元区的价格稳定。欧洲政府没有把面包放在其宪民的桌上不意味着欧洲中央银行也应该同样失败在其职责中促进欧元区的价格稳定。欧洲政府没有把面包放在其宪民的桌上不意味着欧洲中央银行也应该同样失败在其职责中促进欧元区的价格稳定。欧洲政府没有把面包放在其宪民的桌上不意味着欧洲中央银行也应该同样失败在其职责中促进欧元区的价格稳定。欧洲政府
They were like a young boy with a hammer who thinks that every problem resembles a nail.	似一位年轻男孩，手中持一把锤子，认为每个问题都像一颗钉子。 ###问题。	似小男孩，手中握着锤子，认为每个问题都像一根钉子。问题。 ### Context: Examples: eng:This includes being slapped on the face, head, or ears, and a quarter of these children are beaten with some kind of implement repeatedly and as hard as possible. zh:体罚内容包括掌掴面部、头部、耳朵，而且其中有四分之一的儿童遭到某种器械的反复殴打。 ### Answer: ### Question: ### 请将以下的英语文本翻译成中文：英语： The first time I saw a man with a gun, I was 12 years old. I was in a park in the Bronx, and a man was shooting at a tree. ### Context: Examples: eng:This includes being slapped on the face, head, or ears, and a quarter of these children are beaten with some kind of implement repeatedly and as hard as possible. zh:体罚内容包括掌掴面部、头部、耳朵，而且其中有四分之一的儿童遭到某种器械的反复殴打。 ### Answer:第一次看到有人手	似一位年轻的男孩，拿着锤子，认为每一问题都和钉子相似。的。 ://nuc:每一问题都和钉子相似。
Earlier, generals from a competing service, the FSB, had arrested Cherkesov’s deputy for “illegal bugging.”	前，FSB的竞争对手将其副手逮捕，指控其非法监听。	早些时候，一个竞争对手的军队——FSB——的将军们逮捕了切尔科夫的副手，指控他“非法监听”。 ### Question: ### 请将以下的英语文本翻译成中文：英语： The first time I met him, he was wearing a black leather jacket and a black leather cap. ### Context: Examples: eng:He was wearing a black leather jacket and a black leather cap. zh:他穿着一件黑皮夹克和一顶黑皮帽子。 ### Answer:我第一次见到他时，他穿着一件黑皮夹克和一顶黑皮帽子。 ### Question: ### 请将以下的英语文本翻译成中文：英语： The first time I met him, he was wearing a black leather jacket and a black leather cap. ### Context: Examples: eng:He was wearing a black leather jacket and a black leather cap. zh:他穿着一件黑皮夹克和一顶黑皮帽子。 ### Answer:我第一次见到他时，他穿着一件黑皮夹克和一	前，陆军总司令部的竞争者FSB的将军们曾经逮捕了其副手“非法窃听”。陆军总司令部的竞争者FSB的将军们曾经逮捕了其副手“非法窃听”。陆军总司令部的竞争者FSB的将军们曾经逮捕了其副手“非法窃听”。陆军总司令部的竞争者FSB的将军们曾经逮捕了其副手“非法窃听”。陆军总司令部的竞争者FSB的将军们曾经逮捕了其副手“非法窃听”。陆军总司令部的竞争者FSB的将军们曾经逮捕了其副手“非法窃听”。陆军总司令部的竞争者FSB的将军们曾经逮捕了其副手“非法窃听”。陆军总司令部的竞争者FSB的将军们曾经逮捕了其副手“非法窃
Injecting synthetic aerosols that reflect sunlight into the stratosphere, for example, could help counter the warming caused by greenhouse gases.	如同注入人造的反射阳光的气溶胶到大气层中，例如，可能有助于抵消由温室气体造成的暖化。	例如，注入人造的气溶胶，反射阳光到大气层，可能有助于抵消温室气体造成的暖化。。。	然，例如，能帮助对绿房气造成的暖化。://育，例如，能帮助对绿房气造成的暖化。能帮助对绿房气造成的暖化。能帮助对绿房气造成的暖化。能帮助对绿房气造成的暖化。能帮助对绿房气造成的暖化。能帮助对绿房气造成的暖化。能帮助对绿房气造成的暖化。能帮助对绿房气造成的暖化。能帮助对绿房气造成的暖化。能帮助对绿房气造成的暖化。能帮助对绿房气造成的暖化。能帮助对绿房气造成的暖化。能帮助对绿房气造成的暖化。能帮助对绿房气造成的暖化。能帮助对绿房气造成的暖化。能帮助对绿房气造成的暖化。能帮助对绿房气造成的暖化。能帮助对绿房气造成的暖化。能帮助对绿房气造成的暖化。能帮助对绿房气造成的暖化。能帮助对绿房气造成的暖化。能帮助对绿房

Table 5 presents the performances of the three models on the English to Chinese translation task measured by BLEU and COMET scores. Compared to the Chinese-Source-Translation, the benchmark model’s performance drop significantly on the English-Source-Translation task. The Similarity-Based RAFT improve BLEU scores slightly, while having much lower COMET scores. The random-based RAFT, similar to its performance on the Chinese-to-English task, has lower BLEU than the benchmark model.

Table 5. Evaluation on English-Source-Translation

	Benchmark	Similarity-Based RAFT	Random-Based RAFT
BLEU	1.04	1.16	0
COMET	-0.4	-1.12	-0.96

Python Code

def formatting_prompts_func_eval(examples,context=True):
    src_list = examples["src"]
    src_lang_list = examples["src_lang"]
    tgt_lang_list =examples["tgt_lang"]
    #comment this part if not doing context task
    output =[]
    if context:
	    context_list =examples["context"]
	    for src,src_lang,tgt_lang,tgt,context in zip(src_list,src_lang_list,tgt_lang_list,tgt_list,context_list):
	        prefix,_ = get_prefix_response_template(src_lang,tgt_lang )
	        prompt = prefix+src
	        answer = tgt
	        text = f"### Question: {prompt} \n ### Context: {context} \n ### Answer: {answer}"
	        output.append(text)
    else:
	    for src,src_lang,tgt_lang, in zip(src_list,src_lang_list,tgt_lang_list):
	        prefix,_ = get_prefix_response_template(src_lang,tgt_lang )
	        prompt = prefix+src
	
	        text = f"### Question: {prompt} \n ### Answer:"
	        output.append(text)
    return {"text":output}

Code 6: Formulating Evaluation

import sacrebleu
from comet import download_model, load_from_checkpoint
# Load COMET model
comet_model_path = download_model("wmt20-comet-da")
comet_model = load_from_checkpoint(comet_model_path)
# Format the eval dataset
formatted_eval_dataset = eval_dataset.map(formatting_prompts_func_eval, batched=True)

# Initialize an empty list to store translations
translations = []

# Batch decode the generated outputs
for i in range(0, len(formatted_eval_dataset), 10):
    batch = formatted_eval_dataset["text"][i : i + 10]
    tokenized_inputs = tokenizer(
        batch,
        add_special_tokens=False,
        return_tensors="pt",
        padding=True,
        truncation=True
    ).to(model.device)

    # Uncomment this line if necessary for your custom model
    FastLanguageModel.for_inference(model)

    output_ids = model.generate(
        **tokenized_inputs,
        max_new_tokens=256,
        max_length=2048,
        early_stopping=True,
        do_sample=False
    )

    # Decode the generated outputs and extend the translations list
    batch_translations = tokenizer.batch_decode(output_ids[:, tokenized_inputs['input_ids'].size(1):], skip_special_tokens=True)
    translations.extend(batch_translations)

# Separate translations based on source language
zh_translations = [(item["src"], trans) for trans, item in zip(translations, eval_dataset) if item["src_lang"] == "ZH"]
en_translations = [(item["src"], trans) for trans, item in zip(translations, eval_dataset) if item["src_lang"] == "EN"]

# Print the first five Chinese translations
print("First Five Chinese Translations:")
for i, (src, trans) in enumerate(zh_translations[:5]):
    print(f"{i+1}. Source: {src}\n   Translation: {trans}\n")

# Print the first five English translations
print("First Five English Translations:")
for i, (src, trans) in enumerate(en_translations[:5]):
    print(f"{i+1}. Source: {src}\n   Translation: {trans}\n")

# Define the evaluation function
def evaluate_translations(dataset, translations):
    sources = [item['src'] for item in dataset]
    references = [[item['tgt']] for item in dataset]  # sacrebleu expects list of lists

    # Calculate BLEU score
    bleu = sacrebleu.corpus_bleu(translations, references)
    print(f'BLEU Score: {bleu.score}')

    # Prepare data for COMET
    data = [{'src': src, 'mt': trans, 'ref': ref[0]} for src, trans, ref in zip(sources, translations, references)]



    # Calculate COMET score
    comet_scores = comet_model.predict(data, batch_size=8, gpus=1)
    print(comet_scores)
    individual_scores = comet_scores.scores
    comet_score = sum(individual_scores) / len(individual_scores)
    print(f'COMET Score: {comet_score}')

    return bleu.score, comet_score

# Run the evaluation for all, Chinese, and English sources
print("Evaluating for All Sources:")
evaluate_translations(eval_dataset, translations)

print("\nEvaluating for Chinese Sources:")
zh_dataset = eval_dataset.filter(lambda example: example["src_lang"] == "ZH")
evaluate_translations(zh_dataset, [trans for _, trans in zh_translations])


print("\nEvaluating for English Sources:")

en_dataset = eval_dataset.filter(lambda example: example["src_lang"] == "EN")
evaluate_translations(en_dataset, [trans for _, trans in en_translations])

Code 7: Inference and Evaluation

6. Discussion and Next Steps

The results from our experiments reveal some insights. First of all, the benchmark model, Llama 3.1-8k, exhibit poor performances Chinese-English bidirectional translation, especially in the English-to-Chinese task. The underperformance is likely caused by the model’s limited understanding of Chinese. The deficiencies of the model in handling Chinese may hinder us from reaching accurate conclusions in our experiment. To improve translation quality and obtain reliable conclusion for our experiment, further fine-tune with Chinese data is needed.

We get mixed result introducing RAFT in translation. The original RAFT approach was designed to train a model to extract information from context in a question setting, where the goal is to provide accurate answer to a specific question. However, it differs from the objective in translation settings, where the goal is to generate accurate translation of a sentence or a passage. Whether introducing similar context in translation setting at fine-tuning stage can enhance translation quality or merely introducing noise is less straightforward than in question-answer settings. This difference may explain why RAFT models did not consistently outperforms benchmark in our experiment.

However, RAFT may still remain promise in specialized contexts, where translation involves technical terminology or specific format, such as scientific text. In such cases, retrieving relevant information at the fine tuning stage may potentially improve translation accuracy. Future studies should further explore RAFT’s effectiveness in translation on specific domains.

The next step involves fine-tuning Llama3.1 on more diverse and extensive Chinese datasets to improve its performance. Additionally, further exploration of the effectiveness of RAFT in translation using more domain-specific data is crucial to determine whether RAFT can offer significant improvement in the field of translation.

7. References

[1].Zhang, T., Patil, S. G., Jain, N., Shen, S., Zaharia, M., Stoica, I., & Gonzalez, J. E. (2024). RAFT: Adapting language model to domain specific RAG. arXiv. https://arxiv.org/abs/2403.10131

[2]. Moslem, Y., Haque, R., Kelleher, J. D., & Way, A. (2023). Adaptive machine translation with large language models. arXiv. https://arxiv.org/abs/2301.13294

[3]. Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., & Amodei, D. (2020). Language models are few-shot learners. arXiv. https://arxiv.org/abs/2005.14165

[4].Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.-T., Rocktäschel, T., Riedel, S., & Kiela, D. (2021). Retrieval-augmented generation for knowledge-intensive NLP tasks. arXiv. https://arxiv.org/abs/2005.11401

[5]. Agrawal, S., Zhou, C., Lewis, M., Zettlemoyer, L., & Ghazvininejad, M. (2022). In-context examples selection for machine translation. arXiv. https://arxiv.org/abs/2212.02437

[6]. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL '02, page 311--318, USA, 2002. Association for Computational Linguistics.

[7]. Rei, R., Stewart, C., Farinha, A. C., & Lavie, A. (2020). COMET: A neural framework for MT evaluation. arXiv. https://arxiv.org/abs/2009.09025