October 2023, How To Improve In-Context Learning Performance in Machine Translation

Chen Yufeng - Waseda University

1. Introduction

Impressive proficiency in downstream tasks has been exhibited by large language models(LLMs) through the straightforward conditioning on a small set of input-label pairs. This particular mode of inference is called in-context learning (Brown et al. 2020)[1]. In simpler terms,GPT-4 could improve its translation capabilities without necessitating fine-tuning of the model itself. This is attributed to the enhanced understanding and performance of tasks exhibited by GPT-4 when compared to models not employing in-context learning. It could be achieved by providing specific task examples for GPT-4.

(Figure 1 :To achieve in-context learning for Chinese to English translation task, a few-shot approach involving specific task examples is employed)

In fact, the potency of in-context learning as a useful tool for LLMs stems from the equation grounded in Implicit Bayesian Inference (Xie et al. 2022)[2]. Specifically,

p(output |prompt)=\int_{concept}^{} p(output|concept,prompt)p(concept |prompt)d(concept) \,

According to this equation, if p(concept|prompt) focuses on the prompt concept with more examples, the output of LLMs could be better by selecting the prompt concept more effectively(Bashir 2023)[3]. Without a doubt, the random selection of examples cannot effectively facilitate GPT-4 in acquiring a comprehensive understanding of the prompt concept(Das et al., 2021[4]; Liu et al., 2022 [5]; Margatina et al., 2023[6]). Consequently, the primary objective becomes the strategic selection of more suitable examples based on the user’s input prompt, thereby enhancing GPT-4's performance. In the subsequent section, this paper will present a methodology designed to facilitate the selection of improved translation examples from a dataset based on the input. This approach aims to empower GPT-4 to achieve high-accuracy translations from Chinese to English (ZH to EN), Japanese to English (JA to EN), and Vietnamese to English (VI to EN).

2. Proposed Method

In this methodology, it is assumed that there is access to a dataset denoted as $D_s$ ={(X0,Y0),···,(Xt,Yt)}. Within this dataset, translation pairs in a specific language are chosen as examples for in-context learning.And the size of $D_s$ can be huge, this paper will explore the implications for the translation results when the dataset size is significant later. In this context, the design of a text retriever(Gao 2023)[7] is necessary to locate and select the top K sentences from $D_s$ with similar meaning to the user prompt sentence. The retriever can be conducted in two components. The first part involves the TF-IDF matrix, while the second part is associated with cosine similarity. Further discussion on these two components will follow. The top K examples selected by the retriever are combined with the user prompt. Subsequently, GPT-4 translates the prompt, and evaluation metrics such as BLEU and COMET are employed to assess the accuracy of the translation.

TF-IDF Score

The TF-IDF matrix is composed of TF-IDF scores. And TF-IDF scores can be calculated as $TF(t,d)=\frac{Number\; of \;times\; term\; appears\; in\; document\; d}{Total\; number\; of \;terms\; in\; document \;d }$ ,

which represents the term frequency that measures how often a word appears in a document. And, the IDF also needs to be considered.

$IDF(t,D) = log(\frac{Total\; number \;of \;documents\; in \;the \;corpus\; D}{Number\; of\; documents\; containing\; term\; t})$ , which measures the significance of a word across a collection of documents. In the present study, the symbol "D" is employed as the selected dataset, denoted as $D_s$ , where "d" signifies an individual sentence within the confines of $D_s$ . Based on these pieces of information, the TF-IDF scores can be calculated to construct the TF-IDF matrix eventually. Specifically, the TF-IDF scores are determined through the expression $TF(t,d)\times IDF(t,D)$ , allowing for the quantification of the significance of a given word within a particular document.

Cosine Similarity

Cosine similarity, a method employed to assess the similarity between two vectors within an inner product space, finds significant application in the realm of TF-IDF vectors. Particularly in the evaluation of document similarity, it gauges the similarity between documents by considering the angle between their vector representations. The cosine similarity between vectors A and B is computed using the formula: $cosine \;similarity(A,B) =\frac{ A \cdot B}{∥A∥⋅∥B∥}$ . In this study, “A” corresponds to the TF-IDF vector of the user prompt while B corresponds to the TF-IDF vectors of other documents within the dataset $D_s$ . It is evident that a higher cosine similarity score signifies a greater likeness between the user prompt and other documents.As a result, the top K examples can be selected from the dataset $D_s$ based on their similarity scores to serve as in-context learning examples.

(Figure 2 :Use retriever to create the TF-IDF matrix and cosine similarity scores to select the top-K examples from Dselect for in-context learning)

3. Experimental Setup

3.1 Experimental procedure

Next, the proposed method will be applied to evaluate the performance of GPT-4 across three translation language pairs: ZH-EN, JA-EN, and VI-EN. To prove this proposed method is feasible, the experiment is designed to encompass three distinct scenarios. In the first scenario,GPT-4 undergoes translation for these language pairs without the incorporation of in-context learning, implying a lack of provided translation examples. In the second scenario, in-context learning is introduced, albeit in a random fashion, wherein all selected examples are chosen entirely at random. The final scenario involves the implementation of the proposed method, which utilizes a retriever to compute the TF-IDF matrix. This matrix enables the determination of cosine similarity scores between the user prompt and other sentences within the dataset. Eventually, the top 4 examples are selected based on these similarity scores and provided to GPT-4 as in-context learning examples.After finishing the translation, two evaluation metrics, BLEU and COMET, will be used to quantify the translation accuracy of GPT-4 for each language pair under the three distinct scenarios.

BLEU Score

Bleu Score (bilingual evaluation understudy) is calculated for each individual translated segment by comparing it with reference translations. These scores are aggregated by averaging them over the entire corpus, providing an assessment of the overall quality of the translation (Papineni et al. 2002)[8]. And more remarkably, this approach is comparable to human judgments of quality. Consequently, the BLEU will be employed as a valuable metric to assess the accuracy of translations from GPT-4.

COMET Score

Comet score is a neural framework designed for training multilingual machine translation evaluation models, achieving new state-of-the-art levels of correlation with human judgments (Rei et al. 2020)[9]. This pre-trained neural-based framework requires three input parameters: the translation text, the original text, and the translation reference text. Subsequently, these three parameters undergo encoding by the pre-trained encoder. The encoded results are then passed through a feed-forward regressor for further processing.

3.2 Datasets

In the dataset selection process, OPUS-100(Zhang et al. 2020)[10] was chosen for two primary reasons. Firstly, OPUS-100 comprises a broad spectrum of translation language pairs, including ZH-EN, JA-EN, and VI-EN, obviating the need for multiple datasets. A single source adequately meets the requirements of this study. Secondly, OPUS-100 boasts diverse domains, aligning with the goal of incorporating a wide range of domains into the $D_s$ .This diversity enhances the ability to handle user prompts effectively by providing relevant examples. Consequently, OPUS-100 emerged as the preferred dataset. The OPUS-100 dataset was divided into two segments: 10,000 training data instances from OPUS-100 were employed for each language pair in the construction of $D_s$ . Additionally, the first 100 sentences from the testing dataset of OPUS-100 for each language pair were chosen as the testing data.

3.3 Programming Code

The programming code required to achieve the objective is not intricate. Initially, TfidfVectorizer and cosine_similarity functions need to be imported from the scikit-learn package. Then,these functions are employed to construct the retriever. In the retriever, a critical step involves merging the user prompt with $D_s$ , aiming to calculate cosine similarity scores between the prompt and all sentences within $D_s$ . This process, facilitated by the imported functions, allows for the identification of the top 4 examples from $D_s$ based on the prompt. Following this, all four examples undergo embedding into GPT-4.

(Figure 3 :The ultimate prompt instructs GPT-4 to perform Chinese-to-English translation, incorporating the optimal four examples from Dselect )

In Figure 3, the prompt, enriched with in-context learning, is now complete, featuring the four examples identified by the retriever. Consequently, the subsequent phase involves the analysis of results derived from the experiment.

4.Results and Discussion

Table 1 summarizes all results. Based on these findings, the approach demonstrates superior translation accuracy compared to other scenarios across all three language pairs. Despite the seemingly modest increase, a 1% improvement in BLEU score holds significant importance in the context of machine translation. It is noteworthy that the effectiveness of random in-context learning occasionally lags behind the scenario of not employing in-context learning at all. This highlights the critical importance of judiciously selecting examples for GPT-4 during the in-context learning process, as inappropriate examples may adversely affect its overall performance.

(Table 1 :illustrates the translation accuracy outcomes across all three distinct scenarios for all language pairs)

And another aspect to consider is the size of $D_s$ .At the beginning, the expectation was that a larger $D_s$ would yield better results, as a sizable dataset has the potential to encompass a diverse range of domains, providing more effective examples for GPT-4. This assumption was validated through experimentation with a larger $D_s$ comprising 1 million sentences. Table 2 illustrates the results obtained when using this expanded dataset as $D_s$ for selecting in-context learning examples.

(Table 2: illustrates the variations in translation accuracy corresponding to the incremental augmentation of the

D_s

dataset size)

There is no doubt that using a larger dataset as $D_s$ holds the potential to enhance the efficacy of task learning for GPT-4. Therefore, the amalgamation of this methodology with a crafted extensive dataset becomes imperative for enabling GPT-4 to attain high performance, particularly in the domain of machine translation.

5. Conclusion and Next Steps

This paper introduces a novel method aimed at enhancing the translation capability of GPT-4 through the incorporation of in-context learning. The pivotal aspect of this approach involves the construction of a retriever, utilizing TF-IDF matrix and cosine similarity scores, to identify sentences in the $D_s$ dataset that closely align with the user prompt. Subsequently, a subset of examples is selected from $D_s$ to facilitate in-context learning for GPT-4. Experimental translation evaluations indicate the feasibility of this approach, as evidenced by improvements in BLEU score and COMET score in comparison to scenarios without in-context learning or employing random in-context learning. Moreover, the study reveals that a larger $D_s$ contributes significantly to the effectiveness of the method, as a more extensive dataset enhances the retriever's likelihood of identifying superior examples.In light of these findings, the author advocates for implementing this approach with a sizable dataset.

Nevertheless, two key aspects require further exploration in future research. Firstly, there is a need to determine how to construct a reliable dataset, termed $D_s$ . While the OPUS-100 dataset was utilized in this study, the creation of a comprehensive dataset spanning various domains and containing highly accurate translation reference sentences is imperative for maximizing the method's potential.Although this process is time-consuming, the translation proficiency of GPT-4 is anticipated to peak with the availability of such a dataset. Secondly, investigating the impact of the quantity of in-context learning examples on translation accuracy is crucial. Currently, only the top 4 examples are employed based on the cosine similarity scores. Exploring scenarios with 5 or 10 examples will provide insights into the influence of quantity on translation accuracy. This paper, focusing on the quality of in-context learning examples, lays the groundwork for future investigations. The anticipated outcomes of these endeavors will serve as valuable reference points for subsequent research and practical applications.

6. References

[1]. Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., ... Amodei, D. (2020). "Language models are few-shot learners."

[2]. Xie, S. M., Raghunathan, A., Liang, P., & Ma, T. (2022). "An Explanation of In-context Learning as Implicit Bayesian Inference."

[3]. Bashir, D. (2023). "In-Context Learning, in Context." The Gradient. https://thegradient.pub/in-context-learning-in-context/

[4]. Rajarshi Das, Manzil Zaheer, Dung Thai, Ameya Godbole, Ethan Perez, Jay Yoon Lee, Lizhen Tan, Lazaros Polymenakos, and Andrew McCallum. 2021. Case-based reasoning for natural language queries over knowledge bases. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 9594–9611, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics

[5]. Jiachang Liu, Dinghan Shen, Yizhe Zhang, Bill Dolan, Lawrence Carin, and Weizhu Chen. 2022. What makes good in-context examples for GPT-3? In Proceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, pages 100–114, Dublin, Ireland and Online. Association for Computational Linguistics.

[6].Katerina Margatina, Timo Schick, Nikolaos Aletras, and Jane Dwivedi-Yu. 2023. Active learning principles for in-context learning with large language models.

[7].Gao, L., Chaudhary, A., Srinivasan, K., Hashimoto, K., Raman, K., & Bendersky, M. (2023). "Ambiguity-Aware In-Context Learning with Large Language Models."

[8].Papineni, K., Roukos, S., Ward, T., & Zhu, W. J. (2002). "BLEU: A method for automatic evaluation of machine translation." In Proceedings of the 40th annual meeting of the Association for Computational Linguistics (ACL) (pp. 311-318). Association for Computational Linguistics.

[9].Rei, R., Stewart, C., Farinha, A. C., & Lavie, A. (2020). "COMET: A Neural Framework for MT Evaluation."

[10]Biao Zhang, Philip Williams, Ivan Titov, Rico Sennrich. (2020). “Improving Massively Multilingual Neural Machine Translation and Zero-Shot Translation”