Massively Multilingual Adaptation of Large Language Models Using Bilingual Translation Data

Shaoxiong Ji^1,2*, Zihao Li², Jaakko Paavola², Indraneil Paul¹, Hengyu Luo², Jörg Tiedemann²,

¹Technical University of Darmstadt ²University of Helsinki

^*Corresponding to shaoxiong.ji@tu-darmstadt.de

Abstract

This paper investigates a critical design decision in the practice of massively multilingual continual pre-training --- the inclusion of parallel data. Specifically, we study the impact of bilingual translation data for massively multilingual language adaptation of the Llama3 family of models to 500 languages. To this end, we construct a bilingual translation corpus named MaLA, containing data from more than 2,500 language pairs. Subsequently, we develop the EMMA-500 Llama 3 suite of four massively multilingual models --- continually pre-trained from the Llama 3 family of base models extensively on diverse data mixes up to 671B tokens --- and explore the effect of continual pre-training with or without bilingual translation data. Comprehensive evaluation across 7 tasks and 12 benchmarks demonstrates that bilingual data tends to enhance language transfer and performance, particularly for low-resource languages. We open-source the MaLA corpus, EMMA-500 Llama 3 suite artefacts, code, and model generations.

Contributions

Data: We compile a bilingual translation corpus for Massive Language Adaptation in more than 2,500 language pairs and 500 languages, namely the 🤗 MaLA bilingual translation corpus.
Models: We train and release 4 models, namely 🤗 EMMA-500 Llama 3/3.1 Mono/Bi for Enhancing Massively Multilingual Adaptation,¹ by continually pre-training of Llama 3 & 3.1 (8B) using both monolingual and bilingual MaLA corpus augmented with diverse data types, up to 671B tokens.
Evaluation: We conduct a comprehensive evaluation across 7 tasks and 12 benchmarks. Our empirical evaluation ablates the impact of two diverse data mixes and analyzes gains in task generalization and multilingual robustness.

¹ We use MaLA and EMMA to name the corpus and models, following the naming convention of EMMA-500 Llama 2 (Ji et al., 2024), which is not an artifact of this paper. MaLA and EMMA are a collection of corpora and models. In this paper, ``Mono'' and ``Bi'' indicate CPT on monolingual (Mix 2) and bilingual (Mix 1) mixes, respectively.

Evaluation Results and Findings

CPT using a data mix with bilingual translation data generally exhibits better multilingual performance than a monolingual mix², particularly in low-resource languages and in machine translation tasks that directly benefit from parallel data.
Heavily pre-trained models (e.g., Llama 3 and 3.1) that consume more training tokens are more resistant to further adaptation than English-centric models (e.g., Llama 2) when scaling to include many additional languages.
As for overall performance, our EMMA models are the best at machine translation (Flores200) and competitive at text classification (Taxi1500 and SIB-200) and commonsense reasoning (XCOPA and XStoryCloze).
EMMA CPT models exhibit a lower average accuracy on the BELEBELE comprehension benchmark, but they outperform baselines across a greater number of languages.

While multilingual models can achieve broad coverage, perfect uniformity across all tasks and languages remains an unattainable goal. However, we show that multilingual performance and language equality can be pushed forward with parallel training data.

² A monolingual mix (Mix 2) contains monolingual data in different languages but not in the aligned format as parallel data.

🤗 MaLA Bilingual Translation Corpus

Categories	Language Pairs	Tokens
very high	4	8.5E+10
high	83	2.1E+11
medium-high	67	4.7E+10
medium	281	6.4E+10
medium-low	508	2.0E+10
low	655	2.5E+09
very low	909	1.8E+08
sum	2507	4.3E+11

Table 1: Key statistics of the 🤗 MaLA bilingual translation corpus.

Data Mix Composition

Data mix 1: bilingual

Data mix 2: monolingual

Figure 1: Two data mixes and their composition. The bilingual mix includes all types of data. The monolingual mix consists of a subset of the bilingual mix that excludes bilingual data (parallel texts).

🤗 EMMA-500 Continual Pre-trained Models

Base Model	Data Mix	Our Models	Tokens
Llama 3	Monolingual (Mix 2)	🤗 EMMA Llama 3 8B Mono	419B
Llama 3	Bilingual (Mix 1)	🤗 EMMA Llama 3 8B Bi	671B
Llama 3.1	Monolingual (Mix 2)	🤗 EMMA Llama 3.1 8B Mono	419B
Llama 3.1	Bilingual (Mix 1)	🤗 EMMA Llama 3.1 8B Bi	671B

Table 2: EMMA-500 continual pre-trained models and settings.

Evaluation Results

Tasks	Dataset	Metric	Samples/Lang	N Lang	Domain	Results
Text Classification	🤗 SIB200	Accuracy	204	205	Misc	[Results]
Text Classification	Taxi1500	Accuracy	111	1507	Bible	[Results]
Commonsense Reasoning	🤗 XCOPA	Accuracy	600	11	Misc	[Results]
Commonsense Reasoning	🤗 XStoryCloze	Accuracy	1870	11	Misc	[Results]
Natural Language Inference	🤗 XNLI	Accuracy	2490	15	Misc	[Results]
Machine Translation	🤗 FLORES-200	BLEU, chrF++	1012	204	Misc	[Results]
Summarization	🤗 XL-Sum	ROUGE-L, BERTScore	2537	44	News	[Results]
	🤗 MassiveSumm Long	ROUGE-L, BERTScore	3908	55	News	[Results]
	🤗 MassiveSumm Short	ROUGE-L, BERTScore	5538	88	News	[Results]
Machine Comprehension	🤗 BELEBELE	Accuracy	900	122	Misc	[Results]
Machine Comprehension	ARC multilingual	Accuracy	1170	31	Misc	[Results]
Math	🤗 MGSM direct	Accuracy	250	10	Misc	[Results]
Math	🤗 MGSM CoT	Accuracy	250	10	Misc	[Results]

Table 3: Evaluation statistics. Sample/Lang: average number of test samples per language; N Lang: number of languages covered.

BibTeX


@article{ji2025emma2,
      title={Massively Multilingual Adaptation of Large Language Models Using Bilingual Translation Data}, 
      author={Shaoxiong Ji and Zihao Li and Jaakko Paavola and Indraneil Paul and Hengyu Luo and Jörg Tiedemann},
      year={2025},
      journal={arXiv preprint 2506.00469},
      url={https://arxiv.org/abs/2506.00469},
}