MaLA-500: Massive Language Adaptation of Large Language Models

1Center for Information and Language Processing, LMU Munich 2Munich Center for Machine Learning
3University of Helsinki 4Instituto Superior Técnico (Lisbon ELLIS Unit) 5Instituto de Telecomunicações 6Unbabel
linpq@cis.lmu.de, shaoxiong.ji@helsinki.fi

*Indicates Equal Contribution

Abstract

Large language models (LLMs) have advanced the state of the art in natural language processing. However, their predominant design for English or a limited set of languages creates a substantial gap in their effectiveness for low-resource languages. To bridge this gap, we introduce MaLA-500, a novel large language model designed to cover an extensive range of 534 languages. To train MaLA-500, we employ vocabulary extension and continued pretraining on LLaMA 2 with Glot500-c. Our intrinsic evaluation demonstrates that MaLA-500 is better at predicting the given texts of low-resource languages than existing multilingual LLMs. Moreover, the extrinsic evaluation of in-context learning shows that MaLA-500 outperforms previous LLMs on SIB200 and Taxi1500 by a significant margin, i.e., 11.68% and 4.82% marco-average accuracy across languages.

BibTeX


@article{lin2024mala,
    title={MaLA-500: Massive Language Adaptation of Large Language Models},
    author={Lin, Peiqin and Ji, Shaoxiong and Tiedemann, J{\"o}rg and Martins, Andr{\'e} FT and Sch{\"u}tze, Hinrich},
    journal={arXiv preprint arXiv:2401.13303},
    year={2024}
}