Massive Language Adaptation of Large Language Models

MaLA-LM

Welcome to MaLA-LM (Massive Language Adaptation of Large Language Models)! ๐ŸŒ

MaLA-LM focuses on adapting large language models to support hundreds of languages, including many underrepresented ones. Our models are multilingual, scalable, and optimized for diverse linguistic tasks. We work on data construction (e.g., MaLA corpus and PolyWrite), continual pretraining (e.g., EMMA-500, MaLA-500, and MixCPT), instruction fine-tuning (e.g., mono. vs. multilingual Alpaca and Lucky 52) and evaluation (e.g., GlotEval).

Featured ๐Ÿ—ฃ๏ธ Check out our multilingual LLM collections, featuring models trained to handle 500+ languages, ideal for global, multilingual applications.

Dive into the HuggingFace collections: EMMA-500 | MaLA corpus | MaLA-500

Latest Updates

Our Works

Continual Pretraining ๐Ÿ“œ

Massively Multilingual Adaptation of Large Language Models Using Bilingual Translation Data

EMMA-500: Enhancing Massively Multilingual Adaptation of Large Language Models

MaLA-500: Massive Language Adaptation of Large Language Models

Rethinking Multilingual Continual Pretraining: Data Mixing for Adapting LLMs Across Languages and Resources

Instruction Fine-tuning ๐Ÿ”ฎ

How Many Languages Make Good Multilingual Instruction Tuning? A Case Study on BLOOM

Monolingual or Multilingual Instruction Tuning: Which Makes a Better Alpaca

Evaluation ๐Ÿ› ๏ธ

GlotEval: A Test Suite for Massively Multilingual Evaluation of Large Language Models