Massive Language Adaptation of Large Language Models

MaLA-LM

Welcome to MaLA-LM (Massive Language Adaptation of Large Language Models)! 🌍

MaLA-LM focuses on adapting large language models to support hundreds of languages, including many underrepresented ones. Our models are multilingual, scalable, and optimized for diverse linguistic tasks. We work on data construction (e.g., MaLA corpus and PolyWrite), continual pretraining (e.g., EMMA-500, MaLA-500, and MixCPT), instruction fine-tuning (e.g., mono. vs. multilingual Alpaca and Lucky 52) and evaluation (e.g., GlotEval).

Featured 🗣️ Check out our multilingual LLM collections, featuring models trained to handle 500+ languages, ideal for global, multilingual applications.

Dive into the HuggingFace collections: EMMA-500 | MaLA corpus | MaLA-500

Continual Pretraining 📜

EMMA-500: Enhancing Massively Multilingual Adaptation of Large Language Models

MaLA-500: Massive Language Adaptation of Large Language Models

Rethinking Multilingual Continual Pretraining: Data Mixing for Adapting LLMs Across Languages and Resources

GlotEval 🛠️

GlotEval: A Test Suite for Massively Multilingual Evaluation of Large Language Models