Welcome to MaLA-LM (Massive Language Adaptation of Large Language Models)! ๐
MaLA-LM focuses on adapting large language models to support hundreds of languages, including many underrepresented ones. Our models are multilingual, scalable, and optimized for diverse linguistic tasks. We work on data construction (e.g., MaLA corpus and PolyWrite), continual pretraining (e.g., EMMA-500, MaLA-500, and MixCPT), instruction fine-tuning (e.g., mono. vs. multilingual Alpaca and Lucky 52) and evaluation (e.g., GlotEval).
Featured ๐ฃ๏ธ Check out our multilingual LLM collections, featuring models trained to handle 500+ languages, ideal for global, multilingual applications.
Dive into the HuggingFace collections: EMMA-500 | MaLA corpus | MaLA-500
Continual Pretraining ๐
Massively Multilingual Adaptation of Large Language Models Using Bilingual Translation Data
EMMA-500: Enhancing Massively Multilingual Adaptation of Large Language Models
MaLA-500: Massive Language Adaptation of Large Language Models
Instruction Fine-tuning ๐ฎ
How Many Languages Make Good Multilingual Instruction Tuning? A Case Study on BLOOM
Monolingual or Multilingual Instruction Tuning: Which Makes a Better Alpaca
Evaluation ๐ ๏ธ
GlotEval: A Test Suite for Massively Multilingual Evaluation of Large Language Models