Nov 21, 2025
Data is the fuel of modern AI, but in the realm of multilingual models, quality is quickly becoming more valuable than quantity.
For years, the OPUS collection has served as the backbone of open-source machine translation, offering a massive repository of parallel texts. However, as we push the boundaries of what Large Language Models (LLMs) and Machine Translation (MT) systems can do, the noise inherent in web-scraped and aggregated data has become a bottleneck.
Today, we announce FineOPUS: a project dedicated to transforming the vast OPUS collection into a foundational, high-quality parallel corpus for the global AI community. Our mission is simple yet ambitious: we are applying a rigorous, empirically-driven data curation philosophy to parallel data. Inspired by the success of the FineWeb project, we aren't just cleaning data; we are systematically engineering a resource to be reliable, equitable, and state-of-the-art.
Current open datasets often suffer from critical issues that hinder model performance:
FineOPUS aims to mitigate these issues through a transparent, reproducible pipeline.
We are not guessing which cleaning methods work best. We are proving it. The core of the FineOPUS strategy is validation through ablation. We will train dozens of models to empirically justify every single design choice in our pipeline. If a filtering step doesn't improve model performance, it doesn't make it into the final pipeline.
We believe in open science. Upon completion, FineOPUS will release three key assets to the community:
We are building FineOPUS to empower researchers and developers to build more powerful, inclusive language technologies. By ensuring the world's linguistic diversity is represented in the future of AI, we hope to bridge the digital divide one sentence pair at a time.
Stay tuned for our FineOPUS release and technical report.
Discord: MaLA-LM