Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages
MCML Authors
Peiqin Lin
Dr.
* Former Member
Nora Kassner
* Former Member
Abstract
Peiqin Lin
Dr.
* Former Member
Nora Kassner
* Former Member
Abstract
The NLP community has mainly focused on scaling Large Language Models (LLMs) vertically, i.e., making them better for about 100 languages. We instead scale LLMs horizontally: we create, through continued pretraining, Glot500-m, an LLM that covers 511 predominantly low-resource languages. An important part of this effort is to collect and clean Glot500-c, a corpus that covers these 511 languages and allows us to train Glot500-m. We evaluate Glot500-m on five diverse tasks across these languages. We observe large improvements for both high-resource and low-resource languages compared to an XLM-R baseline. Our analysis shows that no single factor explains the quality of multilingual LLM representations. Rather, a combination of factors determines quality including corpus size, script, 'help' from related languages and the total capacity of the model. Our work addresses an important goal of NLP research: we should notlimit NLP to a small fraction of the world’s languages and instead strive to support as many languages as possible to bring the benefits of NLP technology to all languages and cultures.
inproceedings ILK+23
ACL 2023
61th Annual Meeting of the Association for Computational Linguistics. Toronto, Canada, Jul 09-14, 2023.Authors
A. Imani • P. Lin • A. H. Kargaran • S. Severini • M. J. Sabet • N. Kassner • C. Ma • H. Schmid • A. Martins • F. Yvon • H. SchützeLinks
DOI GitHubIn Collaboration
Unbabel
Research Area
BibTeXKey: ILK+23