Home | Publications | LJT+24

MaLA-500: Massive Language Adaptation of Large Language Models

MCML Authors

Peiqin Lin

Dr.

* Former Member

→ Group Hinrich Schütze
Computational Linguistics

Hinrich Schütze

Prof. Dr.

Core PI

Computational Linguistics

Abstract

Large language models (LLMs) have advanced the state of the art in natural language processing. However, their predominant design for English or a limited set of languages creates a substantial gap in their effectiveness for low-resource languages. To bridge this gap, we introduce MaLA-500, a novel large language model designed to cover an extensive range of 534 languages. To train MaLA-500, we employ vocabulary extension and continued pretraining on LLaMA 2 with Glot500-c. Our intrinsic evaluation demonstrates that MaLA-500 is better at predicting the given texts of low-resource languages than existing multilingual LLMs. Moreover, the extrinsic evaluation of in-context learning shows that MaLA-500 outperforms previous LLMs on SIB200 and Taxi1500 by a significant margin, i.e., 11.68% and 4.82% marco-average accuracy across languages.

misc LJT+24

Preprint

Apr. 2024

Authors

P. Lin • S. Ji • J. Tiedemann • A. F. T. Martins • H. Schütze

Links

arXiv GitHub

In Collaboration

Unbabel

Research Area

B2 | Natural Language Processing

BibTeXKey: LJT+24

#p-schuetze