Home  | Publications | SKY+26

GlotWeb: Web Indexing for Minority Languages

MCML Authors

Abstract

We introduce GlotWeb, a system for indexing webpages, each written in a minority language. While popular search engines allow filtering by language only on query results for high-resource languages, GlotWeb focuses on minority languages, systematically indexing webpages and providing highly accurate links for each. We start with seed data from multiple sources, including search engine queries and Common Crawl datasets, then iteratively crawl corresponding websites to collect additional pages in the same language. Language identification is applied to determine the language of each webpage, with high accuracy ensured through language-specific wordlist filtering. GlotWeb v1.0 contains over 169K linguistically verified links across more than 400 languages, with most religious content removed. Notably, 47% of these languages are absent from major multilingual datasets such as FLORES-200, MADLAD-400, and Glot500, highlighting GlotWeb's role in expanding digital resources for underrepresented languages. The data indices are available at hf.co/spaces/cis-lmu/GlotWeb, and the pipeline is at github.com/cisnlp/GlotWeb.

inproceedings SKY+26


WWW 2026

ACM Web Conference. Dubai, United Arab Emirates, Jun 29-Jul 03, 2026.
Conference logo
A* Conference

Authors

A. A. Sefat • A. H. Kargaran • F. Yvon • H. Schütze

Links

DOI GitHub

Research Area

 B2 | Natural Language Processing

BibTeXKey: SKY+26

Back to Top