We introduce GlotWeb, a system for indexing webpages, each written in a minority language. While popular search engines allow filtering by language only on query results for high-resource languages, GlotWeb focuses on minority languages, systematically indexing webpages and providing highly accurate links for each. We start with seed data from multiple sources, including search engine queries and Common Crawl datasets, then iteratively crawl corresponding websites to collect additional pages in the same language. Language identification is applied to determine the language of each webpage, with high accuracy ensured through language-specific wordlist filtering. GlotWeb v1.0 contains over 169K linguistically verified links across more than 400 languages, with most religious content removed. Notably, 47% of these languages are absent from major multilingual datasets such as FLORES-200, MADLAD-400, and Glot500, highlighting GlotWeb's role in expanding digital resources for underrepresented languages. The data indices are available at hf.co/spaces/cis-lmu/GlotWeb, and the pipeline is at github.com/cisnlp/GlotWeb.
inproceedings SKY+26
BibTeXKey: SKY+26