Home  | Publications | MH24a

Measuring Bias of Web-Filtered Text Datasets and Bias Propagation Through Training

MCML Authors

Abstract

We investigate biases in pretraining datasets for large language models (LLMs) through dataset classification experiments. Building on prior work demonstrating the existence of biases in popular computer vision datasets, we analyze popular open-source pretraining datasets for LLMs derived from CommonCrawl including C4, RefinedWeb, DolmaCC, RedPajama-V2, FineWeb, and DCLM-Baseline. Despite those datasets being obtained with similar filtering and deduplication steps, neural networks can classify surprisingly well which dataset a single text sequence belongs to, significantly better than a human can. This indicates that popular pretraining datasets have their own unique biases or fingerprints. Those biases remain even when the text is rewritten with LLMs. Moreover, these biases propagate through training: Random sequences generated by models trained on those datasets can be classified well by a classifier trained on the original datasets.

misc


Preprint

Dec. 2024

Authors

Y. MansourR. Heckel

Links


Research Area

 A2 | Mathematical Foundations

BibTeXKey: MH24a

Back to Top