Home | Publications | EMK+25

Aligning NLP Models With Target Population Perspectives Using PAIR: Population-Aligned Instance Replication

MCML Authors

Bolei Ma

→ Group Frauke Kreuter
Social Data Science and AI

Christoph Kern

Prof. Dr.

Associate

Social Data Science and AI

Barbara Plank

Prof. Dr.

Principal Investigator

AI and Computational Linguistics

Frauke Kreuter

Prof. Dr.

Principal Investigator

Social Data Science and AI

Abstract

Models trained on crowdsourced annotations may not reflect population views, if those who work as annotators do not represent the broader population. In this paper, we propose PAIR: Population-Aligned Instance Replication, a post-processing method that adjusts training data to better reflect target population characteristics without collecting additional annotations. Using simulation studies on offensive language and hate speech detection with varying annotator compositions, we show that non-representative pools degrade model calibration while leaving accuracy largely unchanged. PAIR corrects these calibration problems by replicating annotations from underrepresented annotator groups to match population proportions. We conclude with recommendations for improving the representativity of training data and model performance.

inproceedings EMK+25

NLPerspectives @EMNLP 2025

4th Workshop on Perspectivist Approaches to NLP at the Conference on Empirical Methods in Natural Language Processing. Suzhou, China, Nov 04-09, 2025.