Home | Publications | UTA+25a

Fair Play in the Newsroom: Actor-Based Filtering Gender Discrimination in Text Corpora

MCML Authors

Matthias Aßenmacher

Dr.

→ Group Bernd Bischl
Statistical Learning and Data Science

Abstract

Language corpora are the foundation of most natural language processing research, yet they often reproduce structural inequalities. One such inequality is gender discrimination in how actors are represented, which can distort analyses and perpetuate discriminatory outcomes. This paper introduces a user-centric, actor-level pipeline for detecting and mitigating gender discrimination in large-scale text corpora. By combining discourse-aware analysis with metrics for sentiment, syntactic agency, and quotation styles, our method enables both fine-grained auditing and exclusion-based balancing. Applied to the taz2024full corpus of German newspaper articles (1980–2024), the pipeline yields a more gender-balanced dataset while preserving core dynamics of the source material. Our findings show that structural asymmetries can be reduced through systematic filtering, though subtler biases in sentiment and framing remain. We release the tools and reports to support further research in discourse-based fairness auditing and equitable corpus construction.

inproceedings UTA+25a

Eval4NLP @AACL 2025

5th Workshop on Evaluation and Comparison of NLP Systems at the 4th Asia-Pacific Chapter of the Association for Computational Linguistics. Mumbai, India, Dec 20-24, 2025.