Home | Research | Groups | Barbara Plank

Research Group Barbara Plank

Barbara Plank

Prof. Dr.

Principal Investigator

B2 | Natural Language Processing

AI and Computational Linguistics

Barbara Plank

heads the Chair for AI and Computational Linguistics at LMU Munich.

Her lab carries out research in Natural Language Processing, an interdisciplinary subdiscipline of Artificial Intelligence at the interface of computer science, linguistics and cognitive science. In broad terms, the aim is human-facing NLP: to make NLP models more robust and inclusive, so that they can deal better with underlying shifts in data due to language variation, are fairer and embrace human label variation.

Team members @MCML

PostDocs

Silvia Casola

Dr.

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Yang Janet Liu

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Siyao Peng

Dr.

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

PhD Students

Verena Blaschke

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Beiduo Chen

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Felicia Körner

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Robert Litschko

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Philipp Mondorf

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Andreas Säuberli

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Soh-Eun Shim

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Xinpeng Wang

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Shijia Zhou

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Publications @MCML

2025

[93]

A. Bavaresco, R. Bernardi, L. Bertolazzi, D. Elliott, R. Fernández, A. Gatt, E. Ghaleb, M. Giulianelli, M. Hanna, A. Koller, A. F. T. Martins, P. Mondorf, V. Neplenbroek, S. Pezzelle, B. Plank, D. Schlangen, A. Suglia, A. K. S. Aditya K. Surikuchi, E. Takmaz and A. Testoni.
LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks.
ACL 2025 - 63rd Annual Meeting of the Association for Computational Linguistics. Vienna, Austria, Jul 27-Aug 01, 2025. URL

Abstract

There is an increasing trend towards evaluating NLP models with LLM-generated judgments instead of human judgments. In the absence of a comparison against human data, this raises concerns about the validity of these evaluations; in case they are conducted with proprietary models, this also raises concerns over reproducibility. We provide JUDGE-BENCH, a collection of 20 NLP datasets with human annotations, and comprehensively evaluate 11 current LLMs, covering both open-weight and proprietary models, for their ability to replicate the annotations. Our evaluations show that each LLM exhibits a large variance across datasets in its correlation to human judgments. We conclude that LLMs are not yet ready to systematically replace human judges in NLP.

MCML Authors

Philipp Mondorf

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Barbara Plank

Prof. Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

[92]

F. Eichin, Y. J. Liu, B. Plank and M. A. Hedderich.
Probing LLMs for Multilingual Discourse Generalization Through a Unified Label Set.
ACL 2025 - 63rd Annual Meeting of the Association for Computational Linguistics. Vienna, Austria, Jul 27-Aug 01, 2025. URL

Abstract

Discourse understanding is essential for many NLP tasks, yet most existing work remains constrained by framework-dependent discourse representations. This work investigates whether large language models (LLMs) capture discourse knowledge that generalizes across languages and frameworks. We address this question along two dimensions: (1) developing a unified discourse relation label set to facilitate cross-lingual and cross-framework discourse analysis, and (2) probing LLMs to assess whether they encode generalizable discourse abstractions. Using multilingual discourse relation classification as a testbed, we examine a comprehensive set of 23 LLMs of varying sizes and multilingual capabilities. Our results show that LLMs, especially those with multilingual training corpora, can generalize discourse information across languages and frameworks. Further layer-wise analyses reveal that language generalization at the discourse level is most salient in the intermediate layers. Lastly, our error analysis provides an account of challenging relation classes.

MCML Authors

Florian Eichin

B2 | Natural Language Processing
→ Group Michael Hedderich

AI and Computational Linguistics

Yang Janet Liu

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Barbara Plank

Prof. Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

Michael Hedderich

Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

[91]

M. A. Hedderich, A. Wang, R. Zhao, F. Eichin, J. Fischer and B. Plank.
What's the Difference? Supporting Users in Identifying the Effects of Prompt and Model Changes Through Token Patterns.
ACL 2025 - 63rd Annual Meeting of the Association for Computational Linguistics. Vienna, Austria, Jul 27-Aug 01, 2025. URL

Abstract

Prompt engineering for large language models is challenging, as even small prompt perturbations or model changes can significantly impact the generated output texts. Existing evaluation methods, either automated metrics or human evaluation, have limitations, such as providing limited insights or being labor-intensive. We propose Spotlight, a new approach that combines both automation and human analysis. Based on data mining techniques, we automatically distinguish between random (decoding) variations and systematic differences in language model outputs. This process provides token patterns that describe the systematic differences and guide the user in manually analyzing the effects of their prompt and model changes efficiently. We create three benchmarks to quantitatively test the reliability of token pattern extraction methods and demonstrate that our approach provides new insights into established prompt data. From a human-centric perspective, through demonstration studies and a user study, we show that our token pattern approach helps users understand the systematic differences of language model outputs, and we are able to discover relevant differences caused by prompt and model changes (e.g. related to gender or culture), thus supporting the prompt engineering process and human-centric model behavior research.

MCML Authors

Michael Hedderich

Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

Raoyuan Zhao

B2 | Natural Language Processing
→ Group Michael Hedderich

AI and Computational Linguistics

Florian Eichin

B2 | Natural Language Processing
→ Group Michael Hedderich

AI and Computational Linguistics

Barbara Plank

Prof. Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

[90]

B. Ma, Y. Li, W. Zhou, Z. Gong, Y. J. Liu, K. Jasinskaja, A. Friedrich, J. Hirschberg, F. Kreuter and B. Plank.
Pragmatics in the Era of Large Language Models: A Survey on Datasets, Evaluation, Opportunities and Challenges.
ACL 2025 - 63rd Annual Meeting of the Association for Computational Linguistics. Vienna, Austria, Jul 27-Aug 01, 2025. URL

Abstract

Understanding pragmatics-the use of language in context-is crucial for developing NLP systems capable of interpreting nuanced language use. Despite recent advances in language technologies, including large language models, evaluating their ability to handle pragmatic phenomena such as implicatures and references remains challenging. To advance pragmatic abilities in models, it is essential to understand current evaluation trends and identify existing limitations. In this survey, we provide a comprehensive review of resources designed for evaluating pragmatic capabilities in NLP, categorizing datasets by the pragmatics phenomena they address. We analyze task designs, data collection methods, evaluation approaches, and their relevance to real-world applications. By examining these resources in the context of modern language models, we highlight emerging trends, challenges, and gaps in existing benchmarks. Our survey aims to clarify the landscape of pragmatic evaluation and guide the development of more comprehensive and targeted benchmarks, ultimately contributing to more nuanced and context-aware NLP models.

MCML Authors

Bolei Ma

C4 | Computational Social Sciences
→ Group Frauke Kreuter

Social Data Science and AI

Yang Janet Liu

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Frauke Kreuter

Prof. Dr.

C4 | Computational Social Sciences

Social Data Science and AI

Barbara Plank

Prof. Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

[89]

B. Ma, B. Yoztyurk, A.-C. Haensch, X. Wang, M. Herklotz, F. Kreuter, B. Plank and M. Aßenmacher.
Algorithmic Fidelity of Large Language Models in Generating Synthetic German Public Opinions: A Case Study.
ACL 2025 - 63rd Annual Meeting of the Association for Computational Linguistics. Vienna, Austria, Jul 27-Aug 01, 2025. URL

Abstract

In recent research, large language models (LLMs) have been increasingly used to investigate public opinions. This study investigates the algorithmic fidelity of LLMs, i.e., the ability to replicate the socio-cultural context and nuanced opinions of human participants. Using open-ended survey data from the German Longitudinal Election Studies (GLES), we prompt different LLMs to generate synthetic public opinions reflective of German subpopulations by incorporating demographic features into the persona prompts. Our results show that Llama performs better than other LLMs at representing subpopulations, particularly when there is lower opinion diversity within those groups. Our findings further reveal that the LLM performs better for supporters of left-leaning parties like The Greens and The Left compared to other parties, and matches the least with the right-party AfD. Additionally, the inclusion or exclusion of specific variables in the prompts can significantly impact the models’ predictions. These findings underscore the importance of aligning LLMs to more effectively model diverse public opinions while minimizing political biases and enhancing robustness in representativeness.

MCML Authors

Bolei Ma

C4 | Computational Social Sciences
→ Group Frauke Kreuter

Social Data Science and AI

Anna-Carolina Haensch

Dr.

C4 | Computational Social Sciences
→ Group Frauke Kreuter

Social Data Science and AI

Xinpeng Wang

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Frauke Kreuter

Prof. Dr.

C4 | Computational Social Sciences

Social Data Science and AI

Barbara Plank

Prof. Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

Matthias Aßenmacher

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[88]

P. Mondorf, S. Wold and B. Plank.
Circuit Compositions: Exploring Modular Structures in Transformer-Based Language Models.
ACL 2025 - 63rd Annual Meeting of the Association for Computational Linguistics. Vienna, Austria, Jul 27-Aug 01, 2025. URL

Abstract

A fundamental question in interpretability research is to what extent neural networks, particularly language models, implement reusable functions via subnetworks that can be composed to perform more complex tasks. Recent developments in mechanistic interpretability have made progress in identifying subnetworks, often referred to as circuits, which represent the minimal computational subgraph responsible for a model’s behavior on specific tasks. However, most studies focus on identifying circuits for individual tasks without investigating how functionally similar circuits relate to each other. To address this gap, we examine the modularity of neural networks by analyzing circuits for highly compositional subtasks within a transformer-based language model. Specifically, given a probabilistic context-free grammar, we identify and compare circuits responsible for ten modular string-edit operations. Our results indicate that functionally similar circuits exhibit both notable node overlap and cross-task faithfulness. Moreover, we demonstrate that the circuits identified can be reused and combined through subnetwork set operations to represent more complex functional capabilities of the model.

MCML Authors

Philipp Mondorf

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Barbara Plank

Prof. Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

[87]

V. Blaschke, M. Fedzechkina and M. Ter Hoeve.
Analyzing the Effect of Linguistic Similarity on Cross-Lingual Transfer: Tasks and Experimental Setups Matter.
ACL 2025 - Findings of the 63rd Annual Meeting of the Association for Computational Linguistics. Vienna, Austria, Jul 27-Aug 01, 2025. URL

Abstract

Cross-lingual transfer is a popular approach to increase the amount of training data for NLP tasks in a low-resource context. However, the best strategy to decide which cross-lingual data to include is unclear. Prior research often focuses on a small set of languages from a few language families and/or a single task. It is still an open question how these findings extend to a wider variety of languages and tasks. In this work, we analyze cross-lingual transfer for 263 languages from a wide variety of language families. Moreover, we include three popular NLP tasks: POS tagging, dependency parsing, and topic classification. Our findings indicate that the effect of linguistic similarity on transfer performance depends on a range of factors: the NLP task, the (mono- or multilingual) input representations, and the definition of linguistic similarity.

MCML Authors

Verena Blaschke

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

[86]

B. Chen, S. Peng, A. Korhonen and B. Plank.
A Rose by Any Other Name: LLM-Generated Explanations Are Good Proxies for Human Explanations to Collect Label Distributions on NLI.
ACL 2025 - Findings of the 63rd Annual Meeting of the Association for Computational Linguistics. Vienna, Austria, Jul 27-Aug 01, 2025. URL

Abstract

Disagreement in human labeling is ubiquitous, and can be captured in human judgment distributions (HJDs). Recent research has shown that explanations provide valuable information for understanding human label variation (HLV) and large language models (LLMs) can approximate HJD from a few human-provided label-explanation pairs. However, collecting explanations for every label is still time-consuming. This paper examines whether LLMs can be used to replace humans in generating explanations for approximating HJD. Specifically, we use LLMs as annotators to generate model explanations for a few given human labels. We test ways to obtain and combine these label-explanations with the goal to approximate human judgment distribution. We further compare the resulting human with model-generated explanations, and test automatic and human explanation selection. Our experiments show that LLM explanations are promising for NLI: to estimate HJD, generated explanations yield comparable results to human’s when provided with human labels. Importantly, our results generalize from datasets with human explanations to i) datasets where they are not available and ii) challenging out-of-distribution test sets.

MCML Authors

Beiduo Chen

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Siyao Peng

Dr.

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Barbara Plank

Prof. Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

[85]

A. Säuberli, D. Frassinelli and B. Plank.
Do LLMs Give Psychometrically Plausible Responses in Educational Assessments?
BEA @ACL 2025 - 20th Workshop on Innovative Use of NLP for Building Educational Applications at the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025). Vienna, Austria, Jul 27-Aug 01, 2025. URL

Abstract

Knowing how test takers answer items in educational assessments is essential for test development, to evaluate item quality, and to improve test validity. However, this process usually requires extensive pilot studies with human participants. If large language models (LLMs) exhibit human-like response behavior to test items, this could open up the possibility of using them as pilot participants to accelerate test development. In this paper, we evaluate the human-likeness or psychometric plausibility of responses from 18 instruction-tuned LLMs with two publicly available datasets of multiple-choice test items across three subjects: reading, U.S. history, and economics. Our methodology builds on two theoretical frameworks from psychometrics which are commonly used in educational assessment, classical test theory and item response theory. The results show that while larger models are excessively confident, their response distributions can be more human-like when calibrated with temperature scaling. In addition, we find that LLMs tend to correlate better with humans in reading comprehension items compared to other subjects. However, the correlations are not very strong overall, indicating that LLMs should not be used for piloting educational assessments in a zero-shot setting.

MCML Authors

Andreas Säuberli

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Barbara Plank

Prof. Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

[84]

Y. Du, P. Mondorf, S. Casola, Y. Yao, R. Litschko and B. Plank.
Reason to Rote: Rethinking Memorization in Reasoning.
Preprint (Jul. 2025). arXiv

Abstract

Large language models readily memorize arbitrary training instances, such as label noise, yet they perform strikingly well on reasoning tasks. In this work, we investigate how language models memorize label noise, and why such memorization in many cases does not heavily affect generalizable reasoning capabilities. Using two controllable synthetic reasoning datasets with noisy labels, four-digit addition (FDA) and two-hop relational reasoning (THR), we discover a reliance of memorization on generalizable reasoning mechanisms: models continue to compute intermediate reasoning outputs even when retrieving memorized noisy labels, and intervening reasoning adversely affects memorization. We further show that memorization operates through distributed encoding, i.e., aggregating various inputs and intermediate results, rather than building a look-up mechanism from inputs to noisy labels. Moreover, our FDA case study reveals memorization occurs via outlier heuristics, where existing neuron activation patterns are slightly shifted to fit noisy labels. Together, our findings suggest that memorization of label noise in language models builds on, rather than overrides, the underlying reasoning mechanisms, shedding lights on the intriguing phenomenon of benign memorization.

MCML Authors

Philipp Mondorf

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Silvia Casola

Dr.

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Robert Litschko

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Barbara Plank

Prof. Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

[83]

C. Gruber, H. Alber, B. Bischl, G. Kauermann, B. Plank and M. Aßenmacher.
Revisiting Active Learning under (Human) Label Variation.
Preprint (Jul. 2025). arXiv

Abstract

Access to high-quality labeled data remains a limiting factor in applied supervised learning. While label variation (LV), i.e., differing labels for the same instance, is common, especially in natural language processing, annotation frameworks often still rest on the assumption of a single ground truth. This overlooks human label variation (HLV), the occurrence of plausible differences in annotations, as an informative signal. Similarly, active learning (AL), a popular approach to optimizing the use of limited annotation budgets in training ML models, often relies on at least one of several simplifying assumptions, which rarely hold in practice when acknowledging HLV. In this paper, we examine foundational assumptions about truth and label nature, highlighting the need to decompose observed LV into signal (e.g., HLV) and noise (e.g., annotation error). We survey how the AL and (H)LV communities have addressed – or neglected – these distinctions and propose a conceptual framework for incorporating HLV throughout the AL loop, including instance selection, annotator choice, and label representation. We further discuss the integration of large language models (LLM) as annotators. Our work aims to lay a conceptual foundation for HLV-aware active learning, better reflecting the complexities of real-world annotation.

MCML Authors

Helen Alber

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability

Statistical Learning and Data Science

Göran Kauermann

Prof. Dr.

A1 | Statistical Foundations & Explainability

Applied Statistics in Social Sciences, Economics and Business

Barbara Plank

Prof. Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

Matthias Aßenmacher

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[82]

V. Blaschke, M. Winkler, C. Förster, G. Wenger-Glemser and B. Plank.
A Multi-Dialectal Dataset for German Dialect ASR and Dialect-to-Standard Speech Translation.
Preprint (Jun. 2025). arXiv

Abstract

Although Germany has a diverse landscape of dialects, they are underrepresented in current automatic speech recognition (ASR) research. To enable studies of how robust models are towards dialectal variation, we present Betthupferl, an evaluation dataset containing four hours of read speech in three dialect groups spoken in Southeast Germany (Franconian, Bavarian, Alemannic), and half an hour of Standard German speech. We provide both dialectal and Standard German transcriptions, and analyze the linguistic differences between them. We benchmark several multilingual state-of-the-art ASR models on speech translation into Standard German, and find differences between how much the output resembles the dialectal vs. standardized transcriptions. Qualitative error analyses of the best ASR model reveal that it sometimes normalizes grammatical differences, but often stays closer to the dialectal constructions.

MCML Authors

Verena Blaschke

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Barbara Plank

Prof. Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

[81]

S. Casola, Y. J. Liu, S. Peng, O. Kraus, A. Gatt and B. Plank.
Evaluation Should Not Ignore Variation: On the Impact of Reference Set Choice on Summarization Metrics.
Preprint (Jun. 2025). arXiv

Abstract

Human language production exhibits remarkable richness and variation, reflecting diverse communication styles and intents. However, this variation is often overlooked in summarization evaluation. While having multiple reference summaries is known to improve correlation with human judgments, the impact of using different reference sets on reference-based metrics has not been systematically investigated. This work examines the sensitivity of widely used reference-based metrics in relation to the choice of reference sets, analyzing three diverse multi-reference summarization datasets: SummEval, GUMSum, and DUC2004. We demonstrate that many popular metrics exhibit significant instability. This instability is particularly concerning for n-gram-based metrics like ROUGE, where model rankings vary depending on the reference sets, undermining the reliability of model comparisons. We also collect human judgments on LLM outputs for genre-diverse data and examine their correlation with metrics to supplement existing findings beyond newswire summaries, finding weak-to-no correlation. Taken together, we recommend incorporating reference set variation into summarization evaluation to enhance consistency alongside correlation with human judgments, especially when evaluating LLMs.

MCML Authors

Silvia Casola

Dr.

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Yang Janet Liu

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Siyao Peng

Dr.

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Barbara Plank

Prof. Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

[80]

D. N. Jakobi, M. Stegenwallner-Schütz, N. Hollenstein, C. Ding, R. Kaspere, A. M. Škorić, E. Pavlinusic Vilus, S. Frank, M.-L. Müller, K. M. Jensen de López, N. Kharlamov, H. B. Søndergaard Knudsen, Y. Berzak, E. Lion, I. A. S. Irina A. Sekerina, C. Acarturk, M. F. Ansari, K. Harezlak, P. Kasprowski, A. Bautista, L. Beinborn, A. Bondar, A. Boznou, L. Bradshaw, J. M. Hofmann, T. Krosness, N. B. Soliva, A. Çepani, K. Cergol, A. Došen, M. Palmovic, A. Çerpja, D. Chirino, J. Chromý, V. Demberg, I. Škrjanec, N. D. Deniz, I. Fajardo, M. Giménez-Salvador, X. Mínguez-López, M. Filip, Z. Freibergs, J. Gomes, A. Janeiro, P. Luegi, J. Veríssimo, S. Gramatikov, J. Hasenäcker, A. Haveriku, N. Kote, M. M. Kamal, H. Kędzierska, D. Klimek-Jankowska, S. Kosutar, D. G. Krakowczyk, I. Krejtz, M. Łockiewicz, K. Lõo, J. Motiejūnienė, J. A. Nasir, J. S. Krog Nedergård, A. Özkan, M. Preininger, L. Pungă, D. R. Reich, C. Tschirner, Š. Rot, A. Säuberli, J. Solé-Casals, E. Strati, I. Svoboda, E. Trandafili, S. Varlokosta, M. Vulchanova and L. A. .
MultiplEYE: Creating a multilingual eye-tracking-while-reading corpus.
ETRA 2025 - ACM Symposium on Eye Tracking Research and Applications. Tokyo, Japan, May 26-29, 2025. DOI

Abstract

Eye-tracking-while-reading data provide valuable insights across multiple disciplines, including psychology, linguistics, natural language processing, education, and human-computer interaction. Despite its potential, the availability of large, high-quality, multilingual datasets remains limited, hindering both foundational reading research and advancements in applications. The MultiplEYE project addresses this gap by establishing a large-scale, international eye-tracking data collection initiative. It aims to create a multilingual dataset of eye movements recorded during natural reading, balancing linguistic diversity, while ensuring methodological consistency for reliable cross-linguistic comparisons. The dataset spans numerous languages and follows strict procedural, documentation, and data pre-processing standards to enhance eye-tracking data transparency and reproducibility. A novel data-sharing framework, integrated with data quality reports, allows for selective data filtering based on research needs. Researchers and labs worldwide are invited to join the initiative. By establishing and promoting standardized practices and open data sharing, MultiplEYE facilitates interdisciplinary research and advances reading research and gaze-augmented applications.

MCML Authors

Andreas Säuberli

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

[79]

B. Chen, Y. Liu, A. Korhonen and B. Plank.
Threading the Needle: Reweaving Chain-of-Thought Reasoning to Explain Human Label Variation.
Preprint (May. 2025). arXiv

Abstract

The recent rise of reasoning-tuned Large Language Models (LLMs)–which generate chains of thought (CoTs) before giving the final answer–has attracted significant attention and offers new opportunities for gaining insights into human label variation, which refers to plausible differences in how multiple annotators label the same data instance. Prior work has shown that LLM-generated explanations can help align model predictions with human label distributions, but typically adopt a reverse paradigm: producing explanations based on given answers. In contrast, CoTs provide a forward reasoning path that may implicitly embed rationales for each answer option, before generating the answers. We thus propose a novel LLM-based pipeline enriched with linguistically-grounded discourse segmenters to extract supporting and opposing statements for each answer option from CoTs with improved accuracy. We also propose a rank-based HLV evaluation framework that prioritizes the ranking of answers over exact scores, which instead favor direct comparison of label distributions. Our method outperforms a direct generation method as well as baselines on three datasets, and shows better alignment of ranking methods with humans, highlighting the effectiveness of our approach.

MCML Authors

Beiduo Chen

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Barbara Plank

Prof. Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

[78]

F. Eichin, Y. Du, P. Mondorf, B. Plank and M. A. Hedderich.
Grokking ExPLAIND: Unifying Model, Data, and Training Attribution to Study Model Behavior.
Preprint (May. 2025). arXiv GitHub

Abstract

Post-hoc interpretability methods typically attribute a model’s behavior to its components, data, or training trajectory in isolation. This leads to explanations that lack a unified view and may miss key interactions. While combining existing methods or applying them at different training stages offers broader insights, these approaches usually lack theoretical support. In this work, we present ExPLAIND, a unified framework that integrates all three perspectives. First, we generalize recent work on gradient path kernels, which reformulate models trained by gradient descent as a kernel machine, to more realistic training settings. Empirically, we find that both a CNN and a Transformer model are replicated accurately by this reformulation. Second, we derive novel parameter- and step-wise influence scores from the kernel feature maps. We show their effectiveness in parameter pruning that is comparable to existing methods, reinforcing their value for model component attribution. Finally, jointly interpreting model components and data over the training process, we leverage ExPLAIND to analyze a Transformer that exhibits Grokking. Among other things, our findings support previously proposed stages of Grokking, while refining the final phase as one of alignment of input embeddings and final layers around a representation pipeline learned after the memorization phase. Overall, ExPLAIND provides a theoretically grounded, unified framework to interpret model behavior and training dynamics.

MCML Authors

Florian Eichin

B2 | Natural Language Processing
→ Group Michael Hedderich

AI and Computational Linguistics

Philipp Mondorf

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Barbara Plank

Prof. Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

Michael Hedderich

Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

[77]

P. Hong, B. Chen, S. Peng, M.-C. de Marneffe and B. Plank.
LiTEx: A Linguistic Taxonomy of Explanations for Understanding Within-Label Variation in Natural Language Inference.
Preprint (May. 2025). arXiv

Abstract

There is increasing evidence of Human Label Variation (HLV) in Natural Language Inference (NLI), where annotators assign different labels to the same premise-hypothesis pair. However, within-label variation–cases where annotators agree on the same label but provide divergent reasoning–poses an additional and mostly overlooked challenge. Several NLI datasets contain highlighted words in the NLI item as explanations, but the same spans on the NLI item can be highlighted for different reasons, as evidenced by free-text explanations, which offer a window into annotators’ reasoning. To systematically understand this problem and gain insight into the rationales behind NLI labels, we introduce LITEX, a linguistically-informed taxonomy for categorizing free-text explanations. Using this taxonomy, we annotate a subset of the e-SNLI dataset, validate the taxonomy’s reliability, and analyze how it aligns with NLI labels, highlights, and explanations. We further assess the taxonomy’s usefulness in explanation generation, demonstrating that conditioning generation on LITEX yields explanations that are linguistically closer to human explanations than those generated using only labels or highlights. Our approach thus not only captures within-label variation but also shows how taxonomy-guided generation for reasoning can bridge the gap between human and model explanations more effectively than existing strategies.

MCML Authors

Beiduo Chen

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Siyao Peng

Dr.

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Barbara Plank

Prof. Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

[76]

Y. Liu, M. Wang, A. H. Kargaran, F. Körner, E. Nie, B. Plank, F. Yvon and H. Schütze.
Tracing Multilingual Factual Knowledge Acquisition in Pretraining.
Preprint (May. 2025). arXiv GitHub

Abstract

Large Language Models (LLMs) are capable of recalling multilingual factual knowledge present in their pretraining data. However, most studies evaluate only the final model, leaving the development of factual recall and crosslingual consistency throughout pretraining largely unexplored. In this work, we trace how factual recall and crosslingual consistency evolve during pretraining, focusing on OLMo-7B as a case study. We find that both accuracy and consistency improve over time for most languages. We show that this improvement is primarily driven by the fact frequency in the pretraining corpus: more frequent facts are more likely to be recalled correctly, regardless of language. Yet, some low-frequency facts in non-English languages can still be correctly recalled. Our analysis reveals that these instances largely benefit from crosslingual transfer of their English counterparts – an effect that emerges predominantly in the early stages of pretraining. We pinpoint two distinct pathways through which multilingual factual knowledge acquisition occurs: (1) frequency-driven learning, which is dominant and language-agnostic, and (2) crosslingual transfer, which is limited in scale and typically constrained to relation types involving named entities.

MCML Authors

Yihong Liu

B2 | Natural Language Processing
→ Group Hinrich Schütze

Computational Linguistics

Mingyang Wang

B2 | Natural Language Processing
→ Group Hinrich Schütze

Computational Linguistics

Amir Hossein Kargaran

B2 | Natural Language Processing
→ Group Hinrich Schütze

Computational Linguistics

Felicia Körner

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Ercong Nie

B2 | Natural Language Processing
→ Group Hinrich Schütze

Computational Linguistics

Barbara Plank

Prof. Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

Hinrich Schütze

Prof. Dr.

B2 | Natural Language Processing

Computational Linguistics

[75]

R. S.-E. Shim, D. De Cristofaro, C. M. Hu, A. Vietti and B. Plank.
Languages in Multilingual Speech Foundation Models Align Both Phonetically and Semantically.
Preprint (May. 2025). arXiv

Abstract

Cross-lingual alignment in pretrained language models (LMs) has enabled efficient transfer in text-based LMs. Such an alignment has also been observed in speech foundation models. However, it remains an open question whether findings and methods from text-based cross-lingual alignment apply to speech. Building on prior work on spoken translation retrieval, we perform pronunciation-controlled experiments to observe if cross-lingual alignment can indeed occur in such models on a semantic basis, instead of relying on phonetic similarities. Our findings indicate that even in the absence of phonetic cues, spoken translation retrieval accuracy remains relatively stable. We follow up with a controlled experiment on a word-level dataset of cross-lingual synonyms and near-homophones, confirming the existence of both phonetic and semantic knowledge in the encoder. Finally, we qualitatively examine the transcriptions produced by early exiting the encoder, where we observe that speech translation produces semantic errors that are characterized by phonetic similarities to corresponding words in the source language. We apply this insight from early exiting to speech recognition in seven low-resource languages unsupported by the Whisper model, and achieve improved accuracy in all languages examined, particularly for languages with transparent orthographies.

MCML Authors

Barbara Plank

Prof. Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

[74]

X. Wang, M. Wang, Y. Liu, H. Schütze and B. Plank.
Refusal Direction is Universal Across Safety-Aligned Languages.
Preprint (May. 2025). arXiv

Abstract

Refusal mechanisms in large language models (LLMs) are essential for ensuring safety. Recent research has revealed that refusal behavior can be mediated by a single direction in activation space, enabling targeted interventions to bypass refusals. While this is primarily demonstrated in an English-centric context, appropriate refusal behavior is important for any language, but poorly understood. In this paper, we investigate the refusal behavior in LLMs across 14 languages using PolyRefuse, a multilingual safety dataset created by translating malicious and benign English prompts into these languages. We uncover the surprising cross-lingual universality of the refusal direction: a vector extracted from English can bypass refusals in other languages with near-perfect effectiveness, without any additional fine-tuning. Even more remarkably, refusal directions derived from any safety-aligned language transfer seamlessly to others. We attribute this transferability to the parallelism of refusal vectors across languages in the embedding space and identify the underlying mechanism behind cross-lingual jailbreaks. These findings provide actionable insights for building more robust multilingual safety defenses and pave the way for a deeper mechanistic understanding of cross-lingual vulnerabilities in LLMs.

MCML Authors

Xinpeng Wang

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Mingyang Wang

B2 | Natural Language Processing
→ Group Hinrich Schütze

Computational Linguistics

Yihong Liu

B2 | Natural Language Processing
→ Group Hinrich Schütze

Computational Linguistics

Hinrich Schütze

Prof. Dr.

B2 | Natural Language Processing

Computational Linguistics

Barbara Plank

Prof. Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

[73]

R. Zhao, B. Chen, B. Plank and M. A. Hedderich.
MAKIEval: A Multilingual Automatic WiKidata-based Framework for Cultural Awareness Evaluation for LLMs.
Preprint (May. 2025). arXiv

Abstract

Large language models (LLMs) are used globally across many languages, but their English-centric pretraining raises concerns about cross-lingual disparities for cultural awareness, often resulting in biased outputs. However, comprehensive multilingual evaluation remains challenging due to limited benchmarks and questionable translation quality. To better assess these disparities, we introduce MAKIEval, an automatic multilingual framework for evaluating cultural awareness in LLMs across languages, regions, and topics. MAKIEval evaluates open-ended text generation, capturing how models express culturally grounded knowledge in natural language. Leveraging Wikidata’s multilingual structure as a cross-lingual anchor, it automatically identifies cultural entities in model outputs and links them to structured knowledge, enabling scalable, language-agnostic evaluation without manual annotation or translation. We then introduce four metrics that capture complementary dimensions of cultural awareness: granularity, diversity, cultural specificity, and consensus across languages. We assess 7 LLMs developed from different parts of the world, encompassing both open-source and proprietary systems, across 13 languages, 19 countries and regions, and 6 culturally salient topics (e.g., food, clothing). Notably, we find that models tend to exhibit stronger cultural awareness in English, suggesting that English prompts more effectively activate culturally grounded knowledge. We publicly release our code and data.

MCML Authors

Raoyuan Zhao

B2 | Natural Language Processing
→ Group Michael Hedderich

AI and Computational Linguistics

Beiduo Chen

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Barbara Plank

Prof. Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

Michael Hedderich

Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

[72]

S. Zhou, S. Peng, S. Luebke, J. Haßler, M. Haim, S. M. Mohammad and B. Plank.
What Media Frames Reveal About Stance: A Dataset and Study about Memes in Climate Change Discourse.
Preprint (May. 2025). arXiv

Abstract

Media framing refers to the emphasis on specific aspects of perceived reality to shape how an issue is defined and understood. Its primary purpose is to shape public perceptions often in alignment with the authors’ opinions and stances. However, the interaction between stance and media frame remains largely unexplored. In this work, we apply an interdisciplinary approach to conceptualize and computationally explore this interaction with internet memes on climate change. We curate CLIMATEMEMES, the first dataset of climate-change memes annotated with both stance and media frames, inspired by research in communication science. CLIMATEMEMES includes 1,184 memes sourced from 47 subreddits, enabling analysis of frame prominence over time and communities, and sheds light on the framing preferences of different stance holders. We propose two meme understanding tasks: stance detection and media frame detection. We evaluate LLaVA-NeXT and Molmo in various setups, and report the corresponding results on their LLM backbone. Human captions consistently enhance performance. Synthetic captions and human-corrected OCR also help occasionally. Our findings highlight that VLMs perform well on stance, but struggle on frames, where LLMs outperform VLMs. Finally, we analyze VLMs’ limitations in handling nuanced frames and stance expressions on climate change internet memes.

MCML Authors

Shijia Zhou

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Siyao Peng

Dr.

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Barbara Plank

Prof. Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

[71]

L. Madaan, D. Esiobu, P. Stenetorp, B. Plank and D. Hupkes.
Lost in Inference: Rediscovering the Role of Natural Language Inference for Large Language Models.
NAACL 2025 - Annual Conference of the North American Chapter of the Association for Computational Linguistics. Albuquerque, NM, USA, Apr 29-May 04, 2025. To be published. Preprint available. arXiv

Abstract

In the recent past, a popular way of evaluating natural language understanding (NLU), was to consider a model’s ability to perform natural language inference (NLI) tasks. In this paper, we investigate if NLI tasks, that are rarely used for LLM evaluation, can still be informative for evaluating LLMs. Focusing on five different NLI benchmarks across six models of different scales, we investigate if they are able to discriminate models of different size and quality and how their accuracies develop during training. Furthermore, we investigate the extent to which the softmax distributions of models align with human distributions in cases where statements are ambiguous or vague. Overall, our results paint a positive picture for the NLI tasks: we find that they are able to discriminate well between models at various stages of training, yet are not (all) saturated. Furthermore, we find that while the similarity of model distributions with human label distributions increases with scale, it is still much higher than the similarity between two populations of humans, making it a potentially interesting statistic to consider.

MCML Authors

Barbara Plank

Prof. Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

[70]

R. Shim and B. Plank.
Dialetto, ma Quanto Dialetto? Transcribing and Evaluating Dialects on a Continuum.
NAACL 2025 - Annual Conference of the North American Chapter of the Association for Computational Linguistics. Albuquerque, NM, USA, Apr 29-May 04, 2025. To be published. Preprint available.

Abstract

There is increasing interest in looking at dialects in NLP. However, most work to date still treats dialects as discrete categories. For instance, evaluative work in variation-oriented NLP for English often works with Indian English or African-American Venacular English as homogeneous categories (Faisal et al., 2024; Ziems et al., 2023), yet even within one variety there is substantial variation. We examine within-dialect variation and show that performance critically varies within categories. We measure speech-to-text performance on Italian dialects, and empirically observe a geographical performance disparity. This disparity correlates substantially (-0.5) with linguistic similarity to the highest performing dialect variety. We cross-examine our results against dialectometry methods, and interpret the performance disparity to be due to a bias towards dialects that are more similar to the standard variety in the speech-to-text model examined. We additionally leverage geostatistical methods to predict zero-shot performance at unseen sites, and find the incorporation of geographical information to substantially improve prediction performance, indicating there to be geographical structure in the performance distribution.

MCML Authors

Barbara Plank

Prof. Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

[69]

V. Blaschke.
Beyond 'noisy' text: How (and why) to process dialect data.
W-NUT @NAACL 2025 - 10th Workshop on Noisy and User-generated Text at the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2025). Albuquerque, NM, USA, Apr 29-May 04, 2025. Keynote Talk. PDF

Abstract

Processing data from non-standard dialects links two lines of research: creating NLP tools that are robust to ’noisy’ inputs, and extending the coverage of NLP tools to underserved language communities. In this talk, I will describe ways in which processing dialect data differs from processing standard-language data, and discuss some of the current challenges in dialect NLP research. For instance, I will talk about strategies to mitigate the effect of infelicitous subword tokenization caused by ad-hoc pronunciation spellings. Additionally, I argue that we should not only consider how to tackle dialectal variation in NLP, but also why. To this end, I will highlight perspectives of some dialect speaker communities on which language technologies should (or should not) be able to process or produce dialectal in- or output.

MCML Authors

Verena Blaschke

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

[68]

X. Wang, C. Hu, P. Röttger and B. Plank.
Surgical, Cheap, and Flexible: Mitigating False Refusal in Language Models via Single Vector Ablation.
ICLR 2025 - 13th International Conference on Learning Representations. Singapore, Apr 24-28, 2025. URL

Abstract

Training a language model to be both helpful and harmless requires careful calibration of refusal behaviours: Models should refuse to follow malicious instructions or give harmful advice (e.g.‘how do I kill someone?’), but they should not refuse safe requests, even if they superficially resemble unsafe ones (e.g. ‘how do I kill a Python process?’). Avoiding such false refusal, as prior work has shown, is challenging even for highly-capable language models. In this paper, we propose a simple and surgical method for mitigating false refusal in language models via single vector ablation. For a given model, we extract a false refusal vector and show that ablating this vector reduces false refusal rate while preserving the model’s safety and general capabilities. We also show that our approach can be used for fine-grained calibration of model safety. Our approach is training-free and model-agnostic, making it useful for mitigating the problem of false refusal in current and future language models.

MCML Authors

Xinpeng Wang

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Barbara Plank

Prof. Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

[67]

P. Mondorf, S. Zhou, M. Riedler and B. Plank.
Enabling Systematic Generalization in Abstract Spatial Reasoning through Meta-Learning for Compositionality.
Preprint (Apr. 2025). arXiv

Abstract

Systematic generalization refers to the capacity to understand and generate novel combinations from known components. Despite recent progress by large language models (LLMs) across various domains, these models often fail to extend their knowledge to novel compositional scenarios, revealing notable limitations in systematic generalization. There has been an ongoing debate about whether neural networks possess the capacity for systematic generalization, with recent studies suggesting that meta-learning approaches designed for compositionality can significantly enhance this ability. However, these insights have largely been confined to linguistic problems, leaving their applicability to other tasks an open question. In this study, we extend the approach of meta-learning for compositionality to the domain of abstract spatial reasoning. To this end, we introduce SYGAR-a dataset designed to evaluate the capacity of models to systematically generalize from known geometric transformations (e.g., translation, rotation) of two-dimensional objects to novel combinations of these transformations (e.g., translation+rotation). Our results show that a transformer-based encoder-decoder model, trained via meta-learning for compositionality, can systematically generalize to previously unseen transformation compositions, significantly outperforming state-of-the-art LLMs, including o3-mini, GPT-4o, and Gemini 2.0 Flash, which fail to exhibit similar systematic behavior. Our findings highlight the effectiveness of meta-learning in promoting systematicity beyond linguistic tasks, suggesting a promising direction toward more robust and generalizable models.

MCML Authors

Philipp Mondorf

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Shijia Zhou

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Monica Riedler

B1 | Computer Vision
→ Group Almut Sophia Koepke

Computer Vision & Artificial Intelligence

Barbara Plank

Prof. Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

[66]

S. Si, X. Wang, G. Zhai, N. Navab and B. Plank.
Think Before Refusal : Triggering Safety Reflection in LLMs to Mitigate False Refusal Behavior.
Preprint (Mar. 2025). arXiv

Abstract

Recent advancements in large language models (LLMs) have demonstrated that fine-tuning and human alignment can render LLMs harmless. In practice, such ‘harmlessness’ behavior is mainly achieved by training models to reject harmful requests, such as ‘Explain how to burn down my neighbor’s house’, where the model appropriately declines to respond. However, this approach can inadvertently result in false refusal, where models reject benign queries as well, such as ‘Tell me how to kill a Python process’. In this work, we demonstrate that prompting safety reflection before generating a response can mitigate false refusal behavior. Building on this finding, we introduce the Think-Before-Refusal (TBR) schema and conduct safety-aware instruction fine-tuning incorporating safety reflection. In an ablation study across 15 pre-trained models, we show that models fine-tuned with safety reflection significantly reduce false refusal behavior while maintaining safety and overall performance compared to those fine-tuned without safety reflection.

MCML Authors

Xinpeng Wang

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Guangyao Zhai

C1 | Medicine
→ Group Nassir Navab

Computer Aided Medical Procedures & Augmented Reality

Nassir Navab

Prof. Dr.

C1 | Medicine

Computer Aided Medical Procedures & Augmented Reality

Barbara Plank

Prof. Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

[65]

J. Lan, D. Frassinelli and B. Plank.
Mind the Uncertainty in Human Disagreement: Evaluating Discrepancies between Model Predictions and Human Responses in VQA.
AAAI 2025 - 39th Conference on Artificial Intelligence. Philadelphia, PA, USA, Feb 25-Mar 04, 2025. DOI

Abstract

Large vision-language models struggle to accurately predict responses provided by multiple human annotators, particularly when those responses exhibit high uncertainty. In this study, we focus on a Visual Question Answering (VQA) task and comprehensively evaluate how well the output of the state-of-the-art vision-language model correlates with the distribution of human responses. To do so, we categorize our samples based on their levels (low, medium, high) of human uncertainty in disagreement (HUD) and employ, not only accuracy, but also three new human-correlated metrics for the first time in VQA, to investigate the impact of HUD. We also verify the effect of common calibration and human calibration (Baan et al. 2022) on the alignment of models and humans. Our results show that even BEiT3, currently the best model for this task, struggles to capture the multi-label distribution inherent in diverse human responses. Additionally, we observe that the commonly used accuracy-oriented calibration technique adversely affects BEiT3’s ability to capture HUD, further widening the gap between model predictions and human distributions. In contrast, we show the benefits of calibrating models towards human distributions for VQA, to better align model confidence with human uncertainty. Our findings highlight that for VQA, the alignment between human responses and model predictions is understudied and is an important target for future studies.

MCML Authors

Barbara Plank

Prof. Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

[64]

S. Feuerriegel, A. Maarouf, D. Bär, D. Geißler, J. Schweisthal, N. Pröllochs, C. E. Robertson, S. Rathje, J. Hartmann, S. M. Mohammad, O. Netzer, A. A. Siegel, B. Plank and J. J. Van Bavel.
Using natural language processing to analyse text data in behavioural science.
Nature Reviews Psychology 4 (Feb. 2025). DOI

Abstract

Language is a uniquely human trait at the core of human interactions. The language people use often reflects their personality, intentions and state of mind. With the integration of the Internet and social media into everyday life, much of human communication is documented as written text. These online forms of communication (for example, blogs, reviews, social media posts and emails) provide a window into human behaviour and therefore present abundant research opportunities for behavioural science. In this Review, we describe how natural language processing (NLP) can be used to analyse text data in behavioural science. First, we review applications of text data in behavioural science. Second, we describe the NLP pipeline and explain the underlying modelling approaches (for example, dictionary-based approaches and large language models). We discuss the advantages and disadvantages of these methods for behavioural science, in particular with respect to the trade-off between interpretability and accuracy. Finally, we provide actionable recommendations for using NLP to ensure rigour and reproducibility.

MCML Authors

Stefan Feuerriegel

Prof. Dr.

C4 | Computational Social Sciences

Artificial Intelligence in Management

Abdurahman Maarouf

C4 | Computational Social Sciences
→ Group Stefan Feuerriegel

Artificial Intelligence in Management

Dominik Bär

C4 | Computational Social Sciences
→ Group Stefan Feuerriegel

Artificial Intelligence in Management

Jonas Schweisthal

C4 | Computational Social Sciences
→ Group Stefan Feuerriegel

Artificial Intelligence in Management

Barbara Plank

Prof. Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

[63]

L. Bertolazzi, P. Mondorf, B. Plank and R. Bernardi.
The Validation Gap: A Mechanistic Analysis of How Language Models Compute Arithmetic but Fail to Validate It.
Preprint (Feb. 2025). arXiv

Abstract

The ability of large language models (LLMs) to validate their output and identify potential errors is crucial for ensuring robustness and reliability. However, current research indicates that LLMs struggle with self-correction, encountering significant challenges in detecting errors. While studies have explored methods to enhance self-correction in LLMs, relatively little attention has been given to understanding the models’ internal mechanisms underlying error detection. In this paper, we present a mechanistic analysis of error detection in LLMs, focusing on simple arithmetic problems. Through circuit analysis, we identify the computational subgraphs responsible for detecting arithmetic errors across four smaller-sized LLMs. Our findings reveal that all models heavily rely on consistency heads–attention heads that assess surface-level alignment of numerical values in arithmetic solutions. Moreover, we observe that the models’ internal arithmetic computation primarily occurs in higher layers, whereas validation takes place in middle layers, before the final arithmetic results are fully encoded. This structural dissociation between arithmetic computation and validation seems to explain why current LLMs struggle to detect even simple arithmetic errors.

MCML Authors

Philipp Mondorf

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Barbara Plank

Prof. Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

[62]

C. Wu, B. Ma, Y. Liu, Z. Zhang, N. Deng, Y. Li, B. Chen, Y. Zhang, Y. Xue and B. Plank.
M-ABSA: A Multilingual Dataset for Aspect-Based Sentiment Analysis.
Preprint (Feb. 2025). arXiv

Abstract

Aspect-based sentiment analysis (ABSA) is a crucial task in information extraction and sentiment analysis, aiming to identify aspects with associated sentiment elements in text. However, existing ABSA datasets are predominantly English-centric, limiting the scope for multilingual evaluation and research. To bridge this gap, we present M-ABSA, a comprehensive dataset spanning 7 domains and 21 languages, making it the most extensive multilingual parallel dataset for ABSA to date. Our primary focus is on triplet extraction, which involves identifying aspect terms, aspect categories, and sentiment polarities. The dataset is constructed through an automatic translation process with human review to ensure quality. We perform extensive experiments using various baselines to assess performance and compatibility on M-ABSA. Our empirical findings highlight that the dataset enables diverse evaluation tasks, such as multilingual and multi-domain transfer learning, and large language model evaluation, underscoring its inclusivity and its potential to drive advancements in multilingual ABSA research.

MCML Authors

Bolei Ma

C4 | Computational Social Sciences
→ Group Frauke Kreuter

Social Data Science and AI

Yihong Liu

B2 | Natural Language Processing
→ Group Hinrich Schütze

Computational Linguistics

Barbara Plank

Prof. Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

[61]

S. Xu, T. Y. S. S. Santosh, Y. Elazar, Q. Vogel, B. Plank and M. Grabmair.
Better Aligned with Survey Respondents or Training Data? Unveiling Political Leanings of LLMs on U.S. Supreme Court Cases.
Preprint (Feb. 2025). arXiv

Abstract

The increased adoption of Large Language Models (LLMs) and their potential to shape public opinion have sparked interest in assessing these models’ political leanings. Building on previous research that compared LLMs and human opinions and observed political bias in system responses, we take a step further to investigate the underlying causes of such biases by empirically examining how the values and biases embedded in training corpora shape model outputs. Specifically, we propose a method to quantitatively evaluate political leanings embedded in the large pretraining corpora. Subsequently we investigate to whom are the LLMs’ political leanings more aligned with, their pretrainig corpora or the surveyed human opinions. As a case study, we focus on probing the political leanings of LLMs in 32 U.S. Supreme Court cases, addressing contentious topics such as abortion and voting rights. Our findings reveal that LLMs strongly reflect the political leanings in their training data, and no strong correlation is observed with their alignment to human opinions as expressed in surveys. These results underscore the importance of responsible curation of training data and the need for robust evaluation metrics to ensure LLMs’ alignment with human-centered values.

MCML Authors

Barbara Plank

Prof. Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

[60]

R. Litschko, O. Kraus, V. Blaschke and B. Plank.
Cross-Dialect Information Retrieval: Information Access in Low-Resource and High-Variance Languages.
COLING 2025 - The 31st International Conference on Computational Linguistics. Abu Dhabi, United Arab Emirates, Jan 19-24, 2025. URL

Abstract

A large amount of local and culture-specific knowledge (e.g., people, traditions, food) can only be found in documents written in dialects. While there has been extensive research conducted on cross-lingual information retrieval (CLIR), the field of cross-dialect retrieval (CDIR) has received limited attention. Dialect retrieval poses unique challenges due to the limited availability of resources to train retrieval models and the high variability in non-standardized languages. We study these challenges on the example of German dialects and introduce the first German dialect retrieval dataset, dubbed WikiDIR, which consists of seven German dialects extracted from Wikipedia. Using WikiDIR, we demonstrate the weakness of lexical methods in dealing with high lexical variation in dialects. We further show that commonly used zero-shot cross-lingual transfer approach with multilingual encoders do not transfer well to extremely low-resource setups, motivating the need for resource-lean and dialect-specific retrieval models. We finally demonstrate that (document) translation is an effective way to reduce the dialect gap in CDIR.

MCML Authors

Robert Litschko

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Verena Blaschke

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Barbara Plank

Prof. Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

[59]

A. Muñoz-Ortiz, V. Blaschke and B. Plank.
Evaluating Pixel Language Models on Non-Standardized Languages.
COLING 2025 - The 31st International Conference on Computational Linguistics. Abu Dhabi, United Arab Emirates, Jan 19-24, 2025. URL

Abstract

We explore the potential of pixel-based models for transfer learning from standard languages to dialects. These models convert text into images that are divided into patches, enabling a continuous vocabulary representation that proves especially useful for out-of-vocabulary words common in dialectal data. Using German as a case study, we compare the performance of pixel-based models to token-based models across various syntactic and semantic tasks. Our results show that pixel-based models outperform token-based models in part-of-speech tagging, dependency parsing and intent detection for zero-shot dialect evaluation by up to 26 percentage points in some scenarios, though not in Standard German. However, pixel-based models fall short in topic classification. These findings emphasize the potential of pixel-based models for handling dialectal data, though further research should be conducted to assess their effectiveness in various linguistic contexts.

MCML Authors

Verena Blaschke

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Barbara Plank

Prof. Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

[58]

V. Blaschke, F. Körner and B. Plank.
Add Noise, Tasks, or Layers? MaiNLP at the VarDial 2025 Shared Task on Norwegian Dialectal Slot and Intent Detection.
VarDial @COLING 2025 - 12th Workshop on NLP for Similar Languages, Varieties and Dialects at the The 31st International Conference on Computational Linguistics (COLING 2025). Abu Dhabi, United Arab Emirates, Jan 19-24, 2025. URL

Abstract

Slot and intent detection (SID) is a classic natural language understanding task. Despite this, research has only more recently begun focusing on SID for dialectal and colloquial varieties. Many approaches for low-resource scenarios have not yet been applied to dialectal SID data, or compared to each other on the same datasets. We participate in the VarDial 2025 shared task on slot and intent detection in Norwegian varieties, and compare multiple set-ups: varying the training data (English, Norwegian, or dialectal Norwegian), injecting character-level noise, training on auxiliary tasks, and applying Layer Swapping, a technique in which layers of models fine-tuned on different datasets are assembled into a model. We find noise injection to be beneficial while the effects of auxiliary tasks are mixed. Though some experimentation was required to successfully assemble a model from layers, it worked surprisingly well; a combination of models trained on English and small amounts of dialectal data produced the most robust slot predictions. Our best models achieve 97.6% intent accuracy and 85.6% slot F1 in the shared task.

MCML Authors

Verena Blaschke

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Felicia Körner

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Barbara Plank

Prof. Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

[57]

X. Krückl, V. Blaschke and B. Plank.
Improving Dialectal Slot and Intent Detection with Auxiliary Tasks: A Multi-Dialectal Bavarian Case Study.
VarDial @COLING 2025 - 12th Workshop on NLP for Similar Languages, Varieties and Dialects at the The 31st International Conference on Computational Linguistics (COLING 2025). Abu Dhabi, United Arab Emirates, Jan 19-24, 2025. URL

Abstract

Reliable slot and intent detection (SID) is crucial in natural language understanding for applications like digital assistants. Encoder-only transformer models fine-tuned on high-resource languages generally perform well on SID. However, they struggle with dialectal data, where no standardized form exists and training data is scarce and costly to produce. We explore zero-shot transfer learning for SID, focusing on multiple Bavarian dialects, for which we release a new dataset for the Munich dialect. We evaluate models trained on auxiliary tasks in Bavarian, and compare joint multi-task learning with intermediate-task training. We also compare three types of auxiliary tasks: token-level syntactic tasks, named entity recognition (NER), and language modelling. We find that the included auxiliary tasks have a more positive effect on slot filling than intent classification (with NER having the most positive effect), and that intermediate-task training yields more consistent performance gains. Our best-performing approach improves intent classification performance on Bavarian dialects by 5.1 and slot filling F1 by 8.4 percentage points.

MCML Authors

Verena Blaschke

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Barbara Plank

Prof. Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

[56]

A.-M. Lutgen, A. Plum, C. Purschke and B. Plank.
Neural Text Normalization for Luxembourgish Using Real-Life Variation Data.
VarDial @COLING 2025 - 12th Workshop on NLP for Similar Languages, Varieties and Dialects at the The 31st International Conference on Computational Linguistics (COLING 2025). Abu Dhabi, United Arab Emirates, Jan 19-24, 2025. URL

Abstract

Orthographic variation is very common in Luxembourgish texts due to the absence of a fully-fledged standard variety. Additionally, developing NLP tools for Luxembourgish is a difficult task given the lack of annotated and parallel data, which is exacerbated by ongoing standardization. In this paper, we propose the first sequence-to-sequence normalization models using the ByT5 and mT5 architectures with training data obtained from word-level real-life variation data. We perform a fine-grained, linguistically-motivated evaluation to test byte-based, word-based and pipeline-based models for their strengths and weaknesses in text normalization. We show that our sequence model using real-life variation data is an effective approach for tailor-made normalization in Luxembourgish.

MCML Authors

Barbara Plank

Prof. Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

[55]

S. Eckman, B. Ma, C. Kern, R. Chew, B. Plank and F. Kreuter.
Correcting Annotator Bias in Training Data: Population-Aligned Instance Replication (PAIR).
Preprint (Jan. 2025). arXiv

Abstract

Models trained on crowdsourced labels may not reflect broader population views when annotator pools are not representative. Since collecting representative labels is challenging, we propose Population-Aligned Instance Replication (PAIR), a method to address this bias through statistical adjustment. Using a simulation study of hate speech and offensive language detection, we create two types of annotators with different labeling tendencies and generate datasets with varying proportions of the types. Models trained on unbalanced annotator pools show poor calibration compared to those trained on representative data. However, PAIR, which duplicates labels from underrepresented annotator groups to match population proportions, significantly reduces bias without requiring new data collection. These results suggest statistical techniques from survey research can help align model training with target populations even when representative annotator pools are unavailable. We conclude with three practical recommendations for improving training data quality.

MCML Authors

Bolei Ma

C4 | Computational Social Sciences
→ Group Frauke Kreuter

Social Data Science and AI

Christoph Kern

Prof. Dr.

C4 | Computational Social Sciences

Social Data Science and AI Lab

Barbara Plank

Prof. Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

Frauke Kreuter

Prof. Dr.

C4 | Computational Social Sciences

Social Data Science and AI

2024

[54]

Y. Zhang, Y. Li, X. Wang, Q. Shen, B. Plank, B. Bischl, M. Rezaei and K. Kawaguchi.
FinerCut: Finer-grained Interpretable Layer Pruning for Large Language Models.
NeurIPS 2024 - Workshop on Machine Learning and Compression at the 38th Conference on Neural Information Processing Systems. Vancouver, Canada, Dec 10-15, 2024. URL

Abstract

Overparametrized transformer networks are the state-of-the-art architecture for Large Language Models (LLMs). However, such models contain billions of parameters making large compute a necessity, while raising environmental concerns. To address these issues, we propose FinerCut, a new form of fine-grained layer pruning, which in contrast to prior work at the transformer block level, considers all self-attention and feed-forward network (FFN) layers within blocks as individual pruning candidates. FinerCut prunes layers whose removal causes minimal alternation to the model’s output – contributing to a new, lean, interpretable, and task-agnostic pruning method. Tested across 9 benchmarks, our approach retains 90% performance of Llama3-8B with 25% layers removed, and 95% performance of Llama3-70B with 30% layers removed, all without fine-tuning or post-pruning reconstruction. Strikingly, we observe intriguing results with FinerCut: 42% (34 out of 80) of the self-attention layers in Llama3-70B can be removed while preserving 99% of its performance – without additional fine-tuning after removal. Moreover, FinerCut provides a tool to inspect the types and locations of pruned layers, allowing to observe interesting pruning behaviors. For instance, we observe a preference for pruning self-attention layers, often at deeper consecutive decoder layers. We hope our insights inspire future efficient LLM architecture designs.

MCML Authors

Yawei Li

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Xinpeng Wang

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Barbara Plank

Prof. Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability

Statistical Learning and Data Science

Mina Rezaei

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

[53]

V. Basile, S. Casola, S. Frenda and S. M. Lo.
PERSEID - Perspectivist Irony Detection: A CALAMITA Challenge.
CLiC-it 2024 - 10th Italian Conference on Computational Linguistics. Pisa, Italy, Dec 04-06, 2024. URL

Abstract

Works in perspectivism and human label variation have emphasized the need to collect and leverage various voices and points of view in the whole Natural Language Processing pipeline. PERSEID places itself in this line of work. We consider the task of irony detection from short social media conversations in Italian collected from Twitter (X) and Reddit. To do so, we leverage data from MultiPICO, a recent multilingual dataset with disaggregated annotations and annotators’ metadata, containing 1000 Post, Reply pairs with five annotations each on average. We aim to evaluate whether prompting LLMs with additional annotators’ demographic information (namely gender only, age only, and the combination of the two) results in improved performance compared to a baseline in which only the input text is provided. The evaluation is zero-shot; and we evaluate the results on the disaggregated annotations using f1.

MCML Authors

Silvia Casola

Dr.

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

[52]

T. Bourgeade, S. Casola, A. M. Wizani and C. Bosco.
Data Augmentation through Back-Translation for Stereotypes and Irony Detection.
CLiC-it 2024 - 10th Italian Conference on Computational Linguistics. Pisa, Italy, Dec 04-06, 2024. URL

Abstract

Complex linguistic phenomena such as stereotypes or irony are still challenging to detect, particularly due to the lower availability of annotated data. In this paper, we explore Back-Translation (BT) as a data augmentation method to enhance such datasets by artificially introducing semantics-preserving variations. We investigate French and Italian as source languages on two multilingual datasets annotated for the presence of stereotypes or irony and evaluate French/Italian, English, and Arabic as pivot languages for the BT process. We also investigate cross-translation, i.e., augmenting one language subset of a multilingual dataset with translated instances from the other languages. We conduct an intrinsic evaluation of the quality of back-translated instances, identifying linguistic or translation model-specific errors that may occur with BT. We also perform an extrinsic evaluation of different data augmentation configurations to train a multilingual Transformer-based classifier for stereotype or irony detection on mono-lingual data.

MCML Authors

Silvia Casola

Dr.

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

[51]

S. Frenda, A. Piergentili, B. Savoldi, M. Madeddu, M. Rosola, S. Casola, C. Ferrando, V. Patti, M. Negri and L. Bentivogli.
GFG - Gender-Fair Generation: A CALAMITA Challenge.
CLiC-it 2024 - 10th Italian Conference on Computational Linguistics. Pisa, Italy, Dec 04-06, 2024. URL

Abstract

Gender-fair language aims at promoting gender equality by using terms and expressions that include all identities and avoid reinforcing gender stereotypes. Implementing gender-fair strategies is particularly challenging in heavily gender-marked languages, such as Italian. To address this, the Gender-Fair Generation challenge intends to help shift toward gender-fair language in written communication. The challenge, designed to assess and monitor the recognition and generation of gender-fair language in both mono- and cross-lingual scenarios, includes three tasks: (1) the detection of gendered expressions in Italian sentences, (2) the reformulation of gendered expressions into gender-fair alternatives, and (3) the generation of gender-fair language in automatic translation from English to Italian. The challenge relies on three different annotated datasets: the GFL-it corpus, which contains Italian texts extracted from administrative documents provided by the University of Brescia; GeNTE, a bilingual test set for gender-neutral rewriting and translation built upon a subset of the Europarl dataset; and Neo-GATE, a bilingual test set designed to assess the use of non-binary neomorphemes in Italian for both fair formulation and translation tasks. Finally, each task is evaluated with specific metrics: average of F1-score obtained by means of BERTScore computed on each entry of the datasets for task 1, an accuracy measured with a gender-neutral classifier, and a coverage-weighted accuracy for tasks 2 and 3.

MCML Authors

Silvia Casola

Dr.

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

[50]

A. Testoni, B. Plank and R. Fernández.
RACQUET: Unveiling the Dangers of Overlooked Referential Ambiguity in Visual LLMs.
Preprint (Dec. 2024). arXiv

Abstract

Ambiguity resolution is key to effective communication. While humans effortlessly address ambiguity through conversational grounding strategies, the extent to which current language models can emulate these strategies remains unclear. In this work, we examine referential ambiguity in image-based question answering by introducing RACQUET, a carefully curated dataset targeting distinct aspects of ambiguity. Through a series of evaluations, we reveal significant limitations and problems of overconfidence of state-of-the-art large multimodal language models in addressing ambiguity in their responses. The overconfidence issue becomes particularly relevant for RACQUET-BIAS, a subset designed to analyze a critical yet underexplored problem: failing to address ambiguity leads to stereotypical, socially biased responses. Our results underscore the urgency of equipping models with robust strategies to deal with uncertainty without resorting to undesirable stereotypes.

MCML Authors

Barbara Plank

Prof. Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

[49]

Y. J. Liu, T. Aoyama, W. Scivetti, Y. Zhu, S. Behzad, L. E. Levine, J. Lin, D. Tiwari and A. Zeldes.
GDTB: Genre Diverse Data for English Shallow Discourse Parsing across Modalities, Text Types, and Domains.
EMNLP 2024 - Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI

Abstract

Work on shallow discourse parsing in English has focused on the Wall Street Journal corpus, the only large-scale dataset for the language in the PDTB framework. However, the data is not openly available, is restricted to the news domain, and is by now 35 years old. In this paper, we present and evaluate a new open-access, multi-genre benchmark for PDTB-style shallow discourse parsing, based on the existing UD English GUM corpus, for which discourse relation annotations in other frameworks already exist. In a series of experiments on cross-domain relation classification, we show that while our dataset is compatible with PDTB, substantial out-of-domain degradation is observed, which can be alleviated by joint training on both datasets.

MCML Authors

Yang Janet Liu

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

[48]

P. Mondorf and B. Plank.
Liar, Liar, Logical Mire: A Benchmark for Suppositional Reasoning in Large Language Models.
EMNLP 2024 - Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI

Abstract

Knights and knaves problems represent a classic genre of logical puzzles where characters either tell the truth or lie. The objective is to logically deduce each character’s identity based on their statements. The challenge arises from the truth-telling or lying behavior, which influences the logical implications of each statement. Solving these puzzles requires not only direct deductions from individual statements, but the ability to assess the truthfulness of statements by reasoning through various hypothetical scenarios. As such, knights and knaves puzzles serve as compelling examples of suppositional reasoning. In this paper, we introduce TruthQuest, a benchmark for suppositional reasoning based on the principles of knights and knaves puzzles. Our benchmark presents problems of varying complexity, considering both the number of characters and the types of logical statements involved. Evaluations on TruthQuest show that large language models like Llama 3 and Mixtral-8x7B exhibit significant difficulties solving these tasks. A detailed error analysis of the models’ output reveals that lower-performing models exhibit a diverse range of reasoning errors, frequently failing to grasp the concept of truth and lies. In comparison, more proficient models primarily struggle with accurately inferring the logical implications of potentially false statements.

MCML Authors

Philipp Mondorf

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Barbara Plank

Prof. Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

[47]

P. F. Balestrucci, S. Casola, S. M. Lo, V. Basile and A. Mazzei.
I’m sure you’re a real scholar yourself: Exploring Ironic Content Generation by Large Language Models.
EMNLP 2024 - Findings of the Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI

Abstract

Generating ironic content is challenging: it requires a nuanced understanding of context and implicit references and balancing seriousness and playfulness. Moreover, irony is highly subjective and can depend on various factors, such as social, cultural, or generational aspects. This paper explores whether Large Language Models (LLMs) can learn to generate ironic responses to social media posts. To do so, we fine-tune two models to generate ironic and non-ironic content and deeply analyze their outputs’ linguistic characteristics, their connection to the original post, and their similarity to the human-written replies. We also conduct a large-scale human evaluation of the outputs. Additionally, we investigate whether LLMs can learn a form of irony tied to a generational perspective, with mixed results.

MCML Authors

Silvia Casola

Dr.

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

[46]

B. Chen, X. Wang, S. Peng, R. Litschko, A. Korhonen and B. Plank.
'Seeing the Big through the Small': Can LLMs Approximate Human Judgment Distributions on NLI from a Few Explanations?
EMNLP 2024 - Findings of the Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI

Abstract

Human label variation (HLV) is a valuable source of information that arises when multiple human annotators provide different labels for valid reasons. In Natural Language Inference (NLI) earlier approaches to capturing HLV involve either collecting annotations from many crowd workers to represent human judgment distribution (HJD) or use expert linguists to provide detailed explanations for their chosen labels. While the former method provides denser HJD information, obtaining it is resource-intensive. In contrast, the latter offers richer textual information but it is challenging to scale up to many human judges. Besides, large language models (LLMs) are increasingly used as evaluators (‘LLM judges’) but with mixed results, and few works aim to study HJDs. This study proposes to exploit LLMs to approximate HJDs using a small number of expert labels and explanations. Our experiments show that a few explanations significantly improve LLMs’ ability to approximate HJDs with and without explicit labels, thereby providing a solution to scale up annotations for HJD. However, fine-tuning smaller soft-label aware models with the LLM-generated model judgment distributions (MJDs) presents partially inconsistent results: while similar in distance, their resulting fine-tuned models and visualized distributions differ substantially. We show the importance of complementing instance-level distance measures with a global-level shape metric and visualization to more effectively evaluate MJDs against human judgment distributions.

MCML Authors

Beiduo Chen

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Xinpeng Wang

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Siyao Peng

Dr.

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Robert Litschko

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Barbara Plank

Prof. Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

[45]

B. Ma, X. Wang, T. Hu, A.-C. Haensch, M. A. Hedderich, B. Plank and F. Kreuter.
The Potential and Challenges of Evaluating Attitudes, Opinions, and Values in Large Language Models.
EMNLP 2024 - Findings of the Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI

Abstract

Recent advances in Large Language Models (LLMs) have sparked wide interest in validating and comprehending the human-like cognitive-behavioral traits LLMs may capture and convey. These cognitive-behavioral traits include typically Attitudes, Opinions, Values (AOVs). However, measuring AOVs embedded within LLMs remains opaque, and different evaluation methods may yield different results. This has led to a lack of clarity on how different studies are related to each other and how they can be interpreted. This paper aims to bridge this gap by providing a comprehensive overview of recent works on the evaluation of AOVs in LLMs. Moreover, we survey related approaches in different stages of the evaluation pipeline in these works. By doing so, we address the potential and challenges with respect to understanding the model, human-AI alignment, and downstream application in social sciences. Finally, we provide practical insights into evaluation methods, model enhancement, and interdisciplinary collaboration, thereby contributing to the evolving landscape of evaluating AOVs in LLMs.

MCML Authors

Bolei Ma

C4 | Computational Social Sciences
→ Group Frauke Kreuter

Social Data Science and AI

Xinpeng Wang

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Anna-Carolina Haensch

Dr.

C4 | Computational Social Sciences
→ Group Frauke Kreuter

Social Data Science and AI

Michael Hedderich

Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

Barbara Plank

Prof. Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

Frauke Kreuter

Prof. Dr.

C4 | Computational Social Sciences

Social Data Science and AI

[44]

A. Sedova, R. Litschko, D. Frassinelli, B. Roth and B. Plank.
To Know or Not To Know? Analyzing Self-Consistency of Large Language Models under Ambiguity.
EMNLP 2024 - Findings of the Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI

Abstract

One of the major aspects contributing to the striking performance of large language models (LLMs) is the vast amount of factual knowledge accumulated during pre-training. Yet, many LLMs suffer from self-inconsistency, which raises doubts about their trustworthiness and reliability. This paper focuses on entity type ambiguity, analyzing the proficiency and consistency of state-of-the-art LLMs in applying factual knowledge when prompted with ambiguous entities. To do so, we propose an evaluation protocol that disentangles knowing from applying knowledge, and test state-of-the-art LLMs on 49 ambiguous entities. Our experiments reveal that LLMs struggle with choosing the correct entity reading, achieving an average accuracy of only 85%, and as low as 75% with underspecified prompts. The results also reveal systematic discrepancies in LLM behavior, showing that while the models may possess knowledge, they struggle to apply it consistently, exhibit biases toward preferred readings, and display self-inconsistencies. This highlights the need to address entity ambiguity in the future for more trustworthy LLMs.

MCML Authors

Robert Litschko

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Barbara Plank

Prof. Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

[43]

J. Wang, L. Zuo, S. Peng and B. Plank.
MultiClimate: Multimodal Stance Detection on Climate Change Videos.
NLP4PI @EMNLP 2024 - 3rd Workshop on NLP for Positive Impact at the Conference on Empirical Methods in Natural Language Processing (EMNLP 2024). Miami, FL, USA, Nov 12-16, 2024. DOI GitHub

Abstract

Climate change (CC) has attracted increasing attention in NLP in recent years. However, detecting the stance on CC in multimodal data is understudied and remains challenging due to a lack of reliable datasets. To improve the understanding of public opinions and communication strategies, this paper presents MultiClimate, the first open-source manually-annotated stance detection dataset with 100 CC-related YouTube videos and 4,209 frame-transcript pairs. We deploy state-of-the-art vision and language models, as well as multimodal models for MultiClimate stance detection. Results show that text-only BERT significantly outperforms image-only ResNet50 and ViT. Combining both modalities achieves state-of-the-art, 0.747/0.749 in accuracy/F1. Our 100M-sized fusion models also beat CLIP and BLIP, as well as the much larger 9B-sized multimodal IDEFICS and text-only Llama3 and Gemma2, indicating that multimodal stance detection remains challenging for large language models.

MCML Authors

Siyao Peng

Dr.

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Barbara Plank

Prof. Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

[42]

P. Mondorf and B. Plank.
Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models--A Survey.
COLM 2024 - Conference on Language Modeling. Philadelphia, PA, USA, Oct 07-09, 2024. PDF

Abstract

Large language models (LLMs) have recently shown impressive performance on tasks involving reasoning, leading to a lively debate on whether these models possess reasoning capabilities similar to humans. However, despite these successes, the depth of LLMs’ reasoning abilities remains uncertain. This uncertainty partly stems from the predominant focus on task performance, measured through shallow accuracy metrics, rather than a thorough investigation of the models’ reasoning behavior. This paper seeks to address this gap by providing a comprehensive review of studies that go beyond task accuracy, offering deeper insights into the models’ reasoning processes. Furthermore, we survey prevalent methodologies to evaluate the reasoning behavior of LLMs, emphasizing current trends and efforts towards more nuanced reasoning analyses. Our review suggests that LLMs tend to rely on surface-level patterns and correlations in their training data, rather than on sophisticated reasoning abilities. Additionally, we identify the need for further research that delineates the key differences between human and LLM-based reasoning. Through this survey, we aim to shed light on the complex reasoning processes within LLMs.

MCML Authors

Philipp Mondorf

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Barbara Plank

Prof. Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

[41]

X. Wang, C. Hu, B. Ma, P. Rottger and B. Plank.
Look at the Text: Instruction-Tuned Language Models are More Robust Multiple Choice Selectors than You Think.
COLM 2024 - Conference on Language Modeling. Philadelphia, PA, USA, Oct 07-09, 2024. PDF

Abstract

Multiple choice questions (MCQs) are commonly used to evaluate the capabilities of large language models (LLMs). One common way to evaluate the model response is to rank the candidate answers based on the log probability of the first token prediction. An alternative way is to examine the text output. Prior work has shown that first token probabilities lack robustness to changes in MCQ phrasing, and that first token probabilities do not match text answers for instruction-tuned models. Therefore, in this paper, we investigate the robustness of text answers. We show that the text answers are more robust to question perturbations than the first token probabilities, when the first token answers mismatch the text answers. The difference in robustness increases as the mismatch rate becomes greater. As the mismatch reaches over 50%, the text answer is more robust to option order changes than the debiased first token probabilities using state-of-the-art debiasing methods such as PriDe. Our findings provide further evidence for the benefits of text answer evaluation over first token probability evaluation.

MCML Authors

Xinpeng Wang

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Bolei Ma

C4 | Computational Social Sciences
→ Group Frauke Kreuter

Social Data Science and AI

Barbara Plank

Prof. Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

[40]

V. Blaschke, B. Kovačić, S. Peng and B. Plank.
MaiBaam Annotation Guidelines.
Preprint (Oct. 2024). arXiv

Abstract

This document provides the annotation guidelines for MaiBaam, a Bavarian corpus manually annotated with part-of-speech (POS) tags, syntactic dependencies, and German lemmas. MaiBaam belongs to the Universal Dependencies (UD) project, and our annotations elaborate on the general and German UD version 2 guidelines. In this document, we detail how to preprocess and tokenize Bavarian data, provide an overview of the POS tags and dependencies we use, explain annotation decisions that would also apply to closely related languages like German, and lastly we introduce and motivate decisions that are specific to Bavarian grammar.

MCML Authors

Verena Blaschke

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Siyao Peng

Dr.

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Barbara Plank

Prof. Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

[39]

Q. Chen, X. Wang, P. Mondorf, M. A. Hedderich and B. Plank.
Understanding When Tree of Thoughts Succeeds: Larger Models Excel in Generation, Not Discrimination.
Preprint (Oct. 2024). arXiv

Abstract

Tree of Thoughts (ToT) is a reasoning strategy for Large Language Models (LLMs) that employs a generator to suggest reasoning steps and a discriminator to decide which steps to implement. ToT demonstrates strong performance on reasoning tasks, often surpassing simple methods such as Input-Output (IO) prompting and Chain-of-Thought (CoT) reasoning. However, ToT does not consistently outperform such simpler methods across all models, leaving large knowledge gaps on the conditions under which ToT is most beneficial. In this paper, we analyze the roles of the generator and discriminator separately to better understand the conditions when ToT is beneficial. We find that the generator plays a more critical role than the discriminator in driving the success of ToT. Scaling the generator leads to notable improvements in ToT performance, even when using a smaller model as the discriminator, whereas scaling the discriminator with a fixed generator yields only marginal gains. Our results show that models across different scales exhibit comparable discrimination capabilities, yet differ significantly in their generative performance for ToT.

MCML Authors

Xinpeng Wang

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Philipp Mondorf

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Michael Hedderich

Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

Barbara Plank

Prof. Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

[38]

V. Blaschke, C. Purschke, H. Schütze and B. Plank.
What Do Dialect Speakers Want? A Survey of Attitudes Towards Language Technology for German Dialects.
ACL 2024 - 62nd Annual Meeting of the Association for Computational Linguistics. Bangkok, Thailand, Aug 11-16, 2024. DOI

Abstract

Natural language processing (NLP) has largely focused on modelling standardized languages. More recently, attention has increasingly shifted to local, non-standardized languages and dialects. However, the relevant speaker populations’ needs and wishes with respect to NLP tools are largely unknown. In this paper, we focus on dialects and regional languages related to German – a group of varieties that is heterogeneous in terms of prestige and standardization. We survey speakers of these varieties (N=327) and present their opinions on hypothetical language technologies for their dialects. Although attitudes vary among subgroups of our respondents, we find that respondents are especially in favour of potential NLP tools that work with dialectal input (especially audio input) such as virtual assistants, and less so for applications that produce dialectal output such as machine translation or spellcheckers.

MCML Authors

Verena Blaschke

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Hinrich Schütze

Prof. Dr.

B2 | Natural Language Processing

Computational Linguistics

Barbara Plank

Prof. Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

[37]

P. Mondorf and B. Plank.
Comparing Inferential Strategies of Humans and Large Language Models in Deductive Reasoning.
ACL 2024 - 62nd Annual Meeting of the Association for Computational Linguistics. Bangkok, Thailand, Aug 11-16, 2024. DOI

Abstract

Deductive reasoning plays a pivotal role in the formulation of sound and cohesive arguments. It allows individuals to draw conclusions that logically follow, given the truth value of the information provided. Recent progress in the domain of large language models (LLMs) has showcased their capability in executing deductive reasoning tasks. Nonetheless, a significant portion of research primarily assesses the accuracy of LLMs in solving such tasks, often overlooking a deeper analysis of their reasoning behavior. In this study, we draw upon principles from cognitive psychology to examine inferential strategies employed by LLMs, through a detailed evaluation of their responses to propositional logic problems. Our findings indicate that LLMs display reasoning patterns akin to those observed in humans, including strategies like supposition following or chain construction. Moreover, our research demonstrates that the architecture and scale of the model significantly affect its preferred method of reasoning, with more advanced models tending to adopt strategies more frequently than less sophisticated ones. Importantly, we assert that a model’s accuracy, that is the correctness of its final conclusion, does not necessarily reflect the validity of its reasoning process. This distinction underscores the necessity for more nuanced evaluation procedures in the field.

MCML Authors

Philipp Mondorf

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Barbara Plank

Prof. Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

[36]

L. Weber-Genzel, S. Peng, M.-C. De Marneffe and B. Plank.
VariErr NLI: Separating Annotation Error from Human Label Variation.
ACL 2024 - 62nd Annual Meeting of the Association for Computational Linguistics. Bangkok, Thailand, Aug 11-16, 2024. DOI

Abstract

Human label variation arises when annotators assign different labels to the same item for valid reasons, while annotation errors occur when labels are assigned for invalid reasons. These two issues are prevalent in NLP benchmarks, yet existing research has studied them in isolation. To the best of our knowledge, there exists no prior work that focuses on teasing apart error from signal, especially in cases where signal is beyond black-and-white.To fill this gap, we introduce a systematic methodology and a new dataset, VariErr (variation versus error), focusing on the NLI task in English. We propose a 2-round annotation procedure with annotators explaining each label and subsequently judging the validity of label-explanation pairs.VariErr contains 7,732 validity judgments on 1,933 explanations for 500 re-annotated MNLI items. We assess the effectiveness of various automatic error detection (AED) methods and GPTs in uncovering errors versus human label variation. We find that state-of-the-art AED methods significantly underperform GPTs and humans. While GPT-4 is the best system, it still falls short of human performance. Our methodology is applicable beyond NLI, offering fertile ground for future research on error versus plausible variation, which in turn can yield better and more trustworthy NLP systems.

MCML Authors

Leon Weber-Genzel

Dr.

B2 | Natural Language Processing
→ Group Barbara Plank

* Former Member

Siyao Peng

Dr.

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Barbara Plank

Prof. Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

[35]

S. Xu, S. T.y.s.s, O. Ichim, B. Plank and M. Grabmair.
Through the Lens of Split Vote: Exploring Disagreement, Difficulty and Calibration in Legal Case Outcome Classification.
ACL 2024 - 62nd Annual Meeting of the Association for Computational Linguistics. Bangkok, Thailand, Aug 11-16, 2024. DOI

Abstract

In legal decisions, split votes (SV) occur when judges cannot reach a unanimous decision, posing a difficulty for lawyers who must navigate diverse legal arguments and opinions. In high-stakes domains, %as human-AI interaction systems become increasingly important, understanding the alignment of perceived difficulty between humans and AI systems is crucial to build trust. However, existing NLP calibration methods focus on a classifier’s awareness of predictive performance, measured against the human majority class, overlooking inherent human label variation (HLV). This paper explores split votes as naturally observable human disagreement and value pluralism. We collect judges’ vote distributions from the European Court of Human Rights (ECHR), and present SV-ECHR, a case outcome classification (COC) dataset with SV information. We build a taxonomy of disagreement with SV-specific subcategories. We further assess the alignment of perceived difficulty between models and humans, as well as confidence- and human-calibration of COC models. We observe limited alignment with the judge vote distribution. To our knowledge, this is the first systematic exploration of calibration to human judgements in legal NLP. Our study underscores the necessity for further research on measuring and enhancing model calibration considering HLV in legal decision tasks.

MCML Authors

Barbara Plank

Prof. Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

[34]

X. Wang, B. Ma, C. Hu, L. Weber-Genzel, P. Röttger, F. Kreuter, D. Hovy and B. Plank.
My Answer is C: First-Token Probabilities Do Not Match Text Answers in Instruction-Tuned Language Models.
ACL 2024 - Findings of the 62nd Annual Meeting of the Association for Computational Linguistics. Bangkok, Thailand, Aug 11-16, 2024. DOI

Abstract

The open-ended nature of language generation makes the evaluation of autoregressive large language models (LLMs) challenging. One common evaluation approach uses multiple-choice questions to limit the response space. The model is then evaluated by ranking the candidate answers by the log probability of the first token prediction. However, first-tokens may not consistently reflect the final response output, due to model’s diverse response styles such as starting with ‘Sure’ or refusing to answer. Consequently, first-token evaluation is not indicative of model behaviour when interacting with users. But by how much? We evaluate how aligned first-token evaluation is with the text output along several dimensions, namely final option choice, refusal rate, choice distribution and robustness under prompt perturbation. Our results show that the two approaches are severely misaligned on all dimensions, reaching mismatch rates over 60%. Models heavily fine-tuned on conversational or safety data are especially impacted. Crucially, models remain misaligned even when we increasingly constrain prompts, i.e., force them to start with an option letter or example template. Our findings i) underscore the importance of inspecting the text output as well and ii) caution against relying solely on first-token evaluation.

MCML Authors

Xinpeng Wang

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Bolei Ma

C4 | Computational Social Sciences
→ Group Frauke Kreuter

Social Data Science and AI

Leon Weber-Genzel

Dr.

B2 | Natural Language Processing
→ Group Barbara Plank

* Former Member

Frauke Kreuter

Prof. Dr.

C4 | Computational Social Sciences

Social Data Science and AI

Barbara Plank

Prof. Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

[33]

S. Zhou, S. Peng and B. Plank.
CLIMATELI: Evaluating Entity Linking on Climate Change Data.
ClimateNLP @ACL 2024 - 1st Workshop on Natural Language Processing Meets Climate Change at the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024). Bangkok, Thailand, Aug 11-16, 2024. DOI

Abstract

Climate Change (CC) is a pressing topic of global importance, attracting increasing attention across research fields, from social sciences to Natural Language Processing (NLP). CC is also discussed in various settings and communication platforms, from academic publications to social media forums. Understanding who and what is mentioned in such data is a first critical step to gaining new insights into CC. We present CLIMATELI (CLIMATe Entity LInking), the first manually annotated CC dataset that links 3,087 entity spans to Wikipedia. Using CLIMATELI (CLIMATe Entity LInking), we evaluate existing entity linking (EL) systems on the CC topic across various genres and propose automated filtering methods for CC entities. We find that the performance of EL models notably lags behind humans at both token and entity levels. Testing within the scope of retaining or excluding non-nominal and/or non-CC entities particularly impacts the models’ performances.

MCML Authors

Shijia Zhou

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Siyao Peng

Dr.

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Barbara Plank

Prof. Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

[32]

S. Eckman, B. Plank and F. Kreuter.
Position: Insights from Survey Methodology can Improve Training Data.
ICML 2024 - 41st International Conference on Machine Learning. Vienna, Austria, Jul 21-27, 2024. URL

Abstract

Whether future AI models are fair, trustworthy, and aligned with the public’s interests rests in part on our ability to collect accurate data about what we want the models to do. However, collecting high-quality data is difficult, and few AI/ML researchers are trained in data collection methods. Recent research in data-centric AI has show that higher quality training data leads to better performing models, making this the right moment to introduce AI/ML researchers to the field of survey methodology, the science of data collection. We summarize insights from the survey methodology literature and discuss how they can improve the quality of training and feedback data. We also suggest collaborative research ideas into how biases in data collection can be mitigated, making models more accurate and human-centric.

MCML Authors

Barbara Plank

Prof. Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

Frauke Kreuter

Prof. Dr.

C4 | Computational Social Sciences

Social Data Science and AI

[31]

S. Zhou, H. Shan, B. Plank and R. Litschko.
MaiNLP at SemEval-2024 Task 1: Analyzing Source Language Selection in Cross-Lingual Textual Relatedness.
SemEval @NAACL 2024 - 18th International Workshop on Semantic Evaluation at the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2024). Mexico City, Mexico, Jun 16-21, 2024. URL

Abstract

This paper presents our system developed for the SemEval-2024 Task 1: Semantic Textual Relatedness (STR), on Track C: Cross-lingual. The task aims to detect semantic relatedness of two sentences from the same languages. For cross-lingual approach we developed a set of linguistics-inspired models trained with several task-specific strategies. We 1) utilize language vectors for selection of donor languages; 2) investigate the multi-source approach for training; 3) use transliteration of non-latin script to study impact of ‘script gap’; 4) opt machine translation for data augmentation. We additionally compare the performance of XLM-RoBERTa and Furina with the same training strategy. Our submission achieved the first place in the C8 (Kinyarwanda) test.

MCML Authors

Shijia Zhou

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Barbara Plank

Prof. Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

Robert Litschko

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

[30]

V. Blaschke, B. Kovačić, S. Peng, H. Schütze and B. Plank.
MaiBaam: A Multi-Dialectal Bavarian Universal Dependency Treebank.
LREC-COLING 2024 - Joint International Conference on Computational Linguistics, Language Resources and Evalutaion. Torino, Italy, May 20-25, 2024. URL

Abstract

Despite the success of the Universal Dependencies (UD) project exemplified by its impressive language breadth, there is still a lack in `within-language breadth’: most treebanks focus on standard languages. Even for German, the language with the most annotations in UD, so far no treebank exists for one of its language varieties spoken by over 10M people: Bavarian. To contribute to closing this gap, we present the first multi-dialect Bavarian treebank (MaiBaam) manually annotated with part-of-speech and syntactic dependency information in UD, covering multiple text genres (wiki, fiction, grammar examples, social, non-fiction). We highlight the morphosyntactic differences between the closely-related Bavarian and German and showcase the rich variability of speakers’ orthographies. Our corpus includes 15k tokens, covering dialects from all Bavarian-speaking areas spanning three countries. We provide baseline parsing and POS tagging results, which are lower than results obtained on German and vary substantially between different graph-based parsers. To support further research on Bavarian syntax, we make our dataset, language-specific guidelines and code publicly available.

MCML Authors

Verena Blaschke

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Siyao Peng

Dr.

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Hinrich Schütze

Prof. Dr.

B2 | Natural Language Processing

Computational Linguistics

Barbara Plank

Prof. Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

[29]

C. Müller and B. Plank.
IndirectQA: Understanding Indirect Answers to Implicit Polar Questions in French and Spanish.
LREC-COLING 2024 - Joint International Conference on Computational Linguistics, Language Resources and Evalutaion. Torino, Italy, May 20-25, 2024. URL

Abstract

Polar questions are common in dialogue and expect exactly one of two answers (yes/no). It is however not uncommon for speakers to bypass these expected choices and answer, for example, ‘Islands are generally by the sea’ to the question: ‘An island? By the sea?’. While such answers are natural in spoken dialogues, conversational systems still struggle to interpret them. Seminal work to interpret indirect answers were made in recent years—but only for English and with strict question formulations. In this work, we present a new corpus for French and Spanish—IndirectQA —where we mine subtitle data for indirect answers to study the labeling task with six different labels, while broadening polar questions to include also implicit polar questions (statements that trigger a yes/no-answer which are not necessarily formulated as a question). We opted for subtitles since they are a readily available source of conversation in various languages, but also come with peculiarities and challenges which we will discuss. Overall, we provide the first results on French and Spanish. They show that the task is challenging: the baseline accuracy scores drop from 61.43 on English to 44.06 for French and Spanish.

MCML Authors

Barbara Plank

Prof. Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

[28]

S. Peng, Z. Sun, H. Shan, M. Kolm, V. Blaschke, E. Artemova and B. Plank.
Sebastian, Basti, Wastl?! Recognizing Named Entities in Bavarian Dialectal Data.
LREC-COLING 2024 - Joint International Conference on Computational Linguistics, Language Resources and Evalutaion. Torino, Italy, May 20-25, 2024. URL

Abstract

Named Entity Recognition (NER) is a fundamental task to extract key information from texts, but annotated resources are scarce for dialects. This paper introduces the first dialectal NER dataset for German, BarNER, with 161K tokens annotated on Bavarian Wikipedia articles (bar-wiki) and tweets (bar-tweet), using a schema adapted from German CoNLL 2006 and GermEval. The Bavarian dialect differs from standard German in lexical distribution, syntactic construction, and entity information. We conduct in-domain, cross-domain, sequential, and joint experiments on two Bavarian and three German corpora and present the first comprehensive NER results on Bavarian. Incorporating knowledge from the larger German NER (sub-)datasets notably improves on bar-wiki and moderately on bar-tweet. Inversely, training first on Bavarian contributes slightly to the seminal German CoNLL 2006 corpus. Moreover, with gold dialect labels on Bavarian tweets, we assess multi-task learning between five NER and two Bavarian-German dialect identification tasks and achieve NER SOTA on bar-wiki. We substantiate the necessity of our low-resource BarNER corpus and the importance of diversity in dialects, genres, and topics in enhancing model performance.

MCML Authors

Siyao Peng

Dr.

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Verena Blaschke

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Barbara Plank

Prof. Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

[27]

M. Winkler, V. Juozapaityte, R. van der Goot and B. Plank.
Slot and Intent Detection Resources for Bavarian and Lithuanian: Assessing Translations vs Natural Queries to Digital Assistants.
LREC-COLING 2024 - Joint International Conference on Computational Linguistics, Language Resources and Evalutaion. Torino, Italy, May 20-25, 2024. URL

Abstract

Digital assistants perform well in high-resource languages like English, where tasks like slot and intent detection (SID) are well-supported. Many recent SID datasets start including multiple language varieties. However, it is unclear how realistic these translated datasets are. Therefore, we extend one such dataset, namely xSID-0.4, to include two underrepresented languages: Bavarian, a German dialect, and Lithuanian, a Baltic language. Both language variants have limited speaker populations and are often not included in multilingual projects. In addition to translations we provide “natural” queries to digital assistants generated by native speakers. We further include utterances from another dataset for Bavarian to build the richest SID dataset available today for a low-resource dialect without standard orthography. We then set out to evaluate models trained on English in a zero-shot scenario on our target language variants. Our evaluation reveals that translated data can produce overly optimistic scores. However, the error patterns in translated and natural datasets are highly similar. Cross-dataset experiments demonstrate that data collection methods influence performance, with scores lower than those achieved with single-dataset translations. This work contributes to enhancing SID datasets for underrepresented languages, yielding NaLiBaSID, a new evaluation dataset for Bavarian and Lithuanian.

MCML Authors

Barbara Plank

Prof. Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

[26]

S. Zhou, L. Weissweiler, T. He, H. Schütze, D. R. Mortensen and L. Levin.
Constructions Are So Difficult That Even Large Language Models Get Them Right for the Wrong Reasons.
LREC-COLING 2024 - Joint International Conference on Computational Linguistics, Language Resources and Evalutaion. Torino, Italy, May 20-25, 2024. URL

Abstract

In this paper, we make a contribution that can be understood from two perspectives: from an NLP perspective, we introduce a small challenge dataset for NLI with large lexical overlap, which minimises the possibility of models discerning entailment solely based on token distinctions, and show that GPT-4 and Llama 2 fail it with strong bias. We then create further challenging sub-tasks in an effort to explain this failure. From a Computational Linguistics perspective, we identify a group of constructions with three classes of adjectives which cannot be distinguished by surface features. This enables us to probe for LLM’s understanding of these constructions in various ways, and we find that they fail in a variety of ways to distinguish between them, suggesting that they don’t adequately represent their meaning or capture the lexical properties of phrasal heads.

MCML Authors

Shijia Zhou

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Leonie Weissweiler

B2 | Natural Language Processing
→ Group Hinrich Schütze

* Former Member

Hinrich Schütze

Prof. Dr.

B2 | Natural Language Processing

Computational Linguistics

[25]

C. Gruber, K. Hechinger, M. Aßenmacher, G. Kauermann and B. Plank.
More Labels or Cases? Assessing Label Variation in Natural Language Inference.
UnImplicit 2024 - 3rd Workshop on Understanding Implicit and Underspecified Language. Malta, Mar 21, 2024. URL

Abstract

In this work, we analyze the uncertainty that is inherently present in the labels used for supervised machine learning in natural language inference (NLI). In cases where multiple annotations per instance are available, neither the majority vote nor the frequency of individual class votes is a trustworthy representation of the labeling uncertainty. We propose modeling the votes via a Bayesian mixture model to recover the data-generating process, i.e., the “true” latent classes, and thus gain insight into the class variations. This will enable a better understanding of the confusion happening during the annotation process. We also assess the stability of the proposed estimation procedure by systematically varying the numbers of i) instances and ii) labels. Thereby, we observe that few instances with many labels can predict the latent class borders reasonably well, while the estimation fails for many instances with only a few labels. This leads us to conclude that multiple labels are a crucial building block for properly analyzing label uncertainty.

MCML Authors

Matthias Aßenmacher

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Barbara Plank

Prof. Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

[24]

S. Peng, Z. Sun, S. Loftus and B. Plank.
Different Tastes of Entities: Investigating Human Label Variation in Named Entity Annotations.
UnImplicit 2024 - 3rd Workshop on Understanding Implicit and Underspecified Language. Malta, Mar 21, 2024. URL

Abstract

Named Entity Recognition (NER) is a key information extraction task with a long-standing tradition. While recent studies address and aim to correct annotation errors via re-labeling efforts, little is known about the sources of human label variation, such as text ambiguity, annotation error, or guideline divergence. This is especially the case for high-quality datasets and beyond English CoNLL03. This paper studies disagreements in expert-annotated named entity datasets for three languages: English, Danish, and Bavarian. We show that text ambiguity and artificial guideline changes are dominant factors for diverse annotations among high-quality revisions. We survey student annotations on a subset of difficult entities and substantiate the feasibility and necessity of manifold annotations for understanding named entity ambiguities from a distributional perspective.

MCML Authors

Siyao Peng

Dr.

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Barbara Plank

Prof. Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

[23]

E. Artemova, V. Blaschke and B. Plank.
Exploring the Robustness of Task-oriented Dialogue Systems for Colloquial German Varieties.
EACL 2024 - 18th Conference of the European Chapter of the Association for Computational Linguistics. St. Julians, Malta, Mar 17-22, 2024. URL

Abstract

Mainstream cross-lingual task-oriented dialogue (ToD) systems leverage the transfer learning paradigm by training a joint model for intent recognition and slot-filling in English and applying it, zero-shot, to other languages.We address a gap in prior research, which often overlooked the transfer to lower-resource colloquial varieties due to limited test data.Inspired by prior work on English varieties, we craft and manually evaluate perturbation rules that transform German sentences into colloquial forms and use them to synthesize test sets in four ToD datasets.Our perturbation rules cover 18 distinct language phenomena, enabling us to explore the impact of each perturbation on slot and intent performance.Using these new datasets, we conduct an experimental evaluation across six different transformers.Here, we demonstrate that when applied to colloquial varieties, ToD systems maintain their intent recognition performance, losing 6% (4.62 percentage points) in accuracy on average. However, they exhibit a significant drop in slot detection, with a decrease of 31% (21 percentage points) in slot F1 score.Our findings are further supported by a transfer experiment from Standard American English to synthetic Urban African American Vernacular English.

MCML Authors

Verena Blaschke

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Barbara Plank

Prof. Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

[22]

J. Baan, R. Fernández, B. Plank and W. Aziz.
Interpreting Predictive Probabilities: Model Confidence or Human Label Variation?
EACL 2024 - 18th Conference of the European Chapter of the Association for Computational Linguistics. St. Julians, Malta, Mar 17-22, 2024. URL

Abstract

With the rise of increasingly powerful and user-facing NLP systems, there is growing interest in assessing whether they have a good representation of uncertainty by evaluating the quality of their predictive distribution over outcomes. We identify two main perspectives that drive starkly different evaluation protocols. The first treats predictive probability as an indication of model confidence; the second as an indication of human label variation. We discuss their merits and limitations, and take the position that both are crucial for trustworthy and fair NLP systems, but that exploiting a single predictive distribution is limiting. We recommend tools and highlight exciting directions towards models with disentangled representations of uncertainty about predictions and uncertainty about human labels.

MCML Authors

Barbara Plank

Prof. Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

[21]

M. Zhang, R. van der Goot, M.-Y. Kan and B. Plank.
NNOSE: Nearest Neighbor Occupational Skill Extraction.
EACL 2024 - 18th Conference of the European Chapter of the Association for Computational Linguistics. St. Julians, Malta, Mar 17-22, 2024. URL

Abstract

The labor market is changing rapidly, prompting increased interest in the automatic extraction of occupational skills from text. With the advent of English benchmark job description datasets, there is a need for systems that handle their diversity well. We tackle the complexity in occupational skill datasets tasks—combining and leveraging multiple datasets for skill extraction, to identify rarely observed skills within a dataset, and overcoming the scarcity of skills across datasets. In particular, we investigate the retrieval-augmentation of language models, employing an external datastore for retrieving similar skills in a dataset-unifying manner. Our proposed method, Nearest Neighbor Occupational Skill Extraction (NNOSE) effectively leverages multiple datasets by retrieving neighboring skills from other datasets in the datastore. This improves skill extraction without additional fine-tuning. Crucially, we observe a performance gain in predicting infrequent patterns, with substantial gains of up to 30% span-F1 in cross-dataset settings.

MCML Authors

Barbara Plank

Prof. Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

[20]

M. Zhang, R. van der Goot and B. Plank.
Entity Linking in the Job Market Domain.
EACL 2024 - Findings of the 18th Conference of the European Chapter of the Association for Computational Linguistics. St. Julians, Malta, Mar 17-22, 2024. URL

Abstract

In Natural Language Processing, entity linking (EL) has centered around Wikipedia, but yet remains underexplored for the job market domain. Disambiguating skill mentions can help us get insight into the current labor market demands. In this work, we are the first to explore EL in this domain, specifically targeting the linkage of occupational skills to the ESCO taxonomy (le Vrang et al., 2014). Previous efforts linked coarse-grained (full) sentences to a corresponding ESCO skill. In this work, we link more fine-grained span-level mentions of skills. We tune two high-performing neural EL models, a bi-encoder (Wu et al., 2020) and an autoregressive model (Cao et al., 2021), on a synthetically generated mention–skill pair dataset and evaluate them on a human-annotated skill-linking benchmark. Our findings reveal that both models are capable of linking implicit mentions of skills to their correct taxonomy counterparts. Empirically, BLINK outperforms GENRE in strict evaluation, but GENRE performs better in loose evaluation (accuracy@k).

MCML Authors

Barbara Plank

Prof. Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

[19]

A. Sorensen, S. Peng, B. Plank and R. Goot.
EEVEE: An Easy Annotation Tool for Natural Language Processing.
LAW @EACL 2024 - 18th Linguistic Annotation Workshop at the 18th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2024). St. Julians, Malta, Mar 17-22, 2024. URL

Abstract

Annotation tools are the starting point for creating Natural Language Processing (NLP) datasets. There is a wide variety of tools available; setting up these tools is however a hindrance. We propose EEVEE, an annotation tool focused on simplicity, efficiency, and ease of use. It can run directly in the browser (no setup required) and uses tab-separated files (as opposed to character offsets or task-specific formats) for annotation. It allows for annotation of multiple tasks on a single dataset and supports four task-types: sequence labeling, span labeling, text classification and seq2seq.

MCML Authors

Siyao Peng

Dr.

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Barbara Plank

Prof. Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

[18]

L. Weber-Genzel, R. Litschko, E. Artemova and B. Plank.
Donkii: Characterizing and Detecting Errors in Instruction-Tuning Datasets.
LAW @EACL 2024 - 18th Linguistic Annotation Workshop at the 18th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2024). St. Julians, Malta, Mar 17-22, 2024. URL

Abstract

Instruction tuning has become an integral part of training pipelines for Large Language Models (LLMs) and has been shown to yield strong performance gains. In an orthogonal line of research, Annotation Error Detection (AED) has emerged as a tool for detecting quality problems in gold standard labels. So far, however, the application of AED methods has been limited to classification tasks. It is an open question how well AED methods generalize to language generation settings, which are becoming more widespread via LLMs. In this paper, we present a first and novel benchmark for AED on instruction tuning data: Donkii.It comprises three instruction-tuning datasets enriched with error annotations by experts and semi-automatic methods. We also provide a novel taxonomy of error types for instruction-tuning data.We find that all three datasets contain clear errors, which sometimes propagate directly into instruction-tuned LLMs. We propose four AED baselines for the generative setting and evaluate them extensively on the newly introduced dataset. Our results show that the choice of the right AED method and model size is indeed crucial and derive practical recommendations for how to use AED methods to clean instruction-tuning data.

MCML Authors

Leon Weber-Genzel

Dr.

B2 | Natural Language Processing
→ Group Barbara Plank

* Former Member

Robert Litschko

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Barbara Plank

Prof. Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

2023

[17]

S. Zhang, P. Wicke, L. K. Senel, L. Figueredo, A. Naceri, S. Haddadin, B. Plank and H. Schütze.
LoHoRavens: A Long-Horizon Language-Conditioned Benchmark for Robotic Tabletop Manipulation.
NeurIPS 2023 - 6th Robot Learning Workshop: Pretraining, Fine-Tuning, and Generalization with Large Scale Models at the 37th Conference on Neural Information Processing Systems. New Orleans, LA, USA, Dec 10-16, 2023. URL

Abstract

The convergence of embodied agents and large language models (LLMs) has brought significant advancements to embodied instruction following.Particularly, the strong reasoning capabilities of LLMs make it possible for robots to perform long-horizon tasks without expensive annotated demonstrations.However, public benchmarks for testing the long-horizon reasoning capabilities of language-conditioned robots in various scenarios are still missing. To fill this gap, this work focuses on the tabletopmanipulation task and releases a simulation benchmark,textit{LoHoRavens}, which covers various long-horizonreasoning aspects spanning color, size, space, arithmeticsand reference.Furthermore, there is a key modality bridging problem forlong-horizon manipulation tasks with LLMs: how toincorporate the observation feedback during robot executionfor the LLM’s closed-loop planning, which is however less studied by prior work. We investigate two methods of bridging the modality gap: caption generation and learnable interface for incorporating explicit and implicit observation feedback to the LLM, respectively.These methods serve as the two baselines for our proposed benchmark. Experiments show that both methods struggle to solve most tasks, indicating long-horizon manipulation tasks are still challenging for current popular models.We expect the proposed public benchmark and baselines can help the community develop better models for long-horizon tabletop manipulation tasks.

MCML Authors

Shengqiang Zhang

B2 | Natural Language Processing
→ Group Hinrich Schütze

Computational Linguistics

Philipp Wicke

Dr.

B2 | Natural Language Processing
→ Group Hinrich Schütze

Computational Linguistics

Lütfi Kerem Senel

Dr.

B2 | Natural Language Processing
→ Group Hinrich Schütze

* Former Member

Barbara Plank

Prof. Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

Hinrich Schütze

Prof. Dr.

B2 | Natural Language Processing

Computational Linguistics

[16]

M. Giulianelli, J. Baan, W. Aziz, R. Fernández and B. Plank.
What Comes Next? Evaluating Uncertainty in Neural Text Generators Against Human Production Variability.
EMNLP 2023 - Conference on Empirical Methods in Natural Language Processing. Singapore, Dec 06-10, 2023. DOI

Abstract

In Natural Language Generation (NLG) tasks, for any input, multiple communicative goals are plausible, and any goal can be put into words, or produced, in multiple ways. We characterise the extent to which human production varies lexically, syntactically, and semantically across four NLG tasks, connecting human production variability to aleatoric or data uncertainty. We then inspect the space of output strings shaped by a generation system’s predicted probability distribution and decoding algorithm to probe its uncertainty. For each test input, we measure the generator’s calibration to human production variability. Following this instance-level approach, we analyse NLG models and decoding strategies, demonstrating that probing a generator with multiple samples and, when possible, multiple references, provides the level of detail necessary to gain understanding of a model’s representation of uncertainty.

MCML Authors

Barbara Plank

Prof. Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

[15]

R. Litschko, M. Müller-Eberstein, R. van der Goot, L. Weber-Genzel and B. Plank.
Establishing Trustworthiness: Rethinking Tasks and Model Evaluation.
EMNLP 2023 - Conference on Empirical Methods in Natural Language Processing. Singapore, Dec 06-10, 2023. DOI

Abstract

Language understanding is a multi-faceted cognitive capability, which the Natural Language Processing (NLP) community has striven to model computationally for decades. Traditionally, facets of linguistic intelligence have been compartmentalized into tasks with specialized model architectures and corresponding evaluation protocols. With the advent of large language models (LLMs) the community has witnessed a dramatic shift towards general purpose, task-agnostic approaches powered by generative models. As a consequence, the traditional compartmentalized notion of language tasks is breaking down, followed by an increasing challenge for evaluation and analysis. At the same time, LLMs are being deployed in more real-world scenarios, including previously unforeseen zero-shot setups, increasing the need for trustworthy and reliable systems. Therefore, we argue that it is time to rethink what constitutes tasks and model evaluation in NLP, and pursue a more holistic view on language, placing trustworthiness at the center. Towards this goal, we review existing compartmentalized approaches for understanding the origins of a model’s functional capacity, and provide recommendations for more multi-faceted evaluation protocols.

MCML Authors

Robert Litschko

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Leon Weber-Genzel

Dr.

B2 | Natural Language Processing
→ Group Barbara Plank

* Former Member

Barbara Plank

Prof. Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

[14]

X. Wang and B. Plank.
ACTOR: Active Learning with Annotator-specific Classification Heads to Embrace Human Label Variation.
EMNLP 2023 - Conference on Empirical Methods in Natural Language Processing. Singapore, Dec 06-10, 2023. DOI

Abstract

Label aggregation such as majority voting is commonly used to resolve annotator disagreement in dataset creation. However, this may disregard minority values and opinions. Recent studies indicate that learning from individual annotations outperforms learning from aggregated labels, though they require a considerable amount of annotation. Active learning, as an annotation cost-saving strategy, has not been fully explored in the context of learning from disagreement. We show that in the active learning setting, a multi-head model performs significantly better than a single-head model in terms of uncertainty estimation. By designing and evaluating acquisition functions with annotator-specific heads on two datasets, we show that group-level entropy works generally well on both datasets. Importantly, it achieves performance in terms of both prediction and uncertainty estimation comparable to full-scale training from disagreement, while saving 70% of the annotation budget.

MCML Authors

Xinpeng Wang

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Barbara Plank

Prof. Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

[13]

S. Xu, S. T.y.s.s, O. Ichim, I. Risini, B. Plank and M. Grabmair.
From Dissonance to Insights: Dissecting Disagreements in Rationale Construction for Case Outcome Classification.
EMNLP 2023 - Conference on Empirical Methods in Natural Language Processing. Singapore, Dec 06-10, 2023. DOI

Abstract

In legal NLP, Case Outcome Classification (COC) must not only be accurate but also trustworthy and explainable. Existing work in explainable COC has been limited to annotations by a single expert. However, it is well-known that lawyers may disagree in their assessment of case facts. We hence collect a novel dataset RaVE: Rationale Variation in ECHR, which is obtained from two experts in the domain of international human rights law, for whom we observe weak agreement. We study their disagreements and build a two-level task-independent taxonomy, supplemented with COC-specific subcategories. To our knowledge, this is the first work in the legal NLP that focuses on human label variation. We quantitatively assess different taxonomy categories and find that disagreements mainly stem from underspecification of the legal context, which poses challenges given the typically limited granularity and noise in COC metadata. We further assess the explainablility of state-of-the-art COC models on RaVE and observe limited agreement between models and experts. Overall, our case study reveals hitherto underappreciated complexities in creating benchmark datasets in legal NLP that revolve around identifying aspects of a case’s facts supposedly relevant for its outcome.

MCML Authors

Barbara Plank

Prof. Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

[12]

M. Müller-Eberstein, R. van der Goot, B. Plank and I. Titov.
Subspace Chronicles: How Linguistic Information Emerges, Shifts and Interacts during Language Model Training.
EMNLP 2023 - Findings of the Conference on Empirical Methods in Natural Language Processing. Singapore, Dec 06-10, 2023. DOI

Abstract

Representational spaces learned via language modeling are fundamental to Natural Language Processing (NLP), however there has been limited understanding regarding how and when during training various types of linguistic information emerge and interact. Leveraging a novel information theoretic probing suite, which enables direct comparisons of not just task performance, but their representational subspaces, we analyze nine tasks covering syntax, semantics and reasoning, across 2M pre-training steps and five seeds. We identify critical learning phases across tasks and time, during which subspaces emerge, share information, and later disentangle to specialize. Across these phases, syntactic knowledge is acquired rapidly after 0.5% of full training. Continued performance improvements primarily stem from the acquisition of open-domain knowledge, while semantics and reasoning tasks benefit from later boosts to long-range contextualization and higher specialization. Measuring cross-task similarity further reveals that linguistically related tasks share information throughout training, and do so more during the critical phase of learning than before or after. Our findings have implications for model interpretability, multi-task learning, and learning from limited data.

MCML Authors

Barbara Plank

Prof. Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

[11]

L. Weber and B. Plank.
ActiveAED: A Human in the Loop Improves Annotation Error Detection.
ACL 2023 - Findings of the 61th Annual Meeting of the Association for Computational Linguistics. Toronto, Canada, Jul 09-14, 2023. DOI

Abstract

Manually annotated datasets are crucial for training and evaluating Natural Language Processing models. However, recent work has discovered that even widely-used benchmark datasets contain a substantial number of erroneous annotations. This problem has been addressed with Annotation Error Detection (AED) models, which can flag such errors for human re-annotation. However, even though many of these AED methods assume a final curation step in which a human annotator decides whether the annotation is erroneous, they have been developed as static models without any human-in-the-loop component. In this work, we propose ActiveAED, an AED method that can detect errors more accurately by repeatedly querying a human for error corrections in its prediction loop. We evaluate ActiveAED on eight datasets spanning five different tasks and find that it leads to improvements over the state of the art on seven of them, with gains of up to six percentage points in average precision.

MCML Authors

Barbara Plank

Prof. Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

[10]

J. Baan, N. Daheim, E. Ilia, D. Ulmer, H.-S. Li, R. Fernández, B. Plank, R. Sennrich, C. Zerva and W. Aziz.
Uncertainty in Natural Language Generation: From Theory to Applications.
Preprint (Jul. 2023). arXiv

Abstract

Recent advances of powerful Language Models have allowed Natural Language Generation (NLG) to emerge as an important technology that can not only perform traditional tasks like summarisation or translation, but also serve as a natural language interface to a variety of applications. As such, it is crucial that NLG systems are trustworthy and reliable, for example by indicating when they are likely to be wrong; and supporting multiple views, backgrounds and writing styles – reflecting diverse human sub-populations. In this paper, we argue that a principled treatment of uncertainty can assist in creating systems and evaluation protocols better aligned with these goals. We first present the fundamental theory, frameworks and vocabulary required to represent uncertainty. We then characterise the main sources of uncertainty in NLG from a linguistic perspective, and propose a two-dimensional taxonomy that is more informative and faithful than the popular aleatoric/epistemic dichotomy. Finally, we move from theory to applications and highlight exciting research directions that exploit uncertainty to power decoding, controllable generation, self-assessment, selective answering, active learning and more.

MCML Authors

Barbara Plank

Prof. Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

[9]

V. Blaschke, H. Schütze and B. Plank.
A Survey of Corpora for Germanic Low-Resource Languages and Dialects.
NoDaLiDa 2023 - 24th Nordic Conference on Computational Linguistics. Tórshavn, Faroe Islands, May 22-24, 2023. URL

Abstract

Despite much progress in recent years, the vast majority of work in natural language processing (NLP) is on standard languages with many speakers. In this work, we instead focus on low-resource languages and in particular non-standardized low-resource languages. Even within branches of major language families, often considered well-researched, little is known about the extent and type of available resources and what the major NLP challenges are for these language varieties. The first step to address this situation is a systematic survey of available corpora (most importantly, annotated corpora, which are particularly valuable for NLP research). Focusing on Germanic low-resource language varieties, we provide such a survey in this paper. Except for geolocation (origin of speaker or document), we find that manually annotated linguistic resources are sparse and, if they exist, mostly cover morphosyntax. Despite this lack of resources, we observe that interest in this area is increasing: there is active development and a growing research community. To facilitate research, we make our overview of over 80 corpora publicly available.

MCML Authors

Verena Blaschke

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Hinrich Schütze

Prof. Dr.

B2 | Natural Language Processing

Computational Linguistics

Barbara Plank

Prof. Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

[8]

X. Wang, L. Weissweiler, H. Schütze and B. Plank.
How to Distill your BERT: An Empirical Study on the Impact of Weight Initialisation and Distillation Objectives.
EACL 2023 - 17th Conference of the European Chapter of the Association for Computational Linguistics. Dubrovnik, Croatia, May 02-06, 2023. DOI

Abstract

Recently, various intermediate layer distillation (ILD) objectives have been shown to improve compression of BERT models via Knowledge Distillation (KD). However, a comprehensive evaluation of the objectives in both task-specific and task-agnostic settings is lacking. To the best of our knowledge, this is the first work comprehensively evaluating distillation objectives in both settings. We show that attention transfer gives the best performance overall. We also study the impact of layer choice when initializing the student from the teacher layers, finding a significant impact on the performance in task-specific distillation. For vanilla KD and hidden states transfer, initialisation with lower layers of the teacher gives a considerable improvement over higher layers, especially on the task of QNLI (up to an absolute percentage change of 17.8 in accuracy). Attention transfer behaves consistently under different initialisation settings. We release our code as an efficient transformer-based model distillation framework for further studies.

MCML Authors

Xinpeng Wang

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Leonie Weissweiler

B2 | Natural Language Processing
→ Group Hinrich Schütze

* Former Member

Hinrich Schütze

Prof. Dr.

B2 | Natural Language Processing

Computational Linguistics

Barbara Plank

Prof. Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

[7]

V. Blaschke, H. Schütze and B. Plank.
Does Manipulating Tokenization Aid Cross-Lingual Transfer? A Study on POS Tagging for Non-Standardized Languages.
VarDial @EACL 2023 - 10th Workshop on NLP for Similar Languages, Varieties and Dialects at the 17th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2023). Dubrovnik, Croatia, May 02-06, 2023. DOI

Abstract

One of the challenges with finetuning pretrained language models (PLMs) is that their tokenizer is optimized for the language(s) it was pretrained on, but brittle when it comes to previously unseen variations in the data. This can for instance be observed when finetuning PLMs on one language and evaluating them on data in a closely related language variety with no standardized orthography. Despite the high linguistic similarity, tokenization no longer corresponds to meaningful representations of the target data, leading to low performance in, e.g., part-of-speech tagging. In this work, we finetune PLMs on seven languages from three different families and analyze their zero-shot performance on closely related, non-standardized varieties. We consider different measures for the divergence in the tokenization of the source and target data, and the way they can be adjusted by manipulating the tokenization during the finetuning step. Overall, we find that the similarity between the percentage of words that get split into subwords in the source and target data (the split word ratio difference) is the strongest predictor for model performance on target data.

MCML Authors

Verena Blaschke

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Hinrich Schütze

Prof. Dr.

B2 | Natural Language Processing

Computational Linguistics

Barbara Plank

Prof. Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

2022

[6]

J. Baan, W. Aziz, B. Plank and R. Fernandez.
Stop Measuring Calibration When Humans Disagree.
EMNLP 2022 - Conference on Empirical Methods in Natural Language Processing. Abu Dhabi, United Arab Emirates, Nov 07-11, 2022. DOI

Abstract

Calibration is a popular framework to evaluate whether a classifier knows when it does not know - i.e., its predictive probabilities are a good indication of how likely a prediction is to be correct. Correctness is commonly estimated against the human majority class. Recently, calibration to human majority has been measured on tasks where humans inherently disagree about which class applies. We show that measuring calibration to human majority given inherent disagreements is theoretically problematic, demonstrate this empirically on the ChaosNLI dataset, and derive several instance-level measures of calibration that capture key statistical properties of human judgements - including class frequency, ranking and entropy.

MCML Authors

Barbara Plank

Prof. Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

[5]

E. Bassignana, M. Müller-Eberstein, M. Zhang and B. Plank.
Evidence > Intuition: Transferability Estimation for Encoder Selection.
EMNLP 2022 - Conference on Empirical Methods in Natural Language Processing. Abu Dhabi, United Arab Emirates, Nov 07-11, 2022. DOI

Abstract

With the increase in availability of large pre-trained language models (LMs) in Natural Language Processing (NLP), it becomes critical to assess their fit for a specific target task a priori—as fine-tuning the entire space of available LMs is computationally prohibitive and unsustainable. However, encoder transferability estimation has received little to no attention in NLP. In this paper, we propose to generate quantitative evidence to predict which LM, out of a pool of models, will perform best on a target task without having to fine-tune all candidates. We provide a comprehensive study on LM ranking for 10 NLP tasks spanning the two fundamental problem types of classification and structured prediction. We adopt the state-of-the-art Logarithm of Maximum Evidence (LogME) measure from Computer Vision (CV) and find that it positively correlates with final LM performance in 94% of the setups.In the first study of its kind, we further compare transferability measures with the de facto standard of human practitioner ranking, finding that evidence from quantitative metrics is more robust than pure intuition and can help identify unexpected LM candidates.

MCML Authors

Barbara Plank

Prof. Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

[4]

M. Müller-Eberstein, R. van der Goot and B. Plank.
Spectral Probing.
EMNLP 2022 - Conference on Empirical Methods in Natural Language Processing. Abu Dhabi, United Arab Emirates, Nov 07-11, 2022. DOI

Abstract

Linguistic information is encoded at varying timescales (subwords, phrases, etc.) and communicative levels, such as syntax and semantics. Contextualized embeddings have analogously been found to capture these phenomena at distinctive layers and frequencies. Leveraging these findings, we develop a fully learnable frequency filter to identify spectral profiles for any given task. It enables vastly more granular analyses than prior handcrafted filters, and improves on efficiency. After demonstrating the informativeness of spectral probing over manual filters in a monolingual setting, we investigate its multilingual characteristics across seven diverse NLP tasks in six languages. Our analyses identify distinctive spectral profiles which quantify cross-task similarity in a linguistically intuitive manner, while remaining consistent across languages—highlighting their potential as robust, lightweight task descriptors.

MCML Authors

Barbara Plank

Prof. Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

[3]

B. Plank.
The 'Problem' of Human Label Variation: On Ground Truth in Data, Modeling and Evaluation.
EMNLP 2022 - Conference on Empirical Methods in Natural Language Processing. Abu Dhabi, United Arab Emirates, Nov 07-11, 2022. DOI

Abstract

Human variation in labeling is often considered noise. Annotation projects for machine learning (ML) aim at minimizing human label variation, with the assumption to maximize data quality and in turn optimize and maximize machine learning metrics. However, thisconventional practice assumes that there exists a ground truth, and neglects that there exists genuine human variation in labeling due to disagreement, subjectivity in annotation or multiple plausible answers. In this position paper, we argue that this big open problem of human label variation persists and critically needs more attention to move our field forward. This is because human label variation impacts all stages of the ML pipeline: data, modeling and evaluation. However, few works consider all of these dimensions jointly; and existing research is fragmented. We reconcile different previously proposed notions of human label variation, provide a repository of publicly-available datasets with un-aggregated labels, depict approaches proposed so far, identify gaps and suggest ways forward. As datasets are becoming increasingly available, we hope that this synthesized view on the ‘problem’ will lead to an open discussion on possible strategies to devise fundamentally new directions.

MCML Authors

Barbara Plank

Prof. Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

[2]

E. Bassignana and B. Plank.
CrossRE: A Cross-Domain Dataset for Relation Extraction.
EMNLP 2022 - Findings of the Conference on Empirical Methods in Natural Language Processing. Abu Dhabi, United Arab Emirates, Nov 07-11, 2022. DOI

Abstract

Relation Extraction (RE) has attracted increasing attention, but current RE evaluation is limited to in-domain evaluation setups. Little is known on how well a RE system fares in challenging, but realistic out-of-distribution evaluation setups. To address this gap, we propose CrossRE, a new, freely-available cross-domain benchmark for RE, which comprises six distinct text domains and includes multi-label annotations. An additional innovation is that we release meta-data collected during annotation, to include explanations and flags of difficult instances. We provide an empirical evaluation with a state-of-the-art model for relation classification. As the meta-data enables us to shed new light on the state-of-the-art model, we provide a comprehensive analysis on the impact of difficult cases and find correlations between model and human annotations. Overall, our empirical investigation highlights the difficulty of cross-domain RE. We release our dataset, to spur more research in this direction.

MCML Authors

Barbara Plank

Prof. Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

[1]

D. Ulmer, E. Bassignana, M. Müller-Eberstein, D. Varab, M. Zhang, R. van der Goot, C. Hardmeier and B. Plank.
Experimental Standards for Deep Learning in Natural Language Processing Research.
EMNLP 2022 - Findings of the Conference on Empirical Methods in Natural Language Processing. Abu Dhabi, United Arab Emirates, Nov 07-11, 2022. DOI

Abstract

The field of Deep Learning (DL) has undergone explosive growth during the last decade, with a substantial impact on Natural Language Processing (NLP) as well. Yet, compared to more established disciplines, a lack of common experimental standards remains an open challenge to the field at large. Starting from fundamental scientific principles, we distill ongoing discussions on experimental standards in NLP into a single, widely-applicable methodology. Following these best practices is crucial to strengthen experimental evidence, improve reproducibility and enable scientific progress. These standards are further collected in a public repository to help them transparently adapt to future needs.

MCML Authors

Barbara Plank

Prof. Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

Research Group Barbara Plank

Barbara Plank

Team members @MCML

PostDocs

PhD Students

Recent News @MCML

Barbara Plank Awarded 2025 Imminent Research Grant for Work on Language Data

MCML Researchers With 36 Papers at ACL 2025

Call for Papers for NLPOR - First Workshop on Bridging NLP and Public Opinion Research

MCML Researchers With Twelve Papers at NAACL 2025

MCML Researchers With 52 Papers at ICLR 2025

Publications @MCML

2025

2024

2023

2022