05.08.2024

MCML Researchers With 15 Papers at ACL 2024

62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024). Bangkok, Thailand, 11.08.2024–16.08.2024

We are happy to announce that MCML researchers are represented with 15 papers at ACL 2024. Congrats to our researchers!

Main Track (9 papers)

V. Blaschke, C. Purschke, H. Schütze and B. Plank.
What Do Dialect Speakers Want? A Survey of Attitudes Towards Language Technology for German Dialects.
ACL 2024 - 62nd Annual Meeting of the Association for Computational Linguistics. Bangkok, Thailand, Aug 11-16, 2024. DOI

Abstract

Natural language processing (NLP) has largely focused on modelling standardized languages. More recently, attention has increasingly shifted to local, non-standardized languages and dialects. However, the relevant speaker populations’ needs and wishes with respect to NLP tools are largely unknown. In this paper, we focus on dialects and regional languages related to German – a group of varieties that is heterogeneous in terms of prestige and standardization. We survey speakers of these varieties (N=327) and present their opinions on hypothetical language technologies for their dialects. Although attitudes vary among subgroups of our respondents, we find that respondents are especially in favour of potential NLP tools that work with dialectal input (especially audio input) such as virtual assistants, and less so for applications that produce dialectal output such as machine translation or spellcheckers.

MCML Authors

Verena Blaschke

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Hinrich Schütze

Prof. Dr.

B2 | Natural Language Processing

Computational Linguistics

Barbara Plank

Prof. Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

A. H. Kargaran, F. Yvon and H. Schütze.
MaskLID: Code-Switching Language Identification through Iterative Masking.
ACL 2024 - 62nd Annual Meeting of the Association for Computational Linguistics. Bangkok, Thailand, Aug 11-16, 2024. DOI GitHub

Abstract

We present MaskLID, a simple, yet effective, code-switching (CS) language identification (LID) method. MaskLID does not require any training and is designed to complement current high-performance sentence-level LIDs. Sentence-level LIDs are classifiers trained on monolingual texts to provide single labels, typically using a softmax layer to turn scores into probabilities. However, in cases where a sentence is composed in both L1 and L2 languages, the LID classifier often only returns the dominant label L1. To address this limitation, MaskLID employs a strategy to mask text features associated with L1, allowing the LID to classify the text as L2 in the next round. This method uses the LID itself to identify the features that require masking and does not rely on any external resource. In this work, we explore the use of MaskLID for two open-source LIDs (GlotLID and OpenLID), that are both based on the FastText architecture.

MCML Authors

Amir Hossein Kargaran

B2 | Natural Language Processing
→ Group Hinrich Schütze

Computational Linguistics

Hinrich Schütze

Prof. Dr.

B2 | Natural Language Processing

Computational Linguistics

Y. Liu, C. Ma, H. Ye and H. Schütze.
TransliCo: A Contrastive Learning Framework to Address the Script Barrier in Multilingual Pretrained Language Models.
ACL 2024 - 62nd Annual Meeting of the Association for Computational Linguistics. Bangkok, Thailand, Aug 11-16, 2024. DOI

Abstract

The world’s more than 7000 languages are written in at least 293 scripts. Due to various reasons, many closely related languages use different scripts, which poses a difficulty for multilingual pretrained language models (mPLMs) in learning crosslingual knowledge through lexical overlap. As a consequence, mPLMs are faced with a script barrier: representations from different scripts are located in different subspaces, which can result in crosslingual transfer involving languages of different scripts performing suboptimally. To address this problem, we propose TransliCo, a framework that optimizes the Transliteration Contrastive Modeling (TCM) objective to fine-tune an mPLM by contrasting sentences in its training data and their transliterations in a unified script (in our case Latin), which enhances uniformity in the representation space for different scripts. Using Glot500-m, an mPLM pretrained on over 500 languages, as our source model, we fine-tune it on a small portion (5%) of its training data, and refer to the resulting model as Furina. We show that Furina not only better aligns representations from distinct scripts but also outperforms the original Glot500-m on various zero-shot crosslingual transfer tasks. Additionally, we achieve consistent improvement in a case study on the Indic group where the languages exhibit areal features but use different scripts. We make our code and models publicly available.

MCML Authors

Yihong Liu

B2 | Natural Language Processing
→ Group Hinrich Schütze

Computational Linguistics

Chunlan Ma

B2 | Natural Language Processing
→ Group Hinrich Schütze

Computational Linguistics

Haotian Ye

B2 | Natural Language Processing
→ Group Hinrich Schütze

Computational Linguistics

Hinrich Schütze

Prof. Dr.

B2 | Natural Language Processing

Computational Linguistics

T. Liu, I. Škrjanec and V. Demberg.
Temperature-scaling surprisal estimates improve fit to human reading times – but does it do so for the 'right reasons'?
ACL 2024 - 62nd Annual Meeting of the Association for Computational Linguistics. Bangkok, Thailand, Aug 11-16, 2024. DOI

Abstract

A wide body of evidence shows that human language processing difficulty is predicted by the information-theoretic measure surprisal, a word’s negative log probability in context. However, it is still unclear how to best estimate these probabilities needed for predicting human processing difficulty – while a long-standing belief held that models with lower perplexity would provide more accurate estimates of word predictability, and therefore lead to better reading time predictions, recent work has shown that for very large models, psycholinguistic predictive power decreases. One reason could be that language models might be more confident of their predictions than humans, because they have had exposure to several magnitudes more data. In this paper, we test what effect temperature-scaling of large language model (LLM) predictions has on surprisal estimates and their predictive power of reading times of English texts. Firstly, we show that calibration of large language models typically improves with model size, i.e. poorer calibration cannot account for poorer fit to reading times. Secondly, we find that temperature-scaling probabilities lead to a systematically better fit to reading times (up to 89% improvement in delta log likelihood), across several reading time corpora. Finally, we show that this improvement in fit is chiefly driven by words that are composed of multiple subword tokens.

MCML Authors

Tong Liu

A3 | Computational Models
→ Group Volker Tresp

Database Systems and Data Mining

P. Mondorf and B. Plank.
Comparing Inferential Strategies of Humans and Large Language Models in Deductive Reasoning.
ACL 2024 - 62nd Annual Meeting of the Association for Computational Linguistics. Bangkok, Thailand, Aug 11-16, 2024. DOI

Abstract

Deductive reasoning plays a pivotal role in the formulation of sound and cohesive arguments. It allows individuals to draw conclusions that logically follow, given the truth value of the information provided. Recent progress in the domain of large language models (LLMs) has showcased their capability in executing deductive reasoning tasks. Nonetheless, a significant portion of research primarily assesses the accuracy of LLMs in solving such tasks, often overlooking a deeper analysis of their reasoning behavior. In this study, we draw upon principles from cognitive psychology to examine inferential strategies employed by LLMs, through a detailed evaluation of their responses to propositional logic problems. Our findings indicate that LLMs display reasoning patterns akin to those observed in humans, including strategies like supposition following or chain construction. Moreover, our research demonstrates that the architecture and scale of the model significantly affect its preferred method of reasoning, with more advanced models tending to adopt strategies more frequently than less sophisticated ones. Importantly, we assert that a model’s accuracy, that is the correctness of its final conclusion, does not necessarily reflect the validity of its reasoning process. This distinction underscores the necessity for more nuanced evaluation procedures in the field.

MCML Authors

Philipp Mondorf

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Barbara Plank

Prof. Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

L. K. Senel, B. Fetahu, D. Yoshida, Z. Chen, G. Castellucci, N. Vedula, J. I. Choi and S. Malmasi.
Generative Explore-Exploit: Training-free Optimization of Generative Recommender Systems using LLM Optimizers.
ACL 2024 - 62nd Annual Meeting of the Association for Computational Linguistics. Bangkok, Thailand, Aug 11-16, 2024. DOI

Abstract

Recommender systems are widely used to suggest engaging content, and Large Language Models (LLMs) have given rise to generative recommenders. Such systems can directly generate items, including for open-set tasks like question suggestion. While the world knowledge of LLMs enable good recommendations, improving the generated content through user feedback is challenging as continuously fine-tuning LLMs is prohibitively expensive. We present a training-free approach for optimizing generative recommenders by connecting user feedback loops to LLM-based optimizers. We propose a generative explore-exploit method that can not only exploit generated items with known high engagement, but also actively explore and discover hidden population preferences to improve recommendation quality. We evaluate our approach on question generation in two domains (e-commerce and general knowledge), and model user feedback with Click Through Rate (CTR). Experiments show our LLM-based explore-exploit approach can iteratively improve recommendations, and consistently increase CTR. Ablation analysis shows that generative exploration is key to learning user preferences, avoiding the pitfalls of greedy exploit-only approaches. A human evaluation strongly supports our quantitative findings.

MCML Authors

Lütfi Kerem Senel

B2 | Natural Language Processing
→ Group Hinrich Schütze

* Former Member

C. Tomani, D. Vilar, M. Freitag, C. Cherry, S. Naskar, M. Finkelstein, X. Garcia and D. Cremers.
Quality-Aware Translation Models: Efficient Generation and Quality Estimation in a Single Model.
ACL 2024 - 62nd Annual Meeting of the Association for Computational Linguistics. Bangkok, Thailand, Aug 11-16, 2024. DOI

Abstract

Maximum-a-posteriori (MAP) decoding is the most widely used decoding strategy for neural machine translation (NMT) models. The underlying assumption is that model probability correlates well with human judgment, with better translations getting assigned a higher score by the model. However, research has shown that this assumption does not always hold, and generation quality can be improved by decoding to optimize a utility function backed by a metric or quality-estimation signal, as is done by Minimum Bayes Risk (MBR) or Quality-Aware decoding. The main disadvantage of these approaches is that they require an additional model to calculate the utility function during decoding, significantly increasing the computational cost. In this paper, we propose to make the NMT models themselves quality-aware by training them to estimate the quality of their own output. Using this approach for MBR decoding we can drastically reduce the size of the candidate list, resulting in a speed-up of two-orders of magnitude. When applying our method to MAP decoding we obtain quality gains similar or even superior to quality reranking approaches, but with the efficiency of single pass decoding.

MCML Authors

Christian Tomani

B1 | Computer Vision
→ Group Daniel Cremers

Computer Vision & Artificial Intelligence

Daniel Cremers

Prof. Dr.

B1 | Computer Vision

Computer Vision & Artificial Intelligence

L. Weber-Genzel, S. Peng, M.-C. De Marneffe and B. Plank.
VariErr NLI: Separating Annotation Error from Human Label Variation.
ACL 2024 - 62nd Annual Meeting of the Association for Computational Linguistics. Bangkok, Thailand, Aug 11-16, 2024. DOI

Abstract

Human label variation arises when annotators assign different labels to the same item for valid reasons, while annotation errors occur when labels are assigned for invalid reasons. These two issues are prevalent in NLP benchmarks, yet existing research has studied them in isolation. To the best of our knowledge, there exists no prior work that focuses on teasing apart error from signal, especially in cases where signal is beyond black-and-white.To fill this gap, we introduce a systematic methodology and a new dataset, VariErr (variation versus error), focusing on the NLI task in English. We propose a 2-round annotation procedure with annotators explaining each label and subsequently judging the validity of label-explanation pairs.VariErr contains 7,732 validity judgments on 1,933 explanations for 500 re-annotated MNLI items. We assess the effectiveness of various automatic error detection (AED) methods and GPTs in uncovering errors versus human label variation. We find that state-of-the-art AED methods significantly underperform GPTs and humans. While GPT-4 is the best system, it still falls short of human performance. Our methodology is applicable beyond NLI, offering fertile ground for future research on error versus plausible variation, which in turn can yield better and more trustworthy NLP systems.

MCML Authors

Leon Weber-Genzel

Dr.

B2 | Natural Language Processing
→ Group Barbara Plank

* Former Member

Siyao Peng

Dr.

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Barbara Plank

Prof. Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

S. Xu, S. T.y.s.s, O. Ichim, B. Plank and M. Grabmair.
Through the Lens of Split Vote: Exploring Disagreement, Difficulty and Calibration in Legal Case Outcome Classification.
ACL 2024 - 62nd Annual Meeting of the Association for Computational Linguistics. Bangkok, Thailand, Aug 11-16, 2024. DOI

Abstract

In legal decisions, split votes (SV) occur when judges cannot reach a unanimous decision, posing a difficulty for lawyers who must navigate diverse legal arguments and opinions. In high-stakes domains, %as human-AI interaction systems become increasingly important, understanding the alignment of perceived difficulty between humans and AI systems is crucial to build trust. However, existing NLP calibration methods focus on a classifier’s awareness of predictive performance, measured against the human majority class, overlooking inherent human label variation (HLV). This paper explores split votes as naturally observable human disagreement and value pluralism. We collect judges’ vote distributions from the European Court of Human Rights (ECHR), and present SV-ECHR, a case outcome classification (COC) dataset with SV information. We build a taxonomy of disagreement with SV-specific subcategories. We further assess the alignment of perceived difficulty between models and humans, as well as confidence- and human-calibration of COC models. We observe limited alignment with the judge vote distribution. To our knowledge, this is the first systematic exploration of calibration to human judgements in legal NLP. Our study underscores the necessity for further research on measuring and enhancing model calibration considering HLV in legal decision tasks.

MCML Authors

Barbara Plank

Prof. Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

Workshops (6 papers)

M. Aßenmacher, A. Stephan, L. Weissweiler, E. Çano, I. Ziegler, M. Härttrich, B. Bischl, B. Roth, C. Heumann and H. Schütze.
Collaborative Development of Modular Open Source Educational Resources for Natural Language Processing.
TeachingNLP @ACL 2024 - 6th Workshop on Teaching NLP at the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024). Bangkok, Thailand, Aug 11-16, 2024. URL

Abstract

In this work, we present a collaboratively and continuously developed open-source educational resource (OSER) for teaching natural language processing at two different universities. We shed light on the principles we followed for the initial design of the course and the rationale for ongoing developments, followed by a reflection on the inter-university collaboration for designing and maintaining teaching material. When reflecting on the latter, we explicitly emphasize the considerations that need to be made when facing heterogeneous groups and when having to accommodate multiple examination regulations within one single course framework. Relying on the fundamental principles of OSER developments as defined by Bothmann et al. (2023) proved to be an important guideline during this process. The final part pertains to open-sourcing our teaching material, coping with the increasing speed of developments in the field, and integrating the course digitally, also addressing conflicting priorities and challenges we are currently facing.

MCML Authors

Matthias Aßenmacher

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Leonie Weissweiler

Dr.

B2 | Natural Language Processing
→ Group Hinrich Schütze

* Former Member

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability

Statistical Learning and Data Science

Hinrich Schütze

Prof. Dr.

B2 | Natural Language Processing

Computational Linguistics

A. Dimmelmeier, H. Doll, M. Schierholz, E. Kormanyos, M. Fehr, B. Ma, J. Beck, A. Fraser and F. Kreuter.
Informing climate risk analysis using textual information - A research agenda.
ClimateNLP @ACL 2024 - 1st Workshop on Natural Language Processing Meets Climate Change at the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024). Bangkok, Thailand, Aug 11-16, 2024. DOI

Abstract

We present a research agenda focused on efficiently extracting, assuring quality, and consolidating textual company sustainability information to address urgent climate change decision-making needs. Starting from the goal to create integrated FAIR (Findable, Accessible, Interoperable, Reusable) climate-related data, we identify research needs pertaining to the technical aspects of information extraction as well as to the design of the integrated sustainability datasets that we seek to compile. Regarding extraction, we leverage technological advancements, particularly in large language models (LLMs) and Retrieval-Augmented Generation (RAG) pipelines, to unlock the underutilized potential of unstructured textual information contained in corporate sustainability reports. In applying these techniques, we review key challenges, which include the retrieval and extraction of CO2 emission values from PDF documents, especially from unstructured tables and graphs therein, and the validation of automatically extracted data through comparisons with human-annotated values. We also review how existing use cases and practices in climate risk analytics relate to choices of what textual information should be extracted and how it could be linked to existing structured data.

MCML Authors

Malte Schierholz

Dr.

C4 | Computational Social Sciences
→ Group Frauke Kreuter

Social Data Science and AI

Bolei Ma

C4 | Computational Social Sciences
→ Group Frauke Kreuter

Social Data Science and AI

Jacob Beck

C4 | Computational Social Sciences
→ Group Frauke Kreuter

Social Data Science and AI

Alexander Fraser

Prof. Dr.

B2 | Natural Language Processing

Data Analytics & Statistics

Frauke Kreuter

Prof. Dr.

C4 | Computational Social Sciences

Social Data Science and AI

B. Ma.
Evaluating Lexical Aspect with Large Language Models.
CMCL @ACL 2024 - Workshop on Cognitive Modeling and Computational Linguistics at the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024). Bangkok, Thailand, Aug 11-16, 2024. DOI

Abstract

In this study, we explore the proficiency of large language models (LLMs) in understanding two key lexical aspects: duration (durative/stative) and telicity (telic/atelic). Through experiments on datasets featuring sentences, verbs, and verb positions, we prompt the LLMs to identify aspectual features of verbs in sentences. Our findings reveal that certain LLMs, particularly those closed-source ones, are able to capture information on duration and telicity, albeit with some performance variations and weaker results compared to the baseline. By employing prompts at three levels (sentence-only, sentence with verb, and sentence with verb and its position), we demonstrate that integrating verb information generally enhances performance in aspectual feature recognition, though it introduces instability. We call for future research to look deeper into methods aimed at optimizing LLMs for aspectual feature comprehension.

MCML Authors

Bolei Ma

C4 | Computational Social Sciences
→ Group Frauke Kreuter

Social Data Science and AI

J. Pavlopoulos, V. Kougia, E. Garces Arias, P. Platanou, S. Shabalin, K. Liagkou, E. Papadatos, H. Essler, J.-B. Camps and F. Fischer.
Challenging Error Correction in Recognised Byzantine Greek.
ML4AL @ACL 2024 - 1st Workshop on Machine Learning for Ancient Languages at the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024). Bangkok, Thailand, Aug 11-16, 2024. DOI

Abstract

Automatic correction of errors in Handwritten Text Recognition (HTR) output poses persistent challenges yet to be fully resolved. In this study, we introduce a shared task aimed at addressing this challenge, which attracted 271 submissions, yielding only a handful of promising approaches. This paper presents the datasets, the most effective methods, and an experimental analysis in error-correcting HTRed manuscripts and papyri in Byzantine Greek, the language that followed Classical and preceded Modern Greek. By using recognised and transcribed data from seven centuries, the two best-performing methods are compared, one based on a neural encoder-decoder architecture and the other based on engineered linguistic rules. We show that the recognition error rate can be reduced by both, up to 2.5 points at the level of characters and up to 15 at the level of words, while also elucidating their respective strengths and weaknesses.

MCML Authors

Esteban Garces Arias

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

A. Yüksel, A. Köksal, L. K. Senel, A. Korhonen and H. Schütze.
TurkishMMLU: Measuring Massive Multitask Language Understanding in Turkish.
SIGTURK @ACL 2024 - 1st Workshop on Natural Language Processing for Turkic Languages at the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024). Bangkok, Thailand, Aug 11-16, 2024. Invited talk. arXiv GitHub

Abstract

Multiple choice question answering tasks evaluate the reasoning, comprehension, and mathematical abilities of Large Language Models (LLMs). While existing benchmarks employ automatic translation for multilingual evaluation, this approach is error-prone and potentially introduces culturally biased questions, especially in social sciences. We introduce the first multitask, multiple-choice Turkish QA benchmark, TurkishMMLU, to evaluate LLMs’ understanding of the Turkish language. TurkishMMLU includes over 10,000 questions, covering 9 different subjects from Turkish high-school education curricula. These questions are written by curriculum experts, suitable for the high-school curricula in Turkey, covering subjects ranging from natural sciences and math questions to more culturally representative topics such as Turkish Literature and the history of the Turkish Republic. We evaluate over 20 LLMs, including multilingual open-source (e.g., Gemma, Llama, MT5), closed-source (GPT 4o, Claude, Gemini), and Turkish-adapted (e.g., Trendyol) models. We provide an extensive evaluation, including zero-shot and few-shot evaluation of LLMs, chain-of-thought reasoning, and question difficulty analysis along with model performance. We provide an in-depth analysis of the Turkish capabilities and limitations of current LLMs to provide insights for future LLMs for the Turkish language.

MCML Authors

Abdullatif Köksal

B2 | Natural Language Processing
→ Group Hinrich Schütze

Computational Linguistics

Lütfi Kerem Senel

B2 | Natural Language Processing
→ Group Hinrich Schütze

* Former Member

Hinrich Schütze

Prof. Dr.

B2 | Natural Language Processing

Computational Linguistics

S. Zhou, S. Peng and B. Plank.
CLIMATELI: Evaluating Entity Linking on Climate Change Data.
ClimateNLP @ACL 2024 - 1st Workshop on Natural Language Processing Meets Climate Change at the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024). Bangkok, Thailand, Aug 11-16, 2024. DOI

Abstract

Climate Change (CC) is a pressing topic of global importance, attracting increasing attention across research fields, from social sciences to Natural Language Processing (NLP). CC is also discussed in various settings and communication platforms, from academic publications to social media forums. Understanding who and what is mentioned in such data is a first critical step to gaining new insights into CC. We present CLIMATELI (CLIMATe Entity LInking), the first manually annotated CC dataset that links 3,087 entity spans to Wikipedia. Using CLIMATELI (CLIMATe Entity LInking), we evaluate existing entity linking (EL) systems on the CC topic across various genres and propose automated filtering methods for CC entities. We find that the performance of EL models notably lags behind humans at both token and entity levels. Testing within the scope of retaining or excluding non-nominal and/or non-CC entities particularly impacts the models’ performances.

MCML Authors

Shijia Zhou

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Siyao Peng

Dr.

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Barbara Plank

Prof. Dr.

B2 | Natural Language Processing

AI and Computational Linguistics

ACL 2024

05.08.2024

Subscribe to RSS News feed

25.06.2025

When Clinical Expertise Meets AI Innovation – With Michael Ingrisch

The new research film features Michael Ingrisch, who shows how AI and clinical expertise can solve real challenges in radiology together.

23.06.2025

Autonomous Driving: From Infinite Possibilities to Safe Decisions— With Matthias Althoff

The new research film features Matthias Althoff explaining how his team verifies autonomous vehicle safety using EDGAR and rigorous testing.

20.06.2025

ERC Advanced Grant for Massimo Fornasier

Massimo Fornasier was awarded ERC Advanced Grant to develop advanced algorithms for solving complex nonconvex optimization problems.

18.06.2025

ERC Advanced Grant for Albrecht Schmidt

Albrecht Schmidt receives ERC Advanced Grant for research on personalized generative AI to support memory, planning, and creativity.

11.06.2025

Better Data, Smarter AI: Why Quality Matters – With Frauke Kreuter

In our new research film, Frauke Kreuter explains how data quality shapes fair, reliable, and socially responsible AI systems.

MCML Researchers With 15 Papers at ACL 2024

62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024). Bangkok, Thailand, 11.08.2024–16.08.2024

Main Track (9 papers)

Workshops (6 papers)

Related