06.11.2024

Teaser image to

MCML Researchers With 22 Papers at EMNLP 2024

Conference on Empirical Methods in Natural Language Processing (EMNLP 2024). Miami, FL, USA, 12.11.2024–16.11.2024

We are happy to announce that MCML researchers are represented with 22 papers at EMNLP 2024. Congrats to our researchers!

Main Track (6 papers)

M. Di Marco and A. Fraser.
Subword Segmentation in LLMs: Looking at Inflection and Consistency.
EMNLP 2024 - Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI
Abstract

null

MCML Authors
Link to Profile Alexander Fraser

Alexander Fraser

Prof. Dr.

Data Analytics & Statistics


L. Edman, H. Schmid and A. Fraser.
CUTE: Measuring LLMs’ Understanding of Their Tokens.
EMNLP 2024 - Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI
Abstract

Large Language Models (LLMs) show remarkable performance on a wide variety of tasks. Most LLMs split text into multi-character tokens and process them as atomic units without direct access to individual characters. This raises the question: To what extent can LLMs learn orthographic information? To answer this, we propose a new benchmark, CUTE, which features a collection of tasks designed to test the orthographic knowledge of LLMs. We evaluate popular LLMs on CUTE, finding that most of them seem to know the spelling of their tokens, yet fail to use this information effectively to manipulate text, calling into question how much of this knowledge is generalizable.

MCML Authors
Link to website

Lukas Edman

Dr.

Data Analytics & Statistics

Link to Profile Alexander Fraser

Alexander Fraser

Prof. Dr.

Data Analytics & Statistics


W. Lai, V. Hangya and A. Fraser.
Style-Specific Neurons for Steering LLMs in Text Style Transfer.
EMNLP 2024 - Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI
Abstract

null

MCML Authors
Link to Profile Alexander Fraser

Alexander Fraser

Prof. Dr.

Data Analytics & Statistics


Y. J. Liu, T. Aoyama, W. Scivetti, Y. Zhu, S. Behzad, L. E. Levine, J. Lin, D. Tiwari and A. Zeldes.
GDTB: Genre Diverse Data for English Shallow Discourse Parsing across Modalities, Text Types, and Domains.
EMNLP 2024 - Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI
Abstract

null

MCML Authors
Link to website

Yang Janet Liu

AI and Computational Linguistics


Y. Liu, Y. Zhang, Q. Li, T. Liu, S. Feng, D. Wang, Y. Zhang and H. Schütze.
HiFT: A Hierarchical Full Parameter Fine-Tuning Strategy.
EMNLP 2024 - Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI
Abstract

null

MCML Authors
Link to website

Tong Liu

Database Systems and Data Mining

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Computational Linguistics


P. Mondorf and B. Plank.
Liar, Liar, Logical Mire: A Benchmark for Suppositional Reasoning in Large Language Models.
EMNLP 2024 - Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI
Abstract

null

MCML Authors
Link to website

Philipp Mondorf

AI and Computational Linguistics

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

AI and Computational Linguistics


Findings Track (14 papers)

P. F. Balestrucci, S. Casola, S. M. Lo, V. Basile and A. Mazzei.
I’m sure you’re a real scholar yourself: Exploring Ironic Content Generation by Large Language Models.
EMNLP 2024 - Findings of the Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI
Abstract

Generating ironic content is challenging: it requires a nuanced understanding of context and implicit references and balancing seriousness and playfulness. Moreover, irony is highly subjective and can depend on various factors, such as social, cultural, or generational aspects. This paper explores whether Large Language Models (LLMs) can learn to generate ironic responses to social media posts. To do so, we fine-tune two models to generate ironic and non-ironic content and deeply analyze their outputs’ linguistic characteristics, their connection to the original post, and their similarity to the human-written replies. We also conduct a large-scale human evaluation of the outputs. Additionally, we investigate whether LLMs can learn a form of irony tied to a generational perspective, with mixed results.

MCML Authors
Link to website

Silvia Casola

Dr.

AI and Computational Linguistics


B. Chen, X. Wang, S. Peng, R. Litschko, A. Korhonen and B. Plank.
'Seeing the Big through the Small': Can LLMs Approximate Human Judgment Distributions on NLI from a Few Explanations?
EMNLP 2024 - Findings of the Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI
Abstract

Human label variation (HLV) is a valuable source of information that arises when multiple human annotators provide different labels for valid reasons. In Natural Language Inference (NLI) earlier approaches to capturing HLV involve either collecting annotations from many crowd workers to represent human judgment distribution (HJD) or use expert linguists to provide detailed explanations for their chosen labels. While the former method provides denser HJD information, obtaining it is resource-intensive. In contrast, the latter offers richer textual information but it is challenging to scale up to many human judges. Besides, large language models (LLMs) are increasingly used as evaluators (‘LLM judges’) but with mixed results, and few works aim to study HJDs. This study proposes to exploit LLMs to approximate HJDs using a small number of expert labels and explanations. Our experiments show that a few explanations significantly improve LLMs’ ability to approximate HJDs with and without explicit labels, thereby providing a solution to scale up annotations for HJD. However, fine-tuning smaller soft-label aware models with the LLM-generated model judgment distributions (MJDs) presents partially inconsistent results: while similar in distance, their resulting fine-tuned models and visualized distributions differ substantially. We show the importance of complementing instance-level distance measures with a global-level shape metric and visualization to more effectively evaluate MJDs against human judgment distributions.

MCML Authors
Link to website

Beiduo Chen

AI and Computational Linguistics

Link to website

Xinpeng Wang

AI and Computational Linguistics

Link to website

Siyao Peng

Dr.

AI and Computational Linguistics

Link to website

Robert Litschko

AI and Computational Linguistics

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

AI and Computational Linguistics


Z. Ding, J. Wu, J. Wu, Y. Xia and V. Tresp.
Temporal Fact Reasoning over Hyper-Relational Knowledge Graphs.
EMNLP 2024 - Findings of the Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI
Abstract

Stemming from traditional knowledge graphs (KGs), hyper-relational KGs (HKGs) provide additional key-value pairs (i.e., qualifiers) for each KG fact that help to better restrict the fact validity. In recent years, there has been an increasing interest in studying graph reasoning over HKGs. Meanwhile, as discussed in recent works that focus on temporal KGs (TKGs), world knowledge is ever-evolving, making it important to reason over temporal facts in KGs. Previous mainstream benchmark HKGs do not explicitly specify temporal information for each HKG fact. Therefore, almost all existing HKG reasoning approaches do not devise any module specifically for temporal reasoning. To better study temporal fact reasoning over HKGs, we propose a new type of data structure named hyper-relational TKG (HTKG). Every fact in an HTKG is coupled with a timestamp explicitly indicating its time validity. We develop two new benchmark HTKG datasets, i.e., Wiki-hy and YAGO-hy, and propose an HTKG reasoning model that efficiently models hyper-relational temporal facts. To support future research on this topic, we open-source our datasets and model.

MCML Authors
Link to website

Zifeng Ding

Database Systems and Data Mining

Link to website

Yan Xia

Dr.

Computer Vision & Artificial Intelligence

Link to Profile Volker Tresp

Volker Tresp

Prof. Dr.

Database Systems and Data Mining


E. Garces Arias, J. Rodemann, M. Li, C. Heumann and M. Aßenmacher.
Adaptive Contrastive Search: Uncertainty-Guided Decoding for Open-Ended Text Generation.
EMNLP 2024 - Findings of the Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI
Abstract

Decoding from the output distributions of large language models to produce high-quality text is a complex challenge in language modeling. Various approaches, such as beam search, sampling with temperature, k−sampling, nucleus p−sampling, typical decoding, contrastive decoding, and contrastive search, have been proposed to address this problem, aiming to improve coherence, diversity, as well as resemblance to human-generated text. In this study, we introduce adaptive contrastive search, a novel decoding strategy extending contrastive search by incorporating an adaptive degeneration penalty, guided by the estimated uncertainty of the model at each generation step. This strategy is designed to enhance both the creativity and diversity of the language modeling process while at the same time producing coherent and high-quality generated text output. Our findings indicate performance enhancement in both aspects, across different model architectures and datasets, underscoring the effectiveness of our method in text generation tasks. Our code base, datasets, and models are publicly available.

MCML Authors
Link to website

Esteban Garces Arias

Statistical Learning and Data Science

Link to website

Matthias Aßenmacher

Dr.

Statistical Learning and Data Science


A. Köksal, T. Schick, A. Korhonen and H. Schütze.
LongForm: Effective Instruction Tuning with Reverse Instructions.
EMNLP 2024 - Findings of the Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI GitHub
Abstract

null

MCML Authors
Link to website

Abdullatif Köksal

Computational Linguistics

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Computational Linguistics


R. Liao, M. Erler, H. Wang, G. Zhai, G. Zhang, Y. Ma and V. Tresp.
VideoINSTA: Zero-shot Long Video Understanding via Informative Spatial-Temporal Reasoning with LLMs.
EMNLP 2024 - Findings of the Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI GitHub
Abstract

null

MCML Authors
Link to website

Ruotong Liao

Database Systems and Data Mining

Link to website

Guangyao Zhai

Computer Aided Medical Procedures & Augmented Reality

Link to website

Gengyuan Zhang

Database Systems and Data Mining

Link to Profile Volker Tresp

Volker Tresp

Prof. Dr.

Database Systems and Data Mining


B. Ma, X. Wang, T. Hu, A.-C. Haensch, M. A. Hedderich, B. Plank and F. Kreuter.
The Potential and Challenges of Evaluating Attitudes, Opinions, and Values in Large Language Models.
EMNLP 2024 - Findings of the Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI
Abstract

null

MCML Authors
Link to website

Xinpeng Wang

AI and Computational Linguistics

Link to website

Anna-Carolina Haensch

Dr.

Social Data Science and AI

Link to Profile Michael Hedderich

Michael Hedderich

Dr.

AI and Computational Linguistics

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

AI and Computational Linguistics

Link to Profile Frauke Kreuter

Frauke Kreuter

Prof. Dr.

Social Data Science and AI


A. Modarressi, A. Köksal and H. Schütze.
Consistent Document-Level Relation Extraction via Counterfactuals.
EMNLP 2024 - Findings of the Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI
Abstract

null

MCML Authors
Link to website

Ali Modarressi

Computational Linguistics

Link to website

Abdullatif Köksal

Computational Linguistics

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Computational Linguistics


A. Sedova, R. Litschko, D. Frassinelli, B. Roth and B. Plank.
To Know or Not To Know? Analyzing Self-Consistency of Large Language Models under Ambiguity.
EMNLP 2024 - Findings of the Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI
Abstract

One of the major aspects contributing to the striking performance of large language models (LLMs) is the vast amount of factual knowledge accumulated during pre-training. Yet, many LLMs suffer from self-inconsistency, which raises doubts about their trustworthiness and reliability. This paper focuses on entity type ambiguity, analyzing the proficiency and consistency of state-of-the-art LLMs in applying factual knowledge when prompted with ambiguous entities. To do so, we propose an evaluation protocol that disentangles knowing from applying knowledge, and test state-of-the-art LLMs on 49 ambiguous entities. Our experiments reveal that LLMs struggle with choosing the correct entity reading, achieving an average accuracy of only 85%, and as low as 75% with underspecified prompts. The results also reveal systematic discrepancies in LLM behavior, showing that while the models may possess knowledge, they struggle to apply it consistently, exhibit biases toward preferred readings, and display self-inconsistencies. This highlights the need to address entity ambiguity in the future for more trustworthy LLMs.

MCML Authors
Link to website

Robert Litschko

AI and Computational Linguistics

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

AI and Computational Linguistics


M. Wang, L. Lange, H. Adel, J. Strötgen and H. Schütze.
Better Call SAUL: Fluent and Consistent Language Model Editing with Generation Regularization.
EMNLP 2024 - Findings of the Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI
Abstract

To ensure large language models contain up-to-date knowledge, they need to be updated regularly. However, model editing is challenging as it might also affect knowledge that is unrelated to the new data. State-of-the-art methods identify parameters associated with specific knowledge and then modify them via direct weight updates. However, these locate-and-edit methods suffer from heavy computational overhead and lack theoretical validation. In contrast, directly fine-tuning the model on requested edits affects the model’s behavior on unrelated knowledge, and significantly damages the model’s generation fluency and consistency. To address these challenges, we propose SAUL, a streamlined model editing method that uses sentence concatenation with augmented random facts for generation regularization. Evaluations on three model editing benchmarks show that SAUL is a practical and reliable solution for model editing outperforming state-of-the-art methods while maintaining generation quality and reducing computational overhead.

MCML Authors
Link to website

Mingyang Wang

Computational Linguistics

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Computational Linguistics


O. Xhelili, Y. Liu and H. Schütze.
Breaking the Script Barrier in Multilingual Pre-Trained Language Models with Transliteration-Based Post-Training Alignment.
EMNLP 2024 - Findings of the Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI GitHub
Abstract

Multilingual pre-trained models (mPLMs) have shown impressive performance on cross-lingual transfer tasks. However, the transfer performance is often hindered when a low-resource target language is written in a different script than the high-resource source language, even though the two languages may be related or share parts of their vocabularies. Inspired by recent work that uses transliteration to address this problem, our paper proposes a transliteration-based post-pretraining alignment (PPA) method aiming to improve the cross-lingual alignment between languages using diverse scripts. We select two areal language groups, Mediterranean-Amharic-Farsi and South+East Asian Languages, wherein the languages are mutually influenced but use different scripts. We apply our method to these language groups and conduct extensive experiments on a spectrum of downstream tasks. The results show that after PPA, models consistently outperform the original model (up to 50% for some tasks) in English-centric transfer. In addition, when we use languages other than English as sources in transfer, our method obtains even larger improvements.

MCML Authors
Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Computational Linguistics


A. Yüksel, A. Köksal, L. K. Senel, A. Korhonen and H. Schütze.
TurkishMMLU: Measuring Massive Multitask Language Understanding in Turkish.
EMNLP 2024 - Findings of the Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI GitHub
Abstract

Multiple choice question answering tasks evaluate the reasoning, comprehension, and mathematical abilities of Large Language Models (LLMs). While existing benchmarks employ automatic translation for multilingual evaluation, this approach is error-prone and potentially introduces culturally biased questions, especially in social sciences. We introduce the first multitask, multiple-choice Turkish QA benchmark, TurkishMMLU, to evaluate LLMs’ understanding of the Turkish language. TurkishMMLU includes over 10,000 questions, covering 9 different subjects from Turkish high-school education curricula. These questions are written by curriculum experts, suitable for the high-school curricula in Turkey, covering subjects ranging from natural sciences and math questions to more culturally representative topics such as Turkish Literature and the history of the Turkish Republic. We evaluate over 20 LLMs, including multilingual open-source (e.g., Gemma, Llama, MT5), closed-source (GPT 4o, Claude, Gemini), and Turkish-adapted (e.g., Trendyol) models. We provide an extensive evaluation, including zero-shot and few-shot evaluation of LLMs, chain-of-thought reasoning, and question difficulty analysis along with model performance. We provide an in-depth analysis of the Turkish capabilities and limitations of current LLMs to provide insights for future LLMs for the Turkish language.

MCML Authors
Link to website

Abdullatif Köksal

Computational Linguistics

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Computational Linguistics


H. Zhang, J. Liu, Z. Han, S. Chen, B. He, V. Tresp, Z. Xu and J. Gu.
Visual Question Decomposition on Multimodal Large Language Models.
EMNLP 2024 - Findings of the Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI
Abstract

Question decomposition has emerged as an effective strategy for prompting Large Language Models (LLMs) to answer complex questions. However, while existing methods primarily focus on unimodal language models, the question decomposition capability of Multimodal Large Language Models (MLLMs) has yet to be explored. To this end, this paper explores visual question decomposition on MLLMs. Specifically, we introduce a systematic evaluation framework including a dataset and several evaluation criteria to assess the quality of the decomposed sub-questions, revealing that existing MLLMs struggle to produce high-quality sub-questions. To address this limitation, we propose a specific finetuning dataset, DecoVQA+, for enhancing the model’s question decomposition capability. Aiming at enabling models to perform appropriate selective decomposition, we propose an efficient finetuning pipeline. The finetuning pipeline consists of our proposed dataset and a training objective for selective decomposition. Finetuned MLLMs demonstrate significant improvements in the quality of sub-questions and the policy of selective question decomposition. Additionally, the models also achieve higher accuracy with selective decomposition on VQA benchmark datasets.

MCML Authors
Link to website

Shuo Chen

Database Systems and Data Mining

Link to Profile Volker Tresp

Volker Tresp

Prof. Dr.

Database Systems and Data Mining


R. Zhao, A. Köksal, Y. Liu, L. Weissweiler, A. Korhonen and H. Schütze.
SynthEval: Hybrid Behavioral Testing of NLP Models with Synthetic Evaluation.
EMNLP 2024 - Findings of the Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI GitHub
Abstract

Traditional benchmarking in NLP typically involves using static held-out test sets. However, this approach often results in an overestimation of performance and lacks the ability to offer comprehensive, interpretable, and dynamic assessments of NLP models. Recently, works like DynaBench (Kiela et al., 2021) and CheckList (Ribeiro et al., 2020) have addressed these limitations through behavioral testing of NLP models with test types generated by a multistep human-annotated pipeline. Unfortunately, manually creating a variety of test types requires much human labor, often at prohibitive cost. In this work, we propose SYNTHEVAL, a hybrid behavioral testing framework that leverages large language models (LLMs) to generate a wide range of test types for a comprehensive evaluation of NLP models. SYNTHEVAL first generates sentences via LLMs using controlled generation, and then identifies challenging examples by comparing the predictions made by LLMs with task-specific NLP models. In the last stage, human experts investigate the challenging examples, manually design templates, and identify the types of failures the taskspecific models consistently exhibit. We apply SYNTHEVAL to two classification tasks, sentiment analysis and toxic language detection, and show that our framework is effective in identifying weaknesses of strong models on these tasks.

MCML Authors
Link to website

Raoyuan Zhao

AI and Computational Linguistics

Link to website

Abdullatif Köksal

Computational Linguistics

Leonie Weissweiler

Leonie Weissweiler

Dr.

* Former Member

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Computational Linguistics


Workshops (2 papers)

K. Hämmerl, A. Manea, G. Vico, J. Helcl and J. Libovický.
CUNI and LMU Submission to the MRL 2024 Shared Task on Multi-lingual Multi-task Information Retrieval.
MRL @EMNLP 2024 - 4th Multilingual Representation Learning Workshop at the Conference on Empirical Methods in Natural Language Processing (EMNLP 2024). Miami, FL, USA, Nov 12-16, 2024. DOI
Abstract

We present the joint CUNI and LMU submission to the MRL 2024 Shared Task on Multi-lingual Multi-task Information Retrieval. The shared task objective was to explore how we can deploy modern methods in NLP in multi-lingual low-resource settings, tested on two sub-tasks: Named-entity recognition and question answering. Our solutions to the subtasks are based on data acquisition and model adaptation. We compare the performance of our submitted systems with the translate-test approach which proved to be the most useful in the previous edition of the shared task. Our results show that using more data as well as fine-tuning recent multilingual pre-trained models leads to considerable improvements over the translate-test baseline.

MCML Authors
Link to website

Katharina Hämmerl

Data Analytics & Statistics


J. Wang, L. Zuo, S. Peng and B. Plank.
MultiClimate: Multimodal Stance Detection on Climate Change Videos.
NLP4PI @EMNLP 2024 - 3rd Workshop on NLP for Positive Impact at the Conference on Empirical Methods in Natural Language Processing (EMNLP 2024). Miami, FL, USA, Nov 12-16, 2024. DOI GitHub
Abstract

Climate change (CC) has attracted increasing attention in NLP in recent years. However, detecting the stance on CC in multimodal data is understudied and remains challenging due to a lack of reliable datasets. To improve the understanding of public opinions and communication strategies, this paper presents MultiClimate, the first open-source manually-annotated stance detection dataset with 100 CC-related YouTube videos and 4,209 frame-transcript pairs. We deploy state-of-the-art vision and language models, as well as multimodal models for MultiClimate stance detection. Results show that text-only BERT significantly outperforms image-only ResNet50 and ViT. Combining both modalities achieves state-of-the-art, 0.747/0.749 in accuracy/F1. Our 100M-sized fusion models also beat CLIP and BLIP, as well as the much larger 9B-sized multimodal IDEFICS and text-only Llama3 and Gemma2, indicating that multimodal stance detection remains challenging for large language models.

MCML Authors
Link to website

Siyao Peng

Dr.

AI and Computational Linguistics

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

AI and Computational Linguistics


06.11.2024


Subscribe to RSS News feed

Related

Link to

02.05.2025

MCML Researchers With Five Papers at AISTATS 2025

28th International Conference on Artificial Intelligence and Statistics (AISTATS 2025). Mai Khao, Thailand, 29.04.2025 - 05.05.2024

Link to MCML Delegation Visit to the USA

28.04.2025

MCML Delegation Visit to the USA

MCML delegation visits top US institutions to foster AI research collaborations in Generative and Medical AI, May 19–23, 2025.

Link to

28.04.2025

MCML Researchers With Eleven Papers at NAACL 2025

Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2025). Albuquerque, NM, USA, 29.04.2025 - 04.05.2024

Link to

25.04.2025

MCML Researchers With Seven Papers at CHI 2025

Conference on Human Factors in Computing Systems (CHI 2025). Yokohama, Japan, 26.04.2025 - 01.05.2024

Link to

23.04.2025

MCML Researchers With 52 Papers at ICLR 2025

13th International Conference on Learning Representations (ICLR 2025). Singapore, 24.04.2025 - 28.04.2024