11.06.2024

Teaser image to

MCML researchers with ten papers at NAACL 2024

Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2024). Mexico City, Mexico, 16.06.2024–21.06.2024

We are happy to announce that MCML researchers are represented with ten papers at NAACL 2024:

H. Chen, J. Büssing, D. Rügamer and E. Nie.
Leveraging (Sentence) Transformer Models with Contrastive Learning for Identifying Machine-Generated Text.
18th International Workshop on Semantic Evaluation (SemEval 2024) at the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2024). Mexico City, Mexico, Jun 16-21, 2024. URL.
Abstract

This paper outlines our approach to SemEval-2024 Task 8 (Subtask B), which focuses on discerning machine-generated text from human-written content, while also identifying the text sources, i.e., from which Large Language Model (LLM) the target text is generated. Our detection system is built upon Transformer-based techniques, leveraging various pre-trained language models (PLMs), including sentence transformer models. Additionally, we incorporate Contrastive Learning (CL) into the classifier to improve the detecting capabilities and employ Data Augmentation methods. Ultimately, our system achieves a peak accuracy of 76.96% on the test set of the competition, configured using a sentence transformer model integrated with CL methodology.

MCML Authors
Link to David Rügamer

David Rügamer

Prof. Dr.

Data Science Group

Link to Ercong Nie

Ercong Nie

Statistical NLP and Deep Learning


B. Deiseroth, M. Meuer, N. Gritsch, C. Eichenberg, P. Schramowski, M. Aßenmacher and K. Kersting.
Divergent Token Metrics: Measuring degradation to prune away LLM components -- and optimize quantization.
Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2024). Mexico City, Mexico, Jun 16-21, 2024. DOI.
Abstract

Large Language Models (LLMs) have reshaped natural language processing with their impressive capabilities. However, their ever-increasing size has raised concerns about their effective deployment and the need for LLM compression. This study introduces the Divergent Token Metrics (DTMs), a novel approach to assessing compressed LLMs, addressing the limitations of traditional perplexity or accuracy measures that fail to accurately reflect text generation quality. DTMs measure token divergences that allow deeper insights into the subtleties of model compression, in particular, when evaluating components’ impacts individually. Utilizing the First Divergent Token Metric (FDTM) in model sparsification reveals that 25% of all attention components can be pruned beyond 90% on the Llama-2 model family, still keeping SOTA performance. For quantization, FDTM suggests that more than 80% of parameters can be naively transformed to int8 without special outlier management. These evaluations indicate the necessity of choosing appropriate compressions for parameters individually—and that FDTM can identify those—while standard metrics result in deteriorated outcomes.

MCML Authors
Link to Matthias Aßenmacher

Matthias Aßenmacher

Dr.

Statistical Learning & Data Science


Z. Ding, H. Cai, J. Wu, Y. Ma, R. Liao, B. Xiong and V. Tresp.
zrLLM: Zero-Shot Relational Learning on Temporal Knowledge Graphs with Large Language Models.
Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2024). Mexico City, Mexico, Jun 16-21, 2024. URL.
Abstract

Modeling evolving knowledge over temporal knowledge graphs (TKGs) has become a heated topic. Various methods have been proposed to forecast links on TKGs. Most of them are embedding-based, where hidden representations are learned to represent knowledge graph (KG) entities and relations based on the observed graph contexts. Although these methods show strong performance on traditional TKG forecasting (TKGF) benchmarks, they face a strong challenge in modeling the unseen zero-shot relations that have no prior graph context. In this paper, we try to mitigate this problem as follows. We first input the text descriptions of KG relations into large language models (LLMs) for generating relation representations, and then introduce them into embedding-based TKGF methods. LLM-empowered representations can capture the semantic information in the relation descriptions. This makes the relations, whether seen or unseen, with similar semantic meanings stay close in the embedding space, enabling TKGF models to recognize zero-shot relations even without any observed graph context. Experimental results show that our approach helps TKGF models to achieve much better performance in forecasting the facts with previously unseen relations, while still maintaining their ability in link forecasting regarding seen relations.

MCML Authors
Link to Zifeng Ding

Zifeng Ding

Database Systems & Data Mining

Link to Yunpu Ma

Yunpu Ma

Dr.

Artificial Intelligence & Machine Learning

Link to Ruotong Liao

Ruotong Liao

Database Systems & Data Mining

Link to Volker Tresp

Volker Tresp

Prof. Dr.

Database Systems & Data Mining


R. Liao, X. Jia, Y. Li, Y. Ma and V. Tresp.
GenTKG: Generative Forecasting on Temporal Knowledge Graph.
Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2024). Mexico City, Mexico, Jun 16-21, 2024. URL. GitHub.
Abstract

The rapid advancements in large language models (LLMs) have ignited interest in the temporal knowledge graph (tKG) domain, where conventional embedding-based and rule-based methods dominate. The question remains open of whether pre-trained LLMs can understand structured temporal relational data and replace them as the foundation model for temporal relational forecasting. Therefore, we bring temporal knowledge forecasting into the generative setting. However, challenges occur in the huge chasms between complex temporal graph data structure and sequential natural expressions LLMs can handle, and between the enormous data sizes of tKGs and heavy computation costs of finetuning LLMs. To address these challenges, we propose a novel retrieval-augmented generation framework named GenTKG combining a temporal logical rule-based retrieval strategy and few-shot parameter-efficient instruction tuning to solve the above challenges, respectively. Extensive experiments have shown that GenTKG outperforms conventional methods of temporal relational forecasting with low computation resources using extremely limited training data as few as 16 samples. GenTKG also highlights remarkable cross-domain generalizability with outperforming performance on unseen datasets without re-training, and in-domain generalizability regardless of time split in the same dataset. Our work reveals the huge potential of LLMs in the tKG domain and opens a new frontier for generative forecasting on tKGs.

MCML Authors
Link to Ruotong Liao

Ruotong Liao

Database Systems & Data Mining

Link to Yunpu Ma

Yunpu Ma

Dr.

Artificial Intelligence & Machine Learning

Link to Volker Tresp

Volker Tresp

Prof. Dr.

Database Systems & Data Mining


Y. Liu, P. Lin, M. Wang and H. Schütze.
OFA: A Framework of Initializing Unseen Subword Embeddings for Efficient Large-scale Multilingual Continued Pretraining.
Findings of Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2024). Mexico City, Mexico, Jun 16-21, 2024. URL.
Abstract

Instead of pretraining multilingual language models from scratch, a more efficient method is to adapt existing pretrained language models (PLMs) to new languages via vocabulary extension and continued pretraining. However, this method usually randomly initializes the embeddings of new subwords and introduces substantially more embedding parameters to the model, thus weakening the efficiency. To address these issues, we propose a novel framework: One For All (OFA), which wisely initializes the embeddings of unseen subwords and thus can adapt a PLM to multiple languages efficiently and effectively. OFA takes advantage of external well-aligned multilingual static word vectors and injects the alignment knowledge into the subword embeddings. In addition, OFA applies matrix factorization and replaces the cumbersome embeddings with two lower-dimensional matrices, which largely reduces the number of parameters. We show OFA accelerates the convergence of continued pretraining, which is environmentally friendly as much fewer carbon footprints are generated. Through extensive experiments, we demonstrate OFA can achieve competitive or better performance than default continued pretraining baselines on a wide range of crosslingual downstream tasks. We make our code and models publicly available.

MCML Authors
Link to Yihong Liu

Yihong Liu

Statistical NLP and Deep Learning

Link to Peiqin Lin

Peiqin Lin

Statistical NLP and Deep Learning

Link to Mingyang Wang

Mingyang Wang

Statistical NLP and Deep Learning

Link to Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


P. Resnik, B. Ma, A. Hoyle, P. Goel, R. Sarkar, M. Gearing, A.-C. Haensch and F. Kreuter.
TOPCAT: Topic-Oriented Protocol for Content Analysis of Text – A Preliminary Study.
6th Workshop on Natural Language Processing and Computational Social Science (NLP+CSS 2024) at the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2024). Mexico City, Mexico, Jun 16-21, 2024. URL.
Abstract

Identifying constructs in text data is a labor-intensive task in social science research. Despite the potential richness of open-ended survey responses, the complexity of analyzing them often leads researchers to underutilize or ignore them entirely. While topic modeling offers a technological solution, qualitative researchers may remain skeptical of its rigor. In this paper, we introduce TOPCAT: Topic-Oriented Protocol for Content Analysis of Text, a systematic approach that integrates off-the-shelf topic modeling with human decisionmaking and curation. Our method aims to provide a viable solution for topicalizing open-ended responses in survey research, ensuring both efficiency and trustworthiness. We present the TOPCAT protocol, define an evaluation process, and demonstrate its effectiveness using open-ended responses from a U.S. survey on COVID-19 impact. Our findings suggest that TOPCAT enables efficient and rigorous qualitative analysis, offering a promising avenue for future research in this domain. Furthermore, our findings challenge the adequacy of expert coding schemes as ‘‘gold’’ standards, emphasizing the subjectivity inherent in qualitative content interpretation.

MCML Authors
Link to Bolei Ma

Bolei Ma

Social Data Science and AI Lab

Link to Frauke Kreuter

Frauke Kreuter

Prof. Dr.

Social Data Science and AI Lab


M. Wang, H. Adel, L. Lange, J. Strötgen and H. Schütze.
Rehearsal-Free Modular and Compositional Continual Learning for Language Models.
Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2024). Mexico City, Mexico, Jun 16-21, 2024. URL.
Abstract

Continual learning aims at incrementally acquiring new knowledge while not forgetting existing knowledge. To overcome catastrophic forgetting, methods are either rehearsal-based, i.e., store data examples from previous tasks for data replay, or isolate parameters dedicated to each task. However, rehearsal-based methods raise privacy and memory issues, and parameter-isolation continual learning does not consider interaction between tasks, thus hindering knowledge transfer. In this work, we propose MoCL, a rehearsal-free Modular and Compositional Continual Learning framework which continually adds new modules to language models and composes them with existing modules. Experiments on various benchmarks show that MoCL outperforms state of the art and effectively facilitates knowledge transfer.

MCML Authors
Link to Mingyang Wang

Mingyang Wang

Statistical NLP and Deep Learning

Link to Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


H. Ye, Y. Liu, C. Ma and H. Schütze.
MoSECroT: Model Stitching with Static Word Embeddings for Crosslingual Zero-shot Transfer.
5th Workshop on Insights from Negative Results in NLP at the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2024). Mexico City, Mexico, Jun 16-21, 2024. URL.
Abstract

Transformer-based pre-trained language models (PLMs) have achieved remarkable performance in various natural language processing (NLP) tasks. However, pre-training such models can take considerable resources that are almost only available to high-resource languages. On the contrary, static word embeddings are easier to train in terms of computing resources and the amount of data required. In this paper, we introduce MoSECroT Model Stitching with Static Word Embeddings for Crosslingual Zero-shot Transfer), a novel and challenging task that is especially relevant to low-resource languages for which static word embeddings are available. To tackle the task, we present the first framework that leverages relative representations to construct a common space for the embeddings of a source language PLM and the static word embeddings of a target language. In this way, we can train the PLM on source-language training data and perform zero-shot transfer to the target language by simply swapping the embedding layer. However, through extensive experiments on two classification datasets, we show that although our proposed framework is competitive with weak baselines when addressing MoSECroT, it fails to achieve competitive results compared with some strong baselines. In this paper, we attempt to explain this negative result and provide several thoughts on possible improvement.

MCML Authors
Link to Haotian Ye

Haotian Ye

Statistical NLP and Deep Learning

Link to Yihong Liu

Yihong Liu

Statistical NLP and Deep Learning

Link to Chunlan Ma

Chunlan Ma

Statistical NLP and Deep Learning

Link to Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


Y. Zhang, V. Hangya and A. Fraser.
A Study of the Class Imbalance Problem in Abusive Language Detection.
8th Workshop on Online Abuse and Harms (WOAH 2024) at the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2024). Mexico City, Mexico, Jun 16-21, 2024. DOI.
Abstract

Abusive language detection has drawn increasing interest in recent years. However, a less systematically explored obstacle is label imbalance, i.e., the amount of abusive data is much lower than non-abusive data, leading to performance issues. The aim of this work is to conduct a comprehensive comparative study of popular methods for addressing the class imbalance issue. We explore 10 well-known approaches on 8 datasets with distinct characteristics: binary or multi-class, moderately or largely imbalanced, focusing on various types of abuse, etc. Additionally, we pro-pose two novel methods specialized for abuse detection: AbusiveLexiconAug and ExternalDataAug, which enrich the training data using abusive lexicons and external abusive datasets, respectively. We conclude that: 1) our AbusiveLexiconAug approach, random oversampling, and focal loss are the most versatile methods on various datasets; 2) focal loss tends to yield peak model performance; 3) oversampling and focal loss provide promising results for binary datasets and small multi-class sets, while undersampling and weighted cross-entropy are more suitable for large multi-class sets; 4) most methods are sensitive to hyperparameters, yet our suggested choice of hyperparameters provides a good starting point.

MCML Authors
Link to Alexander Fraser

Alexander Fraser

Prof. Dr.

Data Analytics & Statistics


S. Zhou, H. Shan, B. Plank and R. Litschko.
MaiNLP at SemEval-2024 Task 1: Analyzing Source Language Selection in Cross-Lingual Textual Relatedness.
18th International Workshop on Semantic Evaluation (SemEval 2024) at the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2024). Mexico City, Mexico, Jun 16-21, 2024. URL.
Abstract

This paper presents our system developed for the SemEval-2024 Task 1: Semantic Textual Relatedness (STR), on Track C: Cross-lingual. The task aims to detect semantic relatedness of two sentences from the same languages. For cross-lingual approach we developed a set of linguistics-inspired models trained with several task-specific strategies. We 1) utilize language vectors for selection of donor languages; 2) investigate the multi-source approach for training; 3) use transliteration of non-latin script to study impact of ‘script gap’; 4) opt machine translation for data augmentation. We additionally compare the performance of XLM-RoBERTa and Furina with the same training strategy. Our submission achieved the first place in the C8 (Kinyarwanda) test.

MCML Authors
Link to Shijia Zhou

Shijia Zhou

Artificial Intelligence and Computational Linguistics

Link to Barbara Plank

Barbara Plank

Prof. Dr.

Artificial Intelligence and Computational Linguistics

Link to Robert Litschko

Robert Litschko

Artificial Intelligence and Computational Linguistics


11.06.2024


Related

Link to

06.11.2024

MCML researchers with 20 papers at EMNLP 2024

Conference on Empirical Methods in Natural Language Processing (EMNLP 2024). Miami, FL, USA, 12.11.2024 - 16.11.2024


Link to

01.10.2024

MCML researchers with 16 papers at MICCAI 2024

27th International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI 2024). Marrakesh, Morocco, 06.10.2024 - 10.10.2024


Link to

26.09.2024

MCML researchers with 18 papers at ECCV 2024

18th European Conference on Computer Vision (ECCV 2024). Milano, Italy, 29.09.2024 - 04.10.2024


Link to MCML at ECML-PKDD 2024

10.09.2024

MCML at ECML-PKDD 2024

We are happy to announce that MCML researchers are represented at ECML-PKDD 2024.


Link to

20.08.2024

MCML researchers with two papers at KDD 2024

30th ACM SIGKDD International Conference on Knowledge Discovery and Data (KDD 2024). Barcelona, Spain, 25.08.2024 - 29.08.2024