Home | Research | Groups | Hinrich Schütze

Research Group Hinrich Schütze

Link to website at LMU

Hinrich Schütze

Prof. Dr.

Principal Investigator

Statistical NLP and Deep Learning

Hinrich Schütze

holds the Chair of Statistical NLP and Deep Learning at LMU Munich.

His primary focus is linguistically-informed Neural NLP: His team uses deep understanding of language in its research and believes in the principle that learning is key to successful NLP – the same way that the language capabilities of humans are based on learning. The research areas are representation learning, multilinguality, machine learning for low-resource scenarios, cognitively motivated deep learning, linguistically informed deep learning (especially for morphology), digital humanities, and the intersection of NLP and robotics.

Team members @MCML

Link to website

Sebastian Gerstner

Statistical NLP and Deep Learning

Link to website

Ahmad Dawar Hakimi

Statistical NLP and Deep Learning

Link to website

Ayyoob Imani

Statistical NLP and Deep Learning

Link to website

Amir Hossein Kargaran

Statistical NLP and Deep Learning

Link to website

Molly Kennedy

Statistical NLP and Deep Learning

Link to website

Abdullatif Köksal

Statistical NLP and Deep Learning

Link to website

Sheng Liang

Statistical NLP and Deep Learning

Link to website

Peiqin Lin

Statistical NLP and Deep Learning

Link to website

Yihong Liu

Statistical NLP and Deep Learning

Link to website

Yongkang Liu

Statistical NLP and Deep Learning

Link to website

Chunlan Ma

Statistical NLP and Deep Learning

Link to website

Antonis Maronikolakis

Statistical NLP and Deep Learning

Link to website

Ali Modarressi

Statistical NLP and Deep Learning

Link to website

Ercong Nie

Statistical NLP and Deep Learning

Link to website

Lütfi Kerem Şenel

Statistical NLP and Deep Learning

Link to website

Victor Steinborn

Statistical NLP and Deep Learning

Link to website

Leonor Veloso

Statistical NLP and Deep Learning

Link to website

Mingyang Wang

Statistical NLP and Deep Learning

Link to website

Philipp Wicke

Dr.

Statistical NLP and Deep Learning

Link to website

Axel Wisiorek

Dr.

Statistical NLP and Deep Learning

Link to website

Haotian Ye

Statistical NLP and Deep Learning

Link to website

Shengqiang Zhang

Statistical NLP and Deep Learning

Publications @MCML

2025


[111]
Y. Liu, C. Ma, H. Ye and H. Schütze.
TransMI: A Framework to Create Strong Baselines from Multilingual Pretrained Language Models for Transliterated Data.
COLING 2025 - The 31st International Conference on Computational Linguistics. Abu Dhabi, United Arab Emirates, Jan 19-24, 2025. URL GitHub
Abstract

Transliterating related languages that use different scripts into a common script shows effectiveness in improving crosslingual transfer in downstream tasks. However, this methodology often makes pretraining a model from scratch unavoidable, as transliteration brings about new subwords not covered in existing multilingual pretrained language models (mPLMs). This is not desired because it takes a lot of computation budget for pretraining. A more promising way is to make full use of available mPLMs. To this end, this paper proposes a simple but effective framework: Transliterate-Merge-Initialize (TransMI), which can create a strong baseline well-suited for data that is transliterated into a common script by exploiting an mPLM and its accompanied tokenizer. TransMI has three stages: (a) transliterate the vocabulary of an mPLM into a common script; (b) merge the new vocabulary with the original vocabulary; and (c) initialize the embeddings of the new subwords. We applied TransMI to three recent strong mPLMs, and our experiments demonstrate that TransMI not only preserves their ability to handle non-transliterated data, but also enables the models to effectively process transliterated data: the results show a consistent improvement of 3% to 34%, varying across different models and tasks.

MCML Authors
Link to website

Yihong Liu

Statistical NLP and Deep Learning

Link to website

Chunlan Ma

Statistical NLP and Deep Learning

Link to website

Haotian Ye

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[110]
Y. Liu, M. Wang, A. H. Kargaran, A. Imani, O. Xhelili, H. Ye, C. Ma, F. Yvon and H. Schütze.
How Transliterations Improve Crosslingual Alignment.
COLING 2025 - The 31st International Conference on Computational Linguistics. Abu Dhabi, United Arab Emirates, Jan 19-24, 2025. URL
Abstract

Recent studies have shown that post-aligning multilingual pretrained language models (mPLMs) using alignment objectives on both original and transliterated data can improve crosslingual alignment. This improvement further leads to better crosslingual transfer performance. However, it remains unclear how and why a better crosslingual alignment is achieved, as this technique only involves transliterations, and does not use any parallel data. This paper attempts to explicitly evaluate the crosslingual alignment and identify the key elements in transliteration-based approaches that contribute to better performance. For this, we train multiple models under varying setups for two pairs of related languages: (1) Polish and Ukrainian and (2) Hindi and Urdu. To assess alignment, we define four types of similarities based on sentence representations. Our experiments show that adding transliterations alone improves the overall similarities, even for random sentence pairs. With the help of auxiliary alignment objectives, especially the contrastive objective, the model learns to distinguish matched from random pairs, leading to better alignments. However, we also show that better alignment does not always yield better downstream performance, suggesting that further research is needed to clarify the connection between alignment and performance.

MCML Authors
Link to website

Yihong Liu

Statistical NLP and Deep Learning

Link to website

Mingyang Wang

Statistical NLP and Deep Learning

Link to website

Amir Hossein Kargaran

Statistical NLP and Deep Learning

Link to website

Ayyoob Imani

Statistical NLP and Deep Learning

Link to website

Haotian Ye

Statistical NLP and Deep Learning

Link to website

Chunlan Ma

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


2024


[109]
A. H. Kargaran, F. Yvon and H. Schütze.
GlotCC: An Open Broad-Coverage CommonCrawl Corpus and Pipeline for Minority Languages.
NeurIPS 2024 - 38th Conference on Neural Information Processing Systems. Vancouver, Canada, Dec 10-15, 2024. To be published. Preprint available. arXiv
Abstract

The need for large text corpora has increased with the advent of pretrained language models and, in particular, the discovery of scaling laws for these models. Most available corpora have sufficient data only for languages with large dominant communities. However, there is no corpus available that (i) covers a wide range of minority languages; (ii) is generated by an open-source reproducible pipeline; and (iii) is rigorously cleaned from noise, making it trustworthy to use. We present GlotCC, a clean, document-level, 2TB general domain corpus derived from CommonCrawl, covering more than 1000 languages. We make GlotCC and the system used to generate it - including the pipeline, language identification model, and filters - available to the research community.

MCML Authors
Link to website

Amir Hossein Kargaran

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[108]
H. Ye, A. Wisiorek, A. Maronikolakis, Ö. Alaçam and H. Schütze.
A Federated Approach to Few-Shot Hate Speech Detection for Marginalized Communities.
Preprint (Dec. 2024). arXiv
Abstract

null

MCML Authors
Link to website

Haotian Ye

Statistical NLP and Deep Learning

Link to website

Axel Wisiorek

Dr.

Statistical NLP and Deep Learning

Link to website

Antonis Maronikolakis

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[107]
Y. Liu, Y. Zhang, Q. Li, T. Liu, S. Feng, D. Wang, Y. Zhang and H. Schütze.
HiFT: A Hierarchical Full Parameter Fine-Tuning Strategy.
EMNLP 2024 - Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI
Abstract

Full-parameter fine-tuning has become the go-to choice for adapting language models (LMs) to downstream tasks due to its excellent performance. As LMs grow in size, fine-tuning the full parameters of LMs requires a prohibitively large amount of GPU memory. Existing approaches utilize zeroth-order optimizer to conserve GPU memory, which can potentially compromise the performance of LMs as non-zero order optimizers tend to converge more readily on most downstream tasks. In this paper, we propose a novel optimizer-independent end-to-end hierarchical fine-tuning strategy, HiFT, which only updates a subset of parameters at each training step. HiFT can significantly reduce the amount of gradients and optimizer state parameters residing in GPU memory at the same time, thereby reducing GPU memory usage. Our results demonstrate that: (1) HiFT achieves comparable performance to parameter-efficient fine-tuning and standard full parameter fine-tuning. (2) HiFT supports various optimizers including AdamW, AdaGrad, SGD, etc. (3) HiFT can save more than 60% GPU memory compared with standard full-parameter fine-tuning for 7B model. (4) HiFT enables full-parameter fine-tuning of a 7B model on single 48G A6000 with a precision of 32 using the AdamW optimizer, without using any memory saving techniques.

MCML Authors
Link to website

Yongkang Liu

Statistical NLP and Deep Learning

Link to website

Tong Liu

Database Systems & Data Mining

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[106]
A. Köksal, T. Schick, A. Korhonen and H. Schütze.
LongForm: Effective Instruction Tuning with Reverse Instructions.
EMNLP 2024 - Findings of the Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI GitHub
Abstract

Instruction tuning enables language models to more effectively generalize and better follow user intent. However, obtaining instruction data is costly and challenging. Prior work employs methods such as expensive human annotation, crowd-sourced datasets with alignment issues, and generating noisy examples via LLMs. We introduce the LongForm-C dataset, which is created by reverse instructions. We generate instructions via LLMs for human-written corpus examples using reverse instructions. First we select a diverse set of human-written documents from corpora such as C4 and Wikipedia; then we generate instructions for these documents via LLMs. This approach provides a cheaper and cleaner instruction-tuning dataset with natural output and one suitable for long text generation. Our models outperform 10x larger language models without instruction tuning on tasks such as story/recipe generation and long-form question answering. Moreover, LongForm models outperform prior instruction-tuned models such as FLAN-T5 and Alpaca by a large margin, and improve language understanding capabilities further.

MCML Authors
Link to website

Abdullatif Köksal

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[105]
A. Modarressi, A. Köksal and H. Schütze.
Consistent Document-Level Relation Extraction via Counterfactuals.
EMNLP 2024 - Findings of the Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI
Abstract

Many datasets have been developed to train and evaluate document-level relation extraction (RE) models. Most of these are constructed using real-world data. It has been shown that RE models trained on real-world data suffer from factual biases. To evaluate and address this issue, we present CovEReD, a counterfactual data generation approach for document-level relation extraction datasets using entity replacement. We first demonstrate that models trained on factual data exhibit inconsistent behavior: while they accurately extract triples from factual data, they fail to extract the same triples after counterfactual modification. This inconsistency suggests that models trained on factual data rely on spurious signals such as specific entities and external knowledge – rather than on the input context – to extract triples. We show that by generating document-level counterfactual data with CovEReD and training models on them, consistency is maintained with minimal impact on RE performance. We release our CovEReD pipeline as well as Re-DocRED-CF, a dataset of counterfactual RE documents, to assist in evaluating and addressing inconsistency in document-level RE.

MCML Authors
Link to website

Ali Modarressi

Statistical NLP and Deep Learning

Link to website

Abdullatif Köksal

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[104]
M. Wang, L. Lange, H. Adel, J. Strötgen and H. Schütze.
Better Call SAUL: Fluent and Consistent Language Model Editing with Generation Regularization.
EMNLP 2024 - Findings of the Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI
Abstract

null

MCML Authors
Link to website

Mingyang Wang

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[103]
O. Xhelili, Y. Liu and H. Schütze.
Breaking the Script Barrier in Multilingual Pre-Trained Language Models with Transliteration-Based Post-Training Alignment.
EMNLP 2024 - Findings of the Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI GitHub
Abstract

null

MCML Authors
Link to website

Yihong Liu

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[102]
A. Yüksel, A. Köksal, L. K. Senel, A. Korhonen and H. Schütze.
TurkishMMLU: Measuring Massive Multitask Language Understanding in Turkish.
EMNLP 2024 - Findings of the Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI GitHub
Abstract

null

MCML Authors
Link to website

Abdullatif Köksal

Statistical NLP and Deep Learning

Link to website

Lütfi Kerem Şenel

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[101]
R. Zhao, A. Köksal, Y. Liu, L. Weissweiler, A. Korhonen and H. Schütze.
SynthEval: Hybrid Behavioral Testing of NLP Models with Synthetic Evaluation.
EMNLP 2024 - Findings of the Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI GitHub
Abstract

null

MCML Authors
Link to website

Raoyuan Zhao

Artificial Intelligence and Computational Linguistics

Link to website

Abdullatif Köksal

Statistical NLP and Deep Learning

Link to website

Yihong Liu

Statistical NLP and Deep Learning

Leonie Weissweiler

Leonie Weissweiler

Dr.

* Former member

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[100]
L. He, E. Nie, H. Schmid, H. Schütze, N. Mesgarani and J. Brennan.
Large Language Models as Neurolinguistic Subjects: Identifying Internal Representations for Form and Meaning.
Preprint (Nov. 2024). arXiv
Abstract

This study investigates the linguistic understanding of Large Language Models (LLMs) regarding signifier (form) and signified (meaning) by distinguishing two LLM evaluation paradigms: psycholinguistic and neurolinguistic. Traditional psycholinguistic evaluations often reflect statistical biases that may misrepresent LLMs’ true linguistic capabilities. We introduce a neurolinguistic approach, utilizing a novel method that combines minimal pair and diagnostic probing to analyze activation patterns across model layers. This method allows for a detailed examination of how LLMs represent form and meaning, and whether these representations are consistent across languages. Our contributions are three-fold: (1) We compare neurolinguistic and psycholinguistic methods, revealing distinct patterns in LLM assessment; (2) We demonstrate that LLMs exhibit higher competence in form compared to meaning, with the latter largely correlated to the former; (3) We present new conceptual minimal pair datasets for Chinese (COMPS-ZH) and German (COMPS-DE), complementing existing English datasets.

MCML Authors
Link to website

Ercong Nie

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[99]
V. Hofmann, L. Weissweiler, D. Mortensen, H. Schütze and J. Pierrehumbert.
Derivational Morphology Reveals Analogical Generalization in Large Language Models.
Preprint (Nov. 2024). arXiv
Abstract

null

MCML Authors
Leonie Weissweiler

Leonie Weissweiler

Dr.

* Former member

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[98]
M. Thaler, A. Köksal, A. Leidinger, A. Korhonen and H. Schütze.
How far can bias go? -- Tracing bias from pretraining data to alignment.
Preprint (Nov. 2024). arXiv
Abstract

null

MCML Authors
Link to website

Abdullatif Köksal

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[97]
Y. Liu, F. Shi, D. Wang, Y. Zhang and H. Schütze.
ChatZero: Zero-Shot Cross-Lingual Dialogue Generation via Pseudo-Target Language.
ECAI 2024 - 27th European Conference on Artificial Intelligence. Santiago de Compostela, Spain, Oct 19-24, 2024. DOI
Abstract

Although large language models(LLMs) show amazing capabilities, among various exciting applications discovered for LLMs fall short in other low-resource languages. Besides, most existing methods depend on large-scale dialogue corpora and thus building systems for dialogue generation in a zero-shot scenario remains a considerable challenge. To address this challenge, we propose a novel end-to-end zero-shot dialogue generation model ChatZero based on cross-lingual code-switching method. First, we construct code-switching language and pseudo-target language with placeholders. Then for cross-lingual semantic transfer, we employ unsupervised contrastive learning to minimize the semantics gap of the source language, code-switching language, and pseudo-target language that are mutually positive examples in the high dimensional semantic space. Experiments on the multilingual DailyDialog and DSTC7-AVSD datasets demonstrate that ChatZero can achieve more than 90% of the original performance under the zero-shot case compared to supervised learning, and achieve state-of-the-art performance compared with other baselines.

MCML Authors
Link to website

Yongkang Liu

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[96]
A. H. Kargaran, A. Modarressi, N. Nikeghbal, J. Diesner, F. Yvon and H. Schütze.
MEXA: Multilingual Evaluation of English-Centric LLMs via Cross-Lingual Alignment.
Preprint (Oct. 2024). arXiv
Abstract

English-centric large language models (LLMs) often show strong multilingual capabilities. However, the multilingual performance of these models remains unclear and is not thoroughly evaluated for many languages. Most benchmarks for multilinguality focus on classic NLP tasks, or cover a minimal number of languages. We introduce MEXA, a method for assessing the multilingual capabilities of pre-trained English-centric LLMs using parallel sentences, which are available for more languages than existing downstream tasks. MEXA leverages the fact that English-centric LLMs use English as a kind of pivot language in their intermediate layers. It computes the alignment between English and non-English languages using parallel sentences to evaluate the transfer of language understanding from English to other languages. This alignment can be used to estimate model performance in other languages. We conduct studies using various parallel datasets (FLORES-200 and Bible), models (Llama family, Gemma family, Mistral, and OLMo), and established downstream tasks (Belebele, m-MMLU, and m-ARC). We explore different methods to compute embeddings in decoder-only models. Our results show that MEXA, in its default settings, achieves a statistically significant average Pearson correlation of 0.90 with three established downstream tasks across nine models and two parallel datasets. This suggests that MEXA is a reliable method for estimating the multilingual capabilities of English-centric LLMs, providing a clearer understanding of their multilingual potential and the inner workings of LLMs.

MCML Authors
Link to website

Amir Hossein Kargaran

Statistical NLP and Deep Learning

Link to website

Ali Modarressi

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[95]
Y. Liu, E. Nie, S. Feng, Z. Hua, Z. Ding, D. Wang, Y. Zhang and H. Schütze.
A Unified Data Augmentation Framework for Low-Resource Multi-Domain Dialogue Generation.
ECML-PKDD 2024 - European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases. Vilnius, Lithuania, Sep 09-13, 2024. DOI GitHub
Abstract

Current state-of-the-art dialogue systems heavily rely on extensive training datasets. However, challenges arise in domains where domain-specific training datasets are insufficient or entirely absent. To tackle this challenge, we propose a novel data Augmentation framework for Multi-Domain Dialogue Generation, referred to as AMDG. The AMDG framework consists of a data augmentation process and a two-stage training approach: domain-agnostic training and domain adaptation training. We posit that domain corpora are a blend of domain-agnostic and domain-specific features, with certain representation patterns shared among diverse domains. Domain-agnostic training aims to enable models to learn these common expressive patterns. To construct domain-agnostic dialogue corpora, we employ a de-domaining data processing technique used to remove domain-specific features. By mitigating the effects of domain-specific features, the model trained on the de-domained corpora can effectively learn common expression patterns in different domains. Subsequently, we adapt the learned domain-agnostic features to the target domain through domain adaptation training. We conduct experiments on Chinese dialogue datasets from five different domains and show that AMDG achieves superior performance compared to both direct training on the target domain corpus and collective training on all five domain corpora. Our work underscores AMDG as a viable alternative solution for low-resource multi-domain dialogue generation.

MCML Authors
Link to website

Yongkang Liu

Statistical NLP and Deep Learning

Link to website

Ercong Nie

Statistical NLP and Deep Learning

Link to website

Zifeng Ding

Database Systems & Data Mining

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[94]
A. Köksal, M. Thaler, A. Imani, A. Üstün, A. Korhonen and H. Schütze.
MURI: High-Quality Instruction Tuning Datasets for Low-Resource Languages via Reverse Instructions.
Preprint (Sep. 2024). arXiv GitHub
Abstract

Instruction tuning enhances large language models (LLMs) by aligning them with human preferences across diverse tasks. Traditional approaches to create instruction tuning datasets face serious challenges for low-resource languages due to their dependence on data annotation. This work introduces a novel method, Multilingual Reverse Instructions (MURI), which generates high-quality instruction tuning datasets for low-resource languages without requiring human annotators or pre-existing multilingual models. Utilizing reverse instructions and a translation pipeline, MURI produces instruction-output pairs from existing human-written texts in low-resource languages. This method ensures cultural relevance and diversity by sourcing texts from different native domains and applying filters to eliminate inappropriate content. Our dataset, MURI-IT, includes more than 2 million instruction-output pairs across 200 languages. Evaluation by native speakers and fine-tuning experiments with mT5 models demonstrate the approach’s effectiveness for both NLU and open-ended generation.

MCML Authors
Link to website

Abdullatif Köksal

Statistical NLP and Deep Learning

Link to website

Ayyoob Imani

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[93]
Y. Liu, H. Ye, C. Ma, M. Wang and H. Schütze.
LangSAMP: Language-Script Aware Multilingual Pretraining.
Preprint (Sep. 2024). arXiv GitHub
Abstract

Recent multilingual pretrained language models (mPLMs) often avoid using language embeddings – learnable vectors assigned to different languages. These embeddings are discarded for two main reasons: (1) mPLMs are expected to have a single, unified parameter set across all languages, and (2) they need to function seamlessly as universal text encoders without requiring language IDs as input. However, this removal increases the burden on token embeddings to encode all language-specific information, which may hinder the model’s ability to produce more language-neutral representations. To address this challenge, we propose Language-Script Aware Multilingual Pretraining (LangSAMP), a method that incorporates both language and script embeddings to enhance representation learning while maintaining a simple architecture. Specifically, we integrate these embeddings into the output of the transformer blocks before passing the final representations to the language modeling head for prediction. We apply LangSAMP to the continual pretraining of XLM-R on a highly multilingual corpus covering more than 500 languages. The resulting model consistently outperforms the baseline. Extensive analysis further shows that language/script embeddings encode language/script-specific information, which improves the selection of source languages for crosslingual transfer.

MCML Authors
Link to website

Yihong Liu

Statistical NLP and Deep Learning

Link to website

Haotian Ye

Statistical NLP and Deep Learning

Link to website

Chunlan Ma

Statistical NLP and Deep Learning

Link to website

Mingyang Wang

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[92]
I. Ziegler, A. Köksal, D. Elliott and H. Schütze.
CRAFT Your Dataset: Task-Specific Synthetic Dataset Generation Through Corpus Retrieval and Augmentation.
Preprint (Sep. 2024). arXiv
Abstract

null

MCML Authors
Link to website

Abdullatif Köksal

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[91]
V. Blaschke, C. Purschke, H. Schütze and B. Plank.
What Do Dialect Speakers Want? A Survey of Attitudes Towards Language Technology for German Dialects.
ACL 2024 - 62nd Annual Meeting of the Association for Computational Linguistics. Bangkok, Thailand, Aug 11-16, 2024. DOI
Abstract

Natural language processing (NLP) has largely focused on modelling standardized languages. More recently, attention has increasingly shifted to local, non-standardized languages and dialects. However, the relevant speaker populations’ needs and wishes with respect to NLP tools are largely unknown. In this paper, we focus on dialects and regional languages related to German – a group of varieties that is heterogeneous in terms of prestige and standardization. We survey speakers of these varieties (N=327) and present their opinions on hypothetical language technologies for their dialects. Although attitudes vary among subgroups of our respondents, we find that respondents are especially in favour of potential NLP tools that work with dialectal input (especially audio input) such as virtual assistants, and less so for applications that produce dialectal output such as machine translation or spellcheckers.

MCML Authors
Link to website

Verena Blaschke

Artificial Intelligence and Computational Linguistics

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

Artificial Intelligence and Computational Linguistics


[90]
A. H. Kargaran, F. Yvon and H. Schütze.
MaskLID: Code-Switching Language Identification through Iterative Masking.
ACL 2024 - 62nd Annual Meeting of the Association for Computational Linguistics. Bangkok, Thailand, Aug 11-16, 2024. DOI GitHub
Abstract

We present MaskLID, a simple, yet effective, code-switching (CS) language identification (LID) method. MaskLID does not require any training and is designed to complement current high-performance sentence-level LIDs. Sentence-level LIDs are classifiers trained on monolingual texts to provide single labels, typically using a softmax layer to turn scores into probabilities. However, in cases where a sentence is composed in both L1 and L2 languages, the LID classifier often only returns the dominant label L1. To address this limitation, MaskLID employs a strategy to mask text features associated with L1, allowing the LID to classify the text as L2 in the next round. This method uses the LID itself to identify the features that require masking and does not rely on any external resource. In this work, we explore the use of MaskLID for two open-source LIDs (GlotLID and OpenLID), that are both based on the FastText architecture.

MCML Authors
Link to website

Amir Hossein Kargaran

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[89]
Y. Liu, C. Ma, H. Ye and H. Schütze.
TransliCo: A Contrastive Learning Framework to Address the Script Barrier in Multilingual Pretrained Language Models.
ACL 2024 - 62nd Annual Meeting of the Association for Computational Linguistics. Bangkok, Thailand, Aug 11-16, 2024. DOI
Abstract

The world’s more than 7000 languages are written in at least 293 scripts. Due to various reasons, many closely related languages use different scripts, which poses a difficulty for multilingual pretrained language models (mPLMs) in learning crosslingual knowledge through lexical overlap. As a consequence, mPLMs are faced with a script barrier: representations from different scripts are located in different subspaces, which can result in crosslingual transfer involving languages of different scripts performing suboptimally. To address this problem, we propose TransliCo, a framework that optimizes the Transliteration Contrastive Modeling (TCM) objective to fine-tune an mPLM by contrasting sentences in its training data and their transliterations in a unified script (in our case Latin), which enhances uniformity in the representation space for different scripts. Using Glot500-m, an mPLM pretrained on over 500 languages, as our source model, we fine-tune it on a small portion (5%) of its training data, and refer to the resulting model as Furina. We show that Furina not only better aligns representations from distinct scripts but also outperforms the original Glot500-m on various zero-shot crosslingual transfer tasks. Additionally, we achieve consistent improvement in a case study on the Indic group where the languages exhibit areal features but use different scripts. We make our code and models publicly available.

MCML Authors
Link to website

Yihong Liu

Statistical NLP and Deep Learning

Link to website

Chunlan Ma

Statistical NLP and Deep Learning

Link to website

Haotian Ye

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[88]
L. K. Senel, B. Fetahu, D. Yoshida, Z. Chen, G. Castellucci, N. Vedula, J. I. Choi and S. Malmasi.
Generative Explore-Exploit: Training-free Optimization of Generative Recommender Systems using LLM Optimizers.
ACL 2024 - 62nd Annual Meeting of the Association for Computational Linguistics. Bangkok, Thailand, Aug 11-16, 2024. DOI
Abstract

null

MCML Authors
Link to website

Lütfi Kerem Şenel

Statistical NLP and Deep Learning


[87]
P. Wicke and L. Wachowiak.
Exploring Spatial Schemas in Large Language Models.
ACL 2024 - Findings of the 62nd Annual Meeting of the Association for Computational Linguistics. Bangkok, Thailand, Aug 11-16, 2024. DOI GitHub
Abstract

null

MCML Authors
Link to website

Philipp Wicke

Dr.

Statistical NLP and Deep Learning


[86]
S. Yuan, E. Nie, M. Färber, H. Schmid and H. Schütze.
GNNAVI: Navigating the Information Flow in Large Language Models by Graph Neural Network.
ACL 2024 - Findings of the 62nd Annual Meeting of the Association for Computational Linguistics. Bangkok, Thailand, Aug 11-16, 2024. DOI
Abstract

null

MCML Authors
Link to website

Ercong Nie

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[85]
M. Zhang, V. Gautam, M. Wang, J. Alabi, X. Shen, D. Klakow and M. Mosbach.
The Impact of Demonstrations on Multilingual In-Context Learning: A Multidimensional Analysis.
ACL 2024 - Findings of the 62nd Annual Meeting of the Association for Computational Linguistics. Bangkok, Thailand, Aug 11-16, 2024. DOI
Abstract

null

MCML Authors
Link to website

Mingyang Wang

Statistical NLP and Deep Learning


[84]
A. Yüksel, A. Köksal, L. K. Senel, A. Korhonen and H. Schütze.
TurkishMMLU: Measuring Massive Multitask Language Understanding in Turkish.
SIGTURK @ACL 2024 - 1st Workshop on Natural Language Processing for Turkic Languages at the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024). Bangkok, Thailand, Aug 11-16, 2024. Invited talk. arXiv GitHub
Abstract

null

MCML Authors
Link to website

Abdullatif Köksal

Statistical NLP and Deep Learning

Link to website

Lütfi Kerem Şenel

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[83]
M. Aßenmacher, A. Stephan, L. Weissweiler, E. Çano, I. Ziegler, M. Härttrich, B. Bischl, B. Roth, C. Heumann and H. Schütze.
Collaborative Development of Modular Open Source Educational Resources for Natural Language Processing.
TeachingNLP @ACL 2024 - 6th Workshop on Teaching NLP at the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024). Bangkok, Thailand, Aug 11-16, 2024. URL
Abstract

In this work, we present a collaboratively and continuously developed open-source educational resource (OSER) for teaching natural language processing at two different universities. We shed light on the principles we followed for the initial design of the course and the rationale for ongoing developments, followed by a reflection on the inter-university collaboration for designing and maintaining teaching material. When reflecting on the latter, we explicitly emphasize the considerations that need to be made when facing heterogeneous groups and when having to accommodate multiple examination regulations within one single course framework. Relying on the fundamental principles of OSER developments as defined by Bothmann et al. (2023) proved to be an important guideline during this process. The final part pertains to open-sourcing our teaching material, coping with the increasing speed of developments in the field, and integrating the course digitally, also addressing conflicting priorities and challenges we are currently facing.

MCML Authors
Link to website

Matthias Aßenmacher

Dr.

Statistical Learning & Data Science

Leonie Weissweiler

Leonie Weissweiler

Dr.

* Former member

Link to Profile Bernd Bischl

Bernd Bischl

Prof. Dr.

Statistical Learning & Data Science

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[82]
P. Wicke, L. Hirlimann and J. M. Cunha.
Using Analogical Reasoning to Prompt LLMs for their Intuitions of Abstract Spatial Schemas.
Analogy-ANGLE @IJCAI 2024 - 1st Workshop on Analogical Abstraction in Cognition, Perception, and Language at the 33rd International Joint Conference on Artificial Intelligence (IJCAI 2024). Jeju, Korea, Aug 03-09, 2024. PDF
Abstract

null

MCML Authors
Link to website

Philipp Wicke

Dr.

Statistical NLP and Deep Learning


[81]
P. Lin, A. F. T. Martins and H. Schütze.
A Recipe of Parallel Corpora Exploitation for Multilingual Large Language Models.
Preprint (Jul. 2024). arXiv
Abstract

Recent studies have highlighted the potential of exploiting parallel corpora to enhance multilingual large language models, improving performance in both bilingual tasks, e.g., machine translation, and general-purpose tasks, e.g., text classification. Building upon these findings, our comprehensive study aims to identify the most effective strategies for leveraging parallel corpora. We investigate the impact of parallel corpora quality and quantity, training objectives, and model size on the performance of multilingual large language models enhanced with parallel corpora across diverse languages and tasks. Our analysis reveals several key insights: (i) filtering noisy translations is essential for effectively exploiting parallel corpora, while language identification and short sentence filtering have little effect; (ii) even a corpus containing just 10K parallel sentences can yield results comparable to those obtained from much larger datasets; (iii) employing only the machine translation objective yields the best results among various training objectives and their combinations; (iv) larger multilingual language models benefit more from parallel corpora than smaller models due to their stronger capacity for cross-task transfer. Our study offers valuable insights into the optimal utilization of parallel corpora to enhance multilingual large language models, extending the generalizability of previous findings from limited languages and tasks to a broader range of scenarios.

MCML Authors
Link to website

Peiqin Lin

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[80]
C. Ma, Y. Liu, H. Ye and H. Schütze.
Exploring the Role of Transliteration in In-Context Learning for Low-resource Languages Written in Non-Latin Scripts.
Preprint (Jul. 2024). arXiv
Abstract

Decoder-only large language models (LLMs) excel in high-resource languages across various tasks through few-shot or even zero-shot in-context learning (ICL). However, their performance often does not transfer well to low-resource languages, especially those written in non-Latin scripts. Inspired by recent work that leverages transliteration in encoder-only models, we investigate whether transliteration is also effective in improving LLMs’ performance for low-resource languages written in non-Latin scripts. To this end, we propose three prompt templates, where the target-language text is represented in (1) its original script, (2) Latin script, or (3) both. We apply these methods to several representative LLMs of different sizes on various tasks including text classification and sequential labeling. Our findings show that the effectiveness of transliteration varies by task type and model size. For instance, all models benefit from transliterations for sequential labeling (with increases of up to 25%).

MCML Authors
Link to website

Chunlan Ma

Statistical NLP and Deep Learning

Link to website

Yihong Liu

Statistical NLP and Deep Learning

Link to website

Haotian Ye

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[79]
H. Ye, Y. Liu, C. Ma and H. Schütze.
MoSECroT: Model Stitching with Static Word Embeddings for Crosslingual Zero-shot Transfer.
NAACL 2024 - 5th Workshop on Insights from Negative Results in NLP at the Annual Conference of the North American Chapter of the Association for Computational Linguistics. Mexico City, Mexico, Jun 16-21, 2024. URL
Abstract

null

MCML Authors
Link to website

Haotian Ye

Statistical NLP and Deep Learning

Link to website

Yihong Liu

Statistical NLP and Deep Learning

Link to website

Chunlan Ma

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[78]
M. Wang, H. Adel, L. Lange, J. Strötgen and H. Schütze.
Rehearsal-Free Modular and Compositional Continual Learning for Language Models.
NAACL 2024 - Annual Conference of the North American Chapter of the Association for Computational Linguistics. Mexico City, Mexico, Jun 16-21, 2024. URL
Abstract

null

MCML Authors
Link to website

Mingyang Wang

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[77]
Y. Liu, P. Lin, M. Wang and H. Schütze.
OFA: A Framework of Initializing Unseen Subword Embeddings for Efficient Large-scale Multilingual Continued Pretraining.
NAACL 2024 - Findings of Annual Conference of the North American Chapter of the Association for Computational Linguistics. Mexico City, Mexico, Jun 16-21, 2024. URL
Abstract

Instead of pretraining multilingual language models from scratch, a more efficient method is to adapt existing pretrained language models (PLMs) to new languages via vocabulary extension and continued pretraining. However, this method usually randomly initializes the embeddings of new subwords and introduces substantially more embedding parameters to the model, thus weakening the efficiency. To address these issues, we propose a novel framework: One For All (OFA), which wisely initializes the embeddings of unseen subwords and thus can adapt a PLM to multiple languages efficiently and effectively. OFA takes advantage of external well-aligned multilingual static word vectors and injects the alignment knowledge into the subword embeddings. In addition, OFA applies matrix factorization and replaces the cumbersome embeddings with two lower-dimensional matrices, which largely reduces the number of parameters. We show OFA accelerates the convergence of continued pretraining, which is environmentally friendly as much fewer carbon footprints are generated. Through extensive experiments, we demonstrate OFA can achieve competitive or better performance than default continued pretraining baselines on a wide range of crosslingual downstream tasks. We make our code and models publicly available.

MCML Authors
Link to website

Yihong Liu

Statistical NLP and Deep Learning

Link to website

Peiqin Lin

Statistical NLP and Deep Learning

Link to website

Mingyang Wang

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[76]
L. Hirlimann, S. Zhang, H. Schütze and P. Wicke.
Robustness Testing of Multi-Modal Models in Varied Home Environments for Assistive Robots.
Preprint (Jun. 2024). arXiv
Abstract

The development of assistive robotic agents to support household tasks is advancing, yet the underlying models often operate in virtual settings that do not reflect real-world complexity. For assistive care robots to be effective in diverse environments, their models must be robust and integrate multiple modalities. Consider a caretaker needing assistance in a dimly lit room or navigating around a newly installed glass door. Models relying solely on visual input might fail in low light, while those using depth information could avoid the door. This demonstrates the necessity for models that can process various sensory inputs. Our ongoing study evaluates state-of-the-art robotic models in the AI2Thor virtual environment. We introduce disturbances, such as dimmed lighting and mirrored walls, to assess their impact on modalities like movement or vision, and object recognition. Our goal is to gather input from the Geriatronics community to understand and model the challenges faced by practitioners.

MCML Authors
Link to website

Shengqiang Zhang

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning

Link to website

Philipp Wicke

Dr.

Statistical NLP and Deep Learning


[75]
P. Lin, A. F. T. Martins and H. Schütze.
XAMPLER: Learning to Retrieve Cross-Lingual In-Context Examples.
Preprint (Jun. 2024). arXiv GitHub
Abstract

Recent studies indicate that leveraging off-the-shelf or fine-tuned retrievers, capable of retrieving relevant in-context examples tailored to the input query, enhances few-shot in-context learning of English. However, adapting these methods to other languages, especially low-resource ones, poses challenges due to the scarcity of cross-lingual retrievers and annotated data. Thus, we introduce XAMPLER: Cross-Lingual Example Retrieval, a method tailored to tackle the challenge of cross-lingual in-context learning using only annotated English data. XAMPLER first trains a retriever based on Glot500, a multilingual small language model, using positive and negative English examples constructed from the predictions of a multilingual large language model, i.e., MaLA500. Leveraging the cross-lingual capacity of the retriever, it can directly retrieve English examples as few-shot examples for in-context learning of target languages. Experiments on the multilingual text classification benchmark SIB200 with 176 languages show that XAMPLER substantially improves the in-context learning performance across languages.

MCML Authors
Link to website

Peiqin Lin

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[74]
C. Ma, A. ImaniGooghari, H. Ye, R. Pei, E. Asgari and H. Schütze.
Taxi1500: A Multilingual Dataset for Text Classification in 1500 Languages.
Preprint (Jun. 2024). arXiv
Abstract

While natural language processing tools have been developed extensively for some of the world’s languages, a significant portion of the world’s over 7000 languages are still neglected. One reason for this is that evaluation datasets do not yet cover a wide range of languages, including low-resource and endangered ones. We aim to address this issue by creating a text classification dataset encompassing a large number of languages, many of which currently have little to no annotated data available. We leverage parallel translations of the Bible to construct such a dataset by first developing applicable topics and employing a crowdsourcing tool to collect annotated data. By annotating the English side of the data and projecting the labels onto other languages through aligned verses, we generate text classification datasets for more than 1500 languages. We extensively benchmark several existing multilingual language models using our dataset. To facilitate the advancement of research in this area, we will release our dataset and code.

MCML Authors
Link to website

Chunlan Ma

Statistical NLP and Deep Learning

Link to website

Haotian Ye

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[73]
E. Nie, B. Shao, Z. Ding, M. Wang, H. Schmid and H. Schütze.
BMIKE-53: Investigating Cross-Lingual Knowledge Editing with In-Context Learning.
Preprint (Jun. 2024). arXiv GitHub
Abstract

Large language models (LLMs) possess extensive parametric knowledge, but this knowledge is difficult to update with new information because retraining is very expensive and infeasible for closed-source models. Knowledge editing (KE) has emerged as a viable solution for updating the knowledge of LLMs without compromising their overall performance. On-the-fly KE methods, inspired by in-context learning (ICL), have shown great promise and allow LLMs to be treated as black boxes. In the past, KE was primarily employed in English contexts, whereas the potential for cross-lingual KE in current English-centric LLMs has not been fully explored. To foster more research in this direction, we introduce the BMIKE-53 benchmark for evaluating cross-lingual KE on 53 diverse languages across three KE task types. We also propose a gradient-free KE method called Multilingual In-context Knowledge Editing (MIKE) and evaluate it on BMIKE-53. Our evaluation focuses on cross-lingual knowledge transfer in terms of reliability, generality, locality, and portability, offering valuable insights and a framework for future research in cross-lingual KE.

MCML Authors
Link to website

Ercong Nie

Statistical NLP and Deep Learning

Link to website

Zifeng Ding

Database Systems & Data Mining

Link to website

Mingyang Wang

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[72]
M. Wang, H. Adel, L. Lange, J. Strötgen and H. Schütze.
Learn it or Leave it: Module Composition and Pruning for Continual Learning.
Preprint (Jun. 2024). arXiv
Abstract

null

MCML Authors
Link to website

Mingyang Wang

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[71]
V. Blaschke, B. Kovačić, S. Peng, H. Schütze and B. Plank.
MaiBaam: A Multi-Dialectal Bavarian Universal Dependency Treebank.
LREC-COLING 2024 - Joint International Conference on Computational Linguistics, Language Resources and Evalutaion. Torino, Italy, May 20-25, 2024. URL
Abstract

Despite the success of the Universal Dependencies (UD) project exemplified by its impressive language breadth, there is still a lack in `within-language breadth’: most treebanks focus on standard languages. Even for German, the language with the most annotations in UD, so far no treebank exists for one of its language varieties spoken by over 10M people: Bavarian. To contribute to closing this gap, we present the first multi-dialect Bavarian treebank (MaiBaam) manually annotated with part-of-speech and syntactic dependency information in UD, covering multiple text genres (wiki, fiction, grammar examples, social, non-fiction). We highlight the morphosyntactic differences between the closely-related Bavarian and German and showcase the rich variability of speakers’ orthographies. Our corpus includes 15k tokens, covering dialects from all Bavarian-speaking areas spanning three countries. We provide baseline parsing and POS tagging results, which are lower than results obtained on German and vary substantially between different graph-based parsers. To support further research on Bavarian syntax, we make our dataset, language-specific guidelines and code publicly available.

MCML Authors
Link to website

Verena Blaschke

Artificial Intelligence and Computational Linguistics

Link to website

Siyao Peng

Dr.

Artificial Intelligence and Computational Linguistics

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

Artificial Intelligence and Computational Linguistics


[70]
A. H. Kargaran, F. Yvon and H. Schütze.
GlotScript: A Resource and Tool for Low Resource Writing System Identification.
LREC-COLING 2024 - Joint International Conference on Computational Linguistics, Language Resources and Evalutaion. Torino, Italy, May 20-25, 2024. URL GitHub
Abstract

We present GlotScript, an open resource and tool for low resource writing system identification. GlotScript-R is a resource that provides the attested writing systems for more than 7,000 languages. It is compiled by aggregating information from existing writing system resources. GlotScript-T is a writing system identification tool that covers all 161 Unicode 15.0 scripts. For an input text, it returns its script distribution where scripts are identified by ISO 15924 codes. We also present two use cases for GlotScript. First, we demonstrate that GlotScript can help cleaning multilingual corpora such as mC4 and OSCAR. Second, we analyze the tokenization of a number of language models such as GPT-4 using GlotScript and provide insights on the coverage of low resource scripts and languages by each language model. We hope that GlotScript will become a useful resource for work on low resource languages in the NLP community.

MCML Authors
Link to website

Amir Hossein Kargaran

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[69]
A. Köksal, S. Severini and H. Schütze.
SilverAlign: MT-Based Silver Data Algorithm for Evaluating Word Alignment.
LREC-COLING 2024 - Joint International Conference on Computational Linguistics, Language Resources and Evalutaion. Torino, Italy, May 20-25, 2024. URL
Abstract

Word alignments are essential for a variety of NLP tasks. Therefore, choosing the best approaches for their creation is crucial. However, the scarce availability of gold evaluation data makes the choice difficult. We propose SilverAlign, a new method to automatically create silver data for the evaluation of word aligners by exploiting machine translation and minimal pairs. We show that performance on our silver data correlates well with gold benchmarks for 9 language pairs, making our approach a valid resource for evaluation of different domains and languages when gold data are not available. This addresses the important scenario of missing gold data alignments for low-resource languages.

MCML Authors
Link to website

Abdullatif Köksal

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[68]
D. R. Mortensen, V. Izrailevitch, Y. Xiao, H. Schütze and L. Weissweiler.
Verbing Weirds Language (Models): Evaluation of English Zero-Derivation in Five LLMs.
LREC-COLING 2024 - Joint International Conference on Computational Linguistics, Language Resources and Evalutaion. Torino, Italy, May 20-25, 2024. URL
Abstract

Lexical-syntactic flexibility, in the form of conversion (or zero-derivation) is a hallmark of English morphology. In conversion, a word with one part of speech is placed in a non-prototypical context, where it is coerced to behave as if it had a different part of speech. However, while this process affects a large part of the English lexicon, little work has been done to establish the degree to which language models capture this type of generalization. This paper reports the first study on the behavior of large language models with reference to conversion. We design a task for testing lexical-syntactic flexibility—the degree to which models can generalize over words in a construction with a non-prototypical part of speech. This task is situated within a natural language inference paradigm. We test the abilities of five language models—two proprietary models (GPT-3.5 and GPT-4), three open source model (Mistral 7B, Falcon 40B, and Llama 2 70B). We find that GPT-4 performs best on the task, followed by GPT-3.5, but that the open source language models are also able to perform it and that the 7-billion parameter Mistral displays as little difference between its baseline performance on the natural language inference task and the non-prototypical syntactic category task, as the massive GPT-4.

MCML Authors
Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning

Leonie Weissweiler

Leonie Weissweiler

Dr.

* Former member


[67]
L. Weissweiler, N. Böbel, K. Herrera, W. Scivetti, A. Lorenzi, N. Melnik, A. Bhatia, H. Schütze, L. Levin, A. Zeldes, J. Nivre, W. Croft and N. Schneider.
UCxn: Typologically Informed Annotation of Constructions Atop Universal Dependencies.
LREC-COLING 2024 - Joint International Conference on Computational Linguistics, Language Resources and Evalutaion. Torino, Italy, May 20-25, 2024. URL
Abstract

null

MCML Authors
Leonie Weissweiler

Leonie Weissweiler

Dr.

* Former member

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[66]
S. Zhou, L. Weissweiler, T. He, H. Schütze, D. R. Mortensen and L. Levin.
Constructions Are So Difficult That Even Large Language Models Get Them Right for the Wrong Reasons.
LREC-COLING 2024 - Joint International Conference on Computational Linguistics, Language Resources and Evalutaion. Torino, Italy, May 20-25, 2024. URL
Abstract

null

MCML Authors
Link to website

Shijia Zhou

Artificial Intelligence and Computational Linguistics

Leonie Weissweiler

Leonie Weissweiler

Dr.

* Former member

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[65]
P. Lin, S. Ji, J. Tiedemann, A. F. T. Martins and H. Schütze.
MaLA-500: Massive Language Adaptation of Large Language Models.
Preprint (Apr. 2024). arXiv GitHub
Abstract

Large language models (LLMs) have advanced the state of the art in natural language processing. However, their predominant design for English or a limited set of languages creates a substantial gap in their effectiveness for low-resource languages. To bridge this gap, we introduce MaLA-500, a novel large language model designed to cover an extensive range of 534 languages. To train MaLA-500, we employ vocabulary extension and continued pretraining on LLaMA 2 with Glot500-c. Our intrinsic evaluation demonstrates that MaLA-500 is better at predicting the given texts of low-resource languages than existing multilingual LLMs. Moreover, the extrinsic evaluation of in-context learning shows that MaLA-500 outperforms previous LLMs on SIB200 and Taxi1500 by a significant margin, i.e., 11.68% and 4.82% marco-average accuracy across languages.

MCML Authors
Link to website

Peiqin Lin

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[64]
A. Modarressi, A. Köksal, A. Imani, M. Fayyaz and H. Schütze.
MemLLM: Finetuning LLMs to Use An Explicit Read-Write Memory.
Preprint (Apr. 2024). arXiv
Abstract

While current large language models (LLMs) demonstrate some capabilities in knowledge-intensive tasks, they are limited by relying on their parameters as an implicit storage mechanism. As a result, they struggle with infrequent knowledge and temporal degradation. In addition, the uninterpretable nature of parametric memorization makes it challenging to understand and prevent hallucination. Parametric memory pools and model editing are only partial solutions. Retrieval Augmented Generation (RAG) – though non-parametric – has its own limitations: it lacks structure, complicates interpretability and makes it hard to effectively manage stored knowledge. In this paper, we introduce MemLLM, a novel method of enhancing LLMs by integrating a structured and explicit read-and-write memory module. MemLLM tackles the aforementioned challenges by enabling dynamic interaction with the memory and improving the LLM’s capabilities in using stored knowledge. Our experiments indicate that MemLLM enhances the LLM’s performance and interpretability, in language modeling in general and knowledge-intensive tasks in particular. We see MemLLM as an important step towards making LLMs more grounded and factual through memory augmentation.

MCML Authors
Link to website

Ali Modarressi

Statistical NLP and Deep Learning

Link to website

Abdullatif Köksal

Statistical NLP and Deep Learning

Link to website

Ayyoob Imani

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[63]
A. Maronikolakis, A. Köksal and H. Schütze.
Sociocultural knowledge is needed for selection of shots in hate speech detection tasks.
LT-EDI 2024 - 4th Workshop on Language Technology for Equality, Diversity, Inclusion. St. Julian’s, Malta, Mar 21, 2024. URL
Abstract

We introduce HATELEXICON, a lexicon of slurs and targets of hate speech for Brazil, Germany, India and Kenya, to aid model development and interpretability. First, we demonstrate how HATELEXICON can be used to interpret model predictions, showing that models developed to classify extreme speech rely heavily on target group names. Further, we propose a culturally-informed method to aid shot selection for training in low-resource settings. In few-shot learning, shot selection is of paramount importance to model performance and we need to ensure we make the most of available data. We work with HASOC German and Hindi data for training and the Multilingual HateCheck (MHC) benchmark for evaluation. We show that selecting shots based on our lexicon leads to models performing better than models trained on shots sampled randomly. Thus, when given only a few training examples, using HATELEXICON to select shots containing more sociocultural information leads to better few-shot performance. With these two use-cases we show how our HATELEXICON can be used for more effective hate speech detection.

MCML Authors
Link to website

Antonis Maronikolakis

Statistical NLP and Deep Learning

Link to website

Abdullatif Köksal

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[62]
B. Ma, E. Nie, S. Yuan, H. Schmid, M. Färber, F. Kreuter and H. Schütze.
ToPro: Token-Level Prompt Decomposition for Cross-Lingual Sequence Labeling Tasks.
EACL 2024 - 18th Conference of the European Chapter of the Association for Computational Linguistics. St. Julians, Malta, Mar 17-22, 2024. URL
Abstract

Prompt-based methods have been successfully applied to multilingual pretrained language models for zero-shot cross-lingual understanding. However, most previous studies primarily focused on sentence-level classification tasks, and only a few considered token-level labeling tasks such as Named Entity Recognition (NER) and Part-of-Speech (POS) tagging. In this paper, we propose Token-Level Prompt Decomposition (ToPro), which facilitates the prompt-based method for token-level sequence labeling tasks. The ToPro method decomposes an input sentence into single tokens and applies one prompt template to each token. Our experiments on multilingual NER and POS tagging datasets demonstrate that ToPro-based fine-tuning outperforms Vanilla fine-tuning and Prompt-Tuning in zero-shot cross-lingual transfer, especially for languages that are typologically different from the source language English. Our method also attains state-of-the-art performance when employed with the mT5 model. Besides, our exploratory study in multilingual large language models shows that ToPro performs much better than the current in-context learning method. Overall, the performance improvements show that ToPro could potentially serve as a novel and simple benchmarking method for sequence labeling tasks.

MCML Authors
Link to website

Bolei Ma

Social Data Science and AI Lab

Link to website

Ercong Nie

Statistical NLP and Deep Learning

Link to Profile Frauke Kreuter

Frauke Kreuter

Prof. Dr.

Social Data Science and AI Lab

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[61]
L. K. Senel, B. Ebing, K. Baghirova, H. Schütze and G. Glavaš.
Kardeş-NLU: Transfer to Low-Resource Languages with Big Brother’s Help – A Benchmark and Evaluation for Turkic Languages.
EACL 2024 - 18th Conference of the European Chapter of the Association for Computational Linguistics. St. Julians, Malta, Mar 17-22, 2024. URL
Abstract

null

MCML Authors
Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[60]
P. Lin, C. Hu, Z. Zhang, A. Martins and H. Schütze.
mPLM-Sim: Better Cross-Lingual Similarity and Transfer in Multilingual Pretrained Language Models.
EACL 2024 - Findings of the 18th Conference of the European Chapter of the Association for Computational Linguistics. St. Julians, Malta, Mar 17-22, 2024. URL
Abstract

Recent multilingual pretrained language models (mPLMs) have been shown to encode strong language-specific signals, which are not explicitly provided during pretraining. It remains an open question whether it is feasible to employ mPLMs to measure language similarity, and subsequently use the similarity results to select source languages for boosting cross-lingual transfer. To investigate this, we propose mPLM-Sim, a language similarity measure that induces the similarities across languages from mPLMs using multi-parallel corpora. Our study shows that mPLM-Sim exhibits moderately high correlations with linguistic similarity measures, such as lexicostatistics, genealogical language family, and geographical sprachbund. We also conduct a case study on languages with low correlation and observe that mPLM-Sim yields more accurate similarity results. Additionally, we find that similarity results vary across different mPLMs and different layers within an mPLM. We further investigate whether mPLM-Sim is effective for zero-shot cross-lingual transfer by conducting experiments on both low-level syntactic tasks and high-level semantic tasks. The experimental results demonstrate that mPLM-Sim is capable of selecting better source languages than linguistic measures, resulting in a 1%-2% improvement in zero-shot cross-lingual transfer performance.

MCML Authors
Link to website

Peiqin Lin

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[59]
L. Weissweiler, A. Köksal and H. Schütze.
Hybrid Human-LLM Corpus Construction and LLM Evaluation for Rare Linguistic Phenomena.
Preprint (Mar. 2024). arXiv
Abstract

null

MCML Authors
Leonie Weissweiler

Leonie Weissweiler

Dr.

* Former member

Link to website

Abdullatif Köksal

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[58]
E. Nie, S. Yuan, B. Ma, H. Schmid, M. Färber, F. Kreuter and H. Schütze.
Decomposed Prompting: Unveiling Multilingual Linguistic Structure Knowledge in English-Centric Large Language Models.
Preprint (Feb. 2024). arXiv
Abstract

Despite the predominance of English in their training data, English-centric Large Language Models (LLMs) like GPT-3 and LLaMA display a remarkable ability to perform multilingual tasks, raising questions about the depth and nature of their cross-lingual capabilities. This paper introduces the decomposed prompting approach to probe the linguistic structure understanding of these LLMs in sequence labeling tasks. Diverging from the single text-to-text prompt, our method generates for each token of the input sentence an individual prompt which asks for its linguistic label. We assess our method on the Universal Dependencies part-of-speech tagging dataset for 38 languages, utilizing both English-centric and multilingual LLMs. Our findings show that decomposed prompting surpasses the iterative prompting baseline in efficacy and efficiency under zero- and few-shot settings. Further analysis reveals the influence of evaluation methods and the use of instructions in prompts. Our multilingual investigation shows that English-centric language models perform better on average than multilingual models. Our study offers insights into the multilingual transferability of English-centric LLMs, contributing to the understanding of their multilingual linguistic knowledge.

MCML Authors
Link to website

Ercong Nie

Statistical NLP and Deep Learning

Link to website

Bolei Ma

Social Data Science and AI Lab

Link to Profile Frauke Kreuter

Frauke Kreuter

Prof. Dr.

Social Data Science and AI Lab

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


2023


[57]
S. Zhang, P. Wicke, L. K. Senel, L. Figueredo, A. Naceri, S. Haddadin, B. Plank and H. Schütze.
LoHoRavens: A Long-Horizon Language-Conditioned Benchmark for Robotic Tabletop Manipulation.
NeurIPS 2023 - 6th Robot Learning Workshop: Pretraining, Fine-Tuning, and Generalization with Large Scale Models at the 37th Conference on Neural Information Processing Systems. New Orleans, LA, USA, Dec 10-16, 2023. URL
Abstract

null

MCML Authors
Link to website

Shengqiang Zhang

Statistical NLP and Deep Learning

Link to website

Philipp Wicke

Dr.

Statistical NLP and Deep Learning

Link to website

Lütfi Kerem Şenel

Statistical NLP and Deep Learning

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

Artificial Intelligence and Computational Linguistics

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[56]
X. Li, E. Nie and S. Liang.
From Classification to Generation: Insights into Crosslingual Retrieval Augmented ICL.
NeurIPS 2023 - Workshop Instruction Tuning and Instruction Following at the 37th Conference on Neural Information Processing Systems. New Orleans, LA, USA, Dec 10-16, 2023. URL
Abstract

The remarkable ability of Large Language Models (LLMs) to understand and follow instructions has sometimes been limited by their in-context learning (ICL) performance in low-resource languages. To address this, we introduce a novel approach that leverages cross-lingual retrieval-augmented in-context learning (CREA-ICL). By extracting semantically similar prompts from high-resource languages, we aim to bolster the zero-shot performance of multilingual pretrained language models (MPLMs) across diverse tasks. Though our approach yields steady improvements in classification tasks, it faces challenges in generation tasks, with Bangla serving as a key case study. Our evaluation offers insights into the performance dynamics of retrieval-augmented in-context learning across both classification and generation domains.

MCML Authors
Link to website

Ercong Nie

Statistical NLP and Deep Learning

Link to website

Sheng Liang

Statistical NLP and Deep Learning


[55]
X. Li, E. Nie and S. Liang.
Crosslingual Retrieval Augmented In-context Learning for Bangla.
BLP-2023 - 1st Workshop on Bangla Language Processing. Singapore, Dec 07, 2023. DOI
Abstract

The promise of Large Language Models (LLMs) in Natural Language Processing has often been overshadowed by their limited performance in low-resource languages such as Bangla. To address this, our paper presents a pioneering approach that utilizes cross-lingual retrieval augmented in-context learning. By strategically sourcing semantically similar prompts from high-resource language, we enable multilingual pretrained language models (MPLMs), especially the generative model BLOOMZ, to successfully boost performance on Bangla tasks. Our extensive evaluation highlights that the cross-lingual retrieval augmented prompts bring steady improvements to MPLMs over the zero-shot performance.

MCML Authors
Link to website

Ercong Nie

Statistical NLP and Deep Learning

Link to website

Sheng Liang

Statistical NLP and Deep Learning


[54]
N. Kassner, O. Tafjord, A. Sabharwal, K. Richardson, H. Schütze and P. Clark.
Language Models with Rationality.
EMNLP 2023 - Conference on Empirical Methods in Natural Language Processing. Singapore, Dec 06-10, 2023. DOI
Abstract

While large language models (LLMs) are proficient at question-answering (QA), it is not always clear how (or even if) an answer follows from their latent ‘beliefs’. This lack of interpretability is a growing impediment to widespread use of LLMs. To address this, our goals are to make model beliefs and their inferential relationships explicit, and to resolve inconsistencies that may exist, so that answers are supported by interpretable chains of reasoning drawn from a consistent network of beliefs. Our approach, which we call REFLEX, is to add a rational, self-reflecting layer on top of the LLM. First, given a question, we construct a belief graph using a backward-chaining process to materialize relevant model beliefs (including beliefs about answer candidates) and their inferential relationships. Second, we identify and minimize contradictions in that graph using a formal constraint reasoner. We find that REFLEX significantly improves consistency (by 8%-11% absolute) without harming overall answer accuracy, resulting in answers supported by faithful chains of reasoning drawn from a more consistent belief system. This suggests a new style of system architecture in which an LLM extended with a rational layer can provide an interpretable window into system beliefs, add a systematic reasoning capability, and repair latent inconsistencies present in the LLM.

MCML Authors
Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[53]
M. Wang, H. Adel, L. Lange, J. Strötgen and H. Schütze.
GradSim: Gradient-Based Language Grouping for Effective Multilingual Training.
EMNLP 2023 - Conference on Empirical Methods in Natural Language Processing. Singapore, Dec 06-10, 2023. DOI
Abstract

null

MCML Authors
Link to website

Mingyang Wang

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[52]
L. Weissweiler, V. Hofmann, A. Kantharuban, A. Cai, R. Dutt, A. Hengle, A. Kabra, A. Kulkarni, A. Vijayakumar, H. Yu, H. Schütze, K. Oflazer and D. Mortensen.
Counting the Bugs in ChatGPT's Wugs: A Multilingual Investigation into the Morphological Capabilities of a Large Language Model.
EMNLP 2023 - Conference on Empirical Methods in Natural Language Processing. Singapore, Dec 06-10, 2023. DOI
Abstract

null

MCML Authors
Leonie Weissweiler

Leonie Weissweiler

Dr.

* Former member

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[51]
A. H. Kargaran, A. Imani, F. Yvon and H. Schütze.
GlotLID: Language Identification for Low-Resource Languages.
EMNLP 2023 - Findings of the Conference on Empirical Methods in Natural Language Processing. Singapore, Dec 06-10, 2023. DOI GitHub
Abstract

Several recent papers have published good solutions for language identification (LID) for about 300 high-resource and medium-resource languages. However, there is no LID available that (i) covers a wide range of low-resource languages, (ii) is rigorously evaluated and reliable and (iii) efficient and easy to use. Here, we publish GlotLID-M, an LID model that satisfies the desiderata of wide coverage, reliability and efficiency. It identifies 1665 languages, a large increase in coverage compared to prior work. In our experiments, GlotLID-M outperforms four baselines (CLD3, FT176, OpenLID and NLLB) when balancing F1 and false positive rate (FPR). We analyze the unique challenges that low-resource LID poses: incorrect corpus metadata, leakage from high-resource languages, difficulty separating closely related languages, handling of macrolanguage vs varieties and in general noisy data. We hope that integrating GlotLID-M into dataset creation pipelines will improve quality and enhance accessibility of NLP technology for low-resource languages and cultures.

MCML Authors
Link to website

Amir Hossein Kargaran

Statistical NLP and Deep Learning

Link to website

Ayyoob Imani

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[50]
A. Köksal, T. Schick and H. Schütze.
MEAL: Stable and Active Learning for Few-Shot Prompting.
EMNLP 2023 - Findings of the Conference on Empirical Methods in Natural Language Processing. Singapore, Dec 06-10, 2023. DOI GitHub
Abstract

Few-shot classification has made great strides due to foundation models that, through priming and prompting, are highly effective few-shot learners. However, this approach has high variance both across different sets of few shots (data selection) and across different finetuning runs (run variability). This is problematic not only because it impedes the fair comparison of different approaches, but especially because it makes few-shot learning too unreliable for many real-world applications. To alleviate these issues, we make two contributions for more stable and effective few-shot learning: First, we propose novel ensembling methods and show that they substantially reduce run variability. Second, we introduce a new active learning (AL) criterion for data selection and present the first AL-based approach specifically tailored towards prompt-based learning. In our experiments, we show that our combined method, MEAL (Multiprompt finetuning and prediction Ensembling with Active Learning), improves overall performance of prompt-based finetuning by 2.3 points on five diverse tasks.

MCML Authors
Link to website

Abdullatif Köksal

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[49]
A. Köksal, O. Yalcin, A. Akbiyik, M. T. Kilavuz, A. Korhonen and H. Schütze.
Language-Agnostic Bias Detection in Language Models with Bias Probing.
EMNLP 2023 - Findings of the Conference on Empirical Methods in Natural Language Processing. Singapore, Dec 06-10, 2023. DOI GitHub
Abstract

Pretrained language models (PLMs) are key components in NLP, but they contain strong social biases. Quantifying these biases is challenging because current methods focusing on fill-the-mask objectives are sensitive to slight changes in input. To address this, we propose a bias probing technique called LABDet, for evaluating social bias in PLMs with a robust and language-agnostic method. For nationality as a case study, we show that LABDet “surfaces” nationality bias by training a classifier on top of a frozen PLM on non-nationality sentiment detection. We find consistent patterns of nationality bias across monolingual PLMs in six languages that align with historical and political context. We also show for English BERT that bias surfaced by LABDet correlates well with bias in the pretraining data; thus, our work is one of the few studies that directly links pretraining data to PLM behavior. Finally, we verify LABDet’s reliability and applicability to different templates and languages through an extensive set of robustness checks.

MCML Authors
Link to website

Abdullatif Köksal

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[48]
Y. Liu, H. Ye, L. Weissweiler, R. Pei and H. Schütze.
Crosslingual Transfer Learning for Low-Resource Languages Based on Multilingual Colexification Graphs.
EMNLP 2023 - Findings of the Conference on Empirical Methods in Natural Language Processing. Singapore, Dec 06-10, 2023. DOI
Abstract

In comparative linguistics, colexification refers to the phenomenon of a lexical form conveying two or more distinct meanings. Existing work on colexification patterns relies on annotated word lists, limiting scalability and usefulness in NLP. In contrast, we identify colexification patterns of more than 2,000 concepts across 1,335 languages directly from an unannotated parallel corpus. We then propose simple and effective methods to build multilingual graphs from the colexification patterns: ColexNet and ColexNet+. ColexNet’s nodes are concepts and its edges are colexifications. In ColexNet+, concept nodes are additionally linked through intermediate nodes, each representing an ngram in one of 1,334 languages. We use ColexNet+ to train ColexNet+, high-quality multilingual embeddings that are well-suited for transfer learning. In our experiments, we first show that ColexNet achieves high recall on CLICS, a dataset of crosslingual colexifications. We then evaluate ColexNet+ on roundtrip translation, sentence retrieval and sentence classification and show that our embeddings surpass several transfer learning baselines. This demonstrates the benefits of using colexification as a source of information in multilingual NLP.

MCML Authors
Link to website

Yihong Liu

Statistical NLP and Deep Learning

Link to website

Haotian Ye

Statistical NLP and Deep Learning

Leonie Weissweiler

Leonie Weissweiler

Dr.

* Former member

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[47]
E. Nie, H. Schmid and H. Schütze.
Unleashing the Multilingual Encoder Potential: Boosting Zero-Shot Performance via Probability Calibration.
EMNLP 2023 - Findings of the Conference on Empirical Methods in Natural Language Processing. Singapore, Dec 06-10, 2023. DOI
Abstract

Pretrained multilingual encoder models can directly perform zero-shot multilingual tasks or linguistic probing by reformulating the input examples into cloze-style prompts. This is accomplished by predicting the probabilities of the label words at the masked token position, without requiring any updates to the model parameters. However, the performance of this method is limited by the model’s bias toward predicting label words which frequently occurred during the pretraining. These words typically receive high probabilities. To address this issue, we combine the models with calibration techniques which modify the probabilities of label words predicted by the models. We first validate the effectiveness of a proposed simple calibration method together with other existing techniques on monolingual encoders in both zero- and few-shot scenarios. We subsequently employ these calibration techniques on multilingual encoders, resulting in substantial performance improvements across a wide range of tasks.

MCML Authors
Link to website

Ercong Nie

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[46]
V. Hangya, S. Severini, R. Ralev, A. Fraser and H. Schütze.
Multilingual Word Embeddings for Low-Resource Languages using Anchors and a Chain of Related Languages.
MRL @EMNLP 2023 - 3rd Workshop on Multi-lingual Representation Learning at the Conference on Empirical Methods in Natural Language Processing (EMNLP 2023). Singapore, Dec 06-10, 2023. DOI
Abstract

Very low-resource languages, having only a few million tokens worth of data, are not well-supported by multilingual NLP approaches due to poor quality cross-lingual word representations. Recent work showed that good crosslingual performance can be achieved if a source language is related to the low-resource target language. However, not all language pairs are related. In this paper, we propose to build multilingual word embeddings (MWEs) via a novel language chain-based approach, that incorporates intermediate related languages to bridge the gap between the distant source and target. We build MWEs one language at a time by starting from the resource rich source and sequentially adding each language in the chain till we reach the target. We extend a semi-joint bilingual approach to multiple languages in order to eliminate the main weakness of previous works, i.e., independently trained monolingual embeddings, by anchoring the target language around the multilingual space. We evaluate our method on bilingual lexicon induction for 4 language families, involving 4 very low-resource (≤ 5M tokens) and 4 moderately low-resource (≤ 50M) target languages, showing improved performance in both categories. Additionally, our analysis reveals the importance of good quality embeddings for intermediate languages as well as the importance of leveraging anchor points from all languages in the multilingual space.

MCML Authors
Link to Profile Alexander Fraser

Alexander Fraser

Prof. Dr.

Data Analytics & Statistics

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[45]
A. Köksal, R. Aksitov and C.-C. Chang.
Hallucination Augmented Recitations for Language Models.
Preprint (Nov. 2023). arXiv
Abstract

Attribution is a key concept in large language models (LLMs) as it enables control over information sources and enhances the factuality of LLMs. While existing approaches utilize open book question answering to improve attribution, factual datasets may reward language models to recall facts that they already know from their pretraining data, not attribution. In contrast, counterfactual open book QA datasets would further improve attribution because the answer could only be grounded in the given text. We propose Hallucination Augmented Recitations (HAR) for creating counterfactual datasets by utilizing hallucination in LLMs to improve attribution. For open book QA as a case study, we demonstrate that models finetuned with our counterfactual datasets improve text grounding, leading to better open book QA performance, with up to an 8.0% increase in F1 score. Our counterfactual dataset leads to significantly better performance than using humanannotated factual datasets, even with 4x smaller datasets and 4x smaller models. We observe that improvements are consistent across various model sizes and datasets, including multi-hop, biomedical, and adversarial QA datasets.

MCML Authors
Link to website

Abdullatif Köksal

Statistical NLP and Deep Learning


[44]
L. Weissweiler, V. Hofmann, A. Köksal and H. Schütze.
Explaining pretrained language models' understanding of linguistic structures using construction grammar.
Frontiers in Artificial Intelligence 6 (Oct. 2023). DOI
Abstract

null

MCML Authors
Leonie Weissweiler

Leonie Weissweiler

Dr.

* Former member

Link to website

Abdullatif Köksal

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[43]
B. Ma, E. Nie, H. Schmid and H. Schütze.
Is Prompt-Based Finetuning Always Better than Vanilla Finetuning? Insights from Cross-Lingual Language Understanding.
KONVENS 2023 - 19th Conference on Natural Language Processing. Ingolstadt, Germany, Sep 18-22, 2023. URL
Abstract

Multilingual pretrained language models (MPLMs) have demonstrated substantial performance improvements in zero-shot cross-lingual transfer across various natural language understanding tasks by finetuning MPLMs on task-specific labelled data of a source language (e.g. English) and evaluating on a wide range of target languages. Recent studies show that prompt-based finetuning surpasses regular finetuning in few-shot scenarios. However, the exploration of prompt-based learning in multilingual tasks remains limited. In this study, we propose the PROFIT pipeline to investigate the cross-lingual capabilities of Prompt-based Finetuning. We conduct comprehensive experiments on diverse cross-lingual language understanding tasks (sentiment classification, paraphrase identification, and natural language inference) and empirically analyze the variation trends of prompt-based finetuning performance in cross-lingual transfer across different few-shot and full-data settings. Our results reveal the effectiveness and versatility of prompt-based finetuning in cross-lingual language understanding. Our findings indicate that prompt-based finetuning outperforms vanilla finetuning in full-data scenarios and exhibits greater advantages in few-shot scenarios, with different performance patterns dependent on task types. Additionally, we analyze underlying factors such as language similarity and pretraining data size that impact the cross-lingual performance of prompt-based finetuning. Overall, our work provides valuable insights into the cross-lingual prowess of prompt-based finetuning.

MCML Authors
Link to website

Bolei Ma

Social Data Science and AI Lab

Link to website

Ercong Nie

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[42]
A. Maronikolakis, P. O’Grady, H. Schütze and M. Lyra.
Improving Few-Shot Learning with Multilingual Transfer and Monte Carlo Training Set Selection.
LSD 2023 - CLASP Conference on Learning with Small Data. Gothenburg, Sweden, Sep 11-12, 2023. URL
Abstract

In industry settings, machine learning is an attractive tool to automatize processes. Unfortunately, annotated and high-quality data is expensive to source. This problem is exacerbated in settings spanning multiple markets and languages. Thus, developing solutions for multilingual tasks with little available data is challenging. Few-shot learning is a compelling approach when building solutions in multilingual and low-resource settings, since the method not only requires just a few training examples to achieve high performance, but is also a technique agnostic to language. Even though the technique can be applied to multilingual settings, optimizing performance is an open question. In our work we show that leveraging higher-resource, task-specific language data can boost overall performance and we propose a method to select training examples per their average performance in a Monte Carlo simulation, resulting in a training set more conducive to learning. We demonstrate the effectiveness of our methods in fashion text reviews moderation, classifying reviews as related or unrelated to the given product. We show that our methodology boosts performance in multilingual (English, French, German) settings, increasing F1 score and significantly decreasing false positives.

MCML Authors
Link to website

Antonis Maronikolakis

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[41]
E. Nie, H. Schmid and H. Schütze.
Cross-Lingual Constituency Parsing for Middle High German: A Delexicalized Approach.
ALP @RANLP 2023 - 1st Workshop on Ancient Language Processing co-located with the Conference on Recent Advances in Natural Language Processing (RANLP 2023). Varna, Bulgaria, Sep 08, 2023. URL
Abstract

Constituency parsing plays a fundamental role in advancing natural language processing (NLP) tasks. However, training an automatic syntactic analysis system for ancient languages solely relying on annotated parse data is a formidable task due to the inherent challenges in building treebanks for such languages. It demands extensive linguistic expertise, leading to a scarcity of available resources. To overcome this hurdle, cross-lingual transfer techniques which require minimal or even no annotated data for low-resource target languages offer a promising solution. In this study, we focus on building a constituency parser for Middle High German (MHG) under realistic conditions, where no annotated MHG treebank is available for training. In our approach, we leverage the linguistic continuity and structural similarity between MHG and Modern German (MG), along with the abundance of MG treebank resources. Specifically, by employing the delexicalization method, we train a constituency parser on MG parse datasets and perform cross-lingual transfer to MHG parsing. Our delexicalized constituency parser demonstrates remarkable performance on the MHG test set, achieving an F1-score of 67.3%. It outperforms the best zero-shot cross-lingual baseline by a margin of 28.6% points. The encouraging results underscore the practicality and potential for automatic syntactic analysis in other ancient languages that face similar challenges as MHG.

MCML Authors
Link to website

Ercong Nie

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[40]
A. Imani, P. Lin, A. H. Kargaran, S. Severini, M. J. Sabet, N. Kassner, C. Ma, H. Schmid, A. Martins, F. Yvon and H. Schütze.
Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages.
ACL 2023 - 61th Annual Meeting of the Association for Computational Linguistics. Toronto, Canada, Jul 09-14, 2023. DOI GitHub
Abstract

The NLP community has mainly focused on scaling Large Language Models (LLMs) vertically, i.e., making them better for about 100 languages. We instead scale LLMs horizontally: we create, through continued pretraining, Glot500-m, an LLM that covers 511 predominantly low-resource languages. An important part of this effort is to collect and clean Glot500-c, a corpus that covers these 511 languages and allows us to train Glot500-m. We evaluate Glot500-m on five diverse tasks across these languages. We observe large improvements for both high-resource and low-resource languages compared to an XLM-R baseline. Our analysis shows that no single factor explains the quality of multilingual LLM representations. Rather, a combination of factors determines quality including corpus size, script, ‘help’ from related languages and the total capacity of the model. Our work addresses an important goal of NLP research: we should notlimit NLP to a small fraction of the world’s languages and instead strive to support as many languages as possible to bring the benefits of NLP technology to all languages and cultures.

MCML Authors
Link to website

Ayyoob Imani

Statistical NLP and Deep Learning

Link to website

Peiqin Lin

Statistical NLP and Deep Learning

Link to website

Amir Hossein Kargaran

Statistical NLP and Deep Learning

Link to website

Masoud Jalili Sabet

Dr.

* Former member

Link to website

Chunlan Ma

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[39]
Y. Liu, H. Ye, L. Weissweiler, P. Wicke, R. Pei, R. Zangenfeind and H. Schütze.
A Crosslingual Investigation of Conceptualization in 1335 Languages.
ACL 2023 - 61th Annual Meeting of the Association for Computational Linguistics. Toronto, Canada, Jul 09-14, 2023. DOI
Abstract

Languages differ in how they divide up the world into concepts and words; e.g., in contrast to English, Swahili has a single concept for ‘belly’ and ‘womb’. We investigate these differences in conceptualization across 1,335 languages by aligning concepts in a parallel corpus. To this end, we propose Conceptualizer, a method that creates a bipartite directed alignment graph between source language concepts and sets of target language strings. In a detailed linguistic analysis across all languages for one concept (‘bird’) and an evaluation on gold standard data for 32 Swadesh concepts, we show that Conceptualizer has good alignment accuracy. We demonstrate the potential of research on conceptualization in NLP with two experiments. (1) We define crosslingual stability of a concept as the degree to which it has 1-1 correspondences across languages, and show that concreteness predicts stability. (2) We represent each language by its conceptualization pattern for 83 concepts, and define a similarity measure on these representations. The resulting measure for the conceptual similarity between two languages is complementary to standard genealogical, typological, and surface similarity measures. For four out of six language families, we can assign languages to their correct family based on conceptual similarity with accuracies between 54% and 87%.

MCML Authors
Link to website

Yihong Liu

Statistical NLP and Deep Learning

Link to website

Haotian Ye

Statistical NLP and Deep Learning

Leonie Weissweiler

Leonie Weissweiler

Dr.

* Former member

Link to website

Philipp Wicke

Dr.

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[38]
Y. Liu, S. Feng, D. Wang, Y. Zhang and H. Schütze.
PVGRU: Generating Diverse and Relevant Dialogue Responses via Pseudo-Variational Mechanism.
ACL 2023 - 61th Annual Meeting of the Association for Computational Linguistics. Toronto, Canada, Jul 09-14, 2023. DOI
Abstract

We investigate response generation for multi-turn dialogue in generative chatbots. Existing generative modelsbased on RNNs (Recurrent Neural Networks) usually employ the last hidden state to summarize the history, which makesmodels unable to capture the subtle variability observed in different dialogues and cannot distinguish the differencesbetween dialogues that are similar in composition. In this paper, we propose Pseudo-Variational Gated Recurrent Unit (PVGRU). The key novelty of PVGRU is a recurrent summarizing variable thataggregates the accumulated distribution variations of subsequences. We train PVGRU without relying on posterior knowledge, thus avoiding the training-inference inconsistency problem. PVGRU can perceive subtle semantic variability through summarizing variables that are optimized by two objectives we employ for training: distribution consistency and reconstruction. In addition, we build a Pseudo-Variational Hierarchical Dialogue(PVHD) model based on PVGRU. Experimental results demonstrate that PVGRU can broadly improve the diversity andrelevance of responses on two benchmark datasets.

MCML Authors
Link to website

Yongkang Liu

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[37]
A. Modarressi, M. Fayyaz, E. Aghazadeh, Y. Yaghoobzadeh and M. T. Pilehvar.
DecompX: Explaining Transformers Decisions by Propagating Token Decomposition.
ACL 2023 - 61th Annual Meeting of the Association for Computational Linguistics. Toronto, Canada, Jul 09-14, 2023. DOI GitHub
Abstract

An emerging solution for explaining Transformer-based models is to use vector-based analysis on how the representations are formed. However, providing a faithful vector-based explanation for a multi-layer model could be challenging in three aspects: (1) Incorporating all components into the analysis, (2) Aggregating the layer dynamics to determine the information flow and mixture throughout the entire model, and (3) Identifying the connection between the vector-based analysis and the model’s predictions. In this paper, we present DecompX to tackle these challenges. DecompX is based on the construction of decomposed token representations and their successive propagation throughout the model without mixing them in between layers. Additionally, our proposal provides multiple advantages over existing solutions for its inclusion of all encoder components (especially nonlinear feed-forward networks) and the classification head. The former allows acquiring precise vectors while the latter transforms the decomposition into meaningful prediction-based values, eliminating the need for norm- or summation-based vector aggregation. According to the standard faithfulness evaluations, DecompX consistently outperforms existing gradient-based and vector-based approaches on various datasets.

MCML Authors
Link to website

Ali Modarressi

Statistical NLP and Deep Learning


[36]
Z. Han, R. Liao, J. Gu, Y. Zhang, Z. Ding, Y. Gu, H. Köppl, H. Schütze and V. Tresp.
ECOLA: Enhancing Temporal Knowledge Embeddings with Contextualized Language Representations.
ACL 2023 - Findings of the 61th Annual Meeting of the Association for Computational Linguistics. Toronto, Canada, Jul 09-14, 2023. DOI
Abstract

Since conventional knowledge embedding models cannot take full advantage of the abundant textual information, there have been extensive research efforts in enhancing knowledge embedding using texts. However, existing enhancement approaches cannot apply to temporal knowledge graphs (tKGs), which contain time-dependent event knowledge with complex temporal dynamics. Specifically, existing enhancement approaches often assume knowledge embedding is time-independent. In contrast, the entity embedding in tKG models usually evolves, which poses the challenge of aligning temporally relevant texts with entities. To this end, we propose to study enhancing temporal knowledge embedding with textual data in this paper. As an approach to this task, we propose Enhanced Temporal Knowledge Embeddings with Contextualized Language Representations (ECOLA), which takes the temporal aspect into account and injects textual information into temporal knowledge embedding. To evaluate ECOLA, we introduce three new datasets for training and evaluating ECOLA. Extensive experiments show that ECOLA significantly enhances temporal KG embedding models with up to 287% relative improvements regarding Hits@1 on the link prediction task.

MCML Authors
Link to website

Ruotong Liao

Database Systems & Data Mining

Link to website

Yao Zhang

Database Systems & Data Mining

Link to website

Zifeng Ding

Database Systems & Data Mining

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning

Link to Profile Volker Tresp

Volker Tresp

Prof. Dr.

Database Systems & Data Mining


[35]
E. Nie, S. Liang, H. Schmid and H. Schütze.
Cross-Lingual Retrieval Augmented Prompt for Low-Resource Languages.
ACL 2023 - Findings of the 61th Annual Meeting of the Association for Computational Linguistics. Toronto, Canada, Jul 09-14, 2023. DOI
Abstract

Multilingual Pretrained Language Models (MPLMs) perform strongly in cross-lingual transfer. We propose Prompts Augmented by Retrieval Crosslingually (PARC) to improve zero-shot performance on low-resource languages (LRLs) by augmenting the context with prompts consisting of semantically similar sentences retrieved from a high-resource language (HRL). PARC improves zero-shot performance on three downstream tasks (sentiment classification, topic categorization, natural language inference) with multilingual parallel test sets across 10 LRLs covering 6 language families in unlabeled (+5.1%) and labeled settings (+16.3%). PARC also outperforms finetuning by 3.7%. We find a significant positive correlation between cross-lingual transfer performance on one side, and the similarity between high- and low-resource languages as well as the amount of low-resource pretraining data on the other side. A robustness analysis suggests that PARC has the potential to achieve even stronger performance with more powerful MPLMs.

MCML Authors
Link to website

Ercong Nie

Statistical NLP and Deep Learning

Link to website

Sheng Liang

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[34]
P. Wicke.
LMs stand their Ground: Investigating the Effect of Embodiment in Figurative Language Interpretation by Language Models.
ACL 2023 - Findings of the 61th Annual Meeting of the Association for Computational Linguistics. Toronto, Canada, Jul 09-14, 2023. DOI
Abstract

null

MCML Authors
Link to website

Philipp Wicke

Dr.

Statistical NLP and Deep Learning


[33]
Y. Liu, A. Chronopoulou, H. Schütze and A. Fraser.
On the Copying Problem of Unsupervised NMT: A Training Schedule with a Language Discriminator Loss.
IWSLT 2023 - 20th International Conference on Spoken Language Translation. Toronto, Canada, Jul 09-14, 2023. DOI
Abstract

Although unsupervised neural machine translation (UNMT) has achieved success in many language pairs, the copying problem, i.e., directly copying some parts of the input sentence as the translation, is common among distant language pairs, especially when low-resource languages are involved. We find this issue is closely related to an unexpected copying behavior during online back-translation (BT). In this work, we propose a simple but effective training schedule that incorporates a language discriminator loss. The loss imposes constraints on the intermediate translation so that the translation is in the desired language. By conducting extensive experiments on different language pairs, including similar and distant, high and low-resource languages, we find that our method alleviates the copying problem, thus improving the translation performance on low-resource languages.

MCML Authors
Link to website

Yihong Liu

Statistical NLP and Deep Learning

Link to website

Alexandra Chronopoulou

Dr.

* Former member

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning

Link to Profile Alexander Fraser

Alexander Fraser

Prof. Dr.

Data Analytics & Statistics


[32]
P. Wicke, L. K. Senel, S. Zhang, L. Figueredo, A. Naceri, S. Haddadin and H. Schütze.
Towards Language-Based Modulation of Assistive Robots through Multimodal Models.
Geriatronics Summit 2023 - 2nd Geriatronics Summit. Garmisch-Partenkirchen, Germany, Jul 02-03, 2023. arXiv
Abstract

null

MCML Authors
Link to website

Philipp Wicke

Dr.

Statistical NLP and Deep Learning

Link to website

Shengqiang Zhang

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[31]
V. Steinborn, A. Maronikolakis and H. Schütze.
Politeness Stereotypes and Attack Vectors: Gender Stereotypes in Japanese and Korean Language Models.
Preprint (Jun. 2023). arXiv
Abstract

null

MCML Authors
Link to website

Victor Steinborn

Statistical NLP and Deep Learning

Link to website

Antonis Maronikolakis

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[30]
V. Blaschke, H. Schütze and B. Plank.
A Survey of Corpora for Germanic Low-Resource Languages and Dialects.
NoDaLiDa 2023 - 24th Nordic Conference on Computational Linguistics. Tórshavn, Faroe Islands, May 22-24, 2023. URL
Abstract

Despite much progress in recent years, the vast majority of work in natural language processing (NLP) is on standard languages with many speakers. In this work, we instead focus on low-resource languages and in particular non-standardized low-resource languages. Even within branches of major language families, often considered well-researched, little is known about the extent and type of available resources and what the major NLP challenges are for these language varieties. The first step to address this situation is a systematic survey of available corpora (most importantly, annotated corpora, which are particularly valuable for NLP research). Focusing on Germanic low-resource language varieties, we provide such a survey in this paper. Except for geolocation (origin of speaker or document), we find that manually annotated linguistic resources are sparse and, if they exist, mostly cover morphosyntax. Despite this lack of resources, we observe that interest in this area is increasing: there is active development and a growing research community. To facilitate research, we make our overview of over 80 corpora publicly available.

MCML Authors
Link to website

Verena Blaschke

Artificial Intelligence and Computational Linguistics

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

Artificial Intelligence and Computational Linguistics


[29]
X. Wang, L. Weissweiler, H. Schütze and B. Plank.
How to Distill your BERT: An Empirical Study on the Impact of Weight Initialisation and Distillation Objectives.
EACL 2023 - 17th Conference of the European Chapter of the Association for Computational Linguistics. Dubrovnik, Croatia, May 02-06, 2023. DOI
Abstract

null

MCML Authors
Link to website

Xinpeng Wang

Artificial Intelligence and Computational Linguistics

Leonie Weissweiler

Leonie Weissweiler

Dr.

* Former member

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

Artificial Intelligence and Computational Linguistics


[28]
V. Blaschke, H. Schütze and B. Plank.
Does Manipulating Tokenization Aid Cross-Lingual Transfer? A Study on POS Tagging for Non-Standardized Languages.
VarDial @EACL 2023 - 10th Workshop on NLP for Similar Languages, Varieties and Dialects at the 17th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2023). Dubrovnik, Croatia, May 02-06, 2023. DOI
Abstract

One of the challenges with finetuning pretrained language models (PLMs) is that their tokenizer is optimized for the language(s) it was pretrained on, but brittle when it comes to previously unseen variations in the data. This can for instance be observed when finetuning PLMs on one language and evaluating them on data in a closely related language variety with no standardized orthography. Despite the high linguistic similarity, tokenization no longer corresponds to meaningful representations of the target data, leading to low performance in, e.g., part-of-speech tagging. In this work, we finetune PLMs on seven languages from three different families and analyze their zero-shot performance on closely related, non-standardized varieties. We consider different measures for the divergence in the tokenization of the source and target data, and the way they can be adjusted by manipulating the tokenization during the finetuning step. Overall, we find that the similarity between the percentage of words that get split into subwords in the source and target data (the split word ratio difference) is the strongest predictor for model performance on target data.

MCML Authors
Link to website

Verena Blaschke

Artificial Intelligence and Computational Linguistics

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

Artificial Intelligence and Computational Linguistics


[27]
Y. Liu, S. Feng, D. Wang, Y. Zhang and H. Schütze.
Evaluate What You Can't Evaluate: Unassessable Quality for Generated Response.
Preprint (May. 2023). arXiv
Abstract

LLMs (large language models) such as ChatGPT have shown remarkable language understanding and generation capabilities. Although reference-free evaluators based on LLMs show better human alignment than traditional reference-based evaluators, there are many challenges in using reference-free evaluators based on LLMs. Reference-free evaluators are more suitable for open-ended examples with different semantics responses. But not all examples are open-ended. For closed-ended examples with unique correct semantic response, reference-free evaluators will still consider it high quality when giving a response that is inconsistent with the facts and the semantic of reference. In order to comprehensively evaluate the reliability of evaluators based on LLMs, we construct two adversarial meta-evaluation dialogue generation datasets KdConv-ADV and DSTC7-ADV based on KdConv and DSTC7-AVSD, respectively. Compared to previous meta-evaluation benchmarks, KdConv-ADV and DSTC7-ADV are much more challenging since they requires evaluators to be able to reasonably evaluate closed-ended examples with the help of external knowledge or even its own knowledge. Empirical results show that the ability of LLMs to identify unreasonable responses is insufficient. There are risks in using eference-free evaluators based on LLMs to evaluate the quality of dialogue responses.

MCML Authors
Link to website

Yongkang Liu

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[26]
A. Modarressi, A. Imani, M. Fayyaz and H. Schütze.
RET-LLM: Towards a General Read-Write Memory for Large Language Models.
Preprint (May. 2023). arXiv
Abstract

Large language models (LLMs) have significantly advanced the field of natural language processing (NLP) through their extensive parameters and comprehensive data utilization. However, existing LLMs lack a dedicated memory unit, limiting their ability to explicitly store and retrieve knowledge for various tasks. In this paper, we propose RET-LLM a novel framework that equips LLMs with a general write-read memory unit, allowing them to extract, store, and recall knowledge from the text as needed for task performance. Inspired by Davidsonian semantics theory, we extract and save knowledge in the form of triplets. The memory unit is designed to be scalable, aggregatable, updatable, and interpretable. Through qualitative evaluations, we demonstrate the superiority of our proposed framework over baseline approaches in question answering tasks. Moreover, our framework exhibits robust performance in handling temporal-based question answering tasks, showcasing its ability to effectively manage time-dependent information.

MCML Authors
Link to website

Ali Modarressi

Statistical NLP and Deep Learning

Link to website

Ayyoob Imani

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[25]
H. Ye, Y. Liu and H. Schütze.
A study of conceptual language similarity: comparison and evaluation.
Preprint (May. 2023). arXiv
Abstract

null

MCML Authors
Link to website

Haotian Ye

Statistical NLP and Deep Learning

Link to website

Yihong Liu

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[24]
L. He, N. Otani, D. R. Mortensen, L. Levin and H. Schütze.
Construction Grammar Provides Unique Insight into Neural Language Models.
GURT 2023 - Georgetown University Round Table on Linguistics. Washington D.C., USA, Mar 09-12, 2023. URL
Abstract

null

MCML Authors
Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


2022


[23]
J. Li, M. Zhao, Y. Xie, A. Maronikolakis, P. Pu and H. Schütze.
This joke is [MASK]: Recognizing Humor and Offense with Prompting.
TL4NLP @NeurIPS 2022 - 1st Transfer Learning for Natural Language Processing Workshop at the 36th Conference on Neural Information Processing Systems (NeurIPS 2022). New Orleans, LA, USA, Nov 28-Dec 09, 2022. URL
Abstract

Humor is a magnetic component in everyday human interactions and communications. Computationally modeling humor enables NLP systems to entertain and engage with users. We investigate the effectiveness of prompting, a new transfer learning paradigm for NLP, for humor recognition. We show that prompting performs similarly to finetuning when numerous annotations are available, but gives stellar performance in low-resource humor recognition. The relationship between humor and offense is also inspected by applying influence functions to prompting; we show that models could rely on offense to determine humor during transfer.

MCML Authors
Link to website

Antonis Maronikolakis

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[22]
A. Imani, S. Severini, M. J. Sabet, F. Yvon and H. Schütze.
Graph-Based Multilingual Label Propagation for Low-Resource Part-of-Speech Tagging.
EMNLP 2022 - Conference on Empirical Methods in Natural Language Processing. Abu Dhabi, United Arab Emirates, Nov 07-11, 2022. DOI
Abstract

Part-of-Speech (POS) tagging is an important component of the NLP pipeline, but many low-resource languages lack labeled data for training. An established method for training a POS tagger in such a scenario is to create a labeled training set by transferring from high-resource languages. In this paper, we propose a novel method for transferring labels from multiple high-resource source to low-resource target languages. We formalize POS tag projection as graph-based label propagation. Given translations of a sentence in multiple languages, we create a graph with words as nodes and alignment links as edges by aligning words for all language pairs. We then propagate node labels from source to target using a Graph Neural Network augmented with transformer layers. We show that our propagation creates training sets that allow us to train POS taggers for a diverse set of languages. When combined with enhanced contextualized embeddings, our method achieves a new state-of-the-art for unsupervised POS tagging of low-resource languages.

MCML Authors
Link to website

Ayyoob Imani

Statistical NLP and Deep Learning

Link to website

Masoud Jalili Sabet

Dr.

* Former member

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[21]
L. Weissweiler, V. Hofmann, A. Köksal and H. Schütze.
The better your Syntax, the better your Semantics? Probing Pretrained Language Models for the English Comparative Correlative.
EMNLP 2022 - Conference on Empirical Methods in Natural Language Processing. Abu Dhabi, United Arab Emirates, Nov 07-11, 2022. DOI
Abstract

null

MCML Authors
Leonie Weissweiler

Leonie Weissweiler

Dr.

* Former member

Link to website

Abdullatif Köksal

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[20]
A. Maronikolakis, P. Baader and H. Schütze.
Analyzing Hate Speech Data along Racial, Gender and Intersectional Axes.
GeBNLP 2022 - 4th Workshop on Gender Bias in Natural Language Processing. Seattle, WA, USA, Jul 15, 2022. DOI
Abstract

To tackle the rising phenomenon of hate speech, efforts have been made towards data curation and analysis. When it comes to analysis of bias, previous work has focused predominantly on race. In our work, we further investigate bias in hate speech datasets along racial, gender and intersectional axes. We identify strong bias against African American English (AAE), masculine and AAE+Masculine tweets, which are annotated as disproportionately more hateful and offensive than from other demographics. We provide evidence that BERT-based models propagate this bias and show that balancing the training data for these protected attributes can lead to fairer models with regards to gender, but not race.

MCML Authors
Link to website

Antonis Maronikolakis

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[19]
S. Yuan, A. Maronikolakis and H. Schütze.
Separating Hate Speech and Offensive Language Classes via Adversarial Debiasing.
WOAH 2022 - 6th Workshop on Online Abuse and Harms. Seattle, WA, USA, Jul 14, 2022. DOI
Abstract

null

MCML Authors
Link to website

Antonis Maronikolakis

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[18]
S. Severini, V. Hangya, M. J. Sabet, A. Fraser and H. Schütze.
Don't Forget Cheap Training Signals Before Building Unsupervised Bilingual Word Embeddings.
BUCC @LREC 2022 - 15th Workshop on Building and Using Comparable Corpora at the 13th International Conference on Language Resources and Evaluation (LREC 2022). Marseille, France, Jun 21-23, 2022. URL
Abstract

null

MCML Authors
Link to website

Masoud Jalili Sabet

Dr.

* Former member

Link to Profile Alexander Fraser

Alexander Fraser

Prof. Dr.

Data Analytics & Statistics

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[17]
S. Severini, A. Imani, P. Dufter and H. Schütze.
Towards a Broad Coverage Named Entity Resource: A Data-Efficient Approach for Many Diverse Languages.
LREC 2022 - 13th International Conference on Language Resources and Evaluation. Marseille, France, Jun 21-23, 2022. URL
Abstract

null

MCML Authors
Link to website

Ayyoob Imani

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[16]
V. Steinborn, P. Dufter, H. Jabbar and H. Schütze.
An Information-Theoretic Approach and Dataset for Probing Gender Stereotypes in Multilingual Masked Language Models.
NAACL 2022 - Findings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics. Seattle, WA, USA, Jun 10-15, 2022. DOI
Abstract

null

MCML Authors
Link to website

Victor Steinborn

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[15]
M. Zhao, F. Mi, Y. Wang, M. Li, X. Jiang, Q. Liu and H. Schütze.
LMTurk: Few-Shot Learners as Crowdsourcing Workers in a Language-Model-as-a-Service Framework.
NAACL 2022 - Findings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics. Seattle, WA, USA, Jun 10-15, 2022. DOI
Abstract

null

MCML Authors
Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[14]
L. Weissweiler, V. Hofmann, M. J. Sabet and H. Schütze.
CaMEL: Case Marker Extraction without Labels.
ACL 2022 - 60th Annual Meeting of the Association for Computational Linguistics. Dublin, Ireland, May 22-27, 2022. DOI
Abstract

null

MCML Authors
Leonie Weissweiler

Leonie Weissweiler

Dr.

* Former member

Link to website

Masoud Jalili Sabet

Dr.

* Former member

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[13]
S. Sharifzadeh, S. M. Baharlou, M. Schmitt, H. Schütze and V. Tresp.
Improving Scene Graph Classification by Exploiting Knowledge from Texts.
AAAI 2022 - 36th Conference on Artificial Intelligence. Virtual, Feb 22-Mar 01, 2022. DOI
Abstract

null

MCML Authors
Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning

Link to Profile Volker Tresp

Volker Tresp

Prof. Dr.

Database Systems & Data Mining


2021


[12]
Y. Elazar, N. Kassner, S. Ravfogel, A. Ravichander, E. Hovy, H. Schütze and Y. Goldberg.
Measuring and Improving Consistency in Pretrained Language Models.
Transactions of the Association for Computational Linguistics 9 (Dec. 2021). DOI
Abstract

Consistency of a model—that is, the invariance of its behavior under meaning-preserving alternations in its input—is a highly desirable property in natural language processing. In this paper we study the question: Are Pretrained Language Models (PLMs) consistent with respect to factual knowledge? To this end, we create ParaRel, a high-quality resource of cloze-style query English paraphrases. It contains a total of 328 paraphrases for 38 relations. Using ParaRel, we show that the consistency of all PLMs we experiment with is poor— though with high variance between relations. Our analysis of the representational spaces of PLMs suggests that they have a poor structure and are currently not suitable for representing knowledge robustly. Finally, we propose a method for improving model consistency and experimentally demonstrate its effectiveness.

MCML Authors
Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[11]
A. Imani, M. J. Sabet, L. K. Senel, P. Philipp, F. Yvon and H. Schütze.
Graph Algorithms for Multiparallel Word Alignment.
EMNLP 2021 - Conference on Empirical Methods in Natural Language Processing. Punta Cana, Dominican Republic, Nov 07-11, 2021. DOI
Abstract

With the advent of end-to-end deep learning approaches in machine translation, interest in word alignments initially decreased; however, they have again become a focus of research more recently. Alignments are useful for typological research, transferring formatting like markup to translated texts, and can be used in the decoding of machine translation systems. At the same time, massively multilingual processing is becoming an important NLP scenario, and pretrained language and machine translation models that are truly multilingual are proposed. However, most alignment algorithms rely on bitexts only and do not leverage the fact that many parallel corpora are multiparallel. In this work, we exploit the multiparallelity of corpora by representing an initial set of bilingual alignments as a graph and then predicting additional edges in the graph. We present two graph algorithms for edge prediction: one inspired by recommender systems and one based on network link prediction. Our experimental results show absolute improvements in F1 of up to 28{%} over the baseline bilingual word aligner in different datasets.

MCML Authors
Link to website

Ayyoob Imani

Statistical NLP and Deep Learning

Link to website

Masoud Jalili Sabet

Dr.

* Former member

Link to website

Lütfi Kerem Şenel

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[10]
N. Kassner, O. Tafjord, H. Schütze and P. Clark.
BeliefBank: Adding Memory to a Pre-Trained Language Model for a Systematic Notion of Belief.
EMNLP 2021 - Conference on Empirical Methods in Natural Language Processing. Punta Cana, Dominican Republic, Nov 07-11, 2021. DOI
Abstract

Although pretrained language models (PTLMs) contain significant amounts of world knowledge, they can still produce inconsistent answers to questions when probed, even after specialized training. As a result, it can be hard to identify what the model actually “believes” about the world, making it susceptible to inconsistent behavior and simple errors. Our goal is to reduce these problems. Our approach is to embed a PTLM in a broader system that also includes an evolving, symbolic memory of beliefs – a BeliefBank – that records but then may modify the raw PTLM answers. We describe two mechanisms to improve belief consistency in the overall system. First, a reasoning component – a weighted MaxSAT solver – revises beliefs that significantly clash with others. Second, a feedback component issues future queries to the PTLM using known beliefs as context. We show that, in a controlled experimental setting, these two mechanisms result in more consistent beliefs in the overall system, improving both the accuracy and consistency of its answers over time. This is significant as it is a first step towards PTLM-based architectures with a systematic notion of belief, enabling them to construct a more coherent picture of the world, and improve over time without model retraining.

MCML Authors
Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[9]
A. Imani, M. J. Sabet, P. Dufter, M. Cysouw and H. Schütze.
ParCourE: A Parallel Corpus Explorer for a Massively Multilingual Corpus.
ACL-IJCNLP 2021 - Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. Bangkok, Thailand, Aug 01-06, 2021. DOI
Abstract

With more than 7000 languages worldwide, multilingual natural language processing (NLP) is essential both from an academic and commercial perspective. Researching typological properties of languages is fundamental for progress in multilingual NLP. Examples include assessing language similarity for effective transfer learning, injecting inductive biases into machine learning models or creating resources such as dictionaries and inflection tables. We provide ParCourE, an online tool that allows to browse a word-aligned parallel corpus, covering 1334 languages. We give evidence that this is useful for typological research. ParCourE can be set up for any parallel corpus and can thus be used for typological research on other corpora as well as for exploring their quality and properties.

MCML Authors
Link to website

Ayyoob Imani

Statistical NLP and Deep Learning

Link to website

Masoud Jalili Sabet

Dr.

* Former member

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[8]
P. Dufter, N. Kassner and H. Schütze.
Static Embeddings as Efficient Knowledge Bases?.
NAACL 2021 - Annual Conference of the North American Chapter of the Association for Computational Linguistics. Virtual, Jun 06-11, 2021. DOI
Abstract

Recent research investigates factual knowledge stored in large pretrained language models (PLMs). Instead of structural knowledge base (KB) queries, masked sentences such as ‘Paris is the capital of [MASK]’ are used as probes. The good performance on this analysis task has been interpreted as PLMs becoming potential repositories of factual knowledge. In experiments across ten linguistically diverse languages, we study knowledge contained in static embeddings. We show that, when restricting the output space to a candidate set, simple nearest neighbor matching using static embeddings performs better than PLMs. E.g., static embeddings perform 1.6% points better than BERT while just using 0.3% of energy for training. One important factor in their good comparative performance is that static embeddings are standardly learned for a large vocabulary. In contrast, BERT exploits its more sophisticated, but expensive ability to compose meaningful representations from a much smaller subword vocabulary.

MCML Authors
Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[7]
N. Kassner, P. Dufter and H. Schütze.
Multilingual LAMA: Investigating Knowledge in Multilingual Pretrained Language Models.
EACL 2021 - 16th Conference of the European Chapter of the Association for Computational Linguistics. Virtual, Apr 19-23, 2021. DOI
Abstract

Recently, it has been found that monolingual English language models can be used as knowledge bases. Instead of structural knowledge base queries, masked sentences such as “Paris is the capital of [MASK]” are used as probes. We translate the established benchmarks TREx and GoogleRE into 53 languages. Working with mBERT, we investigate three questions. (i) Can mBERT be used as a multilingual knowledge base? Most prior work only considers English. Extending research to multiple languages is important for diversity and accessibility. (ii) Is mBERT’s performance as knowledge base language-independent or does it vary from language to language? (iii) A multilingual model is trained on more text, e.g., mBERT is trained on 104 Wikipedias. Can mBERT leverage this for better performance? We find that using mBERT as a knowledge base yields varying performance across languages and pooling predictions across languages improves performance. Conversely, mBERT exhibits a language bias; e.g., when queried in Italian, it tends to predict Italy as the country of origin.

MCML Authors
Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


2020


[6]
E. Asgari, M. J. Sabet, P. Dufter, C. Ringlstetter and H. Schütze.
Subword Sampling for Low Resource Word Alignment.
Preprint (Dec. 2020). arXiv
Abstract

Annotation projection is an important area in NLP that can greatly contribute to creating language resources for low-resource languages. Word alignment plays a key role in this setting. However, most of the existing word alignment methods are designed for a high resource setting in machine translation where millions of parallel sentences are available. This amount reduces to a few thousands of sentences when dealing with low-resource languages failing the existing established IBM models. In this paper, we propose subword sampling-based alignment of text units. This method’s hypothesis is that the aggregation of different granularities of text for certain language pairs can help word-level alignment. For certain languages for which gold-standard alignments exist, we propose an iterative Bayesian optimization framework to optimize selecting possible subwords from the space of possible subword representations of the source and target sentences. We show that the subword sampling method consistently outperforms word-level alignment on six language pairs: English-German, English-French, English-Romanian, English-Persian, English-Hindi, and English-Inuktitut. In addition, we show that the hyperparameters learned for certain language pairs can be applied to other languages at no supervision and consistently improve the alignment results. We observe that using 5K parallel sentences together with our proposed subword sampling approach, we obtain similar F1 scores to the use of 100K’s of parallel sentences in existing word-level fast-align/eflomal alignment methods.

MCML Authors
Link to website

Masoud Jalili Sabet

Dr.

* Former member

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[5]
N. Kassner, B. Krojer and H. Schütze.
Are Pretrained Language Models Symbolic Reasoners over Knowledge?.
CoNLL 2020 - 24th Conference on Computational Natural Language Learning. Virtual, Nov 19-20, 2020. DOI
Abstract

How can pretrained language models (PLMs) learn factual knowledge from the training set? We investigate the two most important mechanisms: reasoning and memorization. Prior work has attempted to quantify the number of facts PLMs learn, but we present, using synthetic data, the first study that investigates the causal relation between facts present in training and facts learned by the PLM. For reasoning, we show that PLMs seem to learn to apply some symbolic reasoning rules correctly but struggle with others, including two-hop reasoning. Further analysis suggests that even the application of learned reasoning rules is flawed. For memorization, we identify schema conformity (facts systematically supported by other facts) and frequency as key factors for its success.

MCML Authors
Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[4]
N. Kassner and H. Schütze.
BERT-kNN: Adding a kNN Search Component to Pretrained Language Models for Better QA.
EMNLP 2020 - Findings of the Conference on Empirical Methods in Natural Language Processing. Virtual, Nov 16-20, 2020. DOI
Abstract

Khandelwal et al. (2020) use a k-nearest-neighbor (kNN) component to improve language model performance. We show that this idea is beneficial for open-domain question answering (QA). To improve the recall of facts encountered during training, we combine BERT (Devlin et al., 2019) with a traditional information retrieval step (IR) and a kNN search over a large datastore of an embedded text collection. Our contributions are as follows: i) BERT-kNN outperforms BERT on cloze-style QA by large margins without any further training. ii) We show that BERT often identifies the correct response category (e.g., US city), but only kNN recovers the factually correct answer (e.g.,“Miami”). iii) Compared to BERT, BERT-kNN excels for rare facts. iv) BERT-kNN can easily handle facts not covered by BERT’s training set, e.g., recent events.

MCML Authors
Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[3]
N. Kassner and H. Schütze.
Negated and Misprimed Probes for Pretrained Language Models: Birds Can Talk, But Cannot Fly.
ACL 2020 - 58th Annual Meeting of the Association for Computational Linguistics. Virtual, Jul 05-10, 2020. DOI
Abstract

Building on Petroni et al. 2019, we propose two new probing tasks analyzing factual knowledge stored in Pretrained Language Models (PLMs). (1) Negation. We find that PLMs do not distinguish between negated (‘‘Birds cannot [MASK]”) and non-negated (‘‘Birds can [MASK]”) cloze questions. (2) Mispriming. Inspired by priming methods in human psychology, we add “misprimes” to cloze questions (‘‘Talk? Birds can [MASK]”). We find that PLMs are easily distracted by misprimes. These results suggest that PLMs still have a long way to go to adequately learn human-like factual knowledge.

MCML Authors
Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[2]
A. Beyer, G. Kauermann and H. Schütze.
Embedding Space Correlation as a Measure of Domain Similarity.
LREC 2020 - 12th International Conference on Language Resources and Evaluation. Marseille, France, May 13-15, 2020. URL
Abstract

Prior work has determined domain similarity using text-based features of a corpus. However, when using pre-trained word embeddings, the underlying text corpus might not be accessible anymore. Therefore, we propose the CCA measure, a new measure of domain similarity based directly on the dimension-wise correlations between corresponding embedding spaces. Our results suggest that an inherent notion of domain can be captured this way, as we are able to reproduce our findings for different domain comparisons for English, German, Spanish and Czech as well as in cross-lingual comparisons. We further find a threshold at which the CCA measure indicates that two corpora come from the same domain in a monolingual setting by applying permutation tests. By evaluating the usability of the CCA measure in a domain adaptation application, we also show that it can be used to determine which corpora are more similar to each other in a cross-domain sentiment detection task.

MCML Authors
Link to Profile Göran Kauermann

Göran Kauermann

Prof. Dr.

Applied Statistics in Social Sciences, Economics and Business

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[1]
J. Jungmaier, N. Kassner and B. Roth.
Dirichlet-Smoothed Word Embeddings for Low-Resource Settings.
LREC 2020 - 12th International Conference on Language Resources and Evaluation. Marseille, France, May 13-15, 2020. URL
Abstract

Nowadays, classical count-based word embeddings using positive pointwise mutual information (PPMI) weighted co-occurrence matrices have been widely superseded by machine-learning-based methods like word2vec and GloVe. But these methods are usually applied using very large amounts of text data. In many cases, however, there is not much text data available, for example for specific domains or low-resource languages. This paper revisits PPMI by adding Dirichlet smoothing to correct its bias towards rare words. We evaluate on standard word similarity data sets and compare to word2vec and the recent state of the art for low-resource settings: Positive and Unlabeled (PU) Learning for word embeddings. The proposed method outperforms PU-Learning for low-resource settings and obtains competitive results for Maltese and Luxembourgish.

MCML Authors