Home | Research | Groups | Alexander Fraser

Research Group Alexander Fraser

Link to website at TUM

Alexander Fraser

Prof. Dr.

Principal Investigator

B2 | Natural Language Processing

Data Analytics & Statistics

Alexander Fraser

holds the Chair for Data Analytics & Statistics at TU Munich.

He is renowned for his work in machine learning approaches to machine translation, language modeling, and multilingual natural language processing. He focuses on addressing data sparsity and integrating linguistic and world knowledge in AI systems. Additionally, he collaborates with language communities to develop technology for their languages. His contributions to natural language processing and machine learning emphasize both theoretical advancements and practical applications.

Team members @MCML

PostDocs

Link to website

Daryna Dementieva

Dr.

B2 | Natural Language Processing
→ Group Alexander Fraser

Data Analytics & Statistics

Link to website

Lukas Edman

Dr.

B2 | Natural Language Processing
→ Group Alexander Fraser

Data Analytics & Statistics

Link to website

Shu Okabe

Dr.

B2 | Natural Language Processing
→ Group Alexander Fraser

Data Analytics & Statistics

PhD Students

Link to website

Faeze Ghorbanpour

B2 | Natural Language Processing
→ Group Alexander Fraser

Data Analytics & Statistics

Link to website

Katharina Hämmerl

B2 | Natural Language Processing
→ Group Alexander Fraser

Data Analytics & Statistics

Link to website

Wen Lai

B2 | Natural Language Processing
→ Group Alexander Fraser

Data Analytics & Statistics

Link to website

Tsedeniya Kinfe Temesgen

B2 | Natural Language Processing
→ Group Alexander Fraser

Data Analytics & Statistics

Recent News @MCML

25.07.2025

MCML Researchers With 36 Papers at ACL 2025

10.07.2025

MCML Researchers With 24 Papers at ICML 2025

28.04.2025

MCML Researchers With Twelve Papers at NAACL 2025

06.11.2024

MCML Researchers With 22 Papers at EMNLP 2024

16.10.2024

Alexander Fraser Receives EU Funding for Research on LLMs

Publications @MCML

2025

[43]

F. Friedrich, K. Hämmerl, P. Schramowski, M. Brack, J. Libovicky, K. Kersting and A. Fraser.
Multilingual Text-to-Image Generation Magnifies Gender Stereotypes and Prompt Engineering May Not Help You.
ACL 2025 - 63rd Annual Meeting of the Association for Computational Linguistics. Vienna, Austria, Jul 27-Aug 01, 2025. URL

Abstract

Text-to-image generation models have recently achieved astonishing results in image quality, flexibility, and text alignment, and are consequently employed in a fast-growing number of applications. Through improvements in multilingual abilities, a larger community now has access to this technology. However, our results show that multilingual models suffer from significant gender biases just as monolingual models do. Furthermore, the natural expectation that multilingual models will provide similar results across languages does not hold up. Instead, there are important differences between languages. We propose a novel benchmark, MAGBIG, intended to foster research on gender bias in multilingual models. We use MAGBIG to investigate the effect of multilingualism on gender bias in T2I models. To this end, we construct multilingual prompts requesting portraits of people with a certain occupation or trait. Our results show that not only do models exhibit strong gender biases but they also behave differently across languages. Furthermore, we investigate prompt engineering strategies, such as indirect, neutral formulations, to mitigate these biases. Unfortunately, these approaches have limited success and result in worse text-to-image alignment. Consequently, we call for more research into diverse representations across languages in image generators, as well as into steerability to address biased model behavior.

MCML Authors

Link to website

Katharina Hämmerl

B2 | Natural Language Processing
→ Group Alexander Fraser

Data Analytics & Statistics

Alexander Fraser

Prof. Dr.

B2 | Natural Language Processing

Data Analytics & Statistics

[42]

L. Kinder, L. Edman, A. Fraser and T. Käfer.
Positional Overload: Positional Debiasing and Context Window Extension for Large Language Models using Set Encoding.
ACL 2025 - 63rd Annual Meeting of the Association for Computational Linguistics. Vienna, Austria, Jul 27-Aug 01, 2025. URL

Abstract

Large Language Models (LLMs) typically track the order of tokens using positional encoding, which causes the following problems: positional bias, where the model is influenced by an ordering within the prompt, and a fixed context window, as models struggle to generalize to positions beyond those encountered during training. To address these limitations, we developed a novel method called set encoding. This method allows multiple pieces of text to be encoded in the same position, thereby eliminating positional bias entirely. Another promising use case for set encoding is to increase the size of the input an LLM can handle. Our experiments demonstrate that set encoding allows an LLM to solve tasks with far more tokens than without set encoding. To our knowledge, set encoding is the first technique to effectively extend an LLM’s context window without requiring any additional training.

MCML Authors

Link to website

Lukas Edman

Dr.

B2 | Natural Language Processing
→ Group Alexander Fraser

Data Analytics & Statistics

Alexander Fraser

Prof. Dr.

B2 | Natural Language Processing

Data Analytics & Statistics

[41]

S. Okabe, K. Hämmerl and A. Fraser.
Improving Parallel Sentence Mining for Low-Resource and Endangered Languages.
ACL 2025 - 63rd Annual Meeting of the Association for Computational Linguistics. Vienna, Austria, Jul 27-Aug 01, 2025. URL

Abstract

While parallel sentence mining has been extensively covered for fairly well-resourced languages, pairs involving low-resource languages have received comparatively little attention.To address this gap, we present Belopsem, a benchmark of new datasets for parallel sentence mining on three language pairs where the source side is low-resource and endangered: Occitan-Spanish, Upper Sorbian-German, and Chuvash-Russian. These combinations also reflect varying linguistic similarity within each pair. We compare three language models in an established parallel sentence mining pipeline and apply two types of improvements to one of them, Glot500. We observe better mining quality overall by both applying alignment post-processing with an unsupervised aligner and using a cluster-based isotropy enhancement technique. These findings are crucial for optimising parallel data extraction for low-resource languages in a realistic way.

MCML Authors

Link to website

Shu Okabe

Dr.

B2 | Natural Language Processing
→ Group Alexander Fraser

Data Analytics & Statistics

Link to website

Katharina Hämmerl

B2 | Natural Language Processing
→ Group Alexander Fraser

Data Analytics & Statistics

Alexander Fraser

Prof. Dr.

B2 | Natural Language Processing

Data Analytics & Statistics

[40]

L. Edman, H. Schmid and A. Fraser.
EXECUTE: A Multilingual Benchmark for LLM Token Understanding.
ACL 2025 - Findings of the 63rd Annual Meeting of the Association for Computational Linguistics. Vienna, Austria, Jul 27-Aug 01, 2025. URL

Abstract

The CUTE benchmark showed that LLMs struggle with character understanding in English. We extend it to more languages with diverse scripts and writing systems, introducing EXECUTE. Our simplified framework allows easy expansion to any language. Tests across multiple LLMs reveal that challenges in other languages are not always on the character level as in English. Some languages show word-level processing issues, some show no issues at all. We also examine sub-character tasks in Chinese, Japanese, and Korean to assess LLMs’ understanding of character components.

MCML Authors

Link to website

Lukas Edman

Dr.

B2 | Natural Language Processing
→ Group Alexander Fraser

Data Analytics & Statistics

Alexander Fraser

Prof. Dr.

B2 | Natural Language Processing

Data Analytics & Statistics

[39]

W. Lai, A. Fraser and I. Titov.
Joint Localization and Activation Editing for Low-Resource Fine-Tuning.
ICML 2025 - 42nd International Conference on Machine Learning. Vancouver, Canada, Jul 13-19, 2025. To be published. Preprint available. arXiv

Abstract

Parameter-efficient fine-tuning (PEFT) methods, such as LoRA, are commonly used to adapt LLMs. However, the effectiveness of standard PEFT methods is limited in low-resource scenarios with only a few hundred examples. Recent advances in interpretability research have inspired the emergence of activation editing techniques, which modify the activations of specific model components. These methods, due to their extremely small parameter counts, show promise for small datasets. However, their performance is highly dependent on identifying the correct modules to edit and often lacks stability across different datasets. In this paper, we propose Joint Localization and Activation Editing (JoLA), a method that jointly learns (1) which heads in the Transformer to edit (2) whether the intervention should be additive, multiplicative, or both and (3) the intervention parameters themselves - the vectors applied as additive offsets or multiplicative scalings to the head output. Through evaluations on three benchmarks spanning commonsense reasoning, natural language understanding, and natural language generation, we demonstrate that JoLA consistently outperforms existing methods.

MCML Authors

Link to website

Wen Lai

B2 | Natural Language Processing
→ Group Alexander Fraser

Data Analytics & Statistics

Alexander Fraser

Prof. Dr.

B2 | Natural Language Processing

Data Analytics & Statistics

[38]

F. Ghorbanpour, T. Z. Malaguth and A. Akbaritabar.
Differentiating Emigration from Return Migration of Scholars Using Name-Based Nationality Detection Models.
ICWSM 2025 - 19th International AAAI Conference on Web and Social Media. Copenhagen, Denmark, Jun 23-26, 2025. DOI

Abstract

Most web and digital trace data do not include information about an individual’s nationality due to privacy concerns. The lack of data on nationality can create challenges for migration research. It can lead to a left-censoring issue since we are uncertain about the migrant’s country of origin. Once we observe an emigration event, if we know the nationality, we can differentiate it from return migration. We propose methods to detect the nationality with the least available data, i.e., full names. We use the detected nationality in comparison with the country of academic origin, which is a common approach in studying the migration of researchers. We gathered 2.6 million unique name-nationality pairs from Wikipedia and categorized them into families of nationalities with three granularity levels to use as our training data. Using a character-based machine learning model, we achieved a weighted F1 score of 84% for the broadest- and 67%, for the most granular, country-level categorization. In our empirical study, we used the trained and tested model to assign nationality to 8+ million scholars’ full names in Scopus data. Our results show that using the country of first publication as a proxy for nationality underestimates the size of return flows, especially for countries with a more diverse academic workforce, such as the USA, Australia, and Canada. We found that around 48% of emigration from the USA was return migration once we used the country of name origin in contrast to 33% based on academic origin. In the most recent period, 79% of scholars whose affiliation has consistently changed from the USA to China, and are considered emigrants, have Chinese names in contrast to 41% with a Chinese academic origin. Our proposed methods in addressing left-censoring issues are beneficial for other research that uses digital trace data to study migration.

MCML Authors

Link to website

Faeze Ghorbanpour

B2 | Natural Language Processing
→ Group Alexander Fraser

Data Analytics & Statistics

[37]

D. Dementieva, N. Babakov and A. Fraser.
EmoBench-UA: A Benchmark Dataset for Emotion Detection in Ukrainian.
Preprint (May. 2025). arXiv

Abstract

While Ukrainian NLP has seen progress in many texts processing tasks, emotion classification remains an underexplored area with no publicly available benchmark to date. In this work, we introduce EmoBench-UA, the first annotated dataset for emotion detection in Ukrainian texts. Our annotation schema is adapted from the previous English-centric works on emotion detection (Mohammad et al., 2018; Mohammad, 2022) guidelines. The dataset was created through crowdsourcing using the this http URL platform ensuring high-quality of the annotation process. Then, we evaluate a range of approaches on the collected dataset, starting from linguistic-based baselines, synthetic data translated from English, to large language models (LLMs). Our findings highlight the challenges of emotion classification in non-mainstream languages like Ukrainian and emphasize the need for further development of Ukrainian-specific models and training resources.

MCML Authors

Link to website

Daryna Dementieva

Dr.

B2 | Natural Language Processing
→ Group Alexander Fraser

Data Analytics & Statistics

Alexander Fraser

Prof. Dr.

B2 | Natural Language Processing

Data Analytics & Statistics

[36]

F. Ghorbanpour, D. Dementieva and A. Fraser.
Can Prompting LLMs Unlock Hate Speech Detection across Languages? A Zero-shot and Few-shot Study.
Preprint (May. 2025). arXiv

Abstract

Despite growing interest in automated hate speech detection, most existing approaches overlook the linguistic diversity of online content. Multilingual instruction-tuned large language models such as LLaMA, Aya, Qwen, and BloomZ offer promising capabilities across languages, but their effectiveness in identifying hate speech through zero-shot and few-shot prompting remains underexplored. This work evaluates LLM prompting-based detection across eight non-English languages, utilizing several prompting techniques and comparing them to fine-tuned encoder models. We show that while zero-shot and few-shot prompting lag behind fine-tuned encoder models on most of the real-world evaluation sets, they achieve better generalization on functional tests for hate speech detection. Our study also reveals that prompt design plays a critical role, with each language often requiring customized prompting techniques to maximize performance.

MCML Authors

Link to website

Faeze Ghorbanpour

B2 | Natural Language Processing
→ Group Alexander Fraser

Data Analytics & Statistics

Link to website

Daryna Dementieva

Dr.

B2 | Natural Language Processing
→ Group Alexander Fraser

Data Analytics & Statistics

Alexander Fraser

Prof. Dr.

B2 | Natural Language Processing

Data Analytics & Statistics

[35]

F. Ghorbanpour, D. Dementieva and A. Fraser.
Data-Efficient Hate Speech Detection via Cross-Lingual Nearest Neighbor Retrieval with Limited Labeled Data.
Preprint (May. 2025). arXiv

Abstract

Considering the importance of detecting hateful language, labeled hate speech data is expensive and time-consuming to collect, particularly for low-resource languages. Prior work has demonstrated the effectiveness of cross-lingual transfer learning and data augmentation in improving performance on tasks with limited labeled data. To develop an efficient and scalable cross-lingual transfer learning approach, we leverage nearest-neighbor retrieval to augment minimal labeled data in the target language, thereby enhancing detection performance. Specifically, we assume access to a small set of labeled training instances in the target language and use these to retrieve the most relevant labeled examples from a large multilingual hate speech detection pool. We evaluate our approach on eight languages and demonstrate that it consistently outperforms models trained solely on the target language data. Furthermore, in most cases, our method surpasses the current state-of-the-art. Notably, our approach is highly data-efficient, retrieving as small as 200 instances in some cases while maintaining superior performance. Moreover, it is scalable, as the retrieval pool can be easily expanded, and the method can be readily adapted to new languages and tasks. We also apply maximum marginal relevance to mitigate redundancy and filter out highly similar retrieved instances, resulting in improvements in some languages.

MCML Authors

Link to website

Faeze Ghorbanpour

B2 | Natural Language Processing
→ Group Alexander Fraser

Data Analytics & Statistics

Link to website

Daryna Dementieva

Dr.

B2 | Natural Language Processing
→ Group Alexander Fraser

Data Analytics & Statistics

Alexander Fraser

Prof. Dr.

B2 | Natural Language Processing

Data Analytics & Statistics

[34]

A. Karamolegkou, A. Borah, E. Cho, S. R. Choudhury, M. Galletti, R. Ghosh, P. Gupta, O. Ignat, P. Kargupta, N. Kotonya, H. Lamba, S.-J. Lee, A. Mangla, I. Mondal, D. Nazarova, P. Nemkova, D. Pisarevskaya, N. Rizwan, N. Sabri, D. Stammbach, A. Steinberg, D. Tomás, S. R. Wilson, B. Yi, J. H. Zhu, A. Zubiaga, A. Søgaard, A. Fraser, Z. Jin, R. Mihalcea, J. R. Tetreault and D. Dementieva.
NLP for Social Good: A Survey of Challenges, Opportunities, and Responsible Deployment.
Preprint (May. 2025). arXiv

Abstract

Recent advancements in large language models (LLMs) have unlocked unprecedented possibilities across a range of applications. However, as a community, we believe that the field of Natural Language Processing (NLP) has a growing need to approach deployment with greater intentionality and responsibility. In alignment with the broader vision of AI for Social Good (Tomašev et al., 2020), this paper examines the role of NLP in addressing pressing societal challenges. Through a cross-disciplinary analysis of social goals and emerging risks, we highlight promising research directions and outline challenges that must be addressed to ensure responsible and equitable progress in NLP4SG research.

MCML Authors

Link to website

Anna Steinberg

C4 | Computational Social Sciences
→ Group Frauke Kreuter

Social Data Science and AI

Alexander Fraser

Prof. Dr.

B2 | Natural Language Processing

Data Analytics & Statistics

Link to website

Daryna Dementieva

Dr.

B2 | Natural Language Processing
→ Group Alexander Fraser

Data Analytics & Statistics

[33]

Y. Shen, W. Lai, S. Wang, K. Luo, A. Fraser and M. Sun.
From Unaligned to Aligned: Scaling Multilingual LLMs with Multi-Way Parallel Corpora.
Preprint (May. 2025). arXiv

Abstract

Continued pretraining and instruction tuning on large-scale multilingual data have proven to be effective in scaling large language models (LLMs) to low-resource languages. However, the unaligned nature of such data limits its ability to effectively capture cross-lingual semantics. In contrast, multi-way parallel data, where identical content is aligned across multiple languages, provides stronger cross-lingual consistency and offers greater potential for improving multilingual performance. In this paper, we introduce a large-scale, high-quality multi-way parallel corpus, TED2025, based on TED Talks. The corpus spans 113 languages, with up to 50 languages aligned in parallel, ensuring extensive multilingual coverage. Using this dataset, we investigate best practices for leveraging multi-way parallel data to enhance LLMs, including strategies for continued pretraining, instruction tuning, and the analysis of key influencing factors. Experiments on six multilingual benchmarks show that models trained on multiway parallel data consistently outperform those trained on unaligned multilingual data.

MCML Authors

Link to website

Wen Lai

B2 | Natural Language Processing
→ Group Alexander Fraser

Data Analytics & Statistics

Alexander Fraser

Prof. Dr.

B2 | Natural Language Processing

Data Analytics & Statistics

[32]

F. Ghorbanpour, V. Hangya and A. Fraser.
Fine-Grained Transfer Learning for Harmful Content Detection through Label-Specific Soft Prompt Tuning.
NAACL 2025 - Annual Conference of the North American Chapter of the Association for Computational Linguistics. Albuquerque, NM, USA, Apr 29-May 04, 2025. DOI

Abstract

The spread of harmful content online is a dynamic issue evolving over time. Existing detection models, reliant on static data, are becoming less effective and generalizable. Developing new models requires sufficient up-to-date data, which is challenging. A potential solution is to combine existing datasets with minimal new data. However, detection tasks vary—some focus on hate speech, offensive, or abusive content, which differ in the intent to harm, while others focus on identifying targets of harmful speech such as racism, sexism, etc—raising the challenge of handling nuanced class differences. To address these issues, we introduce a novel transfer learning method that leverages class-specific knowledge to enhance harmful
content detection. In our approach, we first present label-specific soft prompt tuning, which captures and represents class-level information. Secondly, we propose two approaches to transfer this fine-grained knowledge from source (existing tasks) to target (unseen and new tasks): initializing the target task prompts from source prompts and using an attention mechanism that learns and adjusts attention scores to utilize the most relevant information from source prompts. Experiments demonstrate significant improvements in harmful content detection across English and German datasets, highlighting the effectiveness of label-specific representations and knowledge transfer.

MCML Authors

Link to website

Faeze Ghorbanpour

B2 | Natural Language Processing
→ Group Alexander Fraser

Data Analytics & Statistics

Viktor Hangya

Viktor Hangya

Dr.

B2 | Natural Language Processing
→ Group Alexander Fraser

* Former Member

Alexander Fraser

Prof. Dr.

B2 | Natural Language Processing

Data Analytics & Statistics

[31]

K. Hämmerl, T. Limisiewicz, J. Libovický and A. Fraser.
Beyond Literal Token Overlap: Token Alignability for Multilinguality.
NAACL 2025 - Annual Conference of the North American Chapter of the Association for Computational Linguistics. Albuquerque, NM, USA, Apr 29-May 04, 2025. DOI

Abstract

Previous work has considered token overlap, or even similarity of token distributions, as predictors for multilinguality and cross-lingual knowledge transfer in language models. However, these very literal metrics assign large distances to language pairs with different scripts, which can nevertheless show good cross-linguality. This limits the explanatory strength of token overlap for knowledge transfer between language pairs that use distinct scripts or follow different orthographic conventions. In this paper, we propose subword token alignability as a new way to understand the impact and quality of multilingual tokenisation. In particular, this metric predicts multilinguality much better when scripts are disparate and the overlap of literal tokens is low. We analyse this metric in the context of both encoder and decoder models, look at data size as a potential distractor, and discuss how this insight may be applied to multilingual tokenisation in future work. We recommend our subword token alignability metric for identifying optimal language pairs for cross-lingual transfer, as well as to guide the construction of better multilingual tokenisers in the future. We publish our code and reproducibility details.

MCML Authors

Link to website

Katharina Hämmerl

B2 | Natural Language Processing
→ Group Alexander Fraser

Data Analytics & Statistics

Alexander Fraser

Prof. Dr.

B2 | Natural Language Processing

Data Analytics & Statistics

[30]

S. Okabe and A. Fraser.
Bilingual Sentence Mining for Low-Resource Languages: a Case Study on Upper and Lower Sorbian.
Compute-EL @ICLDC 2025 - 8th Workshop on The Use of Computational Methods in the Study of Endangered Languages at the 9th International Conference on Language Documentation and Conservation (ICLDC 2025). Honolulu, Hawaii, USA, Mar 06-06, 2025. URL

Abstract

Parallel sentence mining is crucial for downstream tasks such as Machine Translation, especially for low-resource languages, where such resources are scarce. In this context, we apply a pipeline approach with contextual embeddings on two endangered Slavic languages spoken in Germany, Upper and Lower Sorbian, to evaluate mining quality. To this end, we compare off-the-shelf multilingual language models and word encoders pre-trained on Upper Sorbian to understand their impact on sentence mining. Moreover, to filter out irrelevant pairs, we experiment with a post-processing of mined sentences through an unsupervised word aligner based on word embeddings. We observe the usefulness of additional pre-training in Upper Sorbian, which leads to direct improvements when mining the same language but also its related language, Lower Sorbian.

MCML Authors

Link to website

Shu Okabe

Dr.

B2 | Natural Language Processing
→ Group Alexander Fraser

Data Analytics & Statistics

Alexander Fraser

Prof. Dr.

B2 | Natural Language Processing

Data Analytics & Statistics

[29]

Y. Shen, W. Lai, S. Wang, X. Zhang, K. Luo, A. Fraser and M. Sun.
DCAD-2000: A Multilingual Dataset across 2000+ Languages with Data Cleaning as Anomaly Detection.
Preprint (Feb. 2025). arXiv

Abstract

The rapid development of multilingual large language models (LLMs) highlights the need for high-quality, diverse, and clean multilingual datasets. In this paper, we introduce DCAD-2000 (Data Cleaning as Anomaly Detection), a large-scale multilingual corpus built using newly extracted Common Crawl data and existing multilingual datasets. DCAD-2000 includes over 2,282 languages, 46.72TB of data, and 8.63 billion documents, spanning 155 high- and medium-resource languages and 159 writing scripts. To overcome the limitations of current data cleaning methods, which rely on manual heuristic thresholds, we propose reframing data cleaning as an anomaly detection task. This dynamic filtering approach significantly enhances data quality by identifying and removing noisy or anomalous content. We evaluate the quality of DCAD-2000 on the FineTask benchmark, demonstrating substantial improvements in multilingual dataset quality and task performance.

MCML Authors

Link to website

Wen Lai

B2 | Natural Language Processing
→ Group Alexander Fraser

Data Analytics & Statistics

Alexander Fraser

Prof. Dr.

B2 | Natural Language Processing

Data Analytics & Statistics

[28]

Y. Zhang, V. Hangya and A. Fraser.
LLM Sensitivity Challenges in Abusive Language Detection: Instruction-Tuned vs. Human Feedback.
COLING 2025 - The 31st International Conference on Computational Linguistics. Abu Dhabi, United Arab Emirates, Jan 19-24, 2025. URL

Abstract

The capacity of large language models (LLMs) to understand and distinguish socially unacceptable texts enables them to play a promising role in abusive language detection. However, various factors can affect their sensitivity. In this work, we test whether LLMs have an unintended bias in abusive language detection, i.e., whether they predict more or less of a given abusive class than expected in zero-shot settings. Our results show that instruction-tuned LLMs tend to under-predict positive classes, since datasets used for tuning are dominated by the negative class. On the contrary, models fine-tuned with human feedback tend to be overly sensitive. In an exploratory approach to mitigate these issues, we show that label frequency in the prompt helps with the significant over-prediction.

MCML Authors

Viktor Hangya

Viktor Hangya

Dr.

B2 | Natural Language Processing
→ Group Alexander Fraser

* Former Member

Alexander Fraser

Prof. Dr.

B2 | Natural Language Processing

Data Analytics & Statistics

2024

[27]

M. Di Marco and A. Fraser.
Subword Segmentation in LLMs: Looking at Inflection and Consistency.
EMNLP 2024 - Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI

Abstract

The role of subword segmentation in relation to capturing morphological patterns in LLMs is currently not well explored. Ideally, one would train models like GPT using various segmentations and evaluate how well word meanings are captured. Since this is not computationally feasible, we group words according to their segmentation properties and compare how well a model can solve a linguistic task for these groups. We study two criteria: (i) adherence to morpheme boundaries and (ii) the segmentation consistency of the different inflected forms of a lemma. We select word forms with high and low values for these criteria and carry out experiments on GPT-4o’s ability to capture verbal inflection for 10 languages. Our results indicate that in particular the criterion of segmentation consistency can help to predict the model’s ability to recognize and generate the lemma from an inflected form, providing evidence that subword segmentation is relevant.

MCML Authors

Alexander Fraser

Prof. Dr.

B2 | Natural Language Processing

Data Analytics & Statistics

[26]

L. Edman, H. Schmid and A. Fraser.
CUTE: Measuring LLMs’ Understanding of Their Tokens.
EMNLP 2024 - Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI

Abstract

Large Language Models (LLMs) show remarkable performance on a wide variety of tasks. Most LLMs split text into multi-character tokens and process them as atomic units without direct access to individual characters. This raises the question: To what extent can LLMs learn orthographic information? To answer this, we propose a new benchmark, CUTE, which features a collection of tasks designed to test the orthographic knowledge of LLMs. We evaluate popular LLMs on CUTE, finding that most of them seem to know the spelling of their tokens, yet fail to use this information effectively to manipulate text, calling into question how much of this knowledge is generalizable.

MCML Authors

Link to website

Lukas Edman

Dr.

B2 | Natural Language Processing
→ Group Alexander Fraser

Data Analytics & Statistics

Alexander Fraser

Prof. Dr.

B2 | Natural Language Processing

Data Analytics & Statistics

[25]

W. Lai, V. Hangya and A. Fraser.
Style-Specific Neurons for Steering LLMs in Text Style Transfer.
EMNLP 2024 - Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI

Abstract

Text style transfer (TST) aims to modify the style of a text without altering its original meaning. Large language models (LLMs) demonstrate superior performance across multiple tasks, including TST. However, in zero-shot setups, they tend to directly copy a significant portion of the input text to the output without effectively changing its style. To enhance the stylistic variety and fluency of the text, we present sNeuron-TST, a novel approach for steering LLMs using style-specific neurons in TST. Specifically, we identify neurons associated with the source and target styles and deactivate source-style-only neurons to give target-style words a higher probability, aiming to enhance the stylistic diversity of the generated text. However, we find that this deactivation negatively impacts the fluency of the generated text, which we address by proposing an improved contrastive decoding method that accounts for rapid token probability shifts across layers caused by deactivated source-style neurons. Empirical experiments demonstrate the effectiveness of the proposed method on six benchmarks, encompassing formality, toxicity, politics, politeness, authorship, and sentiment.

MCML Authors

Link to website

Wen Lai

B2 | Natural Language Processing
→ Group Alexander Fraser

Data Analytics & Statistics

Viktor Hangya

Viktor Hangya

Dr.

B2 | Natural Language Processing
→ Group Alexander Fraser

* Former Member

Alexander Fraser

Prof. Dr.

B2 | Natural Language Processing

Data Analytics & Statistics

[24]

K. Hämmerl, A. Manea, G. Vico, J. Helcl and J. Libovický.
CUNI and LMU Submission to the MRL 2024 Shared Task on Multi-lingual Multi-task Information Retrieval.
MRL @EMNLP 2024 - 4th Multilingual Representation Learning Workshop at the Conference on Empirical Methods in Natural Language Processing (EMNLP 2024). Miami, FL, USA, Nov 12-16, 2024. DOI

Abstract

We present the joint CUNI and LMU submission to the MRL 2024 Shared Task on Multi-lingual Multi-task Information Retrieval. The shared task objective was to explore how we can deploy modern methods in NLP in multi-lingual low-resource settings, tested on two sub-tasks: Named-entity recognition and question answering. Our solutions to the subtasks are based on data acquisition and model adaptation. We compare the performance of our submitted systems with the translate-test approach which proved to be the most useful in the previous edition of the shared task. Our results show that using more data as well as fine-tuning recent multilingual pre-trained models leads to considerable improvements over the translate-test baseline.

MCML Authors

Link to website

Katharina Hämmerl

B2 | Natural Language Processing
→ Group Alexander Fraser

Data Analytics & Statistics

[23]

L. Edman, L. Bylinina, F. Ghorbanpour and A. Fraser.
Are BabyLMs Second Language Learners?
Preprint (Oct. 2024). arXiv

Abstract

This paper describes a linguistically-motivated approach to the 2024 edition of the BabyLM Challenge (Warstadt et al. 2023). Rather than pursuing a first language learning (L1) paradigm, we approach the challenge from a second language (L2) learning perspective. In L2 learning, there is a stronger focus on learning explicit linguistic information, such as grammatical notions, definitions of words or different ways of expressing a meaning. This makes L2 learning potentially more efficient and concise. We approximate this using data from Wiktionary, grammar examples either generated by an LLM or sourced from grammar books, and paraphrase data. We find that explicit information about word meaning (in our case, Wiktionary) does not boost model performance, while grammatical information can give a small improvement. The most impactful data ingredient is sentence paraphrases, with our two best models being trained on 1) a mix of paraphrase data and data from the BabyLM pretraining dataset, and 2) exclusively paraphrase data.

MCML Authors

Link to website

Lukas Edman

Dr.

B2 | Natural Language Processing
→ Group Alexander Fraser

Data Analytics & Statistics

Link to website

Faeze Ghorbanpour

B2 | Natural Language Processing
→ Group Alexander Fraser

Data Analytics & Statistics

Alexander Fraser

Prof. Dr.

B2 | Natural Language Processing

Data Analytics & Statistics

[22]

K. Hämmerl, J. Libovický and A. Fraser.
Understanding Cross-Lingual Alignment—A Survey.
ACL 2024 - Findings of the 62nd Annual Meeting of the Association for Computational Linguistics. Bangkok, Thailand, Aug 11-16, 2024. DOI

Abstract

Cross-lingual alignment, the meaningful similarity of representations across languages in multilingual language models, has been an active field of research in recent years. We survey the literature of techniques to improve cross-lingual alignment, providing a taxonomy of methods and summarising insights from throughout the field. We present different understandings of cross-lingual alignment and their limitations. We provide a qualitative summary of results from a number of surveyed papers. Finally, we discuss how these insights may be applied not only to encoder models, where this topic has been heavily studied, but also to encoder-decoder or even decoder-only models, and argue that an effective trade-off between language-neutral and language-specific information is key.

MCML Authors

Link to website

Katharina Hämmerl

B2 | Natural Language Processing
→ Group Alexander Fraser

Data Analytics & Statistics

Alexander Fraser

Prof. Dr.

B2 | Natural Language Processing

Data Analytics & Statistics

[21]

W. Lai, M. Mesgar and A. Fraser.
LLMs Beyond English: Scaling the Multilingual Capability of LLMs with Cross-Lingual Feedback.
ACL 2024 - Findings of the 62nd Annual Meeting of the Association for Computational Linguistics. Bangkok, Thailand, Aug 11-16, 2024. DOI

Abstract

To democratize large language models (LLMs) to most natural languages, it is imperative to make these models capable of understanding and generating texts in many languages, in particular low-resource ones. While recent multilingual LLMs demonstrate remarkable performance in such capabilities, these LLMs still support a limited number of human languages due to the lack of training data for low resource languages. Moreover, these LLMs are not yet aligned with human preference for downstream tasks, which is crucial for the success of LLMs in English. In this paper, we introduce xLLaMA-100 and xBLOOM-100 (collectively xLLMs-100), which scale the multilingual capabilities of LLaMA and BLOOM to 100 languages. To do so, we construct two datasets: a multilingual instruction dataset including 100 languages, which represents the largest language coverage to date, and a cross-lingual human feedback dataset encompassing 30 languages. We perform multilingual instruction tuning on the constructed instruction data and further align the LLMs with human feedback using the DPO algorithm on our cross-lingual human feedback dataset. We evaluate the multilingual understanding and generating capabilities of xLLMs-100 on five multilingual benchmarks. Experimental results show that xLLMs-100 consistently outperforms its peers across the benchmarks by considerable margins, defining a new state-of-the-art multilingual LLM that supports 100 languages.

MCML Authors

Link to website

Wen Lai

B2 | Natural Language Processing
→ Group Alexander Fraser

Data Analytics & Statistics

Alexander Fraser

Prof. Dr.

B2 | Natural Language Processing

Data Analytics & Statistics

[20]

A. Dimmelmeier, H. Doll, M. Schierholz, E. Kormanyos, M. Fehr, B. Ma, J. Beck, A. Fraser and F. Kreuter.
Informing climate risk analysis using textual information - A research agenda.
ClimateNLP @ACL 2024 - 1st Workshop on Natural Language Processing Meets Climate Change at the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024). Bangkok, Thailand, Aug 11-16, 2024. DOI

Abstract

We present a research agenda focused on efficiently extracting, assuring quality, and consolidating textual company sustainability information to address urgent climate change decision-making needs. Starting from the goal to create integrated FAIR (Findable, Accessible, Interoperable, Reusable) climate-related data, we identify research needs pertaining to the technical aspects of information extraction as well as to the design of the integrated sustainability datasets that we seek to compile. Regarding extraction, we leverage technological advancements, particularly in large language models (LLMs) and Retrieval-Augmented Generation (RAG) pipelines, to unlock the underutilized potential of unstructured textual information contained in corporate sustainability reports. In applying these techniques, we review key challenges, which include the retrieval and extraction of CO2 emission values from PDF documents, especially from unstructured tables and graphs therein, and the validation of automatically extracted data through comparisons with human-annotated values. We also review how existing use cases and practices in climate risk analytics relate to choices of what textual information should be extracted and how it could be linked to existing structured data.

MCML Authors

Link to website

Malte Schierholz

Dr.

C4 | Computational Social Sciences
→ Group Frauke Kreuter

Social Data Science and AI

Link to website

Bolei Ma

C4 | Computational Social Sciences
→ Group Frauke Kreuter

Social Data Science and AI

Link to website

Jacob Beck

C4 | Computational Social Sciences
→ Group Frauke Kreuter

Social Data Science and AI

Alexander Fraser

Prof. Dr.

B2 | Natural Language Processing

Data Analytics & Statistics

Frauke Kreuter

Prof. Dr.

C4 | Computational Social Sciences

Social Data Science and AI

[19]

P. Piccirilli, A. Fraser and S. Schulte im Walde.
VOLIMET: A Parallel Corpus of Literal and Metaphorical Verb-Object Pairs for English–German and English–French.
*SEM 2024 - 13th Joint Conference on Lexical and Computational Semantics co-located with NAACL 2024. Mexico City, Mexico, Jun 20-21, 2024. DOI

Abstract

The interplay of cultural and linguistic elements that characterizes metaphorical language poses a substantial challenge for both human comprehension and machine processing. This challenge goes beyond monolingual settings and becomes particularly complex in translation, even more so in automatic translation. We present VOLIMET, a corpus of 2,916 parallel sentences containing gold standard alignments of metaphorical verb-object pairs and their literal paraphrases, e.g., tackle/address question, from English to German and French. On the one hand, the parallel nature of our corpus enables us to explore monolingual patterns for metaphorical vs. literal uses in English. On the other hand, we investigate different aspects of cross-lingual translations into German and French and the extent to which metaphoricity and literalness in the source language are transferred to the target languages. Monolingually, our findings reveal clear preferences in using metaphorical or literal uses of verb-object pairs. Cross-lingually, we observe a rich variability in translations as well as different behaviors for our two target languages.

MCML Authors

Alexander Fraser

Prof. Dr.

B2 | Natural Language Processing

Data Analytics & Statistics

[18]

Y. Zhang, V. Hangya and A. Fraser.
A Study of the Class Imbalance Problem in Abusive Language Detection.
WOAH @NAACL 2024 - 8th Workshop on Online Abuse and Harms at the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2024). Mexico City, Mexico, Jun 16-21, 2024. DOI

Abstract

Abusive language detection has drawn increasing interest in recent years. However, a less systematically explored obstacle is label imbalance, i.e., the amount of abusive data is much lower than non-abusive data, leading to performance issues. The aim of this work is to conduct a comprehensive comparative study of popular methods for addressing the class imbalance issue. We explore 10 well-known approaches on 8 datasets with distinct characteristics: binary or multi-class, moderately or largely imbalanced, focusing on various types of abuse, etc. Additionally, we pro-pose two novel methods specialized for abuse detection: AbusiveLexiconAug and ExternalDataAug, which enrich the training data using abusive lexicons and external abusive datasets, respectively. We conclude that: 1) our AbusiveLexiconAug approach, random oversampling, and focal loss are the most versatile methods on various datasets; 2) focal loss tends to yield peak model performance; 3) oversampling and focal loss provide promising results for binary datasets and small multi-class sets, while undersampling and weighted cross-entropy are more suitable for large multi-class sets; 4) most methods are sensitive to hyperparameters, yet our suggested choice of hyperparameters provides a good starting point.

MCML Authors

Viktor Hangya

Viktor Hangya

Dr.

B2 | Natural Language Processing
→ Group Alexander Fraser

* Former Member

Alexander Fraser

Prof. Dr.

B2 | Natural Language Processing

Data Analytics & Statistics

[17]

V. Hangya and A. Fraser.
How to Solve Few-Shot Abusive Content Detection Using the Data We Actually Have.
LREC-COLING 2024 - Joint International Conference on Computational Linguistics, Language Resources and Evalutaion. Torino, Italy, May 20-25, 2024. URL

Abstract

Due to the broad range of social media platforms, the requirements of abusive language detection systems are varied and ever-changing. Already a large set of annotated corpora with different properties and label sets were created, such as hate or misogyny detection, but the form and targets of abusive speech are constantly evolving. Since, the annotation of new corpora is expensive, in this work we leverage datasets we already have, covering a wide range of tasks related to abusive language detection. Our goal is to build models cheaply for a new target label set and/or language, using only a few training examples of the target domain. We propose a two-step approach: first we train our model in a multitask fashion. We then carry out few-shot adaptation to the target requirements. Our experiments show that using already existing datasets and only a few-shots of the target task the performance of models improve both monolingually and across languages. Our analysis also shows that our models acquire a general understanding of abusive language, since they improve the prediction of labels which are present only in the target dataset and can benefit from knowledge about labels which are not directly used for the target task.

MCML Authors

Viktor Hangya

Viktor Hangya

Dr.

B2 | Natural Language Processing
→ Group Alexander Fraser

* Former Member

Alexander Fraser

Prof. Dr.

B2 | Natural Language Processing

Data Analytics & Statistics

[16]

M. Marco and A. Fraser.
Analyzing the Understanding of Morphologically Complex Words in Large Language Models.
LREC-COLING 2024 - Joint International Conference on Computational Linguistics, Language Resources and Evalutaion. Torino, Italy, May 20-25, 2024. URL

Abstract

We empirically study the ability of a Large Language Model (gpt-3.5-turbo-instruct) to understand morphologically complex words. In our experiments, we looked at a variety of tasks to analyse German compounds with regard to compositional word formation and derivation, such as identifying the head noun of existing and novel compounds, identifying the shared verb stem between two words, or recognizing words constructed with inappropriately used derivation morphemes as invalid. Our results show that the language model is generally capable of solving most tasks, except for the task of identifying ill-formed word forms. While the model demonstrated a good overall understanding of complex words and their word-internal structure, the results also suggest that there is no formal knowledge of derivational rules, but rather an interpretation of the observed word parts to derive the meaning of a word.

MCML Authors

Alexander Fraser

Prof. Dr.

B2 | Natural Language Processing

Data Analytics & Statistics

[15]

A. Chronopoulou.
Efficient multilingual and domain adaptation of language models under resource constraints.
Dissertation 2024. DOI

Abstract

This dissertation develops methods to improve natural language processing (NLP) systems for low-resource languages and diverse domains. For machine translation, it explores bilingual language models, static embeddings, and multilingual systems with adapters, achieving robust performance in low-resource settings. To enhance domain adaptation, it introduces hierarchical tree structures and efficient adapters, enabling better generalization and robustness to domain shifts. These approaches address data disparities and domain variability, advancing adaptable and efficient NLP systems. (Shortened).

MCML Authors

Alexandra Chronopoulou

Alexandra Chronopoulou

Dr.

B2 | Natural Language Processing
→ Group Alexander Fraser

* Former Member

2023

[14]

M. Weller-Di Marco, K. Hämmerl and A. Fraser.
A Study on Accessing Linguistic Information in Pre-Trained Language Models by Using Prompts.
EMNLP 2023 - Conference on Empirical Methods in Natural Language Processing. Singapore, Dec 06-10, 2023. DOI

Abstract

We study whether linguistic information in pre-trained multilingual language models can be accessed by human language: So far, there is no easy method to directly obtain linguistic information and gain insights into the linguistic principles encoded in such models. We use the technique of prompting and formulate linguistic tasks to test the LM’s access to explicit grammatical principles and study how effective this method is at providing access to linguistic features. Our experiments on German, Icelandic and Spanish show that some linguistic properties can in fact be accessed through prompting, whereas others are harder to capture.

MCML Authors

Link to website

Katharina Hämmerl

B2 | Natural Language Processing
→ Group Alexander Fraser

Data Analytics & Statistics

Alexander Fraser

Prof. Dr.

B2 | Natural Language Processing

Data Analytics & Statistics

[13]

W. Lai, A. Chronopoulou and A. Fraser.
Mitigating Data Imbalance and Representation Degeneration in Multilingual Machine Translation.
EMNLP 2023 - Findings of the Conference on Empirical Methods in Natural Language Processing. Singapore, Dec 06-10, 2023. DOI

Abstract

Despite advances in multilingual neural machine translation (MNMT), we argue that there are still two major challenges in this area: data imbalance and representation degeneration. The data imbalance problem refers to the imbalance in the amount of parallel corpora for all language pairs, especially for long-tail languages (i.e., very low-resource languages). The representation degeneration problem refers to the problem of encoded tokens tending to appear only in a small subspace of the full space available to the MNMT model. To solve these two issues, we propose Bi-ACL, a framework which only requires target-side monolingual data and a bilingual dictionary to improve the performance of the MNMT model. We define two modules, named bidirectional autoencoder and bidirectional contrastive learning, which we combine with an online constrained beam search and a curriculum learning sampling strategy. Extensive experiments show that our proposed method is more effective than strong baselines both in long-tail languages and in high-resource languages. We also demonstrate that our approach is capable of transferring knowledge between domains and languages in zero-shot scenarios.

MCML Authors

Link to website

Wen Lai

B2 | Natural Language Processing
→ Group Alexander Fraser

Data Analytics & Statistics

Alexandra Chronopoulou

Alexandra Chronopoulou

Dr.

B2 | Natural Language Processing
→ Group Alexander Fraser

* Former Member

Alexander Fraser

Prof. Dr.

B2 | Natural Language Processing

Data Analytics & Statistics

[12]

V. Hangya, S. Severini, R. Ralev, A. Fraser and H. Schütze.
Multilingual Word Embeddings for Low-Resource Languages using Anchors and a Chain of Related Languages.
MRL @EMNLP 2023 - 3rd Workshop on Multi-lingual Representation Learning at the Conference on Empirical Methods in Natural Language Processing (EMNLP 2023). Singapore, Dec 06-10, 2023. DOI

Abstract

Very low-resource languages, having only a few million tokens worth of data, are not well-supported by multilingual NLP approaches due to poor quality cross-lingual word representations. Recent work showed that good crosslingual performance can be achieved if a source language is related to the low-resource target language. However, not all language pairs are related. In this paper, we propose to build multilingual word embeddings (MWEs) via a novel language chain-based approach, that incorporates intermediate related languages to bridge the gap between the distant source and target. We build MWEs one language at a time by starting from the resource rich source and sequentially adding each language in the chain till we reach the target. We extend a semi-joint bilingual approach to multiple languages in order to eliminate the main weakness of previous works, i.e., independently trained monolingual embeddings, by anchoring the target language around the multilingual space. We evaluate our method on bilingual lexicon induction for 4 language families, involving 4 very low-resource (≤ 5M tokens) and 4 moderately low-resource (≤ 50M) target languages, showing improved performance in both categories. Additionally, our analysis reveals the importance of good quality embeddings for intermediate languages as well as the importance of leveraging anchor points from all languages in the multilingual space.

MCML Authors

Viktor Hangya

Viktor Hangya

Dr.

B2 | Natural Language Processing
→ Group Alexander Fraser

* Former Member

Alexander Fraser

Prof. Dr.

B2 | Natural Language Processing

Data Analytics & Statistics

Hinrich Schütze

Prof. Dr.

B2 | Natural Language Processing

Computational Linguistics

[11]

W. Lai, V. Hangya and A. Fraser.
Extending Multilingual Machine Translation through Imitation Learning.
Preprint (Nov. 2023). arXiv

Abstract

Despite the growing variety of languages supported by existing multilingual neural machine translation (MNMT) models, most of the world’s languages are still being left behind. We aim to extend large-scale MNMT models to a new language, allowing for translation between the newly added and all of the already supported languages in a challenging scenario: using only a parallel corpus between the new language and English. Previous approaches, such as continued training on parallel data including the new language, suffer from catastrophic forgetting (i.e., performance on other languages is reduced). Our novel approach Imit-MNMT treats the task as an imitation learning process, which mimicks the behavior of an expert, a technique widely used in the computer vision area, but not well explored in NLP. More specifically, we construct a pseudo multi-parallel corpus of the new and the original languages by pivoting through English, and imitate the output distribution of the original MNMT model. Extensive experiments show that our approach significantly improves the translation performance between the new and the original languages, without severe catastrophic forgetting. We also demonstrate that our approach is capable of solving copy and off-target problems, which are two common issues existence in current large-scale MNMT models.

MCML Authors

Link to website

Wen Lai

B2 | Natural Language Processing
→ Group Alexander Fraser

Data Analytics & Statistics

Viktor Hangya

Viktor Hangya

Dr.

B2 | Natural Language Processing
→ Group Alexander Fraser

* Former Member

Alexander Fraser

Prof. Dr.

B2 | Natural Language Processing

Data Analytics & Statistics

[10]

V. Hangya and A. Fraser.
LMU at HaSpeeDe3: Multi-Dataset Training for Cross-Domain Hate Speech Detection.
EVALITA 2023 - Final Workshop of the 8th evaluation campaign. Parma, Italy, Sep 07-08, 2023. PDF

Abstract

We describe LMU Munich’s hate speech detection system for participating in the cross-domain track of the HaSpeeDe3 shared task at EVALITA 2023. The task focuses on the politics and religion domains, having no in-domain training data for the latter. Our submission combines multiple training sets from various domains in a multitask prompt-training system. We experimented with both Italian and English source datasets as well as monolingual Italian and multilingual pre-trained language models. We found that the Italian out-of-domain datasets are the most influential on the performance in the test domains and that combining both monolingual and multilingual language models using an ensemble gives the best results. Our system ranked second in both domains.

MCML Authors

Viktor Hangya

Viktor Hangya

Dr.

B2 | Natural Language Processing
→ Group Alexander Fraser

* Former Member

Alexander Fraser

Prof. Dr.

B2 | Natural Language Processing

Data Analytics & Statistics

[9]

K. Hämmerl, B. Deiseroth, P. Schramowski, J. Libovický, C. Rothkopf, A. Fraser and K. Kersting.
Speaking Multiple Languages Affects the Moral Bias of Language Models.
ACL 2023 - Findings of the 61th Annual Meeting of the Association for Computational Linguistics. Toronto, Canada, Jul 09-14, 2023. DOI

Abstract

Pre-trained multilingual language models (PMLMs) are commonly used when dealing with data from multiple languages and cross-lingual transfer. However, PMLMs are trained on varying amounts of data for each language. In practice this means their performance is often much better on English than many other languages. We explore to what extent this also applies to moral norms. Do the models capture moral norms from English and impose them on other languages? Do the models exhibit random and thus potentially harmful beliefs in certain languages? Both these issues could negatively impact cross-lingual transfer and potentially lead to harmful outcomes. In this paper, we (1) apply the MORALDIRECTION framework to multilingual models, comparing results in German, Czech, Arabic, Chinese, and English, (2) analyse model behaviour on filtered parallel subtitles corpora, and (3) apply the models to a Moral Foundations Questionnaire, comparing with human responses from different countries. Our experiments demonstrate that, indeed, PMLMs encode differing moral biases, but these do not necessarily correspond to cultural differences or commonalities in human opinions. We release our code and models.

MCML Authors

Link to website

Katharina Hämmerl

B2 | Natural Language Processing
→ Group Alexander Fraser

Data Analytics & Statistics

Alexander Fraser

Prof. Dr.

B2 | Natural Language Processing

Data Analytics & Statistics

[8]

K. Hämmerl, A. Fastowski, J. Libovický and A. Fraser.
Exploring Anisotropy and Outliers in Multilingual Language Models for Cross-Lingual Semantic Sentence Similarity.
ACL 2023 - Findings of the 61th Annual Meeting of the Association for Computational Linguistics. Toronto, Canada, Jul 09-14, 2023. DOI

Abstract

Previous work has shown that the representations output by contextual language models are more anisotropic than static type embeddings, and typically display outlier dimensions. This seems to be true for both monolingual and multilingual models, although much less work has been done on the multilingual context. Why these outliers occur and how they affect the representations is still an active area of research. We investigate outlier dimensions and their relationship to anisotropy in multiple pre-trained multilingual language models. We focus on cross-lingual semantic similarity tasks, as these are natural tasks for evaluating multilingual representations. Specifically, we examine sentence representations. Sentence transformers which are fine-tuned on parallel resources (that are not always available) perform better on this task, and we show that their representations are more isotropic. However, we aim to improve multilingual representations in general. We investigate how much of the performance difference can be made up by only transforming the embedding space without fine-tuning, and visualise the resulting spaces. We test different operations: Removing individual outlier dimensions, cluster-based isotropy enhancement, and ZCA whitening. We publish our code for reproducibility.

MCML Authors

Link to website

Katharina Hämmerl

B2 | Natural Language Processing
→ Group Alexander Fraser

Data Analytics & Statistics

Alexander Fraser

Prof. Dr.

B2 | Natural Language Processing

Data Analytics & Statistics

[7]

Y. Liu, A. Chronopoulou, H. Schütze and A. Fraser.
On the Copying Problem of Unsupervised NMT: A Training Schedule with a Language Discriminator Loss.
IWSLT 2023 - 20th International Conference on Spoken Language Translation. Toronto, Canada, Jul 09-14, 2023. DOI

Abstract

Although unsupervised neural machine translation (UNMT) has achieved success in many language pairs, the copying problem, i.e., directly copying some parts of the input sentence as the translation, is common among distant language pairs, especially when low-resource languages are involved. We find this issue is closely related to an unexpected copying behavior during online back-translation (BT). In this work, we propose a simple but effective training schedule that incorporates a language discriminator loss. The loss imposes constraints on the intermediate translation so that the translation is in the desired language. By conducting extensive experiments on different language pairs, including similar and distant, high and low-resource languages, we find that our method alleviates the copying problem, thus improving the translation performance on low-resource languages.

MCML Authors

Link to website

Yihong Liu

B2 | Natural Language Processing
→ Group Hinrich Schütze

Computational Linguistics

Alexandra Chronopoulou

Alexandra Chronopoulou

Dr.

B2 | Natural Language Processing
→ Group Alexander Fraser

* Former Member

Hinrich Schütze

Prof. Dr.

B2 | Natural Language Processing

Computational Linguistics

Alexander Fraser

Prof. Dr.

B2 | Natural Language Processing

Data Analytics & Statistics

[6]

A. Chronopoulou, M. Peters, A. Fraser and J. Dodge.
AdapterSoup: Weight Averaging to Improve Generalization of Pretrained Language Models.
EACL 2023 - Findings of the 17th Conference of the European Chapter of the Association for Computational Linguistics. Dubrovnik, Croatia, May 02-06, 2023. DOI

Abstract

Pretrained language models (PLMs) are trained on massive corpora, but often need to specialize to specific domains. A parameter-efficient adaptation method suggests training an adapter for each domain on the task of language modeling. This leads to good in-domain scores but can be impractical for domain- or resource-restricted settings. A solution is to use a related-domain adapter for the novel domain at test time. In this paper, we introduce AdapterSoup, an approach that performs weight-space averaging of adapters trained on different domains. Our approach is embarrassingly parallel: first, we train a set of domain-specific adapters; then, for each novel domain, we determine which adapters should be averaged at test time. We present extensive experiments showing that AdapterSoup consistently improves performance to new domains without extra training. We also explore weight averaging of adapters trained on the same domain with different hyper-parameters, and show that it preserves the performance of a PLM on new domains while obtaining strong in-domain results. We explore various approaches for choosing which adapters to combine, such as text clustering and semantic similarity. We find that using clustering leads to the most competitive results on novel domains.

MCML Authors

Alexandra Chronopoulou

Alexandra Chronopoulou

Dr.

B2 | Natural Language Processing
→ Group Alexander Fraser

* Former Member

Alexander Fraser

Prof. Dr.

B2 | Natural Language Processing

Data Analytics & Statistics

[5]

A. Chronopoulou, D. Stojanovski and A. Fraser.
Language-Family Adapters for Low-Resource Multilingual Neural Machine Translation.
LoResMT @EACL 2023 - 6th Workshop on Technologies for Machine Translation of Low-Resource Languages at the 17th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2023). Dubrovnik, Croatia, May 02-06, 2023. DOI

Abstract

Large multilingual models trained with self-supervision achieve state-of-the-art results in a wide range of natural language processing tasks. Self-supervised pretrained models are often fine-tuned on parallel data from one or multiple language pairs for machine translation. Multilingual fine-tuning improves performance on low-resource languages but requires modifying the entire model and can be prohibitively expensive. Training a new adapter on each language pair or training a single adapter on all language pairs without updating the pretrained model has been proposed as a parameter-efficient alternative. However, the former does not permit any sharing between languages, while the latter shares parameters for all languages and is susceptible to negative interference. In this paper, we propose training language-family adapters on top of mBART-50 to facilitate cross-lingual transfer. Our approach outperforms related baselines, yielding higher translation scores on average when translating from English to 17 different low-resource languages. We also show that language-family adapters provide an effective method to translate to languages unseen during pretraining.

MCML Authors

Alexandra Chronopoulou

Alexandra Chronopoulou

Dr.

B2 | Natural Language Processing
→ Group Alexander Fraser

* Former Member

Alexander Fraser

Prof. Dr.

B2 | Natural Language Processing

Data Analytics & Statistics

2022

[4]

V. Hangya, H. S. Saadi and A. Fraser.
Improving Low-Resource Languages in Pre-Trained Multilingual Language Models.
EMNLP 2022 - Conference on Empirical Methods in Natural Language Processing. Abu Dhabi, United Arab Emirates, Nov 07-11, 2022. DOI

Abstract

Pre-trained multilingual language models are the foundation of many NLP approaches, including cross-lingual transfer solutions. However, languages with small available monolingual corpora are often not well-supported by these models leading to poor performance. We propose an unsupervised approach to improve the cross-lingual representations of low-resource languages by bootstrapping word translation pairs from monolingual corpora and using them to improve language alignment in pre-trained language models. We perform experiments on nine languages, using contextual word retrieval and zero-shot named entity recognition to measure both intrinsic cross-lingual word representation quality and downstream task performance, showing improvements on both tasks. Our results show that it is possible to improve pre-trained multilingual language models by relying only on non-parallel resources.

MCML Authors

Viktor Hangya

Viktor Hangya

Dr.

B2 | Natural Language Processing
→ Group Alexander Fraser

* Former Member

Alexander Fraser

Prof. Dr.

B2 | Natural Language Processing

Data Analytics & Statistics

[3]

W. Lai, A. Chronopoulou and A. Fraser.
m4 Adapter: Multilingual Multi-Domain Adaptation for Machine Translation with a Meta-Adapter.
EMNLP 2022 - Findings of the Conference on Empirical Methods in Natural Language Processing. Abu Dhabi, United Arab Emirates, Nov 07-11, 2022. DOI

Abstract

Multilingual neural machine translation models (MNMT) yield state-of-the-art performance when evaluated on data from a domain and language pair seen at training time. However, when a MNMT model is used to translate under domain shift or to a new language pair, performance drops dramatically. We consider a very challenging scenario: adapting the MNMT model both to a new domain and to a new language pair at the same time. In this paper, we propose m4Adapter (Multilingual Multi-Domain Adaptation for Machine Translation with a Meta-Adapter), which combines domain and language knowledge using meta-learning with adapters. We present results showing that our approach is a parameter-efficient solution which effectively adapts a model to both a new language pair and a new domain, while outperforming other adapter methods. An ablation study also shows that our approach more effectively transfers domain knowledge across different languages and language information across different domains.

MCML Authors

Link to website

Wen Lai

B2 | Natural Language Processing
→ Group Alexander Fraser

Data Analytics & Statistics

Alexandra Chronopoulou

Alexandra Chronopoulou

Dr.

B2 | Natural Language Processing
→ Group Alexander Fraser

* Former Member

Alexander Fraser

Prof. Dr.

B2 | Natural Language Processing

Data Analytics & Statistics

[2]

H. S. Saadi, V. Hangya, T. Eder and A. Fraser.
Comparative Analysis of Cross-lingual Contextualized Word Embeddings.
MRL @EMNLP 2022 - 2nd Workshop on Multi-lingual Representation Learning at the Conference on Empirical Methods in Natural Language Processing (EMNLP 2022). Abu Dhabi, United Arab Emirates, Nov 07-11, 2022. DOI

Abstract

Contextualized word embeddings have emerged as the most important tool for performing NLP tasks in a large variety of languages. In order to improve the cross-lingual representation and transfer learning quality, contextualized embedding alignment techniques, such as mapping and model fine-tuning, are employed. Existing techniques however are time-, data- and computational resource-intensive. In this paper we analyze these techniques by utilizing three tasks: bilingual lexicon induction (BLI), word retrieval and cross-lingual natural language inference (XNLI) for a high resource (German-English) and a low resource (Bengali-English) language pair. In contrast to previous works which focus only on a few popular models, we compare five multilingual and seven monolingual language models and investigate the effect of various aspects on their performance, such as vocabulary size, number of languages used for training and number of parameters. Additionally, we propose a parameter-, data- and runtime-efficient technique which can be trained with 10% of the data, less than 10% of the time and have less than 5% of the trainable parameters compared to model fine-tuning. We show that our proposed method is competitive with resource heavy models, even outperforming them in some cases, even though it relies on less resource.

MCML Authors

Viktor Hangya

Viktor Hangya

Dr.

B2 | Natural Language Processing
→ Group Alexander Fraser

* Former Member

Alexander Fraser

Prof. Dr.

B2 | Natural Language Processing

Data Analytics & Statistics

[1]

S. Severini, V. Hangya, M. J. Sabet, A. Fraser and H. Schütze.
Don't Forget Cheap Training Signals Before Building Unsupervised Bilingual Word Embeddings.
BUCC @LREC 2022 - 15th Workshop on Building and Using Comparable Corpora at the 13th International Conference on Language Resources and Evaluation (LREC 2022). Marseille, France, Jun 21-23, 2022. URL

Abstract

Bilingual Word Embeddings (BWEs) are one of the cornerstones of cross-lingual transfer of NLP models. They can be built using only monolingual corpora without supervision leading to numerous works focusing on unsupervised BWEs. However, most of the current approaches to build unsupervised BWEs do not compare their results with methods based on easy-to-access cross-lingual signals. In this paper, we argue that such signals should always be considered when developing unsupervised BWE methods. The two approaches we find most effective are: 1) using identical words as seed lexicons (which unsupervised approaches incorrectly assume are not available for orthographically distinct language pairs) and 2) combining such lexicons with pairs extracted by matching romanized versions of words with an edit distance threshold. We experiment on thirteen non-Latin languages (and English) and show that such cheap signals work well and that they outperform using more complex unsupervised methods on distant language pairs such as Chinese, Japanese, Kannada, Tamil, and Thai. In addition, they are even competitive with the use of high-quality lexicons in supervised approaches. Our results show that these training signals should not be neglected when building BWEs, even for distant languages.

MCML Authors

Viktor Hangya

Viktor Hangya

Dr.

B2 | Natural Language Processing
→ Group Alexander Fraser

* Former Member

Masoud Jalili Sabet

Masoud Jalili Sabet

Dr.

B2 | Natural Language Processing
→ Group Hinrich Schütze

* Former Member

Alexander Fraser

Prof. Dr.

B2 | Natural Language Processing

Data Analytics & Statistics

Hinrich Schütze

Prof. Dr.

B2 | Natural Language Processing

Computational Linguistics

©all images: LMU | TUM

2024-12-27 - Last modified: 2024-12-27