04.12.2023

MCML Researchers With 17 Papers at EMNLP 2023

Conference on Empirical Methods in Natural Language Processing (EMNLP 2023). Singapore, 06.12.2023–12.12.2023

We are happy to announce that MCML researchers are represented with 17 papers at EMNLP 2023. Congrats to our researchers!

Main Track (9 papers)

E. Garces Arias, V. Pai, M. Schöffel, C. Heumann and M. Aßenmacher.
Automatic transcription of handwritten Old Occitan language.
EMNLP 2023 - Conference on Empirical Methods in Natural Language Processing. Singapore, Dec 06-10, 2023. DOI

Abstract

While existing neural network-based approaches have shown promising results in Handwritten Text Recognition (HTR) for high-resource languages and standardized/machine-written text, their application to low-resource languages often presents challenges, resulting in reduced effectiveness. In this paper, we propose an innovative HTR approach that leverages the Transformer architecture for recognizing handwritten Old Occitan language. Given the limited availability of data, which comprises only word pairs of graphical variants and lemmas, we develop and rely on elaborate data augmentation techniques for both text and image data. Our model combines a custom-trained Swin image encoder with a BERT text decoder, which we pre-train using a large-scale augmented synthetic data set and fine-tune on the small human-labeled data set. Experimental results reveal that our approach surpasses the performance of current state-of-the-art models for Old Occitan HTR, including open-source Transformer-based models such as a fine-tuned TrOCR and commercial applications like Google Cloud Vision. To nurture further research and development, we make our models, data sets, and code publicly available.

MCML Authors

Esteban Garces Arias

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Matthias Aßenmacher

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

M. Giulianelli, J. Baan, W. Aziz, R. Fernández and B. Plank.
What Comes Next? Evaluating Uncertainty in Neural Text Generators Against Human Production Variability.
EMNLP 2023 - Conference on Empirical Methods in Natural Language Processing. Singapore, Dec 06-10, 2023. DOI

Abstract

In Natural Language Generation (NLG) tasks, for any input, multiple communicative goals are plausible, and any goal can be put into words, or produced, in multiple ways. We characterise the extent to which human production varies lexically, syntactically, and semantically across four NLG tasks, connecting human production variability to aleatoric or data uncertainty. We then inspect the space of output strings shaped by a generation system’s predicted probability distribution and decoding algorithm to probe its uncertainty. For each test input, we measure the generator’s calibration to human production variability. Following this instance-level approach, we analyse NLG models and decoding strategies, demonstrating that probing a generator with multiple samples and, when possible, multiple references, provides the level of detail necessary to gain understanding of a model’s representation of uncertainty.

MCML Authors

Barbara Plank

Prof. Dr.

MCML Researchers With 17 Papers at EMNLP 2023

Conference on Empirical Methods in Natural Language Processing (EMNLP 2023). Singapore, 06.12.2023–12.12.2023

Main Track (9 papers)

Findings Track (7 papers)

Workshops (1 papers)

Related