Home  | Publications | SP25

Dialetto, Ma Quanto Dialetto? Transcribing and Evaluating Dialects on a Continuum

MCML Authors

Link to Profile Barbara Plank PI Matchmaking

Barbara Plank

Prof. Dr.

Principal Investigator

Abstract

There is increasing interest in looking at dialects in NLP. However, most work to date still treats dialects as discrete categories. For instance, evaluative work in variation-oriented NLP for English often works with Indian English or African-American Venacular English as homogeneous categories (Faisal et al., 2024; Ziems et al., 2023), yet even within one variety there is substantial variation. We examine within-dialect variation and show that performance critically varies within categories. We measure speech-to-text performance on Italian dialects, and empirically observe a geographical performance disparity. This disparity correlates substantially (-0.5) with linguistic similarity to the highest performing dialect variety. We cross-examine our results against dialectometry methods, and interpret the performance disparity to be due to a bias towards dialects that are more similar to the standard variety in the speech-to-text model examined. We additionally leverage geostatistical methods to predict zero-shot performance at unseen sites, and find the incorporation of geographical information to substantially improve prediction performance, indicating there to be geographical structure in the performance distribution.

inproceedings


Findings @NAACL 2025

Findings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics. Albuquerque, NM, USA, Apr 29-May 04, 2025.
Conference logo
A Conference

Authors

R. Shim • B. Plank

Links

DOI

Research Area

 B2 | Natural Language Processing

BibTeXKey: SP25

Back to Top