Home  | Publications | BKS20

Embedding Space Correlation as a Measure of Domain Similarity

MCML Authors

Link to Profile Göran Kauermann PI Matchmaking

Göran Kauermann

Prof. Dr.

Principal Investigator

Link to Profile Hinrich Schütze PI Matchmaking

Hinrich Schütze

Prof. Dr.

Principal Investigator

Abstract

Prior work has determined domain similarity using text-based features of a corpus. However, when using pre-trained word embeddings, the underlying text corpus might not be accessible anymore. Therefore, we propose the CCA measure, a new measure of domain similarity based directly on the dimension-wise correlations between corresponding embedding spaces. Our results suggest that an inherent notion of domain can be captured this way, as we are able to reproduce our findings for different domain comparisons for English, German, Spanish and Czech as well as in cross-lingual comparisons. We further find a threshold at which the CCA measure indicates that two corpora come from the same domain in a monolingual setting by applying permutation tests. By evaluating the usability of the CCA measure in a domain adaptation application, we also show that it can be used to determine which corpora are more similar to each other in a cross-domain sentiment detection task.

inproceedings


LREC 2020

12th International Conference on Language Resources and Evaluation. Marseille, France, May 13-15, 2020.

Authors

A. Beyer • G. KauermannH. Schütze

Links

URL

Research Areas

 A1 | Statistical Foundations & Explainability

 B2 | Natural Language Processing

BibTeXKey: BKS20

Back to Top