Home | Publications | JTW+26

EmoSURA: Towards Accurate Evaluation of Detailed and Long-Context Emotional Speech Captions

MCML Authors

Xin Jing

→ Group Björn Schuller
Health Informatics

Andreas Triantafyllopoulos

Dr.

→ Group Björn Schuller
Health Informatics

Jiadong Wang

Dr.

→ Group Björn Schuller
Health Informatics

Shahin Amiriparian

Dr.

* Former Member

→ Group Björn Schuller
Health Informatics

Björn Schuller

Prof. Dr.

Core PI

Health Informatics

Abstract

Recent advancements in speech captioning models have enabled the generation of rich, fine-grained captions for emotional speech. However, the evaluation of such captions remains a critical bottleneck: traditional N-gram metrics fail to capture semantic nuances, while LLM judges often suffer from reasoning inconsistency and context-collapse when processing long-form descriptions. In this work, we propose EmoSURA, a novel evaluation framework that shifts the paradigm from holistic scoring to atomic verification. EmoSURA decomposes complex captions into Atomic Perceptual Units, which are self-contained statements regarding vocal or emotional attributes, and employs an audio-grounded verification mechanism to validate each unit against the raw speech signal. Furthermore, we address the scarcity of standardized evaluation resources by introducing SURABench, a carefully balanced and stratified benchmark. Our experiments show that EmoSURA achieves a positive correlation with human judgments, offering a more reliable assessment for long-form captions compared to traditional metrics, which demonstrated negative correlations due to their sensitivity to caption length.

misc JTW+26

Preprint

Mar. 2026

Authors

X. Jing • A. Triantafyllopoulos • J. Wang • S. Amiriparian • J. Luo • B. W. Schuller

Links

arXiv

Research Area

B3 | Multimodal Perception

BibTeXKey: JTW+26

#p-schuller