Home  | Publications | GTT+25

Towards Audio-Based Zero-Shot Action Recognition in Kitchen Environments

MCML Authors

Abstract

Human actions often generate sounds that can be recognized to infer their cause. In action recognition, actions can usually be broken down to a combination of verbs and nouns, of which there exist a very large number of enumerations. Contemporary datasets, like EPIC-KITCHENS, cover a wide gamut of the potential action space, but not its entirety. Arguably, the holistic characterization of human actions through the sounds they generate requires the use of zero-shot learning (ZSL). In this contribution, we explore the feasibility of ZSL for recognizing a) nouns, b) verbs, or c) actions on Epic-Kitchens. To achieve this, we use linguistic intermediation, by generating descriptions of each word corresponding to our classes using a pre-trained large language model (LLAMA-2). Our results show that human action recognition from sounds is possible in zero-shot fashion, as we consistently obtain results over chance.

inproceedings GTT+25


DCASE 2025

Workshop on Detection and Classification of Acoustic Scenes and Events. Barcelona, Spain, Oct 30-31, 2025.

Authors

A. GebhardA. TriantafyllopoulosI. TsangkoB. W. Schuller

Links

DOI

Research Area

 B3 | Multimodal Perception

BibTeXKey: GTT+25

Back to Top