Home | Publications | GTT+25

Towards Audio-Based Zero-Shot Action Recognition in Kitchen Environments

MCML Authors

Alexander Gebhard

→ Group Björn Schuller
Health Informatics

Andreas Triantafyllopoulos

→ Group Björn Schuller
Health Informatics

Iosif Tsangko

→ Group Björn Schuller
Health Informatics

Björn Schuller

Prof. Dr.

Principal Investigator

Health Informatics

Abstract

Human actions often generate sounds that can be recognized to infer their cause. In action recognition, actions can usually be broken down to a combination of verbs and nouns, of which there exist a very large number of enumerations. Contemporary datasets, like EPIC-KITCHENS, cover a wide gamut of the potential action space, but not its entirety. Arguably, the holistic characterization of human actions through the sounds they generate requires the use of zero-shot learning (ZSL). In this contribution, we explore the feasibility of ZSL for recognizing a) nouns, b) verbs, or c) actions on Epic-Kitchens. To achieve this, we use linguistic intermediation, by generating descriptions of each word corresponding to our classes using a pre-trained large language model (LLAMA-2). Our results show that human action recognition from sounds is possible in zero-shot fashion, as we consistently obtain results over chance.

inproceedings GTT+25