Human actions often generate sounds that can be recognized to infer their cause. In action recognition, actions can usually be broken down to a combination of verbs and nouns, of which there exist a very large number of enumerations. Contemporary datasets, like EPIC-KITCHENS, cover a wide gamut of the potential action space, but not its entirety. Arguably, the holistic characterization of human actions through the sounds they generate requires the use of zero-shot learning (ZSL). In this contribution, we explore the feasibility of ZSL for recognizing a) nouns, b) verbs, or c) actions on Epic-Kitchens. To achieve this, we use linguistic intermediation, by generating descriptions of each word corresponding to our classes using a pre-trained large language model (LLAMA-2). Our results show that human action recognition from sounds is possible in zero-shot fashion, as we consistently obtain results over chance.
inproceedings GTT+25
BibTeXKey: GTT+25