Social perception, i. e., assessment of people concerning different attributes, is a ubiquitous and crucial aspect of human communal life. Hence, its automatic prediction is a longstanding Machine Learning problem, which has been largely attempted on the basis of face images. Applications of such machine learning models lie, e.g., in large-scale automatic data analysis for social science studies and communication training. We advance efforts in automatic face-based personality assessment by proposing the novel LMU-ELP dataset, featuring 177 videos of CEOs’ presentations annotated for 35 dimensions, including, e. g., assertiveness, competence, kindness, and trustworthiness. We attempt to model each of the 35 dimensions via sequences of face images only. This is achieved by combining a contemporary Vision Transformer (ViT) with a recurrent neural network. Beyond this standard machine learning setup, we investigate few-shot and zero-shot scenarios, in which one of the 35 dimensions is considered unseen during training time. We leverage conceptual similarities among the dimensions, the knowledge of which is obtained via Large Language Models. Both our few-shot and zero-shot approaches select dimensions semantically similar to the unseen one and fuse their predictions to predict the unknown dimension subsequently. Our experiments show that the proposed methods can effectively model social perception across most dimensions, both in standard setups and in zero-shot and few-shot scenarios. While the results are highly dependent on the predicted dimension, training on the full dataset accounts for a mean CCC of .2442 across all 35 dimensions, while mean CCC values of .1447.1608 are obtained with zero-shot and few-shot methods, respectively.
article CAS+25
BibTeXKey: CAS+25