Home  | Publications | BHB+26

Exploring Automated Recognition of Instructional Activity and Discourse From Multimodal Classroom Data

MCML Authors

Abstract

Observation of classroom interactions can provide concrete feedback to teachers, but current methods rely on manual annotation, which is resource-intensive and hard to scale. This work explores AI-driven analysis of classroom recordings, focusing on multimodal instructional activity and discourse recognition as a foundation for actionable feedback. Using a densely annotated dataset of 164 hours of video and 68 lesson transcripts, we design parallel, modality-specific pipelines. For video, we evaluate zero-shot multimodal LLMs, fine-tuned vision-language models, and self-supervised video transformers on 24 activity labels. For transcripts, we fine-tune a transformer-based classifier with contextualized inputs and compare it against prompting-based LLMs on 19 discourse labels. To handle class imbalance and multi-label complexity, we apply per-label thresholding, context windows, and imbalance-aware loss functions. The results show that fine-tuned models consistently outperform prompting-based approaches, achieving macro-F1 scores of 0.577 for video and 0.460 for transcripts. These results demonstrate the feasibility of automated classroom analysis and establish a foundation for scalable teacher feedback systems.

inproceedings BHB+26


WACV 2026

IEEE/CVF Winter Conference on Applications of Computer Vision. Tucson, AZ, USA, Mar 06-10, 2026. To be published. Preprint available.
Conference logo
A Conference

Authors

I. BuenoR. HouB. Bühler • T. Fütterer • J. Drimalla • J. K. Foster • P. Youngs • P. Gerjets • U. Trautwein • E. Kasneci

Links

arXiv

Research Area

 B3 | Multimodal Perception

BibTeXKey: BHB+26

Back to Top