Recognition and forecasting of surgical events from video sequences are crucial for advancing computer-assisted surgery. Surgical events are often characterized by specific tool-tissue interactions; for example, ”bleeding damage” occurs when a tool unintentionally cuts a tissue, leading to blood flow. Despite progress in general event classification, recognizing and forecasting events in medical contexts remains challenging due to data scarcity and the complexity of these events. To address these challenges, we propose a method utilizing video masked autoencoders (VideoMAE) for surgical event recognition. This approach focuses the network on the most informative areas of the video while minimizing the need for extensive annotations. We introduce a novel mask sampling technique based on an estimated prior probability map derived from optical flow. We hypothesize that leveraging prior knowledge of tool-tissue interactions will enable the network to concentrate on the most relevant regions in the video. We propose two methods for estimating the prior probability map: (a) retaining areas with the fastest motion and (b) incorporating an additional encoding pathway for optical flow. Our extensive experiments on the public dataset CATARACTS and our in-house neurosurgical data demonstrate that optical flow-based masking consistently outperforms random masking strategies of VideoMAE in phase and event classification tasks. We find that an optical flow encoder enhances classification accuracy by directing the network’s focus to the most relevant information, even in regions without rapid motion. Finally, we investigate sequential and multi-task training strategies to identify the best-performing model, which surpasses the current state-of-the-art by 5% on the CATARACTS dataset and 27% on our in-house neurosurgical data.
misc
BibTeXKey: MAF+25