09.07.2025

Teaser image to Capturing Complexity in Surgical Environments

Capturing Complexity in Surgical Environments

MCML Research Insight - With Ege Özsoy, Chantal Pellegrini, Felix Tristram, Kun Yuan, David Bani-Harouni, Matthias Keicher, Benjamin Busam and Nassir Navab

Imagine an operating room - a space filled with intricate interactions, rapid decisions, and precise movements. Now, imagine capturing every detail of such a complex environment not just visually but also through sound, dialogue, robot movements, and much more. This is exactly what MM-OR -- A Large Multimodal Operating Room Dataset for Semantic Understanding of High-Intensity Surgical Environments accomplishes.


Visual Summary of a single timepoint in MM-OR

Figure 1: Visual Summary of a single timepoint in MM-OR, illustrating the multimodal data provided for each sample: RGB-D video from multiple angles, detailed RGB views, low-exposure video, point cloud data, robot screen and tracker logs, audio and speech transcripts, panoptic segmentations, semantic scene graphs, and downstream task annotations such as robot phase, next action, and sterility breach status.

Introduced in a CVPR 2025 paper by MCML Junior Members Ege Özsoy, Chantal Pellegrini, Felix Tristram, Kun Yuan, David Bani-Harouni, Benjamin Busam and Matthias Keicher, MCML PI Nassir Navab and collaborators Tobias Czempiel and Ulrich Eck, MM-OR is a comprehensive multimodal dataset created to significantly enhance our understanding of operating room (OR) dynamics. It captures robotic knee replacement surgeries (see Figure 1) using various sophisticated sensors, including multiple RGB-D cameras, audio recorders, infrared trackers, and real-time robotic system logs.


Semantic Scene Graph Generation with MM2SG

This detailed data collection allows for generating semantic scene graphs - structured representations of interactions between people, tools, and equipment in the OR. Alongside this, the researchers present MM2SG, an innovative multimodal model capable of interpreting and integrating diverse data types to generate detailed semantic scene graphs that accurately represent OR activities (see Figure 2).

Overview of the proposed MM2SG architecture

Figure 2: Overview of the proposed MM2SG architecture for multimodal scene graph generation. MM2SG processes a variety of data sources through specialized encoders, projecting them into a shared space. The language model generates scene graph triplets describing SGs with entities E i and predicates p i . Downstream tasks leverage entire sequences of scene graphs rather than individual ones.


«This research is laying a foundation for advancing multimodal scene analysis in high-stakes environments.»


Ege Özsoy et al.

MCML Junior Members

A New Benchmark

What sets MM-OR apart from existing datasets is its unprecedented scale, realism, and multimodality, significantly surpassing previous efforts that often suffered from limited size, narrow scope, or lack of diverse data types. By providing panoptic segmentation annotations and supporting complex downstream tasks like sterility breach detection and action anticipation, MM-OR establishes a robust new benchmark for evaluating and developing advanced OR modeling techniques.


Open Challenges

Despite these advancements, open challenges remain, including accurately modeling rare surgical actions and effectively generalizing models to diverse surgical scenarios beyond knee replacement procedures.


Why It Matters

Because better understanding of surgical environments means enhanced situational awareness, improved safety, and more effective surgical assistance.


Curious to Explore More?

The full paper published at the CVPR 2025, one of the highest-ranked AI/ML conferences, provides a more in-depth exploration of the potential of this new dataset and technology, laying the foundations for future advancements in OR systems.

E. Özsoy, C. Pellegrini, T. Czempiel, F. Tristram, K. Yuan, D. Bani-Harouni, U. Eck, B. Busam, M. Keicher and N. Navab.
MM-OR: A Large Multimodal Operating Room Dataset for Semantic Understanding of High-Intensity Surgical Environments.
CVPR 2025 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, TN, USA, Jun 11-15, 2025. To be published. URL GitHub
Abstract

Operating rooms (ORs) are complex, high-stakes environments requiring precise understanding of interactions among medical staff, tools, and equipment for enhancing surgical assistance, situational awareness, and patient safety. Current datasets fall short in scale, realism and do not capture the multimodal nature of OR scenes, limiting progress in OR modeling. To this end, we introduce MM-OR, a realistic and large-scale multimodal spatiotemporal OR dataset, and the first dataset to enable multimodal scene graph generation. MM-OR captures comprehensive OR scenes containing RGB-D data, detail views, audio, speech transcripts, robotic logs, and tracking data and is annotated with panoptic segmentations, semantic scene graphs, and downstream task labels. Further, we propose MM2SG, the first multimodal large vision-language model for scene graph generation, and through extensive experiments, demonstrate its ability to effectively leverage multimodal inputs. Together, MM-OR and MM2SG establish a new benchmark for holistic OR understanding, and open the path towards multimodal scene analysis in complex, high-stakes environments.

MCML Authors
Link to website

Ege Özsoy

Computer Aided Medical Procedures & Augmented Reality

Link to website

Chantal Pellegrini

Computer Aided Medical Procedures & Augmented Reality

Link to website

Felix Tristram

Computer Aided Medical Procedures & Augmented Reality

Link to website

Kun Yuan

Computer Aided Medical Procedures & Augmented Reality

Link to website

David Bani-Harouni

Computer Aided Medical Procedures & Augmented Reality

Link to website

Benjamin Busam

Dr.

Computer Aided Medical Procedures & Augmented Reality

Link to website

Matthias Keicher

Dr.

Computer Aided Medical Procedures & Augmented Reality

Link to Profile Nassir Navab

Nassir Navab

Prof. Dr.

Computer Aided Medical Procedures & Augmented Reality

Also check out the YouTube Video showing the dataset.

MM-OR on YouTube

Additional Material: Practical Examples and Recording Setup

Recording setup and sensors overview

Figure 3: Recording setup and sensors overview. A grey circle by each sensor shows quantity; if absent, the sensor count is one.

To provide deeper insights into how MM-OR and MM2SG work in practice, the following visuals illustrate the technical setup used to collect such comprehensive data (see Figure 3) and showcase a real example of MM2SG generated outputs (see Figure 4).

Qualitative examples from a test take in MM-OR

Figure 4: Qualitative examples from a test take in MM-OR, illustrating scene graph generation performance of MM2SG. Unlabeled edges indicate the ”close to” predicate.


Share Your Research!


Get in touch with us!

Are you an MCML Junior Member and interested in showcasing your research on our blog?

We’re happy to feature your work—get in touch with us to present your paper.

09.07.2025


Subscribe to RSS News feed

Related

Link to Tracking Our Changing Planet from Space - with Xiaoxiang Zhu

30.07.2025

Tracking Our Changing Planet From Space - With Xiaoxiang Zhu

In this video, Xiaoxiang Zhu shares how her team extracts geo-information from petabytes of data, with real impact on global challenges.

Link to AI for Enhanced Eye Diagnostics - with researcher Lucie Huang

29.07.2025

AI for Enhanced Eye Diagnostics - With Researcher Lucie Huang

Lucie Huang develops AI for faster eye scans and earlier diagnoses, featured in a new KI Trans video on real-world AI impact.

Link to SceneDINO: How AI Learns to See and Understand Images in 3D–Without Human Labels

24.07.2025

SceneDINO: How AI Learns to See and Understand Images in 3D–Without Human Labels

Accepted at ICCV 2025, SceneDINO infers 3D geometry and semantics from one image—no labels, inspired by human scene understanding.

Link to How Reliable Are Machine Learning Methods? With Anne-Laure Boulesteix and Milena Wünsch

23.07.2025

How Reliable Are Machine Learning Methods? With Anne-Laure Boulesteix and Milena Wünsch

In this research film, Anne-Laure Boulesteix and Milena Wünsch reveal how subtle biases in ML benchmarking can lead to misleading results.

Link to  AI-Powered Cortical Mapping for Neurodegenerative Disease Diagnoses - with Christian Wachinger

16.07.2025

AI-Powered Cortical Mapping for Neurodegenerative Disease Diagnoses - With Christian Wachinger

Research film with Christian Wachinger shows how AI maps the brain’s cortex to support diagnoses of neurodegenerative diseases.