24.07.2025

SceneDINO: How AI Learns to See and Understand Images in 3D–Without Human Labels
MCML Research Insight - With Christoph Reich, Felix Wimbauer, and Daniel Cremers
Imagine looking at a single image and trying to understand the entire 3D scenery–not just what’s visible, but also what’s occluded. Humans do this effortlessly: when we see a photo of a tree, we intuitively grasp its 3D structure and semantic meaning. We learn this ability through interaction and movement in the 3D world, without explicit supervision. Inspired by this natural capability, our Junior Members–Christoph Reich and Felix Wimbauer–together with MCML PI Daniel Cremers and collaborators Aleksandar Jevtić, Oliver Hahn, Christian Rupprecht, and Stefan Roth from TUM, TU Darmstadt, and the University of Oxford, developed a novel approach: 🦖 Feed-Forward SceneDINO for Unsupervised Semantic Scene Completion. SceneDINO can infer both 3D geometry and semantics from a single image–entirely without labeled training data.

©Jevtić et al.
SceneDINO overview. Given a single input image, SceneDINO estimates 3D scene geometry, expressive features, and unsupervised semantics in a feed-forward manner, without using any human annotations.
«We present SceneDINO, the first approach for unsupervised semantic scene completion.»
Christoph Reich et al.
MCML Junior Members
Why Unsupervised Geometric and Semantic Understanding in 3D Matters
Autonomous agents–from self-driving cars to mobile robots–navigate a world that is fundamentally three-dimensional. To operate safely and intelligently, they must understand both the geometry and semantics of their surroundings in 3D. While modern sensors like LiDAR can provide accurate geometric measurements, these sensors are expensive, offer only sparse information, and are not applicable in some domains, such as medical endoscopy. On top of that, obtaining semantic annotations in 3D is immensely time and resource-intensive, preventing the collection of large amounts of annotated training examples.
Approaching the understanding of geometry and semantics in 3D purely from images without human annotations offers a compelling alternative to overcome the limitations associated with expensive sensors and semantic annotations. SceneDINO builds the first approach to estimate both 3D geometry and semantics from a single image, without requiring human annotations for training.
«Trained using 2D self-supervised features and multi-view self-supervision SceneDINO predicts 3D geometry and 3D features from a single image.»
Christoph Reich et al.
MCML Junior Members
How To Understand the 3D World Without Human Supervision
Humans naturally perceive a scene from multiple viewpoints by simply moving through the world. SceneDINO draws inspiration from this process by learning from multi-view images during training. The training consists of two stages: First, SceneDINO uses multi-view self-supervision to learn a 3D feature field that captures both the 3D geometry of the scene and semantically meaningful features, all without human labels. In the second stage, the rich feature field is distilled and clustered to produce unsupervised semantic predictions in 3D.

©Jevtić et al.
Multi-view self-supervision
Multi-view self-supervision
Given a single input image, SceneDINO estimates a 3D feature field. Through volumetric rendering, synthesized images and 2D features from novel viewpoints can be estimated. For self-supervision, SceneDINO’s training leverages additional images captured from different viewpoints and learns to reconstruct them. By also reconstructing 2D multi-view features from a self-supervised image model, SceneDINO learns an expressive and multi-view consistent feature field in 3D.

©Jevtić et al.
Distillation and clustering
Distillation and clustering
SceneDINO’s features are expressive and capture semantic concepts. To enhance these semantic concepts, SceneDINO’s features are distilled in 3D. By amplifying similarities and dissimilarities between 3D features using our pr geometry, we obtain semantically enhanced features. Clustering these features results in semantic prediction, grouping semantically coherent regions in a fully unsupervised way.
«Multi-view feature consistency, linear probing, and domain generalization results highlight the potential of SceneDINO as a strong foundation for 3D scene-understanding.»
Christoph Reich et al.
MCML Junior Members
Highlights of SceneDINO
SceneDINO can perform fully unsupervised semantic scene completion–the computer vision task of estimating dense 3D geometry and semantics.
- Unsupervised Semantic Scene Completion: SceneDINO is the first approach to perform semantic scene completion (SSC)–the computer vision task of estimating dense 3D geometry and semantics–without requiring any human labels
- Unsupervised SSC Accuracy: SceneDINO achieves state-of-the-art accuracy in SSC compared to a competitive baseline (also proposed in the paper)
- A Strong Foundation: Beyond unsupervised SSC, SceneDINO offers general, expressive, and multi-view consistent 3D features, providing a strong foundation for approaching 3D scene-understanding using limited human annotations
«Our novel 3D distillation approach yields state-of-the-art results in unsupervised SSC.»
Christoph Reich et al.
MCML Junior Members
SceneDINO In Action
Given a single input image, SceneDINO achieves impressive 3D reconstructions and segmentation results, without using any human annotations. Compared to the proposed unsupervised baseline (S4C + STEGO), SceneDINO better captures the semantic structure of the scene. Especially for distant structures, SceneDINOs’ semantic predictions are significantly improved. SceneDINO’s high-dimensional feature field, visualized using a dimensionality reduction approach, includes highly semantically rich features.

©Jevtić et al.
SceneDINO in action. Given a single input image of a complex scene, SceneDINO estimates an expressive feature field. From this feature field, SceneDINO can accurately segment the scene in 3D.
Further Reading & Reference
While this blog post highlights the core ideas of SceneDINO, the respective ICCV 2025 paper takes a deep look at the entire methodology, including how training, distillation, and clustering are performed.
If you’re interested in how SceneDINO compares to other methods–or want to explore the technical innovations behind its strong performance–check out the full paper accepted to ICCV 2025, one of the most prestigious conferences in the field of computer vision.
Feed-Forward SceneDINO for Unsupervised Semantic Scene Completion.
ICCV 2025 - IEEE/CVF International Conference on Computer Vision. Honolulu, Hawai’i, Oct 19-13, 2025. To be published. Preprint available. arXiv GitHub
Abstract
Semantic scene completion (SSC) aims to infer both the 3D geometry and semantics of a scene from single images. In contrast to prior work on SSC that heavily relies on expensive ground-truth annotations, we approach SSC in an unsupervised setting. Our novel method, SceneDINO, adapts techniques from self-supervised representation learning and 2D unsupervised scene understanding to SSC. Our training exclusively utilizes multi-view consistency self-supervision without any form of semantic or geometric ground truth. Given a single input image, SceneDINO infers the 3D geometry and expressive 3D DINO features in a feed-forward manner. Through a novel 3D feature distillation approach, we obtain unsupervised 3D semantics. In both 3D and 2D unsupervised scene understanding, SceneDINO reaches state-of-the-art segmentation accuracy. Linear probing our 3D features matches the segmentation accuracy of a current supervised SSC approach. Additionally, we showcase the domain generalization and multi-view consistency of SceneDINO, taking the first steps towards a strong foundation for single image 3D scene understanding.
MCML Authors
Computer Vision & Artificial Intelligence
Computer Vision & Artificial Intelligence
Curious to test SceneDINO on your own images?
The authors provide an online demo, and the code is open-source–check out the Hugging Face demo and the GitHub repository.
Share Your Research!
Get in touch with us!
Are you an MCML Junior Member and interested in showcasing your research on our blog?
We’re happy to feature your work—get in touch with us to present your paper.
24.07.2025
Related

23.07.2025
How Reliable Are Machine Learning Methods? With Anne-Laure Boulesteix and Milena Wünsch
In this research film, Anne-Laure Boulesteix and Milena Wünsch reveal how subtle biases in ML benchmarking can lead to misleading results.

16.07.2025
AI-Powered Cortical Mapping for Neurodegenerative Disease Diagnoses - With Christian Wachinger
Research film with Christian Wachinger shows how AI maps the brain’s cortex to support diagnoses of neurodegenerative diseases.

10.07.2025
Beyond Prediction: How Causal AI Enables Better Decision-Making - With Stefan Feuerriegel
Stefan Feuerriegel in our new film shows how Causal AI helps pick better actions by predicting outcomes for each possible decision.

09.07.2025
Capturing Complexity in Surgical Environments
Published at CVPR 2025, MM-OR is a multimodal dataset of robotic knee surgeries, capturing OR dynamics via video, audio, tracking, and robot logs.