24.07.2025

Teaser image to SceneDINO: How AI Learns to See and Understand Images in 3D–Without Human Labels

SceneDINO: How AI Learns to See and Understand Images in 3D–Without Human Labels

MCML Research Insight - With Christoph Reich, Felix Wimbauer, and Daniel Cremers

Imagine looking at a single image and trying to understand the entire 3D scenery–not just what’s visible, but also what’s occluded. Humans do this effortlessly: when we see a photo of a tree, we intuitively grasp its 3D structure and semantic meaning. We learn this ability through interaction and movement in the 3D world, without explicit supervision. Inspired by this natural capability, our Junior Members–Christoph Reich and Felix Wimbauer–together with MCML PI Daniel Cremers and collaborators Aleksandar Jevtić, Oliver Hahn, Christian Rupprecht, and Stefan Roth from TUM, TU Darmstadt, and the University of Oxford, developed a novel approach: 🦖 Feed-Forward SceneDINO for Unsupervised Semantic Scene Completion. SceneDINO can infer both 3D geometry and semantics from a single image–entirely without labeled training data.


SceneDINO overview

SceneDINO overview. Given a single input image, SceneDINO estimates 3D scene geometry, expressive features, and unsupervised semantics in a feed-forward manner, without using any human annotations.


«We present SceneDINO, the first approach for unsupervised semantic scene completion.»


Christoph Reich et al.

MCML Junior Members

Why Unsupervised Geometric and Semantic Understanding in 3D Matters

Autonomous agents–from self-driving cars to mobile robots–navigate a world that is fundamentally three-dimensional. To operate safely and intelligently, they must understand both the geometry and semantics of their surroundings in 3D. While modern sensors like LiDAR can provide accurate geometric measurements, these sensors are expensive, offer only sparse information, and are not applicable in some domains, such as medical endoscopy. On top of that, obtaining semantic annotations in 3D is immensely time and resource-intensive, preventing the collection of large amounts of annotated training examples.

Approaching the understanding of geometry and semantics in 3D purely from images without human annotations offers a compelling alternative to overcome the limitations associated with expensive sensors and semantic annotations. SceneDINO builds the first approach to estimate both 3D geometry and semantics from a single image, without requiring human annotations for training.


«Trained using 2D self-supervised features and multi-view self-supervision SceneDINO predicts 3D geometry and 3D features from a single image.»


Christoph Reich et al.

MCML Junior Members

How To Understand the 3D World Without Human Supervision

Humans naturally perceive a scene from multiple viewpoints by simply moving through the world. SceneDINO draws inspiration from this process by learning from multi-view images during training. The training consists of two stages: First, SceneDINO uses multi-view self-supervision to learn a 3D feature field that captures both the 3D geometry of the scene and semantically meaningful features, all without human labels. In the second stage, the rich feature field is distilled and clustered to produce unsupervised semantic predictions in 3D.


Multi-view self-supervision

Multi-view self-supervision

Multi-view self-supervision

Given a single input image, SceneDINO estimates a 3D feature field. Through volumetric rendering, synthesized images and 2D features from novel viewpoints can be estimated. For self-supervision, SceneDINO’s training leverages additional images captured from different viewpoints and learns to reconstruct them. By also reconstructing 2D multi-view features from a self-supervised image model, SceneDINO learns an expressive and multi-view consistent feature field in 3D.


Distillation and clustering

Distillation and clustering

Distillation and clustering

SceneDINO’s features are expressive and capture semantic concepts. To enhance these semantic concepts, SceneDINO’s features are distilled in 3D. By amplifying similarities and dissimilarities between 3D features using our pr geometry, we obtain semantically enhanced features. Clustering these features results in semantic prediction, grouping semantically coherent regions in a fully unsupervised way.


«Multi-view feature consistency, linear probing, and domain generalization results highlight the potential of SceneDINO as a strong foundation for 3D scene-understanding.»


Christoph Reich et al.

MCML Junior Members

Highlights of SceneDINO

SceneDINO can perform fully unsupervised semantic scene completion–the computer vision task of estimating dense 3D geometry and semantics.

  • Unsupervised Semantic Scene Completion: SceneDINO is the first approach to perform semantic scene completion (SSC)–the computer vision task of estimating dense 3D geometry and semantics–without requiring any human labels
  • Unsupervised SSC Accuracy: SceneDINO achieves state-of-the-art accuracy in SSC compared to a competitive baseline (also proposed in the paper)
  • A Strong Foundation: Beyond unsupervised SSC, SceneDINO offers general, expressive, and multi-view consistent 3D features, providing a strong foundation for approaching 3D scene-understanding using limited human annotations

«Our novel 3D distillation approach yields state-of-the-art results in unsupervised SSC.»


Christoph Reich et al.

MCML Junior Members

SceneDINO In Action

Given a single input image, SceneDINO achieves impressive 3D reconstructions and segmentation results, without using any human annotations. Compared to the proposed unsupervised baseline (S4C + STEGO), SceneDINO better captures the semantic structure of the scene. Especially for distant structures, SceneDINOs’ semantic predictions are significantly improved. SceneDINO’s high-dimensional feature field, visualized using a dimensionality reduction approach, includes highly semantically rich features.

SceneDINO in action.

SceneDINO in action. Given a single input image of a complex scene, SceneDINO estimates an expressive feature field. From this feature field, SceneDINO can accurately segment the scene in 3D.


Further Reading & Reference

While this blog post highlights the core ideas of SceneDINO, the respective ICCV 2025 paper takes a deep look at the entire methodology, including how training, distillation, and clustering are performed.

If you’re interested in how SceneDINO compares to other methods–or want to explore the technical innovations behind its strong performance–check out the full paper accepted to ICCV 2025, one of the most prestigious conferences in the field of computer vision.

A. Jevtić, C. Reich, F. Wimbauer, O. Hahn, C. Rupprecht, S. Roth and D. Cremers.
Feed-Forward SceneDINO for Unsupervised Semantic Scene Completion.
ICCV 2025 - IEEE/CVF International Conference on Computer Vision. Honolulu, Hawai’i, Oct 19-13, 2025. To be published. Preprint available. arXiv GitHub
Abstract

Semantic scene completion (SSC) aims to infer both the 3D geometry and semantics of a scene from single images. In contrast to prior work on SSC that heavily relies on expensive ground-truth annotations, we approach SSC in an unsupervised setting. Our novel method, SceneDINO, adapts techniques from self-supervised representation learning and 2D unsupervised scene understanding to SSC. Our training exclusively utilizes multi-view consistency self-supervision without any form of semantic or geometric ground truth. Given a single input image, SceneDINO infers the 3D geometry and expressive 3D DINO features in a feed-forward manner. Through a novel 3D feature distillation approach, we obtain unsupervised 3D semantics. In both 3D and 2D unsupervised scene understanding, SceneDINO reaches state-of-the-art segmentation accuracy. Linear probing our 3D features matches the segmentation accuracy of a current supervised SSC approach. Additionally, we showcase the domain generalization and multi-view consistency of SceneDINO, taking the first steps towards a strong foundation for single image 3D scene understanding.

MCML Authors
Link to website

Christoph Reich

Computer Vision & Artificial Intelligence

Link to website

Felix Wimbauer

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


Curious to test SceneDINO on your own images?

The authors provide an online demo, and the code is open-source–check out the Hugging Face demo and the GitHub repository.

SceneDINO Demo
SceneDINO Code on GitHub

Share Your Research!


Get in touch with us!

Are you an MCML Junior Member and interested in showcasing your research on our blog?

We’re happy to feature your work—get in touch with us to present your paper.

24.07.2025


Subscribe to RSS News feed

Related

Link to From Physics Dreams to Algorithm Discovery - with Niki Kilbertus

13.08.2025

From Physics Dreams to Algorithm Discovery - With Niki Kilbertus

Niki Kilbertus develops AI algorithms to uncover cause and effect, making science smarter and decisions in fields like medicine more reliable.

Link to Tracking Actions in Space and Time: ICCV 2025 Challenge & Workshop

12.08.2025

Tracking Actions in Space and Time: ICCV 2025 Challenge & Workshop

ICCV 2025 workshop: Advancing AI to detect who does what, when, and where — across space, time, and complex real-world videos.

Link to AI for Dynamic Urban Mapping - with researcher Shanshan Bai

11.08.2025

AI for Dynamic Urban Mapping - With Researcher Shanshan Bai

Shanshan Bai uses geo-tagged social media and AI to map cities in real time. Part of KI Trans, funded by DATIpilot to support AI in education.

Link to Precise and Subject-Specific Attribute Control in AI Image Generation

07.08.2025

Precise and Subject-Specific Attribute Control in AI Image Generation

At CVPR 2025, MCML researchers introduce a method for fine-grained image control—tweaking attributes like age or mood without changing the scene.

Link to What is intelligence—and what kind of intelligence do we want in our future? With Sven Nyholm

06.08.2025

What Is Intelligence—and What Kind of Intelligence Do We Want in Our Future? With Sven Nyholm

Sven Nyholm explores how AI reshapes authorship, responsibility and creativity, calling for democratic oversight in shaping our AI future.