Home | Publications | DSW+25

The Diashow Paradox: Stronger 3D-Aware Representations Emerge From Image Sets, Not Videos

MCML Authors

Mark Weber

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Nikita Araslanov

Dr.

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Daniel Cremers

Prof. Dr.

Director

Computer Vision & Artificial Intelligence

Abstract

Image-based vision foundation models (VFMs) have demonstrated surprising 3D geometric awareness, despite no explicit 3D supervision or pre-training on multi-view data. While image-based models are widely adopted across a range of downstream tasks, video-based models have so far remained on the sidelines of this success. In this work, we conduct a comparative study of image and video models on three tasks encapsulating 3D awareness: multi-view consistency, depth and surface normal estimation. To enable a fair and reproducible evaluation of both image and video models, we develop AnyProbe, a unified framework for probing network representations. The results of our study reveal a surprising conclusion, which we refer to as the diashow paradox. Specifically, video-based pre-training does not provide any consistent advantage on downstream tasks involving 3D understanding over image-based pre-training. We formulate two hypotheses to explain our observations, which underscore the need for high-quality video datasets and highlight the inherent complexity of video-based pre-training. AnyProbe will be publicly released to streamline evaluation of image- and video-based VFMs alike in a consistent fashion.

inproceedings DSW+25

SP4V @ICCV 2025

Structural Priors for Vision Workshop at the IEEE/CVF International Conference on Computer Vision. Honolulu, Hawai'i, Oct 19-23, 2025. To be published. Preprint available.