Home  | Publications | DSW+25

The Diashow Paradox: Stronger 3D-Aware Representations Emerge From Image Sets, Not Videos

MCML Authors

Abstract

Image-based vision foundation models (VFMs) have demonstrated surprising 3D geometric awareness, despite no explicit 3D supervision or pre-training on multi-view data. While image-based models are widely adopted across a range of downstream tasks, video-based models have so far remained on the sidelines of this success. In this work, we conduct a comparative study of image and video models on three tasks encapsulating 3D awareness: multi-view consistency, depth and surface normal estimation. To enable a fair and reproducible evaluation of both image and video models, we develop AnyProbe, a unified framework for probing network representations. The results of our study reveal a surprising conclusion, which we refer to as the diashow paradox. Specifically, video-based pre-training does not provide any consistent advantage on downstream tasks involving 3D understanding over image-based pre-training. We formulate two hypotheses to explain our observations, which underscore the need for high-quality video datasets and highlight the inherent complexity of video-based pre-training. AnyProbe will be publicly released to streamline evaluation of image- and video-based VFMs alike in a consistent fashion.

inproceedings


SP4V @ICCV 2025

Structural Priors for Vision Workshop at the IEEE/CVF International Conference on Computer Vision. Honolulu, Hawai'i, Oct 19-23, 2025. To be published. Preprint available.

Authors

N. T. Duc • A. Sonnweber • M. WeberN. AraslanovD. Cremers

Links

URL

Research Area

 B1 | Computer Vision

BibTeXKey: DSW+25

Back to Top