Home  | Publications | WMO+26

Stepper: Stepwise Immersive Scene Generation With Multiview Panoramas

MCML Authors

Abstract

The synthesis of immersive 3D scenes from text is rapidly maturing, driven by novel video generative models and feed-forward 3D reconstruction, with vast potential in AR/VR and world modeling. While panoramic images have proven effective for scene initialization, existing approaches suffer from a trade-off between visual fidelity and explorability: autoregressive expansion suffers from context drift, while panoramic video generation is limited to low resolution. We present Stepper, a unified framework for text-driven immersive 3D scene synthesis that circumvents these limitations via stepwise panoramic scene expansion. Stepper leverages a novel multi-view 360° diffusion model that enables consistent, high-resolution expansion, coupled with a geometry reconstruction pipeline that enforces geometric coherence. Trained on a new largescale, multi-view panorama dataset, Stepper achieves stateof-the-art fidelity and structural consistency, outperforming prior approaches, thereby setting a new standard for immersive scene generation.

misc WMO+26


Preprint

Mar. 2026

Authors

F. Wimbauer • F. Manhardt • M. Oechsle • N. Kalischek • C. Rupprecht • D. Cremers • F. Tombari

Links

arXiv GitHub

Research Area

 B1 | Computer Vision

BibTeXKey: WMO+26

Back to Top