03.04.2025

Teaser image to CUPS: Teaching AI to Understand Scenes Without Human Labels

CUPS: Teaching AI to Understand Scenes Without Human Labels

MCML Research Insight - With Christoph Reich, Nikita Araslanov, and Daniel Cremers

What matters now

Understanding the location and semantics of objects in a scene is a significant task, enabling robots to navigate through complex environments or facilitating autonomous driving. Recent AI models for understanding scenes from images require significant guidance from humans in the form of pixel-level annotations to achieve accurate predictions.

To overcome the reliance on human guidance, our Junior Members - Christoph Reich and Nikita Araslanov - together with MCML PI Daniel Cremers and collaborators Oliver Hahn, Christian Rupprecht, and Stefan Roth from TU Darmstadt and the University of Oxford, proposed a novel approach: 🥤🥤 CUPS: Scene-Centric Unsupervised Panoptic Segmentation.

«We present the first unsupervised panoptic method that directly trains on scene-centric imagery.»


Christoph Reich et al.

MCML Junior Members

Why Unsupervised Segmentation Matters

The vast majority of current AI models for segmenting and localizing objects in an image are supervised. This means humans must collect a large dataset of hundreds or thousands of images and manually annotate every pixel of each image. Equipped with these annotated example images, an AI model can be trained. The annotation process, however, is immensely time and resource-intensive. Additionally, human annotations can entail significant biases.

Overcoming the need for annotated data circumvents the time and resource-intensive human annotation process and does not introduce annotation biases. CUPS builds the first approach to segmenting and localizing objects in images of complex scenes, including a vast number of objects, without requiring human annotations for training.


«We derive high-quality panoptic pseudo labels of scene-centric images by leveraging self-supervised visual representations, depth, and motion.»


Christoph Reich et al.

MCML Junior Members

How To Segment Images Without Human Supervision

Humans group visual elements and objects based on specific perceptual cues. These cues include (1) similarity; grouping objects based on their similarity; (2) invariance, recognizing objects independent of their rotation, translation, or scale; and (3) common fate, elements that move together belong to the same object. CUPS builds on these perceptual cues to obtain unsupervised pseudo-labels. In particular, optical flow, depth, and visual representations are used to detect objects and obtain semantics. This object and semantic-level understanding is called panoptic segmentation.


Key Benefits of CUPS

CUPS architecture

Motion and depth are used to generate scene-centric panoptic pseudo-labels. Given a monocular image (bottom right), CUPS learns a panoptic network using pseudo labels and self-training.

«Our approach brings the quality of unsupervised panoptic, instance, and semantic segmentation to a new level.»


Christoph Reich et al.

MCML Junior Members

Using these pseudo-labels, CUPS first trains a panoptic segmentation network that both detects objects and predicts semantic categories. After training using pseudo-labels, CUPS performs self-training, which enables the panoptic network to detect objects not captured by the pseudo-labels and enhances the network’s segmentation accuracy.

  • Unsupervised Panoptic Segmentation - CUPS does not require any human annotations to perform segmentation
  • Scene-Centric Performance - Accurate segmentations on complex scenes, including a large number of objects
  • Competitive Results - CUPS archives state-of-the-art segmentation accuracy across multiple benchmarks, outperforming existing unsupervised models

CUPS In Action

The CUPS approach achieves impressive segmentation results on various datasets. Compared to a recent approach (U2Seg), CUPS detects more objects (e.g., persons and cars) and better captures the semantics of a scene (e.g., distinguishes between road and sidewalk). Quantitatively, CUPS significantly improves segmentation accuracy.

CUPS in action

CUPS in action. Given a single image of a crowded scene, CUPS predicts accurate panoptic segmentations, while the recent U2Seg model struggles.


Further Reading & Reference

While this blog post highlights the core ideas of CUPS, the respective CVPR 2025 paper takes a deep look at the entire methodology - including how pseudo-labels are generated and training is performed.

If you’re interested in how CUPS compares to other unsupervised and supervised segmentation methods - or want to explore the technical innovations behind its strong performance - check out the full paper accepted to CVPR 2025, one of the most prestigious conferences in the field of computer vision.

O. Hahn, C. Reich, N. Araslanov, D. Cremers, C. Rupprecht and S. Roth.
Scene-Centric Unsupervised Panoptic Segmentation.
CVPR 2025 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, TN, USA, Jun 11-15, 2025. To be published. Preprint available. arXiv GitHub
Abstract

Unsupervised panoptic segmentation aims to partition an image into semantically meaningful regions and distinct object instances without training on manually annotated data. In contrast to prior work on unsupervised panoptic scene understanding, we eliminate the need for object-centric training data, enabling the unsupervised understanding of complex scenes. To that end, we present the first unsupervised panoptic method that directly trains on scene-centric imagery. In particular, we propose an approach to obtain high-resolution panoptic pseudo labels on complex scene-centric data combining visual representations, depth, and motion cues. Utilizing both pseudo-label training and a panoptic self-training strategy yields a novel approach that accurately predicts panoptic segmentation of complex scenes without requiring any human annotations. Our approach significantly improves panoptic quality, e.g., surpassing the recent state of the art in unsupervised panoptic segmentation on Cityscapes by 9.4% points in PQ.

MCML Authors

Curious to test CUPS on your images? The code is open-source - check out the GitHub repository and experience CUPS by checking out the example video below.

CUPS at Github
CUPS example video

Share Your Research!


Get in touch with us!

Are you an MCML Junior Member and interested in showcasing your research on our blog?

We’re happy to feature your work—get in touch with us to present your paper.

03.04.2025


Subscribe to RSS News feed

Related

Link to Robots Seeing in the Dark - with researcher Yannick Burkhardt

15.09.2025

Robots Seeing in the Dark - With Researcher Yannick Burkhardt

Yannick Burkhardt erforscht Event-Kameras, die Robotern ermöglichen, blitzschnell zu reagieren und auch im Dunkeln zu sehen.

Link to 3D Machine Perception Beyond Vision - with researcher Riccardo Marin

08.09.2025

3D Machine Perception Beyond Vision - With Researcher Riccardo Marin

Researcher Riccardo Marin explores 3D geometry and AI, from manufacturing to VR, making machine perception more human-like.

Link to AI for Personalized Psychiatry - with researcher Clara Vetter

01.09.2025

AI for Personalized Psychiatry - With Researcher Clara Vetter

AI research by Clara Vetter uses brain, genetic and smartphone data to personalize psychiatry and improve diagnosis and treatment.

Link to Satellite Insights for a Sustainable Future - with researcher Ivica Obadic

25.08.2025

Satellite Insights for a Sustainable Future - With Researcher Ivica Obadic

AI from satellite imagery helps design livable cities, improve well-being & food systems with transparent models by Ivica Obadić.

Link to Digital Twins for Surgery - with researcher Azade Farshad

18.08.2025

Digital Twins for Surgery - With Researcher Azade Farshad

Azade Farshad develops patient digital twins at TUM & MCML to improve personalized treatment, surgical planning, and training.