03.04.2025

CUPS: Teaching AI to Understand Scenes Without Human Labels

MCML Research Insight - With Christoph Reich, Nikita Araslanov, and Daniel Cremers

What matters now

Understanding the location and semantics of objects in a scene is a significant task, enabling robots to navigate through complex environments or facilitating autonomous driving. Recent AI models for understanding scenes from images require significant guidance from humans in the form of pixel-level annotations to achieve accurate predictions.

To overcome the reliance on human guidance, our Junior Members - Christoph Reich and Nikita Araslanov - together with MCML PI Daniel Cremers and collaborators Oliver Hahn, Christian Rupprecht, and Stefan Roth from TU Darmstadt and the University of Oxford, proposed a novel approach: 🥤🥤 CUPS: Scene-Centric Unsupervised Panoptic Segmentation.

«We present the first unsupervised panoptic method that directly trains on scene-centric imagery.»

Christoph Reich et al.

MCML Junior Members

Why Unsupervised Segmentation Matters

The vast majority of current AI models for segmenting and localizing objects in an image are supervised. This means humans must collect a large dataset of hundreds or thousands of images and manually annotate every pixel of each image. Equipped with these annotated example images, an AI model can be trained. The annotation process, however, is immensely time and resource-intensive. Additionally, human annotations can entail significant biases.

Overcoming the need for annotated data circumvents the time and resource-intensive human annotation process and does not introduce annotation biases. CUPS builds the first approach to segmenting and localizing objects in images of complex scenes, including a vast number of objects, without requiring human annotations for training.

«We derive high-quality panoptic pseudo labels of scene-centric images by leveraging self-supervised visual representations, depth, and motion.»

Christoph Reich et al.

MCML Junior Members

How To Segment Images Without Human Supervision

Humans group visual elements and objects based on specific perceptual cues. These cues include (1) similarity; grouping objects based on their similarity; (2) invariance, recognizing objects independent of their rotation, translation, or scale; and (3) common fate, elements that move together belong to the same object. CUPS builds on these perceptual cues to obtain unsupervised pseudo-labels. In particular, optical flow, depth, and visual representations are used to detect objects and obtain semantics. This object and semantic-level understanding is called panoptic segmentation.

Key Benefits of CUPS

©Christoph Reich et al.

Motion and depth are used to generate scene-centric panoptic pseudo-labels. Given a monocular image (bottom right), CUPS learns a panoptic network using pseudo labels and self-training.

«Our approach brings the quality of unsupervised panoptic, instance, and semantic segmentation to a new level.»

Christoph Reich et al.

MCML Junior Members

Using these pseudo-labels, CUPS first trains a panoptic segmentation network that both detects objects and predicts semantic categories. After training using pseudo-labels, CUPS performs self-training, which enables the panoptic network to detect objects not captured by the pseudo-labels and enhances the network’s segmentation accuracy.

Unsupervised Panoptic Segmentation - CUPS does not require any human annotations to perform segmentation
Scene-Centric Performance - Accurate segmentations on complex scenes, including a large number of objects
Competitive Results - CUPS archives state-of-the-art segmentation accuracy across multiple benchmarks, outperforming existing unsupervised models

CUPS In Action

The CUPS approach achieves impressive segmentation results on various datasets. Compared to a recent approach (U2Seg), CUPS detects more objects (e.g., persons and cars) and better captures the semantics of a scene (e.g., distinguishes between road and sidewalk). Quantitatively, CUPS significantly improves segmentation accuracy.

©Christoph Reich et al.

CUPS in action. Given a single image of a crowded scene, CUPS predicts accurate panoptic segmentations, while the recent U2Seg model struggles.

Further Reading & Reference

While this blog post highlights the core ideas of CUPS, the respective CVPR 2025 paper takes a deep look at the entire methodology - including how pseudo-labels are generated and training is performed.

If you’re interested in how CUPS compares to other unsupervised and supervised segmentation methods - or want to explore the technical innovations behind its strong performance - check out the full paper accepted to CVPR 2025, one of the most prestigious conferences in the field of computer vision.

O. Hahn, C. Reich, N. Araslanov, D. Cremers, C. Rupprecht and S. Roth.
Scene-Centric Unsupervised Panoptic Segmentation.
CVPR 2025 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, TN, USA, Jun 11-15, 2025. To be published. Preprint available. arXiv GitHub

Abstract

Unsupervised panoptic segmentation aims to partition an image into semantically meaningful regions and distinct object instances without training on manually annotated data. In contrast to prior work on unsupervised panoptic scene understanding, we eliminate the need for object-centric training data, enabling the unsupervised understanding of complex scenes. To that end, we present the first unsupervised panoptic method that directly trains on scene-centric imagery. In particular, we propose an approach to obtain high-resolution panoptic pseudo labels on complex scene-centric data combining visual representations, depth, and motion cues. Utilizing both pseudo-label training and a panoptic self-training strategy yields a novel approach that accurately predicts panoptic segmentation of complex scenes without requiring any human annotations. Our approach significantly improves panoptic quality, e.g., surpassing the recent state of the art in unsupervised panoptic segmentation on Cityscapes by 9.4% points in PQ.

MCML Authors

Christoph Reich

B1 | Computer Vision
→ Group Daniel Cremers

Computer Vision & Artificial Intelligence

Nikita Araslanov

Dr.

B1 | Computer Vision
→ Group Daniel Cremers

Computer Vision & Artificial Intelligence

Daniel Cremers

Prof. Dr.

B1 | Computer Vision

Computer Vision & Artificial Intelligence

Curious to test CUPS on your images? The code is open-source - check out the GitHub repository and experience CUPS by checking out the example video below.

CUPS at Github

CUPS example video

Share Your Research!

Get in touch with us!

Are you an MCML Junior Member and interested in showcasing your research on our blog?

We’re happy to feature your work—get in touch with us to present your paper.

03.04.2025

Subscribe to RSS News feed

07.08.2025

Precise and Subject-Specific Attribute Control in AI Image Generation

At CVPR 2025, MCML researchers introduce a method for fine-grained image control—tweaking attributes like age or mood without changing the scene.

06.08.2025

What Is Intelligence—and What Kind of Intelligence Do We Want in Our Future? With Sven Nyholm

Sven Nyholm explores how AI reshapes authorship, responsibility and creativity, calling for democratic oversight in shaping our AI future.

04.08.2025

AI for Better Social Media - With Researcher Dominik Bär

Dominik Bär develops AI for real-time counterspeech to combat hate and misinformation, part of the KI Trans project on AI in education.

31.07.2025

From Vulnerable to Verified: Exact Certificates Shield Models From Label‑Flipping

Published as a spotlight presentation at ICLR 2025, the paper certifies neural networks against label poisoning.

30.07.2025

Tracking Our Changing Planet From Space - With Xiaoxiang Zhu

In this video, Xiaoxiang Zhu shares how her team extracts geo-information from petabytes of data, with real impact on global challenges.

CUPS: Teaching AI to Understand Scenes Without Human Labels

MCML Research Insight - With Christoph Reich, Nikita Araslanov, and Daniel Cremers

What matters now

Why Unsupervised Segmentation Matters

How To Segment Images Without Human Supervision

Key Benefits of CUPS

CUPS In Action

Further Reading & Reference

Related