Home | Publications | OHP+25

Location-Free Scene Graph Generation

MCML Authors

Ege Özsoy

→ Group Nassir Navab
Computer Aided Medical Procedures & Augmented Reality

Felix Holm

→ Group Nassir Navab
Computer Aided Medical Procedures & Augmented Reality

Chantal Pellegrini

→ Group Nassir Navab
Computer Aided Medical Procedures & Augmented Reality

Nassir Navab

Prof. Dr.

Principal Investigator

Computer Aided Medical Procedures & Augmented Reality

Benjamin Busam

Prof. Dr.

Principal Investigator

Photogrammetry and Remote Sensing

Abstract

Scene Graph Generation (SGG) is a visual understanding task that describes a scene as a graph of entities and their relationships, traditionally relying on spatial labels like bounding boxes or segmentation masks. These requirements increase annotation costs and complicate integration with other modalities where spatial synchronization may be unavailable. In this work, we investigate the feasibility and effectiveness of scene graphs without location information, offering an alternative paradigm for scenarios where spatial data is unavailable. To this end, we propose the first method to generate location-free scene graphs, directly from images, evaluate their correctness and show the usefulness of such location-free scene graphs in several downstream tasks. Our proposed method, Pix2SG, models scene graph generation as an autoregressive sequence modeling task, predicting all instances and their relations as one output sequence. To enable evaluation without location matching, we propose a heuristic tree search algorithm that matches predicted scene graphs with ground truth graphs, bypassing the need for location-based metrics. We demonstrate the effectiveness of location-free scene graphs on three benchmark datasets and two downstream tasks -- image retrieval and visual question showing they can achieve competitive performance with significantly less annotations. Our findings suggest that location-free scene graphs can still be generated and utilized effectively without location information, thus opening new avenues for scalable, structured and efficient visual representations, such as for multimodal scene understanding by reducing dependency on modality-specific annotations. The code will be made available upon acceptance.

inproceedings OHP+25

MULA @CVPR 2025

8th Multimodal Learning and Applications Workshop at IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, TN, USA, Jun 11-15, 2025.

Authors

E. Özsoy • F. Holm • C. Pellegrini • T. Czempiel • M. Saleh • N. Navab • B. Busam

Links

DOI

Research Areas

B1 | Computer Vision

C1 | Medicine

BibTeXKey: OHP+25

#p-busam #p-navab