MCML researchers with ten papers at ICCV 2023

H. Chen, A. Frikha, D. Krompass, J. Gu and V. Tresp.
FRAug: Tackling Federated Learning with Non-IID Features via Representation Augmentation.
IEEE/CVF International Conference on Computer Vision (ICCV 2023). Paris, France, Oct 02-06, 2023. DOI.

Abstract

Federated Learning (FL) is a decentralized machine learning paradigm, in which multiple clients collaboratively train neural networks without centralizing their local data, and hence preserve data privacy. However, real-world FL applications usually encounter challenges arising from distribution shifts across the local datasets of individual clients. These shifts may drift the global model aggregation or result in convergence to deflected local optimum. While existing efforts have addressed distribution shifts in the label space, an equally important challenge remains relatively unexplored. This challenge involves situations where the local data of different clients indicate identical label distributions but exhibit divergent feature distributions. This issue can significantly impact the global model performance in the FL framework. In this work, we propose Federated Representation Augmentation (FRAug) to resolve this practical and challenging problem. FRAug optimizes a shared embedding generator to capture client consensus. Its output synthetic embeddings are transformed into client-specific by a locally optimized RTNet to augment the training space of each client. Our empirical evaluation on three public benchmarks and a real-world medical dataset demonstrates the effectiveness of the proposed method, which substantially outperforms the current state-of-the-art FL methods for feature distribution shifts, including PartialFed and FedBN.

MCML Authors

Ahmed Frikha

Dr.

* Former member

Volker Tresp

Prof. Dr.

Database Systems & Data Mining

M. B. Colomer, P. L. Dovesi, T. Panagiotakopoulos, J. F. Carvalho, L. Härenstam-Nielsen, H. Azizpour, H. Kjellström, D. Cremers and M. Poggi.
To adapt or not to adapt? Real-time adaptation for semantic segmentation.
IEEE/CVF International Conference on Computer Vision (ICCV 2023). Paris, France, Oct 02-06, 2023. DOI.

Abstract

The goal of Online Domain Adaptation for semantic segmentation is to handle unforeseeable domain changes that occur during deployment, like sudden weather events. However, the high computational costs associated with brute-force adaptation make this paradigm unfeasible for real-world applications. In this paper we propose HAMLET, a Hardware-Aware Modular Least Expensive Training framework for real-time domain adaptation. Our approach includes a hardware-aware back-propagation orchestration agent (HAMT) and a dedicated domain-shift detector that enables active control over when and how the model is adapted (LT). Thanks to these advancements, our approach is capable of performing semantic segmentation while simultaneously adapting at more than 29FPS on a single consumer-grade GPU. Our framework’s encouraging accuracy and speed trade-off is demonstrated on OnDA and SHIFT benchmarks through experimental results.

MCML Authors

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence

A. Farshad, Y. Yeganeh, Y. Chi, C. Shen, B. Ommer and N. Navab.
Scenegenie: Scene graph guided diffusion models for image synthesis.
Workshop at the IEEE/CVF International Conference on Computer Vision (ICCV 2023). Paris, France, Oct 02-06, 2023. DOI. GitHub.

Abstract

Text-conditioned image generation has made significant progress in recent years with generative adversarial networks and more recently, diffusion models. While diffusion models conditioned on text prompts have produced impressive and high-quality images, accurately representing complex text prompts such as the number of instances of a specific object remains challenging.To address this limitation, we propose a novel guidance approach for the sampling process in the diffusion model that leverages bounding box and segmentation map information at inference time without additional training data. Through a novel loss in the sampling process, our approach guides the model with semantic features from CLIP embeddings and enforces geometric constraints, leading to high-resolution images that accurately represent the scene. To obtain bounding box and segmentation map information, we structure the text prompt as a scene graph and enrich the nodes with CLIP embeddings. Our proposed model achieves state-of-the-art performance on two public benchmarks for image generation from scene graphs, surpassing both scene graph to image and text-based diffusion models in various metrics. Our results demonstrate the effectiveness of incorporating bounding box and segmentation map guidance in the diffusion model sampling process for more accurate text-to-image generation.

MCML Authors

Azade Farshad

Dr.

Computer Aided Medical Procedures & Augmented Reality

Yousef Yeganeh

Computer Aided Medical Procedures & Augmented Reality

Björn Ommer

Prof. Dr.

Machine Vision & Learning

Nassir Navab

Prof. Dr.

Computer Aided Medical Procedures & Augmented Reality

C1 | Medicine

M. Gao, P. Roetzer, M. Eisenberger, Z. Lähner, M. Moeller, D. Cremers and F. Bernard.
ΣIGMA: Scale-Invariant Global Sparse Shape Matching.
IEEE/CVF International Conference on Computer Vision (ICCV 2023). Paris, France, Oct 02-06, 2023. DOI.

Abstract

We propose a novel mixed-integer programming (MIP) formulation for generating precise sparse correspondences for highly non-rigid shapes. To this end, we introduce a projected Laplace-Beltrami operator (PLBO) which combines intrinsic and extrinsic geometric information to measure the deformation quality induced by predicted correspondences. We integrate the PLBO, together with an orientation-aware regulariser, into a novel MIP formulation that can be solved to global optimality for many practical problems. In contrast to previous methods, our approach is provably invariant to rigid transformations and global scaling, initialisation-free, has optimality guarantees, and scales to high resolution meshes with (empirically observed) linear time. We show state-of-the-art results for sparse non-rigid matching on several challenging 3D datasets, including data with inconsistent meshing, as well as applications in mesh-to-point-cloud matching.

MCML Authors

Maolin Gao

Computer Vision & Artificial Intelligence

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence

H. Li, J. Dong, B. Wen, M. Gao, T. Huang, Y.-H. Liu and D. Cremers.
DDIT: Semantic Scene Completion via Deformable Deep Implicit Templates.
IEEE/CVF International Conference on Computer Vision (ICCV 2023). Paris, France, Oct 02-06, 2023. DOI.

Abstract

Scene reconstructions are often incomplete due to occlusions and limited viewpoints. There have been efforts to use semantic information for scene completion. However, the completed shapes may be rough and imprecise since respective methods rely on 3D convolution and/or lack effective shape constraints. To overcome these limitations, we propose a semantic scene completion method based on deformable deep implicit templates (DDIT). Specifically, we complete each segmented instance in a scene by deforming a template with a latent code. Such a template is expressed by a deep implicit function in the canonical frame. It abstracts the shape prior of a category, and thus can provide constraints on the overall shape of an instance. Latent code controls the deformation of template to guarantee fine details of an instance. For code prediction, we design a neural network that leverages both intra-and inter-instance information. We also introduce an algorithm to transform instances between the world and canonical frames based on geometric constraints and a hierarchical tree. To further improve accuracy, we jointly optimize the latent code and transformation by enforcing the zero-valued isosurface constraint. In addition, we establish a new dataset to solve different problems of existing datasets. Experiments showed that our DDIT outperforms state-of-the-art approaches.

MCML Authors

Haoang Li

Dr.

* Former member

Maolin Gao

Computer Vision & Artificial Intelligence

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence

H. Li, J. Gu, R. Koner, S. Sharifzadeh and V. Tresp.
Do DALL-E and Flamingo Understand Each Other?.
IEEE/CVF International Conference on Computer Vision (ICCV 2023). Paris, France, Oct 02-06, 2023. DOI. GitHub.

Abstract

The field of multimodal research focusing on the comprehension and creation of both images and text has witnessed significant strides. This progress is exemplified by the emergence of sophisticated models dedicated to image captioning at scale, such as the notable Flamingo model and text-to-image generative models, with DALL-E serving as a prominent example. An interesting question worth exploring in this domain is whether Flamingo and DALL-E understand each other. To study this question, we propose a reconstruction task where Flamingo generates a description for a given image and DALL-E uses this description as input to synthesize a new image. We argue that these models understand each other if the generated image is similar to the given image. Specifically, we study the relationship between the quality of the image reconstruction and that of the text generation. We find that an optimal description of an image is one that gives rise to a generated image similar to the original one. The finding motivates us to propose a unified framework to finetune the text-to-image and image-to-text models. Concretely, the reconstruction part forms a regularization loss to guide the tuning of the models. Extensive experiments on multiple datasets with different image captioning and image generation models validate our findings and demonstrate the effectiveness of our proposed unified framework. As DALL-E and Flamingo are not publicly available, we use Stable Diffusion and BLIP in the remaining work.

MCML Authors

Hang Li

Database Systems & Data Mining

Rajat Koner

Database Systems & Data Mining

Volker Tresp

Prof. Dr.

Database Systems & Data Mining

Y. Xia, M. Gladkova, R. Wang, Q. Li, U. Stilla, J. F. Henriques and D. Cremers.
CASSPR: Cross Attention Single Scan Place Recognition.
IEEE/CVF International Conference on Computer Vision (ICCV 2023). Paris, France, Oct 02-06, 2023. DOI.

Abstract

Place recognition based on point clouds (LiDAR) is an important component for autonomous robots or self-driving vehicles. Current SOTA performance is achieved on accumulated LiDAR submaps using either point-based or voxel-based structures. While voxel-based approaches nicely integrate spatial context across multiple scales, they do not exhibit the local precision of point-based methods. As a result, existing methods struggle with fine-grained matching of subtle geometric features in sparse single-shot Li-DAR scans. To overcome these limitations, we propose CASSPR as a method to fuse point-based and voxel-based approaches using cross attention transformers. CASSPR leverages a sparse voxel branch for extracting and aggregating information at lower resolution and a point-wise branch for obtaining fine-grained local information. CASSPR uses queries from one branch to try to match structures in the other branch, ensuring that both extract self-contained descriptors of the point cloud (rather than one branch dominating), but using both to inform the out-put global descriptor of the point cloud. Extensive experiments show that CASSPR surpasses the state-of-the-art by a large margin on several datasets (Oxford RobotCar, TUM, USyd). For instance, it achieves AR@1 of 85.6% on the TUM dataset, surpassing the strongest prior model by ~15%. Our code is publicly available.

MCML Authors

Yan Xia

Dr.

Computer Vision & Artificial Intelligence

Mariia Gladkova

Computer Vision & Artificial Intelligence

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence

Y. Yeganeh, A. Farshad, P. Weinberger, S.-A. Ahmadi, E. Adeli and N. Navab.
Transformers pay attention to convolutions leveraging emerging properties of vits by dual attention-image network.
Workshop at the IEEE/CVF International Conference on Computer Vision (ICCV 2023). Paris, France, Oct 02-06, 2023. DOI. GitHub.

Abstract

Although purely transformer-based architectures pretrained on large datasets are introduced as foundation models for general computer vision tasks, hybrid models that incorporate combinations of convolution and transformer blocks showed state-of-the-art performance in more specialized tasks. Nevertheless, despite the performance gain of both pure and hybrid transformer-based architectures compared to convolutional networks, their high training cost and complexity make it challenging to use them in real scenarios. In this work, we propose a novel and simple architecture based on only convolutional layers and show that by just taking advantage of the attention map visualizations obtained from a self-supervised pretrained vision transformer network, complex transformer-based networks, and even 3D architectures are outperformed with much fewer computation costs. The proposed architecture is composed of two encoder branches with the original image as input in one branch and the attention map visualizations of the same image from multiple self-attention heads from a pre-trained DINO model in the other branch. The results of our experiments on medical imaging datasets show that the extracted attention map visualizations from the attention heads of a pre-trained transformer architecture combined with the image provide strong prior knowledge for a pure CNN architecture to outperform CNN-based and transformer-based architectures.

MCML Authors

Yousef Yeganeh

Computer Aided Medical Procedures & Augmented Reality

Azade Farshad

Dr.

Computer Aided Medical Procedures & Augmented Reality

Nassir Navab

Prof. Dr.

Computer Aided Medical Procedures & Augmented Reality

C1 | Medicine

G. Zhang, J. Ren, J. Gu and V. Tresp.
Multi-event Video-Text Retrieval.
IEEE/CVF International Conference on Computer Vision (ICCV 2023). Paris, France, Oct 02-06, 2023. DOI. GitHub.

MCML Authors

Gengyuan Zhang

Database Systems & Data Mining

Volker Tresp

Prof. Dr.

Database Systems & Data Mining