Home | Research | Groups | Xi Wang

Research Group Xi Wang

Xi Wang

Dr.

JRG Leader Egocentric Vision

B1 | Computer Vision

Computer Vision & Artificial Intelligence

Xi Wang

leads the MCML Junior Research Group ‘Egocentric Vision’ at TU Munich.

Xi Wang and her team conduct cutting-edge research in egocentric vision, focusing on learning from first-person human videos to understand behavior patterns and extract valuable information for potential applications in robotics. Their ongoing projects include 3D reconstruction using Gaussian splitting and multimodal learning with vision-language models. Funded as a BMBF project, the group maintains close ties with MCML and actively seeks collaborations that bridge egocentric vision with other research domains, extending beyond our own focus.

Team members @MCML

PhD Students

Abhishek Saroha

B1 | Computer Vision
→ Group Xi Wang

Computer Vision & Artificial Intelligence

Dominik Schnaus

B1 | Computer Vision
→ Group Xi Wang

Computer Vision & Artificial Intelligence

Publications @MCML

2025

[4]

O. Kuzyk, Z. Li, M. Pollefeys and X. Wang.
VisualChef: Generating Visual Aids in Cooking via Mask Inpainting.
Preprint (Jun. 2025). arXiv

Abstract

Cooking requires not only following instructions but also understanding, executing, and monitoring each step - a process that can be challenging without visual guidance. Although recipe images and videos offer helpful cues, they often lack consistency in focus, tools, and setup. To better support the cooking process, we introduce VisualChef, a method for generating contextual visual aids tailored to cooking scenarios. Given an initial frame and a specified action, VisualChef generates images depicting both the action’s execution and the resulting appearance of the object, while preserving the initial frame’s environment. Previous work aims to integrate knowledge extracted from large language models by generating detailed textual descriptions to guide image generation, which requires fine-grained visual-textual alignment and involves additional annotations. In contrast, VisualChef simplifies alignment through mask-based visual grounding. Our key insight is identifying action-relevant objects and classifying them to enable targeted modifications that reflect the intended action and outcome while maintaining a consistent environment. In addition, we propose an automated pipeline to extract high-quality initial, action, and final state frames. We evaluate VisualChef quantitatively and qualitatively on three egocentric video datasets and show its improvements over state-of-the-art methods.

MCML Authors

Xi Wang

Dr.

B1 | Computer Vision

Computer Vision & Artificial Intelligence

[3]

C. Koke, D. Schnaus, Y. Shen, A. Saroha, M. Eisenberger, B. Rieck, M. M. Bronstein and D. Cremers.
On multi-scale Graph Representation Learning.
LMRL @ICLR 2025 - Workshop on Learning Meaningful Representations of Life at the 13th International Conference on Learning Representations (ICLR 2025). Singapore, Apr 24-28, 2025. To be published. Preprint available. URL

Abstract

While Graph Neural Networks (GNNs) are widely used in modern computational biology, an underexplored drawback of common GNN methods,is that they are not inherently multiscale consistent: Two graphs describing the same object or situation at different resolution scales are assigned vastly different latent representations. This prevents graph networks from generating data representations that are consistent across scales. It also complicates the integration of representations at the molecular scale with those generated at the biological scale. Here we discuss why existing GNNs struggle with multiscale consistency and show how to overcome this problem by modifying the message passing paradigm within GNNs.

MCML Authors

Christian Koke

B1 | Computer Vision
→ Group Daniel Cremers

Computer Vision & Artificial Intelligence

Dominik Schnaus

B1 | Computer Vision
→ Group Xi Wang

Computer Vision & Artificial Intelligence

Yuesong Shen

Dr.

B1 | Computer Vision
→ Group Daniel Cremers

* Former Member

Abhishek Saroha

B1 | Computer Vision
→ Group Xi Wang

Computer Vision & Artificial Intelligence

Daniel Cremers

Prof. Dr.

B1 | Computer Vision

Computer Vision & Artificial Intelligence

[2]

N. P. A. Vu, A. Saroha, O. Litany and D. Cremers.
GAS-NeRF: Geometry-Aware Stylization of Dynamic Radiance Fields.
Preprint (Mar. 2025). arXiv

Abstract

Current 3D stylization techniques primarily focus on static scenes, while our world is inherently dynamic, filled with moving objects and changing environments. Existing style transfer methods primarily target appearance – such as color and texture transformation – but often neglect the geometric characteristics of the style image, which are crucial for achieving a complete and coherent stylization effect. To overcome these shortcomings, we propose GAS-NeRF, a novel approach for joint appearance and geometry stylization in dynamic Radiance Fields. Our method leverages depth maps to extract and transfer geometric details into the radiance field, followed by appearance transfer. Experimental results on synthetic and real-world datasets demonstrate that our approach significantly enhances the stylization quality while maintaining temporal coherence in dynamic scenes.

MCML Authors

Abhishek Saroha

B1 | Computer Vision
→ Group Xi Wang

Computer Vision & Artificial Intelligence

Daniel Cremers

Prof. Dr.

B1 | Computer Vision

Computer Vision & Artificial Intelligence

2024

[1]

L. Sang, M. Gao, A. Saroha and D. Cremers.
Enhancing Surface Neural Implicits with Curvature-Guided Sampling and Uncertainty-Augmented Representations.
Wild3D @ECCV 2024 - Workshop 3D Modeling, Reconstruction, and Generation in the Wild at the 18th European Conference on Computer Vision (ECCV 2024). Milano, Italy, Sep 29-Oct 04, 2024. URL

Abstract

Neural implicits are a widely used surface presentation because they offer an adaptive resolution and support arbitrary topology changes. While previous works rely on ground truth point clouds or meshes, they often do not discuss the data acquisition and ignore the effect of input quality and sampling methods during reconstruction. In this paper, we introduce a sampling method with an uncertainty-augmented surface implicit representation that employs a sampling technique that considers the geometric characteristics of inputs. To this end, we introduce a strategy that efficiently computes differentiable geometric features, namely, mean curvatures, to guide the sampling phase during the training period. The uncertainty augmentation offers insights into the occupancy and reliability of the output signed distance value, thereby expanding representation capabilities into open surfaces. Finally, we demonstrate that our method improves the reconstruction of both synthetic and real-world data.

MCML Authors