Home | Research | Groups | Stefan Leutenegger

Research Group Stefan Leutenegger

Stefan Leutenegger

Prof. Dr.

Principal Investigator

B3 | Multimodal Perception

Machine Learning for Robotics

Stefan Leutenegger

is Assistant Professor of Machine Learning for Robotics at TU Munich.

His field of research is the area of mobile robotics, with focus on robot navigation through potentially unknown environments. He develops algorithms and software, which allow a robot (e.g. drone) using its sensors (e.g. video) to reconstruct 3D structure as well as to categorise it with the help of modern Machine Learning (including Deep Learning). This understanding enables safe navigation through challenging environments, as well as the interaction with it (including humans).

Team members @MCML

PhD Students

Yannick Burkhardt

B3 | Multimodal Perception
→ Group Stefan Leutenegger

Machine Learning for Robotics

Hanzhi Chen

B3 | Multimodal Perception
→ Group Stefan Leutenegger

Machine Learning for Robotics

Simon Schaefer

B3 | Multimodal Perception
→ Group Stefan Leutenegger

Machine Learning for Robotics

Publications @MCML

2025

[8]

J. Jung, S. Boche, S. B. Laina and S. Leutenegger.
Uncertainty-Aware Visual-Inertial SLAM with Volumetric Occupancy Mapping.
ICRA 2025 - IEEE International Conference on Robotics and Automation. Atlanta, GA, USA, May 19-23, 2025. To be published. Preprint available. arXiv

Abstract

We propose visual-inertial simultaneous localization and mapping that tightly couples sparse reprojection errors, inertial measurement unit pre-integrals, and relative pose factors with dense volumetric occupancy mapping. Hereby depth predictions from a deep neural network are fused in a fully probabilistic manner. Specifically, our method is rigorously uncertainty-aware: first, we use depth and uncertainty predictions from a deep network not only from the robot’s stereo rig, but we further probabilistically fuse motion stereo that provides depth information across a range of baselines, therefore drastically increasing mapping accuracy. Next, predicted and fused depth uncertainty propagates not only into occupancy probabilities but also into alignment factors between generated dense submaps that enter the probabilistic nonlinear least squares estimator. This submap representation offers globally consistent geometry at scale. Our method is thoroughly evaluated in two benchmark datasets, resulting in localization and mapping accuracy that exceeds the state of the art, while simultaneously offering volumetric occupancy directly usable for downstream robotic planning and control in real-time.

MCML Authors

Stefan Leutenegger

Prof. Dr.

B3 | Multimodal Perception

Machine Learning for Robotics

[7]

Y. Burkhardt, S. Schaefer and S. Leutenegger.
SuperEvent: Cross-Modal Learning of Event-based Keypoint Detection.
Preprint (Apr. 2025). arXiv GitHub

Abstract

Event-based keypoint detection and matching holds significant potential, enabling the integration of event sensors into highly optimized Visual SLAM systems developed for frame cameras over decades of research. Unfortunately, existing approaches struggle with the motion-dependent appearance of keypoints and the complex noise prevalent in event streams, resulting in severely limited feature matching capabilities and poor performance on downstream tasks. To mitigate this problem, we propose SuperEvent, a data-driven approach to predict stable keypoints with expressive descriptors. Due to the absence of event datasets with ground truth keypoint labels, we leverage existing frame-based keypoint detectors on readily available event-aligned and synchronized gray-scale frames for self-supervision: we generate temporally sparse keypoint pseudo-labels considering that events are a product of both scene appearance and camera motion. Combined with our novel, information-rich event representation, we enable SuperEvent to effectively learn robust keypoint detection and description in event streams. Finally, we demonstrate the usefulness of SuperEvent by its integration into a modern sparse keypoint and descriptor-based SLAM framework originally developed for traditional cameras, surpassing the state-of-the-art in event-based SLAM by a wide margin.

MCML Authors

Yannick Burkhardt

B3 | Multimodal Perception
→ Group Stefan Leutenegger

Machine Learning for Robotics

Simon Schaefer

B3 | Multimodal Perception
→ Group Stefan Leutenegger

Machine Learning for Robotics

Stefan Leutenegger

Prof. Dr.

B3 | Multimodal Perception

Machine Learning for Robotics

2024

[6]

S. Papatheodorou, S. Boche, S. Laina and S. Leutenegger.
Efficient Submap-based Autonomous MAV Exploration using Visual-Inertial SLAM Configurable for LiDARs or Depth Cameras.
Preprint (Sep. 2024). arXiv

Abstract

Autonomous exploration of unknown space is an essential component for the deployment of mobile robots in the real world. Safe navigation is crucial for all robotics applications and requires accurate and consistent maps of the robot’s surroundings. To achieve full autonomy and allow deployment in a wide variety of environments, the robot must rely on on-board state estimation which is prone to drift over time. We propose a Micro Aerial Vehicle (MAV) exploration framework based on local submaps to allow retaining global consistency by applying loop-closure corrections to the relative submap poses. To enable large-scale exploration we efficiently compute global, environment-wide frontiers from the local submap frontiers and use a sampling-based next-best-view exploration planner. Our method seamlessly supports using either a LiDAR sensor or a depth camera, making it suitable for different kinds of MAV platforms. We perform comparative evaluations in simulation against a state-of-the-art submap-based exploration framework to showcase the efficiency and reconstruction quality of our approach. Finally, we demonstrate the applicability of our method to real-world MAVs, one equipped with a LiDAR and the other with a depth camera.

MCML Authors

Stefan Leutenegger

Prof. Dr.

B3 | Multimodal Perception

Machine Learning for Robotics

[5]

J. Naumann, B. Xu, S. Leutenegger and X. Zuo.
NeRF-VO: Real-Time Sparse Visual Odometry With Neural Radiance Fields.
IEEE Robotics and Automation Letters 9.8 (Aug. 2024). DOI

Abstract

We introduce a novel monocular visual odometry (VO) system, NeRF-VO, that integrates learning-based sparse visual odometry for low-latency camera tracking and a neural radiance scene representation for fine-detailed dense reconstruction and novel view synthesis. Our system initializes camera poses using sparse visual odometry and obtains view-dependent dense geometry priors from a monocular prediction network. We harmonize the scale of poses and dense geometry, treating them as supervisory cues to train a neural implicit scene representation. NeRF-VO demonstrates exceptional performance in both photometric and geometric fidelity of the scene representation by jointly optimizing a sliding window of keyframed poses and the underlying dense geometry, which is accomplished through training the radiance field with volume rendering. We surpass SOTA methods in pose estimation accuracy, novel view synthesis fidelity, and dense reconstruction quality across a variety of synthetic and real-world datasets while achieving a higher camera tracking frequency and consuming less GPU memory.

MCML Authors

Stefan Leutenegger

Prof. Dr.

B3 | Multimodal Perception

Machine Learning for Robotics

Xingxing Zuo

Dr.

B3 | Multimodal Perception
→ Group Stefan Leutenegger

* Former Member

2023

[4]

Y. Xin, X. Zuo, D. Lu and S. Leutenegger.
SimpleMapping: Real-time visual-inertial dense mapping with deep multi-view stereo.
ISMAR 2023 - IEEE/ACM International Symposium on Mixed and Augmented Reality. Sydney, Australia, Oct 16-20, 2023. DOI

Abstract

We present a real-time visual-inertial dense mapping method capable of performing incremental 3D mesh reconstruction with high quality using only sequential monocular images and inertial measurement unit (IMU) readings. 6-DoF camera poses are estimated by a robust feature-based visual-inertial odometry (VIO), which also generates noisy sparse 3D map points as a by-product. We propose a sparse point aided multi-view stereo neural network (SPA-MVSNet) that can effectively leverage the informative but noisy sparse points from the VIO system. The sparse depth from VIO is firstly completed by a single-view depth completion network. This dense depth map, although naturally limited in accuracy, is then used as a prior to guide our MVS network in the cost volume generation and regularization for accurate dense depth prediction. Predicted depth maps of keyframe images by the MVS network are incrementally fused into a global map using TSDF-Fusion. We extensively evaluate both the proposed SPA-MVSNet and the entire dense mapping system on several public datasets as well as our own dataset, demonstrating the system’s impressive generalization capabilities and its ability to deliver high-quality 3D reconstruction online. Our proposed dense mapping system achieves a 39.7% improvement in F-score over existing systems when evaluated on the challenging scenarios of the EuRoC dataset.

MCML Authors

Xingxing Zuo

Dr.

B3 | Multimodal Perception
→ Group Stefan Leutenegger

* Former Member

Stefan Leutenegger

Prof. Dr.

B3 | Multimodal Perception

Machine Learning for Robotics

[3]

X. Zuo, N. Yang, N. Merrill, B. Xu and S. Leutenegger.
Incremental Dense Reconstruction from Monocular Video with Guided Sparse Feature Volume Fusion.
IEEE Robotics and Automation Letters 8.6 (Jun. 2023). DOI

Abstract

Incrementally recovering 3D dense structures from monocular videos is of paramount importance since it enables various robotics and AR applications. Feature volumes have recently been shown to enable efficient and accurate incremental dense reconstruction without the need to first estimate depth, but they are not able to achieve as high of a resolution as depth-based methods due to the large memory consumption of high-resolution feature volumes. This letter proposes a real-time feature volume-based dense reconstruction method that predicts TSDF (Truncated Signed Distance Function) values from a novel sparsified deep feature volume, which is able to achieve higher resolutions than previous feature volume-based methods, and is favorable in outdoor large-scale scenarios where the majority of voxels are empty. An uncertainty-aware multi-view stereo (MVS) network is leveraged to infer initial voxel locations of the physical surface in a sparse feature volume. Then for refining the recovered 3D geometry, deep features are attentively aggregated from multi-view images at potential surface locations, and temporally fused. Besides achieving higher resolutions than before, our method is shown to produce more complete reconstructions with finer detail in many cases. Extensive evaluations on both public and self-collected datasets demonstrate a very competitive real-time reconstruction result for our method compared to state-of-the-art reconstruction methods in both indoor and outdoor settings.

MCML Authors

Xingxing Zuo

Dr.

B3 | Multimodal Perception
→ Group Stefan Leutenegger

* Former Member

Stefan Leutenegger

Prof. Dr.

B3 | Multimodal Perception

Machine Learning for Robotics

[2]

L. Sang, B. Häfner, X. Zuo and D. Cremers.
High-Quality RGB-D Reconstruction via Multi-View Uncalibrated Photometric Stereo and Gradient-SDF.
WACV 2023 - IEEE/CVF Winter Conference on Applications of Computer Vision. Waikoloa, Hawaii, Jan 03-07, 2023. DOI

Abstract

Fine-detailed reconstructions are in high demand in many applications. However, most of the existing RGB-D reconstruction methods rely on pre-calculated accurate camera poses to recover the detailed surface geometry, where the representation of a surface needs to be adapted when optimizing different quantities. In this paper, we present a novel multi-view RGB-D based reconstruction method that tackles camera pose, lighting, albedo, and surface normal estimation via the utilization of a gradient signed distance field (gradient-SDF). The proposed method formulates the image rendering process using specific physically-based model(s) and optimizes the surface’s quantities on the actual surface using its volumetric representation, as opposed to other works which estimate surface quantities only near the actual surface. To validate our method, we investigate two physically-based image formation models for natural light and point light source applications. The experimental results on synthetic and real-world datasets demonstrate that the proposed method can recover high-quality geometry of the surface more faithfully than the state-of-the-art and further improves the accuracy of estimated camera poses

MCML Authors

Lu Sang

B1 | Computer Vision
→ Group Daniel Cremers

Computer Vision & Artificial Intelligence

Björn Häfner

B1 | Computer Vision
→ Group Daniel Cremers

* Former Member

Xingxing Zuo

Dr.

B3 | Multimodal Perception
→ Group Stefan Leutenegger

* Former Member

Daniel Cremers

Prof. Dr.

B1 | Computer Vision

Computer Vision & Artificial Intelligence

2022

[1]

S. Laina, S. Boche, S. Papatheodorou, S. Schaefer, J. Jung and S. Leutenegger.
FindAnything: Open-Vocabulary and Object-Centric Mapping for Robot Exploration in Any Environment.
Preprint (Sep. 2022). arXiv

Abstract

Geometrically accurate and semantically expressive map representations have proven invaluable to facilitate robust and safe mobile robot navigation and task planning. Nevertheless, real-time, open-vocabulary semantic understanding of large-scale unknown environments is still an open problem. In this paper we present FindAnything, an open-world mapping and exploration framework that incorporates vision-language information into dense volumetric submaps. Thanks to the use of vision-language features, FindAnything bridges the gap between pure geometric and open-vocabulary semantic information for a higher level of understanding while allowing to explore any environment without the help of any external source of ground-truth pose information. We represent the environment as a series of volumetric occupancy submaps, resulting in a robust and accurate map representation that deforms upon pose updates when the underlying SLAM system corrects its drift, allowing for a locally consistent representation between submaps. Pixel-wise vision-language features are aggregated from efficient SAM (eSAM)-generated segments, which are in turn integrated into object-centric volumetric submaps, providing a mapping from open-vocabulary queries to 3D geometry that is scalable also in terms of memory usage. The open-vocabulary map representation of FindAnything achieves state-of-the-art semantic accuracy in closed-set evaluations on the Replica dataset. This level of scene understanding allows a robot to explore environments based on objects or areas of interest selected via natural language queries. Our system is the first of its kind to be deployed on resource-constrained devices, such as MAVs, leveraging vision-language information for real-world robotic tasks.

MCML Authors

Simon Schaefer

B3 | Multimodal Perception
→ Group Stefan Leutenegger

Machine Learning for Robotics

Stefan Leutenegger

Prof. Dr.

B3 | Multimodal Perception

Machine Learning for Robotics