Home | Research | Groups | Stefan Leutenegger

Research Group Stefan Leutenegger


Link to website at TUM

Stefan Leutenegger

Prof. Dr.

Principal Investigator

Machine Learning for Robotics

Stefan Leutenegger

is Assistant Professor of Machine Learning for Robotics at TU Munich.

His field of research is the area of mobile robotics, with focus on robot navigation through potentially unknown environments. He develops algorithms and software, which allow a robot (e.g. drone) using its sensors (e.g. video) to reconstruct 3D structure as well as to categorise it with the help of modern Machine Learning (including Deep Learning). This understanding enables safe navigation through challenging environments, as well as the interaction with it (including humans).

Team members @MCML

PhD Students

Link to website

Yannick Burkhardt

Machine Learning for Robotics

Link to website

Hanzhi Chen

Machine Learning for Robotics

Link to website

Simon Schaefer

Machine Learning for Robotics

Publications @MCML

2025


[6]
Y. Burkhardt, S. Schaefer and S. Leutenegger.
SuperEvent: Cross-Modal Learning of Event-based Keypoint Detection.
Preprint (Apr. 2025). arXiv GitHub
Abstract

Event-based keypoint detection and matching holds significant potential, enabling the integration of event sensors into highly optimized Visual SLAM systems developed for frame cameras over decades of research. Unfortunately, existing approaches struggle with the motion-dependent appearance of keypoints and the complex noise prevalent in event streams, resulting in severely limited feature matching capabilities and poor performance on downstream tasks. To mitigate this problem, we propose SuperEvent, a data-driven approach to predict stable keypoints with expressive descriptors. Due to the absence of event datasets with ground truth keypoint labels, we leverage existing frame-based keypoint detectors on readily available event-aligned and synchronized gray-scale frames for self-supervision: we generate temporally sparse keypoint pseudo-labels considering that events are a product of both scene appearance and camera motion. Combined with our novel, information-rich event representation, we enable SuperEvent to effectively learn robust keypoint detection and description in event streams. Finally, we demonstrate the usefulness of SuperEvent by its integration into a modern sparse keypoint and descriptor-based SLAM framework originally developed for traditional cameras, surpassing the state-of-the-art in event-based SLAM by a wide margin.

MCML Authors
Link to website

Yannick Burkhardt

Machine Learning for Robotics

Link to website

Simon Schaefer

Machine Learning for Robotics

Link to Profile Stefan Leutenegger

Stefan Leutenegger

Prof. Dr.

Machine Learning for Robotics


2024


[5]
S. Papatheodorou, S. Boche, S. Laina and S. Leutenegger.
Efficient Submap-based Autonomous MAV Exploration using Visual-Inertial SLAM Configurable for LiDARs or Depth Cameras.
Preprint (Sep. 2024). arXiv
Abstract

Autonomous exploration of unknown space is an essential component for the deployment of mobile robots in the real world. Safe navigation is crucial for all robotics applications and requires accurate and consistent maps of the robot’s surroundings. To achieve full autonomy and allow deployment in a wide variety of environments, the robot must rely on on-board state estimation which is prone to drift over time. We propose a Micro Aerial Vehicle (MAV) exploration framework based on local submaps to allow retaining global consistency by applying loop-closure corrections to the relative submap poses. To enable large-scale exploration we efficiently compute global, environment-wide frontiers from the local submap frontiers and use a sampling-based next-best-view exploration planner. Our method seamlessly supports using either a LiDAR sensor or a depth camera, making it suitable for different kinds of MAV platforms. We perform comparative evaluations in simulation against a state-of-the-art submap-based exploration framework to showcase the efficiency and reconstruction quality of our approach. Finally, we demonstrate the applicability of our method to real-world MAVs, one equipped with a LiDAR and the other with a depth camera.

MCML Authors
Link to Profile Stefan Leutenegger

Stefan Leutenegger

Prof. Dr.

Machine Learning for Robotics


[4]
J. Naumann, B. Xu, S. Leutenegger and X. Zuo.
NeRF-VO: Real-Time Sparse Visual Odometry With Neural Radiance Fields.
IEEE Robotics and Automation Letters 9.8 (Aug. 2024). DOI
Abstract

We introduce a novel monocular visual odometry (VO) system, NeRF-VO, that integrates learning-based sparse visual odometry for low-latency camera tracking and a neural radiance scene representation for fine-detailed dense reconstruction and novel view synthesis. Our system initializes camera poses using sparse visual odometry and obtains view-dependent dense geometry priors from a monocular prediction network. We harmonize the scale of poses and dense geometry, treating them as supervisory cues to train a neural implicit scene representation. NeRF-VO demonstrates exceptional performance in both photometric and geometric fidelity of the scene representation by jointly optimizing a sliding window of keyframed poses and the underlying dense geometry, which is accomplished through training the radiance field with volume rendering. We surpass SOTA methods in pose estimation accuracy, novel view synthesis fidelity, and dense reconstruction quality across a variety of synthetic and real-world datasets while achieving a higher camera tracking frequency and consuming less GPU memory.

MCML Authors
Link to Profile Stefan Leutenegger

Stefan Leutenegger

Prof. Dr.

Machine Learning for Robotics


2023


[3]
Y. Xin, X. Zuo, D. Lu and S. Leutenegger.
SimpleMapping: Real-time visual-inertial dense mapping with deep multi-view stereo.
ISMAR 2023 - IEEE/ACM International Symposium on Mixed and Augmented Reality. Sydney, Australia, Oct 16-20, 2023. DOI
Abstract

We present a real-time visual-inertial dense mapping method capable of performing incremental 3D mesh reconstruction with high quality using only sequential monocular images and inertial measurement unit (IMU) readings. 6-DoF camera poses are estimated by a robust feature-based visual-inertial odometry (VIO), which also generates noisy sparse 3D map points as a by-product. We propose a sparse point aided multi-view stereo neural network (SPA-MVSNet) that can effectively leverage the informative but noisy sparse points from the VIO system. The sparse depth from VIO is firstly completed by a single-view depth completion network. This dense depth map, although naturally limited in accuracy, is then used as a prior to guide our MVS network in the cost volume generation and regularization for accurate dense depth prediction. Predicted depth maps of keyframe images by the MVS network are incrementally fused into a global map using TSDF-Fusion. We extensively evaluate both the proposed SPA-MVSNet and the entire dense mapping system on several public datasets as well as our own dataset, demonstrating the system’s impressive generalization capabilities and its ability to deliver high-quality 3D reconstruction online. Our proposed dense mapping system achieves a 39.7% improvement in F-score over existing systems when evaluated on the challenging scenarios of the EuRoC dataset.

MCML Authors
Link to Profile Stefan Leutenegger

Stefan Leutenegger

Prof. Dr.

Machine Learning for Robotics


[2]
X. Zuo, N. Yang, N. Merrill, B. Xu and S. Leutenegger.
Incremental Dense Reconstruction from Monocular Video with Guided Sparse Feature Volume Fusion.
IEEE Robotics and Automation Letters 8.6 (Jun. 2023). DOI
Abstract

Incrementally recovering 3D dense structures from monocular videos is of paramount importance since it enables various robotics and AR applications. Feature volumes have recently been shown to enable efficient and accurate incremental dense reconstruction without the need to first estimate depth, but they are not able to achieve as high of a resolution as depth-based methods due to the large memory consumption of high-resolution feature volumes. This letter proposes a real-time feature volume-based dense reconstruction method that predicts TSDF (Truncated Signed Distance Function) values from a novel sparsified deep feature volume, which is able to achieve higher resolutions than previous feature volume-based methods, and is favorable in outdoor large-scale scenarios where the majority of voxels are empty. An uncertainty-aware multi-view stereo (MVS) network is leveraged to infer initial voxel locations of the physical surface in a sparse feature volume. Then for refining the recovered 3D geometry, deep features are attentively aggregated from multi-view images at potential surface locations, and temporally fused. Besides achieving higher resolutions than before, our method is shown to produce more complete reconstructions with finer detail in many cases. Extensive evaluations on both public and self-collected datasets demonstrate a very competitive real-time reconstruction result for our method compared to state-of-the-art reconstruction methods in both indoor and outdoor settings.

MCML Authors
Link to Profile Stefan Leutenegger

Stefan Leutenegger

Prof. Dr.

Machine Learning for Robotics


[1]
L. Sang, B. Häfner, X. Zuo and D. Cremers.
High-Quality RGB-D Reconstruction via Multi-View Uncalibrated Photometric Stereo and Gradient-SDF.
WACV 2023 - IEEE/CVF Winter Conference on Applications of Computer Vision. Waikoloa, Hawaii, Jan 03-07, 2023. DOI
Abstract

Fine-detailed reconstructions are in high demand in many applications. However, most of the existing RGB-D reconstruction methods rely on pre-calculated accurate camera poses to recover the detailed surface geometry, where the representation of a surface needs to be adapted when optimizing different quantities. In this paper, we present a novel multi-view RGB-D based reconstruction method that tackles camera pose, lighting, albedo, and surface normal estimation via the utilization of a gradient signed distance field (gradient-SDF). The proposed method formulates the image rendering process using specific physically-based model(s) and optimizes the surface’s quantities on the actual surface using its volumetric representation, as opposed to other works which estimate surface quantities only near the actual surface. To validate our method, we investigate two physically-based image formation models for natural light and point light source applications. The experimental results on synthetic and real-world datasets demonstrate that the proposed method can recover high-quality geometry of the surface more faithfully than the state-of-the-art and further improves the accuracy of estimated camera poses

MCML Authors
Link to website

Lu Sang

Computer Vision & Artificial Intelligence

Björn Häfner

Björn Häfner

* Former Member

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence