10.04.2025

Text2Loc: A Smarter Way to Navigate With Words

MCML Research Insight - With Yan Xia, Zifeng Ding and Daniel Cremers

Imagine standing in an unfamiliar part of a city, no GPS in sight. All you can say is, "I’m west of a green building, near a black garage." That might be vague to a machine, but Text2Loc understands you perfectly. With this powerful new system, AI can find your exact location in a 3D map - just from how you describe the world around you.

«3D localization using natural language descriptions in a city-scale map is crucial for enabling autonomous agents to cooperate with humans to plan their trajectories in applications such as goods delivery or vehicle pickup.»

Yan Xia et al.

MCML Junior Members

The Challenge of 3D Localization

Navigating a city without GPS and relying only on descriptions like “I’m near a black pole, west of a gray-green road” is easy for humans. However, AI struggles with this task, especially in autonomous robots and self-driving cars that rely on 3D point clouds - detailed digital maps of the environment made up of millions of tiny points. These points are captured using sensors like LiDAR, which scan surroundings to create a 3D representation of objects, roads, and buildings.

Traditional methods try to match each word in a user’s description (like “black pole” or “gray road”) to specific objects in the environment, called text-instance matching. That’s slow, unreliable, and not very human-like. Worse, when the description is a bit vague - or there are multiple similar objects - these systems fail to pinpoint the exact spot, which is often referred to as the “last mile problem”.

The recent paper “Text2Loc: 3D Point Cloud Localization from Natural Language”, developed by our MCML Junior Members Yan Xia, Zifeng Ding, our PI Daniel Cremers, and collaborators Letian Shi and Joao F. Henriques, takes a different approach by first retrieving relevant submaps and then refining the location using a hierarchical approach. This method speeds up localization and improves accuracy, making it a major step forward for text-based navigation and tackling the “last mile problem”.

How Text2Loc Works

©Yan Xia et al.

The proposed Text2Loc architecture.

The proposed Text2Loc architecture consists of two tandem modules:

Global place recognition: Given a text-based position description, Text2Loc first identifies a set of coarse candidate locations, “submaps,” potentially containing the target position. This is achieved by retrieving the top-k nearest submaps from a previously constructed database of submaps using the novel text-to-submap retrieval model.
Fine localization: Text2Loc then refines the center coordinates of the retrieved submaps via the designed matching-free position estimation module, which adjusts the target location to increase accuracy.

«Extensive experiments demonstrate that Text2Loc improves the localization performance over the state-of-the-art by a large margin.»

Yan Xia et al.

MCML Junior Members

Text Descriptions as Input

Instead of relying on precise coordinates, Text2Loc understands natural language descriptions such as:

“The pose is on top of a gray road.”
“The pose is west of a black vegetation.”

These descriptions allow for intuitive, human-like localization, making it possible for systems to process spatial information the way people naturally describe their surroundings.

Global Place Recognition (Text-to-Submap Retrieval)

Rather than searching for a single matching object, Text2Loc retrieves the most relevant submaps from a large 3D environment. This reduces the complexity of the search while increasing accuracy.

«We are the first to completely remove the usage of text-instance matcher in the final localization stage.»

Yan Xia et al.

MCML Junior Members

Fine Localization (Instances in Retrieved Submaps)

After retrieving relevant submaps, Text2Loc performs fine localization by estimating the precise position within each submap. Unlike previous methods, it does not use a text-instance matching module to explicitly link words in the description to objects in the environment.

Instead, Text2Loc applies a neural network that operates directly on the spatial and semantic features of the submap and the input text. The model jointly encodes these features using a hierarchical architecture that captures both coarse and fine-grained spatial relationships.

This approach allows the system to predict a location without requiring object-level matching, which simplifies training and reduces inference time. This enables last-mile localization, ensuring that the system can pinpoint a precise location even in dense, real-world environments where GPS might fail.

Why Text2Loc is a True Innovation

©Yan Xia et al.

T-SNE visualization for global place recognition. It can be seen that Text2Loc (with label "Ours") outperforms Text2Pose, a pioneering work that uses text-instance matching.

Text2Loc introduces a new paradigm in text-based 3D localization. Unlike older approaches that struggle with direct text-object matching, this method simplifies the process while improving precision. Instead of processing every object individually, it retrieves relevant submaps, understands spatial relationships hierarchically, and refines the localization using a novel matching-free position estimation module.

By leveraging natural language and retrieval-based localization, Text2Loc enables AI to navigate the world the way humans do - through words. This innovation makes it incredibly useful for goods delivery and vehicle pickup, where precise, text-driven localization is essential.

While this article introduces key concepts, the original paper explores detailed insights into global place recognition and fine localization. If you want to understand the full methodology behind Text2Loc, including its advantages over traditional approaches, check out the original paper published at the CVPR 2024, one of the highest-ranked AI/ML conferences.

Y. Xia, L. Shi, Z. Ding, J. F. Henriques and D. Cremers.
Text2Loc: 3D Point Cloud Localization from Natural Language.
CVPR 2024 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, WA, USA, Jun 17-21, 2024. DOI GitHub

Abstract

We tackle the problem of 3D point cloud localization based on a few natural linguistic descriptions and introduce a novel neural network, Text2Loc, that fully interprets the semantic relationship between points and text. Text2Loc follows a coarse-to-fine localization pipeline: text-submap global place recognition, followed by fine localization. In global place recognition, relational dynamics among each textual hint are captured in a hierarchical transformer with max-pooling (HTM), whereas a balance between positive and negative pairs is maintained using text-submap contrastive learning. Moreover, we propose a novel matching-free fine localization method to further refine the location predictions, which completely removes the need for complicated text-instance matching and is lighter, faster, and more accurate than previous methods. Extensive experiments show that Text2Loc improves the localization accuracy by up to 2 × over the state-of-the-art on the KITTI360Pose dataset.

MCML Authors

Yan Xia

Dr.

B1 | Computer Vision
→ Group Daniel Cremers

* Former Member

Zifeng Ding

A3 | Computational Models
→ Group Volker Tresp

Database Systems and Data Mining AI Lab

Daniel Cremers

Prof. Dr.

B1 | Computer Vision

Computer Vision & Artificial Intelligence

©Yan Xia et al.

The authors at CVPR 2024

Poster session at CVPR 2024

Share Your Research!

Get in touch with us!

Are you an MCML Junior Member and interested in showcasing your research on our blog?

We’re happy to feature your work—get in touch with us to present your paper.

10.04.2025

Subscribe to RSS News feed

13.08.2025

From Physics Dreams to Algorithm Discovery - With Niki Kilbertus

Niki Kilbertus develops AI algorithms to uncover cause and effect, making science smarter and decisions in fields like medicine more reliable.

12.08.2025

Tracking Actions in Space and Time: ICCV 2025 Challenge & Workshop

ICCV 2025 workshop: Advancing AI to detect who does what, when, and where — across space, time, and complex real-world videos.

11.08.2025

AI for Dynamic Urban Mapping - With Researcher Shanshan Bai

Shanshan Bai uses geo-tagged social media and AI to map cities in real time. Part of KI Trans, funded by DATIpilot to support AI in education.

07.08.2025

Precise and Subject-Specific Attribute Control in AI Image Generation

At CVPR 2025, MCML researchers introduce a method for fine-grained image control—tweaking attributes like age or mood without changing the scene.

06.08.2025

What Is Intelligence—and What Kind of Intelligence Do We Want in Our Future? With Sven Nyholm

Sven Nyholm explores how AI reshapes authorship, responsibility and creativity, calling for democratic oversight in shaping our AI future.

Text2Loc: A Smarter Way to Navigate With Words

MCML Research Insight - With Yan Xia, Zifeng Ding and Daniel Cremers

The Challenge of 3D Localization

How Text2Loc Works

Text Descriptions as Input

Global Place Recognition (Text-to-Submap Retrieval)

Fine Localization (Instances in Retrieved Submaps)

Why Text2Loc is a True Innovation

Read More

Related