10.04.2025

Teaser image to Text2Loc: A Smarter Way to Navigate with Words

Text2Loc: A Smarter Way to Navigate With Words

MCML Research Insight - With Yan Xia, Zifeng Ding and Daniel Cremers

Imagine standing in an unfamiliar part of a city, no GPS in sight. All you can say is, "I’m west of a green building, near a black garage." That might be vague to a machine, but Text2Loc understands you perfectly. With this powerful new system, AI can find your exact location in a 3D map - just from how you describe the world around you.

«3D localization using natural language descriptions in a city-scale map is crucial for enabling autonomous agents to cooperate with humans to plan their trajectories in applications such as goods delivery or vehicle pickup.»


Yan Xia et al.

MCML Junior Members

The Challenge of 3D Localization

Navigating a city without GPS and relying only on descriptions like “I’m near a black pole, west of a gray-green road” is easy for humans. However, AI struggles with this task, especially in autonomous robots and self-driving cars that rely on 3D point clouds - detailed digital maps of the environment made up of millions of tiny points. These points are captured using sensors like LiDAR, which scan surroundings to create a 3D representation of objects, roads, and buildings.

Traditional methods try to match each word in a user’s description (like “black pole” or “gray road”) to specific objects in the environment, called text-instance matching. That’s slow, unreliable, and not very human-like. Worse, when the description is a bit vague - or there are multiple similar objects - these systems fail to pinpoint the exact spot, which is often referred to as the “last mile problem”.

The recent paper “Text2Loc: 3D Point Cloud Localization from Natural Language”, developed by our MCML Junior Members Yan Xia, Zifeng Ding, our PI Daniel Cremers, and collaborators Letian Shi and Joao F. Henriques, takes a different approach by first retrieving relevant submaps and then refining the location using a hierarchical approach. This method speeds up localization and improves accuracy, making it a major step forward for text-based navigation and tackling the “last mile problem”.


How Text2Loc Works

The proposed Text2Loc architecture.

The proposed Text2Loc architecture.

The proposed Text2Loc architecture consists of two tandem modules:

  • Global place recognition: Given a text-based position description, Text2Loc first identifies a set of coarse candidate locations, “submaps,” potentially containing the target position. This is achieved by retrieving the top-k nearest submaps from a previously constructed database of submaps using the novel text-to-submap retrieval model.
  • Fine localization: Text2Loc then refines the center coordinates of the retrieved submaps via the designed matching-free position estimation module, which adjusts the target location to increase accuracy.

«Extensive experiments demonstrate that Text2Loc improves the localization performance over the state-of-the-art by a large margin.»


Yan Xia et al.

MCML Junior Members

Text Descriptions as Input

Instead of relying on precise coordinates, Text2Loc understands natural language descriptions such as:

  • “The pose is on top of a gray road.”
  • “The pose is west of a black vegetation.”

These descriptions allow for intuitive, human-like localization, making it possible for systems to process spatial information the way people naturally describe their surroundings.

Global Place Recognition (Text-to-Submap Retrieval)

Rather than searching for a single matching object, Text2Loc retrieves the most relevant submaps from a large 3D environment. This reduces the complexity of the search while increasing accuracy.

«We are the first to completely remove the usage of text-instance matcher in the final localization stage.»


Yan Xia et al.

MCML Junior Members

Fine Localization (Instances in Retrieved Submaps)

After retrieving relevant submaps, Text2Loc performs fine localization by estimating the precise position within each submap. Unlike previous methods, it does not use a text-instance matching module to explicitly link words in the description to objects in the environment.

Instead, Text2Loc applies a neural network that operates directly on the spatial and semantic features of the submap and the input text. The model jointly encodes these features using a hierarchical architecture that captures both coarse and fine-grained spatial relationships.

This approach allows the system to predict a location without requiring object-level matching, which simplifies training and reduces inference time. This enables last-mile localization, ensuring that the system can pinpoint a precise location even in dense, real-world environments where GPS might fail.


Why Text2Loc is a True Innovation

T-SNE visualization for global place recognition

T-SNE visualization for global place recognition. It can be seen that Text2Loc (with label "Ours") outperforms Text2Pose, a pioneering work that uses text-instance matching.

Text2Loc introduces a new paradigm in text-based 3D localization. Unlike older approaches that struggle with direct text-object matching, this method simplifies the process while improving precision. Instead of processing every object individually, it retrieves relevant submaps, understands spatial relationships hierarchically, and refines the localization using a novel matching-free position estimation module.

By leveraging natural language and retrieval-based localization, Text2Loc enables AI to navigate the world the way humans do - through words. This innovation makes it incredibly useful for goods delivery and vehicle pickup, where precise, text-driven localization is essential.


Read More

While this article introduces key concepts, the original paper explores detailed insights into global place recognition and fine localization. If you want to understand the full methodology behind Text2Loc, including its advantages over traditional approaches, check out the original paper published at the CVPR 2024, one of the highest-ranked AI/ML conferences.

Y. Xia, L. Shi, Z. Ding, J. F. Henriques and D. Cremers.
Text2Loc: 3D Point Cloud Localization from Natural Language.
CVPR 2024 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, WA, USA, Jun 17-21, 2024. DOI GitHub
Abstract

We tackle the problem of 3D point cloud localization based on a few natural linguistic descriptions and introduce a novel neural network, Text2Loc, that fully interprets the semantic relationship between points and text. Text2Loc follows a coarse-to-fine localization pipeline: text-submap global place recognition, followed by fine localization. In global place recognition, relational dynamics among each textual hint are captured in a hierarchical transformer with max-pooling (HTM), whereas a balance between positive and negative pairs is maintained using text-submap contrastive learning. Moreover, we propose a novel matching-free fine localization method to further refine the location predictions, which completely removes the need for complicated text-instance matching and is lighter, faster, and more accurate than previous methods. Extensive experiments show that Text2Loc improves the localization accuracy by up to 2 × over the state-of-the-art on the KITTI360Pose dataset.

MCML Authors
Link to website

Yan Xia

Dr.

Computer Vision & Artificial Intelligence

Link to website

Zifeng Ding

Database Systems and Data Mining

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence

CVPR 2024 Image 1

The authors at CVPR 2024

CVPR 2024 Image 1

Poster session at CVPR 2024


Share Your Research!


Get in touch with us!

Are you an MCML Junior Member and interested in showcasing your research on our blog?

We’re happy to feature your work—get in touch with us to present your paper.

10.04.2025


Subscribe to RSS News feed

Related

Link to CUPS: Teaching AI to Understand Scenes Without Human Labels

03.04.2025

CUPS: Teaching AI to Understand Scenes Without Human Labels

The team of MCML Director Daniel Cremers introduces CUPS, proving that AI can learn to understand entire scenes - without any human labels.

Link to Beyond the Black Box: Choosing the Right Feature Importance Method

27.03.2025

Beyond the Black Box: Choosing the Right Feature Importance Method

The team of Bernd Bischl created a clear guide to feature importance methods, helping researchers and practitioners interpret AI models effectively.

Link to ReNO: A Smarter Way to Enhance AI-Generated Images

13.03.2025

ReNO: A Smarter Way to Enhance AI-Generated Images

The team of MCML PI Zeynep Akata developed ReNO that enhances image generation without requiring expensive model retraining.

Link to Research at EWCS at the Broad Institute of MIT and Harvard

04.03.2025

Research at EWCS at the Broad Institute of MIT and Harvard

Research stay at Broad Institute: Exploring causality in biology, microbial interactions, and ML applications through MCML AI X-change program.

Link to ChatGPT in Radiology: Making Medical Reports Patient-Friendly?

23.02.2025

ChatGPT in Radiology: Making Medical Reports Patient-Friendly?

The team of MCML PI Michael Ingrisch explored whether ChatGPT could effectively simplify radiology reports while preserving factual accuracy.