Home | Publications | Her24

Dimensionality and Distance: Curse or Blessing? Geometrical Aspects of Nearest Neighbor Computation in High-Dimensional Data

MCML Authors

Moritz Herrmann

Dr.

Transfer Coordinator

→ Group Anne-Laure Boulesteix
Biometry in Molecular Medicine

Abstract

When it comes to computation, it is often said that high-dimensional data is particularly challenging, known as the curse of dimensionality. For example, in their seminal work, Beyer et al [1] study the impact of high-dimensional data on nearest neighbor computation. They show that in a wide range of settings, including IID data, the difference between the distance to the nearest neighbor and the distance to the most distant neighbor vanishes as the dimension increases. However, it is arguably often overlooked that they also point out that this result does not hold in certain situations, in particular when the intrinsic dimension of the data is low and/or when the data is distributed in well separable subsets. More generally, it is probably less well known that high dimensionality can make computation easier, to the extent that Kainen [2] even speaks of a blessing of dimensionality. Given these different aspects, a natural question to ask is: when is high dimensionality a curse and when is it not (or even a blessing)? In this talk we approach this question from a geometric point of view. Focusing on the aspect of nearest neighbor (and hence distance) computation, we show that high-dimensional data need not be more challenging than low-dimensional data in many practically relevant situations. In particular, using results from extensive experiments on synthetic and real data, we show that this can be the case for both outlier detection and cluster analysis, and for a range of different data types, including image and functional data [3, 4]. Moreover, based on concepts from manifold learning and topological data analysis, we show that these observations can be explained using a common conceptual foundation.

inproceedings Her24

Statistical Computing 2024

54. Arbeitstagung der Arbeitsgruppen Statistical Computing, Klassifikation und Datenanalyse in den Biowissenschaften. Günzburg, Germany, Jul 28-31, 2024.