Given a collection of n points x1, x2, . . . , xn in R^d and an integer k << n, the task of finding the k nearest neighbors for each xi is known as the âNearest Neighbors Problemâ; it is ubiquitous in a number of areas of Computer Science: Machine Learning, Data Mining, Artificial Intelligence, etc. The obvious algorithm costs order d n^2 log(k) operations, which tends to be prohibitively expensive in most non-trivial environments. There exist âfastâ schemes, based on various âtreeâ structures. In very low dimensions, such methods are quite satisfactory; as the dimensionality increases, the algorithms become slow, and are replaced with approximate ones (i.e., instead of nearest neighbors, they find neighbors that are âsomewhat closeâ). At some point, existing tree-based techniques become ineffective due to the notorious âcurse of dimensionalityâ; many Machine Learning techniques can be viewed simply as attempts to avoid situations where the Nearest Neighbors Problem has to be solved. I will discuss a randomized algorithm for the approximate nearest neighbor problem that is effective for fairly large values of d. The algorithm is iterative, and its CPU time requirements are of the order T · N · (d · (log d) + k · (d + log k) · (logN)) + N · k^2· (d + log k), with T the number of iterations performed; the probability of errors decreases exponentially with T. The memory requirements of the procedure are of the order N · (d + k).
A byproduct of the scheme is a data structure permitting a rapid search for the k nearest neighbors among {xj} for an arbitrary point x in R^d. The cost of each such query is proportional to T · (d · (log d) + log(N/k) · k · (d + log k)) , and the memory requirements for the requisite data structure are of the order N · (d + k) + T · (d + N). The algorithm utilizes random rotations and a basic divide-and-conquer scheme, followed by a local graph search. We analyze the schemeâs behavior for certain types of distributions of {xj}, and illustrate its performance via several numerical examples.