Penn Arts & Sciences Logo

Friday, February 11, 2011 - 2:00pm

Valdimir Rokhlin

Yale

Location

University of Pennsylvania

Heilmeier Hall (Towne 100)

Given a collection of n points x1, x2, . . . , xn in R^d and an integer k << n, the task of finding the k nearest neighbors for each xi is known as the “Nearest Neighbors Problem”; it is ubiquitous in a number of areas of Computer Science: Machine Learning, Data Mining, Artificial Intelligence, etc. The obvious algorithm costs order d n^2 log(k) operations, which tends to be prohibitively expensive in most non-trivial environments. There exist “fast” schemes, based on various “tree” structures. In very low dimensions, such methods are quite satisfactory; as the dimensionality increases, the algorithms become slow, and are replaced with approximate ones (i.e., instead of nearest neighbors, they find neighbors that are “somewhat close”). At some point, existing tree-based techniques become ineffective due to the notorious “curse of dimensionality”; many Machine Learning techniques can be viewed simply as attempts to avoid situations where the Nearest Neighbors Problem has to be solved. I will discuss a randomized algorithm for the approximate nearest neighbor problem that is effective for fairly large values of d. The algorithm is iterative, and its CPU time requirements are of the order T · N · (d · (log d) + k · (d + log k) · (logN)) + N · k^2· (d + log k), with T the number of iterations performed; the probability of errors decreases exponentially with T. The memory requirements of the procedure are of the order N · (d + k).

A byproduct of the scheme is a data structure permitting a rapid search for the k nearest neighbors among {xj} for an arbitrary point x in R^d. The cost of each such query is proportional to T · (d · (log d) + log(N/k) · k · (d + log k)) , and the memory requirements for the requisite data structure are of the order N · (d + k) + T · (d + N). The algorithm utilizes random rotations and a basic divide-and-conquer scheme, followed by a local graph search. We analyze the scheme’s behavior for certain types of distributions of {xj}, and illustrate its performance via several numerical examples.