A Randomized Approximate Nearest Neighbors Algorithm

Friday, February 11, 2011 - 2:00pm

Valdimir Rokhlin

Yale

Location

University of Pennsylvania

Heilmeier Hall (Towne 100)

Given a collection of n points x1, x2, . . . , xn in R^d and an integer k << n, the task of finding the k nearest neighbors for each xi is known as the âNearest Neighbors Problemâ; it is ubiquitous in a number of areas of Computer Science: Machine Learning, Data Mining, Artificial Intelligence, etc. The obvious algorithm costs order d n^2 log(k) operations, which tends to be prohibitively expensive in most non-trivial environments. There exist âfastâ schemes, based on various âtreeâ structures. In very low dimensions, such methods are quite satisfactory; as the dimensionality increases, the algorithms become slow, and are replaced with approximate ones (i.e., instead of nearest neighbors, they find neighbors that are âsomewhat closeâ). At some point, existing tree-based techniques become ineffective due to the notorious âcurse of dimensionalityâ; many Machine Learning techniques can be viewed simply as attempts to avoid situations where the Nearest Neighbors Problem has to be solved. I will discuss a randomized algorithm for the approximate nearest neighbor problem that is effective for fairly large values of d. The algorithm is iterative, and its CPU time requirements are of the order T Â· N Â· (d Â· (log d) + k Â· (d + log k) Â· (logN)) + N Â· k^2Â· (d + log k), with T the number of iterations performed; the probability of errors decreases exponentially with T. The memory requirements of the procedure are of the order N Â· (d + k).

A byproduct of the scheme is a data structure permitting a rapid search for the k nearest neighbors among {xj} for an arbitrary point x in R^d. The cost of each such query is proportional to T Â· (d Â· (log d) + log(N/k) Â· k Â· (d + log k)) , and the memory requirements for the requisite data structure are of the order N Â· (d + k) + T Â· (d + N). The algorithm utilizes random rotations and a basic divide-and-conquer scheme, followed by a local graph search. We analyze the schemeâs behavior for certain types of distributions of {xj}, and illustrate its performance via several numerical examples.

July

S	M	T	W	T	F	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

How to get to Penn's Mathematics Department

The Mathematics Department Office is located on the fourth (top) floor of David Rittenhouse Laboratory ("DRL"). The building is at 209 South 33rd Street (the Southeast corner of 33rd. and Walnut Streets). Note 33rd Street runs one way north while Walnut runs one way west.

Local Buses & Trains

SEPTA [Skookul]
National Trains: Amtrak [telephone: 1 800 872-7245]

Maps and Directions

We are about a 15 minute walk from the main 30th Street Station and 5 minutes from the University City Rail Station at 32nd and Spruce (=South Street & Convention Avenue). Coming from the airport by train (about 15 minutes): the University City Rail Station is the second stop after you leave the airport.

If you drive, the most convenient public parking is in the pay lot whose entrance is on 34th Street between Market and Chestnut Streets.