Penn Arts & Sciences Logo

Friday, October 29, 2010 - 2:00pm

Mark Tygert

NYU

Location

University of Pennsylvania

Heilmeir Hall (Towne 100)

An underutilized statistic: the Euclidean distance (instead of chi-squared)

Mark Tygert Courant Institute

Abstract: A basic task in statistics is to ascertain whether a given model agrees with a set of independent and identically distributed experiments suspected to be taking draws from the model probability distribution. This task is known as testing "goodness-of-fit." When there are only finitely many possible values that the draws can take, the canonical approach for this task is what is known as the chi-squared test (including related, asymptotically equivalent variations, such as the likelihood-ratio, "G," or power-divergence tests). The chi-squared test is based on the root-mean-square difference between the model probability distribution and the empirical distribution estimated via the experimental draws, with the weights in the (weighted) average of the root-mean-square being the inverses of the model probabilities. Thus, the canonical approach via chi-squared involves dividing by the model probabilities. This is a bad idea when model probabilities can be small (especially bad when many are small, as in "long-tailed" distributions). With the now widespread availability of computers, it is no longer necessary to divide by small numbers in order to simplify the computation of statistical significance; a more useful measure of the size of the discrepancy between the empirical and model distributions is simply their standard root-mean-square difference (using the usual, uniformly weighted average in the root-mean-square). Chi-squared should now be deprecated (at least for goodness-of-fit and allied problems); the unadulterated Euclidean norm is more natural, powerful, and easy-to-use. This is joint work with Will Perkins and Rachel Ward; technical details are available at http://arxiv.org/abs/1006.0042 and http://arxiv.org/abs/1009.2260