K-optimal Randomization Tests for Association in Practical Metric Spaces Using Nearest Neighbor Methods

James McQueen, Jan Stallaert

Let Xi, Yi, i = 1, 2, …, N be pairs of random variables, each Xi in M1, each Yi in M2, where M1 and M2 and metric spaces with distances d1 and d2, respectively. It is desired to test the hypothesis H0 that the Yi are identically distributed and independent of the Xi. Let Si be the set of K nearest neighbors of Xi in distance d1 and let Vi be the average of the distances d2(Yj, Yk) such that Xj and Xk are in Si. Then V = 1/N Σ Vi will tend to be small if there is an association between the Xi and the Yi and V can be used to test H0. An approximate randomization test based on the normal approximation to V was proposed by MacQueen (1991a). This was found to work in a wide range of situations but a satisfactory objective method for the choice of K was not established.

This paper provides a practical objective method of choosing K which is part of a new test. The test statistic p* is defined as the minimum over K = 2,3,…, N-1 of the significance probabilities of the above test. This test statistic being generally too difficult to obtain exactly, is evaluated approximately by taking the minimum over the normal approximation estimates p-hatK of the significance probabilities pK for each K. The resulting test statistic p* = minK is then evaluated by approximate randomization based on a sample of random pairings of the X i and the Yi, getting the p-hat* for each and then calculating an approximate significance probability p-hat as the proportion of these less than or equal to p-hat* from the original data. Because p is an unbiased estimate of the true significance probability and can be made as accurate as desired by increasing the number of random pairings, it can be used as a measure of strength in the usual way, and if a formal procedure is desired, rejecting H0 if p-hat ≤ alpha will accomplish this with a Type I error of not more than alpha + epsilon, with epsilon small.

The test is applied to a variety of simulated data sets of quite different kinds and found to be a practical and convincing test. In the multivariate situation the test performs respectably well in comparison to the F test when all the assumptions of this test are in effect.

1999-09-01