{"title": "Speeding up Permutation Testing in Neuroimaging", "book": "Advances in Neural Information Processing Systems", "page_first": 890, "page_last": 898, "abstract": "Multiple hypothesis testing is a significant problem in nearly all neuroimaging studies. In order to correct for this phenomena, we require a reliable estimate of the Family-Wise Error Rate (FWER). The well known Bonferroni correction method, while being simple to implement, is quite conservative, and can substantially under-power a study because it ignores dependencies between test statistics. Permutation testing, on the other hand, is an exact, non parametric method of estimating the FWER for a given \u03b1 threshold, but for acceptably low thresholds the computational burden can be prohibitive. In this paper, we observe that permutation testing in fact amounts to populating the columns of a very large matrix P. By analyzing the spectrum of this matrix, under certain conditions, we see that P has a low-rank plus a low-variance residual decomposition which makes it suitable for highly sub\u2013sampled \u2014 on the order of 0.5% \u2014 matrix completion methods. Thus, we propose a novel permutation testing methodology which offers a large speedup, without sacrificing the fidelity of the estimated FWER. Our valuations on four different neuroimaging datasets show that a computational speedup factor of roughly 50\u00d7 can be achieved while recovering the FWER distribution up to very high accuracy. Further, we show that the estimated \u03b1 threshold is also recovered faithfully, and is stable.", "full_text": "Speeding up Permutation Testing in Neuroimaging \u2217\n\nChris Hinrichs\u2020 Vamsi K. Ithapu\u2020 Qinyuan Sun\u2020 Sterling C. Johnson\u00a7\u2020 Vikas Singh\u2020\n\n\u00a7William S. Middleton Memorial VA Hospital\n\n\u2020University of Wisconsin\u2013Madison\n\n{hinrichs,vamsi}@cs.wisc.edu {qsun28}@wisc.edu\n\n{scj}@medicine.wisc.edu {vsingh}@biostat.wisc.edu\n\nhttp://pages.cs.wisc.edu/\u02dcvamsi/pt_fast\n\nAbstract\n\nMultiple hypothesis testing is a signi\ufb01cant problem in nearly all neuroimaging\nstudies.\nIn order to correct for this phenomena, we require a reliable estimate\nof the Family-Wise Error Rate (FWER). The well known Bonferroni correction\nmethod, while simple to implement, is quite conservative, and can substantially\nunder-power a study because it ignores dependencies between test statistics. Per-\nmutation testing, on the other hand, is an exact, non-parametric method of es-\ntimating the FWER for a given \u03b1-threshold, but for acceptably low thresholds\nthe computational burden can be prohibitive. In this paper, we show that permu-\ntation testing in fact amounts to populating the columns of a very large matrix\nP. By analyzing the spectrum of this matrix, under certain conditions, we see\nthat P has a low-rank plus a low-variance residual decomposition which makes\nit suitable for highly sub\u2013sampled \u2014 on the order of 0.5% \u2014 matrix comple-\ntion methods. Based on this observation, we propose a novel permutation testing\nmethodology which offers a large speedup, without sacri\ufb01cing the \ufb01delity of the\nestimated FWER. Our evaluations on four different neuroimaging datasets show\nthat a computational speedup factor of roughly 50\u00d7 can be achieved while recov-\nering the FWER distribution up to very high accuracy. Further, we show that the\nestimated \u03b1-threshold is also recovered faithfully, and is stable.\n\n1\n\nIntroduction\n\nSuppose we have completed a placebo-controlled clinical trial of a promising new drug for a neu-\nrodegenerative disorder such as Alzheimer\u2019s disease (AD) on a small sized cohort. The study is\ndesigned such that in addition to assessing improvements in standard cognitive outcomes (e.g.,\nMMSE), the purported treatment effects will also be assessed using Neuroimaging data. The ra-\ntionale here is that, even if the drug does induce variations in cognitive symptoms, the brain changes\nare observable much earlier in the imaging data. On the imaging front, this analysis checks for\nstatistically signi\ufb01cant differences between brain images of subjects assigned to the two trial arms:\ntreatment and placebo. Alternatively, consider a second scenario where we have completed a neu-\nroimaging research study of a particular controlled factor, such as genotype, and the interest is to\nevaluate group-wise differences in the brain images:\nto identify which regions are affected as a\nfunction of class membership. In either cases, the standard image processing work\ufb02ow yields for\neach subject a 3-D image (or voxel-wise \u201cmap\u201d). Depending on the image modality acquired, these\nmaps are of cerebral gray matter density, longitudinal deformation (local growth or contraction) or\nmetabolism. It is assumed that these maps have been \u2018co-registered\u2019 across different subjects so that\neach voxel corresponds to approximately the same anatomical location. [1, 2].\n\n\u2217Hinrichs and Ithapu are joint \ufb01rst authors and contributed equally to this work.\n\n1\n\n\fIn order to localize the effect under investigation (i.e., treatment or genotype), we then have to\ncalculate a very large number (say, v) of univariate voxel-wise statistics \u2013 typically up to several\nmillion voxels. For example, consider group-contrast t-statistics (here we will mainly consider t-\nstatistics, however other test statistics are also applicable, such as the F statistic used in ANOVA\ntesting, Pearson\u2019s correlation as used in functional imaging studies, or the \u03c72 test of dependence\nbetween variates, so long as certain conditions described in Section 2.3 are satis\ufb01ed). In some voxels,\nit may turn out that a group-level effect has been indicated, but it is not clear right away what its true\nsigni\ufb01cance level should be, if any. As one might expect, given the number of hypotheses tests v,\nmultiple testing issues in this setting are quite severe, making it dif\ufb01cult to assess the true Family-\nWise Type I Error Rate (FWER) [3]. If we were to address this issue via Bonferroni correction\n[4], the enormous number of separate tests implies that certain weaker signals will almost certainly\nnever be detected, even if they are real. This directly affects studies of neurodegenerative disorders\nin which atrophy proceeds at a very slow rate and the therapeutic effects of a drug is likely to be mild\nto moderate anyway. This is a critical bottleneck which makes localizing real, albeit slight, short-\nterm treatment effects problematic. Already, this restriction will prevent us from using a smaller\nsized study (fewer subjects), increasing the cost of pharmaceutical research. In the worst case, an\notherwise real treatment effect of a drug may not survive correction, and the trial may be deemed a\nfailure.\nBonferroni versus true FWER threshold. Observe that theoretically, there is a case in which the\nBonferroni corrected threshold is close to the true FWER threshold: when point-wise statistics are\ni.i.d. If so, then the extremely low Bonferroni corrected \u03b1-threshold crossings effectively become\nmutually exclusive, which makes the Union Bound (on which Bonferroni correction is based) nearly\ntight. However, when variables are highly dependent \u2013 and indeed even without smoothing there are\nmany sources of strong non-Gaussian dependencies between voxels, the true FWER threshold can\nbe much more relaxed, and it is precisely this phenomenon which drives the search for alternatives to\nBonferroni correction. Thus, many methods have been developed to more accurately and ef\ufb01ciently\nestimate or approximate the FWER [5, 6, 7, 8], which is a subject of much interest in statistics [9],\nmachine learning [10], bioinformatics [11], and neuroimaging [12].\nPermutation testing. A commonly used method of directly and non-parametrically estimating the\nFWER is Permutation testing [12, 13], which is a method of sampling from the Global (i.e., Family-\nWise) Null distribution. Permutation testing ensures that any relevant dependencies present in the\ndata carry through to the test statistics, giving an unbiased estimator of the FWER. If we want to\nchoose a threshold suf\ufb01cient to exclude all spurious results with probability 1\u2212 \u03b1, we can construct\na histogram of sample maxima taken from permutation samples, and choose a threshold giving the\n1 \u2212 \u03b1/2 quantile. Unfortunately, reliable FWER estimates derived via permutation testing come\nat excessive (and often infeasible) computational cost \u2013 often tens of thousands or even millions of\npermutation samples are required, each of which requires a complete pass over the entire data set.\nThis step alone can run from a few days up to many weeks and even longer [14, 15].\nObserve that the very same dependencies between voxels, that forced the usage of permutation\ntesting, indicate that the overwhelming majority of work in computing so many highly correlated\nNull statistics is redundant. Note that regardless of their description, strong dependencies of almost\nany kind will tend to concentrate most of their co-variation into a low-rank subspace, leaving a\nhigh-rank, low-variance residual [5]. In fact, for Genome wide Association studies (GWAS), many\nstrategies calculate the \u2018effective number\u2019 (Me\ufb00) of independent tests corresponding to the rank of\nthis subspace [16, 5]. This paper is based on the observation that such a low-rank structure must also\nappear in permutation test samples. Using ideas from online low-rank matrix completion [17] we can\nsample a few of the Null statistics and reconstruct the remainder as long as we properly account for\nthe residual. This allows us to sub-sample at extremely low rates, generally < 1%. The contribution\nof our work is to signi\ufb01cantly speed up permutation testing in neuroimaging, delivering running time\nimprovements of up to 50\u00d7. In other words, our algorithm does the same job as permutation testing,\nbut takes anywhere from a few minutes up to a few hours, rather than days or weeks. Further, based\non recent work in random matrix theory, we provide an analysis which sheds additional light on\nthe use of matrix completion methods in this context. To ensure that our conclusions are not an\nartifact of a speci\ufb01c dataset, we present strong empirical evidence via evaluations on four separate\nneuroimaging datasets of Alzheimer\u2019s disease (AD) and Mild Cognitive Impairment (MCI) patients\nas well as cognitively healthy age-matched controls (CN), showing that the proposed method can\nrecover highly faithful Global Null distributions, while offering substantial speedups.\n\n2\n\n\f2 The Proposed Algorithm\n\nWe \ufb01rst cover some basic concepts underlying permutation testing and low rank matrix completion\nin more detail, before presenting our algorithm and the associated analysis.\n\n2.1 Permutation testing\n\nRandomly sampled permutation testing [18] is a methodology for drawing samples under the Global\n(Family-Wise) Null hypothesis. Recall that although point-wise test statistics have well character-\nized univariate Null distributions, the sample maximum usually has no analytic form due to the\nstrong correlations across voxels. Permutation is particularly desirable in this setting because it is\nfree of any distribution assumption whatsoever [12]. The basic idea of permutation testing is very\nsimple, yet extremely powerful. Suppose we have a set of labeled high dimensional data points,\nand a univariate test statistic which measures some interaction between labeled groups for every\ndimension (or feature). If we randomly permute the labels and recalculate each test statistic, then\nby construction we get a sample from the Global Null distribution. The maximum over all of these\nstatistics for every permutation sample is then used to construct a histogram, which therefore is a\nnon-parametric estimate of the distribution of the sample maximum of Null statistics. For a test\nstatistic derived from the real labels, the FWER corrected p-value is then equal to the fraction of\npermutation samples which were more extreme. Note that all of the permutation samples can be\nassembled into a matrix P \u2208 Rv\u00d7T where v is the number of comparisons (voxels for images), and\nT is the number of permutation samples.\nThere is a drawback to this approach, however. Observe that it is in the nature of random sampling\nmethods that we get many samples from near the mode(s) of the distribution of interest, but fewer\nfrom the tails. Hence, to characterize the threshold for a small portion of the tail of this distribution,\nwe must draw a very large number of samples just so that the estimate converges. Thus, if we want\nan \u03b1 = 0.01 threshold from the Null sample maximum distribution, we require many thousands of\npermutation samples \u2014 each requires randomizing the labels and recalculating all test statistics, a\nvery computationally expensive procedure when v is large. To be certain, we would like to ensure\nan especially low FWER by \ufb01rst setting \u03b1 very low, and then getting a very precise estimate of the\ncorresponding threshold. The smallest possible p-value we can derive this way is 1/T , so for very\nlow p-values, T must be very large.\n\n2.2 Low-rank Matrix completion\n\nLow-rank matrix completion [19] seeks to reconstruct missing entries from a matrix, given only a\nsmall fraction of its entries. The problem is ill-posed unless we assume this matrix has a low-rank\ncolumn space. If so, then a much smaller number of observations, on the order of r log(v), where\nr is the column space\u2019s rank, and v is its ambient dimension [19] is suf\ufb01cient to recover both an\northogonal basis for the row space as well as the expansion coef\ufb01cients for each column, giving the\nrecovery. By placing an (cid:96)1-norm penalty on the eigenvalues of the recovered matrix via the nuclear\nnorm [20, 21] we can ensure that the solution is as low rank as possible. Alternatively, we can\nspecify a rank r ahead of time, and estimate an orthogonal basis of that rank by following a gradient\nalong the Grassmannian manifold [22, 17]. Denoting the set of randomly subsampled entries as \u2126,\nthe matrix completion problem is given as,\n\n(cid:107)P\u2126 \u2212 \u02dcP\u2126(cid:107)2\n\nF\n\n\u02dcP\n\nmin\n\n(1)\nwhere U \u2208 Rv\u00d7r is the low-rank basis of P, \u2126 gives the measured entries, and W is the set of\nexpansion coef\ufb01cients which reconstructs \u02dcP in U. Two recent methods operate in an online setting,\ni.e., where rows of P arrive one at a time, and both U and W are updated accordingly [22, 17].\n\ns.t. \u02dcP = UW; U is orthogonal\n\n2.3 Low rank plus a long tail\n\nReal-world data often have a dominant low-rank component. While the data may not be exactly\ncharacterized by a low-rank basis, the residual will not signi\ufb01cantly alter the eigen-spectrum of the\nsample covariance in such cases. Having strong correlations is nearly synonymous with having a\n\n3\n\n\fskewed eigen-spectrum, because the \ufb02atter the eigen-spectrum becomes, the sparser the resulting\ncovariance matrix tends to be (the \u201cuncertainty principle\u201d between low-rank and sparse matrices\n[23]). This low-rank structure carries through for purely linear statistics (such as sample means).\nHowever, non-linearities in the test statistic calculation, e.g., normalizing by pooled variances, will\ncontribute a long tail of eigenvalues, and so we require that this long tail will either decay rapidly,\nor that it does not overlap with the dominant eigenvalues. For t-statistics, the pooled variances are\nunlikely to change very much from one permutation sample to another (barring outliers) \u2014 hence\nwe expect that the spectrum of P will resemble that of the data covariance, with the addition of a\nlong, exponentially decaying tail. More generally, if the non-linearity does not de-correlate the test\nstatistics too much, it will preserve the low-rank structure.\nIf this long tail is indeed dominated by the low-rank structure, then its contribution to P can be\nmodeled as a low variance Gaussian i.i.d. residual. A Central Limit argument appeals to the number\nof independent eigenfunctions that contribute to this residual, and, the orthogonality of eigenfunc-\ntions implies that as more of them meaningfully contribute to each entry in the residual, the more\nindependent those entries become. In other words, if this long tail begins at a low magnitude and\ndecays slowly, then we can treat it as a Gaussian i.i.d. residual; and if it decays rapidly, then the\nresidual will perhaps be less Gaussian, but also more negligible. Thus, our development in the next\nsection makes no direct assumption about these eigenvalues themselves, but rather that the residual\ncorresponds to a low-variance i.i.d. Gaussian random matrix \u2014 its contribution to the covariance of\ntest statistics will be Wishart distributed, and from that we can characterize its eigenvalues.\n\n2.4 Our Method\n\nIt still remains to model the residual numerically. By sub-sampling we can reconstruct the low-rank\nportion of P via matrix completion, but in order to obtain the desired sample maximum distribu-\ntion we must also recover the residual. Exact recovery of the residual is essentially impossible;\nfortunately, for our purposes we need only need its effect on the distribution of the maximum per\npermutation test. So, we estimate its variance, (its mean is zero by assumption,) and then randomly\nsample from that distribution to recover the unobserved remainder of the matrix.\nA large component in the running time of online subspace tracking algorithms is spent in updat-\ning the basis set U; yet, once a good estimate for U has been found this becomes super\ufb02uous.\nWe therefore divide the entire process into two steps: training, and recovery. During the training\nphase we conduct a small number of fully sampled permutation tests (100 permutations in our ex-\nperiments). From these permutation tests, we estimate U using sub-sampled matrix completion\nmethods [22, 17], making multiple passes over the training set (with \ufb01xed sub-sampling rate), until\nconvergence. In our evaluations, three passes suf\ufb01ced. Then, we obtain a distribution of the residual\nS over the entire training set. Next is the recovery phase, in which we sub-sample a small fraction\nof the entries of each successive column t, solve for the reconstruction coef\ufb01cients W(\u00b7, t) in the\nbasis U by least-squares, and then add random residuals using parameters estimated during training.\nAfter that, we proceed exactly as in a normal permutation testing, to recover the statistics.\nBias-Variance tradeoff. By using a very sparse subsampling method, there is a bias-variance\ndilemma in estimating S. That is, if we use the entire matrix P to estimate U, W and S, we\nwill obtain reliable estimates of S. But, there is an over\ufb01tting problem: the least-squares objective\nused in \ufb01tting W(\u00b7, t) to such a small sample of entries is likely to grossly underestimate the vari-\nance of S compared to where we use the entire matrix; (the sub-sampling problem is not nearly as\nover-constrained as for the whole matrix). This sampling artifact reduces the apparent variance of S,\nand induces a bias in the distribution of the sample maximum, because extreme values are found less\nfrequently. This sampling artifact has the effect of \u2018shifting\u2019 the distribution of the sample maximum\ntowards 0. We correct for this bias by estimating the amount of the shift during the training phase,\nand then shifting the recovered sample max distribution by this estimated amount.\n\n3 Analysis\n\nWe now discuss two results which show that as long as the variance of the residual is below a certain\nlevel, we can recover the distribution of the sample maximum. Recall from (1) that for low-rank\nmatrix completion methods to be applied we must assume that the permutation matrix P can be\n\n4\n\n\fdecomposed into a low-rank component plus a high-rank residual matrix S:\n\nP = UW + S,\n\n(2)\nwhere U is a v \u00d7 r orthogonal matrix that spans the r (cid:28) min(v, t) -dimensional column subspace\nof P, and W is the corresponding coef\ufb01cient matrix. We can then treat the residual S as a random\nmatrix whose entries are i.i.d. zero-mean Gaussian with variance \u03c32. We arrive at our \ufb01rst result\nby analyzing how the low-rank portion of P\u2019s singular spectrum interlaces with the contribution\ncoming from the residual by treating P as a low-rank perturbation of a random matrix.\nIf this\nlow-rank perturbation is suf\ufb01cient to dominate the eigenvalues of the random matrix, then P can\nbe recovered with high \ufb01delity at a low sampling rate [22, 17]. Consequently, we can estimate the\ndistribution of the maximum as well, as shown by our second result.\nThe following development relies on the observation that the eigenvalues of PPT are the squared\nsingular values of P. Thus, rather than analyzing the singular value spectrum of P directly, we can\nanalyze the eigenvalues of PPT using a recent result from [24]. This is important because in order\nto ensure recovery of P, we require that its singular value spectrum will approximately retain the\nshape of UW\u2019s. More precisely, we require that for some 0 < \u03b4 < 1,\n\n| \u02dc\u03c6i \u2212 \u03c6i| < \u03b4\u03c6i\n\ni = 1, . . . , r;\n\n\u02dc\u03c6i < \u03b4\u03c6r\n\ni = r + 1, . . . , v\n\n(3)\n\nwhere \u03c6i and \u02dc\u03c6i are the singular values of UW and P respectively. (Recall that in this analysis P\nis the perturbation of UW.) Thm. 3.1 relates the rate at which eigenvalues are perturbed, \u03b4, to the\nparameterization of S in terms of \u03c32. The theorem\u2019s principal assumption also relates \u03c32 inversely\nwith the number of columns of P, which is just the number of trials t. Note however that the process\nmay be split up between several matrices Pi, and the results can then be combined. For purposes\nof applying this result in practice we may then choose a number of columns t which gives the best\nbound. Theorem 3.1 also assumes that the number of trials t is greater than the number of voxels\nv, which is a dif\ufb01cult regime to explore empirically. Thus, our numerical evaluations cover the case\nwhere t < v, while Thm 3.1 covers the case where t is larger.\nFrom the de\ufb01nition of P in (2), we have,\n\nPPT = UWWT UT + SST + UWST + SWT UT .\n\n(4)\n\nWe \ufb01rst analyze the change in eigenvalue structure of SST when perturbed by UWWT UT , (which\nhas r non-zero eigenvalues). The in\ufb02uence of the cross-terms (UWST and SWT UT ) is addressed\nlater. Thus, we have the following theorem.\nTheorem 3.1. Denote that r non-zero eigenvalues of Q = UWWT UT \u2208 Rv\u00d7v by \u03bb1 \u2265 \u03bb2 \u2265\n, . . . , \u03bbr > 0; and let S be a v \u00d7 t random matrix such that Si,j \u223c N (0, \u03c32), with unknown \u03c32. As\nv, t \u2192 \u221e such that v\n\nt (cid:28) 1, the eigenvalues \u02dc\u03bbi of the perturbed matrix Q + SST will satisfy\n\n| \u02dc\u03bbi \u2212 \u03bbi| < \u03b4\u03bbi\n\ni = 1, . . . , r;\n\n\u02dc\u03bbi < \u03b4\u03bbr\n\ni = r + 1, . . . , v\n\n((cid:63))\n\nfor some 0 < \u03b4 < 1, whenever \u03c32 < \u03b4\u03bbr\nt\n\nProof. (Sketch) The proof proceeds by constructing the asymptotic eigenvalues \u02dc\u03bbi (for i = 1, . . . , v),\nand later bounding them to satisfy ((cid:63)). The construction of \u02dc\u03bbi is based on Theorem 2.1 from [24].\nt SST is calculated, followed by its Cauchy transform\nFirstly, an asymptotic spectral measure \u00b5 of 1\nG\u00b5(z). Using G\u00b5(z) and its functional inverse G\u22121\n\u00b5 (\u03b8), we get \u02dc\u03bbi in terms of \u03bbi, \u03c32, v and t. Finally,\nthe constraints in ((cid:63)) are applied to \u02dc\u03bbi to upper bound \u03c32. The supplement includes the proof.\n\nNote that the missing cross-terms would not change the result of Theorem 3.1 drastically, because\nUW has r non-zero singular values and hence UWST is a low-rank projection of a low-variance\nrandom matrix, and this will clearly be dominated by either of the other terms. Having justi\ufb01ed\nthe model in (2), the following thorem shows that the empirical distribution of the maximum Null\nstatistic approximates the true distribution.\n\n5\n\n\fTheorem 3.2. Let mt = maxi Pi,t be the maximum observed test statistic at permutation trial\nt, and similarly let \u02c6mt = maxi \u02c6Pi,t be the maximum reconstructed test statistic. Further, let the\nmaximum reconstruction error be \u0001, such that |Pi,t \u2212 \u02c6Pi,t| \u2264 \u0001. Then, for any real number k > 0,\nwe have,\n\n(cid:104)\n\nPr\n\nmt \u2212 \u02c6mt \u2212 (b \u2212 \u02c6b) > k\u0001\n\n(cid:105)\n\n<\n\n1\nk2\n\nwhere b is the bias term described in Section 2, and \u02c6b is its estimate from the training phase.\n\nThe result is an application of Chebyshev\u2019s bound. The complete proof is given in the supplement.\n\n4 Experimental evaluations\n\nOur experimental evaluations include four separate neuroimaging datasets of Alzheimer\u2019s Disease\n(AD) patients, cognitively healthy age-matched controls (CN), and in some cases Mild Cognitive\nImpairment (MCI) patients. The \ufb01rst of these is the Alzheimer\u2019s Disease Neuroimaging Initiative\n(ADNI) dataset, a nation-wide multi-site study. ADNI is a landmark study sponsored by the NIH,\nmajor pharmaceuticals and others to determine the extent to which multimodal brain imaging can\nhelp predict onset, and monitor progression of AD. The others were collected as part of other studies\nof AD and MCI. We refer to these datasets as Dataset A\u2014D. Their demographic characteristics are\nas follows: Dataset A: 40 subjects, AD vs. CN, median age : 76; Dataset B: 50 subjects, AD vs.\nCN, median age : 68; Dataset C: 55 subjects, CN vs. MCI, median age : 65.16; Dataset D: 70\nsubjects, CN vs. MCI, median age : 66.24.\nOur evaluations focus on three main questions: (i) Can we recover an acceptable approximation\nof the maximum statistic Null distribution from an approximation of the permutation test matrix?\n(ii) What degree of computational speedup can we expect at various subsampling rates, and how\ndoes this affect the trade-off with approximation error? (iii) How sensitive is the estimated \u03b1-level\nthreshold with respect to the recovered Null distribution? In all our experiments, the rank estimate\nfor subspace tracking (to construct the low\u2013rank basis U) was taken as the number of subjects.\n\n4.1 Can we recover the Maximum Null?\n\nOur experiments suggest that our model can recover the maximum Null. We use Kullback\u2013Leibler\n(KL) divergence and Bhattacharya Distance (BD) to compare the estimated maximum Null from our\nmodel to the true one. We also construct a \u201cNaive\u2013Null\u201d, where the subsampled statistics are pooled\nand the Null distribution is constructed with no further processing (i.e., completion). Using this as\na baseline, Fig. 1 shows the KL and BD values obtained from three datasets, at 20 different sub-\nsampling rates (ranging from 0.1% to 10%). Note that our model involves a training module where\nthe approximate \u2018bias\u2019 of residuals is estimated. This estimation is prone to noise (for example,\nnumber of training frames). Hence Fig. 1 also shows the error bars pertaining to 5 realizations\non the 20 sampling rates. The \ufb01rst observation from Fig. 1 is that both KL and BD measures of\nthe recovered Null to the true distribution are < e\u22125 for sampling rates more than 0.4%. This\n\n(a) Dataset A\n\n(b) Dataset B\n\n(c) Dataset C\n\nFigure 1: KL (blue) and BD (red) measures between the true max Null distribution (given by the full matrix\nP) and that recovered by our method (thick lines), along with the baseline naive subsampling method (dotted\nlines). Results for Datasets A, B, C are shown here. Plot for Dataset D is in the extended version of the paper.\n\n6\n\n\f(a) Speedup (at 0.4%) is 45.1\n\n(b) Speedup (at 0.4%) is 45.6\n\n(c) Speedup (at 0.4%) is 48.5\n\nFigure 3: Computation time (in minutes) of our model compared to that of computing the entire matrix P.\nResults are for the same three datasets as in Fig. 1. Please \ufb01nd the plot for Dataset D in the extended version of\nthe paper. The horizontal line (magenta) shows the time taken for computing the full matrix P. The other three\ncurves include : subsampling (blue), GRASTA recovery (red) and total time taken by our model (black). Plots\ncorrespond to the low sampling regime (< 1%) and note the jump in y\u2013axis (black boxes). For reference, the\nspeedup factor at 0.4% sampling rate is reported at the bottom of each plot.\n\nsuggests that our model recovers both the shape (low BD) and position (low KL) of the null to high\naccuracy at extremely low sub-sampling. We also see that above a certain minimum subsampling\nrate (\u223c 0.3%), the KL and BD do not change drastically as the rate is increased. This is expected\nfrom the theory on matrix completion where after observing a minimum number of data samples,\nadding in new samples does not substantially increase information content. Further, the error bars\n(although very small in magnitude) of both KL and BD show that the recovery is noisy. We believe\nthis is due to the approximate estimate of bias from training module.\n\n4.2 What is the computational speedup?\n\nOur experiments suggest that the speedup is\nsubstantial. Figs. 3 and 2 compare the time\ntaken to perform the complete permutation test-\ning to that of our model. The three plots in Fig.\n3 correspond to the datasets used in Fig. 1, in\nthat order. Each plot contains 4 curves and rep-\nresent the time taken by our model, the corre-\nsponding sampling and GRASTA [17] recovery\n(plus training) times and the total time to con-\nstruct the entire matrix P (horizontal line). And\nFig. 2 shows the scatter plot of computational\nspeedup vs. KL divergence (over 3 repeated set\nof experiments on all the datasets and sampling\nrates). Our model achieved at least 30 times de-\ncrease in computation time in the low sampling\nregime (< 1%). Around 0.5% \u2212 0.6% sub-\nsampling (where the KL and BD are already\n< e\u22125), the computation speed-up factor aver-\naged over all datasets was 45\u00d7. This shows that\nour model achieved good accuracy (low KL and\nBD) together with high computational speed up\nin tandem, especially, for 0.4% \u2212 0.7% sampling rates. However note from Fig. 2 that there is\na trade\u2013off between the speedup factor and approximation error (KL or BD). Overall the highest\ncomputational speedup factor achieved at a recovery level of e\u22125 on KL and BD is around 50x (and\nthis occured around 0.4% \u2212 0.5% sampling rate, refer to Fig. 2). It was observed that a speedup\nfactor of upto 55\u00d7 was obtained for Datasets C and D at 0.3% subsampling, where the KL and BD\nwere as low as e\u22125.5 (refer to Fig. 1 and the extended version of the paper).\n\nFigure 2: Scatter plot of computational speedup vs.\nKL. The plot corresponds to the 20 different samplings\non all 4 datasets (for 5 repeated set of experiments) and\nthe colormap is from 0.1% to 10% sampling rate. The\nx\u2013axis is in log scale.\n\n4.3 How stable is the estimated \u03b1-threshold (clinical signi\ufb01cance)?\n\nOur experiments suggest that the threshold is stable. Fig. 4 and Table 1 summarize the clin-\nical signi\ufb01cance of our model.\n4 show the error in estimating the true max thresh-\n\nFig.\n\n7\n\n\f(a) Datasets A, B\n\n(b) Datasets C, D\n\nThe x\u2013axis corresponds to the 20 different\n\nFigure 4: Error of estimated t statistic thresholds (red) for the 20 different subsampling rates on the four\nDatasets. The con\ufb01dence level is 1 \u2212 \u03b1 = 0.95. The y-axis is in log\u2013scale. For reference, the thresholds given\nby baseline model (blue) are included. Note that each plot corresponds to two datasets.\nold, at 1 \u2212 \u03b1 = 0.95 level of con\ufb01dence.\nsampling rates used and y\u2013axis shows the absolute difference of thresholds in log scale.\nObserve that for sampling rates higher than 3%,\nthe mean and maximum differences was 0.04\nand 0.18. Note that the binning resolution of\nmax.statistic used for constructing the Null was\n0.01. These results show that not only the\nglobal shape of the maximum Null distribution\nis estimated to high accuracy (see Section 4.1)\nbut also the shape and area in the tail. To sup-\nport this observation, we show the absolute dif-\nferences of the estimated thresholds on all the\ndatasets at 4 different \u03b1 levels in Table 1. The\nerrors for 1 \u2212 \u03b1 = 0.95, 0.99 are at most 0.16.\nThe increase in error for 1 \u2212 \u03b1 > 0.995 is a\nsampling artifact and is expected. Note that in\na few cases, the error at 0.5% is slightly higher than that at 0.3% suggesting that the recovery is\nnoisy (see Sec. 4.1 and the errorbars of Fig. 1). Overall the estimated \u03b1-thresholds are both faithful\nand stable.\n\n1 \u2212 \u03b1 level\n0.995\n0.99\n0.14\n0.11\n0.08\n0.10\n0.03\n0.05\n0.08\n0.07\n0.21\n0.13\n0.07\n0.07\n0.10\n0.27\n0.25\n0.13\n\nTable 1: Errors of estimated t statistic thresholds on\nall datasets at two different subsampling rates.\n\n0.999\n0.07\n0.03\n0.13\n0.04\n0.20\n0.05\n0.31\n0.22\n\nrate\n0.3%\n0.5%\n0.3%\n0.5%\n0.3%\n0.5%\n0.3%\n0.5%\n\n0.95\n0.16\n0.13\n0.02\n0.02\n0.04\n0.01\n0.08\n0.12\n\nData\nname\n\nSampling\n\nA\n\nB\n\nC\n\nD\n\n5 Conclusions and future directions\n\nIn this paper, we have proposed a novel method of ef\ufb01ciently approximating the permutation testing\nmatrix by \ufb01rst estimating the major singular vectors, then \ufb01lling in the missing values via matrix\ncompletion, and \ufb01nally estimating the distribution of residual values. Experiments on four differ-\nent neuroimaging datasets show that we can recover the distribution of the maximum Null statistic\nto a high degree of accuracy, while maintaining a computational speedup factor of roughly 50\u00d7.\nWhile our focus has been on neuroimaging problems, we note that multiple testing and False Dis-\ncovery Rate (FDR) correction are important issues in genomic and RNA analyses, and our contribu-\ntion may offer enhanced leverage to existing methodologies which use permutation testing in these\nsettings[6].\n\nAcknowledgments: We thank Robert Nowak, Grace Wahba, Moo K. Chung and the anonymous reviewers for\ntheir helpful comments, and Jia Xu for helping with a preliminary implementation of the model. This work\nwas supported in part by NIH R01 AG040396; NSF CAREER grant 1252725; NSF RI 1116584; Wisconsin\nPartnership Fund; UW ADRC P50 AG033514; UW ICTR 1UL1RR025011 and a Veterans Administration\nMerit Review Grant I01CX000165. Hinrichs is supported by a CIBM post-doctoral fellowship via NLM grant\n2T15LM007359. The contents do not represent views of the Dept. of Veterans Affairs or the United States\nGovernment.\n\n8\n\n\fReferences\n[1] J. Ashburner and K. J. Friston. Voxel-based morphometry\u2013the methods. NeuroImage, 11(6):805\u2013821,\n\n2000.\n\n[2] J. Ashburner and K. J. Friston. Why voxel-based morphometry should be used. NeuroImage, 14(6):1238\u2013\n\n1243, 2001.\n\n[3] P. H. Westfall and S. S. Young. Resampling-based multiple testing: examples and methods for p-value\n\nadjustment, volume 279. Wiley-Interscience, 1993.\n\n[4] J. M. Bland and D. G. Altman. Multiple signi\ufb01cance tests:\n\nJournal, 310(6973):170, 1995.\n\nthe bonferroni method. British Medical\n\n[5] J. Li and L. Ji. Adjusting multiple testing in multilocus analyses using the eigenvalues of a correlation\n\nmatrix. Heredity, 95(3):221\u2013227, 2005.\n\n[6] J. Storey and R. Tibshirani. Statistical signi\ufb01cance for genomewide studies. Proceedings of the National\n\nAcademy of Sciences, 100(16):9440\u20139445, 2003.\n\n[7] H. Finner and V. Gontscharuk. Controlling the familywise error rate with plug-in estimator for the propor-\ntion of true null hypotheses. Journal of the Royal Statistical Society: Series B (Statistical Methodology),\n71(5):1031\u20131048, 2009.\n\n[8] J. T. Leek and J. D. Storey. A general framework for multiple testing dependence. Proceedings of the\n\nNational Academy of Sciences, 105(48):18718\u201318723, 2008.\n\n[9] S. Clarke and P. Hall. Robustness of multiple testing procedures against dependence. The Annals of\n\nStatistics, pages 332\u2013358, 2009.\n\n[10] S. Garc\u00b4\u0131a, A. Fern\u00b4andez, J. Luengo, and F. Herrera. Advanced nonparametric tests for multiple compar-\nisons in the design of experiments in computational intelligence and data mining: Experimental analysis\nof power. Information Sciences, 180(10):2044\u20132064, 2010.\n\n[11] Y. Ge, S. Dudoit, and T. P. Speed. Resampling-based multiple testing for microarray data analysis. Test,\n\n12(1):1\u201377, 2003.\n\n[12] T. Nichols and S. Hayasaka. Controlling the familywise error rate in functional neuroimaging: a compar-\n\native review. Statistical Methods in Medical Research, 12:419\u2013446, 2003.\n\n[13] K. D. Singh, G. R. Barnes, and A. Hillebrand. Group imaging of task-related changes in cortical synchro-\n\nnisation using nonparametric permutation testing. NeuroImage, 19(4):1589\u20131601, 2003.\n\n[14] D. Pantazis, T. E. Nichols, S. Baillet, and R. M. Leahy. A comparison of random \ufb01eld theory and permu-\n\ntation methods for the statistical analysis of meg data. NeuroImage, 25(2):383\u2013394, 2005.\n\n[15] B. Gaonkar and C. Davatzikos. Analytic estimation of statistical signi\ufb01cance maps for support vector\n\nmachine based multi-variate image analysis and classi\ufb01cation. NeuroImage, 78:270\u2013283, 2013.\n\n[16] J. M. Cheverud. A simple correction for multiple comparisons in interval mapping genome scans. Hered-\n\nity, 87(1):52\u201358, 2001.\n\n[17] J. He, L. Balzano, and A. Szlam. Incremental gradient on the grassmannian for online foreground and\n\nbackground separation in subsampled video. In CVPR, 2012.\n\n[18] M. Dwass. Modi\ufb01ed randomization tests for nonparametric hypotheses. The Annals of Mathematical\n\nStatistics, 28(1):181\u2013187, 1957.\n\n[19] E. J. Cand`es and T. Tao. The power of convex relaxation: Near-optimal matrix completion. IEEE Trans-\n\nactions on Information Theory, 56(5):2053\u20132080, 2010.\n\n[20] M. Fazel, H. Hindi, and S. Boyd. Rank minimization and applications in system theory. In American\n\nControl Conference, volume 4, 2004.\n\n[21] B. Recht, M. Fazel, and P. A. Parrilo. Guaranteed minimum-rank solutions of linear matrix equations via\n\nnuclear norm minimization. Arxiv Preprint, 2007. arxiv:0706.4138.\n\n[22] L. Balzano, R. Nowak, and B. Recht. Online identi\ufb01cation and tracking of subspaces from highly incom-\n\nplete information. Arxiv Preprint, 2007. arxiv:1006.4046.\n\n[23] V. Chandrasekaran, S. Sanghavi, P. A. Parrilo, and Willsky A. S. Rank-sparsity incoherence for matrix\n\ndecomposition. SIAM Journal on Optimization, 21(2):572\u2013596, 2011.\n\n[24] F. Benaych-Georges and R. R. Nadakuditi. The eigenvalues and eigenvectors of \ufb01nite, low rank perturba-\n\ntions of large random matrices. Advances in Mathematics, 227(1):494\u2013521, 2011.\n\n9\n\n\f", "award": [], "sourceid": 488, "authors": [{"given_name": "Chris", "family_name": "Hinrichs", "institution": "UW-Madison"}, {"given_name": "Vamsi", "family_name": "Ithapu", "institution": "UW-Madison"}, {"given_name": "Qinyuan", "family_name": "Sun", "institution": "UW-Madison"}, {"given_name": "Sterling", "family_name": "Johnson", "institution": "UW-Madison"}, {"given_name": "Vikas", "family_name": "Singh", "institution": "UW-Madison"}]}