{"title": "Evaluating the statistical significance of biclusters", "book": "Advances in Neural Information Processing Systems", "page_first": 1324, "page_last": 1332, "abstract": "Biclustering (also known as submatrix localization) is a problem of high practical relevance in exploratory analysis of high-dimensional data. We develop a framework for performing statistical inference on biclusters found by score-based algorithms. Since the bicluster was selected in a data dependent manner by a biclustering or localization algorithm, this is a form of selective inference. Our framework gives exact (non-asymptotic) confidence intervals and p-values for the significance of the selected biclusters. Further, we generalize our approach to obtain exact inference for Gaussian statistics.", "full_text": "Evaluating the statistical signi\ufb01cance of biclusters\n\nJason D. Lee, Yuekai Sun, and Jonathan Taylor\n\nInstitute of Computational and Mathematical Engineering\n\n{jdl17,yuekai,jonathan.taylor}@stanford.edu\n\nStanford University\nStanford, CA 94305\n\nAbstract\n\nBiclustering (also known as submatrix localization) is a problem of high prac-\ntical relevance in exploratory analysis of high-dimensional data. We develop a\nframework for performing statistical inference on biclusters found by score-based\nalgorithms. Since the bicluster was selected in a data dependent manner by a\nbiclustering or localization algorithm, this is a form of selective inference. Our\nframework gives exact (non-asymptotic) con\ufb01dence intervals and p-values for the\nsigni\ufb01cance of the selected biclusters.\n\nIntroduction\n\n1\nGiven a matrix X \u2208 Rm\u00d7n, biclustering or submatrix localization is the problem of identifying a\nsubset of the rows and columns of X such that the bicluster or submatrix consisting of the selected\nrows and columns are \u201csigni\ufb01cant\u201d compared to the rest of X. An important application of bi-\nclustering is the identi\ufb01cation of signi\ufb01cant genotype-phenotype associations in the (unsupervised)\nanalysis of gene expression data. The data is usually represented by an expression matrix X whose\nrows correspond to genes and columns correspond to samples. Thus genotype-phenotype associa-\ntions correspond to salient submatrices of X. The location and signi\ufb01cance of such biclusters, in\nconjunction with relevant clinical information, give preliminary results on the genetic underpinnings\nof the phenotypes being studied.\nMore generally, given a matrix X \u2208 Rm\u00d7n whose rows correspond to variables and columns corre-\nspond to samples, biclustering seeks sample-variable associations in the form of salient submatrices.\nWithout loss of generality, we consider square matrices X \u2208 Rn\u00d7n of the form\n\nX = M + Z, Zij \u223c N (0, \u03c32)\nM = \u00b5eI0 eT\nJ0\nThe components of eI , I \u2282 [n] are given by\n\n, \u00b5 \u2265 0, I0, J0 \u2282 [n].\n\n(cid:26)1\n\n0\n\n(eI )i =\n\ni \u2208 I\notherwise .\n\n(1.1)\n\nFor our theoretical results, we assume the size of the embedded submatrix |I0| = |J0| = k and the\nnoise variance \u03c32 is known.\nThe biclustering problem, due to its practical relevance, has attracted considerable attention. Most\nprevious work focuses on \ufb01nding signi\ufb01cant submatrices. A large class of algorithms for biclustering\nare score-based, i.e. they search for submatrices that maximize some score function that measures\nthe \u201csigni\ufb01cance\u201d of a submatrix. In this paper, we focus on evaluating the signi\ufb01cance of subma-\ntrices found by score-based algorithms for biclustering. More precisely, let I(X), J(X) \u2282 [n] be a\n(random) pair output by a biclustering algorithm. We seek to test whether the localized submatrix\n\n1\n\n\fXI(X),J(X) contains any signal, i.e. test the hypothesis\n\nH0 :\n\nMij = 0.\n\n(1.2)\n\n(cid:88)\n\ni\u2208I(X)\nj\u2208J(X)\n\nof selective inference. The distribution of the test statistic(cid:80)\n\nSince the hypothesis depends on the (random) output of the biclustering algorithm, this is a form\nXij depends on the speci\ufb01c\n\nalgorithm, and is extremely dif\ufb01cult to derive for many heuristic biclustering algorithms.\nOur main contribution is to test whether a biclustering algorithm has found a statistically signi\ufb01cant\nbicluster. The tests and con\ufb01dence intervals we construct are exact, meaning that in \ufb01nite samples\nthe type 1 error is exactly \u03b1.\nThis paper is organized as follows. First, we review recent work on biclustering and related prob-\nlems. Then, in section 2, we describe our framework for performing inference in the context of a\nsimple biclustering algorithm based on a scan statistic. We show\n\ni\u2208I(X)\nj\u2208J(X)\n\n1. the framework gives exact (non-asymptotic) Unif(0, 1) p-values under H0, and the p-values\n\ncan be \u201cinverted\u201d to form con\ufb01dence intervals for the amount of signal in XI(X),J(X).\n\n2. under the minimax signal-to-noise ratio (SNR) regime \u00b5 (cid:38)(cid:113) log n\n\nk , the test has full asymp-\n\ntotic power .\n\nIn section 4, we show the framework handles more computationally tractable biclustering algo-\nrithms, including a greedy algorithm originally proposed by Shabalin et al. [12]. In the supplemen-\ntary materials, we discuss the problem in the more general setting where there are multiple emnbed-\nded submatrices. Finally, we present experimental validation of the various tests and biclustering\nalgorithms.\n\n1.1 Related work\n\nA slightly easier problem is submatrix detection: test whether a matrix has an embedded submatrix\nwith nonzero mean [1, 4]. This problem was recently studied by Ma and Wu [11] who characerized\nthe minimum signal strength \u00b5 for any test and any computationally tractable test to reliably detect\nan embedded submatrix.\nWe emphasize that the problem we consider is not the submatrix detection problem, but a comple-\nmentary problem. Submatrix detection asks whether there are any hidden row-column associations\nin a matrix. We ask whether a submatrix selected by a biclustering algorithm captures the hidden\nassociation(s). In practice, given a matrix, a practitioner might perform (in order)\n\n1. submatrix detection: check for a hidden submatrix with elevated mean.\n2. submatrix localization: attempt to \ufb01nd the hidden submatrix.\n3. selective inference: check whether the selected submatrix captures any signal.\n\nWe focus on the third step in the pipeline. Results on evaluating the signi\ufb01cance of selected sub-\nmatrices are scarce. The only result we know of is by Bhamidi, Dey and Nobel, who characterized\nthe asymptotic distribution of the largest k \u00d7 k average submatrix in Gaussian random matrices [6].\nTheir result may be used to form an asymptotic test of (1.2).\nThe submatrix localization problem, due to its practical relevance, has attracted considerable at-\ntention [5, 2, 3]. Most prior work focuses on \ufb01nding signi\ufb01cant submatrices. Broadly speaking,\nsubmatrix localization procedures fall into one of two types: score-based search procedures and\nspectral algorithms. The main idea behind the score-based approach to submatrix localization is\nsigni\ufb01cant submatrices should maximize some score that measures the \u201csigni\ufb01cance\u201d of a subma-\ntrix, e.g. the average of its entries [12] or the goodness-of-\ufb01t of a two-way ANOVA model [8, 9].\nSince there are exponentially many submatrices, many score-based search procedure use heuristics\nto reduce the search space. Such heuristics are not guaranteed to succeed, but often perform well in\npractice. One of the purposes of our work is to test whether a heuristic algorithm has identi\ufb01ed a\nsigni\ufb01cant submatrix.\n\n2\n\n\fThe submatrix localization problem exhibits a statistical and computational trade-off that was \ufb01rst\nstudied by Balakrishnan et al. [5]. They compare the SNR required by several computationally\nef\ufb01cient algorithms to the minimax SNR. Recently, Chen and Xu [7] study the trade-off when there\nare several embedded submatrices. In this more general setting, they show the SNR required by\nconvex relaxation is smaller than the SNR required by entry-wise thresholding. Thus the power of\nconvex relaxation is in separating clusters/submatrices, not in identifying one cluster/submatrix.\n\n2 A framework for evaluating the signi\ufb01cance of a submatrix\n\nOur main contribution is a framework for evaluating signi\ufb01cance of a submatrix selected by a bi-\nclustering algorithm. The framework allows us to perform exact (non-asymptotic) inference on the\nselected submatrix. In this section, we develop the framework on a (very) simple score-based al-\ngorithm that simply outputs the largest average submatrix. At a high level, our framework consists\nof characterizing the selection event {(I(X), J(X)) = (I, J)} and applying the key distributional\nresult in [10] to obtain a pivotal quantity.\n\n2.1 The signi\ufb01cance of the largest average submatrix\n\nTo begin, we consider performing inference on output of the simple algorithm that simply returns\nthe k \u00d7 k submatrix with largest sum. Let S be the set of indices of all k \u00d7 k submatrices of X,\ni.e. S = {(I, J) | I, J \u2282 [n],|I| = |J| = k}. The Largest Average Submatrix (LAS) algorithm\nreturns a pair (ILAS(X), JLAS(X))\n\n(ILAS(X), JLAS(X)) = arg max\n(I,J)\u2208S\n\n(cid:17)\n\neT\nI XeJ\n\nis distributed like the maxima of(cid:0)n\n\n(cid:1)2 (corre-\n\n(cid:16)\n\neJLAS(X)eT\n\nThe optimal value S(1) = tr\nlated) normal random variables. Although results on the asymptotic distribution (k \ufb01xed, n growing)\nof S(1) (under H0 : \u00b5 = 0) are known (e.g. Theorem 2.1 in [6]), we are not aware of any results that\ncharacterizes the \ufb01nite sample distribution of the optimal value. To avoid this pickle, we condition\non the selection event\n\nILAS(X)X\n\nk\n\n(2.1)\n\nELAS(I, J) = {(ILAS(X), JLAS(X)) = (I, J)}\nand work with the distribution of X | {(ILAS(X), JLAS(X)) = (I, J)} .\nWe begin by making a key observation. The selection event given by (2.1) is equivalent to X\nsatisfying a set of linear inequalities given by\n\nI X(cid:1) \u2265 tr(cid:0)eJ(cid:48)eT\nCLAS(I, J) =(cid:8)X \u2208 Rn\u00d7n | tr(cid:0)eJ eT\n\nI(cid:48)X(cid:1) for any (I(cid:48), J(cid:48)) \u2208 S \\ (I, J).\nI X(cid:1) \u2265 tr(cid:0)eJ(cid:48)eT\n\nThus the selection event is equivalent to X falling in the polyhedral set\n\nI(cid:48)X(cid:1) for any (I(cid:48), J(cid:48)) \u2208 S \\ (I, J)(cid:9) . (2.3)\n\ntr(cid:0)eJ eT\n\nThus, X | {(ILAS(X), JLAS(X)) = (I, J)} = X | {X \u2208 CLAS(I, J)} is a constrained Gaussian\nrandom variable.\nRecall our goal was to perform inference on the amount of signal in the selected submatrix\nXILAS(X),JLAS(X). This task is akin to performing inference on the mean parameter1 of a con-\nstrained Gaussian random variable, namely X | {X \u2208 CLAS(I, J)} . We apply the selective infer-\nence framework by Lee et al. [10] to accomplish the task.\nBefore we delve into the details of how we perform inference on the mean parameter of a constrained\nGaussian random variable, we review the key distribution result in [10] concerning constrained\nGaussian random variables.\nTheorem 2.1. Consider a Gaussian random variable y \u2208 Rn with mean \u03bd \u2208 Rn and covariance\n\u03a3 \u2208 Sn\u00d7n\n\n++ constrained to a polyhedral set\n\n(2.2)\n\nC = {x \u2208 Rp | Ay \u2264 b} for some A \u2208 Rm\u00d7n, b \u2208 Rm.\n\n1The mean parameter is the mean of the Gaussian prior to truncation.\n\n3\n\n\fLet \u03b7 \u2208 Rn represent a linear function of y. De\ufb01ne \u03b1 = A\u03a3\u03b7\n\n\u03b7T \u03a3\u03b7 and\n\nV +(y) = sup\n\nj:\u03b1j <0\n\nV\u2212(y) = inf\n\nV 0(y) = inf\n\n(bj \u2212 (Ay)j + \u03b1j\u03b7T y)\n\n(bj \u2212 (Ay)j + \u03b1j\u03b7T y)\n\n1\n\u03b1j\n1\n\u03b1j\nbj \u2212 (Ay)j\n\n(cid:1) \u2212 \u03a6(cid:0) a\u2212\u03bd\n(cid:1) \u2212 \u03a6(cid:0) a\u2212\u03bd\n\n\u03c3\n\n(cid:1)\n(cid:1) .\n\nj:\u03b1j >0\n\nj:\u03b1j =0\n\n\u03a6(cid:0) x\u2212\u03bd\n\u03a6(cid:0) b\u2212\u03bd\n\n\u03c3\n\n(2.4)\n\n(2.5)\n\n(2.6)\n\n\u03c3\n\n\u03c3\n\nF (x, \u03bd, \u03c32, a, b) =\n\nF(cid:0)\u03b7T y, \u03b7T \u03bd, \u03b7T \u03a3\u03b7,V\u2212(y),V +(y)(cid:1) | {Ay \u2264 b} \u223c Unif(0, 1).\n\n(2.7)\nThe expression F (\u03b7T y, \u03b7T \u03bd, \u03b7T \u03a3\u03b7,V\u2212(y),V +(y)) is a pivotal quantity with a Unif(0, 1) distribu-\ntion, i.e.\n(2.8)\nRemark 2.2. The truncation limits V +(y) and V\u2212(y) (and V 0(y)) depend on \u03b7 and the polyhedral\nset C. We omit the dependence to keep our notation manageable.\nRecall X | {ELAS(I, J)} is a constrained Gaussian random variable (constrained to the polyhedral\nset CLAS(I, J) given by (2.3)). By Theorem 2.1 and the characterization of the selection event\nELAS(I, J), the random variable\n\nformly distributed on the unit interval. The mean parameter tr(cid:0)eJ eT\nI M(cid:1) = |I \u2229 I0||J \u2229 J0| \u00b5.\n\nI M(cid:1) , \u03c32k2,V\u2212(X),V +(X)(cid:1) | {ELAS(I, J)} ,\ntr(cid:0)eJ eT\n\ncaptured by XI,J:\nJ for any I(cid:48), J(cid:48) \u2282 [n]. For convenience, we index\nWhat are V +(X) and V\u2212(X)? Let EI(cid:48),J(cid:48) = e(cid:48)\nI e(cid:48)T\n|I\u2229I(cid:48)||J\u2229J(cid:48)|\u2212k2\nthe constraints (2.2) by the pairs (I(cid:48), J(cid:48)). The term \u03b1I(cid:48),J(cid:48) is given by \u03b1I(cid:48),J(cid:48) =\n.\nSince |I \u2229 I(cid:48)||J \u2229 J(cid:48)| < k2, \u03b1I(cid:48),J(cid:48) is negative for any (I(cid:48), J(cid:48)) \u2208 Sn,k \\ (I, J), and the upper\ntruncation limit V +(X) is \u221e. The lower truncation limit V\u2212(X) simpli\ufb01es to\n\nwhere V +(X) and V\u2212(X) (and V 0(X)) are evaluated on the polyhedral set CLAS(I, J), is uni-\n\nI M(cid:1) is the amount of signal\n\nF(cid:0)S(1), tr(cid:0)eJ eT\n\nk2\n\n(cid:17)\n\nV\u2212(X) =\n\nmax\n\n(I(cid:48),J(cid:48)):\u03b1I(cid:48) ,J(cid:48) <0\n\n(EI,J \u2212 EI(cid:48),J(cid:48))T X\nk2 \u2212 |I \u2229 I(cid:48)||J \u2229 J(cid:48)|\n\n.\n\n(2.9)\n\nWe summarize the developments thus far in a corollary.\nCorollary 2.3. We have\n\nF(cid:0)S(1), tr(cid:0)eJ eT\n\nI M(cid:1) , k2\u03c32,V\u2212(X),\u221e(cid:1) | {ELAS(I, J)} \u223c Unif (0, 1)\n(cid:17)\n\n(EI,J \u2212 EI(cid:48),J(cid:48))T X\nk2 \u2212 |I \u2229 I(cid:48)||J \u2229 J(cid:48)|\n\n(2.10)\n\n(2.11)\n\nV\u2212(X) =\n\nmax\n\n(I(cid:48),J(cid:48)):\u03b1I(cid:48) ,J(cid:48) <0\n\n(cid:16)\n\ntr(cid:0)ET\nI,J X(cid:1) \u2212 k2 tr\n\n(cid:16)\ntr(cid:0)ET\nI,J X(cid:1) \u2212 k2 tr\n(cid:17)\n\n(cid:16)\n\nUnder the hypothesis\n\n= 0,\n\nH0 : tr\n\nwe expect\n\nILAS(X)M\n\neJLAS(X)eT\n\nF(cid:0)S(1), 0, k2\u03c32,V\u2212(X),\u221e(cid:1) | {ELAS(I, J)} \u223c Unif (0, 1)\n\nThus 1 \u2212 F(cid:0)S(1), 0, k2\u03c32,V\u2212(X),\u221e(cid:1)is a p-value for the hypothesis (2.12). Under the al-\n\nternative, we expect the selected submatrix to be (stochastically) larger than under the null.\nThus rejecting H0 when the p-value is smaller than \u03b1 is an exact \u03b1 level test for H0;\ni.e.\nPr0 (reject H0 | {ELAS(I, J)}) = \u03b1. Since the test controls Type I error at \u03b1 for all possible se-\nlection events (i.e. all possible outcomes of the LAS algorithm), the test also controls Type I error\nunconditionally:\n\n(2.12)\n\nPr0 (reject H0) =\n\nPr0 (reject H0 | {ELAS(I, J)}) Pr0 ({ELAS(I, J)})\n\n(cid:88)\n(cid:88)\n\nI,J\u2282[n]\n\nI,J\u2282[n]\n\n\u2264 \u03b1\n\nPr0 ({ELAS(I, J)}) = \u03b1.\n\nThus the test is an exact \u03b1-level test of H0. We summarize the result in a Theorem.\n\n4\n\n\fTheorem 2.4. The test that rejects when\n\nF(cid:0)S(1), 0, k2\u03c32,V\u2212(X),\u221e(cid:1) \u2265 1 \u2212 \u03b1\n(cid:16)\nis a valid \u03b1-level test for H0 :(cid:80)\n\n(cid:17) \u2212 k2 tr\n\nV\u2212(X), =\n\n(I(cid:48),J(cid:48)):\u03b1I(cid:48) ,J(cid:48) <0\n\nMij = 0.\n\nILAS(X),JLAS(X)\n\nmax\n\nET\n\nX\n\ntr\n\n(cid:16)(cid:0)EILAS(X),JLAS(X) \u2212 EI(cid:48),J(cid:48)(cid:1)T\nk2 \u2212(cid:12)(cid:12)ILAS(X) \u2229 I(cid:48)(cid:12)(cid:12)(cid:12)(cid:12)JLAS(X) \u2229 J(cid:48)(cid:12)(cid:12)\n\nX\n\n(cid:17)\n\n,\n\ni\u2208I(X)\nj\u2208J(X)\n\n(cid:110)\n\nTo obtain con\ufb01dence intervals for the amount of signal in the selected submatrix, we \u201cinvert\u201d the\npivotal quantity given by (2.10). By Corollary 2.3, the interval\n\n\u2264 F(cid:0)S(1), \u03bd, k2\u03c32,V\u2212(X),\u221e(cid:1) \u2264 1 \u2212 \u03b1\n\n(cid:111)\n\n2\n\nis an exact 1 \u2212 \u03b1 con\ufb01dence interval for(cid:80)\n\n\u03bd \u2208 R :\n\n\u03b1\n2\n\n(2.13)\n\n(2.13) is a con\ufb01dence interval for \u00b5. Like the test given by Lemma 2.4, the con\ufb01dence intervals\ngiven by (2.13) are also valid unconditionally.\n\ni\u2208I(X)\nj\u2208J(X)\n\nMij. When (ILAS(X), JLAS(X)) = (I0, J0),\n\n2.2 Power under minimax signal-to-noise ratio\n\nIn section 2, we derived an exact (non-asymptotically valid) test for the hypothesis (2.12). In this\nsection, we study the power of the test. Before we delve into the details, we review some relevant\nresults to place our result in the correct context.\n\n(cid:18)\n\n(cid:113) log(n\u2212k)\n\n(cid:19)\n\n\u03c3\n\n(cid:113) 2 log(n\u2212k)\n\nBalakrishnan et al. [5] show \u00b5 must be at least \u0398\nfor any algorithm to succeed (\ufb01nd\nthe embedded submatrix) with high probability. They also show the LAS algorithm is minimax\nrate optimal; i.e. the LAS algorithm \ufb01nds the embedded submatrix with probability 1 \u2212 4\nn\u2212k when\n\u00b5 \u2265 4\u03c3\n. We show that the test given by Theorem 2.4 has asymptotic full power under\nthe same signal strength. The proof is given in the appendix.\n\u221a\n\n(cid:113) 2 log(n\u2212k)\n\n(cid:113) log 2\n\n. When C > max\n\n(cid:18)\n\n(cid:19)\n\n, 4 + 4\n\nk\n\nk\n\n\u03b1\n\nlog(n\u2212k)\n\nk\n\n\u221a\n\u03b1 log(n\u2212k)(\n\n1\n\n2 , the \u03b1-level test given by Corollary 2.3 has power at least 1 \u2212 5\n\nTheorem 2.5. Let \u00b5 = C\nand k \u2264 n\n\nk\u22125/4)\nn\u2212k ; i.e.\n\nPr(reject H0) \u2265 1 \u2212 5\n\nn\u2212k .\n\nFurther, for any sequence (n, k) such that n \u2192 \u221e, when C > 4, and k \u2264 n\n\n2 , Pr(reject H0) \u2192 1.\n\n3 General scan statistics\n\nAlthough we have elected to present our framework in the context of biclustering, the framework\nreadily extends to scan statistics. Let z \u223c N (\u00b5, \u03a3), where E[z] has the form\nfor some \u00b5 > 0 and S \u2282 [ n ].\n\n(cid:26)\u00b5 i \u2208 S\n\nE[zi] =\n\n0\n\notherwise\n\nThe set S belongs to a collection C = {S1, . . . , SN}. We decide which index set in C generated the\ndata by\n\n\u02c6S = arg maxS\u2208C(cid:80)\n\ni\u2208S zi.\n\n(3.1)\n\nGiven \u02c6S, we are interested in testing the null hypothesis\n\n(3.2)\nTo perform exact inference for the selected effect \u00b5 \u02c6S, we must \ufb01rst characterize the selection event.\nWe observe that the selection event { \u02c6S = S} is equivalent to X satisfying a set of linear inequalities\ngiven by\n\nH0 : E[z \u02c6S] = 0.\n\nS z \u2265 eT\neT\n\nS(cid:48)z for any S(cid:48) \u2208 C \\ S.\n\n(3.3)\n\n5\n\n\fGiven the form of the constraints (3.3),\n(eS(cid:48) \u2212 eS)T eS\n\naS(cid:48) =\n\n=\n\n1\n\n|S| (|S \u2229 S(cid:48)| \u2212 |S|) for any S(cid:48) \u2208 C \\ S.\n\neT\nS eS\n\nSince |S \u2229 S(cid:48)| \u2264 |S| , we have aS(cid:48) \u2208 [\u22121, 0], which implies V +(z) = \u221e. The term V\u2212(z) also\nsimpli\ufb01es:\n\nV\u2212(z) = sup\nS(cid:48)\n\n1\naS(cid:48)\nLet y(1), y(2) be the largest and second largest scan statistics. We have\n\n((eS \u2212 eS(cid:48))T z + aS(cid:48)eT\n\nS z + sup\nS(cid:48)\n\nS z) = eT\n\n1\naS(cid:48)\n\n((eS \u2212 eS(cid:48))T z).\n\nV\u2212(z) \u2264 z(1) + sup\nS(cid:48)\n\n((eS(cid:48) \u2212 eS)T z) = z(1) + z(2) \u2212 z(1) = z(2).\n\nIntuitively, the pivot will be large (the p-value will be small), when eT\nS z exceeds the lower truncation\nlimit V\u2212 by a large margin. Since the second largest scan statistic is an upper bound for the lower\ntruncation limit, the test will reject when y(1) exceeds y(2) by a large margin.\nTheorem 3.1. The test that rejects when\n\nF(cid:0)z(1), 0, k2\u03c32,V\u2212(z),\u221e(cid:1) \u2265 1 \u2212 \u03b1\naS(cid:48) ((e \u02c6S \u2212 eS(cid:48))T z), is a valid \u03b1-level test for H0 : eT\n\n\u02c6S\n\n\u00b5 = 0.\n\nwhere V\u2212(X) = eT\n\u02c6S\n\nz + supS(cid:48) 1\n\nTo our knowledge, most precedures for obtaining valid inference on scan statistics require careful\ncharacterization of the asymptotic distribution of eT\nz. Such results are usually only valid when\n\u02c6S\nthe components of z are independent with identical variances (e.g. see [6]), and can only be used\nto test the global null: H0 : E[z] = 0. Our framework not only relaxes the independence and\nhomeoskedastic assumption, but also allows us to for con\ufb01dence intervals for the selected effect\nsize.\n\n4 Extensions to other score-based approaches\n\nReturning to the submatrix localization problem, we note that the framework described in section 2\nalso readily handles other score-based approaches, as long as the scores are af\ufb01ne functions of the\nentries. The main idea is to partition Rn\u00d7n into non-overlapping regions that corresponding to a\npossible outcomes of the algorithm; i.e. the event that the algorithm outputs a particular submatrix\nis equivalent to X falling in the corresponding region of Rn\u00d7n. In this section, we show how to\nperform exact inference on biclusters found by more computationally tractable algorithms.\n\n4.1 Greedy search\n\nSearching over all(cid:0)n\n\n(cid:1)2 submatrices to \ufb01nd the largest average submatrix is computationally in-\n\ntractable for all but the smallest matrices. Here we consider a family of heuristics based on a greedy\nsearch algorithm proposed by Shabalin et al. [12] that looks for \u201clocal\u201d largest average submatrices.\nTheir approach is widely used to discover genotype-phenotype associations in high-dimensional\ngene expression data. Here the score is simply the sum of the entries in a submatrix.\n\nk\n\nAlgorithm 1 Greedy search algorithm\n1: Initialize: select J 0 \u2282 [n].\n2: repeat\n3:\n4:\n5: until convergence\n\nI l+1 \u2190 the indices of the rows with the largest column sum in J l\nJ l+1 \u2190 the indices of the columns with the largest row sum in I l+1\n\nTo adapt the framework laid out in section 2 to the greedy search algorithm, we must characterize\nthe selection event. Here the selection event is the \u201cpath\u201d of the greedy search:\n\n(cid:0)(I 1, J 1), (I 2, J 2), . . .(cid:1)\n\nEGrS = EGrS\n\n6\n\n\f(cid:27)\n\n(cid:26)\n\nis the event the greedy search selected (I 1, J 1) at the \ufb01rst step, (I 2, J 2) at the second step, etc.\nIn practice, to ensure stable performance of the greedy algorithm, Shabalin et al. propose to run the\ngreedy search with random initialization 1000 times and select the largest local maximum. Suppose\nthe m(cid:63)-th greedy search outputs the largest local maximum. The selection event is\n\nEGrS,1 \u2229 \u00b7\u00b7\u00b7 \u2229 EGrS,1000 \u2229\n\nwhere\n\nEGrS,m = EGrS\n\n(cid:0)(I 1\n\nm(cid:63) = arg max\nm=1,...,1000\n\neT\nIGrS,m(X)XeJGrS,m(X)\n\nm), . . .(cid:1) , m = 1, . . . , 1000\n\nm, J 1\n\nm), (I 2\n\nm, J 2\n\nis the event the m-th greedy search selected (I 1\netc.\nAn alternative to running the greedy search with random initialization many times and picking the\nlargest local maximum is to initialize the greedy search intelligently. Let Jgreedy(X) be the output\nof the intelligent initialization. The selection event is given by\n\nm) at the \ufb01rst step, (I 2\n\nm) at the second step,\n\nm, J 2\n\nm, J 1\n\n(4.1)\nwhere EGrS is the event the greedy search selected (I 1, J 1) at the \ufb01rst step, (I 2, J 2) at the second\nstep, etc. The intelligent initialization selects J 0 when\n\n(4.2)\nwhich corresponds to selecting the k columns with largest sum. Thus the selection event is equiva-\nlent to X falling in the polyhedral set\n\n[n]Xej(cid:48) for any j \u2208 J 0, j(cid:48) \u2208 [n] \\ J 0,\n\n[n]Xej \u2265 eT\neT\n\nEGrS \u2229(cid:8)Jgreedy(X) = J 0(cid:9) ,\n\n(cid:17)\n\nfor any j \u2208 J 0, j(cid:48) \u2208 [n] \\ J 0(cid:111)\n\n,\n\nCGrS \u2229(cid:110)\n\n(cid:16)\n\n(cid:17) \u2265 tr\n\n(cid:16)\n\nX \u2208 Rn\u00d7n | tr\n\nejeT\n\n[n]X\n\nej(cid:48)eT\n\n[n]X\n\nwhere CGrS is the constraint set corresponding to the selection event EGrS (see Appendix for an\nexplicit characteriation).\n\n4.2 Largest row/column sum test\n\n[n]Xej \u2265 eT\neT\n(cid:16)\n(cid:16)\n\nejeT\n\n[n]X\n\nAn alternative to running the greedy search is to use a test statistic based off choosing the k rows and\ncolumns with largest sum. The largest row/column sum test selects a subset of columns J 0 when\n\n[n]Xej(cid:48) for any j \u2208 J 0, j(cid:48) \u2208 [n] \\ J 0\n\n(4.3)\n\nwhich corresponds to selecting the k columns with largest sum. Similarly, it selects rows I 0 with\nlargest sum. Thus the selection event for initialization at (I 0, J 0) is equivalent to X falling in the\n\npolyhedral set(cid:110)\n\u2229(cid:110)\n\u00b5 \u2265 4/k(cid:112)n log(n \u2212 k) the procedure recovers the planted submatrix. We show a similar result for\n\nfor any j \u2208 J 0, j(cid:48) \u2208 [n] \\ J 0(cid:111)\nfor any i \u2208 I 0, i(cid:48) \u2208 [n] \\ I 0(cid:111)\n\nThe procedure of selecting the k largest rows/columns was analyzed in [5]. They proved that when\n\nX \u2208 Rn\u00d7n | tr\nX \u2208 Rn\u00d7n | tr\n\n(cid:17) \u2265 tr\n(cid:17) \u2265 tr\n\n(cid:16)\n(cid:16)\n\n(cid:17)\n(cid:17)\n\nej(cid:48)eT\n\nei(cid:48)eT\n\n(4.4)\n\n[n]X\n\n[n]X\n\n[n]X\n\neieT\n\n.\n\nthe test statistic based off the intelligent initialization\n\nF\n\ntr\n\neJ 0(X)eT\n\nI 0(X)X\n\n, 0, \u03c32k2, V \u2212(X), V +(X)\n\n.\n\n(4.5)\n\n(cid:112)n log(n \u2212 k).\n(cid:18)\n(cid:113)\n\nUnder the null of \u00b5 = 0, the statistic (4.5) is uniformly distributed, so type 1 error is controlled at\nlevel \u03b1. The theorem below shows that this computationally tractable test has power tending to 1 for\n\u00b5 > 4\nk\nTheorem 4.1. Let \u00b5 = C\nk\n\u221a\n\n(cid:112)n log(n \u2212 k). Assume that n \u2265 2 exp(1) and n \u2265 k\n\n2 . When C >\n, the \u03b1-level test given by Corollary 2.3 has power at\n\n(cid:19)\n\n\u221a\n\n+\n\n4\n\n4n2 + 2\n\nn , 2 log 2/\u03b1\nlog(n\u2212k)\n\n2\nn\n\n(cid:16)\n\n(cid:16)\n\n(cid:17)\n\n(cid:17)\n\nmax\nleast 1 \u2212 9\n\n1 + 1\nn\u2212k ; i.e.\n\nFurther, for any sequence (n, k) such that n \u2192 \u221e, when C > 4, and k \u2264 n\n\n2 , Pr(reject H0) \u2192 1.\n\nPr(reject H0) \u2265 1 \u2212 9\n\nn\u2212k .\n\n7\n\n\fFigure 1: Random initialization with 10 restarts\n\nFigure 2: Intelligent initialization\n\nIn practice, we have found that initializing the greedy algorithm with the rows and columns identi\ufb01ed\nby the largest row/column sum test stabilizes the performance of the greedy algorithm and preserves\npower. By intersecting the selection events from the largest row/column sum test and the greedy\nalgorithm, the test also controls type 1 error. Let (Iloc(X), Jloc(X)) be the pair of indices returned\nby the greedy algorithm initialized with (I 0, J 0) from the largest row/column sum test. The test\nstatistic is given by\n\n(cid:16)\n\n(cid:16)\n\n(cid:17)\n\n(cid:17)\n\nk\n\n, 0, \u03c32k2, V \u2212(X), V +(X)\n\n,\n\nF\n\neJloc(X)eT\n\nIloc(X)X\n\ntr\n\n(4.6)\nwhere V +(X), V \u2212(X) are now computed using the intersection of the greedy and the largest\nrow/column sum selection events. This statistic is also uniformly distributed under the null.\nWe test the performance of three of the biclustering algorithms: Algorithm 1 with the intelligent\ninitialization in (4.4) and Algorithm 1 with 10 random restarts. We generate data from the model\n(1.1) for various values of n and k. We only test the power of each procedure, since all of the\nalgorithms discussed provably control type 1 error.\nThe results are in Figures 1, and 2. The y-axis shows power (the probability of rejecting) and the\n. The tests were calibrated to control type 1 error\nx-axis is rescaled signal strength \u00b5\nat \u03b1 = .1, so any power over .1 is nontrivial. From the k = log n plot, we see that the intelligently\ninitialized greedy procedure outperforms the greedy algorithm with a single random initialization\nand the greedy algorithm with 10 random initializations.\n\n(cid:46)(cid:113) 2 log(n\u2212k)\n\n5 Conclusion\n\nIn this paper, we considered the problem of evaluating the statistical signi\ufb01cance of the output of\nseveral biclustering algorithms. By considering the problem as a selective inference problem, we\nare able to devise exact signi\ufb01cance tests and con\ufb01dence intervals for the selected bicluster. We also\nshow how the framework generalizes to the more practical problem of evaluating the signi\ufb01cance of\nmultiple biclusters. In this setting, our approach gives sequential tests that control family-wise error\nrate in the strong sense.\n\n8\n\n051000.20.40.60.81Signal StrengthPowerk=log n051000.20.40.60.81Signal StrengthPowerk=sqrt(n)051000.20.40.60.81Signal StrengthPowerk=.2n n=50n=100n=500n=1000051000.20.40.60.81Signal StrengthPowerk=log n051000.20.40.60.81Signal StrengthPowerk=sqrt(n)051000.20.40.60.81Signal StrengthPowerk=.2n n=50n=100n=500n=1000\fReferences\n[1] Louigi Addario-Berry, Nicolas Broutin, Luc Devroye, G\u00b4abor Lugosi, et al. On combinatorial\n\ntesting problems. The Annals of Statistics, 38(5):3063\u20133092, 2010.\n\n[2] Brendan PW Ames. Guaranteed clustering and biclustering via semide\ufb01nite programming.\n\nMathematical Programming, pages 1\u201337, 2012.\n\n[3] Brendan PW Ames and Stephen A Vavasis. Convex optimization for the planted k-disjoint-\n\nclique problem. Mathematical Programming, 143(1-2):299\u2013337, 2014.\n\n[4] Ery Arias-Castro, Emmanuel J Candes, Arnaud Durand, et al. Detection of an anomalous\n\ncluster in a network. The Annals of Statistics, 39(1):278\u2013304, 2011.\n\n[5] Sivaraman Balakrishnan, Mladen Kolar, Alessandro Rinaldo, Aarti Singh, and Larry Wasser-\nman. Statistical and computational tradeoffs in biclustering. In NIPS 2011 Workshop on Com-\nputational Trade-offs in Statistical Learning, 2011.\n\n[6] Shankar Bhamidi, Partha S Dey, and Andrew B Nobel. Energy landscape for large average\nsubmatrix detection problems in gaussian random matrices. arXiv preprint arXiv:1211.2284,\n2012.\n\n[7] Yudong Chen and Jiaming Xu. Statistical-computational tradeoffs in planted problems and\nsubmatrix localization with a growing number of clusters and submatrices. arXiv preprint\narXiv:1402.1267, 2014.\n\n[8] Yizong Cheng and George M Church. Biclustering of expression data. In ISMB, volume 8,\n\npages 93\u2013103, 2000.\n\n[9] Laura Lazzeroni and Art Owen. Plaid models for gene expression data. Statistica Sinica,\n\n12(1):61\u201386, 2002.\n\n[10] Jason D Lee, Dennis L Sun, Yuekai Sun, and Jonathan E Taylor. Exact post-selection inference\n\nwith the lasso. arXiv preprint arXiv:1311.6238, 2013.\n\n[11] Zongming Ma and Yihong Wu. Computational barriers in minimax submatrix detection. arXiv\n\npreprint arXiv:1309.5914, 2013.\n\n[12] Andrey A Shabalin, Victor J Weigman, Charles M Perou, and Andrew B Nobel. Finding\nlarge average submatrices in high dimensional data. The Annals of Applied Statistics, pages\n985\u20131012, 2009.\n\n9\n\n\f", "award": [], "sourceid": 821, "authors": [{"given_name": "Jason", "family_name": "Lee", "institution": "Stanford"}, {"given_name": "Yuekai", "family_name": "Sun", "institution": "Stanford University"}, {"given_name": "Jonathan", "family_name": "Taylor", "institution": "Stanford University"}]}