{"title": "Practical Methods for Graph Two-Sample Testing", "book": "Advances in Neural Information Processing Systems", "page_first": 3019, "page_last": 3028, "abstract": "Hypothesis testing for graphs has been an important tool in applied research fields for more than two decades, and still remains a challenging problem as one often needs to draw inference from few replicates of large graphs. Recent studies in statistics and learning theory have provided some theoretical insights about such high-dimensional graph testing problems, but the practicality of the developed theoretical methods remains an open question.\n\nIn this paper, we consider the problem of two-sample testing of large graphs. We demonstrate the practical merits and limitations of existing theoretical tests and their bootstrapped variants. We also propose two new tests based on asymptotic distributions. We show that these tests are computationally less expensive and, in some cases, more reliable than the existing methods.", "full_text": "Practical Methods for Graph Two-Sample Testing\n\nDebarghya Ghoshdastidar\n\nDepartment of Computer Science\n\nUniversity of T\u00fcbingen\n\nghoshdas@informatik.uni-tuebingen.de\n\nUlrike von Luxburg\n\nDepartment of Computer Science\n\nUniversity of T\u00fcbingen\n\nMax Planck Institute for Intelligent Systems\nluxburg@informatik.uni-tuebingen.de\n\nAbstract\n\nHypothesis testing for graphs has been an important tool in applied research \ufb01elds\nfor more than two decades, and still remains a challenging problem as one often\nneeds to draw inference from few replicates of large graphs. Recent studies in\nstatistics and learning theory have provided some theoretical insights about such\nhigh-dimensional graph testing problems, but the practicality of the developed\ntheoretical methods remains an open question.\nIn this paper, we consider the problem of two-sample testing of large graphs. We\ndemonstrate the practical merits and limitations of existing theoretical tests and\ntheir bootstrapped variants. We also propose two new tests based on asymptotic\ndistributions. We show that these tests are computationally less expensive and, in\nsome cases, more reliable than the existing methods.\n\n1\n\nIntroduction\n\nHypothesis testing is one of the most commonly encountered statistical problems that naturally arises\nin nearly all scienti\ufb01c disciplines. With the widespread use of networks in bioinformatics, social\nsciences and other \ufb01elds since the turn of the century, it was obvious that the hypothesis testing of\ngraphs would soon become a key statistical tool in studies based on network analysis. The problem\nof testing for differences in networks arises quite naturally in various situations. For instance, Bassett\net al. (2008) study the differences in anatomical brain networks of schizophrenic patients and healthy\nindividuals, whereas Zhang et al. (2009) test for statistically signi\ufb01cant topological changes in gene\nregulatory networks arising from two different treatments of breast cancer. As Clarke et al. (2008)\nand Hyduke et al. (2013) point out, the statistical challenge associated with network testing is the\ncurse of dimensionality as one needs to test large graphs based on few independent samples. Ginestet\net al. (2014) show that complications can also arise due to the widespread use of multiple testing\nprinciples that rely on performing independent tests for every edge.\nAlthough network analysis has been a primary research topic in statistics and machine learning,\ntheoretical developments related to testing random graphs have been rather limited until recent times.\nProperty testing of graphs has been well studied in computer science (Goldreich et al., 1998), but\nprobably the earliest instances of the theory of random graph testing are the works on community\ndetection, which use hypothesis testing to detect if a network has planted communities or to determine\nthe number of communities in a block model (Arias-Castro and Verzelen, 2014, Bickel and Sarkar,\n2016, Lei, 2016). In the present work, we are interested in the more general and practically important\nproblem of two-sample testing: Given two populations of random graphs, decide whether both\npopulations are generated from the same distribution or not. While there have been machine learning\napproaches to quantify similarities between graphs for the purpose of classi\ufb01cation, clustering\netc. (Borgwardt et al., 2005, Shervashidze et al., 2011), the use of graph distances for the purpose of\nhypothesis testing is more recent (Ginestet et al., 2017). Most approaches for graph testing based\non classical two-sample tests are applicable in the relatively low-dimensional setting, where the\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fpopulation size (number of graphs) is larger than the size of the graphs (number of vertices). However,\nHyduke et al. (2013) note that this scenario does not always apply because the number of samples\ncould be potentially much smaller \u2014 for instance, one may need to test between two large regulatory\nnetworks (that is, population size is one). Such scenarios can be better tackled from a perspective of\nhigh-dimensional statistics as shown in Tang et al. (2016), Ghoshdastidar et al. (2017a), where the\nauthors study two-sample testing for speci\ufb01c classes of random graphs with particular focus on the\nsmall population size.\nIn this work, we focus on the framework of the graph two-sample problem considered in Tang et al.\n(2016), Ginestet et al. (2017), Ghoshdastidar et al. (2017a), where all graphs are de\ufb01ned on a common\nset of vertices. Assume that the number of vertices in each graph is n, and the sample size of either\npopulation is m. One can consider the two-sample problem in three different regimes: (i) m is large;\n(ii) m > 1, but much smaller than n; and (iii) m = 1. The \ufb01rst setting is the simplest one, and\npractical tests are known in this case (Gretton et al., 2012, Ginestet et al., 2017). However, there\nexist many application domains where already the availability of only a small population of graphs\nis a challenge, and large populations are completely out of bounds. The latter two cases of small\nm > 1 and m = 1 have been studied in Ghoshdastidar et al. (2017a) and Tang et al. (2016), where\ntheoretical tests based on concentration inequalities have been developed and practical bootstrapped\nvariants of the tests have been suggested. The contribution of the present work is three-fold:\n\n1. For the cases of m > 1 and m = 1, we propose new tests that are based on asymptotic null\ndistributions under certain model assumptions and we prove their statistical consistency\n(Sections 4 and 5 respectively). The proposed tests are devoid of bootstrapping, and hence,\ncomputationally faster than existing bootstrapped tests for small m. Detailed descriptions of\nthe tests are provided in the supplementary material.\n\n2. We compare the practical merits and limitations of existing tests with the proposed tests\n(Section 6 and supplementary). We show that the proposed tests are more powerful and\nreliable than existing methods in some situations.\n\n3. Our aim is also to make the existing and proposed tests more accessible for applied research.\n\nWe provide Matlab implementations of the tests in the supplementary material.\n\nThe present work is focused on the assumption that all networks are de\ufb01ned over the same set\nof vertices. This may seem restrictive in some application areas, but it is commonly encountered\nin other areas such as brain network analysis or molecular interaction networks, where vertices\ncorrespond to well-de\ufb01ned regions of the brain or protein structures. Few works study the case where\ngraphs do not have vertex correspondences in context of clustering (Mukherjee et al., 2017) and\ntesting (Ghoshdastidar et al., 2017b, Tang et al., 2017). But, theoretical guarantees are only known\nfor speci\ufb01c choices of network functions (triangle counts or graph spectra), or under the assumption\nof an underlying embedding of the vertices.\nNotation. We use the asymptotic notation on(\u00b7) and \u03c9n(\u00b7), where the asymptotics are with respect\nto the number of vertices n. We say x = on(y) and y = \u03c9n(x) when lim\ny = 0. We denote the\nn\u2192\u221e\nmatrix Frobenius norm by (cid:107) \u00b7 (cid:107)F and the spectral norm or largest singular value by (cid:107) \u00b7 (cid:107)2.\n\nx\n\n2 Problem Statement\n\nWe consider the following framework of two-sample setting. Let V be a set of n vertices. Let\nG1, . . . , Gm and H1, . . . , Hm be two populations of undirected unweighted graphs de\ufb01ned on the\ncommon vertex set V , where each population consists of independent and identically distributed\nsamples. The two-sample hypothesis testing problem is as follows:\n\nTest whether (Gi)i=1,...,m and (Hi)i=1,...,m are generated from the same random model or not.\nThere exist a plethora of nonparametric tests that are provably consistent for m \u2192 \u221e. In particular,\nkernel based tests (Gretton et al., 2012) are known to be suitable for two-sample problems in large\ndimensions. These tests, in conjunction with graph kernels (Shervashidze et al., 2011, Kondor and\nPan, 2016) or distances (Mukherjee et al., 2017), may be used to derive consistent procedures for\ntesting between two large populations of graphs. Such principles are applicable even under a more\ngeneral framework without vertex correspondence (see Gretton et al., 2012). However, given graphs\n\n2\n\n\f2\n\n2\n\n2\n\non a common vertex set, the most natural approach is to construct tests based on the graph adjacency\nmatrix or the graph Laplacian (Ginestet et al., 2017). To be precise, one may view each undirected\n\ngraph on n vertices as a(cid:0)n\n\u03c72 or T 2 statistics (Anderson, 1984). Unfortunately, such tests require an estimate of the(cid:0)n\n\n(cid:1)-dimensional vector and use classical two-sample tests based on the\n(cid:1)\u00d7(cid:0)n\n(cid:1)-\n\ndimensional sample covariance matrix, which cannot be accurately obtained from a moderate sample\nsize. For instance, Ginestet et al. (2017) need regularisation of the covariance estimate even for\nmoderate sized problems (n = 40, m = 100), and it is unknown whether such methods work for\nbrain networks obtained from a single-lab experimental setup (m < 20). For m (cid:28) n, it is indeed\nhard to prove consistency results under the general two-sample framework described above since\nthe correlation among the edges can be arbitrary. Hence, we develop our theory for random graphs\nwith independent edges. Tang et al. (2016) show that tests derived for such graphs are also useful in\npractice.\nWe assume that the graphs are generated from the inhomogeneous Erd\u02ddos-R\u00e9nyi (IER) model (Bollobas\net al., 2007). This model has been considered in the work of Ghoshdastidar et al. (2017a) and subsumes\nother models studied in the context of graph testing such as dot product graphs (Tang et al., 2016) and\nstochastic block models (Lei, 2016). Given a symmetric matrix P \u2208 [0, 1]n\u00d7n with zero diagonal,\na graph G is said to be an IER graph with population adjacency P , denoted as G \u223c IER(P ), if its\nsymmetric adjacency matrix AG \u2208 {0, 1}n\u00d7n satis\ufb01es:\n\nH0 : P (n) = Q(n)\n\n(AG)ij \u223c Bernoulli(Pij) for all i < j,\n\nand {(AG)ij : i < j} are mutually independent.\n\nagainst H1 : P (n) (cid:54)= Q(n).\n\nFor any n, we state the two-sample problem as follows. Let P (n), Q(n) \u2208 [0, 1]n\u00d7n be two symmetric\n\nmatrices. Given G1, . . . , Gm \u223ciid IER(cid:0)P (n)(cid:1) and H1, . . . , Hm \u223ciid IER(cid:0)Q(n)(cid:1), test the hypotheses\nwe assume that there are two sequences of models(cid:0)P (n)(cid:1)\n\n(1)\nOur theoretical results in subsequent sections will often be in the asymptotic case as n \u2192 \u221e. For this,\nn\u22651, and the sequences are\nidentical under the null hypothesis H0. We derive asymptotic powers of the proposed tests assuming\ncertain separation rates under the alternative hypothesis.\n3 Testing large population of graphs (m \u2192 \u221e)\nBefore proceeding to the case of small population size, we discuss a baseline approach that is designed\nfor the large m regime (m \u2192 \u221e). The following discussion provides a \u03c72-type test statistic for\nnetworks, which is a simpli\ufb01cation of Ginestet et al. (2017) under the IER assumption. Given the\nadjacency matrices AG1, . . . , AGm and AH1, . . . , AHm, consider the test statistic\n\nn\u22651 and(cid:0)Q(n)(cid:1)\n\n(cid:0)(AG)ij \u2212 (AH )ij\n\n(cid:1)2\n\n(cid:88)\n\nT\u03c72 =\n\ni<j\n\nm(m\u22121)\n\n1\n\n(cid:0)(AGk )ij \u2212 (AG)ij\n\n(cid:0)(AHk )ij \u2212 (AH )ij\nm(cid:80)\nk=1(AGk )ij. It is easy to see that under H0, T\u03c72 \u2192 \u03c72(cid:16) n(n\u22121)\n(cid:80)m\n\nm(cid:80)\n\nm(m\u22121)\n\n(cid:1)2\n\nk=1\n\nk=1\n\n+\n\n1\n\n(cid:1)2\n(cid:17)\n\nwhere (AG)ij = 1\nin distri-\nm\nbution as m \u2192 \u221e for any \ufb01xed n. This suggests a \u03c72-type test similar to Ginestet et al. (2017).\nHowever, like any classical test, no performance guarantee can be given for small m and our numeri-\ncal results show that such a test is powerless for small m and sparse graphs. Hence, in the rest of the\npaper, we consider tests that are powerful even for small m.\n\n2\n\n,\n\n(2)\n\n4 Testing small populations of large graphs (m > 1)\n\nThe case of small m > 1 for IER graphs was \ufb01rst studied from a theoretical perspective in Ghosh-\ndastidar et al. (2017a), and the authors also show that, under a minimax testing framework, the testing\nproblem is quite different for m = 1 and m > 1. From a practical perspective, small m > 1 is a\ncommon situation in neural imaging with only few subjects. The case of m = 2 is also interesting\nfor testing between two individuals based on test-retest diffusion MRI data, where two scans are\ncollected from each subject with a separation of multiple weeks (Landman et al., 2011).\nUnder the assumption of IER models described in Section 2 and given the adjacency matrices\nAG1, . . . , AGm and AH1, . . . , AHm, Ghoshdastidar et al. (2017a) propose test statistics based on\n\n3\n\n\festimates of the distances(cid:13)(cid:13)P (n) \u2212 Q(n)(cid:13)(cid:13)2 and(cid:13)(cid:13)P (n) \u2212 Q(n)(cid:13)(cid:13)F up to certain normalisation factors\n\nthat account for sparsity of the graphs. They consider the following two test statistics\n\n(cid:13)(cid:13)(cid:13)(cid:13)2\n\nAGk \u2212 AHk\nm(cid:80)\n\nk=1\n\n(AGk )ij + (AHk )ij\n\nk=1\n\n(cid:13)(cid:13)(cid:13)(cid:13) m(cid:80)\n(cid:115)\nn(cid:80)\n(cid:32) (cid:80)\n(cid:80)\n(cid:118)(cid:117)(cid:117)(cid:116)(cid:80)\n(cid:32) (cid:80)\n\nmax\n1\u2264i\u2264n\n\nj=1\n\ni<j\n\nk\u2264m/2\n\ni<j\n\nk\u2264m/2\n\nTspec =\n\nTf ro =\n\n(3)\n\n(4)\n\n(cid:33)\n(cid:33) .\n\n, and\n\n(cid:33)(cid:32) (cid:80)\n(cid:33)(cid:32) (cid:80)\n\nk>m/2\n\nk>m/2\n\n(AGk )ij \u2212 (AHk )ij\n\n(AGk )ij \u2212 (AHk )ij\n\n(AGk )ij + (AHk )ij\n\n(AGk )ij + (AHk )ij\n\nSubsequently, theoretical tests are constructed based on concentration inequalities: one can show that\nwith high probability, the test statistics are smaller than some speci\ufb01ed threshold under the null hy-\npothesis, but they exceed the same threshold if the separation between P (n) and Q(n) is large enough.\nIn practice, however, the authors note that the theoretical thresholds are too large to be exceeded\nfor moderate n, and recommend estimation of the threshold through bootstrapping. Each bootstrap\nsample is generated by randomly partitioning the entire population G1, . . . , Gm, H1, . . . , Hm into\ntwo parts, and Tspec or Tf ro are computed based on this random partition. This procedure provides\nan approximation of the statistic under the null model. We refer to these tests as Boot-Spectral\nand Boot-Frobenius, and show their limitations for small m via simulations. Detailed descriptions\nof these tests are included in Appendix B in the supplementary.\nWe now propose a test based on the asymptotic behaviour of Tf ro in (4) as n \u2192 \u221e. We state the\nasymptotic behaviour in the following result.\nTheorem 1 (Asymptotic test based on Tf ro). In the two-sample framework of Section 2, assume\n\nthat P (n), Q(n) have entries bounded away from 1, and satisfy max(cid:8)(cid:13)(cid:13)P (n)(cid:13)(cid:13)F ,(cid:13)(cid:13)Q(n)(cid:13)(cid:13)F\n\n(cid:9) = \u03c9n(1).\n\nn\u2192\u221e Tf ro is dominated by a standard normal random variable, and hence,\nlim\n\nUnder the null hypothesis,\nfor any \u03b1 \u2208 (0, 1),\n\nP(cid:0)Tf ro /\u2208 [\u2212t\u03b1, t\u03b1](cid:1) \u2264 \u03b1 + on(1),\nP(cid:0)Tf ro \u2208 [\u2212t\u03b1, t\u03b1](cid:1) = on(1).\n\nm max(cid:8)(cid:13)(cid:13)P (n)(cid:13)(cid:13)F ,(cid:13)(cid:13)Q(n)(cid:13)(cid:13)F\n(cid:0) 1\n\nF = \u03c9n\n\n2 upper quantile of the standard normal distribution.\n\n(cid:9)(cid:1), then\n\n(5)\n\n(6)\n\nwhere t\u03b1 = \u03a6\u22121(1 \u2212 \u03b1\n\nOn the other hand, if(cid:13)(cid:13)P (n) \u2212 Q(n)(cid:13)(cid:13)2\n\n2 ) is the \u03b1\n\nThe proof, given in Appendix A, is based on the use of the Berry-Esseen theorem (Berry, 1941).\nUsing Theorem 1, we propose an \u03b1-level test based on asymptotic normal dominance of Tf ro.\n\nProposed Test Asymp-Normal: Reject the null hypothesis if |Tf ro| > t\u03b1.\n\nA detailed description of this test is given in Appendix B. The assumption(cid:13)(cid:13)P (n)(cid:13)(cid:13)F ,(cid:13)(cid:13)Q(n)(cid:13)(cid:13)F =\n\n\u03c9n(1) is not restrictive since it is quite similar to assuming that the number of edges is super-linear\nin n, that is, the graphs are not too sparse. We note that unlike the \u03c72-test of Section 2, here the\nasymptotics are for n \u2192 \u221e instead of m \u2192 \u221e, and hence, the behaviour under null hypothesis\nmay not improve for larger m. The asymptotic unit power of the Asymp-Normal test, as shown in\nTheorem 1, is proved under a separation condition, which is not surprising since we have access to\nonly a \ufb01nite number of graphs. The result also shows that for large m, smaller separations can be\ndetected by the proposed test.\nRemark 2 (Computational effort). Note that the computational complexity for computing the test\nstatistics in (3) and (4) is linear in the total number of edges in the entire population. However, the\nbootstrap tests require computation of the test statistic multiple times (equal to number of bootstrap\nsamples b; we use b = 200 in our experiments). On the other hand, the proposed test compute the\nstatistic once, and is much faster (\u223c200 times). Moreover, if the graphs are too large to be stored in\nmemory, bootstrapping requires multiple passes over the data, while the proposed test requires only a\nsingle pass.\n\n4\n\n\f5 Testing difference between two large graphs (m = 1)\n\nThe case of m = 1 is perhaps the most interesting from theoretical perspective: the objective is to\ndetect whether two large graphs G and H are identically distributed or not. This \ufb01nds application\nin detecting differences in regulatory networks (Zhang et al., 2009) or comparing brain networks\nof individuals (Tang et al., 2016). Although the concentration based test using Tspec is applicable\neven for m = 1 (Ghoshdastidar et al., 2017a), bootstrapping based on label permutation is infeasible\nfor m = 1 since there is no scope of permuting labels with unit population size. Tang et al. (2016),\nhowever, propose a concentration based test in this case and suggest a bootstrapping based on low\nrank assumption of the population adjacency. Tang et al. (2016) study the two-sample problem for\nrandom dot product graphs, which are IER graphs with low rank population adjacency matrices\n(ignoring the effect of zero diagonal). This class includes the stochastic block model, where the rank\n\nequals the number of communities. Let G \u223c IER(cid:0)P (n)(cid:1) and H \u223c IER(cid:0)Q(n)(cid:1), and assume that\n\nP (n) and Q(n) are of rank r. One de\ufb01nes the adjacency spectral embedding (ASE) of graph G as\nG , where \u03a3G \u2208 Rr\u00d7r is a diagonal matrix containing r largest singular values of AG\nXG = UG\u03a31/2\nand UG \u2208 Rn\u00d7r is the matrix of corresponding left singular vectors. Tang et al. (2016) propose the\ntest statistic\n\nTASE = min(cid:8)(cid:107)XG \u2212 XH W(cid:107)F : W \u2208 Rr\u00d7r, W W T = I(cid:9) ,\n\n(7)\nwhere the rank r is assumed to be known. The rotation matrix W aligns the ASE of the two graphs.\nTang et al. (2016) theoretically analyse a concentration based test, where the null hypothesis is rejected\nif TASE crosses a suitably chosen threshold. In practice, they suggest the following bootstrapping\nto determine the threshold (Algorithm 1 in Tang et al., 2016). One may approximate P (n) by the\nG. More random dot product graphs can be\n\nestimated population adjacency (EPA) (cid:98)P = XGX T\nsimulated from (cid:98)P , and a bootstrapped threshold can be obtained by computing TASE for pairs of\ngraphs generated from (cid:98)P . Instead of the TASE statistic, one may also use a statistic based on EPA as\n(cid:13)(cid:13)(cid:13)F\n(cid:13)(cid:13)(cid:13)(cid:98)P \u2212 (cid:98)Q\n\nTEP A =\n\n(8)\n\n.\n\nThis statistic has been used as distance measure in the context of graph clustering (Mukherjee et al.,\n2017). We refer to the tests based on the statistics in (7) and (8), and the above bootstrapping\nprocedure by Boot-ASE and Boot-EPA (see Appendix B for detailed descriptions). We \ufb01nd that the\nlatter performs better, but both tests work under the condition that the population adjacency is of low\nrank, and the rank is precisely known. Our numerical results demonstrate the limitations of these\ntests when the rank is not correctly known.\nAlternatively, we propose a test based on the asymptotic distribution of eigenvalues that is not\n\nrestricted to graphs with low rank population adjacencies. Given G \u223c IER(cid:0)P (n)(cid:1) and H \u223c\nIER(cid:0)Q(n)(cid:1), consider the matrix C \u2208 Rn\u00d7n with zero diagonal and for i (cid:54)= j,\n(cid:17)(cid:17) .\n\n(cid:114)\n\nCij =\n\n(cid:16)\n\n(cid:16)\n\n(cid:17)\n\n(cid:16)\n\n(9)\n\n(AG)ij \u2212 (AH )ij\n1 \u2212 P (n)\n\nij\n\n+ Q(n)\nij\n\n(n \u2212 1)\n\nP (n)\n\nij\n\n1 \u2212 Q(n)\n\nij\n\nWe assume that the entries of P (n) and Q(n) are not arbitrarily close to 1, and de\ufb01ne Cij = 0 when\n0. We show that the extreme eigenvalues of C asymptotically follow the Tracy-Widom law,\nCij = 0\nwhich characterises the distribution of the largest eigenvalues of matrices with independent standard\nnormal entries (Tracy and Widom, 1996). Subsequently, we show that (cid:107)C(cid:107)2 is a useful test statistic.\nTheorem 3 (Asymptotic test based on (cid:107)C(cid:107)2). Consider the above setting of two-sample testing,\nand let C be as de\ufb01ned in (9). Let \u03bb1(C) and \u03bbn(C) be the largest and smallest eigenvalues of C.\nUnder the null hypothesis, that is, if P (n) = Q(n) for all n, then\n\nn2/3(cid:0)\u03bb1(C) \u2212 2(cid:1) \u2192 T W1 and n2/3(cid:0) \u2212 \u03bbn(C) \u2212 2(cid:1) \u2192 T W1\n\nin distribution as n \u2192 \u221e, where T W1 is the Tracy-Widom law for orthogonal ensembles. Hence,\n(10)\n\nn2/3((cid:107)C(cid:107)2 \u2212 2) > \u03c4\u03b1\n\n(cid:17) \u2264 \u03b1 + on(1),\n\nP(cid:16)\n\nfor any \u03b1 \u2208 (0, 1), where \u03c4\u03b1 is the \u03b1\n\n2 upper quantile of the T W1 distribution.\n\n5\n\n\fOn the other hand, if P (n) and Q(n) are such that (cid:107)E[C](cid:107)2 \u2265 4 + \u03c9n(n\u22122/3), then\n\nP(cid:16)\n\n(cid:17)\n\nn2/3((cid:107)C(cid:107)2 \u2212 2) \u2264 \u03c4\u03b1\n\n= on(1).\n\n(11)\n\nThe proof, given in Appendix A, relies on results on the spectrum of random matrices (Erd\u02ddos et al.,\n2012, Lee and Yin, 2014), and have been previously used for the special case of determining the\nnumber of communities in a block model (Bickel and Sarkar, 2016, Lei, 2016). If the graphs are\nassumed to be block models, then asymptotic power can be proved under more precise conditions on\ndifference in population adjacencies P (n) \u2212 Q(n) (see Appendix A.3). From a practical perspective,\nC cannot be computed since P (n) and Q(n) are unknown. Still, one may approximate them by\nrelying on a weaker version of Szemer\u00e9di\u2019s regularity lemma, which implies that large graphs can\nbe approximated by stochastic block models with possibly large number of blocks (Lov\u00e1sz, 2012).\nTo this end, we propose to estimate P (n) from AG as follows. We use a community detection\nalgorithm, such as normalised spectral clustering (Ng et al., 2002), to \ufb01nd r communities in G (r is a\n\nparameter for the test). Subsequently P (n) is approximated by a block matrix (cid:101)P such that if i, j lie in\ncommunities V1, V2 respectively, then (cid:101)Pij is the mean of the sub-matrix of AG restricted to V1 \u00d7 V2.\nSimilarly one can also compute (cid:101)Q from AH. Hence, we propose a Tracy-Widom test statistic as\n\nTT W = n2/3(cid:16)(cid:13)(cid:13)(cid:13)(cid:101)C\n(cid:101)Cij =\n\n(cid:114)\n\n(n \u2212 1)\n\n(cid:13)(cid:13)(cid:13)2\n(cid:16)(cid:101)Pij\n\n(cid:17)\n(cid:16)\n1 \u2212 (cid:101)Pij\n\n,\n\n\u2212 2\n(AG)ij \u2212 (AH )ij\n\nwhere\n\n(cid:17)\n\n+ (cid:101)Qij\n\n(12)\n\n(cid:16)\n\n1 \u2212 (cid:101)Qij\n\n(cid:17)(cid:17)\n\nfor all i (cid:54)= j\n\nand the diagonal is zero. The proposed \u03b1-level test based on TT W and Theorem 3 is the following.\n\nProposed Test Asymp-TW: Reject the null hypothesis if TT W > \u03c4\u03b1.\n\nA detailed description of the test, as used in our implementations, is given in Appendix B. We note\nthat unlike bootstrap tests based on TASE or TEP A, the proposed test uses the number of communities\n(or rank) r only for approximation of P (n), Q(n), and the power of the test is not sensitive to the\nchoice of r. In addition, the computational bene\ufb01t of a distribution based test over bootstrap tests, as\nnoted in Remark 2, is also applicable in this case.\n\n6 Numerical results\n\nIn this section, we empirically compare the merits and limitations of the tests discussed in the paper.\nWe present our numerical results in three groups: (i) results for random graphs for m > 1, (ii) results\nfor random graphs for m = 1, and (iii) results for testing real networks. For m > 1, we consider\nfour tests. Boot-Spectral and Boot-Frobenius are the bootstrap tests based on Tspec (3) and\nTf ro (4), respectively. Asymp-Chi2 is the \u03c72-type test based on T\u03c72 (2), which is suited for the large\nm setting, and \ufb01nally, the proposed test Asymp-Normal is based on the normal dominance of Tf ro\nas n \u2192 \u221e as shown in Theorem 1. For m = 1, we consider three tests. Boot-ASE and Boot-EPA\nare the bootstrap tests based on TASE (7) and TEP A (8), respectively. Asymp-TW is the proposed test\nbased on TT W (12) and Theorem 3. Appendices B and C in the supplementary contain descriptions\nof all tests and additional numerical results. Matlab codes are provided in the supplementary.1\n\n6.1 Comparative study on random graphs for m > 1\n\nFor this study, we generate graphs from stochastic block models with 2 communities as considered\nin Tang et al. (2016). We de\ufb01ne P (n) and Q(n) as follows. The vertex set of size n is partitioned into\ntwo communities, each of size n/2. In P (n), edges occur independently with probability p within\neach community, and with probability q between two communities. Q(n) has the same block structure\nas P (n), but edges occur with probability (p + \u0001) within each community. Under the null hypothesis\n\u0001 = 0 and hence Q(n) = P (n), whereas under the alternative hypothesis, we set \u0001 > 0.\n\n1Also available at: https://github.com/gdebarghya/Network-TwoSampleTesting.\n\n6\n\n\fUnder null hypothesis\n\nUnder alternative hypothesis\n\n)\ne\nt\na\nr\n\nn\no\ni\nt\nc\ne\nj\ne\nr\n\nl\nl\nu\nn\n(\n\nr\ne\nw\no\np\n\nt\ns\ne\nT\n\n2\n=\nm\n\n4\n=\nm\n\nNumber of vertices n\n\nFigure 1: Power of different tests for increasing number of vertices n, and for m = 2, 4. The dotted\nline for case of null hypothesis corresponds to the signi\ufb01cance level of 5%.\n\nIn our \ufb01rst experiment, we study the performance of different tests for varying m and n. We let n\ngrow from 100 to 1000 in steps of 100, and set p = 0.1 and q = 0.05. We set \u0001 = 0 and 0.04 for\nnull and alternative hypotheses, respectively. We use two values of population size, m \u2208 {2, 4}, and\n\ufb01x the signi\ufb01cance level at \u03b1 = 5%. Figure 1 shows the rate of rejecting the null hypothesis (test\npower) computed from 1000 independent runs of the experiment. Under the null model, the test\npower should be smaller than \u03b1 = 5%, whereas under the alternative model, a high test power (close\nto 1) is desirable. We see that for m = 2, only Asymp-Normal has power while the bootstrap tests\nhave zero rejection rate. This is not surprising as bootstrapping is impossible for m = 2. For m = 4,\nBoot-Frobenius has a behaviour similar to Asymp-Normal although the latter is computationally\nmuch faster. Boot-Spectral achieves a higher power for small n but cannot achieve unit power.\nAsymp-Chi2 has an erratic behaviour for small m, and hence, we study it for larger sample size in\nFigure 3 (in Appendix C). As is expected, Asymp-Chi2 has desired performance only for m (cid:29) n.\nWe also study the effect of edge sparsity on the performance of the tests. For this, we consider the\nabove setting, but scale the edge probabilities by a factor of \u03c1, where \u03c1 = 1 is exactly same as the\nabove setting while larger \u03c1 corresponds to denser graphs. Figure 4 in the appendix shows the results\n2 , 1, 2, 4} and m \u2208 {2, 4, 6, 8, 10}. We again\nin this case, where we \ufb01x n = 500 and vary \u03c1 \u2208 { 1\n\ufb01nd that Asymp-Normal and Boot-Frobenius have similar trends for m \u2265 4. All tests perform\nbetter for dense graphs, but Boot-Spectral may be preferred for sparse graphs when m \u2265 6.\n\n4 , 1\n\n6.2 Comparative study on random graphs for m = 1\n\nWe conduct similar experiments for the case of m = 1. Recall that bootstrap tests for m = 1 work\nunder the assumption that the population adjacencies are of low rank. This holds in above considered\nsetting of block models, where the rank is 2. We \ufb01rst demonstrate the effect of knowledge of true\nrank on the test power. We use r \u2208 {2, 4} to specify the rank parameter for bootstrap tests, and\nalso as the number of blocks used for community detection step of Asymp-TW. Figure 2 shows the\npower of the tests for the above setting with \u03c1 = 1 and growing n. We \ufb01nd that when r = 2, that is,\ntrue rank is known, both bootstrap tests perform well under alternative hypothesis, and outperform\nAsymp-TW, although Boot-ASE has a high type-I error rate. However, when an over-estimate of\nrank is used (r = 4), both bootstrap tests break down \u2014 Boot-EPA always rejects while Boot-ASE\nalways accepts \u2014 but the performance of Asymp-TW is robust to this parameter change.\nWe also study the effect of sparsity by varying \u03c1 (see Figure 5 in Appendix C). We only consider the\ncase r = 2. We \ufb01nd that all tests perform better in dense regime, and the rejection rate of Asymp-TW\nunder null is below 5% even for small graphs. However, the performance of both Boot-ASE and\n\n7\n\n100400700100000.0250.050.0750.1100400700100000.250.50.751100400700100000.0250.050.0750.1100400700100000.250.50.751Boot-SpectralBoot-FrobeniusAsymp-Normal\fUnder null hypothesis\n\nUnder alternative hypothesis\n\n)\ne\nt\na\nr\n\nn\no\ni\nt\nc\ne\nj\ne\nr\n\nl\nl\nu\nn\n(\n\nr\ne\nw\no\np\n\nt\ns\ne\nT\n\n2\n=\nr\n\n4\n=\nr\n\nFigure 2: Power of different tests with increase number of vertices n, and for rank parameter r = 2, 4.\nThe dotted line under null hypothesis corresponds to the signi\ufb01cance level of 5%.\n\nNumber of vertices n\n\nAsymp-TW are poor if the graphs are too sparse. Hence, Boot-EPA may be preferable for sparse\ngraphs, but only if the rank is correctly known.\n\n6.3 Qualitative results for testing real networks\n\nWe use the proposed asymptotic tests to analyse two real datasets. These experiments demonstrate\nthat the proposed tests are applicable beyond the setting of IER graphs. In the \ufb01rst setup, we\nconsider moderate sized graphs (n = 178) constructed by thresholding autocorrelation matrices of\nEEG recordings (Andrzejak et al., 2001, Dua and Taniskidou, 2017). The network construction is\ndescribed Appendix C.2. Each group of networks corresponds to either epileptic seizure activity\nor four other resting states. In Tables 1\u20134 in Appendix C, we report the test powers and p-values\nfor Asymp-Normal and Asymp-TW. We \ufb01nd that, except for one pair of resting states, networks for\ndifferent groups can be distinguished by both tests. Further observations and discussions are also\nprovided in the appendix.\nWe also study networks corresponding to peering information of autonomous systems, that is,\ngraphs de\ufb01ned on the routers comprising the Internet with the edges representing who-talks-to-\nwhom (Leskovec et al., 2005, Leskovec and Krevl, 2014). The information for n = 11806 systems\nwas collected once a week for nine consecutive weeks, and two networks are available for each date\nbased on two sets of information (m = 2). We run Asymp-Normal test for every pair of dates and\nreport the p-values in Table 5 (Appendix C.3). It is interesting to observe that as the interval between\ntwo dates increase, the p-values decrease at an exponential rate, that is, the networks differ drastically\naccording to our tests. We also conduct semi-synthetic experiments by randomly perturbing the\nnetworks, and study the performance of Asymp-Normal and Asymp-TW as the perturbations increase\n(see Figures 6\u20137). Since the networks are large and sparse, we perform the community detection step\nof Asymp-TW using BigClam (Yang and Leskovec, 2013) instead of spectral clustering. We infer that\nthe limitation of Asymp-TW in sparse regime (observed in Figure 5) could possibly be caused by poor\nperformance of standard spectral clustering in sparse regime.\n\n7 Concluding remarks\n\nIn this work, we consider the two-sample testing problem for undirected unweighted graphs de\ufb01ned\non a common vertex set. This problem \ufb01nds application in various domains, and is often challenging\ndue to unavailability of large number of samples (small m). We study the practicality of existing\n\n8\n\n100400700100000.250.50.751100400700100000.250.50.751100400700100000.250.50.751100400700100000.250.50.751Boot-ASEBoot-EPAAsymp-TW\ftheoretical tests, and propose two new tests based on asymptotics for large graphs (Thereoms 1 and 3).\nWe perform numerical comparison of various tests, and also provide their Matlab implementations.\nIn the m > 1 case, we \ufb01nd that Boot-Spectral is effective for m \u2265 6, but Asymp-Normal is\nrecommended for smaller m since it is more reliable and requires less computation. For m = 1, we\nrecommend Asymp-TW due to robustness to the rank parameter and computational advantage. For\nlarge sparse graphs, Asymp-TW should be used with a robust community detection step (BigClam).\nOne can certainly extend some of these tests to more general frameworks of graph testing. For\ninstance, directed graphs can be tackled by modifying Tf ro such that the summation is over all i, j\nand Theorem 1 would hold even in this case. For weighted graphs, Theorem 3 can be used if one\nmodi\ufb01es C (9) by normalising with variance of (AG)ij \u2212 (AH )ij. Subsequently, these variances can\nbe approximated again through block modelling. For m > 1, we believe that unequal population\nsizes can be handled by rescaling the matrices appropriately, but we have not veri\ufb01ed this.\n\nAcknowledgements\n\nThis work is supported by the German Research Foundation (Research Unit 1735) and the Institutional\nStrategy of the University of T\u00fcbingen (DFG, ZUK 63).\n\nReferences\nT. W. Anderson. An introduction to multivariate statistical analysis. John Wiley and Sons, 1984.\n\nR. G. Andrzejak, K. Lehnertz, C. Rieke, F. Mormann, P. David, and C. E. Elger. Indications of\nnonlinear deterministic and \ufb01nite dimensional structures in time series of brain electrical activity:\nDependence on recording region and brain state. Physical Review E, 64:061907, 2001.\n\nE. Arias-Castro and N. Verzelen. Community detection in dense random networks. Annals of\n\nStatistics, 42(3):940\u2013969, 2014.\n\nD. S. Bassett, E. Bullmore, B. A. Verchinski, V. S. Mattay, D. R. Weinberger, and A. Meyer-\nLindenberg. Hierarchical organization of human cortical networks in health and schizophrenia.\nThe Journal of Neuroscience, 28(37):9239\u20139248, 2008.\n\nA. C. Berry. The accuracy of the Gaussian approximation to the sum of independent variates.\n\nTransactions of the American Mathematical Society, 49(1):122\u2013136, 1941.\n\nP. J. Bickel and P. Sarkar. Hypothesis testing for automated community detection in networks. Journal\n\nof the Royal Statistical Society Series B: Statistical Methodology, 78(1):253\u2013273, 2016.\n\nB. Bollobas, S. Janson, and O. Riordan. The phase transition in inhomogeneous random graphs.\n\nRandom Structures and Algorithms, 31(1):3\u2013122, 2007.\n\nK. M. Borgwardt, C. S. Ong, S. Sch\u00f6nauer, S. V. Vishwanathan, A. J. Smola, and H. P. Kriegel.\n\nProtein function prediction via graph kernels. Bioinformatics, 21(Suppl 1):i47\u201356, 2005.\n\nR. Clarke, H. W. Ressom, A. Wang, J. Xuan, M. C. Liu, E. A. Gehan, and Y. Wang. The properties of\nhigh-dimensional data spaces: Implications for exploring gene and protein expression data. Nature\nReviews Cancer, 8:37\u201349, 2008.\n\nD. Dua and K. Taniskidou. UCI machine learning repository. http://archive.ics.uci.edu/ml,\n\n2017.\n\nL. Erd\u02ddos, H.-T. Yau, and J. Yin. Rigidity of eigenvalues of generalized Wigner matrices. Advances in\n\nMathematics, 229(3):1435\u20131515, 2012.\n\nD. Ghoshdastidar, M. Gutzeit, A. Carpentier, and U. von Luxburg. Two-sample hypothesis testing for\n\ninhomogeneous random graphs. arXiv preprint (arXiv:1707.00833), 2017a.\n\nD. Ghoshdastidar, M. Gutzeit, A. Carpentier, and U. von Luxburg. Two-sample tests for large random\n\ngraphs using network statistics. In Conference on Learning Theory (COLT), 2017b.\n\n9\n\n\fC. E. Ginestet, A. P. Fournel, and A. Simmons. Statistical network analysis for functional MRI:\nSummary networks and group comparisons. Frontiers in computational neuroscience, 8(51):\n10.3389/fncom.2014.00051, 2014.\n\nC. E. Ginestet, J. Li, P. Balachandran, S. Rosenberg, and E. D. Kolaczyk. Hypothesis testing for\nnetwork data in functional neuroimaging. The Annals of Applied Statistics, 11(2):725\u2013750, 2017.\n\nO. Goldreich, S. Goldwasser, and D. Ron. Property testing and its connection to learning and\n\napproximation. Journal of the ACM, 45(4):653\u2013750, 1998.\n\nA. Gretton, K. M. Borgwardt, M. J. Rasch, B. Sch\u00f6lkopf, and A. Smola. A kernel two-sample test.\n\nJournal of Machine Learning Research, 13:723\u2013733, 2012.\n\nD. R. Hyduke, N. E. Lewis, and B. Palsson. Analysis of omics data with genome-scale models of\n\nmetabolism. Molecular BioSystems, 9(2):167\u2013174, 2013.\n\nR. Kondor and H. Pan. The multiscale Laplacian graph kernel. In Advances in Neural Information\n\nProcessing Systems (NIPS), 2016.\n\nB. A. Landman, A. J. Huang, A. Gifford, D. S. Vikram, I. A. Lim, J. A. Farrell, J. A. Bogovic, J. Hua,\nM. Chen, S. Jarso, S. A. Smith, S. Joel, S. Mori, J. J. Pekar, P. B. Barker, J. L. Prince, and P. C. van\nZijl. Multi-parametric neuroimaging reproducibility: A 3-T resource study. Neuroimage, 54(4):\n2854\u20132866, 2011.\n\nJ. O. Lee and J. Yin. A necessary and suf\ufb01cient condition for edge universality of Wigner matrices.\n\nDuke Mathematical Journal, 163(1):117\u2013173, 2014.\n\nJ. Lei. A goodness-of-\ufb01t test for stochastic block models. The Annals of Statistics, 44(1):401\u2013424,\n\n2016.\n\nJ. Leskovec and A. Krevl. SNAP Datasets: Stanford large network dataset collection. http:\n\n//snap.stanford.edu/data, 2014.\n\nJ. Leskovec, J. Kleinberg, and C. Faloutsos. Graphs over time: Densi\ufb01cation laws, shrinking diameters\nand possible explanations. In ACM SIGKDD International Conference on Knowledge Discovery\nand Data Mining, 2005.\n\nL. Lov\u00e1sz. Large networks and graph limits. American Mathematical Society, 2012.\n\nS. S. Mukherjee, P. Sarkar, and L. Lin. On clustering network-valued data. In Advances in Neural\n\nInformation Processing Systems (NIPS), 2017.\n\nA. Ng, M. I. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. In Advances in\n\nNeural Information Processing Systems (NIPS), 2002.\n\nN. Shervashidze, P. Schweitzer, E. J. van Leeuwen, K. Mehlhorn, and K. M. Borgwardt. Weisfeiler-\n\nLehman graph kernels. Journal of Machine Learning Research, 12:2539\u20132561, 2011.\n\nM. Tang, A. Athreya, D. L. Sussman, V. Lyzinski, and C. E. Priebe. A semiparametric two-sample\nhypothesis testing problem for random graphs. Journal of Computational and Graphical Statistics,\n26(2):344\u2013354, 2016.\n\nM. Tang, A. Athreya, D. L. Sussman, V. Lyzinski, and C. E. Priebe. A nonparametric two-sample\n\nhypothesis testing problem for random graphs. Bernoulli, 23:1599\u20131630, 2017.\n\nC. A. Tracy and H. Widom. On orthogonal and symplectic matrix ensembles. Communications in\n\nMathematical Physics, 177:727\u2013754, 1996.\n\nJ. Yang and J. Leskovec. Overlapping community detection at scale: A nonnegative matrix factoriza-\ntion approach. In Proceedings of the sixth ACM international conference on Web search and data\nmining (WSDM), pages 587\u2013596, 2013.\n\nB. Zhang, H. Li, R. B. Riggins, M. Zhan, J. Xuan, Z. Zhang, E. P. Hoffman, R. Clarke, and Y. Wang.\nDifferential dependency network analysis to identify condition-speci\ufb01c topological changes in\nbiological networks. Bioinformatics, 25(4):526\u2013532, 2009.\n\n10\n\n\f", "award": [], "sourceid": 1568, "authors": [{"given_name": "Debarghya", "family_name": "Ghoshdastidar", "institution": "University of T\u00fcbingen"}, {"given_name": "Ulrike", "family_name": "von Luxburg", "institution": "University of T\u00fcbingen"}]}