{"title": "Confidence Sets for Network Structure", "book": "Advances in Neural Information Processing Systems", "page_first": 2097, "page_last": 2105, "abstract": "Latent variable models are frequently used to identify structure in dichotomous network data, in part because they give rise to a Bernoulli product likelihood that is both well understood and consistent with the notion of exchangeable random graphs. In this article we propose conservative confidence sets that hold with respect to these underlying Bernoulli parameters as a function of any given partition of network nodes, enabling us to assess estimates of \\emph{residual} network structure, that is, structure that cannot be explained by known covariates and thus cannot be easily verified by manual inspection. We demonstrate the proposed methodology by analyzing student friendship networks from the National Longitudinal Survey of Adolescent Health that include race, gender, and school year as covariates. We employ a stochastic expectation-maximization algorithm to fit a logistic regression model that includes these explanatory variables as well as a latent stochastic blockmodel component and additional node-specific effects. Although maximum-likelihood estimates do not appear consistent in this context, we are able to evaluate confidence sets as a function of different blockmodel partitions, which enables us to qualitatively assess the significance of estimated residual network structure relative to a baseline, which models covariates but lacks block structure.", "full_text": "Con\ufb01dence Sets for Network Structure\n\nDavid S. Choi\n\nHarvard University\n\nCambridge, MA 02138\n\nPatrick Wolfe\n\nHarvard University\n\nCambridge, MA 02138\n\nSchool of Engineering and Applied Sciences\n\nSchool of Engineering and Applied Sciences\n\ndchoi@seas.harvard.edu\n\npatrick@seas.harvard.edu\n\nEdoardo M. Airoldi\nDepartment of Statistics\n\nHarvard University\n\nCambridge, MA 02138\n\nairoldi@fas.harvard.edu\n\nAbstract\n\nLatent variable models are frequently used to identify structure in dichotomous\nnetwork data, in part because they give rise to a Bernoulli product likelihood that\nis both well understood and consistent with the notion of exchangeable random\ngraphs. In this article we propose conservative con\ufb01dence sets that hold with re-\nspect to these underlying Bernoulli parameters as a function of any given partition\nof network nodes, enabling us to assess estimates of residual network structure,\nthat is, structure that cannot be explained by known covariates and thus cannot be\neasily veri\ufb01ed by manual inspection. We demonstrate the proposed methodology\nby analyzing student friendship networks from the National Longitudinal Survey\nof Adolescent Health that include race, gender, and school year as covariates. We\nemploy a stochastic expectation-maximization algorithm to \ufb01t a logistic regres-\nsion model that includes these explanatory variables as well as a latent stochastic\nblockmodel component and additional node-speci\ufb01c effects. Although maximum-\nlikelihood estimates do not appear consistent in this context, we are able to evalu-\nate con\ufb01dence sets as a function of different blockmodel partitions, which enables\nus to qualitatively assess the signi\ufb01cance of estimated residual network structure\nrelative to a baseline, which models covariates but lacks block structure.\n\nIntroduction\n\n1\nNetwork datasets comprising edge measurements Aij \u2208 {0, 1} of a binary, symmetric, and anti-\nre\ufb02exive relation on a set of n nodes, 1 \u2264 i < j \u2264 n, are fast becoming of paramount interest in the\nstatistical analysis and data mining literatures [1]. A common aim of many models for such data is\nto test for and explain the presence of network structure, primary examples being communities and\nblocks of nodes that are equivalent in some formal sense. Algorithmic formulations of this problem\ntake varied forms and span many literatures, touching on subjects such as statistical physics [2, 3],\ntheoretical computer science [4], economics [5], and social network analysis [6].\nOne popular modeling assumption for network data is to assume dyadic independence of the edge\nmeasurements when conditioned on a set of latent variables [7, 8, 9, 10]. The number of latent\nparameters in such models generally increases with the size of the graph, however, meaning that\ncomputationally intensive \ufb01tting algorithms may be required and standard consistency results may\nnot always hold. As a result, it can often be dif\ufb01cult to assess statistical signi\ufb01cance or quantify\nthe uncertainty associated with parameter estimates. This issue is evident in literatures focused\n\n1\n\n\fon community detection, where common practice is to examine whether algorithmically identi\ufb01ed\ncommunities agree with prior knowledge or intuition [11, 12]; this practice is less useful if additional\ncon\ufb01rmatory information is unavailable, or if detailed uncertainty quanti\ufb01cation is desired.\nCon\ufb01dence sets are a standard statistical tool for uncertainty quanti\ufb01cation, but they are not yet\nwell developed for network data. In this paper, we propose a family of con\ufb01dence sets for net-\nwork structure that apply under the assumption of a Bernoulli product likelihood. The form of\nthese sets stems from a stochastic blockmodel formulation which re\ufb02ects the notion of latent nodal\nclasses, and they provide a new tool for the analysis of estimated or algorithmically determined\nnetwork structure. We demonstrate usage of the con\ufb01dence sets by analyzing a sample of 26 ado-\nlescent friendship networks from the National Longitudinal Survey of Adolescent Health (available\nat http://www.cpc.unc.edu/addhealth), using a baseline model that only includes explanatory covari-\nates and heterogeneity in the nodal degrees. We employ these con\ufb01dence sets to validate departures\nfrom this baseline model taking the form of residual community structure. Though the con\ufb01dence\nsets we employ are conservative, we show that they are effective in identifying putative residual\nstructure in these friendship network data.\n\n2 Model Speci\ufb01cation and Inference\nWe represent network data via a sociomatrix A \u2208 {0, 1}N\u00d7N that re\ufb02ects the adjacency structure\nof a simple, undirected graph on N nodes. In keeping with the latent variable network analysis\nliterature, we assume entries {Aij} for i < j to be independent Bernoulli random variables with\nassociated success probabilities {Pij}i 1 using the modi\ufb01ed\nversion of Algorithm 1 detailed above, is \u201cfar\u201d from its nominal value under the baseline model \ufb01tted\nwith K = 1, in the sense that the corresponding 95% Bonferroni-corrected con\ufb01dence set bound is\nexceeded. We observe that in each partition, the number of apparently visible communities exceeds\nK, and they are comprised of small numbers of students. This effect is due to the intersection of\ngrade and z-induced clustering.\nWe take as our de\ufb01nition of nominal value the quantity \u00af\u03a6(z) computed under the base-\nline model, which we denote by \u03a6(z).\nTable 2 lists normalized divergence terms\nab ), Bonferroni-corrected 95% con\ufb01dence bounds, and measures of\nalignment between the corresponding partitions z and the explanatory variables. The alignment\nwith the covariates are small, as measured by the Jacaard similarity coef\ufb01cient and ratio of within-\nclass to total variance2, signifying the residual quality of the partitions, while the relatively large\ndivergence terms signify that the Bonferroni-corrected con\ufb01dence set bounds for each school have\nbeen met or exceeded.\n\n(cid:1)\u22121(cid:80)\n\na\u2264b nabD( \u02c6\u03a6(z)\n\nab ||\u03a6(z)\n\n(cid:0)N\n\n2\n\nB|, were A, B \u2282 (cid:0)N\n\n2The alignment scores are de\ufb01ned as follows. The Jacaard similarity coef\ufb01cient is de\ufb01ned as |A \u2229 B|/|A \u222a\n\n(cid:1) are the student pairings sharing the same latent class or the same covariate value,\n\nrespectively. See [12] for further network-related discussion. Variance ratio denotes the within-class degree\nvariance divided by the total variance, averaged over all classes.\n\n2\n\n6\n\n\fSchool\n10\n18\n21\n22\n26\n29\n38\n55\n56\n66\n67\n72\n78\n80\n\nStudents\n678\n284\n377\n614\n551\n569\n521\n336\n446\n644\n456\n352\n432\n594\n\nEdges K\n6\n2795\n5\n1189\n1531\n6\n5\n2450\n3\n2066\n6\n2534\n5\n1925\n803\n4\n6\n1394\n6\n2865\n3\n926\n1398\n4\n6\n1334\n1745\n4\n\nDiv. (Bound) Gender\n0.14\n0.17\n0.15\n0.18\n0.25\n0.15\n0.17\n0.20\n0.15\n0.15\n0.25\n0.21\n0.15\n0.20\n\n0.0064 (0.0062)\n0.0150 (0.0150)\n0.0140 (0.0120)\n0.0064 (0.0061)\n0.0049 (0.0045)\n0.0091 (0.0075)\n0.0073 (0.0073)\n0.0100 (0.0100)\n0.0120 (0.0099)\n0.0069 (0.0066)\n0.0055 (0.0055)\n0.0099 (0.0095)\n0.0100 (0.0100)\n0.0054 (0.0053)\n\nJaccard coef\ufb01cient or Variance ratio\nRace Grade Degree\n0.93\n0.16\n0.88\n0.19\n0.16\n0.95\n0.99\n0.14\n0.99\n0.21\n0.88\n0.16\n0.86\n0.18\n0.18\n0.97\n0.98\n0.14\n0.91\n0.16\n1.00\n0.23\n0.21\n0.96\n0.98\n0.12\n0.19\n0.99\n\n0.097\n0.14\n0.12\n0.11\n0.13\n0.10\n0.17\n0.21\n0.15\n0.099\n0.25\n0.12\n0.15\n0.15\n\nTable 2: Block structure assessments corresponding to Fig. 3. Small Jacaard coef\ufb01cient values (for\ngender, race, and grade) and variance ratios approaching 1 for degree indicate a lack of alignment\nwith covariates and hence the identi\ufb01cation of residual structure in the corresponding partition.\n\nWe note that the usage of covariate information was necessary to detect small student groups; with-\nout the incorporation of grade effects, we would require a much larger value of K for Algorithm 1 to\ndetect the observed network structure (a concern noted by [23] in the absence of covariates), which\nin turn would in\ufb02ate the con\ufb01dence set, leading to an inability to validate the observed structure\nfrom that predicted by a baseline model.\n\n4 Concluding Remarks\n\nIn this article we have developed con\ufb01dence sets for assessing inferred network structure, by lever-\naging our result derived in [14]. We explored the use of these con\ufb01dence sets with an application to\nthe analysis of Adolescent Health survey data comprising friendship networks from 26 schools.\nOur methodology can be summarized as follows. In lieu of a parametric model, we assume dyadic\nindependence with Bernoulli parameters {Pji}. We introduced a baseline model (K = 1) that in-\ncorporates degree and covariate effects, without block structure. Algorithm 1 was then used to \ufb01nd\nhighly assortative partitions of students which are also far from partitions induced by the explana-\ntory covariates in the baseline model. Differences in assortativity were quanti\ufb01ed by an empirical\ndivergence statistic, which was compared to an upper bound computed from Eq. (3) to check for sig-\nni\ufb01cance and to generate con\ufb01dence sets for {Pij}. While the upper bound in Eq. (3) is known to be\nloose, simulation results in Figure 1 suggest that the slack is moderate, leading to useful con\ufb01dence\nsets in practice.\nIn our procedure, we cannot quantify the uncertainty associated with the estimated baseline model,\nsince the parameter estimates lack consistency. As a result, we cannot conduct a formal hypothesis\ntest for \u0398 = 0. However, for a baseline model where the MLE is known to be consistent, we con-\njecture that such a hypothesis test should be possible by incorporating the con\ufb01dence set associated\nwith the MLE.\nDespite concerns regarding estimator consistency in this and other latent variable models, we were\nable to show that the notion of con\ufb01dence sets may instead be used to provide a (conservative)\nmeasure of residual block structure. We note that many open questions remain, and are hopeful\nthat this analysis may help to shed light on some important current issues facing practitioners and\ntheorists alike in statistical network analysis.\n\n7\n\n\f(a) School 10, K = 6\n\n(b) School 18, K = 5\n\n(c) School 21, K = 6\n\n(d) School 22, K = 5\n\n(e) School 26, K = 3\n\n(f) School 29, K = 6\n\n(g) School 38, K = 5\n\n(h) School 55, K = 4\n\n(i) School 56, K = 6\n\n(j) School 66, K = 6\n\n(k) School 67, K = 3\n\n(l) School 72, K = 4\n\n(m) School 78, K = 6\n\n(n) School 80, K = 4\n\nFigure 3: Adjacency matrices for schools exhibiting residual block structure as described in Sec-\ntion 3.2, with nodes ordered by grade (solid lines) and corresponding latent classes (dotted lines).\n\n8\n\n\fReferences\n\n[1] A. Goldenberg, A. X. Zheng, S. E. Fienberg, and E. M. Airoldi, \u201cA survey of statistical\nnetwork models\u201d, Foundation and Trends in Machine Learning, vol. 2, pp. 1\u2013117, Feb. 2010.\n[2] R. Albert and A. L. Barabasi, \u201cStatistical mechanics of complex networks\u201d, Reviews of\n\nModern Physics, vol. 74, no. 47, Jan. 2002.\n\n[3] M. E. J. Newman, \u201cThe structure and function of complex networks\u201d, SIAM Review, vol. 45,\n\npp. 167\u2013256, June 2003.\n\n[4] C. Cooper and A. M. Frieze, \u201cA general model of web graphs\u201d, Random Structures and\n\nAlgorithms, vol. 22, no. 3, pp. 311\u2013335, Mar. 2003.\n\n[5] M. O. Jackson, Social and Economic Networks, Princeton University Press, 2008.\n[6] S. Wasserman and K. Faust, Social Network Analysis: Methods and Applications, Cambridge\n\nUniversity Press, Cambridge, U.K., 1994.\n\n[7] T. A. B. Snijders and K. Nowicki, \u201cEstimation and prediction for stochastic blockmodels for\n\ngraphs with latent block structure\u201d, J. Classif., vol. 14, pp. 75\u2013100, Jan. 1997.\n\n[8] M. S. Handcock, A. E. Raftery, and J. M. Tantrum, \u201cModel-based clustering for social net-\n\nworks\u201d, J. R. Stat. Soc. A, vol. 170, pp. 301\u2013354, Mar. 2007.\n\n[9] E. M. Airoldi, D. M. Blei, S. E. Fienberg, and E. P. Xing, \u201cMixed membership stochastic\n\nblockmodels\u201d, J. Mach. Learn. Res., vol. 9, pp. 1981\u20132014, June 2008.\n\n[10] P. D. Hoff, \u201cMultiplicative latent factor models for description and prediction of social net-\n\nworks\u201d, Computational Math. Organization Theory, vol. 15, pp. 261\u2013272, Dec. 2009.\n\n[11] M. E. J. Newman, \u201cModularity and community structure in networks\u201d, Proc. Natl Acad. Sci.\n\nU.S.A., vol. 103, pp. 8577\u20138582, June 2006.\n\n[12] A. L. Traud, E. D. Kelsic, P. J. Mucha, and M. A. Porter, \u201cComparing community structure\n\nto characteristics in online collegiate social networks\u201d, SIAM Rev., 2011, to appear.\n\n[13] P. J. Bickel and A. Chen, \u201cA nonparametric view of network models and Newman-Girvan\nand other modularities\u201d, Proc. Natl Acad. Sci. U.S.A., vol. 106, pp. 21068\u201321073, Dec. 2009.\n[14] D.S. Choi, P.J. Wolfe, and E.M. Airoldi, \u201cStochastic blockmodels with growing numbers of\n\nclasses\u201d, Biometrika, 2011, to appear.\n\n[15] K. Rohe, S. Chatterjee, and B. Yu, \u201cSpectral clustering and the high-dimensional stochastic\n\nblockmodel\u201d, Ann. Stat., 2011, to appear.\n\n[16] A. Celisse, J.J. Daudin, and L. Pierre, \u201cConsistency of maximum-likelihood and variational\n\nestimators in the stochastic block model\u201d, Arxiv preprint 1105.3288, 2011.\n\n[17] B. Karrer, E. Levina, and MEJ Newman, \u201cRobustness of community structure in networks\u201d,\n\nPhys. Rev. E, vol. 77, pp. 46119\u201346128, Apr. 2008.\n\n[18] C.P. Massen and J.P.K. Doye, \u201cThermodynamics of community structure\u201d, Arxiv preprint\n\ncond-mat/0610077, 2006.\n\n[19] J. Copic, M. O. Jackson, and A. Kirman, \u201cIdentifying community structures from network\n\ndata via maximum likelihood methods\u201d, B.E. J. Theoretical Economics, vol. 9, Sept. 2009.\n\n[20] P.W. Holland and S. Leinhardt, \u201cAn exponential family of probability distributions for di-\n\nrected graphs\u201d, J. Am. Stat. Assoc., vol. 76, pp. 33\u201350, Mar. 1981.\n\n[21] SJ Haberman, \u201cComment on Holland and Leinhardt\u201d, J. Am. Stat. Assoc., vol. 76, pp. 60\u201362,\n\nMar. 1981.\n\n[22] S. Wasserman and S.O.L. Weaver, \u201cStatistical analysis of binary relational data: parameter\n\nestimation\u201d, J. Math. Psychol., vol. 29, pp. 406\u2013427, Dec. 1985.\n\n[23] P. D. Hoff, \u201cModeling homophily and stochastic equivalence in symmetric relational data\u201d,\n\nin Adv. in Neural Information Processing Systems, pp. 657\u2013664. MIT Press, 2008.\n\n[24] S.M. Goodreau, J.A. Kitts, and M. Morris, \u201cBirds of a feather, or friend of a friend? using\nexponential random graph models to investigate adolescent social networks\u201d, Demography,\nvol. 46, pp. 103\u2013125, Feb. 2009.\n\n[25] M.C. Gonz\u00b4alez, H.J. Herrmann, J. Kert\u00b4esz, and T. Vicsek, \u201cCommunity structure and ethnic\npreferences in school friendship networks\u201d, Physica A., vol. 379, no. 1, pp. 307\u2013316, 2007.\n\n9\n\n\f", "award": [], "sourceid": 1176, "authors": [{"given_name": "David", "family_name": "Choi", "institution": null}, {"given_name": "Patrick", "family_name": "Wolfe", "institution": null}, {"given_name": "Edo", "family_name": "Airoldi", "institution": null}]}