{"title": "Efficient Online Inference for Bayesian Nonparametric Relational Models", "book": "Advances in Neural Information Processing Systems", "page_first": 962, "page_last": 970, "abstract": "Stochastic block models characterize observed network relationships via latent community memberships. In large social networks, we expect entities to participate in multiple communities, and the number of communities to grow with the network size. We introduce a new model for these phenomena, the hierarchical Dirichlet process relational model, which allows nodes to have mixed membership in an unbounded set of communities. To allow scalable learning, we derive an online stochastic variational inference algorithm. Focusing on assortative models of undirected networks, we also propose an efficient structured mean field variational bound, and online methods for automatically pruning unused communities. Compared to state-of-the-art online learning methods for parametric relational models, we show significantly improved perplexity and link prediction accuracy for sparse networks with tens of thousands of nodes. We also showcase an analysis of LittleSis, a large network of who-knows-who at the heights of business and government.", "full_text": "Ef\ufb01cient Online Inference for Bayesian\n\nNonparametric Relational Models\n\nDae Il Kim1, Prem Gopalan2, David M. Blei2, and Erik B. Sudderth1\n\n1Department of Computer Science, Brown University, {daeil,sudderth}@cs.brown.edu\n\n2Department of Computer Science, Princeton University, {pgopalan,blei}@cs.princeton.edu\n\nAbstract\n\nStochastic block models characterize observed network relationships via latent\ncommunity memberships. In large social networks, we expect entities to partic-\nipate in multiple communities, and the number of communities to grow with the\nnetwork size. We introduce a new model for these phenomena, the hierarchical\nDirichlet process relational model, which allows nodes to have mixed membership\nin an unbounded set of communities. To allow scalable learning, we derive an on-\nline stochastic variational inference algorithm. Focusing on assortative models of\nundirected networks, we also propose an ef\ufb01cient structured mean \ufb01eld variational\nbound, and online methods for automatically pruning unused communities. Com-\npared to state-of-the-art online learning methods for parametric relational models,\nwe show signi\ufb01cantly improved perplexity and link prediction accuracy for sparse\nnetworks with tens of thousands of nodes. We also showcase an analysis of Little-\nSis, a large network of who-knows-who at the heights of business and government.\n\n1\n\nIntroduction\n\nA wide range of statistical models have been proposed for the discovery of hidden communities\nwithin observed networks. The simplest stochastic block models [20] create communities by clus-\ntering nodes, aiming to identify demographic similarities in social networks, or proteins with related\nfunctional interactions. The mixed-membership stochastic blockmodel (MMSB) [1] allows nodes\nto be members of multiple communities; this generalization substantially improves predictive accu-\nracy in real-world networks. These models are practically limited by the need to externally specify\nthe number of latent communities. We propose a novel hierarchical Dirichlet process relational\n(HDPR) model, which allows mixed membership in an unbounded collection of latent communities.\nBy adapting the HDP [18], we allow data-driven inference of the number of communities underlying\na given network, and growth in the community structure as additional nodes are observed.\nThe in\ufb01nite relational model (IRM) [10] previously adapted the Dirichlet process to de\ufb01ne a non-\nparametric relational model, but restrictively associates each node with only one community. The\nmore \ufb02exible nonparametric latent feature model (NLFM) [14] uses an Indian buffet process\n(IBP) [7] to associate nodes with a subset of latent communities. The in\ufb01nite multiple member-\nship relational model (IMRM) [15] also uses an IBP to allow multiple memberships, but uses a\nnon-conjugate observation model to allow more scalable inference for sparse networks. The non-\nparametric metadata dependent relational (NMDR) model [11] employs a logistic stick-breaking\nprior on the node-speci\ufb01c community frequencies, and thereby models relationships between com-\nmunities and metadata. All of these previous nonparametric relational models employed MCMC\nlearning algorithms. In contrast, the conditionally conjugate structure of our HDPR model allows us\nto easily develop a stochastic variational inference algorithm [17, 2, 9]. Its online structure, which\nincrementally updates global community parameters based on random subsets of the full graph, is\nhighly scalable; our experiments consider social networks with tens of thousands of nodes.\n\n1\n\n\fWhile the HDPR is more broadly applicable, our focus in this paper is on assortative models for\nundirected networks, which assume that the probability of linking distinct communities is small.\nThis modeling choice is appropriate for the clustered relationships found in friendship and collab-\noration networks. Our work builds on stochastic variational inference methods developed for the\nassortative MMSB (aMMSB) [6], but makes three key technical innovations. First, adapting work\non HDP topic models [19], we develop a nested family of variational bounds which assign posi-\ntive probability to dynamically varying subsets of the unbounded collection of global communities.\nSecond, we use these nested bounds to dynamically prune unused communities, improving compu-\ntational speed, predictive accuracy, and model interpretability. Finally, we derive a structured mean\n\ufb01eld variational bound which models dependence among the pair of community assignments associ-\nated with each edge. Crucially, this avoids the expensive and inaccurate local optimizations required\nby naive mean \ufb01eld approximations [1, 6], while maintaining computation and storage requirements\nthat scale linearly (rather than quadratically) with the number of hypothesized communities.\nIn this paper, we use our assortative HDPR (aHDPR) model to recover latent communities in so-\ncial networks previously examined with the aMMSB [6], and demonstrate substantially improved\nperplexity scores and link prediction accuracy. We also use our learned community structure to\nvisualize business and governmental relationships extracted from the LittleSis database [13].\n\n2 Assortative Hierarchical Dirichlet Process Relational Models\n\nWe introduce the assortative HDP relational (aHDPR) model, a nonparametric generalization of the\naMMSB for discovering shared memberships in an unbounded collection of latent communities.\nWe focus on undirected binary graphs with N nodes and E = N (N \u2212 1)/2 possible edges, and let\nyij = yji = 1 if there is an edge between nodes i and j. For some experiments, we assume the yij\nvariables are only partially observed to compare the predictive performance of different models.\nAs summarized in the graphical models of Fig. 1, we begin by de\ufb01ning a global Dirichlet process to\ncapture the parameters associated with each community. Letting \u03b2k denote the expected frequency\nof community k, and \u03b3 > 0 the concentration, we de\ufb01ne a stick-breaking representation of \u03b2:\n\n\u03b2k = vk\n\n(1 \u2212 v(cid:96)),\n\nvk \u223c Beta(1, \u03b3),\n\nk = 1, 2, . . .\n\n(1)\n\nk\u22121(cid:89)(cid:96)=1\n\nAdapting a two-layer hierarchical DP [18], the mixed community memberships for each node i are\nthen drawn from DP with base measure \u03b2, \u03c0i \u223c DP(\u03b1\u03b2). Here, E[\u03c0i | \u03b1, \u03b2] = \u03b2, and small\nprecisions \u03b1 encourage nodes to place most of their mass on a sparse subset of communities.\nTo generate a possible edge yij between nodes i and j, we \ufb01rst sample a pair of indicator variables\nfrom their corresponding community membership distributions, sij \u223c Cat(\u03c0i), rij \u223c Cat(\u03c0j). We\nthen determine edge presence as follows:\n\np(yij = 1 | sij = rij = k) = wk,\n\n(2)\nFor our assortative aHDPR model, each community has its own self-connection probability\nwk \u223c Beta(\u03c4a, \u03c4b). To capture the sparsity of real networks, we \ufb01x a very small probability of\nbetween-community connection, \u0001 = 10\u221230. Our HDPR model could easily be generalized to more\n\ufb02exible likelihoods in which each pair of communities k, (cid:96) have their own interaction probability [1],\nbut motivated by work on the aMMSB [6], we do not pursue this generalization here.\n\np(yij = 1 | sij (cid:54)= rij) = \u0001.\n\n3 Scalable Variational Inference\n\nPrevious applications of the MMSB associate a pair of community assignments, sij and rij, with\neach potential edge yij. In assortative models these variables are strongly dependent, since present\nedges only have non-negligible probability for consistent community assignments. To improve ac-\ncuracy and reduce local optima, we thus develop a structured variational method based on joint\ncon\ufb01gurations of these assignment pairs, which we denote by eij = (sij, rij). See Figure 1.\nGiven this alternative representation, we aim to approximate the joint distribution of the observed\nedges y, local community assignments e, and global community parameters \u03c0, w, \u03b2 given \ufb01xed\n\n2\n\n\fFigure 1: Alternative graphical representations of the aHDPR model, in which each of N nodes has mixed\nmembership \u03c0i in an unbounded set of latent communities, wk are the community self-connection probabilities,\nand yij indicates whether an edge is observed between nodes i and j. Left: Conventional representation,\nin which source sij and receiver rij community assignments are independently sampled. Right: Blocked\nrepresentation in which eij = (sij, rij) denotes the pair of community assignments underlying yij.\n\nhyperparameters \u03c4, \u03b1, \u03b3. Mean \ufb01eld variational methods minimize the KL divergence between a\nfamily of approximating distributions q(e, \u03c0, w, \u03b2) and the true posterior, or equivalently maximize\nthe following evidence lower bound (ELBO) on the marginal likelihood of the observed edges y:\n\nL(q) (cid:44) Eq[log p(y, e, \u03c0, w, \u03b2 | \u03c4, \u03b1, \u03b3)] \u2212 Eq[log q(e, \u03c0, w, \u03b2)].\n\n(3)\nFor the nonparametric aHDPR model, the number of latent community parameters wk, \u03b2k, and the\ndimensions of the community membership distributions \u03c0i, are both in\ufb01nite. Care must thus be\ntaken to de\ufb01ne a tractably factorized, and \ufb01nitely parameterized, variational bound.\n\n3.1 Variational Bounds via Nested Truncations\nWe begin by de\ufb01ning categorical edge assignment distributions q(eij | \u03c6ij) = Cat(eij | \u03c6ij),\nwhere \u03c6ijk(cid:96) = q(eij = (k, (cid:96))) = q(sij = k, rij = (cid:96)). For some truncation level K, which will be\ndynamically varied by our inference algorithms, we constrain \u03c6ijk(cid:96) = 0 if k > K or (cid:96) > K.\nGiven this restriction, all observed interactions are explained by one of the \ufb01rst (and under the\nstick-breaking prior, most probable) K communities. The resulting variational distribution has K 2\nparameters. This truncation approach extends prior work for HDP topic models [19, 5].\nFor the global community parameters, we de\ufb01ne an untruncated factorized variational distribution:\n\nq(\u03b2, w | v\u2217, \u03bb) =\n\nk (vk)Beta(wk | \u03bbka, \u03bbkb),\n\u03b4v\u2217\n\n\u03b2k(v\u2217) = v\u2217k\n\n(1 \u2212 v\u2217(cid:96) ).\n\n(4)\n\nk\u22121(cid:89)(cid:96)=1\n\n\u221e(cid:89)k=1\n\nOur later derivations show that for communities k > K above the truncation level, the optimal\nvariational parameters equal the prior: \u03bbka = \u03c4a, \u03bbkb = \u03c4b. These distributions thus need not be\nexplicitly represented. Similarly, the objective only depends on v\u2217k for k \u2264 K, de\ufb01ning K + 1\nprobabilities: the frequencies of the \ufb01rst K communities, and the aggregate frequency of all others.\nMatched to this, we associate a (K + 1)-dimensional community membership distribution \u03c0i to\neach node, where the \ufb01nal component contains the sum of all mass not assigned to the \ufb01rst K.\nExploiting the fact that the Dirichlet process induces a Dirichlet distribution on any \ufb01nite partition,\nwe let q(\u03c0i | \u03b8i) = Dir(\u03c0i | \u03b8i), \u03b8i \u2208 RK+1. The overall variational objective is then\n\nEq[log p(wk | \u03c4a, \u03c4b)] \u2212 Eq[log q(wk | \u03bbka, \u03bbkb)] + Eq[log p(v\u2217k | \u03b3)]\nEq[log p(\u03c0i | \u03b1, \u03b2(v\u2217))] \u2212 Eq[log q(\u03c0i | \u03b8i)]\nEq[log p(yij|w, eij)] + Eq[log p(eij|\u03c0i, \u03c0j)] \u2212 Eq[log q(eij|\u03c6ij)].\n\n(5)\n\nUnlike truncations of the global stick-breaking process [4], our variational bounds are nested, so that\nlower-order approximations are special cases of higher-order ones with some zeroed parameters.\n\nL(q) =(cid:80)k\n+(cid:80)i\n+(cid:80)ij\n\n3.2 Structured Variational Inference with Linear Time and Storage Complexity\n\nConventional, coordinate ascent variational inference algorithms iteratively optimize each parameter\ngiven \ufb01xed values for all others. Community membership and interaction parameters are updated as\n(6)\n\nk=1 \u03c6ijkk(1 \u2212 yij),\n\nk=1 \u03c6ijkkyij,\n\n\u03bbka = \u03c4a +(cid:80)E\n\nij(cid:80)K\n\n\u03bbkb = \u03c4b +(cid:80)E\n\nij(cid:80)K\n\n(cid:96)=1 \u03c6ijk(cid:96).\n\n\u03b8ik = \u03b1\u03b2k +(cid:80)(i,j)\u2208E(cid:80)K\n\n(7)\n\n3\n\n\u221e\u221e\u2327yijsijrijk\u21e1i\u21b5ENwk\u221e\u221e\u2327yijk\u21e1i\u21b5ENeijwk\fHere, the \ufb01nal summation is over all potential edges (i, j) linked to node i. Updates for assignment\ndistributions depend on expectations of log community assignment probabilities:\n\nEq[log(wk)] = \u03c8(\u03bbka) \u2212 \u03c8(\u03bbka + \u03bbkb),\n\nEq[log(1 \u2212 wk)] = \u03c8(\u03bbkb) \u2212 \u03c8(\u03bbka + \u03bbkb),\n\n\u02dc\u03c0ik (cid:44) exp{Eq[log(\u03c0ik)]} = exp{\u03c8(\u03b8ik) \u2212 \u03c8((cid:80)K+1\n\n(cid:96)=1 \u03b8i(cid:96))},\n\nGiven these suf\ufb01cient statistics, the assignment distributions can be updated as follows:\n\n\u02dc\u03c0i (cid:44)(cid:80)K\n\nk=1 \u02dc\u03c0ik.\n\n(8)\n(9)\n\n\u03c6ijkk \u221d \u02dc\u03c0ik \u02dc\u03c0jkf (wk, yij),\n\u03c6ijk(cid:96) \u221d \u02dc\u03c0ik \u02dc\u03c0j(cid:96)f (\u0001, yij),\n\n(10)\n(11)\nHere, f (wk, yij) = exp{yijEq[log(wk)] + (1\u2212 yij)Eq[log(1\u2212 wk)]}. More detailed derivations of\nrelated updates have been developed for the MMSB [1].\nA naive implementation of these updates would require O(K 2) computation and storage for each\nassignment distribution q(eij | \u03c6ij). Note, however, that the updates for q(wk | \u03bbk) in Eq. (6)\ndepend only on the K probabilities \u03c6ijkk that nodes select the same community. Using the updates\nfor \u03c6ijk(cid:96) from Eq. (11), the update of q(\u03c0i | \u03b8i) in Eq. (7) can be expanded as follows:\n\n(cid:96) (cid:54)= k.\n\n\u03b8ik = \u03b1\u03b2k +(cid:80)(i,j)\u2208E \u03c6ijkk + 1\n= \u03b1\u03b2k +(cid:80)(i,j)\u2208E \u03c6ijkk + 1\n\nZij(cid:80)(cid:96)(cid:54)=k \u02dc\u03c0ik \u02dc\u03c0j(cid:96)f (\u0001, yij)\n\n\u02dc\u03c0ikf (\u0001, yij)(\u02dc\u03c0j \u2212 \u02dc\u03c0jk).\n\nZij\n\nNote that \u02dc\u03c0j need only be computed once, in O(K) operations. The normalization constant Zij,\nwhich is de\ufb01ned so that \u03c6ij is a valid categorical distribution, can also be computed in linear time:\n\n(12)\n\n(13)\n\nZij = \u02dc\u03c0i \u02dc\u03c0jf (\u0001, yij) +(cid:80)K\n\nk=1 \u02dc\u03c0ik \u02dc\u03c0jk(f (wk, yij) \u2212 f (\u0001, yij)).\n\nFinally, to evaluate our variational bound and assess algorithm convergence, we still need to calculate\nthe likelihood and entropy terms dependent on \u03c6ijk(cid:96). However, we can compute part of our bound\nby caching our partition function Zij in linear time. See \u2021A.2 for details regarding the full derivation\nof this ELBO and its extensions.\n\n3.3 Stochastic Variational Inference\n\nStandard variational batch updates become computationally intractable when N becomes very large.\nRecent advancements in applying stochastic optimization techniques within variational inference [8]\nshowed that if our variational mean-\ufb01eld family of distributions are members of the exponential\nfamily, we can derive a simple stochastic natural gradient update for our global parameters \u03bb, \u03b8, v.\nThese gradients can be calculated from only a subset of the data and are noisy approximations of the\ntrue natural gradient for the variational objective, but represent an unbiased estimate of that gradient.\nTo accomplish this, we de\ufb01ne a new variational objective with respect to our current set of obser-\nvations. This function, in expectation, is equivalent to our true ELBO. By taking natural gradients\nwith respect to our new variational objective for our global variables \u03bb, \u03b8, we have\n\n(14)\n\n(cid:96)=1 \u03c6ijk(cid:96) + \u03b1\u03b2k \u2212 \u03b8ik,\n\ng(i,j) \u03c6ijkkyij + \u03c4a \u2212 \u03bbka;\ng(i,j)(cid:80)(i,j)\u2208E(cid:80)K\n\n\u2207\u03bb\u2217ka = 1\n\u2207\u03b8\u2217ik = 1\n(15)\nwhere the natural gradient for \u2207\u03bb\u2217kb is symmetric to \u2207\u03bb\u2217ka and where yij in Eq. (14) is replaced by\n(1 \u2212 yij). Note that(cid:80)(i,j)\u2208E(cid:80)K\n(cid:96)=1 \u03c6ijk(cid:96) was shown in the previous section to be computable in\nO(K). The scaling term g(i, j) is needed for an unbiased update to our expectation. If g(i, j) =\n2/N (N \u2212 1), then this would represent a uniform distribution over possible edge selections in our\nundirected graphs. In general, g(i, j) can be an arbitrary distribution over possible edge selections\nsuch as a distribution over sets of edges as long as the expectation with respect to this distribution is\nequivalent to the original ELBO [6]. When referring to the scaling constant associated with sets, we\nconsider the notation of h(T ) instead of g(i, j).\nWe optimize this ELBO with a Robbins-Monro algorithm which iteratively steps along the direction\nof this noisy gradient. We specify a learning rate \u03c1t (cid:44) (\u00b50 + t)\u2212\u03ba at time t where \u03ba \u2208 (.5, 1] and\nt < \u221e and\n\n\u00b50 \u2265 0 downweights the in\ufb02uence of earlier updates. With the requirement that(cid:80)t \u03c12\n\n4\n\n\f(cid:80)t \u03c1t = \u221e, we will provably converge to a local optimum. For our global variational parameters\n\n{\u03bb, \u03b8}, the updates at iteration t are now\n\nka + \u03c1t(\u2207\u03bb\u2217ka) = (1 \u2212 \u03c1t)\u03bbt\u22121\nik + \u03c1t(\u2207\u03b8\u2217ik) = (1 \u2212 \u03c1t)\u03b8t\u22121\n\nka = \u03bbt\u22121\n\u03bbt\nik = \u03b8t\u22121\n\u03b8t\nk = (1 \u2212 \u03c1t)vt\u22121\nvt\n\nik + \u03c1t(\n\nk + \u03c1t(v\u2217k),\n\n(17)\n(18)\nwhere v\u2217k is obtained via a constrained optimization task using the gradients derived in \u2021A.3. De\ufb01n-\ning an update on our global parameters given a single edge observation can result in very poor local\noptima. In practice, we specify a mini-batch T , a set of unique observations in determining a noisy\ngradient that is more informative. This results in a simple summation over the suf\ufb01cient statistics\nassociated with the set of observations as well as a change to g(i, j) to re\ufb02ect the necessary scaling\nof our gradients when we can no longer assume our samples are uniformly chosen from our dataset.\n\n(cid:96)=1 \u03c6ijk(cid:96) + \u03b1\u03b2k);\n\nka + \u03c1t(\n\n1\n\ng(i,j) \u03c6ijkkyij + \u03c4a);\n1\n\ng(i,j)(cid:80)(i,j)\u2208E(cid:80)K\n\n(16)\n\n3.4 Restricted Strati\ufb01ed Node Sampling\n\nStochastic variational inference provides us with the ability to choose a sampling scheme that al-\nlows us to better exploit the sparsity of real world networks. Given the success of strati\ufb01ed node\nsampling [6], we consider this technique for all our experiments. Brie\ufb02y, strati\ufb01ed node-sampling\nrandomly selects a single node i and either chooses its associated links or a set of edges from\nm equally sized non-link edge sets. For this mini-batch strategy, h(T ) = 1/N for link sets and\nh(T ) = 1/N m for a partitioned non-link set. In [6], all nodes in \u03c0 were considered global param-\neters and updated after each mini-batch. For our model, we also treat \u03c0 similarly, but maintain a\nseparate learning rate \u03c1i for each node. This allows us to focus on updating only nodes that are rele-\nvant to our mini-batch as well as limit the computational costs associated with this global update. To\nensure that our Robbins-Monro conditions are still satis\ufb01ed, we set the learning rate for nodes that\nare not part of our mini-batch to be 0. When a new minibatch contains this particular node, we look\nto the most previous learning rate and assume this value as the previous learning rate. This modi\ufb01ed\nit < \u221e and that\nsubsequence of learning rates maintains our convergence criterion so that the(cid:80)t \u03c12\n(cid:80)t \u03c1it = \u221e. We show how performing this simple modi\ufb01cation results in signi\ufb01cant improvements\n\nin both perplexity and link prediction scores.\n\n3.5 Pruning Moves\n\nOur nested truncation requires setting an initial number of communities K. A large truncation lets\nthe posterior \ufb01nd the best number of communities, but can be computationally costly. A truncation\nset too small may not be expressive enough to capture the best approximate posterior. To remedy\nthis, we de\ufb01ne a set of pruning moves aimed at improving inference by removing communities that\nhave very small posterior mass. Pruning moves provide the model with a more parsimonious and\ninterpretable latent structure, and may also signi\ufb01cantly reduce the computational cost of subsequent\niterations. Figure 2 provides an example illustrating how pruning occurs in our model.\nTo determine communities which are good candidates for pruning, for each community k we \ufb01rst\nk=1 \u03b8ik). Any community for which \u0398k < (log K)/N for\nt\u2217 = N/2 consecutive iterations is then evaluated in more depth. We scale t\u2217 with the number of\nnodes N within the graph to ensure that a broad set of observations are accounted for. To estimate\nan approximate but still informative ELBO for the pruned model, we must associate a set of relevant\nobservations to each pruning candidate. In particular, we approximate the pruned ELBO L(qprune) by\nconsidering observations yij among pairs of nodes with signi\ufb01cant mass in the pruned community.\nWe also calculate L(qold) from these same observations, but with the old model parameters. We\nthen compare these two values to accept or reject the pruning of the low-weight community.\n\ncompute \u0398k = ((cid:80)N\n\ni=1 \u03b8ik)/((cid:80)N\n\ni=1(cid:80)K\n\n4 Experiments\n\nIn this section we perform experiments that compare the performance of the aHDPR model to the\naMMSB. We show signi\ufb01cant gains in AUC and perplexity scores by using the restricted form of\n\n5\n\n\fWe specify a new model by redistributing its mass (cid:80)N\n\na, \u03bb\u2217\n\nFigure 2: Pruning extraneous communities. Suppose that community k = 3 is considered for removal.\ni=1 \u03b8i3 uniformly across the remaining communities\n\u03b8i(cid:96), (cid:96) (cid:54)= 3. An analogous operation is used to generate {v\u2217, \u03b2\u2217, \u03bb\u2217\nb , \u03b8\u2217}. To accurately estimate the true\nchange in ELBO for this pruning, we select the n\u2217 = 10 nodes with greatest participation \u03b8i3 in community 3.\nLet S denote the set of all pairs of these nodes, and yij\u2208S their observed relationships. From these observations\nij\u2208S for a model in which community k = 3 is pruned, and a corresponding ELBO L(qprune).\nwe can estimate \u03c6\u2217\nUsing the data from the same sub-graph, but the old un-pruned model parameters, we estimate an alternative\nELBO L(qold). We accept if L(qprune) > L(qold), and reject otherwise. Because our structured mean-\ufb01eld\napproach provides simple direct updates for \u03c6\u2217\n\nij\u2208S, the calculation of L(qold) and L(qprune) is ef\ufb01cient.\n\nstrati\ufb01ed node sampling, a quick K-means initialization1 for \u03b8, and our ef\ufb01cient structured mean-\n\ufb01eld approach combined with pruning moves. We perform a detailed comparison on a synthetic toy\ndataset, as well as the real-world relativity collaboration network, using a variety of metrics to show\nthe bene\ufb01ts of each contribution. We then show signi\ufb01cant improvements over the baseline aMMSB\nmodel in both AUC and perplexity metrics on several real-world datasets previously analyzed by [6].\nFinally, we perform a qualitative analysis on the LittleSis network and demonstrate the usefulness\nof using our learned latent community structure to create visualizations of large networks. For\nadditional details on the parameters used in these experiments, please see \u2021A.1.\n\n4.1 Synthetic and Collaboration Networks\n\nThe synthetic network we use for testing is generated from the standards and software outlined\nin [12] to produce realistic synthetic networks with overlapping communities and power-law degree\ndistributions. For these purposes, we set the number of nodes N = 1000, with the minimum degree\nper node set to 10 and its maximum to 60. On this network the true number of latent communities\nwas found to be K = 56. Our real world networks include 5 undirected networks originally ranging\nfrom N = 5, 242 to N = 27, 770. These raw networks, however, contain several disconnected com-\nponents. Both the aMMSB and aHDPR achieve highest posterior probability by assigning each con-\nnected component distinct, non-overlapping communities; effectively, they analyze each connected\nsub-graph independently. To focus on the more challenging problem of identifying overlapping\ncommunity structure, we take the largest connected component of each graph for analysis.\nInitialization and Node-Speci\ufb01c Learning Rates. The upper-left panels in Fig. 3 compare differ-\nent aHDPR inference algorithms, and the perplexity scores achieved on various networks. Here we\ndemonstrate the bene\ufb01ts of initializing \u03b8 via K-means, and our restricted strati\ufb01ed node sampling\nprocedure. For our random initializations, we initalized \u03b8 in the same fashion as the aMMSB. Using\na combination of both modi\ufb01cations, we achieve the best perplexity scores on these datasets. The\nnode-speci\ufb01c learning rates intuitively restrict updates for \u03b8 to batches containing relevant observa-\ntions, while our K-means initialization quickly provides a reasonable single-membership partition\nas a starting point for inference.\nNaive Mean-Field vs. Structured Mean-Field. The naive mean-\ufb01eld approach is the aHDPR\nmodel where the community indicator assignments are split into sij and rij. This can result in\nsevere local optima due to their coupling as seen in some experiments in Fig. 4. The aMMSB in some\n\n1Our K-means initialization views the rows of the adjacency matrix as distinct data points and produces a\nsingle community assignment zi for each node. To initialize community membership distributions based on\nthese assignments, we set \u03b8izi = N \u2212 1 and \u03b8i\\zi = \u03b1.\n\n6\n\nAdjacency Matrix \u2713K communities N nodes New Model \u21e5k=(PNi=1\u2713ik)/(PNi=1PKk=1\u2713ik)\u2713\u21e4Prune k=3 Uniformly redistribute mass. Perform similar operation for other latent variables. \u21e5k<(logK)/NSelect nodes relevant to pruned topic and its corresponding sub-graph (red box) to generate a new ELBO: L(qprune) If L(qprune) > L(qold), accept or else reject and continue inference with old model YL(qold) v\u21e4,\u21e4,\u21e4\u2713\u21e4,\u21e4ij2SL(qprune) v,,\u2713,ij2S\fFigure 3: The upper left shows bene\ufb01ts of a restricted update and a K-means initialization for strati\ufb01ed node\nsampling on both synthetic and relativity networks. The upper right shows the sensitivity of the aMMSB as\nK varies versus the aHDPR. The lower left shows various perplexity scores for the synthetic and relativity\nnetworks with the best performing model (aHDPR-Pruning) scoring an average AUC of 0.9675\u00b1 .0017 on the\nsynthetic network and 0.9466\u00b1 .0062 on the relativity network. The lower right shows the pruning process for\nthe toy data and the \ufb01nal K communities discovered on our real-world networks.\n\ninstances performs better than the naive mean-\ufb01eld approach, but this can be due to differences in our\ninitialization procedures. However, by changing our inference procedure to an ef\ufb01cient structured\nmean-\ufb01eld approach, we see signi\ufb01cant improvements across all datasets.\nBene\ufb01ts of Pruning Moves. Pruning moves were applied every N/2 iterations with a maximum\nof K/10 communities removed per move.\nIf the number of prune candidates was greater than\nK/10, then K/10 communities with the lowest mass were chosen. The lower right portion of Fig. 3\nshows that our pruning moves can learn close to the true underlying number of clusters (K=56) on a\nsynthetic network even when signi\ufb01cantly altering its initial K. Across several real world networks,\nthere was low variance between runs with respect to the \ufb01nal K communities discovered, suggesting\na degree of robustness. Furthermore, pruning moves improved perplexity and AUC scores across\nevery dataset as well as reducing computational costs during inference.\n\nFigure 4: Analysis of four real-world collaboration networks. The \ufb01gures above show that the aHDPR with\npruning moves has the best performance, in terms of both perplexity (top) and AUC (bottom) scores.\n\n4.2 The LittleSis Network\n\nThe LittleSis network was extracted from the website (http://littlesis.org), which is an organization\nthat acts as a watchdog network to connect the dots between the world\u2019s most powerful people\n\n7\n\n246810121416x 1063456789101112Mini\u2212Batch Strategies TOY, K=56 (aHDPR\u2212Fixed)Number of Observed EdgesPerplexity Random Init\u2212AllRandom Init\u2212RestrictedKmeans Init\u2212AllKmeans Init\u2212Restricted12345x 107510152025303540Mini\u2212Batch Strategies Relativity, K=250 (aHDPR\u2212Fixed)Number of Observed EdgesPerplexity Random Init\u2212AllRandom Init\u2212RestrictedKmeans Init\u2212AllKmeans Init\u2212Restricted20405660801002.533.544.555.566.57Average Perplexity vs. K (Toy)PerplexityNumber of Communities K (aMMSB) aMMSBaHDPR\u2212K100aHDPR\u2212Pruning\u2212K100aHDPR\u2212Pruning\u2212K2001502002503003504002468101214161820Average Perplexity vs. K (Relativity)PerplexityNumber of Communities K (aMMSB) aMMSBaHDPR\u2212K500aHDPR\u2212Pruning246810121416x 1062345678910Perplexity TOY N=1000Number of Observed EdgesPerplexity aMMSB\u2212K20aMMSB\u2212K56aHDPR\u2212Naive\u2212K56aHDPR\u2212Batch\u2212K56aHDPR\u2212K56aHDPR\u2212Pruning\u2212K200aHDPR\u2212Truth12345x 10751015202530Perplexity Relativity N=4158Number of Observed EdgesPerplexity aMMSB\u2212K150aMMSB\u2212K200aHDPR\u2212Naive\u2212K500aHDPR\u2212K500aHDPR\u2212Pruning105106107020406080100120140160180200Pruning process for Toy DataNumber of Observed Edges (Log Axis)Number of Communities Init K=100Init K=200True K=56relhep2hepastrocm250300350400Number of Communities UsedK after Pruning (aHDPR Initial K=500)123456789x 10710152025303540455055Perplexity Hep2 N=7464Number of Observed EdgesPerplexity aMMSB\u2212K150aMMSB\u2212K200aHDPR\u2212Naive\u2212K500aHDPR\u2212K500aHDPR\u2212Pruning2468101214x 10751015202530Perplexity Hep N=11204Number of Observed EdgesPerplexity aMMSB\u2212K250aMMSB\u2212K300aHDPR\u2212Naive\u2212K500aHDPR\u2212K500aHDPR\u2212Pruning0.511.52x 108101520253035Perplexity AstroPhysics N=17903Number of Observed EdgesPerplexity aMMSB\u2212K300aMMSB\u2212K350aHDPR\u2212Naive\u2212K500aHDPR\u2212K500aHDPR\u2212Pruning0.511.522.5x 10810203040506070Perplexity Condensed Matter N=21363Number of Observed EdgesPerplexity aMMSB\u2212K400aMMSB\u2212K450aHDPR\u2212Naive\u2212K500aHDPR\u2212K500aHDPR\u2212Pruning0.60.650.70.750.80.850.90.951AUC Hep2 N=7464AUC QuantilesaMMSBK200aMMSBK250aHDPRNaive\u2212K500aHDPRK500aHDPRPruning0.60.650.70.750.80.850.90.951AUC Hep N=11204AUC QuantilesaMMSBK250aMMSBK300aHDPRNaive\u2212K500aHDPRK500aHDPRPruning0.60.650.70.750.80.850.90.951AUC AstroPhysics N=17903AUC QuantilesaMMSBK300aMMSBK350aHDPRNaive\u2212K500aHDPRK500aHDPRPruning0.60.650.70.750.80.850.90.951AUC Condensed Matter N=21363AUC QuantilesaMMSBK400aMMSBK450aHDPRNaive\u2212K500aHDPRK500aHDPRPruning\fFigure 5: The LittleSis Network. Near the center in violet we have prominent government \ufb01gures such as\nLarry H. Summers (71st US Treasury Secretary) and Robert E. Rubin (70th US Treasury Secretary) with ties\nto several distinct communities, representative of their high posterior bridgness. Conversely, within the beige\ncolored community, individuals with small posterior bridgness such as Wendy Neu can re\ufb02ect a career that was\nhighly focused in one organization. A quick internet search shows that she is currently the CEO of Hugo Neu,\na green-technology \ufb01rm where she has worked for over 30 years. An analysis on this type of network might\nprovide insights into the structures of power that shape our world and the key individuals that de\ufb01ne them.\n\nand organizations. Our \ufb01nal graph contained 18,831 nodes and 626,881 edges, which represents\na relatively sparse graph with edge density of 0.35% (for details on how this dataset was pro-\ncessed see \u2021A.3). For this analysis, we ran the aHDPR with pruning on the entire dataset using\nthe same settings from our previous experiments. We then took the top 200 degree nodes and gen-\nerated weighted edges based off of a variational distance between their learned expected variational\nposteriors such that dij = 1 \u2212 |Eq [\u03c0i]\u2212Eq [\u03c0j ]|\n. This weighted edge was then included in our visual-\nization software [3] if dij > 0.5. Node sizes were determined by posterior bridgness [16] where\nk=1(Eq[\u03c0ik] \u2212 1\nK )2and measures the extent to which a node is involved\nwith multiple communities. Larger nodes have greater posterior bridgeness while node colors rep-\nresent its dominant community membership. Our learned latent communities can drive these types\nof visualizations that otherwise might not have been possible given the raw subgraph (see \u2021A.4).\n\nbi = 1 \u2212(cid:112)K/(K \u2212 1)(cid:80)K\n\n2\n\n5 Discussion\n\nOur model represents the \ufb01rst Bayesian nonparametric relational model to use a stochastic varia-\ntional approach for ef\ufb01cient inference. Our pruning moves allow us to save computation and im-\nprove inference in a principled manner while our ef\ufb01cient structured mean-\ufb01eld inference procedure\nhelps us escape local optima. Future extensions of interest could entail advanced split-merge moves\nthat can grow the number of communities as well as extending these scalable inference algorithms\nto more sophisticated relational models.\n\n8\n\nZalmay_KhalilzadRussell_L_CarsonTerry_J_LundgrenLeon_D_BlackWilliam_DonaldsonRonald_L_OlsonWilliam_C_RudinJerry_SpeyerPaul_H_O_NeillRoger_C_AltmanNeal_S_WolinRalph_L_SchlossteinVikram_S_PanditThomas_R_NidesWilliam_M_Lewis_JrRobert_H_HerzSuresh_SundaresanRebecca_M_BlankWilliam_A_HaseltineRobert_D_ReischauerStephen_M_WolfMaya_MacGuineasTom_ClausenVernon_E_Jordan_JrRichard_N_HaassRichard_D_ParsonsLarry_SummersTodd_SternPeter_J_WallisonThomas_E_DonilonPeter_OrszagNancy_KilleferScooter_LibbyTrenton_ArthurStephanie_CutterRobert_BoorstinThomas_F_McLarty_IIITony_FrattoRon_BloomRichard_L__Jake__SiewertStanton_AndersonGlenn_H_HutchinsThurgood_Marshall_JrRoger_B_PorterRobert_GatesMark_B_McClellanRobert_E_RubinStephen_FriedmanPeter_G_PetersonShirley_Ann_JacksonKenneth_M_DubersteinSylvia_Mathews_BurwellLaura_D_TysonTim_GeithnerRichard_C_HolbrookeMartin_S_FeldsteinWilliam_McDonoughWarren_Bruce_RudmanPaul_VolckerJohn_C_WhiteheadMaurice_R_GreenbergStrobe_TalbottWilliam_S_CohenStephen_W_BosworthThomas_R_PickeringReynold_LevyMichael_H_MoskowThomas_S_FoleyVincent_A_MaiWarren_ChristopherVin_WeberRichard_N_CooperMadeleine_K_AlbrightMichael_FromanSusan_RicePeter_J_SolomonValerie_JarrettRobert_WolfReed_E_HundtRichard_J_DanzigPenny_PritzkerWilliam_E_KennardWendy_AbramsRobert_BauerRobert_Lane_GibbsTom_BernsteinRon_MoelisMark_T_GalloglySusan_MandelTony_WestMichael_StrautmanisLouis_B_SusmanNorm_EisenPaul_J_TaubmanWendy_NeuMichael_LyntonSteven_GlucksternWilliam_Von_HoeneJason_FurmanMelissa_E_WinterWahid_HamidPaul_Tudor_Jones_IIPaula_H_CrownRonald_KirkRobert_M_PerkowitzRobert_ZoellickThomas_SteyerWillem_BuiterRichard_KauffmanSuzanne_Nora_JohnsonRobert_K_SteelThomas_K_MontagWilliam_WickerRobert_D_HormatsTimothy_M_GeorgePeter_BassWilliam_C_DudleyThomas_J_HealeyTracy_R_WolstencroftRajat_K_GuptaMeredith_BroomePeter_K_ScaturroRobert_S_KaplanReuben_Jeffery_IIIRobin_Jermyn_BrooksWilliam_P_BoardmanJohn_L_ThorntonSarah_G_SmithSteven_T_MnuchinSharmin_Mossavar_RahmaniRobert_B_MenschelRobert_J_HurstWilliam_W_GeorgeRon_Di_RussoLloyd_C_BlankfeinWilliam_M_YarbenetThomas_L_Kempner_JrKenneth_M_JacobsJohn_A_ThainJosh_BoltenRichard_C_PerryThomas_E_TuftMarti_ThomasJudd_Alan_GreggWalter_W_Driver_JrNeel_KashkariMark_PattersonNick_O_DonohoeMario_DraghiPete_ConewayScott_Kapnick\fReferences\n[1] E. Airoldi, D. Blei, S. Fienberg, and E. Xing. Mixed membership stochastic blockmodels. JMLR, 9, 2008.\n[2] S. Amari. Natural gradient works ef\ufb01ciently in learning. Neural Computation, 10(2):251\u2013276, 1998.\n[3] M. Bastian, S. Heymann, and M. Jacomy. Gephi: An open source software for exploring and manipulating\n\nnetworks, 2009.\n\n[4] D. M. Blei and M. I. Jordan. Variational methods for the dirichlet process. In ICML, 2004.\n[5] M. Bryant and E. B. Sudderth. Truly nonparametric online variational inference for hierarchical dirichlet\n\n[6] P. Gopalan, D. M. Mimno, S. Gerrish, M. J. Freedman, and D. M. Blei. Scalable inference of overlapping\n\nprocesses. In NIPS, pages 2708\u20132716, 2012.\n\ncommunities. In NIPS, pages 2258\u20132266, 2012.\n\n[7] T. L. Grif\ufb01ths and Z. Ghahramani. In\ufb01nite latent feature models and the Indian buffet process. Technical\n\nReport 2005-001, Gatsby Computational Neuroscience Unit, May 2005.\n\n[8] M. Hoffman, D. Blei, C. Wang, and J. Paisley. Stochastic variational inference.\n\narXiv preprint\n\n[9] M. D. Hoffman, D. M. Blei, and F. R. Bach. Online learning for latent dirichlet allocation. In NIPS, pages\n\narXiv:1206.7051, 2012.\n\n856\u2013864, 2010.\n\n[10] C. Kemp, J. Tenenbaum, T. Grif\ufb01ths, T. Yamada, and N. Ueda. Learning systems of concepts with an\n\nin\ufb01nite relational model. In AAAI, 2006.\n\n[11] D. Kim, M. C. Hughes, and E. B. Sudderth. The nonparametric metadata dependent relational model. In\n\n[12] A. Lancichinetti and S. Fortunato. Benchmarks for testing community detection algorithms on directed\n\nand weighted graphs with overlapping communities. Phys. Rev. E, 80(1):016118, July 2009.\n\n[13] littlesis.org. Littlesis is a free database detailing the connections between powerful people and organiza-\n\n[14] K. Miller, T. Grif\ufb01ths, and M. Jordan. Nonparametric latent feature models for link prediction. In NIPS,\n\nICML, 2012.\n\ntions, June 2009.\n\n2009.\n\n[15] M. Morup, M. N. Schmidt, and L. K. Hansen. In\ufb01nite multiple membership relational modeling for com-\nplex networks. In Machine Learning for Signal Processing (MLSP), 2011 IEEE International Workshop\non, pages 1\u20136. IEEE, 2011.\n\n[16] T. Nepusz, A. Petrczi, L. Ngyessy, and F. Bazs. Fuzzy communities and the concept of bridgeness in\n\ncomplex networks. Phys Rev E Stat Nonlin Soft Matter Phys, 77(1 Pt 2):016107, 2008.\n\n[17] M. Sato. Online model selection based on the variational bayes. Neural Computation, 13(7):1649\u20131681,\n\n[18] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Hierarchical Dirichlet processes.\n\nJASA,\n\n2001.\n\n101(476):1566\u20131581, Dec. 2006.\n\n[19] Y. W. Teh, K. Kurihara, and M. Welling. Collapsed variational inference for hdp. In NIPS, 2007.\n[20] Y. Wang and G. Wong. Stochastic blockmodels for directed graphs. JASA, 82(397):8\u201319, 1987.\n\n9\n\n\f", "award": [], "sourceid": 527, "authors": [{"given_name": "Dae Il", "family_name": "Kim", "institution": "Brown University"}, {"given_name": "Prem", "family_name": "Gopalan", "institution": "Princeton University"}, {"given_name": "David", "family_name": "Blei", "institution": "Princeton University"}, {"given_name": "Erik", "family_name": "Sudderth", "institution": "Brown University"}]}