{"title": "A Scalable Approach to Probabilistic Latent Space Inference of Large-Scale Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 422, "page_last": 430, "abstract": "We propose a scalable approach for making inference about latent spaces of large networks. With a succinct representation of networks as a bag of triangular motifs, a parsimonious statistical model, and an efficient stochastic variational inference algorithm, we are able to analyze real networks with over a million vertices and hundreds of latent roles on a single machine in a matter of hours, a setting that is out of reach for many existing methods. When compared to the state-of-the-art probabilistic approaches, our method is several orders of magnitude faster, with competitive or improved accuracy for latent space recovery and link prediction.", "full_text": "A Scalable Approach to Probabilistic Latent Space\n\nInference of Large-Scale Networks\n\nJunming Yin\n\nSchool of Computer Science\nCarnegie Mellon University\n\nPittsburgh, PA 15213\n\njunmingy@cs.cmu.edu\n\nQirong Ho\n\nSchool of Computer Science\nCarnegie Mellon University\n\nPittsburgh, PA 15213\nqho@cs.cmu.edu\n\nEric P. Xing\n\nSchool of Computer Science\nCarnegie Mellon University\n\nPittsburgh, PA 15213\n\nepxing@cs.cmu.edu\n\nAbstract\n\nWe propose a scalable approach for making inference about latent spaces of large\nnetworks. With a succinct representation of networks as a bag of triangular motifs,\na parsimonious statistical model, and an ef\ufb01cient stochastic variational inference\nalgorithm, we are able to analyze real networks with over a million vertices and\nhundreds of latent roles on a single machine in a matter of hours, a setting that is\nout of reach for many existing methods. When compared to the state-of-the-art\nprobabilistic approaches, our method is several orders of magnitude faster, with\ncompetitive or improved accuracy for latent space recovery and link prediction.\n\nIntroduction\n\n1\nIn the context of network analysis, a latent space refers to a space of unobserved latent represen-\ntations of individual entities (i.e., topics, roles, or simply embeddings, depending on how users\nwould interpret them) that govern the potential patterns of network relations. The problem of latent\nspace inference amounts to learning the bases of such a space and reducing the high-dimensional\nnetwork data to such a lower-dimensional space, in which each entity has a position vector. De-\npending on model semantics, the position vectors can be used for diverse tasks such as community\ndetection [1, 5], user personalization [4, 13], link prediction [14] and exploratory analysis [9, 19, 8].\nHowever, scalability is a key challenge for many existing probabilistic methods, as even recent state-\nof-the-art methods [5, 8] still require days to process modest networks of around 100, 000 nodes.\nTo perform latent space analysis on at least million-node (if not larger) real social networks with\nmany distinct latent roles [24], one must design inferential mechanisms that scale in both the number\nof vertices N and the number of latent roles K. In this paper, we argue that the following three\nprinciples are crucial for successful large-scale inference: (1) succinct but informative representation\nof networks; (2) parsimonious statistical modeling; (3) scalable and parallel inference algorithms.\nExisting approaches [1, 5, 7, 8, 14] are limited in that they consider only one or two of the above\nprinciples, and therefore can not simultaneously achieve scalability and suf\ufb01cient accuracy. For\nexample, the mixed-membership stochastic blockmodel (MMSB) [1] is a probabilistic latent space\nmodel for edge representation of networks. Its batch variational inference algorithm has O(N 2K 2)\ntime complexity and hence cannot be scaled to large networks. The a-MMSB [5] improves upon\nMMSB by applying principles (2) and (3): it reduces the dimension of the parameter space from\nO(K 2) to O(K), and applies a stochastic variational algorithm for fast inference. Fundamentally,\nhowever, the a-MMSB still depends on the O(N 2) adjacency matrix representation of networks,\njust like the MMSB. The a-MMSB inference algorithm mitigates this issue by downsampling zero\nelements in the matrix, but is still not fast enough to handle networks with N \u2265 100, 000.\nBut looking beyond the edge-based relations and features, other higher-order structural statistics\n(such as the counts of triangles and k-stars) are also widely used to represent the probability dis-\ntribution over the space of networks, and are viewed as crucial elements in building a good-\ufb01tting\nexponential random graph model (ERGM) [11]. These higher-order relations have motivated the\ndevelopment of the triangular representation of networks [8], in which each network is represented\nsuccinctly as a bag of triangular motifs with size typically much smaller than \u0398(N 2). This suc-\ncinct representation has proven effective in extracting informative mixed-membership roles from\n\n1\n\n\fnetworks with high \ufb01delity, thus achieving the \ufb01rst principle (1). However, the corresponding statis-\ntical model, called the mixed-membership triangular model (MMTM), only scales well against the\nsize of a network, but does not scale to large numbers of latent roles (i.e., dimension of the latent\nspace). To be precise, if there are K distinct latent roles, its tensor of triangle-generating parame-\nters is of size O(K 3), and its blocked Gibbs sampler requires O(K 3) time per iteration. Our own\nexperiments show that the MMTM Gibbs algorithm is unusable for K > 10.\nWe now present a scalable approach to both latent space modeling and inference algorithm design\nthat encompasses all three aforementioned principles for large networks. Speci\ufb01cally, we build our\napproach on the bag-of-triangles representation of networks [8] and apply principles (2) and (3),\nyielding a fast inference procedure that has time complexity O(N K). In Section 3, we propose the\nparsimonious triangular model (PTM), in which the dimension of the triangle-generating parameters\nonly grows linearly in K. The dramatic reduction is principally achieved by sharing parameters\namong certain groups of latent roles. Then, in Section 4, we develop a fast stochastic natural gradient\nascent algorithm for performing variational inference, where an unbiased estimate of the natural\ngradient is obtained by subsampling a \u201cmini-batch\u201d of triangular motifs. Instead of adopting a fully\nfactorized, naive mean-\ufb01eld approximation, which we \ufb01nd performs poorly in practice, we pursue\na structured mean-\ufb01eld approach that captures higher-order dependencies between latent variables.\nThese new developments all combine to yield an ef\ufb01cient inference algorithm that usually converges\nafter 2 passes on each triangular motif (or up to 4-5 passes at worst), and achieves competitive or\nimproved accuracy for latent space recovery and link prediction on synthetic and real networks.\nFinally, in Section 5, we demonstrate that our algorithm converges and infers a 100-role latent space\non a 1M-node Youtube social network in just 4 hours, using a single machine with 8 threads.\n2 Triangular Representation of Networks\nWe take a scalable approach to network modeling by representing each network succinctly as a\nbag of triangular motifs [8]. Each triangular motif is a connected subgraph over a vertex triple\ncontaining 2 or 3 edges (called open triangle and closed triangle respectively). Empty and single-\nedge triples are ignored. Although this triangular format does not preserve all network information\nfound in an edge representation, these three-node connected subgraphs are able to capture a number\nof informative structural features in the network. For example, in social network theory, the notion of\ntriadic closure [21, 6] is commonly measured by the relative number of closed triangles compared to\nthe total number of connected triples, known as the global clustering coef\ufb01cient or transitivity [17].\nThe same quantity is treated as a general network statistic in the exponential random graph model\n(ERGM) literature [16]. Furthermore, the most signi\ufb01cant and recurrent structural patterns in many\ncomplex networks, so-called \u201cnetwork motifs\u201d, turn out to be connected three-node subgraphs [15].\nMost importantly of all, triangular modeling requires much less computational cost compared to\nedge-based models, with little or no degradation of performance for latent space recovery [8]. In\nnetworks with N vertices and low maximum vertex degree D, the number of triangular motifs\n\u0398(N D2) is normally much smaller than \u0398(N 2), allowing us to construct more ef\ufb01cient inference\nalgorithms scalable to larger networks. For high-maximum-degree networks, the triangular motifs\ncan be subsampled in a node-centric fashion as a local data reduction step. For each vertex i with\n\ndegree higher than a user-chosen threshold \u03b4, uniformly sample(cid:0)\u03b4\n\nof (a) its adjacent closed triangles, and (b) its adjacent open triangles that are centered on i. Vertices\nwith degree \u2264 \u03b4 keep all triangles from their set. It has been shown that this \u03b4-subsampling pro-\ncedure can approximately preserve the distribution over open and closed triangles, and allows for\nmuch faster inference algorithms (linear growth in N) at a small cost in accuracy [8].\nIn what follows, we assume that a preprocessing step has been performed \u2014 namely, extracting and\n\u03b4-subsampling triangular motifs (which can be done in O(1) time per sample, and requires < 1% of\nthe actual inference time) \u2014 to yield a bag-of-triangles representation of the input network. For each\ntriplet of vertices i, j, k \u2208 {1, . . . , N} , i < j < k, let Eijk denote the observed type of triangular\nmotif formed among these three vertices: Eijk = 1, 2 and 3 represent an open triangle with i, j\nand k in the center respectively, and Eijk = 4 if a closed triangle is formed. Because empty and\nsingle-edge triples are discarded, the set of triples with triangular motifs formed, I = {(i, j, k) : i <\nj < k, Eijk = 1, 2, 3 or 4}, is of size O(N \u03b42) after \u03b4-subsampling [8].\n3 Parsimonious Triangular Model\nGiven the input network, now represented as a bag of triangular motifs, our goal is to make infer-\nence about the latent position vector \u03b8i of each vertex i \u2208 {1, . . . , N}. We take a mixed-membership\n\n(cid:1) triangles from the set composed\n\n2\n\n2\n\n\f(si,jk, sj,ik, sk,ij )\n\nx = si,jk = sj,ik = sk,ij\nx = si,jk = sj,ik (cid:54)= sk,ij\nx = si,jk = sk,ij (cid:54)= sj,ik\nx = sj,ik = sk,ij (cid:54)= si,jk\n\nsk,ij (cid:54)= si,jk (cid:54)= sj,ik\n\nEquivalence classes\n{1, 2, 3}, {4}\n{1, 2}, {3}, {4}\n{1, 3}, {2}, {4}\n{2, 3}, {1}, {4}\n{1, 2, 3}, {4}\n\nConditional probability of Eijk \u2208 {1, 2, 3, 4}\n, Bxxx,2\n\nBxxx,1\n\nBxxx,1\n\n,\n\n,\n\n3\n\n3\n\n(cid:3)(cid:1)\n\nBxx,1\n\n2\n\n, Bxx,2, Bxx,3\n\n3\n\nDiscrete(cid:0)(cid:2) Bxxx,1\nDiscrete(cid:0)(cid:2) Bxx,1\nDiscrete(cid:0)(cid:2) Bxx,1\nDiscrete(cid:0)(cid:2)Bxx,2,\nDiscrete(cid:0)(cid:2) B0,1\n\n2\n\n2\n\n,\n\n3\n\n(cid:3)(cid:1)\n(cid:3)(cid:1)\n(cid:3)(cid:1)\n\nBxx,1\n\n2\n\nBxx,1\n\n2\n\n, Bxx,2,\n\nBxx,1\n\n2\nB0,1\n\n3\n\n,\n\n,\n\n,\n\n, Bxx,3\n\n, Bxx,3\n\n(cid:3)(cid:1)\n\nB0,1\n\n3\n\n, B0,2\n\nTable 1: Equivalence classes and conditional probabilities of Eijk given si,jk, sj,ik, sk,ij (see text for details).\napproach: each vertex i can take a mixture distribution over K latent roles governed by a mixed-\nmembership vector \u03b8i \u2208 \u2206K\u22121 restricted to the (K \u2212 1)-simplex. Such vectors can be used for\nperforming community detection and link prediction, as demonstrated in Section 5. Following a de-\nsign principle similar to the Mixed-Membership Triangular Model (MMTM) [8], our Parsimonious\nTriangular Model (PTM) is essentially a latent-space model that de\ufb01nes the generative process for a\nbag of triangular motifs. However, compared to the MMTM, the major advantage of the PTM lies in\nits more compact and lower-dimensional nature that allows for more ef\ufb01cient inference algorithms\n(see Global Update step in Section 4). The dimension of triangle-generating parameters in the PTM\nis just O(K), rather than O(K 3) in the MMTM (see below for further discussion).\nTo form a triangular motif Eijk for each triplet of vertices (i, j, k), a triplet of role indices\nsi,jk, sj,ik, sk,ij \u2208 {1, . . . , K} is \ufb01rst chosen based on the mixed-membership vectors \u03b8i, \u03b8j, \u03b8k.\nThese indices designate the roles taken by each vertex participating in this triangular motif. There\nare O(K 3) distinct con\ufb01gurations of such latent role triplet, and the MMTM uses a tensor of triangle-\ngenerating parameters of the same size to de\ufb01ne the probability of Eijk, one entry Bxyz for each\npossible con\ufb01guration (x, y, z). In the PTM, we reduce the number of such parameters by parti-\ntioning the O(K 3) con\ufb01guration space into several groups, and then sharing parameters within the\nsame group. The partitioning is based on the number of distinct states in the con\ufb01guration of the\nrole triplet: 1) if the three role indices are all in the same state x, the triangle-generating probability\nis determined by Bxxx; 2) if only two role indices exhibit the same state x (called majority role),\nthe probability of triangles is governed by Bxx, which is shared across different minority roles; 3)\nif the three role indices are all distinct, the probability of triangular motifs depends on B0, a sin-\ngle parameter independent of the role con\ufb01gurations. This sharing yields just O(K) parameters\nB0, Bxx, Bxxx, x \u2208 {1, . . . , K}, allowing PTM to scale to far more latent roles than MMTM. A\nsimilar idea was proposed in a-MMSB [5], using one parameter \u0001 to determine inter-role link prob-\nabilities, rather than O(K 2) parameters for all pairs of distinct roles, as in the original MMSB [1].\nOnce the role triplet (si,jk, sj,ik, sk,ij) is chosen, some of the triangular motifs can become\nindistinguishable. To illustrate, in the case of x = si,jk = sj,ik (cid:54)= sk,ij, one cannot distinguish the\nopen triangle with i in the center (Eijk = 1) from that with j in the center (Eijk = 2), because both\nare open triangles centered at a vertex with majority role x, and are thus structurally equivalent\nunder the given role con\ufb01guration. Formally, this con\ufb01guration induces a set of triangle equivalence\nclasses {{1, 2},{3},{4}} of all possible triangular motifs {1, 2, 3, 4}. We treat the triangular motifs\nwithin the same equivalence class as stochastically equivalent; that is, the conditional probabilities\nof events Eijk = 1 and Eijk = 2 are the same if x = si,jk = sj,ik (cid:54)= sk,ij. All possible cases are\nenumerated as follows (see also Table 1):\n1. If all three vertices have the same role x, all three open triangles are equivalent and the induced set of\nequivalence classes is {{1, 2, 3},{4}}. The probability of Eijk is determined by Bxxx \u2208 \u22061, where\nBxxx,1 represents the total probability of sampling an open triangle from {1, 2, 3} and Bxxx,2 represents\nthe closed triangle probability. Thus, the probability of a particular open triangle is Bxxx,1/3.\n2. If only two vertices have the same role x (majority role), the probability of Eijk is governed by Bxx \u2208 \u22062.\nHere, Bxx,1 and Bxx,2 represent the open triangle probabilities (for open triangles centered at a vertex in\nmajority and minority role respectively), and Bxx,3 represents the closed triangle probability. There are two\npossible open triangles with a vertex in majority role at the center, and hence each has probability Bxx,1/2.\n3. If all three vertices have distinct roles, the probability of Eijk depends on B0 \u2208 \u22061, where B0,1 represents\nthe total probability of sampling an open triangle from {1, 2, 3} (regardless of the center vertex\u2019s role) and\nB0,2 represents the closed triangle probability.\n\nTo summarize, the PTM assumes the following generative process for a bag of triangular motifs:\n\u2022 Choose B0 \u2208 \u22061, Bxx \u2208 \u22062 and Bxxx \u2208 \u22061 for each role x \u2208 {1, . . . , K} according to symmetric\n\nDirichlet distributions Dirichlet(\u03bb).\n\n3\n\n\f\u2022 For each vertex i \u2208 {1, . . . , N}, draw a mixed-membership vector \u03b8i \u223c Dirichlet (\u03b1).\n\u2022 For each triplet of vertices (i, j, k) , i < j < k,\n\n\u2212 Draw role indices si,jk \u223c Discrete (\u03b8i), sj,ik \u223c Discrete (\u03b8j), sk,ij \u223c Discrete (\u03b8k).\n\u2212 Choose a triangular motif Eijk \u2208 {1, 2, 3, 4} based on B0, Bxx, Bxxx and the con\ufb01guration of\n\n(si,jk, sj,ik, sk,ij) (see Table 1 for the conditional probabilities).\n\nIt is worth pointing out that, similar to the MMTM, our PTM is not a generative model of networks\nper se because (a) empty and single-edge motifs are not modeled, and (b) one can generate a set\nof triangles that does not correspond to any network, because the generative process does not force\noverlapping triangles to have consistent edge values. However, given a bag of triangular motifs E\nextracted from a network, the above procedure de\ufb01nes a valid probabilistic model p(E | \u03b1, \u03bb) and\nwe can legitimately use it for performing posterior inference p(s, \u03b8, B | E, \u03b1, \u03bb). We stress that our\ngoal is latent space inference, not network simulation.\n4 Scalable Stochastic Variational Inference\nIn this section, we present a stochastic variational inference algorithm [10] for performing approx-\nimate inference under our model. Although it is also feasible to develop such algorithm for the\nMMTM [8], the O(N K 3) computational complexity precludes its application to large numbers of\nlatent roles. However, due to the parsimonious O(K) parameterization of the PTM, our ef\ufb01cient\nalgorithm has only O(N K) complexity.\nWe adopted a structured mean-\ufb01eld approximation method, in which the true posterior of latent\nvariables p(s, \u03b8, B | E, \u03b1, \u03bb) is approximated by a partially factorized distribution q(s, \u03b8, B),\n\n(i,j,k)\u2208I\n\nq(s, \u03b8, B) =\n\nq(si,jk, sj,ik, sk,ij | \u03c6ijk)\n\nq(Bxx | \u03b7xx)q(B0 | \u03b70),\nwhere I = {(i, j, k) : i < j < k, Eijk = 1, 2, 3 or 4} and |I| = O(N \u03b42). The strong dependencies\namong the per-triangle latent roles (si,jk, sj,ik, sk,ij) suggest that we should model them as a group,\nrather than completely independent as in a naive mean-\ufb01eld approximation1. Thus, the variational\nposterior of (si,jk, sj,ik, sk,ij) is the discrete distribution\n\nq(Bxxx | \u03b7xxx)\n\nq(\u03b8i | \u03b3i)\n\nx=1\n\nx=1\n\ni=1\n\n.\n= qijk(x, y, z) = \u03c6xyz\nijk ,\n\nq(si,jk = x, sj,ik = y, sk,ij = z)\n\n(1)\nThe posterior q(\u03b8i) is a Dirichlet(\u03b3i); and the posteriors of Bxxx, Bxx, B0 are parameterized as:\nq(Bxxx) = Dirichlet(\u03b7xxx), q(Bxx) = Dirichlet(\u03b7xx), and q(B0) = Dirichlet(\u03b70).\nThe mean \ufb01eld approximation aims to minimize the KL divergence KL(q (cid:107) p) between the ap-\nproximating distribution q and the true posterior p; it is equivalent to maximizing a lower bound\nL(\u03c6, \u03b7, \u03b3) of the log marginal likelihood of the triangular motifs (based on Jensen\u2019s inequality)\nwith respect to the variational parameters {\u03c6, \u03b7, \u03b3} [22].\n\nx, y, z = 1, . . . , K.\n\nlog p(E | \u03b1, \u03bb) \u2265 Eq[log p(E, s, \u03b8, B | \u03b1, \u03bb)] \u2212 Eq[log q(s, \u03b8, B)]\n\n(2)\nTo simplify the notation, we decompose the variational objective L(\u03c6, \u03b7, \u03b3) into a global term and\na summation of local terms, one term for each triangular motif (see Appendix for details).\n\n= L(\u03c6, \u03b7, \u03b3).\n.\n\nL(\u03c6, \u03b7, \u03b3) = g(\u03b7, \u03b3) +\n\n(cid:96)(\u03c6ijk, \u03b7, \u03b3).\n\n(3)\n\n(cid:88)\n\n(i,j,k)\u2208I\n\n(cid:89)\n\nN(cid:89)\n\nK(cid:89)\n\nK(cid:89)\n\n(cid:88)\n\nThe global term g(\u03b7, \u03b3) depends only on the global variational parameters \u03b7, which govern the\nposterior of the triangle-generating probabilities B, as well as the per-node mixed-membership pa-\nrameters \u03b3. Each local term (cid:96)(\u03c6ijk, \u03b7, \u03b3) depends on per-triangle parameters \u03c6ijk as well as the\nglobal parameters. De\ufb01ne L(\u03b7, \u03b3)\n= max\u03c6 L(\u03c6, \u03b7, \u03b3), which is the variational objective achieved\n.\nby \ufb01xing the global parameters \u03b7, \u03b3 and optimizing the local parameters \u03c6. By equation (3),\n\nL(\u03b7, \u03b3) = g(\u03b7, \u03b3) +\n\nmax\n\u03c6ijk\n\n(i,j,k)\u2208I\n\n(cid:96)(\u03c6ijk, \u03b7, \u03b3).\n\n(4)\nStochastic variational inference is a stochastic gradient ascent algorithm [3] that maximizes L(\u03b7, \u03b3),\nbased on noisy estimates of its gradient with respect to \u03b7 and \u03b3. Whereas computing the true\ngradient \u2207L(\u03b7, \u03b3) involves a costly summation over all triangular motifs as in (4), an unbiased\nnoisy approximation of the gradient can be obtained much more cheaply by summing over a small\nsubsample of triangles. With this unbiased estimate of the gradient and a suitable adaptive step size,\nthe algorithm is guaranteed to converge to a stationary point of the variational objective L(\u03b7, \u03b3) [18].\n1 We tested a naive mean-\ufb01eld approximation, and it performed very poorly. This is because the tensor of\n\nrole probabilities q(x, y, z) is often of high rank, whereas naive mean-\ufb01eld is a rank-1 approximation.\n\n4\n\n\fAlgorithm 1 Stochastic Variational Inference\n1: t = 0. Initialize the global parameters \u03b7 and \u03b3.\n2: Repeat the following steps until convergence.\n(1) Sample a mini-batch of triangles S.\n(2) Optimize the local parameters qijk(x, y, z) for all sampled triangles in parallel by (6).\n(3) Accumulate suf\ufb01cient statistics for the natural gradients of \u03b7, \u03b3 (and then discard qijk(x, y, z)).\n(4) Optimize the global parameters \u03b7 and \u03b3 by the stochastic natural gradient ascent rule (7).\n(5) \u03c1t \u2190 \u03c40(\u03c41 + t)\u2212\u03ba, t \u2190 t + 1.\n\nIn our setting, the most natural way to obtain an unbiased gradient of L(\u03b7, \u03b3) is to sample a \u201cmini-\nbatch\u201d of triangular motifs at each iteration, and then average the gradient of local terms in (4) only\nfor these sampled triangles. Formally, let m be the total number of triangles and de\ufb01ne\n\nLS(\u03b7, \u03b3) = g(\u03b7, \u03b3) +\n\nm\n|S|\n\nmax\n\u03c6ijk\n\n(cid:96)(\u03c6ijk, \u03b7, \u03b3),\n\n(5)\n\n(cid:88)\n\n(i,j,k)\u2208S\n\n(cid:88)\n\na\n\n(a,b,c)\u2208A \u03b4abc\n\nijk = 1.\n\n\u03b4abc\nijk\n\n(a,b,c)\u2208A\n\n\u03b4aaa\nijk\n\nI[x = y = z = a] +\n\nijk (x (cid:54)= y).\n(cid:88)\n\nIt is easy to verify that\n\nwhere S is a mini-batch of triangles sampled uniformly at random.\nES[LS(\u03b7, \u03b3)] = L(\u03b7, \u03b3), hence \u2207LS(\u03b7, \u03b3) is unbiased: ES[\u2207LS(\u03b7, \u03b3)] = \u2207L(\u03b7, \u03b3).\nExact Local Update. To obtain the gradient \u2207LS(\u03b7, \u03b3), one needs to compute the optimal local\nvariational parameters \u03c6ijk (keeping \u03b7 and \u03b3 \ufb01xed) for each sampled triangle (i, j, k) in the mini-\nbatch S; these optimal \u03c6ijk\u2019s are then used in equation (5) to compute \u2207LS(\u03b7, \u03b3). Taking partial\n(cid:110)Eq[log B0,2]I[Eijk = 4] + Eq[log(B0,1/3)]I[Eijk (cid:54)= 4] + Eq[log \u03b8i,x + log \u03b8j,x + log \u03b8k,x]\n(cid:111)\nderivatives of (3) with respect to each local term \u03c6xyz\nijk and setting them to zero, we get for distinct\nx, y, z \u2208 {1, . . . , K},\nijk \u221d exp\n\u03c6xyz\nSee Appendix for the update equations of \u03c6xxx\nO(K) Approximation to Local Update. For each sampled triangle (i, j, k), the exact local update\nrequires O(K 3) work to solve for all \u03c6xyz\nijk , making it unscalable. To enable a faster local up-\ndate, we replace qijk(x, y, z | \u03c6ijk) in (1) with a simpler \u201cmixture-of-deltas\u201d variational distribution,\n\nijk and \u03c6xxy\n\n.\n(6)\n\nI[x = a, y = b, z = c],\n\nqijk(x, y, z | \u03b4ijk) =\n\nwhere A is a randomly chosen set of triples (a, b, c) with size O(K), and (cid:80)\n(cid:80)\n\nijk +\nIn other words, we assume the probability mass of the variational poste-\nrior q(si,jk, sj,ik, sk,ij) falls entirely on the K \u201cdiagonal\u201d role combinations (a, a, a) as well as\nO(K) randomly chosen \u201coff-diagonals\u201d (a, b, c). Conveniently, the \u03b4 update equations are identical\nto their \u03c6 counterparts as in (6), except that we normalize over the \u03b4\u2019s instead.\nIn our implementation, we generate A by picking 3K combinations of the form (a, a, b), (a, b, a) or\n(a, a, b), and another 3K combinations of the form (a, b, c), thus mirroring the parameter structure\nof B. Furthermore, we re-pick A every time we perform the local update on some triangle (i, j, k),\nthus avoiding any bias due to a single choice of A. We \ufb01nd that this approximation works as well\nas the full parameterization in (1), yet requires only O(K) work per sampled triangle. Note that\nany choice of A yields a valid lower bound to the true log-likelihood; this follows from standard\nvariational inference theory.\nGlobal Update. We appeal to stochastic natural gradient ascent [2, 20, 10] to optimize the global\nparameters \u03b7 and \u03b3, as it greatly simpli\ufb01es the update rules while maintaining the same asymptotic\nconvergence properties as classical stochastic gradient. The natural gradient \u02dc\u2207LS(\u03b7, \u03b3) is obtained\nby a premultiplication of the ordinary gradient \u2207LS(\u03b7, \u03b3) with the inverse of the Fisher information\nof the variational posterior q. See Appendix for the exact forms of the natural gradients with respect\nto \u03b7 and \u03b3. To update the parameters \u03b7 and \u03b3, we apply the stochastic natural gradient ascent rule\n(7)\nwhere the step size is given by \u03c1t = \u03c40(\u03c41 + t)\u2212\u03ba. To ensure convergence, the \u03c40, \u03c41, \u03ba are set such\nt \u03c1t = \u221e (Section 5 has our experimental values). The global update only\n\n\u03b7t+1 = \u03b7t + \u03c1t \u02dc\u2207\u03b7LS(\u03b7t, \u03b3t), \u03b3t+1 = \u03b3t + \u03c1t \u02dc\u2207\u03b3LS(\u03b7t, \u03b3t),\n\nt < \u221e and(cid:80)\n\nthat(cid:80)\n\nt \u03c12\n\ncosts O(N K) time per iteration due to the parsimonious O(K) parameterization of our PTM.\nOur full inferential procedure is summarized in Algorithm 1. Within a mini-batch S, steps 2-3\ncan be trivially parallelized across triangles. Furthermore, the local parameters qijk(x, y, z) can\n\na \u03b4aaa\n\n5\n\n\fbe discarded between iterations, since all natural gradient suf\ufb01cient statistics can be accumulated\nduring the local update. This saves up to tens of gigabytes of memory on million-node networks.\n5 Experiments\nWe demonstrate that our stochastic variational algorithm achieves latent space recovery accuracy\ncomparable to or better than prior work, but in only a fraction of the time. In addition, we perform\nheldout link prediction and likelihood lower bound (i.e. perplexity) experiments on several large\nreal networks, showing that our approach is orders of magnitude more scalable than previous work.\n5.1 Generating Synthetic Data\nWe use two latent space models as the simulator for our experiments \u2014 the MMSB model\n[1]\n(which the MMSB batch variational algorithm solves for), and a model that produces power-law net-\nworks from a latent space (see Appendix for details). Brie\ufb02y, the MMSB model produces networks\nwith \u201cblocks\u201d of nodes characterized by high edge probabilities, whereas the Power-Law model pro-\nduces \u201ccommunities\u201d centered around a high-degree hub node. We show that our algorithm rapidly\nand accurately recovers latent space roles based on these two notions of node-relatedness.\nFor both models, we synthesized ground truth role vectors \u03b8i\u2019s to generate networks of varying dif\ufb01-\nculty. We generated networks with N \u2208 {500, 1000, 2000, 5000, 10000} nodes, with the number of\nroles growing as K = N/100, to simulate the fact that large networks can have more roles [24]. We\ngenerated \u201ceasy\u201d networks where each \u03b8i contains 1 to 2 nonzero roles, and \u201chard\u201d networks with 1\nto 4 roles per \u03b8i. A full technical description of our networks can be found in the Appendix.\n5.2 Latent Space Recovery on Synthetic Data\nTask and Evaluation. Given one of the synthetic networks, the task is to recover estimates \u02c6\u03b8i\u2019s\nof the original latent space vectors \u03b8i\u2019s used to generate the network. Because we are comparing\ndifferent algorithms (with varying model assumptions) on different networks (generated under their\nown assumptions), we standardize our evaluation by thresholding all outputs \u02c6\u03b8i\u2019s at 1/8 = 0.125\n(because there are no more than 4 roles per \u03b8i), and use Normalized Mutual Information (NMI) [12,\n23], a commonly-used measure of overlapping cluster accuracy, to compare the \u02c6\u03b8i\u2019s with the true\n\u03b8i\u2019s (thresholded similarly). In other words, we want to recover the set of non-zero roles.\nCompeting Algorithms and Initialization. We tested the following algorithms:\n\n\u2022 Our PTM stochastic variational algorithm. We used \u03b4 = 50 subsampling2 (i.e. (cid:0)50\n\n(cid:1) = 1225 triangles\n\nalgorithm has O(N \u03b42K 3) time complexity, and is single-threaded.\n\nBecause of block sampling, complexity is still O(N \u03b42K 3). Single-threaded.\n\nper node), hyperparameters \u03b1 = \u03bb = 0.1, and a 10% minibatch size with step-size \u03c40(\u03c41 + t)\u03ba, where\n\u03c40 = 100, \u03c41 = 10000, \u03ba = \u22120.5, and t is the iteration number. Our algorithm has a runtime complexity\nof O(N \u03b42K). Since our algorithm can be run in parallel, we conduct all experiments using 4 threads \u2014\ncompared to single-threaded execution, we observe this reduces runtime to about 40%.\n\u2022 MMTM collapsed blocked Gibbs sampler, according to [8]. We also used \u03b4 = 50 subsampling. The\n\u2022 PTM collapsed blocked Gibbs sampler. Like the above MMTM Gibbs, but using our PTM model.\n\u2022 MMSB batch variational [1]. This algorithm has O(N 2K 2) time complexity, and is single-threaded.\nAll these algorithms are locally-optimal search procedures, and thus sensitive to initial values. In\nparticular, if nodes from two different roles are initialized to have the same role, the output is likely\nto merge all nodes in both roles into a single role. To ensure a meaningful comparison, we therefore\nprovide the same \ufb01xed initialization to all algorithms \u2014 for every role x, we provide 2 example\nnodes i, and initialize the remaining nodes to have random roles. In other words, we seed 2% of the\nnodes with one of their true roles, and let the algorithms proceed from there3.\nRecovery Accuracy. Results of our method, MMSB Variational, MMTM Gibbs and PTM\nGibbs are in Figure 1. Our method exhibits high accuracy (i.e. NMI close to 1) across almost all\nnetworks, validating its ability to recover latent roles under a range of network sizes N and roles\nK. In contrast, as N (and thus K) increases, MMSB Variational exhibits degraded performance\ndespite having converged, while MMTM/PTM Gibbs converge to and become stuck in local minima\n2 We chose \u03b4 = 50 because almost all our synthetic networks have median degree \u2264 50. Choosing \u03b4 above\n\n2\n\nthe median degree ensures that more than 50% of the nodes will receive all their assigned triangles.\n\n3 In general, one might not have any ground truth roles or labels to seed the algorithm with. For such cases,\nour algorithm can be initialized as follows: rank all nodes according to the number of 3-triangles they touch,\nand then seed the top K nodes with different roles x. The intuition is that \u201cgood\u201d roles may be de\ufb01ned as\nhaving a high ratio of 3-triangles to 2-triangles among participating nodes.\n\n6\n\n\fLatent space recovery on Synthetic Power-Law and MMSB Networks\n\nAccuracy vs MMSB, MMTM\n\nRuntime\n\nFull vs Mini-Batch\n\nFigure 1: Synthetic Experiments. Left/Center: Latent space recovery accuracy (measured using Normalized\nMutual Information) and runtime per data pass for our method and baselines. With the MMTM/PTM Gibbs and\nMMSB Variational algorithms, the larger networks did not complete within 12 hours. The runtime plots\nfor MMSB easy and Power-Law easy experiments are very similar to the hard experiments, so we omit them.\nRight: Convergence of our stochastic variational algorithm (with 10% minibatches) versus a batch variational\nversion of our algorithm. On N = 1, 000 networks, our minibatch algorithm converges within 1-2 data passes.\n\nLink Prediction on Synthetic and Real Networks\n\nDictionary\n\nRoget\n1.0K\n3.6K\n0.65\n0.72\n\nOdlis\n2.9K\n16K\n0.81\n0.88\n\nBiological\n\nYeast\n2.4K\n6.6K\n0.75\n0.81\n\narXiv Collaboration\nAstroPh\nGrQc\n5.2K\n18.7K\n200K\n14K\n0.86\n0.82\n0.77\n\u2014\n\nInternet\nStanford\n\n282K\n2.0M\n0.94\n\u2014\n\nSocial\nYoutube\n1.1M\n3.0M\n0.71\n\u2014\n\nNetwork Type\n\nName\nNodes N\nEdges\n\nOur Method AUC\n\nMMSB Variational AUC\n\nMMSB\n2.0K\n40K\n0.93\n0.91\n\nSynthetic\n\nPower-law\n\n2.0K\n40K\n0.97\n0.94\n\nTable 2: Link Prediction Experiments, measured using AUC. Our method performs similarly to MMSB\nVariational on synthetic data. MMSB performs better on smaller, non-social networks, while we perform\nbetter on larger, social networks (or MMSB fails to complete due to lack of scalability). Roget, Odlis and\nYeast networks are from Pajek datasets (http://vlado.fmf.uni-lj.si/pub/networks/data/);\nthe rest are from Stanford Large Network Dataset Collection (http://snap.stanford.edu/data/).\n(even after many iterations and trials), without reaching a good solution4. We believe our method\nmaintains high accuracy due to its parsimonious O(K) parameter structure \u2014 compared to MMSB\nVariational\u2019s O(K 2) block matrix and MMTM Gibbs\u2019s O(K 3) tensor of triangle parameters.\nHaving fewer parameters may lead to better parameter estimates, and better task performance.\nRuntime. On the larger networks, MMSB Variational and MMTM/PTM Gibbs did not even\n\ufb01nish execution due to their high runtime complexity. This can be seen in the runtime graphs, which\nplot the time taken per data pass5: at N = 5, 000, all 3 baselines require orders of magnitude more\ntime than our method does at N = 10, 000. Recall that K = O(N ), and that our method has time\ncomplexity O(N \u03b42K), while MMSB Variational has O(N 2K 2), and MMTM/PTM Gibbs has\nO(N \u03b42K 3) \u2014 hence, our method runs in O(N 2) on these synthetic networks, while the others run\nin O(N 4). This highlights the need for network methods that are linear in N and K.\nConvergence of stochastic vs. batch algorithms. We also demonstrate that our stochastic varia-\ntional algorithm with 10% mini-batches converges much faster to the correct solution than a non-\nstochastic, full-batch implementation. The convergence graphs in Figure 1 plot NMI as a function of\ndata passes, and show that our method converges to the (almost) correct solution in 1-2 data passes.\nIn contrast, the batch algorithm takes 10 or more data passes to converge.\n5.3 Heldout Link Prediction on Real and Synthetic Networks\nWe compare MMSB Variational and our method on a link prediction task, in which 10% of\nthe edges are randomly removed (set to zero) from the network, and, given this modi\ufb01ed network,\nthe task is to rank these heldout edges against an equal number of randomly chosen non-edges.\nFor MMSB, we simply ranked according to the link probability under the MMSB model. For our\n\n4 With more generous initializations (20 out of 100 ground truth nodes per role), MMTM/PTM Gibbs\nconverge correctly. In practice however, this is an unrealistic amount of prior knowledge to expect. We believe\nthat more sophisticated MCMC schemes may \ufb01x this convergence issue with MMTM/PTM models.\n\n5One data pass is de\ufb01ned as performing variational inference on m triangles, where m is equal to the total\n\nnumber of triangles. This takes the same amount of time for both the stochastic and batch algorithms.\n\n7\n\n0.51251000.20.40.60.811,000s of nodes NNMINMI: MMSB easy Our methodMMTM GibbsPTM GibbsMMSB Variational0.51251000.20.40.60.811,000s of nodes NNMINMI: MMSB hard Our methodMMTM GibbsPTM GibbsMMSB Variational0.512510020040060080010001,000s of nodes NSeconds per data passRuntime: MMSB hard Our methodMMTM GibbsPTM GibbsMMSB Variational024681000.20.40.60.81Data passesNMIConvergence: MMSB Hard, N=1000 10% mini\u2212batchesFull batch variational0.51251000.20.40.60.811,000s of nodes NNMINMI: Power\u2212Law easy Our methodMMTM GibbsPTM GibbsMMSB Variational0.51251000.20.40.60.811,000s of nodes NNMINMI: Power\u2212Law hard Our methodMMTM GibbsPTM GibbsMMSB Variational0.512510020040060080010001,000s of nodes NSeconds per data passRuntime: Power\u2212Law hard Our methodMMTM GibbsPTM GibbsMMSB Variational024681000.20.40.60.81Data passesNMIConvergence: Power\u2212Law Hard, N=1000 10% mini\u2212batchesFull batch variational\fName\n\nBrightkite\nBrightkite\n\nSlashdot Feb 2009\nSlashdot Feb 2009\n\nStanford Web\nStanford Web\n\nBerkeley-Stanford Web\n\nYoutube\n\nReal Networks \u2014 Statistics, Experimental Settings and Runtime\n\nEdges\nNodes\n\u03b4\n50\n214K\n58K\n||\n||\n||\n50\n504K\n82K\n||\n||\n||\n2.0M 20\n282K\n||\n||\n50\n685K\n6.6M 30\n1.1M 3.0M 50\n\n2,3-Tris (for \u03b4)\n\nFrac. 3-Tris\n\nRoles K Threads\n\nRuntime (10 data passes)\n\n3.5M\n||\n9.0M\n||\n\n11.4M\n25.0M\n57.6M\n36.0M\n\n0.11\n||\n0.030\n||\n0.57\n0.42\n0.55\n0.053\n\n64\n300\n100\n300\n5\n100\n100\n100\n\n4\n4\n4\n4\n4\n4\n8\n8\n\n34 min\n2.6 h\n2.4 h\n6.7 h\n10 min\n6.3 h\n15.2 h\n9.1 h\n\nTable 3: Real Network Experiments. All networks were taken from the Stanford Large Network Dataset\nCollection; directed networks were converted to undirected networks via symmetrization. Some networks were\nrun with more than one choice of settings. Runtime is the time taken for 10 data passes (which was more than\nsuf\ufb01cient for convergence on all networks, see Figure 2).\n\nReal Networks \u2014 Heldout lower bound of our method\n\nFigure 2: Real Network Experiments. Training and heldout variational lower bound (equivalent to perplex-\nity) convergence plots for all experiments in Table 3. Each plot shows both lower bounds over 10 data passes\n(i.e. 100 iterations with 10% minibatches). In all cases, we observe convergence between 2-5 data passes, and\nthe shape of the heldout curve closely mirrors the training curve (i.e. no over\ufb01tting).\nmethod, we ranked possible links i \u2212 j by the probability that the triangle (i, j, k) will include edge\ni \u2212 j, marginalizing over all choices of the third node k and over all possible role choices for nodes\ni, j, k. Table 2 displays results for a variety of networks, and our triangle-based method does better\non larger social networks than the edge-based MMSB. This matches what has been observed in the\nnetwork literature [24], and further validates our triangle modeling assumptions.\n5.4 Real World Networks \u2014 Convergence on Heldout Data\nFinally, we demonstrate that our approach is capable of scaling to large real-world networks, achiev-\ning convergence in a fraction of the time reported by recent work on scalable network modeling.\nTable 3 lists the networks that we tested on, ranging in size from N = 58K to N = 1.1M. With\na few exceptions, the experiments were conducted with \u03b4 = 50 and 4 computational threads. In\nparticular, for every network, we picked \u03b4 to be larger than the average degree, thus minimizing the\namount of triangle data lost to subsampling. Figure 2 plots the training and heldout variational lower\nbound for several experiments, and shows that our algorithm always converges in 2-5 data passes.\nWe wish to highlight two experiments, namely the Brightkite network for K = 64, and the Stanford\nnetwork for K = 5 (the \ufb01rst and \ufb01fth rows respectively in Table 3). Gopalan et al. ([5]) reported\nconvergence on Brightkite in 8 days using their scalable a-MMSB algorithm with 4 threads, while\nHo et al. ([8]) converged on Stanford in 18.5 hours using the MMTM Gibbs algorithm on 1 thread.\nIn both settings, our algorithm is orders of magnitude faster \u2014 using 4 threads, it converged on\nBrightkite and Stanford in just 12 and 4 minutes respectively, as seen in Figure 2.\nIn summary, we have constructed a latent space network model with O(N K) parameters and de-\nvised a stochastic variational algorithm for O(N K) inference. Our implementation allows network\nanalysis with millions of nodes N and hundreds of roles K in hours on a single multi-core machine,\nwith competitive or improved accuracy for latent space recovery and link prediction. These results\nare orders of magnitude faster than recent work on scalable latent space network modeling [5, 8].\nAcknowledgments\nThis work was supported by AFOSR FA9550010247, NIH 1R01GM093156 and DARPA\nFA87501220324 to Eric P. Xing. Qirong Ho is supported by an A-STAR, Singapore fellowship.\nJunming Yin is supported by a Ray and Stephanie Lane Research Fellowship.\n\n8\n\n00.10.20.30.40.50.60.7\u22121.5\u22121\u22120.5x 107Training LBHoursBrightkite K=64, 4 threads00.10.20.30.40.50.60.7\u22122\u22121.5\u22121x 106Heldout LB00.511.522.53\u22122\u22121.5\u22121\u22120.5x 107Training LBHoursBrightkite K=300, 4 threads00.511.522.53\u22122.5\u22122\u22121.5\u22121x 106Heldout LB00.511.522.5\u22124\u22123\u22122\u22121x 107Training LBHoursSlashdot K=100, 4 threads00.511.522.5\u22126\u22124\u221220x 106Heldout LB01234567\u22123.5\u22123\u22122.5\u22122\u22121.5\u22121x 107Training LBHoursSlashdot K=300, 4 threads01234567\u22124.5\u22124\u22123.5\u22123\u22122.5\u22122x 106Heldout LB00.020.040.060.080.10.120.140.16\u22123\u22122.5\u22122\u22121.5\u22121x 107Training LBHoursStanford K=5, 4 threads00.020.040.060.080.10.120.140.16\u22124.5\u22124\u22123.5\u22123\u22122.5x 106Heldout LB01234567\u221210\u221250x 107Training LBHoursStanford K=100, 4 threads01234567\u22121.5\u22121\u22120.5x 107Heldout LB0246810121416\u22121.8\u22121.6\u22121.4\u22121.2\u22121\u22120.8\u22120.6x 108Training LBHoursBerk\u2212Stan K=100, 8 threads0246810121416\u22122.4\u22122.2\u22122\u22121.8\u22121.6\u22121.4\u22121.2x 107Heldout LB0246810\u22121.5\u22121\u22120.5x 108Training LBHoursYoutube K=100, 8 threads0246810\u22122\u22121.5\u22121x 107Heldout LB\fReferences\n[1] E. Airoldi, D. Blei, S. Fienberg, and E. Xing. Mixed membership stochastic blockmodels. Journal of\n\nMachine Learning Research, 9:1981\u20132014, 2008.\n\n[2] S. Amari. Natural gradient works ef\ufb01ciently in learning. Neural Computation, 10(2):251\u2013276, 1998.\n[3] L. Bottou. Stochastic learning. Advanced Lectures on Machine Learning, pages 146\u2013168, 2004.\n[4] M. Carman, F. Crestani, M. Harvey, and M. Baillie. Towards query log based personalization using\ntopic models. In Proceedings of the 19th ACM international conference on Information and knowledge\nmanagement (CIKM \u201910), pages 1849\u20131852, 2010.\n\n[5] P. Gopalan, D. Mimno, S. Gerrish, M. Freedman, and D. Blei. Scalable inference of overlapping commu-\n\nnities. In Advances in Neural Information Processing Systems 25, pages 2258\u20132266. 2012.\n\n[6] M. Granovetter. The strength of weak ties. American Journal of Sociology, 78(6):1360\u20131380, 1973.\n[7] Q. Ho, A. Parikh, and E. Xing. A multiscale community blockmodel for network exploration. Journal of\n\nthe American Statistical Association, 107(499), 2012.\n\n[8] Q. Ho, J. Yin, and E. Xing. On triangular versus edge representations \u2014 towards scalable modeling of\n\nnetworks. In Advances in Neural Information Processing Systems 25, pages 2141\u20132149. 2012.\n\n[9] P. Hoff, A. Raftery, and M. Handcock. Latent space approaches to social network analysis. Journal of the\n\nAmerican Statistical Association, 97(460):1090\u20131098, 2002.\n\n[10] M. Hoffman, D. Blei, C. Wang, and J. Paisley. Stochastic variational inference. Journal of Machine\n\nLearning Research, 14:1303\u20131347, 2013.\n\n[11] D. Hunter, S. Goodreau, and M. Handcock. Goodness of \ufb01t of social network models. Journal of the\n\nAmerican Statistical Association, 103(481):248\u2013258, 2008.\n\n[12] A. Lancichinetti, S. Fortunato, and J. Kert\u00b4esz. Detecting the overlapping and hierarchical community\n\nstructure in complex networks. New Journal of Physics, 11(3):033015+, 2009.\n\n[13] Y. Low, D. Agarwal, and A. Smola. Multiple domain user personalization. In Proceedings of the 17th\nACM SIGKDD international conference on Knowledge discovery and data mining (KDD \u201911), pages\n123\u2013131, 2011.\n\n[14] K. Miller, T. Grif\ufb01ths, and M. Jordan. Nonparametric latent feature models for link prediction. In Ad-\n\nvances in Neural Information Processing Systems 22, pages 1276\u20131284. 2009.\n\n[15] R. Milo, S. Shen-Orr, S. Itzkovitz, N. Kashtan, D. Chklovskii, and U. Alon. Network motifs: Simple\n\nbuilding blocks of complex networks. Science, 298(5594):824\u2013827, 2002.\n\n[16] M. Morris, M. Handcock, and D. Hunter. Speci\ufb01cation of exponential-family random graph models:\n\nTerms and computational aspects. Journal of Statistical Software, 24(4), 2008.\n\n[17] M. Newman, S. Strogatz, and D. Watts. Random graphs with arbitrary degree distributions and their\n\napplications. Physical Review E, 64(2), 2001.\n\n[18] H. Robbins and S. Monro. A stochastic approximation method. The Annals of Mathematical Statistics,\n\n22(3):400\u2013407, 1951.\n\n[19] P. Sarkar and A. Moore. Dynamic social network analysis using latent space models. ACM SIGKDD\n\nExplorations Newsletter, 7(2):31\u201340, 2005.\n\n[20] M. Sato. Online model selection based on the variational Bayes. Neural Computation, 13(7):1649\u20131681,\n\n2001.\n\n[21] G. Simmel and K. Wolff. The Sociology of Georg Simmel. Free Press, 1950.\n[22] M. Wainwright and M. Jordan. Graphical models, exponential families, and variational inference. Foun-\n\ndations and Trends in Machine Learning, 1(1-2):1\u2013305, 2008.\n\n[23] J. Xie, S. Kelley, and B. Szymanski. Overlapping community detection in networks: the state of the art\n\nand comparative study. ACM Computing Surveys, 45(4), 2013.\n\n[24] J. Yang and J. Leskovec. De\ufb01ning and evaluating network communities based on ground-truth. In Pro-\n\nceedings of the ACM SIGKDD Workshop on Mining Data Semantics. ACM, 2012.\n\n9\n\n\f", "award": [], "sourceid": 278, "authors": [{"given_name": "Junming", "family_name": "Yin", "institution": "CMU"}, {"given_name": "Qirong", "family_name": "Ho", "institution": "CMU"}, {"given_name": "Eric", "family_name": "Xing", "institution": "CMU"}]}