{"title": "Scalable inference of topic evolution via models for latent geometric structures", "book": "Advances in Neural Information Processing Systems", "page_first": 5951, "page_last": 5961, "abstract": "We develop new models and algorithms for learning the temporal dynamics of the topic polytopes and related geometric objects that arise in topic model based inference. Our model is nonparametric Bayesian and the corresponding inference algorithm is able to discover new topics as the time progresses. By exploiting the connection between the modeling of topic polytope evolution, Beta-Bernoulli process and the Hungarian matching algorithm, our method is shown to be several orders of magnitude faster than existing topic modeling approaches, as demonstrated by experiments working with several million documents in under two dozens of minutes.", "full_text": "Scalable inference of topic evolution via models for\n\nlatent geometric structures\n\nMikhail Yurochkin\n\nIBM Research\n\nmikhail.yurochkin@ibm.com\n\nZhiwei Fan\n\nUniversity of Wisconsin-Madison\n\nzhiwei@cs.wisc.edu\n\nAritra Guha\n\nUniversity of Michigan\naritra@umich.edu\n\nParaschos Koutris\n\nUniversity of Wisconsin-Madison\n\nparis@cs.wisc.edu\n\nXuanLong Nguyen\nUniversity of Michigan\n\nxuanlong@umich.edu\n\nAbstract\n\nWe develop new models and algorithms for learning the temporal dynamics of\nthe topic polytopes and related geometric objects that arise in topic model based\ninference. Our model is nonparametric Bayesian and the corresponding inference\nalgorithm is able to discover new topics as the time progresses. By exploiting\nthe connection between the modeling of topic polytope evolution, Beta-Bernoulli\nprocess and the Hungarian matching algorithm, our method is shown to be several\norders of magnitude faster than existing topic modeling approaches, as demon-\nstrated by experiments working with several million documents in under two dozens\nof minutes.1\n\n1\n\nIntroduction\n\nThe topic or population polytope is a fundamental geometric object that underlies the presence of latent\ntopic variables in topic and admixture models [4, 19, 21]. The geometry of topic models provides the\ntheoretical basis for posterior contraction analysis of latent topics, in addition to helping to develop\nfast and quite accurate inference algorithms in parametric and nonparametric settings [18, 28, 29, 32].\nWhen data and the associated topics are indexed by time dimension, it is of interest to study the\ntemporal dynamics of such latent geometric structures. In this paper, we will study the modeling\nand algorithms for learning temporal dynamics of topic polytope that arises in the analysis of text\ncorpora.\nSeveral authors have extended the basic topic modeling framework to analyze how topics evolve\nover time. The Dynamic Topic Models (DTM) [3] demonstrated the importance of accounting for\nnon-exchangeability between document groups, particularly when time index is provided. Another\napproach is to keep topics \ufb01xed and consider only evolving topic popularity [26]. Hong et al. [13]\nextended such an approach to multiple corpora. Ahmed and Xing [1] proposed a nonparametric\nconstruction extending DTM where topics can appear or eventually die out. Although the evolution\nof the latent geometric structure (i.e., the topic polytope) is implicitly present in these works, it was\nnot explicitly addressed nor is the geometry exploited. A related limitation shared by these modeling\n\n1Code: https://github.com/moonfolk/SDDM\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fframeworks is the lack of scalability, due to inef\ufb01cient joint modeling and learning of topics at each\ntime point and topic evolution over time. To improve scalability, a natural solution is decoupling the\ntwo phases of inference.\nTo this end, we seek to develop a series of topic meta-models, i.e. models for temporal dynamics of\ntopic polytopes, assuming that the topic estimates from each time point have already been obtained via\nsome ef\ufb01cient static topic inference technique. The focus on inference of topic evolution offers novel\nopportunities and challenges. To start, what is the suitable ambient space in which the topic polytope\nis represented? As topics evolve, so are the number of topics that may become active or dormant,\nraising distinct modeling choices. Interesting issues arise in the inference, too. For instance, what is\nthe principled way of matching vertices of a collection of polytopes to their next reincarnations? Such\nquestion arises because we consider modeling of topics learned independently across timestamps and\ntext corpora, which entails the need for preserving the topic structure\u2019s permutation invariance of the\nvertex labels.\nWe consider an isometric embedding of the unit sphere in the word simplex, so that the evolution\nof topic polytopes may be represented by a collection of (random) trajectories of points residing on\nthe unit sphere. Instead of attempting to mix-match vertices in an ad hoc fashion, we appeal to a\nBayesian nonparametric modeling framework that allows the number of topic vertices to be random\nand vary across time. The mix-matching between topics shall be guided by the assumption on the\nsmoothness of the collection of global trajectories on the sphere using von Mises-Fisher dynamics\n[15]. The selection of active topics at each time point will be enabled by a nonparametric prior on the\nrandom binary matrices via the (hierarchical) Beta-Bernoulli process [24].\nOur contribution includes a sequence of Bayesian nonparametric models in increasing levels of\ncomplexity: the simpler model describes a topic polytope evolving over time, while the full model\ndescribes the temporal dynamics of a collection of topic polytopes as they arise from multiple corpora.\nThe semantics of topics can be summarized as follows: there is a collection of latent global topics of\nunknown cardinality evolving over time (e.g. topics in science or social topics in Twitter). Each year\n(or day) a subset of the global topics is elucidated by the community (some topics may be dormant\nat a given time point). The nature of each global topic may change smoothly (via varying word\nfrequencies). Additionally, different subsets of global topics are associated with different groups (e.g.\njournals or Twitter location stamps), some becoming active/inactive over time.\nAnother key contribution includes a suite of scalable approximate inference algorithms suitable\nfor online and distributed settings. In particular, we focus mainly on MAP updates rather than a\nfull Bayesian integration. This is appropriate in an online learning setting, moreover such updates\nof the latent topic polytope can be viewed as solving an optimal matching problem for which\na fast Hungarian matching algorithm can be applied. Our approach is able to perform dynamic\nnonparametric topic inference on 3 million documents in 20 minutes, which is signi\ufb01cantly faster\nthan prior static online and/or distributed topic modeling algorithms [16, 12, 25, 6, 5].\nThe remainder of the paper is organized as follows. In Section 2 we de\ufb01ne a Markov process over\nthe space of topic polytopes (simplices). In Section 3 we present a series of models for polytope\ndynamics and describe our algorithms for online dynamic and/or distributed inference. Section 4\ndemonstrates experimental results. We conclude with a discussion in Section 5.\n\n2 Temporal dynamics of a topic polytope\n\nThe fundamental object of inference in this work is the topic polytope arising in topic modeling\nwhich we shall now de\ufb01ne [4, 18]. Given a vocabulary of V words, a topic is de\ufb01ned as a probability\ndistribution on the vocabulary. Thus a topic is taken to be a point in the vocabulary simplex, namely,\n\u2206V \u22121, and a topic polytope for a corpus of documents is de\ufb01ned as a convex hull of topics associated\nwith the documents. Geometrically, the topics correspond to the vertices (extreme points) of the\n(latent) topic polytope to be inferred from data.\nIn order to infer about the temporal dynamics of a topic polytope, one might consider the evo-\nlution of each topic variable, say \u03b8(t), which represents a vertex of the polytope at time t. A\nstandard approach is due to Blei and Lafferty [3], who proposed to use a Gaussian Markov chain\n\u03b8(t)|\u03b8(t\u22121) \u223c N (\u03b8(t\u22121), \u03c3I) in RV for modeling temporal dynamics and a logistic normal transfor-\n\n2\n\n\fFigure 1: Invertible transformation between unit sphere and a standard simplex; dynamics example\n\n(cid:80)\n\ni\n\ni\n\n)\n\n, which sends elements of RV into \u2206V \u22121. In our meta-modeling\nmation \u03c0(\u03b8(t))i := exp(\u03b8(t)\n)\ni exp(\u03b8(t)\napproach, we consider topics, i.e. points in \u2206V \u22121, learned independently across time and corpora.\nLogistic normal map is many-to-one, hence it is undesirably ambiguous in mapping a collection of\ntopic polytopes to RV .\nWe propose to represent each topic variable as a point in a unit sphere SV \u22122, which possesses a\nnatural isometric embedding (i.e. one-to-one) in the vocabulary simplex \u2206V \u22121, so that the temporal\ndynamics of a topic variable can be identi\ufb01ed as a (random) trajectory on SV \u22122. This trajectory shall\nbe modeled as a Markovian process on SV \u22122: \u03b8(t)|\u03b8(t\u22121) \u223c vMF(\u03b8(t\u22121), \u03c40). Von Mises-Fisher\n(vMF) distribution is commonly used in the \ufb01eld of directional statistics [15] to model points on a\nunit sphere and was previously utilized for text modeling [2, 20].\n\nIsometric embedding of SV \u22122 into the vocabulary simplex We start with the directional repre-\nsentation of topic polytope [29]: let B = {\u03b21, . . . , \u03b2K} be a collection of vertices of a topic polytope.\n\u02dc\u03b2k, where C \u2208 Conv(B) is a reference point in a convex\nEach vertex is represented as \u03b2k := C + Rk\nhull of B, \u02dc\u03b2k \u2208 RV is a topic direction and Rk \u2208 R+. Moreover, Rk \u2208 [0, 1] is determined so that\nthe tip of direction vector \u02dc\u03b2k resides on the boundary of \u2206V \u22121. Since the effective dimensionality of\n\u02dc\u03b2k is V \u2212 2, we can now de\ufb01ne an one-to-one and isometric map sending \u02dc\u03b2k onto SV \u22122 as follows:\nmap of the vocabulary simplex \u2206V \u22121 \u2208 RV where it is \ufb01rst translated so that C becomes the origin\nand then rotated into RV \u22121, where resulting topics, say \u03b81, . . . , \u03b8K \u2208 SV \u22122, are normalized to the\nunit length. Observe that this geometric map is an isometry and hence invertible. It preserves angles\nbetween vectors, therefore we can evaluate vMF density without performing the map explicitly, by\nsimply setting \u03b8k := \u03b2k\u2212C\nLemma 1. \u0393 : {\u03b2 = (\u03b21, . . . , \u03b2V ) \u2208 \u2206V \u22121 : \u03b2i = 0 for some i} \u2192 {\u03b8 \u2208 SV \u22121 : 1T\na homeomorphism, where \u0393(\u03b2) = (\u03b2 \u2212 C) /(cid:107)\u03b2 \u2212 C(cid:107)2, and \u0393\u22121(\u03b8) = \u2212\nC = (c1, . . . , cV ) \u2208 \u2206V \u22121.\n\n(cid:107)\u03b2k\u2212C(cid:107). The following lemma formalizes this idea.\n\nV \u03b8 = 0} is\n+ C, for any\n\nmaxi \u03b8i/ci\n\n\u03b8\n\nProofs of this Lemma and subsequent technical results are given in the Supplement. The intuition\nbehind the construction is provided via Figure 1 which gives a geometric illustration for V = 3,\nvocabulary simplex \u2206V \u22121 shown as red triangle. Two topics on the boundary (face) of the vocabulary\nsimplex are \u03b21 = C + \u02dc\u03b21 and \u03b22 = C + \u02dc\u03b22. Green dot C is the reference point and \u03b1 = \u2220( \u02dc\u03b21, \u02dc\u03b22).\nIn Fig. 1 (left) we move C by translation to the origin and rotate \u2206V \u22121 from xyz to xy plane.\nIn Fig. 1 (center left) we show the resulting image of \u2206V \u22121 and add a unit sphere (blue) in R2.\nCorresponding to \u03b21, \u03b22 topics are the points \u03b81, \u03b82 on the sphere with \u2220(\u03b81, \u03b82) = \u03b1. Now, apply\nthe inverse translation and rotation to both \u2206V \u22121 and SV \u22122, the result is shown in Fig. 1 (center\nright) \u2014 we are back to R3 and \u2220(\u03b81, \u03b82) = \u2220( \u02dc\u03b21, \u02dc\u03b22) = \u03b1, where \u03b8k = \u03b2k\u2212C\n. In Fig. 1 (right)\n(cid:107)\u03b2k\u2212C(cid:107)2\nwe give a geometric illustration of the temporal dynamics.\nAs described above, each topic evolves in a random trajectory residing in a unit sphere, so the\nevolution of a collection of topics can be modeled by a collection of corresponding trajectories on the\nsphere. Note that the number of \"active\" topics may be unknown and vary over time. Moreover, a\ntopic may be activated, become dormant, and then resurface after some time. New modeling elements\nare introduced in the next section to account for these phenomena.\n\n3\n\n\ufffd\u0303\u00a01\ufffd\u0303\u00a02\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\u0303\u00a01\ufffd\ufffd1\ufffd\ufffd\ufffd\u0303\u00a02\ufffd2\ufffd\ufffd\u0303\u00a01\ufffd\u0303\u00a02\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd2\ufffd1\ufffd\ufffd\ufffd\ufffd\ufffd(\ufffd)2\ufffd(\ufffd+1)2\ufffd(\ufffd+2)2\ufffd(\ufffd)1\ufffd(\ufffd+1)1\ufffd(\ufffd+2)1\f3 Hierarchical Bayesian modeling for single or multiple topic polytopes\n\nWe shall present a sequence of models with increasing levels of complexity: we start by introducing a\nhierarchical model for online learning of the temporal dynamics of a single topic polytope, allowing\nfor varying number of vertices over time. Next, a static model for multiple topic polytopes learned on\ndifferent corpora drawing on a common pool of global topics. Finally, we present a \"full\" model for\nmodeling evolution of global topic trajectories over time and across groups of corpora.\n\n3.1 Dynamic model for single topic polytope\n\nAt a high level, our model maintains a collection of global trajectories taking values on a unit sphere.\nEach trajectory shall be endowed with a von Mises-Fisher dynamic described in the previous section.\nAt each time point, a random topic polytope is constructed by selecting a (random) subset of points\non the trajectory evaluated at time t. The random selection is guided by a Beta-Bernoulli process\nprior [24]. This construction is motivated by a modeling technique of Nguyen [17], who studied\na Bayesian hierarchical model for inference of smooth trajectories on an Euclidean domain using\nDirichlet process priors. Our generative model, using Beta-Bernoulli process as a building block,\nis more appropriate for the purpose of topic discovery. Due to the isometric embedding of SV \u22122 in\n\u2206V \u22121 described in the previous section, from here on we shall refer to topics as points on SV \u22122.\nFirst, generate a collection of global topic trajectories using Beta Process prior (cf. Thibaux and\nJordan [24]) 2 with a base measure H on the space of trajectories on SV \u22122 and mass parameter \u03b30:\n(1)\ni=1 follows a stick-breaking construction [23]: \u00b5i \u223c\nj=1 \u00b5j, and each \u03b8i \u223c H is a sequence of T random elements on the unit sphere\n\nIt follows that Q = (cid:80)\nBeta(\u03b30, 1), qi =(cid:81)i\n\nQ|\u03b30, H \u223c BP(\u03b30, H).\n\ni qi\u03b4\u03b8i, where {qi}\u221e\n\n\u03b8i := {\u03b8(t)\n\ni }T\n\nt=1, which are generated as follows:\n\ni\n\n\u223c vMF(\u03b8(t\u22121)\n\n|\u03b8(t\u22121)\n\u223c vMF(\u00b7, 0) \u2013 uniform on SV \u22122.\n\ni\n\n, \u03c40) for t = 1, . . . , T,\n\n\u03b8(t)\ni\n\u03b8(0)\ni\n\n(2)\n\ntive at t via the Bernoulli process T (t)|Qt \u223c BeP(Qt). Then T (t) :=(cid:80)\n\nAt any given time t = 1, . . . , T , the process Q induces a marginal measure Qt, whose support is given\nby the atoms of Q as they are evaluated at time t. Now, select a subset of the global topics that are ac-\n|qi \u223c\nBern(qi), \u2200i. T (t) are supported by atoms {\u03b8(t)\ni = 1, i = 1, 2, . . .} representing topics active\nat time t. Finally, assume that noisy measurements of each of these topic variables are generated via:\n\n, where b(t)\ni\n\ni=1b(t)\n\ni \u03b4\u03b8(t)\n\n: b(t)\n\ni\n\ni\n\nk |T (t) \u223c vMF(T (t)\nv(t)\nK (t) := card(T (t));T (t)\n\nk\n\nk , \u03c41), k = 1, . . . , K (t), where\n\nis k-th atom of T (t).\n\n(3)\n\nNoisy estimates for the topics at any particular time point may come from either the global topics\nobserved until the previous time point or a topic yet unexplored. We emphasize that topics {v(t)\nk }K(t)\nk=1\nfor t = 1, . . . , T are the quantities we aim to model, hence we refer to our approach as the meta-model.\nThese topics may be learned, for each time point independently, by any stationary topic modeling\nalgorithms, and then transformed to sphere by applying Lemma 1.\nLet B(t) denote the binary matrix representing the assignment of observed topic estimates to global\ntopics at time point t, i.e, B(t)\n. In words, these\nrandom variables \u201clink up\u201d the noisy estimates at any time point to the global topics observed thus\nfar. By conditional independence, the joint posterior of the hidden \u03b8(t) given observed noisy v(t) is:\n\nik = 1 if the vector v(t)\n\nis a noisy estimate for \u03b8(t)\n\nk\n\ni\n\n\u03b8(0),{\u03b8(t), B(t)}T\n\nt=1|{v(t)}T\n\nt=1\n\nP(\u03b8(t), B(t)|\u03b8(t\u22121),{B(a)}t\u22121\n\na=1)P(v(t)|\u03b8(t), B(t)).\n\n2Thibaux and Jordan [24] write BP(c, H), H(\u2126) = \u03b30; we set c = 1, H = H/\u03b30 and write BP(\u03b30, H).\n\nt=1\n\n4\n\nP(cid:16)\n\n(cid:17) \u221d P(\u03b8(0))\n\nT(cid:89)\n\n\fAt t, P(\u03b8(t), B(t)|\u03b8(t\u22121),{B(a)}t\u22121\n\na=1)P(v(t)|\u03b8(t), B(t)) \u221d P(\u03b8(t), B(t)|\u03b8(t\u22121), v(t),{B(a)}t\u22121\n\na=1) \u221d\n\n(cid:32)(cid:16)\nLt\u22121(cid:89)\n\u00b7 exp(\u2212 \u03b30\n\ni=1\n\nk=1 B(t)\n\nik\n\n(cid:17)(cid:80)K(t)\n(cid:80)Lt\n\nexp(\u03c41\n\ni=1\n\n)\n\ni\n\ni\n\n/(t \u2212 m(t\u22121)\nm(t\u22121)\nt )(\u03b30/t)Lt\u2212Lt\u22121\n(Lt \u2212 Lt\u22121)!\n\n(cid:33)\n\ni\n\nexp(\u03c40(cid:104)\u03b8(t\u22121)\n(cid:80)K(t)\n\nk=1 B(t)\n\nik (cid:104)\u03b8(t)\n\ni\n\ni (cid:105))\n, \u03b8(t)\n\nk (cid:105)).\n, v(t)\n\n(4)\n\nThe equation above represents a product of four quantities: (1) probability of B(t)s, where m(t)\ni\ndenotes the number of occurrences of topic i up to time point t (cf. popularity of a dish in the\nIndian Buffet Process (IBP) metaphor [9]), (2) vMF conditional of \u03b8(t)\n(cf. Eq. (2)),\n(3) number of new global topics at time t, Lt \u2212 Lt\u22121 \u223c Pois(\u03b30/t), and (4) emission probability\nP(v(t)|\u03b8(t), B(t)) (cf. Eq. (3)). Derivation details are given in the Supplement.\n\ni given \u03b8(t\u22121)\n\ni\n\nStreaming Dynamic Matching (SDM) To perform MAP estimation in the streaming setting, we\nhighlight the connection of the maximization of the posterior (4) to the objective of an optimal\nmatching problem: given a cost matrix, workers should be assigned to tasks, at most one worker per\ntask and one task per worker. The solution of this problem is obtained by employing the well-known\nHungarian algorithm [14]. In the context of dynamic topic modeling, our goal is to match topics\nlearned on the new timestamp to the trajectories of topics learned over the previous timestamps,\nwhere the cost is governed by our model. This connection is formalized by the following.\n\n\uf8f1\uf8f2\uf8f3(cid:107)\u03c41v(t)\nk + \u03c40\u03b8(t\u22121)\n(cid:80)\n\n\u03c41 + log \u03b30\ni,k B(t)\n\ni\n\n(cid:107)2 \u2212 \u03c40 + log m(t\u22121)\nt\u2212m(t\u22121)\n\ni\n\n, i \u2264 Lt\u22121\n\nt \u2212 log(i \u2212 Lt\u22121), Lt\u22121 < i \u2264 Lt\u22121 + K (t)\nik C (t)\n\ni\n\nik subject to the constraints that (a) for each\nconsider the optimization problem maxB(t)\n\ufb01xed i, at most one of B(t)\nik is 1\nand the rest are 0. Then, the MAP estimate for Eq. (4) can be obtained by the Hungarian algorithm,\nwhich solves for ((B(t)\n\nik is 1 and the rest are 0, and (b) for each \ufb01xed k, exactly one of B(t)\n\nas\n\nProposition 1. Given the cost C (t)\n\nik =\n\ni\n\nik )) to obtain \u03b8(t)\nk +\u03c40\u03b8(t\u22121)\nk +\u03c40\u03b8(t\u22121)\n(cid:107)2\n\n,\n\ni\n\ni\n\n\uf8f1\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f3\n\n\u03c41v(t)\n(cid:107)\u03c41v(t)\nv(t)\nk ,\n\u03b8(t\u22121)\n\ni\n\nif \u2203 k s.t. B(t)\nif \u2203 k s.t. B(t)\notherwise (topic is dormant at t).\n\nik = 1 and i \u2264 Lt\u22121\nik = 1 and i > Lt\u22121 (new topic)\n\n(5)\n\nm \u2208 NV }Mt\n\nWe defer proof to the Supplement. To complete description of the inference we shall discuss how\nnoisy estimates are obtained from the bag-of-words representation of the documents observed at\ntime point t. We choose to use CoSAC [29] algorithm to obtain topics {\u03b2(t)\nk=1 from\n{x(t)\nm=1, collection of Mt documents at time point t. CoSAC is a stationary topic modeling\nalgorithm which can infer number of topics from the data and is computationally ef\ufb01cient for\nmoderately sized corpora. We note that other topic modeling algorithms, e.g., variational inference\n[4] or Gibbs sampling [11, 22], can be used in place of CoSAC. Estimated topics are then transformed\nto {v(t)\na=1 Ma,\nwhere N (a)\nm is the number of words in the corresponding document. Our reference point is simply an\naverage (computed dynamically) of the normalized documents observed thus far. Finally we update\nMAP estimates of global topics dynamics based on Proposition 1. Streaming Dynamic Matching\n(SDM) is summarized in Algorithm 1.\n\nk=1 using Lemma 1 and reference point Ct = (cid:80)t\n\nk \u2208 \u2206V \u22121}K(t)\n\nk \u2208 SV \u22122}K(t)\n\n/(cid:80)t\n\n(cid:80)Ma\n\nx(a)\nm\nN (a)\nm\n\nm=1\n\na=1\n\nAdditional related literature\nutilizing similar technical building blocks in different contexts. Fox\net al. [8] utilized Beta-Bernoulli process in time series modeling to capture switching regimes of an\nautoregressive process, where the corresponding Indian Buffet Process was used to select subsets of\nthe latent states of the Hidden Markov Model. Williamson et al. [27] used Indian Buffet Process in\ntopic models to sparsify document topic proportions. Campbell et al. [7] utilized Hungarian algorithm\nfor streaming mean-\ufb01eld variational inference of the Dirichlet Process mixture model.\n\n5\n\n\fAlgorithm 1 Streaming Dynamic Matching (SDM)\n1: for t = 1, . . . , T do\nObserve documents {x(t)\n2:\nEstimate topics {\u03b2(t)\nk }K(t)\n3:\n4: Map topics to sphere {v(t)\n5:\n6:\n7:\n\nm }Mt\nk=1 = CoSAC({x(t)\nk }K(t)\nk=1 (Lemma 1)\ni=1 and {v(t)\nk }K(t)\n\ni=1 as in eq. (5)\n\nm=1\n\nGiven {\u03b8(t\u22121)}Lt\u22121\nUsing Hungarian algorithm solve the corresponding matching problem to obtain B(t)\nCompute {\u03b8(t)}Lt\n\nk=1 compute cost matrix as in Proposition 1\n\nm }Mt\n\nm=1)\n\n3.2 Beta-Bernoulli Process for multiple topic polytopes\n\nTj|Q \u223c BeP(Q), then Tj :=(cid:80)\n\nWe now consider meta-modeling in the presence of multiple corpora, each of which maintains its\nown topic polytope. Large text corpora often can be partitioned based on some grouping criteria, e.g.\nscienti\ufb01c papers by journals, news by different media agencies or tweets by location stamps. In this\nsubsection we model the collection of topic polytopes observed at a single time point by employing\nthe Beta-Bernoulli Process prior [24]. The modeling of a collection of polytopes evolving over time\nwill be described in the following subsection.\nFirst, generate global topic measure Q as in Eq. (1). Here, we are interested only in a single time\npoint, the base measure H is simply a vMF(\u00b7, 0), the uniform distribution over SV \u22122. Next, for each\ngroup j = 1, . . . , J, select a subset of the global topics:\n\ni=1bji\u03b4\u03b8i, where bji|qi \u223c Bern(qi), \u2200i.\n\n(6)\nNotice that each group Tj := {\u03b8i : bji = 1, i = 1, 2, . . .} selects only a subset from the collection of\nglobal topics, which is consistent with the idea of partitioning by journals: some topics of ICML are\nnot represented in SIGGRAPH and vice versa. The next step is analogous to Eq. (3):\nvjk|Tj \u223c vMF(Tjk, \u03c41) for k = 1, . . . , Kj, where Kj := card(Tj).\n\n(7)\nWe again use B to denote the binary matrix representing the assignment of global topics to the noisy\ntopic estimates, i.e., Bjik = 1 if the kth topic estimate for group j arises as a noisy estimate of global\ntopic \u03b8i. However, the matching problem is now different from before: we don\u2019t have any information\nabout the global topics as there is no history, instead we should match a collection of topic polytopes\nto a global topic polytope. The matrix of topic assignments is distributed a priori by an Indian Buffet\nProcess (IBP) with parameter \u03b30. The conditional probability for global topics \u03b8i and assignment\nmatrix B given topic estimates vjk has the following form:\n\n(cid:80)\nj,i,kBjik(cid:104)\u03b8i, vjk(cid:105))IBP({mi}), where mi =(cid:80)\n\nP(B, \u03b8|v) \u221d exp(\u03c41\n\nj,kBjik\n\n(8)\n\nand IBP is the prior (see Eq. (15) in [10]) with mi denoting the popularity of global topic i.\n\n(cid:40)\n\nDistributed Matching (DM) Similar to Section 3.1, we look for point estimates for the topic\ndirections \u03b8 and for the topic assignment matrix B. Direct computation of the global MAP estimate\nfor Eq. (8) is not straight-forward. The problem of matching across groups and topics is not amenable\nto a closed form Hungarian algorithm. However we show that for a \ufb01xed group the assignment\noptimization reduces to a case of the Hungarian algorithm. This motivates the use of Hungarian\nalgorithm iteratively, which guarantees convergence to a local optimum.\nProposition 2. Given the cost\n\n\u03c41(cid:107)vjk +(cid:80)\u2212j,i,k B\u2212jikv\u2212jk(cid:107)2 \u2212 \u03c41(cid:107)(cid:80)\u2212j,i,k B\u2212jikv\u2212jk(cid:107)2 + log m\u2212ji\n\nJ \u2212 log(i \u2212 L\u2212j), if L\u2212j < i \u2264 L\u2212j + Kj,\n\nfor each group j, (((Bjik))) which maximizes(cid:80)\n\nCjik =\nwhere \u2212j denotes groups excluding group j and L\u2212j is the number of global topics before group j\n(due to exchangeability of the IBP, group j can always be considered last). Then, a locally optimum\nMAP estimate for Eq. (8) can be obtained by iteratively employing the Hungarian algorithm to solve:\nj,i,k BjikCjik, subject to constraints: (a) for each\n\ufb01xed i and j, at most one of Bjik is 1, rest are 0 and (b) for each \ufb01xed k and j, exactly one of Bjik is\n1, rest are 0. After solving for (((Bjik))), \u03b8i is obtained as \u03b8i =\n\n(cid:80)\n(cid:107)(cid:80)\nj,k Bjikvjk\nj,k Bjikvjk(cid:107)2\n\n, if i \u2264 L\u2212j\n\n\u03c41 + log \u03b30\n\nJ\u2212m\u2212ji\n\n.\n\n6\n\n\fThe noisy topics for each of the groups can be obtained by applying CoSAC to corresponding\ndocuments, which is trivially parallel. Distributed Matching algorithm and proof of the Proposition 2\nare given in the Supplement.\n\n3.3 Dynamic Hierarchical Beta Process\n\nOur \u201cfull\u201d model, the Dynamic Hierarchical Beta Process model (dHBP), builds on the constructions\ndescribed in subsections 3.1 and 3.2 to enable the inference of temporal dynamics of collections\nof topic polytopes. We start by specifying the upper level Beta Process given by Eq. (1) and base\nmeasure H given by Eq. (2). Next, for each group j = 1, . . . , J, we introduce an additional level of\nhierarchy to model group speci\ufb01c distributions over topics\nQj|Q \u223c BP(\u03b3j, Q), then Qj :=\n\n(cid:88)\n\n(9)\n\npji\u03b4\u03b8i,\n\nj\n\nwhere pjis vary around corresponding qi. The distributional properties of pji are described in [24].\nAt any given time t, each group j selects a subset from the common pool of global topics:\nji |pji \u223c Bern(pji), \u2200i.\n\nT (t)\nLet T (t)\ntime t in group j. Noisy measurements of these topics are generated by:\n, where K (t)\n\n(10)\nji = 1, i = 1, 2, . . .} be the corresponding collection of atoms \u2013 topics active at\n\u223c vMF(T (t)\n\n|Qjt \u223c BeP(Qjt), then T (t)\n:= {\u03b8(t)\n\n:=(cid:80)\n\n:= card(T (t)\n\njk , \u03c41) for k = 1, . . . , K (t)\n\nThe conditional distribution of global topics at t given the state of the global topics at t \u2212 1 is\n\n, where b(t)\n\ni=1b(t)\n\nji \u03b4\u03b8(t)\n\n: b(t)\n\n(11)\n\n).\n\nj\n\nj\n\nj\n\nj\n\nj\n\ni\n\ni\n\nj\n\njk |T (t)\nv(t)\n(cid:80)\n\n(cid:16)\n\nP(\u03b8(t), B(t)|\u03b8(t\u22121), v(t),{B(a)}t\u22121\n\n(cid:17)\n\na=1) \u221d\nF ({m(t\u22121)\n\ni\n\n(cid:16)(cid:80)\n\njk (cid:105)(cid:17)\n\ni\n\ni\n\nji\n\nji\n\n\u03c40\n\n(cid:105))\n\nexp\n\n},{m(t)\n\nj,i,k\u03c41B(t)\n\nji }) \u00b7 exp\n\ni=1(cid:104)\u03b8(t)\n},{m(t)\n\n, \u03b8(t\u22121)\nwhere F ({m(t\u22121)\nji }) is the prior term dependent on the popularity counts history from current\nand previous time points. Analogous to the Chinese Restaurant Franchise [22], one can think of an\nIndian Buffet Franchise in the case of HBP. A headquarter buffet provides some dishes each day\nand the local branches serve a subset of those dishes. Although this analogy seems intuitive, we\nare not aware of a corresponding Gibbs sampler and it remains to be a question of future studies.\nTherefore, unfortunately, we are unable to handle this prior term directly and instead propose a\nheuristic replacement \u2014 stripping away popularity of topics across groups and only considering\ngroup speci\ufb01c topic popularity (groups still remain dependent through the atom locations).\n\njik(cid:104)\u03b8(t)\n\n, v(t)\n\n,\n\ni\n\n(12)\n\n\uf8f1\uf8f2\uf8f3(cid:107)\u03c41v(t)\n\nStreaming Dynamic Distributed Matching (SDDM) We combine our results to perform approx-\nimate inference of the model in Section 3.3. Using Hungarian algorithm, iterating over groups at time\nt obtain estimates for (((B(t)\n\njik))) based on the following cost C (t)\n\njik =\n\n(cid:80)\u2212j,i,k B(t)\u2212jikv(t)\u2212jk + \u03c40\u03b8(t\u22121)\n\ni\n\n(cid:107)2 \u2212 (cid:107)(cid:80)\u2212j,i,k B(t)\u2212jikv(t)\u2212jk + \u03c40\u03b8(t\u22121)\n\ni\n\nJ \u2212 log(i \u2212 L(t)\u2212j), if L(t)\u2212j < i \u2264 L(t)\u2212j + K (t)\n\njk + \u03c41\n\u03c41 + log \u03b30\nwhere \ufb01rst case is if i \u2264 L(t)\u2212j; m(t)\nji denotes the popularity of topic i in group j up to time t (plus\none is used to indicate that global topic i exists even when m(t)\nji = 0). Then compute global topic\nestimates \u03b8(t)\n. At time point t, the noisy topics for each of the groups\ncan be obtained by applying CoSAC to corresponding documents in parallel. SDDM algorithm and\ncost derivations are presented in the Supplement.\n\njk +\u03c40\u03b8(t\u22121)\njk +\u03c40\u03b8(t\u22121)\n(cid:107)2\n\nj,k B(t)\nj,k B(t)\n\njikv(t)\njikv(t)\n\n(cid:80)\n(cid:80)\n\n\u03c41\n(cid:107)\u03c41\n\ni =\n\n,\n\nj\n\ni\n\ni\n\n(cid:107)2 + log\n\n1+m(t)\nji\nt\u2212m(t)\n\nji\n\n,\n\n4 Experiments\n\nWe study ability of our models to learn the latent temporal dynamics and discover new topics that\nchange over time. Next we show that our models scale well by utilizing temporal and group inherent\ndata structures. We also study hyperparameters choices. We analyze two datasets: the Early Journal\nContent (http://www.jstor.org/dfr/about/sample-datasets), and a collection of\nWikipedia articles partitioned by categories and in time according to their popularity.\n\n7\n\n\f(a) SDM Epidemics: evolution of top 15 words\n\n(b) DM Law: matched topics from journals\n\nFigure 2: Qualitative examples of topics learned by SDM and DM algorithms on the EJC data\n\n4.1 Temporal Dynamics and Topic Discovery\n\nEarly Journal Content. The Early Journal Content dataset spans years from 1665 up to 1922.\nYears before 1882 contain very few articles, and we aggregated them into a single timepoint. After\npreprocessing, dataset has 400k scienti\ufb01c articles from over 400 unique journals. The vocabulary was\ntruncated to 4516 words. We set all articles from the last available year (1922) aside for the testing\npurposes.\n\nCase study: epidemics. The beginning of the 20th century is known to have a vast history of\ndisease epidemics of various kinds, such as smallpox, typhoid, yellow fever to name a few. Vaccines\nor effective treatments for the majority of them were developed shortly after. One of the journals\nrepresented in the EJC dataset is the \"Public Health Report\"; however, publications from it are only\navailable starting 1896. Primary objective of the journal was to re\ufb02ect epidemic disease infections.\nAs one of the goals of our modeling approach is topic discovery, we verify that the model can discover\nan epidemics-related topic around 1896. Figure 2a shows that SDM correctly discovered a new topic\nis 1896 semantically related to epidemics. We plot the evolution of probabilities of the top 15 words\nin this topic across time. We observe that word \"typhoid\" increases in probability towards 1910 in the\n\"epidemics\" topic, which aligns with historical events such as Typhoid Mary in 1907 and chlorination\nof public drinking water in the US in 1908 for controlling the typhoid fever. The probability of\n\"tuberculosis\" also increases, aligning with foundation of the National Association for the Study and\nPrevention of Tuberculosis in 1904.\n\nCase study: law. Some of the EJC journals are related to the topic of law. Our DM algorithm\nidenti\ufb01ed a global topic semantically similar to law by matching similar topics present in 32 out of the\n417 journals. In Figure 2b we present the learned global topic and 4 examples of the matched local\ntopics with the corresponding journal names. Our algorithm correctly identi\ufb01ed that these journals\nhave a shared law topic.\n\n4.2 Scalability\n\nWiki Corpus. We collected articles from\nWikipedia and their page view counts for the 12\nmonths of 2017 and category information (e.g.,\nArts, History). We used categories as groups\nFigure 3: Comparison on Wiki Data (20 cores)\nand partitioned the data across time according\nto the page view counts. Dataset construction details are given in the Supplement. The total number\nof documents is about 3 million, and we reduced vocabulary to 7359 words similarly to [12]. For\ntesting we set aside documents from category Art from December 2017.\n\nModeling Grouping.\nIn Fig. 3 we present comparisons on Wiki data: CoSAC [29] v.s DM under\nthe static distributed setting and SDM v.s SDDM under the dynamic streaming setting. Fig. 3 (left)\nshows that for data accessible in groups, DM outperforms CoSAC by \u223c 25X, as DM runs CoSAC\non different data groups in parallel and then matches the outputs. Matching time adds only a small\n\n8\n\n18851890189519001905191019151920year0.000.020.040.060.080.100.12word probabilityTopic firstappearancefevertotalnumbertuberculosipopuldiphtheriamonthscarletentermeaslendtyphoidratesmallpoxweekTopic 9 in The Virginia Law Register court case law state ani supremcircuit jurisdictopinion said judgorder right appeal time unit judgment action countiquestionTopic 54 in The North American Reviewcourt law ani judgcase justictime trial onlistate juribefornew unit supremgenerpublic constitutpower maniGlobal Topic 61 court law case state ani unit time act new onlisupremjusticgenerpar question power beforright jurisdictshallTopic 16 inThe Yale Law Journalcourt case law state jurisdictsupremunit decisani act question defend rule constitutpower new right journal beforactionTopic 18 in Columbia Law Reviewcourt case law state jurisdictrule decisani right new question act statutreview onliunit federconstitutgenerpowerCoSACDMModels02500500075001000012500150001750020000Time(s)CoSACMatchingSDMSDDMModels020004000600080001000012000Time(s)CoSACInterleaveMatching\fTable 1: Modeling topics of EJC\n\n|| Modeling Wikipedia articles\n\nPerplexity\n1179\n1361\n1241\n1194\n1840\n1191\n\nTime Topics Cores\n1\n22min\n20\n5min\n2.3min\n20\n56hours\n1\n20\n3hours\n51min\n1\n\nSDM\nDM\nSDDM\nDTM\nSVB\nCoSAC\noverhead compared to the runtime of CoSAC. Similarly, in Fig. 3 (right), SDDM is \u223c 6X faster than\nSDM, since SDDM can process documents of different groups in parallel and interleaves CoSAC\nwith matching: while matching is being performed on data groups with timestamp t, CoSAC can\nprocess the data that arrives with timestamp t + 1 in parallel.\n\n125\n125\n103\n100\n100\n132\n\nPerplexity\n2.4hours\n1254\n15min\n1260\n1201\n20min\nNA >72hours\n29.5hours\n1219\n1227\n4.4hours\n\nTime Topics Cores\n1\n20\n20\n1\n20\n1\n\n182\n182\n238\n100\n100\n173\n\nModeling temporality also bene\ufb01ts scalability. We compare our methods with other topic models\non both Wiki and EJC datasets: Streaming Variational Bayes (SVB) [5] and Dynamic Topic Models\n(DTM) [3] trained with 100 topics. Perplexity scores on the held out data, training times, computing\nresources and number of topics are reported in Table 1. On the wiki dataset, SDDM took only 20min\nto process approximately 3 million documents, which is much faster than the other approaches.\nRegarding perplexity scores, SDDM generally outperforms DM, which suggests that modeling time\nis bene\ufb01cial. For the EJC dataset, SDM outperforms SDDM. Modeling groups might negatively\naffect perplexity because the majority of the EJC journals (groups) have very few articles (i.e. less\nthan 100 \u2013 a setup challenging for many topic modeling algorithms). On the Wiki corpus each\ncategory (group) has suf\ufb01cient amount of training documents and time-group partitioning considered\nby SDDM achieves the best perplexity score.\n\n4.3 Parameter choices\n\nThe rate of topic dynamics of the SDM and SDDM is effectively con-\ntrolled by \u03c40, where smaller values imply higher dynamics rate. Parameter\n\u03c41 controls variance of local topics around corresponding global topics\nin all of our models. This variance dictates how likely a local topic to\nbe matched to an existing global topic. When this variance is small,\nthe model will tend to identify local topics as new global topics more\noften. Lastly, \u03b30 affects the probability of new topic discovery, which\nscales with time and number of groups. In the preceding experiments\nwe set \u03c40 = 2, \u03c41 = 1, \u03b30 = 1 for SDM; \u03c41 = 2, \u03b30 = 1 for DM;\n\u03c40 = 4, \u03c41 = 2, \u03b30 = 2 for SDDM. In Figure 4 we show heatpmaps for\nperplexity and number of learned topics, \ufb01xing \u03b30 = 1 and varying \u03c40\nand \u03c41. We see that for large \u03c41, SDM identi\ufb01es more topics to \ufb01t the\nsmaller variability constraint imposed by the parameter.\n\n(a) EJC perplexity\n\n5 Discussion and Conclusion\n\n(b) # of topics in EJC\n\nFig. 4: SDM parameters\n\nOur work suggests the naturalness of incorporating sophisticated Bayesian nonparametric techniques\nin the inference of rich latent geometric structures of interest. We demonstrated the feasibility of\napproximate nonparametric learning at scale, by utilizing suitable geometric representations and\ndevising fast algorithms for obtaining reasonable point estimates for such representations. Further\ndirections include incorporating more meaningful geometric features into the models (e.g., via more\nelaborated base measure modeling for the Beta Process) and developing ef\ufb01cient algorithms for full\nBayesian inference. For instance, the latent geometric structure of the problem is solely encoded in\nthe base measure. We want to explore choices of base measures for other geometric structures such\nas collections of k-means centroids, principal components, etc. Once an appropriate base measure\nis constructed, our Beta process based models can be utilized to enable a new class of Bayesian\nnonparametric models amenable to scalable inference and suitable for analysis of large datasets. In\nour concurrent work we have utilized model construction similar to one from Section 3.2 to perform\nFederated Learning of neural networks trained on heterogeneous data [31] and proposed a general\nframework for model fusion [30].\n\n9\n\n124816321(=1)124816320117511771177117811781181117911761174117711731176119011781173117111691167121011881176116111561154124012111181116011551151126312411195116611571152116011801200122012401260124816321(=1)124816320125125127128127127125129133136137137128134146169192202126135172229259264128133188259270274125134194264274278150180210240270\fAcknowledgments\n\nThis research is supported in part by grants NSF CAREER DMS-1351362, NSF CNS-1409303, a\nresearch gift from Adobe Research and a Margaret and Herman Sokol Faculty Award to XN.\n\nReferences\n[1] Ahmed, A. and Xing, E. P. (2012). Timeline: A dynamic hierarchical Dirichlet process model\nfor recovering birth/death and evolution of topics in text stream. arXiv preprint arXiv:1203.3463.\n[2] Banerjee, A., Dhillon, I. S., Ghosh, J., and Sra, S. (2005). Clustering on the unit hypersphere\n\nusing von mises-\ufb01sher distributions. Journal of Machine Learning Research, 6, 1345\u20131382.\n\n[3] Blei, D. M. and Lafferty, J. D. (2006). Dynamic topic models. In Proceedings of the 23rd\n\nInternational Conference on Machine Learning, pages 113\u2013120.\n\n[4] Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent Dirichlet Allocation. Journal of Machine\n\nLearning Research, 3, 993\u20131022.\n\n[5] Broderick, T., Boyd, N., Wibisono, A., Wilson, A. C., and Jordan, M. I. (2013). Streaming\n\nvariational Bayes. In Advances in Neural Information Processing Systems, pages 1727\u20131735.\n\n[6] Bryant, M. and Sudderth, E. B. (2012). Truly nonparametric online variational inference for\nhierarchical Dirichlet processes. In Advances in Neural Information Processing Systems, pages\n2699\u20132707.\n\n[7] Campbell, T., Straub, J., Fisher III, J. W., and How, J. P. (2015). Streaming, distributed variational\ninference for bayesian nonparametrics. In Advances in Neural Information Processing Systems,\npages 280\u2013288.\n\n[8] Fox, E., Jordan, M. I., Sudderth, E. B., and Willsky, A. S. (2009). Sharing features among\ndynamical systems with Beta processes. In Advances in Neural Information Processing Systems,\npages 549\u2013557.\n\n[9] Ghahramani, Z. and Grif\ufb01ths, T. L. (2005). In\ufb01nite latent feature models and the Indian buffet\n\nprocess. In Advances in Neural Information Processing Systems, pages 475\u2013482.\n\n[10] Grif\ufb01ths, T. L. and Ghahramani, Z. (2011). The Indian buffet process: An introduction and\n\nreview. Journal of Machine Learning Research, 12, 1185\u20131224.\n\n[11] Grif\ufb01ths, T. L. and Steyvers, M. (2004). Finding scienti\ufb01c topics. PNAS, 101(suppl. 1),\n\n5228\u20135235.\n\n[12] Hoffman, M., Bach, F. R., and Blei, D. M. (2010). Online learning for Latent Dirichlet\n\nAllocation. In Advances in Neural Information Processing Systems, pages 856\u2013864.\n\n[13] Hong, L., Dom, B., Gurumurthy, S., and Tsioutsiouliklis, K. (2011). A time-dependent topic\nmodel for multiple text streams. In Proceedings of the 17th ACM SIGKDD international conference\non Knowledge discovery and data mining, pages 832\u2013840. ACM.\n\n[14] Kuhn, H. W. (1955). The Hungarian method for the assignment problem. Naval Research\n\nLogistics (NRL), 2(1-2), 83\u201397.\n\n[15] Mardia, K. V. and Jupp, P. E. (2009). Directional statistics, volume 494. John Wiley & Sons.\n[16] Newman, D., Smyth, P., Welling, M., and Asuncion, A. U. (2008). Distributed inference\nfor Latent Dirichlet Allocation. In Advances in Neural Information Processing Systems, pages\n1081\u20131088.\n\n[17] Nguyen, X. (2010). Inference of global clusters from locally distributed data. Bayesian Analysis,\n\n5(4), 817\u2013845.\n\n[18] Nguyen, X. (2015). Posterior contraction of the population polytope in \ufb01nite admixture models.\n\nBernoulli, 21(1), 618\u2013646.\n\n[19] Pritchard, J. K., Stephens, M., and Donnelly, P. (2000). Inference of population structure using\n\nmultilocus genotype data. Genetics, 155(2), 945\u2013959.\n\n[20] Reisinger, J., Waters, A., Silverthorn, B., and Mooney, R. J. (2010). Spherical topic models. In\n\nProceedings of the 27th International Conference on Machine Learning, pages 903\u2013910.\n\n10\n\n\f[21] Tang, J., Meng, Z., Nguyen, X., Mei, Q., and Zhang, M. (2014). Understanding the limiting\nfactors of topic modeling via posterior contraction analysis. In Proceedings of the 31st International\nConference on Machine Learning, pages 190\u2013198.\n\n[22] Teh, Y. W., Jordan, M. I., Beal, M. J., and Blei, D. M. (2006). Hierarchical Dirichlet processes.\n\nJournal of the American Statistical Association, 101(476).\n\n[23] Teh, Y. W., Gr\u00fcr, D., and Ghahramani, Z. (2007). Stick-breaking construction for the Indian\n\nbuffet process. In Arti\ufb01cial Intelligence and Statistics, pages 556\u2013563.\n\n[24] Thibaux, R. and Jordan, M. I. (2007). Hierarchical Beta processes and the Indian buffet process.\n\nIn Arti\ufb01cial Intelligence and Statistics, pages 564\u2013571.\n\n[25] Wang, C., Paisley, J., and Blei, D. (2011). Online variational inference for the hierarchical\nDirichlet process. In Proceedings of the 14th International Conference on Arti\ufb01cial Intelligence\nand Statistics, pages 752\u2013760.\n\n[26] Wang, X. and McCallum, A. (2006). Topics over time: a non-Markov continuous-time model of\ntopical trends. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge\ndiscovery and data mining, pages 424\u2013433. ACM.\n\n[27] Williamson, S., Wang, C., Heller, K. A., and Blei, D. M. (2010). The IBP compound Dirichlet\nprocess and its application to focused topic modeling. In Proceedings of the 27th International\nConference on Machine Learning, pages 1151\u20131158.\n\n[28] Yurochkin, M. and Nguyen, X. (2016). Geometric Dirichlet Means Algorithm for topic inference.\n\nIn Advances in Neural Information Processing Systems, pages 2505\u20132513.\n\n[29] Yurochkin, M., Guha, A., and Nguyen, X. (2017). Conic Scan-and-Cover algorithms for\nnonparametric topic modeling. In Advances in Neural Information Processing Systems, pages\n3881\u20133890.\n\n[30] Yurochkin, M., Agarwal, M., Ghosh, S., Greenewald, K., and Hoang, N. (2019a). Statistical\nmodel aggregation via parameter matching. In Advances in Neural Information Processing Systems,\npages 10954\u201310964.\n\n[31] Yurochkin, M., Agarwal, M., Ghosh, S., Greenewald, K., Hoang, N., and Khazaeni, Y. (2019b).\nBayesian nonparametric federated learning of neural networks. In International Conference on\nMachine Learning, pages 7252\u20137261.\n\n[32] Yurochkin, M., Guha, A., Sun, Y., and Nguyen, X. (2019c). Dirichlet simplex nest and geometric\n\ninference. In International Conference on Machine Learning, pages 7262\u20137271.\n\n11\n\n\f", "award": [], "sourceid": 3187, "authors": [{"given_name": "Mikhail", "family_name": "Yurochkin", "institution": "IBM Research, MIT-IBM Watson AI Lab"}, {"given_name": "Zhiwei", "family_name": "Fan", "institution": "University of Wisconsin-Madison"}, {"given_name": "Aritra", "family_name": "Guha", "institution": "University of Michigan"}, {"given_name": "Paraschos", "family_name": "Koutris", "institution": "University of Wisconsin-Madison"}, {"given_name": "XuanLong", "family_name": "Nguyen", "institution": "University of Michigan"}]}