{"title": "Statistical Model Aggregation via Parameter Matching", "book": "Advances in Neural Information Processing Systems", "page_first": 10956, "page_last": 10966, "abstract": "We consider the problem of aggregating models learned from sequestered, possibly heterogeneous datasets. Exploiting tools from Bayesian nonparametrics, we develop a general meta-modeling framework that learns shared global latent structures by identifying correspondences among local model parameterizations. Our proposed framework is model-independent and is applicable to a wide range of model types. After verifying our approach on simulated data, we demonstrate its utility in aggregating Gaussian topic models, hierarchical Dirichlet process based hidden Markov models, and sparse Gaussian processes with applications spanning text summarization, motion capture analysis, and temperature forecasting.", "full_text": "Statistical Model Aggregation via\n\nParameter Matching\n\nMikhail Yurochkin1,2\n\nmikhail.yurochkin@ibm.com\n\nMayank Agarwal1,2\n\nmayank.agarwal@ibm.com\n\nSoumya Ghosh1,2,3\nghoshso@us.ibm.com\n\nKristjan Greenewald1,2\n\nkristjan.h.greenewald@ibm.com\n\nTrong Nghia Hoang1,2\n\nnghiaht@ibm.com\n\nIBM Research,1 MIT-IBM Watson AI Lab,2 Center for Computational Health3.\n\nAbstract\n\nWe consider the problem of aggregating models learned from sequestered, pos-\nsibly heterogeneous datasets. Exploiting tools from Bayesian nonparametrics,\nwe develop a general meta-modeling framework that learns shared global latent\nstructures by identifying correspondences among local model parameterizations.\nOur proposed framework is model-independent and is applicable to a wide range\nof model types. After verifying our approach on simulated data, we demonstrate its\nutility in aggregating Gaussian topic models, hierarchical Dirichlet process based\nhidden Markov models, and sparse Gaussian processes with applications spanning\ntext summarization, motion capture analysis, and temperature forecasting.1\n\n1\n\nIntroduction\n\nOne is often interested in learning from groups of heterogeneous data produced by related, but\nunique, generative processes. For instance, consider the problem of discovering shared topics from a\ncollection of documents, or extracting common patterns from physiological signals of a cohort of\npatients. Learning such shared representations can be relevant to many heterogeneous, federated, and\ntransfer learning tasks. Hierarchical Bayesian models [3, 12, 29] are widely used for performing such\nanalyses, as they are able to both naturally model heterogeneity in data and share statistical strength\nacross heterogeneous groups.\nHowever, when the data is large and scattered across disparate silos, as is increasingly the case in\nmany real-world applications, use of standard hierarchical Bayesian machinery becomes fraught with\ndif\ufb01culties. In addition to costs associated with moving large volumes of data, the computational\ncost of full Bayesian inference may be prohibitive. Moreover, pooling sequestered data may also\nbe undesirable owing to concerns such as privacy [11]. While distributed variants [18] have been\ndeveloped, they require frequent communication with a central server and hence are restricted to\nsituations where suf\ufb01cient communication bandwidth is available. Yet others [26] have proposed\nfederated learning algorithms to deal with such scenarios. However, these algorithms tend to be\nbespoke and can require signi\ufb01cant modi\ufb01cations based on the models being federated.\nMotivated by these challenges, in this paper we develop Bayesian nonparametric meta-models that\nare able to coherently combine models trained on independent partitions of data (model fusion).\nRelying on tools from Bayesian nonparametrics (BNP), our meta model treats the parameters of\nthe locally trained models as noisy realizations of latent global parameters, of which there can be\n\n1Code: https://github.com/IBM/SPAHM\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fin\ufb01nitely many. The generative process is formally characterized through a Beta-Bernoulli process\n(BBP) [31]. Model fusion, rather than being an ad-hoc procedure, then reduces to posterior inference\nover the meta-model. Governed by the BBP posterior, the meta-model allows local parameters to\neither match existing global parameters or create new ones. This ability to grow or shrink the number\nof parameters is crucial for combining local models of varying complexity \u2013 for instance, hidden\nMarkov models with differing numbers of states.\nOur construction provides several key advantages over alternatives in terms of scalability and\n\ufb02exibility. First, scaling to large data through parallelization is trivially easy in our framework. One\nwould simply train the local models in parallel and fuse them. Armed with a Hungarian algorithm-\nbased ef\ufb01cient MAP inference procedure for the BBP model, we \ufb01nd that our train-in-parallel and fuse\nscheme affords signi\ufb01cant speedups. Since our model fusion procedure is independent of the learning\nand inference algorithms that may have been used to train individual models, we can seamlessly\ncombine models trained using disparate algorithms. Furthermore, since we only require access to\ntrained local models and not the original data, our framework is also applicable in cases where only\npre-trained models are available but not the actual data, a setting that is dif\ufb01cult for existing federated\nor distributed learning algorithms.\nFinally, we note that our development is largely agnostic to the form of the local models and is reusable\nacross a wide variety of domains. In fact, up to the choice of an appropriate base measure to describe\nthe local parameters, the exact same algorithm can be used for fusion across qualitatively different\nsettings. We illustrate this \ufb02exibility by demonstrating pro\ufb01ciency at combining a diverse class of\nmodels, which include sparse Gaussian processes, mixture models, topic models and hierarchical\nDirichlet process based hidden Markov models.\n\n2 Background and Related Work\n\nHere, we brie\ufb02y review the building blocks of our approach and highlight the differences of our\napproach from existing work.\n\nIndian Buffet Process and the Beta Bernoulli Process The Indian buffet process (IBP) speci\ufb01es a\ndistribution over sparse binary matrices with in\ufb01nitely many columns [17]. It is commonly described\nthrough the following culinary metaphor. Imagine J customers arrive sequentially at a buffet and\nchoose dishes to sample. The \ufb01rst customer to arrive samples Poisson(\u03b30) dishes. The j-th subsequent\ncustomer then tries each of the dishes selected by previous customers with probability proportional to\nthe dish\u2019s popularity, and then additionally samples Poisson(\u03b30/j) new dishes that have not yet been\nsampled by any customer. Thibaux and Jordan [31] showed that the de Finetti mixing distribution\nunderlying the IBP is a Beta Bernoulli Process (BBP). Let Q be a random measure drawn from\na Beta process, Q | \u03b1, \u03b30, H \u223c BP(\u03b1, \u03b30H), with mass parameter \u03b30, base measure H over \u2126\nsuch that H(\u2126) = 1 and concentration parameter \u03b1. It can be shown Q is a discrete measure\ni qi\u03b4\u03b8i formed by an in\ufb01nitely countable set of (weight, atom) pairs (qi, \u03b8i) \u2208 [0, 1] \u00d7 \u2126. The\nweights {qi}\u221e\nj=1 \u03bdj\nand the atoms \u03b8i are drawn i.i.d. from H. Subsets of atoms in Q are then selected via a Bernoulli\nprocess. That is, each subset Tj with j = 1, . . . , J is characterized by a Bernoulli process with\nbase measure Q, Tj | Q \u223c BeP(Q). Consequently, subset Tj is also a discrete measure formed by\ni bji\u03b4\u03b8i, where bji | qi \u223c Bernoulli(qi)\u2200i is a binary random\nvariable indicating whether atom \u03b8i belongs to subset Tj. The collection of such subsets is then said\nto be distributed by a Beta-Bernoulli process. Marginalizing over the Beta Process distributed Q we\nrecover the predictive distribution, TJ | T1, . . . ,TJ\u22121 \u223c BeP\n, where\nj=1 bji (dependency on J is suppressed for notational simplicity) which can be shown to be\nequivalent to the IBP. Our work is related to recent advances [33] in ef\ufb01cient BBP MAP inference.\n\nQ =(cid:80)\npairs (bji, \u03b8i) \u2208 {0, 1} \u00d7 \u2126, Tj :=(cid:80)\nmi =(cid:80)J\u22121\n\ni=1 are distributed by a stick-breaking process [30], \u03bd1 \u223c Beta(\u03b30, 1), \u03bdi =(cid:81)i\n\n(cid:16) \u03b1\u03b30\nJ+\u03b1\u22121 H +(cid:80)\n\n(cid:17)\n\nmi\n\nJ+\u03b1\u22121 \u03b4\u03b8i\n\ni\n\nDistributed, Decentralized, and Federated Learning Similarly to us, federated and distributed\nlearning approaches also attempt to learn from sequestered data. These approaches roughly fall\ninto two groups, those [9, 15, 21, 22, 23] that decompose a global, centralized learning objective\ninto localized ones that can be optimized separately using local data, and those that iterate between\ntraining local models on private data sources and distilling them into a global model [8, 6, 18, 26].\nThe former group carefully exploits properties of the local models being combined. It is unclear how\n\n2\n\n\fmethods developed for a particular class of local models (for example, Gaussian processes) can be\nadapted to a different class of models (say, hidden Markov models). More recently, [32] also exploited\na BBP construction for federated learning, but were restricted to only neural networks. Alternatively,\n[20] follows a different development that requires local models of different classes to be distilled\ninto the same class of surrogate models before aggregating them, which, however, accumulates local\ndistillation error (especially when the number of local models is large). Members of the latter group\nrequire frequent communication with a central server, are poorly suited to bandwidth limited cases,\nand are not applicable when the pretrained models cannot share their associated data. Others [5] have\nproposed decentralized approximate Bayesian algorithms. However, unlike us, they assume that each\nof the local models have the same number of parameters, which is unsuitable for federating models\nwith different complexities.\n\n3 Bayesian Nonparametric Meta Model\n\nWe propose a Bayesian nonparametric meta model based on the Beta-Bernoulli process [31]. In\nseeking a \u201cmeta model\u201d, our goal will be to describe a model that generates collections of parameters\nthat describe the local models. This meta model can then be used to infer the parameters of a global\nmodel from a set of local models learned independently on private datasets.\nOur key assumption is that there is an unknown shared set of parameters of unknown size across\ndatasets, which we call global parameters, and we are able to learn subsets of noisy realizations\nof these parameters from each of the datasets, which we call local parameters. The noise in local\nparameters is motivated by estimation error due to \ufb01nite sample size and by variations in the\ndistributions of each of the datasets. Additionally, local parameters are allowed to be permutation\n(cid:80)\ninvariant, which is the case in a variety of widely used models (e.g., any mixture or an HMM).\nWe start with Beta process prior on the collection of global parameters, G \u223c BP(\u03b1, \u03b30H) then G =\ni pi\u03b4\u03b8i, \u03b8i \u223c H, where H is a base measure, \u03b8i are the global parameters, and pi are the stick\nbreaking weights. To devise a meta-model applicable to broad range of existing models, we do not\nassume any speci\ufb01c base measure and instead proceed with general exponential family base measure,\n\np\u03b8(\u03b8 | \u03c4, n0) = H(\u03c4, n0) exp(\u03c4 T \u03b8 \u2212 n0A(\u03b8)).\n\n(1)\nLocal models do not necessarily have to use all global parameters, e.g. a Hidden Markov Model for a\ngiven time series may only contain a subset of latent dynamic behaviors observed across collection of\ntime series. We use a Bernoulli process to allow J local models to select a subset of global parameters,\n\nThen Qj =(cid:80)\n\nQj | G \u223c BeP(G) for j = 1, . . . , J.\n\n(2)\ni bji\u03b4\u03b8i, where bji | pi \u223c Bern(pi) is a random measure representing the subset of\nglobal parameters characterizing model j. We denote the corresponding subset of indices of the\nglobal parameters induced by Qj as Cj = {i : bji = 1}. The noisy, permutation invariant local\nparameters estimated from dataset j are modeled as,\n\nvjl | \u03b8c(j,l) \u223c F (\u00b7 | \u03b8c(j,l)) for l = 1, . . . , Lj,\n\n(3)\nwhere Lj = card(Cj) and c(j, l) : {1, . . . , LJ} \u2192 Cj is an unknown mapping of indices of local\nparameters to indices of global parameters corresponding to dataset j. Parameters of different models\nof potential interest may have different domain spaces and domain-speci\ufb01c structure. To preserve\ngenerality of our meta-modeling approach we again consider a general exponential family density for\nthe local parameters,\n\nwhere T (\u00b7) is the suf\ufb01cient statistic function.\n\npv(v | \u03b8) = h(v) exp(\u03b8T T (v) \u2212 A(\u03b8)).\n\n(4)\n\nInterpreting the model. We emphasize that our construction describes a meta model, in particular\nit describes a generative process for the parameters of the local models rather than the data itself.\nThese parameters are \u201cobserved\u201d either when pre-trained local models are made available or when the\nlocal models are learned independently and potentially in parallel across datasets. The meta model\nthen infers shared latent structure among the datasets. The Beta process concentration parameter \u03b1\ncontrols the degree of sharing across local models while the mass parameter \u03b30 controls the number\nof global parameters. The interpretation of the exponential family parameters, \u03c4 and n0, depends on\n\n3\n\n\fthe choice of the particular exponential family. We provide a concrete example with the Gaussian\ndistribution in Section 4.1.\nSeveral prior works [10, 19, 4] explore the meta modeling perspective. The key difference with\nour approach is that we consider broader model class allowing for inherent permutation invariant\nstructure of the parameter space, e.g. mixture models, topic models, hidden Markov models and\nsparse Gaussian processes. The aforementioned approaches are only applicable to models with\nnatural parameter ordering, e.g. linear regression, which is a simpler special case of our construction.\nPermutation invariance leads to inferential challenges associated with \ufb01nding correspondences across\nsets of local parameters and learning the size of the global model, which we address in the next\nsection.\n\n4 Ef\ufb01cient Meta Model Inference\n\nTaking the optimization perspective, our goal is to maximize the posterior probability of the global\nparameters given the local ones. Before discussing the objective function we re-parametrize (4) to\nside-step the index mappings c(\u00b7,\u00b7) as follows:\n\n(cid:32)\n\u00b7 |(cid:88)\n\n(cid:33)\n\ns.t. (cid:88)\n\nBj\n\nil\u03b8i\n\nvjl | B, \u03b8 \u223c F\n\n(cid:88)\n\nBj\n\nil = 1, bji =\n\nil \u2208 {0, 1},\nBj\n\n(5)\n\ni\n\ni\n\nl\n\nil}i,j,l are the assignment variables such that Bj\n\nwhere B = {Bj\nil = 1 denotes that vjl is matched to\n\u03b8i, i.e. vjl is the local parameter realization of the global parameter \u03b8i; Bj\nil = 0 implies the opposite.\nThe objective function is then P(\u03b8, B | v, \u0398), where \u0398 = {\u03c4, n0} are the hyperparameters and\nindexing is suppressed for simplicity. In the context of our meta model this problem has been\nstudied when distributions in (1) and (4) are Gaussian [32] or von Mises-Fisher [33], which are both\nspecial cases of our meta model. However, this objective requires \u0398 to be chosen a priori leading to\npotentially sub-optimal solutions or to be selected via expensive cross-validation.\nWe show that it is possible to simplify the optimization problem via integrating out \u03b8 and jointly learn\nhyperparameters and matching variables B, all while maintaining the generality of our meta model.\nDe\ufb01ne Zi = {(j, l) | Bj\nil = 1} to be the index set of the local parameters assigned to the ith global\nparameter, then the objective functions we consider is,\nL(B, \u0398) = P(B | v) \u221d P(B)\n\npv(v | B, \u03b8)p\u03b8(\u03b8) d\u03b8 = P(B)\n\npv(vz | \u03b8i)p\u03b8(\u03b8i) d\u03b8i\n\n(cid:90) (cid:89)\n\n(cid:89)\n\n(cid:90)\n(cid:90) \uf8eb\uf8ed(cid:89)\nH(\u03c4, n0)(cid:81)\n\nz\u2208Zi\n\n(cid:89)\n(cid:89)\n\ni\n\nH(\u03c4, n0)\n\nH(\u03c4 +(cid:80)\n\n= P(B)\n\nT (vz))T )\u03b8i \u2212 (card(Zi) + n0)A(\u03b8)\n\n(6)\n\n\uf8f6\uf8f8 exp\n\n\uf8eb\uf8ed(\u03c4 +\n\n(cid:88)\n\nz\u2208Zi\n\nh(vz)\n\ni\n\nz\u2208Zi\n\n\uf8f6\uf8f8 d\u03b8i\n\n= P(B)\n\nHolding \u0398 \ufb01xed, and then taking the logarithm and noting that(cid:80)\n\nT (vz), card(Zi) + n0)\n\nz\u2208Zi\n\n,\n\ni\n\nz\u2208Zi\n\nh(vz)\n\nB, we wish to maximize,\n\nL\u0398(B) = log P(B) \u2212(cid:88)\n\nlog H\n\ni\n\n\uf8eb\uf8ed\u03c4 +\n\n(cid:88)\n\nj,l\n\n(cid:80)\n(cid:88)\n\nj,l\n\n\uf8f6\uf8f8 ,\n\nBj\n\nilT (vjl),\n\nBj\n\nil + n0\n\n(7)\n\nj,l Bj\n\nil log h(vjl) is constant in\n\ni\n\nwhere we have used L\u0398(B) to denote the objective when \u0398 is held constant. Despite the large\nnumber of discrete variables, we show that L\u0398(B) admits a reformulation that permits ef\ufb01cient\ninference by iteratively solving small sized linear sum assignment problems (e.g., the Hungarian\nalgorithm [25]).\nWe consider iterative optimization where we optimize the assignments Bj0 for some group j0 given\nj,l Bj\nil denote\nnumber of local parameters assigned to the global parameter i, m\nil be the same\noutside of group j0 and let L\\j0 denote the number of unique global parameters corresponding to the\n\nthat the assignments for all other groups, denoted B\\j0, are held \ufb01xed. Let mi =(cid:80)\n\ni =(cid:80)\n\nj(cid:54)=j0,l Bj\n\n\\j0\n\n4\n\n\fL\\j0\n\n\\j0 )\n\n+Lj0(cid:88)\n\n\uf8eb\uf8ed\u03c4 +\n\nlocal parameters outside of j0. The corresponding objective functions are given by\nLB\\j0 ,\u0398(Bj0 ) = log P(Bj0 | B\nlog H\n\n(cid:88)\nProposition 1 (Subtraction trick). When(cid:80)\n(cid:80)\ni f ((cid:80)\n\nl Bilxl + C) for B is equivalent to optimizing(cid:80)\n\nTo arrive at a form of a linear sum assignment problem we de\ufb01ne a subtraction trick:\n\nil T (vj0l) +\n\n(cid:88)\n\n(cid:88)\n\nl Bil \u2208 {0, 1} and Bil \u2208 {0, 1} for \u2200 i, l, optimizing\ni,l Bil(f (xl + C)\u2212 f (C)) for any function\n\n\uf8f6\uf8f8 .\n\n(8)\n\n\\j0\ni + n0\n\nilT (vjl),\n\nil + m\n\nj(cid:54)=j0,l\n\nBj0\n\nBj\n\nBj\n\n\u2212\n\ni=1\n\nl\n\nl\n\nf, {xl} and C independent of B.\n\nProof. This result simply follows by observing that both objectives are equal for any values of B\nsatisfying the constraint.\n\nApplying the subtraction trick to (8) (conditions on B are satis\ufb01ed per (5)), we arrive at a linear sum\n\nassignment formulation LB\\j0 ,\u0398(Bj0 ) = \u2212(cid:80)\ni,l Bj0\nH(cid:16)\n\u03c4 +T (vj0l)+(cid:80)\nH(cid:16)\n\u03c4 +(cid:80)\n\u03b1+J\u22121\u2212m\n(\u03b1+J\u22121)(i\u2212L\\j0 ) \u2212 log\n\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f3log\n\n\u2212 log\n\nH(\u03c4,n0)\n\nil =\n\nC j0\n\nlog\n\n\\j0\n\n\\j0\n\n\u03b1\u03b30\n\nm\ni\n\ni\n\nil C j0\n\nil , where the cost\n(cid:17)\n\\j0\ni +n0\n\nilT (vjl),1+m\n\nj(cid:54)=j0,l Bj\n\n(cid:17)\n\nj(cid:54)=j0 ,l Bj\nH(\u03c4 +T (vj0l),1+n0)\n\nilT (vjl),m\n\n,\n\n\\j0\ni +n0\n\n,\n\ni \u2264 L\\j0\nL\\j0 < i \u2264 L\\j0 + Lj.\n\n(9)\n\nTerms on the left are due to log P(Bj0 | B\\j0 ). Details are provided in the supplement. Our algorithm\nconsists of alternating the Hungarian algorithm with the above cost and hyperparameter optimization\nusing log P(B | v) from (6), ignoring P(B) as it is a constant with respect to hyperparameters.\nSpeci\ufb01cally, the hyperparameter optimization step is\n\nL(cid:88)\n\n\uf8eb\uf8edlog H(\u03c4, n0) \u2212 log H\n\n\uf8eb\uf8ed\u03c4 +\n\n(cid:88)\n\n(cid:88)\n\n\uf8f6\uf8f8\uf8f6\uf8f8 ,\n\nBj\n\nilT (vjl),\n\nBj\n\nil + n0\n\n(10)\n\n\u02c6\u03c4 , \u02c6n0 = arg max\n\n\u03c4,n0\n\ni=1\n\nj,l\n\nj,l\n\nwhere B is held \ufb01xed. After obtaining estimates for B and the hyperparameters \u0398, it only re-\ni=1|B, v, \u0398). Given the\nmains to compute global parameters estimates {\u03b8i}L\nassignments, expressions for hyperparameter and global parameter estimates can be obtained using\ngradient-based optimization. In Section 4.1 we give a concrete example where derivations may be\ndone in closed form. Our method, Statistical Parameter Aggregation via Heterogeneous Matching\n(SPAHM, pronounced \u201cspam\u201d), is summarized as Algorithm 1.\n\ni=1 = arg max\n\u03b81,...,\u03b8L\n\nP({\u03b8i}L\n\nAlgorithm 1 Statistical Parameter Aggregation via Heterogeneous Matching (SPAHM)\ninput Observed local vjl, iterations number M, initial hyperparameter guesses \u02c6\u03c4 , \u02c6n0.\n1: while not converged do\nfor M iterations do\n2:\n3:\n4:\n5:\n6:\n7:\n8: end while\noutput Matching assignments B, global atom estimates \u03b8i.\n\nj \u223c Unif({1, . . . , J}).\nForm matching cost matrix C j using eq. (9).\nUse Hungarian algorithm to optimize assignments Bj, holding all other assignments \ufb01xed.\n\nend for\nGiven B, optimize (10) to update hyperparameters \u02c6\u03c4 , \u02c6n0.\n\n4.1 Meta Models with Gaussian Base Measure\n\nWe present an example of how a statistical modeler may apply SPAHM in practice. The only choice\nmodeler has to make is the prior over parameters of their local models, i.e. (4). In many practical\nscenarios (as we will demonstrate in the experiments section) model parameters are real-valued and\nthe Gaussian distribution is a reasonable choice for the prior on the parameters. The Gaussian case is\n\n5\n\n\ffurther of interest as it introduces additional parameters. For simplicity we consider the 1-dimensional\nGaussian, which is also straightforward to generalize to multi-dimensional isotropic case.\nThe modeler starts by writing the density\n\nHere the subscript \u03c3 indicates dependence on the additional parameter, i.e. variance. Next,\n\npv(v | \u03b8, \u03c32) =\n\nh\u03c3(v) =\n\n1\u221a\n2\u03c0\u03c32\n\n1\u221a\n2\u03c0\u03c32\n\n(cid:18)\n\nexp\n\nexp\n\n\u2212 v2\n2\u03c32\n\n(cid:18)(cid:90)\n\n(cid:18)\n\nH\u03c3(\u03c4, n0) =\n\nexp\n\n\u03c4 \u03b8 \u2212 n0\u03b82\n2\u03c32\n\nto ensure p\u03b8(\u03b8|\u03c4, n0) integrates to unity. Hence,\nlog H\u03c3(\u03c4, n0) = \u2212 \u03c4 2\u03c32\n2n0\n\n(cid:19)\n\n2\u03c32\n\n(cid:18)\n\u2212 (v \u2212 \u03b8)2\n(cid:19)\n(cid:19)\n\n(cid:19)\u22121\n\nd\u03b8\n\n=\n\n, T\u03c3(v) =\n\nand noticing that\n\nv\n\n\u03c32 , A\u03c3(v) =\n(cid:19)(cid:115)\n(cid:18) \u03c4 2\u03c32\n\n\uf8eb\uf8edexp\n\n2n0\n\n\u03b82\n2\u03c32 .\n\n2\u03c0\u03c32\n\nn0\n\n\uf8f6\uf8f8\u22121\n\nlog n0 \u2212 log \u03c32 \u2212 log 2\u03c0\n\n2\n\n.\n\n+\n\n(cid:80)\n\nThese are all we need to customize (9) and (10) to the Gaussian case, which then allows the modeler\nto use Algorithm 1 to compute the shared parameters across the datasets. Note that in this case\nj,l log h\u03c3(vjl) should be added to eq. (10) if it is desired to learn the additional parameter \u03c32. We\nrecognize that not every exponential family allows for closed form evaluation of the prior normalizing\nconstant H\u03c3(\u03c4, n0), however it remains possible to use SPAHM by employing Monte Carlo techniques\nfor estimating entries of the cost (9) and auto-differentiation [2] to optimize hyperparameters.\n0 we recover (1)\nContinuing our Gaussian example, we note that setting \u03c4 = \u00b50/\u03c32\nas a density of a Gaussian random variable with mean \u00b50 and variance \u03c32\n0, as expected. The further\nbene\ufb01t of the Gaussian choice is the closed-form solution to the hyperparameters estimation problem.\n0 \u2200i (i.e., global parameters are suf\ufb01ciently distant\nUnder the mild assumption that \u03c32\nL(cid:88)\nfrom each other in comparison to the noise in the local parameters) we obtain\n\n0 + \u03c32/mi \u2248 \u03c32\n\n0 and n0 = \u03c32/\u03c32\n\nilvjl)2\n\n1\n\n\uf8eb\uf8ed(cid:88)\n\njl \u2212 ((cid:80)\n\nBj\n\nilv2\n\ni=1\n\nj,l\n\ni=1\n\n1\nL\n\nilvjl\n\n\u02c6\u03c32\n0 =\n\n\u2212 \u02c6\u00b50\n\nj,l Bj\nmi\n\n\u2212 L(cid:88)\nwhere N =(cid:80)\nby setting corresponding derivatives of eq. (10) +(cid:80)\n(cid:80)\n\n\u02c6\u03c32\nmi\n\ni=1\n\nj Lj is the total number of observed local parameters. The result may be veri\ufb01ed\nj,l log h\u03c3(vjl) to 0 and solving the system of\nequations. Derivations are long but straightforward. Given assignments B, our example reduces to a\nhierarchical Gaussian model \u2013 see Section 5.4 of Gelman et al. [16] for analogous hyperparameter\nderivations. Finally we obtain\n\n,\n\n(cid:88)\n(cid:32)(cid:80)\n\n1\nmi\n\nj,l\n\nL(cid:88)\nL(cid:88)\n\ni=1\n\n\u02c6\u03c32 =\n\n(cid:33)2\n\n\u02c6\u00b50 =\n\n1\nL\n\n\uf8f6\uf8f8 ,\n\nj,l Bj\nmi\n\nN \u2212 L\n\nBj\n\nilvjl,\n\nFor completeness we provide cost expression corresponding to eq. (9): C j0\n\n\uf8f1\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f32 log\n\n\\j0\n\nm\ni\n\n\u03b1+J\u22121\u2212m\n\n+ log\n\n\\j0\n\ni\n\nm\n\n1+m\n\n\u00b50\u03c32 + \u03c32\n0\n\nj,l Bj\n\nilvjl\n\n.\n\n\u03b8i =\n\n(cid:18)\n\n\u03c32 + mi\u03c32\n0\n\n\u03c32 +(cid:80)\n\nvj0 l\n\n+\n\n1+m\n\n0 +vj0l/\u03c32)2\u03c32\n1+\u03c32/\u03c32\n0\n\nil\n\nj(cid:54)=j0,l Bj\n\\j0\ni + \u03c32\n\u03c32\n0\n\n\u2212 \u00b52\n\n+\n\n\u00b50\n\u03c32\n0\n\n\\j0\ni + \u03c32\n\u03c32\n0\n\\j0\ni + \u03c32\n\u03c32\n0\n0 +\u03c32 + (\u00b50/\u03c32\n\u03c32\n\n(cid:19)2\n\nvjl\n\u03c32\n\n\u03c32\n\n(cid:18)\n\nil =\n\u00b50\n\u2212\n\u03c32\n0\n\n+(cid:80)\n\nil\n\nj(cid:54)=j0,l Bj\n\\j0\ni + \u03c32\nm\n\u03c32\n0\n\n(cid:19)2\n\nvjl\n\u03c32\n\n\u03c32\n\n,\n\n\u03b1\u03b30\n\n\u03c32\n\n(\u03b1+J\u22121)(i\u2212L\\j0 ) + log\n\n2 log\nwhere \ufb01rst case is for i \u2264 L\\j0 and second is for L\\j0 < i \u2264 L\\j0 + Lj.\n4.2 Convergence Analysis\nLemma 1 (Algorithmic convergence). Algorithm 1 creates a sequence of iterates for which log P(B |\nv) converges as the number of iterations n \u2192 \u221e. See Section B of the supplement for a proof sketch.\n\n(11)\n\n0\n\u03c32\n0\n\n,\n\n6\n\n\fHyperparameter Consistency. While the exponential family hyperparameter objective function\n(10) is too general to be tractable, the consistency of the hyperparameter estimates can be analyzed\nfor speci\ufb01c choices of distributional families. Following the specialization to Gaussian distributions\nin Section 4.1, the following result establishes that the closed-form hyperparameter estimates are\nconsistent in the case of Gaussian priors, subject to the assignments B being correct.\nTheorem 1. Assume that the binary assignment variables B are known or estimated correctly. The\nestimator for \u02c6\u00b50 for the hyperparameter \u00b50 in the Gaussian case is then consistent as the number of\nglobal atoms L \u2192 \u221e. Furthermore, the estimators \u02c6\u03c32\n0 and \u03c32\nj,l Bj\nil) >\n1) \u2192 \u221e, where I(\u00b7) is the indicator function. See Section C of the supplement for a detailed proof.\n\nare consistent as the total number of global atoms with multiple assignments(cid:80)L\n\n0 and \u02c6\u03c32 for the hyperparameters \u03c32\n\ni=1 I(((cid:80)\n\n5 Experiments\n\nSimulated Data. We begin with a correctness veri\ufb01cation of our\ninference procedure via a simulated experiment. We randomly sam-\nple L = 50 global centroids \u03b8i \u2208 R50 from a Gaussian distribution\n\u03b8i \u223c N (\u00b50, \u03c32\n0I). We then simulate j = 1, . . . , J heterogeneous\ndatasets by picking a random subset of global centroids and adding\nwhite noise with variance \u03c32 to obtain the \u201ctrue\u201d local centroids,\n{vjl}Lj\nl=1 (following generative process in Section 3 with Gaussian\ndensities). Then each dataset is sampled from a Gaussian mixture\nmodel with the corresponding set of centroids. We want to esti-\nmate global centroids and parameters \u00b50, \u03c32\n0, \u03c32. We consider two\nbasic baselines: k-means clustering of all datasets pooled into one\n(k-means pooled) and k-means clustering of local centroid estimates\n(this can be seen as another form of parameter aggregation - i.e.\nk-means \u201cmatching\u201d). Both, unlike SPAHM, enjoy access to true L.\nTo obtain local centroid estimates for SPAHM and k-means \u201cmatch-\ning\u201d, we run (another) k-means on each of the simulated datasets.\nAdditionally to quantify how local estimation error may effect our\napproach, we compare to SPAHM using true data generating lo-\ncal centroids. To measure the quality of different approaches we\nevaluate Hausdorff distance between the estimates and true data\ngenerating global centroids. This experiment is presented in Figures\n1 and 2. White noise variance \u03c3 implies degree of heterogeneity\nacross J = 20 datasets and as it grows the estimation problem becomes harder, however SPAHM\ndegrades more gracefully than baselines. Fixing \u03c32 = 1 and varying number of datasets J may make\nthe problem harder as there is more overlap among the datasets, however SPAHM is able to maintain\nlow estimation error. We empirically verify hyperparameter estimation quality in the supplement.\n\nFigure 1: Increasing noise \u03c3\n\nFigure 2: Increasing J\n\nGaussian Topic Models. We present a practical scenario where problem similar to our simulated\nexperiment arises \u2014 learning Gaussian topic models [7] where local topic models are learned from\nthe Gutenberg dataset comprising 40 books. We then build the global topic model using SPAHM.\nWe use basic k-means with k = 25 to cluster word embeddings of words present in a book to obtain\nlocal topics and then apply SPAHM resulting in 155 topics. We compare to the Gibbs sampler of\nDas et al. [7] in terms of the UCI coherence score [27], \u22122.1 for SPAHM and \u22124.6 for [7], where\nhigher is better. Besides, [7] took 16 hours to run 100 MCMC iterations while SPAHM + k-means\ntakes only 40 seconds, over 1400 times faster. We present topic interpretation in Fig. 3 and defer\nadditional details to the Supplement.\n\nGaussian Processes. We next demonstrate the effectiveness of our approach on the task of temper-\nature prediction using Sparse Gaussian Processes (SGPs) [1]. For this task, we utilize the GSOD data\navailable from the National Oceanic and Atmospheric Administration2 containing the daily global\nsurface weather summary from over 9000 stations across the world. We limit the geography of our\ndata to the United States alone and also \ufb01lter the observations to the year 2015 and after. We further\n\n2https://data.noaa.gov/dataset/dataset/global-surface-summary-of-the-day-gsod\n\n7\n\n0246810\u03c3020406080Hausdorff DistanceSPAHM (true local params)SPAHM (estimated local params)K-Means (Pooled)K-Means \"matching\"20406080100J010203040506070Housdorff DistanceSPAHM (true local params)SPAHM (estimated local params)K-Means (Pooled)K-Means \"matching\"\fFigure 3: Topic related to war found by SPAHM and Gaussian LDA. The \ufb01ve boxes pointing to the\nMatched topic represent local topics that SPAHM fused into the global one. The headers of these \ufb01ve\nboxes state the book names along with their Gutenberg IDs.\n\nTable 1: Temperature prediction using sparse Gaussian Processes\nEXPERIMENT SETUP\n\nRMSE (ACROSS USA)\n\nGROUP SGPS WITH 50 LOCAL PARAMETERS\nSPAHM WITH 289 \u00b1 9.8 GLOBAL PARAMETERS\nGROUP SGPS WITH 300 LOCAL PARAMETERS\n\nRMSE (SELF)\n5.509 \u00b1 0.0135\n5.860 \u00b1 0.0390\n5.267 \u00b1 0.0084\n\n14.565 \u00b1 0.0528\n8.917 \u00b1 0.1988\n15.848 \u00b1 0.0303\n\nselect the following 7 features to create the \ufb01nal dataset - date (day, month, year), latitude, longitude,\nand elevation of the weather stations, and the previous day\u2019s temperature. We consider states as\ndatasets of wheather stations observations.\nWe proceed by training SGPs on each of the 50 states data and evaluate it on the test set consisting of\na random subset drawn from all states. Such locally trained SGPs do not generalize well beyond their\nown region, however we can apply SPAHM to match local inducing points along with their response\nvalues and pass it back to each of the states. Using inducing points found by SPAHM, local GPs gain\nability to generalize across the continent while maintaining comparable \ufb01t on its own test data (i.e.\ntest data sampled only from a corresponding state). We summarize the results across 10 experiment\nrepetitions in Table 1. In addition, we note that Bauer et al. [1] previously showed that increasing\nnumber of inducing inputs tends to improve performance. To ensure this is not the reason for strong\nperformance of SPAHM we also compare to local SGPs trained with 300 inducing points each.\n\nHidden Markov Models. Next, we consider the problem of discovering common structure\nin collections of related MoCAP sequences collected from the CMU MoCap database (http:\n//mocap.cs.cmu.edu). We used a curated subset [14] of the data from two different subjects\neach providing three sequences. This subset comes with human annotated labels which allow quan-\ntitative comparisons. We performed our experiments on this annotated subset. For each subject,\nwe trained an independent \u2018sticky\u2019 HDP-HMM [13] with Normal Wishart likelihoods. We used\nmemoized variational inference [24] with random restarts and merge moves to alleviate local optima\nissues (see supplement for details about parameter settings and data pre-processing). The trained\nmodels discovered nine states for the \ufb01rst subject and thirteen for the second. We then used SPAHM\nto match local HDP-HMM states across subjects and recovered fourteen global states. The matching\nwas done on the basis of the posterior means of the local states.\nThe matched states are visualized in Figure 4 (right) and additional visualizations are available in\nthe supplement. We \ufb01nd that SPAHM correctly recognizes similar activities across subjects. It\nalso creates singleton states when there are no good matches. For instance, \u201cup-downs\u201d, an activity\ncharacterized by distinct motion patterns is only performed by the second subject. We correctly do\nnot match it with any of the activities performed by the \ufb01rst subject. The \ufb01gure also illustrates a\nlimitation of our procedure wherein poor local states can lead to erroneous matches. Global states \ufb01ve\nand six are a combination of \u201ctoe-touches\u201d and \u201ctwists\u201d. State \ufb01ve combines exaggerated motions to\nthe right while state six is a combination of states with motions to the left. Although the toe-touch\n\n8\n\nenemywarforces\ufb01ghtingalliedarmedmilitaryinvasionenemies10409: The Crisis of the Naval Warcommandcorpsforcearmymilitarycommandedalliedpersonnelnaval30047: Aircraft and Submarinesmilitaryforceforcesarmycommandpersonneloperationsarmedallied793: Aeroplanes & Dirigibles of Wararmedsoldiersattackedforcesarmy\ufb01ghting\ufb01recapturedtroops22523: History of the American...enemyalliedforcescapturedattackingarmedforcecommandersarmy26879: Night Bombing with the ...armytakenforceenteredbroughttookarmedcarriedmilitarycapturedalliedattackedforcescamebringingGaussian LDA Topic 16armymilitaryforcesarmedalliedcommandcommandersciviliancaptured\ufb01ghtingattackedtakenenemycarriedtroopsMatched Topic 34\fFigure 4: BBP discovers coherent global structure from MoCap Sequences. We analyze three\nMoCAP sequences each from two subjects performing a series of exercises. Some exercises are shared\nbetween subjects while others are not. Two HDP-HMMs were \ufb01t to explain the sequences belonging\nto each subject independently. Left: We show the fraction of Full HDP-HMM wall clock time taken\nby various competing algorithms. The area of each square is proportional to 1 - normalized hamming\ndistance and adjusted rand index in the top and bottom plots. The actual values are listed above each\nsquare. Larger squares indicate closer matches to ground truth. At less than half the compute SPAHM\nproduces similar segmentations to the full HDP-HMM while improving signi\ufb01cantly on the local\nmodels and k-means based matching. Right: Typical motions associated with the matched states from\nthe two models are visualized in the red and blue boxes. Skeletons are visualized from contiguous\nsegments of at least 0.5 seconds of data as segmented by the MAP trajectories.\n\nactivities exhibit similar motions, the local HDP-HMM splits them into different local states. Our\nmatching procedure only allows local states to be matched across subjects and not within. As a result,\nthey get matched to oversegmented \u201ctwists\u201d with similar motions.\nWe also quanti\ufb01ed the quality of the matched solutions using normalized hamming distance [14] and\nadjusted rand index [28]. We compare against local HDP-HMMs as well as two strong competitors, an\nidentical sticky HDP-HMM model but trained on all six sequences jointly, and an alternate matching\nscheme based on k-means clustering that clusters together states discovered by the individual HDP-\nHMMs. For k-means, we set k to the ground truth number of activities, twelve. The results are shown\nin Figure 4 (left). Quantitatively results show that SPAHM does nearly as well as the full HDP-HMM\nat less than half the amount of compute time. Note that SPAHM may be applied to larger amount of\nsequences, while full HDP-HMM is limited to small data sizes. We also outperform the k-means\nscheme despite cheating in its favor by providing it with the true number of labels.\n\n6 Conclusion\n\nThis work presents a statistical model aggregation framework for combining heterogeneous local\nmodels of varying complexity trained on federated, private data sources. Our proposed framework is\nlargely model-agnostic requiring only an appropriately chosen base measure, and our construction\nassumes only that the base measure belongs to the exponential family. As a result, our work can\nbe applied to a wide range of practical domains with minimum adaptation. A possible interesting\ndirection for future work will be to consider situations where local parameters are learned across\ndatasets with a time stamp in addition to the grouping structure.\n\n9\n\n\fReferences\n[1] Bauer, M., van der Wilk, M., and Rasmussen, C. E. (2016). Understanding probabilistic sparse\ngaussian process approximations. In Advances in neural information processing systems, pages\n1533\u20131541.\n\n[2] Baydin, A. G., Pearlmutter, B. A., Radul, A. A., and Siskind, J. M. (2018). Automatic differenti-\n\nation in machine learning: A survey. Journal of Machine Learning Research, pages 1\u201343.\n\n[3] Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent Dirichlet Allocation. Journal of Machine\n\nLearning Research, 3, 993\u20131022.\n\n[4] Blomstedt, P., Mesquita, D., Lintusaari, J., Sivula, T., Corander, J., and Kaski, S. (2019). Meta-\n\nanalysis of bayesian analyses. arXiv preprint arXiv:1904.04484.\n\n[5] Campbell, T. and How, J. P. (2014). Approximate decentralized bayesian inference. In Uncertainty\n\nin Arti\ufb01cial Intelligence (UAI).\n\n[6] Campbell, T., Straub, J., Fisher III, J. W., and How, J. P. (2015). Streaming, distributed variational\ninference for bayesian nonparametrics. In Advances in Neural Information Processing Systems,\npages 280\u2013288.\n\n[7] Das, R., Zaheer, M., and Dyer, C. (2015). Gaussian lda for topic models with word embeddings.\nIn Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and\nthe 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers),\nvolume 1, pages 795\u2013804.\n\n[8] Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M., Senior, A., Tucker, P., Yang, K.,\nLe, Q. V., et al. (2012). Large scale distributed deep networks. In Advances in neural information\nprocessing systems, pages 1223\u20131231.\n\n[9] Deisenroth, M. P. and Ng, J. W. (2015). Distributed Gaussian processes. In Proc. ICML, pages\n\n1481\u20131490.\n\n[10] Dutta, R., Blomstedt, P., and Kaski, S. (2016). Bayesian inference in hierarchical models by\n\ncombining independent posteriors. arXiv preprint arXiv:1603.09272.\n\n[11] EU (2016). Regulation (EU) 2016/679 of the European Parliament and of the Council of 27\nApril 2016 on the protection of natural persons with regard to the processing of personal data and\non the free movement of such data, and repealing Directive 95/46/EC (General Data Protection\nRegulation). Of\ufb01cial Journal of the European Union, L119, 1\u201388.\n\n[12] Fox, E., Jordan, M. I., Sudderth, E. B., and Willsky, A. S. (2009). Sharing features among\ndynamical systems with Beta processes. In Advances in Neural Information Processing Systems,\npages 549\u2013557.\n\n[13] Fox, E. B., Sudderth, E. B., Jordan, M. I., and Willsky, A. S. (2008). An hdp-hmm for systems\nwith state persistence. In Proceedings of the 25th international conference on Machine learning,\npages 312\u2013319. ACM.\n\n[14] Fox, E. B., Hughes, M. C., Sudderth, E. B., and Jordan, M. I. (2014). Joint modeling of multiple\ntime series via the beta process with application to motion capture segmentation. The Annals of\nApplied Statistics, pages 1281\u20131313.\n\n[15] Gal, Y., van der Wilk, M., and Rasmussen, C. (2014). Distributed variational inference in sparse\n\nGaussian process regression and latent variable models. In Proc. NIPS.\n\n[16] Gelman, A., Stern, H. S., Carlin, J. B., Dunson, D. B., Vehtari, A., and Rubin, D. B. (2013).\n\nBayesian data analysis. Chapman and Hall/CRC.\n\n[17] Ghahramani, Z. and Grif\ufb01ths, T. L. (2005). In\ufb01nite latent feature models and the Indian buffet\n\nprocess. In Advances in Neural Information Processing Systems, pages 475\u2013482.\n\n[18] Hasenclever, L., Webb, S., Lienart, T., Vollmer, S., Lakshminarayanan, B., Blundell, C., and\nTeh, Y. W. (2017). Distributed bayesian learning with stochastic natural gradient expectation\npropagation and the posterior server. The Journal of Machine Learning Research, 18(1), 3744\u2013\n3780.\n\n[19] Heikkil\u00e4, M., Lagerspetz, E., Kaski, S., Shimizu, K., Tarkoma, S., and Honkela, A. (2017).\nDifferentially private bayesian learning on distributed data. In Advances in neural information\nprocessing systems, pages 3226\u20133235.\n\n10\n\n\f[20] Hoang, Q. M., Hoang, T. N., Low, K. H., and Kingsford, C. (2019a). Collective model fusion\n\nfor multiple black-box experts. In Proc. ICML, pages 2742\u20132750.\n\n[21] Hoang, T. N., Hoang, Q. M., and Low, K. H. (2016). A distributed variational inference\nframework for unifying parallel sparse Gaussian process regression models. In Proc. ICML, pages\n382\u2013391.\n\n[22] Hoang, T. N., Hoang, Q. M., Ruofei, O., and Low, K. H. (2018). Decentralized high-dimensional\n\nbayesian optimization with factor graphs. In Proc. AAAI.\n\n[23] Hoang, T. N., Hoang, Q. M., Low, K. H., and How, J. P. (2019b). Collective online learning of\n\ngaussian processes in massive multi-agent systems. In Proc. AAAI.\n\n[24] Hughes, M. C., Stephenson, W. T., and Sudderth, E. (2015). Scalable adaptation of state com-\nplexity for nonparametric hidden markov models. In Advances in Neural Information Processing\nSystems, pages 1198\u20131206.\n\n[25] Kuhn, H. W. (1955). The Hungarian method for the assignment problem. Naval Research\n\nLogistics (NRL), 2(1-2), 83\u201397.\n\n[26] McMahan, B., Moore, E., Ramage, D., Hampson, S., and y Arcas, B. A. (2017). Communication-\nef\ufb01cient learning of deep networks from decentralized data. In Arti\ufb01cial Intelligence and Statistics,\npages 1273\u20131282.\n\n[27] Newman, D., Lau, J. H., Grieser, K., and Baldwin, T. (2010). Automatic evaluation of topic\nIn Human Language Technologies: The 2010 Annual Conference of the North\ncoherence.\nAmerican Chapter of the Association for Computational Linguistics, pages 100\u2013108. Association\nfor Computational Linguistics.\n\n[28] Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods. Journal of the\n\nAmerican Statistical association, 66(336), 846\u2013850.\n\n[29] Teh, Y. W., Jordan, M. I., Beal, M. J., and Blei, D. M. (2005). Sharing clusters among related\ngroups: Hierarchical dirichlet processes. In Advances in neural information processing systems,\npages 1385\u20131392.\n\n[30] Teh, Y. W., Gr\u00fcr, D., and Ghahramani, Z. (2007). Stick-breaking construction for the Indian\n\nbuffet process. In Arti\ufb01cial Intelligence and Statistics, pages 556\u2013563.\n\n[31] Thibaux, R. and Jordan, M. I. (2007). Hierarchical Beta processes and the Indian buffet process.\n\nIn Arti\ufb01cial Intelligence and Statistics, pages 564\u2013571.\n\n[32] Yurochkin, M., Agarwal, M., Ghosh, S., Greenewald, K., Hoang, N., and Khazaeni, Y. (2019a).\nBayesian nonparametric federated learning of neural networks. In International Conference on\nMachine Learning, pages 7252\u20137261.\n\n[33] Yurochkin, M., Fan, Z., Guha, A., Koutris, P., and Nguyen, X. (2019b). Scalable inference of\ntopic evolution via models for latent geometric structures. In Advances in Neural Information\nProcessing Systems, pages 5949\u20135959.\n\n11\n\n\f", "award": [], "sourceid": 5862, "authors": [{"given_name": "Mikhail", "family_name": "Yurochkin", "institution": "IBM Research, MIT-IBM Watson AI Lab"}, {"given_name": "Mayank", "family_name": "Agarwal", "institution": "IBM Research AI, MIT-IBM Watson AI Lab"}, {"given_name": "Soumya", "family_name": "Ghosh", "institution": "IBM Research"}, {"given_name": "Kristjan", "family_name": "Greenewald", "institution": "IBM Research"}, {"given_name": "Nghia", "family_name": "Hoang", "institution": "IBM Research"}]}