Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
The choice of the exponential family is the same for all the local parameters, so I am assuming the method is suited to independent partitions of the data? When I first started reading the paper, I was expecting some sort of aggregation of different methods on the same data and in that case, we may be dealing with completely different parameters and different distributions. Possibility of such extensions should be discussed. I think the idea of sharing global parameters in Q_j is quite restrictive. For example if Q_js are moderately high dimensional, it may be desirable to only share only a few coordinates while keeping the other coordinates distinct, much in the way of the Local partition processes: https://academic.oup.com/biomet/article-abstract/96/2/249/249850?redirectedFrom=fulltext In the simulation study, the authors may also consider comparing with Chinese Restaurant process based clustering which does not require the true value of L.
I think the idea is interesting and the motivation for this problem is important. However, there's other meta-model Bayesian methods that I think the authors should consider discussing in their paper: 1.) Bayesian inference in hierarchical models by combining independent posteriors (Dutta, Blomstedt, Kaski, 2016) 2.) Meta-analysis of Bayesian analyses (Blomstedt et al., 2019) 3.) Differentially Private Bayesian Learning on Distributed Data (Heikkila et al. 2017) Also, perhaps a useful baseline comparison to your method would be a simple one level hierarchical extension of sharing features, which should be possible as the features are assumed Gaussian in each of the experiments, and compare the results against your method. This could better illustrate why your method is necessary in terms of computation time, privacy preserving, etc. I think it's quite a nice idea considering it's designed for structures like mixture models which I think is quite novel for this type of work.
Originality: The beta-Bernoulli interpretation for model aggregation is new to me. To briefly summarize, the paper uses a subset of global model parameters for local groups, and the subset selection process is modeled as beta-Bernoulli with a matching afterward. This method is novel to me, but I still have a few questions. 1) The model requires performing alternative updates for each group, is it possible to parallelize the algorithm? 2) It would be non-trivial to adaptively tune the cardinality of C_j. How do you do that in your experiments? If |C_j| diverges among groups, can your method accurately estimate |C_j| for each data group j? Overall I think the paper totally makes sense to me, but several details need to be verified. Quality: This paper is well-written, and I have not found any technical error within the paper. Significance: I think all experiments in this paper make sense to me but not quite enough since the paper does not compare over their method with provably convergent methods such as online learning methods (e.g. stochastic variational inference). So it is hard for me to judge the total quality of this novel learning strategy. Clarity: This paper is well-written and clearly explained.