Paper ID: 527
Title: Efficient Online Inference for Bayesian Nonparametric Relational Models

Submitted by Assigned_Reviewer_4

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see
The paper introduces a probabilistic model for networks which assigns each node in the network to multiple, overlapping latent communities. Inference is done using a stochastic variational method and the experimental evaluations are performed on very large networks.

The first thing I note is that you do not cite Morup et al. (2010) "Infinite multiple membership relational modelling for complex networks", which in truth was the first work to perform inference for a latent feature relational model on large datasets -- in effect, rendering your statement on 067-068 "... these innovations allow the first..." incorrect. This is a rather serious oversight, because their paper not only performs large scale inference, but their method is also an MCMC method, which is well-known to usually produce more accurate results than variational methods.

I believe the strongest contribution from this paper is the application of a stochastic variational inference method to a relational data model. This aspect is likely to have the highest impact. This paper provides a new choice in the toolbox of probabilistic inference for relational data models, and this is always highly welcome, especially since the inference is shown to be viable on massive datasets. It is primarily for this reason that I will recommend the paper to be accepted.

I also think the experiments are very well done. It is clear that great care was taken to execute the large experimental setups well, and that the authors thought about the various engineering choices that needed to be made to get the inference to work. I don't think these engineered choices are all that elegant, but having these procedures (well) outlined in a paper will certainly do service to the network modelling community.

Also, I don't see how this inference method is "online" in any sense. The pruning moves are your solution to automatically infer the number of communities/features K, which is already done with the other nonparametric models for relational data. Perhaps I am missing something, but I think the online aspect should be removed from the paper. Indeed, I don't really even see where in the paper any aspects making it particularly "online" are discussed.
Q2: Please summarize your review in 1-2 sentences
In summary, the paper by Morup et al. (2010) should have been cited. The main contribution of the paper is the addition of a stochastic variational method to the inference toolbox for large-scale relational data.

Submitted by Assigned_Reviewer_5

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see
The paper addresses an important issue: Inference for Bayesian
Nonparametric Relational Models for large networks. The authors use an assortative HDPR model to recover latent communities in social networks previously examined with the mixed-membership stochastic block model and demonstrate an improvement of perplexity scores and link prediction accuracy. They also use their community structure to visualize business and governmental relationships extracted from an existing database. Their Pruning Moves seems to be appealing and productive in large networks. They also provide a logical approach for determining candidates for pruning, which I like. I have only a minor concern on selection of t*? Will the results be sensitive to the choice of t*? Perhaps authors can briefly extend the discussion alone this.
Q2: Please summarize your review in 1-2 sentences
The authors presents an interesting novel approach for Inference for Bayesian
Nonparametric Relational Models. The approach is very useful in large networks.

Submitted by Assigned_Reviewer_6

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see
This paper presents a method for learning a nonparametric, mixed-membership model of relational data. The model is a straightforward extension of the aMMSB using an HDP. They combine recent advances in stochastic variational inference with novel pruning strategies geared for this particular application.

The authors describe some of the care in creating an efficient implementation for this particular model (ensuring linear time and space complexity).

Lingering questions:
- Why not compute aMMSB for a wide range of K and compare their aHDPR to aMMSB's best performance? It would be nice to confirm that aHDPR is obtaining a very good result with far less computational effort than fitting aMMSB for a wide range of K.
- It would be nice to see the role of epsilon on the predictive accuracy of the model, though they mention that they leave it for future work. For that matter, discussion regarding robustness to other hyper parameters was quite limited as well.
- Why only analyze a single connected component for each network? Both models should be able to quickly identify the non-overlapping nature of these groups.
- Where is the model making mistakes? Is it for nodes with low degree? or for nodes that are part of groups that get (accidentally) pruned?
- How does this predictive performance compare to latent feature relational models?
- I found their strategy of choosing samples to use for their pruning rule insightful and interesting. Can these pruning techniques be extended to other nonparametric relational models (eg. LFRM)? Why or why not?
- I also found their use of node-specific learning rates intriguing. For how much it appears to matter in the results, there is a very limited amount of discussion and insight.
- It would be nice to see aMMSB fit to a wide range of K and make sure that aMMSB at least *can* achieve the same perplexity as the proposed model.
- The visualization does not really emphasize the mixed-membership nature of the model.
Q2: Please summarize your review in 1-2 sentences
The paper provides a nonparametric extension to a mixed-membership model for relational data, and presents a stochastic variational inference algorithm paired with pruning moves. Several techniques are presented that may be applicable for extending related models.
Author Feedback

Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 6000 characters. Note however that reviewers and area chairs are very busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.
We thank the reviewers for their helpful comments. We emphasize that there are a number of novel contributions in this paper, including: the proposal of an HDP relational model; the adaptation of nested variational truncations (developed for HDP topic models) to this case; the design of linear-time-and-space structured variational updates for assortative likelihoods; the development of effective online methods for pruning extraneous communities; and careful experimental validation showing major improvements relative to parametric mixed membership models.

We thank R4 for pointing out a missed citation to Morup et. al, "Infinite Multiple Membership Relational Modeling for Complex Networks". The paper develops a Bayesian nonparametric mixed-membership latent feature model via the Indian Buffet Process (IBP) that scales to networks that are similar in size to our paper (via MCMC inference and likelihood approximations). We will add a citation, and remove the claim that our approach is the first scalable Bayesian nonparametric relational model. We nevertheless note a number of substantial differences from this work: we propose a novel HDP (rather than IBP) relational model, use stochastic variational inference rather than MCMC, and our experiments show scalability with hundreds of learned communities.

R6 asks if our pruning moves can be extended to other nonparametric, latent feature relational models. We believe this may be possible: while we look for communities where the variational Dirichlet distributions have low total mass, binary IBP models could identify features whose variational beta distributions have low total mass. Of course, details would need to be worked out, and this is not the focus of our paper.

We agree that more discussion of node-specific learning rates should be included. While the correctness of this approach is established by the arguments in the submission, it is less obvious why performance should improve. We believe this is related to the variance of different components of the (stochastic) gradient vector, which will be large for nodes that are poorly represented in the batch. A more formal analysis of this phenomenon is ongoing work.

Regarding the choice of the number of communities K for the aMMSB, note that we already include a fairly extensive comparison in Figure 2. For the larger-scale experiments in Figure 3, computational constraints forced us to focus on a narrower range of options. Note that even with a "good" K, the aMMSB may (and often does) have worse predictive performance than the aHDPR, because the models have non-equivalent hierarchical community membership priors.

Our decision to analyze the largest connected component of each network allows us to focus our quantitative comparisons on the challenging problem of identifying mixed community memberships for "active" nodes with many neighbors. For the small disconnected cliques found in real-world networks, all models easily assign them to unique communities but have little information to use for prediction.

R5 asks about sensitivity to the pruning threshold t*, and R6 about sensitivity to the between-community threshold parameter epsilon. Currently, both parameters are set to be "very small" (and in the case of epsilon, consistent with prior work on the aMMSB); performance for a wide range of other "small" values is similar. It could potentially be useful to increase t* to allow more rapid pruning and greater speed, or increase epsilon if less assortativity were expected. But given time and space constraints, we focused on other issues in our experiments.

Our usage of the term "online" was motivated by prior work in this area, for example Hoffman et al.'s "Online learning for latent Dirichlet allocation". We agree the word "stochastic" is also relevant, and will consider potential changes.