NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID: 5716 Semi-Implicit Graph Variational Auto-Encoders

### Reviewer 1

This paper proposes a Semi-Implicit VI extension of the GraphVAE model. SIVI assumes a prior distribution over the posterior parameter, enabling more flexible modeling of latent variables. In this paper, SIVI is straightforwardly incorporated into the Graph VAE framework. The formulation is simple but possibly new in the graph analysis literature. It is easy to understand the main idea. The proposed model shows good records in link prediction experiments. Fig. 3 is not reader-friendly in several aspects (i) the panels are simply too small. (ii) we can observe the posterior distributions learned by SIG-VAE is multi-modal. But the readers do not know that the posteriors of five nodes should be'' multi-modal. In other words, the SIG-VAE's variational posterior is closer to the true distribution, than that of VGAE? Are there any solutions that can answer this question more directly? I cannot fully understand the meaning of the result of graph generation experiments. What is the main message we can read from this result? I have a few questions concerning the result of the graph node classification experiment. (i) what kind of data splitting is employed in the experiments? (train/validation/test sample splitting) Data split has a huge impact on the final score. The split is the same with the standard split used in the Kipf-Welling's GCN paper? (ii) The performance of the proposed SIG-VAE is not so much impressive, compared to naive and simple GCN. Why is that? (iii) I think GCN is a strong baseline but not the best one to claim SOTA. In general, the [GAT] works better in many cases, including the cora and citeseer datasets. Please see the experiments in [Wu19]. [GAT] Velickovic+ Graph Attention Networks'', in Proc. ICML 2018 [Wu19] Wu+, Simpplifying Graph Convolution Networks'', in Proc. ICML 2019 + A combination fo Semi-implicit VI and graph VAE is new + Formulation is concise and easy to understand - Some unclear issues in Fig.3 and graph generation experiments, - The node classification result is not SOTA (too-strong claim) ### after author-feedback ### The authors provided satisfactory answers for some of my concerns. Considering the other reviewers' points of view at the same time, I raised the score.

### Reviewer 2

Originality The paper is a combination of a number of ideas in the literature, where a careful combination of existing techniques leads to really good representation learning for graphs. In that sense the work is original and interesting. -----POST REBUTTAL----- I thank the authors for addressing my concerns / questions around a VAMP version of VGAE as well as questions around Eqn. 5. In general the rebuttal seems to include a lot of relevant experiments for the concerns from the review stage, and based on this evidence I am happy to keep my original score for the paper. Clarity The paper is generally clear and has clear mathematical formulations written down for all the methods considerered. Quality The paper has a number of thorough experiments and generally seems to be high quality in empirical evaluation. It also has a clear intuition for why the proposed method is better and extensively demonstrates and validates it. Significance The paper seems like a significant contribution to the graph representation learning literature. Weaknesses - It would be good to better justify and understand the bernoulli poisson link. Why are the number of layers used in the link in the poisson part? The motivation for the original paper [40] seems to be that one can capture communities and the sum in the exponential is over r_k coefficientst where each coefficient corresponds to a community. In this case the sum is over layers. How do the intuitions from that work transfer here? In what way do the communities correspond to layers in the encoder? It would be nice to beter understand this. Missing Baselines - It would be instructive to vary the number of layers of processing for the representation during inference and analyze how that affects the representations and performance on downstream tasks. - Can we run VGAE with a vamp prior to more accurately match the doubly stochastic construction in this work? That would help inform if the benefits are coming from a better generative model or better inference due to doubly-semi implicit variational inference. Minor Points - Figure 3: It might be nice to keep the generative model fixed and then optimize only the inference part of the model, parameterizing it as either SIG-VAE or VGAE to compare the representations. Its impossible to know / compare representations when the underlying generative models are also potentially different.

### Reviewer 3

The paper is incremental work compared to the Semi-Implicit VAE [38]. The general idea of the SIVAE is to model the parameters of the VAE model ($\psi$) as a random variable that one can sample from but does not necessarily have an explicit form which results into a mixture like behavior for the VAE. In this work, the authors propose to use that framework for Graph data. The idea is to sample from $\psi$ and concatenate with each layer of graph VAE. The rest follows [38]. They also propose another variant based on the normalized flow which read a bit out of sync (afterthought/add-on) with the rest of the paper.

### Reviewer 4

### edit after author response ### I read the feedback and found the additional results impressive. I am still uncertain about \psi: l116 says "implicit prior distribution parametrized by \psi" (I assumed this means q(Z|\psi)) and "reparametrizable q_\phi(\psi|X,A)" and the feedback states \psi is \mu and \Sigma. I think Gaussian random variable Z is reparametrizable and q(Z|\psi) is explicit. Since the major strength of this paper is the empirical performance, the clarity of method/model description and experimental setups (mentioned by other reviewers) are very important. I hope the reviews are helpful for improving the presentation. ########################### This paper presents SIG-VAE for learning representations of nodes in a given graph. This method is an extension of VGAE in two ways. The use of semi-implicit variational distribution enriches the complexity of variational distribution produced by the encoder. The decoder uses the Bernoulli--Poisson likelihood to better fit the sparse link structure. Experiments in Section 4 compares SIG-VAE with many modern methods on various tasks and datasets. My questions and suggestions are as follows. * The reviewer failed to comprehend the connection between the semi-implicit distribution in Eq.(2) and the following part lines 122--142. Specifically, an absense of \psi in Eqs. (3,4) is confusing. Is \epsilon_u ~ q_u(\epsilon) interpreted as \psi ~ q_\phi of Eq. (2)? If so, \psi conditioned on X, A is misleading information. * Figure 1 is not informative either. The caption says 'diffuses the distributions of the neighboring nodes', but VGAE already achieves it: the latent representaiton is propagated to adjacent nodes, and it infers distributions. What does SIG-VAE attain on top of it? The illustration is also obscure. Does this tell that the distribution of latent representaiton of each node can be multimodal and that the neighboring distributions influence the distribution of certain nodes? * CONCAT operators are unclear in Eqs. (3,4). In particular, I want to know the size of \epsilon_u and h_{u-1} to see what information is added to X. After checking the code, I assumed \epsilon_u is concatenated to node feature for each node and layer, but not convinced due to unfamiliarity with TensorFlow. Why is X fed in every layer? For u>1, h_u seems to carry some information on X. * Dots in middle and right panels of Figure 2. Suppopsed the latent representation is infered as distributions, what are the dots in the figure? Mean values? As visualized in Figure 3, the distributions may take multiple modes. I am curious if it is okay to shrink such distributions to a single point, though the full visualization is challenging. * Evaluation on graph generation task. Density and average clustering coefficient are adopted as metrics of the graph generation task in Section 4.3. While these scores indicate the efficacy of Bernoulli--Poisson likelihood for sparse edges, they may not fully characterize graph structures. Using the following metrics may further strengthen the result: MMD-based score in Section 4.3 of [a], and KL-divergence between the degree distributions [b]. [a] J. You et al. GraphRNN: Generating Realistic Graphs with Deep Auto-regressive Models, ICML 2018. [b] Y. Li et al. Learning Deep Generative Models of Graphs https://arxiv.org/abs/1803.03324