NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:4699
Title:Graph Agreement Models for Semi-Supervised Learning

Reviewer 1

The paper is clearly written and easy to understand. The paper addresses an important problem in semi-supervised learning where the graph edges and weights in a graph-based method might come from noisy sources. The proposed GAM method predicts the probability that pairs of nodes share the same label. The output of the agreement model is used to regularize the classification problem. To account for the limited labeled data, the classification and agreement models are jointly trained. Since the method can learn edges between datapoints, it addresses the issue common in cases when the graph is not provided, or the provided graph is incomplete in terms of having nodes that are disconnected from the graph. The experiments are thorough and explore the two settings of a given partial graph and having to learn the entire graph well across a variety of datasets. The key advantage of this method in constructing graphs comes from better predicting label agreement than using a distance metric between the nodes. Since the GAM uses the features of the nodes, which is generated by an encoder, the quality of the encoder is crucial to the performance of this method. Exploring how the quality of the encoder and noise in the features affects the quality of the GAM and the overall jointly trained model.

Reviewer 2

-------Originality The paper claims to propose a novel method in SSL that learns the graph instead of using a fixed graph. However, some closely related work sharing the same idea has been explored in [1, 4] and is, unfortunately, not mentioned in the paper. SNTG [1] is recent work on graph-based SSL, which also uses an auxiliary model to predict whether a pair of nodes is similar or not. The difference lies in the co-training part. [4] proposes a method based on dynamic infection processes to propagate labels. Please include [1,4] in the related work and add more discussions. -------Clarity The writing is really good. The paper is clear and easy to follow. -------Methodology 1. The introduction of agreement model incurs more network parameters. Is there any comparison with the baselines in terms of the number of network parameters? What is the performance of baselines with the same number of parameters as GAM? This seems to be a fairer comparison. 2. The convergence of the algorithm not guaranteed. Since the agreement model can make mistakes including the top confidant predictions, it may augment errors and propagate them into the classification model. How could the improvement be guaranteed at each iteration of the interaction in the co-training? Figure 5 in Appendix 5 also supports the concern. E.g., in Fig. 5(b) the test accuracy peaks at around iteration 15 and no longer improves after that (oscillates up and down). An extreme case that the augmented errors lead to diverged results may happen if the agreement model performs poor at the beginning and is learned slowly. See the sudden drop at about iteration 12 in Fig.5(a). It may not increase at the next iteration but gets worse. -------Experiments 1. What is the result of GCN_1024+VAT? It is not listed in Table 1. I noticed better results of GCN+VAT than those in this paper were reported in [2]. I was curious why VATENT failed as stated in line 291. Is it due to your implementation or the method itself? 2. In Sec. 4.2, a simple CNN is used for image classification. But this is not commonly used in the SSL literature. The paper only compared to VAT while several important baselines in SSL are missing. E.g., SNTG [1] and Fast-SWA [2] should be included in the related work and Table 3, which achieve much better performance than VAT. I recommend the authors use the standard stronger 13-layer CNN in the SSL literature and report the results, rather than omitting the closely related work [1, 3], especially SNTG which shares the same idea of learning a graph to improve the classification model. 3. How do you choose M, the number of most confident predictions? -----References--- [1] Smooth Neighbors on Teacher Graphs for Semi-supervised Learning, CVPR 2018. [2] Batch Virtual Adversarial Training for Graph Convolutional Networks, arxiv 1902.09192 [3] There Are Many Consistent Explanations of Unlabeled Data: Why You Should Average, ICLR 2019. [4] Semi-Supervised Learning with Competitive Infection Models, AISTATS 2018.

Reviewer 3

This paper proposes a new label-propagation method based on deep learning. Instead of propagation the labels directly from the graph, it learns another function which determines whether the edge is connected for same-label instances or not. The labels are then propagated according to the new determined connecting same-label edge. The learning classifier and propagating labels execute iteratively. Originality The limited novelty of the paper is mainly on the algorithm part. It first proposes to learn another function to determine whether the edge should connect two instances of the same label. Another is they use deep neural networks as base models for both tasks. The idea of using neural networks for semi-supervised learning has been explored for many years. Thus, the paper leads a small step to the deep-learning-based label propagation method. In the semi-supervised community, it has been discussed from long ago how to label the unlabeled data in a safe way instead of labeling them directly by classical semi-supervised methods. In this way, this paper can be counted as one in this trend and proposes a method to solve it, which simply learn whether two nodes should be joined or not. Besides, the paper studies a well-defined semi-supervised problem, without any new insights or problem setting. Quality One contribution of the paper is that they propose a training algorithm that resembles the co-training algorithm. However, the proposed method is not co-training. In co-training, two classifiers should both produce some data to feed to each other. In this proposal, only one classifier is generating data, and the other classifier is used only to update the graph. This is more like self-training instead of co-training. Other parts are technically sound and the experimental results are good. Clarity The paper is clearly written. Significant The paper may be of practical use since the empirical results shown are good. However, it only adds a small modification to existing algorithms, and replace all the classifiers by a deep learning one. In this way, the paper does not give too much new information (new problems, new insights, new assumptions) to the community and it has only limited significance. ===================================================== After the rebuttal, I will increase my score.