NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Reviewer 1
The paper targets the application of network compression using a cloud platform. Instead of uploading all the training data onto the platform, the paper suggests uploading a small portion of data as positive (P) data and use larger datasets already on the platform as unlabeled (U) data. After training a PU classifier, the classifier will be used to select more P data from the U data. And such selected data, together with the original data, are used in a knowledge distillation framework to compress the original network. The experimental results show that the compressed network’s performance is close to the original deep neural network trained on all data, on three widely used datasets. Originality, The paper is a smart application of PU learning. The idea of using PU learning to do data augmentation is smart and does make sense. However, the application problem of network compression on the cloud may be problematic. The users have both the well-trained network, as well as the data to train the network. In this way, compressing the network is easier on the users’ part. If a cloud platform is used, then the whole big network can be executed on the cloud, and thus no compression is required. In this way, the application does not make sense. Other applications that can motivate the paper should be searched. For the technology part, the paper proposes some heuristic of feature extraction and knowledge distillation, by slightly modifying the original attention and knowledge distillation methods. These modifications are targeted at the current problem, but the novelty is limited. Quality, Although the compressed network can comparable performance, why the paper works are still not clear. More comparisons are required: For example, the proposed method may work not only due to the “PU” learning part, but also the attention-based feature extraction part. In this way, is there any comparison on using the attention-based feature extraction and not use it? Another thing is, the proposed method may also work due to the robust knowledge distillation method. In this way, another possible comparison is that using a part of, or all the unlabeled data set with knowledge distillation to compress the network. With this comparison, it will be clear what is the contribution of each component of the proposed method. Clarity, The paper is clearly written. There are two minor problems. One is that referring to an equation is usually written as Eq.(x) instead of Fcn.(x). Another is some of the references that are not actually cited in the text, for example, [12,18,21,23,25,27,28]. Significance, The paper is a nice try of PU learning on data augmentation application. However, its application problem may not be appropriate enough in practice. Some necessary comparisons are also missing. ------------------------------------------------ I am satisfied after reading the rebuttal and would like to increase my score. However, the application motivation should be strengthened in a revised version.
Reviewer 2
Generally speaking, this is a high quality paper which solve the compression problem in a different but effective way. By utilizing the proposed algorithms, the end users do not need to spend hours uploading dataset to the cloud, which is quite friendly and attractive for end users. The paper leverages the strength of PU learning in augmenting data and the strength of KD method in compression. The experimental results show that the network can be compressed effectively with only 8% of the original data. The paper is well-written and easy to understand.
Reviewer 3
This paper focuses on solving model compression problem with limited labeled data. Taking a pre-prepared large-scale dataset as unlabeled, a handful of data from the original training data as positive, it is an interesting way to select data for the subsequent compression task with the help of a PU classifier. Robust KD method is used to deal with the noise and the data imbalanced problem. Multi-feature network with attention structure alleviate the dimensionality gap between datasets. This two-step method is clear and efficient, and the experimental results are very impressive. ----------------------- I am satisfied with the rebuttal.