NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:5082
Title:Zero-shot Knowledge Transfer via Adversarial Belief Matching

Reviewer 1

[UPDATE] I am satisfied with the authors' reply about larger scale experiments, mentioned by 2 out of 3 reviewers. While I am only guessing that performance may degrade as a function of dataset scale, it is not hard to imagine advances in GANs which could make that degradation smaller, hence make the proposed method more useful. Further, even in an adversarial setting, it may be possible to guess what kind of inputs are relevant, or extend the method to few-shot or some hybrid approach. I am positively surprised that features of the student have comparable transferability to the teacher, I was concerned that some sort of overfitting to a teacher's decision boundary was possible, but this does not seem to be the case. While I agree with the authors that, in most cases, those releasing research models will not go out of their way to vaccinate them against zero-shot distillation, the proposed method could be used to (somewhat) copy and repurpose information stored in hardware model. Take for example Tesla's autopilot which uses several neural networks in it and is trained on tens of billions of images which are not available to the world. In fact, that data is the primary competitive advantage of Tesla in the long run and its acquisition costs Tesla millions of dollars. Since the autopilot needs to drive without an internet connection and inference needs to be extremely low latency, it is surely done on device and hackable. Your method could be used to train a good feature extractor comparable to that on their device. Further, it would not need to be zero-shot, since it's very easy to get a little bit of data. Overall I am impressed with the work and satisfied with the authors' response. Hence, I am increasing my score to 8. [OLD REVIEW] Originality: Highly original. Zero-shot distillation using some form of adversarial loss is imaginable, but the actual performance level is quite unexpected, imho. Quality: Good deltas to previous works, and error bars make the differences more convincing. Clarity: Writing is clear enough, although some more details are required for those who are only distantly familiar with the GAN literature. The supplementary materials are also on the light side. Significance: High. I believe this paper opens new avenues for investigation, basically offering a new attack vector against embedded inference devices.

Reviewer 2

The paper proposes an algorithm for distilling a teacher into a student without any data or metadata. The main idea is to train a generator to feed “hard examples” (where teacher and student disagree) to the student. The dynamics of this algorithm in Fig. 1 leads to an analysis that probes the difference between the predictions of the two networks near the decision boundaries in $4.4. Experiments are done on SVHN and CIFAR-10. 1. The paper is clear, concise, and generally well-written. One thing that could be improved is $4.1. The title does not reflect all the content. Also, can we highlight what the baselines are? Especially KD(+AT) (L172) as this is used many times. It would also be nice to remind the reader from time to time when discussing the results. 2. The paper is well-motivated (L27-39). 3. The approach is sound and intuitive. Possible conceptual concerns are explicitly mentioned ($3.4). I also appreciate that the authors mention the loss terms that have been tried but did not work (L112-117). 4. The paper gives the right context on the relevant work, helping the reader to quantify the significance of the proposed approach and experimental results. For instance, it makes sure to point out the concurrent work of Nayak et al. 2019 in L81-84 and then highlight that again in L218-221. 5. Solid experiments and analysis. Main results are shown in Fig. 2 and Table 1. Overall, I like that the experiments do not only attempt to showcase the effectiveness of the proposed approach but also to provide further insights including toy experiments ($3.3 and Fig. 1), the transition error in $4.4, and qualitative results ($4.3 and Fig. 3). ### Updated Reviews ### While the authors' statement regarding large-scale experiments seems fair, the rebuttal does not attempt to change my opinion regarding the paper's level of significance. I am happy to remain at a 7. Obviously, I would still recommend acceptance.

Reviewer 3

[Update after author feedback] I did not have any significant concerns with the paper originally, and the author feedback addresses the other reviewer's concerns about scaling to large datasets. I continue to recommend acceptance. This paper provides a method called zero-shot KT for performing distillation from a teacher model to a smaller student model in the zero-shot setting (i.e. the student has no access to the teacher's training data). An adversarial generator is trained alongside the student and seeks to generate pseudo-examples that maximize the KL-divergence in the class predictive distributions between the teacher and student. The student seeks to minimize this same KL-divergence. The proposed model is shown to be competitive with or outperform distillation methods that have access to significant amounts of training data. In addition, the paper proposes a metric to quantify the mismatch between teacher and student as images from the test dataset are adversarially attacked. As far as I can tell, this work is original. Rather than relying on auxiliary information about classes, the proposed zero-shot KT approach only relies on the teacher model itself. Related work is detailed enough to clearly understand the novelty of the work and how previous approaches are related. The quality of this work is high. Figure 1 demonstrating results on a toy problem and figure 3 showing adversarially generated pseudo-examples were helpful inclusions for understanding the method. In Table 1, the authors show the circumstances under which zero-shot KT is outperformed by KD+AT, albeit KD+AT having more examples. Results were compared over a range of different architectures and over 3 initial seeds. The authors also addressed potential concerns head-on, such as issues with the adversary and choosing hyperparameters. Clarity is excellent. The writing was clear and insightful. The method and hyperparameters were discussed in sufficient detail to reproduce the results. Significance is high as well. As the authors point out in the introduction, this method is useful for distilling and making accessible pre-trained models for which the dataset may not be released due to intellectual property, privacy, size, or other concerns. Such scenarios will only become more likely in the future. A potential concern is the choice of hyperparameters, but the authors describe how they chose the hyperparameters by using another task and that the hyperparameters do not have a large effect on performance. The teacher transition curves in Figure 4 was a point of confusion for me. I was under the impression that the teacher network was the one being attacked and so was surprised that the performance of the teacher differed so drastically between the left and right columns. Is there any particular reason why the student model was the one attacked and not the teacher?