NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:2738
Title:One ticket to win them all: generalizing lottery ticket initializations across datasets and optimizers

Reviewer 1

Update after author reply: The author response was okay, I won't update my score. I agree with the other reviewers: 1. rescaling the weights during initialization should be a part of the experimental setup, that might change these results. that should be a central part of these experiments, and the authors should update the paper with such experiments. 2. as is common in lottery ticket work, there are insufficient comparisons against other (non-lottery ticket) approaches. eventually lottery ticket work will have to compare against other relevant work; for now, i don't think this means we should reject. My original review is below ================================================ The experimental results are convincing, and the experimental setup is not bad. This paper is original, and answers a relevant and timely question for lottery ticket research. It's clearly written, and while it does leave a few questions about exact implementation details the author do a better job than most at being clear. This feels like a complete result, with thorough experiments. One problem throughout research on lottery tickets is the lack of comparisons against other pruning methods. It's understandable that this work only analyzes lottery tickets, as already this includes a lot of experimentation. However, it's not clear to me if the properties found here are simply re-finding the same generalization properties that have been known about other pruning methods (e.g. L1 or L0 feature selection) for years.

Reviewer 2

***POST-REBUTTAL*** Thank you for the time you spent writing the rebuttal! I think that the finding that LT can generalise (I use the word "can" because it does not seem that this is true consistently) is an interesting one, and with some changes, this paper would deserve publication at a top venue like NeurIPS. However, I think we still see things differently on two points. Firstly, I do not believe that comparison to existing algorithms is orthogonal to the topic of this paper. You claim that "... we may be able to generate new initialization schemes which can substantially improve training of neural networks from scratch" and I agree, but the point I am making is that there are other ways of obtaining a better initialisation (e.g., unsupervised pretraining and/or layer-wise pretraining) which are known to improve performance and speed up converge, some of them using less computation than is required to generate a lottery ticket. I view your algorithm as yet another way of generating a good init using some data which yields good performance, potentially with other benefits like compression, after some amount of fine-tuning (the fact that LT is trained from scratch and thus require more fine-tuning than using trained weights seems like a drawback, not advantage, from this viewpoint). As R1, I thus have the feeling that your paper is just rediscovering phenomena which are known except for LTs (e.g., using weights pretrained on Imagenet as init for other datasets is a pretty standard practice by now because it often leads to better results and faster convergence). This could still be interesting if LTs were providing better performance or some other advantage (like compression) compared to other relevant algorithms (unsupervised pretraining, pruning, etc.). Unfortunately, I cannot assess if such advantage exists without comparison to other existing algorithms. Furthermore, since you are using standard datasets (Imagenet, CIFAR-10, etc.), it seems that you could provide comparison to other algorithms (e.g., in terms of accuracy and fraction of pruned weights) without any additional compute. Secondly, I still have some doubts about the results you report for random initialisation. In particular, you say: "... the relevant comparison here is between winning tickets and random tickets neither of which is rescaled", and that "We therefore consider it unlikely that rescaling would change our core results since we have no reason to expect that rescaling would preferentially benefit winning or random tickets." I am not sure about the winning but for the random tickets, the scale does matter (especially when over 90% of the weights are pruned)---please see the He et al., Xavier et al., and other well-cited papers about initialisation of deep neural networks which show that dependent on the scale of the init, the final performance can vary a lot. I suspect that the randomly initialised networks are underperforming (at least partly) because their scale is way off the commonly used 1 / sqrt(no. of inputs) or 1 / sqrt(no. of inputs + no. of outputs), and it may be that this does not have such a strong effect on LTs because they are picked by magnitude based pruning, and thus have naturally higher magnitude than random init. I may be wrong but without additional experiments, I am unable to accept the "core results" in your paper at their face value. ***ORIGINAL REVIEW*** This paper studies whether lottery tickets generalise between datasets and optimisers. Since generation of well-performing lottery tickets, at least for large datasets, is very computationally expensive (the models are retrained up to 20 or 30 times), finding a method that allows us to only go through this process only once and then transfer to other datasets might extend applicability of the lottery tickets for network pruning (albeit it must be said that the presented algorithm does not allow to transfer between different architectures). Two major concerns remained in my mind after reading this paper. Firstly, since the end use of your algorithm is sparsification, I am really missing comparison with alternative algorithms that speed up computation at prediction time (you cite some alternative sparsification approaches in the related work section; other alternatives include the line of work exploiting low precision computation, like “Binarized Neural Networks”, “XNOR-Nets”, etc.). Secondly, even though the original lottery ticket hypothesis is certainly interesting, it seems like “late resetting” turns the algorithm of obtaining the “lottery ticket” into more of a pruning technique, since it essentially makes the initialisation data and “large network” dependent (due to the dependence of the initial value on the first few training iterations of the large network). Hence, viewing your algorithm as a pruning technique, it seems like it should be possible to save some computation and potentially obtain better results by doing transfer learning on top of the **already trained** pruned architecture. Of course, I might be completely wrong, but I would have liked to see the comparison or at least some discussion of this alternative in relation to your algorithm. I am thus skewing towards recommending rejection of this paper at this time. Major comments: - Can you please clarify lines 150-155? In particular, I do not understand if and how do you use the “target” dataset when generating the lottery ticket on the “source” dataset. - Are the weights in any way rescaled after each pruning so that the scale of outputs are approximately preserved? - You explain how differing number of outputs has been handled in the transfer experiments, but I have missed an explanation of how differences in the input dimension are handled?! - How exactly is “global pruning” executed? Specifically, weights in each layer are at initialisation of the scale 1 / sqrt(no. of inputs to the layer) and since there is a growing body of evidence showing that this property is approximately preserved throughout training, magnitude based pruning will preferentially prune weights in the large layers. This is in line with your observations on the bottom of page 3, but I am not entirely sure this is a desirable behaviour. Have you tested what would happen if you pruned based on the rescaled magnitude (e.g., if the initial weight value is w = \sqrt(2 / no. of inputs) \eps, then you could only prune based on the value of \eps instead of w)? Minor comments: - At several places, you cite [7] when [8] should be cited and vice versa (e.g., on p.3, you cite [7] for late resetting). - [push back a little on the generality claims: “... suggesting that [winning tickets] are not overfit to a particular optimizer or dataset” -- clearly, the tickets from smaller datasets fared similarly to random on several datasets + the ResNet example (Fig. 2) and Fig. 3b show that the effects are not universal] - Random masks (p.4): To facilitate comparison, please follow the previous papers and include the results for models with random reinitialisation of the already pruned architecture (i.e. the ‘preserved mask’ curve from the appendix).

Reviewer 3

# Originality To my knowledge, this is the first study of the transfer properties of tickets found in one dataset to a different dataset. # Quality The experiments were clear and provided error bars across six random seeds for the core experiments. # Clarity I found the writing to be clear and it was generally easy to follow the methodology and results. One complaint I have is that the colors in Figure 4 are inconsistent between the subplots. This made it more difficult for me to follow the patterns that were being pointed out. # Significance I think that the most significant aspect of this is the study of the transfer properties of tickets found in one dataset to a different dataset. Transfer learning is increasingly important as the cost of training large models grows, and finding ways to sparsify the model could help open the way for faster research.