NeurIPS 2020

Winning the Lottery with Continuous Sparsification

Meta Review

The paper focuses on the recently popular Lottery Ticket Hypotheses (LTH). It provides a novel algorithm, Continuous Sparsification (CS), that finds the sparse "matching subnetworks" that can be trained in isolation to the full accuracy starting from early stage in training. Currently, the only existing method for performing this task is the so-called Iterative Magnitude Pruning (IPM). CS finds matching subnetworks that are sparser than results of IMP, and CS runs much faster (provided certain parallel resources). The method is based on a clear and easy-to-understand idea of introducing one new parameter per each weight of the network that is used to eliminate the weights. This is a significant and important contribution to the field of LTH. As pointed out by Rev#2, "it makes it possible for us to efficiently scale up the lottery ticket observations". Quoting the same reviewer, "The paper has a clear goal and conducts thorough experiments to demonstrate that the proposed technique meets that goal". Finally, the paper "follows all of the empirical best practices". In light of this, I recommend the acceptance of the paper. However, as agreed in the rebuttal (and based on later post-rebuttal discussions with the reviewers), in the final revision the authors should fully address the following list of action items. Below I include the summary of these action items; I will ask the authors to refer to the detailed updated reviews and address exact suggestions described there. *** 1. Writing quality. *** Improve writing quality in Sections 1 and 2. The authors have promised "a subsection introducing the reader to the nomenclature and precise definitions." *** 2. On the relationship between CS and L0 regularization method. *** Considering that all four reviewers were confused on this point, the authors need to update the camera ready version to feature the discussion appearing in the rebuttal in place of the brief, hand-wavey comparisons that are currently in the paper. *** 3. Comparison to standard pruning methods. This is IMPORTANT. *** Here, by pruning we mean methods that start from the trained dense networks and prune them. If "ticket search is a strictly more general and harder task than pruning," then there is no excuse for the paper not to feature comparisons to standard pruning methods. As such, the authors should prominently feature comparisons to these pruning methods, including already existing comparisons to STR and DNW, as well as additional comparison to the standard magnitude pruning (Gupta,2017). See post-rebuttal notes of Rev#1 and Rev#2 for exact suggestions. I would like to highlight once again, that this point is important. The authors seem to slightly disagree in the rebuttal, however all the reviewers are convinced that this point should be addressed properly. *** 4. Comparison to rigged lottery, 5x training. *** The authors compare their method to the Rigged Lottery (RigL) method. RigL does end-to-end sparse training, i.e. it starts with a random sparse network and *never* needs dense resources. Therefore, comparing resource-demanding CS with RigL is not exactly fair. The authors should either remove the current misleading comparison to RigL, or include the 5x (5 times longer) training results for RigL, if they like to compare fairly. Since the training in RigL is sparse, total training FLOPs needed for a 5x training is still less than 1x of pruning training usually. * * * * * * * * * * * * All the reviewers think (and I agree) that these items can be comfortably addressed before the camera ready deadline. I also believe that addressing them will make the submission significantly stronger and reader-friendly. In summary, I recommend the acceptance and trust the authors to address the (minor but extremely important) action items above.