Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
The paper proposes an interesting technique to train neural networks while learning as well part of the architecture by splitting neurons in a principled way. More precisely, the algorithm is defining a notion of steepest descent in the space of distributions over neuron weights, which is a steepest descent in the space of these distributions equipped with the L-infty Wasserstein distance. The corresponding steepest descent algorithm retrieves the usual steepest descent direction as long as a "usual" descent direction exists, but if a local minimum is reached and some further progress could be made locally by duplicating (or make a number of other copies of) neurons and decoupling them, then the algorithm finds a locally optimal split. The paper is well written, novel, with clear simple theory. The idea is simple and elegant. The paper features a significant amount of compelling experiments with different neural network architectures both on synthetic and real data. In the discussion, the reviewers have again mentioned the above qualities of the paper. In terms of points that could potentially be improved, in addition to what is mentionned in the reviews, 1) the reviewers have expressed concern that the paper does not provide comparison of the efficacy of the proposed algorithm from the point of view of the computational cost (say if we have a lot of memory available so that memory is not constraining): is the proposed algorithm competitive in terms of running time with the baselines that consist in learning and much larger network and pruning it? 2) The reviewers have mentioned that, for networks with multiple hidden layers, it would be interesting to have some information on the architectures that are learned with the proposed algorithm. In particular if they are balanced or not. 3) That it would interesting to compare with recent more efficient pruning techniques like: Frankle, J., & Carbin, M. (2018). The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks. ICLR. Please see also the updated reviews of the paper. I can only encourage the authors to take into account these comments when preparing the final version of the manuscript.