Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Summary: The authors analyse the energy landscape associated with the training of deep neural networks and introduce the concept of Asymmetric Valleys (AV), local minima that cannot be classified as sharp or flat local minima. AV are characterized by the presence of asymmetric directions along which the loss increases abruptly on one side and is almost flat on the other. The presence of AV in commonly used architectures is proven empirically by showing that asymmetric directions can be found with `decent probability'. The authors explain why SGD, with averaged updates, behaves well (in terms of the generalization properties of the trained model) in the proximity of AV. Strengths: The study of neural networks' energy landscape is a recent important topic. Existing analysis are often based on the flat-sharp classification of local minima and showing the presence of stationary points that escape such basic distinction is an important contribution. Moreover, the paper explains clearly and justify theoretically how the flatness and sharpness of minima affect the generalization properties of the trained model. The introduction contains a nice and accessible review of recent works on the topic. Weaknesses: It is hard to assess what is the contribution of the paper from a practical point of view as SGD automatically avoids possible training problems related to AV. On the theoretical side, the paper does not investigate deeply the relationship between AV and usual flat or sharp minima. For example, how are AV connected to each others and which properties of sharp/flat minima generalize to AV? Questions: - Under what conditions (on the network architecture) do AV appear? Is there an intuitive interpretation of why they can be found in the loss function associated with many `modern' neural networks? - Is the presence of AV restricted to the over-parameterized case? If yes, what happens in the under-parameterized situation? Which of the given theoretical properties extend to that case (as local minima cannot be expected to be equivalent in the under-parameterized case). - Do the structure of AV depend on the type of objective function used for training? What happens if a L-2 penalty term on the weights is added to the loss function? - Would it be possible to built a 2-dimensional analytical example of AV? - The averaged SGD performs well also in the case of convex loss. Is its good behaviour around a AV related to this? - In Section 3.2, it is claimed that AV can be found with `decent probability'. What is the order of magnitude of such probability? Does it depend on the complexity of the model? Does this mean that most of the minima are AV?
As you described in the paper, there are many studies on the generalization analysis of DNNs, including flatness/sharpness. In these studies, this paper focus on a novel concept of a loss landscape of DNNs. This work may be potentially useful, but the current version provides little explicit motivation for the proposed concept. What is the main contribution of analyzing asymmetric valleys compared with other concepts? What problems does this concept solve that cannot be solved with other theories? For example, flatness typically seems to include asymmetric valleys.
The paper provides interesting insights about the existence of asymmetric valleys in deep neural networks and claims that asymmetric valleys can lead to better generalization. The authors impose slightly non-standard strong assumptions, but empirically demonstrate that these assumptions are in fact practical and not difficult to achieve in practice. The paper is extremely well written and easy to read. However, there are few issues which concern me: 1) The authors state that the biased solution at an asymmetric valley is better than the biased solution at an asymmetric valley. How does it compare to a unbiased solution at a symmetric valley? It is not clear how often do these algorithms end up in asymmetric valleys and if using a biased solution is a good idea when we land up at symmetric valleys 2) Do the authors run optimize for learning rates for SGD? Additionally, what decay schedules have the authors experimented with? 3) Why do authors run SGD from the SWA solutions? While this provides evidence that if are at an asymmetric valley then SWA performs better than SGD, however how often does SGD end up in the same neighborhood (valley) as SWA solution and the relative generalization guarantees are unclear? Post-rebuttal: Having read the author's rebuttal, I would like to change my review from weak accept to an accept