Review for NeurIPS paper: Optimizing Mode Connectivity via Neuron Alignment

NeurIPS 2020

Optimizing Mode Connectivity via Neuron Alignment

Review 1

Summary and Contributions: This paper proposed Neuron Alignment, which is a heuristic method aiming to find a lower-loss curve between local minima on the loss landscape of neural networks by taking the symmetry of neurons into account. Specifically, Neuron Alignment tries to find a permutation of neurons in neural networks so that the corresponding weights of the two models in parameter space are more correlated, and then connect one of the minima with the permuted version of the other using existing methods. The authors theoretically proved that their method can decrease the upper bound of the expected loss on the curve, and empirically verified on various models and datasets that their proposed method finds lower-loss curves than the baseline. This method is also empirically proven to be locally optimal, and it can reduce the robust loss barrier on the curve between adversarially robust models.

Strengths: -The idea of "Neuron Alignment via Assignment", which takes neuron symmetry into account, is novel and interesting. Finding the best permutation of neurons to optimize the correlations can be hard because of the discrete nature of this problem, but the authors solved this problem by using heuristics and showed empirically that their solution is locally optimal. -The authors did various experiments and showed that their Neuron Alignment method can help to find a better path connecting the two local minima, where the min/avg accuracy along the path can be decreased by up to 3-4 percent. This phenomenon is consistent for different models (TinyTen/ResNet32/GoogLeNet) on different datasets (CIFAR-10/100, ImageNet).

Weaknesses: -The theoretical insight (Theorem 3.1) of Neuron Alignment is a bit weak. The authors only proved that the upper bound for the averaged loss decreases, but didn't talk about the tightness of this upper bound. Besides, there is a small gap between theory and experiments because the authors are using pre-activations for neuron alignment and the assumptions that the loss and activation function are Lipschitz-continuous, while in experiments they are using post-activations for neuron alignment and ReLU activations. -The Neuron Alignment algorithm heavily depends on the structure of the neural networks, e.g., the symmetry for neurons is different for convolutional/fully-connected layers and residual blocks. This may make the application of this algorithm somewhat limited.

Correctness: The theoretical claims and empirical methodology appear to be correct.

Clarity: This paper is generally well-written and well-structured with some minor problems: -In equation 2, the authors are talking about the correlations of activations without stating the definition of correlations. Since this is an important concept in this paper, it may be better if the authors could state the definition (or at least some explanation) of correlation in this paper instead of referring the readers to another paper. -When the authors are mentioning some tables and figures, it would be better if the authors could just say that these tables and figures are in the appendix so that the readers wouldn't waste time finding these things in the main paper. Examples include "Table 2" at line 223, "Figure 7" at line 265, and "Figure 8" at line 270. -For Figure 2 top left, the pictures are not clear enough and it's hard to see the texts like "\theta_1" in the figure, so perhaps the authors could find a better way of plotting this. -Typo: Line 265, "in in" -> "in"

Relation to Prior Work: This paper is related to previous works about mode connectivity and neuron symmetry/network similarity, all of which are discussed in this paper. The authors also clearly stated that their work bridges mode connectivity and neuron symmetry by introducing a heuristic method, i.e., Neuron Alignment, to find better curves for mode connectivity.

Reproducibility: Yes

Additional Feedback: I have read the other reviews and the authors' feedback, and I have decided to maintain my score. Generally, the authors addressed some of my concerns, but I am still worried about the tightness of the bound provided in Theorem 3.1, which is also my main concern. The detailed reasons why I keep my score are listed below: -Thank the authors for providing such detailed feedback! I have learned some more intuitions about the proof of Theorem 3.1. However, the proof still requires a lot of inequalities, e.g., triangle inequalities, matrix norm inequalities, and Lipschitz continuities. Some of the bounds can be tight for some specific network structures, but it's still a bit hard for me to believe the upper bound provided in this theorem is tight enough. I still think that an empirical computation of these bounds should be much more convincing than the current theoretical explanation given by the authors. -The authors did not address my concerns about the clarity of this paper. I think it is very important for a paper to be clearly written, but the current version of this paper has some clarity issues. -There is another paper called "Low-loss connection of weight vectors: distribution-based approaches", which I think is very related to this paper but was published after this paper was submitted to NeurIPS. It would be better if the authors could compare with that paper in the camera-ready version. Below is the original review. -------------------------------------------------- -For the upper bounds provided in Theorem 3.1, I wonder whether it is possible to empirically compute or approximate them (perhaps for some small networks with small datasets) so that one can have a sense of whether the inequalities are tight or not. It would also be better if the authors could provide more intuitions about this upper bound, e.g., how the upper bound is constructed. If the upper bounds are very loose, then the intuition provided by this theorem will be quite limited. -In this paper, the authors are permuting the orders of the neurons once. However, if we consider the weights of neural networks as a distribution, the order of them can be permuted many times along the optimal path. In other words, after traveling some distance from one local minima to the other, it may have another permutation that makes its weights more correlated to the destination. Thus, one possible future work might be interesting to investigate the possibility of "multi-stage" neuron alignments.

Review 2

Summary and Contributions: This paper focuses on the problem of curve finding, i.e. finding a curve connecting two modes (e.g. global minima) in a neural network loss landscape such that the average loss along the curve is as low as possible. The main contributions of this paper are: -- It proposes neuron alignment---a procedure that permutes a network in the parameter space but does not change it in the function space---as an addition to curve finding algorithms. The neuron alignment can either work as a modular pre-processing step to any curve finding algorithm, or be jointly optimized along with the curve. -- The paper shows experimentally that making the networks aligned can improve upon the performance of common curve finding algorithms. The same observation holds on connecting adversarially robust models. -- Some theoretical justification on why aligned networks may be better connected than unaligned networks.

Strengths: -- The idea of using neuron alignment (as a reparametrization) to enhance mode connectivity is pretty neat, and novel (as far as I’m aware of). It makes a lot of sense that permuting the neurons do not change one network, but can improve the connectivity in the parameter space when we use it to align two networks. -- The experiments seem comprehensive (3 networks of different sizes on 3 image classification tasks) and convincing enough to me. Also it’s good to see the same results hold on adversarially robust models. It’s a bit concerning that baseline curve finding already works reasonably well in terms of average loss (and thus the room for improvement using alignment is also little, cf. Table 1). I’m curious have the authors tried vanilla *linear* mode connectivity between aligned vs. unaligned networks? Because the curve is linear and not learnable, the performance will degrade for both aligned and unaligned, but I wonder if neuron alignment still helps there. -- At a high level, the experimental results provide us with a bit of new knowledge about mode connectivity / loss landscape of neural networks. -- I liked the proposal and discussion about joint optimization of permutation and curve (PAM). Especially the observation that PAM does not help when already initialized from a per-learned alignment, but still helps when initialized from identity permutation (i.e. no permutation). This suggests PAM sort of works but the simpler two-stage algorithm (alignment then curve learning) performs just as well so that maybe there is no need to run the more sophisticated PAM.

Weaknesses: -- I’m concerned about the strength / implications of the theoretical result (Theorem 3.1). Most importantly, the theorem states a comparison between unaligned and aligned networks in terms of curve finding, but the comparison is done on *upper bounds*, not actual values. In this case, it’s often good to have more explicit expressions for the two upper bounds (e.g. how they depend on various problem parameters), and discussions on how tight the bounds are, which are all missing in the current paper. Without these details, it could a priori be the case that both bounds are very loose, in which case it does not make sense to draw any conclusion about the two algorithms based on them. -- Algorithmic novelty. At the core the proposed algorithm seems to be a combination of existing techniques (learning permutations and curve finding). Because this is an understanding paper (in my perspective) though, I am probably weighing less on this point.

Correctness: The theoretical results as well as the experimental methodologies are correct up to my inspections.

Clarity: The paper is generally quite well presented and I didn’t have a hard time understanding it.

Relation to Prior Work: The paper discusses related work on mode connectivity and curve finding, weight symmetry, and network similarity. These seem all relevant enough and positions the present paper well in terms of its delta with prior work. I’m not very familiar though with the prior work in the specific directions of curve finding / mode connectivity, so may not have the best judgement in whether the related work is comprehensive enough.

Reproducibility: Yes

Additional Feedback:

Review 3

Summary and Contributions: The paper explores the loss landscape of trained neural nets. It focuses on the problem of mode connectivity. Given two optimized instances of the same neural net architecture, one seeks a curve on the loss landscape parametrized by their weights that connects the two. The authors continue the recent line of work, which learns curves with minimal impact to loss. The main observation is that several points in the weights space correspond to the same network due to weight symmetry. The authors continue to develop an algorithm that finds a locally suitable permutation using neuron alignment. Notably, the authors find robust models with a lower loss on the curve between two robust models. Update: I thank the authors' for their explanation. It would be good to add the subset size explanation to the final version. The score stays as it is.

Strengths: The main contribution to the NeurIPS community is an algorithm that finds a robust, optimized model with minimal cost. Another contribution is an insight into how important is exploiting symmetries for mode connectivity. The authors test their curve loss empirically and verify that it is locally optimal with Proximal Alternating minimization.

Weaknesses: Theorem 3.1, which shows tighter loss bounds, is weak in that it doesn't connect the aligned and unaligned losses. Algorithm 1 discussion should contain the impact of choosing a specific subset of X on the algorithm performance. The authors do not compare the resulting loss of neuron alignment to curves found by previous methods, e.g., Garipov et al.

Correctness: The methods and claims appear to be fully supported and correct: symmetry is theoretically and empirically shown to be an essential factor in curve optimization.

Clarity: The paper is well written. However, some parts of the text are a little long and vague- e.g. "Theory for using Neuron Alignment" paragraph, and better be shortened.

Relation to Prior Work: Yes, The authors discuss the importance of weight symmetry and compare it to many recent works of curve optimization in the related work section.

Reproducibility: Yes

Additional Feedback: