Review for NeurIPS paper: MMA Regularization: Decorrelating Weights of Neural Networks by Maximizing the Minimal Angles

NeurIPS 2020

MMA Regularization: Decorrelating Weights of Neural Networks by Maximizing the Minimal Angles

Review 1

Summary and Contributions: For naively trained neural networks, the neurons at each layer could be strongly correlated, i.e. the input weight vectors align with each other very well. This paper claimed that this "strong correlation" could hurt the test performance. To address this issue, the authors proposed to maximize the minimal pairwise angle (MMA) of input weight vectors for each hidden layer.

Strengths: The experiments shows consistent improvements over the previous diveristy-enhancing regularizations. In addition, the extra computation overhead of the MMA regularization is also negligible. It seems promising to apply this regularization to other problems.

Weaknesses: The contribution of this paper seems incremental. The neuron diversity issue was already raised and investigated in previous studys. Moreover, as far as I can tell, the improvement of test performances over the other methods is not that significant. Please correct me if I misunderstood it. In addition, it is not clear if the new regularization can be combined with other popular regularization methods, such as weight decay, data argumentation, etc. The author may need to compare with SOTA or consider other datasets, for which the improvement becomes either significant, or MMA can be combined with other regularization technique to further improve the test performance.

Correctness: The claim that the lack of neuron diveristy can hurt the generalization performance is indeed a good motivation to propose the new method. However, it is not supported by any theoretical or numerical evidences. It is really doubtful that the improvement of accuracy actually come from the increasing of neuron diversity. Afterall, the new regularization might introduce some other effects we don't know. More ablation studies are needed to identify the real reason for the improvement.

Clarity: the paper is well written and organized. I can easily follow the author's idea.

Relation to Prior Work: Yes

Reproducibility: No

Additional Feedback: The authors may need to provide the details of the training procedure for experiments in Section 5. It is well-known that the batch size, learning rate and the optimizer could significantly affects the test performance for these problems. == After rebutal == The authors' response addressed my concern about the significance of improvements. I raised my score to 6 considering that this regularizer can easily and efficiently applied to different problems.

Review 2

Summary and Contributions: This paper proposes a numerical method for the Tammes problem. By maximizing the minimal pairwise angles, the method can achieve near-optimal solutions under different settings. The method can then be easily applied in deep neural networks as an extra regularization loss to promote angular diversity between pairwise weight vectors. Experimental results show that the proposed method can improve the performance in both image classification and face recognition tasks.

Strengths: 1. This paper proposes a regularization term called MMA to promote angular diversity between pairwise weight vectors in DNNs. The method is easy to implement and improves performances in classification and recognition tasks. It can be easily applied in other tasks. 2. The paper gives detailed analyses of the gradient norm, thus the reason to use pairwise angles instead of cosine similarities is clearly explained. 3. The experiments demonstrate the effectiveness of MMA in both decorrelating weight vectors and improving network performances.

Weaknesses: The regularization coefficient is important. It can be seen from Figure 3 that the influences of different coefficients are unstable in a single experiment. To better clarify the influences, the author should report the average results of multiple experiments under the same setting. 2. In the Section 5, it seems that the proposed MMA can effectively improve the image classification performances. I wonder if the convergence speed and stability are changed after introducing the MMA regularization in the training process. Therefore, it would be better to give the training loss curves with and without MMA regularization. 3. The proposed method aims to decorrelating the network weights. Some works such as [1] [2] [3] that analyze and reduce the redundancy of convolutional filters are related to this direction. They should be discussed in Section 2. [1] Haase D, Amthor M. Rethinking Depthwise Separable Convolutions: How Intra-Kernel Correlations Lead to Improved MobileNets[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020: 14600-14609. [2] Guo J, Li Y, Lin W, et al. Network decoupling: From regular to depthwise separable convolutions[J]. arXiv preprint arXiv:1808.05517, 2018. [3] Jaderberg M, Vedaldi A, Zisserman A. Speeding up convolutional neural networks with low rank expansions[J]. arXiv preprint arXiv:1405.3866, 2014.

Correctness: Yes

Clarity: Yes

Relation to Prior Work: Clearly

Reproducibility: Yes

Additional Feedback: The authors' response addressed my concern. So I maintain my rating (should be accepted)

Review 3

Summary and Contributions: ===UPDATE===== I have read the rebuttal as well other reviews. My concerns have been addressed so I keep my score. =============== The paper introduces maximal minimal angle regularizer (MMA) as a way to encourage the diversity of weight vectors in the layers of DNN. They analyze the optimization issues, suggest reasonable modifications, compare aganist baselines. Importantly proposed method and baselines are compared on the several Tammes problems (that is to place the number of points on a unit sphere in such locations that they maximize the distance between the closest neighbors) for which the optimal answers are known and it can be clearly seen that MMA outperforms analogues. Next the authors use MMA as a regularizer for DNN and show (1) improvements when the regularizer is added with optimal hyperparamter; (2) correlation between the minimal angle in DNN layers and test accuracy. Overall I think this method is of value for the community since it reveals that additional decorrelation really matters for improving generalization.

Strengths: The paper is methodologically solid with motivation, analysis of existing tools, careful comparison with baselines, ablation study and final experiments on a wide number of different architectures. The proposed method is computationally efficient and can be easily implemented into existing learning tools. I think it is relevant for the community.

Weaknesses: There are some questions/concerns however. 1. Haven't you tried to set hyperparameters for the baseline models via cross-validation (i.e. the same method you used for your own model)? Setting it to their default values (even taken from other papers) may have a risk of unfair comparison aganist yours. I do not think this is the case but I would recommend the authors to carry out the corresponding experiments. 2. It is unclear for me why the performance of DNN+MMA becomes worse than vanilla DNN when lambda becomes small? See fig.3-4. I would expect it will approach vanilla methods from above but from below.

Correctness: I think the claims and method are correct. The results seem to be convincing.

Clarity: The paper is clearly written and easy to follow.

Relation to Prior Work: The paper discusses different alternatives to encourage weight diversity. The closest analogues are described in more details and the proposed method is fairly compared aganist them.

Reproducibility: Yes

Additional Feedback: Compare aganist baselines whose hyperparamters were set via cross-validation

Review 4

Summary and Contributions: In this paper, authors propose a diversity regularization on filters in a neural network. There, normalized weight vectors of neurons are encouraged to be distributed uniformly on a hypersphere. Performance improvements are observed on cifar100 and TinyImageNet.

Strengths: + Removing strong correlations among filters seems to me a good idea. + Authors use the Tammes problem as the inspiration. + The analytical part seems sound.

Weaknesses: Authors may provide more insights and illustrations. For example, - On one side, I agree, removing filter correlation is a good idea. On the other side, more insights can be helpful to show uniformly distributed filters are actually the way to go. - Why does MMA show superior over orthogonal regularization in (14), which seems a very similar objective. - Orthogonality or uniform distribution are more often enforced over (inter-class) features. From this aspect, what is the direct effect on features using MMA? I notice some discussion around line 195, any additional illustration and supports? For example, in Table 6, will MMA alone match ArcFace performance?

Correctness: The method and empirical methodology look sound in general.

Clarity: The paper is clearly written.

Relation to Prior Work: The difference from previous contributions is clearly discussed.

Reproducibility: Yes

Additional Feedback: