Reviews: Preventing Gradient Attenuation in Lipschitz Constrained Convolutional Networks

Originality. This work is a followup of Anil et al. 2019 and extends their approach to the case of convolutional layers. As so, this is incremental work, although some hurdles have to be overcome in order to extend the results in this new setting. The BCOP parametrization of gradient-norm preserving convolutional layers seems a novel application of previous work (Xiao et al. 2018). The disconectedness result for 1-D convolutions adds to the understanding of the difficulty of the problem. Quality. The proposed algorithms are technically and theoretically sound. However I find some parts are lacking. (1) There is no mention or discussion about how could the method adapt to other metrics rather than the L2 metric, which makes the method somewhat limited. (2) There are some claims that are either incorrect or have no reference e.g. (2a) section 2.2 second paragraph: "one can show that the norm of the gradient backpropagating through a 1-Lipschitz...", this statement is not clear at all and moreover does not point out to any reference. (2b) 3.1.1. second paragraph: "this is a valid projection under the matrix 2-norm but not the Frobenius norm", this is a reference to a paper by Gouk et al. 2018. where a matrix is scaled by its matrix 2-norm, however in my understanding this is NOT the projection onto the 1-ball in matrix 2-norm, rather one has to do singular value clipping and that would be the projection both w.r.t. 2-norm and Frobenius norm. (Edit: I stand corrected as the rescaling is indeed a valid projection under the 2 operator norm) (2c) same paragraph: "... is not guaranteed to converge to the correct solution" why? reference? (2d) next paragraph: "... permits Euclidean steepest descent", I might be missing something but I am not sure what is meant by this. (2e) 3.1.1. first paragraph: "The Lipschitz constant of the convolution operator is bounded by *a constant factor* of the spectral norm of the kernel reshaped into a matrix", perhaps I am missing something but can the authors comment on where does this "constant factor" come from? I would assume that the spectral norm of the kernel reshaped into a matrix is exactly the Lipschitz constant of the convolution, given that the convolution operator is equivalent to the matrix form (they are the same function). (2f) section 4.4. first paragraph "ensuring gradient norm preservation is critical for obtaining tighter lower bounds on the Wasserstein distance", as I understand this is not critical, in the sense that there might be better methods to approximate 1-Lipschitz functions that do not rely on gradient norm preservation, If there is some negative result stating that this is not possible without gradient norm preservation, there should be a clear reference. (3) The results from table 1 shows marginal improvements over existing method, for example 51.47 vs 50.00 or 49.37 vs 48.07. Without any sort of confidence intervals or repeated trials it is too difficult to assess how much of an improvement this is. The same for the other tables. Clarity. The general idea of the paper can be grasped on one reading, however there are many details that are confusing and that need improvement (1) Throughout the paper there are many references to a term "Lipschitz networks" What do you mean by this? as long as the activations are Lipschitz continuous (most of them are) then any network is Lipschitz continuous. Maybe you want to say "networks with a small Lipschitz constant...". Again in the second paragraph you say "... ensure that each piece of the computation is Lipschitz", what do you mean by piece? I guess you mean layers or something similar, in this case again any linear layer with Lipschitz activation is Lipschitz, perhaps you mean to say "ensure that each layer has a small Lipschitz constant". This makes the claims confusing. (2) I feel there are some functions that are not properly introduced and one has to make an effort to understand. The notation Bjorck(R) could be introduced better, and the notation SymmetricProjector(M) is not introduced at all before presenting BCOP. (3) When the experimental section starts there is no clarity about how the networks were trained. I think it is only explicitely stated later in the last page "we implicitely enforce the Lipschitz constant of the network to be exactly 1" this should be made clear before the experiments sections. Significance: I think the extension to the convolutional case is important as it is a widely used type of layer and the results of the previous work by Anil et al. 2019 seem promising. However the experimental part needs some "technical" work to better assess the improvements that one can see in practice.

Reviewer 2

This paper studies the block convolutional orthogonal representation for convolutional neural networks in presence of Lipschitz constraints. During training of the network gradient norm preservation is utilized to combat gradient attenuation. This simple combination of the two main ideas outlined above is shown to be useful in two settings, namely adversarial training and computing the Wasserstein distance using Kantorovic duality. Admittedly, I might have missed important pieces of the paper, but I regret to say that the paper is limited in terms of its novelty. As is, the paper feels like a combination of two existing ideas from Anil et al. and Xiao et al. Having a broader set of experiments could have made up for the limited novelty of the paper, I would argue. For example, the Wasserstein computation is a crucial step in performing Wasserstein GAN, so I was really curious to see if the benefits ultimately transfers to better generative modeling. On the same note, another application could have been learning stochastic models of the world in reinforcement learning, where it is is important to compute models with low Wasserstein errors.

Reviewer 3

[Originality] Although the authors point out that there exist papers on Lipschitz convolutional networks, they are not expressive enough to perform well. And analyzing the Lipschitz property of CNNs is harder than that of fully-connected nets. Therefore, I consider this paper a novel and original extension of previous work for Lipschitz networks (on fully-connected networks). [Quality] The proposed method (BCOP) is technically sound. The visualizations of singular values of weights of trained network layers clearly indicate that BCOP works as predicted in theory. The authors show empirical results on both adversarial robustness and estimating Wasserstein distance between data distribution and generated image distribution of GAN. I believe these results justify the effectiveness of BCOP. [Clarity] The paper is well-written and enough implementation details are provided for readers to implement BCOP. Moreover, the authors explain the derivation in depth in the appendix. [Significance] As mentioned in originality section, this paper is the first paper that does not place tight constraints on expressiveness of CNNs to constrain the Lipschitz constant while keeping gradient not vanishing. As Lipschitz network has been shown important in many tasks like GAN, adversarial robustness and invertible nets, this paper definitely makes a major contribution to the community. In addition, this paper provides theoretical insights for others to work on Lipschitz properties of neural nets.

Paper ID:	8869
Title:	Preventing Gradient Attenuation in Lipschitz Constrained Convolutional Networks

Reviewer 1

Reviewer 2

Reviewer 3