Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
This paper proposes to control the level-sets of neural networks. To do so, the paper first proposes ways to find and manipulate the level-sets. The approach can improve both the generalization and the robustness against adversarial examples, by augmenting one more layer to the model. This is a quite interesting idea on using the level-sets. The results are quite solid, in particular on defending the adversarial examples. I have a few detailed comments: 1. How well does the model converge? Is it guaranteed to find level-sets through optimizing (3)? 2. What is the practical speed of training the network due to that we have to get the level-sets per iteration? 3. Could you provide ImageNet results on the adversarial examples part? Would like to adjust my scores if these questions can be resolved during the rebuttal period.
This paper addresses the important task of controlling the level sets that comprise the decision boundaries of a neural network. I think the proposed method is quite reasonable, well-described, and convincingly demonstrated to be quite useful across a number of tasks. Re level set sampling: - Doesn't the ReLU activation imply that D_x F(p; theta) is often =0 at many points p? How do you get around this issue when attempting to optimize p toward S(theta)? It seems this optimization might often get stuck in regions where D_x F(p;theta) = 0 yet p lies far away from S(theta). - The choice of sampling distribution for the initial points p_i should be moved from the Appendix to the main text as this seems critical. Furthermore, the particular choice used by the authors should be a bit better motivated in the text as it's not clear to me. - It seems one can sample from level 0 set by instead just optimizing: min_x ||F(x;theta)|| via gradient descent in x. Did the authors try this procedure? It would be good to comment why the authors proposed method is superior (or an actual example where the authors proposed method is truly superior would be even better!) - In general, since the proposed S(theta) sampling procedure does not come with theoretical guarantees, I recommend the authors empirically evaluate their sampling procedure by plotting ||F(p^*, theta)|| where p^* is the solution found after 10-20 Newton iterations. It seems important to verify the samples are actually coming from near the level 0 set, which is not done in the paper (otherwise the overall approach might be working for different mysterious reasons). It would also be nice to visually show some of the samples from S(theta) when the data are images. - In equations (1)-(2): the authors should write f_j(x; theta), f_i(x; theta), and F(x; theta) for clarity. Also, equation (3) seems redundant given (2). Overall, I recommend the authors initially define everything in terms of level C sets instead of level 0 sets, as 0 doesn't really seem to have special significance and might confuse readers. - In Figure 1: I disagree with authors' statement that panel (c) with L2 margin depicts a decision boundary that better explains the training examples compared to (a). First of all "better explains" does not seem to be the appropriate terminology to use here, perhaps "better models" is more appropriate? Secondly, (c) shows an arbitrary red triangle on the lefthand side, in a region that only contains blue training examples, which seems to be a weird artifact of empirical risk minimization; furthermore, the depicted decision boundary is very jagged-looking compared with (a) which seems brittle. Thus, I would say only (b),(d) look better than (a) to me. - The sample network G should be described in a bit greater detail, such as from what space to what space G maps, how it is implemented via a simple fixed linear layer to F(x; theta), etc. - Overall, it would be nice if the authors could provide some intuition on why they believe their proposed methodology works better than other strategies for large-margin deep learning. Update after reading author response: - I am pleased to see the authors practicing thorough science and toning down their robustness claims in the face of these additional experiments. I am not personally so concerned about the potentially-diminished adversarial robustness results, as level-set control of a model certainly has other important applications. - Re optimizing min_x ||F(x,theta)|| via alternative optimization methods like gradient descent or LBFGS: I encourage the authors to provide wall-clock time comparison to help readers understand the practical gain of the proposed methodology. In nonconvex settings, it is certainly not always the case that Newton steps are practically superior. - I like the idea of comparing with Elsayed et al (2018) and think the conceptual sentence from your rebuttal that establishes the connection between their methodology and yours should be also included in the final paper.
The paper has a few conceptual shortcomings: There is no guarantee that the iteration in Eqn 4 would successfully sample a point on the level set. A good distribution of points on the level set should also account for local geometry, e.g., curvature, which is not addressed in the proposed method. A sparse set of samples may not provide adequate control over the behavior of the entire level set. Some of the theoretical results, e.g., Lemma 1 and Theorem 1 do not strike as particularly surprising. Experimentally, the geometric SVM loss appears to lose its edge as soon as a little bit of data is collected (in small sample regimes one would need to resort to stronger priors or go e.g., semi-supervised route in any case). Still, the empirical results in the robust learning setting and surface reconstruction seem to be promising. Line 144: S(theta_0) should be S(theta)?