{"title": "Controlling Neural Level Sets", "book": "Advances in Neural Information Processing Systems", "page_first": 2034, "page_last": 2043, "abstract": "The level sets of neural networks represent fundamental properties such as decision boundaries of classifiers and are used to model non-linear manifold data such as curves and surfaces. Thus, methods for controlling the neural level sets could find many applications in machine learning.\n\nIn this paper we present a simple and scalable approach to directly control level sets of a deep neural network. Our method consists of two parts: (i) sampling of the neural level sets, and (ii) relating the samples' positions to the network parameters. The latter is achieved by a sample network that is constructed by adding a single fixed linear layer to the original network. In turn, the sample network can be used to incorporate the level set samples into a loss function of interest.\n \nWe have tested our method on three different learning tasks: improving generalization to unseen data, training networks robust to adversarial attacks, and curve and surface reconstruction from point clouds. For surface reconstruction, we produce high fidelity surfaces directly from raw 3D point clouds. When training small to medium networks to be robust to adversarial attacks we obtain robust accuracy comparable to state-of-the-art methods.", "full_text": "Controlling Neural Level Sets\n\nMatan Atzmon, Niv Haim, Lior Yariv, Ofer Israelov, Haggai Maron, Yaron Lipman\n\nWeizmann Institute of Science\n\nRehovot, Israel\n\nAbstract\n\nThe level sets of neural networks represent fundamental properties such as decision\nboundaries of classi\ufb01ers and are used to model non-linear manifold data such as\ncurves and surfaces. Thus, methods for controlling the neural level sets could \ufb01nd\nmany applications in machine learning.\nIn this paper we present a simple and scalable approach to directly control level\nsets of a deep neural network. Our method consists of two parts: (i) sampling of the\nneural level sets, and (ii) relating the samples\u2019 positions to the network parameters.\nThe latter is achieved by a sample network that is constructed by adding a single\n\ufb01xed linear layer to the original network. In turn, the sample network can be used\nto incorporate the level set samples into a loss function of interest.\nWe have tested our method on three different learning tasks: improving generaliza-\ntion to unseen data, training networks robust to adversarial attacks, and curve and\nsurface reconstruction from point clouds. For surface reconstruction, we produce\nhigh \ufb01delity surfaces directly from raw 3D point clouds. When training small to\nmedium networks to be robust to adversarial attacks we obtain robust accuracy\ncomparable to state-of-the-art methods.\n\n1\n\nIntroduction\n\nThe level sets of a Deep Neural Network (DNN) are known to capture important characteristics\nand properties of the network. A popular example is when the network F (x; \u03b8) : Rd \u00d7 Rm \u2192 Rl\nrepresents a classi\ufb01er, \u03b8 are its learnable parameters, fi(x; \u03b8) are its logits (the outputs of the \ufb01nal\nlinear layer), and the level set\n\nS(\u03b8) =\n\n(cid:27)\n\n{fi} = 0\n\n(cid:26)\n\nx \u2208 Rd(cid:12)(cid:12)(cid:12) fj \u2212 max\nx \u2208 Rd(cid:12)(cid:12)(cid:12) F = 0\n(cid:110)\n(cid:111)\n\ni(cid:54)=j\n\nS(\u03b8) =\n\n(1)\n\n(2)\n\nrepresents the decision boundary of the j-th class. Another recent example is modeling a manifold\n(e.g., a curve or a surface in R3) using a level set of a neural network (e.g., [24]). That is,\n\nrepresents (generically) a manifold of dimension d \u2212 l in Rd.\nThe goal of this work is to provide practical means to directly control and manipulate neural level\nsets S(\u03b8), as exempli\ufb01ed in Equations 1, 2. The main challenge is how to incorporate S(\u03b8) in a\ndifferentiable loss function. Our key observation is that given a sample p \u2208 S(\u03b8), its position can be\nassociated to the network parameters: p = p(\u03b8), in a differentiable and scalable way. In fact, p(\u03b8) is\nitself a neural network that is obtained by an addition of a single linear layer to F (x; \u03b8); we call these\nnetworks sample networks. Sample networks, together with an ef\ufb01cient mechanism for sampling the\nlevel set, {pj(\u03b8)} \u2282 S(\u03b8), can be incorporated in general loss functions as a proxy for the level set\nS(\u03b8).\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 1: Our method applied to binary classi\ufb01cation in 2D. Blue and red dots represent positive\nand negative examples respectively. (a) standard cross-entropy loss baseline; (b) using our method\nto move the decision boundary at least \u03b5 away from the training set in L\u221e norm; (c) same for L2\nnorm; (d) a geometrical adaptation of SVM soft-margin loss to neural networks using our method, the\n+1,\u22121 level sets are marked in light red and blue, respectively. Note that in (b),(c),(d) we achieve\ndecision boundaries that seem to better explain the training examples compared to (a).\n\nWe apply our method of controlling the neural level sets to two seemingly different learning tasks:\ncontrolling decision boundaries of classi\ufb01ers (Equation 1) and reconstructing curves and surfaces\nfrom point cloud data (Equation 2).\nAt \ufb01rst, we use our method to improve the generalization and adversarial robustness in classi\ufb01cation\ntasks. In these tasks, the distance between the training examples and the decision boundary is an\nimportant quantity called the margin. Margins are traditionally desired to be as large as possible to\nimprove generalization [7, 10] and adversarial robustness [10]. Usually, margins are only controlled\nindirectly by loss functions that measure network output values of training examples (e.g., cross\nentropy loss). Recently, researchers have been working on optimizations with more direct control of\nthe margin using linearization techniques [13, 21, 10], regularization [26], output-layer margin [27],\nor using margin gradients [9]. We suggest controlling the margin by sampling the decision boundary,\nconstructing the sample network, and measuring distances between samples and training examples\ndirectly. By applying this technique to train medium to small-size networks against adversarial\nperturbations we achieved comparable robust accuracy to state-of-the-art methods.\nTo improve generalization when learning from small amounts of data, we devise a novel geometrical\nformulation of the soft margin SVM loss to neural networks. This loss aims at directly increasing\nthe input space margin, in contrast to standard deep network hinge losses that deal with output\nspace margin [28, 27]. The authors of [10] also suggest to increase input space margin to improve\ngeneralization. Figure 1 shows 2D examples of training our adversarial robustness and geometric\nSVM losses for networks.\nIn a different application, we use our method for the reconstruction of manifolds such as curves and\nsurfaces in R3 from point cloud data. The usage of neural networks for the modeling of surfaces\nhas recently become popular [12, 29, 6, 2, 24, 22]. There are two main approaches: parametric and\nimplicit. The parametric approach uses networks as parameterization functions to the manifolds\n[12, 29]. The implicit approach represents the manifold as a level set of a neural network [24, 6, 22].\nSo far, implicit representations were learned using regression, given a signed distance function or\noccupancy function computed directly from a ground truth surface. Unfortunately, for raw point\nclouds in R3, computing the signed distance function or an occupancy function is a notoriously\ndif\ufb01cult task [4]. In this paper we show that by using our sample network to control the neural level\nsets we can reconstruct curves and surfaces directly from point clouds in R3.\nLastly, to theoretically justify neural level sets for modeling manifolds or arbitrary decision boundaries,\nwe prove a geometric version of the universality property of MLPs [8, 14]: any piecewise linear\nhyper-surface in Rd (i.e., a d\u2212 1 manifold built out of a \ufb01nite number of linear pieces, not necessarily\nbounded) can be precisely represented as a neural level set of a suitable MLP.\n\n2\n\n\f2 Sample Network\nGiven a neural network F (x; \u03b8) : Rd \u00d7 Rm \u2192 Rl its 0 \u2208 Rl level set is de\ufb01ned by\n\n(cid:110)\n\n(cid:12)(cid:12)(cid:12) F (x; \u03b8) = 0\n(cid:111)\n\n.\n\nS(\u03b8) :=\n\nx\n\n(3)\nWe denote by DxF (p; \u03b8) \u2208 Rl\u00d7d the matrix of partial derivatives of F with respect to x. Assuming\nthat \u03b8 is \ufb01xed, F (p; \u03b8) = 0 and that DxF (p; \u03b8) is of full rank l (l (cid:28) d), a corollary of the Implicit\nFunction Theorem [15] implies that S(\u03b8) is a d \u2212 l dimensional manifold in the vicinity of p \u2208 S(\u03b8).\nOur goal is to incorporate the neural level set S(\u03b8) in a differentiable loss. We accomplish that by\nperforming the following procedure at each training iteration: (i) Sample n points on the level set:\npi \u2208 S(\u03b8), i \u2208 [n]; (ii) Build the sample network pi(\u03b8), i \u2208 [n], by adding a \ufb01xed linear layer to the\nnetwork F (x; \u03b8); and (iii) Incorporate the sample network in a loss function as a proxy for S(\u03b8).\n2.1 Level set sampling\nTo sample S(\u03b8) at some \u03b8 = \u03b80, we start with a set of n points pi, i \u2208 [n] sampled from some\nprobability measure in Rd. Next, we perform generalized Newton iterations [3] to move each point p\ntowards S(\u03b80):\n\npnext = p \u2212 DxF (p; \u03b80)\u2020F (p; \u03b80),\n\n(4)\nwhere DxF (p, \u03b80)\u2020 is the Moore-Penrose pseudo-inverse of DxF (p, \u03b80). The generalized Newton\nstep solves the under-determined (l (cid:28) d) linear system F (p; \u03b80) + DxF (p; \u03b80)(pnext \u2212 p) = 0. To\nmotivate this particular solution choice we show that the generalized Newton step applied to a linear\nfunction is an orthogonal projection onto its zero level set (see proof in the supplementary material):\nLemma 1. Let (cid:96)(x) = Ax + b, A \u2208 Rl\u00d7d, b \u2208 Rl, (cid:96) < d, and A is of full rank l. Then\nEquation 4 applied to F (x) = (cid:96)(x) is an orthogonal projection on the zero level set of (cid:96), namely, on\n{x | (cid:96)(x) = 0}.\nFor l > 1 the computation of DxF (p; \u03b80)\u2020 requires inverting an l \u00d7 l matrix; in this paper l \u2208 {1, 2}.\nThe directions DxF (pi; \u03b80) can be computed ef\ufb01ciently using back-propagation where the points pi,\ni \u2208 [n] are grouped into batches. We performed 10 \u2212 20 iterations of Equation 4 for each pi, i \u2208 [n].\nScalar networks. For scalar networks, i.e., l = 1, DxF \u2208 R1\u00d7d, a direct computation shows that\n\nDxF (p; \u03b80)\u2020 =\n\nDxF (p; \u03b80)T\n(cid:107)DxF (p; \u03b80)(cid:107)2 .\n\n(5)\n\nThat is, the point p moves towards the level set S(\u03b80) by going in the direction of the steepest descent\n(or ascent) of F .\nIt is worth mentioning that the projection-on-level-sets formula in the case of l = 1 has already\nbeen developed in [23] and was used to \ufb01nd adversarial perturbations; our result generalizes to the\nintersection of several level sets and shows that this procedure is an instantiation of the generalized\nNewton algorithm.\nThe generalized Newton method (similarly to Newton method) is not guaranteed to \ufb01nd a point on the\nzero level set. We denote by ci = F (pi; \u03b80) the level set values of the \ufb01nal point pi; in the following,\nwe use also points that failed to be projected with their level set ci. Furthermore, we found that the\ngeneralized Newton method usually does \ufb01nd zeros of neural networks but these can be far from the\ninitial projection point. Other, less ef\ufb01cient but sometimes more robust ways to project on neural\nlevel sets could be using gradient descent on (cid:107)F (x; \u03b8)(cid:107) or zero \ufb01nding in the direction of a successful\nPGD attack [17, 20] as done in [9]. In our robust networks application (Section 3.2) we have used a\nsimilar approach with the false position method.\n\nRelation to Elsayed et al. [10]. The authors of [10] replace the margin distance with distance to\nthe linearized network, while our approach is to directly sample the actual network\u2019s level set and\nmove it explicitly. Speci\ufb01cally, for L2 norm (p = 2) Elsayed\u2019s method is similar to our method using\na single generalized Newton iteration, Equation 4.\n\n3\n\n\f2.2 Differentiable sample position\nNext, we would like to relate each sample p, belonging to some level set F (p; \u03b80) = c, to the network\nparameters \u03b8. Namely, p = p(\u03b8). The functional dependence of a sample p on \u03b8 is de\ufb01ned by\np(\u03b80) = p and F (p(\u03b8); \u03b8) = c, for \u03b8 in some neighborhood of \u03b80. The latter condition ensures that p\nstays on the c level set as the network parameters \u03b8 change. As only \ufb01rst derivatives are used in the\noptimization of neural networks, it is enough to replace this condition with its \ufb01rst-order version. We\nget the following two equations:\n\n(cid:12)(cid:12)(cid:12)\u03b8=\u03b80\n\np(\u03b80) = p ;\n\n\u2202\n\u2202\u03b8\n\nF (p(\u03b8); \u03b8) = 0.\n\n(6)\n\nUsing the chain rule, the second condition in Equation 6 reads:\n\n(7)\nThis is a system of linear equations with d \u00d7 m unknowns (the components of D\u03b8p(\u03b80)) and l \u00d7 m\nequations. When d > l, this linear system is under-determined. Similarly to what is used in the\ngeneralized Newton method, a minimal norm solution is given by the Moore-Penrose inverse:\n\nDxF (p, \u03b80)D\u03b8p(\u03b80) + D\u03b8F (p, \u03b80) = 0.\n\nD\u03b8p(\u03b80) = \u2212DxF (p, \u03b80)\u2020D\u03b8F (p, \u03b80).\n\n(8)\nThe columns of the matrix D\u03b8p(\u03b80) \u2208 Rd\u00d7m describe the velocity of p(\u03b8) w.r.t. each of the parameters\nin \u03b8. The pseudo-inverse selects the minimal norm solution that can be shown to represent, in this\ncase, a movement in the orthogonal direction to the level set (see supplementary material for a proof).\nWe reiterate that for scalar networks, where l = 1, DxF (pi; \u03b80)\u2020 has a simple closed-form expression,\nas shown in Equation 5.\n\nThe sample network. A possible simple solution to Equation 6 would be to use the linear function\np(\u03b8) = p + D\u03b8p(\u03b80)(\u03b8 \u2212 \u03b80), with D\u03b8p(\u03b80) as de\ufb01ned in Equation 8. Unfortunately, this would\nrequire storing D\u03b8p(\u03b80), using at-least O(m) space (i.e., the number of network parameters), for\nevery projection point p. (We assume the number of output channels l of F is constant, which is the\ncase in this paper.) A much more ef\ufb01cient solution is\n\np(\u03b8) = p \u2212 DxF (p; \u03b80)\u2020 [F (p; \u03b8) \u2212 c] ,\n\n(9)\nthat requires storing DxF (p; \u03b80)\u2020, using only O(d) space, where d is the input space dimension, for\nevery projection point p. Furthermore, Equation 9 allows an ef\ufb01cient implementation with a single\nnetwork\nWe call G the sample network. Note that a collection of samples pi, i \u2208 [n] can be treated as a batch\ninput to G.\n\nG(p, DxF (p; \u03b80)\u2020; \u03b8) := p(\u03b8).\n\nIncorporation of samples in loss functions\n\n3\nOnce we have the sample network pi(\u03b8), i \u2208 [n], we can incorporate it in a loss function to control\nthe neural level set S(\u03b8) in a desired way. We give three examples in this paper.\n3.1 Geometric SVM\nSupport-vector machine (SVM) is a model which aims to train a linear binary classi\ufb01er that would\ngeneralize well to new data by combining the hinge loss and a large margin term. It can be interpreted\nas encouraging large distances between training examples and the decision boundary. Speci\ufb01cally,\nthe soft SVM loss takes the form [7]:\n\nloss(w, b) =\n\n1\nN\n\nmax{0, 1 \u2212 yj(cid:96)(xj; w, b)} + \u03bb(cid:107)w(cid:107)2\n2 ,\n\n(10)\n\nwhere (xj, yj) \u2208 Rd \u00d7 {\u22121, 1}, j \u2208 [N ] is the binary classi\ufb01cation training data, and (cid:96)(x; w, b) =\nwT x + b, w \u2208 Rd, b \u2208 R is the linear classi\ufb01er. We would like to generalize Equation 10 to\na deep network F : Rd \u00d7 Rm \u2192 R towards the goal of increasing the network\u2019s input space\nmargin, which is de\ufb01ned as the minimal distance of training examples to the decision boundary\n\nS(\u03b8) =(cid:8)x \u2208 Rd | F (x; \u03b8) = 0(cid:9). Note that this is in strong contrast to standard deep network hinge\n\n4\n\nN(cid:88)\n\nj=1\n\n\floss that works with the output space margin [28, 27], namely, measuring differences of output logits\nwhen evaluating the network on training examples. For that reason, this type of loss function does\nnot penalize small input space margin, so long as it doesn\u2019t damage the output-level classi\ufb01cation\nperformance on the training data. Using the input margin over the output margin may also provide\nrobustness to perturbations [10].\nWe now describe a new, geometric formulation of Equation 10, and use it to de\ufb01ne the soft-SVM loss\nfor general neural networks. In the linear case, the following quantity serves as the margin:\n\n(cid:107)w(cid:107)\u22121\n\nwhere St(\u03b8) =(cid:8)x \u2208 Rd | F (x; \u03b8) = t(cid:9), and d(S(\u03b8),St(\u03b8)) is the distance between the level sets,\n\n2 = d(S(\u03b8),S1(\u03b8)) = d(S(\u03b8),S\u22121(\u03b8))\n\nwhich are two parallel hyper-planes. In the general case, however, level sets are arbitrary hyper-\nsurfaces which are not necessarily equidistant (i.e., the distance when traveling from S to St does\nnot have to be constant across S). Hence, for each data sample x, we de\ufb01ne the following margin\nfunction:\n\nd(cid:0)p(\u03b8),S1(\u03b8)(cid:1), d(cid:0)p(\u03b8),S\u22121(\u03b8)(cid:1)(cid:111)\n(cid:110)\n\n\u2206(x; \u03b8) = min\n\nwhere p(\u03b8) is the sample network of the projection of x onto S(\u03b80). Additionally, note that in the\n\nlinear case:(cid:12)(cid:12)wT x + b(cid:12)(cid:12) = d(x,S(\u03b8))/\u2206(x; \u03b8). With these de\ufb01nitions in mind, Equation 10 can be\n\n,\n\ngiven the geometric generalized form:\n\nloss(w, b) =\n\n1\nN\n\nmax\n\n0, 1 \u2212 sign(yjF (xj; \u03b8))\n\nd(xj, pj)\n\u2206(xj; \u03b8)\n\n+\n\n\u03bb\nN\n\n\u2206(xj; \u03b8)\u03b1,\n\n(11)\n\n(cid:27)\n\nN(cid:88)\n\nj=1\n\n(cid:26)\n\nN(cid:88)\n\nj=1\n\nN(cid:88)\n\n1\nN\n\n(cid:110)\n\nwhere F (x; \u03b8) is a general classi\ufb01er (such as a neural network, in our applications). Note that in the\ncase where F (x; \u03b8) is af\ufb01ne, \u03b1 = \u22122 and d = L2, Equation 11 reduces back to the regular SVM\nloss, Equation 10. Figure 1d depicts the result of optimizing this loss in a 2D case, i.e., xj \u2208 R2; the\nlight blue and red curves represent S\u22121 and S1.\n\n3.2 Robustness to adversarial perturbations\n\nThe goal of robust training is to prevent a change in a model\u2019s classi\ufb01cation result when small\nperturbations are applied to the input. Following [20] the attack model is speci\ufb01ed by some set\nS \u2282 Rd of allowed perturbations; in this paper we focus on the popular choice of L\u221e perturbations,\nthat is S is taken to be the \u03b5-radius L\u221e ball, {x | (cid:107)x(cid:107)\u221e \u2264 \u03b5}. Let (xj, yj) \u2208 Rd \u00d7L denote training\nthe decision boundary of label j. We de\ufb01ne the loss\n\nexamples and labels, and let S j(\u03b8) =(cid:8)x \u2208 Rd | F j(x; \u03b8) = 0(cid:9), where F j(x; \u03b8) = fj \u2212 maxi(cid:54)=j fi,\n\nloss(\u03b8) =\n\n\u03bbj max\n\n0, \u03b5j \u2212 sign(F yj (xj; \u03b8))d(xj,S yj (\u03b8))\n\n(12)\n\nj=1\n\n(cid:82)\nwhere d(x,S j) is some notion of a distance between x and S j, e.g., miny\u2208S j (cid:107)x \u2212 y(cid:107)p or\nS j (\u03b8) (cid:107)x \u2212 y(cid:107)p d\u00b5(y), d\u00b5 is some probability measure on S j(\u03b8). The parameter \u03bbj controls the\nweighting between correct (i.e., F yj (xj; \u03b8) > 0) and incorrect (i.e., F yj (xj; \u03b8) < 0) classi\ufb01ed\nsamples. We \ufb01x \u03bbj = 1 for incorrectly classi\ufb01ed samples and set \u03bbj to be the same for correctly\nclassi\ufb01ed samples; The parameter \u03b5j controls the desired target distances; Similarly to [10], the idea\nof this loss is: (i) if xj is incorrectly classi\ufb01ed, pull the decision boundary S yj toward xj; (ii) if xj is\nclassi\ufb01ed correctly, push the decision boundary S yj to be within a distance of at-least \u03b5j from xj.\nIn our implementation we have used d(x,S j) = \u03c1(x, p(\u03b8)), where p = p(\u03b80) \u2208 S j is a\nof x, p (resp.), i\u2217 = arg maxi\u2208[d] |DxF (p; \u03b80)|. This loss encourages p(\u03b8) to move\nin the direction of the axis (i.e., i\u2217) that corresponds to the largest component in the\ngradient DxF (p; \u03b80). Intuitively, as \u03c1(x, p) \u2264 (cid:107)x \u2212 p(cid:107)\u221e, the loss pushes the level set S j to leave\nthe \u03b5j-radius L\u221e-ball in the direction of the axis that corresponds to the maximal speed of p(\u03b8). The\ninset depicts an example where this distance measure is more effective than the standard L\u221e distance:\nDxF (p; \u03b8) is shown as black arrow and the selected axis i\u2217 as dashed line.\n\nsample of this level set; \u03c1(x, p) =(cid:12)(cid:12)xi\u2217 \u2212 pi\u2217(cid:12)(cid:12), and xi\u2217 , pi\u2217 denote the i\u2217-th coordinates\n\n(cid:111)\n\n,\n\n5\n\n\f(cid:34)(cid:90)\n\n(cid:88)\n\nt\u2208T\n\nSt(\u03b8)\n\n(cid:12)(cid:12)(cid:12)d(x,X ) \u2212 |t|(cid:12)(cid:12)(cid:12)p\n\nloss(\u03b8) =\n\n(cid:35) 1\n\np\n\nN(cid:88)\n\nj=1\n\n\u03bb\nN\n\ndv(x)\n\n+\n\n|F (xj; \u03b8)| ,\n\n(13)\n\nIn the case where the distance function miny\u2208S(\u03b8) (cid:107)x \u2212 y(cid:107)p is used, the computation of its gradient\nusing Equation 8 coincides with the gradient derivation of [9] up to a sign difference. Still, our\nderivation allows working with general level set points (i.e., not just the closest) on the decision\nboundary S j, and our sample network offers an ef\ufb01cient implementation of these samples in a loss\nfunction. Furthermore, we use the same loss for both correctly and incorrectly classi\ufb01ed examples.\n\n3.3 Manifold reconstruction\nj=1 \u2282 Rd that samples, possibly with\nSurface reconstruction. Given a point cloud X = {xj}N\nnoise, some surface M \u2282 R3, our goal is to \ufb01nd parameters \u03b8 of a network F : R3 \u00d7 Rm \u2192 R, so\nthat the neural level set S(\u03b8) approximates M. Even more desirable is to have F approximate the\nsigned distance function to the unknown surface sampled by X . To that end, we would like the neural\nlevel set St(\u03b8), t \u2208 T to be of distance |t| to X , where T \u2282 R is some collection of desired level\nset values. Let d(x,X ) = minj\u2208[N ] (cid:107)x \u2212 xj(cid:107)2 be the distance between x and X . We consider the\nreconstruction loss\n\nwhere dv(x) is the normalized volume element on St(\u03b8) and \u03bb > 0 is a parameter. The \ufb01rst part of\nthe loss encourages the t level set of F to be of distance |t| to X ; note that for t = 0 this reconstruction\nerror was used in level set surface reconstruction methods [33]. The second part of the loss penalizes\nsamples X outside the zero level set S(\u03b8).\n\nIn case of approximating a manifold M \u2282 Rd with co-dimension greater\nCurve reconstruction.\nthan 1, e.g., a curve in R3, one cannot expect F to approximate the signed distance function as no\nsuch function exists. Instead, we model the manifold via the level set of a vector-valued network\nF : Rd \u00d7 Rm \u2192 Rl whose zero level set is an intersection of l hyper-surfaces. As explained in\nSection 2, this generically de\ufb01nes a d \u2212 l manifold. In that case we used the loss in Equation 13 with\nT = {0}, namely, only encouraging the zero level set to be as close as possible to the samples X .\n4 Universality\nTo theoretically support the usage of neural level sets for modeling manifolds or controlling decision\nboundaries we provide a geometric universality result for multilayer perceptrons (MLP) with ReLU\nactivations. That is, the level sets of MLPs can represent any watertight piecewise linear hyper-surface\n(i.e., manifolds of co-dimension 1 in Rd that are boundaries of d-dimensional polytopes). More\nspeci\ufb01cally, we prove:\nTheorem 1. Any watertight, not necessarily bounded, piecewise linear hypersurface M \u2282 Rd can\nbe exactly represented as the neural level set S of a multilayer perceptron with ReLU activations,\nF : Rd \u2192 R.\nThe proof of this theorem is given in the supplementary material. Note that this theorem is a\ngeometrical version of Theorem 2.1 in [1], asserting that MLPs with ReLU activations can represent\nany piecewise linear continuous function.\n\n5 Experiments\n5.1 Classi\ufb01cation generalization\n\nIn this experiment, we show that when\ntraining on small amounts of data, our\ngeometric SVM loss (see Equation 11)\ngeneralizes better than the cross en-\ntropy loss and the hinge loss. Exper-\niments were done on three datasets:\nMNIST [18], Fashion-MNIST [31]\nand CIFAR10 [16]. For all datasets we arbitrarily merged the labels into two classes, resulting\nin a binary classi\ufb01cation problem. We randomly sampled a fraction of the original training examples\nand evaluated on the original test set.\n\nFigure 2: Generalization from small fractions of the data.\n\n(b)\n\n(a)\n\n(c)\n\n6\n\n0.10.20.30.512580859095HingeXentOursFraction of Training Data (%)Test accuracy (%)0.10.20.30.512591929394959697HingeXentOursFraction of Training Data (%)Test accuracy (%)0.10.20.30.512570758085HingeXentOursFraction of Training Data (%)Test accuracy (%)\fAttack\n\n\u03b5train Test Acc. Rob. Acc. Xent Rob. Acc. Margin\n\nDataset Arch.\nMethod\n99.34%\nStandard\nMNIST ConvNet-4a PGD40(\u03b5attack = 0.3)\n99.35%\nMadry et al. [20] MNIST ConvNet-4a PGD40(\u03b5attack = 0.3)\n99.16%\nMadry et al. [20] MNIST ConvNet-4a PGD40(\u03b5attack = 0.3)\n98.97%\nTRADES [32] MNIST ConvNet-4a PGD40(\u03b5attack = 0.3)\n98.62%\nTRADES [32] MNIST ConvNet-4a PGD40(\u03b5attack = 0.3)\n99.35%\nMNIST ConvNet-4a PGD40(\u03b5attack = 0.3)\nOurs\nStandard\n83.67%\nCIFAR10 ConvNet-4b PGD20 (\u03b5attack = 0.031)\nMadry et al. [20] CIFAR10 ConvNet-4b PGD20 (\u03b5attack = 0.031) 0.031 71.86%\nMadry et al. [20] CIFAR10 ConvNet-4b PGD20 (\u03b5attack = 0.031) 0.045 63.66%\nTRADES [32]\nCIFAR10 ConvNet-4b PGD20 (\u03b5attack = 0.031) 0.031 71.24%\nCIFAR10 ConvNet-4b PGD20 (\u03b5attack = 0.031) 0.045 68.24%\nTRADES [32]\nCIFAR10 ConvNet-4b PGD20 (\u03b5attack = 0.031) 0.045 71.96%\nOurs\nStandard\nCIFAR10 ResNet-18 PGD20 (\u03b5attack = 0.031)\n93.18%\nMadry et al. [20] CIFAR10 ResNet-18 PGD20 (\u03b5attack = 0.031) 0.031 81.0%\nMadry et al. [20] CIFAR10 ResNet-18 PGD20 (\u03b5attack = 0.031) 0.045 74.97%\nCIFAR10 ResNet-18 PGD20 (\u03b5attack = 0.031) 0.031 83.04%\nTRADES [32]\nCIFAR10 ResNet-18 PGD20 (\u03b5attack = 0.031) 0.045 79.52%\nTRADES [32]\nOurs\nCIFAR10 ResNet-18 PGD20 (\u03b5attack = 0.031) 0.045 81.3%\nTable 1: Results of different L\u221e-bounded attacks on models trained using our method (described in\nSection 3.2) compared to other methods.\n\n0.00%\n96.11%\n96.53%\n96.74%\n96.76%\n97.35%\n0.00%\n38.18%\n39.13%\n38.4%\n38.18%\n38.54%\n0.00%\n46.58%\n48.02%\n51.36%\n51.22%\n43.17%\n\n13.59%\n96.04%\n96.54%\n96.75%\n96.78%\n99.23%\n0.00%\n39.84%\n41.53%\n41.89%\n42.04%\n38.45%\n0.00%\n47.29%\n49.84%\n53.31%\n53.49%\n79.74%\n\n-\n0.3\n0.4\n0.3\n0.4\n0.4\n-\n\n-\n\nDue to the variability in the results, we rerun the experiment 100 times for MNIST and 20 times\nfor Fashion-MNIST and CIFAR10. We report the mean accuracy along with the standard deviation.\nFigure 2 shows the test accuracy of our loss compared to the cross-entropy loss and hinge loss over\ndifferent training set sizes for MNIST (a), Fashion-MNIST (b) and CIFAR10 (c). Our loss function\noutperforms the standard methods.\nFor the implementation of Equation 11 we used \u03b1 = \u22121, d = L\u221e, and approximated d(x,St) \u2248\n(cid:107)x \u2212 pt(cid:107)\u221e, where pt denotes the projection of p on the level set St, t \u2208 {\u22121, 0, 1} (see Section\n2.1). The approximation of the term \u2206(x; \u03b8), where x = xj is a train example, is therefore\n\nmin(cid:8)(cid:107)p0 \u2212 p\u22121(cid:107)\u221e,(cid:107)p0 \u2212 p1(cid:107)\u221e(cid:9). See supplementary material for further implementation details.\n\n5.2 Robustness to adversarial examples\n\nIn this experiment we used our method with the loss in Equation 12 to train robust models on MNIST\n[18] and CIFAR10 [16] datasets. For MNIST we used ConvNet-4a (312K params) used in [32]\nand for CIFAR10 we used two architectures: ConvNet-4b (2.5M params) from [30] and ResNet-18\n(11.2M params) from [32]. We report results using the loss in Equation 12 with the choice of \u03b5j \ufb01xed\nas \u03b5train in Table 1, \u03bbj to be 1, 11 for MNIST and CIFAR10 (resp.), and d = \u03c1 as explained in Section\n3.2. We evaluated our trained networks on L\u221e bounded attacks with \u03b5attack radius using Projected\nGradient Descent (PGD) [17, 20] and compared to networks with the same architectures trained\nusing the methods of Madry et al. [20] and TRADES [32]. We found our models to be robust to\nPGD attacks based on the Xent loss; during the revision of this paper we discovered weakness of our\n\ntrained models to PGD attack based on the margin loss, i.e., min(cid:8)F j(cid:9), where F j = fj \u2212 maxi(cid:54)=j fi;\n\nwe attribute this fact to the large gradients created at the level set. Consequently, we added margin\nloss attacks to our evaluation. The results are summarized in Table 1. Note that although superior in\nrobustness to Xent attacks we are comparable to baseline methods for margin attacks. Furthermore,\nwe believe the relatively low robust accuracy of our model when using the ResNet-18 architecture is\ndue to the fact that we didn\u2019t speci\ufb01cally adapt our method to Batch-Norm layers.\nIn the supplementary material we provide tables summarizing robustness of our trained models\n(MNIST ConvNet-4a and CIFAR10 ConvNet-4b) to black-box attacks; we log black-box attacks of\nour and baseline methods [20, 32] in an all-versus-all fashion. In general, we found that all black-box\nattacks are less effective than the relevant white-box attacks, our method performs better when using\nstandard model black-box attacks, and that all three methods compared are in general similar in their\nblack-box robustness.\n\n7\n\n\fFigure 3: Point cloud reconstruction. Surface: (a) input point cloud; (b) AtlasNet [12] reconstruction;\n(c) our result; (d) blow-ups; (e) more examples of our reconstruction. Curve: (f) bottom image shows\na curve reconstructed (black line) from a point cloud (green points) as an intersection of two scalar\nlevel sets (top image); (g) bottom shows curve reconstruction from a point cloud with large noise,\nwhere top image demonstrates the graceful completion of an open curve point cloud data.\n\nChamfer L1\n23.56 \u00b1 2.91\n18.67 \u00b1 3.45\n11.54 \u00b1 0.53\n10.71 \u00b1 0.63\n\nChamfer L2\n17.69 \u00b1 2.45\n13.38 \u00b1 2.66\n7.89 \u00b1 0.42\n7.32 \u00b1 0.46\n\nTable 2: Surface reconstruction results.\n\nAtlasNet-1 sphere\nAtlasNet-1 patch\nAtlasNet-25 patches\nOurs\n\n5.3 Surface and curve reconstruction\nIn this experiment we used our method to reconstruct curves and surfaces in R3 using only incomplete\npoint cloud data X \u2282 R3, which is an important task in 3D shape acquisition and processing. Each\npoint cloud is processed independently using the loss function described in Equation 13, which\nencourages the zero level set of the network to pass through the point cloud. For surfaces, it also\nmoves other level sets to be of the correct distance to the point cloud.\nFor surface reconstruction, we trained on 10\nhuman raw scans from the FAUST dataset [5],\nwhere each scan consists of \u223c 170K points\nin R3. The scans include partial connectivity\ninformation which we do not use. After con-\nvergence, we reconstruct the mesh using the\nmarching cubes algorithm [19] sampled at a\nresolution of [100]3. Table 2 compares our method with the recent method of [12] which also works\ndirectly with point clouds. Evaluation is done using the Chamfer distance [11] computed between\n30K uniformly sampled points from our and [12] reconstructed surfaces and the ground truth reg-\nistrations provided by the dataset, with both L1, L2 norms. Numbers in the table are multiplied\nby 103. We can see that our method outperforms its competitor; Figure 3b-3e show examples of\nsurfaces reconstructed from a point cloud (a batch of 10K points is shown in 3a) using our method\n(in 3c, 3d-right, 3e), and the method of [12] (in 3b, 3d-left). Importantly, we note that there are\nrecent methods for implicit surface representation using deep neural networks [6, 24, 22]. These\nmethods use signed distance information and/or the occupancy function of the ground truth surfaces\nand perform regression on these values. Our formulation, in contrast, allows working directly on the\nmore common, raw input of point clouds.\nFor curve reconstruction, we took a noisy sample of parametric curves in R3 and used similar network\nto the surface case, except its output layer consists of two values. We trained the network with the\nloss Equation 13, where T = {0}, using similar settings to the surface case. Figure 3f shows an\nexample of the input point cloud (in green) and the reconstructed curve (in black) (see bottom image),\nas well as the two hyper-surfaces of the trained network, the intersection of which de\ufb01nes the \ufb01nal\nreconstructed curve (see top image); 3g shows two more examples: reconstruction of a curve from\nhigher noise samples (see bottom image), and reconstruction of a curve from partial curve data (see\ntop image); note how the network gracefully completes the curve.\n\n6 Conclusions\nWe have introduced a simple and scalable method to incorporate level sets of neural networks into\na general family of loss functions. Testing this method on a wide range of learning tasks we found\nthe method particularly easy to use and applicable in different settings. Current limitations and\ninteresting venues for future work include: applying our method with the batch normalization layer\n(requires generalization from points to batches); investigating control of intermediate layers\u2019 level\nsets; developing sampling conditions to ensure coverage of the neural level sets; and employing\nadditional geometrical regularization to the neural level sets (e.g., penalize curvature).\n\n8\n\n\fAcknowledgments\n\nThis research was supported in part by the European Research Council (ERC Consolidator Grant,\n\"LiftMatch\" 771136) and the Israel Science Foundation (Grant No. 1830/17).\n\nReferences\n[1] R. Arora, A. Basu, P. Mianjy, and A. Mukherjee. Understanding deep neural networks with recti\ufb01ed linear\n\nunits. arXiv preprint arXiv:1611.01491, 2016.\n\n[2] H. Ben-Hamu, H. Maron, I. Kezurer, G. Avineri, and Y. Lipman. Multi-chart generative surface modeling.\n\nIn SIGGRAPH Asia 2018 Technical Papers, page 215. ACM, 2018.\n\n[3] A. Ben-Israel. A newton-raphson method for the solution of systems of equations. Journal of Mathematical\n\nanalysis and applications, 15(2):243\u2013252, 1966.\n\n[4] M. Berger, A. Tagliasacchi, L. M. Seversky, P. Alliez, G. Guennebaud, J. A. Levine, A. Sharf, and C. T.\nSilva. A survey of surface reconstruction from point clouds. In Computer Graphics Forum, volume 36,\npages 301\u2013329. Wiley Online Library, 2017.\n\n[5] F. Bogo, J. Romero, M. Loper, and M. J. Black. FAUST: Dataset and evaluation for 3D mesh registration.\nIn Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Piscataway, NJ, USA,\nJune 2014. IEEE.\n\n[6] Z. Chen and H. Zhang. Learning implicit \ufb01elds for generative shape modeling.\n\narXiv preprint\n\narXiv:1812.02822, 2018.\n\n[7] C. Cortes and V. Vapnik. Support-vector networks. Machine learning, 20(3):273\u2013297, 1995.\n[8] G. Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals\n\nand systems, 2(4):303\u2013314, 1989.\n\n[9] G. W. Ding, Y. Sharma, K. Y. C. Lui, and R. Huang. Max-margin adversarial (mma) training: Direct input\n\nspace margin maximization through adversarial training. arXiv preprint arXiv:1812.02637, 2018.\n\n[10] G. Elsayed, D. Krishnan, H. Mobahi, K. Regan, and S. Bengio. Large margin deep networks for classi\ufb01ca-\n\ntion. In Advances in Neural Information Processing Systems, pages 842\u2013852, 2018.\n\n[11] H. Fan, H. Su, and L. J. Guibas. A point set generation network for 3d object reconstruction from a single\nimage. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 605\u2013613,\n2017.\n\n[12] T. Groueix, M. Fisher, V. G. Kim, B. C. Russell, and M. Aubry. A papier-m\u00e2ch\u00e9 approach to learning 3d\nsurface generation. In Proceedings of the IEEE conference on computer vision and pattern recognition,\npages 216\u2013224, 2018.\n\n[13] M. Hein and M. Andriushchenko. Formal guarantees on the robustness of a classi\ufb01er against adversarial\n\nmanipulation. In Advances in Neural Information Processing Systems, pages 2266\u20132276, 2017.\n\n[14] K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal approximators.\n\n[15] S. G. Krantz and H. R. Parks. The implicit function theorem: history, theory, and applications. Springer\n\nNeural networks, 2(5):359\u2013366, 1989.\n\nScience & Business Media, 2012.\n\nCiteseer, 2009.\n\narXiv:1611.01236, 2016.\n\n[16] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical report,\n\n[17] A. Kurakin, I. Goodfellow, and S. Bengio. Adversarial machine learning at scale. arXiv preprint\n\n[18] Y. LeCun. The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/, 1998.\n[19] W. E. Lorensen and H. E. Cline. Marching cubes: A high resolution 3d surface construction algorithm. In\n\nACM siggraph computer graphics, volume 21, pages 163\u2013169. ACM, 1987.\n\n[20] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu. Towards deep learning models resistant to\n\nadversarial attacks. arXiv preprint arXiv:1706.06083, 2017.\n\n[21] A. Matyasko and L.-P. Chau. Margin maximization for robust classi\ufb01cation using deep learning. In 2017\n\nInternational Joint Conference on Neural Networks (IJCNN), pages 300\u2013307. IEEE, 2017.\n\n[22] L. Mescheder, M. Oechsle, M. Niemeyer, S. Nowozin, and A. Geiger. Occupancy networks: Learning 3d\n\nreconstruction in function space. arXiv preprint arXiv:1812.03828, 2018.\n\n[23] S.-M. Moosavi-Dezfooli, A. Fawzi, and P. Frossard. Deepfool: a simple and accurate method to fool deep\nneural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages\n2574\u20132582, 2016.\n\n[24] J. J. Park, P. Florence, J. Straub, R. Newcombe, and S. Lovegrove. Deepsdf: Learning continuous signed\n\ndistance functions for shape representation. arXiv preprint arXiv:1901.05103, 2019.\n\n[25] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and\n\nA. Lerer. Automatic differentiation in pytorch. 2017.\n\n[26] J. Sokoli\u00b4c, R. Giryes, G. Sapiro, and M. R. Rodrigues. Robust large margin deep neural networks. IEEE\n\nTransactions on Signal Processing, 65(16):4265\u20134280, 2017.\n\n[27] S. Sun, W. Chen, L. Wang, and T.-Y. Liu. Large margin deep neural networks: Theory and algorithms.\n\narXiv preprint arXiv:1506.05232, 148, 2015.\n\n[28] Y. Tang. Deep learning using support vector machines. CoRR, abs/1306.0239, 2, 2013.\n[29] F. Williams, T. Schneider, C. Silva, D. Zorin, J. Bruna, and D. Panozzo. Deep geometric prior for surface\n\nreconstruction. arXiv preprint arXiv:1811.10943, 2018.\n\n9\n\n\f[30] E. Wong, F. Schmidt, J. H. Metzen, and J. Z. Kolter. Scaling provable adversarial defenses. In Advances in\n\nNeural Information Processing Systems, pages 8400\u20138409, 2018.\n\n[31] H. Xiao, K. Rasul, and R. Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine\n\nlearning algorithms. arXiv preprint arXiv:1708.07747, 2017.\n\n[32] H. Zhang, Y. Yu, J. Jiao, E. P. Xing, L. E. Ghaoui, and M. I. Jordan. Theoretically principled trade-off\n\nbetween robustness and accuracy. arXiv preprint arXiv:1901.08573, 2019.\n\n[33] H.-K. Zhao, S. Osher, and R. Fedkiw. Fast surface reconstruction using the level set method. In Proceedings\nIEEE Workshop on Variational and Level Set Methods in Computer Vision, pages 194\u2013201. IEEE, 2001.\n\n10\n\n\f", "award": [], "sourceid": 1191, "authors": [{"given_name": "Matan", "family_name": "Atzmon", "institution": "Weizmann Institute Of Science"}, {"given_name": "Niv", "family_name": "Haim", "institution": "Weizmann Institute of Science"}, {"given_name": "Lior", "family_name": "Yariv", "institution": "Weizmann Institute of Science"}, {"given_name": "Ofer", "family_name": "Israelov", "institution": "Weizmann Institute of Science"}, {"given_name": "Haggai", "family_name": "Maron", "institution": "NVIDIA Research"}, {"given_name": "Yaron", "family_name": "Lipman", "institution": "Weizmann Institute of Science"}]}