{"title": "You Only Propagate Once: Accelerating Adversarial Training via Maximal Principle", "book": "Advances in Neural Information Processing Systems", "page_first": 227, "page_last": 238, "abstract": "Deep learning achieves state-of-the-art results in many tasks in computer vision and natural language processing. However, recent works have shown that deep networks can be vulnerable to adversarial perturbations which raised a serious robustness issue of deep networks. Adversarial training, typically formulated as a robust optimization problem, is an effective way of improving the robustness of deep networks. A major drawback of existing adversarial training algorithms is the computational overhead of the generation of adversarial examples, typically far greater than that of the network training. This leads to unbearable overall computational cost of adversarial training. In this paper, we show that adversarial training can be cast as a discrete time differential game. Through analyzing the Pontryagin\u2019s Maximum Principle (PMP) of the problem, we observe that the adversary update is only coupled with the parameters of the first layer of the network. This inspires us to restrict most of the forward and back propagation within the first layer of the network during adversary updates. This effectively reduces the total number of full forward and backward propagation to only one for each group of adversary updates. Therefore, we refer to this algorithm YOPO (\\textbf{Y}ou \\textbf{O}nly \\textbf{P}ropagate  \\textbf{O}nce). Numerical experiments demonstrate that YOPO can achieve comparable defense accuracy with \\textbf{approximately 1/5 $\\sim$ 1/4 GPU time} of the projected gradient descent (PGD) algorithm~\\cite{kurakin2016adversarial}.", "full_text": "You Only Propagate Once: Accelerating Adversarial\n\nTraining via Maximal Principle\n\nDinghuai Zhang\u2217, Tianyuan Zhang\u2217\n\nPeking University\n\n{zhangdinghuai, 1600012888}@pku.edu.cn\n\nYiping Lu\u2217\n\nStanford University\nyplu@stanford.edu\n\nZhanxing Zhu\u2020\n\nSchool of Mathematical Sciences, Peking University\n\nCenter for Data Science, Peking University\n\nBeijing Institute of Big Data Research\n\nzhanxing.zhu@pku.edu.cn\n\nBeijing International Center for Mathematical Research, Peking University\n\nBin Dong\u2020\n\nCenter for Data Science, Peking University\n\nBeijing Institute of Big Data Research\n\ndongbin@math.pku.edu.cn\n\nAbstract\n\nDeep learning achieves state-of-the-art results in many tasks in computer vision\nand natural language processing. However, recent works have shown that deep\nnetworks can be vulnerable to adversarial perturbations, which raised a serious\nrobustness issue of deep networks. Adversarial training, typically formulated as a\nrobust optimization problem, is an effective way of improving the robustness of\ndeep networks. A major drawback of existing adversarial training algorithms is\nthe computational overhead of the generation of adversarial examples, typically\nfar greater than that of the network training. This leads to the unbearable overall\ncomputational cost of adversarial training. In this paper, we show that adversarial\ntraining can be cast as a discrete time differential game. Through analyzing the\nPontryagin\u2019s Maximum Principle (PMP) of the problem, we observe that the\nadversary update is only coupled with the parameters of the \ufb01rst layer of the\nnetwork. This inspires us to restrict most of the forward and back propagation\nwithin the \ufb01rst layer of the network during adversary updates. This effectively\nreduces the total number of full forward and backward propagation to only one\nfor each group of adversary updates. Therefore, we refer to this algorithm YOPO\n(You Only Propagate Once). Numerical experiments demonstrate that YOPO can\nachieve comparable defense accuracy with approximately 1/5 \u223c 1/4 GPU time\nof the projected gradient descent (PGD) algorithm .3\n\n\u2217Equal Contribution\n\u2020Corresponding Authors\n3Our codes are available at https://github.com/a1600012888/YOPO-You-Only-Propagate-Once\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f1\n\nIntroduction\n\nDeep neural networks achieve state-of-the-art performance on many tasks [4,8,16,21,25,44]. However,\nrecent works show that deep networks are often sensitive to adversarial perturbations [27,35,49], i.e.,\nchanging the input in a way imperceptible to humans while causing the neural network to output\nan incorrect prediction. This poses signi\ufb01cant concerns when applying deep neural networks to\nsafety-critical problems such as autonomous driving and medical domains. To effectively defend\nthe adversarial attacks, [26] proposed adversarial training, which can be formulated as a robust\noptimization [38]:\n\nmin\n\n\u03b8\n\nE(x,y)\u223cD max\n(cid:107)\u03b7(cid:107)\u2264\u0001\n\n(cid:96)(\u03b8; x + \u03b7, y),\n\n(1)\n\nwhere \u03b8 is the network parameter, \u03b7 is the adversarial perturbation, and (x, y) is a pair of data\nand label drawn from a certain distribution D. The magnitude of the adversarial perturbation \u03b7 is\nrestricted by \u0001 > 0. For a given pair (x, y), we refer to the value of the inner maximization of (1), i.e.\nmax(cid:107)\u03b7(cid:107)\u2264\u0001 (cid:96)(\u03b8; x + \u03b7, y), as the adversarial loss which depends on (x, y).\nA major issue of the current adversarial training methods is their signi\ufb01cantly high computational cost.\nIn adversarial training, we need to solve the inner loop, which is to obtain the \"optimal\" adversarial\nattack to the input in every iteration. Such \"optimal\" adversary is usually obtained using multi-step\ngradient decent, and thus the total time for learning a model using standard adversarial training\nmethod is much more than that using the standard training. Considering applying 40 inner iterations\nof projected gradient descent (PGD [15]) to obtain the adversarial examples, the computation cost of\nsolving the problem (1) is about 40 times that of a regular training.\nThe main objective of this paper is to re-\nduce the computational burden of adver-\nsarial training by limiting the number of\nforward and backward propagation with-\nout hurting the performance of the trained\nnetwork. In this paper, we exploit the struc-\ntures that the min-max objectives is encoun-\ntered with deep neural networks. To achieve\nthis, we formulate the adversarial training\nproblem(1) as a differential game. After-\nwards we can derive the Pontryagin\u2019s Max-\nimum Principle (PMP) of the problem.\nFrom the PMP, we discover a key fact that\nthe adversarial perturbation is only coupled\nwith the weights of the \ufb01rst layer. This\nmotivates us to propose a novel adversarial\ntraining strategy by decoupling the adver-\nsary update from the training of the network\nparameters. This effectively reduces the to-\ntal number of full forward and backward\npropagation to only one for each group of\nadversary updates, signi\ufb01cantly lowering\nthe overall computation cost without ham-\npering performance of the trained network.\nWe name this new adversarial training algorithm as YOPO (You Only Propagate Once). Our nu-\nmerical experiments show that YOPO achieves approximately 4 \u223c5 times speedup over the original\nPGD adversarial training with comparable accuracy on MNIST/CIFAR10. Furthermore, we apply\nour algorithm to a recent proposed min max optimization objective \"TRADES\" [46] and achieve better\nclean and robust accuracy within half of the time TRADES need.\n\nFigure 1: Our proposed YOPO expolits the structure\nof neural network. To alleviate the heavy computation\ncost, YOPO focuses the calculation of the adversary at\nthe \ufb01rst layer.\n\n1.1 Related Works\n\nAdversarial Defense. To improve the robustness of neural networks to adversarial examples, many\ndefense strategies and models have been proposed, such as adversarial training [26], orthogonal reg-\nularization [6,22], Bayesian method [45], TRADES [46], rejecting adversarial examples [43], Jacobian\n\n2\n\nAdversary updaterAdversary updaterBlack boxPrevious WorkYOPOHeavy gradient calculation\fregularization [14,29], generative model based defense [12,33], pixel defense [24,31], ordinary differential\nequation (ODE) viewpoint [47], ensemble via an intriguing stochastic differential equation perspec-\ntive [39], and feature denoising [34,42], etc. Among all these approaches, adversarial training and its\nvariants tend to be most effective since it largely avoids the the obfuscated gradient problem [2].\nTherefore, in this paper, we choose adversarial training to achieve model robustness.\n\nNeural ODEs. Recent works have built up the relationship between ordinary differential equations\nand neural networks [5,10,23,32,37,40,48]. Observing that each residual block of ResNet can be written\nas un+1 = un + \u2206tf (un), one step of forward Euler method approximating the ODE ut = f (u).\nThus [19,41] proposed an optimal control framework for deep learning and [5,19,20] utilize the adjoint\nequation and the maximal principle to train a neural network.\n\nDecouple Training. Training neural networks requires forward and backward propagation in\na sequential manner. Different ways have been proposed to decouple the sequential process by\nparallelization. This includes ADMM [36], synthetic gradients [13], delayed gradient [11], lifted ma-\nchines [1,9,18]. Our work can also be understood as a decoupling method based on a splitting technique.\nHowever, we do not attempt to decouple the gradient w.r.t. network parameters but the adversary\nupdate instead.\n\n1.2 Contribution\n\n\u2022 To the best of our knowledge, it is the \ufb01rst attempt to design NN\u2013speci\ufb01c algorithm for\nadversarial defense. To achieve this, we recast the adversarial training problem as a discrete\ntime differential game. From optimal control theory, we derive the an optimality condition,\ni.e. the Pontryagin\u2019s Maximum Principle, for the differential game.\n\n\u2022 Through PMP, we observe that the adversarial perturbation is only coupled with the \ufb01rst\nlayer of neural networks. The PMP motivates a new adversarial training algorithm, YOPO.\nWe split the adversary computation and weight updating and the adversary computation is\nfocused on the \ufb01rst layer. Relations between YOPO and original PGD are discussed.\n\u2022 We \ufb01nally achieve about 4\u223c 5 times speed up than the original PGD training with compa-\nrable results on MNIST/CIFAR10. Combining YOPO with TRADES [46], we achieve both\nhigher clean and robust accuracy within less than half of the time TRADES need.\n\n1.3 Organization\n\nThis paper is organized as follows. In Section 2, we formulate the robust optimization for neural\nnetwork adversarial training as a differential game and propose the gradient based YOPO. In Section 3,\nwe derive the PMP of the differential game, study the relationship between the PMP and the back-\npropagation based gradient descent methods, and propose a general version of YOPO. Finally, all the\nexperimental details and results are given in Section 4.\n\n2 Differential Game Formulation and Gradient Based YOPO\n\n2.1 The Optimal Control Perspective and Differential Game\n\nInspired by the link between deep learning and optimal control [20], we formulate the robust optimiza-\ntion (1) as a differential game [7]. A two-player, zero-sum differential game is a game where each\nplayer controls a dynamics, and one tries to maximize, the other to minimize, a payoff functional.\nIn the context of adversarial training, one player is the neural network, which controls the weights\nof the network to \ufb01t the label, while the other is the adversary that is dedicated to producing a false\nprediction by modifying the input.\nThe robust optimization problem (1) can be written as a differential game as follows,\n\nN(cid:88)\n\n1\nN\n\nN(cid:88)\n\nT\u22121(cid:88)\n\n1\nN\n\nmin\n\n\u03b8\n\nmax\n\nJ(\u03b8, \u03b7) :=\n\n(cid:107)\u03b7i(cid:107)\u221e\u2264\u0001\nsubject to xi,1 = f0(xi,0 + \u03b7i, \u03b80), i = 1, 2,\u00b7\u00b7\u00b7 , N\nxi,t+1 = ft(xi,t, \u03b8t), t = 1, 2,\u00b7\u00b7\u00b7 , T \u2212 1\n\nt=0\n\ni=1\n\ni=1\n\n(cid:96)i(xi,T ) +\n\nRt(xi,t; \u03b8t)\n\n(2)\n\n3\n\n\fHere, the dynamics {ft(xt, \u03b8t), t = 0, 1, . . . , T \u2212 1} represent a deep neural network, T denote the\nnumber of layers, \u03b8t \u2208 \u0398t denotes the parameters in layer t (denote \u03b8 = {\u03b8t}t \u2208 \u0398), the function\nft : Rdt \u00d7 \u0398t \u2192 Rdt+1 is a nonlinear transformation for one layer of neural network where dt is\nthe dimension of the t th feature map and {xi,0, i = 1, . . . , N} is the training dataset. The variable\n\u03b7 = (\u03b71,\u00b7\u00b7\u00b7 , \u03b7N ) is the adversarial perturbation and we constrain it in an \u221e-ball. Function (cid:96)i is a\ndata \ufb01tting loss function and Rt is the regularization weights \u03b8t such as the L2-norm. By casting\nthe problem of adversarial training as a differential game (2), we regard \u03b8 and \u03b7 as two competing\nplayers, each trying to minimize/maximize the loss function J(\u03b8, \u03b7) respectively.\n\n2.2 Gradient Based YOPO\n\nThe Pontryagin\u2019s Maximum Principle (PMP) is a fundamental tool in optimal control that character-\nizes optimal solutions of the corresponding control problem [7]. PMP is a rather general framework\nthat inspires a variety of optimization algorithms. In this paper, we will derive the PMP of the\ndifferential game (2), which motivates the proposed YOPO in its most general form. However, to\nbetter illustrate the essential idea of YOPO and to better address its relations with existing methods\nsuch as PGD, we present a special case of YOPO in this section based on gradient descent/ascent.\nWe postpone the introduction of PMP and the general version of YOPO to Section 3.\nLet us \ufb01rst rewrite the original robust optimization problem (1) (in a mini-batch form) as\n\nB(cid:88)\n\ni=1\n\nmin\n\n\u03b8\n\nmax\n(cid:107)\u03b7i(cid:107)\u2264\u0001\n\n(cid:96)(g\u02dc\u03b8(f0(xi + \u03b7i, \u03b80)), yi),\n\nT\u22121 \u25e6 f \u03b8T \u22122\n\nT\u22122 \u25e6 \u00b7\u00b7\u00b7 f \u03b81\n\nwhere f0 denotes the \ufb01rst layer, g\u02dc\u03b8 = f \u03b8T \u22121\n1 denotes the network without the \ufb01rst\nlayer and B is the batch size. Here \u02dc\u03b8 is de\ufb01ned as {\u03b81,\u00b7\u00b7\u00b7 , \u03b8T\u22121}. For simplicity we omit the\nregularization term Rt.\nThe simplest way to solve the problem is to perform gradient ascent on the input data and gradient\ndescent on the weights of the neural network as shown below. Such alternating optimization algorithm\nis essentially the popular PGD adversarial training [26]. We summarize the PGD-r (for each update on\n\u03b8) as follows, i.e. performing r iterations of gradient ascent for inner maximization.\n\n\u2022 For s = 0, 1, . . . , r \u2212 1, perform\n\n\u03b7s+1\ni = \u03b7s\n\ni + \u03b11\u2207\u03b7i(cid:96)(g\u02dc\u03b8(f0(xi + \u03b7s\n\ni , \u03b80)), yi), i = 1,\u00b7\u00b7\u00b7 , B,\n\nwhere by the chain rule,\n\u2207\u03b7i(cid:96)(g\u02dc\u03b8(f0(xi + \u03b7s\n\ni , \u03b80)), yi) =\u2207g \u02dc\u03b8\n\u2207f0\n\n(cid:0)(cid:96)(g\u02dc\u03b8(f0(xi + \u03b7s\n(cid:0)g\u02dc\u03b8(f0(xi + \u03b7s\n\ni , \u03b80)), yi)(cid:1)\u00b7\ni , \u03b80))(cid:1) \u00b7 \u2207\u03b7if0(xi + \u03b7s\n(cid:33)\n\ni , \u03b80).\n\n\u2022 Perform the SGD weight update (momentum SGD can also be used here)\n\n\u03b8 \u2190 \u03b8 \u2212 \u03b12\u2207\u03b8\n\n(cid:96)(g\u02dc\u03b8(f0(xi + \u03b7m\n\ni , \u03b80)), yi)\n\n(cid:32) B(cid:88)\n\ni=1\n\nNote that this method conducts r sweeps of forward and backward propagation for each update of \u03b8.\nThis is the main reason why adversarial training using PGD-type algorithms can be very slow.\nTo reduce the total number of forward and backward propagation, we introduce a slack variable\n\n(cid:0)(cid:96)(g\u02dc\u03b8(f0(xi + \u03b7i, \u03b80)), yi)(cid:1) \u00b7 \u2207f0\n\n(cid:0)g\u02dc\u03b8(f0(xi + \u03b7i, \u03b80))(cid:1)\n\np = \u2207g \u02dc\u03b8\n\nand freeze it as a constant within the inner loop of the adversary update. The modi\ufb01ed algorithm is\ngiven below and we shall refer to it as YOPO-m-n.\n\n4\n\n\f\u2013 Let \u03b7j+1,0\n\ni\n\n= \u03b7j,n\n\n.\n\ni\n\n\u2022 Calculate the weight update\n\n(cid:32) B(cid:88)\n\nm(cid:88)\n\n\u2207\u03b8\n\nU =\n\n, \u03b80), i = 1,\u00b7\u00b7\u00b7 , B\n\n(cid:33)\n\n\u2022 Initialize {\u03b71,0\n\n\u2013 Calculate the slack variable p\n\ni } for each input xi. For j = 1, 2,\u00b7\u00b7\u00b7 , m\n(cid:16)\n\n(cid:16)\n\n(cid:96)(g\u02dc\u03b8(f0(xi + \u03b7j,0\n\ni\n\n, \u03b80)), yi)\n\n(cid:17) \u00b7 \u2207f0\n\n(cid:17)\n\n, \u03b80))\n\n,\n\ng\u02dc\u03b8(f0(xi + \u03b7j,0\n\ni\n\np = \u2207g \u02dc\u03b8\n\n\u2013 Update the adversary for s = 0, 1, . . . , n \u2212 1 for \ufb01xed p\n\n\u03b7j,s+1\ni\n\n= \u03b7j,s\n\ni + \u03b11p \u00b7 \u2207\u03b7if0(xi + \u03b7j,s\n\ni\n\n(cid:96)(g\u02dc\u03b8(f0(xi + \u03b7j,n\n\ni\n\n, \u03b80)), yi)\n\nj=1\n\ni=1\n\nand update the weight \u03b8 \u2190 \u03b8 \u2212 \u03b12U. (Momentum SGD can also be used here.)\n\nIntuitively, YOPO freezes the values of the derivatives of the network at level 1, 2 . . . , T \u2212 1 during\nthe s-loop of the adversary updates. Figure 2 shows the conceptual comprison between YOPO and\nPGD. YOPO-m-n accesses the data m \u00d7 n times while only requires m full forward and backward\npropagation. PGD-r, on the other hand, propagates the data r times for r full forward and backward\npropagation. As one can see that, YOPO-m-n has the \ufb02exibility of increasing n and reducing m to\nachieve approximately the same level of attack but with much less computation cost. For example,\nsuppose one applies PGD-10 (i.e. 10 steps of gradient ascent for solving the inner maximization) to\ncalculate the adversary. An alternative approach is using YOPO-5-2 which also accesses the data 10\ntimes but the total number of full forward propagation is only 5. Empirically, YOPO-m-n achieves\ncomparable results only requiring setting m \u00d7 n a litter larger than r.\nAnother bene\ufb01t of YOPO is that we take full advantage of every forward and backward propagation\ni , j = 1,\u00b7\u00b7\u00b7 , m \u2212 1 are not wasted like\nto update the weights, i.e. the intermediate perturbation \u03b7j\nPGD-r. This allows us to perform multiple updates per iteration, which potentially drives YOPO\nto converge faster in terms of the number of epochs. Combining the two factors together, YOPO\nsigni\ufb01cantly could accelerate the standard PGD adversarial training.\nWe would like to point out a concurrent paper [30] that is related to YOPO. Their proposed method,\ncalled \"Free-m\", also can signi\ufb01cantly speed up adversarial training. In fact, Free-m is essentially\nYOPO-m-1, except that YOPO-m-1 delays the weight update after the whole mini-batch is processed\nin order for a proper usage of momentum 4.\n\n3 The Pontryagin\u2019s Maximum Principle for Adversarial Training\n\nIn this section, we present the PMP of the discrete time differential game (2). From the PMP, we\ncan observe that the adversary update and its associated back-propagation process can be decoupled.\nFurthermore, back-propagation based gradient descent can be understood as an iterative algorithm\nsolving the PMP and with that the version of YOPO presented in the previous section can be viewed\nas an algorithm solving the PMP. However, the PMP facilitates a much wider class of algorithms than\ngradient descent algorithms [19]. Therefore, we will present a general version of YOPO based on the\nPMP for the discrete differential game.\n\n3.1 PMP\n\nPontryagin type of maximal principle [3,28] provides necessary conditions for optimality with a\nlayer-wise maximization requirement on the Hamiltonian function. For each layer t \u2208 [T ] :=\n\n4Momentum should be accumulated between mini-batches other than different adversarial examples from\n\none mini-batch, otherwise over\ufb01tting will become a serious problem.\n\n5\n\n\fFigure 2: Pipeline of YOPO-m-n described in Algorithm 1. The yellow and olive blocks represent\nfeature maps while the orange blocks represent the gradients of the loss w.r.t. feature maps of each\nlayer.\n\n{0, 1, . . . , T \u2212 1}, we de\ufb01ne the Hamiltonian function Ht : Rdt \u00d7 Rdt+1 \u00d7 \u0398t \u2192 R as\n\nHt(x, p, \u03b8t) = p \u00b7 ft(x, \u03b8t) \u2212 1\nB\n\nRt(x, \u03b8t).\n\n(PMP for adversarial\n\nThe PMP for continuous time differential game has been well studied in the literature [7]. Here, we\npresent the PMP for our discrete time differential game (2).\nTheorem 1.\nis twice continuous differentiable,\nft(\u00b7, \u03b8), Rt(\u00b7, \u03b8) are twice continuously differentiable with respect to x; ft(\u00b7, \u03b8), Rt(\u00b7, \u03b8) together\nwith their x partial derivatives are uniformly bounded in t and \u03b8, and the sets {ft(x, \u03b8) : \u03b8 \u2208 \u0398t}\nand {Rt(x, \u03b8) : \u03b8 \u2208 \u0398t} are convex for every t and x \u2208 Rdt. Denote \u03b8\u2217 as the solution of the\ni,t : t \u2208 [T ]} such that the following holds\nproblem (2), then there exists co-state processes p\u2217\nfor all t \u2208 [T ] and i \u2208 [B]:\n\ntraining) Assume (cid:96)i\n\ni := {p\u2217\n\ni,t+1 = \u2207pHt(x\u2217\nx\u2217\ni,t = \u2207xHt(x\u2217\np\u2217\ni,t, p\u2217\n\ni,t, p\u2217\ni,t+1, \u03b8\u2217\nt ),\ni,t+1, \u03b8\u2217\nt ),\n\ni,0 = xi,0 + \u03b7\u2217\nx\u2217\ni,T )\n\n\u2207(cid:96)i(x\u2217\n\ni\n\ni,T = \u2212 1\np\u2217\nB\n\n(4)\n0 \u2208 \u03980 and the optimal adversarial perturbation\n\nAt the same time, the parameters of the \ufb01rst layer \u03b8\u2217\n\u03b7\u2217\ni satisfy\n\nB(cid:88)\n\n0) \u2265 B(cid:88)\n\nH0(x\u2217\n\ni,0 + \u03b7i, p\u2217\n\ni,1, \u03b8\u2217\n\ni=1\n\ni=1\n\n0) \u2265 B(cid:88)\n\n(3)\n\n(5)\n\n(6)\n\n(7)\n\ni , p\u2217\n\nH0(x\u2217\n\ni,1, \u03b8\u2217\n\ni,0 + \u03b7\u2217\n\nH0(x\u2217\ni,1, \u03b80),\n\u2200\u03b80 \u2208 \u03980,(cid:107)\u03b7i(cid:107)\u221e \u2264 \u0001\nt \u2208 \u0398t, t \u2208 [T ] maximize the Hamiltonian functions\n\ni,0 + \u03b7\u2217\n\ni , p\u2217\n\ni=1\n\nand the parameters of the other layers \u03b8\u2217\n\nB(cid:88)\n\nt ) \u2265 B(cid:88)\n\nHt(x\u2217\n\ni,t, p\u2217\n\ni,t+1, \u03b8\u2217\n\nHt(x\u2217\n\ni,t, p\u2217\n\ni,t+1, \u03b8t), \u2200\u03b8t \u2208 \u0398t\n\ni=1\n\ni=1\n\nProof. Proof is in the supplementary materials.\n\nFrom the theorem, we can observe that the adversary \u03b7 is only coupled with the parameters of the\n\ufb01rst layer \u03b80. This key observation inspires the design of YOPO.\n\n6\n\n\ud835\udc91\ud835\udc94\ud835\udc95\ud835\udc99\ud835\udc94\ud835\udc95YOPO Outer IterationYOPO Inner Iterationcopy\ud835\udc91\ud835\udc94\ud835\udfcf\ud835\udc91\ud835\udc94\ud835\udc95\ud835\udc99\ud835\udc94\ud835\udc95PGD Adv. TrainIterationFor r times\ud835\udc91\ud835\udc94\ud835\udfcfFor m timesFor n times\f3.2 PMP and Back-Propagation Based Gradient Descent\n\nThe classical back-propagation based gradient descent algorithm [17] can be viewed as an algorithm\nattempting to solve the PMP. Without loss of generality, we can let the regularization term R = 0,\nsince we can simply add an extra dynamic wt to evaluate the regularization term R, i.e.\n\nwt+1 = wt + Rt(xt, \u03b8t), w0 = 0.\n\nWe append w to x to study the dynamics of a new (dt + 1)-dimension vector and change ft(x, \u03b8t)\nto (ft(x, \u03b8t), w + Rt(x, \u03b8t)). The relationship between the PMP and the back-propagation based\ngradient descent method was \ufb01rst observed by Li et al. [19]. They showed that the forward dynamical\nsystem Eq.(3) is the same as the neural network forward propagation. The backward dynamical\nsystem Eq.(4) is the back-propagation, which is formally described by the following lemma.\nLemma 1.\nt = \u2207xHt(x\u2217\np\u2217\n\nt+1)T \u00b7\u2212\u2207xt+1((cid:96)(xT )) = \u2212\u2207xt((cid:96)(xT )).\n\nt )T pt+1 = (\u2207xt x\u2217\n\nt ) = \u2207xf (x\u2217\n\nt , p\u2217\n\nt+1, \u03b8\u2217\n\nt , \u03b8\u2217\n\nTo solve the maximization of the Hamiltonian, a simple way is the gradient ascent:\n\nB(cid:88)\n\n\u03b81\nt = \u03b80\n\nt + \u03b1 \u00b7 \u2207\u03b8\n\nHt(x\u03b80\n\ni,t, p\u03b80\n\ni,t+1, \u03b80\n\nt ).\n\n(8)\n\nTheorem 2. The update (8) is equivalent to gradient descent method for training networks [19,20].\n\ni=1\n\n3.3 YOPO from PMP\u2019s View Point\n\nBased on the relationship between back-propagation and the Pontryagin\u2019s Maximum Principle, in this\nsection, we provide a new understanding of YOPO, i.e. solving the PMP for the differential game.\nObserving that, in the PMP, the adversary \u03b7 is only coupled with the weight of the \ufb01rst layer \u03b80. Thus\nwe can update the adversary via minimizing the Hamiltonian function instead of directly attacking\nthe loss function, described in Algorithm 1.\nFor YOPO-m-n, to approximate the exact minimization of the Hamiltonian, we perform n times\ngradient descent to update the adversary. Furthermore, in order to make the calculation of the\nadversary more accurate, we iteratively pass one data point m times. Besides, the network weights\nare optimized via performing the gradient ascent to Hamiltonian, resulting in the gradient-based\nYOPO proposed in Section 2.2.\n\n4 Experiments\n\n4.1 YOPO for Adversarial Training\n\nTo demonstrate the effectiveness of YOPO, we conduct experiments on MNIST and CIFAR10. We\n\ufb01nd that models trained with YOPO have comparable performance with that of the PGD adversarial\ntraining, but with a much fewer computational cost. We also compare our method with a concurrent\nmethod \"For Free\" [30], and the result shows that our algorithm can achieve comparable performance\nwith around 2/3 GPU time of their of\ufb01cial implementation.\n\nMNIST We achieve comparable results with the best in [5] within 250 seconds, while it takes\nPGD-40 more than 1250s to reach the same level. The accuracy-time curve is shown in Figuire 3(a).\nQuantitative results can be seen in supplementary materials. Naively reducing the backprop times of\nPGD-40 to PGD-10 will harm the robustness, as can be seen in supplementary materials.\nCIFAR10.\n[26] performs a 7-step PGD to generate adversary while training. As a comparison, we\ntest YOPO-3-5 and YOPO-5-3 with a step size of 2/255. Quantitative results can be seen in Table 1\nand supplementary materials.\nUnder PreAct-Res18, for YOPO-5-3, it achieves comparable robust accuracy with [26] with around\nhalf computation for every epoch. The accuracy-time curve is shown in Figuire 3(b).The quantitative\nresults can be seen in supplementary materials.\n\n7\n\n\fAlgorithm 1 YOPO (You Only Propagate Once)\n\nRandomly initialize the network parameters or using a pre-trained network.\nrepeat\n\nRandomly select a mini-batch B = {(x1, y1),\u00b7\u00b7\u00b7 , (xB, yB)} from training set.\nInitialize \u03b7i, i = 1, 2,\u00b7\u00b7\u00b7 , B by sampling from a uniform distribution between [-\u0001, \u0001]\nfor j = 1 to m do\nxi,0 = xi + \u03b7j\nfor t = 0 to T \u2212 1 do\n\ni , i = 1, 2,\u00b7\u00b7\u00b7 , B\n\nxi,t+1 = \u2207pHt(xi,t, pi,t+1, \u03b8t), i = 1, 2,\u00b7\u00b7\u00b7 , B\n\nend for\npi,T = \u2212 1\nfor t = T \u2212 1 to 0 do\n\nB\u2207(cid:96)(x\u2217\n\ni,T ), i = 1, 2,\u00b7\u00b7\u00b7 , B\n\npi,t = \u2207xHt(xi,t, pi,t+1, \u03b8t), i = 1, 2,\u00b7\u00b7\u00b7 , B\nend for\ni = arg min\u03b7i H0(xi,0 + \u03b7i, pi,0, \u03b80), i = 1, 2,\u00b7\u00b7\u00b7 , B\n\u03b7j\n\nend for\nfor t = T \u2212 1 to 1 do\n\nend for\n\u03b80 = arg max\u03b80\nuntil Convergence\n\n1\nm\n\n(cid:80)B\n(cid:80)m\n\n(cid:80)B\n\n\u03b8t = arg max\u03b8t\n\ni=1 Ht(xi,t, pi,t+1, \u03b8t)\n\nk=1\n\ni=1 H0(xi,0 + \u03b7j\n\ni , pi,1, \u03b80)\n\n(a) \"Samll CNN\" [46] Result on MNIST\n\n(b) PreAct-Res18 Results on CIFAR10\n\nFigure 3: Performance w.r.t. training time\n\nAs for Wide ResNet34, YOPO-5-3 still achieves similar acceleration against PGD-10, as shown in\nTable 1. We also test PGD-3/5 to show that naively reducing backward times for this minmax prob-\nlem [26] cannot produce comparable results within the same computation time as YOPO. Meanwhile,\nYOPO-3-5 can achieve more aggressive speed-up with only a slight drop in robustness.\n\n4.2 YOPO for TRADES\n\nTRADES [46] formulated a new min-max objective function of adversarial defense and achieves the\nstate-of-the-art adversarial defense results. The experiment details are in supplementary material, and\nquantitative results are demonstrated in Table 2.\n\nTraining Methods\nTRADES-10 [46]\n\nTRADES-YOPO-3-4 (Ours)\nTRADES-YOPO-2-5 (Ours)\n\nClean Data\n\nPGD-20 Attack CW Attack\n\nTraining Time (mins)\n\n86.14%\n87.82%\n88.15%\n\n44.50%\n46.13%\n42.48%\n\n58.40%\n59.48%\n59.25%\n\n633\n259\n218\n\nTable 2: Results of training PreAct-Res18 for CIFAR10 with TRADES objective\n\n8\n\n5timesfaster4timesfaster\fTraining Methods Clean Data\n\nPGD-20 Attack\n\nTraining Time (mins)\n\nNatural train\nPGD-3 [26]\nPGD-5 [26]\nPGD-10 [26]\nFree-8 [30]1\n\n233\n1134\n1574\n2713\n667\n299\nYOPO-3-5 (Ours)\n476\nYOPO-5-3 (Ours)\n1 Code from https://github.com/ashafahi/free_adv_train.\n\nTable 1: Results of Wide ResNet34 for CIFAR10.\n\n95.03%\n90.07%\n89.65%\n87.30%\n86.29%\n87.27%\n86.70%\n\n0.00%\n39.18%\n43.85%\n47.04%\n47.00%\n43.04%\n47.98%\n\n5 Conclusion\n\nIn this work, we have developed an ef\ufb01cient strategy for accelerating adversarial training. We recast\nthe adversarial training of deep neural networks as a discrete time differential game and derive a\nPontryagin\u2019s Maximum Principle (PMP) for it. Based on this maximum principle, we discover\nthat the adversary is only coupled with the weights of the \ufb01rst layer. This motivates us to split the\nadversary updates from the back-propagation gradient calculation. The proposed algorithm, called\nYOPO, avoids computing full forward and backward propagation for too many times, thus effectively\nreducing the computational time as supported by our experiments.\n\nAcknowledgement\n\nWe thank Di He and Long Chen for bene\ufb01cial discussion. Zhanxing Zhu is supported in part by\nNational Natural Science Foundation of China (No.61806009), Beijing Natural Science Foundation\n(No. 4184090) and Beijing Academy of Arti\ufb01cial Intelligence (BAAI). Bin Dong is supported in part\nby Beijing Natural Science Foundation (No. Z180001) and Beijing Academy of Arti\ufb01cial Intelligence\n(BAAI). Dinghuai Zhang is supported by the Elite Undergraduate Training Program of Applied Math\nof the School of Mathematical Sciences at Peking University.\n\n9\n\n\fReferences\n[1] Armin Askari, Geoffrey Negiar, Rajiv Sambharya, and Laurent El Ghaoui. Lifted neural\n\nnetworks. arXiv preprint arXiv:1805.01532, 2018.\n\n[2] Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of\nsecurity: Circumventing defenses to adversarial examples. arXiv preprint arXiv:1802.00420,\n2018.\n\n[3] Vladimir Grigor\u2019evich Boltyanskii, Revaz Valer\u2019yanovich Gamkrelidze, and Lev Semenovich\nPontryagin. The theory of optimal processes. i. the maximum principle. Technical report, TRW\nSPACE TECHNOLOGY LABS LOS ANGELES CALIF, 1960.\n\n[4] Peng Cao, Yilun Xu, Yuqing Kong, and Yizhou Wang. Max-mig: an information theoretic\n\napproach for joint learning from crowds. arXiv preprint arXiv:1905.13436, 2019.\n\n[5] Tian Qi Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary\ndifferential equations. In Advances in Neural Information Processing Systems, pages 6572\u20136583,\n2018.\n\n[6] Moustapha Cisse, Piotr Bojanowski, Edouard Grave, Yann Dauphin, and Nicolas Usunier.\nParseval networks: Improving robustness to adversarial examples. In Proceedings of the 34th\nInternational Conference on Machine Learning-Volume 70, pages 854\u2013863. JMLR. org, 2017.\n\n[7] Lawrence C Evans. An introduction to mathematical optimal control theory. Lecture Notes,\n\nUniversity of California, Department of Mathematics, Berkeley, 2005.\n\n[8] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.\n\n[9] Fangda Gu, Armin Askari, and Laurent El Ghaoui. Fenchel lifted networks: A lagrange\n\nrelaxation of neural network training. arXiv preprint arXiv:1811.08039, 2018.\n\n[10] Eldad Haber and Lars Ruthotto. Stable architectures for deep neural networks. Inverse Problems,\n\n34(1):014004, 2017.\n\n[11] Zhouyuan Huo, Bin Gu, Qian Yang, and Heng Huang. Decoupled parallel backpropagation\n\nwith convergence guarantee. arXiv preprint arXiv:1804.10574, 2018.\n\n[12] Andrew Ilyas, Ajil Jalal, Eirini Asteri, Constantinos Daskalakis, and Alexandros G Dimakis.\nThe robust manifold defense: Adversarial training using generative models. arXiv preprint\narXiv:1712.09196, 2017.\n\n[13] Max Jaderberg, Wojciech Marian Czarnecki, Simon Osindero, Oriol Vinyals, Alex Graves,\nDavid Silver, and Koray Kavukcuoglu. Decoupled neural interfaces using synthetic gradients.\nIn Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages\n1627\u20131635. JMLR. org, 2017.\n\n[14] Daniel Jakubovitz and Raja Giryes. Improving dnn robustness to adversarial attacks using\nIn Proceedings of the European Conference on Computer Vision\n\njacobian regularization.\n(ECCV), pages 514\u2013529, 2018.\n\n[15] Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial machine learning at scale.\n\narXiv preprint arXiv:1611.01236, 2016.\n\n[16] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436,\n\n2015.\n\n[17] Yann LeCun, D Touresky, G Hinton, and T Sejnowski. A theoretical framework for back-\npropagation. In Proceedings of the 1988 connectionist models summer school, volume 1, pages\n21\u201328. CMU, Pittsburgh, Pa: Morgan Kaufmann, 1988.\n\n[18] Jia Li, Cong Fang, and Zhouchen Lin. Lifted proximal operator machines. arXiv preprint\n\narXiv:1811.01501, 2018.\n\n10\n\n\f[19] Qianxiao Li, Long Chen, Cheng Tai, and E Weinan. Maximum principle based algorithms for\n\ndeep learning. The Journal of Machine Learning Research, 18(1):5998\u20136026, 2017.\n\n[20] Qianxiao Li and Shuji Hao. An optimal control approach to deep learning and applications to\ndiscrete-weight neural networks. In Jennifer Dy and Andreas Krause, editors, Proceedings of\nthe 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine\nLearning Research, pages 2985\u20132994, Stockholmsm\u00e4ssan, Stockholm Sweden, 10\u201315 Jul 2018.\nPMLR.\n\n[21] Xuechen Li, Denny Wu, Lester Mackey, and Murat A Erdogdu. Stochastic runge-kutta acceler-\n\nates langevin monte carlo and beyond. arXiv preprint arXiv:1906.07868, 2019.\n\n[22] Ji Lin, Chuang Gan, and Song Han. Defensive quantization: When ef\ufb01ciency meets robustness.\n\nIn International Conference on Learning Representations, 2019.\n\n[23] Yiping Lu, Aoxiao Zhong, Quanzheng Li, and Bin Dong. Beyond \ufb01nite layer neural net-\nworks: Bridging deep architectures and numerical differential equations. arXiv preprint\narXiv:1710.10121, 2017.\n\n[24] Tiange Luo, Tianle Cai, Mengxiao Zhang, Siyu Chen, and Liwei Wang. RANDOM MASK:\n\nTowards robust convolutional neural networks, 2019.\n\n[25] Pingchuan Ma, Yunsheng Tian, Zherong Pan, Bo Ren, and Dinesh Manocha. Fluid directed\nrigid body control using deep reinforcement learning. ACM Transactions on Graphics (TOG),\n37(4):96, 2018.\n\n[26] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu.\nTowards deep learning models resistant to adversarial attacks. In International Conference on\nLearning Representations, 2018.\n\n[27] Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard. Deepfool: a simple\nand accurate method to fool deep neural networks. In Proceedings of the IEEE Conference on\nComputer Vision and Pattern Recognition, pages 2574\u20132582, 2016.\n\n[28] Lev Semenovich Pontryagin. Mathematical theory of optimal processes. CRC, 1987.\n\n[29] Haifeng Qian and Mark N Wegman. L2-nonexpansive neural networks. arXiv preprint\n\narXiv:1802.07896, 2018.\n\n[30] Ali Shafahi, Mahyar Najibi, Amin Ghiasi, Xu Zeng, John Dickerson, Christoph Studer, Larry\nS. Davis, Gavin Taylor, and Tom Goldstein. Adversarial training for free! arXiv preprint\narXiv:1904.12843, 2019.\n\n[31] Yang Song, Taesup Kim, Sebastian Nowozin, Stefano Ermon, and Nate Kushman. Pixeldefend:\nLeveraging generative models to understand and defend against adversarial examples. arXiv\npreprint arXiv:1710.10766, 2017.\n\n[32] Sho Sonoda and Noboru Murata. Transport analysis of in\ufb01nitely deep neural network. The\n\nJournal of Machine Learning Research, 20(1):31\u201382, 2019.\n\n[33] Ke Sun, Zhanxing Zhu, and Zhouchen Lin. Enhancing the robustness of deep neural networks\n\nby boundary conditional gan. arXiv preprint arXiv:1902.11029, 2019.\n\n[34] Jan Svoboda, Jonathan Masci, Federico Monti, Michael Bronstein, and Leonidas Guibas.\nPeernets: Exploiting peer wisdom against adversarial attacks. In International Conference on\nLearning Representations, 2019.\n\n[35] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfel-\nlow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199,\n2013.\n\n[36] Gavin Taylor, Ryan Burmeister, Zheng Xu, Bharat Singh, Ankit Patel, and Tom Goldstein.\nTraining neural networks without gradients: A scalable admm approach. In International\nconference on machine learning, pages 2722\u20132731, 2016.\n\n11\n\n\f[37] Matthew Thorpe and Yves van Gennip. Deep limits of residual neural networks. arXiv preprint\n\narXiv:1810.11741, 2018.\n\n[38] Abraham Wald. Contributions to the theory of statistical estimation and testing hypotheses. The\n\nAnnals of Mathematical Statistics, 10(4):299\u2013326, 1939.\n\n[39] Bao Wang, Binjie Yuan, Zuoqiang Shi, and Stanley J Osher. Enresnet: Resnet ensemble via the\n\nfeynman-kac formalism. arXiv preprint arXiv:1811.10745, 2018.\n\n[40] E Weinan. A proposal on machine learning via dynamical systems. Communications in\n\nMathematics and Statistics, 5(1):1\u201311, 2017.\n\n[41] E Weinan, Jiequn Han, and Qianxiao Li. A mean-\ufb01eld optimal control formulation of deep\n\nlearning. Research in the Mathematical Sciences, 6(1):10, 2019.\n\n[42] Cihang Xie, Yuxin Wu, Laurens van der Maaten, Alan Yuille, and Kaiming He. Feature\n\ndenoising for improving adversarial robustness. arXiv preprint arXiv:1812.03411, 2018.\n\n[43] Weilin Xu, David Evans, and Yanjun Qi. Feature squeezing: Detecting adversarial examples in\n\ndeep neural networks. arXiv preprint arXiv:1704.01155, 2017.\n\n[44] Yilun Xu, Peng Cao, Yuqing Kong, and Yizhou Wang. L_dmi: An information-theoretic\n\nnoise-robust loss function. arXiv preprint arXiv:1909.03388, 2019.\n\n[45] Nanyang Ye and Zhanxing Zhu. Bayesian adversarial learning. In Advances in Neural Informa-\n\ntion Processing Systems, pages 6892\u20136901, 2018.\n\n[46] Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric P Xing, Laurent El Ghaoui, and Michael I\nJordan. Theoretically principled trade-off between robustness and accuracy. arXiv preprint\narXiv:1901.08573, 2019.\n\n[47] Jingfeng Zhang, Bo Han, Laura Wynter, Kian Hsiang Low, and Mohan Kankanhalli. Towards\n\nrobust resnet: A small step but a giant leap. arXiv preprint arXiv:1902.10887, 2019.\n\n[48] Xiaoshuai Zhang, Yiping Lu, Jiaying Liu, and Bin Dong. Dynamically unfolding recurrent\nrestorer: A moving endpoint control method for image restoration. In International Conference\non Learning Representations, 2019.\n\n[49] Daniel Z\u00fcgner, Amir Akbarnejad, and Stephan G\u00fcnnemann. Adversarial attacks on neural\nnetworks for graph data. In Proceedings of the 24th ACM SIGKDD International Conference\non Knowledge Discovery & Data Mining, pages 2847\u20132856. ACM, 2018.\n\n12\n\n\f", "award": [], "sourceid": 95, "authors": [{"given_name": "Dinghuai", "family_name": "Zhang", "institution": "Peking University"}, {"given_name": "Tianyuan", "family_name": "Zhang", "institution": "Peking University"}, {"given_name": "Yiping", "family_name": "Lu", "institution": "Peking University"}, {"given_name": "Zhanxing", "family_name": "Zhu", "institution": "Peking University"}, {"given_name": "Bin", "family_name": "Dong", "institution": "Peking University"}]}