{"title": "Learning to Prune Deep Neural Networks via Layer-wise Optimal Brain Surgeon", "book": "Advances in Neural Information Processing Systems", "page_first": 4857, "page_last": 4867, "abstract": "How to develop slim and accurate deep neural networks has become crucial for real- world applications, especially for those employed in embedded systems. Though previous work along this research line has shown some promising results, most existing methods either fail to significantly compress a well-trained deep network or require a heavy retraining process for the pruned deep network to re-boost its prediction performance. In this paper, we propose a new layer-wise pruning method for deep neural networks. In our proposed method, parameters of each individual layer are pruned independently based on second order derivatives of a layer-wise error function with respect to the corresponding parameters. We prove that the final prediction performance drop after pruning is bounded by a linear combination of the reconstructed errors caused at each layer. By controlling layer-wise errors properly, one only needs to perform a light retraining process on the pruned network to resume its original prediction performance. We conduct extensive experiments on benchmark datasets to demonstrate the effectiveness of our pruning method compared with several state-of-the-art baseline methods. Codes of our work are released at: https://github.com/csyhhu/L-OBS.", "full_text": "Learning to Prune Deep Neural Networks via\n\nLayer-wise Optimal Brain Surgeon\n\nXin Dong\n\nShangyu Chen\n\nNanyang Technological University, Singapore\n\nNanyang Technological University, Singapore\n\nn1503521a@e.ntu.edu.sg\n\nschen025@e.ntu.edu.sg\n\nSinno Jialin Pan\n\nNanyang Technological University, Singapore\n\nsinnopan@ntu.edu.sg\n\nAbstract\n\nHow to develop slim and accurate deep neural networks has become crucial for real-\nworld applications, especially for those employed in embedded systems. Though\nprevious work along this research line has shown some promising results, most\nexisting methods either fail to signi\ufb01cantly compress a well-trained deep network\nor require a heavy retraining process for the pruned deep network to re-boost its\nprediction performance. In this paper, we propose a new layer-wise pruning method\nfor deep neural networks. In our proposed method, parameters of each individual\nlayer are pruned independently based on second order derivatives of a layer-wise\nerror function with respect to the corresponding parameters. We prove that the\n\ufb01nal prediction performance drop after pruning is bounded by a linear combination\nof the reconstructed errors caused at each layer. By controlling layer-wise errors\nproperly, one only needs to perform a light retraining process on the pruned network\nto resume its original prediction performance. We conduct extensive experiments\non benchmark datasets to demonstrate the effectiveness of our pruning method\ncompared with several state-of-the-art baseline methods. Codes of our work are\nreleased at: https://github.com/csyhhu/L-OBS.\n\n1\n\nIntroduction\n\nIntuitively, deep neural networks [1] can approximate predictive functions of arbitrary complexity\nwell when they are of a huge amount of parameters, i.e., a lot of layers and neurons. In practice, the\nsize of deep neural networks has been being tremendously increased, from LeNet-5 with less than\n1M parameters [2] to VGG-16 with 133M parameters [3]. Such a large number of parameters not\nonly make deep models memory intensive and computationally expensive, but also urge researchers\nto dig into redundancy of deep neural networks. On one hand, in neuroscience, recent studies point\nout that there are signi\ufb01cant redundant neurons in human brain, and memory may have relation with\nvanishment of speci\ufb01c synapses [4]. On the other hand, in machine learning, both theoretical analysis\nand empirical experiments have shown the evidence of redundancy in several deep models [5, 6].\nTherefore, it is possible to compress deep neural networks without or with little loss in prediction by\npruning parameters with carefully designed criteria.\nHowever, \ufb01nding an optimal pruning solution is NP-hard because the search space for pruning\nis exponential in terms of parameter size. Recent work mainly focuses on developing ef\ufb01cient\nalgorithms to obtain a near-optimal pruning solution [7, 8, 9, 10, 11]. A common idea behind most\nexiting approaches is to select parameters for pruning based on certain criteria, such as increase in\ntraining error, magnitude of the parameter values, etc. As most of the existing pruning criteria are\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fdesigned heuristically, there is no guarantee that prediction performance of a deep neural network\ncan be preserved after pruning. Therefore, a time-consuming retraining process is usually needed to\nboost the performance of the trimmed neural network.\nInstead of consuming efforts on a whole deep network, a layer-wise pruning method, Net-Trim, was\nproposed to learn sparse parameters by minimizing reconstructed error for each individual layer [6].\nA theoretical analysis is provided that the overall performance drop of the deep network is bounded by\nthe sum of reconstructed errors for each layer. In this way, the pruned deep network has a theoretical\nguarantee on its error. However, as Net-Trim adopts (cid:96)1-norm to induce sparsity for pruning, it fails to\nobtain high compression ratio compared with other methods [9, 11].\nIn this paper, we propose a new layer-wise pruning method for deep neural networks, aiming to\nachieve the following three goals: 1) For each layer, parameters can be highly compressed after\npruning, while the reconstructed error is small. 2) There is a theoretical guarantee on the overall\nprediction performance of the pruned deep neural network in terms of reconstructed errors for each\nlayer. 3) After the deep network is pruned, only a light retraining process is required to resume its\noriginal prediction performance.\nTo achieve our \ufb01rst goal, we borrow an idea from some classic pruning approaches for shallow neural\nnetworks, such as optimal brain damage (OBD) [12] and optimal brain surgeon (OBS) [13]. These\nclassic methods approximate a change in the error function via functional Taylor Series, and identify\nunimportant weights based on second order derivatives. Though these approaches have proven to\nbe effective for shallow neural networks, it remains challenging to extend them for deep neural\nnetworks because of the high computational cost on computing second order derivatives, i.e., the\ninverse of the Hessian matrix over all the parameters. In this work, as we restrict the computation on\nsecond order derivatives w.r.t. the parameters of each individual layer only, i.e., the Hessian matrix is\nonly over parameters for a speci\ufb01c layer, the computation becomes tractable. Moreover, we utilize\ncharacteristics of back-propagation for fully-connected layers in well-trained deep networks to further\nreduce computational complexity of the inverse operation of the Hessian matrix.\nTo achieve our second goal, based on the theoretical results in [6], we provide a proof on the bound\nof performance drop before and after pruning in terms of the reconstructed errors for each layer.\nWith such a layer-wise pruning framework using second-order derivatives for trimming parameters\nfor each layer, we empirically show that after signi\ufb01cantly pruning parameters, there is only a little\ndrop of prediction performance compared with that before pruning. Therefore, only a light retraining\nprocess is needed to resume the performance, which achieves our third goal.\nThe contributions of this paper are summarized as follows. 1) We propose a new layer-wise pruning\nmethod for deep neural networks, which is able to signi\ufb01cantly trim networks and preserve the\nprediction performance of networks after pruning with a theoretical guarantee. In addition, with the\nproposed method, a time-consuming retraining process for re-boosting the performance of the pruned\nnetwork is waived. 2) We conduct extensive experiments to verify the effectiveness of our proposed\nmethod compared with several state-of-the-art approaches.\n\n2 Related Works and Preliminary\n\nPruning methods have been widely used for model compression in early neural networks [7] and\nmodern deep neural networks [6, 8, 9, 10, 11]. In the past, with relatively small size of training data,\npruning is crucial to avoid over\ufb01tting. Classical methods include OBD and OBS. These methods\naim to prune parameters with the least increase of error approximated by second order derivatives.\nHowever, computation of the Hessian inverse over all the parameters is expensive. In OBD, the\nHessian matrix is restricted to be a diagonal matrix to make it computationally tractable. However,\nthis approach implicitly assumes parameters have no interactions, which may hurt the pruning\nperformance. Different from OBD, OBS makes use of the full Hessian matrix for pruning. It obtains\nbetter performance while is much more computationally expensive even using Woodbury matrix\nidentity [14], which is an iterative method to compute the Hessian inverse. For example, using OBS\non VGG-16 naturally requires to compute inverse of the Hessian matrix with a size of 133M \u00d7 133M.\nRegarding pruning for modern deep models, Han et al. [9] proposed to delete unimportant parameters\nbased on magnitude of their absolute values, and retrain the remaining ones to recover the original\nprediction performance. This method achieves considerable compression ratio in practice. However,\n\n2\n\n\fas pointed out by pioneer research work [12, 13], parameters with low magnitude of their absolute\nvalues can be necessary for low error. Therefore, magnitude-based approaches may eliminate\nwrong parameters, resulting in a big prediction performance drop right after pruning, and poor\nrobustness before retraining [15]. Though some variants have tried to \ufb01nd better magnitude-based\ncriteria [16, 17], the signi\ufb01cant drop of prediction performance after pruning still remains. To avoid\npruning wrong parameters, Guo et al. [11] introduced a mask matrix to indicate the state of network\nconnection for dynamically pruning after each gradient decent step. Jin et al. [18] proposed an\niterative hard thresholding approach to re-activate the pruned parameters after each pruning phase.\nBesides Net-trim, which is a layer-wise pruning method discussed in the previous section, there\nis some other work proposed to induce sparsity or low-rank approximation on certain layers for\npruning [19, 20]. However, as the (cid:96)0-norm or the (cid:96)1-norm sparsity-induced regularization term\nincreases dif\ufb01culty in optimization, the pruned deep neural networks using these methods either\nobtain much smaller compression ratio [6] compared with direct pruning methods or require retraining\nof the whole network to prevent accumulation of errors [10].\nOptimal Brain Surgeon As our proposed layer-wise pruning method is an extension of OBS on\n\u03b4E =(cid:0) \u2202E\ndeep neural networks, we brie\ufb02y review the basic of OBS here. Consider a network in terms of\nparameters w trained to a local minimum in error. The functional Taylor series of the error w.r.t. w is:\nvariable, H \u2261 \u22022E/\u2202w2 \u2208 Rm\u00d7m is the Hessian matrix, where m is the number of parameters, and\nO((cid:107)\u03b4\u0398l(cid:107)3) is the third and all higher order terms. For a network trained to a local minimum in error,\nthe \ufb01rst term vanishes, and the term O((cid:107)\u03b4\u0398l(cid:107)3) can be ignored. In OBS, the goal is to set one of the\nparameters to zero, denoted by wq (scalar), to minimize \u03b4E in each pruning iteration. The resultant\noptimization problem is written as follows,\n\n(cid:107)\u03b4w(cid:107)3(cid:1), where \u03b4 denotes a perturbation of a corresponding\n\n2 \u03b4w(cid:62)H\u03b4w + O(cid:0)\n\n(cid:1)(cid:62)\n\n\u03b4w + 1\n\n\u2202w\n\nmin\n\nq\n\n1\n2\n\n\u03b4w(cid:62)H\u03b4w, s.t. e(cid:62)\n\nq \u03b4w + wq = 0,\n\n(1)\n\nwhere eq is the unit selecting vector whose q-th element is 1 and otherwise 0. As shown in [21], the\noptimization problem (1) can be solved by the Lagrange multipliers method. Note that a computation\nbottleneck of OBS is to calculate and store the non-diagonal Hesssian matrix and its inverse, which\nmakes it impractical on pruning deep models which are usually of a huge number of parameters.\n\n3 Layer-wise Optimal Brain Surgeon\n\n3.1 Problem Statement\n\n1\n\n, ..., yl\u22121\n\nGiven a training set of n instances, {(xj, yj)}n\nj=1, and a well-trained deep neural network of L layers\n(excluding the input layer)1. Denote the input and the output of the whole deep neural network by\nX = [x1, ..., xn]\u2208Rd\u00d7n and Y\u2208Rn\u00d71, respectively. For a layer l, we denote the input and output of\nthe layer by Yl\u22121 = [yl\u22121\nn]\u2208Rml\u00d7n, respectively, where\nn ]\u2208Rml\u22121\u00d7n and Yl = [yl\n1, ..., yl\ni can be considered as a representation of xi in layer l, and Y0 = X, YL = Y, and m0 = d. Using\nyl\n(cid:62)Yl\u22121 with Wl\u2208Rml\u22121\u00d7ml being the\none forward-pass step, we have Yl = \u03c3(Zl), where Zl =Wl\nmatrix of parameters for layer l, and \u03c3(\u00b7) is the activation function. For convenience in presentation\nand proof, we de\ufb01ne the activation function \u03c3(\u00b7) as the recti\ufb01ed linear unit (ReLU) [22]. We further\ndenote by \u0398l\u2208Rml\u22121ml\u00d71 the vectorization of Wl. For a well-trained neural network, Yl, Zl and\n\u0398\u2217\nl are all \ufb01xed matrixes and contain most information of the neural network. The goal of pruning is\nto set the values of some elements in \u0398l to be zero.\n\n3.2 Layer-Wise Error\nDuring layer-wise pruning in layer l, the input Yl\u22121 is \ufb01xed as the same as the well-trained network.\nSuppose we set the q-th element of \u0398l, denoted by \u0398l[q], to be zero, and get a new parameter vector,\ndenoted by \u02c6\u0398l. With Yl\u22121, we obtain a new output for layer l, denoted by \u02c6Yl. Consider the root of\n\n1For simplicity in presentation, we suppose the neural network is a feed-forward (fully-connected) network.\n\nIn Section 3.4, we will show how to extend our method to \ufb01lter layers in Convolutional Neural Networks.\n\n3\n\n\fmean square error between \u02c6Yl and Yl over the whole training data as the layer-wise error:\n\n(cid:118)(cid:117)(cid:117)(cid:116) 1\n\nn\n\n(cid:0)(\u02c6yl\n\nn(cid:88)\n\nj=1\n\n\u03b5l =\n\nj)(cid:1) =\n\nj \u2212 yl\n\nj)(cid:62)(\u02c6yl\n\nj \u2212 yl\n\n1\n\n\u221an(cid:107) \u02c6Yl \u2212 Yl(cid:107)F ,\n\n(2)\n\nwhere (cid:107) \u00b7 (cid:107)F is the Frobenius Norm. Note that for any single parameter pruning, one can compute its\nq, where 1 \u2264 q \u2264 ml\u22121ml, and use it as a pruning criterion. This idea has been adopted by\nerror \u03b5l\nsome existing methods [15]. However, in this way, for each parameter at each layer, one has to pass\nthe whole training data once to compute its error measure, which is very computationally expensive.\nA more ef\ufb01cient approach is to make use of the second order derivatives of the error function to help\nidentify importance of each parameter.\nWe \ufb01rst de\ufb01ne an error function E(\u00b7) as\n\nEl = E(\u02c6Zl) =\n\n,\n\n(3)\n\n(cid:13)(cid:13)(cid:13)\u02c6Zl \u2212 Zl(cid:13)(cid:13)(cid:13)2\n\nF\n\n1\nn\n\nwhere Zl is outcome of the weighted sum operation right before performing the activation function\n\u03c3(\u00b7) at layer l of the well-trained neural network, and \u02c6Zl is outcome of the weighted sum operation\nafter pruning at layer l . Note that Zl is considered as the desired output of layer l before activation.\nThe following lemma shows that the layer-wise error is bounded by the error de\ufb01ned in (3).\nLemma 3.1. With the error function (3) and Yl = \u03c3(Zl), the following holds: \u03b5l \u2264\nTherefore, to \ufb01nd parameters whose deletion (set to be zero) minimizes (2) can be translated to \ufb01nd\nparameters those deletion minimizes the error function (3). Following [12, 13], the error function can\nbe approximated by functional Taylor series as follows,\n\nE(\u02c6Zl).\n\n(cid:113)\n\n(cid:62)Hl\u03b4\u0398l + O(cid:0)\n\n(cid:107)\u03b4\u0398l(cid:107)3(cid:1) ,\n\n(4)\n\n(cid:18) \u2202El\n\n(cid:19)(cid:62)\n\n\u2202\u0398l\n\n\u03b4\u0398l +\n\n\u03b4\u0398l\n\n1\n2\n\n(cid:12)(cid:12)(cid:12)\u0398l=\u0398\u2217\n\nE(\u02c6Zl) \u2212 E(Zl) = \u03b4El =\n\n2 is the Hessian matrix\nwhere \u03b4 denotes a perturbation of a corresponding variable, Hl \u2261 \u22022El/\u2202\u0398l\nw.r.t. \u0398l, and O((cid:107)\u03b4\u0398l(cid:107)3) is the third and all higher order terms. It can be proven that with the error\nfunction de\ufb01ned in (3), the \ufb01rst (linear) term \u2202El\n\u2202\u0398l\n\nl and O((cid:107)\u03b4\u0398l(cid:107)3) are equal to 0.\n\nSuppose every time one aims to \ufb01nd a parameter \u0398l[q] to set to be zero such that the change \u03b4El is\nminimal. Similar to OBS, we can formulate it as the following optimization problem:\n\nmin\n\nq\n\n1\n2\n\n\u03b4\u0398l\n\n(cid:62)Hl\u03b4\u0398l, s.t. e(cid:62)\n\nq \u03b4\u0398l + \u0398l[q] = 0,\n\n(5)\n\nwhere eq is the unit selecting vector whose q-th element is 1 and otherwise 0. By using the Lagrange\nmultipliers method as suggested in [21], we obtain the closed-form solutions of the optimal parameter\npruning and the resultant minimal change in the error function as follows,\n(\u0398l[q] )2\n[H\u22121\n]qq\n\nl eq, and Lq = \u03b4El =\n\n\u0398l[q]\n[H\u22121\n\n\u03b4\u0398l = \u2212\n\nH\u22121\n\n(6)\n\n1\n2\n\n]qq\n\n.\n\nl\n\nl\n\nHere Lq is referred to as the sensitivity of parameter \u0398l[q]. Then we select parameters to prune based\non their sensitivity scores instead of their magnitudes. As mentioned in section 2, magnitude-based\ncriteria which merely consider the numerator in (6) is a poor estimation of sensitivity of parameters.\nMoreover, in (6), as the inverse Hessian matrix over the training data is involved, it is able to capture\ndata distribution when measuring sensitivities of parameters.\nAfter pruning the parameter, \u0398l[q], with the smallest sensitivity, the parameter vector is updated via\n\u02c6\u0398l = \u0398l +\u03b4\u0398l. With Lemma 3.1 and (6), we have that the layer-wise error for layer l is bounded by\n\n(cid:113)\n\n(cid:113)\n\n\u03b5l\nq \u2264\n\nE(\u02c6Zl) =\n\nE(\u02c6Zl) \u2212 E(Zl) = \u221a\u03b4El = |\u0398l[q]|\n2[H\u22121\n\nl\n\n.\n\n]qq\n\n(7)\n\n(cid:113)\n\nNote that \ufb01rst equality is obtained because of the fact that E(Zl) = 0. It is worth to mention\nthat though we merely focus on layer l, the Hessian matrix is still a square matrix with size of\nfor\n\nml\u22121ml \u00d7 ml\u22121ml. However, we will show how to signi\ufb01cantly reduce the computation of H\u22121\n\nl\n\neach layer in Section 3.4.\n\n4\n\n\f3.3 Layer-Wise Error Propagation and Accumulation\n\nSo far, we have shown how to prune parameters for each layer, and estimate their introduced errors\nindependently. However, our aim is to control the consistence of the network\u2019s \ufb01nal output YL before\nand after pruning. To do this, in the following, we show how the layer-wise errors propagate to \ufb01nal\noutput layer, and the accumulated error over multiple layers will not explode.\nTheorem 3.2. Given a pruned deep network via layer-wise pruning introduced in Section 3.2, each\nlayer has its own layer-wise error \u03b5l for 1 \u2264 l \u2264 L, then the accumulated error of ultimate network\noutput \u02dc\u03b5L = 1\u221a\n\nL\u22121(cid:88)\nn(cid:107) \u02dcYL \u2212 YL(cid:107)F obeys:\n\n(cid:32) L(cid:89)\n\n(cid:33)\n\n(cid:107) \u02c6\u0398l(cid:107)F\u221a\u03b4Ek\n\n+ \u221a\u03b4EL,\n\nl\n\nk=1\n\nl=k+1\n\n\u02dc\u03b5L \u2264\n\n1 X).\n\n(8)\n\u02dcYl\u22121), for 2 \u2264 l \u2264 L denotes \u2018accumulated pruned output\u2019 of layer l, and\n\nwhere \u02dcYl = \u03c3( \u02c6W(cid:62)\n\u02dcY1 = \u03c3( \u02c6W(cid:62)\nTheorem 3.2 shows that: 1) Layer-wise error for a layer l will be scaled by continued multiplication\nof parameters\u2019 Frobenius Norm over the following layers when it propagates to \ufb01nal output, i.e.,\nthe L\u2212l layers after the l-th layer; 2) The \ufb01nal error of ultimate network output is bounded by the\nweighted sum of layer-wise errors. The proof of Theorem 3.2 can be found in Appendix.\nConsider a general case with (6) and (8): parameter \u0398l[q] who has the smallest sensitivity in layer l\n\nis pruned by the i-th pruning operation, and this \ufb01nally adds(cid:81)L\nby a quite large product factor, Sl =(cid:81)L\n\nk=l+1 (cid:107) \u02c6\u0398k(cid:107)F\u221a\u03b4El to the ultimate\nnetwork output error. It is worth to mention that although it seems that the layer-wise error is scaled\nk=l+1 (cid:107) \u02c6\u0398k(cid:107)F when it propagates to the \ufb01nal layer, this scaling\nis still tractable in practice because ultimate network output is also scaled by the same product factor\ncompared with the output of layer l. For example, we can easily estimate the norm of ultimate network\noutput via, (cid:107)YL(cid:107)F \u2248 S1(cid:107)Y1(cid:107)F . If one pruning operation in the 1st layer causes the layer-wise error\n\u221a\u03b4E1, then the relative ultimate output error is\n\nr = (cid:107) \u02dcYL \u2212 YL(cid:107)F\n\u03beL\n\n(cid:107)YL(cid:107)F\n\n\u221a\u03b4E1\n(cid:107) 1\nn Y1(cid:107)F\n\n.\n\n\u2248\n\n\u221a\u03b4E1/(cid:107) 1\n\nThus, we can see that even S1 may be quite large, the relative ultimate output error would still be about\nn Y1(cid:107)F which is controllable in practice especially when most of modern deep networks\nadopt maxout layer [23] as ultimate output. Actually, S0 is called as network gain representing the\nratio of the magnitude of the network output to the magnitude of the network input.\n\n3.4 The Proposed Algorithm\n\n3.4.1 Pruning on Fully-Connected Layers\n\nTo selectively prune parameters, our approach needs to compute the inverse Hessian matrix at each\nlayer to measure the sensitivities of each parameter of the layer, which is still computationally\nexpensive though tractable. In this section, we present an ef\ufb01cient algorithm that can reduce the size\nof the Hessian matrix and thus speed up computation on its inverse.\nFor each layer l, according to the de\ufb01nition of the error function used in Lemma 3.1, the \ufb01rst\nderivative of the error function with respect to \u02c6\u0398l is \u2202El\nj and\n\u2202\u0398l\nj are the j-th columns of the matrices \u02c6Zl and Zl, respectively, and the Hessian matrix is de\ufb01ned as:\nzl\nj)(cid:62)\nHl \u2261 \u22022El\nclose to zl\nj. Even in the late-stage of pruning when this\ndifference is not small, we can still ignore the corresponding term [13]. For layer l that has ml output\nunits, zl\n\n\u22022zl\n\u2202(\u0398l)2 (\u02c6zl\nj\nj, we simply ignore the term containing \u02c6zl\n\nmlj], the Hessian matrix can be calculated via\n\n. Note that for most cases \u02c6zl\n\n(cid:33)\n= \u2212 1\n\n(cid:18) \u2202zl\n\n\u2202(\u0398l)2 = 1\n\nj), where \u02c6zl\n\n(\u02c6zl\nj \u2212 zl\n\n(cid:80)n\n\nj = [zl\n\n1j, . . . , zl\n\n(cid:80)n\n\n(cid:19)(cid:62)\n\nj is quite\n\n(cid:32)\n\n\u2202zl\nj\n\u2202\u0398l\n\n\u2202zl\nj\n\u2202\u0398l\n\nj\n\u2202\u0398l\n\nn\n\nj=1\n\nn\n\nj=1\n\n\u2212\n\nn(cid:88)\n\nj=1\n\nHl =\n\n1\nn\n\nHj\n\nl =\n\n1\nn\n\n(cid:32)\n\n(cid:33)(cid:62)\n\n\u2202zl\nij\n\u2202\u0398l\n\n\u2202zl\nij\n\u2202\u0398l\n\n,\n\n(9)\n\nj\u2212zl\nj\u2212zl\nml(cid:88)\nn(cid:88)\n\nj=1\n\ni=1\n\n5\n\n\fFigure 1: Illustration of shape of Hessian. For feed-forward neural networks, unit z1 gets its\nactivation via forward propagation: z = W(cid:62)y, where W \u2208 R4\u00d73, y = [y1, y2, y3, y4](cid:62)\n\u2208 R4\u00d71,\n\u2208 R3\u00d71. Then the Hessian matrix of z1 w.r.t. all parameters is denoted by\nand z = [z1, z2, z3](cid:62)\nH[z1]. As illustrated in the \ufb01gure, H[z1]\u2019s elements are zero except for those corresponding to W\u22171\n(the 1st column of W), which is denoted by H11. H[z2] and H[z3] are similar. More importantly,\nH\u22121 = diag(H\u22121\n22 , H\u22121\n33 ), and H11 = H22 = H33. As a result, one only needs to compute\nH\u22121\n11 to obtain H\u22121 which signi\ufb01cantly reduces computational complexity.\n\n11 , H\u22121\n\n(cid:21)\n\n\u2202zl\n1j\n\u2202wml\n\n(cid:20) \u2202zl\n\nwhere the Hessian matrix for a single instance j at layer l, Hj\nof the size ml\u22121\u00d7 ml. Speci\ufb01cally, the gradient of the \ufb01rst output unit zl\n\nl , is a block diagonal square matrix\n=\n\n1j w.s.t. \u0398l is \u2202zl\n\n1j\n\u2202\u0398l\n\nj\n\n1j\n\u2202w1\n\nij\n\u2202wk\n\n, . . . ,\n\n= yl\u22121\n\nif k = i, otherwise \u2202zl\n\n, where wi is the i-th column of Wl. As zl\n\n1j is the layer output before activation\nfunction, its gradient is simply to calculate, and more importantly all output units\u2019s gradients are\nequal to the layer input: \u2202zl\n= 0. An illustrated example is shown in\nFigure 1, where we ignore the scripts j and l for simplicity in presentation.\nIt can be shown that the block diagonal square matrix Hj\nwhere 1 \u2264 i \u2264 ml, are all equal to \u03c8j\nblock diagonal square matrix with its diagonal blocks being ( 1\nn\n\u03a8l = 1\nn\nmatrix identity [13]:\n\nlii \u2208 Rml\u22121\u00d7ml\u22121,\nis also a\nl )\u22121. In addition, normally\nl is degenerate and its pseudo-inverse can be calculated recursively via Woodbury\n\n, and the inverse Hessian matrix H\u22121\n\nl \u2019s diagonal blocks Hj\n\nl = yl\u22121\n\n(cid:80)n\n\nj=1 \u03c8j\n\nj=1 \u03c8j\n\n(yl\u22121\n\nij\n\u2202wk\n\n(cid:62)\n\n)\n\nj\n\nj\n\nl\n\n(cid:80)n\n(cid:0)yl\u22121\n(cid:1)(cid:62)\n(cid:1)(cid:62)\n\n(\u03a8l\nj)\n\n\u22121\n\nyl\u22121\n\nn +(cid:0)yl\u22121\n\n\u22121\n\n,\n\n(\u03a8l\nj)\n\u22121\nyl\u22121\n\u22121\n\n\u22121\n\n(\u03a8l\n\nj+1)\n\n\u22121\n\n= (\u03a8l\nj)\n\n(cid:80)t\n\nj\n\n\u2212\n\nj\n(\u03a8l\nj)\n= \u03b1I, \u03b1 \u2208 [104, 108], and (\u03a8l)\n\nj+1\n\n\u22121\n\nt = 1\nt\n\n(cid:1).\n\nj=1 \u03c8j\n\nl with (\u03a8l\n0)\n\nis O(cid:0)nm2\n\nj+1\n\u22121. The size of \u03a8l\nwhere \u03a8l\n= (\u03a8l\nn)\nis then reduced to ml\u22121, and the computational complexity of calculating H\u22121\nTo make the estimated minimal change of the error function optimal in (6), the layer-wise Hessian\nmatrices need to be exact. Since the layer-wise Hessian matrices only depend on the corresponding\nlayer inputs, they are always able to be exact even after several pruning operations. The only parameter\nwe need to control is the layer-wise error \u03b5l. Note that there may be a \u201cpruning in\ufb02ection point\u201d after\nwhich layer-wise error would drop dramatically. In practice, user can incrementally increase the size\nof pruned parameters based on the sensitivity Lq, and make a trade-off between the pruning ratio and\nthe performance drop to set a proper tolerable error threshold or pruning ratio.\nThe procedure of our pruning algorithm for a fully-connected layer l is summarized as follows.\nStep 1: Get layer input yl\u22121 from a well-trained deep network.\nStep 2: Calculate the Hessian matrix Hlii, for i = 1, ..., ml, and its pseudo-inverse over the dataset,\n\nl\u22121\n\nl\n\nand get the whole pseudo-inverse of the Hessian matrix.\n\nStep 3: Compute optimal parameter change \u03b4\u0398l and the sensitivity Lq for each parameter at layer l.\n\nSet tolerable error threshold \u0001.\n\n6\n\nH11H22H33W11W21W31W41y1y2y3y4z1z2z3H\u2208R12\u00d712H11,H22,H33\u2208R4\u00d74\fStep 4: Pick up parameters \u0398l[q]\u2019s with the smallest sensitivity scores.\n\nStep 5: If(cid:112)Lq \u2264 \u0001, prune the parameter \u0398l[q]\u2019s and get new parameter values via \u02c6\u0398l = \u0398l + \u03b4\u0398l,\n\nthen repeat Step 4; otherwise stop pruning.\n\n3.4.2 Pruning on Convolutional Layers\n\nIt is straightforward to generalize our method to a convolutional layer and its variants if we vectorize\n\ufb01lters of each channel and consider them as special fully-connected layers that have multiple inputs\n(patches) from a single instance. Consider a vectorized \ufb01lter wi of channel i, 1 \u2264 i \u2264 ml, it\nacts similarly to parameters which are connected to the same output unit in a fully-connected layer.\nHowever, the difference is that for a single input instance j, every \ufb01lter step of a sliding window across\nof it will extract a patch Cjn from the input volume. Similarly, each pixel zl\nijn in the 2-dimensional\nactivation map that gives the response to each patch corresponds to one output unit in a fully-connected\nlayer. Hence, for convolutional layers, (9) is generalized as Hl = 1\n\u2202[w1,...,wml ],\nn\nwhere Hl is a block diagonal square matrix whose diagonal blocks are all the same. Then, we can\nslightly revise the computation of the Hessian matrix, and extend the algorithm for fully-connected\nlayers to convolutional layers.\nNote that the accumulated error of ultimate network output can be linearly bounded by layer-wise\nerror as long as the model is feed-forward. Thus, L-OBS is a general pruning method and friendly\nwith most of feed-forward neural networks whose layer-wise Hessian can be computed expediently\nwith slight modi\ufb01cations. However, if models have sizable layers like ResNet-101, L-OBS may not\nbe economical because of computational cost of Hessian, which will be studied in our future work.\n\n(cid:80)ml\n\n(cid:80)n\n\n(cid:80)\n\n\u2202zl\n\nijn\n\nj=1\n\ni=1\n\njn\n\n4 Experiments\n\nIn this section, we verify the effectiveness of our proposed Layer-wise OBS (L-OBS) using various\narchitectures of deep neural networks in terms of compression ratio (CR), error rate before retraining,\nand the number of iterations required for retraining to resume satisfactory performance. CR is de\ufb01ned\nas the ratio of the number of preserved parameters to that of original parameters, lower is better.\nWe conduct comparison results of L-OBS with the following pruning approaches: 1) Randomly\npruning, 2) OBD [12], 3) LWC [9], 4) DNS [11], and 5) Net-Trim [6]. The deep architectures used for\nexperiments include: LeNet-300-100 [2] and LeNet-5 [2] on the MNIST dataset, CIFAR-Net2 [24]\non the CIFAR-10 dataset, AlexNet [25] and VGG-16 [3] on the ImageNet ILSVRC-2012 dataset. For\nexperiments, we \ufb01rst well-train the networks, and apply various pruning approaches on networks to\nevaluate their performance. The retraining batch size, crop method and other hyper-parameters are\nunder the same setting as used in LWC. Note that to make comparisons fair, we do not adopt any\nother pruning related methods like Dropout or sparse regularizers on MNIST. In practice, L-OBS can\nwork well along with these techniques as shown on CIFAR-10 and ImageNet.\n\n4.1 Overall Comparison Results\n\nThe overall comparison results are shown in Table 1. In the \ufb01rst set of experiments, we prune each\nlayer of the well-trained LeNet-300-100 with compression ratios: 6.7%, 20% and 65%, achieving\nslightly better overall compression ratio (7%) than LWC (8%). Under comparable compression\nratio, L-OBS has quite less drop of performance (before retraining) and lighter retraining compared\nwith LWC whose performance is almost ruined by pruning. Classic pruning approach OBD is\nalso compared though we observe that Hessian matrices of most modern deep models are strongly\nnon-diagonal in practice. Besides relative heavy cost to obtain the second derivatives via the chain\nrule, OBD suffers from drastic drop of performance when it is directly applied to modern deep\nmodels.\nTo properly prune each layer of LeNet-5, we increase tolerable error threshold \u0001 from relative small\ninitial value to incrementally prune more parameters, monitor model performance, stop pruning and\nset \u0001 until encounter the \u201cpruning in\ufb02ection point\u201d mentioned in Section 3.4. In practice, we prune\neach layer of LeNet-5 with compression ratio: 54%, 43%, 6% and 25% and retrain pruned model with\n\n2A revised AlexNet for CIFAR-10 containing three convolutional layers and two fully connected layers.\n\n7\n\n\fTable 1: Overall comparison results. (For iterative L-OBS, err. after pruning regards the last pruning stage.)\n\nNetworks\n\nOriginal error\n\nCR\n\nErr. after pruning\n\nRe-Error\n\nMethod\n\nRandom\nOBD\nLWC\nDNS\nL-OBS\nL-OBS (iterative)\n\nOBD\nLWC\nDNS\nL-OBS\nL-OBS (iterative)\n\nLWC\nL-OBS\n\nDNS\nLWC\nL-OBS\n\nLeNet-300-100\nLeNet-300-100\nLeNet-300-100\nLeNet-300-100\nLeNet-300-100\nLeNet-300-100\n\nLeNet-5\nLeNet-5\nLeNet-5\nLeNet-5\nLeNet-5\n\nCIFAR-Net\nCIFAR-Net\n\nAlexNet (Top-1 / Top-5 err.)\nAlexNet (Top-1 / Top-5 err.)\nAlexNet (Top-1 / Top-5 err.)\n\nDNS\nLWC\nL-OBS (iterative)\n\nVGG-16 (Top-1 / Top-5 err.)\nVGG-16 (Top-1 / Top-5 err.)\nVGG-16 (Top-1 / Top-5 err.)\n\n1.76%\n1.76%\n1.76%\n1.76%\n1.76%\n1.76%\n\n1.27%\n1.27%\n1.27%\n1.27%\n1.27%\n\n18.57%\n18.57%\n\n43.30 / 20.08%\n43.30 / 20.08%\n43.30 / 20.08%\n\n31.66 / 10.12%\n31.66 / 10.12%\n31.66 / 10.12%\n\n8%\n8%\n8%\n1.8%\n7%\n1.5%\n\n8%\n8%\n0.9%\n7%\n0.9%\n\n9%\n9%\n5.7%\n11%\n11%\n\n7.5%\n7.5%\n7.5%\n\n85.72%\n86.72%\n81.32%\n\n-\n\n3.10%\n2.43%\n\n86.72%\n89.55%\n\n-\n\n3.21%\n2.04%\n\n87.65%\n21.32%\n\n2.25%\n1.96%\n1.95%\n1.99%\n1.82%\n1.96%\n\n2.65%\n1.36%\n1.36%\n1.27%\n1.66%\n\n19.36%\n18.76%\n\n#Re-Iters.\n3.50 \u00d7 105\n8.10 \u00d7 104\n1.40 \u00d7 105\n3.40 \u00d7 104\n\n510\n643\n\n2.90 \u00d7 105\n9.60 \u00d7 104\n4.70 \u00d7 104\n\n740\n841\n\n1.62 \u00d7 105\n\n1020\n\n-\n\n43.91 / 20.72%\n76.14 / 57.68%\n44.06 / 20.64%\n50.04 / 26.87% 43.11 / 20.01%\n\n7.30 \u00d7 105\n5.04 \u00d7 106\n1.81 \u00d7 104\n63.38% / 38.69% 1.07 \u00d7 106\n2.35 \u00d7 107\n73.61 / 52.64%\n32.43 / 11.12%\n8.63 \u00d7 104\n37.32 / 14.82% 32.02 / 10.97%\n\n-\n\nmuch fewer iterations compared with other methods (around 1 : 1000). As DNS retrains the pruned\nnetwork after every pruning operation, we are not able to report its error rate of the pruned network\nbefore retraining. However, as can be seen, similar to LWC, the total number of iterations used by\nDNS for rebooting the network is very large compared with L-OBS. Results of retraining iterations\nof DNS are reported from [11] and the other experiments are implemented based on TensorFlow [26].\nIn addition, in the scenario of requiring high pruning ratio, L-OBS can be quite \ufb02exibly adopted to an\niterative version, which performs pruning and light retraining alternatively to obtain higher pruning\nratio with relative higher cost of pruning. With two iterations of pruning and retraining, L-OBS is\nable to achieve as the same pruning ratio as DNS with much lighter total retraining: 643 iterations on\nLeNet-300-100 and 841 iterations on LeNet-5.\nRegarding comparison experiments on CIFAR-Net, we \ufb01rst well-train it to achieve a testing error of\n18.57% with Dropout and Batch-Normalization. We then prune the well-trained network with LWC\nand L-OBS, and get the similar results as those on other network architectures. We also observe that\nLWC and other retraining-required methods always require much smaller learning rate in retraining.\nThis is because representation capability of the pruned networks which have much fewer parameters\nis damaged during pruning based on a principle that number of parameters is an important factor for\nrepresentation capability. However, L-OBS can still adopt original learning rate to retrain the pruned\nnetworks. Under this consideration, L-OBS not only ensures a warm-start for retraining, but also\n\ufb01nds important connections (parameters) and preserve capability of representation for the pruned\nnetwork instead of ruining model with pruning.\nRegarding AlexNet, L-OBS achieves an overall compression ratio of 11% without loss of accuracy\nwith 2.9 hours on 48 Intel Xeon(R) CPU E5-1650 to compute Hessians and 3.1 hours on NVIDIA\nTian X GPU to retrain pruned model (i.e. 18.1K iterations). The computation cost of the Hessian\ninverse in L-OBS is negligible compared with that on heavy retraining in other methods. This\nclaim can also be supported by the analysis of time complexity. As mentioned in Section 3.4, the\ntime complexity of calculating H\u22121\nSGD, then the approximate time complexity of retraining is O (IdM ), where d is the size of the\nmini-batch, M and I are the total numbers of parameters and iterations, respectively. By considering\nthat M \u2248\n(Id (cid:29) n) as shown in experiments, complexity of calculating the Hessian (inverse) in L-OBS is quite\neconomic. More interestingly, there is a trade-off between compression ratio and pruning (including\nretraining) cost. Compared with other methods, L-OBS is able to provide fast-compression: prune\nAlexNet to 16% of its original size without substantively impacting accuracy (pruned top-5 error\n20.98%) even without any retraining. We further apply L-OBS to VGG-16 that has 138M parameters.\nTo achieve more promising compression ratio, we perform pruning and retraining alteratively twice.\nAs can be seen from the table, L-OBS achieves an overall compression ratio of 7.5% without loss\n\n(cid:1). Assume that neural networks are retrained via\n(cid:1), and retraining in other methods always requires millions of iterations\n\nis O(cid:0)nm2\n\n(cid:80)l=L\n\nl\u22121\n\nl\n\n(cid:0)m2\n\nl\u22121\n\nl=1\n\n8\n\n\f(a) Top-5 test accuracy of L-OBS on ResNet-50\nunder different compression ratios.\n\n(b) Memory Comparion between L-OBS and Net-\nTrim on MNIST.\n\nTable 2: Comparison of Net-Trim and Layer-wise OBS on the second layer of LeNet-300-100.\n\nMethod\n\nNet-Trim\nL-OBS\nL-OBS\n\n\u03be2\nr\n0.13\n0.70\n0.71\n\nPruned Error\n\nCR Method\n\n13.24%\n11.34%\n10.83%\n\n19%\n3.4%\n3.8%\n\nNet-Trim\nL-OBS\nNet-Trim\n\n\u03be2\nr\n0.62\n0.37\n0.71\n\nPruned Error\n\n28.45%\n4.56%\n47.69%\n\nCR\n\n7.4%\n7.4%\n4.2%\n\nof accuracy taking 10.2 hours in total on 48 Intel Xeon(R) CPU E5-1650 to compute the Hessian\ninverses and 86.3K iterations to retrain the pruned model.\nWe also apply L-OBS on ResNet-50 [27]. From our best knowledge, this is the \ufb01rst work to perform\npruning on ResNet. We perform pruning on all the layers: All layers share a same compression ratio,\nand we change this compression ratio in each experiments. The results are shown in Figure 2(a). As\nwe can see, L-OBS is able to maintain ResNet\u2019s accuracy (above 85%) when the compression ratio is\nlarger than or equal to 45%.\n\n4.2 Comparison between L-OBS and Net-Trim\n\nAs our proposed L-OBS is inspired by Net-Trim, which adopts (cid:96)1-norm to induce sparsity, we\nconduct comparison experiments between these two methods. In Net-Trim, networks are pruned by\nl Yl\u22121) \u2212 Yl(cid:107)F \u2264 \u03bel,\nformulating layer-wise pruning as a optimization: minWl (cid:107)Wl(cid:107)1 s.t. (cid:107)\u03c3(W(cid:62)\nwhere \u03bel corresponds to \u03bel\nr(cid:107)Yl(cid:107)F in L-OBS. Due to memory limitation of Net-Trim, we only prune\nthe middle layer of LeNet-300-100 with L-OBS and Net-Trim under the same setting. As shown in\nTable 2, under the same pruned error rate, CR of L-OBS outnumbers that of the Net-Trim by about\n(cid:1). For example, Q requires\nquadratic constraints used in Net-Trim for optimization is O(cid:0)2nm2\nsix times. In addition, Net-Trim encounters explosion of memory and time on large-scale datasets\nand large-size parameters. Speci\ufb01cally, space complexity of the positive semide\ufb01nite matrix Q in\n\nabout 65.7Gb for 1,000 samples on MNIST as illustrated in Figure 2(b). Moreover, Net-Trim is\ndesigned for multi-layer perceptrons and not clear how to deploy it on convolutional layers.\n5 Conclusion\n\nl ml\u22121\n\nWe have proposed a novel L-OBS pruning framework to prune parameters based on second order\nderivatives information of the layer-wise error function and provided a theoretical guarantee on the\noverall error in terms of the reconstructed errors for each layer. Our proposed L-OBS can prune\nconsiderable number of parameters with tiny drop of performance and reduce or even omit retraining.\nMore importantly, it identi\ufb01es and preserves the real important part of networks when pruning\ncompared with previous methods, which may help to dive into nature of neural networks.\nAcknowledgements\n\nThis work is supported by NTU Singapore Nanyang Assistant Professorship (NAP) grant\nM4081532.020, Singapore MOE AcRF Tier-2 grant MOE2016-T2-2-060, and Singapore MOE\nAcRF Tier-1 grant 2016-T1-001-159.\n\n9\n\n0.30.40.50.60.70.80.91.0CompressionRate0.600.650.700.750.800.850.900.951.00Accuracy(Top-5)100101102Numberofdatasample0.00.20.40.60.81.01.21.4Memoryused(Byte)\u00d7108Net-TrimOurMethod\fReferences\n[1] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436\u2013444,\n\n2015.\n\n[2] Yann LeCun, L\u00e9on Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning\n\napplied to document recognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[3] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale\n\nimage recognition. arXiv preprint arXiv:1409.1556, 2014.\n\n[4] Luisa de Vivo, Michele Bellesi, William Marshall, Eric A Bushong, Mark H Ellisman, Giulio\nTononi, and Chiara Cirelli. Ultrastructural evidence for synaptic scaling across the wake/sleep\ncycle. Science, 355(6324):507\u2013510, 2017.\n\n[5] Misha Denil, Babak Shakibi, Laurent Dinh, Nando de Freitas, et al. Predicting parameters in\ndeep learning. In Advances in Neural Information Processing Systems, pages 2148\u20132156, 2013.\n\n[6] Nguyen N. Aghasi, A. and J. Romberg. Net-trim: A layer-wise convex pruning of deep neural\n\nnetworks. Journal of Machine Learning Research, 2016.\n\n[7] Russell Reed. Pruning algorithms-a survey. IEEE transactions on Neural Networks, 4(5):740\u2013\n\n747, 1993.\n\n[8] Yunchao Gong, Liu Liu, Ming Yang, and Lubomir Bourdev. Compressing deep convolutional\n\nnetworks using vector quantization. arXiv preprint arXiv:1412.6115, 2014.\n\n[9] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections\nfor ef\ufb01cient neural network. In Advances in Neural Information Processing Systems, pages\n1135\u20131143, 2015.\n\n[10] Yi Sun, Xiaogang Wang, and Xiaoou Tang. Sparsifying neural network connections for\nface recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition, pages 4856\u20134864, 2016.\n\n[11] Yiwen Guo, Anbang Yao, and Yurong Chen. Dynamic network surgery for ef\ufb01cient dnns. In\n\nAdvances In Neural Information Processing Systems, pages 1379\u20131387, 2016.\n\n[12] Yann LeCun, John S Denker, Sara A Solla, Richard E Howard, and Lawrence D Jackel. Optimal\n\nbrain damage. In NIPs, volume 2, pages 598\u2013605, 1989.\n\n[13] Babak Hassibi, David G Stork, et al. Second order derivatives for network pruning: Optimal\n\nbrain surgeon. Advances in neural information processing systems, pages 164\u2013164, 1993.\n\n[14] Thomas Kailath. Linear systems, volume 156. Prentice-Hall Englewood Cliffs, NJ, 1980.\n\n[15] Nikolas Wolfe, Aditya Sharma, Lukas Drude, and Bhiksha Raj. The incredible shrinking neural\nnetwork: New perspectives on learning representations through the lens of pruning. arXiv\npreprint arXiv:1701.04465, 2017.\n\n[16] Hengyuan Hu, Rui Peng, Yu-Wing Tai, and Chi-Keung Tang. Network trimming: A data-driven\nneuron pruning approach towards ef\ufb01cient deep architectures. arXiv preprint arXiv:1607.03250,\n2016.\n\n[17] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning \ufb01lters for\n\nef\ufb01cient convnets. arXiv preprint arXiv:1608.08710, 2016.\n\n[18] Xiaojie Jin, Xiaotong Yuan, Jiashi Feng, and Shuicheng Yan. Training skinny deep neural\n\nnetworks with iterative hard thresholding methods. arXiv preprint arXiv:1607.05423, 2016.\n\n[19] Cheng Tai, Tong Xiao, Yi Zhang, Xiaogang Wang, et al. Convolutional neural networks with\n\nlow-rank regularization. arXiv preprint arXiv:1511.06067, 2015.\n\n[20] Baoyuan Liu, Min Wang, Hassan Foroosh, Marshall Tappen, and Marianna Pensky. Sparse\nconvolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision\nand Pattern Recognition, pages 806\u2013814, 2015.\n\n10\n\n\f[21] R Tyrrell Rockafellar. Convex analysis. princeton landmarks in mathematics, 1997.\n\n[22] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse recti\ufb01er neural networks. In\n\nAistats, volume 15, page 275, 2011.\n\n[23] Ian J Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron C Courville, and Yoshua Bengio.\n\nMaxout networks. ICML (3), 28:1319\u20131327, 2013.\n\n[24] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images.\n\n2009.\n\n[25] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep\nconvolutional neural networks. In Advances in neural information processing systems, pages\n1097\u20131105, 2012.\n\n[26] Mart\u00edn Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro,\nGreg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensor\ufb02ow: Large-scale\nmachine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467,\n2016.\n\n[27] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\nrecognition. In Proceedings of the IEEE conference on computer vision and pattern recognition,\npages 770\u2013778, 2016.\n\n11\n\n\f", "award": [], "sourceid": 2517, "authors": [{"given_name": "Xin", "family_name": "Dong", "institution": "Nanyang Technological Univ"}, {"given_name": "Shangyu", "family_name": "Chen", "institution": "Nanyang Technological University, Singapore"}, {"given_name": "Sinno", "family_name": "Pan", "institution": "NTU"}]}