{"title": "AutoPrune: Automatic Network Pruning by Regularizing Auxiliary Parameters", "book": "Advances in Neural Information Processing Systems", "page_first": 13681, "page_last": 13691, "abstract": "Reducing the model redundancy is an important task to deploy complex deep learning models to resource-limited or time-sensitive devices. Directly regularizing or modifying weight values makes pruning procedure less robust and sensitive to the choice of hyperparameters, and it also requires prior knowledge to tune different hyperparameters for different models. To build a better generalized and easy-to-use pruning method, we propose AutoPrune, which prunes the network through optimizing a set of trainable auxiliary parameters instead of original weights. The instability and noise during training on auxiliary parameters will not directly affect weight values, which makes pruning process more robust to noise and less sensitive to hyperparameters. Moreover, we design gradient update rules for auxiliary parameters to keep them consistent with pruning tasks. Our method can automatically eliminate network redundancy with recoverability, relieving the complicated prior knowledge required to design thresholding functions, and reducing the time for trial and error. We evaluate our method with LeNet and VGG-like on MNIST and CIFAR-10 datasets, and with AlexNet, ResNet and MobileNet on ImageNet to establish the scalability of our work. Results show that our model achieves state-of-the-art sparsity, e.g. 7%, 23% FLOPs and 310x, 75x compression ratio for LeNet5 and VGG-like structure without accuracy drop, and 200M and 100M FLOPs for MobileNet V2 with accuracy 73.32% and 66.83% respectively.", "full_text": "AutoPrune: Automatic Network Pruning by\n\nRegularizing Auxiliary Parameters\n\nXia Xiao, Zigeng Wang, Sanguthevar Rajasekaran\u2217\nDepartment of Computer Science and Engineering\n\nUniversity of Connecticut\nStorrs, CT, USA, 06269\n\n{xia.xiao, zigeng.wang, sanguthevar.rajasekaran}@uconn.edu\n\nAbstract\n\nReducing the model redundancy is an important task to deploy complex deep\nlearning models to resource-limited or time-sensitive devices. Directly regularizing\nor modifying weight values makes pruning procedure less robust and sensitive\nto the choice of hyperparameters, and it also requires prior knowledge to tune\ndifferent hyperparameters for different models. To build a better generalized and\neasy-to-use pruning method, we propose AutoPrune, which prunes the network\nthrough optimizing a set of trainable auxiliary parameters instead of original\nweights. The instability and noise during training on auxiliary parameters will not\ndirectly affect weight values, which makes pruning process more robust to noise\nand less sensitive to hyperparameters. Moreover, we design gradient update rules\nfor auxiliary parameters to keep them consistent with pruning tasks. Our method\ncan automatically eliminate network redundancy with recoverability, relieving\nthe complicated prior knowledge required to design thresholding functions, and\nreducing the time for trial and error. We evaluate our method with LeNet and VGG-\nlike on MNIST and CIFAR-10 datasets, and with AlexNet, ResNet and MobileNet\non ImageNet to establish the scalability of our work. Results show that our model\nachieves state-of-the-art sparsity, e.g. 7%, 23% FLOPs and 310x, 75x compression\nratio for LeNet5 and VGG-like structure without accuracy drop, and 200M and\n100M FLOPs for MobileNet V2 with accuracy 73.32% and 66.83% respectively.\n\n1\n\nIntroduction\n\nDeep neural networks (DNNs) have achieved a signi\ufb01cant success in many applications, ranging from\nimage classi\ufb01cation He et al. [2016] and object detection Ren et al. [2015] to self driving Maqueda et\nal. [2018] and machine translation Sutskever et al. [2014]. However, the computationally expensive\nand memory intensive properties of DNNs prevent their direct deployment to devices such as mobile\nphones and auto-driving cars. To overcome these challenges, learning compressed light-weight DNNs\nhas attracted growing research attention Han et al. [2015]; Dong et al. [2017]; Zhuang et al. [2018].\nFor recent pruning methods, prior knowledge plays an important role in improving the performance\nand reducing the training time, in which a large number of hyperparameters need to be individually\ndesigned for different architectures and datasets. In magnitude-based pruning, where weights lower\nthan thresholds will be removed, the chosen thresholds majorly affect the pruning performance Han\net al. [2015]; Guo et al. [2016]. Moreover, for the layer-wise pruning Dong et al. [2017]; Aghasi\net al. [2017], the searching space for layer-wise threshold combinations can be exponential in the\nnumber of layers. As another branch of pruning, sensitivity-based method Tartaglione et al. [2018]\n\u2217Corresponding author. This work has been supported in part by the following NSF grants: 1447711,\n\n1514357, 1743418, and 1843025.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fremoves the less sensitive weights from the network, while further hyperparameter/function design is\nrequired to avoid undesired weight shrinkage or updates.\nRecently research on pruning Liu et al. [2019b] implies that the pruning process is actually \ufb01nding the\nright network structure, thus bridging the gap between pruning and neuron architecture search(NAS).\nHowever, state-of-art NAS methods cannot be directly applied to pruning task. For example, gradient\nbased search algorithm DART Liu et al. [2019a] introduces auxiliary parameters acting as indicators\nto select the appropriate network structure optimized through a gradient-descent procedure. But, dis-\ncrepancy between continuous over-parameterized graph and the discretized sub-graph is unavoidable\nduring the model evaluation procedure, and zero operation is eliminated in the search space. Our\nmethod is similar to DART such that we employ smooth, approximated, gradient-based search to\npruning task, but the discrepancy is reduced by iteratively evaluating recoverable sub-graph during\nthe pruning procedure.\nThe advantage of introducing auxiliary parameters to pruning task is hyperparameter insensitive.\nInstead of directly regularizing weights, our method regularizes auxiliary parameters which aggregate\ngradient perturbations such as batch noise, dead neuron or dropout during pruning. In this way,\ntemporarily incorrect pruning induced by the instability and non-optimal hyperparameters can be\nrecovered, which greatly contributes to the pruning performance and ef\ufb01ciency. Different from\nupdating auxiliary parameters with vanilla unstable linear coarse gradient in Srinivas et al. [2017],\nin order to stabilize the pruning procedure, we analyze and decouple the gradient between weight\nIn contrast to Louizos et al. [2018], our method avoids\nparameters and auxiliary parameters.\ninef\ufb01cient and high variance single-step Monte-Carlo sampling and places no assumptions on the\nprior distribution. In comparison with Carreira-Perpin\u00e1n and Idelbayev [2018], we add no constraints\non model parameters, maintaining the \ufb02exibility and capacity of the model. In addition, we design a\nsparse regularizer working with the original loss function and weight decay. In order to evaluate the\nproposed method, we conduct extensive experiments on different datasets and models, and the results\nshow that our method achieves state-of-the-art performance.\nContributions and novelty of our work are: 1) we offer a gradient based automatic network pruning\nmodel; 2) we propose novel and weakly coupled update rule for auxiliary parameters to stabilize\npruning procedure; 3) we reduce the sub-graph discrepancy by iteratively evaluating recoverable\nsub-graph; 4) we evaluate different smooth approximations of the derivative of the recti\ufb01er; 5) we\nobtain the state-of-art results on both structure and weight pruning and our method is scalable on\nmodern models and datasets.\n\n2 Related Work\n\nNeural network pruning can be mainly classi\ufb01ed into two categories: unstructured pruning and\nstructured pruning. Unstructured pruning compresses neural networks by dropping redundant/less-\nmeaningful weights, while structured pruning is by dropping neurons. Both pruning methods shrink\nthe storage space of the targeted neural network, but, comparatively speaking, structured pruning has\na directly bene\ufb01t in reducing the computational cost of DNNs.\nLeCun et al. [1990] pioneers neural network pruning and proposes optimal brain damage method\nfor shallow neural network unstructured pruning. For DNNs, Han et al. [2015] presents global\nmagnitude-based weight pruning and Guo et al. [2016] introduces recoverability into the global\npruning. Similar idea has then been applied to structured pruning. Hu et al. [2016] removes neurons\nwith high average zero output ratio and Li et al. [2017] prunes neurons with low absolute summation\nvalues of incoming weights, which are all replying on prede\ufb01ned thresholds.\nIn order to further improve the compression rate, different layer-wise pruning methods have been\nproposed, either by weighting connections based on a layer-wise loss function(Dong et al. [2017])\nor by solving a specially designed convex optimization program(Aghasi et al. [2017]). These layer-\nwise schemes provide theoretical error bounds for speci\ufb01c activation functions but leave many\nhyperparameters to be carefully designed. Due to this issue, Li et al. [2018] presents a relatively\nef\ufb01cient comprehensive optimization algorithm for tuning layer-wise hyperparameters.\nBesides layer-wise schemes, Gordon et al. [2018] scales ef\ufb01cient structured pruning on large networks\nby applying resource weighted sparsifying regularizers on activations. Zhu et al. [2018] improves\nneural network sparsity by explicitly forcing the network to learn a set of less correlated \ufb01lters via\n\n2\n\n\fdecorrelation regularization. Zhuang et al. [2018] designs a discrimination-aware channel pruning\nmethod to locate most discriminative channels. But after ranking the \ufb01lters or channels, we still have\nto pinpoint their optimal combinations for each layer, which highly relies on expertise. Gomez et al.\n[2019] proposed to keep neurons with high magnitude and prune neurons with smaller magnitude in\na stochastic way. The accuracy is maintained by reducing the dependency of important neurons on\nunimportant neurons.\nLiu et al. [2019b] does comprehensive experiments showing that training-from-scratch on the right\nsparse architecture yields better results than pruning from pre-trained models. Searching for a spare\narchitecture is more important than the weight values. Liu et al. [2019a] employs continuous indicator\nparameters to relax the non-differentiable architecture searching problem. The relaxation is then\nremoved by dropping weak connections and selecting the single choice of the k options with the\nhighest weight. However, the gap between the continuous solution and the discretized architecture\nremain unknown. More importantly, zero operations are omitted during the derivation process,\nmaking is unsuitable for network pruning. Yu and Huang [2019a] implements greedy search for\nwidth multipliers of slimmable network(Yu et al. [2018]) to reduce kernel number. Multiple batch\nnormalization layers are trained under different channel settings. However, a signi\ufb01cant accuracy\ndrop is observed in extreme sparse cases.\n\n3 Methods\n\nIn this section, we \ufb01rst formulate the problem and discuss the indicator function and auxiliary\nparameters. Then, we introduce the update rule for auxiliary parameters for stable and ef\ufb01cient\nnetwork pruning. Without losing generality, our method is formulated on weight pruning, but it can\nbe directly extended to neuron pruning.\n\n3.1 Problem Formulation\nLet fw : Rm\u00d7n \u2192 Rd be a continuous and differentiable neural network parametrized by W\nmapping input X \u2208 Rm\u00d7n to target Y \u2208 Rd. The pruning problem can be formulated as:\n\n(cid:18) N(cid:88)\n\ni=1\n\nargmin\n\nw\n\n1\nN\n\n(cid:19)\n\nL(f (xi, W ), yi)\n\n+ \u00b5||W||0,\n\n(1)\n\nwhere ||W||0 denotes zero norm, or number of non-zero weights. The goal is to \ufb01nd the sparse archi-\ntecture with minimum subset w \u2208 W that preserves the model accuracy. However, the second term is\nnon-differentiable, making the problem not solvable using gradient descent. Direct regularization on\nwij will lead to sensitivity on hyperparameter \u00b5 and instability with batched training. We relax this\nproblem by introducing a indicator function de\ufb01ned as:\n\n(cid:40)\n\nhij =\n\nif wij is pruned;\n\n0,\n1, otherwise.\n\n(2)\n\n(cid:19)\n\n(cid:18) N(cid:88)\n\nInstead of designing an indicator function for each wij manually, we propose to parameterized a uni-\nversal indicator function by a set of trainable auxiliary parameters M. Due to the non-differentiable\nproperty of the indicator function, we will discuss how to update auxiliary parameters in subsec-\ntions 3.2 and 3.3. Then the network sparsi\ufb01cation problem can be re-formulated as an optimization\nproblem:\n\n1\nN\n\nL(f (xi, W (cid:12) h(M )), yi)\n\n+ \u03bbR(W ) + \u00b5R(h(M )),\n\ni=1\n\nw,m\n\nargmin\n\n(3)\nwhere R(\u00b7) denotes a regularization function. We also denote the element-wise product T =\nW (cid:12) h(M ) as the weight matrix after pruning. The advantage of regularizing on auxiliary parameters\ninstead of original weights is that any change in mij does not directly in\ufb02uence the gradient update\nof wij, leading to a less sensitive pruning process with respect to hyperparameter \u00b5.\nAs done by Han et al. [2015] and Carreira-Perpin\u00e1n and Idelbayev [2018], in order to enhance the\nstability and performance, we also implement a multi-step training through iteratively training the\nsparsity structure and retraining the original weights. More speci\ufb01cally, we employ the bi-level\noptimization used in Liu et al. [2019a] for the optimization problem. The training set will be split\n\n3\n\n\finto Xtrain and Xval, and we can further re-formulate the problem from minimizing a single loss\nfunction to minimizing the following loss functions iteratively.\n\nL1 = min\n\nw\n\nmin\n\nw\n\nL2 = min\n\nm\n\nmin\nm\n\nN(cid:88)\nN(cid:88)\n\ni=1\n\ni=1\n\nL(f (xi, W (cid:12) h(M )), yi) + \u03bbR(W ), xi \u2208 Xtrain,\n\nL(f (xi, W (cid:12) h(M )), yi) + \u00b5R(h(M )), xi \u2208 Xval,\n\n(4)\n\n(5)\n\nThe \ufb01rst term in both loss functions is the regular accuracy loss for neural network training. Note that\nthe regularization of W is not necessarily required but we add the term to show that our method is\nconsistent with traditional regularizers.\n\n3.2 Coarse Gradient for Indicator Function\n\nThe indicator function hij contains only zero and one values and thus is non-smooth and non-\ndifferentiable. Inspired by Hubara et al. [2016] where binary weights are represented using step\nfunctions and trained with hard sigmoid straight through estimator (STE), we use a simple step\nfunction for indicator function hij with trainable parameter mij.\nBinarized neural networks (BNNs) with proper STE have been demonstrated to be quite effective in\n\ufb01nding optimal binary parameters and can achieve promising results in complex tasks. The vanilla\nBNNs are optimized by updating continuous variables mij:\n\n\u2202L\n\u2202mij\n\n=\n\n\u2202L\n\n\u2202\u03c3(mij)\n\n, where \u03c3(mij) = max(0, min(1,\n\nmij + 1\n\n2\n\n)).\n\n(6)\n\n\u2202\u03c3(mij)\n\n\u2202mij\n\nThe output of each weight is the output of the hard sigmoid binary function. Note that the gradients\nof \u2202\u03c3(mij)\n\u2202mij\n\ncan be estimated in multiple ways.\n\nFigure 1: Coarse Gradients for STEs\n\nSrinivas et al. [2017] discuss using BNNs to learn sparse networks, however, the authors suggest\nusing linear STE to quickly estimate the gradient of the heaviside function. Recent result Yin et\nal. [2019] shows that ReLU or clipped ReLU STEs yield better convergence while linear STE is\nunstable at minima. Unfortunately, as shown in Fig. 1, the gradient of ReLU is zero if the input m is\nsmaller than zero. In other words, if we apply auxiliary parameters directly to any weight, without\nany regularization, the weight will permanently die once the corresponding weight has been pruned.\nConsidering the pruning recoverability, we suggest using Leaky ReLU or Softplus instead of ReLU.\n\n3.3 Updating Auxiliary Parameters\n\n(cid:18) \u2202Lacc\n\n(cid:19)\n\nInstead of directly applying the gradient update as described in Eq. 6, we propose a modi\ufb01ed update\nrule of auxiliary parameters to be consistent with (1) the magnitude of weights; (2) the change of\nweights; and (3) the directions of BNN gradients. The update rule of mij is de\ufb01ned as:\n\nmij := mij \u2212 \u03b7\n\n\u2202h(mij)\n\n\u2212 \u00b5\n\n\u2202h(mij)\n\n\u2202tij\n\n\u2202mij\n\nsgn(wij)\n\n(7)\nwhere Lacc denotes L(f (xi, W (cid:12) h(M )), yi), \u03b7 is the learning rate of mij, tij = wij (cid:12) h(mij), the\nsecond term can be considered as the gradient of mij, \u2202tij\n, and the third term is related to the sparse\n\u2202mij\nregularizer. The proposed update rule is motivated from three advantages:\nSensitivity Consistency: The gradient of a vanilla BNN is correlated with wij, i.e., \u2202Lacc\nf (|wij|),\nwhich means that mij is more sensitive if the magnitude of the corresponding wij is large. Such a\nsensitive correlation is counter-intuitive since a larger wij is more likely to be pruned with a small\nturbulence which reduces the robustness of the pruning. In the proposed update rule, we decouple\n\n\u2202mij\n\n\u221d\n\n\u2202mij\n\n1\n\n4\n\n202m012h(m)202m012h(m)202m012h(m)\fsuch a correlation to increase the stability of the pruning procedure. Practically, in order to boost the\nsensitivity of mij associated with smaller weight magnitude(i.e. sensitivity consistency), we use a\nmultiplier wij to Eq. 7.\nCorrelation Consistency: The second advantage of the update rule is that the direction of the\ngradient of an arbitrary auxiliary parameter mij is the same as the direction of the gradient of its\ncorresponding |wij|, when ignoring the regularizers, i.e., sgn( \u2202L2\nProof. We can expand the gradient for wij and mij as follows:\n\n) = sgn( \u2202L1\n\n\u2202|wij| ).\n\n\u2202mij\n\n\u2202L1\n\u2202wij\n\n\u2202Lacc\n\u2202tij\n\n\u2202tij\n\u2202wij\n\n=\n\n\u2202R(wij)\n\u2202wij\n\n\u2202Lacc\n\u2202tij\n\n=\n\n+ \u03bb\n\nh(mij) + \u03bb\n\n\u2202R(wij)\n\u2202wij\n\n\u2202L2\n\u2202mij\n\n\u2202Lacc\n\u2202tij\n\n\u2202tij\n\u2202mij\n\n=\n\n\u2202R(h(mij))\n\n\u2202mij\n\n\u2202Lacc\n\u2202tij\n\n=\n\n+ \u00b5\n\nwij\n\n\u2202h(mij)\n\n\u2202mij\n\n+ \u00b5\n\n\u2202h(mij)\n\n\u2202mij\n\nIf we consider the direction of the \ufb01rst term of both gradients while ignoring the regularizers:\n\nsgn(\n\n\u2202Lacc\n\u2202tij\n\u2202Lacc\n\u2202tij\nGiven the conditions that h(mij) \u2265 0 and \u2202h(mij )\n\n\u2202L1\n\u2202wij\n\u2202L2\n\u2202mij\n\n) = sgn(\n\n) = sgn(\n\nsgn(\n\n)sgn(h(mij))\n\n)sgn(wij)sgn(\n\n\u2202h(mij)\n\n\u2202mij\n\n).\n\n\u2265 0, we can conclude that\n\n\u2202mij\n\n\u2202L2\n\u2202mij\n\nsgn(\n\n) = sgn(\n\n\u2202L1\n\u2202|wij| ).\n\n(8)\n\n(9)\n\n(10)\n\n(11)\n\nIn other words, the auxiliary parameter mij tracks the changing of the magnitude of wij. For the\npruning task, when the absolute value of a weight/neuron keeps moving towards zero, we should\naccelerate the pruning process of the weight/neuron.\nDirection Consistency: The third advantage of the update rule is that the inner product between\nthe expected coarse and population gradients with respect to m is greater than zero, i.e., the update\ngradient and the population gradient form an acute angle. Updating in this way actually reduces\nthe loss of vanilla BNNs. We refer to Eq. 5, Lemma4 and Lemma10 from Yin et al. [2019], where\nthe ReLU and linear STE form acute angle with population gradient. Since(cid:104)g\u03c3, g(cid:105) = \u03c3(cid:48)q(w, w\u2217),\nwhere q(w, w\u2217) is a deterministic function for both cases and \u03c3 represent the STE function. Since\nrelu \u2264 \u03c3(cid:48)\nLinear, we can then retain 0 \u2264 (cid:104)grelu, g(cid:105) \u2264 (cid:104)gLeakyRelu, g(cid:105) \u2264 (cid:104)gLinear, g(cid:105).\n\u03c3(cid:48)\n\nLeakyRelu \u2264 \u03c3(cid:48)\n\n3.4 Recoverable Pruning\n\nPruning with recoverability is important to reduce the gap between the original network graph and the\nsparse network graph, which helps to achieve better sparsity. We design the pruning step following the\nidea of Dynamic Network Surgery(Guo et al. [2016]), that once some important weights are pruned\nand a large discrepancy occurs, the incorrectly pruned weights will be recovered to compensate for the\nincrease of loss. Different from previous works with hard thresholding, for a speci\ufb01c weight/neuron,\nits opportunity to be pruned is determined automatically during optimization. The pruning step in our\nmodel is soft, the pruned weight will hold its value, and ready to be spliced back to the network if\nlarge discrepancy is observed.\nBased on the multi-step training framework, after mij is updated by Eq. 7, the unpruned network\nparameters wij will be updated based on the newly learned structure. If no regularization is applied\non wij, the corresponding mij could be recovered by the accuracy loss. Note that a weight will be\nrecovered if the damage made by the pruned weight cannot be restored by updating other unpruned\nweights. If weight decay is applied, any pruned weight will gradually lose recoverability with a \ufb01xed\nrate. The weight decay will decrease the magnitude of wij and provide a negative gradient to mij,\nwhich reduces the recoverability. Whether a weight will be recovered under weight decay depends on\n(1) the absolute value of wij, and (2) the damage it made when removing it from the network. More\nspeci\ufb01cally, recovering a weight wij requires the gradient of mij moving toward positive direction.\nWith L1 regularization, a weight will be permanently pruned when its absolute value drops to zero.\n\n5\n\n\fAlgorithm 1 AutoPrun\nInput: Data set X and iter\nParameter: W , M, \u03bb and \u00b5\nOutput: Auxiliary parameter M and W\n\n1: Randomly split X into Xtrain and Xval.\n2: if Pre-trained then\n3:\n4: else\n5:\n6: end if\n\nInitialize M based on pre-trained W ;\nInitialize M \u223c Gaussian(\u00b5, \u03c32);\n\n3.5 Acceleration by Regularizers\n\nSample a mini batch from Xval;\nCompute gradw with L1;\nCompute gradm and gradmr by Eq. 7;\nUpdate M with gradm and gradmr;\nSample a mini batch from Xtrain;\nCompute gradw with L1;\nUpdate W with gradw;\nUpdate iter, \u03bb, \u00b5 (if scheduling);\n\n7: while iter!=0 do\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n15:\n16: end while\n17: return solution;\n\n3.5.1 Sparse Regularizer\nWithout any regularizer, our model can gradually converge to a sparse model, but with relatively\nslow speed, especially when the weights are close to optimal and the gradients with respect to\nT = W (cid:12) h(M ) are almost zero. In order to accelerate the pruning process, we bring in regularizers\nto force the mask values to approach zero. The sparse regularizer is de\ufb01ned as:\n\nR(h(M )) =\n\n|h(mij)| = count(h(M )).\n\n(12)\n\n(cid:88)\n\ni,j\n\nNote that the L1 regularizer applied on h(M ) directly counts the number of gates that are open,\nwhich is equivalent to applying L0 regularizer on h(M ). With the regularizer, M will be pushed\ntowards zero since the gradient with respect to mij is the positive STE gradient. Another bene\ufb01t of\nthis regularizer is to \ufb01lter out the noise when updating W with SGD or dropout, i.e., \u00b5\u2202L2/\u2202mij > 0\nwhen \u2206|w| < \u03b4 and mij still decreases when wij increases by only a small amount.\n3.5.2 Working with Weight Decay Regularizer\n\nOur model can also work with general 1-norm or 2-norm regularizers on weights W . Since the auxil-\niary parameters M follow |W|, any weight decay regularizer will help to increase the sparsi\ufb01cation\nspeed. An important side effect of weight decay regularizer is that after pruning a certain weight, the\nonly source that can change |wij| \u2208 |W| s.t. h(mij) = 0 will be the weight decay regularizers. A\nlarge weight decay hyperparameter will decrease the pruned weight fast and hamper the recoverability\ndiscussed in the previous subsection.\n\n3.6 Hyperparameters Sensitivity and Robustness\n\nBy proposing auxiliary parameters and an indicator function, we introduce two new hyperparameters,\nlearning rate hyperparameter \u03b7 and regularization hyperparameter \u00b5. However, the pruning procedure\nis not sensitive to those hyperparameters based on the following reasons: 1) We are not directly\nregularizing W , so the bias of STE and hyperparameter will not directly in\ufb02uence weights; 2) The\nindicator function is tolerant to the turbulence of auxiliary parameters mij; and 3) The pruning is\nrecoverable when an incorrect pruning happens and the damage is made. Practically, as shown in the\nexperimental part, the learning rate \u03b7 is scheduled to be the same as for learning the original weights\nW , and the regularization hyperparameter \u00b5 is set to be the same in all test cases. To conclude, our\nmethod reduces a set of hyperparameters to one single, non-sensitive hyperparameter.\n\n3.7 Convergence Discussion\n\nSimilar to Gordon et al. [2018], our framework doesn\u2019t guarantee convergence when optimized with\nregularizers. But since the sparsi\ufb01cation procedure is emperically fast and a good structure can be\nobtained with fewer epochs, we do not always need to wait until convergence. But, in order to give a\nguidance to hyperparameter tuning, we will brie\ufb02y discuss the necessary condition for convergence.\nAt convergence, if no regularization is applied, \u2202Lacc\n\n= 0. We can further conclude:\n\n\u2202tij\n\n\u2202Lacc\n\u2202tij\n\nh(mij) =\n\n\u2202Lacc\n\u2202tij\n\nsgn(wij)\n\n\u2202h(mij)\n\n\u2202mij\n\n= 0.\n\n(13)\n\n6\n\n\f= \u2202L2\n\nIf both weight decay and sparse regularizers are applied, we need \u2202L1\n= 0. Assuming that\npruned weights are suf\ufb01ciently small and make no contribution to both gradients, we only consider\nthe gradients w.r.t. mij \u2208 M s.t. mij > 0, and h(M ) = 1. When taking into account the learning\nrate compensation, we have:\n\u2202Lacc\n\u2202tij\n\n(14)\nIf L2 is applied, we have the necessary condition 2\u03bb|wij| = c\u00b5, where c is the non-linear factor by\ndifferent STEs. If L1 is applied, we have the necessary condition \u03bb = c\u00b5. Under both cases, \u03bb and \u00b5\nshould be reduced to the same level when convergence.\n\n\u2202R(wij)\n\u2202wij\n\n\u2202Lacc\n\u2202tij\n\nsgn(wij)\n\n\u2202h(mij)\n\n\u2202h(mij)\n\n\u2202mij\n\n\u2202wij\n\n\u2202mij\n\n\u2202mij\n\n+ \u00b5\n\n=\n\n0 =\n\n+ \u03bb\n\n.\n\n4 Experiments\n\nIn this section, we introduce our experiment settings, and compare the neuron pruning and weight\npruning performance with existing approaches.\n\n4.1 Settings\n\nTo ensure a fair comparison, we follow the same backend packages as described in other papers.\nExcept for LeNets, all the other pre-trained parameters are downloaded from commonly available\nsources and the auxiliary parameters are either initialized randomly or by pre-trained weights. All the\naccuracy results are the average of 10 runs and the spare structure is picked from the best accurate\nmodel. Our models are implemented by Tensor\ufb02ow and run on Ubuntu Linux 16.04 with 32G memory\nand a single NVIDIA Titan Xp GPU. To show the insensitivity of the introduced hyperparameter, we\nset the learning rate of auxiliary parameters to 1.5e-2 and \u00b5 to 5e-2 for all test cases.\n\nTable 1: Comparison of Different Neuron Pruning Techniques\nMethods\n\nNeurons per Layer\n\nBase Error\n\nEpochs\n\nModel\n\nLeNet-300-100\n784-300-100\n\nLeNet5\n(MNIST)\n20-50-\n800-500\n\nVGG-like\n(CIFAR-10)\n64x2-128x2-\n256x3-512x7\n\nLouizos et al. [2017]\nLouizos et al. [2018]\nLouizos et al. [2018]\n\nOur method\n\nWen et al. [2016]\n\nNeklyudov et al. [2017]\nLouizos et al. [2017]\nLouizos et al. [2018]\nLouizos et al. [2018]\n\nOur method\n\nLi et al. [2017]\n\nNeklyudov et al. [2017]\nNeklyudov et al. [2017]\n\nOur method\n\n-\n-\n\n-\n-\n\n-\n-\n\n1.60%\n\n1.60%\n\n0.90%\n\n0.78%\n6.75%\n7.20%\n7.20%\n7.60%\n\nError\n1.80%\n-\n1.40% 200\n200\n1.80%\n100\n1.82%\n-\n1.00%\n-\n0.86%\n-\n1.00%\n200\n0.90%\n1.00%\n200\n0.80% 100\n40\n6.60%\n-\n7.50%\n-\n9.00%\n8.50%\n150\n\n278-98-13\n219-214-100\n266-88-33\n244-85-37\n\n3-12-800-500\n2-18-284-283\n5-10-76-16\n20-25-45-462\n9-18-65-25\n4-16-86-87\n\n32-64-128-128-256-256-256-256-256-256-256-256-256-512\n\n64-62-128-126-234-155-31-79-73-9-59-73-56-27\n44-54-92-115-234-155-31-76-55-9-34-35-21-280\n37-41-91-89-156-140-74-81-54-51-44-46-48-52\n\nNCR FLOPs\n11%\n3.04\n26%\n2.22\n10%\n3.06\n3.23\n9%\n25%\n1.04\n9%\n2.33\n7%\n12.8\n50%\n2.48\n11.71\n17%\n7%\n9.86\n66%\n1.49\n43%\n4.03\n32%\n3.83\n4.72\n23%\n\n4.2 LeNet-300-100 and LeNet5 on MNIST Database\n\nWe \ufb01rst use MNIST dataset to evaluate the performance. Layer structure of LeNet-300-100 is [784,\n300, 100, 10] and of LeNet5 is two [20,50] convolution layers, followed by two FC layers. The total\nnumber of trainable parameters of LeNet-300-100 and LeNet5 are 267K and 431K, respectively.\nSimilar to previous works, we train reference models with standard training method with SGD\noptimizer, achieving accuracy of 1.72% and 0.78% respectively. In the pruning process, we use the\nsoftplus STE. The learning rate for L1 is scheduled from 1e-2 to 1e-3. During the training procedure,\nwe observe that the \ufb01nal result is not sensitive to \u03bb and \u00b5 but the sparsi\ufb01cation speed relies on \u00b5.\nFor neuron pruning, from Table 1, we can achieve the highest neuron compression rate(NCR) as 3.23\nand the lowest FLOP usage percentage 9% comparing to original LeNet-300-100. For LeNet5, we\nare taking the lead in both the model accuracy 99.20% and the FLOP reduction rate 93%. For weight\npruning, as we show in Table 4, our method applied to the LeNet-300-100 structure achieves the best\ncompression rate of up to 80x while a 0.06% error increase. Note that all the other methods with\ncompression rates greater than 60 have a minor accuracy drop while our method reaches the best\naccuracy. For LeNet5 model, we compare existing works with two reference models. For the \ufb01rst\nmodel with 0.78% error, we achieve 260x compression rate and 0.8% error. For the second model\nwith 0.91% error, our method obtains a 310x compression rate with no accuracy drop.\n\n7\n\n\fTable 2: VGG-like CIFAR-10 Neuron Pruning\n\n57%\n\nLayer Conv1 Conv2 Conv3 Conv4\nSparsity 57.03% 17.36% 20.95% 16.06%\nFLOP\n49%\nConv5 Conv6 Conv7 Conv8 Conv9\n10.76% 4.67% 5.30% 1.52% 0.39%\n15% 4.50% 1.60%\n\n37%\n\n45%\n\n42%\n\n33%\n\nMethods\n\nSandler et al. [2018]\n\nTable 3: MobileNetV2(Top 1 Accuracy)\nFLOPs\nFLOPs Accuracy\n97M 65.40%\n97M 64.40%\n97M 65.10%\n102M 66.83%\n209M 69.80%\n216M 71.5%\nYu and Huang [2019b] 209M 69.60%\n\nYu and Huang [2019b]\n\nSandler et al. [2018]\n\nTan et al. [2019]\n\nYu et al. [2018]\n\nOur method\n\n100M\n\n200M\n\nWu et al. [2019]\n\n246M\nYu and Huang [2019a] 207M\n\nConv10 Conv11 Conv12 Conv13\n0.35% 0.28% 0.27% 0.33%\n0.85% 0.77% 0.84%\n\n1%\n\n300M\n\nOur method\n\nSandler et al. [2018]\n\nTan et al. [2019]\n\n73%\n73%\n209M 73.32%\n300M 69.80%\n317M\n\n74%\n\nYu and Huang [2019a] 305M 74.20%\n305M 74.0%\n\nOur method\n\n4.3 VGG-like on CIFAR-10\n\nFor VGG-like model, we use CIFAR-10 dataset to evaluate the performance. VGG-like is a standard\nconvolution neural network with 13 convolutional layers followed by 2 FC layers (512 and 10\nrespectively). The total number of trainable parameters is 15M. Similar to previous works, we use\nthe reference VGG-like model pre-trained with SGD with testing error 7.60%.\nIn this structure, we use L2-norm and L1-norm for L1 with hyperparameters 5e-5 and 1e-6, respec-\ntively. We evaluate both Leaky ReLU and Softplus STEs. Leaky ReLU gives a fast sparsi\ufb01cation\nspeed while Softplus shows a smooth convergence with approximately 1.5x running time. We suggest\nselecting the proper STE based on the time constraint.\nFor neuron pruning task, as shown in Table 1, our method reaches 23% FLOPs within 150 epochs.\nIn Table 2, we show the layer-wise percentage FLOPs of VGG-16 structure. Our model achieves a\nhigher sparsity at any layer compared to Li et al. [2017]. For weight pruning, our model reaches the\nhighest 75x compression rate, with only moderate accuracy drop within 150 epochs of training.\n\n4.4 AlexNet, ResNet-50 and MobileNet on ImageNet\n\nThree models with ILSVRC12 dataset are also tested with our pruning method including 1M training\nimages and 0.5M validation and testing images. AlexNet can be considered as deep since it contains\n5 convolution layers and 3 FC layers. ResNet-50 consists of 16 convolution blocks with structure\ncfg=[3,4,6,3], plus one input and one output layer, and in total 25M parameters. For MobileNet, we\n\nTable 4: Comparison of Different Weight Pruning Techniques\n\nModel\n\nLeNet300-100\n\n(MNIST)\n\nLeNet5\n(MNIST)\n\nVGG-like\n(CIFAR-10)\n\nAlexNet\n\n(ILSVRC12)\n\nMethods\n\nDong et al. [2017]\nUllrich et al. [2017]\n\nMolchanov et al. [2017]\n\nOur method\n\nGuo et al. [2016]\nUllrich et al. [2017]\n\nMolchanov et al. [2017]\n\nLi et al. [2018]\n\nOur method\nOur method\n\nZhuang et al. [2018]\n\nZhu et al. [2018]\n\nMolchanov et al. [2017]\n\nOur method\n\nGuo et al. [2016]\n\nSrinivas et al. [2017]\nDong et al. [2017]\n\nOur method\n\nResNet50\n\n(ILSVRC12)\n\nZhuang et al. [2018]\n\nOur method\n\n8\n\nError\n\nCR\n1.76%\u21922.43% 66.7\n1.89%\u21921.94%\n64\n1.64%\u21921.92%\n68\n1.72%\u2192 1.78% 80\n0.91%\u21920.91%\n108\n0.88%\u21920.97%\n162\n0.80%\u2192 0.75% 280\n0.91%\u21920.91%\n298\n0.78%\u21920.80%\n260\n0.91%\u21920.91%\n310\n6.01%\u21925.43% 15.58\n6.42%\u21926.69%\n8.5\n7.55%\u21927.55%\n65\n7.60%\u21927.82%\n75\n43.42%\u219243.09% 17.7\n42.80%\u219243.04% 10.3\n43.30%\u219250.04% 9.1\n43.26%\u219244.10% 18.5\n23.99%\u219225.05% 2.06\n25.10%\u219225.50% 2.2\n\n\fuse its conventional MobileNet V2 (224\u00d7224) model with 310M FLOPs. The size of the dataset and\nalso the complexity of the model clearly reveals the scalability of our method.\nResNet-50 is trained with a learning rate schedule from 1e-5 to 1e-6. Only L2 norm is applied, with\n\u03bb = 1e \u2212 5. Note that the identity connections alleviate the need to add layer-wise learning rate since\nthe gradient to the \ufb01rst several layers is enough to pull the auxiliary parameters. The learning rate for\nAlexNet is 1e-3 and for MobileNet V2 is 1e-5. We split the training data into 1:1 for weight update\nand auxiliary parameter update respectively. Once the desired FLOPs is reached, we use all training\ndata to \ufb01ne tune the model.\nFor neuron pruning, we evaluate our method on compact MobileNet V2 with less redundancy, and\ncompare with the state-of-art methods in different FLOPs levels, in Table 3. Our method achieves\nsimilar error at 300M level and outperforms others at extreme level(200M and 100M). For ResNet\nat 600M FLOPs, the top-1 error is 27.6%. For weight pruning, the results in Table 4 show that our\nmethod on AlexNet model achieves 18.5x compression rate and 0.84% accuracy drop. For ResNet-50,\nwe get 2.2x compression rate with only 0.4% accuracy drop.\n\n4.5 Ablation Study\n\nWe show the sparsity and accuracy are not sensitive to hyperparameters, taking weight pruning with\nVGG-like on CIFAR-10 as an example. In Fig. 2(a), we set the learning rate of auxiliary parameters to\n1e-2, 1e-1 and 5e-1. From the result we observe that all three settings converge to similar compression\nratio with different sparsi\ufb01cation speed. In Fig. 2(b), the accuracy with higher learning rate drops\nfaster, but the \ufb01nal gap is less than 0.1%. In Fig. 2(c), we show the compression ratio versus accuracy\nplot with proposed update in Eq. 7 and regular BNN update. The regular BNN update becomes\nnon-stable after 30x CR, and accuracy drops sharply afterward. With the proposed update rule,\naccuracy is more stable and with lower variance until 80x. We\u2019ve also included the comparison on\nchoosing different STE functions and learning rates for VGG like model on CIFAR10 in Fig. 2(d).\nSoftplus STE achieves the best result while converges slower than LeakyReLU STE, which achieves\nslightly lower CR. The linear STE however, yields worst CR and slower convergence speed.\n\n(a) CR vs Iter\n\n(b) CR vs accuracy\n\n(c) Training CR vs accuracy(d) Comparison on different\n\nSTEs and learning rate\n\nFigure 2: Illustration of Hyperparameter Sensitivity\n\n4.6 Training From Scratch\n\nApart from sparsi\ufb01cation on pre-trained models, our method can support training sparse network from\nscratch. We evaluate our method through training LeNet5 from scratch. All the weights are randomly\ninitialized as usual while the auxiliary parameters are initialized as mij \u223c Gaussian(0.1, 0.05). The\ninitial learning rate is set to 1e-3 and gradually decreased to 1e-5. The \ufb01nal model we obtain has an\nerror of 0.95% with a 168x compression rate.\n\n5 Conclusions\n\nIn this paper, we propose to automatically prune deep neural networks by regularizing auxiliary\nparameters instead of original weights values. The auxiliary parameters are not sensitive to hyperpa-\nrameters and are more robust to noise during training. We also design a gradient-based update rule\nfor auxiliary parameters and analyze the bene\ufb01ts. In addition, we combine sparse regularizers and\nweight regularization to accelerate the sparsi\ufb01cation process. Extensive experiments show that our\nmethod achieves the state-of-the-art sparsity in both weight pruning and neuron pruning compared\nwith existing approaches. Moreover, our model also supports training from scratch and can reach a\ncomparable sparsity.\n\n9\n\n01000020000300004000050000Iteration01020304050607080Compression Rate(CR)lr=1e-2lr=1e-1lr=5e-101020304050607080Compression Rate(CR)91.491.691.892.092.292.492.6Accuracy in %lr=1e-2lr=1e-1lr=5e-1020406080100Compression Rate(CR)888990919293Accuracy in %Proposed UpdateRegular BNN Update01000020000300004000050000Iteration01020304050607080Compression Rate(CR)lr=1e-2lr=1e-1lr=5e-1\fReferences\nAlireza Aghasi, Afshin Abdi, Nam Nguyen, and Justin Romberg. Net-trim: Convex pruning of deep neural\nIn Advances in Neural Information Processing Systems, pages\n\nnetworks with performance guarantee.\n3177\u20133186, 2017.\n\nMiguel A Carreira-Perpin\u00e1n and Yerlan Idelbayev. \u201clearning-compression\u201d algorithms for neural net pruning. In\nProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8532\u20138541, 2018.\n\nXin Dong, Shangyu Chen, and Sinno Pan. Learning to prune deep neural networks via layer-wise optimal brain\n\nsurgeon. In Advances in Neural Information Processing Systems, pages 4857\u20134867, 2017.\n\nAidan N Gomez, Ivan Zhang, Kevin Swersky, Yarin Gal, and Geoffrey E Hinton. Learning sparse networks\n\nusing targeted dropout. arXiv preprint arXiv:1905.13678, 2019.\n\nAriel Gordon, Elad Eban, O\ufb01r Nachum, Bo Chen, Hao Wu, Tien-Ju Yang, and Edward Choi. Morphnet: Fast &\nsimple resource-constrained structure learning of deep networks. In Proceedings of the IEEE Conference on\nComputer Vision and Pattern Recognition, pages 1586\u20131595, 2018.\n\nYiwen Guo, Anbang Yao, and Yurong Chen. Dynamic network surgery for ef\ufb01cient dnns. In Advances In Neural\n\nInformation Processing Systems, pages 1379\u20131387, 2016.\n\nSong Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for ef\ufb01cient neural\n\nnetwork. In Advances in neural information processing systems, pages 1135\u20131143, 2015.\n\nKaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In\n\nProceedings of the IEEE conference on computer vision and pattern recognition, pages 770\u2013778, 2016.\n\nHengyuan Hu, Rui Peng, Yu-Wing Tai, and Chi-Keung Tang. Network trimming: A data-driven neuron pruning\n\napproach towards ef\ufb01cient deep architectures. arXiv preprint arXiv:1607.03250, 2016.\n\nItay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized neural\n\nnetworks. In Advances in neural information processing systems, pages 4107\u20134115, 2016.\n\nYann LeCun, John S Denker, and Sara A Solla. Optimal brain damage. In Advances in neural information\n\nprocessing systems, pages 598\u2013605, 1990.\n\nHao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning \ufb01lters for ef\ufb01cient convnets.\n\nIn International Conference on Learning Representations, 2017.\n\nGuiying Li, Chao Qian, Chunhui Jiang, Xiaofen Lu, and Ke Tang. Optimization based layer-wise magnitude-\n\nbased pruning for dnn compression. In IJCAI, pages 2383\u20132389, 2018.\n\nHanxiao Liu, Karen Simonyan, and Yiming Yang. DARTS: Differentiable architecture search. In International\n\nConference on Learning Representations, 2019.\n\nZhuang Liu, Mingjie Sun, Tinghui Zhou, Gao Huang, and Trevor Darrell. Rethinking the value of network\n\npruning. In International Conference on Learning Representations, 2019.\n\nChristos Louizos, Karen Ullrich, and Max Welling. Bayesian compression for deep learning. In Advances in\n\nNeural Information Processing Systems, pages 3288\u20133298, 2017.\n\nChristos Louizos, Max Welling, and Diederik P. Kingma. Learning sparse neural networks through l_0\n\nregularization. In International Conference on Learning Representations, 2018.\n\nAna I Maqueda, Antonio Loquercio, Guillermo Gallego, Narciso Garc\u00eda, and Davide Scaramuzza. Event-based\nvision meets deep learning on steering prediction for self-driving cars. In Proceedings of the IEEE Conference\non Computer Vision and Pattern Recognition, pages 5419\u20135427, 2018.\n\nDmitry Molchanov, Arsenii Ashukha, and Dmitry Vetrov. Variational dropout sparsi\ufb01es deep neural networks.\nIn Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2498\u20132507.\nJMLR. org, 2017.\n\nKirill Neklyudov, Dmitry Molchanov, Arsenii Ashukha, and Dmitry P Vetrov. Structured bayesian pruning via\nlog-normal multiplicative noise. In Advances in Neural Information Processing Systems, pages 6775\u20136784,\n2017.\n\nShaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with\n\nregion proposal networks. In Advances in neural information processing systems, pages 91\u201399, 2015.\n\n10\n\n\fMark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2:\nInverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and\nPattern Recognition, pages 4510\u20134520, 2018.\n\nSuraj Srinivas, Akshayvarun Subramanya, and R Venkatesh Babu. Training sparse neural networks.\n\nProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 138\u2013145, 2017.\n\nIn\n\nIlya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances\n\nin neural information processing systems, pages 3104\u20133112, 2014.\n\nMingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, and Quoc V Le.\nMnasnet: Platform-aware neural architecture search for mobile. In Proceedings of the IEEE Conference on\nComputer Vision and Pattern Recognition, pages 2820\u20132828, 2019.\n\nEnzo Tartaglione, Skjalg Leps\u00f8y, Attilio Fiandrotti, and Gianluca Francini. Learning sparse neural networks via\nsensitivity-driven regularization. In Advances in Neural Information Processing Systems, pages 3878\u20133888,\n2018.\n\nKaren Ullrich, Edward Meeds, and Max Welling. Soft weight-sharing for neural network compression. In\n\nInternational Conference on Learning Representations, 2017.\n\nWei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning structured sparsity in deep neural\n\nnetworks. In Advances in neural information processing systems, pages 2074\u20132082, 2016.\n\nBichen Wu, Xiaoliang Dai, Peizhao Zhang, Yanghan Wang, Fei Sun, Yiming Wu, Yuandong Tian, Peter Vajda,\nYangqing Jia, and Kurt Keutzer. Fbnet: Hardware-aware ef\ufb01cient convnet design via differentiable neural\narchitecture search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,\npages 10734\u201310742, 2019.\n\nPenghang Yin, Jiancheng Lyu, Shuai Zhang, Stanley J. Osher, Yingyong Qi, and Jack Xin. Understanding\nIn International Conference on\n\nstraight-through estimator in training activation quantized neural nets.\nLearning Representations, 2019.\n\nJiahui Yu and Thomas Huang. Network slimming by slimmable networks: Towards one-shot architecture search\n\nfor channel numbers. arXiv preprint arXiv:1903.11728, 2019.\n\nJiahui Yu and Thomas Huang. Universally slimmable networks and improved training techniques. arXiv preprint\n\narXiv:1903.05134, 2019.\n\nJiahui Yu, Linjie Yang, Ning Xu, Jianchao Yang, and Thomas Huang. Slimmable neural networks. arXiv\n\npreprint arXiv:1812.08928, 2018.\n\nXiaotian Zhu, Wengang Zhou, and Houqiang Li. Improving deep neural network sparsity through decorrelation\n\nregularization. In IJCAI, pages 3264\u20133270, 2018.\n\nZhuangwei Zhuang, Mingkui Tan, Bohan Zhuang, Jing Liu, Yong Guo, Qingyao Wu, Junzhou Huang, and Jinhui\nZhu. Discrimination-aware channel pruning for deep neural networks. In Advances in Neural Information\nProcessing Systems, pages 883\u2013894, 2018.\n\n11\n\n\f", "award": [], "sourceid": 7607, "authors": [{"given_name": "XIA", "family_name": "XIAO", "institution": "University of Connecticut"}, {"given_name": "Zigeng", "family_name": "Wang", "institution": "University of Connecticut"}, {"given_name": "Sanguthevar", "family_name": "Rajasekaran", "institution": "University of Connecticut"}]}