{"title": "Moonshine: Distilling with Cheap Convolutions", "book": "Advances in Neural Information Processing Systems", "page_first": 2888, "page_last": 2898, "abstract": "Many engineers wish to deploy modern neural networks in memory-limited settings; but the development of flexible methods for reducing memory use is in its infancy, and there is little knowledge of the resulting cost-benefit. We propose structural model distillation for memory reduction using a strategy that produces a student architecture that is a simple transformation of the teacher architecture: no redesign is needed, and the same hyperparameters can be used. Using attention transfer, we provide Pareto curves/tables for distillation of residual networks with four benchmark datasets, indicating the memory versus accuracy payoff. We show that substantial memory savings are possible with very little loss of accuracy, and confirm that distillation provides student network performance that is better than training that student architecture directly on data.", "full_text": "Moonshine: Distilling with Cheap Convolutions\n\nElliot J. Crowley\n\nSchool of Informatics\nUniversity of Edinburgh\n\nelliot.j.crowley@ed.ac.uk\n\nGavin Gray\n\nSchool of Informatics\nUniversity of Edinburgh\ng.d.b.gray@ed.ac.uk\n\nAmos Storkey\n\nSchool of Informatics\nUniversity of Edinburgh\na.storkey@ed.ac.uk\n\nAbstract\n\nMany engineers wish to deploy modern neural networks in memory-limited settings;\nbut the development of \ufb02exible methods for reducing memory use is in its infancy,\nand there is little knowledge of the resulting cost-bene\ufb01t. We propose structural\nmodel distillation for memory reduction using a strategy that produces a student\narchitecture that is a simple transformation of the teacher architecture: no redesign\nis needed, and the same hyperparameters can be used. Using attention transfer,\nwe provide Pareto curves/tables for distillation of residual networks with four\nbenchmark datasets, indicating the memory versus accuracy payoff. We show\nthat substantial memory savings are possible with very little loss of accuracy, and\ncon\ufb01rm that distillation provides student network performance that is better than\ntraining that student architecture directly on data.\n\n1\n\nIntroduction\n\nDespite advances in deep learning for a variety of tasks (LeCun et al., 2015), deployment of deep\nlearning into embedded devices e.g. wearable devices, digital cameras, vehicle navigation systems,\nhas been relatively slow due to resource constraints under which these devices operate. Big, memory-\nintensive neural networks do not \ufb01t on these devices, but do these networks have to be big and\nexpensive? The dominant run-time memory cost of neural networks is the number of parameters\nthat need to be stored. Can we have networks with substantially fewer parameters, without the\ncommensurate loss of performance?\nIt is possible to take a large pre-trained teacher network, and use its outputs to aid in the training of a\nsmaller student network (Ba & Caruana, 2014) through some distillation process. By doing this the\nstudent network is more powerful than if it was trained solely on the training data, and is closer in\nperformance to the larger teacher. The lower-parameter student network typically has an architecture\nthat is more shallow, or thinner \u2014 by which we mean its \ufb01lters have fewer channels (Romero et al.,\n2015) \u2014 than the teacher. While it is not possible to arbitrarily approximate any network with\nanother (Urban et al., 2017), the limit in neural network performance is at least in part due to the\ntraining algorithm, rather than its representational power.\nIn this paper, we take an alternative approach in designing our student networks. Instead of making\nnetworks thinner, or more shallow, we take the standard convolutional block such networks possess\nand replace it with a cheaper convolutional block, keeping the original architecture. For example,\nin a Residual Network (ResNet) (He et al., 2016a) this standard block is a pair of sequential 3\u00d73\nconvolutions. We show that for a comparable number of parameters, student networks that retain the\narchitecture of their teacher but with cheaper convolutional blocks outperform student networks with\nthe original blocks and smaller architectures.\nThe cheap convolutional blocks we suggest are described in Section 3 as well as an overview of the\nmethods we employ for distillation. In Section 4 we evaluate student networks with these blocks\non CIFAR-10 and CIFAR-100 (Krizhevsky, 2009). Finally, in Section 5 we examine the ef\ufb01cacy\nof such networks for the tasks of ImageNet (Russakovsky et al., 2015) classi\ufb01cation, and semantic\nsegmentation on the Cityscapes dataset (Cordts et al., 2016). Our claims are as follows:\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fby shrinking the architecture.\n\n\u2022 Greater model compression by distillation is possible by replacing convolutional blocks than\n\u2022 Grouped convolutional blocks, with or without a bottleneck contraction, are an effective\n\u2022 This replacement is cheap in design time (substitution), and cheap in training complexity; it\n\nreplacement block.\n\nuses the same optimiser and hyperparameters as those used during the original training.\n\n2 Related Work\n\nThe parameters in deep networks have a great deal of redundancy; it has been shown that many\nof them can be predicted from a subset of parameters (Denil et al., 2013). However the challenge\nremains to \ufb01nd good ways to exploit this redundancy without losing substantial model accuracy.\nThis observation, along with a desire for ef\ufb01ciency improvements has driven the development of\nsmaller, and less computationally-intensive convolutions. One of the most prominent examples is the\ndepthwise separable convolution (Sifre, 2014) which applies a separate convolutional kernel to each\nchannel, followed by a pointwise convolution over all channels; depthwise separable convolutions\nhave been used in several architectures (Ioffe & Szegedy, 2015; Chollet, 2016; Xie et al., 2017), and\nwere explicitly adapted to mobile devices in Howard et al. (2017).\nThe depthwise part of this convolution is a speci\ufb01c case of a grouped convolution where there are\nas many groups as channels; grouped convolutions were used with a grouping of 2 (i.e. half the\nchannels belong to each group) in the original AlexNet (Krizhevsky et al., 2012) due to GPU memory\nconstraints. In the work of Ioannou et al. (2017) the authors examine networks trained from scratch\nfor different groupings, motivated by ef\ufb01ciency. They found that these networks actually generalised\nbetter than an ungrouped alternative. However, separating the spatial and channel-wise elements\nis not the only way to simplify a convolution. In Jin et al. (2015) the authors propose breaking up\nthe general 3D convolution into a set of 3 pointwise convolutions along different axes. Wang et al.\n(2016) start with separable convolutions and add topological subdivisioning, a way to treat sections of\ntensors separately, and a bottleneck of the spatial dimensions. Both of these methods produce models\nthat are several times smaller than the original model while maintaining accuracy.\nIn a separable convolution, the most expensive part is the pointwise convolution, so it has been\nproposed that this operation could also be grouped over sets of channels. However, to maintain some\nconnections between channels, it is helpful to add an operation mixing the channels together (Zhang\net al., 2018). More simply, a squared reduction can be achieved by applying a bottleneck on the\nchannels before the spatial convolution (Iandola et al., 2016; Xie et al., 2017). In this paper we\nexamine the potency of a separable bottleneck structure.\nThe work discussed thus far involves learning a compressed network from scratch. There are alterna-\ntives to this such as retraining after reducing the number of parameters (Han et al., 2016; Li et al.,\n2017). We are interested in learning our smaller network as a student through distillation (Bucilu\u02c7a\net al., 2006; Ba & Caruana, 2014) in conjunction with a pre-trained teacher network.\nHow small can our student network be? The complex function of a large, deep teacher network can,\ntheoretically, be approximated by a network with a single hidden layer with enough units (Cybenko,\n1989). The dif\ufb01culty in practice is learning that function. Knowledge distillation (Ba & Caruana,\n2014; Hinton et al., 2015) proposes to use the information in the logits of a learnt network to train the\nsmaller student network. In early experiments, this was shown to be effective; networks much smaller\nthan the original could be trained with small increases in error. However, modern deep architectures\nprove harder to compress. For example, a deep convolutional network cannot be trivially replaced by\na feedforward architecture (Urban et al., 2017). Two methods have been proposed to deal with this.\nFirst, in Romero et al. (2015) the authors use a linear map between activations at intermediate points\nto produce an extra loss function. Second, in Zagoruyko & Komodakis (2017), the authors choose\ninstead to match the activations after taking the mean over the channels and call this method attention\ntransfer. In the context of this paper, we found attention transfer to be effective in our experiments,\nas described in Section 4.\n\n3 Compression with Cheap Convolutions\n\nGiven a large, deep network that performs well on a given task, we are interested in compressing that\nnetwork so that it uses fewer parameters. A \ufb02exible and widely applicable way to reduce the number\n\n2\n\n\fof parameters in a model is to replace all its convolutional layers with a cheaper alternative. Doing\nthis replacement invariably impairs performance when this reduced network is trained directly on the\ndata. Fortunately, we are able to demonstrate that modern distillation methods enable the cheaper\nmodel to have performance closer to the original large network.\n\n3.1 Distillation\n\nFor this paper, we utilise and compare two different distillation methods for learning a smaller student\nnetwork from a large, pre-trained teacher network: knowledge distillation (Ba & Caruana, 2014;\nHinton et al., 2015) and attention transfer (Zagoruyko & Komodakis, 2017).\n\nLCE(p, q) = \u2212(cid:80)\n\nKnowledge Distillation Let us denote the cross entropy of two probability vectors p and q as\nk pk log qk. Assume we have a dataset of elements, with one such element\ndenoted x, where each element has a corresponding one-hot class label: denote the one-hot vector\ncorresponding to x by y. Given x, we have a trained teacher network t = teacher(x) that outputs\nthe corresponding logits, denoted by t; likewise we have a student network that outputs logits\ns = student(x). To perform knowledge distillation we train the student network to minimise the\nfollowing loss function (averaged across all data items):\n\nLKD = (1 \u2212 \u03b1)LCE(y, \u03c3(s)) + 2\u03b1T 2LCE\n\n\u03c3\n\n, \u03c3\n\n,\n\n(1)\n\n(cid:18)\n\n(cid:18) t\n\n(cid:19)\n\nT\n\n(cid:17)(cid:19)\n\n(cid:16) s\n\nT\n\nwhere \u03c3(.) is the softmax function, T is a temperature parameter and \u03b1 is a parameter controlling the\nratio of the two terms. The \ufb01rst term is a standard cross entropy loss penalising the student network\nfor incorrect classi\ufb01cations. The second term is minimised if the student network produces outputs\nsimilar to that of the teacher network. The idea being that the outputs of the teacher network contain\nadditional, bene\ufb01cial information beyond just a class prediction.\n\nAttention Transfer Consider some choice of layers i = 1, 2, ..., NL in a teacher network, and the\ncorresponding layers in the student network. At each chosen layer i of the teacher network, collect the\nspatial map of the activations for channel j into the vector at\nij for all j. Likewise\nfor the student network we correspondingly collect into as\nNow given some choice of mapping f (Ai) that maps each collection of the form Ai into a vector,\nattention transfer involves learning the student network by minimising:\n\nij. Let At\nij and As\ni .\n\ni collect at\n\n(cid:13)(cid:13)(cid:13)(cid:13) f (At\n\ni)\n||f (At\ni)||2\n\nNL(cid:88)\n\ni=1\n\n(cid:13)(cid:13)(cid:13)(cid:13)2\n\nLAT = LCE(y, \u03c3(s)) + \u03b2\n\n\u2212 f (As\ni )\n||f (As\ni )||2\n\n,\n\n(2)\n\n(1/NAi)(cid:80)NAi\n\nj=1 a2\n\nwhere \u03b2 is a hyperparameter. Zagoruyko & Komodakis (2017) recommended using f (Ai) =\nij, where NAi is the number of channels at layer i. In other words, the loss targeted\nthe difference in the spatial map of average squared activations, where each spatial map is normalised\nby the overall activation norm.\nLet us examine the loss (2) further. The \ufb01rst term is again a standard cross entropy loss. The second\nterm, however, ensures the spatial distribution of the student and teacher activations are similar at\nselected layers in the network, the explanation being that both networks are then paying attention to\nthe same things at those layers.\n\n3.2 Cheap Convolutions\n\nAs large fully-connected layers are no longer commonplace, convolutions make up almost all of the\nparameters in modern networks.1 It is therefore desirable to make them smaller. Here, we present\nseveral convolutional blocks that may be introduced in place of a standard block in a network to\nsubstantially reduce its parameter cost.\nFirst, let us consider a standard two dimensional convolutional layer that contains Nout \ufb01lters, each\nof size Nin \u00d7 k \u00d7 k. Nout is the number of channels of the layer output, Nin is the number of\n1The parameters introduced by batch normalisation are negligible compared to those in the convolutions.\n\nHowever, they are included for completeness in Table 1.\n\n3\n\n\fN\n\nK\n\nK\ngrouped\n\n(a)\n\nG(g)\n\nN\nKK\ngrouped\n\n\u00d7Cin/b\n\nN\nKK\ngrouped\n\n(b)\n\nN\nKK\ngrouped\n\n\u00d7Cout\n\nBG(b,g)\n\nFigure 1: In (a) a grouped convolution operates by passing independent \ufb01lters over the tensor after it\nis separated into g groups over the channel dimension; as each of the g \ufb01lters needs only to operate\nover N/g channels this reduces the parameter cost of the layer by a factor of g. These can be\ncomposed into the blocks illustrated in (b). The Grouped + Pointwise (G(g)) block substitutes a\nk \u00d7 k convolution with a grouped convolution followed by a pointwise (1 \u00d7 1) convolution, repeating\nthis twice. To reduce parameters further, a pointwise Bottleneck can be used before the Grouped +\nPointwise convolution (BG(b, g)).\n\nchannels of the input, and k \u00d7 k is the kernel size of each convolution. In modern networks it is\nalmost always the case that Nin (cid:54) Nout. Let N = max(Nin, Nout). Then the parameter cost of this\nlayer is NinNoutk2, and is bounded by N 2k2. In a typical residual network, a block contains two\nsuch convolutions. We will refer to this as a Standard block S, and it is outlined in Table 1.\nAn alternative approach is to separate each convolution into g groups, as shown in Figure 1a. By\nrestricting the convolutions to only mix channels within each group, we obtain a substantial reduction\nin the number of parameters for a grouped computation: for example, for Nin = Nout = N the cost\nchanges from N 2k2 for a standard layer to g groups of (N/g)2 k2 parameter convolutions, hence\nreducing the parameter cost by a factor of g. We can then provide some cross-group mixing by\nfollowing each grouped convolution with a pointwise convolution, with a N 2 parameter cost (when\nNin (cid:54)= Nout the change in channel size occurs across this pointwise convolution). We refer to this\nsubstitution operator as G(g) (grouped convolution with g groups) and illustrate it in Figure 1b.\nIn the original ResNet paper (He et al., 2016a) the authors introduced a bottleneck block which we\nhave parameterised, and denoted as B(b) in Table 1: the input \ufb01rst has its channels decreased by\na factor of b via a pointwise convolution, before a full convolution is carried out. Finally, another\npointwise convolution brings the representation back up to the desired Nout. We can reduce the\nparameter cost of this block even further by replacing the full convolution with a grouped one; the\nBottleneck Grouped + Pointwise block is referred to as BG(b, g) and is illustrated in Figure 1b.\nThese substitute blocks are compared in Table 1 and their computational costs are given. In practice,\nby varying the bottleneck size and the number of groups, network parameter numbers may vary over\ntwo orders of magnitude; enumerated examples are given in Table 2.\nUsing grouped convolutions and bottlenecks are common methods for parameter reduction when\ndesigning a network architecture. Both are easy to implement in any deep learning framework.\nSparsity inducing methods (Han et al., 2016), or approximate layers (Yang et al., 2015), may also\nprovide advantages, but these are complementary to the approaches here. More structured reductions\nsuch as grouped convolutions and bottlenecks can be advantageous over sparsity methods in that\nthe sparsity structure does not need to be stored. In the following sections, we demonstrate that\nusing these proposed blocks with effective model distillation allows for substantial compression with\nminimal reduction in performance.\n\n4 CIFAR Experiments\n\nIn this section we train and evaluate a number of student networks, each distilled from the same large\nteacher network. Experiments are conducted for both the CIFAR-10 and CIFAR-100 datasets. We\ndistil with (i) knowledge distillation and (ii) attention transfer. We also train the networks without\n\n4\n\nN/g\fTable 1: Convolutional Blocks used in this paper: a standard block S, a grouped + pointwise block\nG, a bottleneck block B, and a bottleneck grouped + pointwise block BG. Conv refers to a k \u00d7 k\nconvolution. GConv is a grouped k \u00d7 k convolution and Conv1x1 is a pointwise convolution. Blocks\nuse pre-activations (He et al., 2016b): all convolutions are preceded by a batch-norm layer + a ReLU\nactivation. We assume that the input and output to each block has N channels and that channel\nsize does not change over a particular convolution unless written out explicitly as (x \u2192 y). Where\napplicable, g is the number of groups in a grouped convolution and b is the bottleneck contraction.\nWe give the cost of the convolutions in each block in terms of these parameters. The batch-norm cost\nat test time is also given, but is markedly smaller.\n\nBlock\nStructure\n\nS\nConv\nConv\n\nConv Params\nBN Params\n\n2N 2k2\n4N\n\nG(g)\nGConv (g)\nConv1x1\nGConv (g)\nConv1x1\n2N 2( k2\n8N\n\nBG(b, g)\n\nB(b)\nConv1x1(N \u2192 N\nConv\nConv1x1( N\n\nb ) Conv1x1(N \u2192 N\nb )\nb \u2192 N)\n\nGConv(g)\nb \u2192 N) Conv1x1( N\n\ng + 1) N 2( k2\n\nb2 + 2\nb )\nN (2 + 4\nb )\n\ngb2 + 2\nb )\n\nN 2( k2\nN (2 + 4\nb )\n\nany form of distillation (i.e. from scratch) to observe whether the distillation process is necessary\nto obtain good performance. In this way we demonstrate that the high performance comes from the\ndistillation, and cannot be achieved by directly training the student networks using the data.\nFor comparison we also study student networks with smaller architectures (i.e. fewer layers/\ufb01lters)\nthan the teacher. This enables us to test if the block transformations we propose are key, or it is\nsimply a matter of distilling networks with smaller numbers of parameters. We compare the smaller\nstudent architectures with student architectures implementing cheap, substitute convolutional blocks,\nbut with the same architecture as the teacher. The different convolutional blocks are summarised in\nTable 1 and the student networks are described in detail in Section 4.1. Results are given in Table 2\nand Figure 2. These results are discussed in detail in Section 4.2.\n\n4.1 Network Descriptions\n\nFor our experiments we utilise the Wide Residual Network (WRN) architecture (Zagoruyko &\nKomodakis, 2016); the bulk of the network lies in its {conv2, conv3, conv4} groups and the network\ndepth d determines the number of convolutional blocks n in these groups as n = (d \u2212 4)/6. The\nnetwork width, denoted by k, affects the channel size of the \ufb01lters in these blocks. Note that when\nwe employ attention transfer the student and teacher outputs of groups {conv2, conv3, conv4} are\nused as {A1, A2, A3} in the second term of Equation (2) with NL = 3.\nFor our teacher network we use WRN-40-2 (a WRN with depth 40 and width multiplier 2) with\nstandard (S) blocks. 3 \u00d7 3 kernels are used for all non-pointwise convolutions in our student and\nteacher networks unless stated otherwise. For our student networks we use:\n\n\u2022 WRN-40-1, 16-2, and 16-1 with S blocks. These are student networks that are thinner\n\nand/or more shallow than the teacher and represent typical student networks used.\n\n\u2022 WRN-40-2 with S blocks where the 3 \u00d7 3 kernels have been replaced with 2 \u00d7 2 dilated\nkernels (as described in Yu & Koltun (2016)). This allows us to see if it possible to naively\nreduce parameters by effectively zeroing out elements of standard kernel.\n\n\u2022 WRN-40-2 using a bottleneck block B with 2\u00d7 and 4\u00d7 channel contraction (b).\n\u2022 WRN-40-2 using a grouped + pointwise block G for group sizes (g) {2, 4, 8, 16, N/16, N/8,\nN/4, N/2, N} where N is the number of channels in a given block. This allows us to explore\nthe spectrum between full convolutions (g = 1) and fully separable convolutions (g = N).\n\u2022 WRN-40-2 with a bottleneck grouped + pointwise block BG. We use b = 2 with groups\nsizes of {2, 4, 8, 16, M/16, M/8, M/4, M/2, M} where M = N/b is the number of\nchannels after the bottleneck. We use this notation so that g = M represents fully separable\nconvolutions and we can easily denote divisions thereof. BG(4, M ) is also used to observe\nthe effect of extreme compression.\n\n5\n\n\f(a)\n\n(b)\n\nFigure 2: Test Error vs. (a) No. parameters and (b) Mult-adds for student networks learnt with\nattention transfer on CIFAR-10. Note that the x-axes are log-scaled. Points on the red curve\ncorrespond to networks with S convolutional blocks and reduced architectures. All other networks\nhave the same WRN-40-2 architecture as the teacher but with cheap convolutional blocks: G (green),\nB (blue), and BG (cyan). The blocks are described in Table 1. Notice that the student networks\nwith cheap blocks outperform those with smaller architectures and standard convolutions for a given\nparameter budget or mult-add budget.\n\nImplementation Details For training we used minibatches of size 128. Before each minibatch, the\nimages were padded by 4 \u00d7 4 zeros, and then a random 32 \u00d7 32 crop was taken. Each image was\nleft-right \ufb02ipped with a probability of a half. Networks were trained for 200 epochs using SGD with\nmomentum \ufb01xed at 0.9 with an initial learning rate of 0.1. The learning rate was reduced by a factor\nof 0.2 at the start of epochs 60, 120, and 160. For knowledge distillation we set \u03b1 to 0.9 and used a\ntemperature of 4. For attention transfer \u03b2 was set to 1000. The code to reproduce these experiments\nis available at https://github.com/BayesWatch/pytorch-moonshine.\n\n4.2 Analysis and Observations\n\nFigure 2a compares the parameter cost of each student network (on a log scale) against the test\nerror on CIFAR-10 obtained with attention transfer. On this plot, the ideal network would lie in the\nbottom-left corner (few parameters, low error). What is fascinating is that almost every network\nwith the same architecture as the teacher, but with cheap convolutional blocks (those on the blue,\ngreen, and cyan lines) performs better for a given parameter budget than the reduced architecture\nnetworks with standard blocks (the red line). BG(2, 2) outperforms 16-2 (5.57% vs. 5.66%) despite\nhaving considerably fewer parameters (287K vs. 692K). Several of the networks with BG blocks\nboth signi\ufb01cantly outperform 16-1 and use fewer parameters.\nIt is encouraging that signi\ufb01cant compression is possible with only small losses; several networks\nperform almost as well as the teacher with considerably fewer parameters \u2013 G(N/8) has an error\nof 5.06%, close to that of the teacher, but has just over a \ufb01fth of the parameters. BG(2, M/8) has\nless than a tenth of the parameters of the teacher, for a cost of 1.15% increase in error. Even simply\nswitching all convolutions with smaller, dilated equivalents (S \u2212 2 \u00d7 2) allows one to use half the\nparameters for a similar performance.\nAn important lesson can be learnt regarding grouped + pointwise convolutions. They are often used in\ntheir depthwise-separable (Chollet, 2016) form (g = N). However, the networks with half, or quarter\nthat number of groups perform substantially better for a modest increase in parameters. G(N/4) has\n\n6\n\n1052x1055x1051062x106No. Parameters4.555.566.577.58Test Error (%)teacher 40-2 16-240-116-1G(2)G(4)G(8)G(16)G(N/16)G(N/8)G(N/4)G(N/2)G(N)B(2)B(4)BG(2,2)BG(2,4)BG(2,8)BG(2,16)BG(2,M/16)BG(2,M/8)BG(2,M/4)BG(2,M/2)BG(2,M)BG(4,M)S-2x2Standard NetworksGrouped + PointwiseBottleneckBottleneck Grouped + Pointwise1072x1075x1071082x1083x108Mult-Adds4.555.566.577.58Test Error (%)teacher 40-2 16-240-116-1G(2)G(4)G(8)G(16)G(N/16)G(N/8)G(N/4)G(N/2)G(N)B(2)B(4)BG(2,2)BG(2,4)BG(2,8)BG(2,16)BG(2,M/16)BG(2,M/8)BG(2,M/4)BG(2,M/2)BG(2,M)BG(4,M)S-2x2Standard NetworksGrouped + PointwiseBottleneckBottleneck Grouped + Pointwise\f363K parameters compared to the 294K of G(N ) but has an error that is 1.26% lower. The number\nof groups is an easy parameter to tune to trade some performance for a smaller network. Grouped\n+ pointwise convolutions also work well in conjunction with a bottleneck of size 2, although for\nlarge bottlenecks the error increases signi\ufb01cantly, as can be seen for BG(4, M ). Despite this, it is\nstill of comparable performance to 16-1 with half the parameters. Similar trends are observed for\nCIFAR-100 in Table 2b.\nWe also observe that training a student with attention transfer is substantially better than using\nknowledge distillation, or simply training from scratch. Consider Table 2, which shows the attention\ntransfer errors of Figure 2 (the AT column) alongside those of networks trained with knowledge\ndistillation (KD), and no distillation i.e. from scratch (Scr) for CIFAR-10 and CIFAR-100. In all cases,\nthe student network trained with attention transfer is better than the student network trained by itself\n\u2013 the distillation process appears to be necessary. Some performances are particularly impressive;\non CIFAR-10, for G(2) blocks the error is only 0.08% higher than the teacher despite the network\nhaving 60% of the parameters.\nThese results support our claim that greater model compression through distillation is possible by\nsubstituting the convolutional blocks in a network, rather than by shrinking its architecture. We have\nalso demonstrated that the blocks outlined in Table 1 are suitable substitutes. By observing Figure 2b\nwe can also see that our networks with cheap, substitute blocks utilise fewer mult-add operations\nthan their standard equivalents, which roughly corresponds to a faster runtime. However, it is worth\nnoting that actual runtime on a given platform or device is dependent on speci\ufb01cs (memory paging,\nchoice of libraries etc.), so mult-adds are not always fully indicative of runtime, but are a decent\napproximation in a platform/implementation-agnostic setting.\n\n5 Additional Experiments\n\nSection 4 demonstrates the effectiveness of cheapening convolutions for CIFAR classi\ufb01cation. In this\nsection, we apply this method to two further problems. Firstly, in Section 5.1 we examine whether\nthe bene\ufb01ts observed hold for large-scale image classi\ufb01cation on ImageNet (Russakovsky et al.,\n2015) where there are far more classes (1000), and the images are signi\ufb01cantly larger. Secondly, in\nSection 5.2 we cheapen the convolutions of a network trained for semantic segmentation.\n\n5.1\n\nImageNet\n\nOur experiments use a pre-trained ResNet-34 (He et al., 2016a) (21.8M parameters) as a teacher and\nwe train several networks using attention transfer (AT). We compare student networks that have the\narchitecture of ResNet-34, with cheaper convolutions, to those that have reduced architectures, and\nfull convolutions. Note, that the bulk of the parameters in a ResNet are contained in four groups,\nas opposed to the three of a WideResNet. We train the following student networks: (i) ResNet-18,\n(ii) ResNet-18 with the channel widths of the last three groups halved (Res18-0.5), ResNet-34 with\neach convolutional block replaced by (iii) a G(N ) block2 and (iv) a G(4) block. Validation errors for\nthese networks are available in Table 3.\nConsider Res34-G(N) and Res18-0.5, which both have roughly the same parameter cost (\u223c3M). After\ndistillation, the former has a signi\ufb01cantly lower top-5 error (10.66% vs. 15.02%). This again supports\nour claim that is is preferable to cheapen convolutions, rather than shrink the network architecture.\nRes34-G(N) trained from scratch has a noticeably higher top-5 error (12.26%), it bene\ufb01ts from\ndistillation. Conversely, distillation makes Res18-0.5 slightly worse, suggesting that it has no further\nrepresentational capacity.\nRes34-G(4) similarly outperforms Res18 (these are roughly similar in cost at 8.1M and 11.7M\nparameters respectively), although in this case the latter does bene\ufb01t from distillation. It is intriguing\nthat Res34-G(4) trained from scratch is actually on par with the original teacher (having a 0.12%\nlower top-1 error, and a 0.05% higher top-5 error) despite having 13 million fewer parameters;\nthis generalisation capability of grouped convolutions in networks has been observed previously\nby Ioannou et al. (2017). Distillation is able to push its performance slightly further to the point that\nits top-5 error surpasses that of the teacher (8.43% vs. 8.57%).\n\n2As the convolutional blocks in the teacher do not use pre-activations, the G blocks used here are modi\ufb01ed\naccordingly (BN + ReLU now come after each convolution). This also applies to the networks in Section 5.2.\n\n7\n\n\fTable 2: Student Network test error on CIFAR-10/100. Each network is a WideResNet with its\ndepth-width (D-W) given in the \ufb01rst column, and with its block type (corresponding to Table 1 in\nthe second. N refers to the channel width of each block, and M refers to the channel width after the\nbottleneck where applicable. The total parameter cost of the networks for CIFAR-10 is given, as well\nas the number of mult-add operations they use. Note that CIFAR-100 networks use an extra 11.6K\nparameters and mult-adds over their CIFAR-10 equivalents as they have a larger linear classi\ufb01cation\nlayer. Errors are reported for (i) learning with no distillation i.e. from scratch (Scr), (ii) knowledge\ndistillation with a teacher (KD), and attention transfer with a teacher (AT). The same teacher is\nused for training, and is given in the \ufb01rst row. This table shows that (i) through attention transfer it\nis possible to cut the number of parameters of a network, but retain high performance and (ii) for\na similar number of parameters, students with cheap convolutional blocks outperform those with\nexpensive convolutions and smaller architectures.\n\nCIFAR-10\n\nCIFAR-100\n\nD-W\nT 40-2\n16-2\n40-1\n16-1\n40-2\n40-2\n40-2\n40-2\n40-2\n40-2\n40-2\n40-2\n40-2\n40-2\n40-2\n40-2\n40-2\n40-2\n40-2\n40-2\n40-2\n40-2\n40-2\n40-2\n40-2\n40-2\n\nBlock\nS\nS\nS\nS\nS-2x2\nG(2)\nG(4)\nG(8)\nG(16)\nG(N/16)\nG(N/8)\nG(N/4)\nG(N/2)\nG(N)\nB(2)\nB(4)\nBG(2,2)\nBG(2,4)\nBG(2,8)\nBG(2,16)\nBG(2,M/16)\nBG(2,M/8)\nBG(2,M/4)\nBG(2,M/2)\nBG(2,M)\nBG(4,M)\n\nParams (K) MAdds (M)\n328.3\n101.4\n83.6\n26.8\n147.4\n198.1\n118.5\n78.7\n58.8\n133.9\n86.4\n62.6\n50.8\n44.8\n64.5\n22.8\n43.3\n32.7\n27.3\n24.7\n46.8\n34.4\n28.2\n25.1\n23.6\n13.0\n\n2243.5\n691.7\n563.9\n175.1\n1007.1\n1359.0\n814.7\n542.5\n406.4\n641.3\n455.8\n363.1\n316.7\n293.5\n431.8\n150.9\n286.7\n214.1\n177.8\n159.7\n238.3\n189.9\n165.7\n153.6\n147.6\n81.4\n\nScr\n4.79\n6.53\n6.48\n8.81\n5.89\n5.30\n5.50\n5.92\n6.65\n5.72\n6.07\n6.93\n7.12\n8.51\n6.36\n7.94\n6.12\n6.75\n6.94\n6.77\n6.26\n6.75\n7.06\n7.45\n7.95\n9.04\n\nKD\n\u2013\n6.03\n6.39\n8.75\n6.03\n5.37\n5.81\n5.72\n6.38\n5.72\n5.61\n6.45\n6.83\n8.01\n6.28\n7.83\n6.25\n6.75\n6.98\n6.97\n6.50\n6.49\n7.15\n7.47\n7.99\n8.61\n\nAT\n\u2013\n5.66\n5.50\n7.72\n5.09\n4.87\n5.00\n5.05\n5.13\n5.12\n5.06\n5.31\n5.98\n6.57\n5.37\n6.93\n5.57\n6.05\n6.09\n6.19\n6.02\n5.94\n6.03\n6.17\n6.67\n7.87\n\nScr\n23.85\n27.63\n29.64\n34.00\n27.20\n25.94\n26.20\n26.49\n28.85\n27.08\n27.85\n28.91\n30.24\n31.84\n28.27\n31.63\n28.51\n29.39\n30.21\n30.57\n29.69\n29.09\n30.42\n30.44\n30.90\n33.64\n\nKD\n\u2013\n27.97\n30.21\n37.28\n26.98\n24.92\n25.48\n26.64\n27.10\n26.11\n27.05\n27.93\n28.89\n29.99\n28.08\n33.63\n28.82\n29.25\n29.34\n30.54\n28.69\n29.13\n30.28\n30.66\n31.18\n37.34\n\nAT\n\u2013\n27.24\n28.24\n33.74\n26.09\n24.45\n25.30\n25.71\n26.34\n25.78\n26.15\n26.85\n28.54\n30.06\n26.68\n30.56\n28.28\n28.54\n28.89\n29.46\n29.05\n28.16\n28.60\n29.51\n30.03\n32.89\n\nImplementation Details Models were trained for 100 epochs using SGD with an initial learning\nrate of 0.1, momentum of 0.9 and weight decay of 10\u22124. The learning rate was reduced by a factor of\n10 every 30 epochs. Minibatches of size 256 were used across 4 GPUs. When trained with a teacher,\nan additional AT loss was used with the outputs of the four groups of each ResNet. \u03b2 was set to 750\nso that the total contribution of the AT loss was the same as in Section 4.\n\n5.2 Semantic Segmentation\n\nWe have shown that cheapening the convolutions of a network, coupled with a good distillation\nprocess, has allowed for a substantial reduction in the number of network parameters in return for\na small drop in performance. However, the networks trained thus far have all had the same task \u2013\nimage classi\ufb01cation. Here, we take an existing network, trained for the task of semantic segmentation\nand apply our method to distil it.\nFor our teacher network we use an ERFNet (Romera et al., 2017a,b) that has been trained from\nscratch on the Cityscapes dataset (Cordts et al., 2016) \u2013 a collection of images of urban street scenes,\nin which each pixel has been labelled as one of 19 classes. The bulk of an ERFNet is made up of\nstandard residual blocks where each full convolution has been replaced by a pair of 1D alternatives:\n\n8\n\n\fTable 3: Top 1 and Top 5 classi\ufb01cation errors (%) on the validation set of ImageNet for models\n(i) trained from scratch, and (ii) those trained with attention transfer with ResNet-34 (Res34) as a\nteacher. Res18 refers to a ResNet-18, and Res18-0.5 is a ResNet-18 where the channel width in the\nlast three groups is halved. Res34-G(x) is a ResNet-34 with each convolutional block replaced by\na G(x) block. We can observe that for a particular parameter budget (3M or \u223c10M) the networks\nwith cheap replacement blocks outperform those with reduced architectures. These trends follow for\nmult-adds. Note that the Res34 and Res18 scratch results were obtained from pre-trained PyTorch\nmodels.\n\nScratch\n\nAT\n\nModel\nRes34 T\nRes18\nRes34-G(4)\nRes18-0.5\nRes34-G(N)\n\nParams Mult-Adds Top 1 Top 5 Top 1 Top 5\n\u2013\n21.8M\n10.05\n11.7M\n8.43\n8.1M\n3.2M\n15.02\n10.66\n3.1M\n\n3.669G 26.73\n1.818G 30.36\n1.395G 26.61\n909M 36.96\n559M 32.98\n\n8.57\n11.02\n8.62\n15.01\n12.26\n\n\u2013\n29.18\n26.58\n37.20\n30.16\n\nTable 4: IoU accuracy (%) on the validation set of Cityscapes for (i) ERFNet, and (ii) ERFNet\nwith replacement blocks (ERFNet-G(N)). For ERFNet-G(N) the accuracy when trained from scratch\n(Scratch IoU), and when used as a student with the original ERFNet as a teacher (AT IoU) is given.\n\nModel\nERFNet\nERFNet-G(N)\n\nParams Mult-Adds\n3.73G\n2.06M\n0.49M\n1.19G\n\nScratch IoU AT IoU\n\u2013\n68.11\n\n70.59\n65.29\n\na 3 \u00d7 1 convolution followed by a 1 \u00d7 3 convolution. The second such pair in each block is often\ndilated. To cheapen this network for use as a student, we replace each block with a G(N ) block,\nmaintaining the dilations where appropriate.\nWe use the same optimiser, and training schedule as for the original ERFNet. When training the\nstudent the only difference is the addition of an attention transfer term (see Equation 2) between\nseveral of the feature maps in the \ufb01nal loss. The models are evaluated using class Intersection-over-\nUnion (IoU) accuracy on the validation set, and the results can be found in Table 4.\nIn Romera et al. (2017b) the authors detail how ERFNet is designed with ef\ufb01ciency in mind. With\nonly one training run and no tuning, we are able to reduce the number of parameters to one quarter of\nthe original for a modest drop in performance.\nImplementation Details Models were trained using the same optimiser, schedule, and image\nscaling and augmentation as in the ERFNet paper (Romera et al., 2017b) with the attention transfer\nloss for the case of ERFNet-G(N) as a student. For encoder training, the outputs of layers 7, 12, and\n16 were used for attention transfer with \u03b2 = 1000. For decoder training, the outputs of layers 19 and\n22 were also used and \u03b2 was dropped to 600 (so that the contribution of this term remains this same).\n6 Conclusion\nAfter training a large, deep model it may be prohibitively time consuming to design a model\ncompression strategy in order to deploy it. On many problems, it may also be more dif\ufb01cult to achieve\nthe desired performance with a smaller model. We have demonstrated a model compression strategy\nthat is fast to apply, and doesn\u2019t require any additional engineering for both image classi\ufb01cation and\nsemantic segmentation. Furthermore, the optimisation algorithm of the larger model is suf\ufb01cient to\ntrain the cheaper student model.\nThe cheap convolutions used in this paper were chosen for their ease of implementation. Future work\ncould investigate more complicated approximate operations, such as those described in Moczulski\net al. (2016); which could make a difference for the 1\u00d7 1 convolutions in the \ufb01nal layers of a network.\nOne could also make use of custom blocks generated through a large scale black box optimisation as\nin Zoph et al. (2018). Equally, there are many methods for low rank approximations that could be\napplicable (Sainath et al., 2013; Jaderberg et al., 2014; Garipov et al., 2016). We hope that this work\nencourages others to consider cheapening their convolutions as a compression strategy.\n\n9\n\n\fAcknowledgements. This project has received funding from the European Union\u2019s Horizon 2020\nresearch and innovation programme under grant agreement No. 732204 (Bonseyes). This work\nis supported by the Swiss State Secretariat for Education\u201a Research and Innovation (SERI) under\ncontract number 16.0159. The opinions expressed and arguments employed herein do not necessarily\nre\ufb02ect the of\ufb01cial views of these funding bodies. The authors are grateful to Sam Albanie, Luke\nDarlow, Jack Turner, and the anonymous reviewers for their helpful suggestions.\n\nReferences\nBa, L. J. and Caruana, R. Do deep nets really need to be deep? In Advances in Neural Information\n\nProcessing Systems, 2014.\n\nBucilu\u02c7a, C., Caruana, R., and Niculescu-Mizil, A. Model compression. In ACM SIGKDD Interna-\n\ntional Conference on Knowledge Discovery and Data Mining, 2006.\n\nChollet, F. Xception: Deep learning with depthwise separable convolutions. arXiv:1610.02357, 2016.\n\nCordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S.,\nand Schiele, B. The cityscapes dataset for semantic urban scene understanding. In Proceedings of\nthe IEEE Conference on Computer Vision and Pattern Recognition, 2016.\n\nCybenko, G. Approximation by superpositions of a sigmoidal function. Mathematics of Control,\n\nSignals, and Systems (MCSS), 2(4):303\u2013314, 1989.\n\nDenil, M., Shakibi, B., Dinh, L., Ranzato, M., and de Freitas, N. Predicting parameters in deep\n\nlearning. In Advances in Neural Information Processing Systems, 2013.\n\nGaripov, T., Podoprikhin, D., Novikov, A., and Vetrov, D. P. Ultimate tensorization: compressing\n\nconvolutional and FC layers alike. arXiv:1611.03214, 2016.\n\nHan, S., Mao, H., and Dally, W. J. Deep compression: Compressing deep neural networks with\npruning, trained quantization and Huffman coding. In International Conference on Learning\nRepresentations, 2016.\n\nHe, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings\n\nof the IEEE Conference on Computer Vision and Pattern Recognition, 2016a.\n\nHe, K., Zhang, X., Ren, S., and Sun, J. Identity mappings in deep residual networks. In European\n\nConference on Computer Vision, 2016b.\n\nHinton, G., Vinyals, O., and Dean, J. Distilling the knowledge in a neural network. arXiv:1503.02531,\n\n2015.\n\nHoward, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., and\nAdam, H. MobileNets: Ef\ufb01cient convolutional neural networks for mobile vision applications.\narXiv:1704.04861, 2017.\n\nIandola, F. N., Moskewicz, M. W., Ashraf, K., Han, S., Dally, W. J., and Keutzer, K. SqueezeNet:\nAlexNet-level accuracy with 50\u00d7 fewer parameters and < 1MB model size. arXiv:1602.07360,\n2016.\n\nIoannou, Y., Robertson, D., Cipolla, R., and Criminisi, A. Deep roots: Improving CNN ef\ufb01ciency\nwith hierarchical \ufb01lter groups. In Proceedings of the IEEE Conference on Computer Vision and\nPattern Recognition, 2017.\n\nIoffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducing\n\ninternal covariate shift. In International Conference on Machine Learning, 2015.\n\nJaderberg, M., Vedaldi, A., and Zisserman, A. Speeding up convolutional neural networks with low\n\nrank expansions. In British Machine Vision Conference, 2014.\n\nJin, J., Dundar, A., and Culurciello, E. Flattened convolutional neural networks for feedforward\n\nacceleration. In International Conference on Learning Representations, 2015.\n\nKrizhevsky, A. Learning multiple layers of features from tiny images. Master\u2019s thesis, University of\n\nToronto, 2009.\n\n10\n\n\fKrizhevsky, A., Sutskever, I., and Hinton, G. ImageNet classi\ufb01cation with deep convolutional neural\n\nnetworks. In Advances in Neural Information Processing Systems, 2012.\n\nLeCun, Y., Bengio, Y., and Hinton, G. Deep learning. Nature, 521(7553):436\u2013444, 2015.\n\nLi, Z., Wang, X., Lv, X., and Yang, T. SEP-Nets: Small and effective pattern networks.\n\narXiv:1706.03912, 2017.\n\nMoczulski, M., Denil, M., Appleyard, J., and de Freitas, N. ACDC: a structured ef\ufb01cient linear layer.\n\nIn International Conference on Learning Representations, 2016.\n\nRomera, E., \u00c1lvarez, J. M., Bergasa, L. M., and Arroyo, R. Ef\ufb01cient convnet for real-time semantic\n\nsegmentation. IEEE Intelligent Vehicles Symposium, 2017a.\n\nRomera, E., \u00c1lvarez, J. M., Bergasa, L. M., and Arroyo, R. ERFNet: Ef\ufb01cient residual factorized\nconvnet for real-time semantic segmentation. IEEE Transactions on Intelligent Transportation\nSystems, 2017b.\n\nRomero, A., Ballas, N., Kahou, S. E., Chassang, A., Gatta, C., and Bengio, Y. FitNets: Hints for thin\n\ndeep nets. In International Conference on Learning Representations, 2015.\n\nRussakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A.,\nKhosla, A., Bernstein, M., Berg, A. C., and Fei-Fei, L. ImageNet Large Scale Visual Recognition\nChallenge. International Journal of Computer Vision (IJCV), 115(3):211\u2013252, 2015. doi: 10.1007/\ns11263-015-0816-y.\n\nSainath, T. N., Kingsbury, B., Sindhwani, V., Arisoy, E., and Ramabhadran, B. Low-rank matrix\nfactorization for deep neural network training with high-dimensional output targets. In IEEE\nInternational Conference on Acoustics, Speech and Signal Processing, 2013.\n\nSifre, L. Rigid-Motion Scattering for Image Classi\ufb01cation. PhD thesis, \u00c9cole Polytechnique, 2014.\n\nUrban, G., Geras, K. J., Kahou, S. E., Aslan, O., Wang, S., Caruana, R., Mohamed, A., Philipose,\nM., and Richardson, M. Do deep convolutional nets really need to be deep and convolutional? In\nInternational Conference on Learning Representations, 2017.\n\nWang, M., Liu, B., and Foroosh, H. Design of ef\ufb01cient convolutional layers using single intra-channel\nconvolution, topological subdivisioning and spatial bottleneck structure. arXiv:1608.04337, 2016.\n\nXie, S., Girshick, R., Doll\u00e1r, P., Tu, Z., and He, K. Aggregated residual transformations for deep neural\nnetworks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,\n2017.\n\nYang, Z., Moczulski, M., Denil, M., de Freitas, N., Smola, A., Song, L., and Wang, Z. Deep fried\n\nconvnets. In Proceedings of the IEEE International Conference on Computer Vision, 2015.\n\nYu, F. and Koltun, V. Multi-scale context aggregation by dilated convolutions. In International\n\nConference on Learning Representations, 2016.\n\nZagoruyko, S. and Komodakis, N. Wide residual networks. In British Machine Vision Conference,\n\n2016.\n\nZagoruyko, S. and Komodakis, N. Paying more attention to attention: Improving the performance\nof convolutional neural networks via attention transfer. In International Conference on Learning\nRepresentations, 2017.\n\nZhang, X., Zhou, X., Lin, M., and Sun, J. Shuf\ufb02eNet: An extremely ef\ufb01cient convolutional neural\nnetwork for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and\nPattern Recognition, 2018.\n\nZoph, B., Vasudevan, V., Shlens, J., and Le, Q. V. Learning transferable architectures for scalable\nimage recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition, 2018.\n\n11\n\n\f", "award": [], "sourceid": 1504, "authors": [{"given_name": "Elliot", "family_name": "Crowley", "institution": "University of Edinburgh"}, {"given_name": "Gavin", "family_name": "Gray", "institution": "University of Edinburgh"}, {"given_name": "Amos", "family_name": "Storkey", "institution": "University of Edinburgh"}]}