{"title": "Learning Versatile Filters for Efficient Convolutional Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 1608, "page_last": 1618, "abstract": "This paper introduces versatile filters to construct efficient convolutional neural network. Considering the demands of efficient deep learning techniques running on cost-effective hardware, a number of methods have been developed to learn compact neural networks. Most of these works aim to slim down filters in different ways, e.g., investigating small, sparse or binarized filters. In contrast, we treat filters from an additive perspective. A series of secondary filters can be derived from a primary filter. These secondary filters all inherit in the primary filter without occupying more storage, but once been unfolded in computation they could significantly enhance the capability of the filter by integrating information extracted from different receptive fields. Besides spatial versatile filters, we additionally investigate versatile filters from the channel perspective. The new techniques are general to upgrade filters in existing CNNs. Experimental results on benchmark datasets and neural networks demonstrate that CNNs constructed with our versatile filters are able to achieve comparable accuracy as that of original filters, but require less memory and FLOPs.", "full_text": "Learning Versatile Filters for\n\nEf\ufb01cient Convolutional Neural Networks\n\nYunhe Wang1, Chang Xu2, Chunjing Xu1, Chao Xu3, Dacheng Tao2\n\n1 Huawei Noah\u2019s Ark Lab\n\n2 UBTECH Sydney AI Centre, SIT, FEIT, University of Sydney, Australia\n\n3 Key Lab of Machine Perception (MOE), Cooperative Medianet Innovation Center,\n\nSchool of EECS, Peking University, Beijing, China\n\nyunhe.wang@huawei.com, c.xu@sydney.edu.au, xuchunjing@huawei.com\n\nxuchao@cis.pku.edu.cn, dacheng.tao@sydney.edu.au\n\nAbstract\n\nThis paper introduces versatile \ufb01lters to construct ef\ufb01cient convolutional neural\nnetwork. Considering the demands of ef\ufb01cient deep learning techniques running\non cost-effective hardware, a number of methods have been developed to learn\ncompact neural networks. Most of these works aim to slim down \ufb01lters in different\nways, e.g. investigating small, sparse or binarized \ufb01lters. In contrast, we treat\n\ufb01lters from an additive perspective. A series of secondary \ufb01lters can be derived\nfrom a primary \ufb01lter. These secondary \ufb01lters all inherit in the primary \ufb01lter\nwithout occupying more storage, but once been unfolded in computation they could\nsigni\ufb01cantly enhance the capability of the \ufb01lter by integrating information extracted\nfrom different receptive \ufb01elds. Besides spatial versatile \ufb01lters, we additionally\ninvestigate versatile \ufb01lters from the channel perspective. The new techniques are\ngeneral to upgrade \ufb01lters in existing CNNs. Experimental results on benchmark\ndatasets and neural networks demonstrate that CNNs constructed with our versatile\n\ufb01lters are able to achieve comparable accuracy as that of original \ufb01lters, but require\nless memory and FLOPs.\n\n1\n\nIntroduction\n\nConsiderable computer vision applications (e.g. image classi\ufb01cation [19], object detection [15],\nsubspace clustering [27], and image segmentation [13]) have received remarkable progress with\nthe help of convolutional neural networks (CNNs) in last decade. Table 1 summarizes pro\ufb01les of\nbenchmark CNNs on the ILSVRC 2012 dataset [17]. From the pioneering AlexNet [11] to the\nrecent ResNeXt-50 [25], the storage of networks is slightly saved, but the classi\ufb01cation accuracy has\nbeen continuously improved. This performance improvement comes from sophisticatedly designed\ncalculations introduced in these networks, e.g. residual modules in ResNet [7] and versatile modules\nin GoogleNet [20]. These networks are widely used in the scenario of abundant computation and\nstorage resources, but they cannot easily adapt to mobile platforms, such as smartphones and cameras.\nTaking ResNet-50 [7] with 54 convolutional layers as an example, about 97MB memory is required\nto store all \ufb01lters and over 4.0 \u00d7 109 times of \ufb02oating number multiplications have to be operated for\nan image.\nOver the years, different techniques have been proposed to tackle the contradiction between resources\nsupply of low performance devices and demands of heavy neural networks. One common approach\nis to explore and eliminate redundancy in pre-trained CNNs. For example, Han et.al. [6] discarded\nsubtle weights in convolution \ufb01lters, Wang et.al. [23] investigated redundancy between weights,\nFigurnov et.al. [5] removed redundant connections between input data and \ufb01lters, Wang et.al. [22]\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fTable 1: Properties of benchmark CNN models learned on the ILSVRC 2012 dataset.\n\nModel\n\nAlexNet [11]\n\nVGGNet-16 [19]\nGoogleNet [20]\nResNet-50 [7]\n\nResNeXt-50 [25]\n\nWeights\n6.1 \u00d7 107\n13.8 \u00d7 107\n0.7 \u00d7 107\n2.6 \u00d7 107\n2.5 \u00d7 107\n\nMemory\n232.5MB\n526.4MB\n26.3MB\n97.2MB\n95.3MB\n\nFLOPs\n0.7 \u00d7 109\n15.4 \u00d7 109\n1.5 \u00d7 109\n4.1 \u00d7 109\n4.2 \u00d7 109\n\nTop1-err\n42.9%\n28.5%\n34.2%\n24.7%\n22.6%\n\nTop5-err\n19.8%\n9.9%\n12.9%\n7.8%\n6.5%\n\nexplored compact feature maps for deep neural networks, and Wen et.al. [24] investigated the\nsparsity from several aspects. There are also some methods to approximate the original neural\nnetworks by employing more compact structures, e.g. quantization and binarization [1, 14, 3], matrix\ndecomposition [4], and teacher student learning paradigm [8, 16]. Instead of patching pre-trained\nCNNs, some highly ef\ufb01cient network architectures have been designed for applications on mobile\ndevices. For example, ResNext [25] aggregated a set of transformations with the same topology,\nXception [2] and MobileNet [9] used separable convolutions with 1 \u00d7 1 \ufb01lters, and Shuf\ufb02eNet [26]\nencouraged pointwise group convolutions and channel shuf\ufb02e operations.\nMost of these existing works learn ef\ufb01cient CNNs through slimming down \ufb01lters, e.g. making\ngreat use of smaller \ufb01lters (e.g. 1 \u00d7 1 \ufb01lters) and developing various (e.g. sparse and low-rank)\napproximation of \ufb01lters. Given such lightweight \ufb01lters, the network performance is struggling to keep\nup, due to limited capacity of 1 \u00d7 1 \ufb01lters or approximation error of \ufb01lters. Rather then subtracting\n(i.e. slimming down \ufb01lters), another thing to consider is adding. We must ask whether the value of a\nnormal \ufb01lter has already been maximally explored and can a normal \ufb01lter take more roles than usual.\nIn this paper, we propose versatile \ufb01lters for ef\ufb01cient convolutional neural networks. We produce\na series of smaller secondary \ufb01lters from a primary \ufb01lter based on some pre-de\ufb01ned rules. These\nsecondary \ufb01lters inherit weights from the primary \ufb01lter, but they will have different receptive \ufb01elds\nand extract features from the spatial dimension. The neural network is composed of primary \ufb01lters,\nwhile the strength of the network will be fully disclosed through secondary \ufb01lters in computation.\nSpeci\ufb01cally, we develop versatile \ufb01lters in both spatial and channel dimensions. We provide detailed\nfeed-forward and back-propagation of the proposed versatile \ufb01lters. Experiments on benchmarks\ndemonstrate that, equipping CNNs with our versatile \ufb01lters can lead to lower memory usages and\nFLOPs, but with comparable network accuracy.\n\n2 Approach\n\nIn this section, we illustrate the design of versatile \ufb01lters, which can be applied over any \ufb01lter with\nheight and width greater than one. Besides spatial versatile \ufb01lters, we additionally investigate versatile\n\ufb01lters from the channel perspective.\n\n2.1 Spatial Versatile Filters\nConsider the input data x \u2208 RH\u00d7W\u00d7c, where H and W are height and width of the input data\nrespectively, and c is the channel number, i.e. the number of feature maps generated in the previous\nlayer. A convolution \ufb01lter is denoted as f \u2208 Rd\u00d7d\u00d7c, where d \u00d7 d is the size of the convolution \ufb01lter.\nWe focus on square \ufb01lters, e.g. 5 \u00d7 5 and 3 \u00d7 3, which are most widely used in modern CNNs such\nas ResNet [7], VGGNet [19], ResNeXt [25], and Shuf\ufb02eNet [26]. The conventional convolution can\nbe formulated as\n\n(1)\nis the output feature map of x, and H(cid:48) and W (cid:48)\n\nwhere \u2217 is the convolution operation, y \u2208 RH(cid:48)\u00d7W (cid:48)\nare its height and width, respectively.\nCompared with traditional fully connected neural networks, one of the most important advantages\nof CNNs is that the size (d \u00d7 d) of \ufb01lters in a convolutional layer can be much smaller than that\n(H \u00d7 W ) of the input. For example, 7\u00d7 7 \ufb01lters in the \ufb01rst layer of ResNet-50 [7] are used to process\nthe 224 \u00d7 224 input. Fixing the output size, the complexity of \ufb02oating number multiplications of a\n\ufb01lter in the fully-connected layer is O(cHW H(cid:48)W (cid:48)), while the complexity of a convolution \ufb01lter is\n\ny = f \u2217 x,\n\n2\n\n\fFigure 1: An illustration of the proposed spatial versatile convolution \ufb01lter. Given the input data (a),\nthere are four sub-regions (b) covered by a 5\u00d7 5 convolution \ufb01lter with stride 2, and their convolution\nresults are stacked into a feature map (c). In contrast, a spatial versatile \ufb01lter will be applied three\ntimes on each sub-region with different secondary \ufb01lters, i.e., 5 \u00d7 5 blue, 3 \u00d7 3 green, and 1 \u00d7 1 red\nin (b) to generate three feature maps (d).\n\nonly O(cd2H(cid:48)W (cid:48)). In addition, convolution operations extract features from small regions, which is\nbene\ufb01cial for subsequent tasks such as recognition and detection.\nReceptive \ufb01eld is an important concept introduced by convolutions. Larger receptive \ufb01eld would\nallow neurons to detect changes over a wider area, but result in a less precise perception. On the other\nhand, smaller receptive \ufb01eld would enable neurons to detect \ufb01ne details. It is therefore reasonable\nto integrate neurons of larger receptive \ufb01elds and smaller receptive \ufb01elds to extract comprehensive\nand accurate features. For example, versatile modules [20] introduce parallel paths with different\nreceptive \ufb01eld sizes by making use of multiple \ufb01lters with different sizes, e.g. 3 \u00d7 3 and 5 \u00d7 5\nconvolutions. Explicitly brining in \ufb01lters of different sizes is a straightforward approach to process\nthe input information in different scales, but the signi\ufb01cant increase in storage of these \ufb01lters could\nbe a new challenge. Most importantly, though \ufb01lters of different sizes in the same layer have different\nreceptive \ufb01elds, their receptive \ufb01elds would have some overlap, which indicates the prospective\nconnections between their corresponding \ufb01lters.\nTaking f \u2208 Rd\u00d7d as a primary \ufb01lter, we propose to derive a series of secondary \ufb01lters {f1, f2,\u00b7\u00b7\u00b7 , fs}\nfrom f, where s = (cid:100)d/2(cid:101). To maximally explore the potential of primary \ufb01lter f, each secondary\n\ufb01lters fi is directly inherited from f with a mask Mi,\n\nif q, p \u2265 i | p, q \u2264 d + 1 \u2212 i,\n\notherwise,\n\nMi(p, q, c) =\n\n(2)\nand fi is calculated as fi = Mi \u25e6 f, where \u25e6 is the element-wise multiplication. More speci\ufb01cally, f1\nis the \ufb01lter f itself, f2 discards the outermost circle of parameters in f, and fs is the innermost circle\nof parameters in f (i.e., fs is a 1 \u00d7 1 \ufb01lter given an odd d). Example secondary \ufb01lters for a 5 \u00d7 5\n\ufb01lter can be seen in Figure 1 (b).\nBy concatenating convolution responses from these secondary \ufb01lters, we get the feature map repre-\nsented as\n\n(cid:26) 1,\n\n0,\n\ny = [(M1 \u25e6 f ) \u2217 x + b1, ..., (Ms \u25e6 f ) \u2217 x + bs] ,\ni=1 \u2208 {0, 1}d\u00d7d\u00d7c,\n\ns.t. s = (cid:100)d/2(cid:101), {Mi}s\n\n(3)\n\nwhere b1, ..., bs are bias parameters.\nBy embedding Fcn. 3 into conventional CNNs, we can obtain convolution responses simultaneously\nfrom s secondary \ufb01lters of different receptive \ufb01elds. The number of the output channels of the\nproposed versatile \ufb01lter is s times more than that of the original \ufb01lter, and feature maps of a\nconvolutional layer using the proposed versatile \ufb01lters contain features in different scales at the same\ntime.\nNote that convolution operations (\u2217) in Fcn. 3 share the same stride and padding parameters for\nthe following two reasons: 1) dimensionalities of feature maps generated by secondary \ufb01lters with\n\n3\n\n(a) Input data(b) Convolution operations(c) Original features(d) Spatial versatile features\fdifferent receptive \ufb01elds have to be consistent for the subsequent calculation; 2) centers of these\nsecondary \ufb01lters are the same, and the s-dimensional feature is thus a multi-scale representation of a\nspeci\ufb01c pixel at x. The schematic of the proposed versatile \ufb01lters is shown in Fig. 1, and the detailed\nback-propagation procedure of the proposed spatial versatile convolution \ufb01lters can be found in the\nsupplementary materials.\nDiscussion: Besides the proposed method as shown in Fcn. 3, a na\u00efve approach to aggregate features\nfrom multiple secondary \ufb01lters can be\n\ns(cid:88)\n\ny =\n\n(Mi \u25e6 f ) \u2217 x + b,\n\ni=1\n\ns.t. s = (cid:100)d/2(cid:101), {Mi}s\n\ni=1 \u2208 {0, 1}d\u00d7d\u00d7c,\n\n(4)\n\nwhich calculates the resulting feature map as a linear combination of features from different receptive\n\ufb01elds. Since the convolution \u2217 is exactly an linear operation, the sum of different convolution\nresponses on the same input can be rewritten as the response of a combined convolution \ufb01lter\nemployed on this data, i.e.,\n\ns(cid:88)\n\ns(cid:88)\n\ny =\n\n(Mi \u25e6 f ) \u2217 x + b = [(\n\nMi) \u25e6 f ] \u2217 x + b.\n\n(5)\n\ni=1\n\ni=1\n\nTherefore, Fcn. 4 is equivalent to adding a \ufb01xed weight mask on conventional convolution \ufb01lters,\nwhich cannot produce more meaningful calculations and informative features in practice. We will\ncompare the performance of this na\u00efve approach in experiments.\n\n2.2 Analysis on Spatial Versatile Filters\n\nCompared with original convolution \ufb01lters, the proposed spatial versatile \ufb01lters can provide more\nfeature maps without increasing the number of \ufb01lters. Therefore, we further analyze the memory\nusage and computation cost of neural networks using the proposed spatial versatile \ufb01lters.\nThe proposed spatial versatile convolution operation as shown in Fcn. 3 can generate multiple feature\nmaps using a \ufb01xed number of convolution \ufb01lters. Thus the computational complexity and memory\nusage of CNNs for extracting the same amount of features can be reduced signi\ufb01cantly as analyzed in\nProposition 1.\nProposition 1. Given a convolutional layer for extracting feature maps y \u2208 RH(cid:48)\u00d7W (cid:48)\u00d7n using the\nproposed spatial versatile \ufb01lters (Fcn. 3), the space complexity of d \u00d7 d \ufb01lters with c channels is\n\nO(d2cn/s) and the computational complexity is O((cid:80)s\n\ni=1(d \u2212 2i + 2)2cH(cid:48)W n/s).\n\nProof. For the desired feature map y \u2208 RH(cid:48)\u00d7W (cid:48)\u00d7n, where H(cid:48) and W (cid:48) are height and width of\ny, respectively. Commonly, we need n convolution \ufb01lters {fi}n\ni=1 of size d \u00d7 d \u00d7 c. The space\ncomplexity for storing these \ufb01lters is O(d2cn), and the computational complexity for generating y is\nO(dcH(cid:48)W (cid:48)n).\nIn contrast, the proposed spatial versatile convolution operation can extract s = (cid:100)di/2(cid:101) sets of feature\nmaps simultaneously. Thus, for generating N feature maps, the space complexity for storing the\nproposed spatial versatile convolution \ufb01lters is\n\nO(d2cn/s).\n\n(6)\nThe computational complexity for generating feature maps using the proposed spatial versatile \ufb01lters\nin different scales is various, which affects by the size of convolution \ufb01lters, i.e., the number of\nnon-zero elements in each Mi. The number of non-zero elements in Mi is (d \u2212 2i + 2)2 as shown in\nFcn. 2, thus the computational complexity for the i-th scale is O((d\u22122i+2)2cH(cid:48)W (cid:48)n/s). Therefore,\nthe computational complexity of the entire layer can be calculated as:\n\ns(cid:88)\n\nO(\n\n(d \u2212 2i + 2)2cH(cid:48)W (cid:48)n/s),\n\n(7)\n\nwhich is de\ufb01nitely smaller than that O(d2cH(cid:48)W (cid:48)n) of the traditional convolution operation when\ns > 2.\n\ni=1\n\n4\n\n\fFigure 2: An illustration of the proposed channel versatile \ufb01lters. The original \ufb01lter can only generate\nonly one feature map for the given input data, and the proposed method can provide multiple feature\nmaps simultaneously according to the channel stride parameters. Each color represents a secondary\n\ufb01lter and its corresponding feature map.\n\n2.3 Channel Versatile Filters\n\nA spatial versatile \ufb01lter was proposed in Fcn. 3, which generates a series of secondary convolution\n\ufb01lters by adjusting the height and width of a given convolution \ufb01lter. However, there are still obvious\nredundancy in these secondary \ufb01lters, i.e., the number of channels of each convolution \ufb01lter is much\nlarger than its height and width. In addition, given 1 \u00d7 1 primary \ufb01lters, Fcn. 3 will be reduced to\nFcn. 1 for conventional convolution operation. Considering the wide use of 1 \u00d7 1 \ufb01lters in modern\nCNN architectures such as Shuf\ufb02eNet [26] and ResNeXt [25], etc., we proceed to develop versatile\n\ufb01lters from the channel perspective.\nThe most important property of convolution \ufb01lters is that their weights are shared by the input data. A\nconvolution \ufb01lter used to have the same depth as the input data, and slide along the width and height\nof the input data with some stride parameters. If the depth of the input is 512, a 1 \u00d7 1 \u00d7 512 \ufb01lter has\nto take a large number of \ufb02oating number multiplications to weight different channels and integrate\nthe information across different input channels. However, this coarse information summarization\nover all channels is dif\ufb01cult to highlight characters of individual channels, especially when there are\nextremely many channels. Hence, we de\ufb01ne secondary \ufb01lters for original convolution \ufb01lters with the\nhelp of channel stride, i.e.\n\ny = [f1 \u2217 x + b1, f2 \u2217 x + b2,\u00b7\u00b7\u00b7 , fn \u2217 x + bn] ,\ns.t. \u2200 i, fi \u2208 Rd\u00d7d\u00d7c, n = (c \u2212 \u02c6c)/g + 1.\n\n(8)\n\nwhere g is the channel stride parameter and \u02c6c < c is the number of non-zero channels of secondary\n\ufb01lters. fi is the i-th unduplicated copy of primary \ufb01lter f given the length \u02c6c and the stride g. Therefore,\na \ufb01lter will be used n times simultaneously to generate more feature maps by introducing Fcn. 8.\nExample secondary \ufb01lters using the proposed channel stride approach are given in Figure 2. In\naddition, the proposes channel versatile \ufb01lters can also signi\ufb01cantly reduce the memory usage and\ncomputational complexity of CNNs, which can be similarly derived as that in Proposition 1.\n\n3 Experiments\n\nIn this section, we will implement experiments to validate the effectiveness of the proposed multi-\nscale convolution \ufb01lter on several benchmark image datasets, including MNIST [12], ImageNet\n(ILSVRC 2012 [17]), etc. Experimental results will be analyzed to further understand the bene\ufb01ts of\nthe proposed approach.\n\n3.1 Experiments on MNIST\n\nThe MNIST dataset consists of 70, 000 images drawn from ten categories, which is split into 60, 000\ntraining and 10, 000 testing images. Each sample in this dataset is a 28 \u00d7 28 gray-scale digit (from 0\nto 9) image. In addition, the last 10, 000 images in the training set is selected as the validation set for\ndetermining the \ufb01nal model.\n\n5\n\n(a) Original convolution(b) Channel versatile filters\fSpatial versatile \ufb01lters: We \ufb01rst tested the performance of the proposed spatial versatile \ufb01lter in\nFcn. 3 using a LeNet for classifying the MNIST dataset learned on MatConvNet [21]. The baseline\nmodel has four convolutional layers of size 5 \u00d7 5 \u00d7 1 \u00d7 20, 5 \u00d7 5 \u00d7 20 \u00d7 50, 4 \u00d7 4 \u00d7 50 \u00d7 500,\nand 1 \u00d7 1 \u00d7 500 \u00d7 10, respectively, which accounts about 1.6MB (\ufb01lters are stored in 32-bit \ufb02oating\nvalues), and the accuracy is 99.20%. Then, several models with different architectures and strategies\nwere trained, and their results are show in Table 2. Wherein, memory usage of convolution \ufb01lters and\n\ufb02oating number multiplications (FLOPs) of each model are also provided.\nVersatile-Model 1 is the network using the proposed versatile \ufb01lters (Fcn. 4) with the same architecture\nas that of the baseline model. Since it does not change the size of output data, its memory usage and\nmultiplications are also the same as those of the baseline model. Not surprisingly, there is not any\nperformance enhancement by exploiting this approach since the network can adjust parameters in\n\nconvolution \ufb01lters according to the weight mask ((cid:80)s\n\ni=1 Mi).\n\nTable 2: The performance of the proposed spatial versatile \ufb01lters on MNIST.\n\nModel\nBaseline\n\nVersatile-Model 1\nVersatile-Model 2\nVersatile-Model 3\n\nWeights\n4.3 \u00d7 105\n4.3 \u00d7 105\n2.2 \u00d7 105\n2.2 \u00d7 105\n\nMemory\n1681.6KB\n1681.6KB\n852.0KB\n852.0KB\n\nFLOPs\n22.93\u00d7105\n22.93\u00d7105\n12.00\u00d7105\n12.00\u00d7105\n\nAccuracy\n99.20%\n99.20%\n99.15%\n99.22%\n\nVersatile-Model 2 and Versatile-Model 3 adopted the proposed versatile convolution operation as\nshown in Fcn. 3. There are multiple bias term b1, ..., bs in Fcn. 3 for controlling features generated\nby different secondary \ufb01lters from an versatile convolution \ufb01lter. The difference between Model 2\nand Model 3 is that, bias term of convolution \ufb01lters in Model 3 are shared, i.e., b1 =, ..., = bs, and\ngradients of b are also averaged.\nThe proposed method can generate multiple feature maps using a convolution \ufb01lter whose size is\nlarger than 2 \u00d7 2 (i.e., s = (cid:100)di/2(cid:101) > 1), which will increase the number of channels in the next\nlayer and make the convolutional neural network enormous. Therefore, we reduce the number\nof convolution \ufb01lters in each layer to make the amount of feature maps in Versatile-Model 2 and\nVersatile-Model 3 similar to that in the original network, as shown in Table 2. For example, numbers\nof \ufb01lters in the \ufb01rst convolutional layer of the base Model 3 are 20 and 7, respectively. However, their\noutput channels are 20 and 21, respectively, since a spatial versatile \ufb01lter with size 5 \u00d7 5 can produce\nthree channels outputs simultaneously.\nIt can be found in Table 2, Model 3 obtained a higher result (99.22%) with this strategy, which is\nslightly higher than that of the baseline model. The reason is that if differences between bias terms\nare extremely large, gradients of secondary convolution \ufb01lters will be fundamentally different, which\nmakes the training of entire convolution \ufb01lters dif\ufb01cult. The performance of Model 3 is slightly\nhigher than that of the baseline model, but has signi\ufb01cantly lower memory usage and FLOPs, which\ndemonstrates the effectiveness of the proposed versatile convolution \ufb01lters. In addition, the detailed\nstructure of Versatile-Model 3 in Table 2 and the corresponding demo code for verifying the proposed\nmethod can be found in our supplementary materials.\nFilter Visualization: Convolution \ufb01lters are used for extracting intrinsic information from natural\nimages. Thus, these \ufb01lters often present some speci\ufb01c structures, such as line, blob, etc. However, the\nproposed versatile convolution \ufb01lters adopt a more complex approach to capture useful information\nfrom input images, i.e., a large \ufb01lter consists of a series of smaller \ufb01lters and each of them will employ\non the input images to generate feature maps. Therefore, it is necessary to visualize and compare\n\ufb01lters in original CNN and the network using the proposed versatile convolution \ufb01lters for having an\nexplicit illustration.\nFig. 3 illustrates convolution \ufb01lters in the \ufb01rst layer of the Baseline and Model 3 in Table 2, respec-\ntively. Since the proposed approach is fundamentally different to the original convolution \ufb01lters,\n\ufb01lters using Fcn. 3 present more complex structures. Speci\ufb01cally, each 3\u00d7 3 area in Fig. 3 (b) still can\nbe seen as an independent convolution \ufb01lter with complex structure and obvious magnitude change.\nIn contrast, some 3 \u00d7 3 areas in Fig. 3 (a) are extremely smooth, which cannot provide distinctive\ninformation.\n\n6\n\n\f(a) Original convolution \ufb01lters.\n\n(b) Versatile convolution \ufb01lters.\n\nFigure 3: Visualization of example \ufb01lters learned on MNIST.\n\nChannel versatile \ufb01lters: After investigating the effectiveness of the proposed spatial versatile\nconvolution operation, we shall further test the performance of the proposed channel versatile \ufb01lters\nas described in Fcn. 8, namely versatile v2. Note that, for the \ufb01rst layer and the last layer in neural\nnetworks, we do not apply the channel stride approach, since the input channel of the \ufb01rst layer is\nusually very small and the output channel of the last layer is exactly the number of ground-truth\nlabels.\nThere are two important parameters in Fcn. 8, i.e., the number of channels \u02c6c of the convolution \ufb01lter\n\u02c6f and the stride g. We then established three models using the proposed versatile \ufb01lter with different\n\u02c6c and g, and trained them on the MNIST dataset as detailed in Table. 3.\n\nTable 3: The performance of the proposed channel versatile \ufb01lters on MNIST.\n\nModel\nBaseline\n\nVersatile v2-Model 1\nVersatile v2-Model 2\nVersatile v2-Model 3\n\nc \u2212 \u02c6c\n\n-\n1\n1\n2\n\ng Weights\n4.3 \u00d7 105\n-\n1.18 \u00d7 105\n1\n1.18 \u00d7 105\n2\n0.79 \u00d7 105\n1\n\nMemory\n1681.6KB\n460.5KB\n460.5KB\n309.1KB\n\nFLOPs\n22.93\u00d7105\n12.17\u00d7105\n11.17\u00d7105\n12.12\u00d7105\n\nAccuracy\n99.20%\n99.18%\n99.15%\n99.07%\n\nAs mentioned above, the channel versatile \ufb01lters can reduce the number of convolution \ufb01lters by\na factor of n = (c \u2212 \u02c6c)/g + 1, therefore, when we set \u02c6c \u2212 c = 1 and g = 1, we can reduce about\nhalf convolution \ufb01lters and maintain the similar amount of feature maps. For example, the size of\nthe second layer\u2019s convolution \ufb01lter in Versatile-Model 3 is 5 \u00d7 5 \u00d7 21 \u00d7 17, and the size of the\nsecond layer\u2019s convolution \ufb01lter in Versatile v2-Model 1 is 5 \u00d7 5 \u00d7 21 \u00d7 9. As a result, the Versatile\nv2-Model 1 achieved a 99.18 accuracy, which is slightly lower than that of the baseline model, but its\nmemory usage and FLOPs have been reduced signi\ufb01cantly.\nSimilarly, when we set \u02c6c \u2212 c = 1 and g = 2 (i.e., Versatile v2-Model 2), the network obtained\nsimilar results to those of Versatile v2-Model 1. Furthermore, when \u02c6c \u2212 c = 2 and g = 1 in Versatile\nv2-Model 3, the number of convolution \ufb01lters will be further reduced. However, since the number of\n\ufb01lters is very small, the representability of this network is also lower. The Versatile v2-Model 3 with\nthe smallest memory usage and FLOPs obtained a 99.07% classi\ufb01cation accuracy. Therefore, we set\n\u02c6c \u2212 c = 1 and g = 1 in the following experiments for having a best trade-off.\n\n3.2 Large Scale Visual Recognition Experiments\n\nExperiments in the above chapter show that the proposed spatial versatile \ufb01lters in Fcn. 3 and the\nchannel versatile \ufb01lters are able to replace the traditional convolution operation on the MNIST dataset.\nWe next employed the proposed method on an extremely large image dataset, namely ImageNet\nILSVRC 2012 dataset [17], which contains over 1.2M training images and 50k validation images.\nThree baseline architectures, AlexNet [11], ResNet-50 [7] and ResNeXt-50 [25], were selected for\nconducting the following experiments. Note that, all training settings such as weight decay and\nlearning rate used the default setting to ensure fair comparisons.\nAlexNet: AlexNet is one of the most classical deep CNN models for large scale visual recognition,\nwhich has over 230MB parameters and a 80.2% accuracy on the ImageNet dataset with 1000 different\ncategories. This network has 8 convolutional layers, sizes of convolution \ufb01lters in the \ufb01rst six layers\n\n7\n\n\fare larger than 1\u00d7 1, i.e., 11\u00d7 11\u00d7 3\u00d7 96, 5\u00d7 5\u00d7 48\u00d7 256, 3\u00d7 3\u00d7 256\u00d7 384, 3\u00d7 3\u00d7 192\u00d7 384,\n3 \u00d7 3 \u00d7 192 \u00d7 256, and 6 \u00d7 6 \u00d7 256 \u00d7 4096.\nSince sizes of convolution \ufb01lters used in this network are much larger than that in other networks,\nresources required by this network can be signi\ufb01cantly reduced by exploiting the proposed versatile\nconvolution operation. For example, the parameter for the \ufb01rst convolutional layer is s1 = (cid:100)11/2(cid:101) =\n6, thus the number of parameters in this layer with versatile convolution \ufb01lters is only 11\u00d711\u00d73\u00d716.\nIn this manner, we established a new network (Versatile-AlexNet in Table 4) and reduced the number\nof \ufb01lters in each convolutional layer according to its versatile parameter s. Speci\ufb01cally, sizes of\nconvolution \ufb01lters in its \ufb01rst six convolutional layers are 11 \u00d7 11 \u00d7 3 \u00d7 16, 5 \u00d7 5 \u00d7 48 \u00d7 86,\n3 \u00d7 3 \u00d7 258 \u00d7 192, 3 \u00d7 3 \u00d7 192 \u00d7 192, 3 \u00d7 3 \u00d7 192 \u00d7 128, and 6 \u00d7 6 \u00d7 256 \u00d7 1366, respectively.\nAfter training the network on the ImageNet dataset, Versatile-AlexNet using Fcn. 3 obtained a 19.5%\ntop5-err and a 42.1% top1-err, which are better than those of the baseline model. The memory usage\nof \ufb01lters was reduced by a factor of 1.76\u00d7, and the FLOPs in Versatile-AlexNet is 1.95\u00d7 less than\nthat in the baseline model.\nFurthermore, we applied the channel versatile \ufb01lters (Fcn. 8) on the Versatile-AlexNet model with\n\u02c6c \u2212 c = 1, and g = 1, namely, Versatile v2-AlexNet. In this manner, the number of convolution layer\nin each layer will be reduced by a factor of 1\n2. As a result, this network achieved a 20.7% top5-err,\nwhich is slightly higher than that of the baseline model. But, the memory usage of the entire network\nis only 73.6MB, which is only about 30% to that of the baseline model.\n\nTable 4: Statistics for versatile \ufb01lters on the ImageNet 2012 dataset.\n\nModel\n\nAlexNet [11]\n\nVersatile-AlexNet\n\nVersatile v2-AlexNet\n\nResNet-50 [7]\n\nVersatile-ResNet-50\n\nVersatile v2-ResNet-50\n\nResNeXt-50 [25]\n\nVersatile v2-ResNeXt-50\n\nWeights Memory\n6.1 \u00d7 107\n232.5MB\n3.5 \u00d7 107\n131.8MB\n1.9 \u00d7 107\n73.7MB\n2.6 \u00d7 107\n97.2MB\n1.9 \u00d7 107\n75.6MB\n1.1 \u00d7 107\n41.7MB\n2.5 \u00d7 107\n95.3MB\n1.3 \u00d7 107\n50.0MB\n\nFLOPs\n0.7 \u00d7 109\n0.4 \u00d7 109\n0.4 \u00d7 109\n4.1 \u00d7 109\n3.2 \u00d7 109\n3.0 \u00d7 109\n4.2 \u00d7 109\n4.0 \u00d7 109\n\nTop1err\n42.9%\n42.1%\n44.1%\n24.7%\n24.5%\n25.5%\n22.6%\n23.8%\n\nTop5err\n19.8%\n19.5%\n20.7%\n7.8%\n7.6%\n8.2%\n6.5%\n7.0%\n\nResNets: To further illustrate the superiority of the proposed scheme, we then employed it on the\nResNet-50 model. Although there are many layers with 1 \u00d7 1 \ufb01lters in this network, it also has a\nlot of convolutional layers with large \ufb01lters, e.g. 3 \u00d7 3 and 7 \u00d7 7, which accounts for about half\nmemory usage of the entire network. In addition, ResNets introduce shortcut operations which also\nprovides considerable versatile features since receptive \ufb01elds of neurons in different layer are various\nas discussed in [18]. Therefore, it is meaningful to investigate the functionality of the versatile\nconvolution \ufb01lters on this network.\nSimilarly, we reset the original convolutional layers with the proposed versatile convolution \ufb01lters.\nFor instance, a convolutional layer of size 3 \u00d7 3 \u00d7 64 \u00d7 128 will be converted into a new layer of size\n3 \u00d7 3 \u00d7 64 \u00d7 32 using the proposed versatile convolution \ufb01lters. The performance of the original\nResNet-50 and the network using versatile \ufb01lters were detailed in Table 4.\nAs mentioned above, there are still considerable \ufb01lters in the ResNet whose sizes are larger than 1\u00d7 1.\nThus, its memory usage and FLOPs were reduced obviously by exploiting the proposed versatile\nconvolution \ufb01lters. Versatile-ResNet-50 with the same amount feature maps achieved a 7.6% top-5\naccuracy, which is slightly lower than that of the baseline models with only 75.6MB and 3.2 \u00d7 109\nFLOPs.\nIn addition, Versatile v2-ResNet-50 with the same amount feature maps achieved a 8.2% top-5\naccuracy, which is slightly higher than that of the baseline models. Its memory usage is only about\n41.7MB, which is only about 1\n2 to that of the original network. Therefore, our Versatile v2-ResNet-50,\nwhich is a more portable alternative to the original ResNet-50 model.\nMoreover, we attempted to replace original convolution \ufb01lters by the proposed versatile convolution\n\ufb01lters in ResNeXt-50. This network is an enhanced version of ResNet-50, which divides convolutional\nlayers into several smaller groups and achieves a higher performance, and avoids larger convolution\n\n8\n\n\f\ufb01lters. Since more than 90% convolution \ufb01lters in this network are 1 \u00d7 1 \ufb01lters, Fcn. 3 cannot obtain\nan obvious enhancement. However, the proposed channel versatile scheme in Fcn. 8 can effectively\nreduce the number of massive 1 \u00d7 1 convolution \ufb01lters. Thus, we directly applied the versatile\nv2 convolution \ufb01lters with channel stride approach on it. After applying the proposed versatile\nconvolution \ufb01lter on the ResNeXt-50 model, we obtained a 7.0% top-5 classi\ufb01cation error rate, which\nis slightly higher than its baseline model with only about half memory usage. The detailed statistics\nof the Versatile v2-ResNeXt-50 using the proposed versatile \ufb01lters were also shown in Table 4.\nComparing Versatile v2-ResNet-50 and Versatile v2-ResNeXt-50, we found that the memory usage\nof Versatile v2-ResNet-50 is lower than that of the Versatile v2-ResNeXt-50. This is because the\nproposed versatile \ufb01lters can effectively reduce the memory and FLOPs of \ufb01lters whose sizes are\nlarger than 1 \u00d7 1, which provides a more \ufb02exible way for designing CNNs with high performance\nand portable architectures.\n\n3.3 Comparing with Portable Architectures\n\nBesides sophisticate CNNs such as AlexNet and ResNet-50 with heavy architectures, a variety of\nrecent works attempt to design neural networks with portable architectures and comparable perfor-\nmance. MobileNet [9] utilized separable convolution to reduce memory usage and computational\ncost of massive large convolution \ufb01lters. Shuf\ufb02eNet [26] further proposed shuf\ufb02e operation to mixed\nfeatures in different groups and achieved higher results.\n\nTable 5: An overall comparison of state-of-the-art portable CNNs on the ILSVRC2012 dataset.\n\nModel\n\n1.0 MobileNet-224 [9]\nShuf\ufb02eNet 2\u00d7 (g = 3) [26]\n\nVersatile v2- Shuf\ufb02eNet 2\u00d7 (g = 3)\n\nWeights Memory\n0.4 \u00d7 107\n16.0MB\n0.7 \u00d7 107\n20.6MB\n0.4 \u00d7 107\n14.0MB\n\nFLOPs\n0.5 \u00d7 109\n0.5 \u00d7 109\n0.5 \u00d7 109\n\nTop1err\n29.4%\n26.3%\n27.6%\n\nTable 5 summarizes state-of-the-art CNN architectures, including their memory usages, FLOPs, and\nrecognition results on the ILSVRC 2012 dataset. Obviously, MobileNet has the smallest model size\nand FLOPs, but its classi\ufb01cation accuracy is lower than those of other networks. Shuf\ufb02eNet with the\nsimilar FLOPs to that of the MobileNet achieves a higher accuracy, with a slightly higher memory\nusage. By exploiting the proposed versatile convolution \ufb01lters on the Shuf\ufb02eNet 2\u00d7 (g = 3), we\nreduced more than 30% weights of convolution \ufb01lters and achieved the smallest model size with a\ncomparable accuracy, which is a more portable convolutional neural network.\nIn addition, to investigate the generalization ability of the proposed versatile convolution \ufb01lter, we\nfurther employed it on the single image super-resolution experiment. We selected the VDSR (Very\nDeep CNN for Image Super-resolution [10]) as the baseline model and the Versatile-VDSR with the\nsame amount of feature maps but less memory usage and FLOPs achieved a higher performance.\nDetailed experiments and analysis can be found in the supplementary materials.\n\n4 Conclusions and Discussions\n\nExploring convolutional neural networks with low memory usage and computational complexity is\nvery essential so that these models can be used on mobile devices. In fact, the main waste in a general\nneural network is that a convolution \ufb01lter with massive parameters can only produce one feature for a\ngiven data. In order to make full use of convolution \ufb01lters, this paper proposes versatile convolution\n\ufb01lters from spatial and channel perspectives. Thus, we can use fewer parameters to generate the same\namount of useful features with a lower computational complexity at the same time. Experiments\nconducted on benchmark image datasets and models show that the proposed method can not only\nreduce the requirement of storage and computational resources, but also enhance performance of\nCNNs, which is very effective for establishing portable CNNs with high accuracies. In addition,\nthe proposed method can be easily implemented using the existing convolution component, we will\nfurther embed it into other applications such as object detection and image segmentation.\nAcknowledgments: This work was supported in part by the ARC DE-180101438, FL-170100117,\nDP-180103424, and NSFC under Grant 61876007, 61872012. We also thank Huawei Hisilicon for\ntheir technical supports.\n\n9\n\n\fReferences\n[1] Sanjeev Arora, Aditya Bhaskara, Rong Ge, and Tengyu Ma. Provable bounds for learning some deep\n\nrepresentations. ICML, 2014.\n\n[2] Fran\u00e7ois Chollet. Xception: Deep learning with depthwise separable convolutions. arXiv preprint\n\narXiv:1610.02357, 2016.\n\n[3] Matthieu Courbariaux and Yoshua Bengio. Binarynet: Training deep neural networks with weights and\n\nactivations constrained to+ 1 or-1. arXiv preprint arXiv:1602.02830, 2016.\n\n[4] Emily L Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. Exploiting linear structure\n\nwithin convolutional networks for ef\ufb01cient evaluation. In NIPS, 2014.\n\n[5] Michael Figurnov, Dmitry Vetrov, and Pushmeet Kohli. Perforatedcnns: Acceleration through elimination\n\nof redundant convolutions. NIPS, 2016.\n\n[6] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with\n\npruning, trained quantization and huffman coding. In ICLR, 2016.\n\n[7] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.\n\nIn CVPR, 2016.\n\n[8] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv\n\npreprint arXiv:1503.02531, 2015.\n\n[9] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco\nAndreetto, and Hartwig Adam. Mobilenets: Ef\ufb01cient convolutional neural networks for mobile vision\napplications. arXiv preprint arXiv:1704.04861, 2017.\n\n[10] Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee. Accurate image super-resolution using very deep\n\nconvolutional networks. In CVPR, 2016.\n\n[11] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep convolutional\n\nneural networks. In NIPS, 2012.\n\n[12] Yann LeCun, L\u00e9on Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to\n\ndocument recognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[13] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmenta-\n\ntion. In CVPR, 2015.\n\n[14] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-net: Imagenet classi\ufb01ca-\n\ntion using binary convolutional neural networks. arXiv preprint arXiv:1603.05279, 2016.\n\n[15] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection\n\nwith region proposal networks. In NIPS, 2015.\n\n[16] Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua\n\nBengio. Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550, 2014.\n\n[17] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang,\nAndrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge.\nIJCV, 115(3):211\u2013252, 2015.\n\n[18] Pierre Sermanet and Yann LeCun. Traf\ufb01c sign recognition with multi-scale convolutional networks. In\n\nIJCNN, 2011.\n\n[19] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recogni-\n\ntion. ICLR, 2015.\n\n[20] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru\n\nErhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In CVPR, 2015.\n\n[21] Andrea Vedaldi and Karel Lenc. Matconvnet: Convolutional neural networks for matlab. In Proceedings\n\nof the 23rd Annual ACM Conference on Multimedia Conference, 2015.\n\n[22] Yunhe Wang, Chang Xu, Dacheng Tao, and Chao Xu. Beyond \ufb01lters: Compact feature map for portable\n\ndeep model. In ICML, 2017.\n\n10\n\n\f[23] Yunhe Wang, Chang Xu, Shan You, Dacheng Tao, and Chao Xu. Cnnpack: Packing convolutional neural\n\nnetworks in the frequency domain. In NIPS, 2016.\n\n[24] Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning structured sparsity in deep\n\nneural networks. In NIPS, 2016.\n\n[25] Saining Xie, Ross Girshick, Piotr Doll\u00e1r, Zhuowen Tu, and Kaiming He. Aggregated residual transforma-\n\ntions for deep neural networks. In CVPR, 2017.\n\n[26] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shuf\ufb02enet: An extremely ef\ufb01cient convolutional\n\nneural network for mobile devices. arXiv preprint arXiv:1707.01083, 2017.\n\n[27] Pan Zhou, Yunqing Hou, and Jiashi Feng. Deep adversarial subspace clustering. In CVPR, 2018.\n\n11\n\n\f", "award": [], "sourceid": 818, "authors": [{"given_name": "Yunhe", "family_name": "Wang", "institution": "Noah\u2019s Ark Laboratory, Huawei Technologies Co., Ltd."}, {"given_name": "Chang", "family_name": "Xu", "institution": "The University of Sydney"}, {"given_name": "Chunjing", "family_name": "XU", "institution": "Huawei Technologies"}, {"given_name": "Chao", "family_name": "Xu", "institution": "Peking University"}, {"given_name": "Dacheng", "family_name": "Tao", "institution": "University of Technology, Sydney"}]}