{"title": "ChannelNets: Compact and Efficient Convolutional Neural Networks via Channel-Wise Convolutions", "book": "Advances in Neural Information Processing Systems", "page_first": 5197, "page_last": 5205, "abstract": "Convolutional neural networks (CNNs) have shown great capability of solving various artificial intelligence tasks. However, the increasing model size has raised challenges in employing them in resource-limited applications. In this work, we propose to compress deep models by using channel-wise convolutions, which replace dense connections among feature maps with sparse ones in CNNs. Based on this novel operation, we build light-weight CNNs known as ChannelNets. ChannelNets use three instances of channel-wise convolutions; namely group channel-wise convolutions, depth-wise separable channel-wise convolutions, and the convolutional classification layer. Compared to prior CNNs designed for mobile devices, ChannelNets achieve a significant reduction in terms of the number of parameters and computational cost without loss in accuracy. Notably, our work represents the first attempt to compress the fully-connected classification layer, which usually accounts for about 25% of total parameters in compact CNNs. Experimental results on the ImageNet dataset demonstrate that ChannelNets achieve consistently better performance compared to prior methods.", "full_text": "ChannelNets: Compact and Ef\ufb01cient Convolutional\nNeural Networks via Channel-Wise Convolutions\n\nHongyang Gao\n\nTexas A&M University\n\nCollege Station, TX\n\nhongyang.gao@tamu.edu\n\nZhengyang Wang\n\nTexas A&M University\n\nCollege Station, TX\n\nzhengyang.wang@tamu.edu\n\nShuiwang Ji\n\nTexas A&M University\n\nCollege Station, TX\n\nsji@tamu.edu\n\nAbstract\n\nConvolutional neural networks (CNNs) have shown great capability of solving\nvarious arti\ufb01cial intelligence tasks. However, the increasing model size has raised\nchallenges in employing them in resource-limited applications. In this work, we\npropose to compress deep models by using channel-wise convolutions, which re-\nplace dense connections among feature maps with sparse ones in CNNs. Based on\nthis novel operation, we build light-weight CNNs known as ChannelNets. Channel-\nNets use three instances of channel-wise convolutions; namely group channel-wise\nconvolutions, depth-wise separable channel-wise convolutions, and the convolu-\ntional classi\ufb01cation layer. Compared to prior CNNs designed for mobile devices,\nChannelNets achieve a signi\ufb01cant reduction in terms of the number of parameters\nand computational cost without loss in accuracy. Notably, our work represents the\n\ufb01rst attempt to compress the fully-connected classi\ufb01cation layer, which usually\naccounts for about 25% of total parameters in compact CNNs. Experimental results\non the ImageNet dataset demonstrate that ChannelNets achieve consistently better\nperformance compared to prior methods.\n\n1\n\nIntroduction\n\nConvolutional neural networks (CNNs) have demonstrated great capability of solving visual recogni-\ntion tasks. Since AlexNet [11] achieved remarkable success on the ImageNet Challenge [3], various\ndeeper and more complicated networks [19, 21, 5] have been proposed to set the performance records.\nHowever, the higher accuracy usually comes with an increasing amount of parameters and com-\nputational cost. For example, the VGG16 [19] has 128 million parameters and requires 15, 300\nmillion \ufb02oating point operations (FLOPs) to classify an image. In many real-world applications,\npredictions need to be performed on resource-limited platforms such as sensors and mobile phones,\nthereby requiring compact models with higher speed. Model compression aims at exploring a tradeoff\nbetween accuracy and ef\ufb01ciency.\nRecently, signi\ufb01cant progress has been made in the \ufb01eld of model compression [7, 15, 23, 6, 24].\nThe strategies for building compact and ef\ufb01cient CNNs can be divided into two categories; those\nare, compressing pre-trained networks or designing new compact architectures that are trained from\nscratch. Studies in the former category were mostly based on traditional compression techniques such\nas product quantization [23], pruning [17], hashing [1], Huffman coding [4], and factorization [12, 9].\nThe second category has already been explored before model compression. Inspired by the Network-\nIn-Network architecture [14], GoogLeNet [21] included the Inception module to build deeper net-\nworks without increasing model sizes and computational cost. Through factorizing convolutions, the\nInception module was further improved by [22]. The depth-wise separable convolution, proposed\nin [18], generalized the factorization idea and decomposed the convolution into a depth-wise con-\nvolution and a 1 \u21e5 1 convolution. The operation has been shown to be able to achieve competitive\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\f(a)\n\n(b)\n\nFigure 1: Illustrations of different compact convolutions. Part (a) shows the depth-wise separable\nconvolution, which is composed of a depth-wise convolution and a 1 \u21e5 1 convolution. Part (b)\nshows the case where the 1 \u21e5 1 convolution is replaced by a 1 \u21e5 1 group convolution. Part (c)\nillustrates the use of the proposed group channel-wise convolution for information fusion. Part (d)\nshows the proposed depth-wise separable channel-wise convolution, which consists of a depth-wise\nconvolution and a channel-wise convolution. For channel-wise convolutions in (c) and (d), the same\ncolor represents shared weights.\n\n(c)\n\n(d)\n\nresults with fewer parameters. In terms of model compression, MobileNets [6] and Shuf\ufb02eNets [24]\ndesigned CNNs for mobile devices by employing depth-wise separable convolutions.\nIn this work, we focus on the second category and build a new family of light-weight CNNs known as\nChannelNets. By observing that the fully-connected pattern accounts for most parameters in CNNs,\nwe propose channel-wise convolutions, which are used to replace dense connections among feature\nmaps with sparse ones. Early work like LeNet-5 [13] has shown that sparsely-connected networks\nwork well when resources are limited. To apply channel-wise convolutions in model compression,\nwe develop group channel-wise convolutions, depth-wise separable channel-wise convolutions, and\nthe convolutional classi\ufb01cation layer. They are used to compress different parts of CNNs, leading to\nour ChannelNets. ChannelNets achieve a better trade-off between ef\ufb01ciency and accuracy than prior\ncompact CNNs, as demonstrated by experimental results on the ImageNet ILSVRC 2012 dataset. It\nis worth noting that ChannelNets are the \ufb01rst models that attempt to compress the fully-connected\nclassi\ufb01cation layer, which accounts for about 25% of total parameters in compact CNNs.\n\n2 Background and Motivations\n\nThe trainable layers of CNNs are commonly composed of convolutional layers and fully-connected\nlayers. Most prior studies, such as MobileNets [6] and Shuf\ufb02eNets [24], focused on compressing\nconvolutional layers, where most parameters and computation lie. To make the discussion concrete,\nsuppose a 2-D convolutional operation takes m feature maps with a spatial size of df \u21e5 df as inputs,\nand outputs n feature maps of the same spatial size with appropriate padding. m and n are also\nknown as the number of input and output channels, respectively. The convolutional kernel size is\ndk \u21e5 dk and the stride is set to 1. Here, without loss of generality, we use square feature maps and\nconvolutional kernels for simplicity. We further assume that there is no bias term in the convolutional\noperation, as modern CNNs employ the batch normalization [8] with a bias after the convolution. In\nthis case, the number of parameters in the convolution is dk \u21e5 dk \u21e5 m \u21e5 n and the computational\ncost in terms of FLOPs is dk \u21e5 dk \u21e5 m \u21e5 n \u21e5 df \u21e5 df . Since the convolutional kernel is shared\nfor each spatial location, for any pair of input and output feature maps, the connections are sparse\nand weighted by dk \u21e5 dk shared parameters. However, the connections among channels follow a\nfully-connected pattern, i.e., all m input channels are connected to all n output channels, which\nresults in the m \u21e5 n term. For deep convolutional layers, m and n are usually large numbers like 512\nand 1024, thus m \u21e5 n is usually very large.\nBased on the above insights, one way to reduce the size and cost of convolutions is to circumvent the\nmultiplication between dk \u21e5 dk and m\u21e5 n. MobileNets [6] applied this approach to explore compact\ndeep models for mobile devices. The core operation employed in MobileNets is the depth-wise\nseparable convolution [2], which consists of a depth-wise convolution and a 1 \u21e5 1 convolution,\nas illustrated in Figure 1(a). The depth-wise convolution applies a single convolutional kernel\nindependently for each input feature map, thus generating the same number of output channels. The\nfollowing 1 \u21e5 1 convolution is used to fuse the information of all output channels using a linear\n\n2\n\n\fcombination. The depth-wise separable convolution actually decomposes the regular convolution\ninto a depth-wise convolution step and a channel-wise fuse step. Through this decomposition, the\nnumber of parameters becomes\n\n(1)\n\nand the computational cost becomes\n\ndk \u21e5 dk \u21e5 m + m \u21e5 n,\n\ndk \u21e5 dk \u21e5 m \u21e5 df \u21e5 df + m \u21e5 n \u21e5 df \u21e5 df .\n\n(2)\nIn both equations, the \ufb01rst term corresponds to the depth-wise convolution and the second term\ncorresponds to the 1 \u21e5 1 convolution. By decoupling dk \u21e5 dk and m \u21e5 n, the amounts of parameters\nand computations are reduced.\nWhile MobileNets successfully employed depth-wise separable convolutions to perform model\ncompression and achieve competitive results, it is noted that the m \u21e5 n term still dominates the\nnumber of parameters in the models. As pointed out in [6], 1 \u21e5 1 convolutions, which lead to\nthe m \u21e5 n term, account for 74.59% of total parameters in MobileNets. The analysis of regular\nconvolutions reveals that m \u21e5 n comes from the fully-connected pattern, which is also the case in\n1 \u21e5 1 convolutions. To understand this, \ufb01rst consider the special case where df = 1. Now the inputs\nare m units as each feature map has only one unit. As the convolutional kernel size is 1 \u21e5 1, which\ndoes not change the spatial size of feature maps, the outputs are also n units. It is clear that the\noperation between the m input units and the n output units is a fully-connected operation with m \u21e5 n\nparameters. When df > 1, the fully-connected operation is shared for each spatial location, leading\nto the 1 \u21e5 1 convolution. Hence, the 1 \u21e5 1 convolution actually outputs a linear combination of input\nfeature maps. More importantly, in terms of connections between input and output channels, both the\nregular convolution and the depth-wise separable convolution follow the fully-connected pattern.\nAs a result, a better strategy to compress convolutions is to change the dense connection pattern\nbetween input and output channels. Based on the depth-wise separable convolution, it is equivalent\nto circumventing the 1 \u21e5 1 convolution. A simple method, previously used in AlexNet [11], is the\ngroup convolution. Speci\ufb01cally, the m input channels are divided into g mutually exclusive groups.\nEach group goes through a 1 \u21e5 1 convolution independently and produces n/g output feature maps.\nIt follows that there are still n output channels in total. For simplicity, suppose both m and n are\ndivisible by g. As the 1 \u21e5 1 convolution for each group requires 1/g2 parameters and FLOPs, the\ntotal amount after grouping is only 1/g as compared to the original 1 \u21e5 1 convolution. Figure 1(b)\ndescribes a 1 \u21e5 1 group convolution where the number of groups is 2.\nHowever, the grouping operation usually compromises performance because there is no interaction\namong groups. As a result, information of feature maps in different groups is not combined, as\nopposed to the original 1 \u21e5 1 convolution that combines information of all input channels. To address\nthis limitation, Shuf\ufb02eNet [24] was proposed, where a shuf\ufb02ing layer was employed after the 1 \u21e5 1\ngroup convolution. Through random permutation, the shuf\ufb02ing layer partly achieves interactions\namong groups. But any output group accesses only m/g input feature maps and thus collects partial\ninformation. Due to this reason, Shuf\ufb02eNet had to employ a deeper architecture than MobileNets to\nachieve competitive results.\n\n3 Channel-Wise Convolutions and ChannelNets\n\nIn this work, we propose channel-wise convolutions in Section 3.1, based on which we build our\nChannelNets. In Section 3.2, we apply group channel-wise convolutions to address the information\ninconsistency problem caused by grouping. Afterwards, we generalize our method in Section 3.3,\nwhich leads to a direct replacement of depth-wise separable convolutions in deeper layers. Through\nanalysis of the generalized method, we propose a convolutional classi\ufb01cation layer to replace the\nfully-connected output layer in Section 3.4, which further reduces the amounts of parameters and\ncomputations. Finally, Section 3.5 introduces the architecture of our ChannelNets.\n\n3.1 Channel-Wise Convolutions\nWe begin with the de\ufb01nition of channel-wise convolutions in general. As discussed above, the 1 \u21e5 1\nconvolution is equivalent to using a shared fully-connected operation to scan every df \u21e5 df locations\nof input feature maps. A channel-wise convolution employs a shared 1-D convolutional operation,\ninstead of the fully-connected operation. Consequently, the connection pattern between input and\n\n3\n\n\foutput channels becomes sparse, where each output feature map is connected to a part of input feature\nmaps. To be speci\ufb01c, we again start with the special case where df = 1. The m input units (feature\nmaps) can be considered as a 1-D feature map of size m. Similarly, the output becomes a 1-D feature\nmap of size n. Note that both the input and output have only 1 channel. The channel-wise convolution\nperforms a 1-D convolution with appropriate padding to map the m units to the n units. In the cases\nwhere df > 1, the same 1-D convolution is computed for every spatial locations. As a result, the\nnumber of parameters in a channel-wise convolution with a kernel size of dc is simply dc and the\ncomputational cost is dc \u21e5 n \u21e5 df \u21e5 df . By employing sparse connections, we avoid the m \u21e5 n\nterm. Therefore, channel-wise convolutions consume a negligible amount of computations and can\nbe performed ef\ufb01ciently.\n\n3.2 Group Channel-Wise Convolutions\n\nWe apply channel-wise convolutions to develop a solution to the information inconsistency problem\nincurred by grouping. After the 1 \u21e5 1 group convolution, the outputs are g groups, each of which\nincludes n/g feature maps. As illustrated in Figure 1(b), the g groups are computed independently\nfrom completely separate groups of input feature maps. To enable interactions among groups, an\nef\ufb01cient information fusion layer is needed after the 1 \u21e5 1 group convolution. The fusion layer is\nexpected to retain the grouping for following group convolutions while allowing each group to collect\ninformation from all the groups. Concretely, both inputs and outputs of this layer should be n feature\nmaps that are divided into g groups. Meanwhile, the n/g output channels in any group should be\ncomputed from all the n input channels. More importantly, the layer must be compact and ef\ufb01cient;\notherwise the advantage of grouping will be compromised.\nBased on channel-wise convolutions, we propose the group channel-wise convolution, which serves\nelegantly as the fusion layer. Given n input feature maps that are divided into g groups, this operation\nperforms g independent channel-wise convolutions. Each channel-wise convolution uses a stride\nof g and outputs n/g feature maps with appropriate padding. Note that, in order to ensure all n\ninput channels are involved in the computation of any output group of channels, the kernel size of\nchannel-wise convolutions needs to satisfy dc  g. The desired outputs of the fusion layer is obtained\nby concatenating the outputs of these channel-wise convolutions. Figure 1(c) provides an example\nof using the group channel-wise convolution after the 1 \u21e5 1 group convolution, which replaces the\noriginal 1 \u21e5 1 convolution.\nTo see the ef\ufb01ciency of this approach, the number of parameters of the 1 \u21e5 1 group convolution\nfollowed by the group channel-wise convolution is m\ng \u21e5 g + dc \u21e5 g, and the computational\ncost is m\ng \u21e5 df \u21e5 df \u21e5 g. Since in most cases we have dc \u2327 m, our\napproach requires approximately 1/g training parameters and FLOPs, as compared to the second\nterms in Eqs. 1 and 2.\n\ng \u21e5 df \u21e5 df \u21e5 g + dc \u21e5 n\n\ng \u21e5 n\n\ng \u21e5 n\n\n3.3 Depth-Wise Separable Channel-Wise Convolutions\n\nBased on the above descriptions, it is worth noting that there is a special case where the number of\ngroups and the number of input and output channels are equal, i.e., g = m = n. A similar scenario\nresulted in the development of depth-wise convolutions [6, 2]. In this case, there is only one feature\nmap in each group. The 1 \u21e5 1 group convolution simply scales the convolutional kernels in the\ndepth-wise convolution. As the batch normalization [8] in each layer already involves a scaling term,\nthe 1 \u21e5 1 group convolution becomes redundant and can be removed. Meanwhile, instead of using\nm independent channel-wise convolutions with a stride of m as the fusion layer, we apply a single\nchannel-wise convolution with a stride of 1. Due to the removal of the 1 \u21e5 1 group convolution, the\nchannel-wise convolution directly follows the depth-wise convolution, resulting in the depth-wise\nseparable channel-wise convolution, as illustrated in Figure 1(d).\nIn essence, the depth-wise separable channel-wise convolution replaces the 1 \u21e5 1 convolution in\nthe depth-wise separable convolution with the channel-wise convolution. The connections among\nchannels are changed directly from a dense pattern to a sparse one. As a result, the number of\nparameters is dk \u21e5 dk \u21e5 m + dc, and the cost is dk \u21e5 dk \u21e5 m \u21e5 df \u21e5 df + dc \u21e5 n \u21e5 df \u21e5 df , which\nsaves dramatic amounts of parameters and computations. This layer can be used to directly replace\nthe depth-wise separable convolution.\n\n4\n\n\f3.4 Convolutional Classi\ufb01cation Layer\n\n!\"\n\n!\"\n\n!\"\n\n!\"\n\n1\n\nn\n\n1\n\n1\n\nm\n\nm\n\nn\n\nm\n\nFully\n\nConnected\n\nConvolutional \n\nClassification Layer\n\n!\"\u00d7!\"\u00d7(%\u2212'+))\n\nGlobal\nPooling\n\n%\u00d7'\n\nFigure 2: An illustration of the convolutional classi\ufb01cation\nlayer. The left part describes the original output layers, i.e., a\nglobal average pooling layer and a fully-connected classi\ufb01ca-\ntion layer. The global pooling layer reduces the spatial size\ndf \u21e5 df to 1\u21e5 1 while keeping the number of channels. Then\nthe fully-connected classi\ufb01cation layer changes the number\nof channels from m to n, where n is the number of classes.\nThe right part illustrates the proposed convolutional classi\ufb01-\ncation layer, which performs a single 3-D convolution with a\nkernel size of df \u21e5 df \u21e5 (m  n + 1) and no padding. The\nconvolutional classi\ufb01cation layer saves a signi\ufb01cant amount\nof parameters and computation.\n\nMost prior model compression meth-\nods pay little attention to the very\nlast layer of CNNs, which is a fully-\nconnected layer used to generate\nclassi\ufb01cation results. Taking Mo-\nbileNets on the ImageNet dataset as\nan example, this layer uses a 1, 024-\ncomponent feature vector as inputs\nand produces 1, 000 logits correspond-\ning to 1, 000 classes. Therefore, the\nnumber of parameters is 1, 024 \u21e5\n1, 000 \u21e1 1 million, which accounts\nfor 24.33% of total parameters as re-\nported in [6].\nIn this section, we\nexplore a special application of the\ndepth-wise separable channel-wise\nconvolution, proposed in Section 3.3,\nto reduce the large amount of param-\neters in the classi\ufb01cation layer.\nWe note that the second-to-the-last\nlayer is usually a global average pooling layer, which reduces the spatial size of feature maps\nto 1. For example, in MobileNets, the global average pooling layer transforms 1, 024 7 \u21e5 7 input\nfeature maps into 1, 024 1 \u21e5 1 output feature maps, corresponding to the 1, 024-component feature\nvector fed into the classi\ufb01cation layer. In general, suppose the spatial size of input feature maps is\ndf \u21e5 df . The global average pooling layer is equivalent to a special depth-wise convolution with a\nkernel size of df \u21e5 df , where the weights in the kernel is \ufb01xed to 1/d2\nf . Meanwhile, the following\nfully-connected layer can be considered as a 1 \u21e5 1 convolution as the input feature vector can be\nviewed as 1\u21e5 1 feature maps. Thus, the global average pooling layer followed by the fully-connected\nclassi\ufb01cation layer is a special depth-wise convolution followed by a 1 \u21e5 1 convolution, resulting in a\nspecial depth-wise separable convolution.\nAs the proposed depth-wise separable channel-wise convolution can directly replace the depth-wise\nseparable convolution, we attempt to apply the replacement here. Speci\ufb01cally, the same special\ndepth-wise convolution is employed, but is followed by a channel-wise convolution with a kernel size\nof dc whose number of output channels is equal to the number of classes. However, we observe that\nsuch an operation can be further combined using a regular 3-D convolution [10].\nIn particular, the m df \u21e5 df input feature maps can be viewed as a single 3-D feature map with a\nsize of df \u21e5 df \u21e5 m. The special depth-wise convolution, or equivalently the global average pooling\nlayer, is essentially a 3-D convolution with a kernel size of df \u21e5 df \u21e5 1, where the weights in the\nkernel is \ufb01xed to 1/d2\nf . Moreover, in this view, the channel-wise convolution is a 3-D convolution\nwith a kernel size of 1 \u21e5 1 \u21e5 dc. These two consecutive 3-D convolutions follow a factorized pattern.\nAs proposed in [22], a dk \u21e5 dk convolution can be factorized into two consecutive convolutions with\nkernel sizes of dk \u21e5 1 and 1 \u21e5 dk, respectively. Based on this factorization, we combine the two 3-D\nconvolutions into a single one with a kernel size of df \u21e5 df \u21e5 dc. Suppose there are n classes, to\nensure that the number of output channels equals to the number of classes, dc is set to (m  n + 1)\nwith no padding on the input. This 3-D convolution is used to replace the global average pooling\nlayer followed by the fully-connected layer, serving as a convolutional classi\ufb01cation layer.\nWhile the convolutional classi\ufb01cation layer dramatically reduces the number of parameters, there is a\nconcern that it may cause a signi\ufb01cation loss in performance. In the fully-connected classi\ufb01cation\nlayer, each prediction is based on the entire feature vector by taking all features into consideration. In\ncontrast, in the convolutional classi\ufb01cation layer, the prediction of each class uses only (m  n + 1)\nfeatures. However, our experiments show that the weight matrix of the fully-connected classi\ufb01cation\nlayer is very sparse, indicating that only a small number of features contribute to the prediction of a\nclass. Meanwhile, our ChannelNets with the convolutional classi\ufb01cation layer achieve much better\nresults than other models with similar amounts of parameters.\n\n5\n\n\f3.5 ChannelNets\n\nAdd\n\n(a)\n\nAdd\n\n(b)\n\n1x1 Group Conv\n\n1x1 Group Conv\n\n3x3 Depth-Wise Conv\n\n1x1 Group Conv\n\n3x3 Depthwise Conv\n\n1x1 Group Conv\n\nBatch Norm + ReLU6\n\nBatch Norm + ReLU6\n\nBatch Norm + ReLU6\n\n3x3 Depth-Wise Conv\n\nBatch Norm + ReLU6\n\n3x3 Depth-Wise Conv\n\nGroup Channel-Wise Conv\n\nFigure 3: Illustrations of the group module (GM)\nand the group channel-wise module (GCWM). Part\n(a) shows GM, which has two depth-wise separable\nconvolutional layers. Note that 1 \u21e5 1 convolutions\nis replaced by 1 \u21e5 1 group convolutions to save\ncomputations. A skip connection is added to fa-\ncilitate model training. GCWM is described in\npart (b). Compared to GM, it has a group channel-\nwise convolution to fuse information from different\ngroups.\n\nWith the proposed group channel-wise convo-\nlutions, the depth-wise separable channel-wise\nconvolutions, and the convolutional classi\ufb01ca-\ntion layer, we build our ChannelNets. We follow\nthe basic architecture of MobileNets to allow\nfair comparison and design three ChannelNets\nwith different compression levels. Notably, our\nproposed methods are orthogonal to the work\nof MobileNetV2 [16]. Similar to MobileNets,\nwe can apply our methods to MobileNetV2 to\nfurther reduce the parameters and computational\ncost. The details of network architectures are\nshown in Table 4 in the supplementary material.\nChannelNet-v1: To employ the group channel-\nwise convolutions, we design two basic modules;\nthose are, the group module (GM) and the group\nchannel-wise module (GCWM). They are illus-\ntrated in Figure 3. GM simply applies 1 \u21e5 1\ngroup convolution instead of 1 \u21e5 1 convolution\nand adds a residual connection [5]. As analyzed\nabove, GM saves computations but suffers from\nthe information inconsistency problem. GCWM\naddresses this limitation by inserting a group\nchannel-wise convolution after the second 1 \u21e5 1\ngroup convolution to achieve information fusion. Either module can be used to replace two consec-\nutive depth-wise separable convolutional layers in MobileNets. In our ChannelNet-v1, we choose\nto replace depth-wise separable convolutions with larger numbers of input and output channels.\nSpeci\ufb01cally, six consecutive depth-wise separable convolutional layers with 512 input and output\nchannels are replaced by two GCWMs followed by one GM. In these modules, we set the number of\ngroups to 2. The total number of parameters in ChannelNet-v1 is about 3.7 million.\nChannelNet-v2: We apply the depth-wise separable channel-wise convolutions on ChannelNet-v1\nto further compress the network. The last depth-wise separable convolutional layer has 512 input\nchannels and 1, 024 output channels. We use the depth-wise separable channel-wise convolution to\nreplace this layer, leading to ChannelNet-v2. The number of parameters reduced by this replacement\nof a single layer is 1 million, which accounts for about 25% of total parameters in ChannelNet-v1.\nChannelNet-v3: We employ the convolutional classi\ufb01cation layer on ChannelNet-v2 to obtain\nChannelNet-v3. For the ImageNet image classi\ufb01cation task, the number of classes is 1, 000, which\nmeans the number of parameters in the fully-connected classi\ufb01cation layer is 1024\u21e51000 \u21e1 1 million.\nSince the number of parameters for the convolutional classi\ufb01cation layer is only 7 \u21e5 7 \u21e5 25 \u21e1 1\nthousand, ChannelNet-v3 reduces 1 million parameters approximately.\n\n4 Experimental Studies\n\nIn this section, we evaluate the proposed ChannelNets on the ImageNet ILSVRC 2012 image\nclassi\ufb01cation dataset [3], which has served as the benchmark for model compression. We compare\ndifferent versions of ChannelNets with other compact CNNs. Ablation studies are also conducted\nto show the effect of group channel-wise convolutions. In addition, we perform an experiment to\ndemonstrate the sparsity of weights in the fully-connected classi\ufb01cation layer.\n\n4.1 Dataset\n\nThe ImageNet ILSVRC 2012 dataset contains 1.2 million training images and 50 thousand validation\nimages. Each image is labeled by one of 1, 000 classes. We follow the same data augmentation\nprocess in [5]. Images are scaled to 256 \u21e5 256. Randomly cropped patches with a size of 224 \u21e5 224\nare used for training. During inference, 224\u21e5 224 center crops are fed into the networks. To compare\n\n6\n\n\fwith other compact CNNs [6, 24], we train our models using training images and report accuracies\ncomputed on the validation set, since the labels of test images are not publicly available.\n\n4.2 Experimental Setup\nWe train our ChannelNets using the same settings as those for MobileNets except for a minor change.\nFor depth-wise separable convolutions, we remove the batch normalization and activation function\nbetween the depth-wise convolution and the 1 \u21e5 1 convolution. We observe that it has no in\ufb02uence\non the performance while accelerating the training speed. For the proposed GCWMs, the kernel size\nof group channel-wise convolutions is set to 8. In depth-wise separable channel-wise convolutions,\nwe set the kernel size to 64. In the convolutional classi\ufb01cation layer, the kernel size of the 3-D\nconvolution is 7\u21e5 7\u21e5 25. All models are trained using the stochastic gradient descent optimizer with\na momentum of 0.9 for 80 epochs. The learning rate starts at 0.1 and decays by 0.1 at the 45th, 60th,\n65th, 70th, and 75th epoch. Dropout [20] with a rate of 0.0001 is applied after 1 \u21e5 1 convolutions.\nWe use 4 TITAN Xp GPUs and a batch size of 512 for training, which takes about 3 days.\n\nTable 1: Comparison between ChannelNet-v1 and\nother CNNs in terms of the top-1 accuracy on the\nImageNet validation set, the number of total pa-\nrameters, and FLOPs needed for classifying an\nimage.\nModels\nGoogleNet\nVGG16\nAlexNet\nSqueezeNet\n1.0 MobileNet\nShuf\ufb02eNet 2x\nChannelNet-v1\n\n4.3 Comparison of ChannelNet-v1 with Other Models\nWe compare ChannelNet-v1 with other CNNs,\nincluding regular networks and compact ones,\nin terms of the top-1 accuracy, the number of\nparameters and the computational cost in terms\nof FLOPs. The results are reported in Table 1.\nWe can see that ChannelNet-v1 is the most com-\npact and ef\ufb01cient network, as it achieves the best\ntrade-off between ef\ufb01ciency and accuracy.\nWe can see that SqueezeNet [7] has the smallest\nsize. However, the speed is even slower than\nAlexNet and the accuracy is not competitive to\nother compact CNNs. By replacing depth-wise\nseparable convolutions with GMs and GCWMs,\nChannelNet-v1 achieves nearly the same perfor-\nmance as 1.0 MobileNet with a 11.9% reduction in parameters and a 28.5% reduction in FLOPs.\nHere, the 1.0 represents the width multiplier in MobileNets, which is used to control the width of\nthe networks. MobileNets with different width multipliers are compared with ChannelNets under\nsimilar compression levels in Section 4.4. Shuf\ufb02eNet 2x can obtain a slightly better performance.\nHowever, it employs a much deeper network architecture, resulting in even more parameters and\nFLOPs than MobileNets. This is because more layers are required when using shuf\ufb02ing layers to\naddress the information inconsistency problem in 1 \u21e5 1 group convolutions. Thus, the advantage\nof using group convolutions is compromised. In contrast, our group channel-wise convolutions can\novercome the problem without more layers, as shown by experiments in Section 4.5.\n\nTop-1 Params\n6.8m\n0.698\n128m\n0.715\n60m\n0.572\n0.575\n1.3m\n4.2m\n0.706\n5.3m\n0.709\n0.705\n3.7m\n\nFLOPs\n1550m\n15300m\n720m\n833m\n569m\n524m\n407m\n\nTable 2: Comparison between ChannelNets\nand other compact CNNs with width multi-\npliers in terms of the top-1 accuracy on the\nImageNet validation set, and the number of to-\ntal parameters. The numbers before the model\nnames represent width multipliers.\n\n4.4 Comparison of ChannelNets with Models Using Width Multipliers\nThe width multiplier is proposed in [6] to make the\nnetwork architecture thinner by reducing the number\nof input and output channels in each layer, thereby in-\ncreasing the compression level. This approach simply\ncompresses each layer by the same factor. Note that\nmost of parameters lie in deep layers of the model.\nHence, reducing widths in shallow layers does not\nlead to signi\ufb01cant compression, but hinders model\nperformance, since it is important to maintain the\nnumber of channels in the shallow part of deep mod-\nels. Our ChannelNets explore a different way to\nachieve higher compression levels by replacing the\ndeepest layers in CNNs. Remarkably, ChannelNet-v3\nis the \ufb01rst compact network that attempts to compress\nthe last layer, i.e., the fully-connected classi\ufb01cation layer.\n\nModels\n0.75 MobileNet\n0.75 ChannelNet-v1\nChannelNet-v2\n0.5 MobileNet\n0.5 ChannelNet-v1\nChannelNet-v3\n\nTop-1 Params\n2.6m\n0.684\n2.3m\n0.678\n0.695\n2.7m\n0.637\n1.3m\n1.2m\n0.627\n0.667\n1.7m\n\n7\n\n\fWe perform experiments to compare ChannelNet-v2 and ChannelNet-v3 with compact CNNs using\nwidth multipliers. The results are shown in Table 2. We apply width multipliers {0.75, 0.5} on both\nMobileNet and ChannelNet-v1 to illustrate the impact of applying width multipliers. In order to\nmake the comparison fair, compact networks with similar compression levels are compared together.\nSpeci\ufb01cally, we compare ChannelNet-v2 with 0.75 MobileNet and 0.75 ChannelNet-v1, since the\nnumbers of total parameters are in the same 2.x million level. For ChannelNet-v3, 0.5 MobileNet and\n0.5 ChannelNet-v1 are used for comparison, as all of them contain 1.x million parameters.\nWe can observe from the results that ChannelNet-v2 outperforms 0.75 MobileNet with an absolute\n1.1% gain in accuracy, which demonstrates the effect of our depth-wise separable channel-wise\nconvolutions. In addition, note that using depth-wise separable channel-wise convolutions to replace\ndepth-wise separable convolutions is a more \ufb02exible way than applying width multipliers. It only\naffects one layer, as opposed to all layers in the networks. ChannelNet-v3 has signi\ufb01cantly better\nperformance than 0.5 MobileNet by 3% in accuracy. It shows that our convolutional classi\ufb01cation\nlayer can retain the accuracy to most extent while increasing the compression level. The results also\nshow that applying width multipliers on ChannelNet-v1 leads to poor performance.\n\n4.5 Ablation Study on Group Channel-Wise Convolutions\n\nTo demonstrate the effect of our group channel-wise con-\nvolutions, we conduct an ablation study on ChannelNet-\nv1. Based on ChannelNet-v1, we replace the two\nGCWMs with GMs, thereby removing all group channel-\nwise convolutions. The model is denoted as ChannelNet-\nv1(-). It follows exactly the same experimental setup\nas ChannelNet-v1 to ensure fairness. Table 3 pro-\nvides comparison results between ChannelNet-v1(-\n) and ChannelNet-v1. ChannelNet-v1 outperforms\nChannelNet-v1(-) by 0.8%, which is signi\ufb01cant as\nChannelNet-v1 has only 32 more parameters with group\nchannel-wise convolutions. Therefore, group channel-wise convolutions are extremely ef\ufb01cient and\neffective information fusion layers for solving the problem incurred by group convolutions.\n\nTable 3: Comparison between ChannelNet-\nv1 and ChannelNet-v1 without group\nchannel-wise convolutions, denoted as\nChannelNet-v1(-). The comparison is in\nterms of the top-1 accuracy on the Ima-\ngeNet validation set, and the number of\ntotal parameters.\nModels\nChannelNet-v1(-)\nChannelNet-v1\n\nTop-1 Params\n0.697\n3.7m\n3.7m\n0.705\n\n4.6 Sparsity of Weights in Fully-Connected Classi\ufb01cation Layers\n\nIn ChannelNet-v3, we replace the fully-connected classi\ufb01cation layer with our convolutional classi\ufb01-\ncation layer. Each prediction is based on only (m  n + 1) features instead of all n features, which\nraises a concern of potential loss in performance. To investigate this further, we analyze the weight\nmatrix in the fully-connected classi\ufb01cation layer, as shown in Figure 4 in the supplementary material.\nWe take the fully- connected classi\ufb01cation layer of ChannelNet-v1 as an example. The analysis shows\nthat the weights are sparsely distributed in the weight matrix, which indicates that each prediction\nonly makes use of a small number of features, even with the fully-connected classi\ufb01cation layer.\nBased on this insight, we propose the convolutional classi\ufb01cation layer and ChannelNet-v3. As\nshown in Section 4.4, ChannelNet-v3 is highly compact and ef\ufb01cient with promising performance.\n\n5 Conclusion and Future Work\n\nIn this work, we propose channel-wise convolutions to perform model compression by replacing\ndense connections in deep networks. We build a new family of compact and ef\ufb01cient CNNs, known\nas ChannelNets, by using three instances of channel-wise convolutions; namely group channel-wise\nconvolutions, depth-wise separable channel-wise convolutions, and the convolutional classi\ufb01cation\nlayer. Group channel-wise convolutions are used together with 1\u21e51 group convolutions as information\nfusion layers. Depth-wise separable channel-wise convolutions can be directly used to replace depth-\nwise separable convolutions. The convolutional classi\ufb01cation layer is the \ufb01rst attempt in the \ufb01eld\nof model compression to compress the the fully-connected classi\ufb01cation layer. Compared to prior\nmethods, ChannelNets achieve a better trade-off between ef\ufb01ciency and accuracy. The current study\nevaluates the proposed methods on image classi\ufb01cation tasks, but the methods can be applied to other\ntasks, such as detection and segmentation. We plan to explore these applications in the future.\n\n8\n\n\fAcknowledgments\nThis work was supported in part by National Science Foundation grants IIS-1633359 and DBI-\n1641223.\n\nReferences\n[1] Wenlin Chen, James Wilson, Stephen Tyree, Kilian Weinberger, and Yixin Chen. Compressing neural networks with the hashing trick.\n\nIn International Conference on Machine Learning, pages 2285\u20132294, 2015.\n\n[2] Fran\u00e7ois Chollet. Xception: Deep learning with depthwise separable convolutions. arXiv preprint, 2016.\n\n[3] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In Proceedings of\n\nthe IEEE Conference on Computer Vision and Pattern Recognition, 2009.\n\n[4] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization\n\nand huffman coding. International Conference on Learning Representations, 2015.\n\n[5] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE\n\nconference on computer vision and pattern recognition, pages 770\u2013778, 2016.\n\n[6] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig\n\nAdam. Mobilenets: Ef\ufb01cient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.\n\n[7] Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt Keutzer. Squeezenet: Alexnet-level\n\naccuracy with 50x fewer parameters and< 0.5 mb model size. arXiv preprint arXiv:1602.07360, 2016.\n\n[8] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv\n\npreprint arXiv:1502.03167, 2015.\n\n[9] Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. Speeding up convolutional neural networks with low rank expansions.\n\nProceedings of the British Machine Vision Conference. BMVA Press, 2014.\n\nIn\n\n[10] Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 3D convolutional neural networks for human action recognition. IEEE Transactions on\n\nPattern Analysis and Machine Intelligence, 35(1):221\u2013231, 2013.\n\n[11] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep convolutional neural networks. In Advances\n\nin neural information processing systems, pages 1097\u20131105, 2012.\n\n[12] Vadim Lebedev, Yaroslav Ganin, Maksim Rakhuba, Ivan Oseledets, and Victor Lempitsky. Speeding-up convolutional neural networks\n\nusing \ufb01ne-tuned cp-decomposition. arXiv preprint arXiv:1412.6553, 2014.\n\n[13] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86\n\n(11):2278\u20132324, November 1998.\n\n[14] Min Lin, Qiang Chen, and Shuicheng Yan. Network in network. arXiv preprint arXiv:1312.4400, 2013.\n\n[15] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-net: Imagenet classi\ufb01cation using binary convolutional\n\nneural networks. In European Conference on Computer Vision, pages 525\u2013542. Springer, 2016.\n\n[16] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Inverted residuals and linear bottlenecks:\n\nMobile networks for classi\ufb01cation, detection and segmentation. arXiv preprint arXiv:1801.04381, 2018.\n\n[17] Abigail See, Minh-Thang Luong, and Christopher D Manning. Compression of neural machine translation models via pruning.\n\nProceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, pages 291\u2013301, 2016.\n\nIn\n\n[18] Laurent Sifre and PS Mallat. Rigid-motion scattering for image classi\ufb01cation. PhD thesis, Citeseer, 2014.\n\n[19] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. Proceedings of the\n\nInternational Conference on Learning Representations, 2015.\n\n[20] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent\n\nneural networks from over\ufb01tting. Journal of Machine Learning Research, 15(1):1929\u20131958, 2014.\n\n[21] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and\nAndrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recog-\nnition, pages 1\u20139, 2015.\n\n[22] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for com-\n\nputer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2818\u20132826, 2016.\n\n[23] Jiaxiang Wu, Cong Leng, Yuhang Wang, Qinghao Hu, and Jian Cheng. Quantized convolutional neural networks for mobile devices. In\n\nProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4820\u20134828, 2016.\n\n[24] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shuf\ufb02enet: An extremely ef\ufb01cient convolutional neural network for mobile\n\ndevices. arXiv preprint arXiv:1707.01083, 2017.\n\n9\n\n\f", "award": [], "sourceid": 2487, "authors": [{"given_name": "Hongyang", "family_name": "Gao", "institution": "Texas A&M University"}, {"given_name": "Zhengyang", "family_name": "Wang", "institution": "Texas A&M University"}, {"given_name": "Shuiwang", "family_name": "Ji", "institution": "Texas A&M University"}]}