{"title": "Doubly Convolutional Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 1082, "page_last": 1090, "abstract": "Building large models with parameter sharing accounts for most of the success of deep convolutional neural networks (CNNs). In this paper, we propose doubly convolutional neural networks (DCNNs), which significantly improve the performance of CNNs by further exploring this idea. In stead of allocating a set of convolutional filters that are independently learned, a DCNN maintains groups of filters where filters within each group are translated versions of each other. Practically, a DCNN can be easily implemented by a two-step convolution procedure, which is supported by most modern deep learning libraries. We perform extensive experiments on three image classification benchmarks: CIFAR-10, CIFAR-100 and ImageNet, and show that DCNNs consistently outperform other competing architectures. We have also verified that replacing a convolutional layer with a doubly convolutional layer at any depth of a CNN can improve its performance. Moreover, various design choices of DCNNs are demonstrated, which shows that DCNN can serve the dual purpose of building more accurate models and/or reducing the memory footprint without sacrificing the accuracy.", "full_text": "Doubly Convolutional Neural Networks\n\nShuangfei Zhai\n\nBinghamton University\nVestal, NY 13902, USA\n\nszhai2@binghamton.edu\n\nYu Cheng\n\nIBM T.J. Watson Research Center\nYorktown Heights, NY 10598, USA\n\nchengyu@us.ibm.com\n\nWeining Lu\n\nTsinghua University\nBeijing 10084, China\n\nluwn14@mails.tsinghua.edu.cn\n\nZhongfei (Mark) Zhang\nBinghamton University\nVestal, NY 13902, USA\n\nzhongfei@cs.binghamton.edu\n\nAbstract\n\nBuilding large models with parameter sharing accounts for most of the success of\ndeep convolutional neural networks (CNNs). In this paper, we propose doubly con-\nvolutional neural networks (DCNNs), which signi\ufb01cantly improve the performance\nof CNNs by further exploring this idea. In stead of allocating a set of convolutional\n\ufb01lters that are independently learned, a DCNN maintains groups of \ufb01lters where\n\ufb01lters within each group are translated versions of each other. Practically, a DCNN\ncan be easily implemented by a two-step convolution procedure, which is supported\nby most modern deep learning libraries. We perform extensive experiments on\nthree image classi\ufb01cation benchmarks: CIFAR-10, CIFAR-100 and ImageNet, and\nshow that DCNNs consistently outperform other competing architectures. We have\nalso veri\ufb01ed that replacing a convolutional layer with a doubly convolutional layer\nat any depth of a CNN can improve its performance. Moreover, various design\nchoices of DCNNs are demonstrated, which shows that DCNN can serve the dual\npurpose of building more accurate models and/or reducing the memory footprint\nwithout sacri\ufb01cing the accuracy.\n\n1\n\nIntroduction\n\nIn recent years, convolutional neural networks (CNNs) have achieved great success to solve many\nproblems in machine learning and computer vision. CNNs are extremely parameter ef\ufb01cient due\nto exploring the translation invariant property of images, which is the key to training very deep\nmodels without severe over\ufb01tting. While considerable progresses have been achieved by aggressively\nexploring deeper architectures [1, 2, 3, 4] or novel regularization techniques [5, 6] with the standard\n\"convolution + pooling\" recipe, we contribute from a different view by providing an alternative to the\ndefault convolution module, which can lead to models with even better generalization abilities and/or\nparameter ef\ufb01ciency.\nOur intuition originates from observing well trained CNNs where many of the learned \ufb01lters are the\nslightly translated versions of each other. To quantify this in a more formal fashion, we de\ufb01ne the\nk-translation correlation between two convolutional \ufb01lters within a same layer Wi, Wj as:\n\n\u03c1k(Wi, Wj) =\n\n(1)\nwhere T (\u00b7, x, y) denotes the translation of the \ufb01rst operand by (x, y) along its spatial dimensions,\nwith proper zero padding at borders to maintain the shape; < \u00b7,\u00b7 >f denotes the \ufb02attened inner\n\nx,y\u2208{\u2212k,...,k},(x,y)(cid:54)=(0,0)\n\nmax\n\n< Wi, T (Wj, x, y) >f\n\n(cid:107)Wi(cid:107)2(cid:107)Wj(cid:107)2\n\n,\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fFigure 1: Visualization of the 11 \u00d7 11 sized \ufb01rst layer \ufb01lters learned by AlexNet [1]. Each column\nshows a \ufb01lter in the \ufb01rst row along with its three most 3-translation-correlated \ufb01lters. Only the \ufb01rst\n32 \ufb01lters are shown for brevity.\n\nFigure 2: Illustration of the averaged maximum 1-translation correlation, together with the standard\ndeviation, of each convolutional layer for AlexNet [1] (left), and the 19-layer VGGNet [2] (right),\nrespectively. For comparison, for each convolutional layer in each network, we generate a \ufb01lter set\nwith the same shape from the standard Gaussian distribution (the blue bars). For both networks, all\nthe convolutional layers have averaged maximum 1-translation correlations that are signi\ufb01cantly\nlarger than their random counterparts.\n\ni=1 maxN\n\n(cid:80)N\n\nproduct, where the two operands are \ufb02attened into column vectors before taking the standard inner\nproduct; (cid:107) \u00b7 (cid:107)2 denotes the (cid:96)2 norm of its \ufb02attened operand. In other words, the k-translation\ncorrelation between a pair of \ufb01lters indicates the maximum correlation achieved by translating one\n\ufb01lter up to k steps along any spatial dimension. As a concrete example, Figure 1 demonstrates\nthe 3-translation correlation of the \ufb01rst layer \ufb01lters learned by the AlexNet [1], with the weights\nobtained from the Caffe model zoo [7]. In each column, we show a \ufb01lter in the \ufb01rst row and its three\nmost 3-translation-correlated \ufb01lters (that is, \ufb01lters with the highest 3-translation correlations) in the\nsecond to fourth row. Only the \ufb01rst 32 \ufb01lters are shown for brevity. It is interesting to see for most\n\ufb01lters, there exist several \ufb01lters that are roughly its translated versions.\nIn addition to the convenient visualization of the \ufb01rst layers, we further study this property at\nhigher layers and/or in deeper models. To this end, we de\ufb01ne the averaged maximum k-translation\ncorrelation of a layer W as \u00af\u03c1k(W) = 1\nj=1,j(cid:54)=i \u03c1k(Wi, Wj), where N is the number\nN\nof \ufb01lters. Intuitively, the \u00af\u03c1k of a convolutional layer characterizes the average level of translation\ncorrelation among the \ufb01lters within it. We then load the weights of all the convolutional layers of\nAlexNet as well as the 19-layer VGGNet [2] from the Caffe model zoo, and report the averaged\nmaximum 1-translation correlation of each layer in Figure 2. In each graph, the height of the red\nbars indicates the \u00af\u03c11 calculated with the weights of the corresponding layer. As a comparison,\nfor each layer we have also generated a \ufb01lter bank with the same shape but \ufb01lled with standard\nGaussian samples, whose \u00af\u03c11 are shown as the blue bars. We clearly see that all the layers in both\nmodels demonstrate averaged maximum translation correlations that are signi\ufb01cantly higher than\ntheir random counterparts. In addition, it appears that lower convolutional layers generally have\nhigher translation correlations, although this does not strictly hold (e.g., conv3_4 in VGGNet).\nMotivated by the evidence shown above, we propose the doubly convolutional layer (with the double\nconvolution operation), which can be plugged in place of a convolutional layer in CNNs, yielding\nthe doubly convolutional neural networks (DCNNs). The idea of double convolution is to learn\ngroups \ufb01lters where \ufb01lters within each group are translated versions of each other. To achieve this, a\ndoubly convolutional layer allocates a set of meta \ufb01lters which has \ufb01lter sizes that are larger than the\neffective \ufb01lter size. Effective \ufb01lters can be then extracted from each meta \ufb01lter, which corresponds to\nconvolving the meta \ufb01lters with an identity kernel. All the extracted \ufb01lters are then concatenated, and\nconvolved with the input. Optionally, one can also choose to pool along activations produced by \ufb01lters\nfrom the same meta \ufb01lter, in a similar spirit to the maxout networks [8]. We also show that double\nconvolution can be easily implemented with available deep learning libraries by utilizing the ef\ufb01cient\n\n2\n\n\fFigure 3: The architecture of a convolutional layer (left) and a doubly convolutional layer (right). A\ndoubly convolutional layer maintains meta \ufb01lters whose spatial size z(cid:48) \u00d7 z(cid:48) is larger than the effective\n\ufb01lter size z \u00d7 z. By pooling and \ufb02attening the convolution output, a doubly convolutional layer\n)2 times more channels for the output image, with s \u00d7 s being the pooling size.\nproduces ( z(cid:48)\u2212z+1\n\ns\n\nconvolutional kernel. In our experiments, we show that the additional level of parameter sharing by\ndouble convolution allows one to build DCNNs that yield an excellent performance on several popular\nimage classi\ufb01cation benchmarks, consistently outperforming all the competing architectures with a\nmargin. We have also con\ufb01rmed that replacing a convolutional layer with a doubly convolutional\nlayer consistently improves the performance, regardless of the depth of the layer. Last but not least,\nwe show that one is able to balance the trade off between performance and parameter ef\ufb01ciency by\nleveraging the architecture of a DCNN.\n\n2 Model\n\n2.1 Convolution\nWe de\ufb01ne an image I \u2208 Rc\u00d7w\u00d7h as a real-valued 3D tensor, where c is the number of channels;\nw, h are the width and height, respectively. We de\ufb01ne the convolution operation, denoted by\nI (cid:96)+1 = I (cid:96) \u2217 W(cid:96), as follows:\n\n(cid:88)\n\nI (cid:96)+1\nk,c(cid:48),i(cid:48),j(cid:48)I (cid:96)\nk,i,j =\nk \u2208 [1, c(cid:96)+1], i \u2208 [1, w(cid:96)+1], j \u2208 [1, h(cid:96)+1].\n\nc(cid:48)\u2208[1,c],i(cid:48)\u2208[1,z],j(cid:48)\u2208[1,z]\n\nW(cid:96)\n\nc(cid:48),i+i(cid:48)\u22121,j+j(cid:48)\u22121,\n\n(2)\n\nHere I (cid:96) \u2208 Rc(cid:96)\u00d7w(cid:96)\u00d7h(cid:96) is the input image; W(cid:96) \u2208 Rc(cid:96)+1\u00d7c(cid:96)\u00d7z\u00d7z is a set of c(cid:96)+1 \ufb01lters, with each\n\ufb01lter of shape c(cid:96) \u00d7 z \u00d7 z; I (cid:96)+1 \u2208 Rc(cid:96)+1\u00d7w(cid:96)+1\u00d7h(cid:96)+1 is the output image. The spatial dimensions\nof the output image w(cid:96)+1, h(cid:96)+1 are by default w(cid:96) + z \u2212 1 and h(cid:96) + z \u2212 1, respectively (aka, valid\nconvolution), but one can also pad a number of zeros at the borders of I (cid:96) to achieve different output\nspatial dimensions (e.g., keeping the spatial dimensions unchanged). In this paper, we use a loose\nnotation by freely allowing both the LHS and RHS of \u2217 to be either a single image (\ufb01lter) or a set of\nimages (\ufb01lters), with proper convolution along the non-spatial dimensions.\nA convolutional layer can thus be implemented with a convolution operation followed by a nonlinearity\nfunction such as ReLU, and a convolutional neural network (CNN) is constructed by interweaving\nseveral convolutoinal and spatial pooling layers.\n\n2.2 Double convolution\nWe next introduce and de\ufb01ne the double convolution operation, denoted by I (cid:96)+1 = I (cid:96) \u2297 W(cid:96), as\nfollows:\n\nk \u2217 I (cid:96)\n\n:,i:(i+z\u22121),j:(j+z\u22121),\n\nO(cid:96)+1\ni,j,k = W(cid:96)\nI (cid:96)+1\n(nk+1):n(k+1),i,j = pools(O(cid:96)+1\ni,j,k), n = (\nk \u2208 [1, c(cid:96)+1], i \u2208 [1, w(cid:96)+1], j \u2208 [1, h(cid:96)+1].\n\nz(cid:48) \u2212 z + 1\n\ns\n\n)2,\n\n(3)\n\n3\n\n\fare a set of c(cid:96)+1 meta \ufb01lters, with \ufb01lter size z(cid:48) \u00d7 z(cid:48), z(cid:48) > z; O(cid:96)+1\n\nHere I (cid:96) \u2208 Rc(cid:96)\u00d7w(cid:96)\u00d7h(cid:96) and I (cid:96)+1 \u2208 Rnc(cid:96)+1\u00d7w(cid:96)+1\u00d7h(cid:96)+1 are the input and output image, respectively.\nW(cid:96) \u2208 Rc(cid:96)+1\u00d7c(cid:96)\u00d7z(cid:48)\u00d7z(cid:48)\ni,j,k \u2208\nR(z(cid:48)\u2212z+1)\u00d7(z(cid:48)\u2212z+1) is the intermediate output of double convolution; pools(\u00b7) de\ufb01nes a spatial\npooling function with pooling size s \u00d7 s (and optionally reshaping the output to a column vector,\ninferred from the context); \u2217 is the convolution operator de\ufb01ned previously in Equation 2.\nIn words, a double convolution applies a set of c(cid:96)+1 meta \ufb01lters with spatial dimensions z(cid:48) \u00d7 z(cid:48),\nwhich are larger than the effective \ufb01lter size z \u00d7 z. Image patches of size z \u00d7 z at each location\n(i, j) of the input image, denoted by I (cid:96)\n:,i:(i+z\u22121),j:(j+z\u22121), are then convolved with each meta \ufb01lter,\nresulting an output of size z(cid:48) \u2212 z + 1 \u00d7 z(cid:48) \u2212 z + 1, for each (i, j). A spatial pooling of size s \u00d7 s is\nthen applied along this resulting output map, whose output is \ufb02attened into a column vector. This\nproduces an output feature map with nc(cid:96)+1 channels. The above procedure can be viewed as a two\nstep convolution, where image patches are \ufb01rst convolved with meta \ufb01lters, and the meta \ufb01lters then\nslide across and convolve with the image, hence the name double convolution.\nA doubly convolutional layer is by analogy de\ufb01ned as a double convolution followed by a nonlinearity;\nand substituting the convolutional layers in a CNN with doubly convolutional layers yields a doubly\nconvolutional neural network (DCNN). In Figure 3 we have illustrated the difference between a\nconvolutional layer and a doubly convolutional layer. It is possible to vary the combination of z, z(cid:48), s\nfor each doubly convolutional layer of a DCNN to yield different variants, among which three extreme\ncases are:\n(1) CNN: Setting z(cid:48) = z recovers the standard CNN; hence, DCNN is a generalization of CNN.\n(2) ConcatDCNN: Setting s = 1 produces a DCNN variant that is maximally parameter ef\ufb01cient.\nThis corresponds to extracting all sub-regions of size z \u00d7 z from a z(cid:48) \u00d7 z(cid:48) sized meta \ufb01lter, which\nare then stacked to form a set of (z(cid:48) \u2212 z + 1)2 \ufb01lters with size z \u00d7 z. With the same amount of\nparameters, this produces (z(cid:48)\u2212z+1)2z2\n(3) MaxoutDCNN: Setting s = z(cid:48) \u2212 z + 1, i.e., applying global pooling on O(cid:96)+1, produces a DCNN\nvariant where the output image channel size is equal to the number of the meta \ufb01lters. Interestingly,\nthis yields a parameter ef\ufb01cient implementation of the maxout network [8]. To be concrete, the\nmaxout units in a maxout network are equivalent to pooling along the channel (feature) dimension,\nwhere each channel corresponds to a distinct \ufb01lter. MaxoutDCNN, on the other hand, pools along\nchannels which are produced by the \ufb01lters that are translated versions of each other. Besides the\nobvious advantage of reducing the number of parameters required, this also acts as an effective\nregularizer, which is veri\ufb01ed later in the experiments at Section 4.\nImplementing a double convolution is also readily supported by most main stream GPU-compatible\ndeep learning libraries (e.g., Theano which is used in our experiments), which we have summarized\nin Algorithm 1. In particular, we are able to perform double convolution by two steps of convolution,\ncorresponding to line 4 and line 6, together with proper reshaping and pooling operations. The \ufb01rst\nconvolution extracts overlapping patches of size z \u00d7 z from the meta \ufb01lters, which are then convolved\nwith the input image. Although it is possible to further reduce the time complexity by designing a\nspecialized double convolution module, we \ufb01nd that Algorithm 1 scales well to deep DCNNs, and\nlarge datasets such as ImageNet.\n\n(z(cid:48))2\n\ntimes more channels for a single layer.\n\n3 Related work\n\nThe spirit of DCNNs is to further push the idea of parameter sharing of the convolutional layers,\nwhich is shared by several recent efforts. [9] explores the rotation symmetry of certain classes of\nimages, and hence proposes to rotate each \ufb01lter (or alternatively, the input) by a multiplication of\n90\u25e6 which produces four times \ufb01lters with the same amount of parameters for a single layer. [10]\nobserves that \ufb01lters learned by ReLU CNNs often contain pairs with opposite phases in the lower\nlayers. The authors accordingly propose the concatenated ReLU where the linear activations are\nconcatenated with their negations and then passed to ReLU, which effectively doubles the number of\n\ufb01lters. [11] proposes the dilated convolutions, where additional \ufb01lters with larger sizes are generated\nby dilating the base convolutional \ufb01lters, which is shown to be effective in dense prediction tasks\nsuch as image segmentation. [12] proposes a multi-bias activation scheme where k, k \u2264 1, bias\nterms are learned for each \ufb01lter, which produces a k times channel size for the convolution output.\n\n4\n\n\f, effective \ufb01lter size\n\nAlgorithm 1: Implementation of double convolution with convolution.\nInput: Input image I (cid:96) \u2208 Rc(cid:96)\u00d7w(cid:96)\u00d7h(cid:96), meta \ufb01lters W(cid:96) \u2208 Rc(cid:96)+1\u00d7z(cid:48)\u00d7z(cid:48)\nOutput: Output image I (cid:96)+1 \u2208 Rnc(cid:96)+1\u00d7w(cid:96)+1\u00d7h(cid:96)+1, with n = (z(cid:48)\u2212z+1)2\n1 begin\n2\n\nz \u00d7 z, pooling size s \u00d7 s.\n\ns2\n\n.\n\nI(cid:96) \u2190 IdentityMatrix (c(cid:96)z2) ;\nReorganize I(cid:96) to shape c(cid:96)z2 \u00d7 c(cid:96) \u00d7 z \u00d7 z;\n\u02dcW(cid:96) \u2190 W(cid:96) \u2217 I(cid:96) ; /* output shape:\nReorganize \u02dcW(cid:96) to shape c(cid:96)+1(z(cid:48) \u2212 z + 1)2 \u00d7 c(cid:96) \u00d7 z \u00d7 z;\nO(cid:96)+1 \u2190 I (cid:96) \u2217 \u02dcW(cid:96) ;\nReorganize O(cid:96)+1 to shape c(cid:96)+1w(cid:96)+1h(cid:96)+1 \u00d7 (z(cid:48) \u2212 z + 1) \u00d7 (z(cid:48) \u2212 z + 1) ;\nc(cid:96)+1w(cid:96)+1h(cid:96)+1 \u00d7 z(cid:48)\u2212z+1\nI (cid:96)+1 \u2190 pools(O(cid:96)+1) ; /* output shape:\nReorganize I (cid:96)+1 to shape c(cid:96)+1( z(cid:48)\u2212z+1\n\n)2 \u00d7 w(cid:96)+1 \u00d7 h(cid:96)+1 ;\n\n/* output shape:\n\n3\n\n4\n\n5\n\n6\n\n7\n\n8\n\n9\n\ns\n\nc(cid:96)+1 \u00d7 c(cid:96)z2 \u00d7 (z(cid:48) \u2212 z + 1) \u00d7 (z(cid:48) \u2212 z + 1) */\n\nc(cid:96)+1(z(cid:48) \u2212 z + 1)2 \u00d7 w(cid:96)+1 \u00d7 h(cid:96)+1 */\n\ns \u00d7 z(cid:48)\u2212z+1\n\ns\n\n*/\n\nAdditionally, [13, 14] have investigated the combination of more than one transformations of \ufb01lters,\nsuch as rotation, \ufb02ipping and distortion. Note that all the aforementioned approaches are orthogonal\nto DCNNs and can theoretically be combined in a single model. The need of correlated \ufb01lters in\nCNNs is also studied in [15], where similar \ufb01lters are explicitly learned and grouped with a group\nsparsity penalty.\nWhile DCNNs are designed with better performance and generalization ability in mind, they are\nalso closely related to the thread of work on parameter reduction in deep neural networks. The\nwork of Vikas and Tara [16] addresses the problem of compressing deep networks by applying\nstructured transforms. [17] exploits the redundancy in the parametrization of deep architectures by\nimposing a circulant structure on the projection matrix, while allowing the use of FFT for faster\ncomputations. [18] attempts to obtain the compression of the fully-connected layers of the AlexNet-\ntype network with the Fastfood method. Novikov et al. [19] use a multi-linear transform (Tensor-Train\ndecomposition) to attain reduction of the number of parameters in the linear layers of CNNs. These\nwork differ from DCNNs as most of their focuses are on the fully connected layers, which often\naccounts for most of the memory consumption. DCNNs, on the other hand, apply directly to the\nconvolutional layers, which provides a complementary view to the same problem.\n\n4 Experiments\n\n4.1 Datasets\n\nWe conduct several sets of experiments with DCNN on three image classi\ufb01cation benchmarks:\nCIFAR-10, CIFAR-100, and ImageNet. CIFAR-10 and CIFAR-100 both contain 50,000 training\nand 10,000 testing 32 \u00d7 32 sized RGB images, evenly drawn from 10 and 100 classes, respectively.\nImageNet is the dataset used in the ILSVRC-2012 challenge, which consists of about 1.2 million\nimages for training and 50,000 images for validation, sampled from 1,000 classes.\n\n4.2\n\nIs DCNN an effective architecture?\n\n4.2.1 Model speci\ufb01cations\n\nIn the \ufb01rst set of experiments, we study the effectiveness of DCNN compared with two different\nCNN designs. The three types of architectures subject to evaluation are:\n(1) CNN: This corresponds to models using the standard convolutional layers. A convolutional layer\nis denoted as C--, where c, z are the number of \ufb01lters and the \ufb01lter size, respectively.\n(2) MaxoutCNN: This corresponds to the maxout convolutional networks [8], which uses the maxout\nunit to pool along the channel (feature) dimensions with a stride k. A maxout convolutional layer is\ndenoted as MC---, where c, z, k are the number of \ufb01lters, the \ufb01lter size, and the feature\npooling stride, respectively.\n\n5\n\n\fTable 1: The con\ufb01gurations of the models used in Section 4.2. The architectures on the CIFAR-10\nand CIFAR-100 datasets are the same, except for the top softmax layer (left). The architectures on the\nImageNet dataset are variants of the 16-layer VGGNet [2] (right). See the details about the naming\nconvention in Section 4.2.1.\n\nCNN\n\nC-128-3\nC-128-3\n\nDCNN\n\nMaxoutCNN\n\nDC-128-4-3-2\nDC-128-4-3-2\n\nMC-512-3-4\nMC-512-3-4\n\nP-2\n\nC-128-3\nC-128-3\n\nDC-128-4-3-2\nDC-128-4-3-2\n\nMC-512-3-4\nMC-512-3-4\n\nP-2\n\nC-128-3\nC-128-3\n\nDC-128-4-3-2\nDC-128-4-3-2\n\nMC-512-3-4\nMC-512-3-4\n\nP-2\n\nC-128-3\nC-128-3\n\nDC-128-4-3-2\nDC-128-4-3-2\n\nMC-512-3-4\nMC-512-3-4\n\nP-2\n\nGlobal Average Pooling\n\nSoftmax\n\nCNN\n\nC-64-3\nC-64-3\n\nDCNN\n\nDC-64-4-3-2\nDC-64-4-3-2\n\nP-2\n\nMaxoutCNN\n\nMC-256-3-4\nMC-256-3-4\n\nC-128-3\nC-128-3\n\nDC-128-4-3-2\nDC-128-4-3-2\n\nMC-512-3-4\nMC-512-3-4\n\nP-2\n\nC-256-3\nC-256-3\nC-256-3\n\nDC-256-4-3-2 MC-1024-3-4\nDC-256-4-3-2 MC-1024-3-4\nDC-256-4-3-2 MC-1024-3-4\n\nP-2\n\nC-512-3\nC-512-3\nC-512-3\n\nDC-512-4-3-2 MC-2048-3-4\nDC-512-4-3-2 MC-2048-3-4\nDC-512-4-3-2 MC-2048-3-4\n\nP-2\n\nC-512-3\nC-512-3\nC-512-3\n\nDC-512-4-3-2 MC-2048-3-4\nDC-512-4-3-2 MC-2048-3-4\nDC-512-4-3-2 MC-2048-3-4\n\nP-2\n\nGlobal Average Pooling\n\nSoftmax\n\n(3) DCNN: This corresponds to using the doubly convolutional layers. We denote a doubly convolu-\ntional layer with c \ufb01lters as DC----~~, where z(cid:48), z, s are the meta \ufb01lter size, effective\n\ufb01lter size and pooling size, respectively, as in Equation 3. In this set of experiments, we use the\nMaxoutDCNN variant, whose layers are readily represented as DC----.\nWe denote a spatial max pooling layer as P-~~~~ with s as the pooling size. For all the models, we\napply batch normalization [6] immediately after each convolution layer, after which ReLU is used as\nthe nonlinearity (including MaxoutCNN, which makes out implementation slightly different from\n[8]). Our model design is similar to VGGNet [2] where 3 \u00d7 3 \ufb01lter sizes are used, as well as Network\nin Network [20] where fully connected layers are completely eliminated. Zero padding is used before\neach convolutional layer to maintain the spatial dimensions unchanged after convolution. Dropout is\napplied after each pooling layer. Global average pooling is applied on top of the last convolutional\nlayer, which is fed to a Softmax layer with a proper number of outputs.\nAll the three models on each dataset are of the same architecture w.r.t. the number of layers and the\nnumber of units per layer. The only difference thus resides in the choice of the convolutional layers.\nNote that the architecture we have used on the ImageNet dataset resembles the 16-layer VGGNet [2],\nbut without the fully connected layers. The full speci\ufb01cation of the model architectures is shown in\nTable 1.\n\n4.2.2 Training protocols\n\nWe preprocess all the datasets by extracting the mean for each pixel and each channel, calculated on\nthe training sets. All the models are trained with Adadelta [21] on NVIDIA K40 GPUs. Bath size is\nset as 200 for CIFAR-10 and CIFAR-100, and 128 for ImageNet.\nData augmentation has also been explored. On CIFAR-10 and CIFAR-100, We follow the simple\ndata augmentation as in [2]. For training, 4 pixels are padded on each side of the images, from which\n32 \u00d7 32 crops are sampled with random horizontal \ufb02ipping. For testing, only the original 32 \u00d7 32\nimages are used. On ImageNet, 224 \u00d7 224 crops are sampled with random horizontal \ufb02ipping; the\nstandard color augmentation and the 10-crop testing are also applied as in AlexNet [1].\n\n6\n\n\f4.2.3 Results\n\nThe test errors are summarized in Table 2 and Table 3, where the relative # parameters of DCNN and\nMaxoutCNN compared with the standard CNN are also shown. On the moderately-sized datasets\nCIFAR-10 and CIFAR-100, DCNN achieves the best results of the three control experiments, with\nand without data augmentation. Notably, DCNN consistently improves over the standard CNN with a\nmargin. More remarkably, DCNN also consistently outperforms MaxoutCNN, with 2.25 times less\nparameters. This on the one hand proves that the doubly convolutional layers greatly improves the\nmodel capacity, and on the other hand veri\ufb01es our hypothesis that the parameter sharing introduced\nby double convolution indeed acts as a very effective regularizer. The results achieved by DCNN on\nthe two datasets are also among the best published results compared with [20, 22, 23, 24].\nBesides, we also note that DCNN does not have dif\ufb01culty scaling up to a large dataset as Ima-\ngeNet, where consistent performance gains over the other baseline architectures are again observed.\nCompared with the results of the 16-layer VGGNet in [2] with multiscale evaluation, our DCNN\nimplementation achieves comparable results, with signi\ufb01cantly less parameters.\n\nTable 2: Test errors on CIFAR-10 and CIFAR-100 with and without data augmentation, together with\nthe relative # parameters compared with the standard CNN.\n\nModel\n\n# Parameters Without Data Augmentation With Data Augmentation\nCIFAR-100\n\nCIFAR-100\n\nCIFAR-10\n\nCIFAR-10\n\nCNN\n\nMaxoutCNN\n\nDCNN\nNIN [20]\nDSN [22]\nAPL [23]\nELU [24]\n\n1.\n4.\n1.78\n0.92\n\n-\n-\n-\n\n9.85%\n9.56%\n8.58%\n10.41%\n9.78%\n9.59%\n\n-\n\n34.26%\n33.52%\n30.35%\n35.68%\n34.57%\n34.40%\n\n-\n\n9.59%\n9.23%\n7.24%\n8.81%\n8.22%\n7.51%\n6.55%\n\n33.04%\n32.37%\n26.53%\n\n-\n-\n\n30.83%\n24.28%\n\n4.3 Does double convolution contribute to every layer?\n\nIn the next set of experiments, we study the effect of applying double convolution to layers at\nvarious depths. To this end, we replace the convolutional layers at each level of the standard CNN\nde\ufb01ned in 4.2.1 with a doubly convolutional layer counterpart (e.g., replacing a C-128-3 layer with a\nDC-128-4-3-2 layer). We hence de\ufb01ne DCNN[i-j] as the network resulted from replacing the i \u2212 jth\nconvolutional layer of a CNN with its doubly convolutional layer counterpart, and train {DCNN[1-2],\nDCNN[3-4], DCNN[5-6], DCNN[7-8]} on CIFAR-10 and CIFAR-100 following the same protocol\nas that in Section 4.2.2. The results are shown in Table 4. Interestingly, the doubly convolutional\nlayer is able to consistently improve the performance over that of the standard CNN regardless of the\ndepth with which it is plugged in. Also, it seems that applying double convolution at lower layers\ncontributes more to the performance, which is consistent with the trend of translation correlation\nobserved in Figure 2.\n\nTable 3: Test errors on ImageNet, evaluated on the validation set, together with the relative #\nparameters compared with the standard CNN.\n\nTop-5 Error Top-1 Error\n\nModel\nCNN\n\nMaxoutCNN\n\nDCNN\n\nVGG-16 [2]\n\nResNet-152 [4]\nGoogLeNet [3]\n\n10.59%\n9.82%\n8.23%\n7.5%\n5.71%\n7.9%\n\n29.42%\n28.4%\n26.27 %\n24.8%\n21.43%\n\n-\n\n# Parameters\n\n1.\n4.\n1.78\n9.3\n4.1\n0.47\n\n7\n\n\fTable 4: Inserting the doubly convolutional layer at different depths of the network.\n\nModel\nCNN\n\nDCNN[1-2]\nDCNN[3-4]\nDCNN[5-6]\nDCNN[7-8]\nDCNN[1-8]\n\nCIFAR-10 CIFAR-100\n\n9.85%\n9.12%\n9.23%\n9.45%\n9.57%\n8.58%\n\n34.26%\n32.91%\n33.27%\n33.58%\n33.72%\n30.35%\n\n4.4 Performance vs. parameter ef\ufb01ciency\n\nIn the last set of experiments, we study the behavior of DCNNs under various combinations of its\nhyper-parameters, z(cid:48), z, s. To this end, we train three more DCNNs on CIFAR-10 and CIFAR-100,\nnamely {DCNN-32-6-3-2, DCNN-16-6-3-1, DCNN-4-10-3-1}. Here we have overloaded the notation\nfor a doubly convolutional layer to denote a DCNN which contains correspondingly shaped doubly\nconvolutional layers (the DCNN in Table 1 thus corresponds to DCNN-128-4-3-2). In particular,\nDCNN-32-6-3-2 produces a DCNN with the exact same shape and number of parameters of those of\nthe reference CNN; DCNN-16-6-3-1, DCNN-4-10-3-1 are two ConcatDCNN instances from Section\n2.2, which produce larger sized models with same or less amount of parameters. The results, together\nwith the effective layer size and the relative number of parameters, are listed in Table 5. We see that\nall the variants of DCNN consistently outperform the standard CNN, even when fewer parameters\nare used (DCNN-4-10-3-1). This veri\ufb01es that DCNN is a \ufb02exible framework which allows one to\neither maximize the performance with a \ufb01xed memory budget, or on the other hand, minimize the\nmemory footprint without sacri\ufb01cing the accuracy. One can choose the best suitable architecture of a\nDCNN by balancing the trade off between performance and the memory footprint.\n\nTable 5: Different architecture con\ufb01gurations of DCNNs.\n\nModel\nCNN\n\nDCNN-32-6-3-2\nDCNN-16-6-3-1\nDCNN-4-10-3-1\nDCNN-128-4-3-2\n\nCIFAR-10 CIFAR-100 Layer size\n\n# Parameters\n\n9.85%\n9.05%\n9.16%\n9.65%\n8.58%\n\n34.26%\n32.28%\n32.54%\n33.57%\n30.35%\n\n128\n128\n256\n256\n128\n\n1.\n1.\n1.\n0.69\n1.78\n\n5 Conclusion\n\nWe have proposed the doubly convolutional neural networks (DCNNs), which utilize a novel double\nconvolution operation to provide an additional level of parameter sharing over CNNs. We show that\nDCNNs generalize standard CNNs, and relate to several recent proposals that explore parameter\nredundancy in CNNs. A DCNN can be easily implemented by modern deep learning libraries\nby reusing the ef\ufb01cient convolution module. DCNNs can be used to serve the dual purpose of 1)\nimproving the classi\ufb01cation accuracy as a regularized version of maxout networks, and 2) being\nparameter ef\ufb01cient by \ufb02exibly varying their architectures. In the extensive experiments on CIFAR-10,\nCIFAR-100, and ImageNet datasets, we have shown that DCNNs signi\ufb01cantly improves over other\narchitecture counterparts. In addition, we have shown that introducing the doubly convolutional\nlayer to any layer of a CNN improves its performance. We have also experimented with various\ncon\ufb01gurations of DCNNs, all of which are able to outperform the CNN counterpart with the same or\nfewer number of parameters.\n\nReferences\n[1] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep convolutional\n\nneural networks. In Advances in neural information processing systems, pages 1097\u20131105, 2012.\n\n8\n\n\f[2] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recogni-\n\ntion. arXiv preprint arXiv:1409.1556, 2014.\n\n[3] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru\nErhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of\nthe IEEE Conference on Computer Vision and Pattern Recognition, pages 1\u20139, 2015.\n\n[4] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.\n\narXiv preprint arXiv:1512.03385, 2015.\n\n[5] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout:\nA simple way to prevent neural networks from over\ufb01tting. The Journal of Machine Learning Research,\n15(1):1929\u20131958, 2014.\n\n[6] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing\n\ninternal covariate shift. arXiv preprint arXiv:1502.03167, 2015.\n\n[7] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio\nIn\n\nGuadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding.\nProceedings of the ACM International Conference on Multimedia, pages 675\u2013678. ACM, 2014.\n\n[8] Ian J Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio. Maxout\n\nnetworks. arXiv preprint arXiv:1302.4389, 2013.\n\n[9] Sander Dieleman, Jeffrey De Fauw, and Koray Kavukcuoglu. Exploiting cyclic symmetry in convolutional\n\nneural networks. arXiv preprint arXiv:1602.02660, 2016.\n\n[10] Wenling Shang, Kihyuk Sohn, Diogo Almeida, and Honglak Lee. Understanding and improving con-\nvolutional neural networks via concatenated recti\ufb01ed linear units. arXiv preprint arXiv:1603.05201,\n2016.\n\n[11] Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolutions. arXiv preprint\n\narXiv:1511.07122, 2015.\n\n[12] Hongyang Li, Wanli Ouyang, and Xiaogang Wang. Multi-bias non-linear activation in deep neural networks.\n\narXiv preprint arXiv:1604.00676, 2016.\n\n[13] Robert Gens and Pedro M Domingos. Deep symmetry networks. In Advances in neural information\n\nprocessing systems, pages 2537\u20132545, 2014.\n\n[14] Taco S Cohen and Max Welling. Group equivariant convolutional networks.\n\narXiv:1602.07576, 2016.\n\narXiv preprint\n\n[15] Koray Kavukcuoglu, Rob Fergus, Yann LeCun, et al. Learning invariant features through topographic\n\ufb01lter maps. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages\n1605\u20131612. IEEE, 2009.\n\n[16] Vikas Sindhwani, Tara Sainath, and Sanjiv Kumar. Structured transforms for small-footprint deep learning.\nIn C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural\nInformation Processing Systems 28, pages 3088\u20133096. Curran Associates, Inc., 2015.\n\n[17] Yu Cheng, Felix X. Yu, Rogerio Feris, Sanjiv Kumar, and Shih-Fu Chang. An exploration of parameter\nredundancy in deep networks with circulant projections. In International Conference on Computer Vision\n(ICCV), 2015.\n\n[18] Zichao Yang, Marcin Moczulski, Misha Denil, Nando de Freitas, Alex Smola, Le Song, and Ziyu Wang.\n\nDeep fried convnets. In International Conference on Computer Vision (ICCV), 2015.\n\n[19] Alexander Novikov, Dmitry Podoprikhin, Anton Osokin, and Dmitry Vetrov. Tensorizing neural networks.\n\nIn Advances in Neural Information Processing Systems 28 (NIPS). 2015.\n\n[20] Min Lin, Qiang Chen, and Shuicheng Yan. Network in network. arXiv preprint arXiv:1312.4400, 2013.\n[21] Matthew D Zeiler. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012.\n[22] Chen-Yu Lee, Saining Xie, Patrick Gallagher, Zhengyou Zhang, and Zhuowen Tu. Deeply-supervised nets.\n\narXiv preprint arXiv:1409.5185, 2014.\n\n[23] Forest Agostinelli, Matthew Hoffman, Peter J. Sadowski, and Pierre Baldi. Learning activation functions\n\nto improve deep neural networks. CoRR, abs/1412.6830, 2014.\n\n[24] Djork-Arn\u00e9 Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network learning\n\nby exponential linear units (elus). arXiv preprint arXiv:1511.07289, 2015.\n\n9\n\n\f", "award": [], "sourceid": 620, "authors": [{"given_name": "Shuangfei", "family_name": "Zhai", "institution": "Binghamton University"}, {"given_name": "Yu", "family_name": "Cheng", "institution": "IBM Research"}, {"given_name": "Zhongfei (Mark)", "family_name": "Zhang", "institution": "Binghamton University"}, {"given_name": "Weining", "family_name": "Lu", "institution": "Tsinghua University"}]}~~