{"title": "Convolutional Neural Fabrics", "book": "Advances in Neural Information Processing Systems", "page_first": 4053, "page_last": 4061, "abstract": "Despite the success of CNNs, selecting the optimal architecture for a given task remains an open problem. Instead of aiming to select a single optimal architecture, we propose a ``fabric'' that embeds an exponentially large number of architectures. The fabric consists of a 3D trellis that connects response maps at different layers, scales, and channels with a sparse homogeneous local connectivity pattern. The only hyper-parameters of a fabric are the number of channels and layers. While individual architectures can be recovered as paths, the fabric can in addition ensemble all embedded architectures together, sharing their weights where their paths overlap. Parameters can be learned using standard methods based on back-propagation, at a cost that scales linearly in the fabric size. We present benchmark results competitive with the state of the art for image classification on MNIST and CIFAR10, and for semantic segmentation on the Part Labels dataset.", "full_text": "Convolutional Neural Fabrics\n\nShreyas Saxena\n\nJakob Verbeek\n\nINRIA Grenoble \u2013 Laboratoire Jean Kuntzmann\n\nAbstract\n\nDespite the success of CNNs, selecting the optimal architecture for a given task\nremains an open problem. Instead of aiming to select a single optimal architecture,\nwe propose a \u201cfabric\u201d that embeds an exponentially large number of architectures.\nThe fabric consists of a 3D trellis that connects response maps at different layers,\nscales, and channels with a sparse homogeneous local connectivity pattern. The\nonly hyper-parameters of a fabric are the number of channels and layers. While\nindividual architectures can be recovered as paths, the fabric can in addition\nensemble all embedded architectures together, sharing their weights where their\npaths overlap. Parameters can be learned using standard methods based on back-\npropagation, at a cost that scales linearly in the fabric size. We present benchmark\nresults competitive with the state of the art for image classi\ufb01cation on MNIST and\nCIFAR10, and for semantic segmentation on the Part Labels dataset.\n\n1\n\nIntroduction\n\nConvolutional neural networks (CNNs) [15] have proven extremely successful for a wide range\nof computer vision problems and other applications. In particular, the results of Krizhevsky et\nal. [13] have caused a major paradigm shift in computer vision from models relying in part on\nhand-crafted features, to end-to-end trainable systems from the pixels upwards. One of the main\nproblems that holds back further progress using CNNs, as well as deconvolutional variants [24, 26]\nused for semantic segmentation, is the lack of ef\ufb01cient systematic ways to explore the discrete and\nexponentially large architecture space. To appreciate the number of possible architectures, consider a\nstandard chain-structured CNN architecture for image classi\ufb01cation. The architecture is determined\nby the following hyper-parameters: (i) number of layers, (ii) number of channels per layer, (iii) \ufb01lter\nsize per layer, (iv) stride per layer, (v) number of pooling vs. convolutional layers, (vi) type of pooling\noperator per layer, (vii) size of the pooling regions, (viii) ordering of pooling and convolutional layers,\n(ix) channel connectivity pattern between layers, and (x) type of activation, e.g. ReLU or MaxOut, per\nlayer. The number of resulting architectures clearly does not allow for (near) exhaustive exploration.\nWe show that all network architectures that can be obtained for various choices of the above ten\nhyper-parameters are embedded in a \u201cfabric\u201d of convolution and pooling operators. Concretely,\nthe fabric is a three-dimensional trellis of response maps of various resolutions, with only local\nconnections across neighboring layers, scales, and channels. See Figure 1 for a schematic illustration\nof how fabrics embed different architectures. Each activation in a fabric is computed as a linear\nfunction followed by a non-linearity from a multi-dimensional neighborhood (spatial/temporal input\ndimensions, a scale dimension and a channel dimension) in the previous layer. Setting the only two\nhyper-parameters, number of layers and channels, is not ciritical as long as they are large enough. We\nalso consider two variants, one in which the channels are fully connected instead of sparsely, and\nanother in which the number of channels doubles if we move to a coarser scale. The latter allows for\none to two orders of magnitude more channels, while increasing memory requirements by only 50%.\nAll chain-structured network architectures embedded in the fabric can be recovered by appropriately\nsetting certain connections to zero, so that only a single processing path is active between input and\noutput. General, non-path, weight settings correspond to ensembling many architectures together,\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fLayers\n\nInput\n\ns\ne\nl\na\nc\nS\n\nOutput\n\nFigure 1: Fabrics embedding two seven-layer CNNs (red, green) and a ten-layer deconvolutional\nnetwork (blue). Feature map size of the CNN layers are given by height. Fabric nodes receiving input\nand producing output are encircled. All edges are oriented to the right, down in the \ufb01rst layer, and\ntowards the output in the last layer. The channel dimension of the 3D fabric is omitted for clarity.\n\nwhich share parameters where the paths overlap. The acyclic trellis structure allows for learning\nusing standard error back-propagation methods. Learning can thus ef\ufb01ciently con\ufb01gure the fabric to\nimplement each one of exponentially many embedded architectures, as well as ensembles of them.\nExperimental results competitive with the state of the art validate the effectiveness of our approach.\nThe contributions of our work are: (1) Fabrics allow by and large to sidestep the CNN model\narchitecture selection problem. Avoiding explicitly training and evaluating individual architectures\nusing, e.g., local-search strategies [2]. (2) While scaling linearly in terms of computation and memory\nrequirements, our approach leverages exponentially many chain-structured architectures in parallel\nby massively sharing weights among them. (3) Since our fabric is multi-scale by construction, it\ncan naturally generate output at multiple resolutions, e.g. for image classi\ufb01cation and semantic\nsegmentation or multi-scale object detection, within a single non-branching network structure.\n\n2 Related work\n\nSeveral chain-structured CNN architectures, including Alex-net [13] and the VGG-16 and VGG-19\nnetworks [27], are widely used for image classi\ufb01cation and related tasks. Although very effective, it is\nnot clear that these architectures are the best ones given their computational and memory requirements.\nTheir widespread adoption is in large part due to the lack of more effective methods to \ufb01nd good\narchitectures than trying them one-by-one, possibly initializing parameters from related ones [2].\nCNN architectures for semantic segmentation, as well as other structured prediction tasks such\nas human pose estimation [25], are often derived from ones developed for image classi\ufb01cation,\nsee e.g. [20, 24, 31, 33]. Up-sampling operators are used to increase the resolution of the output,\ncompensating for pooling operators used in earlier layers of the network [24]. Ronneberger et al.\n[26] present a network with additional links that couple layers with the same resolution near the input\nand output. Other architectures, see e.g. [3, 7], process the input in parallel across several resolutions,\nand then fuse all streams by re-sampling to the output resolution. Such architectures induce networks\nwith multiple parallel paths from input to output. We will show that nearly all such networks are\nembedded in our fabrics, either as paths or other simple sub-graphs.\nWhile multi-dimensional networks have been proposed in the past, e.g. to process non-sequential\ndata with recurrent nets [5, 11], to the best of our knowledge they have not been explored as a\n\u201cbasis\u201d to span large classes of convolutional neural networks. Misra et al. [23] propose related\ncross-stitch networks that exchange information across corresponding layers of two copies of the\nsame architecture that produces two different outputs. Their approach is based on Alex-net [13],\nand does not address the network architecture selection problem. In related work Zhou et al. [34]\ninterlink CNNs that take input from re-scaled versions of the input image. The structure of their\nnetwork is related to our fabric, but lacks a sparse connectivity pattern across channels. They\nconsider their networks for semantic segmentation, and set the \ufb01lter sizes per node manually, and\n\n2\n\n\fuse strided max-pooling for down-sampling and nearest neighbor interpolation for up-sampling. The\ncontribution of our work is to show that a similar network structure suf\ufb01ce to span a vast class of\nnetwork architectures for both dense prediction and classi\ufb01cation tasks.\nSpringenberg et al. [29] experimentally observed that the use of max-pooling in CNN architectures is\nnot always bene\ufb01cial as opposed to using strided convolutions. In our work we go one step further\nand show that ReLU units and strided convolutions suf\ufb01ce to implement max-pooling operators in\nour fabrics. Their work, similar to ours, also strives to simplify architecture design. Our results,\nhowever, reach much further than only removing pooling operators from the architectural elements.\nLee et al. [17] generalize the max and average pooling operators by computing both max and average\npooling, and then fusing the result in a possibly data-driven manner. Our fabrics also generalize max\nand average pooling, but instead of adding elementary operators, we show that settings weights in a\nnetwork with fewer elementary operators is enough for this generalization.\nKulkarni et al. [14] use (cid:96)1 regularization to automatically select the number of units in \u201cfully-\nconnected\u201d layers of CNN architectures for classi\ufb01cation. Although their approach does not directly\nextend to determine more general architectural design choices, it might be possible to use such\nregularization techniques to select the number of channels and/or layers of our fabrics.\nDropout [30] and swapout [28] are stochastic training methods related to our work. They can\nbe understood as approximately averaging over an exponential number of variations of a given\narchitecture. Our approach, on the other hand, allows to leverage an exponentially large class of\narchitectures (ordering of pooling and convolutional layers, type of pooling operator, etc.) by means\nof continuous optimization. Note that these approaches are orthogonal and can be applied to fabrics.\n\n3 The fabric of convolutional neural networks\n\nIn this section we give a precise de\ufb01nition of convolutional neural fabrics, and show in Section 3.2\nthat most architectural network design choices become irrelevant for suf\ufb01ciently large fabrics. Finally,\nwe analyze the number of response maps, parameters, and activations of fabrics in Section 3.3.\n\n3.1 Weaving the convolutional neural fabric\n\nEach node in the fabric represents one response map with the same dimension D as the input signal\n(D = 1 for audio, D = 2 for images, D = 3 for video). The fabric over the nodes is spanned\nby three axes. A layer axis along which all edges advance, which rules out any cycles, and which\nis analogous to the depth axis of a CNN. A scale axis along which response maps of different\nresolutions are organized from \ufb01ne to coarse, neighboring resolutions are separated by a factor two. A\nchannel axis along which different response maps of the same scale and layer are organized. We use\nS = 1 + log2 N scales when we process inputs of size N D, e.g. for 32\u00d732 images we use six scales,\nso as to obtain a scale pyramid from the full input resolution to the coarsest 1\u00d71 response maps.\nWe now de\ufb01ne a sparse and homogeneous edge structure. Each node is connected to a 3\u00d73 scale\u2013\nchannel neighborhood in the previous layer, i.e. activations at channel c, scale s, and layer l are\n\ni,j\u2208{\u22121,0,1} conv(cid:0)a(c + i, s + j, l \u2212 1), wij\n\ncomputed as a(s, c, l) =(cid:80)\n\n(cid:1). Input from a \ufb01ner scale\n\nscl\n\nis obtained via strided convolution, and input from a coarser scale by convolution after upsampling\nby padding zeros around the activations at the coarser level. All convolutions use kernel size 3.\nActivations are thus a linear function over multi-dimensional neighborhoods, i.e. a four dimensional\n3\u00d73\u00d73\u00d73 neighborhood when processing 2D images. The propagation is, however, only convolutional\nacross the input dimensions, and not across the scale and layer axes. The \u201cfully connected\u201d layers\nof a CNN correspond to nodes along the coarsest 1\u00d71 scale of the fabric. Recti\ufb01ed linear units\n(ReLUs) are used at all nodes. Figure 1 illustrates the connectivity pattern in 2D, omitting the channel\ndimension for clarity. The supplementary material contains an illustration of the 3D fabric structure.\nAll channels in the \ufb01rst layer at the input resolution are connected to all channels of the input\nsignal. The \ufb01rst layer contains additional edges to distribute the signal across coarser scales, see\nthe vertical edges in Figure 1. More precisely, within the \ufb01rst layer, channel c at scale s receives\ninput from channels c + {\u22121, 0, 1} from scale s \u2212 1. Similarly, edges within the last layer collect\nthe signal towards the output. Note that these additional edges do not create any cycles, and that the\nedge-structure within the \ufb01rst and last layer is reminiscent of the 2D trellis in Figure 1.\n\n3\n\n\f3.2 Stitching convolutional neural networks on the fabric\n\nWe now demonstrate how various architectural choices can be \u201cimplemented\u201d in fabrics, demonstrat-\ning they subsume an exponentially large class of network architectures. Learning will con\ufb01gure a\nfabric to behave as one architecture or another, but more generally as an ensemble of many of them.\nFor all but the last of the following paragraphs, it is suf\ufb01cient to consider a 2D trellis, as in Figure 1,\nwhere each node contains the response maps of C channels with dense connectivity among channels.\nRe-sampling operators. A variety of re-sampling operators is available in fabrics, here we discuss\nones with small receptive \ufb01elds, larger ones are obtained by repetition. Stride-two convolutions are\nused in fabrics on \ufb01ne-to-coarse edges, larger strides are obtained by repetition. Average pooling\nis obtained in fabrics by striding a uniform \ufb01lter. Coarse-to-\ufb01ne edges in fabrics up-sample by\npadding zeros around the coarse activations and then applying convolution. For factor-2 bilinear\ninterpolation we use a \ufb01lter that has 1 in the center, 1/4 on corners, and 1/2 elsewhere. Nearest\nneighbor interpolation is obtained using a \ufb01lter that is 1 in the four top-left entries and zero elsewhere.\nFor max-pooling over a 2 \u00d7 2 region, let a and b represent the values of two vertically neighboring\npixels. Use one layer and three channels to compute (a + b)/2, (a \u2212 b)/2, and (b \u2212 a)/2. After\nReLU, a second layer can compute the sum of the three terms, which equals max(a, b). Each pixel\nnow contains the maximum of its value and that of its vertical neighbor. Repeating the same in the\nhorizontal direction, and sub-sampling by a factor two, gives the output of 2\u00d72 max-pooling. The\nsame process can also be used to show that a network of MaxOut units [4] can be implemented in a\nnetwork of ReLU units. Although ReLU and MaxOut are thus equivalent in terms of the functions\nthey can implement, for training ef\ufb01ciency it may be more advantageous to use MaxOut networks.\nFilter sizes. To implement a 5 \u00d7 5 \ufb01lter we \ufb01rst compute nine intermediate channels to obtain a\nvectorized version of the 3\u00d73 neighborhood at each pixel, using \ufb01lters that contain a single 1, and are\nzero elsewhere. A second 3\u00d73 convolution can then aggregate values across the original 5\u00d75 patch,\nand output the desired convolution. Any 5\u00d75 \ufb01lter can be implemented exactly in this way, not only\napproximated by factorization, c.f. [27]. Repetition allows to obtain \ufb01lters of any desired size.\nOrdering convolution and re-sampling. As shown in Figure 1, chain-structured networks corre-\nspond to paths in our fabrics. If weights on edges outside a path are set to zero, a chain-structured\nnetwork with a particular sequencing of convolutions and re-sampling operators is obtained. A trellis\n\nthat spans S + 1 scales and L + 1 layers contains more than(cid:0)L\n\n(cid:1) chain-structured CNNs, since this\n\ncorresponds to the number of ways to spread S sub-sampling operators across the L steps to go from\nthe \ufb01rst to the last layer. More CNNs are embedded, e.g. by exploiting edges within the \ufb01rst and\nlast layer, or by including intermediate up-sampling operators. Networks beyond chain-structured\nones, see e.g. [3, 20, 26], are also embedded in the trellis, by activating a larger subset of edges than\na single path, e.g. a tree structure for the multi-scale net of [3].\nChannel connectivity pattern. Although most networks in the literature use dense connectivity\nacross channels between successive layers, this is not a necessity. Krizhevsky et al. [13], for example,\nuse a network that is partially split across two independent processing streams.\nIn Figure 2 we demonstrate that a fabric which is sparsely connected along the channel axis, suf\ufb01ces\nto emulate densely connected convolutional layers. This is achieved by copying channels, convolving\nthem, and then locally aggregating them. Both the copy and sum process are based on local channel\ninteractions and convolutions with \ufb01lters that are either entirely zero, or identity \ufb01lters which are\nall zero except for a single 1 in the center. While more ef\ufb01cient constructions exist to represent the\ndensely connected layer in our trellis, the one presented here is simple to understand and suf\ufb01ces to\ndemonstrate feasibility. Note that in practice learning automatically con\ufb01gures the trellis.\nBoth the copy and sum process generally require more than one layer to execute. In the copying pro-\ncess, intermediate ReLUs do not affect the result since the copied values themselves are non-negative\noutputs of ReLUs. In the convolve-and-sum process care has to be taken since one convolution might\ngive negative outputs, even if the sum of convolutions is positive. To handle this correctly, it suf\ufb01ces\nto shift the activations by subtracting from the bias of every convolution i the minimum possible\ncorresponding output amin\n(which always exists for a bounded input domain). Using the adjusted\nbias, the output of the convolution is now guaranteed to be non-negative, and to propagate properly\nin the copy and sum process. In the last step of summing the convolved channels, we can add back\n\ni\n\nS\n\n(cid:80)\n\ni amin\n\ni\n\nto shift the activations back to recover the desired sum of convolved channels.\n\n4\n\n\fa\nb\nc\nd\ne\n\na\na\nb\nc\nd\ne\n\na\nb\na\nc\nd\ne\n\na\nb\nc\na\nd\ne\n\na\nb\nc\nd\na\ne\n\na\nb\nc\nd\ne\na\n\ns\nl\ne\nn\nn\na\nh\nC\n\nLayers\na\nb\nc\nd\nb\ne\na\n\na\nb\nc\nd\ne\nb\na\n\na\nb\nb\nc\nd\ne\na\n\na\nb\nc\nb\nd\ne\na\n\na\nb\nc\nd\ne\ne\nd\nc\nb\na\n\n...\n\n. . .\n\n. . .\n\n. . .\n\n. . .\n\n\u2014\u2013\n\nc + d + e\n\n\u2014\u2013\na + b\n\u2014\u2013\n\n. . .\n\n. . .\n\n\u2014\u2013\n\u2014\u2013\n\na + b + c + d + e\n\n\u2014\u2013\n\u2014\u2013\n\n. . .\n\nFigure 2: Representation of a dense-channel-connect layer in a fabric with sparse channel connections\nusing copy and swap operations. The \ufb01ve input channels a, . . . , e are \ufb01rst copied; more copies are\ngenerated by repetition. Channels are then convolved and locally aggregated in the last two layers to\ncompute the desired output. Channels in rows, layers in columns, scales are ignored for simplicity.\n\nTable 1: Analysis of fabrics with L layers, S scales, C channels. Number of activations given for\nD = 2 dim. inputs of size N\u00d7N pixels. Channel doubling across scales used in the bottom row.\n\n# chan. / scale\nconstant\ndoubling\n\n# resp. maps\nC \u00b7 L \u00b7 S\nC \u00b7 L \u00b7 2S\n\n# parameters (sparse)\nC \u00b7 L \u00b7 3D+1 \u00b7 3 \u00b7 S\nC \u00b7 L \u00b7 3D+1 \u00b7 3 \u00b7 2S\n\n# parameters (dense)\nC \u00b7 L \u00b7 3D+1 \u00b7 C \u00b7 S\nC \u00b7 L \u00b7 3D+1 \u00b7 C \u00b7 4S \u00b7 7\n\n# activations\nC \u00b7 L \u00b7 N 2 \u00b7 4\n18 C \u00b7 L \u00b7 N 2 \u00b7 2\n\n3\n\n3.3 Analysis of the number of parameters and activations\n\nFor our analysis we ignore border effects, and consider every node to be an internal one. In the top\nrow of Table 1 we state the total number of response maps throughout the fabric, and the number of\nparameters when channels are sparsely or densely connected. We also state the number of activations,\nwhich determines the memory usage of back-propagation during learning.\nWhile embedding an exponential number of architectures in the number of layers L and channels C,\nthe number of activations and thus the memory cost during learning grows only linearly in C and L.\nSince each scale reduces the number of elements by a factor 2D, the total number of elements across\nscales is bounded by 2D/(2D \u2212 1) times the number of elements N D at the input resolution.\nThe number of parameters is linear in the number of layers L, and number of scales S. For sparsely\nconnected channels, the number of parameters grows also linearly with the number of channels C ,\nwhile it grows quadratically with C in case of dense connectivity.\nAs an example, the largest models we trained for 32\u00d732 input have L = 16 layers and C = 256\nchannels, resulting in 2M parameters (170M for dense), and 6M activations. For 256\u00d7256 input we\nused upto L = 16 layers and C = 64 channels, resulting in 0.7 M parameters (16M for dense), and\n89M activations. For reference, the VGG-19 model has 144M parameters and 14M activations.\nChannel-doubling fabrics. Doubling the number of channels when moving to coarser scales is\nused in many well-known architectures, see e.g. [26, 27]. In the second row of Table 1 we analyze\nfabrics with channel-doubling instead of a constant number of channels per scale. This results in C2S\nchannels throughout the scale pyramid in each layer, instead of CS when using a constant number of\nchannels per scale, where we use C to denote the number of \u201cbase channels\u201d at the \ufb01nest resolution.\nFor 32\u00d732 input images the total number of channels is roughly 11\u00d7 larger, while for 256\u00d7256\nimages we get roughly 57\u00d7 more channels. The last column of Table 1 shows that the number of\nactivations, however, grows only by 50% due to the coarsening of the maps.\nWith dense channel connections and 2D data, the amount of computation per node is constant, as at a\ncoarser resolution there are 4\u00d7 less activations, but interactions among 2\u00d72 more channels. Therefore,\nin such fabrics the amount of computation grows linearly in the number of scales as compared to\na single embedded CNN. For sparse channel connections, we adapt the local connectivity pattern\nbetween nodes to accommodate for the varying number channels per scale, see Figure 3 for an\nillustration. Each node still connects to nine other nodes at the previous layer: two inputs from scale\ns \u2212 1, three from scale s, and four from scale s + 1. The computational cost thus also grows only\n\n5\n\n\fChannels\n\ns\ne\nl\na\nc\nS\n\nFigure 3: Diagram of sparse channel connectivity from\none layer to another in a channel-doubling fabric. Chan-\nnels are laid out horizontally and scales vertically. Each\ninternal node, i.e. response map, is connected to nine\nnodes at the previous layer: four channels at a coarser\nresolution, two at a \ufb01ner resolution, and to itself and\nneighboring channels at the same resolution.\n\nby 50% as compared to using a constant number of channels per scale. In this case, the number of\nparameters grows by the same factor 2S/S as the number of channels. In case of dense connections,\nhowever, the number of parameters explodes with a factor 7\n18 4S/S. That is, roughly a factor 265 for\n32\u00d732 input, and 11,327 for 256\u00d7256 input. Therefore, channel-doubling fabrics appear most useful\nwith sparse channel connectivity. Experiments with channel-doubling fabrics are left for future work.\n\n4 Experimental evaluation results\n\nIn this section we \ufb01rst present the datasets used in our experiments, followed by evaluation results.\n\n4.1 Datasets and experimental protocol\n\nPart Labels dataset.\nThis dataset [10] consists of 2,927 face images from the LFW dataset\n[8], with pixel-level annotations into the classes hair, skin, and background. We use the standard\nevaluation protocol which speci\ufb01es training, validation and test sets of 1,500, 500 and 927 images,\nrespectively. We report accuracy at pixel-level and superpixel-level. For superpixel we average the\nclass probabilities over the contained pixels. We used horizontal \ufb02ipping for data augmentation.\nMNIST. This dataset [16] consists of 28\u00d728 pixel images of the handwritten digits 0, . . . , 9. We\nuse the standard split of the dataset into 50k training samples, 10k validation samples and 10k test\nsamples. Pixel values are normalized to [0, 1] by dividing them by 255. We augment the train data by\nrandomly positioning the original image on a 32\u00d732 pixel canvas.\nCIFAR10. The CIFAR-10 dataset (http://www.cs.toronto.edu/~kriz/cifar.html) con-\nsists of 50k 32\u00d732 training images and 10k testing images in 10 classes. We hold out 5k training\nimages as validation set, and use the remaining 45k as the training set. To augment the data, we\nfollow common practice, see e.g. [4, 18], and pad the images with zeros to a 40\u00d740 image and then\ntake a random 32\u00d732 crop, in addition we add horizontally \ufb02ipped versions of these images.\nTraining. We train our fabrics using SGD with momentum of 0.9. After each node in the trellis we\napply batch normalization [9], and regularize the model with weight decay of 10\u22124, but did not apply\ndropout [30]. We use the validation set to determine the optimal number of training epochs, and\nthen train a \ufb01nal model from the train and validation data and report performance on the test set. We\nrelease our Caffe-based implementation at http://thoth.inrialpes.fr/~verbeek/fabrics.\n\n4.2 Experimental results\n\nFor all three datasets we trained sparse and dense fabrics with various numbers of channels and layers.\nIn all cases we used a constant number of channels per scale. The results across all these settings can\nbe found in the supplementary material, here we report only the best results from these. On all three\ndatasets, larger trellises perform comparable or better than smaller ones. So in practice the choice of\nthese only two hyper-parameters of our model is not critical, as long as a large enough trellis is used.\nPart Labels. On this data set we obtained a super-pixel accuracy of 95.6% using both sparse\nand dense trellises. In Figure 4 we show two examples of predicted segmentation maps. Table 2\ncompares our results with the state of the art, both in terms of accuracy and the number of parameters.\nOur results are slightly worse than [31, 33], but the latter are based on the VGG-16 network. That\nnetwork has roughly 4, 000\u00d7 more parameters than our sparse trellis, and has been trained from\nover 1M ImageNet images. We trained our model from scratch using only 2,000 images. Moreover,\n[10, 19, 31] also include CRF and/or RBM models to encode spatial shape priors. In contrast, our\nresults with convolutional neural fabrics (CNF) are obtained by predicting all pixels independently.\n\n6\n\n\fFigure 4: Examples form the Part Labels test set: input image (left), ground-truth labels (middle),\nand superpixel-level labels from our sparse CNF model with 8 layers and 16 channels (right).\n\nTable 2: Comparison of our results with the state of the art on Part Labels.\n\nYear\n\n# Params.\n\nSP Acccur.\n\nP Accur.\n\nTsogkas et al. [31]\nZheng et al. [33]\nLiu et al. [19]\nKae et al. [10]\n\n2015 >414 M\n2015 >138 M\n>33 M\n2015\n2013\n0.7 M\n\nOurs: CNF-sparse (L = 8, C = 16)\nOurs: CNF-dense (L = 8, C = 64)\n\n0.1 M\n8.0 M\n\n96.97\n96.59\n\n\u2014\n\n94.95\n\n95.58\n95.63\n\n\u2014\n\u2014\n\n95.24\n\n\u2014\n\n94.60\n94.82\n\nMNIST. We obtain error rates of 0.48% and 0.33% with sparse and dense fabrics respectively. In\nTable 3 we compare our results to a selection of recent state-of-the-art work. We excluded several\nmore accurate results reported in the literature, since they are based on signi\ufb01cantly more elaborate\ndata augmentation methods. Our result with a densely connected fabric is comparable to those of\n[32], which use similar data augmentation. Our sparse model, which has 20\u00d7 less parameters than\nthe dense variant, yields an error of 0.48% which is slightly higher.\nCIFAR10.\nIn Table 4 we compare our results to the state of the art. Our error rate of 7.43% with a\ndense fabric is comparable to that reported with MaxOut networks [4]. On this dataset the error of\nthe sparse model, 18.89%, is signi\ufb01cantly worse than the dense model. This is either due to a lack of\ncapacity in the sparse model, or due to dif\ufb01culties in optimization. The best error of 5.84% [22] was\nobtained using residual connections, without residual connections they report an error of 6.06%.\nVisualization.\nIn Figure 5 we visualize the connection strengths of learned fabrics with dense\nchannel connectivity. We observe qualitative differences between learned fabrics. The semantic seg-\nmentation model (left) immediately distributes the signal across the scale pyramid (\ufb01rst layer/column),\nand then progressively aggregates the multi-scale signal towards the output. In the CIFAR10 classi\ufb01-\ncation model the signal is progressively downsampled, exploiting multiple scales in each layer. The\n\ufb01gure shows the result of heuristically pruning (by thresholding) the weakest connections to \ufb01nd a\nsmaller sub-network with good performance. We pruned 67% of the connections while increasing\nthe error only from 7.4% to 8.1% after \ufb01ne-tuning the fabric with the remaining connections. Notice\nthat all up-sampling connections are deactivated after pruning.\n\nTable 3: Comparison of our results with the state of the art on MNIST. Data augmentation with\ntranslation and \ufb02ipping is denoted by T and F respectively, N denotes no data augmentation.\n\nYear Augmentation\n\n# Params.\n\nError (%)\n\nChang et al. [1]\nLee et al. [17]\nWan et al. (Dropconnect) [32]\nCKN [21]\nGoodfellow et al. (MaxOut) [4]\nLin et al. (Network in Network) [18]\n\nOurs: CNF-sparse (L = 16, C = 32)\nOurs: CNF-dense (L = 8, C = 64)\n\n2015\n2015\n2013\n2014\n2013\n2013\n\n7\n\nN\nN\nT\nN\nN\nN\n\nT\nT\n\n447K\n\n379K\n43 K\n420 K\n\n249 K\n5.3 M\n\n0.24\n0.31\n0.32\n0.39\n0.45\n0.47\n\n0.48\n0.33\n\n\fTable 4: Comparison of our results with the state of the art on CIFAR10. Data augmentation with\ntranslation, \ufb02ipping, scaling and rotation are denoted by T, F, S and R respectively.\n\nYear Augmentation\n\n# Params.\n\nError (%)\n\nMishkin & Matas [22]\nLee et al. [17]\nChang et al. [1]\nSpringenberg et al. (All Convolutional Net) [29]\nLin et al. (Network in Network) [18]\nWan et al. (Dropconnect) [32]\nGoodfellow et al. (MaxOut) [4]\n\n2016\n2015\n2015\n2015\n2013\n2013\n2013\n\nOurs: CNF-sparse (L = 16, C = 64)\nOurs: CNF-dense (L = 8, C = 128)\n\nT+F\nT+F\nT+F\nT+F\nT+F\n\nT+F+S+R\n\nT+F\n\nT+F\nT+F\n\n2.5M\n1.8M\n1.6M\n1.3 M\n1 M\n19M\n>6 M\n\n2M\n\n21.2M\n\n5.84\n6.05\n6.75\n7.25\n8.81\n9.32\n9.38\n\n18.89\n7.43\n\nFigure 5: Visualization of mean-squared \ufb01lter weights in fabrics learned for Part Labels (left) and\nCIFAR10 (right, pruned network connections). Layers are laid out horizontally, and scales vertically.\n\n5 Conclusion\n\nWe presented convolutional neural fabrics: homogeneous and locally connected trellises over response\nmaps. Fabrics subsume a large class of convolutional networks. They allow to sidestep the tedious\nprocess of specifying, training, and testing individual network architectures in order to \ufb01nd the best\nones. While fabrics use more parameters, memory and computation than needed for each of the\nindividual architectures embedded in them, this is far less costly than the resources required to test\nall embedded architectures one-by-one. Fabrics have only two main hyper-parameters: the number\nof layers and the number of channels. In practice their setting is not critical: we just need a large\nenough fabric with enough capacity. We propose variants with dense channel connectivity, and with\nchannel-doubling over scales. The latter strikes a very attractive capacity/memory trade-off.\nIn our experiments we study performance of fabrics for image classi\ufb01cation on MNIST and CIFAR10,\nand of semantic segmentation on Part Labels. We obtain excellent results that are close to the best\nreported results in the literature on all three datasets. These results suggest that fabrics are competitive\nwith the best hand-crafted CNN architectures, be it using a larger number of parameters in some\ncases (but much fewer on Part Labels). We expect that results can be further improved by using better\noptimization schemes such as Adam [12], using dropout [30] or dropconect [32] regularization, and\nusing MaxOut units [4] or residual units [6] to facilitate training of deep fabrics with many channels.\nIn ongoing work we experiment with channel-doubling fabrics, and fabrics for joint image classi\ufb01ca-\ntion, object detection, and segmentation. We also explore channel connectivity patterns in between\nthe sparse and dense options used here. Finally, we work on variants that are convolutional along the\nscale-axis so as to obtain a scale invariant processing that generalizes better across scales.\n\n8\n\n\fAcknowledgment. We would like to thank NVIDIA for the donation of GPUs used in this research.\nThis work has been partially supported by the LabEx PERSYVAL-Lab (ANR-11-LABX-0025-01).\nReferences\n[1] J.-R. Chang and Y.-S. Chen. Batch-normalized maxout network in network. Arxiv preprint, 2015.\n[2] T. Chen, I. Goodfellow, and J. Shlens. Net2net: Accelerating learning via knowledge transfer. In ICLR,\n\n2016.\n\n2013.\n\n2007.\n\n[3] C. Farabet, C. Couprie, L. Najman, and Y. LeCun. Learning hierarchical features for scene labeling. PAMI,\n\n35(8):1915\u20131929, 2013.\n\n[4] I. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio. Maxout networks. In ICML,\n\n[5] A. Graves, S. Fern\u00e1ndez, and J. Schmidhuber. Multi-dimensional recurrent neural networks. In ICANN,\n\n[6] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In ECCV, 2016.\n[7] S. Honari, J. Yosinski, P. Vincent, and C. Pal. Recombinator networks: Learning coarse-to-\ufb01ne feature\n\naggregation. In CVPR, 2016.\n\n[8] G. Huang, M. Ramesh, T. Berg, and E. Learned-Miller. Labeled faces in the wild: a database for studying\nface recognition in unconstrained environments. Technical Report 07-49, University of Massachusetts,\nAmherst, 2007.\n\n[9] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal\n\ncovariate shift. In ICML, 2015.\n\nfor image labeling. In CVPR, 2013.\n\n[10] A. Kae, K. Sohn, H. Lee, and E. Learned-Miller. Augmenting CRFs with Boltzmann machine shape priors\n\n[11] N. Kalchbrenner, I. Danihelka, and A. Graves. Grid long short-term memory. In ICLR, 2016.\n[12] D. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2015.\n[13] A. Krizhevsky, I. Sutskever, and G. Hinton.\n\nImagenet classi\ufb01cation with deep convolutional neural\n\n[14] P. Kulkarni, J. Zepeda, F. Jurie, P. P\u00e9rez, and L. Chevallier. Learning the structure of deep architectures\n\nnetworks. In NIPS, 2012.\n\nusing (cid:96)1 regularization. In BMVC, 2015.\n\n[15] Y. LeCun, B. Boser, J. Denker, D. Henderson, R. Howard, W. Hubbard, and L. Jackel. Handwritten digit\n\nrecognition with a back-propagation network. In NIPS, 1989.\n\n[16] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.\n\nProceedings of the IEEE, pages 2278\u20132324, 1998.\n\n[17] C.-Y. Lee, P. Gallagher, and Z. Tu. Generalizing pooling functions in convolutional neural networks:\n\n[18] M. Lin, Q. Chen, and S. Yan. Network in network. In ICLR, 2014.\n[19] S. Liu, J. Yang, C. Huang, , and M.-H. Yang. Multi-objective convolutional learning for face labeling. In\n\nMixed, gated, and tree. 2016.\n\nCVPR, 2015.\n\n[20] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR,\n\n2015.\n\n2016.\n\n2015.\n\n[21] J. Mairal, P. Koniusz, Z. Harchaoui, and C. Schmid. Convolutional kernel networks. In NIPS, 2014.\n[22] D. Mishkin and J. Matas. All you need is a good init. In ICLR, 2016.\n[23] I. Misra, A. Shrivastava, A. Gupta, and M. Hebert. Cross-stich networks for multi-task learning. In CVPR,\n\n[24] H. Noh, S. Hong, and B. Han. Learning deconvolution network for semantic segmentation. In ICCV, 2015.\n[25] T. P\ufb01ster, J. Charles, and A. Zisserman. Flowing ConvNets for human pose estimation in videos. In CVPR,\n\n[26] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation.\n\nIn Medical Image Computing and Computer-Assisted Intervention, 2015.\n\n[27] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In\n\n[28] S. Singh, D. Hoiem, and D. Forsyth. Swapout: learning an ensemble of deep architectures. In NIPS, 2016.\n[29] J. Springenberg, A. Dosovitskiy andT. Brox, and M. Riedmiller. Striving for simplicity: The all convolu-\n\nICLR, 2015.\n\ntional net. In ICLR, 2015.\n\n[30] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to\n\nprevent neural networks from over\ufb01tting. JMLR, 2014.\n\n[31] S. Tsogkas, I. Kokkinos, G. Papandreou, and A. Vedaldi. Deep learning for semantic part segmentation\n\nwith high-level guidance. Arxiv preprint, 2015.\n\n[32] L. Wan, M. Zeiler, S. Zhang, Y. LeCun, and R. Fergus. Regularization of neural networks using DropCon-\n\nnect. In ICML, 2013.\n\n[33] H. Zheng, Y. Liu, M. Ji, F. Wu, and L. Fang. Learning high-level prior with convolutional neural networks\n\n[34] Y. Zhou, X. Hu, and B. Zhang. Interlinked convolutional neural networks for face parsing. In International\n\nfor semantic segmentation. Arxiv preprint, 2015.\n\nSymposium on Neural Networks, 2015.\n\n9\n\n\f", "award": [], "sourceid": 2024, "authors": [{"given_name": "Shreyas", "family_name": "Saxena", "institution": "INRIA"}, {"given_name": "Jakob", "family_name": "Verbeek", "institution": "INRIA"}]}