{"title": "Convolutional Neural Networks with Intra-Layer Recurrent Connections for Scene Labeling", "book": "Advances in Neural Information Processing Systems", "page_first": 937, "page_last": 945, "abstract": "Scene labeling is a challenging computer vision task. It requires the use of both local discriminative features and global context information. We adopt a deep recurrent convolutional neural network (RCNN) for this task, which is originally proposed for object recognition. Different from traditional convolutional neural networks (CNN), this model has intra-layer recurrent connections in the convolutional layers. Therefore each convolutional layer becomes a two-dimensional recurrent neural network. The units receive constant feed-forward inputs from the previous layer and recurrent inputs from their neighborhoods. While recurrent iterations proceed, the region of context captured by each unit expands. In this way, feature extraction and context modulation are seamlessly integrated, which is different from typical methods that entail separate modules for the two steps. To further utilize the context, a multi-scale RCNN is proposed. Over two benchmark datasets, Standford Background and Sift Flow, the model outperforms many state-of-the-art models in accuracy and efficiency.", "full_text": "Convolutional Neural Networks with Intra-layer\n\nRecurrent Connections for Scene Labeling\n\nMing Liang\n\nXiaolin Hu\n\nBo Zhang\n\nTsinghua National Laboratory for Information Science and Technology (TNList)\n\nDepartment of Computer Science and Technology\n\nCenter for Brain-Inspired Computing Research (CBICR)\n\nliangm07@mails.tsinghua.edu.cn, {xlhu,dcszb}@tsinghua.edu.cn\n\nTsinghua University, Beijing 100084, China\n\nAbstract\n\nScene labeling is a challenging computer vision task. It requires the use of both\nlocal discriminative features and global context information. We adopt a deep\nrecurrent convolutional neural network (RCNN) for this task, which is originally\nproposed for object recognition. Different from traditional convolutional neural\nnetworks (CNN), this model has intra-layer recurrent connections in the convo-\nlutional layers. Therefore each convolutional layer becomes a two-dimensional\nrecurrent neural network. The units receive constant feed-forward inputs from the\nprevious layer and recurrent inputs from their neighborhoods. While recurrent\niterations proceed, the region of context captured by each unit expands. In this\nway, feature extraction and context modulation are seamlessly integrated, which\nis different from typical methods that entail separate modules for the two steps.\nTo further utilize the context, a multi-scale RCNN is proposed. Over two bench-\nmark datasets, Standford Background and Sift Flow, the model outperforms many\nstate-of-the-art models in accuracy and ef\ufb01ciency.\n\n1\n\nIntroduction\n\nScene labeling (or scene parsing) is an important step towards high-level image interpretation. It\naims at fully parsing the input image by labeling the semantic category of each pixel. Compared\nwith image classi\ufb01cation, scene labeling is more challenging as it simultaneously solves both seg-\nmentation and recognition. The typical approach for scene labeling consists of two steps. First,\nextract local handcrafted features [6, 15, 26, 23, 27]. Second, integrate context information using\nprobabilistic graphical models [6, 5, 18] or other techniques [24, 21]. In recent years, motivated by\nthe success of deep neural networks in learning visual representations, CNN [12] is incorporated in-\nto this framework for feature extraction. However, since CNN does not have an explicit mechanism\nto modulate its features with context, to achieve better results, other methods such as conditional\nrandom \ufb01eld (CRF) [5] and recursive parsing tree [21] are still needed to integrate the context infor-\nmation. It would be interesting to have a neural network capable of performing scene labeling in an\nend-to-end manner.\nA natural way to incorporate context modulation in neural networks is to introduce recurrent con-\nnections. This has been extensively studied in sequence learning tasks such as online handwriting\nrecognition [8], speech recognition [9] and machine translation [25]. The sequential data has strong\ncorrelations along the time axis. Recurrent neural networks (RNN) are suitable for these tasks be-\ncause the long-range context information can be captured by a \ufb01xed number of recurrent weights.\nTreating scene labeling as a two-dimensional variant of sequence learning, RNN can also be applied,\nbut the studies are relatively scarce. Recently, a recurrent CNN (RCNN) in which the output of the\ntop layer of a CNN is integrated with the input in the bottom is successfully applied to scene labeling\n\n1\n\n\fFigure 1: Training and testing processes of multi-scale RCNN for scene labeling. Solid lines denote\nfeed-forward connections and dotted lines denote recurrent connections.\n\n[19]. Without the aid of extra preprocessing or post-processing techniques, it achieves competitive\nresults. This type of recurrent connections captures both local and global information for labeling\na pixel, but it achieves this goal indirectly as it does not model the relationship between pixels (or\nthe corresponding units in the hidden layers of CNN) in the 2D space explicitly. To achieve the goal\ndirectly, recurrent connections are required to be between units within layers. This type of RCNN\nhas been proposed in [14], but there it is used for object recognition. It is unknown if it is useful for\nscene labeling, a more challenging task. This motivates the present work.\nA prominent structural property of RCNN is that feed-forward and recurrent connections co-exist\nin multiple layers. This property enables the seamless integration of feature extraction and context\nmodulation in multiple levels of representation. In other words, an RCNN can be seen as a deep\nRNN which is able to encode the multi-level context dependency. Therefore we expect RCNN to be\ncompetent for scene labeling.\nMulti-scale is another technique for capturing both local and global information for scene labeling\n[5]. Therefore we adopt a multi-scale RCNN [14]. An RCNN is used for each scale. See Figure 1 for\nits overall architecture. The networks in different scales have exactly the same structure and weights.\nThe outputs of all networks are concatenated and input to a softmax layer. The model operates in an\nend-to-end fashion, and does not need any preprocessing or post-processing techniques.\n\n2 Related Work\n\nMany models, either non-parametric [15, 27, 3, 23, 26] or parametric [6, 13, 18], have been proposed\nfor scene labeling. A comprehensive review is beyond the scope of this paper. Below we brie\ufb02y\nreview the neural network models for scene labeling.\nIn [5], a multi-scale CNN is used to extract local features for scene labeling. The weights are shared\namong the CNNs for all scales to keep the number of parameters small. However, the multi-scale\nscheme alone has no explicit mechanism to ensure the consistency of neighboring pixels\u2019 labels.\nSome post-processing techniques, such as superpixels and CRF, are shown to signi\ufb01cantly improve\nthe performance of multi-scale CNN. In [1], CNN features are combined with a fully connected\nCRF for more accurate segmentations. In both models [5, 1] CNN and CRF are trained in separated\nstages. In [29] CRF is reformulated and implemented as an RNN, which can be jointly trained with\nCNN by back-propagation (BP) algorithm.\nIn [24], a recursive neural network is used to learn a mapping from visual features to the semantic\nspace, which is then used to determine the labels of pixels. In [21], a recursive context propagation\n\n2\n\n\ud835\udc1f\ud835\udc5b concatenate Valid convolutions Extract patch and resize classify \u201cboat\u201d Softmax Same convolutions Downsample {\ud835\udc1f\ud835\udc5b} Upsample Concatenate Classify Upsample Image-wise test Patch-wise training \ud835\udc32\ud835\udc8f Cross entropy loss \ud835\udc25\ud835\udc8f label \fnetwork (rCPN) is proposed to better make use of the global context information. The rCPN is fed a\nsuperpixel representation of CNN features. Through a parsing tree, the rCPN recursively aggregates\ncontext information from all superpixels and then disseminates it to each superpixel. Although\nrecursive neural network is related to RNN as they both use weight sharing between different layers,\nthey have signi\ufb01cant structural difference. The former has a single path from the input layer to the\noutput layer while the latter has multiple paths [14]. As will be shown in Section 4, this difference\nhas great in\ufb02uence on the performance in scene labeling.\nTo the best of our knowledge, the \ufb01rst end-to-end neural network model for scene labeling refers\nto the deep CNN proposed in [7]. The model is trained by a supervised greedy learning strategy.\nIn [19], another end-to-end model is proposed. Top-down recurrent connections are incorporated\ninto a CNN to capture context information. In the \ufb01rst recurrent iteration, the CNN receives a raw\npatch and outputs a predicted label map (downsampled due to pooling). In other iterations, the CNN\nreceives both a downsampled patch and the label map predicted in the previous iteration and then\noutputs a new predicted label map. Compared with the models in [5, 21], this approach is simple\nand elegant but its performance is not the best on some benchmark datasets. It is noted that both\nmodels in [14] and [19] are called RCNN. For convenience, in what follows, if not speci\ufb01ed, RCNN\nrefers to the model in [14].\n\n3 Model\n\n3.1 RCNN\n\nThe key module of the RCNN is the RCL. A generic RNN with feed-forward input u(t), internal\nstate x(t) and parameters \u03b8 can be described by:\n\nx(t) = F(u(t), x(t \u2212 1), \u03b8)\n\nwhere F is the function describing the dynamic behavior of RNN.\nThe RCL introduces recurrent connections into a convolutional layer (see Figure 2A for an illus-\ntration). It can be regarded as a special two-dimensional RNN, whose feed-forward and recurrent\ncomputations both take the form of convolution.\n\nxijk(t) = \u03c3\n\n(wf\n\nk )(cid:62)u(i,j)(t) + (wr\n\nk)(cid:62)x(i,j)(t \u2212 1) + bk\n\n(cid:16)\n\n(cid:17)\n\n(1)\n\n(2)\n\n(3)\n\nwhere u(i,j) and x(i,j) are vectorized square patches centered at (i, j) of the feature maps of the\nprevious layer and the current layer, wf\nk are the weights of feed-forward and recurrent\nconnections for the kth feature map, and bk is the kth element of the bias. \u03c3 used in this paper\nis composed of two functions \u03c3(zijk) = h(g(zijk)), where g is the widely used recti\ufb01ed linear\nfunction g(zijk) = max (zijk, 0), and h is the local response normalization (LRN) [11]:\n\nk and wr\n\nh(g(zijk)) =\n\n\uf8eb\uf8ed1 + \u03b1\n\nL\n\ng(zijk)\n\nmin(K,k+L/2)(cid:88)\n\nk(cid:48)=max(0,k\u2212L/2)\n\n(g(zijk(cid:48)))2\n\n\uf8f6\uf8f8\u03b2\n\nwhere K is the number of feature maps, \u03b1 and \u03b2 are constants controlling the amplitude of normal-\nization. The LRN forces the units in the same location to compete for high activities, which mimics\nthe lateral inhibition in the cortex. In our experiments, LRN is found to consistently improve the\naccuracy, though slightly. Following [11], \u03b1 and \u03b2 are set to 0.001 and 0.75, respectively. L is set\nto K/8 + 1.\nDuring the training or testing phase, an RCL is unfolded for T time steps into a multi-layer sub-\nnetwork. T is a predetermined hyper-parameter. See Figure 2B for an example with T = 3. The\nreceptive \ufb01eld (RF) of each unit expands with larger T , so that more context information is cap-\ntured. The depth of the subnetwork also increases with larger T . In the meantime, the number of\nparameters is kept constant due to weight sharing.\nLet u0 denote the static input (e.g., an image). The input to the RCL, denoted by u(t), can take this\nconstant u0 for all t. But here we adopt a more general form:\n\nu(t) = \u03b3u0\n\n3\n\n(4)\n\n\fFigure 2: Illustration of the RCL and RCNN used in this paper. Sold arrows denote feed-forward\nconnections and dotted arrows denote recurrent connections.\n\nwhere \u03b3 \u2208 [0, 1] is a discount factor, which determines the tradeoff between the feed-forward com-\nponent and the recurrent component. When \u03b3 = 0, the feed-forward component is totally discarded\nafter the \ufb01rst iteration. In this case the network behaves like the so-called recursive convolutional\nnetwork [4], in which several convolutional layers have tied weights. There is only one path from\ninput to output. When \u03b3 > 0, the network is a typical RNN. There are multiple paths from input to\noutput (see Figure 2B).\nRCNN is composed of a stack of RCLs. Between neighboring RCLs there are only feed-forward\nconnections. Max pooling layers are optionally interleaved between RCLs. The total number of\nrecurrent iterations is set to T for all N RCLs. There are two approaches to unfold an RCNN.\nFirst, unfold the RCLs one by one, and each RCL is unfolded for T time steps before feeding to\nthe next RCL (see Figure 2C). This unfolding approach multiplicatively increases the depth of the\nnetwork. The largest depth of the network is proportional to N T . In the second approach, at each\ntime step the states of all RCLs are updated successively (see Figure 2D). The unfolded network\nhas a two-dimensional structure whose x axis is the time step and y axis is the level of layer. This\nunfolding approach additively increases the depth of the network. The largest depth of the network\nis proportional to N + T .\nWe adopt the \ufb01rst unfolding approach due to the following advantages. First, it leads to larger\neffective RF and depth, which are important for the performance of the model. Second, the second\napproach is more computationally intensive since the feed-forward inputs need to be updated at each\ntime step. However, in the \ufb01rst approach the feed-forward input of each RCL needs to be computed\nfor only once.\n\n3.2 Multi-scale RCNN\n\nIn natural scenes objects appear in various sizes. To capture this variability, the model should be\nscale invariant. In [5], a multi-scale CNN is proposed to extract features for scene labeling, in which\nseveral CNNs with shared weights are used to process images of different scales. This approach is\nadopted to construct the multi-scale RCNN (see Figure 1). The original image corresponds to the\n\ufb01nest scale. Images of coarser scales are obtained simply by max pooling the original image. The\noutputs of all RCNNs are concatenated to form the \ufb01nal representation. For pixel p, its probability\nfalling into the cth semantic category is given by a softmax layer:\n\nc f p(cid:1)\nexp(cid:0)w(cid:62)\nc(cid:48) exp(cid:0)w(cid:62)\nc(cid:48) f p(cid:1)\n(cid:80)\n\nyp\nc =\n\n(c = 1, 2, ..., C)\n\n(5)\n\nwhere f p denotes the concatenated feature vector of pixel p, and wc denotes the weight for the cth\ncategory.\nThe loss function is the cross entropy between the predicted probability yp\nc :\n\u02c6yp\n\nc and the true hard label\n\nL = \u2212(cid:88)\n\n(cid:88)\n\n\u02c6yp\nc log yp\nc\n\n(6)\n\nc = 1 if pixel p is labeld as c and \u02c6yp\n\nwhere \u02c6yp\nc = 0 otherwise. The model is trained by backpropagation\nthrough time (BPTT) [28], that is, unfolding all the RCNNs to feed-forward networks and apply the\nBP algorithm.\n\np\n\nc\n\n4\n\nUnfold a RCL An RCL unit (red) Multiplicatively unfold two RCLs pooling pooling RCNN 32 64 128 Additively unfold two RCLs A B C D E \f3.3 Patch-wise Training and Image-wise Testing\n\nMost neural network models for scene labeling [5, 19, 21] are trained by the patch-wise approach.\nThe training samples are randomly cropped image patches whose labels correspond to the categories\nof their center pixels. Valid convolutions are used in both feed-forward and recurrent computation.\nThe patch is set to a proper size so that the last feature map has exactly the size of 1 \u00d7 1.\nIn\nimage-wise training, an image is input to the model and the output has exactly the same size as the\nimage. The loss is the average of all pixels\u2019 loss. We have conducted experiments with both training\nmethods, and found that image-wise training seriously suffered from over-\ufb01tting. A possible reason\nis that the pixels in an image have too strong correlations. So patch-wise training is used in all our\nexperiments. In [16], it is suggested that image-wise and patch-wise training are equally effective\nand the former is faster to converge. But their model is obtained by \ufb01netuning the VGG [22] model\npretrained on ImageNet [2]. This conclusion may not hold for models trained from scratch.\nIn the testing phase, the patch-wise approach is time consuming because the patches corresponding\nto all pixels need to be processed. We therefore use image-wise testing. There are two image-wise\ntesting approaches to obtain dense label maps. The \ufb01rst is the Shift-and-stitch approach [20, 19].\nWhen the predicted label map is downsampled by a factor of s, the original image will be shifted\nand processed for s2 times. At each time, the image is shifted by (x, y) pixels to the right and\ndown. Both x and y take their value from {0, 1, 2, . . . , s \u2212 1}, and the shifted image is padded\nin their left and top borders with zero. The outputs for all shifted images are interleaved so that\neach pixel has a corresponding prediction. Shift-and-stitch approach needs to process the image for\ns2 times although it produces the exact prediction as the patch-wise testing. The second approach\ninputs the entire image to the network and obtains downsampled label map, then simply upsample\nthe map to the same resolution as the input image, using bilinear or other interpolation methods (see\nFigure 1, bottom). This approach may suffer from the loss of accuracy, but is very ef\ufb01cient. The\ndeconvolutional layer proposed in [16] is adopted for upsampling, which is the backpropagation\ncounterpart of the convolutional layer. The deconvolutional weights are set to simulates the bilinear\ninterpolation. Both of the image-wise testing methods are used in our experiments.\n\n4 Experiments\n\n4.1 Experimental Settings\n\nExperiments are performed over two benchmark datasets for scene labeling, Sift Flow [15] and\nStanford Background [6]. The Sift Flow dataset contains 2688 color images, all of which have the\nsize of 256\u00d7 256 pixels. Among them 2488 images are training data, and the remaining 200 images\nare testing data. There are 33 semantic categories, and the class frequency is highly unbalanced.\nThe Stanford background dataset contains 715 color images, most of them have the size of 320 \u00d7\n240 pixels. Following [6] 5-fold cross validation is used over this dataset. In each fold there are\n572 training images and 143 testing images. The pixels have 8 semantic categories and the class\nfrequency is more balanced than the Sift Flow dataset.\nIn most of our experiments, RCNN has three parameterized layers (Figure 2E). The \ufb01rst parame-\nterized layer is a convolutional layer followed by a 2 \u00d7 2 non-overlapping max pooling layer. This\nis to reduce the size of feature maps and thus save the computing cost and memory. The other two\nparameterized layers are RCLs. Another 2 \u00d7 2 max pooling layer is placed between the two RCLs.\nThe numbers of feature maps in these layers are 32, 64 and 128. The \ufb01lter size in the \ufb01rst convolu-\ntional layer is 7 \u00d7 7, and the feed-forward and recurrent \ufb01lters in RCLs are all 3 \u00d7 3. Three scales\nof images are used and neighboring scales differed by a factor of 2 in each side of the image.\nThe models are implemented using Caffe [10]. They are trained using stochastic gradient descent\nalgorithm. For the Sift Flow dataset, the hyper-parameters are determined on a separate validation\nset. The same set of hyper-parameters is then used for the Stanford Background dataset. Dropout\nand weight decay are used to prevent over-\ufb01tting. Two dropout layers are used, one after the second\npooling layer and the other before the concatenation of different scales. The dropout ratio is 0.5 and\nweight decay coef\ufb01cient is 0.0001. The base learning rate is 0.001, which is reduced to 0.0001 when\nthe training error enters a plateau. Overall, about ten millions patches have been input to the model\nduring training.\n\n5\n\n\fData augmentation is used in many models [5, 21] for scene labeling to prevent over-\ufb01tting. It is a\ntechnique to distort the training data with a set of transformations, so that additional data is generated\nto improve the generalization ability of the models. This technique is only used in Section 4.3 for\nthe sake of fairness in comparison with other models. Augmentation includes horizontal re\ufb02ection\nand resizing.\n\n4.2 Model Analysis\n\nWe empirically analyze the performance of RCNN models for scene labeling on the Sift Flow\ndataset. The results are shown in Table 1. Two metrics, the per-pixel accuracy (PA) and the av-\nerage per-class accuracy (CA) are used. PA is the ratio of correctly classi\ufb01ed pixels to the total\npixels in testing images. CA is the average of all category-wise accuracies. The following result-\ns are obtained using the shift-and-stitch testing and without any data augmentation. Note that all\nmodels have a multi-scale architecture.\n\nModel\n\nPatch size No. Param.\n\nPA (%) CA (%)\n\nRCNN, \u03b3 = 1, T = 3\nRCNN, \u03b3 = 1, T = 4\nRCNN, \u03b3 = 1, T = 5\n\nRCNN-large, \u03b3 = 1, T = 3\n\nRCNN, \u03b3 = 0, T = 3\nRCNN, \u03b3 = 0, T = 4\nRCNN, \u03b3 = 0, T = 5\n\nRCNN-large, \u03b3 = 0, T = 3\nRCNN, \u03b3 = 0.25, T = 5\nRCNN, \u03b3 = 0.5, T = 5\nRCNN, \u03b3 = 0.75, T = 5\n\nRCNN, no share, \u03b3 = 1, T = 5\n\nCNN1\nCNN2\n\n232\n256\n256\n256\n232\n256\n256\n256\n256\n256\n256\n256\n88\n136\n\n0.28M\n0.28M\n0.28M\n0.65M\n0.28M\n0.28M\n0.28M\n0.65M\n0.28M\n0.28M\n0.28M\n0.28M\n0.33M\n0.28M\n\n80.3\n81.6\n82.3\n83.4\n80.5\n79.9\n80.4\n78.1\n82.4\n81.8\n82.8\n81.3\n74.9\n78.5\n\n31.9\n33.2\n34.3\n38.9\n34.2\n31.4\n31.7\n29.4\n35.4\n34.7\n35.8\n33.3\n24.1\n28.8\n\nTable 1: Model analysis over the Sift Flow dataset. We limit the maximum size of input patch to\n256, which is the size of the image in the Sift Flow dataset. This is achieved by replacing the \ufb01rst\nfew valid convolutions by same convolutions.\n\nFirst, the in\ufb02uence of \u03b3 in (4) is investigated. The patch sizes of images for different models are set\nsuch that the size of the last feature map is 1 \u00d7 1. We mainly investigate two speci\ufb01c values \u03b3 = 1\nand \u03b3 = 0 with different iteration number T. Several other values of \u03b3 are tested with T=5. See\nTable 1 for details. For RCNN with \u03b3 = 1, the performance monotonously increase with more time\nsteps. This is not the case for RCNN with \u03b3 = 0, with which the network tends to be over-\ufb01tting\nwith more iterations. To further investigate this issue, a larger model denoted as RCNN-large is\ntested. It has four RCLs, and has more parameters and larger depth. With \u03b3 = 1 it achieves a better\nperformance than RCNN. However, the RCNN-large with \u03b3 = 0 obtains worse performance than\nRCNN. When \u03b3 is set to other values, 0.25, 0.5 or 0.75, the performance seems better than \u03b3 = 1\nbut the difference is small.\nSecond, the in\ufb02uence of weight sharing in recurrent connections is investigated. Another RCNN\nwith \u03b3 = 1 and T = 5 is tested. Its recurrent weights in different iterations are not shared anymore,\nwhich leads to more parameters than shared ones. But this setting leads to worse accuracy both for\nPA and CA. A possible reason is that more parameters make the model more prone to over-\ufb01tting.\nThird, two feed-forward CNNs are constructed for comparison. CNN1 is constructed by removing\nall recurrent connections from RCNN, and then increasing the numbers of feature maps in each\nlayer from 32, 64 and 128 to 60, 120 and 240, respectively. CNN2 is constructed by removing\nthe recurrent connections and adding two extra convolutional layers. CNN2 had \ufb01ve convolutional\nlayers and the corresponding numbers of feature maps are 32, 64, 64, 128 and 128, respectively.\nWith these settings, the two models have approximately the same number of parameters as RCNN,\nwhich is for the sake of fair comparison. The two CNNs are outperformed by the RCNNs by a\nsigni\ufb01cant margin. Compared with the RCNN, the topmost units in these two CNNs cover much\nsmaller regions (see the patch size column in Table 1). Note that all convolutionas in these models\nare performed in \u201cvalid\u201d mode. This mode decreases the size of feature maps and as a consequence\n\n6\n\n\fFigure 3: Examples of scene labeling results from the Stanford Background dataset. \u201cmntn\u201d denotes\nmountains, and \u201cobject\u201d denotes foreground objects.\n\n(together with max pooling) increases the RF size of the top units. Since the CNNs have fewer\nconvolutional layers than the time-unfolded RCNNs, their RF sizes of the top units are smaller.\n\nModel\nLiu et al.[15]\nTighe and Lazebnik [27]\nEigen and Fergus [3]\nSingh and Kosecka [23]\nTighe and Lazebnik [26]\nMulti-scale CNN + cover [5]\nMulti-scale CNN + cover (balanced) [5]\nTop-down RCNN [19]\nMulti-scale CNN + rCPN [21]\nMulti-scale CNN + rCPN (balanced) [21]\nRCNN\nRCNN (balanced)\nRCNN-small\nRCNN-large\nFCNN [16] (\u2217\ufb01netuned from VGG model [22])\n\nNo. Param.\nNA\nNA\nNA\nNA\nNA\n0.43 M\n0.43 M\n0.09 M\n0.80 M\n0.80 M\n0.28 M\n0.28 M\n0.07 M\n0.65 M\n134 M\n\nPA (%) CA (%) Time (s)\n31 (CPU)\n76.7\n77.0\n8.4 (CPU)\n16.6 (CPU)\n77.1\n79.2\n20 (CPU)\n\u2265 8.4 (CPU)\n78.6\nNA\n78.5\n72.3\nNA\nNA\n77.7\n0.37 (GPU)\n79.6\n0.37 (GPU)\n75.5\n0.03 (GPU)\n83.5\n79.3\n0.03 (GPU)\n0.02 (GPU)\n81.7\n84.3\n0.04 (GPU)\n\u223c 0.33 (GPU)\n85.1\n\nNA\n30.1\n32.5\n33.8\n39.2\n29.6\n50.8\n29.8\n33.6\n48.0\n35.8\n57.1\n32.6\n41.0\n51.7\n\nTable 2: Comparison with the state-of-the-art models over the Sift Flow dataset.\n\n4.3 Comparison with the State-of-the-art Models\n\nNext, we compare the results of RCNN and the state-of-the-art models. The RCNN with \u03b3 = 1\nand T = 5 is used for comparison. The results are obtained using the upsampling testing approach\nfor ef\ufb01ciency. Data augmentation is employed in training because it is used by many other models\n[5, 21]. The images are only preprocessed by removing the average RGB values computed over\ntraining images.\n\nModel\nGould et al. [6]\nTighe and Lazebnik [27]\nSocher et al. [24]\nEigen and Fergus [3]\nSingh and Kosecka [23]\nLempitsky et al. [13]\nMultiscale CNN + CRF [5]\nTop-down RCNN [19]\nSingle-scale CNN + rCPN [21]\nMultiscale CNN + rCPN [21]\nZoom-out [17]\nRCNN\n\nNo. Param.\nNA\nNA\nNA\nNA\nNA\nNA\n0.43M\n0.09M\n0.80M\n0.80M\n0.23 M\n0.28M\n\nPA (%) CA (%) Time (s)\n76.4\n77.5\n78.1\n75.3\n74.1\n81.9\n81.4\n80.2\n81.9\n81.0\n82.1\n83.1\n\n30 to 60 (CPU)\n12 (CPU)\nNA\n16.6 (CPU)\n20 (CPU)\n\u2265 60 (CPU)\n60.5 (CPU)\n10.6 (CPU)\n0.5 (GPU)\n0.37 (GPU)\nNA\n0.03 (GPU)\n\nNA\nNA\nNA\n66.5\n62.2\n72.4\n76.0\n69.9\n73.6\n78.8\n77.3\n74.8\n\nTable 3: Comparison with the state-of-the-art models over the Stanford Background dataset.\n\nThe results over the Sift Flow dataset are shown in Table 2. Besides the PA and CA, the time for\nprocessing an image is also presented. For neural network models, the number of parameters are\n\n7\n\n\fshown. When extra training data from other datasets is not used, the RCNN outperforms all other\nmodels in terms of the PA metric by a signi\ufb01cant margin.\nThe RCNN has fewer parameters than most of the other neural network models except the top-down\nRCNN [19]. A small RCNN (RCNN-small) is then constructed by reducing the numbers of feature\nmaps in RCNN to 16, 32 and 64, respectively, so that its total number of parameters is 0.07 million.\nThe PA and CA of the small RCNN are 81.7% and 32.6%, respectively, signi\ufb01cantly higher than\nthose of the top-down RCNN.\nNote that better result over this dataset has been achieved by the fully convolutional network (FCN)\n[16]. However, FCN is \ufb01netuned from the VGG [22] net trained over the 1.2 million images of\nImageNet, and has approximately 134 million parameters. Being trained over 2488 images, RCNN\nis only outperformed by 1.6 percent on PA. This gap can be further reduced by using larger RCNN\nmodels. For example, the RCNN-large in Table 1 achieves PA of 84.3% with data augmentation.\nThe class distribution in the Sift Flow dataset is highly unbalanced, which is harmful to the CA\nperformance. In [5], frequency balance is used so that patches in different classes appear in the\nsame frequency. This operation greatly enhance the CA value. For better comparison, we also test\nan RCNN with weighted sampling (balanced) so that the rarer classes apprear more frequently. In\nthis case, the RCNN achieves a much higher CA than other methods including FCN, while still\nkeeping a good PA.\nThe results over the Stanford Background dataset are shown in Table 3. The set of hyper-parameters\nused for the Sift Flow dataset is adopted without further tuning. Frequency balance is not used. The\nRCNN again achieves the best PA score, although CA is not the best. Some typical results of RCNN\nare shown in Figure 3.\nOn a GTX Titan black GPU, it takes about 0.03 second for the RCNN and 0.02 second for the\nRCNN-small to process an image. Compared with other models, the ef\ufb01ciency of RCNN is mainly\nattributed to its end-to-end property. For example, the rCPN model takes much time in obtaining the\nsuperpixels.\n\n5 Conclusion\n\nA multi-scale recurrent convolutional neural network is used for scene labeling. The model is able to\nperform local feature extraction and context integration simultaneously in each parameterized layer,\ntherefore particularly \ufb01ts this application because both local and global information are critical for\ndetermining the label of a pixel in an image. This is an end-to-end approach and can be simply\ntrained by the BPTT algorithm. Experimental results over two benchmark datasets demonstrate the\neffectiveness and ef\ufb01ciency of the model.\n\nAcknowledgements\n\nWe are grateful to the anonymous reviewers for their valuable comments. This work was supported\nin part by the National Basic Research Program (973 Program) of China under Grant 2012CB316301\nand Grant 2013CB329403, in part by the National Natural Science Foundation of China under Grant\n61273023, Grant 91420201, and Grant 61332007, in part by the Natural Science Foundation of\nBeijing under Grant 4132046.\n\nReferences\n[1] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Semantic image segmentation\n\nwith deep convolutional nets and fully connected crfs. In ICLR, 2015.\n\n[2] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image\n\ndatabase. In CVPR, pages 248\u2013255, 2009.\n\n[3] D. Eigen and R. Fergus. Nonparametric image parsing using adaptive neighbor sets. In CVPR, pages\n\n2799\u20132806, 2012.\n\n[4] D. Eigen, J. Rolfe, R. Fergus, and Y. LeCun. Understanding deep architectures using a recursive convo-\n\nlutional network. In ICLR, 2014.\n\n8\n\n\f[5] C. Farabet, C. Couprie, L. Najman, and Y. LeCun. Learning hierarchical features for scene labeling. IEEE\n\nTransactions on Pattern Analysis and Machine Intelligence (PAMI), 35(8):1915\u20131929, 2013.\n\n[6] S. Gould, R. Fulton, and D. Koller. Decomposing a scene into geometric and semantically consistent\n\nregions. In ICCV, pages 1\u20138, 2009.\n\n[7] D. Grangier, L. Bottou, and R. Collobert. Deep convolutional networks for scene parsing. In ICML Deep\n\nLearning Workshop, volume 3, 2009.\n\n[8] A. Graves, M. Liwicki, S. Fern\u00b4andez, R. Bertolami, H. Bunke, and J. Schmidhuber. A novel connectionist\nsystem for unconstrained handwriting recognition. IEEE Transactions on Pattern Analysis and Machine\nIntelligence (PAMI), 31(5):855\u2013868, 2009.\n\n[9] A. Graves, A.-r. Mohamed, and G. Hinton. Speech recognition with deep recurrent neural networks. In\n\nICASSP, pages 6645\u20136649, 2013.\n\n[10] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell.\nCaffe: Convolutional architecture for fast feature embedding. In Proceedings of the ACM International\nConference on Multimedia, pages 675\u2013678, 2014.\n\n[11] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classi\ufb01cation with deep convolutional neural\n\nnetworks. In NIPS, pages 1097\u20131105, 2012.\n\n[12] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backprop-\n\nagation applied to handwritten zip code recognition. Neural Computation, 1(4):541\u2013551, 1989.\n\n[13] V. Lempitsky, A. Vedaldi, and A. Zisserman. A pylon model for semantic segmentation. In NIPS, pages\n\n1485\u20131493, 2011.\n\n[14] M. Liang and X. Hu. Recurrent convolutional neural network for object recognition. In CVPR, pages\n\n3367\u20133375, 2015.\n\n[15] C. Liu, J. Yuen, and A. Torralba. Nonparametric scene parsing via label transfer. IEEE Transactions on\n\nPattern Analysis and Machine Intelligence (PAMI), 33(12):2368\u20132382, 2011.\n\n[16] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR,\n\n2015.\n\n[17] M. Mostajabi, P. Yadollahpour, and G. Shakhnarovich. Feedforward semantic segmentation with zoom-\n\nout features. In CVPR, 2015.\n\n[18] R. Mottaghi, X. Chen, X. Liu, N.-G. Cho, S.-W. Lee, S. Fidler, R. Urtasun, and A. Yuille. The role of\n\ncontext for object detection and semantic segmentation in the wild. In CVPR, pages 891\u2013898, 2014.\n\n[19] P. H. Pinheiro and R. Collobert. Recurrent convolutional neural networks for scene parsing. In ICML,\n\n2014.\n\n[20] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. Overfeat: Integrated recognition,\n\nlocalization and detection using convolutional networks. In ICLR, 2014.\n\n[21] A. Sharma, O. Tuzel, and M.-Y. Liu. Recursive context propagation network for semantic scene labeling.\n\nIn NIPS, pages 2447\u20132455. 2014.\n\n[22] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition.\n\nCoRR, abs/1409.1556, 2014.\n\n[23] G. Singh and J. Kosecka. Nonparametric scene parsing with adaptive feature relevance and semantic\n\ncontext. In CVPR, pages 3151\u20133157, 2013.\n\n[24] R. Socher, C. C. Lin, C. Manning, and A. Y. Ng. Parsing natural scenes and natural language with\n\nrecursive neural networks. In ICML, pages 129\u2013136, 2011.\n\n[25] I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In NIPS,\n\npages 3104\u20133112, 2014.\n\n[26] J. Tighe and S. Lazebnik. Finding things: Image parsing with regions and per-exemplar detectors. In\n\nCVPR, pages 3001\u20133008, 2013.\n\n[27] J. Tighe and S. Lazebnik. Superparsing: Scalable nonparametric image parsing with superpixels. Inter-\n\nnational Journal of Computer Vision (IJCV), 101(2):329\u2013349, 2013.\n\n[28] P. J. Werbos. Backpropagation through time: what it does and how to do it. Proceedings of the IEEE,\n\n78(10):1550\u20131560, 1990.\n\n[29] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, and P. Torr. Conditional\n\nrandom \ufb01elds as recurrent neural networks. In ICCV, 2015.\n\n9\n\n\f", "award": [], "sourceid": 604, "authors": [{"given_name": "Ming", "family_name": "Liang", "institution": "Tsinghua University"}, {"given_name": "Xiaolin", "family_name": "Hu", "institution": "Tsinghua University"}, {"given_name": "Bo", "family_name": "Zhang", "institution": "Tsinghua University"}]}