{"title": "R-FCN: Object Detection via Region-based Fully Convolutional Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 379, "page_last": 387, "abstract": "We present region-based, fully convolutional networks for accurate and efficient object detection. In contrast to previous region-based detectors such as Fast/Faster R-CNN that apply a costly per-region subnetwork hundreds of times, our region-based detector is fully convolutional with almost all computation shared on the entire image. To achieve this goal, we propose position-sensitive score maps to address a dilemma between translation-invariance in image classification and translation-variance in object detection. Our method can thus naturally adopt fully convolutional image classifier backbones, such as the latest Residual Networks (ResNets), for object detection. We show competitive results on the PASCAL VOC datasets (e.g., 83.6% mAP on the 2007 set) with the 101-layer ResNet. Meanwhile, our result is achieved at a test-time speed of 170ms per image, 2.5-20 times faster than the Faster R-CNN counterpart. Code is made publicly available at: https://github.com/daijifeng001/r-fcn.", "full_text": "R-FCN: Object Detection via\n\nRegion-based Fully Convolutional Networks\n\nJifeng Dai\n\nMicrosoft Research\n\nYi Li\u2217\n\nTsinghua University\n\nKaiming He\n\nMicrosoft Research\n\nJian Sun\n\nMicrosoft Research\n\nAbstract\n\nWe present region-based, fully convolutional networks for accurate and ef\ufb01cient\nobject detection. In contrast to previous region-based detectors such as Fast/Faster\nR-CNN [7, 19] that apply a costly per-region subnetwork hundreds of times, our\nregion-based detector is fully convolutional with almost all computation shared on\nthe entire image. To achieve this goal, we propose position-sensitive score maps\nto address a dilemma between translation-invariance in image classi\ufb01cation and\ntranslation-variance in object detection. Our method can thus naturally adopt fully\nconvolutional image classi\ufb01er backbones, such as the latest Residual Networks\n(ResNets) [10], for object detection. We show competitive results on the PASCAL\nVOC datasets (e.g., 83.6% mAP on the 2007 set) with the 101-layer ResNet.\nMeanwhile, our result is achieved at a test-time speed of 170ms per image, 2.5-20\u00d7\nfaster than the Faster R-CNN counterpart. Code is made publicly available at:\nhttps://github.com/daijifeng001/r-fcn.\n\n1\n\nIntroduction\n\nA prevalent family [9, 7, 19] of deep networks for object detection can be divided into two subnetworks\nby the Region-of-Interest (RoI) pooling layer [7]: (i) a shared, \u201cfully convolutional\u201d subnetwork\nindependent of RoIs, and (ii) an RoI-wise subnetwork that does not share computation. This\ndecomposition [9] was historically resulted from the pioneering classi\ufb01cation architectures, such\nas AlexNet [11] and VGG Nets [24], that consist of two subnetworks by design \u2014 a convolutional\nsubnetwork ending with a spatial pooling layer, followed by several fully-connected (fc) layers. Thus\nthe (last) spatial pooling layer in image classi\ufb01cation networks is naturally turned into the RoI pooling\nlayer in object detection networks [9, 7, 19].\nBut recent state-of-the-art image classi\ufb01cation networks such as Residual Nets (ResNets) [10] and\nGoogLeNets [25, 27] are by design fully convolutional2. By analogy, it appears natural to use\nall convolutional layers to construct the shared, convolutional subnetwork in the object detection\narchitecture, leaving the RoI-wise subnetwork no hidden layer. However, as empirically investigated\nin this work, this na\u00efve solution turns out to have considerably inferior detection accuracy that does\nnot match the network\u2019s superior classi\ufb01cation accuracy. To remedy this issue, in the ResNet paper\n[10] the RoI pooling layer of the Faster R-CNN detector [19] is unnaturally inserted between two\nsets of convolutional layers \u2014 this creates a deeper RoI-wise subnetwork that improves accuracy, at\nthe cost of lower speed due to the unshared per-RoI computation.\nWe argue that the aforementioned unnatural design is caused by a dilemma of increasing translation\ninvariance for image classi\ufb01cation vs. respecting translation variance for object detection. On one\nhand, the image-level classi\ufb01cation task favors translation invariance \u2014 shift of an object inside an\nimage should be indiscriminative. Thus, deep (fully) convolutional architectures that are as translation-\n\n\u2217This work was done when Yi Li was an intern at Microsoft Research.\n2Only the last layer is fully-connected, which is removed and replaced when \ufb01ne-tuning for object detection.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fFigure 1: Key idea of R-FCN for object detection. In this illustration, there are k \u00d7 k = 3 \u00d7 3\nposition-sensitive score maps generated by a fully convolutional network. For each of the k \u00d7 k bins\nin an RoI, pooling is only performed on one of the k2 maps (marked by different colors).\n\nTable 1: Methodologies of region-based detectors using ResNet-101 [10].\n\nR-CNN [8]\n\nFaster R-CNN [20, 10]\n\nR-FCN [ours]\n\ndepth of shared convolutional subnetwork\ndepth of RoI-wise subnetwork\n\n0\n101\n\n91\n10\n\n101\n0\n\ninvariant as possible are preferable as evidenced by the leading results on ImageNet classi\ufb01cation\n[10, 25, 27]. On the other hand, the object detection task needs localization representations that are\ntranslation-variant to an extent. For example, translation of an object inside a candidate box should\nproduce meaningful responses for describing how good the candidate box overlaps the object. We\nhypothesize that deeper convolutional layers in an image classi\ufb01cation network are less sensitive\nto translation. To address this dilemma, the ResNet paper\u2019s detection pipeline [10] inserts the RoI\npooling layer into convolutions \u2014 this region-speci\ufb01c operation breaks down translation invariance,\nand the post-RoI convolutional layers are no longer translation-invariant when evaluated across\ndifferent regions. However, this design sacri\ufb01ces training and testing ef\ufb01ciency since it introduces a\nconsiderable number of region-wise layers (Table 1).\nIn this paper, we develop a framework called Region-based Fully Convolutional Network (R-FCN)\nfor object detection. Our network consists of shared, fully convolutional architectures as is the case of\nFCN [16]. To incorporate translation variance into FCN, we construct a set of position-sensitive score\nmaps by using a bank of specialized convolutional layers as the FCN output. Each of these score\nmaps encodes the position information with respect to a relative spatial position (e.g., \u201cto the left of\nan object\u201d). On top of this FCN, we append a position-sensitive RoI pooling layer that shepherds\ninformation from these score maps, with no weight (convolutional/fc) layers following. The entire\narchitecture is learned end-to-end. All learnable layers are convolutional and shared on the entire\nimage, yet encode spatial information required for object detection. Figure 1 illustrates the key idea\nand Table 1 compares the methodologies among region-based detectors.\nUsing the 101-layer Residual Net (ResNet-101) [10] as the backbone, our R-FCN yields competitive\nresults of 83.6% mAP on the PASCAL VOC 2007 set and 82.0% the 2012 set. Meanwhile, our results\nare achieved at a test-time speed of 170ms per image using ResNet-101, which is 2.5\u00d7 to 20\u00d7 faster\nthan the Faster R-CNN + ResNet-101 counterpart in [10]. These experiments demonstrate that our\nmethod manages to address the dilemma between invariance/variance on translation, and fully convolu-\ntional image-level classi\ufb01ers such as ResNets can be effectively converted to fully convolutional object\ndetectors. Code is made publicly available at: https://github.com/daijifeng001/r-fcn.\n\n2 Our approach\n\nOverview. Following R-CNN [8], we adopt the popular two-stage object detection strategy [8, 9, 6,\n7, 19, 1, 23] that consists of: (i) region proposal, and (ii) region classi\ufb01cation. Although methods that\ndo not rely on region proposal do exist (e.g., [18, 15]), region-based systems still possess leading\naccuracy on several benchmarks [5, 14, 21]. We extract candidate regions by the Region Proposal\n\n2\n\nimageconvposition-sensitive score mapsfeaturemapsk2(C+1)-d convk2(C+1)\u2026...bottom-rightRoIC+1pooltop-lefttop-centerkkC+1voteC+1softmax\fFigure 2: Overall architecture of R-FCN. A Region Proposal Network (RPN) [19] proposes candidate\nRoIs, which are then applied on the score maps. All learnable weight layers are convolutional and are\ncomputed on the entire image; the per-RoI computational cost is negligible.\n\nNetwork (RPN) [19], which is a fully convolutional architecture in itself. Following [19], we share\nthe features between RPN and R-FCN. Figure 2 shows an overview of the system.\nGiven the proposal regions (RoIs), the R-FCN architecture is designed to classify the RoIs into object\ncategories and background. In R-FCN, all learnable weight layers are convolutional and are computed\non the entire image. The last convolutional layer produces a bank of k2 position-sensitive score\nmaps for each category, and thus has a k2(C + 1)-channel output layer with C object categories (+1\nfor background). The bank of k2 score maps correspond to a k \u00d7 k spatial grid describing relative\npositions. For example, with k \u00d7 k = 3\u00d7 3, the 9 score maps encode the cases of {top-left, top-center,\ntop-right, ..., bottom-right} of an object category.\nR-FCN ends with a position-sensitive RoI pooling layer. This layer aggregates the outputs of the\nlast convolutional layer and generates scores for each RoI. Unlike [9, 7], our position-sensitive RoI\nlayer conducts selective pooling, and each of the k \u00d7 k bin aggregates responses from only one score\nmap out of the bank of k \u00d7 k score maps. With end-to-end training, this RoI layer shepherds the last\nconvolutional layer to learn specialized position-sensitive score maps. Figure 1 illustrates this idea.\nFigure 3 and 4 visualize an example. The details are introduced as follows.\n\nBackbone architecture. The incarnation of R-FCN in this paper is based on ResNet-101 [10],\nthough other networks [11, 24] are applicable. ResNet-101 has 100 convolutional layers followed by\nglobal average pooling and a 1000-class fc layer. We remove the average pooling layer and the fc\nlayer and only use the convolutional layers to compute feature maps. We use the ResNet-101 released\nby the authors of [10], pre-trained on ImageNet [21]. The last convolutional block in ResNet-101 is\n2048-d, and we attach a randomly initialized 1024-d 1\u00d71 convolutional layer for reducing dimension\n(to be precise, this increases the depth in Table 1 by 1). Then we apply the k2(C + 1)-channel\nconvolutional layer to generate score maps, as introduced next.\n\nPosition-sensitive score maps & Position-sensitive RoI pooling. To explicitly encode position\ninformation into each RoI, we divide each RoI rectangle into k \u00d7 k bins by a regular grid. For an RoI\nrectangle of a size w \u00d7 h, a bin is of a size \u2248 w\nk [9, 7]. In our method, the last convolutional layer\nis constructed to produce k2 score maps for each category. Inside the (i, j)-th bin (0 \u2264 i, j \u2264 k \u2212 1),\nwe de\ufb01ne a position-sensitive RoI pooling operation that pools only over the (i, j)-th score map:\n\nk \u00d7 h\n\nrc(i, j | \u0398) =\n\nzi,j,c(x + x0, y + y0 | \u0398)/n.\n\n(1)\n\n(cid:88)\n\n(x,y)\u2208bin(i,j)\n\nHere rc(i, j) is the pooled response in the (i, j)-th bin for the c-th category, zi,j,c is one score map\nout of the k2(C + 1) score maps, (x0, y0) denotes the top-left corner of an RoI, n is the number\nof pixels in the bin, and \u0398 denotes all learnable parameters of the network. The (i, j)-th bin spans\n(cid:98)i w\nk(cid:101). The operation of Eqn.(1) is illustrated in\nk (cid:99) \u2264 x < (cid:100)(i + 1) w\nFigure 1, where a color represents a pair of (i, j). Eqn.(1) performs average pooling (as we use\nthroughout this paper), but max pooling can be conducted as well.\n\nk(cid:99) \u2264 y < (cid:100)(j + 1) h\n\nk (cid:101) and (cid:98)j h\n\n3\n\n\u0189\u011e\u018c\u0372Z\u017d/convRoIpoolconvRoIsconvZWEvotefeaturemaps\fscores, producing a (C + 1)-dimensional vector for each RoI: rc(\u0398) =(cid:80)\ncompute the softmax responses across categories: sc(\u0398) = erc(\u0398)/(cid:80)C\n\nThe k2 position-sensitive scores then vote on the RoI. In this paper we simply vote by averaging the\ni,j rc(i, j | \u0398). Then we\nc(cid:48)=0 erc(cid:48) (\u0398). They are used for\n\nevaluating the cross-entropy loss during training and for ranking the RoIs during inference.\nWe further address bounding box regression [8, 7] in a similar way. Aside from the above k2(C +1)-d\nconvolutional layer, we append a sibling 4k2-d convolutional layer for bounding box regression. The\nposition-sensitive RoI pooling is performed on this bank of 4k2 maps, producing a 4k2-d vector for\neach RoI. Then it is aggregated into a 4-d vector by average voting. This 4-d vector parameterizes a\nbounding box as t = (tx, ty, tw, th) following the parameterization in [7]. We note that we perform\nclass-agnostic bounding box regression for simplicity, but the class-speci\ufb01c counterpart (i.e., with a\n4k2C-d output layer) is applicable.\nThe concept of position-sensitive score maps is partially inspired by [3] that develops FCNs for\ninstance-level semantic segmentation. We further introduce the position-sensitive RoI pooling layer\nthat shepherds learning of the score maps for object detection. There is no learnable layer after\nthe RoI layer, enabling nearly cost-free region-wise computation and speeding up both training and\ninference.\n\nTraining. With pre-computed region proposals, it is easy to end-to-end train the R-FCN architecture.\nFollowing [7], our loss function de\ufb01ned on each RoI is the summation of the cross-entropy loss and\nthe box regression loss: L(s, tx,y,w,h) = Lcls(sc\u2217 ) + \u03bb[c\u2217 > 0]Lreg(t, t\u2217). Here c\u2217 is the RoI\u2019s\nground-truth label (c\u2217 = 0 means background). Lcls(sc\u2217 ) = \u2212 log(sc\u2217 ) is the cross-entropy loss\nfor classi\ufb01cation, Lreg is the bounding box regression loss as de\ufb01ned in [7], and t\u2217 represents the\nground truth box. [c\u2217 > 0] is an indicator which equals to 1 if the argument is true and 0 otherwise.\nWe set the balance weight \u03bb = 1 as in [7]. We de\ufb01ne positive examples as the RoIs that have\nintersection-over-union (IoU) overlap with a ground-truth box of at least 0.5, and negative otherwise.\nIt is easy for our method to adopt online hard example mining (OHEM) [23] during training. Our\nnegligible per-RoI computation enables nearly cost-free example mining. Assuming N proposals per\nimage, in the forward pass, we evaluate the loss of all N proposals. Then we sort all RoIs (positive\nand negative) by loss and select B RoIs that have the highest loss. Backpropagation [12] is performed\nbased on the selected examples. Because our per-RoI computation is negligible, the forward time is\nnearly not affected by N, in contrast to OHEM Fast R-CNN in [23] that may double training time.\nWe provide comprehensive timing statistics in Table 3 in the next section.\nWe use a weight decay of 0.0005 and a momentum of 0.9. By default we use single-scale training:\nimages are resized such that the scale (shorter side of image) is 600 pixels [7, 19]. Each GPU holds 1\nimage and selects B = 128 RoIs for backprop. We train the model with 8 GPUs (so the effective\nmini-batch size is 8\u00d7). We \ufb01ne-tune R-FCN using a learning rate of 0.001 for 20k mini-batches and\n0.0001 for 10k mini-batches on VOC. To have R-FCN share features with RPN (Figure 2), we adopt\nthe 4-step alternating training3 in [19], alternating between training RPN and training R-FCN.\n\nInference. As illustrated in Figure 2, the feature maps shared between RPN and R-FCN are computed\n(on an image with a single scale of 600). Then the RPN part proposes RoIs, on which the R-FCN\npart evaluates category-wise scores and regresses bounding boxes. During inference we evaluate 300\nRoIs as in [19] for fair comparisons. The results are post-processed by non-maximum suppression\n(NMS) using a threshold of 0.3 IoU [8], as standard practice.\n\n\u00c0 trous and stride. Our fully convolutional architecture enjoys the bene\ufb01ts of the network modi-\n\ufb01cations that are widely used by FCNs for semantic segmentation [16, 2]. Particularly, we reduce\nResNet-101\u2019s effective stride from 32 pixels to 16 pixels, increasing the score map resolution. All\nlayers before and on the conv4 stage [10] (stride=16) are unchanged; the stride=2 operations in the\n\ufb01rst conv5 block is modi\ufb01ed to have stride=1, and all convolutional \ufb01lters on the conv5 stage are\nmodi\ufb01ed by the \u201chole algorithm\u201d [16, 2] (\u201cAlgorithme \u00e0 trous\u201d [17]) to compensate for the reduced\nstride. For fair comparisons, the RPN is computed on top of the conv4 stage (that are shared with\nR-FCN), as is the case in [10] with Faster R-CNN, so the RPN is not affected by the \u00e0 trous trick.\nThe following table shows the ablation results of R-FCN (k \u00d7 k = 7 \u00d7 7, no hard example mining).\nThe \u00e0 trous trick improves mAP by 2.6 points.\n\n3Although joint training [19] is applicable, it is not straightforward to perform example mining jointly.\n\n4\n\n\fFigure 3: Visualization of R-FCN (k \u00d7 k = 3 \u00d7 3) for the person category.\n\nFigure 4: Visualization when an RoI does not correctly overlap the object.\n\nR-FCN with ResNet-101 on:\n\nconv4, stride=16\n\nconv5, stride=32\n\nconv5, \u00e0 trous, stride=16\n\nmAP (%) on VOC 07 test\n\n72.5\n\n74.0\n\n76.6\n\nVisualization. In Figure 3 and 4 we visualize the position-sensitive score maps learned by R-FCN\nwhen k \u00d7 k = 3 \u00d7 3. These specialized maps are expected to be strongly activated at a speci\ufb01c\nrelative position of an object. For example, the \u201ctop-center-sensitive\u201d score map exhibits high scores\nroughly near the top-center position of an object. If a candidate box precisely overlaps with a true\nobject (Figure 3), most of the k2 bins in the RoI are strongly activated, and their voting leads to a high\nscore. On the contrary, if a candidate box does not correctly overlaps with a true object (Figure 4),\nsome of the k2 bins in the RoI are not activated, and the voting score is low.\n\n3 Related Work\n\nR-CNN [8] has demonstrated the effectiveness of using region proposals [28, 29] with deep networks.\nR-CNN evaluates convolutional networks on cropped and warped regions, and computation is not\nshared among regions (Table 1). SPPnet [9], Fast R-CNN [7], and Faster R-CNN [19] are \u201csemi-\nconvolutional\u201d, in which a convolutional subnetwork performs shared computation on the entire\nimage and another subnetwork evaluates individual regions.\nThere have been object detectors that can be thought of as \u201cfully convolutional\u201d models. OverFeat [22]\ndetects objects by sliding multi-scale windows on the shared convolutional feature maps; similarly, in\nFast R-CNN [7] and [13], sliding windows that replace region proposals are investigated. In these\ncases, one can recast a sliding window of a single scale as a single convolutional layer. The RPN\ncomponent in Faster R-CNN [19] is a fully convolutional detector that predicts bounding boxes with\nrespect to reference boxes (anchors) of multiple sizes. The original RPN is class-agnostic in [19], but\nits class-speci\ufb01c counterpart is applicable (see also [15]) as we evaluate in the following.\n\n5\n\nimage and RoIposition-sensitive score mapsposition-sensitiveRoI-poolvoteyesnovoteimage and RoIposition-sensitive score mapsposition-sensitiveRoI-pool\fTable 2: Comparisons among fully convolutional (or \u201calmost\u201d fully convolutional) strategies using\nResNet-101. All competitors in this table use the \u00e0 trous trick. Hard example mining is not conducted.\n\nmethod\n\nna\u00efve Faster R-CNN\n\nclass-speci\ufb01c RPN\nR-FCN (w/o position-sensitivity)\nR-FCN\n\nRoI output size (k \u00d7 k) mAP on VOC 07 (%)\n\n1 \u00d7 1\n7 \u00d7 7\n\n-\n\n1 \u00d7 1\n3 \u00d7 3\n7 \u00d7 7\n\n61.7\n68.9\n67.6\nfail\n75.5\n76.6\n\nAnother family of object detectors resort to fully-connected (fc) layers for generating holistic object\ndetection results on an entire image, such as [26, 4, 18].\n\n4 Experiments\n\n4.1 Experiments on PASCAL VOC\n\nWe perform experiments on PASCAL VOC [5] that has 20 object categories. We train the models on\nthe union set of VOC 2007 trainval and VOC 2012 trainval (\u201c07+12\u201d) following [7], and evaluate on\nVOC 2007 test set. Object detection accuracy is measured by mean Average Precision (mAP).\n\nComparisons with Other Fully Convolutional Strategies\nThough fully convolutional detectors are available, experiments show that it is nontrivial for them to\nachieve good accuracy. We investigate the following fully convolutional strategies (or \u201calmost\u201d fully\nconvolutional strategies that have only one classi\ufb01er fc layer per RoI), using ResNet-101:\nNa\u00efve Faster R-CNN. As discussed in the introduction, one may use all convolutional layers in\nResNet-101 to compute the shared feature maps, and adopt RoI pooling after the last convolutional\nlayer (after conv5). An inexpensive 21-class fc layer is evaluated on each RoI (so this variant is\n\u201calmost\u201d fully convolutional). The \u00e0 trous trick is used for fair comparisons.\nClass-speci\ufb01c RPN. This RPN is trained following [19], except that the 2-class (object or not)\nconvolutional classi\ufb01er layer is replaced with a 21-class convolutional classi\ufb01er layer. For fair\ncomparisons, for this class-speci\ufb01c RPN we use ResNet-101\u2019s conv5 layers with the \u00e0 trous trick.\nR-FCN without position-sensitivity. By setting k = 1 we remove the position-sensitivity of the\nR-FCN. This is equivalent to global pooling within each RoI.\n\nAnalysis. Table 2 shows the results. We note that the standard (not na\u00efve) Faster R-CNN in the ResNet\npaper [10] achieves 76.4% mAP with ResNet-101 (see also Table 3), which inserts the RoI pooling\nlayer between conv4 and conv5 [10]. As a comparison, the na\u00efve Faster R-CNN (that applies RoI\npooling after conv5) has a drastically lower mAP of 68.9% (Table 2). This comparison empirically\njusti\ufb01es the importance of respecting spatial information by inserting RoI pooling between layers for\nthe Faster R-CNN system. Similar observations are reported in [20].\nThe class-speci\ufb01c RPN has an mAP of 67.6% (Table 2), about 9 points lower than the standard\nFaster R-CNN\u2019s 76.4%. This comparison is in line with the observations in [7, 13] \u2014 in fact, the\nclass-speci\ufb01c RPN is similar to a special form of Fast R-CNN [7] that uses dense sliding windows as\nproposals, which shows inferior results as reported in [7, 13].\nOn the other hand, our R-FCN system has signi\ufb01cantly better accuracy (Table 2). Its mAP (76.6%) is\non par with the standard Faster R-CNN\u2019s (76.4%, Table 3). These results indicate that our position-\nsensitive strategy manages to encode useful spatial information for locating objects, without using\nany learnable layer after RoI pooling.\nThe importance of position-sensitivity is further demonstrated by setting k = 1, for which R-FCN is\nunable to converge. In this degraded case, no spatial information can be explicitly captured within\nan RoI. Moreover, we report that na\u00efve Faster R-CNN is able to converge if its RoI pooling output\nresolution is 1 \u00d7 1, but the mAP further drops by a large margin to 61.7% (Table 2).\n\n6\n\n\fTable 3: Comparisons between Faster R-CNN and R-FCN using ResNet-101. Timing is evaluated on\na single Nvidia K40 GPU. With OHEM, N RoIs per image are computed in the forward pass, and\n128 samples are selected for backpropagation. 300 RoIs are used for testing following [19].\n\nFaster R-CNN\nR-FCN\nFaster R-CNN\nR-FCN\nFaster R-CNN\nR-FCN\n\ndepth of per-RoI\n\nsubnetwork\n\ntraining\n\nw/ OHEM?\n\ntrain time\n(sec/img)\n\n10\n0\n10\n0\n10\n0\n\n(cid:88)(300 RoIs)\n(cid:88)(300 RoIs)\n(cid:88)(2000 RoIs)\n(cid:88)(2000 RoIs)\n\n1.2\n0.45\n1.5\n0.45\n2.9\n0.46\n\ntest time\n(sec/img)\n0.42\n0.17\n0.42\n0.17\n0.42\n0.17\n\nmAP (%) on VOC07\n\n76.4\n76.6\n79.3\n79.5\nN/A\n79.3\n\nTable 4: Comparisons on PASCAL VOC 2007 test set using ResNet-101. \u201cFaster R-CNN +++\u201d [10]\nuses iterative box regression, context, and multi-scale testing.\n\nFaster R-CNN [10]\nFaster R-CNN +++ [10]\nR-FCN\nR-FCN multi-sc train\nR-FCN multi-sc train\n\ntraining data\n07+12\n07+12+COCO\n\n07+12\n07+12\n07+12+COCO\n\nmAP (%)\n\ntest time (sec/img)\n\n76.4\n85.6\n79.5\n80.5\n83.6\n\n0.42\n3.36\n0.17\n0.17\n0.17\n\nTable 5: Comparisons on PASCAL VOC 2012 test set using ResNet-101. \u201c07++12\u201d [7] denotes the\nunion set of 07 trainval+test and 12 trainval. \u2020: http://host.robots.ox.ac.uk:8080/anonymous/44L5HI.html \u2021:\nhttp://host.robots.ox.ac.uk:8080/anonymous/MVCM2L.html\n\nFaster R-CNN [10]\nFaster R-CNN +++ [10]\nR-FCN multi-sc train\nR-FCN multi-sc train\n\ntraining data\n07++12\n07++12+COCO\n\n07++12\n07++12+COCO\n\nmAP (%)\n\ntest time (sec/img)\n\n73.8\n83.8\n77.6\u2020\n82.0\u2021\n\n0.42\n3.36\n0.17\n0.17\n\nComparisons with Faster R-CNN Using ResNet-101\nNext we compare with standard \u201cFaster R-CNN + ResNet-101\u201d [10] which is the strongest competitor\nand the top-performer on the PASCAL VOC, MS COCO, and ImageNet benchmarks. We use\nk \u00d7 k = 7 \u00d7 7 in the following. Table 3 shows the comparisons. Faster R-CNN evaluates a 10-layer\nsubnetwork for each region to achieve good accuracy, but R-FCN has negligible per-region cost. With\n300 RoIs at test time, Faster R-CNN takes 0.42s per image, 2.5\u00d7 slower than our R-FCN that takes\n0.17s per image (on a K40 GPU; this number is 0.11s on a Titan X GPU). R-FCN also trains faster\nthan Faster R-CNN. Moreover, hard example mining [23] adds no cost to R-FCN training (Table 3).\nIt is feasible to train R-FCN when mining from 2000 RoIs, in which case Faster R-CNN is 6\u00d7 slower\n(2.9s vs. 0.46s). But experiments show that mining from a larger set of candidates (e.g., 2000) has no\nbene\ufb01t (Table 3). So we use 300 RoIs for both training and inference in other parts of this paper.\nTable 4 shows more comparisons. Following the multi-scale training in [9], we resize the image in\neach training iteration such that the scale is randomly sampled from {400,500,600,700,800} pixels. We\nstill test a single scale of 600 pixels, so add no test-time cost. The mAP is 80.5%. In addition, we\ntrain our model on the MS COCO [14] trainval set and then \ufb01ne-tune it on the PASCAL VOC set.\nR-FCN achieves 83.6% mAP (Table 4), close to the \u201cFaster R-CNN +++\u201d system in [10] that uses\nResNet-101 as well. We note that our competitive result is obtained at a test speed of 0.17 seconds per\nimage, 20\u00d7 faster than Faster R-CNN +++ that takes 3.36 seconds as it further incorporates iterative\nbox regression, context, and multi-scale testing [10]. These comparisons are also observed on the\nPASCAL VOC 2012 test set (Table 5).\n\nOn the Impact of Depth\nThe following table shows the R-FCN results using ResNets of different depth [10], as well as the\nVGG-16 model [24]. For VGG-16 model, the fc layers (fc6, fc7) are turned into sliding convolutional\nlayers, and a 1 \u00d7 1 convolutional layer is applied on top to generate the position-sensitive score\n\n7\n\n\fmaps. R-FCN with VGG-16 achieves slightly lower than that of ResNet-50. Our detection accuracy\nincreases when the depth is increased from 50 to 101 in ResNet, but gets saturated with a depth of\n152.\n\ntraining data\n\ntest data\n\nVGG-16\n\nResNet-50\n\nResNet-101\n\nResNet-152\n\nR-FCN\nR-FCN multi-sc train\n\n07+12\n07+12\n\n07\n07\n\n75.6\n76.5\n\n77.0\n78.7\n\n79.5\n80.5\n\n79.6\n80.4\n\nOn the Impact of Region Proposals\nR-FCN can be easily applied with other region proposal methods, such as Selective Search (SS) [28]\nand Edge Boxes (EB) [29]. The following table shows the results (using ResNet-101) with different\nproposals. R-FCN performs competitively using SS or EB, showing the generality of our method.\n\ntraining data\n\ntest data\n\nRPN [19]\n\nSS [28]\n\nEB [29]\n\nR-FCN\n\n07+12\n\n07\n\n79.5\n\n77.2\n\n77.8\n\n4.2 Experiments on MS COCO\n\nNext we evaluate on the MS COCO dataset [14] that has 80 object categories. Our experiments\ninvolve the 80k train set, 40k val set, and 20k test-dev set. We set the learning rate as 0.001 for 90k\niterations and 0.0001 for next 30k iterations, with an effective mini-batch size of 8. We extend the\nalternating training [19] from 4-step to 5-step (i.e., stopping after one more RPN training step), which\nslightly improves accuracy on this dataset when the features are shared; we also report that 2-step\ntraining is suf\ufb01cient to achieve comparably good accuracy but the features are not shared.\nThe results are in Table 6. Our single-scale trained R-FCN baseline has a val result of 48.9%/27.6%.\nThis is comparable to the Faster R-CNN baseline (48.4%/27.2%), but ours is 2.5\u00d7 faster testing.\nIt is noteworthy that our method performs better on objects of small sizes (de\ufb01ned by [14]). Our\nmulti-scale trained (yet single-scale tested) R-FCN has a result of 49.1%/27.8% on the val set and\n51.5%/29.2% on the test-dev set. Considering COCO\u2019s wide range of object scales, we further\nevaluate a multi-scale testing variant following [10], and use testing scales of {200,400,600,800,1000}.\nThe mAP is 53.2%/31.5%. This result is close to the 1st-place result (Faster R-CNN +++ with\nResNet-101, 55.7%/34.9%) in the MS COCO 2015 competition. Nevertheless, our method is simpler\nand adds no bells and whistles such as context or iterative box regression that were used by [10], and\nis faster for both training and testing.\n\nTable 6: Comparisons on MS COCO dataset using ResNet-101. The COCO-style AP is evaluated @\nIoU \u2208 [0.5, 0.95]. AP@0.5 is the PASCAL-style AP evaluated @ IoU = 0.5.\n\nFaster R-CNN [10]\nR-FCN\nR-FCN multi-sc train\nFaster R-CNN +++ [10]\nR-FCN\nR-FCN multi-sc train\nR-FCN multi-sc train, test\n\ntraining\ndata\n\ntrain\ntrain\ntrain\n\ntrainval\ntrainval\ntrainval\ntrainval\n\ntest\ndata\n\nval\nval\nval\n\ntest-dev\ntest-dev\ntest-dev\ntest-dev\n\nAP@0.5\n\n48.4\n48.9\n49.1\n55.7\n51.5\n51.9\n53.2\n\nAP\n\n27.2\n27.6\n27.8\n34.9\n29.2\n29.9\n31.5\n\nAP\nsmall\n6.6\n8.9\n8.8\n15.6\n10.3\n10.8\n14.3\n\nAP\n\nmedium\n28.6\n30.5\n30.8\n38.7\n32.4\n32.8\n35.5\n\nAP\nlarge\n45.0\n42.0\n42.2\n50.9\n43.3\n45.0\n44.2\n\ntest time\n(sec/img)\n0.42\n0.17\n0.17\n3.36\n0.17\n0.17\n1.00\n\n5 Conclusion and Future Work\nWe presented Region-based Fully Convolutional Networks, a simple but accurate and ef\ufb01cient\nframework for object detection. Our system naturally adopts the state-of-the-art image classi\ufb01cation\nbackbones, such as ResNets, that are by design fully convolutional. Our method achieves accuracy\ncompetitive with the Faster R-CNN counterpart, but is much faster during both training and inference.\nWe intentionally keep the R-FCN system presented in the paper simple. There have been a series\nof orthogonal extensions of FCNs that were developed for semantic segmentation (e.g., see [2]), as\nwell as extensions of region-based methods for object detection (e.g., see [10, 1, 23]). We expect our\nsystem will easily enjoy the bene\ufb01ts of the progress in the \ufb01eld.\n\n8\n\n\fReferences\n[1] S. Bell, C. L. Zitnick, K. Bala, and R. Girshick. Inside-outside net: Detecting objects in context with skip\n\npooling and recurrent neural networks. In CVPR, 2016.\n\n[2] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Semantic image segmentation with\n\ndeep convolutional nets and fully connected crfs. In ICLR, 2015.\n\n[3] J. Dai, K. He, Y. Li, S. Ren, and J. Sun. Instance-sensitive fully convolutional networks. arXiv:1603.08678,\n\n2016.\n\n[4] D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov. Scalable object detection using deep neural networks.\n\nIn CVPR, 2014.\n\n[5] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object\n\nClasses (VOC) Challenge. IJCV, 2010.\n\n[6] S. Gidaris and N. Komodakis. Object detection via a multi-region & semantic segmentation-aware cnn\n\nmodel. In ICCV, 2015.\n\n[7] R. Girshick. Fast R-CNN. In ICCV, 2015.\n[8] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection\n\nand semantic segmentation. In CVPR, 2014.\n\n[9] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual\n\nrecognition. In ECCV. 2014.\n\n[10] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.\n[11] A. Krizhevsky, I. Sutskever, and G. Hinton.\n\nImagenet classi\ufb01cation with deep convolutional neural\n\nnetworks. In NIPS, 2012.\n\n[12] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropa-\n\ngation applied to handwritten zip code recognition. Neural computation, 1989.\n\n[13] K. Lenc and A. Vedaldi. R-CNN minus R. In BMVC, 2015.\n[14] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll\u00e1r, and C. L. Zitnick. Microsoft\n\nCOCO: Common objects in context. In ECCV, 2014.\n\n[15] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, and S. Reed. SSD: Single shot multibox detector.\n\narXiv:1512.02325v2, 2015.\n\n[16] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR,\n\n2015.\n\n[17] S. Mallat. A wavelet tour of signal processing. Academic press, 1999.\n[18] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Uni\ufb01ed, real-time object detection.\n\nIn CVPR, 2016.\n\n[19] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region\n\nproposal networks. In NIPS, 2015.\n\n[20] S. Ren, K. He, R. Girshick, X. Zhang, and J. Sun. Object detection networks on convolutional feature\n\nmaps. arXiv:1504.06066, 2015.\n\n[21] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla,\nM. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. IJCV,\n2015.\n\n[22] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. Overfeat: Integrated recognition,\n\nlocalization and detection using convolutional networks. In ICLR, 2014.\n\n[23] A. Shrivastava, A. Gupta, and R. Girshick. Training region-based object detectors with online hard example\n\nmining. In CVPR, 2016.\n\n[24] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In\n\nICLR, 2015.\n\n[25] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, and A. Rabinovich. Going deeper\n\nwith convolutions. In CVPR, 2015.\n\n[26] C. Szegedy, A. Toshev, and D. Erhan. Deep neural networks for object detection. In NIPS, 2013.\n[27] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for\n\ncomputer vision. In CVPR, 2016.\n\n[28] J. R. Uijlings, K. E. van de Sande, T. Gevers, and A. W. Smeulders. Selective search for object recognition.\n\nIJCV, 2013.\n\n[29] C. L. Zitnick and P. Doll\u00e1r. Edge boxes: Locating object proposals from edges. In ECCV, 2014.\n\n9\n\n\f", "award": [], "sourceid": 235, "authors": [{"given_name": "Jifeng", "family_name": "Dai", "institution": "Microsoft"}, {"given_name": "Yi", "family_name": "Li", "institution": "Tsinghua University"}, {"given_name": "Kaiming", "family_name": "He", "institution": "Microsoft"}, {"given_name": "Jian", "family_name": "Sun", "institution": "Microsoft"}]}