{"title": "Cascade RPN: Delving into High-Quality Region Proposal Network with Adaptive Convolution", "book": "Advances in Neural Information Processing Systems", "page_first": 1432, "page_last": 1442, "abstract": "This paper considers an architecture referred to as Cascade Region Proposal Network (Cascade RPN) for improving the region-proposal quality and detection performance by systematically addressing the limitation of the conventional RPN that heuristically defines the anchors and aligns the features to the anchors. First, instead of using multiple anchors with predefined scales and aspect ratios, Cascade RPN relies on a single anchor per location and performs multi-stage refinement. Each stage is progressively more stringent in defining positive samples by starting out with an anchor-free metric followed by anchor-based metrics in the ensuing stages. Second, to attain alignment between the features and the anchors throughout the stages, adaptive convolution is proposed that takes the anchors in addition to the image features as its input and learns the sampled features guided by the anchors. A simple implementation of a two-stage Cascade RPN achieves 13.4 point AR higher than that of the conventional RPN, surpassing any existing region proposal methods. When adopting to Fast R-CNN and Faster R-CNN, Cascade RPN can improve the detection mAP by 3.1 and 3.5 points, respectively. The code will be made publicly available at https://github.com/thangvubk/Cascade-RPN.", "full_text": "Cascade RPN: Delving into High-Quality Region\n\nProposal Network with Adaptive Convolution\n\nThang Vu, Hyunjun Jang, Trung X. Pham, Chang D. Yoo\n\nDepartment of Electrical Engineering\n\nKorea Advanced Institute of Science and Technology\n\n{thangvubk,wiseholi,trungpx,cd_yoo}@kaist.ac.kr\n\nAbstract\n\nThis paper considers an architecture referred to as Cascade Region Proposal Net-\nwork (Cascade RPN) for improving the region-proposal quality and detection\nperformance by systematically addressing the limitation of the conventional RPN\nthat heuristically de\ufb01nes the anchors and aligns the features to the anchors. First,\ninstead of using multiple anchors with prede\ufb01ned scales and aspect ratios, Cascade\nRPN relies on a single anchor per location and performs multi-stage re\ufb01nement.\nEach stage is progressively more stringent in de\ufb01ning positive samples by starting\nout with an anchor-free metric followed by anchor-based metrics in the ensuing\nstages. Second, to attain alignment between the features and the anchors throughout\nthe stages, adaptive convolution is proposed that takes the anchors in addition to the\nimage features as its input and learns the sampled features guided by the anchors.\nA simple implementation of a two-stage Cascade RPN achieves AR 13.4 points\nhigher than that of the conventional RPN, surpassing any existing region proposal\nmethods. When adopting to Fast R-CNN and Faster R-CNN, Cascade RPN can\nimprove the detection mAP by 3.1 and 3.5 points, respectively. The code is made\npublicly available at https://github.com/thangvubk/Cascade-RPN.\n\n1\n\nIntroduction\n\nObject detection has received considerable attention in recent years for its applications in autonomous\ndriving [13, 17], robotics [3, 11] and surveillance [9, 23]. Given an image, object detectors aim to\ndetect known object instances, each of which is assigned to a bounding box and a class label. Recent\nhigh-performing object detectors, such as Faster R-CNN [34], formulate the detection problem as\na two-stage pipeline. At the \ufb01rst stage, a region proposal network (RPN) produces a sparse set of\nproposal boxes by re\ufb01ning and pruning a set of anchors, and at the second stage, a region-wise\nCNN detector (R-CNN) re\ufb01nes and classi\ufb01es the proposals produced by RPN. Compared to R-CNN,\nRPN has received relatively less attention for improving its performance. This paper will focus on\nimproving RPN by addressing its limitations that arise from heuristically de\ufb01ning the anchors and\nheuristically aligning the features to the anchors.\nAn anchor is de\ufb01ned by its scale and aspect ratio, and a set of anchors with different scales and aspect\nratios are required to obtain a suf\ufb01cient number of positive samples that have high overlap with the\ntarget objects. Setting appropriate scales and aspect ratios is important in achieving high detection\nperformance, and it requires a fair amount of tuning [25, 34].\nAn alignment rule is \u201cimplicitly\u201d de\ufb01ned to set up a correspondence between the image features\nand the reference boxes. The input features of RPN and R-CNN should be well-aligned with the\nbounding boxes that are to be regressed. The alignment is guaranteed in R-CNN by the RoIPool [34]\nor RoIAlign [18] layer . The alignment in RPN is heuristically guaranteed: the anchor boxes are\nuniformly initialized, leveraging the observation that the convolutional kernel of the RPN uniformly\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 1: Iterative RPN shows limitations in improving RPN performance. (a) The target regression\ndistribution to be learned at stage 1 and 2. The stage 2 distribution represents the error after the stage\n1 distribution is learned. (b) Iterative RPN fails in learning stage-2 distribution as the average recall\n(AR) improvement is marginal compared to the of RPN. (c) In Iterative RPN, the anchor at stage 2,\nwhich is regressed in stage 1, breaks the alignment rule in detection.\n\nstrides over the feature maps. Such a heuristic introduces limitations for further improving detection\nperformance as described below.\nA number of studies have attempted to improve RPN by iterative re\ufb01nement [14, 41]. Henceforth,\nthis paper will refer to it as Iterative RPN. The motivation behind this idea is illustrated in Figure 1a.\nAnchor boxes which are references for regression are uniformly initialized, and the target ground\ntruth boxes are arbitrarily located. Thus, RPN needs to learn a regression distribution of high variance,\nas shown in Figure 1a. If this regression distribution is perfectly learned, the regression distribution\nat stage 2 should be close to a Dirac Delta distribution. However, such a high-variance distribution at\nstage 1 is dif\ufb01cult to learn, requiring stage 2 regression. Stage 2 distribution has a lower variance\ncompared to that of stage 1, and thus should be easier to learn but fails with Iterative RPN. The\nfailure is implied by the observation in which the performance improvement of Iterative RPN is\nnegligible compared to that of RPN, as shown in Figure 1b. It is explained intuitively in Figure 1c.\nHere, after stage 1, the anchor is regressed to be closer to the ground truth box; however, this breaks\nthe alignment rule in detection.\nThis paper proposes an architecture referred to as Cascade RPN to systematically address the\naforementioned problem arising from heuristically de\ufb01ning the anchors and aligning the features to\nthe anchors. First, instead of using multiple anchors with different scales and aspect ratios, Cascade\nRPN relies on a single anchor and incorporates both anchor-based and anchor-free criteria in de\ufb01ning\npositive boxes to achieve high performance. Second, to bene\ufb01t from multi-stage re\ufb01nement while\nmaintaining the alignment between anchor boxes and features, Cascade RPN relies on the proposed\nadaptive convolution that adapts to the re\ufb01ned anchors after each stage. Adaptive convolution serves\nas an extremely light-weight RoIAlign layer [18] to learn the features sampled within the anchors.\nCascade RPN is conceptually simple and easy to implement. Without bells and whistles, a simple\ntwo-stage Cascade RPN achieves AR 13.4 points improvement compared to RPN baseline on the\nCOCO dataset [26], surpassing any existing region proposal methods by a large margin. Cascade\nRPN can also be integrated into two-stage detectors to improve detection performance. In particular,\nintegrating Cascade RPN into Fast R-CNN and Faster R-CNN achieves 3.1 and 3.5 points mAP\nimprovement, respectively.\n\n2 Related Work\n\nObject Detection. Object detection can be roughly categorized into two main streams: one-stage\nand two-stage detection. Here, one-stage detectors are proposed to enhance computational ef\ufb01ciency.\nExamples falling in this stream are SSD [27], YOLO [31, 32, 33], RetinaNet [25], and CornerNet\n[21]. Meanwhile, two-stage detectors aim to produce accurate bounding boxes, where the \ufb01rst stage\n\n2\n\n\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0004\u00051000\u0003\u0003\u0166\u0003\u0003\u0003\u0166\u0003\u0003\u0003\u0166\u0003\u00059-\u0002\u0013\u000b\u0012f\u0013\u000f\u0001\u000b\u0003\u00059-\ff\u00benf\u000b\u000b\u0003\u00059-Ground TruthAnchor s2Anchor s1\fgenerates region proposals followed by region-wise re\ufb01nement and classi\ufb01cation at the second stage,\ne.g., R-CNN [15], Fast R-CNN [16], Faster R-CNN [34], Cascade R-CNN [4], and HTC [7].\n\nRegion Proposals. Region proposals have become the de-facto paradigm for high-quality object\ndetectors [6, 19, 20]. Region proposals serve as the attention mechanism that enables the detector\nto produce accurate bounding boxes while maintaining computation tractability. Early methods\nare based on grouping super-pixel (e.g., Selective Search [36], CPMC [5], MCG [2]) and window\nscoring (e.g., objectness in windows [1], EdgeBoxes [43]). Although these methods dominate the\n\ufb01eld of object detection in classical computer vision, they exhibit limitations as they are external\nmodules independent of the detector and not computationally friendly. To overcome these limitations,\nShaoqing et al. [34] propose Region Proposal Network (RPN) that shares full-image convolutional\nfeatures with the detection network, enabling nearly cost-free region proposals.\n\nMulti-Stage RPN. There have been a number of studies attempting to improve the performance of\nRPN [14, 37, 38, 41]. The general trend is to perform multi-stage re\ufb01nement that takes the output of\na stage as the input of the next stage and repeats until accurate localization is obtained, as presented\nin [14]. However, this approach ignores the problem that the regressed boxes are misaligned to the\nimage features, breaking the alignment rule required for object detection. To alleviate this problem,\nrecent advanced methods [12, 37] rely on deformable convolution [10] to perform feature spatial\ntransformations and expect the learned transformations to align to the changes of anchor geometry.\nHowever, as there is no explicit supervision to learn the feature transformation, it is dif\ufb01cult to\ndetermine whether the improvement originates from conforming to the alignment rule or from the\nbene\ufb01ts of deformable convolution, thus making it less interpretable.\n\nAnchor-based vs. Anchor-free Criterion for Sample Discrimination. As a bounding box usu-\nally includes an object with some amount of background, it is dif\ufb01cult to determine if the box is a\npositive or a negative sample. This problem is usually addressed by comparing the Intersection over\nUnion (IoU) between an anchor and a ground truth box to a prede\ufb01ned threshold; thus, it is referred\nto as the anchor-based criterion. However, as the anchor is uniformly initialized, multiple anchors\nwith different scales and aspect ratios are required at each location to ensure that there are enough\npositive samples [34]. The hyperparameters, such as scales and aspect ratios, are usually heuristically\ntuned and have a large impact on the \ufb01nal accuracy [25, 34]. Rather than relying on anchors, there\nhave been studies that de\ufb01ne positive samples by the distance between the prediction points and the\ncenter region of objects, referred to as anchor-free [35, 40, 42]. This method is simple and requires\nfewer hyperparameters but usually exhibits limitations in dealing with complex scenes.\n\n3 Region Proposal Network and Variants\n\n3.1 Region Proposal Network\n2 )s \u2264 W, 0 <\nGiven an image I of size W \u00d7 H, a set of anchor boxes A = {aij |0 < (i + 1\n2 )s \u2264 H} is uniformly initialized over the image, with stride s. Unless otherwise speci\ufb01ed, i\n(j + 1\nand j are omitted to simplify the notation. Each anchor box a is represented by a 4-tuple in the form\nof a = (ax, ay, aw, ah), where (ax, ay) is the center location of the anchor with the dimension of\n(aw, ah). The regression branch aims to predict the transformation \u03b4 from the anchor a to the target\nground truth box t represented as follows:\n\n\u03b4x = (tx \u2212 ax) /aw,\n\u03b4w = log (tw/aw) ,\n\n\u03b4y = (ty \u2212 ay) /ah,\n\u03b4h = log (th/ah) .\n\n(1)\n\n(cid:88)\n\nL(\u02c6\u03b4, \u03b4) =\n\n(cid:16)\u02c6\u03b4k \u2212 \u03b4k\n\n(cid:17)\n\n,\n\nHere, the regressor f takes as input the image feature x to output a prediction \u02c6\u03b4 = f (x) that\nminimizes the bounding box loss:\n\nsmoothL1\n\n(2)\nwhere smoothL1(\u00b7) is the robust L1 loss de\ufb01ned in [16]. The regressed anchor is simply inferred\nbased on the inverse transformation of (1) as follows:\na(cid:48)\ny = \u02c6\u03b4yah + ay,\na(cid:48)\nh = ah exp(\u02c6\u03b4h).\n\na(cid:48)\nx = \u02c6\u03b4xaw + ax,\na(cid:48)\nw = aw exp(\u02c6\u03b4w),\n\nk\u2208{x,y,w,h}\n\n(3)\n\n3\n\n\f(a) RPN\n\n(b) Iterative RPN\n\n(c) Iterative RPN+\n\n(d) GA-RPN\n\n(e) Cascade RPN\n\nFigure 2: The architectures of different networks. \u201cI\u201d, \u201cH\u201d, \u201cC\u201d, and \u201cA\u201d denote input image, network\nhead, classi\ufb01er, and anchor regressor, respectively. \u201cConv\u201d, \u201cDefConv\u201d, \u201cDilConv\u201d and \u201cAdaConv\u201d\nindicate conventional convolution, deformable convolution [10], dilated convolution [39] and the\nproposed adaptive convolution layers, respectively.\n\nThen the set of regressed anchor A(cid:48) = {a(cid:48)} is \ufb01ltered by non-maximum suppression (NMS) to\nproduce a sparse set of proposal boxes P:\n\nP = NMS(A(cid:48), S),\n\n(4)\n\nwhere S is the set of objectness scores learned by the classi\ufb01cation branch.\n\n3.2\n\nIterative RPN and Variants\n\nSome previous studies [14, 41] have proposed iterative re\ufb01nement which is referred to as Iterative\nRPN, as shown in Figure 2b. Iterative RPN iteratively re\ufb01nes the anchors by treating A(cid:48) as the new\ninitial anchor set for the next stage and repeats Eqs. (1) to (3) until obtaining accurate localization.\nHowever, this approach exhibits mismatch between anchors and their represented features as the\nanchor positions and shapes change after each iteration.\nTo alleviate this problem, recent advanced methods [12, 37] use deformable convolution [10] to\nperform spatial transformations on the features as shown in Figure 2c and 2d and expect transformed\nfeatures to align to the change in anchor geometry. However, this idea ignores the problem that there\nis no constraint to enforce the features to align with the changes in anchors: it is dif\ufb01cult to determine\nwhether the deformable convolution produces feature transformation leading to alignment. Instead,\nthe proposed Cascade RPN systematically ensures the alignment rule by using the proposed adaptive\nconvolution.\n\n4 Cascade RPN\n\n4.1 Adaptive Convolution\n\nGiven a feature map x, in the standard 2D convolution, the feature map is \ufb01rst sampled using a\nregular grid R = {(rx, ry)}, and the samples are summed up with the weight w. Here, the grid R\nis de\ufb01ned by the kernel size and dilation. For example, R = {(\u22121,\u22121), (\u22121, 0), . . . , (0, 1), (1, 1)}\ncorresponds to kernel size 3\u00d7 3 and dilation 1. For each location p on the output feature y, we have:\n(5)\n\nw[r] \u00b7 x[p + r].\n\n(cid:88)\n\ny[p] =\n\nIn adaptive convolution, the regular grid R is replaced by the offset \ufb01eld O that is directly inferred\nfrom the input anchor.\n\ny[p] =\n\nw[o] \u00b7 x[p + o].\n\n(6)\n\nr\u2208R\n\n(cid:88)\n\no\u2208O\n\n4\n\nIAConvBackboneHConvCConvIA1ConvBackboneH1ConvC1ConvA2ConvH2ConvC2ConvIA1ConvBackboneH1ConvC1ConvA2ConvH2DefConvC2ConvOffsetConvIShapeConvBackboneH1ConvLocConvAConvH2DefConvCConvOffsetConvIA1ConvBackboneH1DilConvA2ConvC2ConvBridged featureH2AdaConvPredefined anchorRegressed anchor\fConvolution\n\nDilated Convolution\n\nDeformable convolution\n\nAdaptive convolution\n\nFigure 3: Illustrations of the sampling locations in different convolutional layers with 3 \u00d7 3 kernel.\n\nLet \u00afa denote the projection of anchor a onto the feature map. The offset o can be decoupled into\ncenter offset and shape offset (shown in Figure 2e):\n\n(7)\nwhere octr = (\u00afax \u2212 px, \u00afay \u2212 py) and oshp is de\ufb01ned by the anchor shape and kernel size. For example,\n\nif kernel size is 3 \u00d7 3, then oshp \u2208(cid:8)(\u2212 \u00afaw\n\n2 )(cid:9). As the offsets\n\n2 ,\u2212 \u00afah\n\n2 ), (\u2212 \u00afaw\n\nare typically fractional, sampling is performed with bilinear interpolation analogous to [10].\n\no = octr + oshp,\n\n2 , 0), . . . , (0, \u00afah\n\n2 ), ( \u00afaw\n\n2 , \u00afah\n\nRelation to other Convolutions.\nThe illustrations of sampling locations in adaptive and other\nrelated convolutions are shown in Figure 3. Conventional convolution samples the features at\ncontiguous locations with a dilation factor of 1. The dilated convolution [39] increases the dilation\nfactor, aiming to enhance the semantic scope with unchanged computational cost. The deformable\nconvolution [10] augments the spatial sampling locations by learning the offsets. Meanwhile, the\nproposed adaptive convolution performs sampling within the anchors to ensure alignment between\nthe anchors and features. Adaptive convolution is closely related to the others. Adaptive convolution\nbecomes dilated convolution if the center offsets are zeros. Deformable convolution becomes adaptive\nconvolution if the offsets are deterministically derived from the anchors.\n\n4.2 Sample Discrimination Metrics\n\nInstead of using multiple anchors with prede\ufb01ned scales and aspect ratios, Cascade RPN relies on a\nsingle anchor per location and performs multi-stage re\ufb01nement. However, this reliance creates a new\nchallenge in determining whether a training sample is positive or negative as the use of anchor-free or\nanchor-based metric is highly adversarial. The anchor-free metric establishes a loose requirement for\npositive samples in the second stage and the anchor-based metric results in an insuf\ufb01cient number of\npositive training examples at the \ufb01rst stage. To overcome this challenge, Cascade RPN progressively\nstrengthens the requirements through the stages by starting out with an anchor-free metric followed\nby anchor-based metrics in the ensuing stages. In particular, at the \ufb01rst stage, an anchor is a positive\nsample if its center is inside the center region of an object. In the following stages, an anchor is a\npositive sample if its IoU with an object is greater than the IoU threshold.\n\n4.3 Cascade RPN\n\nThe architecture of a two-stage Cascade RPN is illustrated in Figure 2e. Here, Cascade RPN relies on\nadaptive convolution to systematically align the features to the anchors. In the \ufb01rst stage, the adaptive\nconvolution is set to perform dilated convolution since anchor center offsets are zeros. The features of\nthe \ufb01rst stage are \u201cbridged\u201d to the next stages since the spatial order of the features is maintained by\nthe dilated convolution. The pipeline of the proposed Cascade RPN can be described mathematically\nin Algorithm 1. The anchor set at the \ufb01rst stage A1 is uniformly initialized over the image. At stage\n\u03c4, the anchor offset o\u03c4 is computed and fed into the regressor f \u03c4 to produce the regression prediction\n\u03c4 . The prediction \u02c6\u03b4\n\u03c4 is used to produce regressed anchors a\u03c4 +1. At the \ufb01nal stage, the objectness\n\u02c6\u03b4\nscores are derived from the classi\ufb01er, followed by NMS to produce the region proposals.\n\n5\n\n\fAlgorithm 1. Cascade RPN\n\n1 Input: sequence of regressors f \u03c4 , classi\ufb01er g, feature x of image I.\n2 Output: proposal set P.\n3 Uniformly initialize anchor set A1 = {a1} over image I.\n4 for \u03c4 \u2190 1 to T do\n5\n\nCompute offset o\u03c4 of input anchor a\u03c4 on feature map using (7).\nCompute regression prediction \u02c6\u03b4\nCompute regressed anchor a\u03c4 +1 from \u02c6\u03b4\n\n7\n8 end\n9 Compute objectness score s = g(x, oT ).\n10 Derive proposals P from A\u03c4 +1 = {a\u03c4 +1} and S = {s} using NMS (4).\n\n\u03c4\n\n= f \u03c4 (x, o\u03c4 ).\n\n\u03c4 using (3).\n\n6\n\n4.4 Learning\n\nT(cid:88)\n\nCascade RPN can be trained in an end-to-end manner using multi-task loss as follows:\n\nL = \u03bb\n\n\u03b1\u03c4L\u03c4\n\nreg + Lcls.\n\n(8)\n\n\u03c4 =1\n\nHere, L\u03c4\nreg is the regression loss at stage \u03c4 with the weight of \u03b1\u03c4 , and Lcls is the classi\ufb01cation loss.\nThe two loss terms are balanced by \u03bb. In the implementation, binary cross entropy loss and IoU loss\n[40] are used as the classi\ufb01cation loss and regression loss, respectively.\n\n5 Experiments\n\n5.1 Experimental Setting\n\nThe experiments are performed on the COCO 2017 detection dataset [26]. All the models are trained\non the train split (115k images). The region proposal performance and ablation analysis are reported\non val split (5k images), and the benchmarking detection performance is reported on test-dev split\n(20k images).\nUnless otherwise speci\ufb01ed, the default model of the experiment is as follows. The model consists\nof two stages, with ResNet50-FPN [24] being its backbone. The use of two stages is to balance\naccuracy and computational ef\ufb01ciency. A single anchor per location is used with size of 322, 642,\n1282, 2562, and 5122 corresponding to the feature levels C2, C3, C4, C5, and C6, respectively [24].\nThe \ufb01rst stage uses the anchor-free metric for sample discrimination with the thresholds of the\ncenter-region \u03c3ctr and ignore-region \u03c3ign, which are adopted from [40, 37], being 0.2 and 0.5. The\nsecond stage uses the anchor-based metric with the IoU threshold of 0.7. The multi-task loss is set\nwith the stage-wise weight \u03b11 = \u03b12 = 1 and the balance term \u03bb = 10. The NMS threshold is set\nto 0.8. In all experiments, the long edge and the short edge of the images are resized to 1333 and\n800 respectively without changing the aspect ratio. No data augmentation is used except for standard\nhorizontal image \ufb02ipping. The models are implemented with PyTorch [29] and mmdetection [8]. The\nmodels are trained with 8 GPUs with a batch size of 16 (two images per GPU) for 12 epochs using\nSGD optimizer. The learning rate is initialized to 0.02 and divided by 10 after 8 and 11 epochs. It\ntakes about 12 hours for the models to converge on 8 Tesla V100 GPUs.\nThe quality of region proposals is measured with Average Recall (AR), which is the average of recalls\nacross IoU thresholds from 0.5 to 0.95 with a step of 0.05. The AR for 100, 300, and 1000 proposals\nper image are denoted as AR100, AR300, and AR1000. The AR for small, medium, and large objects\ncomputed at 100 proposals are denoted as ARS, ARM , and ARL, respectively. Detection results are\nevaluated with the standard COCO-style Average Precision (AP) measured at IoUs from 0.5 to 0.95.\nThe runtime is measured on a single Tesla V100 GPU.\n\n6\n\n\fTable 1: Region proposal results on COCO 2017 val.\n\nBackbone\nResNet-50\n\nVGG-16 (Sync BN)\n\nVGG-16\n\nBN-inception\n\nResNet-50-FPN\n\n-\n-\n\n-\n-\n\n-\n-\n-\n-\n\nAR100 AR300 AR1000 ARS ARM ARL\n36.4\n31.6\n53.3\n53.9\n44.6\n48.5\n54.0\n59.1\n61.1\n\n48.2\n60.7\n66.2\n67.0\n58.3\n58.8\n63.0\n68.5\n71.7\n\n62.2\n63.0\n51.7\n56.9\n62.7\n68.2\n69.3\n\n77.7\n78.5\n61.4\n65.4\n73.9\n78.4\n82.8\n\n31.5\n31.9\n29.5\n32.1\n35.6\n40.7\n42.1\n\n52.9\n55.4\n60.4\n65.1\n67.6\n\n-\n-\n\nTime (s)\n\n0.76\n0.10\n4.00\n1.13\n0.04\n0.05\n0.06\n0.06\n0.06\n\nMethod\n\nSharpMask [30]\nGCN-NS [28]\n\nAttractioNet [14]\n\nZIP [22]\nRPN [34]\n\nIterative RPN\nIterative RPN+\nGA-RPN [37]\nCascade RPN\n\nTable 2: Detection results on COCO 2017 test-dev\n\nMethod\n\nProposal method\n\n# proposals\n\nFast R-CNN\n\nFaster R-CNN\n\nRPN\n\nCascade RPN\n\nRPN\n\nIterative RPN+\n\nGA-RPN\n\nCascade RPN\n\nRPN\n\nCascade RPN\n\nRPN\n\nIterative RPN+\n\nGA-RPN\n\nCascade RPN\n\n1000\n\n300\n\n1000\n\n300\n\nAP\n37.0\n40.1\n36.6\n38.6\n39.5\n40.1\n37.1\n40.5\n36.9\n39.2\n39.9\n40.6\n\nAP50 AP75 APS APM APL\n59.5\n47.0\n50.9\n59.5\n47.0\n58.6\n50.0\n58.8\n50.7\n59.3\n59.4\n51.6\n59.3\n46.5\n51.5\n59.3\n46.5\n58.9\n58.2\n50.4\n59.4\n50.9\n52.6\n58.9\n\n39.4\n42.4\n39.1\n41.5\n42.0\n42.4\n39.8\n42.9\n39.6\n42.0\n42.6\n42.8\n\n21.1\n22.8\n20.3\n21.1\n21.8\n22.1\n21.4\n22.6\n21.1\n21.5\n22.0\n22.0\n\n39.9\n43.7\n39.5\n42.2\n43.2\n43.8\n40.1\n44.2\n39.9\n43.0\n43.6\n44.5\n\n5.2 Benchmarking Results\n\nRegion Proposal Performance. The performance of Cascade RPN is compared to those of recent\nstate-of-the-art region proposal methods, including RPN [34], SharpMask [30], GCN-NS [28],\nAttractioNet [14], ZIP [22], and GA-RPN [37]. In addition, Iterative RPN and Iterative RPN+,\nwhich are referred to in Figure 2, are also benchmarked. The results of Sharp Mask, GCN-NS,\nAttractioNet, ZIP are cited from the papers. The results of the remaining methods are reproduced\nusing mmdetection [8]. Table 1 summarizes the benchmarking results.\nIn particular, Cascade\nRPN achieves AR 13.4 points higher than that of the conventional RPN. Cascade RPN consistently\noutperforms the other methods in terms of AR under different settings of proposal thresholds\nand object scales. The alignment rule is typically missing or loosely conformed to in the other\nmethods; thus, their performance improvements are limited. The alignment rule in Cascade RPN is\nsystematically ensured such that the performance gain is greater and more reliable.\n\nDetection Performance. To investigate the bene\ufb01t of high-quality proposals, Cascade RPN and\nthe baselines are integrated into common two-stage object detectors, including Fast R-CNN and\nFaster R-CNN. Here, Fast R-CNN is trained on precomputed region proposals while Faster R-CNN\nis trained in an end-to-end manner. As studied in [37], despite high-quality region proposals, training\na good detector is still a non-trivial problem, and simply replacing RPN by Cascade RPN without\nchanges in the settings only brings limited gain. Following [37], the IoU threshold in R-CNN is\nincreased and the number of proposals is decreased. In particular, the IoU threshold and the number\nof proposals are set to 0.65 and 300, respectively. The experimental results are reported in Table 2.\nHere, integrating RPN into Fast R-CNN and Faster R-CNN yields 37.0 and 37.1 mAP, respectively.\nFrom the results, the recall improvement is correlated with improvements in detection performance.\nAs it has the highest recall, Cascade RPN boosts the performance for Fast R-CNN and Faster R-CNN\nto 40.1 and 40.6 mAP, respectively.\n\n7\n\n\f1\n\ne\ng\na\nt\nS\n\n2\n\ne\ng\na\nt\nS\n\nFigure 4: Examples of region proposal results at stage 1 (\ufb01rst row) and stage 2 (second row) of\nCascade RPN.\n\nTable 3: Ablation analysis of Cascade RPN. Here, Align., AFAB, and Stats. denote the use of\nalignments, anchor-free and anchor-based metrics, and regression statistics, respectively.\n\nBaseline\n\n(cid:88)\n\n1 anchor\n\nCascade Align. AFAB Stats.\n\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\nOverall Improvement\n\n(cid:88)\n(cid:88)\n(cid:88)\n\n(cid:88)\n(cid:88)\n\nIoU loss AR100 AR300 AR1000\n58.3\n55.8\n58.0\n67.8\n68.6\n71.5\n71.7\n+13.4\n\n44.6\n44.7\n48.2\n57.4\n57.3\n60.8\n61.1\n+16.5\n\n52.9\n51.2\n54.4\n63.7\n64.2\n67.3\n67.6\n+14.7\n\n(cid:88)\n\nTable 4: The effects of alignment\n\nTable 5: The effects of sample metrics\n\nCenter\n\n(cid:88)\n(cid:88)\n\nShape AR100 AR300 AR1000\n58.0\n64.1\n67.8\n\n48.2\n52.5\n57.4\n\n54.4\n59.4\n63.7\n\n(cid:88)\n\nAF AB AR100 AR300 AR1000\n(cid:88)\n66.4\n67.8\n68.6\n\n61.8\n63.7\n64.2\n\n55.2\n57.4\n57.3\n\n(cid:88)\n(cid:88)\n\n(cid:88)\n\n5.3 Ablation Study\n\nComponent-wise Analysis.\nTo demonstrate the effectiveness of Cascade RPN, a comprehensive\ncomponent-wise analysis is performed in which different components are omitted. The results are\nreported in Table 3. Here, the baseline is RPN with 3 anchors per location yielding AR1000 of 58.3.\nWhen the number of anchors per location is reduced to 1, the AR1000 drops to 55.8, implying that\nthe number of positive samples dramatically decreases. Even when the multi-stage cascade is added,\nthe performance is 58.0, which is still lower than that of the baseline. However, when adaptive\nconvolution is applied to ensure alignment, the performance surges to 67.8, showing the importance\nof alignment in multi-stage re\ufb01nement. The incorporation of anchor-free and anchor-based metrics\nfor sample discrimination incrementally improves AR1000 to 68.6. The use of regression statistics\n(shown in Figure 1a) increases the performance to 71.5. Finally, applying IoU loss yields a slight\nimprovement of 0.2 points. Overall, Cascade RPN achieves 16.5, 14.7, and 13.4 points improvement\nin terms of AR100, AR300, and AR1000 respectively, compared to the conventional RPN.\n\nAcquisition of Alignment.\nTo demonstrate the effectiveness of the proposed adaptive convolution,\nthe center and shape alignments, represented by the offsets in Eq. (7), are progressively applied. Here,\nthe center and shape offsets maintain the alignments in position and semantic scope, respectively.\nTable 4 shows that the AR1000 improves from 58.0 to 64.1 using only the center alignment. When\nboth the center and shape alignments are ensured, the performance increases to 67.8.\n\n8\n\n\fFigure 5: More examples of region proposal results of Cascade RPN.\n\nTable 6: Ablation study on # stages.\n\nTable 7: Detection results of Cascade R-CNN with\nRPN and Cascade RPN (denoted by CRPN).\n\n# stages\n\n1\n2\n3\n\nAR100\n56.0\n61.1\n60.9\n\nAR300\n62.2\n67.6\n67.9\n\nAR1000\n\nTime (s)\n\n66.3\n71.7\n72.2\n\n0.04\n0.06\n0.08\n\nMethod\n\nRPN\nCRPN\n\nAP\n\n40.8\n41.6\n\nAP50\n59.3\n59.0\n\nAP75\n44.3\n45.5\n\nAP75\n22.0\n23.0\n\nAPS\n44.2\n45.0\n\nAPM\n54.2\n55.2\n\nSample Discrimination Metrics.\nThe experimental results with different combinations of sample\ndiscrimination metrics are shown in Table 5. Here, AF and AB denote that the anchor-free and\nanchor-based metrics are applied for all stages, respectively. Meanwhile, AFAB indicates that the\nanchor-free metric is applied at stage 1 followed by anchor-based metric at stage 2. Here, AF and\nAB yield the AR1000 of 66.4 and 67.8 respectively, both of which are signi\ufb01cantly less than that of\nAFAB. It is noted that the thresholds for each metric are already adapted through stages. The results\nimply that applying only one of either anchor-free or anchor-based metric is highly adversarial. The\nboth metrics should be incorporated to achieve the best results.\n\nQualitative Evaluation. The examples of region proposal results at the \ufb01rst and second stages are\nillustrated in the \ufb01rst and second row of Figure 4, respectively. The results show that the output\nproposals at the second stage are more accurate and cover a larger number of objects.\n\nNumber of Stages. Table 6 shows the proposal performance on different number of stages. In the\n3-stage Cascade RPN, an IoU threshold of 0.75 is used for the third stage. The 2-stage Cascade RPN\nachieves the best trade-off between AR1000 and inference time.\n\nExtension with Cascade R-CNN. Table 7 reports the detection results of the Cascade R-CNN [4]\nwith different proposal methods. The Cascade RPN improves AP by 0.8 points compared to RPN.\nThe improvement is mainly from AP75, where the objects have high IoU with the ground truth.\n\n6 Conclusion\n\nThis paper introduces Cascade RPN, a simple yet effective network architecture for improving\nregion proposal quality and object detection performance. Cascade RPN systematically addresses\nthe limitations that conventional RPN heuristically de\ufb01nes the anchors and aligns the features to the\nanchors. A simple implementation of a two-stage Cascade RPN achieves AR 13.4 points higher than\nthe baseline, surpassing any existing region proposal methods. When adopting to Fast R-CNN and\nFaster R-CNN, Cascade RPN can improve the detection mAP by 3.1 and 3.5 points, respectively.\n\n9\n\n\fAcknowledgment. This work was supported by Institute for Information & communications Tech-\nnology Planning & Evaluation(IITP) grant funded by the Korea government (MSIT) (2017-0-01780,\nThe technology development for event recognition/relational reasoning and learning knowledge-based\nsystem for video understanding) and (No. 2019-0-01396, Development of framework for analyzing,\ndetecting, mitigating of bias in AI model and training data)\n\nReferences\n[1] Bogdan Alexe, Thomas Deselaers, and Vittorio Ferrari. Measuring the objectness of image windows.\n\nTPAMI, 2012. 3\n\n[2] Pablo Arbel\u00e1ez, Jordi Pont-Tuset, Jonathan T Barron, Ferran Marques, and Jitendra Malik. Multiscale\n\ncombinatorial grouping. In CVPR, 2014. 3\n\n[3] Carlos Astua, Ramon Barber, Jonathan Crespo, and Alberto Jardon. Object detection techniques applied\n\non mobile robot semantic navigation. Sensors, 2014. 1\n\n[4] Zhaowei Cai and Nuno Vasconcelos. Cascade r-cnn: Delving into high quality object detection. In CVPR,\n\n2018. 3, 9\n\n[5] Joao Carreira and Cristian Sminchisescu. Cpmc: Automatic object segmentation using constrained\n\nparametric min-cuts. TPAMI, 2011. 3\n\n[6] Neelima Chavali, Harsh Agrawal, Aroma Mahendru, and Dhruv Batra. Object-proposal evaluation protocol\n\nis \u2019gameable\u2019. In CVPR, 2016. 3\n\n[7] Kai Chen, Jiangmiao Pang, Jiaqi Wang, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu,\n\nJianping Shi, Wanli Ouyang, et al. Hybrid task cascade for instance segmentation. In CVPR, 2019. 3\n\n[8] Kai Chen, Jiaqi Wang, Jiangmiao Pang, Yuhang Cao, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng,\nZiwei Liu, Jiarui Xu, Zheng Zhang, Dazhi Cheng, Chenchen Zhu, Tianheng Cheng, Qijie Zhao, Buyu Li,\nXin Lu, Rui Zhu, Yue Wu, Jifeng Dai, Jingdong Wang, Jianping Shi, Wanli Ouyang, Chen Change Loy,\nand Dahua Lin. MMDetection: Open mmlab detection toolbox and benchmark. arXiv:1906.07155, 2019.\n6, 7\n\n[9] Donatello Conte, Pasquale Foggia, Michele Petretta, Francesco Tufano, and Mario Vento. Meeting\nthe application requirements of intelligent video surveillance systems in moving object detection. In\nInternational Conference on Pattern Recognition and Image Analysis, 2005. 1\n\n[10] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable\n\nconvolutional networks. In ICCV, 2017. 3, 4, 5\n\n[11] Michael Danielczuk, Matthew Matl, Saurabh Gupta, Andrew Li, Andrew Lee, Jeffrey Mahler, and Ken\nGoldberg. Segmenting unknown 3d objects from real depth images using mask r-cnn trained on synthetic\npoint clouds. arXiv:1809.05825, 2018. 1\n\n[12] Heng Fan and Haibin Ling. Siamese cascaded region proposal networks for real-time visual tracking. In\n\nCVPR, 2019. 3, 4\n\n[13] Paul Furgale, Ulrich Schwesinger, Martin Ru\ufb02i, Wojciech Derendarz, Hugo Grimmett, Peter M\u00fchlfellner,\nStefan Wonneberger, Julian Timpner, Stephan Rottmann, Bo Li, et al. Toward automated driving in\ncities using close-to-market sensors: An overview of the v-charge project. In IEEE Intelligent Vehicles\nSymposium (IV), 2013. 1\n\n[14] Spyros Gidaris and Nikos Komodakis. Attend re\ufb01ne repeat: Active box proposal generation via in-out\n\nlocalization. arXiv:1606.04446, 2016. 2, 3, 4, 7\n\n[15] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate\n\nobject detection and semantic segmentation. In CVPR, 2014. 3\n\n[16] Ross B. Girshick. Fast R-CNN. In ICCV, 2015. 3\n\n[17] Christian H\u00e4ne, Torsten Sattler, and Marc Pollefeys. Obstacle detection for self-driving cars using only\n\nmonocular cameras and wheel odometry. In IROS, 2015. 1\n\n[18] Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Girshick. Mask r-cnn. In ICCV, 2017. 1, 2\n\n10\n\n\f[19] J. Hosang, R. Benenson, P. Doll\u00e1r, and B. Schiele. What makes for effective detection proposals? TPAMI,\n\n2016. 3\n\n[20] Jan Hosang, Rodrigo Benenson, and Bernt Schiele. How good are detection proposals, really?\n\narXiv:1406.6962, 2014. 3\n\n[21] Hei Law and Jia Deng. Cornernet: Detecting objects as paired keypoints. In ECCV, 2018. 2\n\n[22] Hongyang Li, Yu Liu, Wanli Ouyang, and Xiaogang Wang. Zoom out-and-in network with map attention\n\ndecision for region proposal and object detection. IJCV, 2019. 7\n\n[23] Wentong Liao, Chun Yang, Michael Ying Yang, and Bodo Rosenhahn. Security event recognition for visual\nsurveillance. ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences,\n2017. 1\n\n[24] Tsung-Yi Lin, Piotr Dollar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature\n\npyramid networks for object detection. In CVPR, 2017. 6\n\n[25] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollar. Focal loss for dense object\n\ndetection. In ICCV, 2017. 1, 2, 3\n\n[26] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll\u00e1r,\n\nand C Lawrence Zitnick. Microsoft COCO: Common objects in context. In ECCV, 2014. 2, 6\n\n[27] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott E. Reed, Cheng-Yang Fu, and\n\nAlexander C. Berg. Ssd: Single shot multibox detector. In ECCV, 2016. 2\n\n[28] Hsueh-Fu Lu, Xiaofei Du, and Ping-Lin Chang. Toward scale-invariance and position-sensitive region\n\nproposal networks. In ECCV, 2018. 7\n\n[29] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming\nLin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. In NIPS-W,\n2017. 6\n\n[30] Pedro O Pinheiro, Tsung-Yi Lin, Ronan Collobert, and Piotr Doll\u00e1r. Learning to re\ufb01ne object segments. In\n\nECCV, 2016. 7\n\n[31] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Uni\ufb01ed, real-time\n\nobject detection. In CVPR, 2016. 2\n\n[32] Joseph Redmon and Ali Farhadi. Yolo9000: Better, faster, stronger. In CVPR, 2017. 2\n\n[33] Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement. arXiv:1804.02767, 2018. 2\n\n[34] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection\n\nwith region proposal networks. In NeurIPS, 2015. 1, 3, 7\n\n[35] Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. FCOS: fully convolutional one-stage object detection.\n\nIn ICCV, 2019. 3\n\n[36] Jasper RR Uijlings, Koen EA Van De Sande, Theo Gevers, and Arnold WM Smeulders. Selective search\n\nfor object recognition. IJCV, 2013. 3\n\n[37] Jiaqi Wang, Kai Chen, Shuo Yang, Chen Change Loy, and Dahua Lin. Region proposal by guided\n\nanchoring. In CVPR, 2019. 3, 4, 6, 7\n\n[38] Bin Yang, Junjie Yan, Zhen Lei, and Stan Z Li. Craft objects from images. In CVPR, 2016. 3\n\n[39] Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolutions. In ICLR, 2016. 4,\n\n5\n\n[40] Jiahui Yu, Yuning Jiang, Zhangyang Wang, Zhimin Cao, and Thomas Huang. Unitbox: An advanced\n\nobject detection network. In 24th ACM international conference on Multimedia, 2016. 3, 6\n\n[41] Qiaoyong Zhong, Chao Li, Yingying Zhang, Di Xie, Shicai Yang, and Shiliang Pu. Cascade region\n\nproposal and global context for deep object detection. arXiv:1710.10749, 2017. 2, 3, 4\n\n[42] Chenchen Zhu, Yihui He, and Marios Savvides. Feature selective anchor-free module for single-shot object\n\ndetection. In CVPR, 2019. 3\n\n[43] C Lawrence Zitnick and Piotr Doll\u00e1r. Edge boxes: Locating object proposals from edges. In ECCV, 2014.\n\n3\n\n11\n\n\f", "award": [], "sourceid": 825, "authors": [{"given_name": "Thang", "family_name": "Vu", "institution": "KAIST"}, {"given_name": "Hyunjun", "family_name": "Jang", "institution": "KAIST"}, {"given_name": "Trung", "family_name": "Pham", "institution": "KAIST"}, {"given_name": "Chang", "family_name": "Yoo", "institution": "KAIST"}]}