{"title": "Sequential Context Encoding for Duplicate Removal", "book": "Advances in Neural Information Processing Systems", "page_first": 2049, "page_last": 2058, "abstract": "Duplicate removal is a critical step to accomplish a reasonable amount of predictions in prevalent proposal-based object detection frameworks. Albeit simple and effective, most previous algorithms utilized a greedy process without making sufficient use of properties of input data. In this work, we design a new two-stage framework to effectively select the appropriate proposal candidate for each object. The first stage suppresses most of easy negative object proposals, while the second stage selects true positives in the reduced proposal set. These two stages share the same network structure, an encoder and a decoder formed as recurrent neural networks (RNN) with global attention and context gate. The encoder scans proposal candidates in a sequential manner to capture the global context information, which is then fed to the decoder to extract optimal proposals. In our extensive experiments, the proposed method outperforms other alternatives by a large margin.", "full_text": "Sequential Context Encoding for Duplicate Removal\n\nShu Liu1,3\n1The Chinese University of Hong Kong\n\nLu Qi1\n\nJianping Shi2\n2SenseTime Research\n\nJiaya Jia1,3\n\n{luqi, sliu, leojia}@cse.cuhk.edu.hk\n\nshijianping@sensetime.com\n\n3 YouTu Lab, Tencent\n\nAbstract\n\nDuplicate removal is a critical step to accomplish a reasonable amount of pre-\ndictions in prevalent proposal-based object detection frameworks. Albeit simple\nand effective, most previous algorithms utilize a greedy process without making\nsuf\ufb01cient use of properties of input data. In this work, we design a new two-stage\nframework to effectively select the appropriate proposal candidate for each object.\nThe \ufb01rst stage suppresses most of easy negative object proposals, while the second\nstage selects true positives in the reduced proposal set. These two stages share the\nsame network structure, i.e., an encoder and a decoder formed as recurrent neural\nnetworks (RNN) with global attention and context gate. The encoder scans pro-\nposal candidates in a sequential manner to capture the global context information,\nwhich is then fed to the decoder to extract optimal proposals. In our extensive\nexperiments, the proposed method outperforms other alternatives by a large margin.\n\n1\n\nIntroduction\n\nObject detection is a fundamentally important task in computer vision and has been intensively\nstudied. With convolutional neural networks (CNNs) [15], most high-performing object detection\nsystems [15, 20, 16, 32, 7, 23, 19, 27] follow the proposal-base object detection framework, which\n\ufb01rst gathers a lot of object proposals and then conducts classi\ufb01cation and regression to infer the\nlabel and location of objects in the given image. The \ufb01nal inevitable step is duplicate removal that\neliminates highly overlapped detection results and only retains the most accurate bounding box for\neach object.\nState-of-the-Art: Most research on object detection focuses on the \ufb01rst two steps to generate\naccurate object proposals and corresponding class labels. In contrast, research of duplicate removal\nis left far behind. NMS [12], which iteratively selects proposals according to the prediction score\nand suppresses overlapped proposals, is still a popular and default solution. Soft-NMS [3] extends it\nby decreasing scores of highly-overlapped proposals instead of deleting them. Box voting [14, 26]\nimproves NMS by grouping highly-overlapped proposals for generating new prediction. In [4], it\nshows that to learn the functionality of NMS automatically with a spatial memory is possible. Most\nrecently, relation network [22] models the relation between object proposals with the same prediction\nclass label.\nMotivation: Optimal duplicate removal is to choose the only correct proposal for each object. The\ndif\ufb01culty is that during inference we actually do not know what is the object. In the detection network,\nwe already obtain the feature of region of interest (RoI) for classi\ufb01cation and regression. But this\npiece of information is seldom considered in the \ufb01nal duplicate removal when the score and location\nof each proposal candidate are available. It may be because the feature data is relatively heavy and\npeople think it is already abstracted in the candidate scores. If this is true, using it again in the \ufb01nal\nstep may cause information redundancy and waste computation.\nSo the \ufb01rst contribution of our method is to better utilize different kinds of information into duplicate\nestimation. We surprisingly found that it is very useful to improve candidate selection. The features\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 1: Illustration of our two-stage sequential context encoding framework.\n\nform the global view of understanding objects rather than only considering a single category or\nindependent proposals based on scores.\nThe second major contribution is the way to process candidate data. We take the large number of\nproposal candidates as a sequence data including its unique structure, and adopt recurrent neural\nnetworks (RNN) to extract vital information.\nIt is based on our thoughtful design to generate\nprediction from a more global perspective in the entire image and make full use of the bi-directional\nhidden states.\nOur \ufb01nal RNN-based duplicate removal model is therefore by nature different from previous solutions\n[3, 26, 22, 14, 21, 4]. It sequentially scans all object proposals to capture global context and makes\n\ufb01nal prediction based on the extra helpful information. Due to the enormous difference between\nproposal candidate and ground truth object numbers, our model is divided into two stages and\nperforms in a way like boosting [13]. The \ufb01rst stage suppresses many easy negatives and the second\nperforms \ufb01ner selection. The two stages are with the same network structure, including encoder and\ndecoder as RNNs, along with context gate and global attention modules.\nOur method achieves consistent improvement on COCO [25] data in different object detection\nframeworks [23, 27, 8, 19]. The new way to utilize RNN for duplicate removal makes the solution\nrobust, general and fundamentally different from other corresponding methods, which will be detailed\nmore later in this paper. Our code and models are made publicly available.\n\n2 Related Work\n\nObject Detection DPM [12] is representative before utilizing CNN, which considers sliding\nwindows in image pyramids to detect objects. R-CNN [15] makes use of object proposals and CNN,\nand achieves remarkable improvement. SPPNet [20] and Fast R-CNN [16] yield faster speed by\nextracting global feature maps. Computation is shared by object proposals. Faster R-CNN [32]\nfurther enhances performance and speed by designing the region proposal network, which generates\nhigh-quality object proposals with neural networks. Other more recent methods [23, 19, 7, 27, 8, 27]\nimprove object detection by modifying network structures.\nAnother line of research followed the single-stage pipeline. YOLO [31], SSD [28] and RentinaNet\n[24] regress objects directly based on a set of pre-de\ufb01ned anchor boxes, achieving faster inference\nspeed. Although these frameworks differ in their operation aspects, the duplicate-removal step is\nneeded by all of them to \ufb01nally achieve decent performance.\nDuplicate Removal Duplicate removal is an indispensable step in object detection frameworks.\nThe most widely used algorithm is non-maximum suppression (NMS). This simple method does not\nconsider any context information and property of input data \u2013 many overlapped proposals are rejected\ndirectly. Soft-NMS [3] instead keeps all object proposals while decreasing their prediction scores\nbased on overlap with the highest-score proposal. The limitation is that many proposals may still be\nkept in \ufb01nal prediction. Box voting [14, 26] makes use of information of other proposals by grouping\nhighly-overlapped ones. With more information used, better localization quality can be achieved.\nDesai et al.[10] explicitly encoded the class label of each candidate and their relation with respect to\nlocation. Final prediction was selected by optimizing the loss function considering all candidates\nand their relation. Class occurrence was considered to re-score object proposals in DPM [12] to\nslightly improve performance. More recently, GossipNet [21] processed a set of objects as a whole for\nduplicate removal with a relatively complex network structure with higher computation complexity.\nSpatial memory network [4] improved NMS by utilizing both semantic and location information.\nRelation network [22] models the relation between different proposals with the attention mechanism,\ntaking both appearance and location of proposals into consideration.\n\n2\n\nStage \u2160DetectionNetworkStage \u2161\fDifferent from all these methods, we utilize an encode-decoder structure with RNN to capture and\nutilize the global context. With only simple operations, consistently better performance is achieved\non all detection frameworks we experimented with.\nSequence Model RNN has been successfully applied to many sequence tasks like speech recognition\n[18], music generation [17], machine translation [2] and video activity recognition [11, 9, 1]. In\nneural machine translation (NMT), the concept of attention becomes common in training networks\n[29, 5, 30, 2, 33], allowing models to learn alignment between different modalities. In [2], parts\nof a source sentence were automatically searched that are relevant to predicting a target word. All\nsource words were attended and only a subset of source words were considered at a time [29].\nIntuitively, generation of content and functional words should rely much on the source and target\ncontext respectively. In [33], context gates dynamically control the ratios, at which source and target\ncontext contributes to the generation of target words.\n\n3 Motivation\n\nWe \ufb01rst analyze the necessity and potential of duplicate removal. We take the three representative\nobject detection systems as baselines, which include FPN [23], Mask R-CNN [19] and PANet [27]\nwith DCN [8]. FPN can yield high-quality object detection results. Mask R-CNN is designed for\ninstance segmentation, suitable for multi-task training. PANet with DCN achieves state-of-the-art\nperformance on both instance segmentation and object detection tasks in recent challenges, which is\na very strong baseline.\n\nModel\nFPN [23]\n\nMask R-CNN [19]\n\nPANet [27] with DCN [8]\n\nNo Removal NMS\n\nScore\n(Oracle)\n\nIoU\n\n(Oracle)\n\n10.3\n12.0\n10.6\n\n37.1\n38.9\n43.7\n\n47.3\n49.3\n53.9\n\n65.2\n63.4\n68.9\n\nTable 1: Performance by modifying the duplicate removal step on COCO data [25].\n\nIn terms of the importance of duplicate removal, we explore performance drop for different detection\nmethods without the \ufb01nal candidate selection, which simply set the \ufb01nal prediction as proposals when\nclass labels and scores are higher than a threshold. As shown in the \u201cNo Removal\u201d column of Table\n1, three frameworks only achieve around 11 points in terms of mAP, with a decrease of more than 20\npoints. This experiment manifests the necessity of duplicate removal.\nThen we evaluate the potential of improving \ufb01nal results when the duplicate removal step gets better.\nIt is done by exploring the tight upper-bound of performance given ground-truth objects during testing.\nFor each ground truth object, like NMS, we only select the proposal candidate with the largest score\nand meanwhile satisfying the overlap threshold. With these optimal choices, as shown in the \u201cScore\nOracle\u201d column, the performance of all three baseline methods are much enhanced with 10+ points.\nThis experiment shows there in fact is much room for improvement at the duplicate removal step.\nOther than potential improvement, we also conduct experiments to evaluate the in\ufb02uence of inevitable\nproposal score errors during proposal generation. They inevitably in\ufb02uence duplicate removal since\nthe scores are the most prominent indication of proposal quality and are utilized by methods like\nNMS, Soft-NMS and box voting to select proposals. In our experiments, we select the proposal\ncandidates with the largest overlap with its corresponding ground truth. The results shown in the \u201cIoU\nOracle\u201d column manifest that traditional NMS methods are likely to be in\ufb02uenced by the quality of\nprediction scores. Unlike NMS that only considers scores, our method has the ability of suppressing\nproposals with high prediction scores but low localization quality.\n\n4 Our Approach\n\nThe key challenge for duplicate removal is the extreme imbalance of proposal candidate and ground\ntruth object numbers. For example, a detection network can generate 1,000+ proposal candidates for\neach class compared with 10 or 20 ground-truth objects, making it hard for the network to capture the\nproperty of the entire image. To balance the positive and negative sample distributions, our framework\ncascades two stages to gradually remove duplicates, in a way analogous to boosting. This is because\n\n3\n\n\fFigure 2: Details of our network components including feature embedding, encoder-decoder, global\nattention, context gate and \ufb01nal decision.\n\nwithin any single image an overwhelming majority of proposal candidates are negative. As such, the\ncascade attempts to reject as many negatives as possible at the early stage [35].\nStage I suppresses easy negatives, which occupy a large portion of input object proposals in Fig. 1.\nIn stage II, we focus on eliminating remaining hard negative proposals. These stages share the same\nnetwork structure, including feature embedding, encoder, decoder, global attention, context gate, and\n\ufb01nal decision, for convenience. These components are deliberately designed and evaluated to help\nour model make comprehensive decision with multi-dimensional information.\nBrie\ufb02y speaking, we \ufb01rst transform primitive features extracted from object detection network for\neach proposal to low-grade features through feature embedding. Then the encoder RNN extracts\nmiddle-grade features to obtain the global context information of all proposals, stored in the \ufb01nal\nhidden state of the encoder. The decoder inherits the global-context hidden state and re-scans the\nproposal candidates to produce high-grade features. Global attention manages to seek the relation\nfor each proposal candidate by combining the middle- and high-grade features. In case of missing\nlower-layer information at top of the network, the context gate is employed to selectively enhance it.\nThe re\ufb01ned feature vector of each proposal helps determine whether the candidate should be kept or\nnot. The overall network structure is showed in Fig. 2.\n\n4.1 Feature Embedding\n\n(cid:0)log(cid:0) x1\n\nh + 0.5(cid:1) , log(cid:0) x2\n\nw + 0.5(cid:1) , log(cid:0) y2\n\nFeatures output from the object detection network are semantically informative. We extract appear-\nance feature fA, geometric feature fG, and score feature fS for each proposal, where fS is a 1D\nprediction class score, fA is 1,024D feature from the last fully-connected (fc) layer in the proposal\nsubnet in detection, and fG has 4D prediction coordinates. Given fG and fS abstract representation\nof fA in the detection network, \u2018smooth\u2019 operation for fA is needed before fusion of fG, fS, and fA.\nTo this end, appearance feature fA is non-linearly transformed into dl-D. Meanwhile,\nto maintain the scale-invariant representation for each bounding box, we denote fG as\n\nh + 0.5(cid:1)(cid:1) where (x1, y1, x2, y2) are the top-\n\nw + 0.5(cid:1) , log(cid:0) y1\n\nleft and bottom-right coordinates of the proposal and (w, h) are image width and height.\nIntuitively, the closer proposal candidates are, the more similar their scores and appearance features\nare. To make our network better capture the quality information from the detection network, we rank\nproposal candidates in a descending order according to their class scores. Each proposal candidate\nis with a rank \u2208 [1, N ] accordingly. The scalar rank is then embedded into a higher-dimensional\nfeature fR using positional encoding [34], which computes cosine and sine functions with different\nwavelengths to guarantee the orthogonality for each rank. The feature dimension dr after embedding\nis typically 32.\nTo balance the importance of features, the geometric feature fG and score feature fS are both tiled to\ndr dimensions. Then transformed fA, tiled fG, tiled fS and fR are concatenated and then transformed\n\n4\n\n\ud835\udc53(cid:3008)\ud835\udc53(cid:3019)\ud835\udc53(cid:3020)\ud835\udc53(cid:3002)\ud835\udc4a(cid:3002)\ud835\udc4a(cid:2896)\ud835\udc53(cid:3013)GRUGRU\ud835\udc53(cid:3014)\ud835\udc53(cid:3009)\ud835\udc4a(cid:3014)\ud835\udc4a(cid:3009)TILEADDTANHSOFTMAXBMM\ud835\udc53(cid:3008)\ud835\udc4a(cid:3004)(cid:2869)\ud835\udc4a(cid:3004)(cid:2870)\ud835\udc4a(cid:3004)(cid:2871)\ud835\udc46(cid:2869)\ud835\udc4a(cid:3046)\ud835\udc4a(cid:3008)CONCATTILECONCATCONCATCONCAT\ud835\udc53(cid:3027)\ud835\udc53(cid:3023)SIGMOID\ud835\udc53(cid:3021)+\ud835\udc53(cid:3027)\u2218\ud835\udc53(cid:3023)\ud835\udc53(cid:3021)\ud835\udc53(cid:3004)\ud835\udc4a(cid:3005)SIGMOID\ud835\udc46(cid:2868)INFERENCETRAINFeature EmbeddingEncoder-DecoderGlobal AttentionContext GateFinal DecisionELEM-MUL\ud835\udc3b\finto smoother low-grade feature fL as\n\nfL = Max{0, WL \u00d7 Concat [Max (0, WAfA) , fS, fR, fG]} .\n\n(1)\n\n4.2 Encoder-decoder Structure\n\nIt is hard for RNN to capture the appropriate information if the sequence data is in a random order.\nTo alleviate this issue, we sort proposals in a descending order according to their class scores. So\nproposal candidates with higher class scores are fed to the encoder or decoder earlier. Moreover, each\nproposal has the context found in other proposals to encode global information. To make good use of\nit, we choose bi-directional gated recurrent units network (GRU) as our basic RNN model. Compared\nwith LSTM, GRU is with fewer parameters and performs better on small data [6]. Its bi-direction\nhelps our model capture global information from two orders.\nFor each stage, the encoder takes fL as input and outputs the middle-grade feature fM . Different from\nzero initialization of the hidden state for encoder, the decoder receives the hidden state of encoder at\nthe \ufb01nal time step with context information in proposals, basis for the decoder to re-scan fL to obtain\nthe high-grade feature fH. The size of hidden state in GRU is the same as input feature. Given the\nimbalance of class distributions, similar to traditional NMS and relation network [22], our method\napplies to each class independently.\n\n4.3 Global Attention\n\nEven though we pass the hidden state at the \ufb01nal time step of encoder to decoder, it is still hard\nfor hidden state to embed all global information. As a remedy, we enable the decoder to access\nrepresentation of each proposal in encoder, leading to better utilization of all proposals.\nSince input data and structures of our encoder and decoder are identical except for their initialized\nhidden states, the output vectors tend to be similar, making it dif\ufb01cult for vanilla attention approaches\n[29] to capture their underlying relation. To address this issue, we apply a mechanism similar to\nBahdanau attention [2] to \ufb01rst transforms the output of encoder and decoder into two different feature\nspaces, and then learn their relation. The detail is to calculate a set of attention weights Sa for\nmiddle-grade feature fM as\n\nSa = \u00b5{WS \u00d7 Tanh [Tile (WM \u00d7 fM) + Tile (WH \u00d7 fH)]} , \u00b5 = Softmax,\n\n(2)\nwhere fM and fH are both linearly transformed and tiled. The tile operation is to get a new view\nof the feature with singleton dimensions expanded to the size of fM or fL, such as tiling the vector\nfrom N \u00d7 dm to N \u00d7 N \u00d7 dm where N denotes the number of proposal candidates. By mapping fM\nand fH to different feature spaces, our attention could focus more on other proposal candidates.\nFinally, we obtain the global feature fG by combining and smoothing fM and fH as\n\nfG = Max (0, WG \u00d7 Concat [Sa \u00d7 fM, fH)] .\n\n(3)\n\n4.4 Context Gate\n\nIn neural machine translation, generation of a target word depends on both source and target context.\nThe source context has a direct impact on the adequacy of translation while target context affects\n\ufb02uency [33]. Similarly, in case of missing part of information, we design the context gate to combine\nthe low-grade feature fL, high-grade feature fH and global feature fG. The bene\ufb01t of context gate is\ntwofold. First, like the residual module, it shortens the path from low to high layers. Second, it can\ndynamically control the ratio of contributions in low- and high-grade context.\nWe calculate gate feature fZ through\n\nfZ = \u03c3(cid:2)W2\n\nC \u00d7 Concat (fL, fH, fG)(cid:3) , \u03c3 = Sigmoid.\n\n(4)\n\nThen the source feature fV and target feature fT are obtained by\n\n(5)\nwhere fZ is the combination of fL, fH and fG. fV is the linear transformation of fG. fT is the linear\ntransform of fL and fH.\n\nfV = W3\n\nC \u00d7 fG, fT = W1\n\nC \u00d7 Concat (fL, fH) ,\n\n5\n\n\fTo control the amount of memory used, we only let fZ affect the source feature fS, essentially\nlike the reset gate in the GRU to decide what information to forget from the previous state before\ntransferring information to the activation layer. The difference is that here the \u201creset\u201d gate resets the\nsource features rather than the previous state, i.e.,\n\nfC = \u03b4 (fT + fZ \u00b7 fV) , \u03b4 = Tanh,\n\nwhere \u00b7 means element-wise multiplication.\nIn the \ufb01nal decision, we obtain the score for each proposal candidate s1 as\n\ns1 = \u03c3 (WD \u00d7 fC) .\n\n(6)\n\n(7)\n\n4.5 Training Strategy\n\nThe binary cross entropy (BCE) loss is used in our model for both stages. The loss is averaged over\nall detection boxes on all object categories. Different from that of [22], we use L = \u2212log(1 \u2212 s1)\ninstead of L = \u2212log(1 \u2212 s0 \u00b7 s1), where s1 denotes the output score of our model and s0 denotes the\nprediction score of the proposal candidate from the detection network. Training with s0 may prevent\nour model from making right prediction for proposal candidates mis-classi\ufb01ed by detection network.\nThus s0 is not used in our training phase. s0 \u00b7 s1 is the \ufb01nal prediction score in inference to make use\nof information from both detection network and our model.\nOur \ufb01rst stage takes the output of NMS as the ground-truth to learn and the second stage takes the\noutput from stage I and learn to select the appropriate proposals according to the actual ground-\ntruth object. Speci\ufb01cally, proposals kept by NMS are assigned positive labels in stage I. While in\nstage II, for each object, we \ufb01rst select proposals with intersection-over-union (IoU) higher than a\nthreshold \u03b7. Then the proposal with highest score in this set are assigned positive label and others\nare negatives. By default, we use \u03b7 = 0.5 for most of our experiments. Considering the COCO\nevaluation criterion (mAP@0.5 \u2212 0.95), we also extend multiple thresholds simultaneously [22],\ni.e., \u03b7 \u2208 [0.5, 0.6, 0.7, 0.8, 0.9]. The classi\ufb01er WD in Eq. 7 thus outputs multiple probabilities\ncorresponding to different IoU thresholds, resulting in multiple binary classi\ufb01cation heads. During\ninference, the multiple probabilities are simply averaged as a single output.\nThere are two ways to train our two-stage framework. The \ufb01rst is sequential to train stages I and II\nconsecutively. The second method is to jointly update the weight of stage I during training stage\nII. Performance of our method in these two ways is comparable. We thus use sequential training\ngenerally.\n\n5 Experiments\n\nAll experiments are performed on challenging COCO detection datasets with 80 object categories [25].\n115k images are used for training [23, 22]. Ablation studies are conducted on the 5k validation images,\nfollowing common practice. We also report the performance on test-dev subset for comparison with\nother methods. The default evaluation metric \u2013 AP averaged on IoU thresholds from 0.5 to 0.95 on\nCOCO \u2013 is used.\nAs described in Section 3, we take FPN [23], Mask R-CNN [19] and PANet [27] with DCN [8] as\nthe baselines to show the generality of our method. These baselines are implemented by us with\ncomparable performance reported in respective papers.\nFor both stages in the framework, we adopt synchronized SGD as the optimizer and train our model\non a Titan X Maxwell GPU, with weight decay 0.0001 and momentum 0.9. The learning rate is 0.01\nin the \ufb01rst ten epochs and 0.001 in the last two epochs. dl and dm are by default 128 and 256.\nIn each training iteration, our network is with 0.45 million parameters. This overhead is small, about\n1% in terms of both model size and computation compared to 43.07 million parameters in FPN with\nResNet-50. It takes about 0.019s for the whole inference process with a single GPU, compared with\n0.07s by FPN with ResNet-50. Also, our computation cost is consistent even on larger backbone\nnetworks for object detection.\n\n6\n\n\f5.1 Stage I Performance\nIn stage I, we take the proposal candidates satisfying s0 \u2265 0.01 as input. The ground-truth labels\nare generated by NMS with IoU threshold 0.6, which produce decent results with NMS. To reduce\nimbalance between positive and negative samples, the weight of positive samples in our BCE loss is\nset to 4.\n\nModel\nFPN [23]\n\nMask R-CNN [19]\n\nPANet with DCN [27, 8]\n\nNMS RNN + Global Attention + Context Gate + Both\n37.2\n37.1\n39.1\n38.9\n43.7\n43.8\n\n35.1\n36.5\n41.5\n\n36.7\n38.6\n43.2\n\n34.3\n35.9\n40.6\n\nTable 2: Ablation study of network structures (+ indicates adding the module to the basic RNN).\n\nWe show the performance of our entire model and ablation study in Table 2. NMS is the ground-truth\nlabel for our network and RNN means using basic RNN module for encoder and decoder, which\nis the baseline. It is noticeable that using RNN module cannot produce reasonable results because\nsummarizing all proposal candidates only according to hidden states is dif\ufb01cult. With global attention\nand context gate, the performance ameliorates. The reason that our \ufb01nal model performs best is that\nglobal attention can capture the relation for all proposal candidates. Context gate makes our model\nmemorize the low-grade feature in high layers while the loss function is only based on the output\nscore of our network rather than the origin score from detection network.\n\n5.2 Stage II Evaluation\n\nWe take proposals selected by stage I with prediction score higher than 0.01 as input. Weight of\npositive samples for our BCE loss is set to 2.\n\nModel\nFPN [23]\n\nMask R-CNN [19]\n\nPANet with DCN [27, 8]\n\nNMS\n\n37.1\n38.9\n43.7\n\nBox\nVoting\n37.5\n39.3\n44.2\n\nSoft\nNMS\n37.8\n39.6\n44.3\n\nStage I\n\n37.2\n39.1\n43.8\n\nStage II\n(joint)\n38.1\n40.0\n44.4\n\nStage II\n\n(step-by-step)\n\n38.3\n40.2\n44.6\n\nTable 3: Comparison of our approach and other alternatives. For NMS and Soft-NMS, we both use\nthe best parameter 0.6. We include global attention and context gate in each stage of our approach.\nTwo training strategies are adopted respectively for comparison.\n\nThe performance of our model and prior solutions are compared in Table 3. With our full structure, the\nproposed method outperforms other popular duplicate removal solutions, including NMS, Soft-NMS\nand box voting.\nFor FPN and Mask R-CNN, our model trained with single head corresponding to IoU threshold (0.5)\nincreases more than one point and 0.9 point even for the strong baseline, PANet with DCN, which\ngenerates more discriminative proposals.\n\nNMS\n\n37.1\n\nours\nall\n38.3\n\nsequence order\n\nnone\n33.8\n\nrank fR\nnone\n37.6\n\nappearance fA\n\nnone\n35.4\n\nbox fG\nnone\n37.5\n\norigin score fS\n\nnone\n36.9\n\nTable 4: Ablation study of input features for our model (none indicates no such feature or out of order\nfor the sequence, all means all input features in a descending order are used).\n\nAblation studies on the source of features are performed. The results are shown in Table 4. The order\nof sequence and appearance feature fA play important roles. Rank feature fR, geometric feature fG\nand score feature fS help our model make prediction from more global view compared with NMS.\nWe analyze the importance of sample distribution. As shown in Table 5, we train stage II directly with\noutput from detection network. Compared with our full framework, the performance drops severely.\nThis manifests the necessity of conducting stage I to suppress easy negatives. We also take the result\nof NMS as input to stage II, however the mAP is slightly lower than using output of stage I. This\ncomparison also shows that our structure is compatible with the box voting method.\n\n7\n\n\fDetection Network\n\nModel\n\nNMS\nStage I\n\nFPN [23] Mask R-CNN [19]\n\nPANet [27] with DCN [8]\n\n33.8\n38.1\n38.3\n\n35.5\n40.0\n40.2\n\n40.1\n44.4\n44.6\n\nTable 5: Ablation study of the in\ufb02uence of input distribution on stage II. We directly take the output\nof detection network, NMS or stage I as the input to our second stage respectively.\n\nModel\n\nStage II (0.5)\nStage II (0.75)\n\nStage II (0.5 \u2212 0.1 \u2212 0.9)\nStage II (0.5 \u2212 0.05 \u2212 0.9)\n\nFPN [23] Mask R-CNN [19]\n\nPANet [27] with DCN [8]\n\n38.3\n38.4\n38.6\n38.6\n\n40.2\n40.3\n40.5\n40.6\n\n44.6\n44.5\n44.8\n44.8\n\nTable 6: Comparison of using different IoU thresholds in the second stage. Last two rows use multiple\nthresholds with different intervals such as 0.1 or 0.05.\n\nTable 6 compares the performance of utilizing different IoU thresholds when assigning the ground-\ntruth labels at stage II. With multi-heads trained on samples assigned with multiple thresholds, our\nmodel further improves the performance by 0.3, accomplishing a new state-of-the-art result.\nWe summarize our approach on val and test-dev subsets for different detection backbones trained\nwith multiple thresholds in Table 7. We achieve nearly 1.5 point improvement based on output from\nFPN and Mask R-CNN. With stronger baseline PANet with DCN, we also surpass the traditional\nNMS by 1.1 points. It is noted that our model get larger improvement in mAP75 than that in mAP50,\nmanifesting that our model makes good use of quality of proposals. The improvement statistics on\nCOCO test-dev is similar.\n\nbackbone\nFPN [23]\n\nMask R-CNN [19]\n\nPANet with DCN[27, 8]\n\ntest set\n\nval\n\ntestdev\n\ntestdev\n\nval\n\nval\n\ntestdev\n\nmAP\n\n37.1\u219237.2\u219238.6\n36.9\u219237.0\u219238.4\n38.9\u219239.1\u219240.6\n39.1\u219239.2\u219240.6\n43.7\u219243.8\u219244.8\n43.4\u219243.4\u219244.4\n\nmAP50\n\n59.0\u219258.6\u219259.6\n58.4\u219258.1\u219259.0\n59.7\u219259.4\u219260.6\n59.6\u219259.3\u219260.1\n63.4\u219262.8\u219263.4\n63.0\u219262.1\u219262.4\n\nmAP75\n\n39.8\u219240.3\u219242.3\n39.8\u219240.3\u219242.3\n42.4\u219243.1\u219244.9\n42.7\u219243.2\u219245.1\n47.9\u219248.5\u219249.6\n47.7\u219248.1\u219249.5\n\nTable 7: Improvement from NMS to stage I and II (connected by \u2192 from left to right) based on\ndifferent stage-of-the-art object detection systems on COCO2017 val and test-dev.\n\nFig. 3 shows that our approach reduces proposal candidates and increases performance at the same\ntime. With output from FPN, Soft NMS keeps about 320.98 proposals in one image on average,\nwhile our approach only produces 84.02 proposals compared with 145.76 from NMS using the same\nscore threshold 0.01.\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 3: Visualization of ground truth (a), NMS (b), stages I (c) and II (d) of our approach.\n\n6 Conclusion\n\nWe have presented a new approach for duplicate removal that is important in object detection.\nWe applied RNN with global attention and context gate structure to sequentially encode context\ninformation existing in all object proposals. The decoder selects appropriate proposals as \ufb01nal output.\nExtensive experiments and ablation studies were conducted and the consistent improvement manifests\nthe effectiveness of our approach. We plan to connect our framework to object detection networks to\nenable joint training for even better performance in future work.\n\n8\n\n\fReferences\n\n[1] T. Bagautdinov, A. Alahi, F. Fleuret, P. Fua, and S. Savarese. Social scene understanding:\nEnd-to-end multi-person action localization and collective activity recognition. In CVPR, 2017.\n[2] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align\n\n[3] N. Bodla, B. Singh, R. Chellappa, and L. S. Davis. Soft-nms - improving object detection with\n\n[4] X. Chen and A. Gupta. Spatial memory for context reasoning in object detection. In ICCV,\n\nand translate. arXiv:1409.0473, 2014.\n\none line of code. In CVPR, 2017.\n\n2017.\n\n[5] J. Chorowski, D. Bahdanau, K. Cho, and Y. Bengio. End-to-end continuous speech recognition\n\nusing attention-based recurrent nn: First results. arXiv:1412.1602, 2014.\n\n[6] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio. Empirical evaluation of gated recurrent neural\n\nnetworks on sequence modeling. arXiv:1412.3555, 2016.\n\n[7] J. Dai, Y. Li, K. He, and J. Sun. R-FCN: object detection via region-based fully convolutional\n\n[8] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei. Deformable convolutional networks.\n\nnetworks. In NIPS, 2016.\n\nIn ICCV, 2017.\n\n[9] Z. Deng, A. Vahdat, H. Hu, and G. Mori. Structure inference machines: Recurrent neural\n\nnetworks for analyzing relations in group activity recognition. In CVPR, 2016.\n\n[10] C. Desai, D. Ramanan, and C. C. Fowlkes. Discriminative models for multi-class object layout.\n\nInternational journal of computer vision, 2011.\n\n[11] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and\nT. Darrell. Long-term recurrent convolutional networks for visual recognition and description.\nIn CVPR, 2015.\n\n[12] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with\n\ndiscriminatively trained part-based models. PAMI, 2010.\n\n[13] Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an\n\napplication to boosting. Journal of computer and system sciences, 1997.\n\n[14] S. Gidaris and N. Komodakis. Object detection via a multi-region and semantic segmentation-\n\naware cnn model. In CVPR, 2015.\n\n[15] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object\n\ndetection and semantic segmentation. In CVPR, 2014.\n\n[16] R. B. Girshick. Fast R-CNN. In ICCV, 2015.\n[17] K. Goel, R. Vohra, and J. Sahoo. Polyphonic music generation by modeling temporal dependen-\n\ncies using a rnn-dbn. In ANN, 2014.\n\nnetworks. In ICASSP, 2013.\n\n[18] A. Graves, A.-r. Mohamed, and G. Hinton. Speech recognition with deep recurrent neural\n\n[19] K. He, G. Gkioxari, P. Doll\u00e1r, and R. B. Girshick. Mask R-CNN. In ICCV, 2017.\n[20] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks\n\nfor visual recognition. In ECCV, 2014.\n\n[21] J. H. Hosang, R. Benenson, and B. Schiele. Learning non-maximum suppression. In CVPR,\n\n[22] H. Hu, J. Gu, Z. Zhang, J. Dai, and Y. Wei. Relation networks for object detection. CVPR,\n\n2017.\n\n2017.\n\nICCV, 2017.\n\nIn CVPR, 2015.\n\nCVPR, 2018.\n\n[23] T. Lin, P. Doll\u00e1r, R. B. Girshick, K. He, B. Hariharan, and S. J. Belongie. Feature pyramid\n\nnetworks for object detection. In CVPR, 2017.\n\n[24] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Doll\u00e1r. Focal loss for dense object detection. In\n\n[25] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll\u00e1r, and C. L. Zitnick.\n\nMicrosoft coco: Common objects in context. In ECCV, 2014.\n\n[26] S. Liu, C. Lu, and J. Jia. Box aggregation for proposal decimation: Last mile of object detection.\n\n[27] S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia. Path aggregation network for instance segmentation.\n\n[28] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg. Ssd: Single\n\nshot multibox detector. In ECCV, 2016.\n\n[29] M.-T. Luong, H. Pham, and C. D. Manning. Effective approaches to attention-based neural\n\nmachine translation. arXiv:1508.04025, 2015.\n\n[30] V. Mnih, N. Heess, A. Graves, et al. Recurrent models of visual attention. In NIPS, 2014.\n[31] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Uni\ufb01ed, real-time\n\nobject detection. In CVPR, 2016.\n\n9\n\n\f[32] S. Ren, K. He, R. B. Girshick, and J. Sun. Faster R-CNN: towards real-time object detection\n\nwith region proposal networks. In NIPS, 2015.\n\n[33] Z. Tu, Y. Liu, Z. Lu, X. Liu, and H. Li. Context gates for neural machine translation.\n\nCVPR, 2001.\n\narXiv:1608.06043, 2016.\n\n[34] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, \u0141. Kaiser, and\n\nI. Polosukhin. Attention is all you need. In NIPS, 2017.\n\n[35] P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. In\n\n10\n\n\f", "award": [], "sourceid": 1023, "authors": [{"given_name": "Lu", "family_name": "Qi", "institution": "The Chinese University of Hong Kong"}, {"given_name": "Shu", "family_name": "Liu", "institution": "Chinese University of Hong Kong"}, {"given_name": "Jianping", "family_name": "Shi", "institution": "Sensetime Group Limited"}, {"given_name": "Jiaya", "family_name": "Jia", "institution": "CUHK"}]}