{"title": "MetaAnchor: Learning to Detect Objects with Customized Anchors", "book": "Advances in Neural Information Processing Systems", "page_first": 320, "page_last": 330, "abstract": "We propose a novel and flexible anchor mechanism named MetaAnchor for object detection frameworks. Unlike many previous detectors model anchors via a predefined manner, in MetaAnchor anchor functions could be dynamically generated from the arbitrary customized prior boxes. Taking advantage of weight prediction, MetaAnchor is able to work with most of the anchor-based object detection systems such as RetinaNet. Compared with the predefined anchor scheme, we empirically find that MetaAnchor is more robust to anchor settings and bounding box distributions; in addition, it also shows the potential on the transfer task. Our experiment on COCO detection task shows MetaAnchor consistently outperforms the counterparts in various scenarios.", "full_text": "MetaAnchor: Learning to Detect Objects with\n\nCustomized Anchors\n\nTong Yang\u2217\u2020\n\nXiangyu Zhang\u2217\n\nZeming Li\u2217 Wenqiang Zhang\u2020\n\nJian Sun\u2217\n\n{yangtong,zhangxiangyu,lizeming,sunjian}@megvii.com\n\n\u2217Megvii Inc (Face++)\n\n\u2020 Fudan University\n\nwqzhang@fudan.edu.cn\n\nAbstract\n\nWe propose a novel and \ufb02exible anchor mechanism named MetaAnchor for object\ndetection frameworks. Unlike many previous detectors model anchors via a prede-\n\ufb01ned manner, in MetaAnchor anchor functions could be dynamically generated\nfrom the arbitrary customized prior boxes. Taking advantage of weight prediction,\nMetaAnchor is able to work with most of the anchor-based object detection systems\nsuch as RetinaNet. Compared with the prede\ufb01ned anchor scheme, we empirically\n\ufb01nd that MetaAnchor is more robust to anchor settings and bounding box distri-\nbutions; in addition, it also shows the potential on transfer tasks. Our experiment\non COCO detection task shows that MetaAnchor consistently outperforms the\ncounterparts in various scenarios.\n\n1\n\nIntroduction\n\nFbi(x; \u03b8i) =(cid:0)F cls\n\nbi\n\n)(cid:1)\n\n),F reg\n\nbi\n\ni\n\ni\n\nThe last few years have seen the success of deep neural networks in object detection task [5, 39, 9,\n12, 8, 32, 16, 2]. In practice, object detection often requires to generate a set of bounding boxes\nalong with their classi\ufb01cation labels associated with each object in the given image. However, it is\nnontrivial for convolutional neural networks (CNNs) to directly predict an orderless set of arbitrary\ncardinality1. One widely-used workaround is to introduce anchor, which employs the thought\nof divide-and-conquer and has been successfully demonstrated in the state-of-the-art detection\nframeworks [39, 32, 25, 30, 31, 11, 22, 23, 2]. In short, anchor method suggests dividing the box\nspace (including position, size, class, etc.) into discrete bins (not necessarily disjoint) and generating\neach object box via the anchor function de\ufb01ned in the corresponding bin. Denote x as the feature\nextracted from the input image, then anchor function for i-th bin could be formulated as follows:\n\nbi\n\n(x; \u03b8reg\n\n(x; \u03b8cls\n\n(\u00b7) discriminates whether there exists an object box associated with the i-th bin, and F reg\n\n(1)\nwhere bi \u2208 B is the prior (also named anchor box in [32]), which describes the common properties\nof object boxes associated with i-th bin (e.g. averaged position/size and classi\ufb01cation label); while\nF cls\n(\u00b7)\nregresses the relative location of the object box (if any) to the prior bi; \u03b8i represents the parameters\nfor the anchor function.\nTo model anchors with deep neural networks, one straight-forward strategy is via enumeration,\nwhich is adopted by most of the previous work [32, 39, 25, 30, 31, 23, 11, 22]. First, a number of\nprede\ufb01ned priors (or anchor boxes) B is chosen by handcraft [32] or statistical methods like clustering\n[39, 31]. Then for each bi \u2208 B the anchor function Fbi is usually implemented by one or a few\nneural network layers respectively. Weights for different anchor functions are independent or partially\nshared. Obviously in this framework anchor strategies (i.e. anchor box choices and the de\ufb01nition of\n\nbi\n\n1There are a few recent studies on the topic, such as [33, 37].\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fcorresponding anchor functions) are \ufb01xed in both training and inference. In addition, the number of\navailable anchors is limited by the prede\ufb01ned B.\nIn this paper, we propose a \ufb02exible alternative to model anchors: instead of enumerating every\npossible bounding box prior bi and modeling the corresponding anchor functions respectively, in our\nframework anchor functions are dynamically generated from bi. It is done by introducing a novel\nMetaAnchor module which is de\ufb01ned as follows:\n\nFbi = G (bi; w)\n\n(2)\nwhere G(\u00b7) is called anchor function generator which maps any bounding box prior bi to the\ncorresponding anchor function Fbi; and w represents the parameters. Note that in MetaAnchor the\nprior set B is not necessarily prede\ufb01ned; instead, it works as a customized manner \u2013 during inference,\nusers could specify any anchor boxes, generate the corresponding anchor functions and use the latter\nto predict object boxes. In Sec. 3, we present that with weight prediction mechanism [10, 18] anchor\nfunction generator could be elegantly implemented and embedded into existing object detection\nframeworks for joint optimization.\nIn conclusion, compared with traditional prede\ufb01ned anchor strategies, we \ufb01nd our proposed MetaAn-\nchor has the following potential bene\ufb01ts (detailed experiments are present in Sec. 4):\n\n\u2022 MetaAnchor is more robust to anchor settings and bounding box distributions. In\ntraditional approaches, the prede\ufb01ned anchor box set B often needs careful design \u2013 too few\nanchors may be insuf\ufb01cient to cover rare boxes, or result in coarse predictions; however,\nmore anchors usually imply more parameters, which may suffer from over\ufb01tting. In addition,\nmany traditional strategies use independent weights to model different anchor functions, so\nit is very likely for the anchors associated with few ground truth object boxes in training\nto produce poor results. In contrast, for MetaAnchor anchor boxes of any shape could be\nrandomly sampled during training so as to cover different kinds of object boxes, meanwhile,\nthe number of parameters keeps constant. Furthermore, according to Equ. 2 different anchor\nfunctions are generated from the same weights w, thus all the training data are able to\ncontribute to all the model parameters, which implies more robustness to the distribution of\nthe training boxes.\n\u2022 MetaAnchor helps to bridge the bounding box distribution gap between datasets. In\ntraditional framework, anchor boxes B are prede\ufb01ned and keep unchanged for both training\nand test, which could be suboptimal for either dataset if their bounding box distributions are\ndifferent. While in MetaAnchor, anchors could be \ufb02exibly customized to adapt the target\ndataset (for example, via grid search) without retraining the whole detector.\n\n2 Related Work\n\nAnchor methodology in object detection. Anchors (maybe called with other names, e.g. \u201cdefault\nboxes\u201d in [25], \u201cpriors\u201d in [39] or \u201cgrid cells\u201d in [30]) are employed in most of the state-of-the-art\ndetection systems [39, 32, 22, 23, 25, 7, 11, 2, 31, 21, 35, 15]. The essential of anchors includes\nposition, size, class label or others. Currently most of the detectors model anchors via enumeration,\ni.e. prede\ufb01ning a number of anchor boxes with all kinds of positions, sizes and class labels, which\nleads to the following issues. First, anchor boxes need careful design, e.g. via clustering [31], which\nis especially critical on speci\ufb01c detection tasks such as anchor-based face [40, 45, 28, 36, 43] and\npedestrian [41, 3, 44, 26] detections. Specially, some papers suggest multi-scale anchors [25, 22, 23]\nto handle different sizes of objects. Second, prede\ufb01ned anchor functions may cause too many\nparameters. A lot of work addresses the issue by weight sharing. For example, in contrast to earlier\nwork like [5, 30], detectors like [32, 25, 31] and their follow-ups [7, 22, 2, 11, 23] employ translation-\ninvariant anchors produced by fully-convolutional network, which could share parameters across\ndifferent positions. Two-stage frameworks such as [32, 2] share weights across various classes. And\n[23] shares weights for multiple detection heads. In comparison, our approach is free of the issues, as\nanchor functions are customized and generated dynamically.\n\nWeight prediction. Weight prediction means a mechanism in neural networks where weights\nare predicted by another structure rather than directly learned, which is mainly used in the \ufb01elds of\nlearning to learn [10, 1, 42], few/zero-shot learning [4, 42] and transfer learning [27]. For object\n\n2\n\n\fdetection there are a few related works, for example, [15] proposes to predict mask weights from box\nweights. There are mainly two differences from ours: \ufb01rst, in our MetaAnchor the purpose of weight\nprediction is to generate anchor functions, while in [15] it is used for domain adaption (from object\nbox to segmentation mask); second, in our work weights are generated almost \u201cfrom scratch\u201d, while\nin [15] the source is the learned box weights.\n\n3 Approach\n\n3.1 Anchor Function Generator\n\nIn MetaAnchor framework, anchor function is dynamically generated from the customized box prior\n(or anchor box) bi rather than \ufb01xed function associated with prede\ufb01ned anchor box. So, anchor\nfunction generator G(\u00b7) (see Equ. 2), which maps bi to the corresponding anchor function Fbi, plays\na key role in the framework. In order to model G(\u00b7) with neural work, inspired by [15, 10], \ufb01rst we\nassume that for different bi anchor functions Fbi share the same formulation F(\u00b7) but have different\nparameters, which means:\n(3)\nThen, since each anchor function is distinguished only by its parameters \u03b8bi, anchor function generator\ncould be formulated to predict \u03b8bi as follows:\n\nFbi(x; \u03b8i) = F(x; \u03b8bi)\n\n\u03b8bi = G(bi; w)\n\n(4)\nwhere \u03b8\u2217 stands for the shared parameters (independent to bi and also learnable), and the residual\nterm R(bi, w) depends on anchor box bi.\nIn the paper we implement R(\u00b7) with a simple two-layer network:\n\n= \u03b8\u2217 + R(bi; w)\n\nR(bi, w) = W2\u03c3 (W1bi)\n\n(5)\nHere, W1 and W2 are the learnable parameters and \u03c3(\u00b7) is the activation function (i.e. ReLU in our\nwork). Denote the number of hidden neurons by m. In practice m is usually much smaller than the\ndimension of \u03b8bi, which causes the weights predicted by R(\u00b7) lie in a signi\ufb01cantly low-rank subspace.\nThat is why we formulate G(\u00b7) as a residual form in Equ 4 rather than directly use R(\u00b7). We also\nsurvey more complex designs for G(\u00b7), however, which results in comparable benchmarking results.\nIn addition, we introduce a data-dependent variant of anchor function generator, which takes the input\nfeature x into the formulation:\n\n\u03b8bi = G(bi; x, w)\n\n(6)\nwhere r(\u00b7) is used to reduce the dimension of the feature x; we empirically \ufb01nd that for convolutional\nfeature x, using global averaged pooling [13, 38] operation for r(\u00b7) usually produces good results.\n\n= \u03b8\u2217 + W2\u03c3 (W11bi + W12r(x))\n\n3.2 Architecture Details\n\nTheoretically MetaAnchor could work with most of the existing anchor-based object detection\nframeworks [32, 25, 30, 31, 23, 11, 22, 19, 20, 2]. Among them, for the two-stage detectors\n[32, 2, 22, 11, 19] anchors are usually used to model \u201cobjectness\u201d and generate box proposals, while\n\ufb01ne results are predicted by RCNN-like modules [9, 8] in the second stage. We try to use MetaAnchor\nin these frameworks and observe some improvements on the box proposals (e.g. improved recalls),\nhowever, it seems no use to the \ufb01nal predictions, whose quality we believe is mainly determined\nby the second stage. Therefore, in the paper we mainly study the case of single-stage detectors\n[30, 25, 31, 23].\nWe choose the state-of-the-art single-stage detector RetinaNet [23] to apply MetaAnchor for instance.\nNote that our methodology is also applicable to other single-stage frameworks such as [31, 25, 7, 35].\nFig 1(a) gives the overview of RetinaNet. In short, 5 levels of features {Pl|l \u2208 {3, 4, 5, 6, 7}} are\nextracted from a \u201cU-shaped\u201d backbone network, where P3 stands for the \ufb01nest feature map (i.e.\nwith largest resolution) and P7 is the coarsest. For each level of feature, a subnet named \u201cdetection\nhead\u201d in Fig 1 is attached to generate detection results. Anchor functions are de\ufb01ned at the tail of\n\n3\n\n\fFigure 1: Illustration to applying MetaAnchor on RetinaNet [23]. (a) RetinaNet overview. (b)\nDetection heads in RetinaNet equipped with MetaAnchor. Fcls(\u00b7) and Freg(\u00b7) compose the anchor\nfunction (de\ufb01ned in Equ 1), which are implemented by a convolutional layer respectively here.\nG(\u00b7, wcls) and G(\u00b7, wreg) are anchor function generators de\ufb01ned in Equ 4 (or Equ 6). bi is the\ncustomized box prior (or called anchor box); and \u201ccls\u201d and \u201creg\u201d represent the prediction results\nassociated to bi.\n\neach detection head. Referring to the settings in [23], anchor functions are implemented by a 3 \u00d7 3\nconvolutional layer; and for each detection head, there are 3 \u00d7 3 \u00d7 80 types of anchor boxes (3 scales,\n3 aspect ratios and 80 classes) are prede\ufb01ned. Thus for each anchor function, there should be 720\n\ufb01lters for the classi\ufb01cation term and 36 \ufb01lters for the regression term (3 \u00d7 3 \u00d7 4, as regression term\nis class-agnostic).\nIn order to apply MetaAnchor, we need to redesign the original anchor functions so that their\nparameters are generated from the customized anchor box bi. First of all, we consider how to encode\nbi. According to the de\ufb01nition in Sec. 1, bi should be a vector which includes the information such as\nposition, size and class label. In RetinaNet, thanks to the fully-convolutional structure, position could\nbe naturally encoded by the coordinate of feature maps thus no need to be involved in bi. As for class\nlabel, there are two alternatives: A) directly encode it in bi, or B) let G(\u00b7) predict weights for each\nclass respectively. We empirically \ufb01nd that Option B is easier to optimize and usually results in better\nperformance than Option A. So, in our experiment bi is mainly related to anchor size. Motivated by\nthe bounding box encoding method introduced in [9, 32], bi is represented as follows:\n\n(cid:18)\n\n(cid:19)\n\nbi =\n\nlog\n\nahi\nAH\n\n, log\n\nawi\nAW\n\n(7)\n\nwhere ahi and awi are the height and width of the corresponding anchor box; and (AH, AW ) is the\nsize of \u201cstandard anchor box\u201d, which is used as a normalization term. We also survey a few other\nalternatives, for example, using the scale and aspect ratio to represent the size of anchor boxes, which\nresults in comparable results with that of Equ. 7.\nFig 1(b) illustrates the usage of MetaAnchor in each detection head of RetinaNet. In the original\ndesign [23], the classi\ufb01cation and box regression parts of anchor functions are attached to separated\nfeature maps (xcls and xreg) respectively; so in MetaAnchor, we also use two independent anchor\nfunction generators G(\u00b7, wcls) and G(\u00b7, wreg) to predict their weights respectively. The design of\nG(\u00b7) follows Equ. 4 (data-independent variant) or Equ. 6 (data-dependent variant), in which the\nnumber of hidden neurons m is set to 128. In addition, recall that in MetaAnchor anchor functions\nare dynamically derived from bi rather than prede\ufb01ned by enumeration; so, the number of \ufb01lters for\nFcls(\u00b7) reduces to 80 (80 classes, for example) and 4 for Freg(\u00b7).\nIt is also worth noting that in RetinaNet [23] corresponding layers in all levels of detection heads\nshare the same weights, even including the last layers which stand for anchor functions. However, the\nde\ufb01nitions of anchors differ from layer to layer: for example, in l-th level suppose an anchor function\nassociated to the anchor box of size (ah, aw); while in (l + 1)-th level (with 50% smaller resolution),\nthe same anchor function should detect with 2x larger anchor box, i.e. (2ah, 2aw). So, in order to\nkeep consistent with the original design, in MetaAnchor we use the same anchor generator function\nG(\u00b7, wcls) and G(\u00b7, wreg) for each level of detection head; while the \u201cstandard boxes\u201d (AH, AW )\nin Equ. 7 are different between levels: suppose the standard box size in l-th level is (AHl, AWl),\nthen for (l + 1)-th level we set (AHl+1, AWl+1) = (2AHl, 2AWl). In our experiment, the size of\n\n4\n\nDetHeadDetHeadDetHead\u2131\"#$(&)\u2131()*(&)clsbox+,class subnetbox subnet-(&;/\"#$)-(&;/()*)0\"#$0()*1\"#$1()*(a)(b)\f# of Anchors\n\n3 \u00d7 3\n5 \u00d7 5\n7 \u00d7 7\n9 \u00d7 9\n\nTable 1: Anchor box con\ufb01gurations\n\nAspect Ratios\n{1/2, 1, 2}\n\n{1/3, 1/2, 1, 2, 3}\n\n{1/4, 1/3, 1/2, 1, 2, 3, 4}\n\n{1/5, 1/4, 1/3, 1/2, 1, 2, 3, 4, 5}\n\n(AH, AW )\n\n(44, 44)\n(45, 47)\n(48, 50)\n(53, 53)\n\nScales 2\n\n{2k/3|k < 3}\n{2k/5|k < 5}\n{2k/7|k < 7}\n{2k/9|k < 9}\n\nstandard box in the lowest level (i.e. P3, which has the largest resolution) is set to the average of all\nthe anchor box sizes (shown in the last column in Table 1).\n\n4 Experiment\n\nIn this section we mainly evaluate our proposed MetaAnchor on COCO object detection task [24]. The\nbasic detection framework is RetinaNet [23] as introduced in 3.2, whose backbone feature extractor\nwe use is ResNet-50 [13] pretrained on ImageNet classi\ufb01cation dataset [34]. For MetaAnchor, we\nuse the data-independent variant of anchor function generator (Equ. 4) unless specially mentioned.\nMetaAnchor subnets are jointly optimized with the backbone detector during training. We do not use\nBatch Normalization [17] in MetaAnchor.\n\nDataset. Following the common practice [23] in COCO detection task, for training we use two\ndifferent dataset splits: COCO-all and COCO-mini; while for test, all results are evaluated on the\nminival set which contains 5000 images. COCO-all includes all the images in the original training\nand validation sets excluding minival images, while COCO-mini is a subset of around 20000 images.\nResults are mainly evaluated with COCO standard metrics such as mmAP.\n\nTraining and evaluation con\ufb01gurations.\nFor fair comparison, we follow most of the settings in\n[23] (image size, learning rate, etc.) for all the experiments, except for a few differences as follows.\nIn [23], 3\u00d7 3 anchor boxes (i.e. 3 scales and 3 aspect ratios) are prede\ufb01ned for each level of detection\nhead. In the paper, more anchor boxes are employed in some experiments. Table 1 lists the anchor\nbox con\ufb01gurations for feature level P3, where the 3 \u00d7 3 case is identical to that in [23]. Settings\nfor other feature levels could also be derived (see Sec. 3.2). As for MetaAnchor, since prede\ufb01ned\nanchors are not needed, we suggest to use the strategy as follows. In training, \ufb01rst we select a sort of\nanchor box con\ufb01guration from Table 1 (e.g. 5 \u00d7 5), then generate 25 bis according to Equ. 7; for\neach iteration, we randomly augment each bi within \u00b10.5, calculating the corresponding ground truth\nand use them to optimize. We call the methodology \u201ctraining with 5 \u00d7 5 anchors\u201d. While in test, bis\nare also set by a certain anchor box con\ufb01guration without augmentation (not necessary the same as\nused in training). We argue that with that training/inference scheme, it is possible to make direct\ncomparisons between MetaAnchor and the counterpart baselines.\nIn the following subsections, \ufb01rst we study the performances of MetaAnchor by a series of controlled\nexperiments on COCO-mini. Then we report the fully-equipped results on COCO-full dataset.\n\n4.1 Ablation Study\n\n4.1.1 Comparison with RetinaNet baselines\n\nTable 2 compares the performances of MetaAnchor and RetinaNet baseline on COCO-mini dataset.\nHere we use the same anchor box settings for training and test. In the column \u201cThreshold\u201d t1/t2\nmeans the intersection-over-union (IoU) thresholds for positive/negative anchor boxes respectively in\ntraining (the detailed de\ufb01nition are introduced in [32, 23]).\nTo analyze, \ufb01rst we compare the rows with the threshold of 0.5/0.4. It is clear that MetaAnchor\noutperforms the counterpart baselines on each of anchor con\ufb01gurations and evaluation metrics, for\ninstance, 0.2 \u223c 0.8% increase for mmAP and 0.8 \u223c 1.5% for AP50. We suppose the improvements\nmay come from two aspects: \ufb01rst, in MetaAnchor the sizes of anchor boxes could be augmented and\n\n2Here we follow the same de\ufb01nition of scale and aspect ratio as in [23].\n\n5\n\n\fTable 2: Comparison of RetinaNets with/without MetaAnchor.\n\nThreshold\n\n# of Anchors\n\nBaseline (%)\n\nMetaAnchor (%)\n\n0.5/0.4\n0.5/0.4\n0.5/0.4\n0.5/0.4\n0.6/0.5\n0.6/0.5\n0.6/0.5\n0.6/0.5\n\n3 \u00d7 3\n5 \u00d7 5\n7 \u00d7 7\n9 \u00d7 9\n3 \u00d7 3\n5 \u00d7 5\n7 \u00d7 7\n9 \u00d7 9\n\nmmAP AP50 AP75 mmAP AP50 AP75\n28.2\n26.5\n26.9\n28.1\n28.5\n26.4\n28.2\n26.3\n27.2\n25.7\n28.8\n26.1\n26.2\n28.3\n29.2\n26.1\n\n26.9\n27.1\n27.2\n27.1\n26.0\n27.3\n27.0\n27.4\n\n27.6\n28.1\n27.7\n27.5\n27.3\n27.8\n27.9\n27.9\n\n44.2\n44.5\n44.4\n44.3\n42.0\n44.2\n43.1\n43.7\n\n43.1\n43.7\n43.0\n42.8\n41.1\n41.4\n41.3\n41.0\n\nTable 3: Comparison of various anchors in inference (mmAP, %)\n\n# of Anchors\n\nTraining\n3 \u00d7 3\n5 \u00d7 5\n7 \u00d7 7\n9 \u00d7 9\n\n3 \u00d7 3\n26.0\n26.7\n26.1\n26.3\n\n5 \u00d7 5\n26.6\n27.3\n26.9\n27.2\n\nInference\n7 \u00d7 7\n26.8\n27.5\n27.0\n27.4\n\n9 \u00d7 9\n26.7\n27.5\n27.1\n27.4\n\nsearch\n27.0\n27.7\n27.3\n27.6\n\nmake the anchor functions to generate a wider range of predictions, which may enhance the model\ncapability (especially important for the case with smaller number of anchors, e.g. 3 \u00d7 3); second,\nrather than prede\ufb01ned anchor functions with independent parameters, MetaAnchor allows all the\ntraining boxes to contribute to the shared generators, which seems bene\ufb01cial to the robustness over\nthe different con\ufb01gurations or object box distributions.\nFor further investigating, we try using stricter IoU threshold (0.6/0.5) for training to encourage more\nprecise anchor box association, however, statistically there are fewer chances for each anchor to\nbe assigned with a positive ground truth. Results are also presented in Table 2. We \ufb01nd results\nof all the baseline models suffer from signi\ufb01cantly drops especially on AP50, which implies the\ndegradation of anchor functions; furthermore, simply increasing the number of anchors works little\non the performance. For MetaAnchor, in contrast, 3 out of 4 con\ufb01gurations are less affected (for the\ncase of 9\u00d7 9 anchors even 0.3% improved mmAP are obtained). The only exception is the 3\u00d7 3 case;\nhowever, according to Table 3 we believe the degradation is mainly because of too few anchor boxes\nfor inference rather than poor training. So, the comparison supports our hypothesis: MetaAnchor\nhelps to use training samples in a more ef\ufb01cient and robust way.\n\n4.1.2 Comparison of various anchor con\ufb01gurations in inference\n\nUnlike the traditional \ufb01xed or prede\ufb01ned anchor strategy, one of the major bene\ufb01ts of MetaAnchor\nis able to use \ufb02exible anchor scheme during inference time. Table 3 compares a variety of anchor\nbox con\ufb01gurations (refer to Table 1; note that the normalization coef\ufb01cient (AH, AW ) should be\nconsistent with what used in training) for inference along with their scores on COCO-mini. For each\nexperiment IoU threshold in training is set to 0.6/0.5. From the results we \ufb01nd that more anchor boxes\nin inference usually produce higher performances, for instance, results of 9 \u00d7 9 inference anchors are\n0.7 \u223c 1.1% better than that of 3 \u00d7 3 for a variety of training con\ufb01gurations.\nTable 3 also implies that the improvements are quickly saturated with the increase of anchor boxes,\ne.g. \u2265 7 \u00d7 7 anchors only bring minor improvements, which is also observed in Table 2. We revisit\nthe anchor con\ufb01gurations in Table 1 and \ufb01nd 7\u00d7 7 and 9\u00d7 9 cases tend to involve too \u201cdense\u201d anchor\nboxes, thus predicting highly overlapped results which might contribute little to the \ufb01nal performance.\nInspired by the phenomenon, we come up with an inference approach via greedy search: each step\nwe randomly select one anchor box bi, generate the predictions and evaluate the combined results\nwith the previous step (performed on a subset of training data); if the score improves, we update\nthe current predictions with the combined results, otherwise discard the predictions in the current\nstep. Final anchor con\ufb01guration is obtained after a few steps. Improved results are shown in the last\ncolumn (named \u201csearch\u201d) of Table 3.\n\n6\n\n\f# of Anchors Baseline (all) MetaAnchor (all) Baseline (drop) MetaAnchor (drop)\n\nTable 4: Comparison in the scenarios of different training/test distributions (mmAP, %)\n3 \u00d7 3\n5 \u00d7 5\n7 \u00d7 7\n9 \u00d7 9\n\n26.9\n27.1\n27.2\n27.1\n\n22.2\n23.0\n22.8\n22.8\n\n21.2\n20.8\n21.8\n20.8\n\n26.5\n26.9\n26.4\n26.3\n\nTable 5: Transfer evaluation on VOC 2007 test set from COCO-full dataset\n\nMethod\n\nBaseline MetaAnchor\n\nmAP @0.5(%)\n\n82.5\n\n83.1\n\nSearch\n83.3\n\n4.1.3 Cross evaluation between datasets of different distributions\n\n\u221a\n\nhw < 100,\u22121 < log w\n\nThough domain adaption or transfer learning [29] is out of the design purpose of MetaAnchor, recently\nthe technique of weight prediction[10], which is also employed in the paper, has been successfully\napplied in those tasks [15, 14]. So, for MetaAnchor it is interesting to evaluate whether it is able\nto bridge the distribution gap between two dataset. More speci\ufb01cally, what about the performance\nif the detection model is trained with another dataset which has the same class labels but different\ndistributions of object box sizes?\nWe perform the experiment on COCO-mini, in which we \u201cdrop\u201d some boxes in the training\nset. However, it seems nontrivial to directly erase the objects in image; instead, during train-\ning, once we use an ground truth box which falls in a certain range (in our experiment the range is\n{(h, w)|50 <\nh < 1}, around 1/6 of the whole boxes), we manually assign\nthe corresponding loss to 0. As for test, we use all the data in the validation set. Therefore, the\ndistributions of the boxes we used in training and test are very different. Table 4 shows the evaluation\nresults. Obviously after some ground truth boxes are erased, all the scores drop signi\ufb01cantly; however,\ncompared with the RetinaNet baseline, MetaAnchor suffers from smaller degradations and generates\nmuch better predictions, which shows the potential on the transfer tasks.\nIn addition, we train models only with COCO-full dataset and evaluate the transfer performace on\nVOC2007 test set [6]. We use two models: Baseline(RetianNet) and MetaAnchor, which achieve the\nbest performace on COCO-full dataset with different architectures. In this experiment, we achieve\n83.3% mAP on VOC 2007 test set, with 0.8% improvement in mAP compared with Baseline and\n0.2% better than MetaAnchor, as shown in Table 5. Therefore, MetaAnchor shows a better tansfer\nability than the RetinaNet baseline on this task. Note that the result is evaluated without sofa class,\nbecause there is no sofa annotation in COCO.\n\n4.1.4 Data-independent vs. data-dependent anchor function generators\n\nIn Sec. 3.2 we introduce two variants of anchor function generators: data-independent (Equ. 4)\nand data-dependent (Equ. 6). In the above subsections we mainly evaluate the data-independent\nones. Table 6 compares the performance of the two alternatives. For simplicity, we use the same\ntraining and test anchor con\ufb01gurations; the IoU threshold is 0.6/0.5. Results shows that in most cases\ndata-dependent variant is slight better, however, the difference is small. We also report the scores\nafter anchor con\ufb01guration search (described in Sec. 4.1.2).\n\n4.2 Results on COCO Object Detection\n\nFinally, we compare our fully-equipped MetaAnchor models with RetinaNet [23] baselines on COCO-\nfull dataset (also called trainval35k in [23]). As mentioned at the begin of Sec. 4, we follow the same\nevaluation protocol as [23]. The input resolution is 600\u00d7 in both training and test. The backbone\nfeature extractor is ResNet-50 [13]. Performances are benchmarked with COCO standard mmAP in\nthe minival dataset.\n\n3Based on the models with 7 \u00d7 7 anchor con\ufb01guration in training.\n\n7\n\n\fTable 6: Comparison of anchor function generators (mmAP, %)\n\n# of Anchors Data-independent Data-dependent\n\n26.0\n27.3\n27.0\n27.4\n27.6\n\n26.5\n27.3\n27.4\n27.3\n28.0\n\n3 \u00d7 3\n5 \u00d7 5\n7 \u00d7 7\n9 \u00d7 9\nsearch3\n\nMethod\nmmAP\n\nmAP @0.5\n\nTable 7: Results of YOLOv2 on COCO minival (%)\n\nBaseline MetaAnchor\n\n18.9\n35.2\n\n21.2\n39.4\n\nSearch\n21.2\n39.5\n\nTable 8 lists the results. Interestingly, our reimplemented RetinaNet model is 1.8% better than the\ncounterpart reported in [23]. For better understanding, we further investigate a lot of anchor box\ncon\ufb01gurations (including those in Table 1) and retrain the baseline model, the best of which is\nnamed \u201cRetinaNet\u2217\u201d and marked with \u201csearch\u201d in Table 8. In comparison, our MetaAnchor model\nachieves 37.5% mmAP on COCO minival, which is 1.7% better than the original RetinaNet (our\nimplemented) and 0.6% better than the best searched entry of RetinaNet. Our data-dependent variant\n(Equ. 6) further boosts the performance by 0.4%. In addition, we argue that for MetaAnchor the\ncon\ufb01guration for inference could be easily obtained by greedy search introduced in 4.1.2 without\nretraining. Speci\ufb01cally, the scales and aspects of greedy search anchors are {2k/5| \u2212 2 < k < 6} and\n{1/3, 1/t, 1, t, 3|t = 1.1, 1.2, ..., 2} respectively. Fig 2 visualizes some detection results predicted\nby MetaAnchor. It is clear that the shapes of detected boxes vary according to the customized anchor\nbox bi.\nWe also evaluate our method on PASCAL VOC 2007 and get preliminary resluts that MetaAnchor\nachieves \u223c 0.3% more mAP than RetinaNet baseline (80.3->80.6% mAP@0.5). The gain is less\nsigni\ufb01cant compared with that on COCO, as we \ufb01nd the distribution of boxes on PASCAL VOC is\nmuch simpler than COCO.\nTo validate our method further, we implement MetaAnchor on YOLOv2 [31], which also use a\ntwo-layer network to predict detector parameters. For YOLOv2 baseline, we use anchors showed\non open source project4 to detect objects. In MetaAnchor, the \u201cstandard box\u201d (AH, AW ) is (4.18,\n4.69). For training, we follow the strategy used in [31] and use the COCO-full dataset. For the\nresults, we report mmAP and mAP@0.5 on COCO minival. Table 7 illustrates the results. Obviously,\nMetaAnchor is better than YOLOv2 baseline and boosts the performace with greedy search method.\n\n5 Conclusion\n\nWe propose a novel and \ufb02exible anchor mechanism named MetaAnchor for object detection frame-\nworks, in which anchor functions could be dynamically generated from the arbitrary customized prior\nboxes. Thanks to weight prediction, MetaAnchor is able to work with most of the anchor-based object\ndetection systems such as RetinaNet. Compared with the prede\ufb01ned anchor scheme, we empirically\n\ufb01nd that MetaAnchor is more robust to anchor settings and bounding box distributions; in addition,\nit also shows the potential on transfer tasks. Our experiment on COCO detection task shows that\nMetaAnchor consistently outperforms the counterparts in various scenarios.\n\nAcknowledgments This work is supported by National Key R&D Program No. 2017YFA0700800,\nChina.\n\n4https://github.com/pjreddie/darknet\n\n8\n\n\fFigure 2: Detection results at a variety of customized anchor boxes. From (a) to (e) the anchor box\nsizes (scale, ratio) are: (20, 1/3), (20, 1/2), (20, 1), (20, 2) and (20, 3) respectively. Note that for each\npicture we aggregate the predictions of all the 5 levels of detection heads, so the differences of boxes\nmainly lie in aspect ratios.\n\nTable 8: Results on COCO minival\n\nInference\n\n# of Anchors\n\n# of Anchors mmAP (%)\n\nModel\n\nRetinaNet [23]\nRetinaNet (our impl.)\nRetinaNet\u2217 (our impl.)\nMetaAnchor (ours)\nMetaAnchor (ours)\nMetaAnchor (ours, data-dependent)\n\nTraining\n3 \u00d7 3\n3 \u00d7 3\nsearch\n3 \u00d7 3\n9 \u00d7 9\n9 \u00d7 9\n\n3 \u00d7 3\n3 \u00d7 3\nsearch\n3 \u00d7 3\nsearch\nsearch\n\n34.0\n35.8\n36.9\n36.8\n37.5\n37.9\n\nReferences\n[1] M. Andrychowicz, M. Denil, S. Gomez, M. W. Hoffman, D. Pfau, T. Schaul, and N. de Freitas. Learning\nto learn by gradient descent by gradient descent. In Advances in Neural Information Processing Systems,\npages 3981\u20133989, 2016.\n\n[2] J. Dai, Y. Li, K. He, and J. Sun. R-fcn: Object detection via region-based fully convolutional networks. In\n\nAdvances in neural information processing systems, pages 379\u2013387, 2016.\n\n[3] P. Dollar, C. Wojek, B. Schiele, and P. Perona. Pedestrian detection: An evaluation of the state of the art.\n\nIEEE transactions on pattern analysis and machine intelligence, 34(4):743\u2013761, 2012.\n\n[4] M. Elhoseiny, B. Saleh, and A. Elgammal. Write a classi\ufb01er: Zero-shot learning using purely textual\ndescriptions. In Computer Vision (ICCV), 2013 IEEE International Conference on, pages 2584\u20132591.\nIEEE, 2013.\n\n[5] D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov. Scalable object detection using deep neural networks.\nIn Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2147\u20132154,\n2014.\n\n[6] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual object\n\nclasses (voc) challenge. International Journal of Computer Vision, 88(2):303\u2013338, June 2010.\n\n[7] C.-Y. Fu, W. Liu, A. Ranga, A. Tyagi, and A. C. Berg. Dssd: Deconvolutional single shot detector. arXiv\n\npreprint arXiv:1701.06659, 2017.\n\n9\n\n(a)(b)(c)(d)(e)\f[8] R. Girshick. Fast r-cnn. arXiv preprint arXiv:1504.08083, 2015.\n\n[9] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and\nsemantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition,\npages 580\u2013587, 2014.\n\n[10] D. Ha, A. Dai, and Q. V. Le. Hypernetworks. arXiv preprint arXiv:1609.09106, 2016.\n\n[11] K. He, G. Gkioxari, P. Doll\u00e1r, and R. Girshick. Mask r-cnn. In Computer Vision (ICCV), 2017 IEEE\n\nInternational Conference on, pages 2980\u20132988. IEEE, 2017.\n\n[12] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual\n\nrecognition. In european conference on computer vision, pages 346\u2013361. Springer, 2014.\n\n[13] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the\n\nIEEE conference on computer vision and pattern recognition, pages 770\u2013778, 2016.\n\n[14] J. Hoffman, S. Guadarrama, E. S. Tzeng, R. Hu, J. Donahue, R. Girshick, T. Darrell, and K. Saenko. Lsda:\nLarge scale detection through adaptation. In Advances in Neural Information Processing Systems, pages\n3536\u20133544, 2014.\n\n[15] R. Hu, P. Doll\u00e1r, K. He, T. Darrell, and R. Girshick. Learning to segment every thing. arXiv preprint\n\narXiv:1711.10370, 2017.\n\n[16] L. Huang, Y. Yang, Y. Deng, and Y. Yu. Densebox: Unifying landmark localization with end to end object\n\ndetection. arXiv preprint arXiv:1509.04874, 2015.\n\n[17] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal\n\ncovariate shift. arXiv preprint arXiv:1502.03167, 2015.\n\n[18] M. Jaderberg, K. Simonyan, A. Zisserman, et al. Spatial transformer networks. In Advances in neural\n\ninformation processing systems, pages 2017\u20132025, 2015.\n\n[19] Z. Li, C. Peng, G. Yu, X. Zhang, Y. Deng, and J. Sun. Light-head r-cnn: In defense of two-stage object\n\ndetector. arXiv preprint arXiv:1711.07264, 2017.\n\n[20] Z. Li, C. Peng, G. Yu, X. Zhang, Y. Deng, and J. Sun. Detnet: A backbone network for object detection.\n\narXiv preprint arXiv:1804.06215, 2018.\n\n[21] Z. Li and F. Zhou. Fssd: Feature fusion single shot multibox detector. arXiv preprint arXiv:1712.00960,\n\n2017.\n\n[22] T.-Y. Lin, P. Doll\u00e1r, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid networks for\n\nobject detection. In CVPR, volume 1, page 4, 2017.\n\n[23] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Doll\u00e1r. Focal loss for dense object detection. arXiv preprint\n\narXiv:1708.02002, 2017.\n\n[24] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll\u00e1r, and C. L. Zitnick. Microsoft\ncoco: Common objects in context. In European conference on computer vision, pages 740\u2013755. Springer,\n2014.\n\n[25] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg. Ssd: Single shot multibox\n\ndetector. In European conference on computer vision, pages 21\u201337. Springer, 2016.\n\n[26] J. Mao, T. Xiao, Y. Jiang, and Z. Cao. What can help pedestrian detection? In The IEEE Conference on\n\nComputer Vision and Pattern Recognition (CVPR), volume 1, page 3, 2017.\n\n[27] I. Misra, A. Gupta, and M. Hebert. From red wine to red tomato: Composition with context. In The IEEE\n\nConference on Computer Vision and Pattern Recognition (CVPR), volume 3, 2017.\n\n[28] M. Najibi, P. Samangouei, R. Chellappa, and L. Davis. Ssh: Single stage headless face detector. In\nProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4875\u20134884,\n2017.\n\n[29] S. J. Pan and Q. Yang. A survey on transfer learning.\n\nengineering, 22(10):1345\u20131359, 2010.\n\nIEEE Transactions on knowledge and data\n\n[30] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Uni\ufb01ed, real-time object detection.\nIn Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779\u2013788, 2016.\n\n10\n\n\f[31] J. Redmon and A. Farhadi. Yolo9000: better, faster, stronger. arXiv preprint, 2017.\n\n[32] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region\n\nproposal networks. In Advances in neural information processing systems, pages 91\u201399, 2015.\n\n[33] S. H. Rezato\ufb01ghi, R. Kaskman, F. T. Motlagh, Q. Shi, D. Cremers, L. Leal-Taix\u00e9, and I. Reid. Deep\nperm-set net: Learn to predict sets with unknown permutation and cardinality using deep neural networks.\narXiv preprint arXiv:1805.00613, 2018.\n\n[34] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla,\nM. Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer\nVision, 115(3):211\u2013252, 2015.\n\n[35] Z. Shen, Z. Liu, J. Li, Y.-G. Jiang, Y. Chen, and X. Xue. Dsod: Learning deeply supervised object detectors\nfrom scratch. In The IEEE International Conference on Computer Vision (ICCV), volume 3, page 7, 2017.\n\n[36] G. Song, Y. Liu, M. Jiang, Y. Wang, J. Yan, and B. Leng. Beyond trade-off: Accelerate fcn-based face\n\ndetector with higher accuracy. arXiv preprint arXiv:1804.05197, 2018.\n\n[37] R. Stewart, M. Andriluka, and A. Y. Ng. End-to-end people detection in crowded scenes. In Proceedings\n\nof the IEEE conference on computer vision and pattern recognition, pages 2325\u20132333, 2016.\n\n[38] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich,\n\net al. Going deeper with convolutions. Cvpr, 2015.\n\n[39] C. Szegedy, S. Reed, D. Erhan, D. Anguelov, and S. Ioffe. Scalable, high-quality object detection. arXiv\n\npreprint arXiv:1412.1441, 2014.\n\n[40] J. Wang, Y. Yuan, G. Yu, and S. Jian. Sface: An ef\ufb01cient network for face detection in large scale variations.\n\narXiv preprint arXiv:1804.06559, 2018.\n\n[41] X. Wang, T. Xiao, Y. Jiang, S. Shao, J. Sun, and C. Shen. Repulsion loss: Detecting pedestrians in a crowd.\n\narXiv preprint arXiv:1711.07752, 2017.\n\n[42] Y.-X. Wang and M. Hebert. Learning to learn: Model regression networks for easy small sample learning.\n\nIn European Conference on Computer Vision, pages 616\u2013634. Springer, 2016.\n\n[43] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao. Joint face detection and alignment using multitask cascaded\n\nconvolutional networks. IEEE Signal Processing Letters, 23(10):1499\u20131503, 2016.\n\n[44] L. Zhang, L. Lin, X. Liang, and K. He. Is faster r-cnn doing well for pedestrian detection? In European\n\nConference on Computer Vision, pages 443\u2013457. Springer, 2016.\n\n[45] S. Zhang, X. Zhu, Z. Lei, H. Shi, X. Wang, and S. Z. Li. S3fd: Single shot scale-invariant face detector.\n\narXiv preprint arXiv:1708.05237, 2017.\n\n11\n\n\f", "award": [], "sourceid": 207, "authors": [{"given_name": "Tong", "family_name": "Yang", "institution": "Megvii(Face++),Fudan University"}, {"given_name": "Xiangyu", "family_name": "Zhang", "institution": "Megvii Inc (Face++)"}, {"given_name": "Zeming", "family_name": "Li", "institution": "Megvii(Face++) Inc"}, {"given_name": "Wenqiang", "family_name": "Zhang", "institution": "Fudan University"}, {"given_name": "Jian", "family_name": "Sun", "institution": "Megvii, Face++"}]}