{"title": "Multiple-Instance Pruning For Learning Efficient Cascade Detectors", "book": "Advances in Neural Information Processing Systems", "page_first": 1681, "page_last": 1688, "abstract": "Cascade detectors have been shown to operate extremely rapidly, with high accuracy, and have important applications such as face detection. Driven by this success, cascade earning has been an area of active research in recent years. Nevertheless, there are still challenging technical problems during the training process of cascade detectors. In particular, determining the optimal target detection rate for each stage of the cascade remains an unsolved issue. In this paper, we propose the multiple instance pruning (MIP) algorithm for soft cascades. This algorithm computes a set of thresholds which aggressively terminate computation with no reduction in detection rate or increase in false positive rate on the training dataset. The algorithm is based on two key insights: i) examples that are destined to be rejected by the complete classifier can be safely pruned early; ii) face detection is a multiple instance learning problem. The MIP process is fully automatic and requires no assumptions of probability distributions, statistical independence, or ad hoc intermediate rejection targets. Experimental results on the MIT+CMU dataset demonstrate significant performance advantages.", "full_text": "Multiple-Instance Pruning For Learning Ef\ufb01cient\n\nCascade Detectors\n\nCha Zhang and Paul Viola\n\nMicrosoft Research\n\nOne Microsoft Way, Redmond, WA 98052\n{chazhang,viola}@microsoft.com\n\nAbstract\n\nCascade detectors have been shown to operate extremely rapidly, with high ac-\ncuracy, and have important applications such as face detection. Driven by this\nsuccess, cascade learning has been an area of active research in recent years. Nev-\nertheless, there are still challenging technical problems during the training process\nof cascade detectors. In particular, determining the optimal target detection rate\nfor each stage of the cascade remains an unsolved issue. In this paper, we propose\nthe multiple instance pruning (MIP) algorithm for soft cascades. This algorithm\ncomputes a set of thresholds which aggressively terminate computation with no re-\nduction in detection rate or increase in false positive rate on the training dataset.\nThe algorithm is based on two key insights: i) examples that are destined to be\nrejected by the complete classi\ufb01er can be safely pruned early; ii) face detection is\na multiple instance learning problem. The MIP process is fully automatic and re-\nquires no assumptions of probability distributions, statistical independence, or ad\nhoc intermediate rejection targets. Experimental results on the MIT+CMU dataset\ndemonstrate signi\ufb01cant performance advantages.\n\n1 Introduction\n\nThe state of the art in real-time face detection has progressed rapidly in recently years. One very\nsuccessful approach was initiated by Viola and Jones [11]. While some components of their work\nare quite simple, such as the so called \u201cintegral image\u201d, or the use of AdaBoost, a great deal of\ncomplexity lies in the training of the cascaded detector. There are many required parameters: the\nnumber and shapes of rectangle \ufb01lters, the number of stages, the number of weak classi\ufb01ers in each\nstage, and the target detection rate for each cascade stage. These parameters conspire to determine\nnot only the ROC curve for the resulting system but also its computational complexity. Since the\nViola-Jones training process requires CPU days to train and evaluate, it is dif\ufb01cult, if not impossible,\nto pick these parameters optimally.\nThe conceptual and computational complexity of the training process has led to many papers propos-\ning improvements and re\ufb01nements [1, 2, 4, 5, 9, 14, 15]. Among them, three are closely related to\nthis paper: Xiao, Zhu and Zhang[15], Sochman and Matas[9], and Bourdev and Brandt[1]. In each\npaper, the original cascade structure of distinct and separate stages is relaxed so that earlier com-\nputation of weak classi\ufb01er scores can be combined with later weak classi\ufb01ers. Bourdev and Brandt\ncoined the term, \u201csoft-cascade\u201d, where the entire detector is trained as a single strong classi\ufb01er\nwithout stages (with 100\u2019s or 1000\u2019s of weak classi\ufb01ers sometimes called \u201cfeatures\u201d). The score\nassigned to a detection window by the soft cascade is simply a weighted sum of the weak classi\ufb01ers:\nsk(T ) =\nj\u2264T \u03b1jhj(xk), where T is the total number of weak classi\ufb01ers; hj(xk) is the jth feature\ncomputed on example xk; \u03b1j is the vote on weak classi\ufb01er j. Computation of the sum is terminated\nearly whenever the partial sum falls below a rejection threshold: sk(t) < \u03b8(t). Note the soft cascade\n\n(cid:80)\n\n1\n\n\fis similar to, but simpler than both the boosting chain approach of Xiao, Zhu, and Zhang and the\nWaldBoost approach of Sochman and Matas.\nThe rejection thresholds \u03b8(t), t \u2208 {1,\u00b7\u00b7\u00b7 , T \u2212 1} are critical to the performance and speed of the\ncomplete classi\ufb01er. However, it is dif\ufb01cult to set them optimally in practice. One possibility is to set\nthe rejection thresholds so that no positive example is lost; this leads to very conservative thresholds\nand a very slow detector. Since the complete classi\ufb01er will not achieve 100% detection (Note, given\npractical considerations, the \ufb01nal threshold of the complete classi\ufb01er is set to reject some positive\nexamples because they are dif\ufb01cult to detect. Reducing the \ufb01nal threshold further would admit too\nmany false positives.), it seems justi\ufb01ed to reject positive examples early in return for fast detection\nspeed. The main question is which positive examples can be rejected and when.\nA key criticism of all previous cascade learning approaches is that none has a scheme to determine\nwhich examples are best to reject. Viola-Jones attempted to reject zero positive examples until\nthis become impossible and then reluctantly gave up on one positive example at a time. Bourdev\nand Brandt proposed a method for setting rejection thresholds based on an ad hoc detection rate\ntarget called a \u201crejection distribution vector\u201d, which is a parameterized exponential curve. Like the\noriginal Viola-Jones proposal, the soft-cascade gradually gives up on a number of positive examples\nin an effort to aggressively reduce the number of negatives passing through the cascade. Perhaps\na particular family of curves is more palatable, but it is still arbitrary and non-optimal. Sochman-\nMatas used a ratio test to determine the rejection thresholds. While this has statistical validity,\ndistributions must be estimated, which introduces empirical risk. This is a particular problem for the\n\ufb01rst few rejection thresholds, and can lead to low detection rates on test data.\nThis paper proposes a new mechanism for setting the rejection thresholds of any soft-cascade which\nis conceptually simple, has no tunable parameters beyond the \ufb01nal detection rate target, yet yields\na cascade which is both highly accurate and very fast. Training data is used to set all reject thresh-\nolds after the \ufb01nal classi\ufb01er is learned. There are no assumptions about probability distributions,\nstatistical independence, or ad hoc intermediate targets for detection rate (or false positive rate).\nThe approach is based on two key insights that constitute the major contributions of this paper: 1)\npositive examples that are rejected by the complete classi\ufb01er can be safely rejected earlier during\npruning; 2) each ground-truth face requires no more than one matched detection window to maintain\nthe classi\ufb01er\u2019s detection rate. We propose a novel algorithm, multiple instance pruning (MIP), to set\nthe reject thresholds automatically, which results in a very ef\ufb01cient cascade detector with superior\nperformance.\nThe rest of the paper is organized as follows. Section 2 describes an algorithm which makes use\nof the \ufb01nal classi\ufb01cation results to perform pruning. Multiple instance pruning is presented in Sec-\ntion 3. Experimental results and conclusions are given in Section 4 and 5, respectively.\n\n2 Pruning Using the Final Classi\ufb01cation\n\nWe propose a scheme which is simultaneously simpler and more effective than earlier techniques.\nOur key insight is quite simple: the reject thresholds are set so that they give up on precisely those\npositive examples which are rejected by the complete classi\ufb01er. Note that the score of each example,\nsk(t) can be considered a trajectory through time. The full classi\ufb01er rejects a positive example if its\n\ufb01nal score sk(T ) falls below the \ufb01nal threshold \u03b8(T ). In the simplest version of our threshold setting\nalgorithm, all trajectories from positive windows which fall below the \ufb01nal threshold are removed.\nEach rejection threshold is then simply:\n\n(cid:169)\n\nk\n\n\u03b8(t) =\n\n(cid:175)(cid:175)sk(T )>\u03b8(T ),yk=1\n\nmin\n\n(cid:170) sk(t)\n\nwhere {xk, yk} is the training set in which yk = 1 indicates positive windows and yk = \u22121 indicates\nnegative windows. These thresholds produce a reasonably fast classi\ufb01er which is guaranteed to\nproduce no more errors than the complete classi\ufb01er (on the training dataset). We call this pruning\nalgorithm direct backward pruning (DBP).\nOne might question whether the minimum of all retained trajectories is robust to mislabeled or\nnoisy examples in the training set. Note that the \ufb01nal threshold of the complete classi\ufb01er will often\nreject mislabeled or noisy examples (though they will be considered false negatives). These rejected\n\n2\n\n\fFigure 1: Traces of cumulative scores of different windows in an image of a face. See text.\n\nexamples play no role in setting the rejection thresholds. We have found this procedure very robust\nto the types of noise present in real training sets.\nIn past approaches, thresholds are set to reject the largest number of negative examples and only a\nsmall percentage of positive examples. These approaches justify these thresholds in different ways,\nbut they all struggle to determine the correct percentage accurately and effectively.\nIn the new\napproach, the \ufb01nal threshold of the complete soft-cascade is set to achieve the require detection rate.\nRejection thresholds are then set to reject the largest number of negative examples and retain all\npositive examples which are retained by the complete classi\ufb01er. The important difference is that the\nparticular positive examples which are rejected are those which are destined to be rejected by the\n\ufb01nal classi\ufb01er. This yields a fast classi\ufb01er which labels all positive examples in exactly the same\nway as the complete classi\ufb01er. In fact, it yields the fastest possible soft-cascade with such property\n(provided the weak classi\ufb01ers are not re-ordered). Note, some negative examples that eventually\npass the complete classi\ufb01er threshold may be pruned by earlier rejection thresholds. This has the\nsatisfactory side bene\ufb01t of reducing false positive rate as well. In contrast, although the detection\nrate on the training set can also be guaranteed in Bourdev-Brandt\u2019s algorithm, there is no guarantee\nthat false positive rate will not increase.\nBourdev-Brandt propose reordering the weak classi\ufb01ers based on the separation between the mean\nscore of the positive examples and the mean score of the negative examples. Our approach is equally\napplicable to a reordered soft-cascade.\nFigure 1 shows 293 trajectories from a single image whose \ufb01nal score is above -15. While the re-\njection thresholds are learned using a large set of training examples, this one image demonstrates\nthe basic concepts. The red trajectories are negative windows. The single physical face is consistent\nwith a set of positive detection windows that are within an acceptable range of positions and scales.\nTypically there are tens of acceptable windows for each face. The blue and magenta trajectories cor-\nrespond to acceptable windows which fall above the \ufb01nal detection threshold. The cyan trajectories\nare potentially positive windows which fall below the \ufb01nal threshold. Since the cyan trajectories are\nrejected by the \ufb01nal classi\ufb01er, rejection thresholds need only retain the blue and magenta trajectories.\nIn a sense the complete classi\ufb01er, along with a threshold which sets the operating point, provides\nlabels on examples which are more valuable than the ground-truth labels. There will always be a\nset of \u201cpositive\u201d examples which are extremely dif\ufb01cult to detect, or worse which are mistakenly\nlabeled positive. In practice the \ufb01nal threshold of the complete classi\ufb01er will be set so that these\nparticular examples are rejected. In our new approach these particular examples can be rejected\nearly in the computation of the cascade. Compared with existing approaches, that set the reject\nthresholds in a heuristic manner, our approach is data-driven and hence more principled.\n\n3 Multiple Instance Pruning\n\nThe notion of an \u201cacceptable detection window\u201d plays a critical role in an improved process for\nsetting rejection thresholds. It is dif\ufb01cult to de\ufb01ne the correct position and scale of a face in an image.\n\n3\n\n0510Positive WindowsNegative WindowsCumulative ScoreFinal Threshold0100200300400500600700-20-15-10-5Feature IndexCumulative ScoreFinal ThresholdPositive windows but below thresholdPositive windows above thresholdPositive windows retained after pruning\fFor a purely upright and frontal face, one might propose the smallest rectangle which includes the\nchin, forehead, and the inner edges of the ears. But, as we include a range of non-upright and\nnon-frontal faces these rectangles can vary quite a bit. Should the correct window be a function\nof apparent head size? Or is eye position and interocular distance more reliable? Even given clear\ninstructions, one \ufb01nds that two subjects will differ signi\ufb01cantly in their \u201cground-truth\u201d labels.\nRecall that the detection process scans the image generating a large, but \ufb01nite, collection of over-\nlapping windows at various scales and locations. Even in the absence of ambiguity, some slop is\nrequired to ensure that at least one of the generated windows is considered a successful detection for\neach face. Experiments typically declare that any window which is within 50% in size and within a\ndistance of 50% (of size) be considered a true positive. Using typical scanning parameters this can\nlead to tens of windows which are all equally valid positive detections. If any of these windows is\nclassi\ufb01ed positive then this face is consider detected.\nEven though all face detection algorithms must address the \u201cmultiple window\u201d issue, few papers\nhave discussed it. Two papers which have fundamentally integrated this observation into the train-\ning process are Nowlan and Platt [6] and more recently by Viola, Platt, and Zhang [12]. These\npapers proposed a multiple instance learning (MIL) framework where the positive examples are\ncollected into \u201cbags\u201d. The learning algorithm is then given the freedom to select at least one, and\nperhaps more examples, in each bag as the true positive examples. In this paper, we do not directly\naddress soft-cascade learning, though we will incorporate the \u201cmultiple window\u201d observation into\nthe determination of the rejection thresholds.\nOne need only retain one \u201cacceptable\u201d window for each face which is detected by the \ufb01nal classi\ufb01er.\nA more aggressive threshold is de\ufb01ned as:\n\n\uf8ee\uf8f0\n(cid:169)\n\n(cid:175)(cid:175)k\u2208Fi\u2229Ri,yk=1\n\nmax\n\nk\n\n\uf8f9\uf8fb\n(cid:170) sk(t)\n\n\u03b8(t) = min\ni\u2208P\n\nwhere i is the index of ground-truth faces; Fi is the set of acceptable windows associated with\nground-truth face i and Ri is the set of windows which are \u201cretained\u201d (see below). P is the set of\nground-truth faces that have at least one acceptable window above the \ufb01nal threshold:\n\n(cid:169)\n\n(cid:175)(cid:175) max\n(cid:175)(cid:175)k\u2208Fi\n(cid:169)\n\nk\n\n(cid:170)\n(cid:170) sk(T ) > \u03b8(T )\n\nP =\n\ni\n\nIn this new procedure the acceptable windows come in bags, only one of which must be classi\ufb01ed\npositive in order to ensure that each face is successfully detected. This new criteria for success is\nmore \ufb02exible and therefore more aggressive. We call this pruning method multiple instance pruning\n(MIP).\nReturning to Figure 1 we can see that the blue, cyan, and magenta trajectories actually form a \u201cbag\u201d.\nBoth in this algorithm, and in the simpler previous algorithm, the cyan trajectories are rejected before\nthe computation of the thresholds. The bene\ufb01t of this new algorithm is that the blue trajectories can\nbe rejected as well.\nThe de\ufb01nition of \u201cretained\u201d examples in the computation above is a bit more complex than before.\nInitially the trajectories from the positive bags which fall above the \ufb01nal threshold are retained. The\nset of retained examples is further reduced as the earlier thresholds are set. This is in contrast to the\nsimpler DBP algorithm where the thresholds are set to preserve all retained positive examples. In\nthe new algorithm the partial score of an example can fall below the current threshold (because it\nis in a bag with a better example). Each such example is removed from the retained set Ri and not\nused to set subsequent thresholds.\nThe pseudo code of the MIP algorithm is shown in Figure 2. It guarantees the same face detection\nrate on the training dataset as the complete classi\ufb01er. Note that the algorithm is greedy, setting earlier\nthresholds \ufb01rst so that all positive bags are retained and the fewest number of negative examples pass.\nTheoretically it is possible that delaying the rejection of a particular example may result in a better\nthreshold at a later stage. Searching for the optimal MIP pruned detector, however, may be quite\nexpensive. The MIP algorithm is however guaranteed to generate a soft-cascade that is at least as\nfast as DBP, since the criteria for setting the thresholds is less restrictive.\n\n4\n\n\fInput\n\n\u2022 A cascade detector.\n\u2022 Threshold \u03b8(T ) at the \ufb01nal stage of the detector.\n\u2022 A large training set (the whole training set to learn the cascade detector can be reused here).\n\nInitialize\n\n\u2022 Run the detector on all rectangles that match with any ground-truth faces. Collect all windows\nthat are above the \ufb01nal threshold \u03b8(T ). Record all intermediate scores as s(i, j, t), where i =\n1,\u00b7\u00b7\u00b7 , N is the face index; j = 1,\u00b7\u00b7\u00b7 , Mi is the index of windows that match with face i;\nt = 1,\u00b7\u00b7\u00b7 , T is the index of the feature node.\n\n\u2022 Initialize \ufb02ags f (i, j) as true.\n\nMIP\nFor t = 1,\u00b7\u00b7\u00b7 , T :\n\n1. For i = 1,\u00b7\u00b7\u00b7 , N: \ufb01nd \u02c6s(i, t) = max{j|f (i,j)=true} s(i, j, t).\n2. Set \u03b8(t) = mini \u02c6s(i, t) \u2212 \u0001 as the rejection threshold of node t, \u0001 = 10\u22126.\n3. For i = 1,\u00b7\u00b7\u00b7 , N, j = 1,\u00b7\u00b7\u00b7 , Mi: set f (i, j) as false if s(i, j, t) < \u03b8(t).\n\nOutput\nRejection thresholds \u03b8(t), t = 1,\u00b7\u00b7\u00b7 , T .\n\nFigure 2: The MIP algorithm.\n\nFigure 3: (a) Performance comparison with existing works on MIT+CMU frontal face dataset. (b)\nROC curves of the detector after MIP pruning using the original training set. No performance\ndegradation is found on the MIT+CMU testing dataset.\n\n4 Experimental Results\n\nMore than 20,000 images were collected from the web, containing roughly 10,000 faces. Over\n2 billion negative examples are generated from the same image set. A soft cascade classi\ufb01er is\nlearned through a new framework based on weight trimming and bootstrapping (see Appendix).\nThe training process was conducted on a dual core AMD Opteron 2.2 GHz processor with 16 GB\nof RAM. It takes less than 2 days to train a classi\ufb01er with 700 weak classi\ufb01ers based on the Haar\nfeatures [11]. The testing set is the standard MIT+CMU frontal face database [10, 7], which consists\nof 125 grayscale images containing 483 labeled frontal faces. A detected rectangle is considered to\nbe a true detection if it has less than 50% variation in shift and scale from the ground-truth.\nIt is dif\ufb01cult to compare the performance of various detectors, since every detector is trained on\na different dataset. Nevertheless, we show the ROC curves of a number of existing detectors and\nours in Figure 3(a). Note there are two curves plotted for soft cascade. The \ufb01rst curve has very\ngood performance, at the cost of slow speed (average 37.1 features per window). The classi\ufb01cation\naccuracy dropped signi\ufb01cantly in the second curve, which is faster (average 25 features per window).\n\n5\n\n\fFigure 4: (a) Pruning performance of DBP and MIP. The bottom two rows indicate the average\nnumber of features visited per window on the MIT+CMU dataset. (b) Results of existing work.\n\nFigure 4(a) compares DBP and MIP with different \ufb01nal thresholds of the strong classi\ufb01er. The\noriginal data set for learning the soft cascade is reused for pruning the detector. Since MIP is a more\naggressive pruning method, the average number of features evaluated is much lower than DBP.\nNote both DBP and MIP guarantee that no positive example from the training set is lost. There\nis no similar guarantee for test data, though. Figure 3(b) shows that there is no practical loss in\nclassi\ufb01cation accuracy on the MIT+CMU test dataset for various applications of the MIP algorithm\n(note that the MIT+CMU data is not used by the training process in any way).\nSpeed comparison with other algorithms are subtle (Figure 4(b)). The \ufb01rst observation is that higher\ndetection rates almost always require the evaluation of additional features. This is certainly true\nin our experiments, but it is also true in past papers (e.g., the two curves of Bourdev-Brandt soft\ncascade in Figure 3(a)). The fastest algorithms often cannot achieve very high detection rates. One\nexplanation is that in order to achieve higher detection rates one must retain windows which are\n\u201cambiguous\u201d and may contain faces. The proposed MIP-based detector yields a much lower false\npositive rate than the 25-feature Bourdev-Brandt soft cascade and nearly 35% improvement on de-\ntection speed. While the WaldBoost algorithm is quite fast, detection rates are measurably lower.\nDetectors such as Viola-Jones, boosting chain, FloatBoost, and Wu et al. all requires manual tuning.\nWe can only guess how much trial and error went into getting a fast detector that yields good results.\nThe expected computation time of the DBP soft-cascade varies monotonically in detection rate.\nThis is guaranteed by the algorithm. In experiments with MIP we found a surprising quirk in the\nexpected computation times. One would expect that if the required detection rate is higher, it world\nbe more dif\ufb01cult to prune.\nIn MIP, when the detection rate increases, there are two con\ufb02icting\nfactors involved. First, the number of detected faces increases, which increases the dif\ufb01culty of\npruning. Second, for each face the number of retained and acceptable windows increases. Since\nwe are computing the maximum of this larger set, MIP can in some cases be more aggressive. The\nsecond factor explains the increase of speed when the \ufb01nal threshold changes from -1.5 to -2.0.\nThe direct performance comparison between MIP and Bourdev-Brandt (B-B) was performed using\nthe same soft-cascade and the same data. In order to better measure performance differences we\ncreated a larger test set containing 3,859 images with 3,652 faces collected from the web. Both\nalgorithms prune the strong classi\ufb01er for a target detection rate of 97.2% on the training set, which\ncorresponds to having a \ufb01nal threshold of \u22122.5 in Figure 4(a). We use the same exponential function\nfamily as [1] for B-B, and adjust the control parameter \u03b1 in the range between \u221216 and 4. The\nresults are shown in Figure 5. It can be seen that the MIP pruned detector has the best detection\nperformance. When a positive \u03b1 is used (e.g., \u03b1 = 4), the B-B pruned detector is still worse than\nthe MIP pruned detector, and its speed is 5 times slower (56.83 vs. 11.25). On the other hand, when\n\u03b1 is negative, the speed of B-B pruned detectors improves and can be faster than MIP (e.g., when\n\u03b1 = \u221216). Note, adjusting \u03b1 leads to changes both in detection time and false positive rate.\nIn practice, both MIP and B-B can be useful. MIP is fully automated and guarantees detection rate\nwith no increase in false positive rate on the training set. The MIP pruned strong classi\ufb01er is usually\nfast enough for most real-world applications. On the other hand, if speed is the dominant factor,\none can specify a target detection rate and target execution time and use B-B to \ufb01nd a solution.\n\n6\n\nFinal Threshold-3.0-2.5-2.0-1.5-1.0-0.50.0Detection Rate95.2%94.6%93.2%92.5%91.7%90.3%88.8%# of False Positive95513220875DBP36.1335.7835.7634.9329.2228.9126.72MIP16.1116.0616.8018.6016.9615.5314.59ApproachViola-JonesBoosting chainFloatBoostWaldBoostWuet al.Soft cascadeTotal # of features606170025466007564943Slowness1018.118.913.9N/A37.1 (25)(a)(b)\fFigure 5: The detector performance comparison after applying MIP and Bourdev-Brandt\u2019s\nmethod [1]. Note, this test was done using a much larger, and more dif\ufb01cult, test set than MIT+CMU.\nIn the legend, symbol #f represents the average number of weak classi\ufb01ers visited per window.\n\nNote such a solution is not guaranteed, and the false positive rate may be unacceptably high (The\nperformance degradation of B-B heavily depends on the given soft-cascade. While with our detector\nthe performance of B-B is acceptable even when \u03b1 = \u221216, the performance of the detector in [1]\ndrops signi\ufb01cantly from 37 features to 25 features, as shown in Fig. 3 (a).).\n\n5 Conclusions\n\nWe have presented a simple yet effective way to set the rejection thresholds of a given soft-cascade,\ncalled multiple instance pruning (MIP). The algorithm begins with a conventional strong classi\ufb01er\nand an associated \ufb01nal threshold. MIP then adds a set of rejection thresholds to construct a cascade\ndetector. The rejection thresholds are determined so that every face which was detected by the orig-\ninal strong classi\ufb01er is guaranteed to be detected by the soft cascade. The algorithm also guarantees\nthat the false positive rate on the training set will not increase. There is only one parameter used\nthroughout the cascade training process, the target detection rate for the \ufb01nal system. Moreover,\nthere are no required assumptions about probability distributions, statistical independence, or ad hoc\nintermediate targets for detection rate or false positive rate.\n\nAppendix: Learn Soft Cascade with Weight Trimming and Bootstrapping\n\nWe present an algorithm for learning a strong classi\ufb01er from a very large set of training examples. In\norder to deal with the many millions of examples, the learning algorithm uses both weight trimming\nand bootstrapping. Weight trimming was proposed by Friedman, Hastie and Tibshirani [3]. At each\nround of boosting it ignores training examples with the smallest weights, up to a percentage of the\ntotal weight which can be between 1% and 10%. Since the weights are typically very skewed toward\na small number of hard examples, this can eliminate a very large number of examples. It was shown\nthat weight trimming can dramatically reduce computation for boosted methods without sacri\ufb01cing\naccuracy. In weight trimming no example is discarded permanently, therefore it is ideal for learning\na soft cascade.\nThe algorithm is described in Figure 6. In step 4, a set A is prede\ufb01ned to reduce the number of weight\nupdates on the whole training set. One can in theory update the scores of the whole training set after\neach feature is learned if computationally affordable, though the gain in detector performance may\nnot be visible.Note, a set of thresholds are also returned by this process (making the result a soft-\ncascade). These preliminary rejection thresholds are extremely conservative, retaining all positive\nexamples in the training set. They result in a very slow detector \u2013 the average number of features\nvisited per window is on the order of hundreds. These thresholds will be replaced with the ones\nderived by the MIP algorithm. We set the preliminary thresholds only to moderately speed up the\ncomputation of ROC curves before MIP.\n\n7\n\n0.9010.9030.9050.9070.909Detection RateMIP, T=-2.5, #f=11.25B-B, alpha=-16, #f=8.460.8950.8970.8991000110012001300140015001600Detection RateNumber of False PositiveB-B, alpha=-16, #f=8.46B-B, alpha=-8, #f=10.22B-B, alpha=-4, #f=13.17B-B, alpha=0, #f=22.75B-B, alpha=4, #f=56.83\fInput\n\n\u2022 Training examples (x1, y1),\u00b7\u00b7\u00b7 , (xK , yK ), where yk = \u22121, 1 for negative and positive exam-\n\u2022 T is the total number of weak classi\ufb01ers, which can be set through cross-validation.\n\nples. K is on the order of billions.\n\nInitialize\n\nples. Q = 4 \u00d7 106 in the current implementation.\n\n\u2022 Take all positive examples and randomly sample negative examples to form a subset of Q exam-\n\u2022 Initialize weights \u03c91,i to guarantee weight balance between positive and negative examples on\n\u2022 De\ufb01ne A as the set {2, 4, 8, 16, 32, 64, 128, 192, 256,\u00b7\u00b7\u00b7}.\n\nthe sampled dataset.\n\nAdaboost Learning\nFor t = 1,\u00b7\u00b7\u00b7 , T :\n\n1. For each rectangle \ufb01lter in the pool, construct a weak classi\ufb01er that minimizes the Z score [8]\n\nunder the current set of weights \u03c9t,i, i \u2208 Q.\n\n2. Select the best classi\ufb01er ht with the minimum Z score, \ufb01nd the associated con\ufb01dences \u03b1t.\n3. Update weights of all Q sampled examples.\n4. If t \u2208 A,\n\n\u2022 Update weights of the whole training set using the previously selected classi\ufb01ers h1,\u00b7\u00b7\u00b7 , ht.\n\u2022 Perform weight trimming [3] to trim 10% of the negative weights.\n\u2022 Take all positive examples and randomly sample negative examples from the trimmed train-\n\ning set to form a new subset of Q examples.\n\n5. Set preliminary rejection threshold \u03b8(t) of\n\nexamples at stage t.\n\n(cid:80)t\n\nj=1 \u03b1jhj as the minimum score of all positive\n\nOutput\nWeak classi\ufb01ers ht, t = 1,\u00b7\u00b7\u00b7 , T , the associated con\ufb01dences \u03b1t and preliminary rejection thresholds \u03b8(t).\n\nFigure 6: Adaboost learning with weight trimming and booststrapping.\n\nReferences\n[1] L. Bourdev and J. Brandt. Robust object detection via soft cascade. In Proc. of CVPR, 2005.\n[2] S. C. Brubaker, M. D. Mullin, and J. M. Rehg. Towards optimal training of cascaded detectors. In Proc.\n\nof ECCV, 2006.\n\n[3] J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: a statistical view of boosting.\n\nTechnical report, Dept. of Statistics, Stanford University, 1998.\n\n[4] S. Li, L. Zhu, Z. Zhang, A. Blake, H. Zhang, and H. Shum. Statistical learning of multi-view face\n\ndetection. In Proc. of ECCV, 2002.\n\n[5] H. Luo. Optimization design of cascaded classi\ufb01ers. In Proc. of CVPR, 2005.\n[6] S. J. Nowlan and J. C. Platt. A convolutional neural network hand tracker. In Proc. of NIPS, volume 7,\n\n1995.\n\n[7] H. Rowley, S. Baluja, and T. Kanade. Neural network-based face detection.\n\n20:23\u201338, 1998.\n\nIEEE Trans. on PAMI,\n\n[8] R. E. Schapire and Y. Singer. Improved boosting algorithms using con\ufb01dence-rated predictions. Machine\n\nLearning, 37:297\u2013336, 1999.\n\n[9] J. Sochman and J. Matas. Waldboost - learning for time constrained sequential detection. In Proc. of\n\nCVPR, 2005.\n\n[10] K. Sung and T. Poggio. Example-based learning for view-based face detection. IEEE Trans. on PAMI,\n\n20:39\u201351, 1998.\n\n[11] P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. In Proc. of\n\nCVPR, 2001.\n\n[12] P. Viola, J. C. Platt, and C. Zhang. Multiple instance boosting for object detection. In Proc. of NIPS,\n\nvolume 18, 2006.\n\n[13] B. Wu, H. Ai, C. Huang, and S. Lao. Fast rotation invariant multi-view face detection based on real\n\nadaboost. In Proc. of IEEE Automatic Face and Gesture Recognition, 2004.\n\n[14] J. Wu, J. M. Rehg, and M. D. Mullin. Learning a rare event detection cascade by direct feature selection.\n\nIn Proc. of NIPS, volume 16, 2004.\n\n[15] R. Xiao, L. Zhu, and H. Zhang. Boosting chain learning for object detection. In Proc. of ICCV, 2003.\n\n8\n\n\f", "award": [], "sourceid": 575, "authors": [{"given_name": "Cha", "family_name": "Zhang", "institution": null}, {"given_name": "Paul", "family_name": "Viola", "institution": null}]}