{"title": "Boosting Classifier Cascades", "book": "Advances in Neural Information Processing Systems", "page_first": 2047, "page_last": 2055, "abstract": "The problem of optimal and automatic design of a detector cascade is considered. A novel mathematical model is introduced for a cascaded detector. This model is analytically tractable, leads to recursive computation, and accounts for both classification and complexity. A boosting algorithm, FCBoost, is proposed for fully automated cascade design. It exploits the new cascade model, minimizes a Lagrangian cost that accounts for both classification risk and complexity. It searches the space of cascade configurations to automatically determine the optimal number of stages and their predictors, and is compatible with bootstrapping of negative examples and cost sensitive learning. Experiments show that the resulting cascades have state-of-the-art performance in various computer vision problems.", "full_text": "Boosting Classi\ufb01er Cascades\n\nMohammad J. Saberian\n\nStatistical Visual Computing Laboratory,\n\nUniversity of California, San Diego\n\nLa Jolla, CA 92039\n\nsaberian@ucsd.edu\n\nAbstract\n\nNuno Vasconcelos\n\nStatistical Visual Computing Laboratory,\n\nUniversity of California, San Diego\n\nLa Jolla, CA 92039\nnuno@ucsd.edu\n\nThe problem of optimal and automatic design of a detector cascade is considered.\nA novel mathematical model is introduced for a cascaded detector. This model is\nanalytically tractable, leads to recursive computation, and accounts for both clas-\nsi\ufb01cation and complexity. A boosting algorithm, FCBoost, is proposed for fully\nautomated cascade design. It exploits the new cascade model, minimizes a La-\ngrangian cost that accounts for both classi\ufb01cation risk and complexity. It searches\nthe space of cascade con\ufb01gurations to automatically determine the optimal num-\nber of stages and their predictors, and is compatible with bootstrapping of neg-\native examples and cost sensitive learning. Experiments show that the resulting\ncascades have state-of-the-art performance in various computer vision problems.\n\n1\n\nIntroduction\n\nThere are many applications where a classi\ufb01er must be designed under computational constraints.\nOne problem where such constraints are extreme is that of object detection in computer vision.\nTo accomplish tasks such as face detection, the classi\ufb01er must process thousands of examples per\nimage, extracted from all possible image locations and scales, at a rate of several images per second.\nThis problem has been the focus of substantial attention since the introduction of the detector cascade\narchitecture by Viola and Jones (VJ) in [13]. This architecture was used to design the \ufb01rst real time\nface detector with state-of-the-art classi\ufb01cation accuracy. The detector has, since, been deployed\nin many practical applications of broad interest, e.g. face detection on low-complexity platforms\nsuch as cameras or cell phones. The outstanding performance of the VJ detector is the result of 1) a\ncascade of simple to complex classi\ufb01ers that reject most non-faces with a few machine operations,\n2) learning with a combination of boosting and Haar features of extremely low complexity, and 3)\nuse of bootstrapping to ef\ufb01ciently deal with the extremely large class of non-face examples.\n\nWhile the resulting detector is fast and accurate, the process of designing a cascade is not.\nIn\nparticular, VJ did not address the problem of how to automatically determine the optimal cascade\ncon\ufb01guration, e.g. the numbers of cascade stages and weak learners per stage, or even how to design\nindividual stages so as to guarantee optimality of the cascade as a whole. In result, extensive manual\nsupervision is required to design cascades with good speed/accuracy trade off. This includes trial-\nand-error tuning of the false positive/detection rate of each stage, and of the cascade con\ufb01guration.\nIn practice, the design of a good cascade can take up several weeks. This has motivated a number\nof enhancements to the VJ training procedure, which can be organized into three main areas: 1)\nenhancement of the boosting algorithms used in cascade design, e.g. cost-sensitive variations of\nboosting [12, 4, 8], \ufb02oat Boost [5] or KLBoost [6], 2) post processing of a learned cascade, by ad-\njusting stage thresholds, to improve performance [7], and 3) specialized cascade architectures which\nsimplify the learning process, e.g. the embedded cascade (ChainBoost) of [15], where each stage\ncontains all weak learners of previous stages. These enhancements do not address the fundamental\nlimitations of the VJ design, namely how to guarantee overall cascade optimality.\n\n1\n\n\fL\n\nR\n\n \n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\nAdaBoost\nChainBoost\n\n10\n\n20\n\nIterations\n\n30\n\n40\n\n50\n\nL\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0\n\nAdaBoost\nChainBoost\n\n10\n\n20\n\nIterations\n\n30\n\n40\n\n50\n\nFigure 1: Plots of RL (left) and L (right) for detectors designed with AdaBoost and ChainBoost.\n\nMore recently, various works have attempted to address this problem [9, 8, 1, 14, 10]. However, the\nproposed algorithms still rely on sequential learning of cascade stages, which is suboptimal, some-\ntimes require manual supervision, do not search over cascade con\ufb01gurations, and frequently lack\na precise mathematical model for the cascade. In this work, we address these problems, through\ntwo main contributions. The \ufb01rst is a mathematical model for a detector cascade, which is an-\nalytically tractable, accounts for both classi\ufb01cation and complexity, and is amenable to recursive\ncomputation. The second is a boosting algorithm, FCBoost, that exploits this model to solve the\ncascade learning problem. FCBoost solves a Lagrangian optimization problem, where the classi\ufb01-\ncation risk is minimized under complexity constraints. The risk is that of the entire cascade, which\nis learned holistically, rather than through sequential stage design, and FCBoost determines the opti-\nmal cascade con\ufb01guration automatically. It is also compatible with bootstrapping and cost sensitive\nboosting extensions, enabling ef\ufb01cient sampling of negative examples and explicit control of the\nfalse positive/detection rate trade off. An extensive experimental evaluation, covering the problems\nof face, car, and pedestrian detection demonstrates its superiority over previous approaches.\n\n2 Problem De\ufb01nition\n\nA binary classi\ufb01er h(x) maps an example x into a class label y \u2208 {\u22121, 1} according to h(x) =\nsign[f (x)], where f (x) is a continuous-valued predictor. Optimal classi\ufb01ers minimize a risk\n\nRL(f ) = EX,Y {L[y, f (x)]} \u2243\n\nL[yi, f (xi)]\n\n(1)\n\n1\n\n|St|Xi\n\n1\n\n|St|Xi\n\nwhere St = {(x1, y1), . . . , (xn, yn)} is a set of training examples, yi \u2208 {1, \u22121} the class label of\nexample xi, and L[y, f (x)] a loss function. Commonly used losses are upper bounds on the zero-one\nloss, whose risk is the probability of classi\ufb01cation error. Hence, RL is a measure of classi\ufb01cation\naccuracy. For applications with computational constraints, optimal classi\ufb01er design must also take\ninto consideration the classi\ufb01cation complexity. This is achieved by de\ufb01ning a computational risk\n\nRC(f ) = EX,Y {LC [y, C(f (x))]} \u2243\n\nLC[yi, C(f (xi))]\n\n(2)\n\nwhere C(f (x)) is the complexity of evaluating f (x), and LC[y, C(f (x))] a loss function that encodes\nthe cost of this operation. In most detection problems, targets are rare events and contribute little to\nthe overall complexity. In this case, which we assume throughout this work, LC[1, C(f (x))] = 0\nand LC[\u22121, C(f (x))] is denoted LC[C(f (x))]. The computational risk is thus\n\nRC(f ) \u2248\n\n1\n|S\u2212\n\nt | Xxi\u2208S\u2212\n\nt\n\nLC[C(f (xi))].\n\n(3)\n\nwhereS\u2212\nt contains the negative examples of St. Usually, more accurate classi\ufb01ers are more complex.\nFor example in boosting, where the decision rule is a combination of weak rules, a \ufb01ner approxima-\ntion of the classi\ufb01cation boundary (smaller error) requires more weak learners and computation.\n\nOptimal classi\ufb01er design under complexity constraints is a problem of constrained optimization,\nwhich can be solved with Lagrangian methods. These minimize a Lagrangian\n\nL(f ; St) =\n\n1\n\n|St| Xxi\u2208St\n\nL[yi, f (xi)] +\n\n2\n\n\u03b7\n|S\u2212\n\nt | Xxi\u2208S\u2212\n\nt\n\nLC[C(f (xi))],\n\n(4)\n\n\fwhere \u03b7 is a Lagrange multiplier, which controls the trade-off between error rate and complexity.\nFigure 1 illustrates this trade-off, by plotting the evolution of RL and L as a function of the boosting\niteration, for the AdaBoost algorithm [2]. While the risk always decreases with the addition of weak\nlearners, this is not true for the Lagrangian. After a small number of iterations, the gain in accuracy\ndoes not justify the increase in classi\ufb01er complexity. The design of classi\ufb01ers under complexity\nconstraints has been addressed through the introduction of detector cascades. A detector cascade\nH(x) implements a sequence of binary decisions hi(x), i = 1 . . . m. An example x is declared a\ntarget (y = 1) if and only if it is declared a target by all stages of H, i.e. hi(x) = 1, \u2200i. Otherwise,\nthe example is rejected. For applications where the majority of examples can be rejected after a\nsmall number of cascade stages, the average classi\ufb01cation time is very small. However, the problem\nof designing an optimal detector cascade is still poorly understood. A popular approach, known as\nChainBoost or embedded cascade [15], is to 1) use standard boosting algorithms to design a detector,\nand 2) insert a rejection point after each weak learner. This is simple to implement, and creates\na cascade with as many stages as weak learners. However, the introduction of the intermediate\nrejection points, a posteriori of detector design, sacri\ufb01ces the risk-optimality of the detector. This\nis illustrated in Figure 1, where the evolution of RL and L are also plotted for ChainBoost. In this\nexample, L is monotonically decreasing, i.e. the addition of weak learners no longer carries a large\ncomplexity penalty. This is due to the fact that most negative examples are rejected in the earliest\ncascade stages. On the other hand, the classi\ufb01cation risk is more than double that of the original\nboosted detector. It is not known how close ChainBoost is to optimal, in the sense of (4).\n\n3 Classi\ufb01er cascades\n\nIn this work, we seek the design of cascades that are provably optimal under (4). We start by\nintroducing a mathematical model for a detector cascade.\n\n3.1 Cascade predictor\n\nLet H(x) = {h1(x), . . . , hm(x)} be a cascade of m detectors hi(x) = sgn[fi(x)]. To develop\nsome intuition, we start with a two-stage cascade, m = 2. The cascade implements the decision rule\n(5)\n\nH(F)(x) = sgn[F(x)]\n\nwhere\n\nF(x) = F(f1, f2)(x) =(cid:26) f1(x)\n\nf2(x)\n\n= f1u(\u2212f1) + u(f1)f2\n\nif f1(x) < 0\nif f1(x) \u2265 0\n\n(6)\n\n(7)\n\nis denoted the cascade predictor, u(.) is the step function and we omit the dependence on x for\nnotational simplicity. This equation can be extended to a cascade of m stages, by replacing the\npredictor of the second stage, when m = 2, with the predictor of the remaining cascade, when m is\nlarger. Letting Fj = F(fj, . . . , fm) be the cascade predictor for the cascade composed of stages j\nto m\n\nF = F1 = f1u(\u2212f1) + u(f1)F2.\n\n(8)\n\nMore generally, the following recursion holds\n\n(9)\nwith initial condition Fm = fm. In Appendix A, it is shown that combining (8) and (9) recursively\nleads to\n\nFk = fku(\u2212fk) + u(fk)Fk+1\n\nF = T1,m + T2,mfm\n\n= T1,k + T2,kfku(\u2212fk) + T2,kFk+1u(fk), k < m.\n\n(10)\n(11)\n\nwith initial conditions T1,0 = 0, T2,0 = 1 and\n\nT1,k+1 = T1,k + fku(\u2212fk) T2,k,\n\n(12)\nSince T1,k, T2,k, and Fk+1 do not depend on fk, (10) and (11) make explicit the dependence of the\ncascade predictor, F, on the predictor of the kth stage.\n\nT2,k+1 = T2,k u(fk).\n\n3\n\n\f3.2 Differentiable approximation\n\nLetting F(fk + \u01ebg) = F(f1, .., fk + \u01ebg, ..fm), the design of boosting algorithms requires the eval-\nuation of both F(fk + \u01ebg), and the functional derivative of F with respect to each fk, along any\ndirection g\n\nThese are straightforward for the last stage since, from (10),\n\n< \u03b4F(fk), g >=\n\nd\nd\u01eb\n\n.\n\nF(fk + \u01ebg)(cid:12)(cid:12)(cid:12)(cid:12)\u01eb=0\n\nF(fm + \u01ebg) = am + \u01ebbmg, < \u03b4F(fm), g >= bmg,\n\n(13)\n\nwhere\n\n(15)\n\n(16)\n\n(17)\n\n(18)\n\n(14)\nIn general, however, the right-hand side of (11) is non-differentiable, due to the u(.) functions. A\ndifferentiable approximation is possible by adopting the classic sigmoidal approximation u(x) \u2248\ntanh(\u03c3x)+1\n\nam = T1,m + T2,mfm = F(fm),\n\n, where \u03c3 is a relaxation parameter. Using this approximation in (11),\n\nbm = T2,m.\n\n2\n\nF = F(fk) = T1,k + T2,kfk(1 \u2212 u(fk)) + T2,kFk+1u(fk)\n\nIt follows that\n\n\u2248 T1,k + T2,kfk +\n\n1\n2\n\nT2,k[Fk+1 \u2212 fk][tanh(\u03c3fk) + 1].\n\n< \u03b4F(fk), g > = bkg\n\n1\n2\n\nF(fk + \u01ebg) can also be simpli\ufb01ed by resorting to a \ufb01rst order Taylor series expansion around fk\n\nT2,k(cid:8)[1 \u2212 tanh(\u03c3fk)] + \u03c3[Fk+1 \u2212 fk][1 \u2212 tanh2(\u03c3fk)](cid:9) .\n\nbk =\n\nF(fk + \u01ebg) \u2248 ak + \u01ebbkg\n\nak = F(fk) = T1,k + T2,k(cid:26)fk +\n\n1\n2\n\n[Fk+1 \u2212 fk][tanh(\u03c3fk) + 1](cid:27) .\n\n(19)\n\n(20)\n\n3.3 Cascade complexity\n\nIn Appendix B, a similar analysis is performed for the computational complexity. Denoting by C(fk)\nthe complexity of evaluating fk, it is shown that\n\nwith initial conditions C(Fm+1) = 0, P1,1 = 0, P2,1 = 1 and\n\nC(F) = P1,k + P2,kC(fk) + P2,ku(fk)C(Fk+1).\n\nP1,k+1 = P1,k + C(fk) P2,k\n\nP2,k+1 = P2,k u(fk).\n\n(21)\n\n(22)\n\nThis makes explicit the dependence of the cascade complexity on the complexity of the kth stage.\n\nIn practice, fk = Pl clgl for gl \u2208 U, where U is a set of functions of approximately identical\n\ncomplexity. For example, the set of projections into Haar features, in which C(fk) is proportional to\nthe number of features gl. In general, fk has three components. The \ufb01rst is a predictor that is also\nused in a previous cascade stage, e.g. fk(x) = fk\u22121(x) + cg(x) for an embedded cascade. In this\ncase, fk\u22121(x) has already been evaluated in stage k \u2212 1 and is available with no computational cost.\nThe second is the set O(fk) of features that have been used in some stage j \u2264 k. These features are\nalso available and require minimal computation (multiplication by the weight cl and addition to the\nrunning sum). The third is the set N (fk) of features that have not been used in any stage j \u2264 k. The\noverall computation is\n\n(23)\nwhere \u03bb < 1 is the ratio of computation required to evaluate a used vs. new feature. For Haar\nwavelets, \u03bb \u2248 1\n20 . It follows that updating the predictor of the kth stage increases its complexity to\n(24)\n\nC(fk) = |N (fk)| + \u03bb|O(fk)|,\n\nC(fk + \u01ebg) =(cid:26) C(fk) + \u03bb\n\nC(fk) + 1\n\nif g \u2208 O(fk)\nif g \u2208 N (fk),\n\nand the complexity of the cascade to\n\nC(F(fk + \u01ebg)) = P1,k + P2,kC(fk + \u01ebg) + P2,ku(fk + \u01ebg)C(Fk+1)\n\nwith\n\n= \u03b1k + \u03b3kC(fk + \u01ebg) + \u03b2ku(fk + \u01ebg)\n\n\u03b1k = P1,k\n\n\u03b3k = P2,k \u03b2k = P2,kC(Fk+1).\n\n(25)\n(26)\n\n(27)\n\n4\n\n\f3.4 Neutral predictors\n\nThe models of (10), (11) and (21) will be used for the design of optimal cascades. Another observa-\ntion that we will exploit is that\n\nH[F(f1, . . . , fm, fm)] = H[F(f1, . . . , fm)].\n\nThis implies that repeating the last stage of a cascade does not change its decision rule. For this\nreason n(x) = fm(x) is referred to as the neutral predictor of a cascade of m stages.\n\n4 Boosting classi\ufb01er cascades\n\nIn this section, we introduce a boosting algorithm for cascade design.\n\n4.1 Boosting\n\nBoosting algorithms combine weak learners to produce a complex decision boundary. Boost-\ning iterations are gradient descent steps towards the predictor f (x) of minimum risk for the loss\nL[y, f (x)] = e\u2212yf (x) [3]. Given a set U of weak learners, the functional derivative of RL along the\ndirection of weak leaner g is\n\n< \u03b4RL(f ), g > =\n\n1\n\n|St|Xi (cid:20) d\n\nd\u01eb\n\ne\u2212yi(f (xi)+\u01ebg(xi))(cid:21)\u01eb=0\n\n= \u2212\n\nyiwig(xi),\n\n(28)\n\n1\n\n|St|Xi\n\nwhere wi = e\u2212yif (xi) is the weight of xi. Hence, the best update is\n\nLetting I(x) be the indicator function, the optimal step size along the selected direction, g\u2217(x), is\n\ng\u2217(x) = arg max\ng\u2208U\n\n< \u2212\u03b4RL(f ), g > .\n\n(29)\n\nc\u2217 = arg min\n\nc\u2208RXi\n\ne\u2212yi(f (xi)+cg\u2217(xi)) =\n\n1\n2\n\nlog Pi wiI(yi = g\u2217(xi))\nPi wiI(yi 6= g\u2217(xi))\n\nThe predictor is updated into f (x) = f (x) + c\u2217g\u2217(x) and the procedure iterated.\n\n.\n\n(30)\n\n4.2 Cascade risk minimization\n\nTo derive a boosting algorithm for\nL[y, F(f1, . . . , fm)(x)] = e\u2212yF (f1,...,fm)(x), and minimize the cascade risk\n\nthe design of detector cascades, we adopt\n\nthe loss\n\nRL(F) = EX,Y {e\u2212yF (f1,...,fm)} \u2248\n\ne\u2212yiF (f1,...,fm)(xi).\n\n1\n\n|St|Xi\n\nUsing (13) and (19),\n\n< \u03b4RL(F(fk)), g >=\n\n1\n\n|St|Xi (cid:20) d\n\nd\u01eb\n\ne\u2212yi[ak(xi)+\u01ebbk(xi)g(xi)](cid:21)\u01eb=0\n\n= \u2212\n\n1\n\n|St|Xi\n\nyiwk\n\ni bk\n\ni g(xi) (31)\n\nwhere wk\ndescent direction and step size for the kth stage are then\n\ni = e\u2212yiak(xi), bk\n\ni = bk(xi) and ak, bk are given by (14), (18), and (20). The optimal\n\ng\u2217\nk = arg max\ng\u2208U\n\n< \u2212\u03b4RL(F(fk)), g >\n\nc\u2217\nk = arg min\n\nc\u2208RXi\n\ni e\u2212yibk\nwk\n\ni cg\u2217\n\nk(xi).\n\n(32)\n\n(33)\n\ni are not constant, there is no closed form for c\u2217\n\nIn general, because the bk\nk, and a line search must\nbe used. Note that, since ak(xi) = F(fk)(xi), the weighting mechanism is identical to that of\nboosting, i.e. points are reweighed according to how well they are classi\ufb01ed by the current cascade.\nGiven the optimal c\u2217, g\u2217 for all stages, the impact of each update in the overall cascade risk, RL, is\nevaluated and the stage of largest impact is updated.\n\n5\n\n\f4.3 Adding a new stage\n\nSearching for the optimal cascade con\ufb01guration requires support for the addition of new stages,\nwhenever necessary. This is accomplished by including a neutral predictor as the last stage of the\ncascade. If adding a weak learner to the neutral stage reduces the risk further than the corresponding\naddition to any other stage, a new stage (containing the neutral predictor plus the weak learner) is\ncreated. Since this new stage includes the last stage of the previous cascade, the process mimics the\ndesign of an embedded cascade. However, there are no restrictions that a new stage should be added\nat each boosting iteration, or consist of a single weak learner.\n\n4.4\n\nIncorporating complexity constraints\n\nJoint optimization of speed and accuracy, requires the minimization of the Lagrangian of (4). This\nrequires the computation of the functional derivatives\n\n< \u03b4RC(F(fk)), g >=\n\n1\n|S\u2212\n\nt |Xi\n\nys\n\ni (cid:26) d\n\nd\u01eb\n\nLC[C(F(fk + \u01ebg)(xi)](cid:27)\u01eb=0\n\n(34)\n\ni = I(yi = \u22121). Similarly to boosting, which upper bounds the zero-one loss u(\u2212yf ) by\nwhere ys\nthe exponential loss e\u2212yf , we rely on a loss that upper-bounds the true complexity. This upper-bound\nis a combination of a boosting-style bound u(f + \u01ebg) \u2264 ef +\u01ebg, and the bound C(f + \u01ebg) \u2264 C(f )+1,\nwhich follows from (24). Using (26),\n\nLC[C(F(fk + \u01ebg)(xi)] = LC[\u03b1k + \u03b3kC(fk + \u01ebg) + \u03b2ku(fk + \u01ebg)]\n\n= \u03b1k + \u03b3k(C(fk) + 1) + \u03b2kefk+\u01ebg\n\nand, since(cid:8) d\n\nd\u01eb LC[C(F(fk + \u01ebg))](cid:9)\u01eb=0 = \u03b2kefk g,\n\n< \u03b4RC(F(fk)), g > =\n\n1\n|S\u2212\n\nt |Xi\n\nys\ni \u03c8k\n\ni \u03b2k\n\ni g(xi)\n\ni = \u03b2k(xi) and \u03c8k\n\ni = efk(xi). The derivative of (4) with respect to the kth stage predictor is\n\nwith \u03b2k\nthen\n\n< \u03b4L(F(fk)), g > = < \u03b4RL(F(fk)), g > +\u03b7 < \u03b4RC(F(fk)), g >\n\n= Xi (cid:18)\u2212\n\ni bk\ni\n\nyiwk\n|St|\n\n+ \u03b7\n\ni \u03c8k\nys\n|S\u2212\n\ni \u03b2k\ni\n\nt | (cid:19) g(xi)\n\ng\u2217\nk = arg max\ng\u2208U\n\n< \u2212\u03b4L(F(fk)), g >\n\n(35)\n(36)\n\n(37)\n\n(38)\n\n(39)\n\n(40)\n\n(41)\n\ni = e\u2212yiak(xi) and ak and bk given by (14), (18), and (20). Given a set of weak learners U,\n\nwith wk\nthe optimal descent direction and step size for the kth stage are then\n\nc\u2217\nk = arg min\n\ni e\u2212yibk\nwk\n\ni cg\u2217\n\nk(xi) +\n\nc\u2208R( 1\n\n|St|Xi\n\n\u03b7\n|S\u2212\n\nt |Xi\n\nys\ni \u03c8k\n\ni \u03b2k\n\ni ecg\u2217\n\nk(xi)) .\n\nk,1, c\u2217\n\nA pair (g\u2217\nk,1) is found among the set O(fk) and another among the set U \u2212 O(fk) . The one\nthat most reduces (4) is selected as the best update for the kth stage and the stage with the largest\nimpact is updated. This gradient descent procedure is denoted Fast Cascade Boosting (FCBoost).\n\n5 Extensions\n\nFCBoost supports a number of extensions that we brie\ufb02y discuss in this section.\n\n5.1 Cost Sensitive Boosting\n\nAs is the case for AdaBoost, it is possible to use cost sensitive risks in FCBoost. For exam-\nple, the risk of CS-AdaBoost: RL(f ) = EX,Y {yce\u2212yf (x)} [12] or Asym-AdaBoost: RL(f ) =\nEX,Y {e\u2212ycyf (x)} [8], where yc = CI(y = \u22121) + (1 \u2212 C)I(y = 1) and C is a cost factor.\n\n6\n\n\fTrain Set\n\nData Set\n\nFace\nCar\n\nPedestrian\n\npos\n9,000\n1,000\n1,000\n\nneg\n9,000\n10,000\n10,000\n\nTest Set\nneg\n832\n2,000\n2,000\n\npos\n832\n100\n200\n\nL\n\nR\n\n0.36\n\n0.32\n\n0.28\n\n0.24\n\n0\n\n10\n\nRC\n\n20\n\n30\n\nFigure 2: Left: data set characteristics. Right: Trade-off between the error (RL) and complexity (RC) com-\nponents of the risk as \u03b7 changes in (4).\n\nTable 1: Performance of various classi\ufb01ers on the face, car, and pedestrian test sets.\n\nMethod\nAdaBoost\nChainBoost\n\nFCBoost (\u03b7 = 0.02)\n\nFace\nRC\n50\n2.65\n4.93\n\nRL\n0.20\n0.45\n0.30\n\nL\n1.20\n0.50\n0.40\n\nRL\n0.22\n0.65\n0.44\n\nCar\nRC\n50\n2.40\n5.38\n\nPedestrian\n\nL\n1.22\n0.70\n0.55\n\nRL\n0.35\n.052\n0.46\n\nRC\n50\n3.34\n4.23\n\nL\n1.35\n0.59\n0.54\n\n5.2 Bootstrapping\n\nBootstrapping is a procedure to augment the training set, by using false positives of the current\nclassi\ufb01er as the training set for the following [11]. This improves performance, but is feasible only\nwhen the bootstrapping procedure does not affect previously rejected examples. Otherwise, the\nclassi\ufb01er will forget the previous negatives while learning from the new ones. Since FCBoost learns\nall cascade stages simultaneously, and any stage can change after bootstrapping, this condition is\nviolated. To overcome the problem, rather than replacing all negative examples with false positives,\nonly a random subset is replaced. The negatives that remain in the training set prevent the classi\ufb01er\nfrom forgetting about the previous iterations. This method is used to update the training set whenever\nthe false positive rate of the cascade being learned reaches 50%.\n\n6 Evaluation\n\nSeveral experiments were performed to evaluate the performance of FCBoost, using face, car, and\npedestrian recognition data sets, from computer vision. In all cases, Haar wavelet features were used\nas weak learners. Figure 2 summarizes the data sets.\nEffect of \u03b7: We started by measuring the impact of \u03b7, see (4), on the accuracy and complexity of\nFCBoost cascades. Figure 2 plots the accuracy component of the risk, RL, as a function of the\ncomplexity component, RC, on the face data set, for cascades trained with different \u03b7. The leftmost\npoint corresponds to \u03b7 = 0.05, and the rightmost to \u03b7 = 0. As expected, as \u03b7 decreases the cascade\nhas lower error and higher complexity. In the remaining experiments we used \u03b7 = 0.02.\nCascade comparison: Figure 3 (a) repeats the plots of the Lagrangian of the risk shown in Fig-\nure 1, for classi\ufb01ers trained with 50 boosting iterations, on the face data. In addition to AdaBoost\nand ChainBoost, it presents the curves of FCBoost with (\u03b7 = 0.02) and without (\u03b7 = 0) com-\nplexity constraints. Note that, in the latter case, performance is in between those of AdaBoost and\nChainBoost. This re\ufb02ects the fact that FCBoost (\u03b7 = 0) does produce a cascade, but this cascade\nhas worse accuracy/complexity trade-off than that of ChainBoost. On the other hand, the inclusion\nof complexity constraints, FCBoost (\u03b7 = 0.02), produces a cascade with the best trade-off. These\nresults are con\ufb01rmed by Table 1, which compares classi\ufb01ers trained on all data sets. In all cases, Ad-\naBoost detectors have the lowest error, but at a tremendous computational cost. On the other hand,\nChainBoost cascades are always the fastest, at the cost of the highest classi\ufb01cation error. Finally,\nFCBoost (\u03b7 = 0.02) achieves the best accuracy/complexity trade-off: its cascade has the lowest risk\nLagrangian L. It is close to ten times faster than the AdaBoost detector, and has half of the increase\nin classi\ufb01cation error (with respect to AdaBoost) of the ChainBoost cascade. Based on these results,\nFCBoost (\u03b7 = 0.02) was used in the last experiment.\n\n7\n\n\f0.8\n\n0.7\n\nL\n\n0.6\n\n0.5\n\n0.4\n\n0\n\nFCBoost h =0\nFCBoost h =0.02\nAdaBoost\nChainBoost\n\n94\n\n90\n\n85\n\ne\nt\na\nR\n \nn\no\ni\nt\nc\ne\nt\ne\nD\n\nViola & Jones\nChainBoost\nFloatBoost\nWaldBoost\nFCBoost\n\n10\n\n20\n\nIterations\n\n(a)\n\n30\n\n40\n\n50\n\n80\n\n0\n\n25\n\n50\n\n75\n\n100\n\n125\n\n150\n\nNumber of False Positives\n\n(b)\n\nFigure 3: a) Lagrangian of the risk for classi\ufb01ers trained with various boosting algorithms. b) ROC of various\ndetector cascades on the MIT-CMU data set.\n\nTable 2: Comparison of the speed of different detectors.\n\nMethod VJ [13]\nEvals\n\n8\n\nFloatBoost [5] ChainBoost [15] WaldBoost [9]\n\n18.9\n\n18.1\n\n10.84\n\n[8]\n15.45\n\nFCBoost\n\n7.2\n\nFace detection: We \ufb01nish with a face detector designed with FCBoost (\u03b7 = 0.02), bootstrapping,\nand 130K Haar features. To make the detector cost-sensitive, we used CS-AdaBoost with C = 0.99.\nFigure 3 b) compares the resulting ROC to those of VJ [13], ChainBoost [15], FloatBoost [5] and\nWaldBoost [9]. Table 2 presents a similar comparison for the detector speed (average number of\nfeatures evaluated per patch). Note the superior performance of the FCBoost cascade in terms of\nboth accuracy and speed. To the best of our knowledge, this is the fastest face detector reported to\ndate.\n\nA Recursive form of cascade predictor\n\nApplying (9) recursively to (8)\n\nF = f1u(\u2212f1) + u(f1)F2\n\n= f1u(\u2212f1) + u(f1) [f2u(\u2212f2) + u(f2)F3]\n= f1u(\u2212f1) + f2u(f1)u(\u2212f2) + u(f1)u(f2) [f3u(\u2212f3) + u(f3)F4]\n\nk\u22121\n\n(42)\n(43)\n(44)\n\n=\n\nfiu(\u2212fi)Yj