{"title": "Boosting with Maximum Adaptive Sampling", "book": "Advances in Neural Information Processing Systems", "page_first": 1332, "page_last": 1340, "abstract": "Classical Boosting algorithms, such as AdaBoost, build a strong classifier without concern about the computational cost. Some applications, in particular in computer vision, may involve up to millions of training examples and features. In such contexts, the training time may become prohibitive. Several methods exist to accelerate training, typically either by sampling the features, or the examples, used to train the weak learners. Even if those methods can precisely quantify the speed improvement they deliver, they offer no guarantee of being more efficient than any other, given the same amount of time. This paper aims at shading some light on this problem, i.e. given a fixed amount of time, for a particular problem, which strategy is optimal in order to reduce the training loss the most. We apply this analysis to the design of new algorithms which estimate on the fly at every iteration the optimal trade-off between the number of samples and the number of features to look at in order to maximize the expected loss reduction. Experiments in object recognition with two standard computer vision data-sets show that the adaptive methods we propose outperform basic sampling and state-of-the-art bandit methods.", "full_text": "Boosting with Maximum Adaptive Sampling\n\nCharles Dubout\n\nIdiap Research Institute\n\ncharles.dubout@idiap.ch\n\nFranc\u00b8ois Fleuret\n\nIdiap Research Institute\n\nfrancois.fleuret@idiap.ch\n\nAbstract\n\nClassical Boosting algorithms, such as AdaBoost, build a strong classi\ufb01er without\nconcern about the computational cost. Some applications, in particular in com-\nputer vision, may involve up to millions of training examples and features. In\nsuch contexts, the training time may become prohibitive. Several methods exist\nto accelerate training, typically either by sampling the features, or the examples,\nused to train the weak learners. Even if those methods can precisely quantify the\nspeed improvement they deliver, they offer no guarantee of being more ef\ufb01cient\nthan any other, given the same amount of time.\nThis paper aims at shading some light on this problem, i.e. given a \ufb01xed amount\nof time, for a particular problem, which strategy is optimal in order to reduce\nthe training loss the most. We apply this analysis to the design of new algo-\nrithms which estimate on the \ufb02y at every iteration the optimal trade-off between\nthe number of samples and the number of features to look at in order to maximize\nthe expected loss reduction. Experiments in object recognition with two standard\ncomputer vision data-sets show that the adaptive methods we propose outperform\nbasic sampling and state-of-the-art bandit methods.\n\n1\n\nIntroduction\n\nBoosting is a simple and ef\ufb01cient machine learning algorithm which provides state-of-the-art per-\nformance on many tasks. It consists of building a strong classi\ufb01er as a linear combination of weak-\nlearners, by adding them one after another in a greedy manner. However, while textbook AdaBoost\nrepeatedly selects each of them using all the training examples and all the features for a predeter-\nmined number of rounds, one is not obligated to do so and can instead choose only to look at a\nsubset of examples and features.\nFor the sake of simplicity, we identify the space of weak learners and the feature space by consid-\nering all the thresholded versions of the latter. More sophisticated combinations of features can be\nenvisioned in our framework by expanding the feature space.\nThe computational cost of one iteration of Boosting is roughly proportional to the product of the\nnumber of candidate weak learners Q and the number of samples T considered, and the perfor-\nmance increases with both. More samples allow a more accurate estimation of the weak-learners\u2019\nperformance, and more candidate weak-learners increase the performance of the best one. There-\nfore, one wants at the same time to look at a large number of candidate weak-learners, in order to\n\ufb01nd a good one, but also needs to look at a large number of training examples, to get an accurate\nestimate of the weak-learner performances. As Boosting progresses, the candidate weak-learners\ntend to behave more and more similarly, as their performance degrades. While a small number of\nsamples is initially suf\ufb01cient to characterize the good weak-learners, it becomes more and more\ndif\ufb01cult, and the optimal values for a \ufb01xed product Q T moves to larger T and smaller Q.\nWe focus in this paper on giving a clear mathematical formulation of the behavior described above.\nOur main analytical results are Equations (13) and (17) in \u00a7 3. They give exact expressions of the\n\n1\n\n\fexpected edge of the selected weak-learner \u2013 that is the immediate loss reduction it provides in the\nconsidered Boosting iteration \u2013 as a function of the number T of samples and number Q of weak-\nlearners used in the optimization process. From this result we derive several algorithms described in\n\u00a7 4, and estimate their performance compared to standard and state-of-the-art baselines in \u00a7 5.\n\n2 Related works\n\nThe most computationally intensive operation performed in Boosting is the optimization of the\nweak-learners. In the simplest version of the procedure, it requires to estimate for each candidate\nweak-learner a score dubbed \u201cedge\u201d, which requires to loop through every training example. Re-\nducing this computational cost is crucial to cope with high-dimensional feature spaces or very large\ntraining sets. This can be achieved through two main strategies: sampling the training examples, or\nthe feature space, since there is a direct relation between features and weak-learners.\nSampling the training set was introduced historically to deal with weak-learners which can not be\ntrained with weighted samples. This procedure consists of sampling examples from the training set\naccording to their Boosting weights, and of approximating a weighted average over the full set by\na non-weighted average over the sampled subset. See \u00a7 3.1 for formal details. Such a procedure\nhas been re-introduced recently for computational reasons [5, 8, 7], since the number of subsampled\nexamples controls the trade-off between statistical accuracy and computational cost.\nSampling the feature space is the central idea behind LazyBoost [6], and consists simply of replacing\nthe brute-force exhaustive search over the full feature set by an optimization over a subset produced\nby sampling uniformly a prede\ufb01ned number of features. The natural redundancy of most of the\nfamilies of features makes such a procedure particularly ef\ufb01cient.\nRecently developed methods rely on multi-arms bandit methods to balance properly the exploitation\nof features known to be informative, and the exploration of new features [3, 4]. The idea behind\nthose methods is to associate a bandit arm to every feature, and to see the loss reduction as a reward.\nMaximizing the overall reduction is achieved with a standard bandit strategy such as UCB [1], or\nExp3.P [2], see \u00a7 5.2 for details.\nThese techniques suffer from three important drawbacks. First they make the assumption that the\nquality of a feature \u2013 the expected loss reduction of a weak-learner using it \u2013 is stationary. This goes\nagainst the underpinning of Boosting, which is that at any iteration the performance of the learners\nis relative to the sample weights, which evolves over the training (Exp3.P does not make such an\nassumption explicitly, but still rely only on the history of past rewards). Second, without additional\nknowledge about the feature space, the only structure they can exploit is the stationarity of individual\nfeatures. Hence, improvement over random selection can only be achieved by sampling again the\nexact same features one has already seen in the past. We therefore only use those methods in a\ncontext where features come from multiple families. This allows us to model the quality, and to bias\nthe sampling, at the level of families instead of individual features.\nThose approaches exploit information about features to bias the sampling, hence making it more\nef\ufb01cient, and reducing the number of weak-learners required to achieve the same loss reduction.\nHowever, they do not explicitly aim at controlling the computational cost. In particular, there is no\nnotion of varying the number of samples used for the estimation of the weak-learners\u2019 performance.\n\n3 Boosting with noisy maximization\n\nWe present in this section some analytical results to approximate a standard round of AdaBoost \u2013 or\nmost other Boosting algorithms \u2013 by sampling both the training examples and the features used to\nbuild the weak-learners. Our main goal is to devise a way of selecting the optimal numbers of weak-\nlearners Q and samples T to look at, so that their product is upper-bounded by a given constant, and\nthat the expectation of the real performance of the selected weak-learner is maximal.\nIn \u00a7 3.1 we recall standard notation for Boosting, the concept of the edge of a weak-learner, and\nIn \u00a7 3.2 we formalize the\nhow it can be approximated by a sampling of the training examples.\noptimization of the learners and derive the expectation E[G\u2217] of the true edge G\u2217 of the selected\nweak-learner, and we illustrate these results in the Gaussian case in \u00a7 3.3.\n\n2\n\n\f1{condition} is equal to 1 if the condition is true, 0 otherwise\nN number of training examples\nF number of weak-learners\nK number of families of weak-learners\nT number of examples subsampled from the full training set\nQ number of weak-learners sampled in the case of a single family of features\nQ1, . . . , QK number of weak-learners sampled from each one of the K families\n(xn, yn) \u2208 X \u00d7 {\u22121, 1} training examples\n\u03c9n \u2208 R weights of the nth training example in the considered Boosting iteration\nGq true edge of the qth weak-learner\nG\u2217 true edge of the selected weak-learner\ne(Q, T ) value of E[G\u2217], as a function of Q and T\ne(Q1, . . . , QK, T ) value of E[G\u2217], in the case of K families of features\nHq approximated edge of the qth weak-learner, estimated from the subsampled T examples\n\u2206q estimation error in the approximated edge Hq \u2212 Gq\n\nTable 1: Notations\n\nAs stated in the introduction, we will ignore the feature space itself, and only consider in the fol-\nlowing sections the set of weak-learners built from it. Also, note that both the Boosting procedure\nand our methods are presented in the context of binary classi\ufb01cation, but can be easily extended to a\nmulti-class context using for example AdaBoost.MH, which we used in all our experiments.\n\n3.1 Edge estimation with weighting-by-sampling\n\nGiven a training set\n\n(1)\nand a set H of weak-learners, the standard Boosting procedure consists of building a strong classi\ufb01er\n\n(xn, yn) \u2208 X \u00d7 {\u22121, 1}, n = 1, . . . , N\n\nf (x) =\n\n\u03b1i hi(x)\n\n(2)\n\nby choosing the terms \u03b1i \u2208 R and hi \u2208 H in a greedy manner to minimize a loss estimated over the\ntraining samples.\nAt every iteration, choosing the optimal weak-learner boils down to \ufb01nding the weak-learner with\nthe largest edge, which is the derivative of the loss reduction w.r.t. the weak-learner weight. The\nhigher this value, the more the loss can be reduced locally, and thus the better the weak-learner. The\nedge is a linear function of the responses of the weak-learner over the samples\n\n(cid:88)\n\ni\n\nN(cid:88)\n\nT(cid:88)\n\nt=1\n\n3\n\nwhere the \u03c9ns depend on the current responses of f over the xns. We consider without loss of\n\nG(h) =\n\nyn\u03c9nh(xn),\n\nn=1\n\ngenerality that they have been normalized such that(cid:80)N\n(cid:20) yN \u03c9N\n\nG(h) = EN\u223c\u03b7\n\nn=1 \u03c9n = 1.\n\n(cid:21)\n\nh(xN )\n\n\u03b7(N )\n\nGiven an arbitrary distribution \u03b7 over the sample indexes, with a non-zero mass over every index,\nwe can rewrite the edge as\n\n(3)\n\n(4)\n\nwhich, for \u03b7(n) = \u03c9n gives\n\n(5)\nThe idea of weighting-by-sampling consists of replacing the expectation in that expression with an\napproximation obtained by sampling. Let N1, . . . , NT , be i.i.d. of distribution \u03b7, we de\ufb01ne the\napproximated edge as\n\nG(h) = EN\u223c\u03b7 [yN h(xN )]\n\nH(h) =\n\n1\nT\n\nyNth(xNt),\n\n(6)\n\n\fFigure 1: To each of the Q weak-learner corresponds a real edge Gq computed over all the training\nexamples, and an approximated edge Hq computed from a subsampling of T training examples.\nThe approximated edge \ufb02uctuates around the true value, with a binomial distribution. The Boosting\nalgorithm selects the weak-learner with the highest approximated edge, which has a real edge G\u2217.\nOn this \ufb01gure, the largest approximated edge is H1, hence the real edge G\u2217 of the selected weak-\nlearner is equal to G1, which is less than G3.\n\nwhich follows a binomial distribution centered on the true edge, with a variance decreasing with the\nnumber of samples T . It is accurately modeled with\n\nH(h) \u223c N\n\nG,\n\n(1 + G)(1 \u2212 G)\n\nT\n\n.\n\n(7)\n\n(cid:18)\n\n(cid:19)\n\n3.2 Formalization of the noisy maximization\n\nLet G1, . . . , GQ be a series of independent, real-valued random variables standing for the true edges\nof Q weak-learners sampled randomly. Let \u22061, . . . , \u2206Q be a series of independent, real-valued\nrandom variables standing for the noise in the estimation of the edges due to the sampling of only T\ntraining examples, and \ufb01nally \u2200q, let Hq = Gq + \u2206q be the approximated edge.\nWe de\ufb01ne G\u2217 as the true edge of the weak-learner, which has the highest approximated edge\n\nG\u2217 = Gargmax1\u2264q\u2264Q Hq\n\n(8)\nThis quantity is random due to both the sampling of the weak-learners, and the sampling of the\ntraining examples.\nThe quantity we want to optimize is e(Q, T ) = E[G\u2217], the expectation of the true edge of the\nselected learner, which increases with both Q and T . A higher Q increases the number of terms in\nthe maximization of Equation (8), and a higher T reduces the variance of the \u2206s, ensuring that G\u2217 is\nclose to maxq Gq. In practice, if the variance of the \u2206s is of the order of, or higher than, the variance\nof the Gs, the maximization is close to a pure random selection, and looking at many weak-learners\nis useless.\nWe have:\n\n(cid:105)\n\n\uf8f9\uf8fb\n\ne(Q, T ) = E[G\u2217]\n\n= E\n\nGargmax1\u2264q\u2264Q Hq\n\n(cid:104)\nQ(cid:88)\nQ(cid:88)\nQ(cid:88)\n\nq=1\n\nq=1\n\n=\n\n=\n\n=\n\nu(cid:54)=q\n\n\uf8ee\uf8f0Gq\n(cid:89)\n\uf8ee\uf8f0Gq\n\uf8ee\uf8f0 E\n(cid:89)\n\uf8ee\uf8f0 E[Gq|Hq]\n\nu(cid:54)=q\n\nE\n\nE\n\nE\n\nq=1\n\n4\n\n(9)\n\n(10)\n\n(11)\n\n(12)\n\n(13)\n\n(cid:3)\uf8f9\uf8fb\n\n1{Hq>Hu}\n\n1{Hq>Hu}\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Hq\n\uf8f9\uf8fb\uf8f9\uf8fb\nE(cid:2)1{Hq>Hu}(cid:12)(cid:12) Hq\n\n(cid:89)\n\nu(cid:54)=q\n\nG*P(H | G )P(H | G )GGGHHH331223123P(H | G )112\falytical expressions for both E[Gq|Hq] and E(cid:2)1{Hq>Hu}(cid:12)(cid:12) Hq\n\n(cid:3), and compute the value of e(Q, T )\n\nIf the distributions of the Gqs and the \u2206qs are Gaussians or mixtures of Gaussians, we can derive an-\n\nef\ufb01ciently.\nIn the case of multiple families of weak-learners, it makes sense to model the distributions of the\nedges Gq separately for each family, as they often have a more homogeneous behavior inside a\nfamily than across families. We can easily adapt the framework developed in the previous sections\nto that case, and we de\ufb01ne e(Q1, . . . , QK, T ), the expected edge of the selected weak-learner when\nwe sample T examples from the training set, and Qk weak-learners from the kth family.\n\n3.3 Gaussian case\n\nAs an illustrative example, we consider here the case where the Gqs, the \u2206qs, and hence also the\nHqs all follow Gaussian distributions. We take Gq \u223c N (0, 1), and \u2206q \u223c N (0, \u03c32), and obtain:\n\ne(Q, T ) = Q E\n\n= Q E\n\n(cid:3)\uf8f9\uf8fb\n\nu(cid:54)=1\n\n\uf8ee\uf8f0 E[G1|H1]\n(cid:89)\nE(cid:2)1{H1>Hu}(cid:12)(cid:12) H1\n(cid:34)\n(cid:19)Q\u22121(cid:35)\n(cid:18) H1\u221a\n(cid:104)\nQ G1\u03a6 (G1)Q\u22121(cid:105)\n(cid:20)\n(cid:21)\n\n\u03c32 + 1\n\nE\n\nH1\n\n\u03c32 + 1\n\n\u03a6\n\n=\n\n=\n\n1\u221a\n\u03c32 + 1\n1\u221a\n\u03c32 + 1\n\nE\n\nmax\n1\u2264q\u2264Q\n\nGq\n\n.\n\n(14)\n\n(15)\n\n(16)\n\n(17)\n\nWhere, \u03a6 stands for the cumulative distribution function of the unit Gaussian, and \u03c3 depends on T .\nSee Figure 2 for an illustration of the behavior of e(Q, T ) for two different variances of the Gqs.\nThere is no reason to expect the distribution of the Gqs to be Gaussian, contrary to the \u2206qs, as shown\nby Equation (7), but this is not a problem as it can always be approximated by a mixture, for which\nwe can still derive analytical expressions, even if the Gqs or the \u2206qs have different distributions for\ndifferent qs.\n\n4 Adaptive Boosting Algorithms\n\nWe propose here several new algorithms to sample features and training examples adaptively at each\nBoosting step.\nWhile all the formulation above deals with uniform sampling of weak learners, we actually sample\nfeatures, and optimize thresholds to build stumps. We observed that after a small number of Boosting\niterations, the Gaussian model of Equation (7) is suf\ufb01ciently accurate.\n\n4.1 Maximum Adaptive Sampling\n\nAt every Boosting step, our \ufb01rst algorithm MAS Naive models Gq with a Gaussian mixture model\n\ufb01tted on the edges estimated at the previous iteration, computes from that density model the pair\n(Q, T ) maximizing e(Q, T ), samples the corresponding number of examples and features, and keeps\nthe weak-learner with the highest approximated edge.\nThe algorithm MAS 1.Q takes into account the decomposition of the feature set into K families of\nfeature extractors. It models the distributions of the Gqs separately, estimating the distribution of\neach on a small number of features and examples sampled at the beginning of each iteration, chosen\nso as to account for 10% of the total cost. From these models, it optimizes Q, T and the index l of\nthe family to sample from, to maximize e(Q1{l=1}, . . . , Q 1{l=K}, T ). Hence, in a given Boosting\nstep, it does not mix weak-learners based on features from different families.\nFinally MAS Q.1 similarly models the distributions of the Gqs, but it optimizes Q1, . . . , QK, T\ngreedily, starting from Q1 = 0, . . . , QK = 0, and iteratively incrementing one of the Ql so as to\nmaximize e(Q1, . . . , QK, T ).\n\n5\n\n\fFigure 2: Simulation of the expectation of G\u2217 in the case where both the Gqs and the \u2206qs follow\nGaussian distributions. Top: Gq \u223c N (0, 10\u22122). Bottom: Gq \u223c N (0, 10\u22124). In both simulations\n\u2206q \u223c N (0, 1/T ). Left: Expectation of G\u2217 vs.\nthe number of sampled weak-learner Q and the\nnumber of samples T . Right: same value as a function of Q alone, for different \ufb01xed costs (product\nof the number of examples T and Q). As this graphs illustrates, the optimal value for Q is greater\nfor larger variances of the Gq. In such a case the Gq are more spread out, and identifying the largest\none can be done despite a large noise in the estimations, hence with a limited number of samples.\n\n4.2 Laminating\n\nThe fourth algorithm we have developed tries to reduce the requirement for a density model of\nthe Gq. At every Boosting step it iteratively reduces the number of considered weak-learners, and\nincreases the number of samples.\nMore precisely: given a \ufb01xed Q0 and T0, at every Boosting iteration, the Laminating \ufb01rst samples\nQ0 weak-learners and T0 training examples. Then, it computes the approximated edges and keeps\nthe Q0/2 best weak-learners. If more than one remains, it samples 2T0 examples, and re-iterates.\nThe cost of each iteration is constant, equal to Q0T0, and there are at most log2(Q0) of them, leading\nto an overall cost of O(log2(Q0)Q0T0). In the experiments, we equalize the computational cost with\nthe MAS approaches parametrized by T, Q by forcing log2(Q0)Q0T0 = T Q.\n\n5 Experiments\n\nWe demonstrate the validity of our approach for pattern recognition on two standard data-sets, using\nmultiple types of image features. We compare our algorithms both to different \ufb02avors of uniform\nsampling and to state-of-the-art bandit based methods, all tuned to deal properly with multiple fam-\nilies of features.\n\n5.1 Datasets and features\n\nFor the \ufb01rst set of experiments we use the well known MNIST handwritten digits database [10], con-\ntaining respectively 60,000/10,000 train/test grayscale images of size 28 \u00d7 28 pixels, divided in ten\nclasses. We use features computed by multiple image descriptors, leading to a total of 16,451 fea-\ntures. Those descriptors can be broadly divided in two categories. (1) Image transforms: Identity,\nGradient image, Fourier and Haar transforms, Local Binary Patterns (LBP/iLBP). (2) Histograms:\n\n6\n\n1101001,00010,0001101001,00010,00001234 Number of samples TNumber of features Q E[G*]1101001,00010,0001101001,00010,00001234 Number of samples TNumber of features Q E[G*]1101001,00010,00001234Number of features QE[G*] for a given cost TQ TQ = 1,000TQ = 10,000TQ = 100,0001101001,00010,00001234Number of features QE[G*] for a given cost TQ TQ = 1,000TQ = 10,000TQ = 100,000E[G*]TQ = 1,000TQ = 10,000TQ = 100,000E[G*]TQ = 1,000TQ = 10,000TQ = 100,000\fsums of the intensities in random image patches, histograms of (oriented and non oriented) gradients\nat different locations, Haar-like features.\nFor the second set of experiments we use the challenging CIFAR-10 data-set [9], a subset of the\n80 million tiny images data-set. It contains respectively 50,000/10,000 train/test color images of\nsize 32 \u00d7 32 pixels, also divided in 10 classes. We dub it as challenging as state-of-the-art results\nwithout using additional training data barely reach 65% accuracy. We use directly as features the\nsame image descriptors as described above for MNIST, plus additional versions of some of them\nmaking use of color information.\n\n5.2 Baselines\n\nWe \ufb01rst de\ufb01ne three baselines extending LazyBoost in the context of multiple feature families. The\nmost naive strategy one could think of, that we call Uniform Naive, simply ignores the families, and\npicks features uniformly at random. This strategy does not properly distribute the sampling among\nthe families, thus if one of them had a far greater cardinality than the others, all features would come\nfrom it. We de\ufb01ne Uniform 1.Q to pick one of the feature families at random, and then samples\nthe Q features from that single family, and Uniform Q.1 to pick uniformly at random Q families of\nfeatures, and then pick one feature uniformly in each family.\nThe second family of baselines we have tested bias their sampling at every Boosting iteration ac-\ncording to the observed edges in the previous iterations, and balance the exploitation of families of\nfeatures known to perform well with the exploration of new family by using bandit algorithms [3, 4].\nWe use three such baselines (UCB, Exp3.P, \u0001-greedy), which differ only by the underlying bandit\nalgorithm used.\nWe tune the meta-parameters of these techniques \u2013 namely the scale of the reward and the\nexploration-exploitation trade-off \u2013 by training them multiple times over a large range of parameters\nand keeping only the results of the run with the smallest \ufb01nal Boosting loss. Hence, the computa-\ntional cost is around one order of magnitude higher than for our methods in the experiments.\n\nNb. of\nstumps\n\nNaive\n\nUniform\n\n1.Q\n\nQ.1\n\nUCB(cid:63)\n\nBandits\nExp3.P(cid:63)\n\nMNIST\n\n\u0001-greedy(cid:63)\n\nNaive\n\nMAS\n1.Q\n\nQ.1\n\nLaminating\n\n-0.34 (0.01) -0.33 (0.02) -0.35 (0.02) -0.33 (0.01) -0.32 (0.01) -0.34 (0.02) -0.51 (0.02) -0.50 (0.02) -0.52 (0.01)\n10\n-0.80 (0.01) -0.73 (0.03) -0.81 (0.01) -0.73 (0.01) -0.73 (0.02) -0.73 (0.03) -1.00 (0.01) -1.00 (0.01) -1.03 (0.01)\n100\n1,000\n-1.70 (0.01) -1.45 (0.02) -1.68 (0.01) -1.64 (0.01) -1.52 (0.02) -1.60 (0.04) -1.83 (0.01) -1.80 (0.01) -1.86 (0.00)\n10,000 -5.32 (0.01) -3.80 (0.02) -5.04 (0.01) -5.26 (0.01) -5.35 (0.04) -5.38 (0.09) -5.35 (0.01) -5.05 (0.02) -5.30 (0.00)\n\n-0.43 (0.00)\n-1.01 (0.01)\n-1.99 (0.01)\n-6.14 (0.01)\n\n-0.26 (0.00) -0.25 (0.01) -0.26 (0.00) -0.25 (0.01) -0.25 (0.01) -0.26 (0.00) -0.28 (0.00) -0.28 (0.00) -0.28 (0.01)\n10\n-0.33 (0.00) -0.33 (0.01) -0.34 (0.00) -0.33 (0.00) -0.33 (0.00) -0.33 (0.00) -0.35 (0.00) -0.35 (0.00) -0.37 (0.01)\n100\n1,000\n-0.47 (0.00) -0.46 (0.00) -0.48 (0.00) -0.48 (0.00) -0.47 (0.00) -0.48 (0.00) -0.48 (0.00) -0.48 (0.00) -0.49 (0.01)\n10,000 -0.93 (0.00) -0.85 (0.00) -0.91 (0.00) -0.90 (0.00) -0.91 (0.00) -0.91 (0.00) -0.93 (0.00) -0.88 (0.00) -0.89 (0.01)\n\n-0.28 (0.00)\n-0.37 (0.00)\n-0.50 (0.00)\n-0.90 (0.00)\n\nCIFAR-10\n\nTable 2: Mean and standard deviation of the Boosting loss (log10) on the two data-sets and for\neach method, estimated on ten randomized runs. Methods highlighted with a (cid:63) require the tuning of\nmeta-parameters which have been optimized by training fully multiple times.\n\n5.3 Results and analysis\nWe report the results of the proposed algorithms against the baselines introduced in \u00a7 5.2 on the\ntwo data-sets of \u00a7 5.1 using the standard train/test cuts in tables 2 and 3. We ran each con\ufb01guration\nten times and report the mean and standard deviation of each. We set the maximum cost of all the\nalgorithms to 10N, setting Q = 10 and T = N for the baselines, as this con\ufb01guration leads to the\nbest results after 10,000 Boosting rounds of AdaBoost.MH.\nThese results illustrate the ef\ufb01ciency of the proposed methods. For 10, 100 and 1,000 weak-learners,\nboth the MAS and the Laminating algorithms perform far better than the baselines. Performance\ntend to get similar for 10,000 stumps, which is unusually large.\nAs stated in \u00a7 5.2, the meta-parameters of the bandit methods have been optimized by running the\ntraining fully ten times, with the corresponding computational effort.\n\n7\n\n\fNb. of\nstumps\n\nNaive\n\nUniform\n\n1.Q\n\nQ.1\n\nUCB(cid:63)\n\nBandits\nExp3.P(cid:63)\n\nMNIST\n\n\u0001-greedy(cid:63)\n\nNaive\n\nMAS\n1.Q\n\nQ.1\n\nLaminating\n\n51.18 (4.22) 54.37 (7.93) 48.15 (3.66) 52.86 (4.75) 53.80 (4.53) 51.37 (6.35) 25.91 (2.04) 25.94 (2.57) 25.73 (1.33) 35.70 (2.35)\n10\n8.95 (0.41) 11.64 (1.06) 8.69 (0.48) 11.39 (0.53) 11.58 (0.93) 11.59 (1.12) 4.87 (0.29) 4.78 (0.16) 4.54 (0.21)\n4.85 (0.16)\n100\n1.34 (0.08)\n1,000\n1.75 (0.06) 2.37 (0.12) 1.76 (0.08) 1.80 (0.08) 2.18 (0.14) 1.83 (0.16) 1.50 (0.06) 1.59 (0.08) 1.45 (0.04)\n10,000 0.94 (0.06) 1.13 (0.03) 0.94 (0.04) 0.90 (0.05) 0.84 (0.02) 0.85 (0.07) 0.92 (0.03) 0.97 (0.05) 0.94 (0.04)\n0.85 (0.04)\n\n76.27 (0.97) 78.57 (1.94) 76.00 (1.60) 77.04 (1.65) 77.51 (1.50) 77.13 (1.15) 71.54 (0.69) 71.13 (0.49) 70.63 (0.34) 71.54 (1.06)\n10\n56.94 (1.01) 58.33 (1.30) 54.48 (0.64) 57.49 (0.46) 58.47 (0.81) 58.19 (0.83) 53.94 (0.55) 52.79 (0.09) 50.15 (0.64) 50.44 (0.68)\n100\n1,000\n39.13 (0.61) 39.97 (0.37) 37.70 (0.38) 38.13 (0.30) 39.23 (0.31) 38.36 (0.72) 38.79 (0.28) 38.31 (0.27) 36.95 (0.25) 36.39 (0.58)\n10,000 31.83 (0.29) 31.16 (0.29) 30.56 (0.30) 30.55 (0.24) 30.39 (0.22) 29.96 (0.45) 32.07 (0.27) 31.36 (0.13) 32.51 (0.38) 31.17 (0.22)\n\nCIFAR-10\n\nTable 3: Mean and standard deviation of the test error (in percent) on the two data-sets and for\neach method, estimated on ten randomized runs. Methods highlighted with a (cid:63) require the tuning of\nmeta-parameters which have been optimized by training fully multiple times.\n\nOn the MNIST data-set, when adding 10 or 100 weak-learners, our methods roughly divides the\nerror rate by two, and still improves it by (cid:39) 30% with 1,000 stumps. The loss reduction follows the\nsame pattern.\nThe CIFAR data-set is a very dif\ufb01cult pattern recognition problem. Still our algorithms performs\nsubstantially better than the baselines for 10 and 100 weak-learners, gaining more than 10% in the\ntest error rates, and behave similarly to the baselines for larger number of stumps.\nAs stated in \u00a7 1, the optimal values for a \ufb01xed product Q T moves to larger T and smaller Q.\nFor instance, On the MNIST data-set with MAS Naive, averaging on ten randomized runs, for\nrespectively 10, 100, 1,000, 10,000 stumps, T = 1,580, 13,030, 37,100, 43,600, and Q = 388, 73,\n27, 19. We obtain similar and consistent results across settings.\nThe overhead of MAS algorithms compared to Uniform ones is small, in our experiments, taking\ninto account the time spent computing features, it is approximately 0.2% for MAS Naive, 2% for\nMAS 1.Q and 8% for MAS Q.1. The Laminating algorithm has no overhead.\nThe poor behavior of bandit methods for small number of stumps may be related to the large varia-\ntions of the sample weights during the \ufb01rst iterations of Boosting, which goes against the underlying\nassumption of stationarity of the loss reduction.\n\n6 Conclusion\n\nWe have improved Boosting by modeling the statistical behavior of the weak-learners\u2019 edges. This\nallowed us to maximize the loss reduction under strict control of the computational cost. Experi-\nments demonstrate that the algorithms perform well on real-world pattern recognition tasks.\nExtensions of the proposed methods could be investigated along two axis. The \ufb01rst one is to blur the\nboundary between the MAS procedures and the Laminating, by deriving an analytical model of the\nloss reduction for generalized sampling procedures: Instead of doubling the number of samples and\nhalving the number of weak-learners, we could adapt both set sizes optimally. The second is to add a\nbandit-like component to our methods by adding a variance term related to the lack of samples, and\ntheir obsolescence in the Boosting process. This would account for the degrading density estimation\nwhen weak-learner families have not been sampled for a while, and induce an exploratory sampling\nwhich may be missing in the current algorithms.\n\nAcknowledgments\n\nThis work was supported by the European Community\u2019s 7th Framework Programme under grant\nagreement 247022 \u2013 MASH, and by the Swiss National Science Foundation under grant 200021-\n124822 \u2013 VELASH. We also would like to thank Dr. Robert B. Israel, Associate Professor Emeritus\nat the University of British Columbia for his help on the derivation of the expectation of the true\nedge of the weak-learner with the highest approximated edge (equations (9) to (13)).\n\n8\n\n\fReferences\n[1] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit prob-\n\nlem. Machine Learning, 47(2):235\u2013256, 2002.\n\n[2] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. Schapire. The nonstochastic multiarmed bandit\n\nproblem. SIAM Journal on Computing, 32(1):48\u201377, 2003.\n\n[3] R. Busa-Fekete and B. Kegl. Accelerating AdaBoost using UCB. JMLR W&CP, Jan 2009.\n[4] R. Busa-Fekete and B. Kegl. Fast Boosting using adversarial bandits. In ICML, 2010.\n[5] N. Duf\ufb01eld, C. Lund, and M. Thorup. Priority sampling for estimation of arbitrary subset\n\nsums. J. ACM, 54, December 2007.\n\n[6] G. Escudero, L. M`arquez, and G. Rigau. Boosting applied to word sense disambiguation.\n\nMachine Learning: ECML 2000, pages 129\u2013141, 2000.\n\n[7] F. Fleuret and D. Geman. Stationary features and cat detection. Journal of Machine Learning\n\n[8] Z. Kalal, J. Matas, and K. Mikolajczyk. Weighted sampling for large-scale Boosting. British\n\nResearch (JMLR), 9:2549\u20132578, 2008.\n\nmachine vision conference, 2008.\n\n[9] A. Krizhevsky. Learning Multiple Layers of Features from Tiny Images. Master\u2019s thesis, 2009.\n\nhttp://www.cs.toronto.edu/\u02dckriz/cifar.html.\n\n[10] Y. Lecun and C. Cortes. The mnist database of handwritten digits. http://yann.lecun.\n\ncom/exdb/mnist/.\n\n9\n\n\f", "award": [], "sourceid": 776, "authors": [{"given_name": "Charles", "family_name": "Dubout", "institution": null}, {"given_name": "Francois", "family_name": "Fleuret", "institution": null}]}