{"title": "Minimal Variance Sampling in Stochastic Gradient Boosting", "book": "Advances in Neural Information Processing Systems", "page_first": 15087, "page_last": 15097, "abstract": "Stochastic Gradient Boosting (SGB) is a widely used approach to regularization of boosting models based on decision trees. It was shown that, in many cases, random sampling at each iteration can lead to better generalization performance of the model and can also decrease the learning time. Different sampling approaches were proposed, where probabilities are not uniform, and it is not currently clear which approach is the most effective. In this paper, we formulate the problem of randomization in SGB in terms of optimization of sampling probabilities to maximize the estimation accuracy of split scoring used to train decision trees.This optimization problem has a closed-form nearly optimal solution, and it leads to a new sampling technique, which we call Minimal Variance Sampling (MVS).The method both decreases the number of examples needed for each iteration of boosting and increases the quality of the model significantly as compared to the state-of-the art sampling methods. The superiority of the algorithm was confirmed by introducing MVS as a new default option for subsampling in CatBoost, a gradient boosting library achieving state-of-the-art quality on various machine learning tasks.", "full_text": "Minimal Variance Sampling in Stochastic Gradient\n\nBoosting\n\nBulat Ibragimov\n\nYandex, Moscow, Russia\n\nMoscow Institute of Physics and Technology\n\nGleb Gusev\n\nSberbank\u2217, Moscow, Russia\ngusev.g.g@sberbank.ru\n\nibrbulat@yandex.ru\n\nAbstract\n\nStochastic Gradient Boosting (SGB) is a widely used approach to regularization\nof boosting models based on decision trees. It was shown that, in many cases, ran-\ndom sampling at each iteration can lead to better generalization performance of\nthe model and can also decrease the learning time. Different sampling approaches\nwere proposed, where probabilities are not uniform, and it is not currently clear\nwhich approach is the most effective. In this paper, we formulate the problem\nof randomization in SGB in terms of optimization of sampling probabilities to\nmaximize the estimation accuracy of split scoring used to train decision trees.\nThis optimization problem has a closed-form nearly optimal solution, and it leads\nto a new sampling technique, which we call Minimal Variance Sampling (MVS).\nThe method both decreases the number of examples needed for each iteration of\nboosting and increases the quality of the model signi\ufb01cantly as compared to the\nstate-of-the art sampling methods. The superiority of the algorithm was con\ufb01rmed\nby introducing MVS as a new default option for subsampling in CatBoost, a gradi-\nent boosting library achieving state-of-the-art quality on various machine learning\ntasks.\n\n1\n\nIntroduction\n\nGradient boosted decision trees (GBDT) [16] is one of the most popular machine learning algorithms\nas it provides high-quality models in a large number of machine learning problems containing het-\nerogeneous features, noisy data, and complex dependencies [31]. There are many \ufb01elds where gradi-\nent boosting achieves state-of-the-art results, e.g., search engines [34, 3], recommendation systems\n[30], and other applications [36, 4].\nOne problem of GBDT is the computational cost of the learning process. GBDT may be described as\nan iterative process of constructing decision tree models, each of which estimates negative gradients\nof examples\u2019 errors. At each step, GBDT greedily builds a tree. GBDT scores every possible feature\nsplit and chooses the best one, which requires computational time proportional to the number of\ndata instances. Since most GBDT models consist of an ensemble of many trees, as the number of\nexamples grows, more learning time is required, what imposes restrictions on using GBDT models\nfor large industry datasets.\nAnother problem is the trade-off between the capacity of the GBDT model and its generalization\nability. One of the most critical parameters that in\ufb02uence the capacity of boosting is the number of\niterations or the size of the ensemble. The more components are used in the algorithm, the more\ncomplex dependencies can be modeled. However, an increase in the number of models in the ensem-\nble does not always lead to an increase in accuracy and, moreover, can decrease its generalization\n\n\u2217The study was done while working at Yandex\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fability [28]. Therefore boosting algorithms are usually provided with regularization methods, which\nare needed to prevent over\ufb01tting.\nA common approach to handle both of the problems described above is to use random subsam-\npling [17] (or bootstrap) at every iteration of the algorithm. Before \ufb01tting a next tree, we select a\nsubset of training objects, which is smaller than the original training dataset, and perform learning\nalgorithm on a chosen subsample. The fraction of chosen objects is called sample rate. Studies\nin this area show that this demonstrates excellent performance in terms of learning time and qual-\nity [17]. It helps to speed up the learning process for each decision tree model as it uses less data.\nAlso, the accuracy can increase because, despite the fact that the variance of each component of the\nensemble goes up, the pairwise correlation between trees of the ensemble decreases, what can lead\nto a reduction in the total variance of the model.\nIn this paper, we propose a new approach to theoretical analysis of random sampling in GBDT. In\nGBDT, random subsamples are used for evaluation of candidate splits, when each next decision tree\nis constructed. Random sampling decreases the size of the active training dataset, and the training\nprocedure becomes more noisy, what can entail a decrease in the quality. Therefore, sampling algo-\nrithm should select the most informative training examples, given a constrained number of instances\nthe algorithm is allowed to choose. We propose a mathematical formulation of this optimization\nproblem in SGB, where the accuracy of estimated scores of candidate splits is maximized. For ev-\nery \ufb01xed sample rate (ratio of sampled objects), we propose a solution to this sampling problem\nand provide a novel algorithm Minimal Variance Sampling (MVS). MVS relies on the distribution\nof loss derivatives and assigns probabilities and weights with which the sampling should be done.\nThat makes the procedure adaptive to any data distribution and allows to signi\ufb01cantly outperform\nthe state of the art SGB methods, operating with way less number of data instances.\n\n2 Background\n\nIn this section, we introduce necessary de\ufb01nitions and notation. We start from the GBDT algorithm,\nand then we describe its two popular modi\ufb01cations that use data subsampling: Stochastic Gradient\nBoosting [17] and Gradient-Based One-Side Sampling (GOSS) [24].\n\n2.1 Gradient Boosting\nConsider a dataset { (cid:126)xi, yi}N\ni=1 sampled from some unknown distribution p((cid:126)x, y). Here (cid:126)xi \u2208 X is a\nvector from the d-dimensional vector space. Value yi \u2208 R is the response to the input (cid:126)xi (or target).\nGiven a loss function L : R2 \u2192 R+, the problem of supervised learning task is to \ufb01nd function\nF : X \u2192 R, which minimizes the empirical risk:\n\nN(cid:88)\nn(cid:88)\n\ni=1\n\nk=1\n\n\u02c6L(F ) =\n\nL(F ( (cid:126)xi), yi)\n\nF ((cid:126)x) =\n\n\u03b1fk((cid:126)x),\n\nGradient boosting (GB) [16] is a method of constructing the desired function F in the form\n\nwhere n is the number of iterations, i.e., the amount of base functions fk chosen from a simple\nparametric family F, such as linear models or decision trees with small depth. The learning rate,\nor step size in functional space, is denoted by \u03b1. Base learners {fk} are learned sequentially in the\n\u03b1fk, the goal is to construct the next member fm of\nfollowing way. Given a function Fm\u22121 =\nthe sequence f1, . . . , fm\u22121 such that:\n\nm\u22121(cid:80)\n\ni=1\n\nfm = arg min\nf\u2208F\n\n\u02c6L(Fm\u22121 + f ) = arg min\nf\u2208F\n\nL(Fm\u22121( (cid:126)xi) + f ( (cid:126)xi), yi)\n\n(1)\n\nGradient boosting constructs a solution of Equation 1 by calculating \ufb01rst-order derivatives (gradi-\nof \u02c6L(F ) at point Fm\u22121 and performing a negative gradi-\nents) gm\n\ni ( (cid:126)xi, yi) = \u2202L(\u02c6yi,yi)\n\n\u2202 \u02c6yi\n\n(cid:12)(cid:12)(cid:12)(cid:12)\u02c6yi=Fm\u22121((cid:126)xi)\n\nN(cid:88)\n\ni=1\n\n2\n\n\fent step in the functional space of examples { (cid:126)xi, yi}N\n{\u2212gm\n\ni ( (cid:126)xi, yi)}N\n\ni=1 as targets and is \ufb01tted by the least\u2013squares approximation:\n\ni=1. The latter means that fm is learned using\n\nfm = arg min\nf\u2208F\n\n(f ( (cid:126)xi) \u2212 (\u2212gm\n\ni ( (cid:126)xi, yi)))2\n\n(2)\n\nN(cid:88)\n\ni=1\n\nq(cid:70)\n\nXi = X} and a piecewise constant function f ((cid:126)x) =\n\nIf a subfamily of decision tree functions is taken as a set of base functions F (e.g., all decision trees\nof depth 5), the algorithm is called Gradient Boosted Decision Trees (GBDT) [16]. A decision tree\ndivides the original feature space Rd into disjoint areas, also called leaves, with a constant value\nin each region. In other words, the result of a decision tree learning is a disjoint union of subsets\n{X1, X2, ..., Xq :\nI{(cid:126)x \u2208 Xi}ci. The\nlearning procedure is recursive. It starts from the whole set Rd as the only region. For every of\nthe already built regions, the algorithm looks out all split candidates by one feature and sets a score\nfor each split. The score is usually a measure of accuracy gain based on target distributions in the\nregions before and after splitting. The process continues until a stopping criterion is reached.\nBesides the classical gradient descent approach to GBDT de\ufb01ned by Equation 2, we also con-\nsider a second-order method based on calculating diagonal elements of hessian of empirical risk\ni ( (cid:126)xi, yi) = \u22022L(\u02c6yi,yi)\nhm\n\nq(cid:80)\n\ni=1\n\ni=1\n\n\u2202 \u02c6y2\ni\n\nThe rule for choosing the next base function in this method [8] is:\n\n(cid:18)\n\n(cid:18)\n\n(cid:19)(cid:19)2\n\n(3)\n\nfm = arg min\nf\u2208F\n\ni=1\n\nhm\ni ( (cid:126)xi, yi)\n\nf ( (cid:126)xi) \u2212\n\n\u2212 gm\ni ( (cid:126)xi, yi)\nhm\ni ( (cid:126)xi, yi)\n\n.\n\n(cid:12)(cid:12)(cid:12)(cid:12)\u02c6yi=Fm\u22121((cid:126)xi)\nN(cid:88)\n\n2.2 Stochastic Gradient Boosting\n\nStochastic Gradient Boosting is a randomized version of standard Gradient Boosting algorithm.\nMotivated by Breiman\u2019s work about adaptive bagging [2], Friedman [17] came to the idea of adding\nrandomness into the tree building procedure by using a subsampling of the full dataset. For each\niteration of the boosting process, the sampling algorithm of SGB selects random s\u00b7N objects without\nreplacement and uniformly. It effectively reduces the complexity of each iteration down to the factor\nof s. It is also proved by experiments [17] that, in some cases, the quality of the learned model can\nbe improved by using SGB.\n\n2.3 GOSS\n\nSGB algorithm makes all objects to be selected equally likely. However, different objects have\ndifferent impacts on the learning process. Gradient-based one-side sampling (GOSS) implements\nan idea that objects (cid:126)xi with larger absolute value of the gradient |gi| are more important than the\nones that have smaller gradients. A large gradient value indicates that the model can be improved\nsigni\ufb01cantly with respect to the object, and it should be sampled with higher probability compared\nto well-trained instances with small gradients. So, GOSS takes the most important objects with\nprobability 1 and chooses a random sample of other objects. To avoid distribution bias, GOSS re-\nweighs selected samples by setting higher weights to the examples with smaller gradients. More\nformally, the training sample consists of top rate \u00d7 N instances with largest |gi| with weight equal\nto 1 and of other rate \u00d7 N instances from the rest of the data with weights equal to 1\u2212top rate\nother rate .\n\n3 Related work\n\nA common approach to randomize the learning of GBDT model is to use some kind of SGB, where\ninstances are sampled equally likely or uniformly. This idea was implemented in different ways.\nOriginally, Friedman proposed to sample a subset of objects of a \ufb01xed size [17] without replacement.\nHowever, in today\u2019s practice, other similar techniques are applied, where the size of the subset can\nbe stochastic. For example, the objects can be sampled independently using a Bernoulli process [24],\n\n3\n\n\for a bootstrap procedure can be applied [14]. To the best of our knowledge, GOSS proposed in [24]\nis the only weighted (non-uniform) sampling approach applied to GBDT. It is based on intuitive\nideas, but its choice is empirical. Therefore our theoretically grounded method MVS outperforms\nGOSS in experiments.\nAlthough, there is a surprising lack of non-uniform sampling for GBDT, there are [13] adaptive\nweighted approaches proposed for AdaBoost, another popular boosting algorithm. These methods\nmostly rely on weights of instances de\ufb01ned in the loss function at each iteration of boosting [15, 35,\n9, 26, 20]. These papers are mostly focused on the accurate estimation of the loss function, while\nsubsamples in GBDT are used to estimate the scores of candidate splits, and therefore, sampling\nmethods of both our paper and GOSS are based on the values of gradients. GBDT algorithms do\nnot apply adaptive weighting of training instances, and methods proposed for AdaBoost cannot be\ndirectly applied to GBDT.\nOne of the most popular sampling methods based on target distribution is Importance Sampling [37]\nwidely used in deep learning [21]. The idea is to choose the objects with larger loss gradients\nwith higher probability than with smaller ones. This leads to a variance reduction of mini-batch\nestimated gradient and has a positive effect on model performance. Unfortunately, Importance Sam-\npling poorly performs for the task of building decision trees in GBDT, because the score of a split\nis a ratio function, which depends on the sum of gradients and the sample sizes in leaves, and the\nvariance of their estimations all affect the accuracy of the GBDT algorithm. The following part of\nthis paper is devoted to a theoretically grounded method, which overcomes these limitations.\n\n4 Minimal Variance Sampling\n\n4.1 Problem setting\n\nAs it was mentioned in Section 2.1, training a decision tree is a recursive process of selecting the\nbest data partition (or split), which is based on a value of some feature. So, given a subset A of\noriginal feature space X , split is a pair of feature f and its value v such that data is partitioned into\ntwo sets: A1 = {(cid:126)x \u2208 A : xf < v}, A2 = {(cid:126)x \u2208 A : xf \u2265 v}. Every split is evaluated by some\nscore, which is used to select the best one among them.\nThere are various scoring metrics, e.g., Gini index and entropy criterion [32] for classi\ufb01cation tasks,\nmean squared error (MSE) and mean absolute error (MAE) for regression trees. Most of GB imple-\nmentations (e.g. [8]) consider hessian while learning next tree (second-order approximation). The\n\nsolution to in a leaf l is the constant equal to the ratio\n\nof the sum of gradients and the sum of\n\nhessian diagonal elements. The score S(f, v) of a split (f, v) is calculated as\n\nhi\n\ni\u2208l\n\ni\u2208l\n\n(cid:80)\ngi(cid:80)\n(cid:18)(cid:80)\n(cid:19)2(cid:80)\n\ni\u2208l\n\ngi\n\nhi\n\ni\u2208l\n\n(cid:88)\n\nl\u2208L\n\nS(f, v) :=\n\n,\n\n(4)\n\nwhere L is the set of obtained leaves, and leaf l consists of objects that belong to this leaf. This score\nis, up to a common constant, the opposite to the value of the functional minimized in Equation 3\nwhen we add this split to the tree. For classical GB based on the \ufb01rst-order gradient steps, according\nto Equation 2, score S(f, v) should be calculated by setting hi = 1 in Equation 4.\nTo formulate the problem, we \ufb01rst describe the general sampling procedure, which generalizes SGB\nand GOSS. Sampling from a set of size N may be described as a sequence of random variables\n(\u03be1, \u03be2, ..., \u03beN ), where \u03bei \u223c Bernoulli(pi), and \u03bei = 1 indicates that i-th example was sampled and\nshould be used to estimate scores S(f, v) of different candidates (f, v). Let nsampled be the number\nof selected instances. By sampling with sampling ratio s, we denote any sequence (\u03be1, \u03be2, ..., \u03beN ),\nwhich samples s \u00d7 100% of data on average:\n\nE(nsampled) = E\n\n\u03bei =\n\npi = N \u00b7 s.\n\n(5)\n\nTo make all key statistics (sum of gradients and sum of hessians in the leaf) unbiased, we perform\ninverse probability weighting estimation (IPWE) [18], which assigns weight wi = 1\nto instance i.\npi\n\ni=1\n\ni=1\n\n4\n\nN(cid:88)\n\nN(cid:88)\n\n\fIn GB with sampling, score S(f, v) is approximated by\n\n(cid:18)(cid:80)\n(cid:80)\n\ni\u2208l\n\ni\u2208l\n\n(cid:88)\n\nl\u2208L\n\n\u02c6S(f, v) :=\n\n1\npi\n\n1\npi\n\n,\n\n\u03beigi\n\n\u03beihi\n\n(cid:19)2\n(cid:19)2\n(cid:18)(cid:80)\nand(cid:80)\n(cid:16) \u02c6S(f, v) \u2212 S(f, v)\n(cid:17)2\nl V ar(yl)(cid:1) ,\n\ni\u2208l\n\ni\u2208l\n\ngi\n\nE\u22062 \u2248(cid:88)\n\u03beigi, yl :=(cid:80)\n\nl\u2208L\n\nc2\nl\n\nwhere the numerator and denominator are estimators of\n\nhi correspondingly.\n\nWe are aimed at minimization of squared deviation \u22062 =\n. Deviation \u2206 is a\nrandom variable due to the randomness of the sampling procedure (randomness of \u03bei). Therefore,\nwe consider the minimization of the expectation E\u22062.\nTheorem 1. The expected squared deviation E\u22062 can be approximated by\n\n(cid:0)4V ar(xl) \u2212 4clCov(xl, yl) + c2\n\n1\npi\n\nwhere xl :=(cid:80)\nThe term \u22124clCov(xl, yl) in Equation 7 has an upper bound of(cid:0)4V ar(xl) + c2\n\nThe proof of this theorem is available in the Supplementary Materials.\n\nbe assigned if l would be a terminal node of the tree.\n\n\u03beihi, and cl := \u00b5xl\n\u00b5yl\n\n(cid:80)\ngi(cid:80)\n\ni\u2208l\n\ni\u2208l\n\n1\npi\n\ni\u2208l\n\ni\u2208l\n\n=\n\nhi\n\nTheorem 1, we come to an upper bound minimization problem\n\n(cid:0)4V ar(xl) + c2\n\nl V ar(yl)(cid:1) \u2192 min .\n\nc2\nl\n\n(cid:88)\n\nl\u2208L\n\nis the value in the leaf l that would\n\nl V ar(yl)(cid:1). Using\n\nNote that we do not have the values of cl for all possible leaves of all possible candidate splits in\nadvance, when we perform sampling procedure. A possible approach to Problem 8 is to substitute\nall c2\nl by a universal constant value, which is a parameter of sampling algorithm. Also, note that\ni up to constants that do not depend on the sampling\nh2\n\ni and V ar(yl) is(cid:80)\n\ng2\n\nprocedure. In this way, we come to the following form of Problem 8:\n\n(6)\n\n(7)\n\n(8)\n\nV ar(xl) is(cid:80)\nN(cid:88)\n\ni\u2208l\n\ng2\ni + \u03bb\n\n1\npi\n\ni=1\n\n1\npi\n\nN(cid:88)\n\ni=1\n\n1\npi\n\ni\u2208l\n\nN(cid:88)\n\ni=1\n\n1\npi\n\ni \u2192 min\nh2\n\npi\n\n, w.r.t.\n\npi = N \u00b7 s\n\nand \u2200i \u2208 {1, . . . , N} pi \u2208 [0, 1]. (9)\n\n4.2 Theoretical analysis\n\nHere we show that Problem 9 has a simple solution and leads to an effective sampling algorithm.\nFirst, we discuss its meaning in the case of \ufb01rst-order optimization, where we have hi = 1.\nThe \ufb01rst term of the minimized expression is responsible for gradient distribution over the leaves of\nthe decision tree, while the second one is responsible for the distribution of sample sizes. Coef\ufb01cient\n\u03bb controls the magnitude of each of the component. It can be seen as a tradeoff between the variance\nof a single model and the variance of the ensemble. The variance of the ensemble consists of indi-\nvidual variances of every single algorithm and pairwise correlations between models. On the one\nhand, it is crucial to reduce individual variances of each model; on the other hand, the more dissim-\nilar subsamples are, the less the total variance of the ensemble is. This is re\ufb02ected in the accuracy\ndependence on the number of sampled examples: the slight reduction of this number usually leads\nto increase in the quality as the variance of each model is not corrupted a lot, but, when the sample\nsize goes down to smaller numbers, the sum of variances prevails over the loss in correlations and\nthe accuracy dramatically decreases.\nIt is easy to derive that setting \u03bb to 0 implies the procedure of Importance Sampling. As it was men-\ntioned before, the applicability of this procedure in GBDT is constrained since it is still important to\n\n5\n\n\festimate the number of instances accurately in each node of the tree. Besides, Importance Sampling\nis suffering from numerical instability while dealing with small gradients close to zero, what usually\nhappens on the latter gradient boosting iterations. In this case, the second part of the expression may\nbe interpreted as a regularisation member prohibiting enormous weights.\nSetting \u03bb to \u221e implies the SGB algorithm.\nFor arbitrary \u03bb general solution is given by the following theorem (we leave the proof to the Supple-\nmentary Materials):\n\n(cid:19)\nFor abbreviations, everywhere below, we refer to the expression \u02c6gi =(cid:112)g2\n\nTheorem 2. There exists a value \u00b5 such that pi = min\n\ni using regularized\nabsolute value term. The number \u00b5 de\ufb01ned above is a threshold for decision, whether to pick an\nexample deterministic of by coin \ufb02ipping. From the solution we see, that for any data instance, the\nweight is always bounded by some number, so the estimator is more computationally stable than\nIPWE usually is.\nFrom Theorem 2, we conclude that optimal sampling scheme in terms of Equation 9 is described by\nAlgorithm 1:\n\nis a solution to Problem 9.\n\ni +\u03bbh2\ng2\ni\n\u00b5\n\ni + \u03bbh2\n\n(cid:18)\n\n\u221a\n\n1,\n\nAlgorithm 1 MVS Algorithm\n\nInput: X, y, Loss, maxIter, sampleRate, \u03bb\nensemble = []\nensemble.append(InitialGuess(X, y))\nfor i from 1 to maxIter do\n\npredictions = ensemble.predict(X)\ngradients, hessians = CalculateDerivatives(Loss, y, predictions)\n\nregGradients[i] =(cid:112)gradients[i]2 + \u03bb \u00b7 hessians[i]2\n\n\u00b5 = CalculateThreshold(regGradients, sampleRate)\nprobs[i] = Min( regGradients[i]\nweights[i] =\nprobs[i]\nidxs = Select(probs)\nnextT ree = TrainTree(X[idxs], \u2212gradients[idxs], hessians[idxs], weights[idxs])\nensemble.append(nextT ree)\n\n, 1)\n\n\u00b5\n\n1\n\nend for\n\n4.3 Algorithm\n\n\u221a\n\ni + \u03bbh2\n\ng2\ni +\u03bbh2\ni\n\u00b5\n\nsolute value(cid:112)g2\n\nNow we are ready to derive the MVS algorithm from Theorem 2, which can be directly applied\nto general scheme of Stochastic Gradient Boosting. First, for given sample rate s, MVS \ufb01nds the\nthreshold \u00b5 to decide, which gradients are considered to be large. Example i with regularized ab-\ni higher than chosen \u00b5 is sampled with probability equal to 1. Every object\nand is assigned weight\n. Still, it is not apparent how to \ufb01nd such a threshold \u00b5\u2217 that will give the required sampling\n\nwith small gradient is sampled independently with probability pi =\nwi = 1\npi\nratio s = s\u2217.\nA brute-force algorithm relies on the fact that the sampling ratio has an inverse dependence on the\nthreshold: the higher the threshold, the lower the fraction of sampled instances. First, we sort the\ndata by regularized absolute value in descending order. Note that now, given a threshold \u00b5, the\ni + k, where k + 1 is the number of\nthe \ufb01rst element in sorted sequence, which is less than \u00b5. Then the binary search is applied to \ufb01nd\na threshold \u00b5\u2217 with the desired property s(\u00b5\u2217) = s\u2217. To speed up this algorithm, the precalculation\ni for every k is performed, so\nof cumulative sums of regularized absolute values\nthe calculation of sampling ratio at each step of binary search has O(1) time complexity. The total\n\n(cid:112)g2\n(cid:112)g2\nN(cid:80)\n\nsampling ratio s can be calculated as s(\u00b5) = 1\n\u00b5\n\ni + \u03bbh2\n\ni + \u03bbh2\n\nN(cid:80)\n\ni=k+1\n\ni=k+1\n\n6\n\n\fcomplexity of this procedure is O(N log N ), due to sorting at the beginning. To compare with, SGB\nand GOSS algorithms have O(N ) complexity for sampling.\nWe propose a more ef\ufb01cient algorithm, which is similar to the quick select algorithm [27]. In the\nbeginning, the algorithm randomly selects the gradient, which is a candidate to be a threshold. The\ndata is partitioned in such a way that all the instances with smaller gradients and larger gradients are\non the opposite sides of the candidate. To calculate the current sample rate, it is suf\ufb01cient to calculate\nthe number of examples on the larger side and the sum of regularized absolute values on the other\nside. Then, estimated sample rate is used to determine the side where to continue the search for the\ndesired threshold. If the current sample rate is higher, then algorithms searches threshold on the side\nwith smaller gradients, otherwise on the side with greater. Calculated statistics for each side may be\nreused in further steps of the algorithm, so the number of the operations at each step is reduced by\nthe number of rejected examples. The time complexity analysis can be carried out by analogy with\nthe quick select algorithm, which results in O(N ) complexity.\n\nAlgorithm 2 Calculate Threshold\n\nInit sumSmall = nLarge = 0, candidatesArray = all\nlength = Length(candidatesArray)\ncandidateT hreshold = RandomSelect(candidatesArray)\nmid = Partition(candidatesArray, candidate)\nsmallArray = candidatesArray[1,...,mid-1]\nlargeArray = candidatesArray[mid+1,...,length]\ncurSampleRate = Sum(smallArray)+sumSmall\nif Length(smallArray) == 0 and curSampleRate < sampleRate then\n\ncandidateT hreshold\n\nreturn\n\nsampleRate\u2212nLarge\u2212Length(largeArray)\u22121\n\nsumSmall\n\nelse if Length(largeArray) == 0 and curSampleRate > sampleRate then\n\nreturn sumSmall+Sum(smallArray)+candidateT hreshold\n\nsampleRate\u2212nLarge\n\nelse if CurSampleRate > sampleRate then\n\n+ Length(largeArray) + nLarge + 1\n\nsumSmall = sumSmall + Sum(smallArray) + candidateT hreshold\nreturn CalculateThreshold(largeArray, sumSmall, nLarge, sampleRate)\n\nnLarge = nLarge + Length(largeArray) + 1\nreturn CalculateThreshold(smallArray, sumSmall, nLarge, sampleRate)\n\nelse\n\nend if\n\n5 Experiments\n\nHere we provide experimental results of MVS algorithm on two popular open-source implementa-\ntions of gradient boosting: CatBoost and LightGBM.\nCatBoost. The default setting of CatBoost is known to achieve state-of-the-art quality on various\nmachine learning tasks [29]. We implemented MVS in CatBoost and performed benchmark com-\nparison of MVS with sampling ratio 80% and default CatBoost with no sampling on 153 publicly\navailable and proprietary binary classi\ufb01cation datasets of different sizes up to 45 millions instances.\nThe algorithms were compared by the ROC-AUC metric, and we calculated the number of wins\nfor each algorithm. The results show signi\ufb01cant improvement over the existing default: 97 wins of\nMVS versus 55 wins of default setting and +0.12% mean ROC-AUC improvement.\nThe source code of MVS is publicly available [6] and ready to be used as a default option of Cat-\nBoost algorithm. The latter means that MVS is already acknowledged as a new benchmark in SGB\nimplementations.\nLightGBM. To perform a fair comparison with previous sampling techniques (GOSS and SGB),\nMVS was also implemented in LightGBM, as it is a popular open-source library with GOSS inside.\nThe MVS source code for LightGBM may be found at [19].\nDatasets\u2019 descriptions used in this section are placed in Table 1. All the datasets are publicly avail-\nable and were preprocessed according to [5].\n\n7\n\n\fDataset\nKDD Internet [1]\nAdult [25]\nAmazon [23]\nKDD Upselling [11]\nKick prediction [22]\nKDD Churn [10]\nClick prediction [12]\n\n# Examples\n10108\n48842\n32769\n50000\n72983\n50000\n399482\n\n# Features\n69\n15\n10\n231\n36\n231\n12\n\nTable 1: Datasets description\n\nWe used the tuned parameters and train-test splitting for each dataset from [5] as baselines, presetting\nthe sampling ratio to 1. For tuning sampling parameters of each algorithm (sample rate and \u03bb\ncoef\ufb01cient for MVS, large gradients fraction and small gradients fraction for GOSS, sample rate for\nSGB), we use 5-fold cross-validation on train subset of the data. Then the tuned models are evaluated\non test subsets (which is 20% of the original size of the data). Here we use the 1 \u2212 ROC-AUC score\nas an error measure (lower is better). To make the results more statistically signi\ufb01cant, the evaluation\npart is run 10 times with different seeds. The \ufb01nal result is de\ufb01ned as the mean over these 10 runs.\nHere we also introduce hyperparameter-free MVS algorithm modi\ufb01cation. Since \u03bb (see Equation 9)\nis an approximation of squared mean leaf value upper bound, we replace it with a squared mean of\nthe initial leaf. As it will be shown, it achieves near-optimal results and dramatically reduces time\nspent on parameter tuning. Since it sets \u03bb adaptively at each iteration, we refer to this method as\nMVS Adaptive.\nQuality comparison. The \ufb01rst experiments are devoted to testing MVS as a regularization method.\nWe state the following question: how much the quality changes when using different sampling\ntechniques? To answer this question, we tuned the sampling parameters of algorithms to get the best\nquality. This quality scores compared to baselines quality are presented in Table 2. From this results,\nwe can see that MVS demonstrates the best generalization ability among given sampling approaches.\nThe best parameter \u03bb for MVS is about 10\u22121, it shows good performance on most of the datasets.\nFor GOSS, the best ratio of large and small gradients varies a lot from the predominance of large to\nthe predominance of small.\n\nFigure 1: 1\u2212 ROC-AUC error versus the fraction of sampled examples per one iteration\n\nKDD Internet Adult\n\nKDD Churn Click\n\nBaseline\nSGB\nGOSS\nMVS\nMVS Adaptive\n\n0.0408\n-1.13%\n-0.64%\n-3.03%\n-2.79%\n\nAmazon KDD Upselling Kick\n0.0688\n0.1517\n+0.81% -1.14%\n-0.11% -1.23%\n-0.24% -1.78%\n-0.13% -1.57%\n\n0.1345\n+0.03%\n+0.07%\n-0.07%\n-0.28%\n\n0.2265\n-0.14%\n-0.10%\n-0.19%\n-0.19%\n\n0.2532\n+0.14%\n+0.16%\n+0.17%\n+0.07%\n\nTable 2: Baseline scores / relative error change\n\nAverage\n0.2655\n-0.0%\n-0.14% -0.22%\n-0.09%\n-0.28%\n-0.74%\n-0.04%\n-0.03%\n-0.70%\n\n8\n\n\fSample rate\nSGB\nGOSS\nMVS\nMVS Adaptive\n\n0.02\n\n0.05\n\n0.1\n\n0.15\n\n0.2\n\n0.25\n\n0.3\n\n0.35\n\n0.4\n\n0.5\n\n+19.92% +11.35% +6.83% +4.99% +3.84% +3.03% +2.17% +1.57% +1.10% +0.42%\n+22.37% +12.75% +8.00% +5.32% +3.39% +2.25% +1.41% +0.75% +0.23% -0.16%\n+13.93% +7.76% +3.69% +1.91% +0.74% +0.14% -0.21% -0.43% -0.41% -0.45%\n+13.72% +7.47% +3.71% +1.70% +0.55% -0.03% -0.07% -0.28% -0.32% -0.51%\n\nTable 3: Relative error change, average over datasets\n\nThe next research question is whether MVS is capable of reducing sample size per each iteration\nneeded to achieve acceptable quality. Furthermore, whether MVS is harmful to accuracy while\nusing small subsamples. For this experiment, we tuned parameters, so that the algorithms achieve\nthe baseline score (or their best score if it is not possible) using the least number of instances. Figure\n1 shows the dependence of error on the sample size for two datasets and its \u00b1\u03c3 con\ufb01dence interval.\nTable 3 demonstrates average relative error change with respect to the baseline over all datasets\nused in this paper. From these results, we can conclude that MVS reaches the goal of reducing the\nvariance of the models, and a decrease in sample size affects the accuracy much less than it does for\nother algorithms.\nLearning time comparison. To compare the speed-up ability of MVS, GOSS and SGB, we used\nruns from the previous experiment setting, i.e., parameters were chosen in order to have the smallest\nsample rate with no quality loss. Among them, we choose the ones which have the least training time\n(if it is impossible to beat baseline, the best score point is chosen). The summary is shown in Table\n4, which demonstrates the average learning time gain relative to the baseline learning time (using all\nexamples). One can see that the usage of MVS has an advantage in training time over other methods\nat the amount of about 10% for datasets presented in this paper. Also, it is important to mention that\ntuning the hyperparemeters is a main part of training a model. There is one common hyperparameter\nfor all sampling algorithms - sample rate. GOSS has one additional hyperparameter - ratio of large\nand small gradients in the subsample, and MVS has a hyperparameter \u03bb. So tuning GOSS and MVS\nmay potentially take more time than SGB. But introducing MVS Adaptive algorithm dramatically\nreduces tuning time due to hyperparameter-free sampling procedure, and we can conclude from\nTables 2 and 3 that it achieves approximately optimal results on the test data\nLarge datasets. Experiments with CatBoost show that regularization effect of MVS is ef\ufb01cient for\nany size of the data. But for large datasets it is more crucial to reduce learning time of the model.\nTo prove that MVS is ef\ufb01cient in accelerating the training we use Higgs dataset [33] (11000000\ninstances and 28 features) and Recsys datasets [7] (16549802 instances and 31 features). The set up\nof experiment remains the same as in the previous paragraph. For Higgs dataset SGB is not able to\nachieve the baseline quality with less than 100% sample size, while GOSS and MVS managed to\ndo this with 80% of samples and MVS was faster than GOSS (-17.7% versus -8.5%) as it converges\nearlier. For Recsys dataset relative learning time differences are -50.3% for SGB (sample rate 20%),\n-39.9% for SGB (sample rate 20%) and -61.5% for MVS (sample rate 10%).\n\nGOSS\n\nSGB\nMVS\ntime\n-20.7% -20.4% -27.7%\ndifference\nTable 4: Relative learning time change\n\n6 Conclusion\n\nIn this paper, we addressed a surprisingly understudied problem of weighted sampling in GBDT. We\nproposed a novel technique, which directly maximizes the accuracy of split scoring, a core step of\nthe tree construction procedure. We rigorously formulated this goal as an optimization problem and\nderived a near-optimal closed-form solution. This solution led to a novel sampling technique MVS.\nWe provided our work with necessary theoretical statements and empirical observations that show\nthe superiority of MVS over the well-known state-of-the-art approaches to data sampling in SGB.\nMVS is implemented and used by default in CatBoost open-source library. Also, one can \ufb01nd MVS\nimplementation in LightGBM package, and its source code is publicly available for further research.\n\n9\n\n\fAcknowledgements\n\nWe are deeply indebted to Liudmila Prokhorenkova for valuable contribution to the content and\nhelpful advice about the presentation. We are also grateful to Aleksandr Vorobev for sharing ideas\nand support, Anna Veronika Dorogush and Nikita Dmitriev for experiment assistance.\n\nReferences\n[1] UCI KDD Archive. 1998. KDD Internet dataset. https://kdd.ics.uci.edu/databases/\n\ninternet_usage/internet_usage.html. (1998).\n\n[2] Leo Breiman. 1999. Using adaptive bagging to debias regressions. Technical Report. Techni-\n\ncal Report 547, Statistics Dept. UCB.\n\n[3] Christopher JC Burges. 2010. From ranknet to lambdarank to lambdamart: An overview.\n\nLearning 11, 23-581 (2010), 81.\n\n[4] Rich Caruana and Alexandru Niculescu-Mizil. 2006. An empirical comparison of supervised\nlearning algorithms. In Proceedings of the 23rd international conference on Machine learning.\nACM, 161\u2013168.\n\n[5] Catboost. 2018. Data preprocessing. https://github.com/catboost/catboost/tree/\n\nmaster/catboost/benchmarks/quality_benchmarks. (2018).\n\n[6] CatBoost. 2019. CatBoost github. https://github.com/catboost/catboost. (2019).\n\n[7] Recsys Challenge. 2015.\nchallenge.html. (2015).\n\nRecsys dataset.\n\nhttps://2015.recsyschallenge.com/\n\n[8] Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system. In Pro-\nceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and\nData Mining. ACM, 785\u2013794.\n\n[9] Mark Culp, George Michailidis, and Kjell Johnson. 2011. On adaptive regularization methods\n\nin boosting. Journal of Computational and Graphical Statistics 20, 4 (2011), 937\u2013955.\n\n[10] KDD Cup. 2009.\n\nKDD Churn dataset.\n\nkdd-cup-2009/Data. (2009).\n\n[11] KDD Cup. 2009.\n\nKDD Upselling dataset.\n\nkdd-cup-2009/Data. (2009).\n\n[12] KDD Cup. 2012.\n\nClick prediction dataset.\n\nkdd-cup-2012-track-2. (2012).\n\nhttps://www.kdd.org/kdd-cup/view/\n\nhttp://www.kdd.org/kdd-cup/view/\n\nhttp://www.kdd.org/kdd-cup/view/\n\n[13] C. Daning, X. Fen, L. Shigang, and Z. Yunquan. 2018. Asynchronous Parallel Sampling\n\nGradient Boosting Decision Tree. In arXiv preprint arXiv:1804.04659.\n\n[14] Anna Veronika Dorogush, Vasily Ershov, and Andrey Gulin. 2017. CatBoost: gradient boost-\n\ning with categorical features support. Workshop on ML Systems at NIPS (2017).\n\n[15] Fleuret F. and Geman D. 2008. Stationary features and cat detection. In Journal of Machine\n\nLearning Research. 2549\u20132578.\n\n[16] Jerome H Friedman. 2001. Greedy function approximation: a gradient boosting machine.\n\nAnnals of statistics (2001), 1189\u20131232.\n\n[17] Jerome H Friedman. 2002. Stochastic gradient boosting. Computational Statistics & Data\n\nAnalysis 38, 4 (2002), 367\u2013378.\n\n[18] Daniel G Horvitz and Donovan J Thompson. 1952. A generalization of sampling without\nreplacement from a \ufb01nite universe. Journal of the American statistical Association 47, 260\n(1952), 663\u2013685.\n\n10\n\n\f[19] Bulat Ibragimov. 2019. MVS implementation. https://github.com/ibr11/LightGBM.\n\n(2019).\n\n[20] Alafate J. and Freund Y. 2019. Faster Boosting with Smaller Memory. In arXiv preprint\n\narXiv:1901.09047.\n\n[21] Tyler B Johnson and Carlos Guestrin. 2018. Training Deep Models Faster with Robust,\nApproximate Importance Sampling. In Advances in Neural Information Processing Systems.\n7265\u20137275.\n\n[22] Kaggle. 2012. Kick prediction dataset. https://www.kaggle.com/c/DontGetKicked.\n\n(2012).\n\n[23] Kaggle.\n\n2013.\n\namazon-employee-access-challenge. (2013).\n\nAmazon\n\ndataset.\n\nhttps://www.kaggle.com/c/\n\n[24] Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and\nTie-Yan Liu. 2017. LightGBM: A highly ef\ufb01cient gradient boosting decision tree. In Advances\nin Neural Information Processing Systems. 3149\u20133157.\n\n[25] Ronny Kohavi and Barry Becker. [n. d.]. Adult dataset. https://archive.ics.uci.edu/\n\nml/datasets/Adult. ([n. d.]).\n\n[26] G\u00b4abor Lugosi, Nicolas Vayatis, et al. 2004. On the Bayes-risk consistency of regularized\n\nboosting methods. The Annals of statistics 32, 1 (2004), 30\u201355.\n\n[27] Hosam M Mahmoud, Reza Modarres, and Robert T Smythe. 1995. Analysis of quickselect: An\nalgorithm for order statistics. RAIRO-Theoretical Informatics and Applications-Informatique\nTh\u00b4eorique et Applications 29, 4 (1995), 255\u2013276.\n\n[28] David Mease and Abraham Wyner. 2008. Evidence contrary to the statistical view of boosting.\n\nJournal of Machine Learning Research 9, Feb (2008), 131\u2013156.\n\n[29] Liudmila Prokhorenkova, Gleb Gusev, Aleksandr Vorobev, Anna Veronika Dorogush, and An-\ndrey Gulin. 2018. CatBoost: unbiased boosting with categorical features. In Advances in\nNeural Information Processing Systems. 6638\u20136648.\n\n[30] Matthew Richardson, Ewa Dominowska, and Robert Ragno. 2007. Predicting clicks: estimat-\ning the click-through rate for new ads. In Proceedings of the 16th international conference on\nWorld Wide Web. ACM, 521\u2013530.\n\n[31] Byron P Roe, Hai-Jun Yang, Ji Zhu, Yong Liu, Ion Stancu, and Gordon McGregor. 2005.\nBoosted decision trees as an alternative to arti\ufb01cial neural networks for particle identi\ufb01cation.\nNuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers,\nDetectors and Associated Equipment 543, 2 (2005), 577\u2013584.\n\n[32] Y-S Shih. 1999. Families of splitting criteria for classi\ufb01cation trees. Statistics and Computing\n\n9, 4 (1999), 309\u2013315.\n\n[33] Daniel Whiteson. 2014. Higgs dataset. https://archive.ics.uci.edu/ml/datasets/\n\nHIGGS. (2014).\n\n[34] Qiang Wu, Christopher JC Burges, Krysta M Svore, and Jianfeng Gao. 2010. Adapting boost-\n\ning for information retrieval measures. Information Retrieval 13, 3 (2010), 254\u2013270.\n\n[35] Kalal Z., Matas J., and Mikolajczyk K. 2008. Weighted Sampling for Large-Scale Boosting.\n\nIn BMVC. 1\u201310.\n\n[36] Yanru Zhang and Ali Haghani. 2015. A gradient boosting method to improve travel time\n\nprediction. Transportation Research Part C: Emerging Technologies 58 (2015), 308\u2013324.\n\n[37] Peilin Zhao and Tong Zhang. 2015. Stochastic optimization with importance sampling for\n\nregularized loss minimization. In international conference on machine learning. 1\u20139.\n\n11\n\n\f", "award": [], "sourceid": 8632, "authors": [{"given_name": "Bulat", "family_name": "Ibragimov", "institution": "Yandex Research"}, {"given_name": "Gleb", "family_name": "Gusev", "institution": "Sberbank"}]}