{"title": "LightGBM: A Highly Efficient Gradient Boosting Decision Tree", "book": "Advances in Neural Information Processing Systems", "page_first": 3146, "page_last": 3154, "abstract": "Gradient Boosting Decision Tree (GBDT) is a popular machine learning algorithm, and has quite a few effective implementations such as XGBoost and pGBRT. Although many engineering optimizations have been adopted in these implementations, the efficiency and scalability are still unsatisfactory when the feature dimension is high and data size is large. A major reason is that for each feature, they need to scan all the data instances to estimate the information gain of all possible split points, which is very time consuming. To tackle this problem, we propose two novel techniques: \\emph{Gradient-based One-Side Sampling} (GOSS) and \\emph{Exclusive Feature Bundling} (EFB). With GOSS, we exclude a significant proportion of data instances with small gradients, and only use the rest to estimate the information gain. We prove that, since the data instances with larger gradients play a more important role in the computation of information gain, GOSS can obtain quite accurate estimation of the information gain with a much smaller data size. With EFB, we bundle mutually exclusive features (i.e., they rarely take nonzero values simultaneously), to reduce the number of features. We prove that finding the optimal bundling of exclusive features is NP-hard, but a greedy algorithm can achieve quite good approximation ratio (and thus can effectively reduce the number of features without hurting the accuracy of split point determination by much). We call our new GBDT implementation with GOSS and EFB \\emph{LightGBM}. Our experiments on multiple public datasets show that, LightGBM speeds up the training process of conventional GBDT by up to over 20 times while achieving almost the same accuracy.", "full_text": "LightGBM: A Highly Ef\ufb01cient Gradient Boosting\n\nDecision Tree\n\nGuolin Ke1, Qi Meng2, Thomas Finley3, Taifeng Wang1,\n\nWei Chen1, Weidong Ma1, Qiwei Ye1, Tie-Yan Liu1\n\n2qimeng13@pku.edu.cn;\n\n3t\ufb01nely@microsoft.com;\n\n1Microsoft Research 2Peking University\n\n3 Microsoft Redmond\n\n1{guolin.ke, taifengw, wche, weima, qiwye, tie-yan.liu}@microsoft.com;\n\nAbstract\n\nGradient Boosting Decision Tree (GBDT) is a popular machine learning algo-\nrithm, and has quite a few effective implementations such as XGBoost and pGBRT.\nAlthough many engineering optimizations have been adopted in these implemen-\ntations, the ef\ufb01ciency and scalability are still unsatisfactory when the feature\ndimension is high and data size is large. A major reason is that for each feature,\nthey need to scan all the data instances to estimate the information gain of all\npossible split points, which is very time consuming. To tackle this problem, we\npropose two novel techniques: Gradient-based One-Side Sampling (GOSS) and\nExclusive Feature Bundling (EFB). With GOSS, we exclude a signi\ufb01cant propor-\ntion of data instances with small gradients, and only use the rest to estimate the\ninformation gain. We prove that, since the data instances with larger gradients play\na more important role in the computation of information gain, GOSS can obtain\nquite accurate estimation of the information gain with a much smaller data size.\nWith EFB, we bundle mutually exclusive features (i.e., they rarely take nonzero\nvalues simultaneously), to reduce the number of features. We prove that \ufb01nding\nthe optimal bundling of exclusive features is NP-hard, but a greedy algorithm\ncan achieve quite good approximation ratio (and thus can effectively reduce the\nnumber of features without hurting the accuracy of split point determination by\nmuch). We call our new GBDT implementation with GOSS and EFB LightGBM.\nOur experiments on multiple public datasets show that, LightGBM speeds up the\ntraining process of conventional GBDT by up to over 20 times while achieving\nalmost the same accuracy.\n\nIntroduction\n\n1\nGradient boosting decision tree (GBDT) [1] is a widely-used machine learning algorithm, due to\nits ef\ufb01ciency, accuracy, and interpretability. GBDT achieves state-of-the-art performances in many\nmachine learning tasks, such as multi-class classi\ufb01cation [2], click prediction [3], and learning to\nrank [4]. In recent years, with the emergence of big data (in terms of both the number of features\nand the number of instances), GBDT is facing new challenges, especially in the tradeoff between\naccuracy and ef\ufb01ciency. Conventional implementations of GBDT need to, for every feature, scan all\nthe data instances to estimate the information gain of all the possible split points. Therefore, their\ncomputational complexities will be proportional to both the number of features and the number of\ninstances. This makes these implementations very time consuming when handling big data.\nTo tackle this challenge, a straightforward idea is to reduce the number of data instances and the\nnumber of features. However, this turns out to be highly non-trivial. For example, it is unclear how to\nperform data sampling for GBDT. While there are some works that sample data according to their\nweights to speed up the training process of boosting [5, 6, 7], they cannot be directly applied to GBDT\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fsince there is no sample weight in GBDT at all. In this paper, we propose two novel techniques\ntowards this goal, as elaborated below.\nGradient-based One-Side Sampling (GOSS). While there is no native weight for data instance in\nGBDT, we notice that data instances with different gradients play different roles in the computation\nof information gain. In particular, according to the de\ufb01nition of information gain, those instances\nwith larger gradients1 (i.e., under-trained instances) will contribute more to the information gain.\nTherefore, when down sampling the data instances, in order to retain the accuracy of information gain\nestimation, we should better keep those instances with large gradients (e.g., larger than a pre-de\ufb01ned\nthreshold, or among the top percentiles), and only randomly drop those instances with small gradients.\nWe prove that such a treatment can lead to a more accurate gain estimation than uniformly random\nsampling, with the same target sampling rate, especially when the value of information gain has a\nlarge range.\nExclusive Feature Bundling (EFB). Usually in real applications, although there are a large number\nof features, the feature space is quite sparse, which provides us a possibility of designing a nearly\nlossless approach to reduce the number of effective features. Speci\ufb01cally, in a sparse feature space,\nmany features are (almost) exclusive, i.e., they rarely take nonzero values simultaneously. Examples\ninclude the one-hot features (e.g., one-hot word representation in text mining). We can safely bundle\nsuch exclusive features. To this end, we design an ef\ufb01cient algorithm by reducing the optimal\nbundling problem to a graph coloring problem (by taking features as vertices and adding edges for\nevery two features if they are not mutually exclusive), and solving it by a greedy algorithm with a\nconstant approximation ratio.\nWe call the new GBDT algorithm with GOSS and EFB LightGBM2. Our experiments on multiple\npublic datasets show that LightGBM can accelerate the training process by up to over 20 times while\nachieving almost the same accuracy.\nThe remaining of this paper is organized as follows. At \ufb01rst, we review GBDT algorithms and related\nwork in Sec. 2. Then, we introduce the details of GOSS in Sec. 3 and EFB in Sec. 4. Our experiments\nfor LightGBM on public datasets are presented in Sec. 5. Finally, we conclude the paper in Sec. 6.\n2 Preliminaries\n2.1 GBDT and Its Complexity Analysis\nGBDT is an ensemble model of decision trees, which are trained in sequence [1]. In each iteration,\nGBDT learns the decision trees by \ufb01tting the negative gradients (also known as residual errors).\nThe main cost in GBDT lies in learning the decision trees, and the most time-consuming part in\nlearning a decision tree is to \ufb01nd the best split points. One of the most popular algorithms to \ufb01nd split\npoints is the pre-sorted algorithm [8, 9], which enumerates all possible split points on the pre-sorted\nfeature values. This algorithm is simple and can \ufb01nd the optimal split points, however, it is inef\ufb01cient\nin both training speed and memory consumption. Another popular algorithm is the histogram-based\nalgorithm [10, 11, 12], as shown in Alg. 13. Instead of \ufb01nding the split points on the sorted feature\nvalues, histogram-based algorithm buckets continuous feature values into discrete bins and uses these\nbins to construct feature histograms during training. Since the histogram-based algorithm is more\nef\ufb01cient in both memory consumption and training speed, we will develop our work on its basis.\nAs shown in Alg. 1, the histogram-based algorithm \ufb01nds the best split points based on the feature\nhistograms. It costs O(#data \u00d7 #f eature) for histogram building and O(#bin \u00d7 #f eature) for\nsplit point \ufb01nding. Since #bin is usually much smaller than #data, histogram building will dominate\nthe computational complexity. If we can reduce #data or #feature, we will be able to substantially\nspeed up the training of GBDT.\n2.2 Related Work\nThere have been quite a few implementations of GBDT in the literature, including XGBoost [13],\npGBRT [14], scikit-learn [15], and gbm in R [16] 4. Scikit-learn and gbm in R implements the pre-\nsorted algorithm, and pGBRT implements the histogram-based algorithm. XGBoost supports both\n\n1When we say larger or smaller gradients in this paper, we refer to their absolute values.\n2The code is available at GitHub: https://github.com/Microsoft/LightGBM.\n3Due to space restriction, high level pseudo code is used. The details could be found in our open-source code.\n4There are some other works speed up GBDT training via GPU [17, 18], or parallel training [19]. However,\n\nthey are out of the scope of this paper.\n\n2\n\n\fthe pre-sorted algorithm and histogram-based algorithm. As shown in [13], XGBoost outperforms\nthe other tools. So, we use XGBoost as our baseline in the experiment section.\nTo reduce the size of the training data, a common approach is to down sample the data instances. For\nexample, in [5], data instances are \ufb01ltered if their weights are smaller than a \ufb01xed threshold. SGB\n[20] uses a random subset to train the weak learners in every iteration. In [6], the sampling ratio are\ndynamically adjusted in the training progress. However, all these works except SGB [20] are based\non AdaBoost [21], and cannot be directly applied to GBDT since there are no native weights for data\ninstances in GBDT. Though SGB can be applied to GBDT, it usually hurts accuracy and thus it is not\na desirable choice.\nSimilarly, to reduce the number of features, it is natural to \ufb01lter weak features [22, 23, 7, 24]. This\nis usually done by principle component analysis or projection pursuit. However, these approaches\nhighly rely on the assumption that features contain signi\ufb01cant redundancy, which might not always\nbe true in practice (features are usually designed with their unique contributions and removing any of\nthem may affect the training accuracy to some degree).\nThe large-scale datasets used in real applications are usually quite sparse. GBDT with the pre-sorted\nalgorithm can reduce the training cost by ignoring the features with zero values [13]. However,\nGBDT with the histogram-based algorithm does not have ef\ufb01cient sparse optimization solutions. The\nreason is that the histogram-based algorithm needs to retrieve feature bin values (refer to Alg. 1) for\neach data instance no matter the feature value is zero or not. It is highly preferred that GBDT with\nthe histogram-based algorithm can effectively leverage such sparse property.\nTo address the limitations of previous works, we propose two new novel techniques called Gradient-\nbased One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB). More details will be\nintroduced in the next sections.\n\nAlgorithm 1: Histogram-based Algorithm\nInput: I: training data, d: max depth\nInput: m: feature dimension\nnodeSet \u2190 {0} (cid:46) tree nodes in current level\nrowSet \u2190 {{0, 1, 2, ...}} (cid:46) data indices in tree nodes\nfor i = 1 to d do\n\nfor node in nodeSet do\n\nusedRows \u2190 rowSet[node]\nfor k = 1 to m do\nH \u2190 new Histogram()\n(cid:46) Build histogram\nfor j in usedRows do\nbin \u2190 I.f[k][j].bin\nH[bin].y \u2190 H[bin].y + I.y[j]\nH[bin].n \u2190 H[bin].n + 1\n\nFind the best split on histogram H.\n...\n\nUpdate rowSet and nodeSet according to the best\nsplit points.\n...\n\nAlgorithm 2: Gradient-based One-Side Sampling\nInput: I: training data, d: iterations\nInput: a: sampling ratio of large gradient data\nInput: b: sampling ratio of small gradient data\nInput: loss: loss function, L: weak learner\nmodels \u2190 {}, fact \u2190 1\u2212a\ntopN \u2190 a \u00d7 len(I) , randN \u2190 b \u00d7 len(I)\nfor i = 1 to d do\n\nb\n\npreds \u2190 models.predict(I)\ng \u2190 loss(I, preds), w \u2190 {1,1,...}\nsorted \u2190 GetSortedIndices(abs(g))\ntopSet \u2190 sorted[1:topN]\nrandSet \u2190 RandomPick(sorted[topN:len(I)],\nrandN)\nusedSet \u2190 topSet + randSet\nw[randSet] \u00d7 = fact (cid:46) Assign weight f act to the\nsmall gradient data.\nnewModel \u2190 L(I[usedSet], \u2212 g[usedSet],\nw[usedSet])\nmodels.append(newModel)\n\n3 Gradient-based One-Side Sampling\nIn this section, we propose a novel sampling method for GBDT that can achieve a good balance\nbetween reducing the number of data instances and keeping the accuracy for learned decision trees.\n3.1 Algorithm Description\nIn AdaBoost, the sample weight serves as a good indicator for the importance of data instances.\nHowever, in GBDT, there are no native sample weights, and thus the sampling methods proposed for\nAdaBoost cannot be directly applied. Fortunately, we notice that the gradient for each data instance\nin GBDT provides us with useful information for data sampling. That is, if an instance is associated\nwith a small gradient, the training error for this instance is small and it is already well-trained.\nA straightforward idea is to discard those data instances with small gradients. However, the data\ndistribution will be changed by doing so, which will hurt the accuracy of the learned model. To avoid\nthis problem, we propose a new method called Gradient-based One-Side Sampling (GOSS).\n\n3\n\n\fGOSS keeps all the instances with large gradients and performs random sampling on the instances\nwith small gradients. In order to compensate the in\ufb02uence to the data distribution, when computing the\ninformation gain, GOSS introduces a constant multiplier for the data instances with small gradients\n(see Alg. 2). Speci\ufb01cally, GOSS \ufb01rstly sorts the data instances according to the absolute value of their\ngradients and selects the top a\u00d7 100% instances. Then it randomly samples b\u00d7 100% instances from\nthe rest of the data. After that, GOSS ampli\ufb01es the sampled data with small gradients by a constant\n1\u2212a\nb when calculating the information gain. By doing so, we put more focus on the under-trained\ninstances without changing the original data distribution by much.\n3.2 Theoretical Analysis\nGBDT uses decision trees to learn a function from the input space X s to the gradient space G [1].\nSuppose that we have a training set with n i.i.d. instances {x1,\u00b7\u00b7\u00b7 , xn}, where each xi is a vector\nwith dimension s in space X s. In each iteration of gradient boosting, the negative gradients of the\nloss function with respect to the output of the model are denoted as {g1,\u00b7\u00b7\u00b7 , gn}. The decision tree\nmodel splits each node at the most informative feature (with the largest information gain). For GBDT,\nthe information gain is usually measured by the variance after splitting, which is de\ufb01ned as below.\nDe\ufb01nition 3.1 Let O be the training dataset on a \ufb01xed node of the decision tree. The variance gain\nof splitting feature j at point d for this node is de\ufb01ned as\n\n(cid:32) ((cid:80)\n\n((cid:80)\nl|O(d) =(cid:80) I[xi \u2208 O : xij \u2264 d] and nj\n\n{xi\u2208O:xij\u2264d} gi)2\n\nnj\nl|O(d)\n\n1\nnO\n\n+\n\n{xi\u2208O:xij >d} gi)2\n\n,\n\nnj\nr|O(d)\n\nr|O(d) =(cid:80) I[xi \u2208 O : xij > d].\n\n(cid:33)\n\nVj|O(d) =\n\nwhere nO =(cid:80) I[xi \u2208 O], nj\n\nFor feature j, the decision tree algorithm selects d\u2217\nj = argmaxdVj(d) and calculates the largest gain\nj ). 5 Then, the data are split according feature j\u2217 at point dj\u2217 into the left and right child nodes.\nVj(d\u2217\nIn our proposed GOSS method, \ufb01rst, we rank the training instances according to their absolute values\nof their gradients in the descending order; second, we keep the top-a \u00d7 100% instances with the larger\ngradients and get an instance subset A; then, for the remaining set Ac consisting (1 \u2212 a) \u00d7 100%\ninstances with smaller gradients, we further randomly sample a subset B with size b \u00d7 |Ac|; \ufb01nally,\nwe split the instances according to the estimated variance gain \u02dcVj(d) over the subset A \u222a B, i.e.,\n\n(cid:32) ((cid:80)\n\n\u02dcVj(d) =\n\n1\nn\n\nxi\u2208Al\n\ngi + 1\u2212a\nl (d)\n\nnj\n\nb\n\n(cid:80)\n\n((cid:80)\n\n(cid:80)\n\ngi + 1\u2212a\nr(d)\n\nnj\n\nb\n\n(cid:33)\n\nxi\u2208Bl\n\ngi)2\n\n+\n\nxi\u2208Ar\n\nxi\u2208Br\n\ngi)2\n\n,\n\n(1)\n\nb\n\nwhere Al = {xi \u2208 A : xij \u2264 d},Ar = {xi \u2208 A : xij > d},Bl = {xi \u2208 B : xij \u2264 d},Br = {xi \u2208 B :\nxij > d}, and the coef\ufb01cient 1\u2212a\nis used to normalize the sum of the gradients over B back to the\nsize of Ac.\nThus, in GOSS, we use the estimated \u02dcVj(d) over a smaller instance subset, instead of the accurate\nVj(d) over all the instances to determine the split point, and the computation cost can be largely\nreduced. More importantly, the following theorem indicates that GOSS will not lose much training\naccuracy and will outperform random sampling. Due to space restrictions, we leave the proof of the\ntheorem to the supplementary materials.\n(cid:80)\nTheorem 3.2 We denote the approximation error in GOSS as E(d) = | \u02dcVj(d) \u2212 Vj(d)| and \u00afgj\n\nl (d) =\n\n(cid:80)\n\nxi\u2208(A\u222aAc )l\n\n|gi|\n\nnj\n\nl (d)\n\n, \u00afgj\n\nr(d) =\n\n|gi|\n\nxi\u2208(A\u222aAc )r\n\nnj\n\nr (d)\n\n. With probability at least 1 \u2212 \u03b4, we have\n(cid:40)\n\n(cid:41)\n\n(cid:114)\n\n+ 2DCa,b\n\n,\n\n(2)\n\nln 1/\u03b4\n\nn\n\nE(d) \u2264 C 2\n\na,b ln 1/\u03b4 \u00b7 max\n\n1\nnj\nl (d)\nmaxxi\u2208Ac |gi|, and D = max(\u00afgj\nl (d), \u00afgj\n\n1\nnj\nr(d)\nr(d)).\n\n,\n\nwhere Ca,b = 1\u2212a\u221a\n\nb\n\n(cid:18)\n\n(cid:19)\n\nAccording to the theorem, we have the following discussions: (1) The asymptotic approximation ratio\nof GOSS is O\nn) and\n\u221a\nr(d) \u2265 O(\nn)), the approximation error will be dominated by the second term of Ineq.(2) which\nnj\n5Our following analysis holds for arbitrary node. For simplicity and without confusion, we omit the sub-index\n\n. If the split is not too unbalanced (i.e., nj\n\n\u221a\nl (d) \u2265 O(\n\n+ 1\nnj\n\n1\nnj\nl (d)\n\n+ 1\u221a\n\nr (d)\n\nn\n\nO in all the notations.\n\n4\n\n\fgen\n\n\u03b2 > 1\u2212a\u221a\n\n\u03b2\u2212a with \u03b1a = maxxi\u2208A\u222aAc |gi|/ maxxi\u2208Ac |gi|.\n\n\u221a\ndecreases to 0 in O(\nn) with n \u2192 \u221e. That means when number of data is large, the approximation\nis quite accurate. (2) Random sampling is a special case of GOSS with a = 0. In many cases,\nGOSS could outperform random sampling, under the condition C0,\u03b2 > Ca,\u03b2\u2212a, which is equivalent\nto \u03b1a\u221a\nNext, we analyze the generalization performance in GOSS. We consider the generalization error in\nGOSS E GOSS\n(d) = | \u02dcVj(d) \u2212 V\u2217(d)|, which is the gap between the variance gain calculated by the\nsampled training instances in GOSS and the true variance gain for the underlying distribution. We\nhave E GOSS\n(d) \u2264 | \u02dcVj(d) \u2212 Vj(d)| + |Vj(d) \u2212 V\u2217(d)| \u2206= EGOSS(d) + Egen(d). Thus, the generalization\nerror with GOSS will be close to that calculated by using the full data instances if the GOSS\napproximation is accurate. On the other hand, sampling will increase the diversity of the base learners,\nwhich potentially help to improve the generalization performance [24].\n4 Exclusive Feature Bundling\nIn this section, we propose a novel method to effectively reduce the number of features.\n\ngen\n\nAlgorithm 3: Greedy Bundling\nInput: F : features, K: max con\ufb02ict count\nConstruct graph G\nsearchOrder \u2190 G.sortByDegree()\nbundles \u2190 {}, bundlesCon\ufb02ict \u2190 {}\nfor i in searchOrder do\nneedNew \u2190 True\nfor j = 1 to len(bundles) do\n\ncnt \u2190 Con\ufb02ictCnt(bundles[j],F [i])\nif cnt + bundlesCon\ufb02ict[i] \u2264 K then\n\nbundles[j].add(F [i]), needNew \u2190 False\nbreak\n\nif needNew then\n\nAdd F [i] as a new bundle to bundles\n\nOutput: bundles\n\nAlgorithm 4: Merge Exclusive Features\nInput: numData: number of data\nInput: F : One bundle of exclusive features\nbinRanges \u2190 {0}, totalBin \u2190 0\nfor f in F do\n\ntotalBin += f.numBin\nbinRanges.append(totalBin)\nnewBin \u2190 new Bin(numData)\nfor i = 1 to numData do\n\nnewBin[i] \u2190 0\nfor j = 1 to len(F ) do\n\nif F [j].bin[i] (cid:54)= 0 then\n\nnewBin[i] \u2190 F [j].bin[i] + binRanges[j]\n\nOutput: newBin, binRanges\n\nHigh-dimensional data are usually very sparse. The sparsity of the feature space provides us a\npossibility of designing a nearly lossless approach to reduce the number of features. Speci\ufb01cally, in\na sparse feature space, many features are mutually exclusive, i.e., they never take nonzero values\nsimultaneously. We can safely bundle exclusive features into a single feature (which we call an\nexclusive feature bundle). By a carefully designed feature scanning algorithm, we can build the\nsame feature histograms from the feature bundles as those from individual features. In this way, the\ncomplexity of histogram building changes from O(#data \u00d7 #f eature) to O(#data \u00d7 #bundle),\nwhile #bundle << #f eature. Then we can signi\ufb01cantly speed up the training of GBDT without\nhurting the accuracy. In the following, we will show how to achieve this in detail.\nThere are two issues to be addressed. The \ufb01rst one is to determine which features should be bundled\ntogether. The second is how to construct the bundle.\n\nTheorem 4.1 The problem of partitioning features into a smallest number of exclusive bundles is\nNP-hard.\n\nProof: We will reduce the graph coloring problem [25] to our problem. Since graph coloring problem\nis NP-hard, we can then deduce our conclusion.\nGiven any instance G = (V, E) of the graph coloring problem. We construct an instance of our\nproblem as follows. Take each row of the incidence matrix of G as a feature, and get an instance of\nour problem with |V | features. It is easy to see that an exclusive bundle of features in our problem\ncorresponds to a set of vertices with the same color, and vice versa. (cid:3)\nFor the \ufb01rst issue, we prove in Theorem 4.1 that it is NP-Hard to \ufb01nd the optimal bundling strategy,\nwhich indicates that it is impossible to \ufb01nd an exact solution within polynomial time. In order to\n\ufb01nd a good approximation algorithm, we \ufb01rst reduce the optimal bundling problem to the graph\ncoloring problem by taking features as vertices and adding edges for every two features if they are\nnot mutually exclusive, then we use a greedy algorithm which can produce reasonably good results\n\n5\n\n\f(with a constant approximation ratio) for graph coloring to produce the bundles. Furthermore, we\nnotice that there are usually quite a few features, although not 100% mutually exclusive, also rarely\ntake nonzero values simultaneously. If our algorithm can allow a small fraction of con\ufb02icts, we can\nget an even smaller number of feature bundles and further improve the computational ef\ufb01ciency.\nBy simple calculation, random polluting a small fraction of feature values will affect the training\naccuracy by at most O([(1 \u2212 \u03b3)n]\u22122/3)(See Proposition 2.1 in the supplementary materials), where\n\u03b3 is the maximal con\ufb02ict rate in each bundle. So, if we choose a relatively small \u03b3, we will be able to\nachieve a good balance between accuracy and ef\ufb01ciency.\nBased on the above discussions, we design an algorithm for exclusive feature bundling as shown\nin Alg. 3. First, we construct a graph with weighted edges, whose weights correspond to the total\ncon\ufb02icts between features. Second, we sort the features by their degrees in the graph in the descending\norder. Finally, we check each feature in the ordered list, and either assign it to an existing bundle\nwith a small con\ufb02ict (controlled by \u03b3), or create a new bundle. The time complexity of Alg. 3 is\nO(#f eature2) and it is processed only once before training. This complexity is acceptable when the\nnumber of features is not very large, but may still suffer if there are millions of features. To further\nimprove the ef\ufb01ciency, we propose a more ef\ufb01cient ordering strategy without building the graph:\nordering by the count of nonzero values, which is similar to ordering by degrees since more nonzero\nvalues usually leads to higher probability of con\ufb02icts. Since we only alter the ordering strategies in\nAlg. 3, the details of the new algorithm are omitted to avoid duplication.\nFor the second issues, we need a good way of merging the features in the same bundle in order to\nreduce the corresponding training complexity. The key is to ensure that the values of the original\nfeatures can be identi\ufb01ed from the feature bundles. Since the histogram-based algorithm stores\ndiscrete bins instead of continuous values of the features, we can construct a feature bundle by letting\nexclusive features reside in different bins. This can be done by adding offsets to the original values of\nthe features. For example, suppose we have two features in a feature bundle. Originally, feature A\ntakes value from [0, 10) and feature B takes value [0, 20). We then add an offset of 10 to the values of\nfeature B so that the re\ufb01ned feature takes values from [10, 30). After that, it is safe to merge features\nA and B, and use a feature bundle with range [0, 30] to replace the original features A and B. The\ndetailed algorithm is shown in Alg. 4.\nEFB algorithm can bundle many exclusive features to the much fewer dense features, which can\neffectively avoid unnecessary computation for zero feature values. Actually, we can also optimize\nthe basic histogram-based algorithm towards ignoring the zero feature values by using a table for\neach feature to record the data with nonzero values. By scanning the data in this table, the cost of\nhistogram building for a feature will change from O(#data) to O(#non_zero_data). However,\nthis method needs additional memory and computation cost to maintain these per-feature tables in the\nwhole tree growth process. We implement this optimization in LightGBM as a basic function. Note,\nthis optimization does not con\ufb02ict with EFB since we can still use it when the bundles are sparse.\n5 Experiments\nIn this section, we report the experimental results regarding our proposed LightGBM algorithm. We\nuse \ufb01ve different datasets which are all publicly available. The details of these datasets are listed\nin Table 1. Among them, the Microsoft Learning to Rank (LETOR) [26] dataset contains 30K web\nsearch queries. The features used in this dataset are mostly dense numerical features. The Allstate\nInsurance Claim [27] and the Flight Delay [28] datasets both contain a lot of one-hot coding features.\nAnd the last two datasets are from KDD CUP 2010 and KDD CUP 2012. We directly use the features\nused by the winning solution from NTU [29, 30, 31], which contains both dense and sparse features,\nand these two datasets are very large. These datasets are large, include both sparse and dense features,\nand cover many real-world tasks. Thus, we can use them to test our algorithm thoroughly.\nOur experimental environment is a Linux server with two E5-2670 v3 CPUs (in total 24 cores) and\n256GB memories. All experiments run with multi-threading and the number of threads is \ufb01xed to 16.\n5.1 Overall Comparison\nWe present the overall comparisons in this subsection. XGBoost [13] and LightGBM without GOSS\nand EFB (called lgb_baselline) are used as baselines. For XGBoost, we used two versions, xgb_exa\n(pre-sorted algorithm) and xgb_his (histogram-based algorithm). For xgb_his, lgb_baseline, and\nLightGBM, we used the leaf-wise tree growth strategy [32]. For xgb_exa, since it only supports\nlayer-wise growth strategy, we tuned the parameters for xgb_exa to let it grow similar trees like other\n\n6\n\n\fTable 1: Datasets used in the experiments.\n\nName\nAllstate\nFlight Delay\nLETOR\nKDD10\nKDD12\n\n#data #f eature Description\n12 M\n10 M\n2M\n19M\n119M\n\nSparse\nSparse\nDense\nSparse\nSparse\n\n4228\n700\n136\n29M\n54M\n\nMetric\n\nTask\nBinary classi\ufb01cation AUC\nBinary classi\ufb01cation AUC\nRanking\nBinary classi\ufb01cation AUC\nBinary classi\ufb01cation AUC\n\nNDCG [4]\n\nTable 2: Overall training time cost comparison. LightGBM is lgb_baseline with GOSS and EFB.\nEFB_only is lgb_baseline with EFB. The values in the table are the average time cost (seconds) for\ntraining one iteration.\n\nxgb_his\n\nlgb_baseline\n\nEFB_only\n\nLightGBM\n\nAllstate\nFlight Delay\nLETOR\nKDD10\nKDD12\n\nxgb_exa\n10.85\n5.94\n5.55\n108.27\n191.99\n\n2.63\n1.05\n0.63\nOOM\nOOM\n\n6.07\n1.39\n0.49\n39.85\n168.26\n\n0.71\n0.27\n0.46\n6.33\n20.23\n\n0.28\n0.22\n0.31\n2.85\n12.67\n\nTable 3: Overall accuracy comparison on test datasets. Use AUC for classi\ufb01cation task and\nNDCG@10 for ranking task. SGB is lgb_baseline with Stochastic Gradient Boosting, and its\nsampling ratio is the same as LightGBM.\nxgb_his\n0.6089\n0.7840\n0.4982\nOOM\nOOM\n\nLightGBM\n0.6093 \u00b1 9e-5\n0.7846 \u00b1 4e-5\n0.5275 \u00b1 5e-4\n0.78732 \u00b1 1e-4\n0.7051 \u00b1 5e-5\n\n0.6064 \u00b1 7e-4\n0.7780 \u00b1 8e-4\n0.5239 \u00b1 6e-4\n0.7759 \u00b1 3e-4\n0.6989 \u00b1 8e-4\n\nAllstate\nFlight Delay\nLETOR\nKDD10\nKDD12\n\nxgb_exa\n0.6070\n0.7601\n0.4977\n0.7796\n0.7029\n\n0.6093\n0.7847\n0.5277\n0.78735\n0.7049\n\nlgb_baseline\n\nSGB\n\nmethods. And we also tuned the parameters for all datasets towards a better balancing between speed\nand accuracy. We set a = 0.05, b = 0.05 for Allstate, KDD10 and KDD12, and set a = 0.1, b = 0.1\nfor Flight Delay and LETOR. We set \u03b3 = 0 in EFB. All algorithms are run for \ufb01xed iterations, and\nwe get the accuracy results from the iteration with the best score.6\n\nFigure 1: Time-AUC curve on Flight Delay.\n\nFigure 2: Time-NDCG curve on LETOR.\n\nThe training time and test accuracy are summarized in Table 2 and Table 3 respectively. From these\nresults, we can see that LightGBM is the fastest while maintaining almost the same accuracy as\nbaselines. The xgb_exa is based on the pre-sorted algorithm, which is quite slow comparing with\nhistogram-base algorithms. By comparing with lgb_baseline, LightGBM speed up 21x, 6x, 1.6x,\n14x and 13x respectively on the Allstate, Flight Delay, LETOR, KDD10 and KDD12 datasets. Since\nxgb_his is quite memory consuming, it cannot run successfully on KDD10 and KDD12 datasets\ndue to out-of-memory. On the remaining datasets, LightGBM are all faster, up to 9x speed-up is\nachieved on the Allstate dataset. The speed-up is calculated based on training time per iteration since\nall algorithms converge after similar number of iterations. To demonstrate the overall training process,\nwe also show the training curves based on wall clock time on Flight Delay and LETOR in the Fig. 1\n\n6Due to space restrictions, we leave the details of parameter settings to the supplementary material.\n\n7\n\n02004006008001000Time(s)0.730.740.750.760.770.780.79AUCLightGBMlgb_baselinexgb_hisxgb_exa050100150200250300350400Time(s)0.400.420.440.460.480.500.52NDCG@10LightGBMlgb_baselinexgb_hisxgb_exa\fTable 4: Accuracy comparison on LETOR dataset for GOSS and SGB under different sampling ratios.\nWe ensure all experiments reach the convergence points by using large iterations with early stopping.\nThe standard deviations on different settings are small. The settings of a and b for GOSS can be\nfound in the supplementary materials.\n\n0.1\n\n0.5182\n0.5224\n\n0.15\n0.5216\n0.5256\n\n0.2\n\n0.5239\n0.5275\n\n0.25\n0.5249\n0.5284\n\n0.3\n\n0.5252\n0.5289\n\nSampling ratio\n\nSGB\nGOSS\n\n0.35\n0.5263\n0.5293\n\n0.4\n\n0.5267\n0.5296\n\nand Fig. 2, respectively. To save space, we put the remaining training curves of the other datasets in\nthe supplementary material.\nOn all datasets, LightGBM can achieve almost the same test accuracy as the baselines. This indicates\nthat both GOSS and EFB will not hurt accuracy while bringing signi\ufb01cant speed-up. It is consistent\nwith our theoretical analysis in Sec. 3.2 and Sec. 4.\nLightGBM achieves quite different speed-up ratios on these datasets. The overall speed-up comes\nfrom the combination of GOSS and EFB, we will break down the contribution and discuss the\neffectiveness of GOSS and EFB separately in the next sections.\n5.2 Analysis on GOSS\nFirst, we study the speed-up ability of GOSS. From the comparison of LightGBM and EFB_only\n(LightGBM without GOSS) in Table 2, we can see that GOSS can bring nearly 2x speed-up by its\nown with using 10% - 20% data. GOSS can learn trees by only using the sampled data. However, it\nretains some computations on the full dataset, such as conducting the predictions and computing the\ngradients. Thus, we can \ufb01nd that the overall speed-up is not linearly correlated with the percentage of\nsampled data. However, the speed-up brought by GOSS is still very signi\ufb01cant and the technique is\nuniversally applicable to different datasets.\nSecond, we evaluate the accuracy of GOSS by comparing with Stochastic Gradient Boosting (SGB)\n[20]. Without loss of generality, we use the LETOR dataset for the test. We tune the sampling ratio\nby choosing different a and b in GOSS, and use the same overall sampling ratio for SGB. We run\nthese settings until convergence by using early stopping. The results are shown in Table 4. We can\nsee the accuracy of GOSS is always better than SGB when using the same sampling ratio. These\nresults are consistent with our discussions in Sec. 3.2. All the experiments demonstrate that GOSS is\na more effective sampling method than stochastic sampling.\n5.3 Analysis on EFB\nWe check the contribution of EFB to the speed-up by comparing lgb_baseline with EFB_only. The\nresults are shown in Table 2. Here we do not allow the con\ufb02iction in the bundle \ufb01nding process (i.e.,\n\u03b3 = 0).7 We \ufb01nd that EFB can help achieve signi\ufb01cant speed-up on those large-scale datasets.\nPlease note lgb_baseline has been optimized for the sparse features, and EFB can still speed up\nthe training by a large factor. It is because EFB merges many sparse features (both the one-hot\ncoding features and implicitly exclusive features) into much fewer features. The basic sparse feature\noptimization is included in the bundling process. However, the EFB does not have the additional cost\non maintaining nonzero data table for each feature in the tree learning process. What is more, since\nmany previously isolated features are bundled together, it can increase spatial locality and improve\ncache hit rate signi\ufb01cantly. Therefore, the overall improvement on ef\ufb01ciency is dramatic. With\nabove analysis, EFB is a very effective algorithm to leverage sparse property in the histogram-based\nalgorithm, and it can bring a signi\ufb01cant speed-up for GBDT training process.\n6 Conclusion\nIn this paper, we have proposed a novel GBDT algorithm called LightGBM, which contains two\nnovel techniques: Gradient-based One-Side Sampling and Exclusive Feature Bundling to deal with\nlarge number of data instances and large number of features respectively. We have performed both\ntheoretical analysis and experimental studies on these two techniques. The experimental results are\nconsistent with the theory and show that with the help of GOSS and EFB, LightGBM can signi\ufb01cantly\noutperform XGBoost and SGB in terms of computational speed and memory consumption. For the\nfuture work, we will study the optimal selection of a and b in Gradient-based One-Side Sampling\nand continue improving the performance of Exclusive Feature Bundling to deal with large number of\nfeatures no matter they are sparse or not.\n\n7We put our detailed study on \u03b3 tuning in the supplementary materials.\n\n8\n\n\fReferences\n[1] Jerome H Friedman. Greedy function approximation: a gradient boosting machine. Annals of statistics, pages 1189\u20131232, 2001.\n\n[2] Ping Li. Robust logitboost and adaptive base class (abc) logitboost. arXiv preprint arXiv:1203.3491, 2012.\n\n[3] Matthew Richardson, Ewa Dominowska, and Robert Ragno. Predicting clicks: estimating the click-through rate for new ads.\n\nProceedings of the 16th international conference on World Wide Web, pages 521\u2013530. ACM, 2007.\n\nIn\n\n[4] Christopher JC Burges. From ranknet to lambdarank to lambdamart: An overview. Learning, 11(23-581):81, 2010.\n\n[5] Jerome Friedman, Trevor Hastie, Robert Tibshirani, et al. Additive logistic regression: a statistical view of boosting (with discussion and\n\na rejoinder by the authors). The annals of statistics, 28(2):337\u2013407, 2000.\n\n[6] Charles Dubout and Fran\u00e7ois Fleuret. Boosting with maximum adaptive sampling.\n\nSystems, pages 1332\u20131340, 2011.\n\nIn Advances in Neural Information Processing\n\n[7] Ron Appel, Thomas J Fuchs, Piotr Doll\u00e1r, and Pietro Perona. Quickly boosting decision trees-pruning underachieving features early. In\n\nICML (3), pages 594\u2013602, 2013.\n\n[8] Manish Mehta, Rakesh Agrawal, and Jorma Rissanen. Sliq: A fast scalable classi\ufb01er for data mining. In International Conference on\n\nExtending Database Technology, pages 18\u201332. Springer, 1996.\n\n[9] John Shafer, Rakesh Agrawal, and Manish Mehta. Sprint: A scalable parallel classi er for data mining. In Proc. 1996 Int. Conf. Very\n\nLarge Data Bases, pages 544\u2013555. Citeseer, 1996.\n\n[10] Sanjay Ranka and V Singh. Clouds: A decision tree classi\ufb01er for large datasets. In Proceedings of the 4th Knowledge Discovery and\n\nData Mining Conference, pages 2\u20138, 1998.\n\n[11] Ruoming Jin and Gagan Agrawal. Communication and memory ef\ufb01cient parallel decision tree construction. In Proceedings of the 2003\n\nSIAM International Conference on Data Mining, pages 119\u2013129. SIAM, 2003.\n\n[12] Ping Li, Christopher JC Burges, Qiang Wu, JC Platt, D Koller, Y Singer, and S Roweis. Mcrank: Learning to rank using multiple\n\nclassi\ufb01cation and gradient boosting. In NIPS, volume 7, pages 845\u2013852, 2007.\n\n[13] Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of the 22Nd ACM SIGKDD International\n\nConference on Knowledge Discovery and Data Mining, pages 785\u2013794. ACM, 2016.\n\n[14] Stephen Tyree, Kilian Q Weinberger, Kunal Agrawal, and Jennifer Paykin. Parallel boosted regression trees for web search ranking. In\n\nProceedings of the 20th international conference on World wide web, pages 387\u2013396. ACM, 2011.\n\n[15] Fabian Pedregosa, Ga\u00ebl Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter\nPrettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikit-learn: Machine learning in python. Journal of Machine Learning Research,\n12(Oct):2825\u20132830, 2011.\n\n[16] Greg Ridgeway. Generalized boosted models: A guide to the gbm package. Update, 1(1):2007, 2007.\n\n[17] Huan Zhang, Si Si, and Cho-Jui Hsieh. Gpu-acceleration for large-scale tree boosting. arXiv preprint arXiv:1706.08359, 2017.\n\n[18] Rory Mitchell and Eibe Frank. Accelerating the xgboost algorithm using gpu computing. PeerJ Preprints, 5:e2911v1, 2017.\n\n[19] Qi Meng, Guolin Ke, Taifeng Wang, Wei Chen, Qiwei Ye, Zhi-Ming Ma, and Tieyan Liu. A communication-ef\ufb01cient parallel algorithm\n\nfor decision tree. In Advances in Neural Information Processing Systems, pages 1271\u20131279, 2016.\n\n[20] Jerome H Friedman. Stochastic gradient boosting. Computational Statistics & Data Analysis, 38(4):367\u2013378, 2002.\n\n[21] Michael Collins, Robert E Schapire, and Yoram Singer. Logistic regression, adaboost and bregman distances. Machine Learning,\n\n48(1-3):253\u2013285, 2002.\n\n[22]\n\nIan Jolliffe. Principal component analysis. Wiley Online Library, 2002.\n\n[23] Luis O Jimenez and David A Landgrebe. Hyperspectral data analysis and supervised feature reduction via projection pursuit. IEEE\n\nTransactions on Geoscience and Remote Sensing, 37(6):2653\u20132667, 1999.\n\n[24] Zhi-Hua Zhou. Ensemble methods: foundations and algorithms. CRC press, 2012.\n\n[25] Tommy R Jensen and Bjarne Toft. Graph coloring problems, volume 39. John Wiley & Sons, 2011.\n\n[26] Tao Qin and Tie-Yan Liu. Introducing LETOR 4.0 datasets. CoRR, abs/1306.2597, 2013.\n\n[27] Allstate claim data, https://www.kaggle.com/c/ClaimPredictionChallenge.\n\n[28] Flight delay data, https://github.com/szilard/benchm-ml#data.\n\n[29] Hsiang-Fu Yu, Hung-Yi Lo, Hsun-Ping Hsieh, Jing-Kai Lou, Todd G McKenzie, Jung-Wei Chou, Po-Han Chung, Chia-Hua Ho, Chun-Fu\n\nChang, Yin-Hsuan Wei, et al. Feature engineering and classi\ufb01er ensemble for kdd cup 2010. In KDD Cup, 2010.\n\n[30] Kuan-Wei Wu, Chun-Sung Ferng, Chia-Hua Ho, An-Chun Liang, Chun-Heng Huang, Wei-Yuan Shen, Jyun-Yu Jiang, Ming-Hao Yang,\nTing-Wei Lin, Ching-Pei Lee, et al. A two-stage ensemble of diverse models for advertisement ranking in kdd cup 2012. In KDDCup,\n2012.\n\n[31] Libsvm binary classi\ufb01cation data, https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html.\n\n[32] Haijian Shi. Best-\ufb01rst decision tree learning. PhD thesis, The University of Waikato, 2007.\n\n9\n\n\f", "award": [], "sourceid": 1786, "authors": [{"given_name": "Guolin", "family_name": "Ke", "institution": "Microsoft Research"}, {"given_name": "Qi", "family_name": "Meng", "institution": "Peking University"}, {"given_name": "Thomas", "family_name": "Finley", "institution": null}, {"given_name": "Taifeng", "family_name": "Wang", "institution": "Microsoft Research"}, {"given_name": "Wei", "family_name": "Chen", "institution": "Microsoft Research"}, {"given_name": "Weidong", "family_name": "Ma", "institution": "Microsoft Research"}, {"given_name": "Qiwei", "family_name": "Ye", "institution": "Microsoft Research"}, {"given_name": "Tie-Yan", "family_name": "Liu", "institution": "Microsoft Research"}]}