{"title": "A Communication-Efficient Parallel Algorithm for Decision Tree", "book": "Advances in Neural Information Processing Systems", "page_first": 1279, "page_last": 1287, "abstract": "Decision tree (and its extensions such as Gradient Boosting Decision Trees and Random Forest) is a widely used machine learning algorithm, due to its practical effectiveness and model interpretability. With the emergence of big data, there is an increasing need to parallelize the training process of decision tree. However, most existing attempts along this line suffer from high communication costs. In this paper, we propose a new algorithm, called \\emph{Parallel Voting Decision Tree (PV-Tree)}, to tackle this challenge. After partitioning the training data onto a number of (e.g., $M$) machines, this algorithm performs both local voting and global voting in each iteration. For local voting, the top-$k$ attributes are selected from each machine according to its local data. Then, the indices of these top attributes are aggregated by a server, and the globally top-$2k$ attributes are determined by a majority voting among these local candidates. Finally, the full-grained histograms of the globally top-$2k$ attributes are collected from local machines in order to identify the best (most informative) attribute and its split point. PV-Tree can achieve a very low communication cost (independent of the total number of attributes) and thus can scale out very well. Furthermore, theoretical analysis shows that this algorithm can learn a near optimal decision tree, since it can find the best attribute with a large probability. Our experiments on real-world datasets show that PV-Tree significantly outperforms the existing parallel decision tree algorithms in the tradeoff between accuracy and efficiency.", "full_text": "A Communication-Ef\ufb01cient Parallel Algorithm for\n\nDecision Tree\n\nQi Meng1,\u2217, Guolin Ke2,\u2217, Taifeng Wang2, Wei Chen2, Qiwei Ye2,\n\nZhi-Ming Ma3, Tie-Yan Liu2\n\n1Peking University\n\n2Microsoft Research\n\n1qimeng13@pku.edu.cn; 2{Guolin.Ke, taifengw, wche, qiwye, tie-yan.liu}@microsoft.com;\n\n3Chinese Academy of Mathematics and Systems Science\n\n3mazm@amt.ac.cn\n\nAbstract\n\nDecision tree (and its extensions such as Gradient Boosting Decision Trees and\nRandom Forest) is a widely used machine learning algorithm, due to its practical\neffectiveness and model interpretability. With the emergence of big data, there is\nan increasing need to parallelize the training process of decision tree. However,\nmost existing attempts along this line suffer from high communication costs. In\nthis paper, we propose a new algorithm, called Parallel Voting Decision Tree\n(PV-Tree), to tackle this challenge. After partitioning the training data onto a\nnumber of (e.g., M) machines, this algorithm performs both local voting and\nglobal voting in each iteration. For local voting, the top-k attributes are selected\nfrom each machine according to its local data. Then, globally top-2k attributes\nare determined by a majority voting among these local candidates. Finally, the\nfull-grained histograms of the globally top-2k attributes are collected from local\nmachines in order to identify the best (most informative) attribute and its split point.\nPV-Tree can achieve a very low communication cost (independent of the total\nnumber of attributes) and thus can scale out very well. Furthermore, theoretical\nanalysis shows that this algorithm can learn a near optimal decision tree, since it\ncan \ufb01nd the best attribute with a large probability. Our experiments on real-world\ndatasets show that PV-Tree signi\ufb01cantly outperforms the existing parallel decision\ntree algorithms in the trade-off between accuracy and ef\ufb01ciency.\n\n1\n\nIntroduction\n\nDecision tree [16] is a widely used machine learning algorithm, since it is practically effective and\nthe rules it learns are simple and interpretable. Based on decision tree, people have developed other\nalgorithms such as Random Forest (RF) [3] and Gradient Boosting Decision Trees (GBDT) [7],\nwhich have demonstrated very promising performances in various learning tasks [5].\nIn recent years, with the emergence of very big training data (which cannot be held in one single\nmachine), there has been an increasing need of parallelizing the training process of decision tree. To\nthis end, there have been two major categories of attempts: 2.\n\n\u2217Denotes equal contribution. This work was done when the \ufb01rst author was visiting Microsoft Research Asia.\n2There is another category of works that parallelize the tasks of sub-tree training once a node is split [15],\nwhich require the training data to be moved from machine to machine for many times and are thus inef\ufb01cient.\nMoreover, there are also some other works accelerating decision tree construction by using pre-sorting [13] [19]\n[11] and binning [17] [8] [10], or employing a shared-memory-processors approach [12] [1]. However, they are\nout of our scope.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fAttribute-parallel: Training data are vertically partitioned according to the attributes and allocated to\ndifferent machines, and then in each iteration, the machines work on non-overlapping sets of attributes\nin parallel in order to \ufb01nd the best attribute and its split point (suppose this best attribute locates\nat the i-th machine) [19] [11] [20]. This process is communicationally very ef\ufb01cient. However,\nafter that, the re-partition of the data on other machines than the i-th machine will induce very high\ncommunication costs (proportional to the number of data samples). This is because those machines\nhave no information about the best attribute at all, and in order to ful\ufb01ll the re-partitioning, they\nmust retrieve the partition information of every data sample from the i-th machine. Furthermore, as\neach worker still has full sample set, the partition process is not parallelized, which slows down the\nalgorithm.\nData-parallel: Training data are horizontally partitioned according to the samples and allocated to\ndifferent machines. Then the machines communicate with each other the local histograms of all\nattributes (according to their own data samples) in order to obtain the global attribute distributions and\nidentify the best attribute and split point [12] [14]. It is clear that the corresponding communication\ncost is very high and proportional to the total number of attributes and histogram size. To reduce the\ncost, in [2] and [21] [10], it was proposed to exchange quantized histograms between machines when\nestimating the global attribute distributions. However, this does not really solve the problem \u2013 the\ncommunication cost is still proportional to the total number of attributes, not to mentioned that the\nquantization may hurt the accuracy.\nIn this paper, we proposed a new data-parallel algorithm for decision tree, called Parallel Voting\nDecision Tree (PV-Tree), which can achieve much better balance between communication ef\ufb01ciency\nand accuracy. The key difference between conventional data-parallel decision tree algorithm and\nPV-Tree lies in that the former only trusts the globally aggregated histogram information, while the\nlatter leverages the local statistical information contained in each machine through a two-stage voting\nprocess, thus can signi\ufb01cantly reduce the communication cost. Speci\ufb01cally, PV-Tree contains the\nfollowing steps in each iteration. 1) Local voting. On each machine, we select the top-k attributes\nbased on its local data according to the informativeness scores (e.g., risk reduction for regression,\nand information gain for classi\ufb01cation). 2) Global voting. We determine global top-2k attributes\nby a majority voting among the local candidates selected in the previous step. That is, we rank the\nattributes according to the number of local machines who select them, and choose the top 2k attributes\nfrom the ranked list. 3) Best attribute identi\ufb01cation. We collect the full-grained histograms of the\nglobally top-2k attributes from local machines in order to compute their global distributions. Then\nwe identify the best attribute and its split point according to the informativeness scores calculated\nfrom the global distributions.\nIt is easy to see that PV-Tree algorithm has a very low communication cost. It does not need to\ncommunicate the information of all attributes, instead, it only communicates indices of the locally\ntop-k attributes per machine and the histograms of the globally top-2k attributes. In other words, its\ncommunication cost is independent of the total number of attributes. This makes PV-Tree highly\nscalable. On the other hand, it can be proven that PV-Tree can \ufb01nd the best attribute with a large\nprobability, and the probability will approach 1 regardless of k when the training data become\nsuf\ufb01ciently large. In contrast, the data-parallel algorithm based on quantized histogram could fail in\n\ufb01nding the best attribute, since the bias introduced by histogram quantization cannot be reduced to\nzero even if the training data are suf\ufb01ciently large.\nWe have conducted experiments on real-world datasets to evaluate the performance of PV-Tree. The\nexperimental results show that PV-Tree has consistently higher accuracy and training speed than all\nthe baselines we implemented. We further conducted experiments to evaluate the performance of\nPV-Tree in different settings (e.g., with different numbers of machines, different values of k). The\nexperimental results are in accordance with our theoretical analysis.\n\n2 Decision Tree\n\nj=1 Xj \u00d7 Y according to ((cid:81)d\n\nsampled from(cid:81)d\nclassi\ufb01cation model f \u2208 F :(cid:81)d\n\nSuppose the training data set Dn = {(xi,j, yi); i = 1,\u00b7\u00b7\u00b7 , n, j = 1,\u00b7\u00b7\u00b7 , d} are independently\nj=1 PXj )PY |X. The goal is to learn a regression or\nj=1 Xj \u2192 Y by minimizing loss functions on the training data, which\n\nhopefully could achieve accurate prediction for the unseen test data.\n\n2\n\n\fAlgorithm 1 BulidTree\n\nDecision tree[16, 18] is a widely used model for both regression [4] and classi\ufb01cation [18]. A typical\ndecision tree algorithm is described in Alg 1. As can be seen, the tree growth procedure is recursive,\nand the nodes will not stop growing until they reach the stopping criteria. There are two important\nfunctions in the algorithm: FindBestSplit returns the best split point {attribute, threshold} of a node,\nand Split splits the training data according to the best split\npoint. The details of FindBestSplit is given in Alg 2: \ufb01rst\nhistograms of the attributes are constructed (for continuous\nattributes, one usually converts their numerical values to\n\ufb01nite bins for ease of compuation) by going over all train-\ning data on the current node; then all bins (split points) are\ntraversed from left to right, and leftSum and rightSum are\nused to accumulate sum of left and right parts of the split\npoint respectively. When selecting the best split point, an\ninformativeness measure is adopted. The widely used infor-\nmative measures are information gain and variance gain for\nclassi\ufb01cation and regression, respectively.\nDe\ufb01nition 2.1 [6][16] In classi\ufb01cation, the information gain (IG) for attribute Xj \u2208 [w1, w2] at\nnode O, is de\ufb01ned as the entropy reduction of the output Y after splitting node O by attribute Xj at\nw, i.e.,\n\nbestSplit = FindBestSplit(D)\n(DL, DR) = Split(D, N, bestSplit)\nBuildTree(N.leftChild, DL)\nBuildTree(N.rightChild, DR)\n\nInput: Node N, Dateset D\nif StoppingCirteria(D) then\nN.output = Prediction(D)\n\nend if\n\nelse\n\nIGj(w; O) = Hj \u2212 (Hl\n\nj(w) + Hr\n\nj (w))\n\n= P (w1 \u2264 Xj \u2264 w2)H(Y |w1 \u2264 Xj \u2264 w2) \u2212 P (w1 \u2264 Xj < w)H(Y |w1 \u2264 Xj < w)\n\n\u2212 P (w \u2264 Xj \u2264 w2)H(Y |w \u2264 Xj \u2264 w2),\n\nwhere H(\u00b7|\u00b7) denotes the conditional entropy.\nIn regression, the variance gain (VG) for attribute Xj \u2208 [w1, w2] at node O, is de\ufb01ned as variance\nreduction of the output Y after splitting node O by attribute Xj at w, i.e.,\nV Gj(w; O) = \u03c3j \u2212 (\u03c3l\n\n= P (w1 \u2264 Xj \u2264 w2)V ar[Y |w1 \u2264 Xj \u2264 w2] \u2212 P (w1 \u2264 Xj < w)V ar[Y |w1 \u2264 Xj < w]\n\nj(w) + \u03c3r\n\nj (w))\n\n\u2212 P (w2 \u2265 Xj \u2265 w)V ar[Y |w2 \u2265 Xj \u2265 w],\n\nwhere V ar[\u00b7|\u00b7] denotes the conditional variance.\n\n3 PV-Tree\n\nIn this section, we describe our proposed PV-Tree algorithm for parallel decision tree learning,\nwhich has a very low communication cost, and can achieve a good trade-off between communication\nef\ufb01ciency and learning accuracy.\nPV-Tree is a data-parallel algorithm, which also partitions the training data onto M machines just\nlike in [2] [21]. However, its design principal is very different. In [2][21], one does not trust the local\ninformation about the attributes in each machine, and decides the best attribute and split point only\nbased on the aggregated global histograms of the attributes. In contrast, in PV-Tree, we leverage the\nmeaningful statistical information about the attributes contained in each local machine, and make\ndecisions through a two-stage (local and then global) voting process. In this way, we can signi\ufb01cantly\nreduce the communication cost since we do not need to communicate the histogram information of\nall the attributes across machines, instead, only the histograms of those attributes that survive in the\nvoting process.\nThe \ufb02ow of PV-tree algorithm is very similar to the standard decision tree, except function FindBest-\nSplit. So we only give the new implementation of this function in Alg 3, which contains following\nthree steps:\nLocal Voting: We select the top-k attributes for each machine based on its local data set (according\nto the informativeness scores, e.g., information gain for classi\ufb01cation and variance reduction for\nregression), and then exchange indices of the selected attributes among machines. Please note that\nthe communication cost for this step is very low, because only the indices for a small number of (i.e.,\nk \u00d7 M) attributes need to be communicated.\nGlobal Voting: We determine the globally top-2k attributes by a majority voting among all locally\nselected attributes in the previous step. That is, we rank the attributes according to the number of\n\n3\n\n\flocal machines who select them, and choose the top-2k attributes from the ranked list. It can be\nproven that when the local data are big enough to be statistically representative, there is a very high\nprobability that the top-2k attributes obtained by this majority voting will contain the globally best\nattribute. Please note that this step does not induce any communication cost.\nBest Attribute Identi\ufb01cation: We collect full-grained histograms of the globally top-2k attributes from\nlocal machines in order to compute their global distributions. Then we identify the best attribute and\nits split point according to the informativeness scores calculated from the global distributions. Please\nnote that the communication cost for this step is also low, because we only need to communicate the\nhistograms of 2k pre-selected attributes (but not all attributes).3 As a result, PV-Tree algorithm can\nscale very well since its communication cost is independent of both the total number of attributes and\nthe total number of samples in the dataset.\nIn next section, we will provide theoretical analysis on accuracy guarantee of PV-Tree algorithm.\nAlgorithm 2 FindBestSplit\n\nAlgorithm 3 PV-Tree_FindBestSplit\n\nInput: DataSet D\nfor all X in D.Attribute do\n(cid:46) Construct Histogram\nH = new Histogram()\nfor all x in X do\n\nH.binAt(x.bin).Put(x.label)\n\nend for\n(cid:46) Find Best Split\nleftSum = new HistogramSum()\nfor all bin in H do\n\nleftSum = leftSum + H.binAt(bin)\nrightSum = H.AllSum - leftSum\nsplit.gain = CalSplitGain(leftSum, rightSum)\nbestSplit = ChoiceBetterOne(split,bestSplit)\n\nend for\n\nend for\nreturn bestSplit\n\nInput: Dataset D\nlocalHistograms = ConstructHistograms(D)\n(cid:46) Local Voting\nsplits = []\nfor all H in localHistograms do\nsplits.Push(H.FindBestSplit())\n\nend for\nlocalTop = splits.TopKByGain(K)\n(cid:46) Gather all candidates\nallCandidates = AllGather(localTop)\n(cid:46) Global Voting\nglobalTop = allCandidates.TopKByMajority(2*K)\n(cid:46) Merge global histograms\nglobalHistograms = Gather(globalTop, localHis-\ntograms)\nbestSplit = globalHistograms.FindBestSplit()\nreturn bestSplit\n\n4 Theoretical Analysis\n\nIn this section, we conduct theoretical analysis on proposed PV-Tree algorithm. Speci\ufb01cally, we\nprove that, PV-Tree can select the best (most informative) attribute in a large probability, for both\nclassi\ufb01cation and regression. In order to better present the theorem, we \ufb01rstly introduce some\nnotations4 In classi\ufb01cation, we denote IGj = maxw IGj(w), and rank {IGj; j \u2208 [d]} from large to\nsmall as {IG(1), ..., IG(d)}. We call the attribute j(1) the most informative attribute. Then, we denote\n,\u2200j \u2265 k + 1 to indicate the distance between the largest and the k-th largest IG.\nl(j)(k) =\nIn regression, l(j)(k) is de\ufb01ned in the same way, except replacing IG with VG.\n\n|IG(1)\u2212IG(j)|\n\n2\n\nTheorem 4.1 Suppose we have M local machines, and each one has n training data. PV-Tree at an\narbitrary tree node with local voting size k and global majority voting size 2k will select the most\ninformative attribute with a probability at least\n\nM(cid:88)\n\n\uf8eb\uf8ed1 \u2212\n\n\uf8eb\uf8ed d(cid:88)\n\nCm\nM\n\n\uf8f6\uf8f8\uf8f6\uf8f8m\uf8eb\uf8ed d(cid:88)\n\n\u03b4(j)(n, k)\n\n\u03b4(j)(n, k)\n\n\uf8f6\uf8f8M\u2212m\n\n,\n\nm=[M/2+1]\n\nj=k+1\n\nj=k+1\n\nwhere \u03b4(j)(n, k) = \u03b1(j)(n) + 4e\n\n\u2212c(j)n(l(j)(k))2 with limn\u2192\u221e \u03b1(j)(n) = 0 and c(j) is constant.\n\nDue to space restrictions, we brie\ufb02y illustrate the proof idea here and leave detailed proof to supple-\nmentary materials. Our proof contains two parts. (1) For local voting, we \ufb01nd a suf\ufb01cient condition\nto guarantee a similar rank of attributes ordered by information gain computed based on local data\nand full data. Then, we derive a lower bound of probability to make the suf\ufb01cient condition holds by\n\n3As indicated by our theoretical analysis and empirical study (see the next sections), a very small k already\n\nleads to good performance in PV-Tree algorithm.\n\n4Since all analysis are for one arbitrarily \ufb01xed node O, we omit the notation O here.\n\n4\n\n\fj=k+1 to(cid:80)100d\n\n100 times larger. Then the terms in the summation (from(cid:80)d\n\ndue to larger \u03b4j(n, k). On the other hand, because function(cid:80)M\n\nusing concentration inequalities. (2) For global voting, we select top-2k attributes. It\u2019s easy to proof\nthat we can select the most informative attribute if only no less than [M/2 + 1] of all machines select\nit.5 Therefore, we can calculate the probability in the theorem using binomial distribution.\nRegarding Theorem 4.1, we have following discussions on factors that impact the lower bound for\nprobability of selecting the best attribute.\n1.Size of local training data n: Since \u03b4(j)(n, k) decreased with n, with more and more local training\ndata, the lower bound will increase. That means, if we have suf\ufb01ciently large data, PV-Tree will\nselect the best attribute with almost probability 1.\n2. Input dimension d: It is clear that for \ufb01xed local voting size k and global voting size 2k, with d\nincreasing, the lower bound is decreasing. Consider the case that the number of attributes become\nj=k+1) is roughly 100 times\nlarger for a relatively small k. But there must be many attributes away from attribute (1) and l(j)(k)\nis a large number which results in a small \u03b4(j)(n, k). Thus we can say that the bound in the theorem\nis not sensitive with d.\n3. Number of machines M: We assume the whole training data size N is \ufb01xed and the local data size\nM . Then on one hand, as M increases, n decreases, and therefore the lower bound will decrease\nn = N\nM pm(1 \u2212 p)M\u2212m will\napproach 1 as M increases when p > 0.5 [[23]], the lower bound will increase. In other words, the\nnumber of machines M has dual effect on the lower bound: with more machines, local data size\nbecomes smaller which reduces the accuracy of local voting, however, it also leads to more copies of\nlocal votes and thus increase the reliability of global voting. Therefore, in terms of accuracy, there\nshould be an optimal number of machines given a \ufb01xed-size training data.6\n4. Local/Global voting size k/2k: Local/Global voting size k/2k in\ufb02uence l(j)(k) and the terms in\nthe summation in the lower bound . As k increases, l(j)(k) increases and the terms in the summation\ndecreases, and the lower bound increases. But increasing k will bring more communication and\ncalculating time. Therefore, we should better select a moderate k. For some distributions, especially\nfor the distributions over high-dimensional space, l(j)(k) is less sensitive to k, then we can choose a\nrelatively smaller k to save communication time.\nAs a comparison, we also prove a theorem for the data-parallel algorithm based on quantized\nhistogram as follows (please refer to the supplementary material for its proof). The theorem basically\ntells us that the bias introduced by histogram quantization cannot be reduced to zero even if the\ntraining data are suf\ufb01ciently large, and as a result the corresponding algorithm could fail in \ufb01nding\nthe best attribute.7 This could be the critical weakness of this algorithm in big data scenario.\n\nm=[M/2+1] C m\n\nTheorem 4.2 We denote quantized histogram with b bins of the underlying distribution P as P b, that\nof the empirical distribution Pn as P b\nn, the information gain of Xj calculated under the distribution P b\nj|. Then, for \u0001 \u2264 minj=1,\u00b7\u00b7\u00b7 ,d fj(b),\nand P b\nwith probability at least \u03b4j(n, fj(b) \u2212 \u0001)), we have |IGb\n\nn,j respectively, and fj(b) (cid:44) |IGj \u2212 IGb\n\nn,j \u2212 IGj| > \u0001.\n\nj and IGb\n\nn as IGb\n\n5 Experiments\n\nIn this section, we report the experimental comparisons between PV-Tree and baseline algorithms.\nWe used two data sets, one for learning to rank (LTR) and the other for ad click prediction (CTR)8\n(see Table 1 for details). For LTR, we extracted about 1200 numerical attributes per data sample, and\nused NDCG [5] as the evaluation measure. For CTR, we extracted about 800 numerical attributes [9],\nand used AUC as the evaluation measure.\n\n5In fact, the global voting size can be \u03b2k with \u03b2 > 1. Then the suf\ufb01cient condition becomes that no less than\n\n[M/\u03b2 + 1] of all machines select the most informative attribute.\n\n6Please note that using more machines will reduce local computing time, thus the optimal value of machine\n\nnumber may be larger in terms of speed-up.\n\n7The theorem for regression holds in the same way, with replacing IG with VG.\n8We use private data in LTR experiments and data of KDD Cup 2012 track 2 in CTR experiments.\n\n5\n\n\fTable 1: Datasets\nTask #Train #Test #Attribute\n\nSource\n\nTable 2: Convergence time (seconds)\n\nTask Sequential Data- Attribute- PV-Tree\n\nLTR\n11M 1M\nCTR 235M 31M\n\n1200\n800\n\nPrivate\n\nKDD Cup\n\nLTR\nCTR\n\n28690\n154112\n\nParallel\n32260\n9209\n\nParallel\n14660\n26928\n\n5825\n5349\n\nAccording to recent industrial practices, a single decision tree might not be strong enough to learn an\neffective model for complicated tasks like ranking and click prediction. Therefore, people usually\nuse decision tree based boosting algorithms (e.g., GBDT) to perform tasks. In this paper, we also\nuse GBDT as a platform to examine the ef\ufb01ciency and effectiveness of decision tree parallelization.\nThat is, we used PV-Tree or other baseline algorithms to parallelize the decision tree construction\nprocess in each iteration of GBDT, and compare their performance. Our experimental environment is\na cluster of servers (each with 12 CPU cores and 32 GB RAM) inter-connected with 1 Gbps Ethernet.\nFor the experiments on LTR, we used 8 machines for parallel training; and for the experiments on\nCTR, we used 32 machines since the dataset is much larger.\n\n5.1 Comparison with Other Parallel Decision Trees\n\nFor comparison with PV-Tree, we have implemented an attribute-parallel algorithm, in which a binary\nvector is used to indicate the split information and exchanged across machines. In addition, we\nimplemented a data-parallel algorithm according to [2, 21], which can communicate both full-grained\nhistograms and quantized histograms. All parallel algorithms and sequential(single machine) version\nare compared together.\nThe experimental results can be found in Figure 1a and 1b. From these \ufb01gures, we have the following\nobservations:\nFor LTR, since the number of data samples is relatively small, the communication of the split\ninformation about the samples does not take too much time. As a result, the attribute-parallel\nalgorithm appears to be ef\ufb01cient. Since most attributes take numerical values in this dataset, the full-\ngrained histogram has quite a lot of bins. Therefore, the data-parallel algorithm which communicates\nfull-grained histogram is quite slow, even slower than the sequential algorithm. When reducing the\nbins in the histogram to 10%, the data-parallel algorithm becomes much more ef\ufb01cient, however, its\nconvergence point is not good (consistent with our theory \u2013 the bias in quantized histograms leads to\naccuracy drop).\nFor CTR, attribute-parallel algorithm becomes very slow since the number of data samples is\nvery large. In contrast, many attributes in CTR take binary or discrete values, which make the\nfull-grained histogram have limited number of bins. As a result, the data-parallel algorithm with\nfull-grain histogram is faster than the sequential algorithm. The data-parallel algorithm with quantized\nhistograms is even faster, however, its convergence point is once again not very good.\nPV-Tree reaches the best point achieved by sequential algorithm within the shortest time in both LTR\nand CTR task. For a more quantitative comparison on ef\ufb01ciency, we list the time for each algorithm\n(8 machines for LTR and 32 machines for CTR) to reach the convergent accuracy of the sequential\nalgorithm in Table 2. From the table, we can see that, for LTR, it costed PV-Tree 5825 seconds, while\nit costed the data-parallel algorithm (with full-grained histogram9) and attribute-parallel algorithm\n32260 and 14660 seconds respectively. As compared with the sequential algorithm (which took 28690\nseconds to converge), PV-Tree achieves 4.9x speed up on 8 machines. For CTR, it costed PV-Tree\n5349 seconds, while it costed the data-parallel algorithm (with full-grained histogram) and attribute-\nparallel algorithm 9209 and 26928 seconds respectively. As compared with the sequential algorithm\n(which took 154112 seconds to converge), PV-Tree achieves 28.8x speed up on 32 machines.\nWe also conducted independent experiments to get a clear comparison of communication cost for\ndifferent parallel algorithms given some typical big data workload setting. The result is listed in\nTable 3. We \ufb01nd the cost of attribute-parallel algorithm is relative to the size of training data N, and\nthe cost of data-parallel algorithm is relative to the number of attributes d. In contrast, the cost of\nPV-Tree is constant.\n\n9The data-parallel algorithm with 10% bins could not achieve the same accuracy with the sequential algorithm\n\nand thus we did not put it in the table.\n\n6\n\n\fTable 3: Comparison of communication\ncost, train one tree with depth=6.\nData\nData size Attribute\nPalallel\nParallel\n750MB 424MB\n\nk=15\n10MB\n\nPV-Tree\n\nN=1B,\nd=1200\nN=100M,\nd=1200\nN=1B,\nd=200\nN=100M,\nd=200\n\n75MB\n\n424MB\n\n10MB\n\n750MB\n\n70MB\n\n10MB\n\n75MB\n\n70MB\n\n10MB\n\nTable 4: Convergence time and accuracy w.r.t. global\nvoting parameter k for PV-Tree.\n\nk=1\n\nk=10\n9065/\n\nk=5\n9906/\n\nk=20\nLTR\n11256/\n8323/\nM=4\n0.7905 0.7909 0.7909 0.7909\nLTR\n8211/\nM=16\n0.7882 0.7893 0.7897 0.7906\nCTR\n9131/\nM=16\n0.7535 0.7538 0.7538 0.7538\nCTR\n1806/\n2133/\nM=128 0.7533 0.7536 0.7537 0.7537\n\nk=40\n9529/\n0.7909\n10320/ 12529/\n0.7909\n10309/ 10877/\n0.7538\n2564/\n0.7538\n\n2077/\n\n8131/\n\n8496/\n\n9947/\n\n9912/\n\n1745/\n\n(a) LTR, 8 machines\n\n(b) CTR, 32 machines\n\nFigure 1: Performances of different algorithms\n\n5.2 Tradeoff between Speed-up and Accuracy in PV-Tree\n\nIn the previous subsection, we have shown that PV-tree is more ef\ufb01cient than other algorithms. Here\nwe make a deep dive into PV-tree to see how its key parameters affect the trade-off between ef\ufb01ciency\nand accuracy. According to Theorem 4.1, the following two parameters are critical to PV-Tree: the\nnumber of machines M and the size of voting k.\n\n5.2.1 On Different Numbers of Machines\n\nWhen more machines join the distributed training process, the data throughput will grow larger but\nthe amortized training data on each machine will get smaller. When the data size on each machine\nbecomes too small, there will be no guarantee on the accuracy of the voting procedure, according to\nour theorem. So it is important to appropriately set the number of machines.\nTo gain more insights on this, we conducted some additional experiments, whose results are shown\nin Figure 2a and 2b. From these \ufb01gures, we can see that for LTR, when the number of machines\ngrows from 2 to 8, the training process is signi\ufb01cantly accelerated. However, when the number goes\nup to 16, the convergence speed is even lower than that of using 8 machines. Similar results can be\nobserved for CTR. These observations are consistent with our theoretical \ufb01ndings. Please note that\nPV-Tree is designed for the big data scenario. Only when the entire training data are huge (and thus\ndistribution of the training data on each local machine can be similar to that of the entire training\ndata), the full power of PV-Tree can be realized. Otherwise, we need to have a reasonable expectation\non the speed-up, and should choose to use a smaller number of machines to parallelize the training.\n\n5.2.2 On Different Sizes of Voting\n\nIn PV-Tree, we have a parameter k, which controls the number of top attributes selected during\nlocal and global voting. Intuitively, larger k will increase the probability of \ufb01nding the globally best\nattribute from the local candidates, however, it also means higher communication cost. According\nto our theorem, the choice of k should depend on the size of local training data. If the size of local\ntraining data is large, the locally best attributes will be similar to the globally best one. In this case,\none can safely choose a small value of k. Otherwise, we should choose a relatively larger k. To gain\nmore insights on this, we conducted some experiments, whose results are shown in Table 4, where M\nrefers to the number of machines. From the table, we have the following observations. First, for both\ncases, in order to achieve good accuracy, one does not need to choose a large k. When k \u2264 40, the\n\n7\n\n\f(a) LTR\n\n(b) CTR\n\nFigure 2: PV-Tree on different numbers of machines\n\n(a) LTR, 8 machines\n\n(b) CTR, 32 machines\n\nFigure 3: Comparison with parallel boosting algorithms\n\naccuracy has been very good. Second, we \ufb01nd that for the cases of using small number of machines,\nk can be set to an even smaller value, e.g., k = 5. This is because, given a \ufb01xed-size training data,\nwhen using fewer machines, the size of training data per machine will become larger and thus a\nsmaller k can already guarantee the approximation accuracy.\n\n5.3 Comparison with Other Parallel GBDT Algorithms\n\nWhile we mainly focus on how to parallelize the decision tree construction process inside GBDT in\nthe previous subsections, one could also parallelize GBDT in other ways. For example, in [22, 20],\neach machine learns its own decision tree separately without communication. After that, these\ndecision trees are aggregated by means of winner-takes-all or output ensemble. Although these works\nare not the focus of our paper, it is still interesting to compare with them.\nFor this purpose, we implemented both the algorithms proposed in [22] and [20]. For ease of\nreference, we denote them as Svore and Yu respectively. Their performances are shown in Figure 3a\nand 3b. From the \ufb01gures, we can see that PV-Tree outperforms both Svore and Yu: although these two\nalgorithms converge at a similar speed to PV-Tree, they have much worse converge points. According\nto our limited understanding, these two algorithms are lacking solid theoretical guarantee. Since\nthe candidate decision trees are trained separately and independently without necessary information\nexchange, they may have non-negligible bias, which will lead to accuracy drop at the end. In contrast,\nwe can clearly characterize the theoretical properties of PV-tree, and use it in an appropriate setting\nso as to avoid observable accuracy drop.\nTo sum up all the experiments, we can see that with appropriately-set parameters, PV-Tree can achieve\na very good trade-off between ef\ufb01ciency and accuracy, and outperforms both other parallel decision\ntree algorithms designed speci\ufb01cally for GBDT parallelization.\n\n6 Conclusions\n\nIn this paper, we proposed a novel parallel algorithm for decision tree, called Parallel Voting Decision\nTree (PV-Tree), which can achieve high accuracy at a very low communication cost. Experiments\non both ranking and ad click prediction indicate that PV-Tree has its advantage over a number of\nbaselines algorithms. As for future work, we plan to generalize the idea of PV-Tree to parallelize\nother machine learning algorithms. Furthermore, we will open-source PV-Tree algorithm to bene\ufb01t\nmore researchers and practitioners.\n\n8\n\n\fReferences\n[1] Rakesh Agrawal, Ching-Tien Ho, and Mohammed J Zaki. Parallel classi\ufb01cation for data mining in a\n\nshared-memory multiprocessor system, 2001. US Patent 6,230,151.\n\n[2] Yael Ben-Haim and Elad Tom-Tov. A streaming parallel decision tree algorithm. In The Journal of\n\nMachine Learning Research, volume 11, pages 849\u2013872, 2010.\n\n[3] Leo Breiman. Random forests. In Machine learning, volume 45, pages 5\u201332. Springer, 2001.\n\n[4] Leo Breiman, Jerome Friedman, Charles J Stone, and Richard A Olshen. Classi\ufb01cation and regression\n\ntrees. CRC press, 1984.\n\n[5] Christopher JC Burges. From ranknet to lambdarank to lambdamart: An overview. In Learning, volume 11,\n\npages 23\u2013581, 2010.\n\n[6] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning, volume 1.\n\nSpringer series in statistics Springer, Berlin, 2001.\n\n[7] Jerome H Friedman. Greedy function approximation: a gradient boosting machine. In Annals of statistics,\n\npages 1189\u20131232. JSTOR, 2001.\n\n[8] Johannes Gehrke, Venkatesh Ganti, Raghu Ramakrishnan, and Wei-Yin Loh. Boat\u2014optimistic decision\n\ntree construction. In ACM SIGMOD Record, volume 28, pages 169\u2013180. ACM, 1999.\n\n[9] Michael Jahrer, A Toscher, JY Lee, J Deng, H Zhang, and J Spoelstra. Ensemble of collaborative \ufb01ltering\n\nand feature engineered models for click through rate prediction. In KDDCup Workshop, 2012.\n\n[10] Ruoming Jin and Gagan Agrawal. Communication and memory ef\ufb01cient parallel decision tree construction.\n\nIn SDM, pages 119\u2013129. SIAM, 2003.\n\n[11] Mahesh V Joshi, George Karypis, and Vipin Kumar. Scalparc: A new scalable and ef\ufb01cient parallel\nclassi\ufb01cation algorithm for mining large datasets. In Parallel processing symposium, 1998. IPPS/SPDP\n1998, pages 573\u2013579. IEEE, 1998.\n\n[12] Richard Kufrin. Decision trees on parallel processors. In Machine Intelligence and Pattern Recognition,\n\nvolume 20, pages 279\u2013306. Elsevier, 1997.\n\n[13] Manish Mehta, Rakesh Agrawal, and Jorma Rissanen. Sliq: A fast scalable classi\ufb01er for data mining. In\n\nAdvances in Database Technology\u2014EDBT\u201996, pages 18\u201332. Springer, 1996.\n\n[14] Biswanath Panda, Joshua S Herbach, Sugato Basu, and Roberto J Bayardo. Planet: massively parallel\nlearning of tree ensembles with mapreduce. In Proceedings of the VLDB Endowment, volume 2, pages\n1426\u20131437. VLDB Endowment, 2009.\n\n[15] Robert Allan Pearson. A coarse grained parallel induction heuristic. University College, University of\n\nNew South Wales, Department of Computer Science, Australian Defence Force Academy, 1993.\n\n[16] J. Ross Quinlan. Induction of decision trees. In Machine learning, volume 1, pages 81\u2013106. Springer,\n\n1986.\n\n[17] Sanjay Ranka and V Singh. Clouds: A decision tree classi\ufb01er for large datasets. In Knowledge discovery\n\nand data mining, pages 2\u20138, 1998.\n\n[18] S Rasoul Safavian and David Landgrebe. A survey of decision tree classi\ufb01er methodology.\n\ntransactions on systems, man, and cybernetics, 21(3):660\u2013674, 1991.\n\nIEEE\n\n[19] John Shafer, Rakesh Agrawal, and Manish Mehta. Sprint: A scalable parallel classi er for data mining. In\n\nProc. 1996 Int. Conf. Very Large Data Bases, pages 544\u2013555. Citeseer, 1996.\n\n[20] Krysta M Svore and CJ Burges. Large-scale learning to rank using boosted decision trees. Scaling Up\n\nMachine Learning: Parallel and Distributed Approaches, 2, 2011.\n\n[21] Stephen Tyree, Kilian Q Weinberger, Kunal Agrawal, and Jennifer Paykin. Parallel boosted regression\ntrees for web search ranking. In Proceedings of the 20th international conference on World wide web,\npages 387\u2013396. ACM, 2011.\n\n[22] C Yu and DB Skillicorn. Parallelizing boosting and bagging. Queen\u2019s University, Kingston, Canada, Tech.\n\nRep, 2001.\n\n[23] Zhi-Hua Zhou. Ensemble methods: foundations and algorithms. CRC Press, 2012.\n\n9\n\n\f", "award": [], "sourceid": 691, "authors": [{"given_name": "Qi", "family_name": "Meng", "institution": "Peking University"}, {"given_name": "Guolin", "family_name": "Ke", "institution": "Microsoft Research"}, {"given_name": "Taifeng", "family_name": "Wang", "institution": "Microsoft Research"}, {"given_name": "Wei", "family_name": "Chen", "institution": "Microsoft Research"}, {"given_name": "Qiwei", "family_name": "Ye", "institution": "Microsoft Research"}, {"given_name": "Zhi-Ming", "family_name": "Ma", "institution": "Academy of Mathematics and Systems Science"}, {"given_name": "Tie-Yan", "family_name": "Liu", "institution": "Microsoft Research"}]}