{"title": "Multi-Layered Gradient Boosting Decision Trees", "book": "Advances in Neural Information Processing Systems", "page_first": 3551, "page_last": 3561, "abstract": "Multi-layered distributed representation is believed to be the key ingredient of deep neural networks especially in cognitive tasks like computer vision. While non-differentiable models such as gradient boosting decision trees (GBDTs) are still the dominant methods for modeling discrete or tabular data, they are hard to incorporate with such representation learning ability. In this work, we propose the multi-layered GBDT forest (mGBDTs), with an explicit emphasis on exploring the ability to learn hierarchical distributed representations by stacking several layers of regression GBDTs as its building block. The model can be jointly trained by a variant of target propagation across layers, without the need to derive backpropagation nor differentiability. Experiments confirmed the effectiveness of the model in terms of performance and representation learning ability.", "full_text": "Multi-Layered Gradient Boosting Decision Trees\n\n\u2020National Key Lab for Novel Software Technology, Nanjing University, China\n\n\u2021{fengji}@chuangxin.com\n\nJi Feng\u2020, \u2021, Yang Yu\u2020, Zhi-Hua Zhou\u2020\n\u2020{fengj, yuy, zhouzh}@lamda.nju.edu.cn\n\n\u2021 Sinovation Ventures AI Institute\n\nAbstract\n\nMulti-layered distributed representation is believed to be the key ingredient of\ndeep neural networks especially in cognitive tasks like computer vision. While\nnon-differentiable models such as gradient boosting decision trees (GBDTs) are\nstill the dominant methods for modeling discrete or tabular data, they are hard to\nincorporate with such representation learning ability. In this work, we propose the\nmulti-layered GBDT forest (mGBDTs), with an explicit emphasis on exploring the\nability to learn hierarchical distributed representations by stacking several layers\nof regression GBDTs as its building block. The model can be jointly trained by\na variant of target propagation across layers, without the need to derive back-\npropagation nor differentiability. Experiments con\ufb01rmed the effectiveness of the\nmodel in terms of performance and representation learning ability.\n\n1\n\nIntroduction\n\nThe development of deep neural networks has achieved remarkable advancement in the \ufb01eld of\nmachine learning during the past decade. By constructing a hierarchical or \"deep\" structure, the\nmodel is able to learn good representations from raw data in both supervised and unsupervised\nsettings which is believed to be its key ingredient. Successful application areas include computer\nvision, speech recognition, natural language processing and more [Goodfellow et al., 2016].\nCurrently, almost all the deep neural networks use back-propagation [Werbos, 1974; Rumelhart et al.,\n1986] with stochastic gradient descent as the workhorse behind the scene for updating parameters\nduring training. Indeed, when the model is composed of differentiable components (e.g., weighted\nsum with non-linear activation functions), it appears that back-prop is still currently the best choice.\nSome other methods such as target propagation [Bengio, 2014] has been proposed as an alternative\nfor training, the effectiveness and popularity are however still in a premature stage. For instance,\nthe work in Lee et al. [2015] proved that target propagation can be at most as good as back-prop,\nand in practice an additional back-propagation for \ufb01ne-tuning is often needed. In other words, the\ngood-old back-propagation is still the most effective way to train a differentiable learning system\nsuch as neural networks.\nOn the other hand, the need to explore the possibility to build a multi-layered or deep model using\nnon-differentiable modules is not only of academic interest but also with important application\npotentials. For instance, tree-based ensembles such as Random Forest [Breiman, 2001] or gradient\nboosting decision trees (GBDTs) [Friedman, 2000] are still the dominant way of modeling discrete or\ntabular data in a variety of areas, it thus would be of great interest to obtain a hierarchical distributed\nrepresentation learned by tree ensembles on such data. In such cases, there is no chance to use chain\nrule to propagate errors, thus back-propagation is no longer possible. This yields to two fundamental\nquestions: First, can we construct a multi-layered model with non-differentiable components, such\nthat the outputs in the intermediate layers are distributed representations? Second, if so, how to jointly\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\ftrain such models without the help of back-propagation? The goal of this paper is to provide such an\nattempt.\nRecently Zhou and Feng [2017; 2018] proposed the Deep Forest framework, which is the \ufb01rst attempt\nto constructing a multi-layered model using tree ensembles. Concretely, by introducing \ufb01ne-grained\nscanning and cascading operations, the model is able to construct a multi-layered structure with\nadaptive model complexity and achieved competitive performance across a board range of tasks.\nThe gcForest model proposed in [2018] utilized all strategies for diversity enhancement of ensemble\nlearning, however, the current approach is only suitable in a supervised learning setting. Meanwhile,\nit is still not clear how to construct a multi-layered model by forest that explicitly examine its\nrepresentation learning ability. Such explorations for representation learning should be made since\nmany previous researches have suggested that, a multi-layered distributed representations [Hinton et\nal., 1986] may be the key reason for the success of deep neural networks [Bengio et al., 2013a].\nIn this work, we aim to take the best parts of both worlds:\nthe excellent performance of tree\nensembles and the expressive power of hierarchical distributed representations (which has been\nmainly explored in neural networks). Concretely, we propose the \ufb01rst multi-layered structure\nusing gradient boosting decision trees as building blocks per layer with an explicit emphasis on its\nrepresentation learning ability and the training procedure can be jointly optimized via a variant of\ntarget propagation. The model can be trained in both supervised and unsupervised settings. This is\nthe \ufb01rst demonstration that we can indeed obtain hierarchical and distributed representations using\ntrees which was commonly believed only possible for neural networks or differentiable systems in\ngeneral. Theoretical justi\ufb01cations as well as experimental results showed the effectiveness of this\napproach.\nThe rest of the paper is organized as follows: \ufb01rst, some more related works are discussed; second,\nthe proposed method with theoretical justi\ufb01cations are presented; \ufb01nally, empirical experiments and\nconclusions are illustrated and discussed.\n\n2 Related Works\n\nThere is still no universal theory in explaining why a deep model works better than a shallow one.\nMany of the current attempts [Bengio et al., 2013b,c] for this question are based on the conjecture\nthat it is the hierarchical distributed representations learned from data are the driven forces behind the\neffectiveness of deep models. Similar works such as [Bengio et al., 2013b] conjectured that better\nrepresentations can be exploited to produce faster-mixing Markov chains, therefore, a deeper model\nalways helps. Tishby and Zaslavsky Tishby and Zaslavsky [2015] treated the hidden layers as a\nsuccessive re\ufb01nement of relevant information and a deeper structure helps to speed up such process\nexponentially. Nevertheless, it seems for a deep model to work well, it is critical to obtain a better\nfeature re-representation from intermediate layers.\nFor a multi-layered deep model with differentiable components, back-propagation is still the dominant\nway for training.\nIn recent years, some alternatives have been proposed. For instance, target-\npropagation [Bengio, 2014] and difference target propagation [Lee et al., 2015] propagate the targets\ninstead of errors via the inverse mapping. By doing so, it helps to solve the vanishing gradient problem\nand the authors claim it is a more biologically plausible training procedure. Similar approaches such\nas feedback-alignment [Lillicrap et al., 2016] used asymmetric feedback connections during training\nand direct feedback alignment [N\u00f8kland, 2016] showed it is possible when the feedback path is\ndisconnected from the forward path. Currently, all these alternatives stay in the differentiable regime\nand their theoretical justi\ufb01cations depend heavily on calculating the Jacobians for the activation\nfunctions.\nEnsemble learning [Zhou, 2012] is a powerful learning paradigm which often uses decision trees as its\nbase learners. Bagging [Breiman, 1996] and boosting [Freund and Schapire, 1999] , for instance, are\nthe driven forces of Random Forest [Breiman, 2001] and gradient boosting decision trees [Friedman,\n2000], respectively. In addition, some ef\ufb01cient implementations for GBDTs such as XGBoost [Chen\nand Guestrin, 2016] and LightGBM [Ke et al., 2017] has become the best choice for many industrial\napplications and data science projects, ranging from predicting clicks on Ads [He et al., 2014], to\ndiscovering Higgs Boson [Chen and He, 2015] and numerous data science competitions in Kaggle1\n\n1www.kaggle.com\n\n2\n\n\fand beyond. Some more recent works such as eForest [Feng and Zhou, 2018] showed the possibility\nto recover the input pattern with almost perfect reconstruction accuracy by forest. Due to the unique\nproperty of decision trees, such models are naturally suitable for modeling discrete data or data\nsets with mixed-types of attributes. There are some works tries to combine the routing structure\nof trees with neural networks [Kontschieder et al., 2015; Frosst and Hinton, 2017], however, these\napproaches require heavily on the differential property for the system and thus are quite different\nwith our purpose and motivation.\n\n3 The Proposed Method\n\nFigure 1: Illustration of training mGBDTs\n\nConsider a multi-layered feed-forward structure with M \u2212 1 intermediate layers and one \ufb01nal output\nlayer. Denote oi where i \u2208 {0, 1, 2, . . . , M} as the output for each layer including the input layer and\nthe output layer oM . For a particular input data x, the corresponding output at each layer is in Rdi,\nwhere i \u2208 {0, 1, 2, . . . , M}. The learning task is therefore to learn the mappings Fi : Rdi\u22121 \u2192 Rdi\nfor each layer i > 0, such that the \ufb01nal output oM minimize the empirical loss L on training set.\nMean squared errors or cross-entropy with extra regularization terms are some common choices for\nthe loss L. In an unsupervised setting, the desired output Y can be the training data itself, which\nleads to an auto-encoder and the loss function is the reconstruction errors between the output and the\noriginal input.\nWhen each Fi is parametric and differentiable, such learning task can be achieved in an ef\ufb01cient\nway using back-propagation. The basic routine is to calculate the gradients of the loss function with\nrespect to each parameter at each layer using the chain rule, and then perform gradient descent for\nparameter updates. Once the training is done, the output for the intermediate layers can be regarded\nas the new representation learned by the model. Such hierarchical dense representation can be\ninterpreted as a multi-layered abstraction of the original input and is believed to be critical for the\nsuccess of deep models.\nHowever, when Fi is non-differentiable or even non-parametric, back-prop is no longer applicable\nsince calculating the derivative of loss function with respect to its parameters is impossible. The rest\nof this section will focus on solving this problem when Fi are gradient boosting decision trees.\nFirst, at iteration t, assume the F t\u22121\nobtained from the previous iteration are given, we need\n(oi\u22121)) \u2248\nto obtain an \"pseudo-inverse\" mapping Gt\noi\u22121. This can be achieved by minimizing the expected value of the reconstruction loss function\nas: \u02c6Gt\n(oi\u22121)))] ,where the loss Linverse can be the\nreconstruction loss. Like an autoencoder, random noise injection is often suggested, that is, instead\nof using a pure reconstruction error measure, it is good practice setting Linverse as: Linverse =\n(cid:107)Gi(Fi(oi\u22121 + \u03b5)) \u2212 (oi\u22121 + \u03b5)(cid:107), \u03b5 \u223c N (0, diag(\u03c32)). By doing so, the model is more robust in\nthe sense that the inverse mapping is forced to learn how to map the neighboring training data to the\nright manifold. In addition, such randomness injection also helps to design a generative model by\ntreating the inverse mapping direction as a generative path which can be considered as future works\nfor exploration.\nSecond, once we updated Gt\ni, we can use it as given and update the forward mapping for the previous\ni\u22121 for Fi\u22121 where i \u2208 {2, ..M}, and each\nlayer Fi\u22121. The key here is to assign a pseudo-labels zt\nlayer\u2019s pseudo-label is de\ufb01ned to be zt\ni). That is, at iteration t, for all the intermediate\nlayers, the pseudo-labels for each layer can be \"aligned\" and propagated from the output layer to\nthe input layer. Then, once the pseudo-labels for each layer is computed, each F t\u22121\ncan follow a\n\ni paired with each F t\u22121\ni(F t\u22121\n\nEx[Linverse(oi\u22121, Gt\n\ni\u22121 = Gi(zt\n\ni = arg minGt\n\ni\n\ni\n\nsuch that Gt\n\ni(F t\u22121\n\ni\n\ni\n\ni\n\n3\n\ni\n\n\fi\n\ni\n\n\u2202oM\n\n. Then, F t\n\n(oi\u22121),zt\ni)\n(oi\u22121)\n\njust like a typical regression\n\ngradient ascent step towards the pseudo-residuals \u2212 \u2202L(F t\u22121\n\u2202F t\u22121\nGBDT.\nThe only thing remains is to set the pseudo-label zt\nM for the \ufb01nal layer to make the whole structure\nready for update. It turns out to be easy since at layer M, one can always use the real labels y when\nde\ufb01ning the output layer\u2019s pseudo-label. For instance, it is natural to de\ufb01ne the pseudo-label of\nM = oM \u2212 \u03b1 \u2202L(oM ,y)\nthe output layer as: zt\nM is set to \ufb01t towards the pseudo-residuals\n\u2212 \u2202L(F t\u22121\nM (oM\u22121),zt\nM)\n. In other words, at iteration t, the output layer FM compute its pseudo-label\n\u2202F t\u22121\nM (oM\u22121)\nM and then produce the pseudo-labels for all the other layer via the inverse functions, then each Fi\nzt\ncan thus be updated accordingly. Once all the Fi get updated, the procedure can then move to the\nnext iteration to update Gi. In practice, a bottom up update is suggested (update Fi before Fj for\ni < j) and each Fi can go several rounds of additive boosting steps towards its current pseudo-label.\nWhen training a neural network, the initialization can be achieved by assigning random Gaussian\nnoise to each parameter, then the procedure can move on to the next stage for parameter update. For\ntree-structured model described here, it is not a trivial task to draw a random tree structure from the\ndistribution of all the possible tree con\ufb01gurations, therefore instead of initializing the tree structure at\nrandom, we produce some Gaussian noise to be the output of intermediate layers and train some very\ntiny trees to obtain F 0\ni , where index 0 denote the tree structures obtained in this initialization stage.\nThen the training procedure can move on to iterative update forward mappings and inverse mappings.\nThe whole procedure is summarized in Algorithm 1 and illustrated in Figure 1.\n\nAlgorithm 1: Training multi-layered GBDT (mGBDT) Forest\nInput: Number of layers M, layer dimension di, training data X,Y , \ufb01nal loss function L,\n\n\u03b1, \u03b3, K1, K2, epoch E, noise injection \u03c32\n\n2:M \u2190 Initialize(); o0 \u2190 X; oj \u2190 F 0\n\nj (oj\u22121) for j = 1, 2, . . . , M\n\nOutput: A trained mGBDT\n1:M \u2190 Initialize(); G0\nF 0\nfor t = 1 to E do\n\n// Calculate the pseudo-label for the \ufb01nal layer\n\n\u2202oM\n\nM \u2190 oM \u2212 \u03b1 \u2202L(oM ,Y )\nzt\nfor j = M to 2 do\nj \u2190 Gt\u22121\nGt\nj\u22121 \u2190 oj\u22121 + \u03b5, \u03b5 \u223c N (0, diag(\u03c32))\nonoise\nfor k = 1 to K1 do\nj\u22121 )) \u2212 onoise\nj\u22121 (cid:107)\n(onoise\n\nj\n\n\u2202Gt\n\nj\n\u2202Linv\n\nj\n(onoise\n\nj(F t\u22121\nj (F t\u22121\n\nj \u2190 (cid:107)Gt\nLinv\nrk \u2190 \u2212[\nFit regression tree hk to rk, i.e. using the training set (F t\u22121\nj \u2190 Gt\nGt\nend\nj\u22121 \u2190 Gt\nzt\n\nj + \u03b3hk\nj) // Calculate the pseudo-label for layer j \u2212 1\n\nj(zt\n\nj\u22121 ))\n\n]\n\nj\n\nj\n\n(onoise\n\nj\u22121 ), rk)\n\nend\nfor j = 1 to M do\n\nj\n\nj for K2 rounds\n\nj \u2190 F t\u22121\nF t\n// Update F t\nfor k = 1 to K2 do\nLj \u2190 (cid:107)F t\nrk \u2190 \u2212[\nFit regression tree hk to rk, i.e. using the training set (oj\u22121, rk)\nj \u2190 F t\nF t\n\nj using pseudo-label zt\nj (oj\u22121) \u2212 zt\nj(cid:107)\n\u2202Lj\nj (oj\u22121) ]\n\nj + \u03b3hk\n\n\u2202F t\n\nend\noj \u2190 F t\n\nj (oj\u22121)\n\nend\n\nend\nreturn F T\n\n1:M , GT\n\n2:M\n\n4\n\n\fIt is worth noticing that the work in Rory and Eibe [2017] utilized GPUs to speed up the time required\nto train a GBDT and Korlakai and Ran [2015] showed an ef\ufb01cient way of conducting drop-out\ntechniques for GBDTs which will give a performance boost further. For a multi-dimensional output\nproblem, the na\u00efve approaches using GBDTs would be memory inef\ufb01cient and Si et al. [2017]\nproposed an ef\ufb01cient way of solving such problem which can reduce the memory by an order of\nmagnitude in practice.\nIn classi\ufb01cation tasks, one could set the forward mapping in the output layer as a linear classi\ufb01er.\nThere are two main reasons of doing this: First, by doing so, the lower layers will be forced to learn\na feature re-representation that is as linear separable as possible which is a useful property to have.\nSecond, often the difference of the dimensionality between the output layer and the layer below is\nbig, as a result, an accurate inverse mapping may be hard to learn. When using a linear classi\ufb01er as\nthe forward mapping at the output layer, there is no need to calculate that particular corresponding\ninverse mapping since the pseudo-label for the layer below can be calculated by taking the gradient\nof the global loss with respect to the output of the last hidden layer.\nA similar procedure such as target propagation [Bengio, 2014] has been proposed to use the inter-layer\nfeedback mappings to train a neural network. They proved that under certain conditions, the angle\nbetween the update directions for the parameters of the forward mappings and the update directions\nwhen trained with back-propagation is less than 90 degree. However, the proof relies heavily on the\ncomputing the Jacobians of Fi and Gi, therefore, their results are only suitable for neural networks.\nThe following theorem proves that, under certain conditions, an update in the intermediate layer\ntowards its pseudo-label helps to reduce the loss of the layer above, and thus helps to reduce the\nglobal loss. The proof here does not rely on the differentiable property of Fi and Gi.\ni, where hi and h(cid:48)\nTheorem 1. Denote an update of f old\ni\nare in Rdi and denote the input for fi\u22121 as hi\u22121 which is in Rdi\u22121. Assume each fi is t-Lipchize\ncontinuous on Rdi and gi = f\u22121\nis 1/t-Lipchize continuous.2. Now suppose such update for fi\u22121\nreduced its local loss, that is, (cid:107)f new\ni\u22121(hi\u22121) \u2212 T argeti\u22121(cid:107), then it\ni\u22121 (hi\u22121) \u2212 T argeti\u22121(cid:107) \u2264 (cid:107)f old\nhelps to reduce the loss for the layer above, that is, the following holds:\n\ni\u22121 moves its output from hi to h(cid:48)\n\ni\u22121 to f new\n\ni\n\n(cid:107)fi(h(cid:48)\n\ni) \u2212 T argeti(cid:107) \u2264 (cid:107)fi(hi) \u2212 T argeti(cid:107)\n\n(1)\ni fi(y)(cid:107) \u2264 (cid:107)fi(x) \u2212 fi(y)(cid:107)/t and\n\nProof. By assumption, it is easy to show that (cid:107)f\u22121\n(cid:107)g\u22121\ni gi(x) \u2212 g\u22121\n\ni gi(y)(cid:107) \u2264 t(cid:107)gi(x) \u2212 gi(y)(cid:107). Then we have the following:\n\ni fi(x) \u2212 f\u22121\n\n(cid:107)fi(h(cid:48)\n\ni) \u2212 T argeti(cid:107) \u2264t(cid:107)gi(fi(h(cid:48)\n\ni)) \u2212 gi(T argeti)(cid:107)\n\ni \u2212 T argeti\u22121(cid:107)\ni\u22121 (hi\u22121) \u2212 T argeti\u22121(cid:107)\ni\u22121(hi\u22121) \u2212 T argeti\u22121(cid:107)\n\n=t(cid:107)h(cid:48)\n=t(cid:107)f new\n\u2264t(cid:107)f old\n\u2264t(cid:107)fi(f old\n=(cid:107)fi(f old\n=(cid:107)fi(hi) \u2212 T argeti(cid:107)\n\ni\u22121(hi\u22121)) \u2212 fi(T argeti\u22121)(cid:107)/t\ni\u22121(hi\u22121)) \u2212 fi(T argeti\u22121)(cid:107)\n\nTo conclude this section, here we discuss several reasons for the need to explore non-differential\ncomponents in designing multi-layered models. Firstly, current adversarial attacks [Nguyen et al.,\n2015; Huang et al., 2017] are all based on calculating the derivative of the \ufb01nal loss with respect to the\ninput. That is, regardless of the training procedure, one can always attack the system as long as chain\nrule is applicable. Non-differentiable modules such as trees can naturally block such calculation,\ntherefore, it would more dif\ufb01cult to perform malicious attacks. Secondly, there are still numerous\ndata sets of interest that are best suitable to be modeled by trees. It would be of great interests and\n\n2Clarke inverse function theorem [Clarke, 1976] proved the existence of such gi under mild conditions on\n\ngeneralized derivatives of fi without loss of generality.\n\n5\n\n\fpotentials to come up with algorithms that can blend the performance of tree ensembles with the\nbene\ufb01t of having a multi-layered representation.\n\n4 Experiments\n\nThe experiments for this section is mainly designed to empirically examine if it is feasible to jointly\ntrain the multi-layered structure proposed by this work. That is, we make no claims that the current\nstructure can outperform CNNs in computer vision tasks. More speci\ufb01cally, we aim to examine the\nfollowing questions: (Q1) Does the training procedure empirically converge? (Q2) What does the\nlearned features look like? (Q3) Does depth help to learn a better representation? (Q4) Given the same\nstructure, what is the performance compared with neural networks trained by either back-propagation\nor target-propagation? With the above questions in mind, we conducted 3 sets of experiments with\nboth synthetic data and real-world applications which results are presented below.\n\n4.1 Synthetic Data\n\nAs a sanity check, here we train two small multi-layered GBDTs on synthetic datasets.\n\n(a) Original\n\n(b) Transformed\nFigure 2: Supervised classi\ufb01cation\n\n(a) Input\n\n(b) Reconstructed\n\nFigure 3: Unsupervised mGBDT autoencoder\n\n(a) Dimension 1 and 2\n\n(b) Dimension 1 and 5\n\n(c) Dimension 4 and 5\n\n(d) Dimension 3 and 5\n\nFigure 4: Visualizations in the 5D encoding space of unsupervised mGBDT autoencoder\n\nWe generated 15, 000 points with 2 classes (70 % for training and 30 % for testing) on R2 as illustrated\nin Figure 2a. The structure we used for training is (input \u2212 5 \u2212 3 \u2212 output) where the input points\nare in R2 and the output is a 0/1 classi\ufb01cation prediction. The mGBDT used in both forward and\ninverse mappings have a maximum depth of 5 per tree with learning rate of 0.1. The output of the\nlast hidden layer (which is in R3) is visualized in Figure 2b. Clearly, the model is able to transform\nthe data points that is easier to separate.\nWe also conducted an unsupervised learning task for autoencoding. 10, 000 points in R3 with shape\nS were generated, as shown in Figure 3a. Then we built an autoencoder using mGBDTs with\nstructure (input \u2212 5 \u2212 output) with MSE as its reconstruction loss. The hyper-parameters for tree\ncon\ufb01gurations are the same as the 2-class classi\ufb01cation task. In other words, the model is forced to\nlearn a mapping from R3 to R5, then maps it back to the original space with low reconstruction error\nas objective. The reconstructed output is presented in Figure 3b. The 5D encodings for the input 3D\npoints are impossible to visualize directly, here we use a common strategy to visualize some pairs of\ndimensions for the 5D encodings in 2D as illustrated in Figure 4. The 5D representation for the 3D\npoints is indeed a distributed representation [Hinton et al., 1986] as some of the dimension captures\nthe curvature whereas others preserve the relative distance among points.\n\n4.2\n\nIncome Prediction\n\nThe income prediction dataset [Lichman, 2013] consists of 48, 842 samples (32, 561 for training\nand 16, 281 for testing) of tabular data with both categorical and continuous attributes. Each sample\n\n6\n\n\f(a) Original representation\n\n(b) 1st layer representation\n\n(c) 2nd layer representation\n\nFigure 5: Feature visualization for income dataset\n\nconsists of a person\u2019s social background such as race, sex, work-class, etc. The task here is to predict\nwhether this person makes over 50K a year. One-hot encoding for the categorical attributes make each\ntraining data in R113. The multi-layered GBDT structure we used is (input \u2212 128 \u2212 128 \u2212 output).\nGaussian noise with zero mean and standard deviation of 0.3 is injected in Linverse. To avoid training\nthe inverse mapping on the output layer, we set the \ufb01nal output layer to be a linear with cross-entropy\nloss, other layers all use GBDTs for for forward/inverse mappings with the same hyper-parameters in\nsection 4.1. The learning rate \u03b1 at output layer is determined by cross-validation. The output for\neach intermediate layers are visualized via T-SNE [van der Maaten and Hinton, 2008] in Figure 5.\nWe wish to highlight that all the mGBDTs used exactly the same hyper-parameters across all the\nexperiments: 5 additive trees per epoch (K1 = K2 = 5), the maximum depth is \ufb01xed to be 5. Such\nrule-of-thumb setting is purposely made in order to avoid a \ufb01ne-tuned performance report.\n\n(a) Training loss\n\n(b) Training accuracy\n\n(c) Testing loss\n\n(d) Testing accuracy\n\nFigure 6: Learning curves of income dataset\n\nFor a comparison, we also trained the exact same structure (input \u2212 128 \u2212 128 \u2212 output) on neural\nnetworks using the target propagation N N T argetP rop and standard back-propagation N N BackP rop,\nrespectively. Since the goal here is to compare the predictive accuracy given the same representational\ndimensions therefore other NN architectures are not reported in details. (Actually smaller NNs won\u2019t\nhelp, for instance, (input \u2212 32 \u2212 32 \u2212 output) achieved 85.20% and (input \u2212 16 \u2212 16 \u2212 output)\nachieved 84.67%.) Adam [Kingma and Ba, 2014] with a learning rate of 0.001 and ReLU activation\nare used for both cases. Dropout rate of 0.25 is used for back-prop. A vanilla XGBoost via cross-\nvalidation search for hyper-parameters with 100 additive trees with a maximum depth of 7 per tree is\nalso trained for comparison, the optimal learning rate found for XGBoost is 0.3. Finally, we stacked 3\nXGBoost of the same con\ufb01gurations as the vanilla XGBoost and used one additional XGBoost as the\nsecond stage of stacking via 3-fold validation. More stacking levels will produce severe over-\ufb01tting\nresults and are not included here.\n\n7\n\n\fExperimental results are summarized in Figure 6 and Table 1. First, multi-layered GBDT forest\n(mGBDT) achieved the highest accuracy compared to DNN approaches trained by either back-prop\nor target-prop, given the same model structure. It also performs better than single GBDTs or stacking\nmultiple ones in terms of accuracy. Second, N N T argetP rop converges not as good as N N BackP rop\nas expected (a consistent result with Lee et al. [2015]), whereas the same structure using GBDT\nlayers can achieve a lower training loss without over-\ufb01tting.\n\nTable 1: Classi\ufb01cation accuracy comparison. For protein dataset, accuracy measured by 10-fold\ncross-validation shown in mean \u00b1 std.\n\nIncome Dataset\nXGBoost\n.8719\nXGBoost Stacking\n.8697\n.8491\nN N T argetP rop\n.8534\nN N BackP rop\nMulti-layered GBDT .8742\n\nProtein Dataset\n.5937 \u00b1 .0324\n.5592 \u00b1 .0400\n.5756 \u00b1 .0465\n.5907 \u00b1 .0268\n.5948 \u00b1 .0268\n\n4.3 Protein Localization\n\n(a) Original representation\n\n(b) 1st layer representation\n\n(c) 2nd layer representation\n\nFigure 7: Feature visualization for protein dataset\n\nThe protein dataset [Lichman, 2013] is a 10 class classi\ufb01cation task consists of only 1484 training\ndata where each of the 8 input attributes is one measurement of the protein sequence, the goal\nis to predict protein localization sites with 10 possible choices. 10-fold cross-validation is used\nfor model evaluation since there is no test set provided. We trained a multi-layered GBDT using\nstructure (input \u2212 16 \u2212 16 \u2212 output). Due to the robustness of tree ensembles, all the training\nhyper-parameters are the same as we used in the previous section. Likewise, we trained two neural\nnetworks N N T argetP rop and N N BackP rop with the same structure, and the training parameters\nwere determined by cross-validation for a fair comparison. Experimental results are presented in\nTable 1. Again mGBDT achieved best performance among all. XGBoost Stacking had a worse\naccuracy than using a single XGBoost, this is mainly because over-\ufb01tting has occurred. We also\nvisualized the output for each mGBDT layer using T-SNE in Figure 7. It can be shown that the quality\nof the representation does get improved with model depth.\nThe training and testing curves for 10-fold cross-validation are plotted with mean value in Figure 8.\nThe multi-layered GBDT (mGBDT) approach converges much faster than NN approaches in terms of\nnumber of epochs, as illustrated in Figure 8a. Only 50 epoch is needed for mGBDT whereas NNs\nrequire 200 epochs for both back-prop and target-prop scenarios. When measured by the wall-clock\ntime, mGBDT runs close to NN (only slower by a factor of 1.2) with backprops in our experiments\nand mGBDT has a training speed very close to NN with target-prop. Nevertheless, comparing\nwall-clock time is less meaningful since mGBDT and NNs use different devices (CPU v.s. GPU)\nand different implementation optimizations. In addition, N N T argetP rop is still sub-optimal than\nN N BackP rop and mGBDT achieved highest accuracy among all. We also examined the effect when\nwe vary the number of intermediate layers on protein datasets. To make the experiments manageable,\nthe dimension for each intermediate layer is \ufb01xed to be 16. The results are summarized in Table 2.\nIt can be shown that mGBDT is more robust compared with N N T argetP rop as we increase the\nintermediate layers. Indeed, the performance dropped from .5964 to .3654 when using target-prop\nfor neural networks whereas mGBDT can still perform well when adding extra layers.\n\n8\n\n\f(a) Training loss\n\n(b) Training accuracy\n\n(c) Testing loss\n\n(d) Testing accuracy\n\nFigure 8: Learning curves of protein dataset\n\nTable 2: Test accuracy with different model structure. Accuracy measured by 10-fold cross-validation\nshown in mean \u00b1 std. N/A stands for not applicable.\n\nModel Structure\n8->10\n8->16->10\n8->16->16->10\n8->16->16->16->10\n8->16->16->16->16->10\n\nN N BackP rop N N T argetP rop mGBDT\n.5873 \u00b1 .0396 N/A\n.5803 \u00b1 .0316\n.5907 \u00b1 .0268\n.5901 \u00b1 .0270\n.5768 \u00b1 .0286\n\n.5964 \u00b1 .0343\n.5756 \u00b1 .0465\n.4759 \u00b1 .0429\n.3654 \u00b1 .0452\n\n.5937 \u00b1 .0324\n.6160 \u00b1 .0323\n.5948 \u00b1 .0268\n.5897 \u00b1 .0312\n.5782 \u00b1 .0229\n\n5 Conclusion and Future Explorations\n\nIn this paper, we present a novel multi-layered GBDT forest (mGBDT) with explicit representation\nlearning ability that can be jointly trained with a variant of target propagation. Due to the excellent\nperformance of tree ensembles, this approach is of great potentials in many application areas where\nneural networks are not the best \ufb01t. The work also showed that, to obtain a multi-layered distributed\nrepresentations is not tired to differentiable systems. Theoretical justi\ufb01cations, as well as experimental\nresults con\ufb01rmed the effectiveness of this approach. Here we list some aspects for future explorations.\nDeep Forest Integration. One important feature of the deep forest model proposed in [Zhou and\nFeng, 2018] is that the model complexity can be adaptively determined according to the input\ndata. Therefore, it is interesting to integrating several mGBDT layers as feature extractor into the\ndeep forest structure to make the system not only capable of learning representations but also can\nautomatically determine its model complexity.\nStructural Variants and Hybird DNN. A recurrent or even convolutional structure using mGBDT\nlayers as building blocks are now possible since the training method does not making restrictions\non such structural priors. Some more radical design is possible. For instance, one can embed the\nmGBDT forest as one or several layers into any complex differentiable system and use mGBDT\nlayers to handle tasks that are best suitable for trees. The whole system can be jointly trained with a\nmixture of different training methods across different layers. Nevertheless, there are plenty of room\nfor future explorations.\nAcknowledgments This research was supported by NSFC (61751306), National Key R&D Program\nof China (2018YFB1004300) and Collaborative Innovation Center of Novel Software Technology\nand Industrialization.\n\n9\n\n\fReferences\n\nY. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new perspectives. IEEE Trans.\n\nPattern Analysis and Machine Intelligence, 35(8):1798\u20131828, 2013.\n\nY. Bengio, G. Mesnil, Y. Dauphin, and S. Rifai. Better mixing via deep representations. In ICML, pages 552\u2013560,\n\n2013.\n\nY. Bengio, L. Yao, G. Alain, and P. Vincent. Generalized denoising auto-encoders as generative models. In\n\nNIPS, pages 899\u2013907, 2013.\n\nY. Bengio. How auto-encoders could provide credit assignment in deep networks via target propagation.\n\narXiv:1407.7906, 2014.\n\nL. Breiman. Bagging predictors. Machine Learning, 24(2):123\u2013140, 1996.\nL. Breiman. Random forests. Machine Learning, 45(1):5\u201332, 2001.\nT.-Q. Chen and C. Guestrin. XGBoost: A scalable tree boosting system. In KDD, pages 785\u2013794, 2016.\nT.-Q. Chen and T. He. Higgs boson discovery with boosted trees. In NIPS Workshop, pages 69\u201380, 2015.\nF. H. Clarke. On the inverse function theorem. Paci\ufb01c Journal of Mathematics, 64(1):97\u2013102, 1976.\nJ. Feng and Z.-H. Zhou. Autoencoder by forest. In AAAI, 2018.\nY. Freund and R. E. Schapire. A short introduction to boosting. Journal of Japanese Society for Arti\ufb01cial\n\nIntelligence, 14(5):771\u2013780, 1999.\n\nJ. H. Friedman. Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29:1189\u2013\n\n1232, 2000.\n\nN. Frosst and G. E. Hinton. Distilling a neural network into a soft decision tree. arXiv:1711.09784, 2017.\nI. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, Cambridge, MA, 2016.\nX. He, J. Pan, O. Jin, T. Xu, B. Liu, T. Xu, Y. Shi, A. Atallah, R. Herbrich, S. Bowers, and J. Qui\u00f1onero. Practical\n\nlessons from predicting clicks on ads at facebook. In ADKDD, pages 5:1\u20135:9, 2014.\n\nG. E. Hinton, J. L. McClelland, and D. E. Rumelhart. Distributed representations. In Parallel Distributed\n\nProcessing: Explorations in the Microstructure of Cognition, Vol. 1, pages 77\u2013109. 1986.\n\nS. H. Huang, N. Papernot, I. Goodfellow, Y. Duan, and P. Abbeel. Adversarial attacks on neural network policies.\n\narXiv:1702.02284, 2017.\n\nG. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T.-Y. Liu. LightGBM: A highly ef\ufb01cient\n\ngradient boosting decision tree. In NIPS, pages 3149\u20133157, 2017.\n\nD. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv:1412.6980, 2014.\nP. Kontschieder, M. Fiterau, A. Criminisi, and S. R. Bul\u00f2. Deep neural decision forests. In ICCV, pages\n\n1467\u20131475, 2015.\n\nV. R. Korlakai and G. B. Ran. DART: dropouts meet multiple additive regression trees. In AISTATS, pages\n\n489\u2013497, 2015.\n\nD.-H. Lee, S. Zhang, A. Fischer, and Y. Bengio. Difference target propagation. In ECML PKDD, pages 498\u2013515,\n\n2015.\n\nM. Lichman. UCI machine learning repository, 2013.\nT. P. Lillicrap, D. Cownden, D. B. Tweed, and C. J. Akerman. Random synaptic feedback weights support error\n\nbackpropagation for deep learning. Nature communications, 7:13276, 2016.\n\nA. Nguyen, J. Yosinski, and J. Clune. Deep neural networks are easily fooled: High con\ufb01dence predictions for\n\nunrecognizable images. pages 427\u2013436, 2015.\n\nA. N\u00f8kland. Direct feedback alignment provides learning in deep neural networks. In NIPS, pages 1037\u20131045,\n\n2016.\n\nM. Rory and F. Eibe. Accelerating the xgboost algorithm using GPU computing. PeerJ Computer Science,\n\n3:127, 2017.\n\nD. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by back-propagating errors. Nature,\n\n323(6088):533, 1986.\n\nS. Si, H. Zhang, S. S. Keerthi, D. Mahajan, I. S. Dhillon, and C.-J. Hsieh. Gradient boosted decision trees for\n\nhigh dimensional sparse output. In ICML, pages 3182\u20133190, 2017.\n\nN. Tishby and N. Zaslavsky. Deep learning and the information bottleneck principle. arXiv:1503.02406, 2015.\nL.J.P van der Maaten and G.E. Hinton. Visualizing high-dimensional data using t-sne. Journal of Machine\n\nLearning Research, 9:2579\u20132605, 2008.\n\nP. Werbos. Beyond regression: New tools for prediction and analysis in the behavioral sciences. PhD thesis,\n\nHarvard University, 1974.\n\n10\n\n\fZ.-H. Zhou and J. Feng. Deep forest: Towards an alternative to deep neural networks. In IJCAI, pages 3553\u20133559,\n\n2017.\n\nZ.-H. Zhou and J. Feng. Deep forest. National Science Review, 2018.\nZ.-H. Zhou. Ensemble Methods: Foundations and Algorithms. CRC, Boca Raton, FL, 2012.\n\n11\n\n\f", "award": [], "sourceid": 1808, "authors": [{"given_name": "Ji", "family_name": "Feng", "institution": "Nanjing University & Sinovation Ventures AI Institute"}, {"given_name": "Yang", "family_name": "Yu", "institution": "Nanjing University"}, {"given_name": "Zhi-Hua", "family_name": "Zhou", "institution": "Nanjing University"}]}