{"title": "Efficient Non-greedy Optimization of Decision Trees", "book": "Advances in Neural Information Processing Systems", "page_first": 1729, "page_last": 1737, "abstract": "Decision trees and randomized forests are widely used in computer vision and machine learning. Standard algorithms for decision tree induction optimize the split functions one node at a time according to some splitting criteria. This greedy procedure often leads to suboptimal trees. In this paper, we present an algorithm for optimizing the split functions at all levels of the tree jointly with the leaf parameters, based on a global objective. We show that the problem of finding optimal linear-combination (oblique) splits for decision trees is related to structured prediction with latent variables, and we formulate a convex-concave upper bound on the tree's empirical loss. Computing the gradient of the proposed surrogate objective with respect to each training exemplar is O(d^2), where d is the tree depth, and thus training deep trees is feasible. The use of stochastic gradient descent for optimization enables effective training with large datasets. Experiments on several classification benchmarks demonstrate that the resulting non-greedy decision trees outperform greedy decision tree baselines.", "full_text": "Ef\ufb01cient Non-greedy Optimization of Decision Trees\n\nMohammad Norouzi1\u2217\n\nMaxwell D. Collins2 \u2217\n\nDavid J. Fleet4\n\nPushmeet Kohli5\n\nMatthew Johnson3\n\n1,4 Department of Computer Science, University of Toronto\n\n2 Department of Computer Science, University of Wisconsin-Madison\n\n3,5 Microsoft Research\n\nAbstract\n\nDecision trees and randomized forests are widely used in computer vision and ma-\nchine learning. Standard algorithms for decision tree induction optimize the split\nfunctions one node at a time according to some splitting criteria. This greedy pro-\ncedure often leads to suboptimal trees. In this paper, we present an algorithm for\noptimizing the split functions at all levels of the tree jointly with the leaf param-\neters, based on a global objective. We show that the problem of \ufb01nding optimal\nlinear-combination (oblique) splits for decision trees is related to structured pre-\ndiction with latent variables, and we formulate a convex-concave upper bound on\nthe tree\u2019s empirical loss. Computing the gradient of the proposed surrogate ob-\njective with respect to each training exemplar is O(d2), where d is the tree depth,\nand thus training deep trees is feasible. The use of stochastic gradient descent for\noptimization enables effective training with large datasets. Experiments on sev-\neral classi\ufb01cation benchmarks demonstrate that the resulting non-greedy decision\ntrees outperform greedy decision tree baselines.\n\n1\n\nIntroduction\n\nDecision trees and forests [5, 21, 4] have a long and rich history in machine learning [10, 7]. Recent\nyears have seen an increase in their popularity, owing to their computational ef\ufb01ciency and applica-\nbility to large-scale classi\ufb01cation and regression tasks. A case in point is Microsoft Kinect where\ndecision trees are trained on millions of exemplars to enable real-time human pose estimation from\ndepth images [22].\nConventional algorithms for decision tree induction are greedy. They grow a tree one node at a\ntime following procedures laid out decades ago by frameworks such as ID3 [21] and CART [5].\nWhile recent work has proposed new objective functions to guide greedy algorithms [20, 12], it\ncontinues to be the case that decision tree applications (e.g., [9, 14]) utilize the same dated methods\nof tree induction. Greedy decision tree induction builds a binary tree via a recursive procedure as\nfollows: beginning with a single node, indexed by i, a split function si is optimized based on a\ncorresponding subset of the training data Di such that Di is split into two subsets, which in turn\nde\ufb01ne the training data for the two children of the node i. The intrinsic limitation of this procedure\nis that the optimization of si is solely conditioned on Di, i.e., there is no ability to \ufb01ne-tune the split\nfunction si based on the results of training at lower levels of the tree. This paper proposes a general\nframework for non-greedy learning of the split parameters for tree-based methods that addresses this\nlimitation. We focus on binary trees, while extension to N-ary trees is possible. We show that our\njoint optimization of the split functions at different levels of the tree under a global objective not\nonly promotes cooperation between the split nodes to create more compact trees, but also leads to\nbetter generalization performance.\n\n\u2217Part of this work was done while M. Norouzi and M. D. Collins were at Microsoft Research, Cambridge.\n\n1\n\n\fOne of the key contributions of this work is establishing a link between the decision tree optimiza-\ntion problem and the problem of structured prediction with latent variables [25]. We present a novel\nformulation of the decision tree learning that associates a binary latent decision variable with each\nsplit node in the tree and uses such latent variables to formulate the tree\u2019s empirical loss. Inspired\nby advances in structured prediction [23, 24, 25], we propose a convex-concave upper bound on the\nempirical loss. This bound acts as a surrogate objective that is optimized using stochastic gradi-\nent descent (SGD) to \ufb01nd a locally optimal con\ufb01guration of the split functions. One complication\nintroduced by this particular formulation is that the number of latent decision variables grows expo-\nnentially with the tree depth d. As a consequence, each gradient update will have a complexity of\nO(2dp) for p-dimensional inputs. One of our technical contributions is showing how this complexity\ncan be reduced to O(d2p) by modifying the surrogate objective, thereby enabling ef\ufb01cient learning\nof deep trees.\n\n2 Related work\n\nFinding optimal split functions at different levels of a decision tree according to some global ob-\njective, such as a regularized empirical risk, is NP-complete [11] due to the discrete and sequential\nnature of the decisions in a tree. Thus, \ufb01nding an ef\ufb01cient alternative to the greedy approach has\nremained a dif\ufb01cult objective despite many prior attempts.\nBennett [1] proposes a non-greedy multi-linear programming based approach for global tree op-\ntimization and shows that the method produces trees that have higher classi\ufb01cation accuracy than\nstandard greedy trees. However, their method is limited to binary classi\ufb01cation with 0-1 loss and\nhas a high computation complexity, making it only applicable to trees with few nodes.\nThe work in [15] proposes a means for training decision forests in an online setting by incrementally\nextending the trees as new data points are added. As opposed to a naive incremental growing of the\ntrees, this work models the decision trees with Mondrian Processes.\nThe Hierarchical Mixture of Experts model [13] uses soft splits rather than hard binary decisions to\ncapture situations where the transition from low to high response is gradual. The use of soft splits at\ninternal nodes of the tree yields a probabilistic model in which the log-likelihood is a smooth func-\ntion of the unknown parameters. Hence, training based on log-likelihood is amenable to numerical\noptimization via methods such as expectation maximization (EM). That said, the soft splits neces-\nsitate the evaluation of all or most of the experts for each data point, so much of the computational\nadvantage of the decision tree are lost.\nMurthy and Salzburg [17] argue that non-greedy tree learning methods that work by looking ahead\nare unnecessary and sometimes harmful. This is understandable since their methods work by min-\nimizing the empirical loss without any regularization, which is prone to over\ufb01tting. To avoid this\nproblem, it is a common practice (see Breiman [4] or Criminisi and Shotton [7] for an overview)\nto limit the tree depth and introduce limits on the number of training instances below which a tree\nbranch is not extended, or to force a diverse ensemble of trees (i.e., a decision forest) through the\nuse of bagging. Bennett and Blue [2] describe a different way to overcome over\ufb01tting by using\nmax-margin framework and the Support Vector Machines (SVM) at the split nodes of the tree. Sub-\nsequently, Bennett et al. [3] show how enlarging the margin of decision tree classi\ufb01ers results in\nbetter generalization performance.\nOur formulation for decision tree induction improves on prior art in a number of ways. Not only\ndoes our latent variable formulation of decision trees enable ef\ufb01cient learning, it can handle any\ngeneral loss function while not sacri\ufb01cing the O(dp) complexity of inference imparted by the tree\nstructure. Further, our surrogate objective provides a natural way to regularize the joint optimization\nof tree parameters to discourage over\ufb01tting.\n\n3 Problem formulation\n\nFor ease of exposition, this paper focuses on binary classi\ufb01cation trees, with m internal (split) nodes,\nand m + 1 leaf (terminal) nodes. Note that in a binary tree the number of leaves is always one more\nthan the number of internal (non-leaf) nodes. An input, x \u2208 Rp, is directed from the root of the\ntree down through internal nodes to a leaf node. Each leaf node speci\ufb01es a distribution over k class\nlabels. Each internal node, indexed by i \u2208 {1, . . . , m}, performs a binary test by evaluating a node-\n\n2\n\n\fh1\n+1\n\nh3\n+1\n\nh2\n-1\n\nh1\n-1\n\nh2\n+1\n\nh3\n+1\n\n\u03b81\n\n\u03b82\n\n\u03b83\n\n\u03b84\n\n\u03b81\n\n\u03b82\n\n\u03b83\n\n\u03b84\n\nf ([+1,\u22121, +1]T) = [0, 0, 0, 1]T = 14\n\nf ([\u22121, +1, +1]T) = [0, 1, 0, 0]T = 12\n\n\u03b8 = \u0398Tf (h) = \u03b84\n\n\u03b8 = \u0398Tf (h) = \u03b82\n\nFigure 1: The binary split decisions in a decision tree with m = 3 internal nodes can be thought as\na binary vector h = [h1, h2, h3]T. Tree navigation to reach a leaf can be expressed in terms of a\nfunction f (h). The selected leaf parameters can be expressed by \u03b8 = \u0398Tf (h).\nspeci\ufb01c split function si(x) : Rp \u2192 {\u22121, +1}. If si(x) evaluates to \u22121, then x is directed to the\nleft child of node i. Otherwise, x is directed to the right child. And so on down the tree. Each split\nfunction si(\u00b7), by parameterized a weight vector wi, is assumed to be a linear threshold function,\ni.e., si(x) = sgn(wi\nTx). We incorporate an offset parameter to obtain split functions of the form\nsgn(wi\nEach leaf node, indexed by j \u2208 {1, . . . , m + 1}, speci\ufb01es a conditional probability distribution over\nclass labels, l \u2208 {1, . . . , k}, denoted p(y = l | j). Leaf distributions are parametrized with a vector\nof unnormalized predictive log-probabilities, denoted \u03b8j \u2208 Rk, and a softmax function; i.e.,\n\nTx \u2212 bi) by appending a constant \u201c\u22121\u201d to the input feature vector.\n\np(y = l | j) =\n\n(cid:9)\nexp(cid:8)\u03b8j[l]\n\u03b1=1 exp(cid:8)\u03b8j[\u03b1]\n(cid:80)k\n\n(cid:9) ,\n\n(1)\n\nwhere \u03b8j[\u03b1] denotes the \u03b1th element of vector \u03b8j.\nThe parameters of the tree comprise the m internal weight vectors, {wi}m\ni=1, and the m + 1 vectors\nof unnormalized log-probabilities, one for each leaf node, {\u03b8j}m+1\nj=1 . We pack these parameters\ninto two matrices W \u2208 Rm\u00d7p and \u0398 \u2208 R(m+1)\u00d7k whose rows comprise weight vectors and leaf\nparameters, i.e., W \u2261 [w1, . . . , wm]T and \u0398 \u2261 [\u03b81, . . . , \u03b8m+1]T. Given a dataset of input-output\npairs, D \u2261 {xz, yz}n\nz=1, where yz \u2208 {1, . . . , k} is the ground truth class label associated with\ninput xz \u2208 Rp, we wish to \ufb01nd a joint con\ufb01guration of oblique splits W and leaf parameters \u0398\nthat minimize some measure of misclassi\ufb01cation loss on the training dataset. Joint optimization of\nthe split functions and leaf parameters according to a global objective is known to be extremely\nchallenging [11] due to the discrete and sequential nature of the splitting decisions within the tree.\nOne can evaluate all of the split functions, for every internal node of the tree, on input x by com-\nputing sgn(W x), where sgn(\u00b7) is the element-wise sign function. One key idea that helps linking\ndecision tree learning to latent structured prediction is to think of an m-bit vector of potential split\ndecisions, e.g., h = sgn(W x) \u2208 {\u22121, +1}m, as a latent variable. Such a latent variable determines\nthe leaf to which a data point is directed, and then classi\ufb01ed using the leaf parameters. To formulate\nthe loss for (x, y), we introduce a tree navigation function f : Hm \u2192 Im+1 that maps an m-bit\nsequence of split decisions (Hm \u2261 {\u22121, +1}m) to an indicator vector that speci\ufb01es a 1-of-(m + 1)\nencoding. Such an indicator vector is only non-zero at the index of the selected leaf. Fig. 1 illustrates\nthe tree navigation function for a tree with 3 internal nodes.\nUsing the notation developed above, \u03b8 = \u0398Tf (sgn(W x)) represents the parameters corresponding\nto the leaf to which x is directed by the split functions in W . A generic loss function of the form\n(cid:96)(\u03b8, y) measures the discrepancy between the model prediction based on \u03b8 and an output y. For\nthe softmax model given by (1), a natural loss is the negative log probability of the correct label,\nreferred to as log loss,\n\n(cid:96)(\u03b8, y) = (cid:96)log(\u03b8, y) = \u2212 \u03b8[y] + log\n\nexp(\u03b8[\u03b2])\n\n.\n\n(2)\n\n(cid:18) k(cid:88)\n\n(cid:19)\n\n\u03b2=1\n\n3\n\n\fFor regression tasks, when y \u2208 Rq, and the value of \u03b8 \u2208 Rq is directly emitted as the model\nprediction, a natural choice of (cid:96) is squared loss,\n\n(3)\nOne can adopt other forms of loss within our decision tree learning framework as well. The goal of\nlearning is to \ufb01nd W and \u0398 that minimize empirical loss, for a given training set D, that is,\n\n(cid:96)(\u03b8, y) = (cid:96)sqr(\u03b8, y) = (cid:107)\u03b8 \u2212 y(cid:107)2 .\n\n(cid:88)\n\n(cid:96)(cid:0)\u0398Tf (sgn(W x)), y(cid:1) .\n\n(4)\n\nL(W, \u0398;D) =\n\n(x,y)\u2208D\n\nDirect global optimization of empirical loss L(W, \u0398;D) with respect to W is challenging. It is a\ndiscontinuous and piecewise-constant function of W . Furthermore, given an input x, the navigation\nfunction f (\u00b7) yields a leaf parameter vector based on a sequence of binary tests, where the results of\nthe initial tests determine which subsequent tests are performed. It is not clear how this dependence\nof binary tests should be formulated.\n\n4 Decision trees and structured prediction\nTo overcome the intractability in the optimization of L, we develop a piecewise smooth upper bound\non empirical loss. Our upper bound is inspired by the formulation of structured prediction with latent\nvariables [25]. A key observation that links decision tree learning to structured prediction, is that\none can re-express sgn(W x) in terms of a latent variable h. That is,\n\nsgn(W x) = argmax\nh\u2208Hm\n\n(hTW x) .\n\n(5)\n\nIn this form, decision tree\u2019s split functions implicitly map an input x to a binary vector h by max-\nimizing a score function hTW x, the inner product of h and W x. One can re-express the score\nfunction in terms of a more familiar form of a joint feature space on h and x, as wT\u03c6(h, x), where\n\u03c6(h, x) = vec (hxT), and w = vec (W ). Previously, Norouzi and Fleet [19] used the same re-\nformulation (5) of linear threshold functions to learn binary similarity preserving hash functions.\nGiven (5), we re-express empirical loss as,\nL(W, \u0398;D) =\n\n(cid:96)(\u0398Tf ((cid:98)h(x)), y) ,\n\n(cid:88)\n\nThis objective resembles the objective functions used in structured prediction, and since we do not\n\nhave a priori access to the ground truth split decisions,(cid:98)h(x), this problem is a form of structured\n\nprediction with latent variables.\n\n5 Upper bound on empirical loss\n\n(cid:16)\n\n(cid:17) \u2212 max\n\nWe develop an upper bound on loss for an input-output pair (x, y), which takes the form,\n\n(7)\n\nh\u2208Hm\n\n(hTW x) .\n\n(cid:96)(\u0398Tf (sgn(W x)), y) \u2264 max\ng\u2208Hm\n\nTo validate the bound, \ufb01rst note that the second term on the RHS is maximized by h = (cid:98)h(x) =\nsgn(W x). Second, when g = (cid:98)h(x), it is clear that the LHS equals the RHS. Finally, for all other\nvalues of g, the RHS can only get larger than when g =(cid:98)h(x) because of the max operator. Hence,\n\ngTW x + (cid:96)(\u0398Tf (g), y)\n\nthe inequality holds. An algebraic proof of (7) is presented in the supplementary material.\nIn the context of structured prediction, the \ufb01rst term of the upper bound, i.e., the maximization over\ng, is called loss-augmented inference, as it augments the inference problem, i.e., the maximization\nover h, with a loss term. Fortunately, the loss-augmented inference for our decision tree learning\nformulation can be solved exactly, as discussed below.\nIt is also notable that the loss term on the LHS of (7) is invariant to the scale of W , but the upper\nbound on the right side of (7) is not. As a consequence, as with binary SVM and margin-rescaling\nformulations of structural SVM [24], we introduce a regularizer on the norm of W when optimizing\nthe bound. To justify the regularizer, we discuss the effect of the scale of W on the bound.\n\n4\n\n(x,y)\u2208D\n\n(cid:98)h(x) = argmax\n\nh\u2208Hm\n\nwhere\n\n(hTW x) .\n\n(6)\n\n\fProposition 1. The upper bound on the loss becomes tighter as a constant multiple of W increases,\ni.e., for a > b > 0:\n\n(cid:16)\n\nmax\ng\u2208Hm\n\nagTW x + (cid:96)(\u0398Tf (g), y)\n\n(cid:17) \u2212 max\n\nh\u2208Hm\n\n(cid:16)\n\n(ahTW x) \u2264\n\n(cid:17) \u2212 max\n\nh\u2208Hm\n\nmax\ng\u2208Hm\n\nbgTW x + (cid:96)(\u0398Tf (g), y)\n\n(bhTW x).\n\n(8)\n\nProof. Please refer to the supplementary material for the proof.\nIn the limit, as the scale of W approach +\u221e, the loss term (cid:96)(\u0398Tf (g), y) becomes negligible com-\npared to the score term gTW x. Thus, the solutions to loss-augmented inference and inference\nproblems become almost identical, except when an element of W x is very close to 0. Thus, even\nthough a larger (cid:107)W(cid:107) yields a tighter bound, it makes the bound approach the loss itself, and there-\nfore becomes nearly piecewise-constant, which is hard to optimize. Based on Proposition 1, one\neasy way to decrease the upper bound is to increase the norm of W , which does not affect the loss.\nOur experiments indicate that a lower value of the loss can be achieved when the norm of W is\nregularized. We therefore constrain the norm of W to obtain an objective with better generalization.\nSince each row of W acts independently in a decision tree in the split functions, it is reasonable to\nconstrain the norm of each row independently. Summing over the bounds for different training pairs\nand constraining the norm of rows of W , we obtain the following optimization problem, called the\nsurrogate objective:\n\nminimize L(cid:48)(W, \u0398;D) =\n\ngTW x + (cid:96)(\u0398Tf (g), y)\n\n(hTW x)\n\n(cid:17) \u2212 max\n\nh\u2208Hm\n\n(cid:19)\n\n(9)\n\n(cid:18)\n\n(cid:88)\n\n(cid:16)\n\n(x,y)\u2208D\ns.t.\n\nmax\ng\u2208Hm\n(cid:107)wi(cid:107)2 \u2264 \u03bd\n\nfor all i \u2208 {1, . . . , m}\n\nwhere \u03bd \u2208 R+ is a regularization parameter and wi is the ith row of W . For all values of \u03bd, we\nhave L(W, \u0398;D) \u2264 L(cid:48)(W, \u0398;D). Instead of using the typical Lagrange form for regularization, we\nemploy hard constraints to enable sparse gradient updates of the rows of W , since the gradients for\nmost rows of W are zero at each step in training.\n\n6 Optimizing the surrogate objective\n\nEven though minimizing the surrogate objective of (9) entails a non-convex optimization,\nL(cid:48)(W, \u0398;D) is much better behaved than empirical loss in (4). L(cid:48)(W, \u0398;D) is piecewise linear\nand convex-concave in W , and the constraints on W de\ufb01ne a convex set.\nLoss-augmented inference. To evaluate and use the surrogate objective in (9) for optimization, we\nmust solve a loss-augmented inference problem to \ufb01nd the binary code that maximizes the sum of\n\nthe score and loss terms: (cid:98)g(x) = argmax\n\ng\u2208Hm\n\n(cid:0) gTW x + (cid:96)(\u0398Tf (g), y)(cid:1) .\n\n(10)\n\nAn observation that makes this optimization tractable is that f (g) can only take on m+1 distinct\nvalues, which correspond to terminating at one of the m+1 leaves of the tree and selecting a leaf\nparameter from {\u03b8j}m+1\n\nj=1 . Fortunately, for any leaf index j \u2208 {1, . . . , m+1}, we can solve\n\n(cid:0) gTW x + (cid:96)(\u03b8j, y)(cid:1)\n\nargmax\ng\u2208Hm\n\ns. t.\n\nf (g) = 1j ,\n\n(11)\n\nef\ufb01ciently. Note that if f (g) = 1j, then \u0398Tf (g) equals the jth row of \u0398, i.e., \u03b8j. To solve (11)\nwe need to set all of the binary bits in g corresponding to the path from the root to the leaf j to be\nconsistent with the path direction toward the leaf j. However, bits of g that do not appear on this path\nhave no effect on the output of f (g), and all such bits should be set based on g[i] = sgn(wi\nTx) to\nobtain maximum gTW x. Accordingly, we can essentially ignore the off-the-path bits by subtracting\nsgn(W x)TW x from (11) to obtain,\n\n(cid:0)gTW x + (cid:96)(\u03b8j, y)(cid:1) = argmax\n\n(cid:16)(cid:0)g \u2212 sgn(W x)(cid:1)T\n\nW x + (cid:96)(\u03b8j, y)\n\n.\n\n(12)\n\n(cid:17)\n\nargmax\ng\u2208Hm\n\ng\u2208Hm\n\n5\n\n\fAlgorithm 1 Stochastic gradient descent (SGD) algorithm for non-greedy decision tree learning.\n1: Initialize W (0) and \u0398(0) using greedy procedure\n2: for t = 0 to \u03c4 do\n3:\n4:\n5:\n\n(cid:8)gTW (t)x + (cid:96)(\u0398Tf (g), y)(cid:9)\n\nSample a pair (x, y) uniformly at random from D\n\n6: W (tmp) \u2190 W (t) \u2212 \u03b7(cid:98)gxT + \u03b7(cid:98)hxT\n(cid:46)(cid:13)(cid:13)W (tmp)\n\n(cid:98)h \u2190 sgn(W (t)x)\n(cid:98)g \u2190 argmaxg\u2208Hm\n(cid:110)\n(cid:111)\n(cid:13)(cid:13)2\n\u2202\u0398 (cid:96)(\u0398Tf ((cid:98)g), y)(cid:12)(cid:12)\u0398=\u0398(t)\n\n7:\n8:\n1,\n9:\n10: \u0398(t+1) \u2190 \u0398(t) \u2212 \u03b7 \u2202\n11: end for\n\nfor i = 1 to m do\ni, . \u2190 min\n\nend for\n\nW (t+1)\n\ni, .\n\n\u221a\n\n\u03bd\n\nW (tmp)\n\ni, .\n\nNote that sgn(W x)TW x is constant in g, and this subtraction zeros out all bits in g that are not on\nthe path to the leaf j. So, to solve (12), we only need to consider the bits on the path to the leaf j for\nwhich sgn(wi\nTx) is not consistent with the path direction. Using a single depth-\ufb01rst search on the\ndecision tree, we can solve (11) for every j, and among those, we pick the one that maximizes (11).\nThe algorithm described above is O(mp) \u2286 O(2dp), where d is the tree depth, and we require\na multiple of p for computing the inner product wix at each internal node i. This algorithm is\nnot ef\ufb01cient for deep trees, especially as we need to perform loss-augmented inference once for\nevery stochastic gradient computation. In what follows, we develop an alternative more ef\ufb01cient\nformulation and algorithm with time complexity of O(d2p).\nFast loss-augmented inference. To motivate the fast loss-augmented inference algorithm, we for-\nmulate a slightly different upper bound on the loss, i.e.,\n\n(cid:0)gTW x + (cid:96)(\u0398Tf (g), y)(cid:1) \u2212 max\n\n(cid:0)hTW x(cid:1) ,\n\nh\u2208Hm\n\n(13)\n\n(cid:96)(\u0398Tf (sgn(W x)), y) \u2264\n\nmax\n\ng\u2208B1(sgn(W x))\n\nwhere B1(sgn(W x)) denotes the Hamming ball of radius 1 around sgn(W x), i.e., B1(sgn(W x)) \u2261\n{g \u2208 Hm | (cid:107)g \u2212 sgn(W x)(cid:107)H \u2264 1}, hence g \u2208 B1(sgn(W x)) implies that g and sgn(W x) differ\nin at most one bit. The proof of (13) is identical to the proof of (7). The key bene\ufb01t of this new\nformulation is that loss-augmented inference with the new bound is computationally ef\ufb01cient. Since\n\n(cid:98)g and sgn(W x) differ in at most one bit, then f ((cid:98)g) can only take d + 1 distinct values. Thus we\n\nneed to evaluate (12) for at most d + 1 values of j, requiring a running time of O(d2p).\nStochastic gradient descent (SGD). One reasonable approach to minimizing (9) uses stochastic\ngradient descent (SGD), the steps of which are outlined in Alg 1. Here, \u03b7 denotes the learning rate,\nand \u03c4 is the number of optimization steps. Line 6 corresponds to a gradient update in W , which is\n\u2202W hTW x = hxT. Line 8 performs projection back to the feasible region\nsupported by the fact that \u2202\nof W , and Line 10 updates \u0398 based on the gradient of loss. Our implementation modi\ufb01es Alg 1 by\nadopting common SGD tricks, including the use of momentum and mini-batches.\nStable SGD (SSGD). Even though Alg 1 achieves good training and test accuracy relatively quickly,\nwe observe that after several gradient updates some of the leaves may end up not being assigned to\nany data points and hence the full tree capacity is not exploited. We call such leaves inactive as\nopposed to active leaves that are assigned to at least one training data point. An inactive leaf may\nbecome active again, but this rarely happens given the form of gradient updates. To discourage\nabrupt changes in the number of inactive leaves, we introduce a variant of SGD, in which the as-\nsignments of data points to leaves are \ufb01xed for a number of gradient updates. Thus, the bound is\noptimized with respect to a set of data point leaf assignment constraints. When the improvement in\nthe bound becomes negligible the leaf assignment variables are updated, followed by another round\nof optimization of the bound. We call this algorithm Stable SGD (SSGD) because it changes the as-\nsignment of data points to leaves more conservatively than SGD. Let a(x) denote the 1-of-(m + 1)\nencoding of the leaf to which a data point x should be assigned to. Then, each iteration of SSGD\n\n6\n\n\fSensIT\n\nConnect4\n\nProtein\n\nMNIST\n\nFigure 2: Test and training accuracy of a single tree as a function of tree depth for different methods.\nNon-greedy trees achieve better test accuracy throughout different depths. Non-greedy exhibit less\nvulnerability to over\ufb01tting.\n\nwith fast loss-augmented inference relies on the following upper bound on loss,\n(cid:96)(\u0398Tf (sgn(W x)), y) \u2264\n\n(cid:0)gTW x + (cid:96)(\u0398Tf (g), y)(cid:1) \u2212\n\nmax\n\ng\u2208B1(sgn(W x))\n\nmax\n\nh\u2208Hm|f (h)=a(x)\n\n(cid:0)hTW x(cid:1) .\n\n(14)\n\nOne can easily verify that the RHS of (14) is larger than the RHS of (13), hence the inequality.\nComputational complexity. To analyze the computational complexity of each SGD step, we note\n\nthat Hamming distance between(cid:98)g (de\ufb01ned in (10)) and (cid:98)h = sgn(W x) is bounded above by the\ndepth of the tree d. This is because only those elements of(cid:98)g corresponding to the path to a selected\nleaf can differ from sgn(W x). Thus, for SGD the expression ((cid:98)g \u2212(cid:98)h) xT needed for Line 6 of Alg 1\ncan be computed in O(dp), if we know which bits of(cid:98)h and(cid:98)g differ. Accordingly, Lines 6 and 7\n\ncan be performed in O(dp). The computational bottleneck is the loss augmented inference in Line\n5. When fast loss-augmented inference is performed in O(d2p) time, the total time complexity of\ngradient update for both SGD and SSGD becomes O(d2p + k), where k is the number of labels.\n\n7 Experiments\n\nExperiments are conducted on several benchmark datasets from LibSVM [6] for multi-class classi-\n\ufb01cation, namely SensIT, Connect4, Protein, and MNIST. We use the provided train; validation; test\nsets when available. If such splits are not provided, we use a random 80%/20% split of the training\ndata for train; validation and a random 64%/16%/20% split for train; validation; test sets.\nWe compare our method for non-greedy learning of oblique trees with several greedy baselines,\nincluding conventional axis-aligned trees based on information gain, OC1 oblique trees [17] that\nuse coordinate descent for optimization of the splits, and random oblique trees that select the best\nsplit function from a set of randomly generated hyperplanes based on information gain. We also\ncompare with the results of CO2 [18], which is a special case of our upper bound approach applied\ngreedily to trees of depth 1, one node at a time. Any base algorithm for learning decision trees can\nbe augmented by post-training pruning [16], or building ensembles with bagging [4] or boosting [8].\nHowever, the key differences between non-greedy trees and baseline greedy trees become most\napparent when analyzing individual trees. For a single tree the major determinant of accuracy is the\nsize of the tree, which we control by changing the maximum tree depth.\nFig. 2 depicts test and training accuracy for non-greedy trees and four other baselines as function of\ntree depth. We evaluate trees of depth 6 up to 18 at depth intervals of 2. The hyper-parameters for\neach method are tuned for each depth independently. While the absolute accuracy of our non-greedy\ntrees varies between datasets, a few key observations hold for all cases. First, we observe that non-\n\n7\n\n6101418Depth0.60.70.8Test accuracy6101418Depth0.50.60.70.86101418Depth0.40.50.60.76101418Depth0.50.60.70.80.96101418Depth0.50.60.70.80.91.0Training accuracy6101418Depth0.50.60.70.80.91.06101418Depth0.50.60.70.80.91.06101418Depth0.60.70.80.91.0Axis-alignedCO2Non-greedyRandomOC1\fs\ne\nv\na\ne\nl\n\ne\nv\ni\nt\nc\na\n\n.\n\nm\nu\nN\n\n4,000\n\n3,000\n\n2,000\n\n1,000\n\n4,000\n\n3,000\n\n2,000\n\n1,000\n\n4,000\n\n3,000\n\n2,000\n\n1,000\n\n0\n100\n103\nRegularization parameter \u03bd (log)\n\n101\n\n102\n\n0\n100\n103\nRegularization parameter \u03bd (log)\n\n102\n\n101\n\n0\n100\n103\nRegularization parameter \u03bd (log)\n\n101\n\n102\n\nTree depth d =10\n\nTree depth d =13\n\nTree depth d =16\n\nFigure 3: The effect of \u03bd on the structure of the trees trained by MNIST. A small value of \u03bd prunes\nthe tree to use far fewer leaves than an axis-aligned baseline used for initialization (dotted line).\n\ngreedy trees achieve the best test performance across tree depths across multiple datasets. Secondly,\ntrees trained using our non-greedy approach seem to be less susceptible to over\ufb01tting and achieve\nbetter generalization performance at various tree depths. As described below, we think that the norm\nregularization provides a principled way to tune the tightness of the tree\u2019s \ufb01t to the training data.\nFinally, the comparison between non-greedy and CO2 [18] trees concentrates on the non-greediness\nof the algorithm, as it compares our method with its simpler variant, which is applied greedily one\nnode at a time. We \ufb01nd that in most cases, the non-greedy optimization helps by improving upon\nthe results of CO2.\nA key hyper-parameter of our method is the regularization\nconstant \u03bd in (9), which controls the tightness of the up-\nper bound. With a small \u03bd, the norm constraints force the\nmethod to choose a W with a large margin at each inter-\nnal node. The choice of \u03bd is therefore closely related to the\ngeneralization of the learned trees. As shown in Fig. 3, \u03bd\nalso implicitly controls the degree of pruning of the leaves\nof the tree during training. We train multiple trees for dif-\nferent values of \u03bd \u2208 {0.1, 1, 4, 10, 43, 100}, and we pick\nthe value of \u03bd that produces the tree with minimum valida-\ntion error. We also tune the choice of the SGD learning rate,\n\u03b7, in this step. This \u03bd and \u03b7 are used to build a tree using\nthe union of both the training and validation sets, which is\nevaluated on the test set.\nTo build non-greedy trees, we initially build an axis-aligned tree with split functions that threshold\na single feature optimized using conventional procedures that maximize information gain. The axis-\naligned split is used to initialize a greedy variant of the tree training procedure called CO2 [18]. This\nprovides initial values for W and \u0398 for the non-greedy procedure.\nFig. 4 shows an empirical comparison of training time for SGD with loss-augmented inference\nand fast loss-augmented inference. As expected, run-time of loss-augmented inference exhibits\nexponential growth with deep trees whereas its fast variant is much more scalable. We expect to see\nmuch larger speedup factors for larger datasets. Connect4 only has 55, 000 training points.\n\ntime to execute\nFigure 4:\n1000 epochs of SGD on the Connect4\ndataset using loss-agumented infer-\nence and its fast varient.\n\nTotal\n\n8 Conclusion\n\nWe present a non-greedy method for learning decision trees, using stochastic gradient descent to\noptimize an upper bound on the empirical loss of the tree\u2019s predictions on the training set. Our model\nposes the global training of decision trees in a well-characterized optimization framework. This\nmakes it simpler to pose extensions that could be considered in future work. Ef\ufb01ciency gains could\nbe achieved by learning sparse split functions via sparsity-inducing regularization on W . Further,\nthe core optimization problem permits applying the kernel trick to the linear split parameters W ,\nmaking our overall model applicable to learning higher-order split functions or training decision\ntrees on examples in arbitrary Reproducing Kernel Hilbert Spaces.\nAcknowledgment. MN was \ufb01nancially supported in part by a Google fellowship. DF was \ufb01nan-\ncially supported in part by NSERC Canada and the NCAP program of the CIFAR.\n\n8\n\n6101418Depth0300600900120015001800Training time (sec)Loss-aug infFast loss-aug inf\fReferences\n[1] K. P. Bennett. Global tree optimization: A non-greedy decision tree algorithm. Computing Science and\n\nStatistics, pages 156\u2013156, 1994.\n\n[2] K. P. Bennett and J.A. Blue. A support vector machine approach to decision trees.\n\nIn Department\nof Mathematical Sciences Math Report No. 97-100, Rensselaer Polytechnic Institute, pages 2396\u20132401,\n1997.\n\n[3] K. P. Bennett, N. Cristianini, J. Shawe-Taylor, and D. Wu. Enlarging the margins in perceptron decision\n\ntrees. Machine Learning, 41(3):295\u2013313, 2000.\n\n[4] L. Breiman. Random forests. Machine Learning, 45(1):5\u201332, 2001.\n[5] L. Breiman, J. Friedman, R. A. Olshen, and C. J. Stone. Classi\ufb01cation and regression trees. Chapman &\n\nHall/CRC, 1984.\n\n[6] C. C. Chang and C. J. Lin. LIBSVM: a library for support vector machines, 2001.\n[7] A. Criminisi and J. Shotton. Decision Forests for Computer Vision and Medical Image Analysis. Springer,\n\n2013.\n\n[8] Jerome H Friedman. Greedy function approximation: a gradient boosting machine. Annals of Statistics,\n\npages 1189\u20131232, 2001.\n\n[9] J. Gall, A. Yao, N. Razavi, L. Van Gool, and V. Lempitsky. Hough forests for object detection, tracking,\n\nand action recognition. IEEE Trans. PAMI, 33(11):2188\u20132202, 2011.\n\n[10] T. Hastie, R. Tibshirani, and J. Friedman. The elements of statistical learning (Ed. 2). Springer, 2009.\n[11] L. Hya\ufb01l and R. L. Rivest. Constructing optimal binary decision trees is NP-complete.\n\nInformation\n\nProcessing Letters, 5(1):15\u201317, 1976.\n\n[12] J. Jancsary, S. Nowozin, and C. Rother. Loss-speci\ufb01c training of non-parametric image restoration mod-\n\nels: A new state of the art. ECCV, 2012.\n\n[13] M. I. Jordan and R. A. Jacobs. Hierarchical mixtures of experts and the em algorithm. Neural Comput.,\n\n6(2):181\u2013214, 1994.\n\n[14] E. Konukoglu, B. Glocker, D. Zikic, and A. Criminisi. Neighbourhood approximation forests. In Medical\n\nImage Computing and Computer-Assisted Intervention\u2013MICCAI 2012, pages 75\u201382. Springer, 2012.\n\n[15] B. Lakshminarayanan, D. M. Roy, and Y. H. Teh. Mondrian forests: Ef\ufb01cient online random forests. In\n\nAdvances in Neural Information Processing Systems, pages 3140\u20133148, 2014.\n\n[16] J. Mingers. An empirical comparison of pruning methods for decision tree induction. Machine Learning,\n\n4(2):227\u2013243, 1989.\n\n[17] S. K. Murthy and S. L. Salzberg. On growing better decision trees from data. PhD thesis, John Hopkins\n\nUniversity, 1995.\n\n[18] M. Norouzi, M. D. Collins, D. J. Fleet, and P. Kohli. Co2 forest: Improved random forest by continuous\n\noptimization of oblique splits. arXiv:1506.06155, 2015.\n\n[19] M. Norouzi and D. J. Fleet. Minimal Loss Hashing for Compact Binary Codes. ICML, 2011.\n[20] S. Nowozin. Improved information gain estimates for decision tree induction. ICML, 2012.\n[21] J. R. Quinlan. Induction of decision trees. Machine learning, 1(1):81\u2013106, 1986.\n[22] J. Shotton, R. Girshick, A. Fitzgibbon, T. Sharp, M. Cook, M. Finocchio, R. Moore, P. Kohli, A. Criminisi,\nA. Kipman, et al. Ef\ufb01cient human pose estimation from single depth images. IEEE Trans. PAMI, 2013.\n\n[23] B. Taskar, C. Guestrin, and D. Koller. Max-margin Markov networks. NIPS, 2003.\n[24] I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun. Support vector machine learning for interde-\n\npendent and structured output spaces. ICML, 2004.\n\n[25] C. N. J. Yu and T. Joachims. Learning structural SVMs with latent variables. ICML, 2009.\n\n9\n\n\f", "award": [], "sourceid": 1046, "authors": [{"given_name": "Mohammad", "family_name": "Norouzi", "institution": "University of Toronto"}, {"given_name": "Maxwell", "family_name": "Collins", "institution": "UW-Madison"}, {"given_name": "Matthew", "family_name": "Johnson", "institution": "Microsoft Research"}, {"given_name": "David", "family_name": "Fleet", "institution": "University of Toronto"}, {"given_name": "Pushmeet", "family_name": "Kohli", "institution": "Microsoft Research"}]}