{"title": "Alternating optimization of decision trees, with application to learning sparse oblique trees", "book": "Advances in Neural Information Processing Systems", "page_first": 1211, "page_last": 1221, "abstract": "Learning a decision tree from data is a difficult optimization problem. The most widespread algorithm in practice, dating to the 1980s, is based on a greedy growth of the tree structure by recursively splitting nodes, and possibly pruning back the final tree. The parameters (decision function) of an internal node are approximately estimated by minimizing an impurity measure. We give an algorithm that, given an input tree (its structure and the parameter values at its nodes), produces a new tree with the same or smaller structure but new parameter values that provably lower or leave unchanged the misclassification error. This can be applied to both axis-aligned and oblique trees and our experiments show it consistently outperforms various other algorithms while being highly scalable to large datasets and trees. Further, the same algorithm can handle a sparsity penalty, so it can learn sparse oblique trees, having a structure that is a subset of the original tree and few nonzero parameters. This combines the best of axis-aligned and oblique trees: flexibility to model correlated data, low generalization error, fast inference and interpretable nodes that involve only a few features in their decision.", "full_text": "Alternating Optimization of Decision Trees,\n\nwith Application to Learning Sparse Oblique Trees\n\nMiguel \u00b4A. Carreira-Perpi \u02dcn\u00b4an\n\nPooya Tavallali\n\nDept. EECS, University of California, Merced\n\nDept. EECS, University of California, Merced\n\nmcarreira-perpinan@ucmerced.edu\n\nptavallali@ucmerced.edu\n\nAbstract\n\nLearning a decision tree from data is a dif\ufb01cult optimization problem. The most\nwidespread algorithm in practice, dating to the 1980s, is based on a greedy growth\nof the tree structure by recursively splitting nodes, and possibly pruning back the\n\ufb01nal tree. The parameters (decision function) of an internal node are approxi-\nmately estimated by minimizing an impurity measure. We give an algorithm that,\ngiven an input tree (its structure and the parameter values at its nodes), produces\na new tree with the same or smaller structure but new parameter values that prov-\nably lower or leave unchanged the misclassi\ufb01cation error. This can be applied\nto both axis-aligned and oblique trees and our experiments show it consistently\noutperforms various other algorithms while being highly scalable to large datasets\nand trees. Further, the same algorithm can handle a sparsity penalty, so it can\nlearn sparse oblique trees, having a structure that is a subset of the original tree\nand few nonzero parameters. This combines the best of axis-aligned and oblique\ntrees: \ufb02exibility to model correlated data, low generalization error, fast inference\nand interpretable nodes that involve only a few features in their decision.\n\n1 Introduction\n\nDecision trees are among the most widely used statistical models in practice. They are routinely at\nthe top of the list in the KDnuggets.com annual polls of top machine learning algorithms and other\ntop-10 lists [36]. Many statistical or mathematical packages such as SAS or Matlab implement them.\nApart from being able to model nonlinear data well in the \ufb01rst place, they enjoy several unique ad-\nvantages. The prediction made by the tree is a path from the root to a leaf consisting of a sequence\nof decisions, each involving a question of the type \u201cxj > bi\u201d (is feature j bigger than threshold bi?)\nfor axis-aligned trees, or \u201cwT\nx > bi\u201d for oblique trees. This makes inference very fast, and may not\ni\neven need to use all input features to make a prediction (with axis-aligned trees). The path can be\nunderstood as a sequence of IF-THEN rules, which is intuitive to humans, and indeed one can equiv-\nalently turn the tree into a database of rules. This can make decision trees preferable over more ac-\ncurate models such as neural nets in some applications, such as medical diagnosis or legal analysis.\n\nHowever, trees pose one crucial problem that is only partly solved to date: learning the tree from\ndata is a very dif\ufb01cult optimization problem, involving a search over a complex, large set of tree\nstructures, and over the parameters at each node. For simplicity, in this paper we focus on classi\ufb01ca-\ntion trees with a binary tree (having a binary split at each node) where the bipartition in each node is\neither an axis-aligned hyperplane or an arbitrary hyperplane (oblique trees). However, many of our\narguments carry over beyond this case.\n\nIdeally, the objective function we would like to optimize has the usual form of a regularized loss:\n\n(1)\nwhere L is the misclassi\ufb01cation error on the training set, given below in eq. (2), and C is the com-\nplexity of the tree, e.g. its depth or number of nodes. This is necessary to avoid large trees that \ufb01nely\n\nE(T ) = L(T ) + \u03b1 C(T )\n\n\u03b1 > 0\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00b4eal, Canada.\n\n\fsplit the space so the dataset is perfectly classi\ufb01ed, but would likely over\ufb01t. Finding an optimal\ndecision tree is NP-hard [22] even if we \ufb01x its number of splits [26].\n\nHow does one learn a tree in practice (also called \u201ctree induction\u201d)? After many decades of research,\nthe algorithms that have stood the test of time are, in spite of their obvious suboptimality, (variations\nof) greedy growing and pruning, such as CART [8] or C4.5 [31]. First, a tree is grown by recursively\nsplitting each node into two children, using an impurity measure. One can stop growing and return\nthe tree when the impurity of each leaf falls below a set threshold. Even better trees are produced by\ngrowing a large tree and pruning it back one node at a time. At each growing step, the parameters\nat the node are learned by minimizing an impurity measure such as the Gini index, cross-entropy or\nmisclassi\ufb01cation error. The goal is to \ufb01nd a bipartition where each class is as pure (single-class) as\npossible. Gini or cross-entropy are preferable to misclassi\ufb01cation error because the former are more\nsensitive to changes in the node probabilities than the misclassi\ufb01cation rate [20, p. 309]. Minimizing\nthe impurity over the parameters at the node depends on the node type:\n\n\u2022 Axis-aligned trees: the exact solution can be found by enumeration over all (feature, thresh-\nold) combinations. For a given feature (= axis), the possible thresholds are the midpoints\nbetween consecutive training point values along that axis. For a node i containing Ni\ntraining points in RD, an ef\ufb01cient implementation can do this in O(DNi) time.\n\n\u2022 Oblique trees: minimizing the impurity is much harder because it is a non-differentiable\nfunction of the weights. As these change continuously, points change side of the hyperplane\ndiscontinuously, and so does the impurity. The standard approach is coordinate descent\nover the weights at the node, but this tends to get stuck in poor local optima [8, 28].\n\nThe optimization over the node parameters (exact for axis-aligned trees, approximate for oblique\ntrees) assumes the rest of the tree (structure and parameters) is \ufb01xed. The greedy nature of the\nalgorithm means that once a node is optimized, it its \ufb01xed forever.\n\nNote that it is only in the leaves where an actual predictive model is \ufb01t. The internal nodes do not\ndo prediction, they partition the space ever more \ufb01nely into boxes (axis-aligned trees) or polyhedra\n(oblique trees). Each internal node bipartitions its region. Each leaf \ufb01ts a local model to its region\n(for classi\ufb01cation, the model is often the majority label of the training points in its region). Hence,\nthe tree is really a partition of the space into disjoint regions with a local predictor at each region\nand a fast access to the region for a given input point (by propagating it through the tree).\n\nThe majority of trees used in practice are axis-aligned, not oblique. Two possible reasons for this\nare 1) an oblique tree is slower at inference and less interpretable because each node involves all\nfeatures. And 2) as noted above and con\ufb01rmed in our experiments, coordinate descent for oblique\ntrees does not work as well, and often an axis-aligned tree will outperform in test error an oblique\ntree of similar size. This is disappointing because an axis-aligned tree imposes an arbitrary region\ngeometry that is unsuitable for many problems and results in larger trees than would be needed.\n\nIn this paper we improve both of these problems with oblique trees. We focus on a restricted set-\nting: we assume a given tree structure, given by an initial tree (CART or random). We propose\nan optimization algorithm for the tree parameters that considerably decreases its misclassi\ufb01cation\nerror. Further, this allows us to introduce a new type of trees that we call sparse oblique trees, where\neach node is a hyperplane involving only a small subset of features, and whose structure is a pruned\nversion of the original tree. Our algorithm is based on directly optimizing the quantity of interest,\nthe misclassi\ufb01cation error, using alternating optimization over separable subsets of nodes. After a\nsection 2 on related work, we describe our algorithm in section 3 and evaluate it in sections 4 and 5.\n\n2 Related work\n\nThe CART book [8] is a good summary of work on decision trees up to the 80s, including the basic\nalgorithms to learn the tree structure (greedy growing and pruning), and to optimize the impurity\nmeasure at each node (by enumeration for the axis-aligned case and coordinate descent over the\nweights for the oblique case). OC1 [28] is a minor variation of the coordinate descent algorithm of\nCART for oblique trees that uses multiple restarts and random perturbations after convergence to try\nto \ufb01nd a better local optimum, but its practical improvement is marginal. See [31, 32] for reviews\nof more recent work, including tree induction and applications. There is also a large literature on\n\n2\n\n\fconstructing ensembles of trees, such as random forests [7, 13] or boosting [33], but we focus here\non learning a single tree.\n\nMuch research has focused on optimizing the parameters of a tree given a initial tree (obtained with\ngreedy growing and pruning) whose structure is kept \ufb01xed. Bennett [2, 3] casts the problem of\noptimizing a \ufb01xed tree as a linear programming problem, in which the global optimum could be\nfound. However, the linear program is so large that the procedure is only practical for very small\ntrees (4 internal nodes in her experiments); also, it applies only to binary classi\ufb01cation. Norouzi\net al. [29, 30] introduce a framework based on optimizing an upper bound over the tree loss using\nstochastic gradient descent (initialized from an already induced tree). Their method is scalable to\nlarge datasets, however it is not guaranteed to decrease the real loss function of a decision tree and\nmay even marginally worsen an already induced tree.\n\nBertsimas and Dunn [4] formulate the optimization over tree structures (limited to a given depth)\nand node parameters as a mixed-integer optimization (MIO) by introducing auxiliary binary vari-\nables that encode the tree structure. Then, one can apply state-of-the-art MIO solvers (based on\nbranch-and-bound) that are guaranteed to \ufb01nd the globally optimum tree (unlike the classical, greedy\napproach). However, this has a worst-case exponential cost and again is not practical unless the tree\nis very small (depth 2 to 4 in their paper).\n\nMethods such as T2 [1], T3 [34] and T3C [35], introduce a family of ef\ufb01cient enumeration ap-\nproaches constructing optimal non-binary decision trees of depths up to 3. These trees may not be\nas interpretable as binary ones and do not outperform existing approaches of inducing trees [34, 35].\n\nFinally, soft decision trees assign a probability to every root-leaf path of a \ufb01xed tree structure, such\nas the hierarchical mixtures of experts [23]. The parameters can be learned by maximum likelihood\nwith an EM or gradient-based algorithm. However, this loses the fast inference and interpretability\nadvantages of regular decision trees, since now an instance must follow each root-leaf path.\n\n3 Alternating optimization over node sets\n\nProblem de\ufb01nition We want to optimize eq. (1) but assuming a given, \ufb01xed tree structure (ob-\ntained e.g. from the CART algorithm, i.e., greedy growing and pruning for axis-aligned or oblique\ntrees with impurity minimization at each node). Equivalently, since the tree complexity is \ufb01xed, we\nminimize the misclassi\ufb01cation cost jointly over the parameters \u0398 = {\u03b8i} of all nodes i of the tree:\n\nL(\u0398) =\n\nN\n\nX\n\nn=1\n\nL(yn, T (xn; \u0398))\n\n(2)\n\nn=1 \u2282 RD \u00d7 {1, . . . , K} is a training set of D-dimensional real-valued instances\nwhere {(xn, yn)}N\nand their labels (in K classes), L(\u00b7, \u00b7) is the 0/1 loss, and T (x; \u0398): RD \u2192 {1, . . . , K} is the\npredictive function of the tree. This is obtained by propagating x along a path from the root down to\na leaf, computing a binary decision fi(x; \u03b8i): RD \u2192 {left, right} at each internal node i along the\npath, and outputing the leaf\u2019s label. Hence, the parameters \u03b8i at a node i are:\n\n\u2022 If i is a leaf, \u03b8i = {yi} \u2282 {1, . . . , K} contains the label at that leaf.\n\u2022 If i is an internal (decision) node, \u03b8i = {wi, bi} where wi \u2208 RD is the weight vector and\nbi \u2208 R the threshold (or bias) for the decision hyperplane \u201cwT\nx \u2265 bi\u201d. For axis-aligned\ni\ntrees, the decision hyperplane is \u201cxk(i) \u2265 bi\u201d where k(i) \u2208 {1, . . . , D}, i.e., we threshold\nthe feature k(i), hence \u03b8i = {k(i), bi}.\n\nSeparability condition The key to design a good optimization algorithm for (2) is the following\nseparability condition. Assume the parameters are not shared across nodes (i 6= j \u21d2 \u03b8i \u2229 \u03b8j = \u2205).\nTheorem 3.1 (separability condition). Let T (x; \u0398) be the predictive function of a rooted directed\nbinary decision tree. If nodes i and j (internal or leaves) are not descendants of each other, then, as\na function of \u03b8i and \u03b8j (i.e., \ufb01xing all other parameters \u0398 \\ {\u03b8i, \u03b8j}), the function L(\u0398) of eq. (2)\ncan be written as L(\u03b8i, \u03b8j) = Li(\u03b8i) + Lj(\u03b8j) + constant, where the \u201cconstant\u201d does not depend\non \u03b8i or \u03b8j .\n\nProof. Each training point xn for n \u2208 {1, . . . , N } follows a unique path from the root to one leaf\nof the tree. Hence, a node i receives a subset {(xn, yn): n \u2208 Si} of the training set {(xn, yn)}N\nn=1,\n\n3\n\n\fon which its bisector (with parameters \u03b8i) will operate. If i and j are not descendants of each other,\nthen their subsets are disjoint. Since L(\u0398) is a separable sum over the N points, then the theorem\nfollows, with Li(\u03b8i) summing those training points in Si, Lj(\u03b8j) summing those in Sj , and the\nremaining points being summed in the \u201cconstant\u201d term. That is, the respective terms are:\nL(\u0398) = X\n\nL(yn, T (xn; \u0398))\n\nL(yn, T (xn; \u0398))\n\nL(yn, T (xn; \u0398))\n\n.\n\n+ X\n\n+\n\nX\n\nn\u2208Si\n\n|\n\n{z\n\nLi(\u03b8i)\n\nn\u2208Sj\n\n}\n\n|\n\nn\u2208{1,...,N }\\(Si\u222aSj )\n\n{z\n\nLi(\u03b8j )\n\n}\n\n|\n\n{z\n\nconstant\n\n}\n\nNote that Li depends on the parameters \u03b8k of other nodes k besides i but it does not depend on \u03b8j .\nLikewise, Lj depends on other nodes\u2019 parameters besides j\u2019s but it does not depend on \u03b8i; and the\n\u201cconstant\u201d term depends on other nodes\u2019 parameters but it does not depend on \u03b8i or \u03b8j .\n\nThe separability condition allows us to optimize separately (and in parallel) over the parameters\nof any set of nodes that are not descendants of each other (\ufb01xing the parameters of the remaining\nnodes). This has two advantages. First, we expect a deeper decrease of the loss, because we optimize\nover a large set of parameters exactly. This is because optimizing over each node can be done exactly\n(see some caveats later) and the nodes separate. Second, the computation is fast: the joint problem\nover the set becomes a collection of smaller independent problems over the nodes that can be solved\nin parallel (if so desired). There are many possible choices of such node sets, and it is of interest to\nmake those sets as big as possible, so that we make large, fast moves in search space. One example\nof set is \u201call nodes at the same depth\u201d (distance from the root), and we will focus on it.\n\nTAO: alternating optimization over depth levels of the tree We cycle over depth levels from the\nbottom (leaves) to the top (root) and iterate bottom-top, bottom-top, etc. (i.e., reverse breadth-\ufb01rst\norder). We experimented with other orders (top-bottom, or alternating bottom-top and top-bottom)\nbut found little difference in the results for both axis-aligned and oblique trees. At a given depth\nlevel, we optimize in parallel over all the nodes. We call this algorithm Tree Alternating Optimization\n(TAO). The optimization over each node is described below. Before, we make some observations.\n\nAs TAO iterates, the root-leaf path followed by each training point changes and so does the set\nof points that reach a particular node. This can cause dead branches and pure subtrees, which we\nremove after convergence. Dead branches arise if, after optimizing over a node, some of its subtrees\n(a child or other descendants) become empty because they receive no training points from their\nparent (which sends all its points to the other child). The subtree of a node with one empty child can\nbe replaced with the non-empty child\u2019s subtree. We do not do this as soon as they become empty\nin case they might become non-empty again. Pure subtrees arise if, after optimizing over a node,\nsome of its subtrees become pure (i.e., all their points have the same label). A pure subtree can be\nreplaced with a leaf but, as with dead branches, we do this after convergence, in case they become\nimpure again. During each pass, only non-empty, impure nodes are processed, so dead branches\nand pure subtrees are ignored, which accelerates the algorithm. This means that TAO can actually\nmodify the tree structure, by reducing the size of the tree; we call this indirect pruning, and it is very\nsigni\ufb01cant with sparse oblique trees (described later). It is a good thing because we achieve (while\nalways decreasing the training loss) a smaller tree that is faster, takes less space and\u2014having fewer\nparameters\u2014probably generalizes better. TAO pseudocode appears in the supplementary material.\n\nOptimizing the misclassi\ufb01cation error at a single node As we show below, optimizing eq. (2),\nthe K-class misclassi\ufb01cation error of the tree over a node\u2019s parameters (decision function at an\ninternal node or predictor model at a leaf), reduces to optimizing a misclassi\ufb01cation error of a\nsimpler classi\ufb01er. In some important special cases this \u201creduced problem\u201d can be solved exactly,\nbut in general it is an NP-hard problem [17, 21]. In the latter case, we can approximate it by a\nsurrogate loss (e.g. logistic or hinge loss with a support vector machine). We consider each type of\nnode next (leaf or internal).\n\nOptimizing (2) over a leaf is equivalent to minimizing the K-class misclassi\ufb01cation error over the\nsubset of training points that reach the leaf. If the classi\ufb01er at the leaf is a constant label, this is\nsolved exactly by majority vote (i.e., setting the leaf label to the most frequent label in the leaf\u2019s\nsubset of points). If the classi\ufb01er at the leaf is some other model, we can train it using a surrogate of\nthe misclassi\ufb01cation error.\n\n4\n\n\fOptimizing (2) over an internal node i is exactly equivalent to a reduced problem: a binary misclassi-\n\ufb01cation loss for a certain subset Ci (de\ufb01ned below) of the training points over the parameters \u03b8i of i\u2019s\ndecision function fi(x; \u03b8i). The argument is subtle; we show it step by step. Firstly, optimizing the\nmisclassi\ufb01cation error over \u03b8i in (2), where is summed over the whole training set, is equivalent to\noptimizing it over the subset of training points Si that reach node i. Next, the fate of a point xn \u2208 Si\n(i.e., the label the tree will predict for it) depends only on which of i\u2019s children it follows, because\nthe decision functions and leaf predictor models in the subtree rooted at i are \ufb01xed (in other words,\nthe subtree of each child of i is a \ufb01xed decision tree that classi\ufb01es whatever is inputted to it). Hence,\ncall zj \u2208 {1, . . . , K} the label predicted for xn if following child j (where j is left or right). Now,\ncomparing the ground-truth label yn of xn in the training set with the predicted label zj for child j,\nthey can either be equal (yn = zj , correct classi\ufb01cation) or not (yn 6= zj, incorrect classi\ufb01cation).\nHence, if xn is correctly classi\ufb01ed for all children j \u2208 {left, right}, or incorrectly classi\ufb01ed for all\nchildren j \u2208 {left, right}, the fate for this point cannot be altered by changing the decision function\nat i, and we call such a point \u201cdon\u2019t-care\u201d. It can be removed from the loss since it contributes an ad-\nditive constant. Therefore, the only points (\u201ccare\u201d points) whose fate does depend on the parameters\nof i\u2019s decision function are those for which zleft = yn 6= zright or zright = yn 6= zleft. Then, we can\nde\ufb01ne a new, binary classi\ufb01cation problem over the parameters \u03b8i of the decision function fi(x; \u03b8i)\non the \u201ccare\u201d points Ci \u2286 Si where xn \u2208 Ci has a label yn \u2208 {left, right}, according to which child\nclassi\ufb01es it correctly. The misclassi\ufb01cation error in this problem equals the misclassi\ufb01cation error\nin eq. (2) for each \u201ccare\u201d point. In summary, we have proven the following theorem.\n\nTheorem 3.2 (reduced problem). Let T (x; \u0398) be the predictive function of a rooted directed binary\ndecision tree and i be any internal node in the tree with decision function fi(x; \u03b8i). The tree\u2019s\nK-class misclassi\ufb01cation error (2) can be written as:\n\nL(\u0398) =\n\nN\n\nX\n\nn=1\n\nL(yn, T (xn; \u0398)) = X\n\nn\u2208Ci\n\nL(yn, fi(xn; \u03b8i)) + constant\n\n(3)\n\nwhere the \u201cconstant\u201d does not depend on \u03b8i, Ci \u2282 {1, . . . , N } is the set of \u201ccare\u201d training points\nfor i de\ufb01ned above, and yn \u2208 {left, right} is the child that leads to a correct classi\ufb01cation for xn\nunder i\u2019s current subtree.\n\nAll is left now is how to solve this binary misclassi\ufb01cation loss problem:\n\nReduced problem:\n\nmin\n\u03b8i\n\nX\n\nn\u2208Ci\n\nL(yn, fi(xn; \u03b8i)).\n\n(4)\n\nFor axis-aligned trees, it can be solved exactly by enumeration over features and splits, just as in the\nCART algorithm to minimize the impurity. For oblique trees, we approximate it by a surrogate loss.\n\nThe reduced problem is, of course, much easier to solve than the original loss over the tree. In\nparticular for the oblique case (where the node decision function is a hyperplane, for which enu-\nmeration is intractable), the reduced problem introduces a crucial advantage over the traditional\nimpurity minimization in CART. The latter is a non-differentiable, unsupervised problem, which is\nsolved rather inaccurately via coordinate descent over the hyperplane weights. The reduced prob-\nlem is non-differentiable but supervised and can be conveniently approximated by a surrogate binary\nclassi\ufb01cation loss, much easier to solve accurately. This improved optimization translates into much\nbetter trees using TAO, as seen in our experiments.\n\nSparse oblique trees The equivalence of optimizing (2) over one oblique node to the reduced\nproblem (4) makes it computationally easy to introduce constraints over the weight vector and hence\nlearn more \ufb02exible types of oblique trees than was possible before. In this paper we propose to learn\noblique nodes involving few features. We can do this by adding an \u21131 penalty \u201c\u03bbPnodes i kwik1\u201d to\nthe misclassi\ufb01cation error (2) where \u03bb \u2265 0 controls the sparsity: from no sparsity for \u03bb = 0 to all\nweight vectors in all nodes being zero if \u03bb is large enough. Since this penalty separates over nodes,\nthe only change in TAO is that the optimization over node i in eq. (4) has a penalty \u201c\u03bbkwik1\u201d.\nThis can be easily combined with the formulation above, resulting in an \u21131-regularized linear SVM\nor logistic regression [19, sections 3.2 and 3.6], a convex problem for which well-developed code\nexists, such as LIBLINEAR [15]. Alternatively, one can use an \u21130 penalty or constraint on the\nweights, for which good optimization algorithms also exist [14].\n\n5\n\n\fConvergence and computational complexity Optimizing the misclassi\ufb01cation loss L is NP-hard\nin general [17, 21, 26] and we have no approximation guarantees for TAO at present. TAO does\nconverge to a local optimum in the sense of alternating optimization (as in k-means), i.e., when no\nmore progress can be made by optimizing one subset of nodes given the rest. For oblique trees,\nthe complexity of one TAO iteration (pass over all nodes) is upper bounded by the tree depth times\nthe cost of solving an SVM on the whole training set, and is typically quite smaller than that. For\naxis-aligned trees, one TAO iteration is comparable to running CART to grow a tree of the same\nsize, since in both cases the nodes run an enumeration procedure over features and thresholds. See\ndetails in the supplementary material.\n\n4 Experiments: sparse oblique trees for MNIST digits\n\nThe supplementary material gives additional experiments using TAO to optimize axis-aligned and\noblique trees on over 10 datasets and comparing with other methods for optimizing trees.\nIn a\nnutshell, TAO produces trees that signi\ufb01cantly improve over the CART baseline for axis-aligned\nand especially oblique trees, and also consistently beat the other methods. But where TAO is truly\nremarkable is with sparse oblique trees, and we explore this here in the MNIST benchmark [27].\n\nWe induce an initial tree for TAO using the CART algorithm1 (greedy growing and pruning) either\nfor axis-aligned trees (enumeration over features/thresholds) or oblique trees (coordinate descent\nover weights, picking the best of several random restarts [28]). The node optimization uses an \u21131-\nregularized linear SVM with slack hyperparameter C \u2265 0 (so the TAO sparsity hyperparameter in\nsection 3 is \u03bb = 1/C), implemented with LIBLINEAR [15]. The rest of our code is in Matlab. We\nstop TAO when the training misclassi\ufb01cation loss decreases but by less than 0.5%, or the number of\niterations (passes over all nodes) reaches 14 (in practice TAO stops after around 7 iterations).\n\nSparse oblique trees behave like the LASSO [20]: we have a path of trees as a function of the\nsparsity hyperparameter C, from \u221e (no \u21131 penalty) to 0 (all parameters zero). We estimate this path\nby inducing an initial CART tree and running TAO for a sequence of decreasing C values, where the\ntree at the current C value is used to initialize TAO for the next, smaller C value. We constructed\npaths using initial CART trees of depths 4 to 12 (both axis-aligned and oblique) on the MNIST\ndataset, splitting its training set of 60k points into 48k training and 12k validation (to determine an\noptimal C or depth), and reporting generalization error on its 10k test points. The training time for\neach C value is roughly between 1 minute (for depth 4) and 4 minutes (for depth 12).\n\nPath of trees The resulting paths are best viewed in our supplementary animations. Fig. 1 shows\nthree representative trees from the path for depth 12: the initial CART tree (which was oblique), the\ntree with optimal validation error and an oversparse tree. It also plots various measures as a function\nof C. Several observations are obvious, as follows.\n\nAs soon as TAO runs on the initial CART tree (for the largest C value, which imposes little sparsity),\nthe improvement is drastic: from a training/test error of 1.95%/11.03% down to 0.09%/5.66%. The\ntree is pruned from 410 to 230 internal nodes (the number of leaves for a binary tree equals the\nnumber of internal nodes plus 1). The TAO tree is more balanced: the samples are distributed more\nevenly over the tree branches and the training subset that a node receives is more pure. This can\nbe seen explicitly from the node histograms (tree as binary heap animations in the supplementary\nmaterial) or indirectly from the sample mean of the node.\n\nFurther decreasing C imposes more sparsity and this leads to progressive pruning of the tree and\never sparser weight vectors at the internal nodes. The large changes in topology are caused by\npostprocessing the tree to remove dead branches. The number of nonzero weights and the number\nof nodes (internal and leaves) decreases monotonically with C. The training error slowly increases\nbut the test error remains about constant. In general, depending on the initial tree size, we \ufb01nd the\nvalidation error (not shown) and the test error are minimal for some range of C); trees there will\ngeneralize best. Further decreasing C (beyond 0.01 in the \ufb01gure) increases both training and test\nerror and produces smaller trees with sparser weight vectors that under\ufb01t.\n\n1During the review period we found out that TAO performs about as well on random initial trees (having\nrandom parameters at the nodes) as on trees induced by CART. This would make TAO a stand-alone learning\nalgorithm rather than a postprocessing step over a CART tree. We will report on this in a separate publication.\n\n6\n\n\fInference runtime For inference (prediction), each point follows a different root-leaf path. We\nreport its mean/min/max path length (number of nodes) and runtime (number of operations, here\nscalar multiplications) over the training set, for each C. It decreases mostly monotonically with C.\nThe inference time is extremely fast because the root-leaf path involves a handful of nodes and each\nnode\u2019s decision function looks at a few pixels of the image. This is orders-of-magnitude faster than\nkernel machines, deep nets, random forests or nearest-neighbor classi\ufb01ers, and is a major advantage\nof trees. The same can be said of the model storage required. This is especially important at present\ngiven the need to deploy classi\ufb01ers on resource-constrained IoT devices [9\u201311, 18, 24].\n\nClassi\ufb01cation accuracy The best test error for the TAO trees we obtained (having initial depth up\nto 12) is around 5%. To put this in perspective, we plot the error reported for MNIST for a range of\nmodels [27] vs. the number of parameters in \ufb01g. 1. The tree error is much better than than of linear\nmodels (\u224812%) and boosted stumps (7.7%) and is comparable to that of a 2-layer neural net and a\n3-nearest-neighbor classi\ufb01er. Better errors can of course be achieved by many-parameter, complex\nmodels such as kernel SVMs or convolutional nets (not shown), or using image-speci\ufb01c transfor-\nmations and features. Our trees operate directly on the image pixels with no special transformation\nand are astoundingly small and fast given their accuracy. For example, our tree achieves about\nthe same test error as a 3-nearest-neighbor classi\ufb01er, but the tree compares the input image with\nat most \u22486 sparse \u201cimages\u201d (weight vectors), rather than with the 60 000 dense training images.\n\nInterpretable trees By using a high sparsity penalty, TAO allows us to obtain trees of suboptimal\nbut reasonable test error that have a really small number of nodes and nonzero weights and are\neminently interpretable. Fig. 1 shows an example for C = 0.01 (test error 10.19%, 0.22% nonzeros,\n17 leaves). Examining the nodes\u2019 weight vectors shows that the few weights that are not zero are\nstrategically located to discriminate between speci\ufb01c classes or groups of classes, and essentially\ndetect patterns characterized by the presence or absence of strokes in certain locations. Nodes use\nminimal features to separate large groups of digits, and leaf parents often separate very similar digits\nthat differ on just one or two strokes. We mention some examples (referring to the nodes by their\nindex in the \ufb01gure). Node 53 separates 3s from 5s by detecting the small vertical stroke that precisely\ndifferentiates these digits (blue = negative, red = positive). Node 5 separates 4s from 9s by detecting\nthe presence of a top horizontal stroke. Node 2 separates 7s from {4,9} by detecting ink in the\nleft-middle of the image. Node 6 separates 0s from {1,2,3,5,8} by detecting the presence of ink in\nthe center of the image but not in its sides. Node 1 (the root) separates {4,7,9} from the remaining\ndigits. Also, once the tree is sparse enough, several of these weight vectors (such as nodes 2 and 5)\ntend to appear in the tree regardless of the initial tree and depth (see supplementary animations).\n\nIn a sense, each decision node pays attention to a simple but high-level concept so the tree classi\ufb01es\ndigits by asking a few conceptual questions about the relative spatial presence of strokes or ink. A\nroot-leaf path can then be seen as a sequence of conceptual questions that \u201cde\ufb01ne\u201d a class. This is\nvery different from the way convolutional neural networks operate, by constructing a large number\nof \u201cfeatures\u201d that start very simple (e.g. edge detectors) and are combined into progressively more\nabstract features. While deep neural nets get very accurate predictions (so they are able to classify\ncorrectly even unusual digit shapes), this is achieved by very complex models that are not easy to\ninterpret. Our trees do not reach such high accuracy, but arguably they are able to learn the more\nhigh-level, conceptual structure of each digit class.\n\n5 Experiments: comparison with forest-based nearest-neighbor classi\ufb01ers\n\nAs requested by a reviewer, we compared CART+TAO with fast, forest-based algorithms that ap-\nproximate a nearest-neighbor classi\ufb01er (see [25] and references therein). Roughly speaking, these\nalgorithms construct a tree that can approximate a nearest neighbor search and have a controllable\ntradeoff between approximation error and search speed. Thus, they can be used to approximate a\nnearest-neighbor classi\ufb01er fast. On top of that, they can be ensembled into a forest. Although the\ncomparison is not apples-to-apples (since the latter classi\ufb01ers are not of the decision-tree type, and\nalso are forests rather than a single tree), it is still very interesting.\n\nWe followed the protocol of [25], which lists results for cover trees (CT) [5], random kd-trees\n(forest of 4 trees) [16] and boundary forest (BF) (50 trees) [25]. Table 1 shows our results for TAO\non oblique trees (we initialized TAO from both axis-aligned and oblique CART trees and picked the\n\n7\n\n\finitial tree:\n\nCART, oblique\n\n1 (b=-1.19e-03)\n\ntrain err= 1.95%\n\n%nonzero= 6.65%\n\ntest err=11.03%\n\n#splits= 410\n\n2 (b=9.68e-09)\n\n3 (b=7.16e-03)\n\n4 (b=2.09e-05)\n\n5 (b=1.38e-10)\n\n6 (b=1.02e-03)\n\n7 (b=1.28e-03)\n\n-1\n\n 0\n\n 1\n\n8 (b=9.24e-04)\n\n9 (b=3.50e-05)\n\n10 (b=-1.45e-04)\n\n11 (b=6.93e-06)\n\n12 (b=1.56e-03)\n\n13 (b=6.27e-02)\n\n14 (b=2.57e-01)\n\n15 (b=3.46e-04)\n\n16 (b=2.18e-02)\n\n17 (b=6.68e-03)\n\n18 (b=1.04e-02)\n\n19 (b=6.40e-06)\n\n20 (b=4.56e-04)\n\n21 (b=1.41e-08)\n\n22 (b=7.28e-03)\n\n23 (b=1.39e-05)\n\n24(19, 1 ) 25 (b=6.48e-02)\n\n26 (b=3.48e-03)\n\n27(9, 3 )\n\n28 (b=6.45e-01)\n\n29 (b=6.28e-03)\n\n30 (b=1.35e-01)\n\n31 (b=5.14e-02)\n\n32 (b=5.84e-02)\n\n33 (b=3.54e-02)\n\n34 (b=2.56e-01)\n\n35 (b=4.55e-02)\n\n36 (b=2.67e-03)\n\n37 (b=5.35e-03)\n\n38 (b=1.80e-03)\n\n39 (b=5.72e-04)\n\n40 (b=2.61e-05)\n\n41 (b=6.01e-03)\n\n42 (b=4.12e-09)\n\n43 (b=2.70e-06)\n\n44 (b=3.50e-02)\n\n45 (b=3.65e-03)\n\n46 (b=4.45e-03)\n\n47 (b=3.28e-05)\n\n50(1, 8 )\n\n51(15, 2 )\n\n52 (b=5.08e-04)\n\n53(22, 8 )\n\n56 (b=7.20e-02)\n\n57(2, 1 )\n\n58 (b=1.18e-01)\n\n59(2, 2 )\n\n60 (b=4.11e-02)\n\n61(3, 5 )\n\n62(1, 6 )\n\n63(31, 8 )\n\n64 (b=1.82e-03)\n\n65 (b=2.35e-02)\n\n66 (b=5.88e-03)\n\n67(23, 2 )\n\n68(8, 2 ) 69 (b=1.96e-02)\n\n70(45, 9 ) 71 (b=2.79e-02)\n\n72 (b=4.45e-03)\n\n73 (b=4.11e-02)\n\n74 (b=6.72e-03)\n\n75 (b=5.71e-02)\n\n76 (b=3.72e-02)\n\n77 (b=2.03e-04)\n\n78 (b=7.74e-04)\n\n79 (b=3.68e-05)\n\n80 (b=2.08e-01)\n\n81 (b=6.31e-01)\n\n82 (b=6.49e-03)\n\n83(33, 2 )\n\n84 (b=6.67e-03)\n\n85 (b=2.14e-04)\n\n86 (b=2.03e-06)\n\n87 (b=2.00e-06)\n\n88 (b=4.35e-04)\n\n89 (b=8.66e-03)\n\n90 (b=1.29e-04)\n\n91 (b=9.98e-01)\n\n92 (b=6.35e-01)\n\n93 (b=1.84e-03)\n\n94 (b=2.93e-03)\n\n95 (b=8.16e-02)\n\n104 (b=1.77e-01)\n\n105(6, 5 )\n\n112 (b=9.64e-01)\n\n113(1, 8 )\n\n116 (b=3.33e-02)\n\n117(2, 3 )\n\n120(2, 1 )\n\n121(2, 3 )\n\n158(46, 7 )\n\n159 (b=4.04e-02)\n\n160 (b=4.59e-02)\n\n161(14, 6 )\n\n162 (b=8.18e-01)\n\n163 (b=9.90e-01)\n\n164 (b=2.40e-02)165 (b=7.93e-01)\n\n319 (b=1.77e-02)\n\n320 (b=9.62e-02)\n\n324 (b=9.91e-01)\n\n330 (b=4.87e-01)\n\n288 (b=2.29e-02)\n\n289 (b=2.16e-02)\n\n290 (b=6.60e-03)\n\n291 (b=4.48e-04)\n\n294 (b=1.52e-02)\n\n296 (b=6.85e-04)\n\n301 (b=7.37e-01)\n\n303 (b=9.90e-01)\n\n308 (b=1.90e-02)\n\n310 (b=1.49e-06)\n\n311 (b=9.86e-01)\n\n313 (b=1.29e-01)\n\n314 (b=3.86e-04)\n\n315 (b=2.81e-03)\n\n336 (b=8.84e-04)\n\n337 (b=4.00e-03)\n\n338 (b=4.03e-03)\n\n339 (b=4.26e-05)\n\n340 (b=2.72e-02)\n\n341 (b=1.96e-03)\n\n342 (b=9.48e-02)\n\n343 (b=1.43e-03)\n\n344 (b=1.56e-02)\n\n345 (b=1.10e-01)\n\n346 (b=1.26e-03)\n\n347 (b=2.09e-03)\n\n348 (b=7.67e-01)\n\n349 (b=4.39e-03)\n\n350 (b=3.60e-03)\n\n351 (b=3.94e-05)\n\n352 (b=5.26e-03)\n\n353 (b=1.25e-03)\n\n256 (b=4.19e-02)\n\n257 (b=4.43e-01)\n\n258 (b=7.58e-03)\n\n259 (b=9.76e-01)\n\n512 (b=1.45e-01)\n\n513 (b=1.47e-02)\n\n514 (b=9.31e-01)\n\n515 (b=1.21e-01)\n\n517 (b=3.41e-01)\n\n263 (b=5.25e-01)\n\n267 (b=2.92e-02)\n\n526 (b=1.36e-02)\n\n535 (b=3.31e-02)\n\n128 (b=4.79e-05)\n\n129 (b=3.93e-02)\n\n130(5,\n\n4 )\n\n131 (b=9.89e-02)\n\n132 (b=9.92e-01)133 (b=6.31e-01)\n\n138 (b=1.76e-02)\n\n139(2,\n\n3 )\n\n142 (b=3.14e-02)\n\n143(4,\n\n1 )\n\n144 (b=6.23e-03)\n\n145 (b=3.43e-03)\n\n146(32, 5 )\n\n147 (b=1.25e-02)\n\n148 (b=6.67e-04)\n\n149(31, 5 )\n\n150 (b=9.17e-03)\n\n151 (b=9.02e-01)\n\n152 (b=9.76e-01)\n\n153(2,\n\n2 )\n\n154 (b=2.73e-04)\n\n155 (b=1.36e-03)\n\n156 (b=4.47e-02)\n\n157 (b=6.05e-04)\n\n168 (b=2.40e-02)\n\n169 (b=8.31e-05)\n\n170 (b=4.62e-03)\n\n171 (b=6.61e-05)\n\n172 (b=3.34e-02)\n\n173 (b=5.11e-05)\n\n174 (b=4.37e-03)\n\n175 (b=3.84e-05)\n\n176 (b=1.51e-06)\n\n177 (b=1.05e-02)\n\n178(9,\n\n10 )\n\n179 (b=1.60e-01)\n\n180 (b=5.34e-03)\n\n181 (b=2.88e-02)\n\n182 (b=9.54e-01)\n\n183(1,\n\n10 )\n\n184(85, 4 )\n\n185(2,\n\n5 )\n\n186 (b=5.94e-01)187 (b=1.03e-03)\n\n188(31, 5 )\n\n189 (b=3.97e-02)\n\n190 (b=5.62e-04)\n\n191(2,\n\n5 )\n\n208 (b=5.59e-02)\n\n209(10, 6 )\n\n224 (b=3.02e-01)\n\n225(2,\n\n1 )\n\n232(2,\n\n5 )\n\n233(2,\n\n8 )\n\n576 (b=5.87e-01)\n\n577 (b=9.51e-02)\n\n579 (b=2.68e-02)\n\n580 (b=9.98e-01)\n\n581 (b=5.75e-03)\n\n582 (b=9.14e-02)\n\n583 (b=2.04e-02)\n\n588 (b=9.98e-01)\n\n589 (b=2.57e-02)\n\n592 (b=7.95e-03)\n\n593 (b=8.95e-04)\n\n616 (b=9.94e-01)\n\n617 (b=2.72e-02)\n\n621 (b=1.55e-01)\n\n626 (b=7.97e-04)\n\n629 (b=2.28e-01)\n\n630 (b=3.52e-02)\n\n631 (b=1.27e-04)\n\n638 (b=1.45e-01)\n\n639 (b=2.61e-01)\n\n641 (b=6.55e-01)\n\n648 (b=5.79e-03)\n\n661 (b=1.17e-01)\n\n672 (b=4.15e-02)\n\n673 (b=2.19e-01)\n\n674 (b=1.25e-02)\n\n675 (b=5.70e-04)\n\n676 (b=9.98e-01)\n\n677 (b=2.55e-02)\n\n678 (b=9.94e-01)\n\n679 (b=8.03e-04)\n\n680 (b=8.67e-03)\n\n681 (b=1.18e-01)\n\n683 (b=1.90e-02)\n\n684 (b=1.65e-01)\n\n685 (b=7.91e-01)\n\n686 (b=3.22e-06)\n\n687 (b=2.76e-03)\n\n688 (b=9.71e-01)\n\n689 (b=8.63e-02)\n\n692 (b=4.04e-02)\n\n693 (b=9.30e-01)\n\n694 (b=2.96e-03)\n\n695 (b=2.94e-04)\n\n697 (b=1.25e-01)\n\n698 (b=2.59e-03)\n\n699 (b=1.62e-01)\n\n700 (b=2.39e-02)\n\n701 (b=9.30e-04)\n\n702 (b=1.59e-03)\n\n703 (b=9.35e-03)\n\n704 (b=3.32e-01)\n\n705 (b=9.94e-01)\n\n706 (b=6.32e-04)\n\n707 (b=2.18e-01)\n\n355 (b=1.42e-01)\n\n358 (b=6.47e-03)\n\n710 (b=2.42e-01)\n\n716 (b=4.01e-02)\n\n360 (b=9.82e-01)361 (b=4.83e-04)\n\n362 (b=9.95e-01)\n\n363 (b=2.40e-02)\n\n722 (b=2.60e-03)\n\n723 (b=6.42e-01)\n\n724 (b=9.04e-01)\n\n726 (b=5.91e-01)\n\n727 (b=7.14e-01)\n\n375 (b=2.85e-02)\n\n750 (b=4.25e-02)\n\n379 (b=1.78e-02)\n\n380 (b=2.85e-03)\n\n381 (b=1.06e-03)\n\n758 (b=4.76e-01)\n\n760 (b=5.90e-02)\n\n762 (b=1.14e-01)\n\n763 (b=1.28e-03)\n\n417 (b=1.74e-01)\n\n448 (b=9.79e-01)\n\n449 (b=2.14e-02)\n\n834 (b=5.88e-03)\n\n896 (b=1.35e-02)\n\ntrain err= 4.27%\n\n%nonzero= 0.95%\n\ntest err= 5.69%\n\n#splits= 37\n\nC=0.090\n\n1 (b=-7.13e-02)\n\n2 (b=-1.20e-01)\n\n3 (b=1.57e-01)\n\n-1\n\n 0\n\n 1\n\n4 (b=7.23e-01)\n\n5 (b=-6.22e-01)\n\n6 (b=-4.20e-01)\n\n7 (b=4.34e-01)\n\n10\n\n8\n\n6\n\n4\n\n2\n\n10 1\n\n10\n\n9\n\n8\n\n7\n\n6\n\n5\n\n4\n\n3\n\n2\n\n1\n\ntraining error\ntest error\n\n10 0\n\n10 -1\n\n10 -2\n\n%nonzero\n# splits\n\n400\n\n350\n\n300\n\n250\n\n200\n\n150\n\n100\n\n50\n\n8 (b=3.52e-02)\n\n9(5, 2 )\n\n10 (b=5.96e-01)\n\n11 (b=-2.14e-01)\n\n12 (b=2.70e-01)\n\n13 (b=2.07e-01)\n\n14 (b=-2.61e-02)\n\n15 (b=-3.04e+00)\n\n10 1\n\n10 0\n\n10 -1\n\n10 -2\n\n16 (b=4.95e-01)\n\n17(289, 2 )\n\n20 (b=-4.20e-01)\n\n21(986, 6 )\n\n22 (b=1.52e+00)\n\n23 (b=-1.84e-01)\n\n24 (b=-3.48e-01)\n\n25(230, 3 )\n\n26 (b=-2.11e-01)\n\n27 (b=7.47e-02)\n\n28(3621, 2 ) 29 (b=-2.95e-01)\n\n30(340, 10 ) 31 (b=7.62e-01)\n\n32 (b=7.24e-01)\n\n33(316, 1 )\n\n40(1435, 4 )\n\n41(180, 7 )\n\n44(1427, 5 )45 (b=3.58e+00)\n\n46(2596, 4 ) 47 (b=-1.13e-01)\n\n48(57, 6 ) 49 (b=-1.09e+00)\n\n52 (b=-4.81e-01)\n\n53 (b=4.89e-01)\n\n54(518, 5 ) 55 (b=7.34e-02)\n\n58(980, 10 )\n\n59(1022, 6 )\n\n62(2038, 6 )\n\n63(112, 10 )\n\n64(2820, 7 )\n\n65(526, 3 )\n\n90(158, 5 )\n\n91(1552, 8 )\n\n94(1846, 7 )95 (b=-1.42e+00)\n\n98(3408, 10 )\n\n99(172, 2 )\n\n104 (b=3.50e-01)\n\n105 (b=-2.46e-01)\n\n106(1865, 5 ) 107(2220, 3 )\n\n110 (b=-1.55e-01)\n\n111 (b=-5.03e-01)\n\n190(277, 8 )\n\n191 (b=-3.44e-02)\n\n208(1099,\n\n3 )\n\n209(785, 5 )\n\n210(1694,\n\n1 )\n\n211(629, 3 )\n\n220(1428,\n\n1 )\n\n221(694, 2 )\n\n222(794, 6 )\n\n223 (b=9.52e-01)\n\ntrain err=10.61%\n\n%nonzero= 0.22%\n\ntest err=10.19%\n\n#splits= 16\n\nC=0.010\n\n1 (b=-4.01e-02)\n\n2 (b=-4.95e-02)\n\n-1\n\n 0\n\n 1\n\n3 (b=-7.90e-02)\n\n4(4658, 7 )\n\n5 (b=-1.38e-01)\n\n6 (b=-4.66e-01)\n\n7 (b=-1.21e-01)\n\n10(4886, 4 )\n\n11(4884, 9 )\n\n12(2112, 10 )\n\n13 (b=9.90e-01)\n\n14 (b=-1.10e-01)\n\n15 (b=-2.41e-01)\n\n26 (b=-4.98e-02)\n\n27 (b=-1.38e-01)\n\n28(4171, 2 )\n\n29 (b=-2.30e-01)\n\n30(1315, 10 )\n\n31(2866, 6 )\n\n52 (b=4.82e-01)\n\n53 (b=1.37e-01)\n\n54 (b=8.53e-02)\n\n55(4624, 8 )\n\n58(1586, 10 )\n\n59(2117, 6 )\n\n104 (b=-3.06e-02)\n\n105(2214, 1 )\n\n106(2911, 5 )107(2859, 3 )\n\n108(2830, 1 ) 109(588, 2 )\n\n208(1923, 3 )\n\n209(1456, 5 )\n\n12\n\n10\n\n8\n\n6\n\n4\n\n2\n10 1\n\n15\n\n10\n\n5\n\n0\n\nr\no\nr\nr\ne\n\nt\ns\ne\nt\n\n4500\n\n4000\n\npath length\ninference time\nmean\nminimum\nmaximum\n\n3500\n\n3000\n\n2500\n\n2000\n\n1500\n\n1000\n\n500\n\n10 0\n\nC\n\n10 -1\n\n10 -2\n\n1\n\ndepth 4\ndepth 6\ndepth 8\ndepth 10\ndepth 12\nTAO axis-aligned\nTAO oblique\nCART axis-aligned\nCART oblique\n\n6\n2\n\n3\n\n10 3\nnumber of parameters\n\n10 4\n\n5\n\n4\n7\n\n10 5\n\nFigure 1: Sparse oblique trees for MNIST. Left plots: initial CART tree and sparse oblique trees for\nC = 0.09 and 0.01. For each internal node, we show its index and bias value and plot its weight\nvector (red = positive, blue = negative, white = zero); you may need to zoom into the image. For\neach leaf, we plot the mean of its training points and show something like \u201c4(4658,7)\u201d where 4\nis its index, 4658 is the number of training points it receives, and 7 is its digit class. Right plots:\nseveral measures of the tree as a function of C \u2265 0:\ntraining/test error; proportion of nonzero\nweights and number of internal nodes; and length of root-leaf path and inference time (in scalar\nmultiplications) for an input sample. The bottom plot shows test error vs. number of parameters for\nsparse oblique trees of different depths (color-coded), initialized from a CART tree that is either axis-\naligned (dotted line) or oblique (solid line). The markers correspond to the initial CART trees (\u25e6, +)\nor to models from [27], numbered as follows: 1) linear classi\ufb01ers, 2) one-vs-all linear classi\ufb01ers, 3)\n2-layer neural net with 300 hidden units, 4) 2-layer neural net with 1 000 hidden units, 5) 3-nearest-\nneighbor classi\ufb01er, 6) one-vs-all classi\ufb01ers where each classi\ufb01er consists of 50 000 boosted decision\nstumps (each operating over a feature and threshold), 7) 3-layer neural net with 500+100 hidden\nunits. Values outside the axes limits are projected on the boundary of the plots.\n\n8\n\n\fTable 1: Comparison with forest-based algorithms that approximate a nearest-neighbor classi\ufb01er.\n\ndataset (N \u00d7 D, K)\n\nTAO\n\nBF\n\nR-kd\n\nCT\n\nTest error (%)\n\nInference time\n\non entire test set (seconds)\nCT\n\nR-kd\n\nTAO\n\nBF\n\nMNIST (60 000\u00d7784, 10)\n\nletter (10 500\u00d716, 26)\n\npendigits (7 494\u00d716, 10)\nprotein (17 766\u00d7357, 3)\nseismic (78 823\u00d750, 3)\n\n5.69\n7.94\n3.14\n31.70\n27.81\n\n2.24\n5.40\n2.62\n44.20\n40.60\n\n3.08\n5.50\n2.26\n53.60\n30.80\n\n2.99\n5.60\n2.80\n52.00\n38.90\n\n0.18\n0.05\n0.01\n0.05\n0.09\n\n23.90\n1.16\n0.34\n35.47\n16.20\n\n89.20\n1.67\n0.75\n11.50\n65.70\n\n417.60\n0.91\n0.02\n51.40\n172.5\n\nC\n\nTAO\n\n0.09\n9.11\n0.03\n0.14\n3.28\n\nbest result), ran on a laptop with 2 core i5 CPUs and 12GB RAM (pretty similar to the system of\n[25]). TAO\u2019s test error is somewhat bigger (\ufb01rst 3 datasets) or quite smaller (last 2 datasets) than\nother forest classi\ufb01ers, but it always has faster inference time by at least one order of magnitude. We\nreiterate that TAO produces a single tree with sparse decision nodes.\n\n6 Discussion\n\nThe way TAO works is very simple: TAO repeatedly trains a simple classi\ufb01er (binary at the decision\nnodes, K-class at the leaves) while monotonically decreasing the objective function. The only thing\nthat changes over iterations is the subset of training instances on which each classi\ufb01er is trained. In\norder to optimize the misclassi\ufb01cation error, TAO fundamentally relies on alternating optimization.\nThis is most effective when two circumstances apply. 1) Some separability into blocks exists in the\nproblem, as e.g. in matrix factorization, or is created via auxiliary variables, as e.g. with consensus\nproblems [6] or nested functions [12]. And 2) the step over each block is easy and ideally exact. All\nthis applies here thanks to the separability condition and the reduced problem.\n\nTwo important remarks. First, note TAO is very different from coordinate descent in CART [8,\n28]. The latter optimizes the impurity of a single node; each step updates a single weight of its\nhyperplane. TAO optimizes the misclassi\ufb01cation error of the entire tree; each step updates one\nentire set of nodes (i.e., all the weights of all the hyperplanes in those nodes). Second, what we\nreally want to minimize is the misclassi\ufb01cation error on the data, not the impurity in each node. The\nlatter, while useful to construct a good tree structure and initial node parameters, is only indirectly\nrelated to the classi\ufb01cation accuracy.\n\nThe quality of the TAO result naturally depends on the initial tree it is run on. A good strategy\nappears to be to grow a large tree with CART that over\ufb01ts the data (or a large tree with random\nparameters) and let TAO prune it, particularly if using a sparsity penalty with oblique trees. TAO\nalso depends on the choice of surrogate loss in the node (decision or leaf) optimization. In our\nexperience with the logistic or hinge loss the TAO trees considerably improve over the initial CART\nor random tree.\n\n7 Conclusion\n\nWe have presented Tree Alternating Optimization (TAO), a scalable algorithm that can \ufb01nd a local\noptimum of oblique trees given a \ufb01xed structure, in the sense of repeatedly decreasing the mis-\nclassi\ufb01cation loss until no more progress can be done. A critical difference with the standard tree\ninduction algorithm is that we do not optimize a proxy measure (the impurity) greedily one node at\na time, but the misclassi\ufb01cation error itself, jointly and iteratively over all nodes. We suggest to use\nTAO as postprocessing after the usual greedy tree induction in CART, or to run TAO directly on a\nrandom initial tree.\n\nTAO could make oblique trees widespread in practice and replace to some extent the considerably\nless \ufb02exible axis-aligned trees. Even more interesting are the sparse oblique trees we propose.\nThese can strike a good compromise between \ufb02exible modeling of features (involving complex local\ncorrelations, as with image data) and using few features in each node, hence producing a relatively\naccurate tree that is very small, fast and interpretable. For MNIST, we believe this is the \ufb01rst time\nthat a single decision tree achieves such high accuracy, comparable to that of much larger models.\n\nOur work opens up important extensions, among others: to other types of trees, such as regression\ntrees; to other types of nodes beyond linear bisectors or constant-class leaves; to ensembles of trees;\nto using TAO with a search over tree structures; and to combining trees with other models.\n\n9\n\n\fAcknowledgements\n\nWork funded in part by NSF award IIS\u20131423515.\n\nReferences\n\n[1] P. Auer, R. C. Holte, and W. Maass. Theory and applications of agnostic PAC\u2013learning with\nIn A. Prieditis and S. J. Russell, editors, Proc. of the 12th Int. Conf.\n\nsmall decision trees.\nMachine Learning (ICML\u201995), pages 21\u201329, Tahoe City, CA, July 9\u201312 1995.\n\n[2] K. P. Bennett. Decision tree construction via linear programming. In Proc. 4th Midwest Arti\ufb01-\n\ncial Intelligence and Cognitive Sience Society Conference, pages 97\u2013101, 1992.\n\n[3] K. P. Bennett. Global tree optimization: A non-greedy decision tree algorithm. Computing\n\nScience and Statistics, 26:156\u2013160, 1994.\n\n[4] D. Bertsimas and J. Dunn. Optimal classi\ufb01cation trees. Machine Learning, 106(7):1039\u20131082,\n\nJuly 2017.\n\n[5] A. Beygelzimer, S. Kakade, and J. Langford. Cover trees for nearest neighbor.\n\nIn W. W.\nCohen and A. Moore, editors, Proc. of the 23rd Int. Conf. Machine Learning (ICML\u201906), pages\n97\u2013104, Pittsburgh, PA, June 25\u201329 2006.\n\n[6] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statis-\ntical learning via the alternating direction method of multipliers. Foundations and Trends in\nMachine Learning, 3(1):1\u2013122, 2011.\n\n[7] L. Breiman. Random forests. Machine Learning, 45(1):5\u201332, Oct. 2001.\n\n[8] L. J. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classi\ufb01cation and Regression\n\nTrees. Wadsworth, Belmont, Calif., 1984.\n\n[9] M. \u00b4A. Carreira-Perpi\u02dcn\u00b4an. Model compression as constrained optimization, with application to\n\nneural nets. Part I: General framework. arXiv:1707.01209, July 5 2017.\n\n[10] M. \u00b4A. Carreira-Perpi\u02dcn\u00b4an and Y. Idelbayev. Model compression as constrained optimization,\n\nwith application to neural nets. Part II: Quantization. arXiv:1707.04319, July 13 2017.\n\n[11] M. \u00b4A. Carreira-Perpi\u02dcn\u00b4an and Y. Idelbayev. \u201cLearning-compression\u201d algorithms for neural net\nIn Proc. of the 2018 IEEE Computer Society Conf. Computer Vision and Pattern\n\npruning.\nRecognition (CVPR\u201918), pages 8532\u20138541, Salt Lake City, UT, June 18\u201322 2018.\n\n[12] M. \u00b4A. Carreira-Perpi\u02dcn\u00b4an and W. Wang. Distributed optimization of deeply nested systems.\nIn S. Kaski and J. Corander, editors, Proc. of the 17th Int. Conf. Arti\ufb01cial Intelligence and\nStatistics (AISTATS 2014), pages 10\u201319, Reykjavik, Iceland, Apr. 22\u201325 2014.\n\n[13] A. Criminisi and J. Shotton. Decision Forests for Computer Vision and Medical Image Analy-\n\nsis. Advances in Computer Vision and Pattern Recognition. Springer-Verlag, 2013.\n\n[14] Y. C. Eldar and G. Kutyniok, editors. Compressed Sensing: Theory and Applications. Cam-\n\nbridge University Press, 2012.\n\n[15] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIBLINEAR: A library for\n\nlarge linear classi\ufb01cation. J. Machine Learning Research, 9:1871\u20131874, Aug. 2008.\n\n[16] J. H. Friedman, J. L. Bentley, and R. A. Finkel. An algorithm for \ufb01nding best matches in\n\nlogarithmic expected time. ACM Trans. Mathematical Software, 3(3):209\u2013226, 1977.\n\n[17] V. Guruswami and P. Raghavendra. Hardness of learning halfspaces with noise. SIAM J.\n\nComp., 39(2):742\u2013765, 2009.\n\n[18] S. Han, J. Pool, J. Tran, and W. Dally. Learning both weights and connections for ef\ufb01cient\nneural network.\nIn C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett,\neditors, Advances in Neural Information Processing Systems (NIPS), volume 28, pages 1135\u2013\n1143. MIT Press, Cambridge, MA, 2015.\n\n10\n\n\f[19] T. Hastie, R. Tibshirani, and M. Wainwright. Statistical Learning with Sparsity: The Lasso and\nGeneralizations. Monographs on Statistics and Applied Probability. Chapman & Hall/CRC,\n2015.\n\n[20] T. J. Hastie, R. J. Tibshirani, and J. H. Friedman. The Elements of Statistical Learning\u2014\nData Mining, Inference and Prediction. Springer Series in Statistics. Springer-Verlag, second\nedition, 2009.\n\n[21] K.-U. Hoffgen, H. U. Simon, and K. S. Vanhorn. Robust trainability of single neurons. J.\n\nComputer and System Sciences, 50(1):114\u2013125, Feb. 1995.\n\n[22] L. Hya\ufb01l and R. L. Rivest. Constructing optimal binary decision trees is NP-complete. Infor-\n\nmation Processing Letters, 5(1):15\u201317, 1975.\n\n[23] M. I. Jordan and R. A. Jacobs. Hierarchical mixtures of experts and the EM algorithm. Neural\n\nComputation, 6(2):181\u2013214, Mar. 1994.\n\n[24] A. Kumar, S. Goyal, and M. Varma. Resource-ef\ufb01cient machine learning in 2 KB RAM for\nthe internet of things. In D. Precup and Y. W. Teh, editors, Proc. of the 34th Int. Conf. Machine\nLearning (ICML 2017), pages 1935\u20131944, Sydney, Australia, Aug. 6\u201311 2017.\n\n[25] C. Mathy, N. Derbinsky, J. Bento, J. Rosenthal, and J. Yedidia. The boundary forest algorithm\nfor online supervised and unsupervised learning. In Proc. of the 29th National Conference on\nArti\ufb01cial Intelligence (AAAI 2015), pages 2864\u20132870, Austin, TX, Jan. 25\u201329 2015.\n\n[26] N. Megiddo. On the complexity of polyhedral separability. Discrete & Computational Geom-\n\netry, 3(4):325\u2013337, Dec. 1988.\n\n[27] MNIST.\n\nMNIST\nhttp://yann.lecun.com/exdb/mnist.\n\nThe\n\ndatabase\n\nof\n\nhandwritten\n\ndigits.\n\n[28] S. K. Murthy, S. Kasif, and S. Salzberg. A system for induction of oblique decision trees. J.\n\nArti\ufb01cial Intelligence Research, 2:1\u201332, 1994.\n\n[29] M. Norouzi, M. Collins, D. J. Fleet, and P. Kohli. CO2 forest: Improved random forest by\n\ncontinuous optimization of oblique splits. arXiv:1506.06155, June 24 2015.\n\n[30] M. Norouzi, M. Collins, M. A. Johnson, D. J. Fleet, and P. Kohli. Ef\ufb01cient non-greedy op-\ntimization of decision trees.\nIn C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and\nR. Garnett, editors, Advances in Neural Information Processing Systems (NIPS), volume 28,\npages 1720\u20131728. MIT Press, Cambridge, MA, 2015.\n\n[31] J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.\n\n[32] L. Rokach and O. Maimon. Data Mining With Decision Trees. Theory and Applications. Num-\n\nber 69 in Series in Machine Perception and Arti\ufb01cial Intelligence. World Scienti\ufb01c, 2007.\n\n[33] R. E. Schapire and Y. Freund. Boosting. Foundations and Algorithms. Adaptive Computation\n\nand Machine Learning Series. MIT Press, 2012.\n\n[34] C. Tjortjis and J. Keane. T3: A classi\ufb01cation algorithm for data mining. In H. Yin, N. Allinson,\nR. Freeman, J. Keane, and S. Hubbard, editors, Proc. of the 6th Int. Conf. Intelligent Data\nEngineering and Automated Learning (IDEAL\u201902), pages 50\u201355, Manchester, UK, Aug. 12\u2013\n14 2002.\n\n[35] P. Tzirakis and C. Tjortjis. T3C: Improving a decision tree classi\ufb01cation algorithm\u2019s interval\nsplits on continuous attributes. Advances in Data Analysis and Classi\ufb01cation, 11(2):353\u2013370,\nJune 2017.\n\n[36] X. Wu, V. Kumar, J. R. Quinlan, J. Ghosh, Q. Yang, H. Motoda, G. J. McLachlan, A. Ng,\nB. Liu, P. S. Yu, Z.-H. Zhou, M. Steinbach, D. J. Hand, and D. Steinberg. Top 10 algorithms\nin data mining. Knowledge and Information Systems, 14(1):1\u201337, Jan. 2008.\n\n11\n\n\f", "award": [], "sourceid": 627, "authors": [{"given_name": "Miguel", "family_name": "Carreira-Perpinan", "institution": "University of California, Merced"}, {"given_name": "Pooya", "family_name": "Tavallali", "institution": "UC Merced"}]}