{"title": "Near-Minimax Optimal Classification with Dyadic Classification Trees", "book": "Advances in Neural Information Processing Systems", "page_first": 1117, "page_last": 1124, "abstract": "", "full_text": "Near-Minimax Optimal Classi\ufb01cation with\n\nDyadic Classi\ufb01cation Trees\n\nClayton Scott\n\nRobert Nowak\n\nElectrical and Computer Engineering\n\nElectrical and Computer Engineering\n\nRice University\n\nHouston, TX 77005\ncscott@rice.edu\n\nUniversity of Wisconsin\n\nMadison, WI 53706\n\nnowak@engr.wisc.edu\n\nAbstract\n\nThis paper reports on a family of computationally practical classi\ufb01ers\nthat converge to the Bayes error at near-minimax optimal rates for a va-\nriety of distributions. The classi\ufb01ers are based on dyadic classi\ufb01cation\ntrees (DCTs), which involve adaptively pruned partitions of the feature\nspace. A key aspect of DCTs is their spatial adaptivity, which enables lo-\ncal (rather than global) \ufb01tting of the decision boundary. Our risk analysis\ninvolves a spatial decomposition of the usual concentration inequalities,\nleading to a spatially adaptive, data-dependent pruning criterion. For any\ndistribution on (X, Y ) whose Bayes decision boundary behaves locally\nlike a Lipschitz smooth function, we show that the DCT error converges\nto the Bayes error at a rate within a logarithmic factor of the minimax\noptimal rate. We also study DCTs equipped with polynomial classi\ufb01ca-\ntion rules at each leaf, and show that as the smoothness of the boundary\nincreases their errors converge to the Bayes error at a rate approaching\nn\u22121/2, the parametric rate. We are not aware of any other practical classi-\n\ufb01ers that provide similar rate of convergence guarantees. Fast algorithms\nfor tree pruning are discussed.\n\n1\n\nIntroduction\n\nWe previously studied dyadic classi\ufb01cation trees, equipped with simple binary decision\nrules at each leaf, in [1]. There we applied standard structural risk minimization to derive\na pruning rule that minimizes the empirical error plus a complexity penalty proportional to\nthe square root of the size of the tree. Our main result concerned the rate of convergence\nof the expected error probability of our pruned dyadic classi\ufb01cation tree to the Bayes error\nfor a certain class of problems. This class, which essentially requires the Bayes decision\nboundary to be locally Lipschitz, had previously been studied by Mammen and Tsybakov\n[2]. They showed the minimax rate of convergence for this class to be n\u22121/d, where n is\nthe number of labeled training samples, and d is the dimension of each sample. They also\ndemonstrated a classi\ufb01cation rule achieving this rate, but the rule requires minimization\nof the empirical error over the entire class of decision boundaries, an infeasible task in\npractice. In contrast, DCTs are computationally ef\ufb01cient, but converge at a slower rate of\nn\u22121/(d+1).\n\n\fIn this paper we exhibit a new pruning strategy that is both computationally ef\ufb01cient and\nrealizes the minimax rate to within a log factor. Our approach is motivated by recent results\nfrom Kearns and Mansour [3] and Mansour and McAllester [4]. Those works develop a\ntheory of local uniform convergence, which allows the error to be decomposed in a spatially\nadaptive way (unlike conventional structural risk minimization). In essence, the associated\npruning rules allow a more re\ufb01ned partition in a region where the classi\ufb01cation problem\nis harder (i.e., near the decision boundary). Heuristic arguments and anecdotal evidence\nin both [3] and [4] suggest that spatially adaptive penalties lead to improved performance\ncompared to \u201cglobal\u201d penalties. In this work, we give theoretical support to this claim (for\na speci\ufb01c kind of classi\ufb01cation tree, the DCT) by showing a superior rate of convergence\nfor DCTs pruned according to spatially adaptive penalties.\n\nWe go on to study DCTs equipped with polynomial classi\ufb01cation rules at each leaf. This\nprovides more \ufb02exible classi\ufb01ers that can take advantage of additional smoothness in the\nBayes decision boundary. We call such a classi\ufb01er a polynomial-decorated DCT (PDCT).\nPDCTs can be practically implemented by employing polynomial kernel SVMs at each\nleaf node of a pruned DCT. For any distribution whose Bayes decision boundary behaves\nlocally like a H\u00a8older-\u03b3 smooth function, we show that the PDCT error converges to the\nBayes error at a rate no slower than O((log n/n)\u03b3/(d+2\u03b3\u22122)). As \u03b3 \u2192 \u221e the rate tends to\nwithin a log factor of the parametric rate, n\u22121/2.\nPerceptron trees, tree classi\ufb01ers having linear splits at each node, have been investigated by\nmany authors and in particular we point to the works [5,6]. Those works consider optimiza-\ntion methods and generalization errors associated with perceptron trees, but do not address\nrates of approximation and convergence. A key aspect of PDCTs is their spatial adaptiv-\nity, which enables local (rather than global) polynomial \ufb01tting of the decision boundary.\nTraditional polynomial kernel-based methods are not capable of achieving such rates of\nconvergence due to their lack of spatial adaptivity, and it is unlikely that other kernels can\nsolve this problem for the same reason. Consider approximating a H\u00a8older-\u03b3 smooth func-\ntion on a bounded domain with a single polynomial. Then the error in approximation is\nO(1), a constant, which is the best one could hope for in learning a H\u00a8older smooth bound-\nary with a traditional polynomial kernel scheme. On the other hand, if we partition the\ndomain into hypercubes of side length O(1/m) and \ufb01t an individual polynomial on each\nhypercube, then the approximation error decays like O(m\u2212\u03b3). Letting m grow with the\nsample size n guarantees that the approximation error will tend to zero. On the other hand,\npruning back the partition helps to avoid over\ufb01tting. This is precisely the idea behind the\nPDCT.\n\n2 Dyadic Classi\ufb01cation Trees\n\nIn this section we review our earlier results on dyadic classi\ufb01cation trees. Let X be a\nd-dimensional observation, and Y \u2208 {0, 1} its class label. Assume X \u2208 [0, 1]d. This\nis a realistic assumption for real-world data, provided appropriate translation and scaling\nhas been applied. DCTs are based on the concept of a cyclic dyadic partition (CDP). Let\nP = {R1, . . . , Rk} be a tree-structured partition of the input space, where each Ri is a\nhyperrectangle with sides parallel to the coordinate axes. Given an integer `, let [`]d denote\nthe element of {1, . . . , d} that is congruent to ` modulo d. If Ri \u2208 P is a cell at depth\nj in the tree, let R(1)\nbe the rectangles formed by splitting Ri at its midpoint\nalong coordinate [j + 1]d. A CDP is a partition P constructed according to the rules:\n(i) The trivial partition P = [0, 1]d is a CDP; (ii) If {R1, . . . , Rk} is a CDP, then so is\n{R1, . . . , Ri\u22121, R(1)\n, Ri+1, . . . , Rk}, where 1 \u2264 i \u2264 d. The term \u201ccyclic\u201d refers to\nhow the splits cycle through the coordinates of the input space as one traverses a path down\nthe tree. We de\ufb01ne a dyadic classi\ufb01cation tree (DCT) to be a cyclic dyadic partition with\n\nand R(2)\n\ni\n\n, R(2)\n\ni\n\ni\n\ni\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 1: Example of a dyadic classi\ufb01cation tree when d = 2. (a) Training samples from\ntwo classes, and Bayes decision boundary. (b) Initial dyadic partition. (c) Pruned dyadic\nclassi\ufb01cation tree. Polynomial-decorated DCTs, discussed in Section 4, are similar in struc-\nture, but a polynomial decision rule is employed at each leaf of the pruned tree, instead of\na simple binary label.\n\na class label (0 or 1) assigned to each node in the tree. We use the notation T to denote a\nDCT. Figure 1 (c) shows an example of a DCT in the two-dimensional case.\n\nPreviously we presented a rule for pruning DCTs with consistency and rate of convergence\nproperties. In this section we review those results, setting the stage for our main result in\nthe next section. Let m = 2J be a dyadic integer, and de\ufb01ne T0 to be the DCT that has\nevery leaf node at depth dJ. Then each leaf of T0 corresponds to a cube of side length 1/m,\nand T0 has md total leaf nodes. Assume a training sample of size n is given, and each node\nof T0 is labeled according to a majority vote with respect to the training data reaching that\nnode. A subtree T of T0 is referred to as a pruned subtree, denoted T \u2264 T0, if T includes\nthe root of T0, if every internal node of T has both its children in T , and if the nodes of T\ninherit their labels from T0. The size of a tree T , denoted |T|, is the number of leaf nodes.\nWe de\ufb01ned the complexity penalized dyadic classi\ufb01cation tree T 0\n\nn to be the solution of\n\nn = arg min\nT 0\n\nT \u2264T0\n\n\u02c6\u0001(T ) + \u03b1np|T|,\n\n(1)\n\nn)} \u2192 \u0001\u2217 with probability one (i.e., T 0\n\nwhere \u03b1n =p32 log(en)/n, and \u02c6\u0001(T ) is the empirical error, i.e., the fraction of training\ndata misclassi\ufb01ed by T . (The solution to this pruning problem can be computed ef\ufb01ciently\n[7].) We showed that if X \u2208 [0, 1]d with probability one, and md = o(n/ log n), then\nn is consistent). Here, \u0001(T ) = P{T (X) 6=\nE{\u0001(T 0\nY } is the true error probability for T , and \u0001\u2217 is the Bayes error, i.e., the minimum error\nprobability over all classi\ufb01ers (not just trees). We also demonstrated a rate of convergence\nresult for T 0\nn, under certain assumptions on the distribution of (X, Y ). Let us recall the\nde\ufb01nition of this class of distributions. Again, let X \u2208 [0, 1]d with probability one.\nDe\ufb01nition 1 Let c1, c2 > 0, and let m0 be a dyadic integer. De\ufb01ne F = F(c1, c2, m0) to\nbe the collection of all distributions on (X, Y ) such that\nA1 (Bounded density): For any measurable set A, P{X \u2208 A} \u2264 c1\u03bb(A), where \u03bb denotes\nA2 (Regularity): For all dyadic integers m \u2265 m0, if we subdivide the unit cube into cubes\nof side length 1/m, The Bayes decision boundary passes through at most c2md\u22121\nof the resulting md cubes.\n\nthe Lebesgue measure.\n\nThese assumptions are satis\ufb01ed when the density of X is essentially bounded with respect\nto Lebesgue measure, and when the Bayes decision boundary for the distribution on (X, Y )\nbehaves locally like a Lipschitz function. See, for example, the boundary fragment class\nof [2] with \u03b3 = 1 therein.\n\n\fthen E{\u0001(T 0\n\nn)} \u2212 \u0001\u2217 = O((log n/n)1/(d+1)). However,\n\nIn [1], we showed that\nif the distribution of (X, Y ) belongs to F, and m \u223c\nthis upper\n(n/ log n)1/(d+1),\nbound on the rate of convergence is not tight. The results of Mammen and Tsybakov [2]\nshow that the minimax rate of convergence, inf \u03c6n supF\nE{\u0001(\u03c6n)} \u2212 \u0001\u2217, is on the order of\nn\u22121/d (here \u03c6n ranges over all possible discrimination rules). In the next section, we in-\ntroduce a new strategy for pruning DCTs, which leads to an improved rate of convergence\nof (log n/n)1/d (i.e., within a logarithmic factor of the minimax rate). We are not aware of\nother practically implementable classi\ufb01ers that can achieve this rate.\n\n3\n\nImproved Tree Pruning with Spatially Adaptive Penalties\n\nAn improved rate of convergence is achieved by pruning the initial tree T0 using a new\ncomplexity penalty. Given a node v in a tree T , let Tv denote the subtree of T rooted at v.\nLet S denote the training data, and let nv denote the number of training samples reaching\nnode v. Let R denote a pruned subtree of T . In the language of [4], R is called a root\nfragment. Let L(R) denote the set of leaf nodes of R.\nConsider the pruning rule that selects\n\nTn = arg min\n\nR\u2264T\n\n\u2206(T, S, R)(cid:19) ,\n\nT \u2264T0 (cid:18)\u02c6\u0001(T ) + min\nnhp48nv|Tv| log(2n) +p48nvd log(m)i .\n\n(2)\n\nwhere\n\n\u2206(T, S, R) = Xv\u2208L(R)\n\n1\n\nObserve that the penalty is data-dependent (since nv depends on S) and spatially adaptive\n(choosing R \u2264 T to minimize \u2206). The penalty can be interpreted as follows. The \ufb01rst\ncan be viewed as an empirical average of the complexity penalties for each of the subtrees\nTv, which depend on the local data associated with each subtree. The second term can be\ninterpreted as the \u201ccost\u201d of spatially decomposing the bound on the generalization error.\n\nterm in the penalty is writtenPv\u2208L(R)bpvp48|Tv| log(2n)/nv, wherebpv = nv/n. This\n\nThe penalty has the following property. Consider pruning one of two subtrees, both with the\nsame size, and assume that both options result in the same increase in the empirical error.\nThen the subtree with more data is selected for pruning. Since deeper nodes typically have\nless data, this shows that the penalty favors unbalanced trees, which may promote higher\nresolution (deeper leaf nodes) in the vicinity of the decision boundary. In contrast, the\npruning rule (1) penalizes balanced and unbalanced trees (with the same size) equally.\nThe following theorem bounds the expected error of Tn. This kind of bound is known as\nan index of resolvability result [3,8]. Recall that m speci\ufb01es the depth of the initial tree T0.\nTheorem 1 If m \u223c (n/ log n)1/d, then\nE{\u0001(Tn) \u2212 \u0001\u2217} \u2264 min\n\n\u2206(T, S, R)(cid:21)(cid:27) + O r log n\nn ! .\n\nT \u2264T0(cid:26)(\u0001(T ) \u2212 \u0001\u2217) + E(cid:20)min\n\nR\u2264T\n\nThe \ufb01rst term in braces on the right is the approximation error. The remaining terms on the\nright-hand side bound the estimation error. Since the bound holds for all T , one feature of\nthe pruning rule (2) is that Tn performs at least as well as the subtree T \u2264 T0 that minimizes\nthe bound. This theorem may be applied to give us our desired rate of convergence result.\nTheorem 2 Assume the density of (X, Y ) belongs to F. If m \u223c (n/ log n)1/d, then\n\nE{\u0001(Tn)} \u2212 \u0001\u2217 = O((log n/n)1/d).\n\n\fIn other words, the pruning rule (2) comes within a log factor of the minimax rate. These\ntheorems are proved in the last section.\n\n4 Faster Rates for Smoother Boundaries\n\nIn this section we extend Theorem 2 to the case of smoother decision boundaries. De-\n\ufb01ne F(\u03b3, c1, c2, m0) \u2282 F(c1, c2, m0) to be those distributions on (X, Y ) satisfying the\nfollowing additional assumption. Here \u03b3 \u2265 1 is \ufb01xed.\nA3 (\u03b3-regularity): Subdivide [0, 1]d into cubes of side length 1/m, m \u2265 m0. Within each\ncube the Bayes decision boundary is described by a function (one coordinate is a\nfunction of the others) with H\u00a8older regularity \u03b3.\n\nThe collection G contains all distributions whose Bayes decision boundaries behave locally\nlike the graph of a function with H\u00a8older regularity \u03b3. The \u201cboundary fragments\u201d class of\nMammen and Tsybakov is a special case of boundaries satisfying A1 and A3.\n\nWe propose a classi\ufb01er, called a polynomial-decorated dyadic classi\ufb01cation tree (PDCT),\nthat achieves fast rates of convergence for distributions satisfying A3. Given a positive\ninteger r, a PDCT of degree r is a DCT, with class labels at each leaf node assigned by a\ndegree r polynomial classi\ufb01er.\nConsider the pruning rule that selects\n\nTn,r = arg min\n\n(3)\n\nR\u2264T\n\n\u2206r(T, S, R)(cid:19) ,\n\nT \u2264T0 (cid:18)\u02c6\u0001(T ) + min\nn(cid:20)q48nvVd,r|Tv| log(2n) +p48nv(d + \u03b3) log(m)(cid:21) .\n\nwhere\n\n\u2206r(T, S, R) =Xv\n\n1\n\nr (cid:1) is the V C dimension of the collection of degree r polynomial classi\ufb01ers\nHere Vd,r =(cid:0)d+r\nin d dimensions. Also, the notation T \u2264 T0 in (3) is rough. We actually consider a search\nover all pruned subtrees of T0, and with all possible con\ufb01gurations of degree r polynomial\nclassi\ufb01ers at the leaf nodes.\nAn index of resolvability result analgous to Theorem 1 for Tn,r can be derived. Moreover,\nIf r = d\u03b3e \u2212 1, then a decision boundary with H\u00a8older regularity \u03b3 is well approximated by\na PDCT of degree r. In this case, Tn,r converges to the Bayes risk at rates bounded by the\nnext theorem.\n\nTheorem 3 Assume the density of (X, Y ) belongs to G and that r = d\u03b3e \u2212 1. If m \u223c\n(n/ log n)1/(d+2\u03b3\u22122), then\n\nE{\u0001(Tn,r)} \u2212 \u0001\u2217 = O((log n/n)\u03b3/(d+2\u03b3\u22122)).\n\nNote that in the case \u03b3 = 1 this result coincides with the near-minimax rate in Theorem 2.\nAlso notice that as \u03b3 \u2192 \u221e, the rate of convergence comes within a logarithmic factor of\nthe parametric rate n\u22121/2. The proof is discussed in the \ufb01nal section.\n\n5 Ef\ufb01cient Algorithms\n\nThe optimally pruned subtree Tn of rule (2) can be computed exactly in O(|T0|2) opera-\ntions. This follows from a simple bottom-up dynamic programming algorithm, which we\n\n\fv = R\u2217\n\nv and R\u2217\n\nroot and R\u2217\n\nroot using a bottom-up procedure.\n\nv be the associated subtree that minimizes \u2206(T \u2217\n\ndescribe below, and uses a method for \u201csquare-root\u201d pruning studied in [7]. In the context\nof Theorem 2, we have |T0| = md \u223c n, so the algorithm runs in time O(n2).\nNote that an algorithm for \ufb01nding the optimal R \u2264 T was provided in [4]. We now describe\nan algorithm for \ufb01nding both the optimal T \u2264 T0 and R \u2264 T solving (2). Given a node\nv \u2208 T0, let T \u2217\nv be the subtree of T0 rooted at v that minimizes the objective function of (2),\nand let R\u2217\nv , S, R). The problem is solved\nby \ufb01nding T \u2217\nv = {v}. If v is an internal node, denote\nIf v is a leaf node of T0, then clearly T \u2217\nv| = 1,\nv: (i) |T \u2217\nthe children of v by u and w. There are three cases for T \u2217\nv | = |R\u2217\nv| > 1, in which case T \u2217\nv can be\nin which case T \u2217\nv and R\u2217\nv = R\u2217\nv| = 1, in\nw, respectively; (iii) |T \u2217\ncomputed by merging T \u2217\nv | > |R\u2217\nv is determined by solving a square root pruning problem, just\nwhich case R\u2217\nv and R\u2217\nlike the one in (1). At each node, these three candidates are determined, and T \u2217\nv\nare the candidates minimizing the objective function (empirical error plus penalty) at each\nnode. Using the \ufb01rst algorithm in [7], the overall pruning procedure may be accomplished\nin (|T0|2) operations.\nDetermining the optimally pruned degree r PDCT is more challenging. The problem re-\nquires the construction, at each node of T0, a polynomial classi\ufb01er of degree r having\nminimum empirical error. Unfortunately, this task is computationally infeasible for large\nsample sizes. As an alternative, we recommend the use of polynomial support vector ma-\nchines. SVMs are well known for their good generalization ability in practical problems.\nMoreover, linear SVMs in perceptron trees have been shown to work well [6].\n\nv = {v}; (ii) |T \u2217\nw and R\u2217\nu with T \u2217\n\nv | \u2265 |R\u2217\nu with R\u2217\n\nv = {v}, and T \u2217\n\n6 Conclusions\n\nA key aspect of DCTs is their spatial adaptivity, which enables local (rather than global)\n\ufb01tting of the decision boundary. Our risk analysis involves a spatial decomposition of the\nusual concentration inequalities, leading to a spatially adaptive, data-dependent pruning\ncriterion that promotes unbalanced trees that focus on the decision boundary. For distri-\nbutions on (X, Y ) whose Bayes decision boundary behave locally like a H\u00a8older-\u03b3 smooth\nfunction, we show that the PDCT error converges to the Bayes error at a rate no slower\nthan O((log n/n)\u03b3/(d+2\u03b3\u22122)). Polynomial kernel methods are not capable of achieving\nsuch rates due to their lack of spatial adaptivity. When \u03b3 = 1, the DCT convergence rate is\nwithin a logarithmic factor of the minimax optimal rate. As \u03b3 \u2192 \u221e the rate tends to within\na log factor of n\u22121/2, the parametric rate. However, the rates for \u03b3 > 1 are not within a\nlogarithmic factor of the minimax rate [2]. It may be possible to tighten the bounds further.\nOn the other hand, near-minimax rates might not be achievable using rectangular partitions,\nand more \ufb02exible partitioning schemes, such as adaptive triangulations, may be required.\n\n7 Proof Sketches\n\nThe key to proving Theorem 1 is the following result, which is a modi\ufb01ed version of a\ntheorem of Mansour and McAllester [4].\nLemma 1 Let \u03b4 \u2208 (0, 1). With probability at least 1 \u2212 \u03b4, every T \u2264 T0 satis\ufb01es\n\n\u0001(T ) \u2264 \u02c6\u0001(T ) + min\n\nR\u2264T\n\nf (T, S, R, \u03b4),\n\nwhere\n\nf (T, S, R, \u03b4) = Xv\u2208L(R)\n\n1\n\nnhp48nv|Tv| log(2n)\n\n\f+p24nv[d log(m) + log(3/\u03b4)] + 2[d log(m) + log(3/\u03b4)]i\n\nOur primary modi\ufb01cation to the lemma is to replace one local uniform deviation inequality\n(which holds for countable collections of classi\ufb01ers [4, Lemma 4]) with another (which\nholds for in\ufb01nite collections of classi\ufb01ers [3, Lemma 2]). This eases our extension to\npolynomial-decorated DCTs in Section 4, by allowing us to avoid tedious quantization\narguments.\nTo prove Theorem 1, de\ufb01ne the event \u2126m to be the collection of all training samples S\nsuch that for all T \u2264 T0, the bound of Lemma 1 holds, with \u03b4 = 3/md. By that lemma,\nP(\u2126m) \u2265 1 \u2212 3/md. Let T \u2264 T0 be arbitrary. We have\n\nE{\u0001(Tn) \u2212 \u0001(T )}\n\n= P (\u2126m)E{\u0001(Tn) \u2212 \u0001(T )| \u2126m} + P (\u2126c\n\u2264 E{\u0001(Tn) \u2212 \u0001(T )| \u2126m} +\n\n3\nmd .\n\nm)E{\u0001(Tn) \u2212 \u0001(T )| \u2126c\nm}\n\nGiven S \u2208 \u2126m, we know\n\n\u0001(Tn) \u2264 \u02c6\u0001(Tn) + min\n\nR\u2264Tn\n\nf (Tn, S, R, 3m\u2212d)\n\n= \u02c6\u0001(Tn) + min\nR\u2264Tn\n\n\u2206(Tn, S, R) +\n\n4d log(m)\n\nn\n\n\u2264 \u02c6\u0001(T ) + min\n\nR\u2264T\n\n\u2206(T, S, R) +\n\n4d log(m)\n\nn\n\n,\n\nwhere the last inequality comes from the de\ufb01nition of Tn.\nFrom Chernoff\u2019s inequality, we know P{\u02c6\u0001(T ) \u2265 \u0001(T ) + t} \u2264 e\u22122nt2. By applying this\nbound, and the fact E{Z} \u2264R \u221e\nBy Theorem 1, it suf\ufb01ces to \ufb01nd a tree T \u2217 \u2264 T0 such that\n\nP{Z > t} dt, the theorem is proved.\n\n7.1 Proof of Theorem 2\n\n2\n\n0\n\nE(cid:20) min\n\nR\u2264T \u2217\n\n\u2206(T \u2217, S, R)(cid:21) + (\u0001(T \u2217) \u2212 \u0001\u2217) = O (cid:18) log n\nd! .\nn (cid:19) 1\n\nDe\ufb01ne T \u2217 to be the tree obtained by pruning back T0 at every node (thought of as a region\nof space) that does not intersect the Bayes decision boundary. It can be shown without\nmuch dif\ufb01culty that \u0001(T \u2217) \u2212 \u0001\u2217 = O((log n/n)1/d) [9, Lemma 1]. It remains to bound the\nestimation error.\nRecall that T0 (and hence T \u2217) has depth Jd, where J = log2(m). De\ufb01ne R\u2217 to be the\npruned subtree of T \u2217 consisting of all nodes in T \u2217 up to depth j0d, where j0 = J \u2212\n(1/d) log2(J) (truncated if necessary). Let \u2126v be the set of all training samples such that\n\u221anv \u2264 2\u221anpv. Let \u2126 be the set of all training samples S such that S \u2208 \u2126v for all\nv \u2208 L(R\u2217).\nNow\n\u2206(T \u2217, S, R)(cid:21)\nE(cid:20) min\n\u2264 P(\u2126)E(cid:20) min\n\n\u2206(T \u2217, S, R)|\u2126(cid:21) + P(\u2126c)E(cid:20) min\n\n\u2206(T \u2217, S, R)|\u2126c(cid:21) .\n\nR\u2264T \u2217\n\nR\u2264T \u2217\n\nR\u2264T \u2217\n\n\fIt can be shown, by applying the union bound, A2, and a theorem of Okamoto [10], that\nP(\u2126c) = O((log n/n)1/d). Moreover, the second expectation on the right is easily seen to\nbe O(1) by considering the root fragment consisting of only the root node. Hence it remains\nto bound the \ufb01rst term on the right-hand side. We use P(\u2126) \u2264 1, and focus on bounding the\nexpectation. It can be shown, assuming S \u2208 \u2126, that \u2206(T \u2217, S, R\u2217) = O((log n/n)1/d). It\nsuf\ufb01ces to bound the \ufb01rst term of \u2206(T \u2217, S, R\u2217), which clearly dominates the second term.\nThe \ufb01rst term, consisting of a sum of terms over the leaf nodes of R\u2217, is dominated by the\nsum of those terms over the leaf nodes of R\u2217 at depth j0d. The number of such nodes may\nbe bounded by assumption A2. The remaining expression is bounded using assumptions\nA1 and A2, as well as the de\ufb01nitions of T \u2217, R\u2217, and \u2126.\n\n7.2 Proof of Theorem 3\n\nThe estimation error is increased by a constant \u221dpVd,r, so its asymptotic analysis remains\nunchanged. The only signi\ufb01cant change is in the analysis of the approximation error. The\ntree T \u2217 is de\ufb01ned as in the previous proof. Recall the leaf nodes of T \u2217 at maximum depth\nare cells of side length 1/m. By a simple Taylor series argument, the approximation error\n\u0001(T \u2217) \u2212 \u0001\u2217 behaves like m\u2212\u03b3. The remainder of the proof is essentially the same as the\nproof of Theorem 2.\n\nAcknowledgments\n\nThis work was partially supported by the National Science Foundation, grant nos. MIP\u2013\n9701692 and ANI-0099148, the Army Research Of\ufb01ce, grant no. DAAD19-99-1-0349,\nand the Of\ufb01ce of Naval Research, grant no. N00014-00-1-0390.\n\nReferences\n\n[1] C. Scott and R. Nowak, \u201cDyadic classi\ufb01cation trees via structural risk minimization,\u201d in Ad-\nvances in Neural Information Processing Systems 14, S. Becker, S. Thrun, and K. Obermayer,\nEds., Cambridge, MA, 2002, MIT Press.\n\n[2] E. Mammen and A. B. Tsybakov, \u201cSmooth discrimination analysis,\u201d Annals of Statistics, vol.\n\n27, pp. 1808\u20131829, 1999.\n\n[3] M. Kearns and Y. Mansour, \u201cA fast, bottom-up decision tree pruning algorithm with near-\noptimal generalization,\u201d in International Conference on Machine Learning, 1998, pp. 269\u2013277.\n[4] Y. Mansour and D. McAllester, \u201cGeneralization bounds for decision trees,\u201d in Proceedings of\nthe Thirteenth Annual Conference on Computational Learning Theory, Palo Alto, California,\nNicol`o Cesa-Bianchi and Sally A. Goldman, Eds., 2000, pp. 69\u201374.\n\n[5] K. Bennett and J. Blue, \u201cA support vector machine approach to decision trees,\u201d in Proceedings\nof the IEEE International Joint Conference on Neural Networks, Anchorage, Alaska, 1998,\nvol. 41, pp. 2396\u20132401.\n\n[6] K. Bennett, N. Cristianini, J. Shawe-Taylor, and D. Wu, \u201cEnlarging the margins in perceptron\n\ndecision trees,\u201d Machine Learning, vol. 41, pp. 295\u2013313, 2000.\n\n[7] C. Scott, \u201cTree pruning using a non-additive penalty,\u201d Tech. Rep. TREE 0301, Rice University,\n\n2003, available at http://www.dsp.rice.edu/\u223ccscott/pubs.html.\n\n[8] A. Barron, \u201cComplexity regularization with application to arti\ufb01cial neural networks,\u201d in Non-\nparametric functional estimation and related topics, G. Roussas, Ed., pp. 561\u2013576. NATO ASI\nseries, Kluwer Academic Publishers, Dordrecht, 1991.\n\n[9] C. Scott and R. Nowak, \u201cComplexity-regularized dyadic classi\ufb01cation trees: Ef\ufb01cient prun-\ning and rates of convergence,\u201d Tech. Rep. TREE0201, Rice University, 2002, available at\nhttp://www.dsp.rice.edu/\u223ccscott/pubs.html.\n\n[10] M. Okamoto, \u201cSome inequalities relating to the partial sum of binomial probabilites,\u201d Annals\n\nof the Institute of Statistical Mathematics, vol. 10, pp. 29\u201335, 1958.\n\n\f", "award": [], "sourceid": 2364, "authors": [{"given_name": "Clayton", "family_name": "Scott", "institution": null}, {"given_name": "Robert", "family_name": "Nowak", "institution": null}]}