{"title": "Bumptrees for Efficient Function, Constraint and Classification Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 693, "page_last": 699, "abstract": null, "full_text": "Bumptrees for Efficient Function, Constraint, and \n\nClassification Learning \n\nStephen M. Omohundro \n\nInternational Computer Science Institute \n\n1947 Center Street. Suite 600 \nBerkeley. California 94704 \n\nAbstract \n\nA new class of data structures called \"bumptrees\" is described. These \nstructures are useful for efficiently implementing a number of neural \nnetwork related operations. An empirical comparison with radial basis \nfunctions is presented on a robot ann mapping learning task. Applica(cid:173)\ntions to density estimation. classification. and constraint representation \nand learning are also outlined. \n\n1 WHAT IS A BUMPTREE? \n\nA bumptree is a new geometric data structure which is useful for efficiently learning. rep(cid:173)\nresenting. and evaluating geometric relationships in a variety of contexts. They are a natural \ngeneralization of several hierarchical geometric data structures including oct-trees. k-d \ntrees. balltrees and boxtrees. They are useful for many geometric learning tasks including \napproximating functions. constraint surfaces. classification regions. and probability densi(cid:173)\nties from samples. In the function approximation case. the approach is related to radial basis \nfunction neural networks, but supports faster construction, faster access, and more flexible \nmodification. We provide empirical data comparing bumptrees with radial basis functions \nin section 2. \nA bumptree is used to provide efficient access to a collection of functions on a Euclidean \nspace of interest. It is a complete binary tree in which a leaf corresponds to each function \nof interest There are also functions associated with each internal node and the defining \nconstraint is that each interior node's function must be everwhere larger than each of the \n\n693 \n\n\f694 \n\nOmohundro \n\nfunctions associated with the leaves beneath it In many cases the leaf functions will be \npeaked in locali7..ed regions. which is the origin of the name. A simple kind of bump func(cid:173)\ntion is spherically symmetric about a center and vanishes outside of a specified ball. Figure \n1 shows the structure of a two-dimensional bumptree in this setting. \n\nBall supported bump \n\n~~ &)b \n\ne\"0f \n\na \n\n2-d leaf functions \n\nab c d e \ntree structure \n\nf \n\ntree functions \n\nFigure 1: A two-dimensional bumptree. \n\nA particularly important special case of bumptrees is used to access collections of Gaussian \nfunctions on multi-dimensional spaces. Such collections are used. for example. in repre(cid:173)\nsenting smooth probability distribution functions as a Gaussian mixture and arises in many \nadaptive kernel estimation schemes. It is convenient to represent the quadratic exponents \nof the Gaussians in the tree rather than the Gaussians themselves. The simplest approach is \nto use quadratic functions for the internal nodes as well as the leaves as shown in Figure 2. \nthough other classes of internal node functions can sometimes provide faster access. \n\nA \n\nFigure 2: A bumptree for holding Gaussians. \n\nab c d \n\nMany of the other hierarchical geometric data structures may be seen as special cases of \nbumptrees by choosing appropriate internal node functions as shown in Figure 3. Regions \nmay be represented by functions which take the value 1 inside the region and which vanish \noutside of it. The function shown in Figure 3D is aligned along a coordinate axis and is con(cid:173)\nstant on one side of a specified value and decreases quadratically on the other side. It is rep(cid:173)\nresented by specifying the coordinate which is cut, the cut location. the constant value (0 in \nsome situations), and the coefficient of quadratic decrease. Such a function may be evalu(cid:173)\nated extremely efficiently on a data point and so is useful for fast pruning operations. Such \nevaluations are effectively what is used in (Sproull. 1990) to implement fast nearest neigh(cid:173)\nbor computation. The bumptree structure generalizes this kind of query to allow for differ(cid:173)\nent scales for different points and directions. The empirical results presented in the next \nsection are based on bumptrees with this kind of internal node function. \n\n\fBumptrees for Efficient Function, Constraint, and Classification Learning \n\n695 \n\nA. \n\nB. \n\nc. \n\ncy \n\nD. \n\nFigure 3: Internal bump functions for A) oct-trees, kd-trees, boxtrees (Omohundro, \n\n1987), B) and C) for balItrees (Omohundro, 1989), and D) for Sproull's higher \n\nperformance kd-tree (Sproull, 1990). \n\nThere are several approaches to choosing a tree structure to build over given leaf data. Each \nof the algorithms studied for ball tree construction in (Omohundro, 1989) may be applied to \nthe more general task of bumptree construction. The fastest approach is analogous to the \nbasic k-d tree construction technique (Friedman, et. al, 1977) and is top down and recur(cid:173)\nsively splits the functions into two sets of almost the same size. This is what is used in the \nsimulations described in the next section. The slowest but most effective approach builds \nthe tree bottom up, greedily deciding on the best pair of functions to join under a single par(cid:173)\nent node. Intermediate in speed and quality are incremental approaches which allow one to \ndynamically insert and delete leaf functions. \nBumptrees may be used to efficiently support many important queries. The simplest kind \nof query presents a point in the space and asks for all leaf functions which have a value at \nthat point which is larger than a specified value. The bumptree allows a search from the root \nto prune any subtrees whose root function is smaller than the specified value at the point. \nMore interesting queries are based on branch and bound and generalize the nearest neigh(cid:173)\nbor queries that k-d trees support. A typical example in the case of a collection of Gaussians \nis to request all Gaussians in the set whose value at a specified point is within a specified \nfactor (say .(01) of the Gaussian whose value is largest at that point. The search proceeds \ndown the most promising branches rust, continually maintains the largest value found at \nany point, and prunes away subtrees which are not within the given factor of the current \nlargest function value. \n\n2 THE ROBOT MAPPING LEARNING TASK \n\n~--..... \n\n........... System \n\nKinematic space ----_~. Visual space \n\nR3 \n\n.. R6 \n\nFigure 4: Robot arm mapping task. \n\n\f696 \n\nOmohundro \n\nFigure 4 shows the setup which defines the mapping learning task we used to study the ef(cid:173)\nfectiveness of the balltree data structure. This setup was investigated extensively by (Mel. \n1990) and involves a camera looking at a robot arm. The kinematic state of the ann is de(cid:173)\nfmed by three angle control coordinates and the visual state by six visual coordinates of \nhighlighted spots on the arm. The mapping from kinematic to visual space is a nonlinear \nmap from three dimensions to six. The system attempts to learn this mapping by flailing the \nann around and observing the visual state for a variety of randomly chosen kinematic \nstates. From such a set of random input/output pairs. the system must generalize the map(cid:173)\nping to inputs it has not seen before. This mapping task was chosen as fairly representative \nof typical problems arising in vision and robotics. \nThe radial basis function approach to mapping learning is to represent a function as a linear \ncombination of functions which are spherically symmetric around chosen centers \n\nf (x) = L wjgj (x - Xj) . In the simplest form. which we use here. the basis functions are \n\nj \n\ncentered on the input points. More recent variations have fewer basis functions than sample \npoints and choose centers by clustering. The timing results given here would be in terms of \nthe number of basis functions rather than the number of sample points for a variation of this \ntype. Many fonns for the basis functions themselves have been suggested. In our study both \nGaussian and linearly increasing functions gave similar results. The coefficients of the ra(cid:173)\ndial basis functions are chosen so that the sum forms a least squares best fit to the data. Such \nfits require a time proportional to the cube of the number of parameters in general. The ex(cid:173)\nperiments reported here were done using the singular value decomposition to compute the \nbest fit coefficients. \nThe approach to mapping learning based on bumptrees builds local models of the mapping \nin each region of the space using data associated with only the training samples which are \nnearest that region. These local models are combined in a convex way according to \"influ(cid:173)\nence\" functions which are associated with each model. Each influence function is peaked \nin the region for which it is most salient. The bumptree structure organizes the local models \nso that only the few models which have a great influence on a query sample need to be eval(cid:173)\nuated. If the influence functions vanish outside of a compact region. then the tree is used to \nprune the branches which have no influence. If a model's influence merely dies off with \ndistance, then the branch and bound technique is used to determine contributions that are \ngreater than a specified error bound. \nIf a set of bump functions sum to one at each point in a region of interest. they are called a \n\"partition of unity\". We fonn influence bumps by dividing a set of smooth bumps (either \nGaussians or smooth bumps that vanish outside a sphere) by their sum to form an easily \ncomputed partiton of unity. Our local models are affine functions determined by a least \nsquares fit to local samples. When these are combined according to the partition of unity, \nthe value at each point is a convex combination of the local model values. The error of the \nfull model is therefore bounded by the errors of the local models and yet the full approxi(cid:173)\nmation is as smooth as the local bump functions. These results may be used to give precise \nbounds on the average number of samples needed to achieve a given approximation error \nfor functions with a bounded second derivative. In this approach. linear fits are only done \non a small set of local samples. avoiding the computationally expensive fits over the whole \ndata set required by radial basis functions. This locality also allows us to easily update the \nmodel online as new data arrives. \n\n\fBumptrees for Efficient Function, Constraint, and Classification Learning \n\n697 \n\nIf bi (x) are bump functions such as Gaussians. then ftj (x) = \n\nfonns a partition \n\nbi(x) \n\nLbj(X) \nj \n\nof unity. If mi (x) are the local affine models. then the final smoothly interpolated approx-\n\nimating function is /(x) = Lfti(x)mi(x). The influence bumps are centered on the \n\ni \n\nsample points with a width determined by the sample density. The affine model associated \nwith each influence bump is detennined by a weighted least squares fit of the sample points \nnearest the bwnp center in which the weight decreases with distance. \nBecause it performs a global fit, for a given number of samples points, the radial basis func(cid:173)\ntion approach achieves a smaller error than the approach based on bumptrees. In terms of \nconstruction time to achieve a given error, however, bwnptrees are the clear winner.Figure \n5 shows how the mean square error for the robot arm mapping task decreases as a function \nof the time to construct the mapping. \n\nMean Square Error \n\n0.010 \n\n0.008 \n\n0.006 \n\n0.004 \n\n0.002 \n\n0.000-1-.....;;;;::. .... -\n40 \n\no \n\n........ -\n80 \n\n.... -...-.......... \n\n120 \n\n160 \n\nLearning time (sees) \n\nFigure 5: Mean square error as a function of learning time. \n\nPerhaps even more important for applications than learning time is retrieval time. Retrieval \nusing radial basis functions requires that the value of each basis function be computed on \neach query input and that these results be combined according to the best fit weight matrix. \nThis time increases linearly as a function of the number of basis functions in the represen(cid:173)\ntation. In the bumptree approach, only those influence bumps and affine models which are \nnot pruned away by the bumptree retrieval need perform any computation on an input. Fig(cid:173)\nure 6 shows the retrieval time as a function of number of training samples for the robot map(cid:173)\nping task. The retrieval time for radial basis functions crosses that for balltrees at about 100 \nsamples and increases linearly off the graph. The balltree algorithm has a retrieval time \nwhich empirically grows very slowly and doesn't require much more time even when \n10,000 samples are represented. \nWhile not shown here, the representation may be improved in both size and generalization \ncapacity by a best first merging technique. The idea is to consider merging two local models \nand their influence bumps into a single model. The pair which increases the error the least \n\n\f698 \n\nOmohundro \n\nis merged frrst and the process is repeated until no pair is left whose meger wouldn't exceed \nan error criterion. This algorithm does a good job of discovering and representing linear \nparts of a map with a single model and putting many higher resolution models in areas with \nstrong nonlinearities. \n\nRetrieval time (sees) \n\nGaussian RBF \n\nBumptree \n\n0.030 \n\n0.020 \n\n0.010 \n\nO.OOO-+---r-..-..... --r~..-.......... ~..-........... \n\no \n\n2K \n\n4K \n\n6K \n\n8K \n\n10K \n\nFigure 6: Retrieval time as a function of number of training samples. \n\n3 EXTENSIONS TO OTHER TASKS \nThe bumptree structure is useful for implementing efficient versions of a variety of other \ngeometric learning tasks (Omohundro, 1990). Perhaps the most fundamental such task is \ndensity estimation which attempts to model a probability distribution on a space on the ba(cid:173)\nsis of samples drawn from that distribution. One powerful technique is adaptive kernel es(cid:173)\ntimation (Devroye and Gyorfi, 1985). The estimated distribution is represented as a \nGaussian mixture in which a spherically symmetric Gaussian is centered on each data point \nand the widths are chosen according to the local density of samples. A best-first merging \ntechnique may often be used to produce mixtures consisting of many fewer non-symmetric \nGaussians. A bumptree may be used to fmd and organize such Gaussians. Possible internal \nnode functions include both quadratics and the faster to evaluate functions shown in Figure \n3D. \nIt is possible to effICiently perform many operations on probability densities represented in \nthis way. The most basic query is to return the density at a given location. The bumptree \nmay be used with branch and bound to achieve retrieval in logarithmic expected time. It is \nalso possible to quickly fmd marginal probabilities by integrating along certain dimensions. \nThe tree is used to quickly identify the Gaussian which contribute. Conditional distribu(cid:173)\ntions may also be represented in this form and bumptrees may be used to compose two such \ndistributions. \nAbove we discussed mapping learning and evaluation. In many situations there are not the \nnatural input and output variables required for a mapping. If a probability distribution is \npeaked on a lower dimensional surface, it may be thought of as a constraint. Networks of \n\n\fBumptrees for Efficient Function, Constraint, and Classification Learning \n\n699 \n\nconstraints which may be imposed in any order among variables are natural for describing \nmany problems. Bumptrees open up several possibilities f(X' efficiently representing and \npropagating smooth constraints on continuous variables. The most basic query is to specify \nknown external constraints on certain variables and allow the network to further impose \nwhatever constraints it can. Multi-dimensional product Ganssians can be used to represent \njoint ranges in a set of variables. The operation of imposing a constraint surface may be \nthought of as multiplying an external constraint Gaussian by the function representing the \nconstraint distribution. Because the product of two Gaussians is a Gaussian, this operation \nalways produces Gaussian mixtures and bumptrees may be used to facilitate the operation. \nA representation of constraints which is more like that used above for mappings consbUcts \nsurfaces from local affine patches weighted by influence functions. We have developed a \nlocal analog of principle components analysis which builds up surfaces from random sam(cid:173)\nples drawn from them. As with the mapping structures, a best-frrst merging operation may \nbe used to discover affine sbUcture in a constraint surface. \nFinally, bumptrees may be used to enhance the performance of classifiers. One approach is \nto directly implement Bayes classifiers using the adaptive kernel density estimator de(cid:173)\nscribed above for each class t s distribution function. A separate bumptree may be used for \neach class or with a more sophisticated branch and bound. a single tree may be used for the \nwhole set of classes. \nIn summary, bumptrees are a natural generalization of several hierarchical geometric ac(cid:173)\ncess structures and may be used to enhance the performance of many neural network like \nalgorithms. While we compared radial basis functions against a different mapping learning \ntechnique, bumptrees may be used to boost the retrieval performance of radial basis func(cid:173)\ntions directly when the basis functions decay away from their centers. Many other neural \nnetwork approaches in which much of the network does not perform useful work for every \nquery are also susceptible to sometimes dramatic speedups through the use of this kind of \naccess SbUCture. \n\nReferences \nL. Devroye and L. Gyorfi. (1985) Nonparametric Density Estimation: The Ll View, New \nYork: Wiley. \n]. H. Friedman, ]. L. Bentley and R. A. Finkel. (1977) An algorithm for finding best match(cid:173)\nes in logarithmic expected time. ACM Trans. Math. Software 3:209-226. \nB. Mel. (1990) Connectionist Robot Motion Planning. A Neurally-Inspired Approach to Vi(cid:173)\nsually-Guided Reaching. San Diego, CA: Academic Press. \nS. M. Omohundro. (1987) Efficient algorithms with neural network behavior. Complex \nSystems 1:273-347. \nS. M. Omohundro. (1989) Five balltree construction algorithms. International Computer \nScience Institute Technical Report TR-89-063. \nS. M. Omohundro. (1990) Geometric learning algorithms. Physica D 42:307-321. \nR. F. Sproull. (1990) Refmements to Nearest-Neighbor Searching in k-d Trees. Sutherland. \nSproull and Associates Technical Report SSAPP # 184c. to appear in Algorithmica. \n\n\f", "award": [], "sourceid": 312, "authors": [{"given_name": "Stephen", "family_name": "Omohundro", "institution": null}]}