{"title": "Memory-Based Methods for Regression and Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 1165, "page_last": 1166, "abstract": null, "full_text": "Memory-Based Methods for Regression \n\nand Classification \n\nThomas G. Dietterich and Dietrich Wettschereck \n\nDepartment of Computer Science \n\nOregon State University \nCorvallis, OR 97331-3202 \n\nChris G. Atkeson \n\nMIT AI Lab \n\n545 Technology Square \nCambridge, MA 02139 \n\nAndrew W. Moore \n\nSchool of Computer Science \nCarnegie Mellon University \n\nPittsburgh, PA 15213 \n\nMemory-based learning methods operate by storing all (or most) of the training data \nand deferring analysis of that data until \"run time\" (i.e., when a query is presented \nand a decision or prediction must be made). When a query is received, these \nmethods generally answer the query by retrieving and analyzing a small subset of \nthe training data-namely, data in the immediate neighborhood of the query point. \nIn short, memory-based methods are \"lazy\" (they wait until the query) and \"local\" \n(they use only a local neighborhood). The purpose of this workshop was to review \nthe state-of-the-art in memory-based methods and to understand their relationship \nto \"eager\" and \"global\" learning algorithms such as batch backpropagation. \n\nThere are two essential components to any memory-based algorithm: the method \nfor defining the \"local neighborhood\" and the learning method that is applied to \nthe training examples in the local neighborhood. \n\nWe heard several talks on issues related to defining the \"local neighborhood\". Fed(cid:173)\nerico Girosi and Trevor Hastie reviewed \"kernel\" methods in classification and re(cid:173)\ngression. A kernel function K(d) maps the distance d from the query point to a \ntraining example into a real value. In the well-known Parzen window approach, the \nkernel is a fixed-width gaussian, and a new example is classified by taking a weighted \nvote of the classes of all training examples, where the weights are determined by \nthe gaussian kernel. Because of the \"local\" shape of the gaussian, distant training \nexamples have essentially no influence on the classification decision. In regression \nproblems, a common approach is to construct a linear regression fit to the data, \nwhere the squared error from each data point is weighted by the kernel. \nHastie described the kernel used in the LOESS method: K(d) = (1_d3)3 (0::; d::; 1 \nand K(d) = 0 otherwise). To adapt to the local density of training examples, this \nkernel is scaled to cover the kth nearest neighbor. Many other kernels have been \nexplored, with particular attention to bias and variance at the extremes of the \n\n1165 \n\n\f1166 \n\nDietterich, Wettschereck, Atkeson, and Moore \n\ntraining data. Methods have been developed for computing the effective number of \nparameters used by these kernel methods. \n\nGirosi pointed out that some \"global\" learning algorithms (e.g., splines) are equiv(cid:173)\nalent to kernel methods. The kernels often have informative shapes. If a kernel \nplaces most weight near the query point, then we can say that the learning algo(cid:173)\nrithm is local, even if the algorithm performs a global analysis of the training data \nat learning time. An open problem is to determine whether multi-layer sigmoidal \nnetworks have equivalent kernels and, if so, what their shapes are. \n\nDavid Lowe described a classification algorithm based on gaussian kernels. The \nkernel is scaled by the mean distance to the k nearest neighbors. His Variable(cid:173)\nkernel Similarity Metric (VSM) algorithm learns the weights of a weighted Euclidean \ndistance in order to maximize the leave-one-out accuracy of the algorithm. Excellent \nresults have been obtained on benchmark tasks (e.g., NETtalk) . \nPatrice Simard described the tangent distance method. In optical character recog(cid:173)\nnition, the features describing a character change as that character is rotated, trans(cid:173)\nlated, or scaled. Hence, each character actually corresponds to a manifold of points \nin feature space. The tangent distance is a planar approximation to the distance be(cid:173)\ntween two manifolds (for two characters). Using tangent distance with the nearest \nneighbor rule gives excellent results in a zip code recognition task. \n\nLeon Bottou also employed a sophisticated distance metric by using the Euclidean \ndistance between the hidden unit activations of the final hidden layer in the Bell \nLabs \"LeNet\" character recognizer. A simple linear classifier (with weight decay) \nwas constructed to classify each query. Bottou also showed that there is a tradeoff \nbetween the quality of the distance metric and the locality of the learning algorithm. \nThe tangent distance is a near-perfect metric, and it can use the highly local first(cid:173)\nnearest-neighbor rule. The hidden layer of the LeNet gives a somewhat better \nmetric, but it requires approximately 200 \"local\" examples. With the raw features, \nLeNet itself requires all of the training examples. \n\nWe heard several talks on methods that are local but not lazy. John Platt described \nhis RAN (Resource Allocating Network) that learns a linear combination of radial \nbasis functions by iterative training on the data. Bernd Fritzke described his im(cid:173)\nprovements to RAN. Stephen Omohundro explained model merging, which initially \nlearns local patches and, when the data justifies, combines primitive patches into \nlarger high-order patches. Dietrich Wettschereck presented BNGE, which learns a \nset of local axis-parallel rectangular patches. \n\nFinally, Andrew Moore, Chris Atkeson, and Stefan Schaal described integrated \nmemory-based learning systems for control applications. Moore's system applies \nhuge amounts of cross-validation to select distance metrics, kernels, kernel widths, \nand so on. Atkeson advocated radical localism-all algorithm parameters should be \ndetermined by lazy, local methods. He described algorithms for obtaining confidence \nintervals on the outputs of local regression as well as techniques for outlier removal. \nOne method seeks to minimize the width of the confidence intervals. \n\nSome of the questions left unanswered by the workshop include these: Are there in(cid:173)\nherent computational penalties that lazy methods must pay (but eager methods can \navoid)? How about the reverse? For what problems are local methods appropriate? \n\n\f", "award": [], "sourceid": 807, "authors": [{"given_name": "Thomas", "family_name": "Dietterich", "institution": null}, {"given_name": "Dietrich", "family_name": "Wettschereck", "institution": null}, {"given_name": "Chris", "family_name": "Atkeson", "institution": null}, {"given_name": "Andrew", "family_name": "Moore", "institution": null}]}