Part of Advances in Neural Information Processing Systems 7 (NIPS 1994)

*Kah Sung, Partha Niyogi*

We develop a principled strategy to sample a function optimally for function approximation tasks within a Bayesian framework. Using ideas from optimal experiment design, we introduce an objective function (incorporating both bias and variance) to measure the de(cid:173) gree of approximation, and the potential utility of the data points towards optimizing this objective. We show how the general strat(cid:173) egy can be used to derive precise algorithms to select data for two cases: learning unit step functions and polynomial functions. In particular, we investigate whether such active algorithms can learn the target with fewer examples. We obtain theoretical and empir(cid:173) ical results to suggest that this is the case.

INTRODUCTION AND MOTIVATION

1 Learning from examples is a common supervised learning paradigm that hypothe(cid:173) sizes a target concept given a stream of training examples that describes the concept. In function approximation, example-based learning can be formulated as synthesiz(cid:173) ing an approximation function for data sampled from an unknown target function (Poggio and Girosi, 1990).

Active learning describes a class of example-based learning paradigms that seeks out new training examples from specific regions of the input space, instead of passively accepting examples from some data generating source. By judiciously selecting ex-

594

Kah Kay Sung, Parlha Niyogi

amples instead of allowing for possible random sampling, active learning techniques can conceivably have faster learning rates and better approximation results than passive learning methods.

This paper presents a Bayesian formulation for active learning within the function approximation framework. Specifically, here is the problem we want to address: Let Dn = {(Xi, Yi)li = 1, ... , n} be a set of n data points sampled from an unknown target function g, possibly in the presence of noise. Given an approximation func(cid:173) tion concept class, :F, where each f E :F has prior probability P;:-[J], one can use regularization techniques to approximate 9 from Dn (in the Bayes optimal sense) by means of a function 9 E:F. We want a strategy to determine at what input location one should sample the next data point, (XN+l, YN+d, in order to obtain the "best" possible Bayes optimal approximation of the unknown target function 9 with our concept class :F.

The data sampling problem consists of two parts:

1) Defining what we mean by the "best" possible Bayes optimal ap(cid:173) proximation of an unknown target function. In this paper, we propose an optimality criterion for evaluating the "goodness" of a solution with respect to an unknown target function.

2) Formalizing precisely the task of determining where in input space to sample the next data point. We express the above mentioned optimality criterion as a cost function to be minimized, and the task of choosing the next sample as one of minimizing the cost function with respect to the input space location of the next sample point.

Earlier work (Cohn, 1991; MacKay, 1992) have tried to use similar optimal experi(cid:173) ment design (Fedorov, 1972) techniques to collect data that would provide maximum information about the target function. Our work differs from theirs in several re(cid:173) spects. First, we use a different, and perhaps more general, optimality criterion for evaluating solutions to an unknown target function, based on a measure of function uncertainty that incorporates both bias and variance components of the total output generalization error. In contrast, MacKay and Cohn use only variance components in model parameter space. Second, we address the important sample complexity question, i.e., does the active strategy require fewer examples to learn the target to the same degree of uncertainty? Our results are stated in PAC-style (Valiant, 1984). After completion of this work, we learnt that Sollich (1994) had also recently developed a similar formulation to ours. His analysis is conducted in a statistical physics framework.

The rest of the paper is organized as follows: Section 2, develops our active sampling paradigm. In Sections 3 and 4, we consider two classes offunctions for which active strategies are obtained, and investigate their performance both theoretically and empirically.

2 THE MATHEMATICAL FRAMEWORK In order to optimally select examples for a learning task, one should first have a clear notion of what an "ideal" learning goal is for the task. We can then measure an example's utility in terms of how well the example helps the learner achieve the

Active Learning for Function Approximation

595

goal, and devise an active sampling strategy that selects examples with maximum potential utility. In this section, we propose one such learning goal - to find an approximation function g E :F that "best" estimates the unknown target function g. We then derive an example utility cost function for the goal and finally present a general procedure for selecting examples.

2.1 EVALUATING A SOLUTION TO AN UNKNOWN TARGET - THE EXPECTED INTEGRATED SQUARED DIFFERENCE

Let 9 be the target function that we want to estimate by means of an approximation function 9 E :F. If the target function 9 were known, then one natural measure of how well (or badly) g approximates 9 would be the Integrated Squared Difference (ISD) of the two functions:

Do not remove: This comment is monitored to verify that the site is working properly