Part of Advances in Neural Information Processing Systems 15 (NIPS 2002)

*Ron Meir, Tong Zhang*

We consider Bayesian mixture approaches, where a predictor is constructed by forming a weighted average of hypotheses from some space of functions. While such procedures are known to lead to optimal predictors in several cases, where su–ciently accurate prior information is available, it has not been clear how they perform when some of the prior assumptions are violated. In this paper we establish data-dependent bounds for such procedures, extending previous randomized approaches such as the Gibbs algorithm to a fully Bayesian setting. The ﬂnite-sample guarantees established in this work enable the utilization of Bayesian mixture approaches in agnostic settings, where the usual assumptions of the Bayesian paradigm fail to hold. Moreover, the bounds derived can be directly applied to non-Bayesian mixture approaches such as Bagging and Boosting.

1

Introduction and Motivation

The standard approach to Computational Learning Theory is usually formulated within the so-called frequentist approach to Statistics. Within this paradigm one is interested in constructing an estimator, based on a ﬂnite sample, which possesses a small loss (generalization error). While many algorithms have been constructed and analyzed within this context, it is not clear how these approaches relate to standard optimality criteria within the frequentist framework. Two classic optimality criteria within the latter approach are the minimax and admissibility criteria, which charac- terize optimality of estimators in a rigorous and precise fashion [9]. Except in some special cases [12], it is not known whether any of the approaches used within the Learning community lead to optimality in either of the above senses of the word. On the other hand, it is known that under certain regularity conditions, Bayesian estimators lead to either minimax or admissible estimators, and thus to well-deﬂned optimality in the classical (frequentist) sense. In fact, it can be shown that Bayes estimators are essentially the only estimators which can achieve optimality in the above senses [9]. This optimality feature provides strong motivation for the study of Bayesian approaches in a frequentist setting.

While Bayesian approaches have been widely studied, there have not been generally

applicable bounds in the frequentist framework. Recently, several approaches have attempted to address this problem. In this paper we establish ﬂnite sample data- dependent bounds for Bayesian mixture methods, which together with the above optimality properties suggest that these approaches should become more widely used.

Consider the problem of supervised learning where we attempt to construct an es- timator based on a ﬂnite sample of pairs of examples S = f(x1; y1); : : : ; (xn; yn)g, each drawn independently according to an unknown distribution „(x; y). Let A be a learning algorithm which, based on the sample S, constructs a hypothesis (esti- mator) h from some set of hypotheses H. Denoting by ‘(y; h(x)) the instantaneous loss of the hypothesis h, we wish to assess the true loss L(h) = E„‘(y; h(x)) where the expectation is taken with respect to „. In particular, the objective is to provide data-dependent bounds of the following form. For any h 2 H and – 2 (0; 1), with probability at least 1 ¡ –,

L(h) • ⁄(h; S) + ¢(h; S; –);

(1)

where ⁄(h; S) is some empirical assessment of the true loss, and ¢(h; S; –) is a com- plexity term. For example, in the classic Vapnik-Chervonenkis framework, ⁄(h; S) i=1 ‘(yi; h(xi)) and ¢(h; S; –) depends on the VC- dimension of H but is independent of both the hypothesis h and the sample S. By algorithm and data-dependent bounds we mean bounds where the complexity term depends on both the hypothesis (chosen by the algorithm A) and the sample S.

is the empirical error (1=n)Pn

2 A Decision Theoretic Bayesian Framework

Consider a decision theoretic setting where we deﬂne the sample dependent loss of an algorithm A by R(„; A; S) = E„‘(y; A(x; S)). Let (cid:181)„ be the optimal predictor for y, namely the function minimizing E„f‘(y; `(x))g over`

. It is clear that the best algorithm A (Bayes algorithm) is the one that always return (cid:181)„, assuming „ is known. We are interested in the expected loss of an algorithm averaged over samples S:

R(„; A) = ESR(„; A; S) =Z R(„; A; S)d„(S);

where the expectation is taken with respect to the sample S drawn i.i.d. from the probability measure „. If we consider a family of measures „, which possesses some underlying prior distribution …(„), then we can construct the averaged risk function with respect to the prior as,

r(…; A) = E…R(„; A) =Z d„(S)d…(„)Z R(„; A; S)d…(„jS);

R„ d„(S)d…(„) is the posterior distribution on the „ family, which

where d…(„jS) = d„(S)d…(„) induces a posterior distribution on the sample space as …S = E…(„jS)„. An algorithm minimizing the Bayes risk r(…; A) is referred to as a Bayes algorithm. In fact, for a given prior, and a given sample S, the optimal algorithm should return the Bayes optimal predictor with respect to the posterior measure …S.

For many important practical problems, the optimal Bayes predictor is a linear functional of the underlying probability measure. For example, if the loss function is quadratic, namely ‘(y; A(x)) = (y ¡A(x))2, then the optimal Bayes predictor (cid:181)„(x) is the conditional mean of y, namely E„[yjx]. For binary classiﬂcation problems, we can let the predictor be the conditional probability (cid:181)„(x) = „(y = 1jx) (the optimal classiﬂcation decision rule then corresponds to a test of whether (cid:181)„(x) > 0:5), which

is also a linear functional of „. Clearly if the Bayes predictor is a linear functional of the probability measure, then the optimal Bayes algorithm with respect to the prior … is given by

Do not remove: This comment is monitored to verify that the site is working properly