Part of Advances in Neural Information Processing Systems 13 (NIPS 2000)

*Vladimir Koltchinskii, Dmitriy Panchenko, Fernando Lozano*

In this paper we develop the method of bounding the generalization error of a classifier in terms of its margin distribution which was introduced in the recent papers of Bartlett and Schapire, Freund, Bartlett and Lee. The theory of Gaussian and empirical processes allow us to prove the margin type inequalities for the most general functional classes, the complexity of the class being measured via the so called Gaussian complexity func(cid:173) tions. As a simple application of our results, we obtain the bounds of Schapire, Freund, Bartlett and Lee for the generalization error of boost(cid:173) ing. We also substantially improve the results of Bartlett on bounding the generalization error of neural networks in terms of h -norms of the weights of neurons. Furthermore, under additional assumptions on the complexity of the class of hypotheses we provide some tighter bounds, which in the case of boosting improve the results of Schapire, Freund, Bartlett and Lee.

1

Introduction and margin type inequalities for general functional classes

Let (X, Y) be a random couple, where X is an instance in a space Sand Y E {-I, I} is a label. Let 9 be a set of functions from S into JR. For 9 E g, sign(g(X)) will be used as a predictor (a classifier) of the unknown label Y. If the distribution of (X, Y) is unknown, then the choice of the predictor is based on the training data (Xl, Yl ), ... , (Xn, Yn) that consists ofn i.i.d. copies of (X, Y). The goal ofleaming is to find a predictor 9 E 9 (based on the training data) whose generalization (classification) error JP'{Yg(X) :::; O} is small enough. We will first introduce some probabilistic bounds for general functional classes and then give several examples of their applications to bounding the generalization error of boosting and neural networks. We omit all the proofs and refer an interested reader to [5].

Let (8, A, P) be a probability space and let F be a class of measurable functions from (8, A) into lR. Let {Xd be a sequence of i.i.d. random variables taking values in (8, A) with common distribution P. Let Pn be the empirical measure based on the sample (Xl,'" ,Xn), Pn := n- l E~=l c5x " where c5x denotes the probability distribution con(cid:173) centrated at the point x. We will denote P! := Is !dP, Pn! := Is !dPn, etc. In what follows, £OO(F) denotes the Banach space of uniformly bounded real valued functions on F with the norm IIYII.:F := sUPfE.:F 1Y(f)I, Y E £OO(F). Define n

n

where {gi} is a sequence of i.i.d. standard normal random variables, independent of {Xi}' We will call n t-+ Gn(F) the Gaussian complexity function of the class F. One can find in the literature (see, e.g. [11]) various upper bounds on such quantities as Gn (F) in terms of entropies, VC-dimensions, etc.

We give below a bound in terms of margin cost functions (compare to [6, 7]) and Gaussian complexities. Let

Theorem 1 For all t > 0,

lP'{ =3/ E F: P{! :-::; O} >

Let us consider a special family of cost functions. Assume that cP is a fixed non increasing Lipschitz function from IR into IR such that cp(x) 2: (1 + sgn( -x)) /2 for all x E lR. One can easily observe that L( cpU 15)) :-::; L( cP )15- 1 . Applying Theorem 1 to the class of Lipschitz functions

Theorem 2 For all t > 0,

lP'{3! E F: P{! :-::; O} >

inf [Pncp(L) + 2y'2irL(cp) Gn(F) aE[O,l]

- cogIOg~(2c5-l)r/2] + t:n2 }:-::; 2exp{-2t2}.

Do not remove: This comment is monitored to verify that the site is working properly