{"title": "A Bound on the Error of Cross Validation Using the Approximation and Estimation Rates, with Consequences for the Training-Test Split", "book": "Advances in Neural Information Processing Systems", "page_first": 183, "page_last": 189, "abstract": null, "full_text": "A Bound on the Error of Cross Validation Using \nthe Approximation and Estimation Rates, with \n\nConsequences for the Training-Test Split \n\nMichael Kearns \nAT&T Research \n\n1 INTRODUCTION \nWe analyze the performance of cross validation 1 in the context of model selection and \ncomplexity regularization. We work in a setting in which we must choose the right number \nof parameters for a hypothesis function in response to a finite training sample, with the goal \nof minimizing the resulting generalization error. There is a large and interesting literature \non cross validation methods, which often emphasizes asymptotic statistical properties, or \nthe exact calculation of the generalization error for simple models. Our approach here is \nsomewhat different, and is pri mari I y inspired by two sources. The first is the work of Barron \nand Cover [2], who introduced the idea of bounding the error of a model selection method \n(in their case, the Minimum Description Length Principle) in terms of a quantity known as \nthe index of resolvability. The second is the work of Vapnik [5], who provided extremely \npowerful and general tools for uniformly bounding the deviations between training and \ngeneralization errors. \nWe combine these methods to give a new and general analysis of cross validation perfor(cid:173)\nmance. In the first and more formal part of the paper, we give a rigorous bound on the error \nof cross validation in terms of two parameters of the underlying model selection problem: \nthe approximation rate and the estimation rate. In the second and more experimental part \nof the paper, we investigate the implications of our bound for choosing 'Y, the fraction of \ndata withheld for testing in cross validation. The most interesting aspect of this analysis is \nthe identification of several qualitative properties of the optimal 'Y that appear to be invariant \nover a wide class of model selection problems: \n\n\u2022 When the target function complexity is small compared to the sample size, the \n\nperformance of cross validation is relatively insensitive to the choice of 'Y. \n\n\u2022 The importance of choosing 'Y optimally increases, and the optimal value for 'Y \ndecreases, as the target function becomes more complex relative to the sample \nsize. \n\n\u2022 There is nevertheless a single fixed value for'Y that works nearly optimally for a \n\nwide range of target function complexity. \n\n2 THE FORMALISM \nWe consider model selection as a two-part problem: choosing the appropriate number of \nparameters for the hypothesis function, and tuning these parameters. The training sample \nis used in both steps of this process. In many settings, the tuning of the parameters is \ndetermined by a fixed learning algorithm such as backpropagation, and then model selection \nreduces to the problem of choosing the architecture. Here we adopt an idealized version of \nthis division of labor. We assume a nested sequence of function classes Hl C ... C H d \u2022\u2022\u2022 , \ncalled the structure [5], where Hd is a class of boolean functions of d parameters, each \n\nIPerhaps in conflict with accepted usage in statistics, here we use the term \"cross validation\" to \nmean the simple method of saving out an independent test set to perform model selection. Precise \ndefinitions will be stated shortly. \n\n\f184 \n\nM.KEARNS \n\nfunction being a mapping from some input space X into {O, I}. For simplicity, in this \npaper we assume that the Vapnik-Chervonenkis (VC) dimension [6, 5] of the class Hd is \nO( d). To remove this assumption, one simply replaces all occurrences of d in our bounds by \nthe VC dimension of H d \u2022 We assume that we have in our possession a learning algorithm \nL that on input any training sample 8 and any value d will output a hypothesis function \nthat is, \u00a3 t ( hd) = minhE H\" { \u00a3 t (h)}, \nhd E H d that minimizes the training error over H d -\nwhere EtCh) is the fraction of the examples in 8 on which h disagrees with the given label. \nIn many situations, training error minimization is known to be computationally intractable, \nleading researchers to investigate heuristics such as backpropagation. The extent to which \nthe theory presented here applies to such heuristics will depend in part on the extent to \nwhich they approximate training error minimization for the problem under consideration. \nModel selection is thus the problem of choosing the best value of d. More precisely, we \nassume an arbitrary target function I (which mayor may not reside in one of the function \nclasses in the structure H1 C ... C H d \u2022\u2022\u2022 ), and an input distribution P; I and P together \ndefine the generalization error function \u00a3g(h) = PrzEP[h(x) =f I(x)]. We are given a \ntraining sample 8 of I, consisting of m random examples drawn according to P and labeled \nby I (with the labels possibly corrupted by a noise process that randomly complements each \nlabel independently with probability TJ < 1/2). The goal is to minimize the generalization \nerror of the hypothesis selected. \nIn this paper, we will make the rather mild but very useful assumption that the structure has \nthe property that for any sample size m, there is a value dm.u:(m) such that \u00a3t(hdm.u:(m)) = \no for any labeled sample 8 of m examples. We call the function dmaz(m) the fitting \nnumber of the structure. The fitting number formalizes the simple notion that with enough \nparameters, we can always fit the training data perfectly, a property held by most sufficiently \npowerful function classes (including multilayer neural networks). We typically expect the \nfitting number to be a linear function of m, or at worst a polynomial in m. The significance \nof the fitting number for us is that no reasonable model selection method should choose hd \nfor d ~ dmaz(m), since doing so simply adds complexity without reducing the training \nerror. \nIn this paper we concentrate on the simplest version of cross validation. We choose a \nparameter \"( E [0, 1], which determines the split between training and test data. Given the \ninput sample 8 of m examples, let 8' be the subsample consisting of the first (1 - \"()m \nexamples in 8, and 8\" the subsample consisting of the last \"(mexamples. In cross validation, \nrather than giving the entire sample 8 to L, we give only the smaller sample 8', resulting in \nthe sequence h1' ... , hdmaz ((1-\"I)m) of increasingly complex hypotheses. Each hypothesis \nis now obtained by training on only (I - \"()m examples, which implies that we will only \nconsider values of d smaller than the corresponding fitting number dmaz((1 - \"()m); let \nus introduce the shorthand d\"!naz for dmaz((1 - \"()m). Cross validation chooses the hd \nsatisfying hd = mini E {1, .. . ,d~az} { \u00a3~' (~)} where \u00a3~' (~) is the error of hi on the subsample \n8\". Notice that we are not considering multifold cross validation, or other variants that \nmake more efficient use of the sample, because our analyses will require the independence \nof the test set. However, we believe that many of the themes that emerge here may apply to \nthese more sophisticated variants as well. \n\nWe use \u00a3 ClI ( m) to denote the generalization error \u00a3 g( hd) ofthe hypothesis hd chosen by cross \nvalidation when given as input a sample 8 of m random examples of the target function. \nObviously, \u00a3clI(m) depends on 8, the structure, I, P, and the noise rate. When bounding \n\u00a3cv (m), we will use the expression \"with high probability\" to mean with probability 1 - ~ \nover the sample 8, for some small .fixed constant ~ > O. All of our results can also be \nstated with ~ as a parameter at the cost of a loge 1 /~) factor in the bounds, or in terms of the \nexpected value of \u00a3clI(m). \n\n3 THE APPROXIMATION RATE \nIt is apparent that any nontrivial bound on \u00a3 cv (m) must take account of some measure of the \n\"complexity\" of the unknown target function I. The correct measure of this complexity is \nless obvious. Following the example of Barron and Cover's analysis of MDL performance \n\n\fA Bound on the Error of Cross Validation \n\n185 \n\nin the context of density estimation [2], we propose the approximation rate as a natural \nmeasure of the complexity of I and P in relation to the chosen structure HI C ... C H d \u2022\u2022\u2022\u2022 \nThus we define the approximation rate function Eg(d) to be Eg(d) = minhEH .. {Eg(h)}. The \nfunction E 9 (d) tells us the best generalization error that can be achieved in the class H d, \nand it is a nonincreasing function of d. If Eg(S) = 0 for some sufficiently large s, this \nmeans that the target function I, at least with respect to the input distribution, is realizable \nin the class H., and thus S is a coarse measure of how complex I is. More generally, even \nif Eg(d) > 0 for all d, the rate of decay of Eg(d) still gives a nice indication of how much \nrepresentational power we gain with respect to I and P by increasing the complexity of \nour models. StilI missing, of course, is some means of determining the extent to which this \nrepresentational power can be realized by training on a finite sample of a given size, but \nthis will be added shortly. First we give examples of the approximation rate that we will \nexamine following the general bound on E ClI ( m). \nThe Intervals Problem. In this problem, the input space X is the real interval [0,1], and \nthe class Hd of the structure consists of all boolean step functions over [0,1] of at most \nd steps; thus, each function partitions the interval [0, 1] into at most d disjoint segments \n(not necessarily of equal width), and assigns alternating positive and negative labels to \nthese segments. The input space is one-dimensional, but the structure contains arbitrarily \ncomplex functions over [0, 1]. It is easily verified that our assumption that the VC dimension \nof Hd is Oed) holds here, and that the fitting number obeys dmllZ(m) S m. Now suppose \nthat the input density P is uniform, and suppose that the target function I is the function \nof S alternating segments of equal width 1/ s, for some s (thUS, I lies in the class H.). \nWe will refer to these settings as the intervals problem. Then the approximation rate is \nEg(d) = (1/2)(1 - dis) for 1 S d < sand Eg(d) = 0 for d ~ s (see Figure 1). \nThe Perceptron Problem. In this problem, the input space X is RN for some large \nnatural number N. The class Hd consists of all perceptrons over the N inputs in which \nat most d weights are nonzero. If the input density is spherically symmetric (for instance, \nthe uniform density on the unit ball in RN ), and the target function is the function in H. \nwith all s nonzero weights equal to 1, then it can be shown that the approximation rate \nis Eg(d) = (1/11\") cos-I(..jd/s) for d < s [4], and of course Eg(d) = 0 for d ~ s (see \nFigure 1). \nPower Law Decay. In addition to the specific examples just given, we would also like \nto study reasonably natural parametric forms of Eg( d), to determine the sensitivity of our \ntheory to a plausible range of behaviors for the approximation rate. This is important, \nbecause in practice we do not expect to have precise knowledge of Eg(d), since it depends \non the target function and input distribution. Following the work of Barron [1], who shows \na c/dbound on Eg(d) for the case of neural networks with one hidden layer under a squared \nerror generalization measure (where c is a measure of target function complexity in terms \nof a Fourier transform integrability condition) 2, we can consider approximation rates of \nthe form Eg(d) = (c/d)a + Emin, where Emin ~ 0 is a parameter representing the \"degree \nof unreal izability\" of I with respect to the structure, and c, a > 0 are parameters capturing \nthe rate of decay to Emin (see Figure 1). \n\n4 THE ESTIMATION RATE \nFor a fixed I, P and HI C .. . C Hd\u00b7 .. , we say that a function p( d, m) is an estimation rate \nboundifforall dand m, with high probability over the sampleSwehave IEt(hd)-Eg(hct)1 S \np(d, m), where as usual hd is the result of training error minimization on S within Hd. \nThus p( d, m) simply bounds the deviation between the training error and the generalization \nerror of hd \u2022 Note that the best such bound may depend in a complicated way on all of \nthe elements of the problem: I, P and the structure. Indeed, much of the recent work \non the statistical physics theory of learning curves has documented the wide variety of \nbehaviors that such deviations may assume [4, 3]. However, for many natural problems \n\n2Since the bounds we will give have straightforward generalizations to real-valued function learn(cid:173)\n\ning under squared error, examining behavior for Eg( d) in this setting seems reasonable. \n\n\f186 \n\nM. KEARNS \n\nit is both convenient and accurate to rely on a universal estimation rate bound provided \nby the powerful theory of unifonn convergence: Namely, for any I, P and any structure, \nthe function p(d, m) = ..j(d/m) log(m/d) is an estimation rate bound [5]. Depending \nupon the details of the problem, it is sometimes appropriate to omit the loge m/ d) factor, \nand often appropriate to refine the J dim behavior to a function that interpolates smoothly \nbetween dim behavior for small Et to Jd/m for large Et. Although such refinements are \nboth interesting and important, many of the qualitative claims and predictions we will make \nare invariant to them as long as the deviation kt(hd) - Eg(hd)1 is well-approximated by a \npower law (d/m)a (0 > 0); it will be more important to recognize and model the cases in \nwhich power law behavior is grossly violated. \nNote that this universal estimation rate bound holds only under the assumption that the \ntraining sample is noise-free, but straightforward generalizations exist. For instance, if the \ntraining data is corrupted by random label noise at rate 0 ~ TJ < 1/2, then p( d, m) \n..j(d/(1 - 2TJ)2m)log(m/d) is again a universal estimation rate bound. \n\n5 THE BOUND \nTheorem 1 Let HI C ... C Hd \u00b7 .. be any structure, where the VC dimension 0/ Hd is \nOed). Let I and P be any target function and input distribution, let Eg(d) be the ap(cid:173)\nproximation rate/unction/or the structure with respect to I and P, and let p(d, m) be an \nestimation rate bound/or the structure with respect to I and P. Then/or any m, with high \nprobability \n\nEcv(m) ~ min \n\nI~d~di... \n\n{Eg(d) + p(d, (1 - ,)m)} + 0 ( \n\n(1) \n\nwhere, is the/raction o/the training sample used/or testing, and lfYmax is thefitting number \ndmax( (1 -,)m). Using the universal estimation bound rate and the rather weak assumption \nthat dmax(m) is polynomial in m, we obtain that with high probability \n\n10g\u00abI-,)m)) . \n\n,m \n\n(2) \nStraightforward generalizations 0/ these bounds/or the case where the data is corrupted \nby classification noise can be obtained, using the modified estimation rate bound given in \nSection 4 3. \n\nWe delay the proof of this theorem to the full paper due to space considerations. However, \nthe central idea is to appeal twice to uniform convergence arguments: once within each class \nHd to bound the generalization error of the resulting training error minimizer hd E Hd, and \na second time to bound the generalization error of the hd minimizing the error on the test \nset of ,m examples. \nIn the bounds given by (1) and (2), themin{\u00b7 } expression is analogous to Barron and Cover's \nindex of resolvability [2]; the final tenn in the bounds represents the error introduced by \nthe testing phase of cross validation. These bounds exhibit tradeoff behavior with respect \nto the parameter,: as we let, approach 0, we are devoting more of the sample to training \nthe hd, and the estimation rate bound tenn p(d, (1 - ,)m) is decreasing. However, the \ntest error tenn O( Jlog(~,u:)/(Tm)) is increasing, since we have less data to accurately \nestimate the Eg(hd). The reverse phenomenon occurs as we let, approach 1. \nWhile we believe Theorem 1 to be enlightening and potentially useful in its own right, \nwe would now like to take its interpretation a step further. More precisely, suppose we \n\n~e main effect of classification noise at rate '1 is the replacement of occurrences in the bound of \n\nthe sample size m by the smaller \"effective\" sample size (1 - '1)2m. \n\n\fA Bound on the Error of Cross Validation \n\n187 \n\nassume that the bound is an approximation to the actual behavior of EClI(m). Then in \nprinciple we can optimize the bound to obtain the best value for \"Y. Of course, in addition \nto the assumptions involved (the main one being that p(d, m) is a good approximation to \nthe training-generalization error deviations of the hd), this analysis can only be carried out \ngiven information that we should not expect to have in practice (at least in exact form)(cid:173)\nin particular, the approximation rate function Eg(d), which depends on f and P. However. \nwe argue in the coming sections that several interesting qualitative phenomena regarding \nthe choice of\"Y are largely invariant to a wide range of natural behaviors for Eg (d). \n\n6 A CASE STUDY: THE INTERVALS PROBLEM \n\nWe begin by performing the suggested optimization of\"Y for the intervals problem. Recall \nthat the approximation rate here is Eg(d) = (1/2)(1 - d/8) for d < 8 and Ey(d) = 0 for \nd ~ 8, where 8 is the complexity of the target function. Here we analyze the behavior \nobtained by assuming that the estimation rate p(d, m) actually behaves as p(d, m) = \nJd/(l - \"Y)m (so we are omitting the log factor from the universal bound), and to simplify \nthe formal analysis a bit (but without changing the qualitative behavior) we replace the \nterm Jlog\u00ab1 - \"Y)m)/bm) by the weaker Jlog(m)/m. Thus, if we define the function \nF(d, m, \"Y) = Ey(d) + Jd/(1 - \"Y)m + Jlog(m)/bm) then following Equation (1), we \nare approximating EclI (m) by EclI (m) ~ min1