{"title": "Neural Network Model Selection Using Asymptotic Jackknife Estimator and Cross-Validation Method", "book": "Advances in Neural Information Processing Systems", "page_first": 599, "page_last": 606, "abstract": null, "full_text": "Neural Network Model Selection Using \n\nAsymptotic Jackknife Estimator and \n\nCross-Validation Method \n\nYong Liu \n\nDepartment of Physics and \n\nInstitute for Brain and Neural Systems \n\nBox 1843, Brown University \n\nProvidence, RI, 02912 \n\nAbstract \n\nTwo theorems and a lemma are presented about the use of jackknife es(cid:173)\ntimator and the cross-validation method for model selection. Theorem 1 \ngives the asymptotic form for the jackknife estimator. Combined with the \nmodel selection criterion, this asymptotic form can be used to obtain the \nfit of a model. The model selection criterion we used is the negative of the \naverage predictive likehood, the choice of which is based on the idea of the \ncross-validation method. Lemma 1 provides a formula for further explo(cid:173)\nration of the asymptotics of the model selection criterion. Theorem 2 gives \nan asymptotic form of the model selection criterion for the regression case, \nwhen the parameters optimization criterion has a penalty term. Theorem \n2 also proves the asymptotic equivalence of Moody's model selection cri(cid:173)\nterion (Moody, 1992) and the cross-validation method, when the distance \nmeasure between response y and regression function takes the form of a \nsquared difference. \n\n1 \n\nINTRODUCTION \n\nSelecting a model for a specified problem is the key to generalization based on the \ntraining data set. In the context of neural network, this corresponds to selecting \nan architecture. There has been a substantial amount of work in model selection \n(Lindley, 1968; Mallows, 1973j Akaike, 1973; Stone, 1977; Atkinson, 1978j Schwartz, \n\n599 \n\n\f600 \n\nLiu \n\n1978; Zellner, 1984; MacKay, 1991; Moody, 1992; etc.). In Moody's paper (Moody, \n1992), the author generalized Akaike Information Criterion (AIC) (Akaike, 1973) \nin the regression case and introduced the term effective number of parameters. It \nis thus of great interest to see what the link between this criterion and the cross(cid:173)\nvalidation method (Stone, 1974) is and what we can gain from it, given the fact \nthat AIC is asymptotically equivalent to the cross-validation method (Stone, 1977). \n\nIn the method of cross-validation (Stone, 1974), a data set, which has a data point \ndeleted from the original training data set, is used to estimate the parameters of a \nmodel by optimizing a parameters optimization criterion. The optimal parameters \nthus obtained are called the jackknife estimator (Miller, 1974). Then the predictive \nlikelihood of the deleted data point is calculated, based on the estimated parame(cid:173)\nters. This is repeated for each data point in the original training data set. The fit \nof the model, or the model selection criterion, is chosen as the negative of the aver(cid:173)\nage of these predictive likelihoods. However, the computational cost of estimating \nparameters for different data point deletion is expensive. In section 2, we obtained \nan asymptotic formula (theorem 1) for the jackknife estimator based on optimizing \na parameters optimization criterion with one data point deleted from the training \ndata set. This somewhat relieves the computational cost mentioned above. This \nasymptotic formula can be used to obtain the model selection criterion by plugging \nit into the criterion. Furthermore, in section 3, we obtained the asymptotic form \nof the model selection criterion for the general case (Lemma 1) and for the special \ncase when the parameters optimization criterion has a penalty term (theorem 2). \nWe also proved the equivalence of Moody's model selection criterion (Moody, 1992) \nand the cross-validation method (theorem 2). Only sketchy proofs are given when \nthese theorems and lemma are introduced. The detail of the proofs are given in \nsection 4. \n\n2 APPROXIMATE JACKKNIFE ESTIMATOR \n\nLet the parameters optimization criterion, with data set w = {(:Vi, yd, i = 1, ... , n} \nand parameters 9, be Cw (9), and let W-i denote the data set with ith data point \ndeleted from w. If we denote 8 and 8_ i as the optimal parameters for criterion Cw (9) \nand Cw _.(9), respectively, \\79 as the derivative with respect to 9 and superscript t \n~s transpose, we have the following theorem about the relationship between 8 and \n9_ i \u00b7 \n\nTheorem 1 If the criterion function Cw (9) is an infinite- order differentiable func(cid:173)\ntion and its derivatives are bounded around 8. The estimator 8 -i (also called jack(cid:173)\nknife estimator (Miller, 1974)) can be approzimated as \n\n8_i - 8 ~ -(\\79\\7~Cw(8) - \\79\\7~Ci(8\u00bb-1\\79Ci(8) \n\n(1) \n\nin which Ci(9) = Cw(9) - Cw_.(9). \n\nProof. Use the Taylor expansion of equation \\7 9Cw_.(8_d \nterms higher than the second order. \n\no around 9. Ignore \n\n\fModel Selection Using Asymptotic Jackknife Estimator & Cross-Validation Method \n\n601 \n\nExample 1: Using the generalized maximum likelihood method from Bayesian \nanalysis 1 (Berger, 1985), if 7r( 0) is the prior on the parameters and the observations \nare mutually independent, for which the distribution is modeled as ylx ,..... f(Ylx, 0), \nthe parameters optimization criterion is \n\nThus Ci(O) = logf(Yilxi, 0). If we ignore the influence of the deleted data point in \nthe dt nominator of equation 1, we have \n\n(3) \n\nExample 2: In the special case of example I, with noninformative prior 7r( 0) = 1, \nthe criterion is the ordinary log-likelihood function, thus \n\n9_i-9~- [ L VeV~logf(Yj lxj,O) j-lVelogf(Yilxi,O). \n\n(4) \n\n(5) \n\n(xi,Y.:)Ew \n\n3 CROSS-VALIDATION METHOD AND MODEL \n\nSELECTION CRITERION \n\nHereafter we use the negative of the average predictive likelihood, or, \n\nTm(w) = -- L logf(Yi lXi, O-i) \n\n1 \nn \n\n(x\"y.:)Ew \n\nas the model selection criterion, in which n is the size of the training data set w, \nmE .. Vi denotes parametric probability models f(Ylx, 0) and .tVi is the set of all the \nmodels in consideration. It is well known that Tm(w) is an unbiased estimator of \nr(00,9(.)), the risk of using the model m and estimator 0, when the true parameters \nare 00 and the training data set is w (Stone, 1974; Efron and Gong, 1983; etc.), i.e., \n\nr(Oo, 0(.)) \n\nE{Tm(w)} \nE{ -logf(Ylx, 9(w))} \nE{ -~ L \n\nlogf(Yj IXj, O(w)) } \n\n(x] ,Y] )Ew ... \n\n(6) \n\nj = I, \n\nin which Wn = {( Xj , Yj ), \n... k} is the test data set, 9(.) is an implicit \nfunction of the training data set wand it is the estimator we decide to use after \nwe have observed the training data set w. The expectation above is taken over the \nrandomness of w, x, Y and W n . The optimal model will be the one that minimizes \nthis criterion. This procedure of using 9_ t and Tm(w) to obtain an estimation of risk \nis often called the cross-validation method (Stone, 1974; Efron and Gong, 1983). \nRemark: After we have obtained 9 for a model, we can use equation 1 to calculate \n9_ i for each i, and put the resulting 9_ i into equation 5 to get the fit of the model, \nthus we will be able to compare different models m E .tVi. \n\n1 Strictly speaking, it is a method to find the posterior mode. \n\n\f602 \n\nLiu \n\nLemma 1 If the probability model f(ylx, 8), as a function,. of 8, is differentiable up \nto infinite order and its derivatives are bounded around 8. The approximation to \nthe model selection criterion, equation 5, can be written as \n\nTm(w) ~ -~ L logf(Yi lXi, 8) - ~ L V'~logf(Yi lXi, 8)(8_i - 8) \n\n(7) \n\nn \n\n(Xi,Yi)Ew \n\nn \n\n(Xi,Yi)Ew \n\nProof. Igoring the terms higher than the second order of the Taylor expansion of \nlogf(Yj IXj, 8_ i ) around 8 will yield the result. \nEzample 2 (continued): Using equation 4, we have, for the modei selection criterion, \n\n1 \"\" \nn \n\nL \n\n(xi,y.)Ew \n\nlogf(Yi lXi, 9) -\n\nA \n\n1 \n\nn 2: V'~logf(Yi lXi, 8)A -IV' /}logf(Yi lXi, 8). \n\n(8) \n\n(:Ci,y.)Ew \n\nin which A = E(X]'Y3)EW V'/}V'~logf(Yjlxj,9). If the model f(Ylx,8) is the true \none, the second term is asymptotically equal to p, the number of parameters in the \nmodel. So the model selection criterion is \n\n-\n\nlog-likelihood + number of parameters of the model. \n\nThis is the well known Akaike's Information Criterion (AIC) (Akaike, 1973). \nEzample 1( continued): Consider the probability model \n\nf(Ylx,8) = ,8exp( - 20'2 E(y, 1}/}( X))) \n\n1 \n\n(9) \n\nin which,8 is a normalization factor, E(y, 1}/}(x)) is a distance measure between Y and \nregression function 1}/} (x). E(\u00b7) as function of 9 is assumed differentiable. Denoting2 \nU(8,~, w) = E(Xi,Yi)EW E(Yi' 1}/}(xd) - 20'2log1T(81~), we have the following theorem, \nTheorem 2 For the model specified in equation 9 and the parameters optimization \ncriterion specified in equation 2 (ezample 1), under regular condition, the unbiased \nestimator of \n\nE{ ~ 2: E(Yi,1}e(xd)} \n\n(xi,y.)Ew .. \n\nasymptotically equals to \n\nE(Yi' 1}e(x~)) + \n\n1\"\" L \nn L V'~E(Yi,1}e(xd){V'/}V'~U(8,).,w)}-IV'/}E(Yi,1}9(xd)\u00b7 \n\n(:Ci,y .. )Ew \n\nn \n\n1 \n\n(Xi,Yi)Ew \n\n2For example, 1r(OIA) = Alp(O, (72 fA), this corresponds to \n\nU(O, A, w) = L \u00a3(Yi,l1s(xi)) + A02 + const(A, (72). \n\n(z\"Yi)Ew \n\n(10) \n\n(11) \n\n\fModel Selection Using Asymptotic Jackknife Estimator & Cross-Validation Method \n\n603 \n\nFor the case when \u00a3(Y, 179(Z)) = (y -179(Z))2, we get, for the asymptotic equivalency \nof the equation 11, \n\n(12) \n\nin wh.ich W = {(Zi,Yi), i = 1, ... , n} is the training data set, Wn = {(zi,yd, ~ = \n1, ... , k} is the test data set, and \u00a3(8,w) = ~ L(:z:\"y,)EW E(Yi,179(Zi)). \n\nProof. This result comes directly from theorem 1 and lemma 1. Some asymptotic \ntechnique has to be used. \n\nRemark: The result in equation 12 was first proposed by Moody (Moody, 1992). \nThe effective number of parameters formulated in his paper corresponds to the \nsummation in equation 12. Since the result in this theorem comes directly from \nthe asymptotics of the cross-validation method and the jackknife estimator, it gives \nthe equivalency proof between Moody's model selection criterion and the cross(cid:173)\nvalidation method. The detailed proof of this theorem, presented in section 4, is \nin spirit the same as the one presented in Stone's paper about the proof of the \nasymptotic equivalence of Ale and the cross-validation method (Stone, 1977). \n\n4 DETAILED PROOF OF LEl\\1MAS AND THEOREMS \n\nIn order to prove theorem 1, lemma 1 and theorem 2, we will present three auxiliary \nlemmas first. \n\nLemma 2 For random variable sequence Zn and Yn, if limn-+co Zn \nliffin-+co Yn = z, then Zn and Yn are asymptotically equivalent. \n\nZ and \n\nProof. This comes from the definition of asymptotic equivalence. Because asymp(cid:173)\ntotically the two random variable will behave the same as random variable z. \n\nLemma 3 Consider the summation Li h(Zi' Ydg(Zi' z). If E(h(z, y)lz, z) is a \nconstant c independent of z, y, z, then the summation is asymptotically equivalent \nto cLig(Zi'Z). \n\nProof. According to the theorem of large number, \n\nlim ~ ' \" h(Zi' Yi)g(Zt, z) \nn-+co n ~ \n\nt \n\nE(h(z, y)g(z, z)) \n\nE(E(h(z, y)lz, z)g(z, z)) = cE(g(z, z)) \n\nwhich is the same as the limit of ~ Li g(Zt' z). Using lemma 2, we get the result of \nthis lemma. \n\nLemma 4 \nmodel Y = T}9 (z) + f with f \n\nIf T}9 (.) and g( 8, .) are differentiable up to the second order, and the \n,...., ,/V (0, (]'2) is the true model, the second derivative with \n\n\f604 \n\nLiu \n\nrespect to 8 of \n\nn \n\ni=l \n\nevaluated at the minimum of U, i. e., iJ, is asymptotically independent of random \nvariable {Yi, i = 1, ... , n}. \n\nPro~of. Explicit calculation of the second derivative of U with respect to 8, evaluated \nat 8, gives \n\nV9V~U(iJ,'\\,W) = 2:LV977J(1;JV~179(:Z:t) \n\nn \n\ni=l \n\ni=l \n\n+ \n\nAs n approaches infinite, the effect of the second term in U vanishes, iJ approach \nthe mean squared error estimator with infinite amount of data points, or the true \nparameters 80 of the model (consistency of MSE estimator (Jennrich, 1969)), E(y-\n779(z)) approaches E(Y-7790(Z)) which is O. According to lemma 2 and lemma 3, the \nsecond term of this second derivative vanishes asymptotically. So as n approaches \ninfinite, the second derivative of U with respect to 8, evaluated at iJ, approaches \n\nV' 9 V~U(80), '\\, w) = 2 L V' 97790 (zi)V~7790 (Zi) + V' 9 V~g( 80 , ,\\) \n\nn \n\nwhich is independent of {Yi, i = 1, ... , n}. According to lemma 2, the result of this \nlemma is readily obtained. \n\nNow we give the detailed proof of theorem 1, lemma 1 and theorem 2. \n\nProof of Theorem 1. The jackknife estimator iJ- i satisfies, V 9Cw_ .. (ILi) \nThe Taylor expansion of the left side of this equation around 8 gives \n91 2 ) = 0 \n\nVeCW_i(iJ) + VeV~Cw_.(iJ)(iJ_i - iJ) + O(liJ- i -\n\nO. \n\nAccording to the definition of iJ and iJ_ i , their difference is thus a small quantity. \nAlso because of the boundness of the derivatives, we can ignore higher order terms \nin the Taylor expansion and get the approximation \n\niJ- i - iJ ~ -(V9V~CW_i(iJ))-1V'9Cw_.(iJ) \n\nSince 9 satisfies V' 9Cw(iJ) = 0, we can rewrite this equation and obtain equation 1. \nProof of Lemma 1. The Taylor expansion of 10gf(Yi IZi' iJ-d around iJ is \n\n10gf(Yi IZi, iJ-d = 10gf(Yi IZi, iJ) + V'~logf(Yi IZi, iJ)(iJ_ i -\n\niJ) + O(liJ-i - 91 2 ) \n\nPutting this into equation 5 and ignoring higher order terms for the same argument \nas that presented in the proof of theorem 1, we readily get equation 7. \n\nProof of Theorem 2. Up to an additive constant dependent only on ,\\ and cr2 , \nthe optimization criterion, or equation 2, can be rewritten as \n\n(13) \n\n\fModel Selection Using Asymptotic Jackknife Estimator & Cross-Validation Method \n\n605 \n\nNow putting equation 9 and 13 into equation 3, we get, \n\nO-i - 0 ~ -{V' 9 V'~U(8, .\\, w)} -IV' 9\u00a3(Yi, 17e(:z:d) \n\nPutting equation 14 into equation 7, we get, for the model selection criterion, \n\nTm(w) = n ~ 2u2 \u00a3(Yi, 17e(:z:d) + \n\n' \" \n\n1 \n\n1 \n\n(:t:\"Yi)Ew \n\n1 \nr.. \n\n' \" \n~ 2u 2 V'9\u00a3(Yi, 17e(:z:d){V'9V'9U(O,>.,w) \n\n1 \n\nt \n\nt \n\n~ \n\n}-l \n\nV'9\u00a3(Yi,17e(:Z:i \n\n)) \n\n(:t:i,Yi)Ew \n\nRecall the discussion associated with equation 6 and now \n\nE{ -\"k ~ 10gf(Y;I:Z:j,0)} = E{\"k ~ 2u2\u00a3(Yj,17e(:Z:;))} \n\n'\" \n\n'\" \n\n1 \n\n1 \n\n1 \n\n~ \n\n(:t:\"y,)Ew\" \n\n(:t:\"Yj)Ew\" \n\n(14) \n\n(15) \n\n(16) \n\nafter some simple algebra, we can obtain the unbiased estimator of equation 10. \nThe result is equation 15 multiplied by 2u 2 , or equation 11. Thus we prove the first \npart of the theorem. \nNow consider the case when \n\n\u00a3(Y,179(:Z:)) = (y -179(:z:))2 \n\n(17) \n\n(18) \n\n(20) \n\nThe second term of equation 11 now becomes \n\n~ L 4(Yi -17e(:z:d)2V'~17e(:Z:i){V'9V'~U(8, >',w)}-1V'917e(:Z:i) \n\n(:t:\"Yi)Ew \n\nAs n approaches infinite, 0 approach the true parameters 0o, V' 917e(:Z:') approaches \nV'9179 0 (:Z:.) and E((y -17e(:z:)))2 asymptotically equals to u 2 \u2022 Using lemma 4 and \nlemma 3, we get, for the asymptotic equivalency of equation 18, \n\n.!..u2 L 2V'~17\u00a7(:z:d{V'9V'~U(0,>.,W)}-12V'917\u00a7(:z:d \n\nn \n\n(19) \n\nIf we use notation \u00a3(O,w) = ~ L(:t:\"Yi)EW \u00a3(Yi,179(:z:d), with \u00a3(Y,179(:Z:)) of the form \nspecified in equation 17, we can get, \n\na -a V'9 n \u00a3(0,w) = -2V'9179(:Z:i) \n\nYi \n\nCombining this with equation 19 and equation 11, we can readily obtain equation 12. \n\n5 SUMMARY \n\nIn this paper, we used asymptotics to obtain the jackknife estimator, which can \nbe used to get the fit of a model by plugging it into the model selection criterion. \nBased on the idea of the cross-validation method, we used the negative of the \naverage predicative likelihood as the model selection criterion. We also obtained \nthe asymptotic form of the model selection criterion and proved that when the \nparameters optimization criterion is the mean squared error plus a penalty term, \nthis asymptotic form is the same as the form presented by (Moody, 1992). This \nalso served to prove the asymptotic equivalence of this criterion to the method of \ncross-validation. \n\n\f606 \n\nLiu \n\nAcknowledgements \n\nThe author thanks all the members of the Institute for Brain and Neural Systems, \nin particular, Professor Leon N Cooper for reading the draft of this paper, and Dr. \nNathan Intrator, Michael P. Perrone and Harel Shouval for helpful comments. This \nresearch was supported by grants from NSF, ONR and ARO. \n\nReferences \n\nAkaike, H. (1973). Information theory and an extension of the maximum likeli(cid:173)\nhood principle. In Petrov and Czaki, editors, Proceedings of the 2nd International \nSymposium on Information Theory, pages 267-281. \n\nAtkinson, A. C. (1978). Posterior probabilities for choosing a regression model. \nBiometrika, 65:39-48. \n\nBerger, J. O. (1985). Statistical Decision Theory and Bayesian Analysis. Springer(cid:173)\nVerlag. \n\nEfron, B. and Gong, G. (1983). A leisurely look at the bootstrap, the jackknife and \ncross-validation. Amer. Stat., 37:36-48. \n\nJennrich, R. (1969). Asymptotic properties of nonlinear least squares estimators. \nAnn. Math. Stat., 40:633-643. \n\nLindley, D. V. (1968). The choice of variables in multiple regression (with discus(cid:173)\nsion). J. Roy. Stat. Soc., Ser. B, 30:31-66. \n\nMacKay, D. (1991). Bayesian methods for adaptive models. PhD thesis, California \nInstitute of Technology. \n\nMallows, C. L. (1973). Some comments on Cpo Technometrics, 15:661-675. \n\nMiller, R. G. (1974). The jackknife - a review. Biometrika, 61:1-15. \n\nMoody, J. E. (1992). The effective number of parameters, an analysis of general(cid:173)\nization and regularization in nonlinear learning system. In Moody, J. E., Hanson, \nS. J., and Lippmann, R. P., editors, Advances in Neural Information Processing \nSystem 4. Morgan Kaufmann Publication. \nSchwartz, G. (1978). Estimating the dimension of a model. Ann. Stat, 6:461-464. \n\nStone, M. (1974). Cross-validatory choice and assessment of statistical predictions \n(with discussion). J. Roy. Stat. Soc., Ser. B. \n\nStone, M. (1977). An asymptotic equivalence of choice of model by cross-validation \nand Akaike's criterion. J. Roy. Stat. Soc., Ser. B, 39(1):44-47. \n\nZellner, A. (1984). Posterior odds ratios for regression hypotheses: General consid(cid:173)\neration and some specific results. In Zellner, A., editor, Basic Issues in Economet(cid:173)\nrics, pages 275-305. University of Chicago Press. \n\n\f", "award": [], "sourceid": 700, "authors": [{"given_name": "Yong", "family_name": "Liu", "institution": null}]}