{"title": "Structured Machine Learning for 'Soft' Classification with Smoothing Spline ANOVA and Stacked Tuning, Testing and Evaluation", "book": "Advances in Neural Information Processing Systems", "page_first": 415, "page_last": 422, "abstract": "", "full_text": "Structured Machine Learning For 'Soft' \n\nClassification with Smoothing Spline \nANOVA and Stacked Tuning, Testing \n\nand Evaluation \n\nGrace Wahba \nDept of Statistics \n\nUniversity of Wisconsin \n\nMadison, WI 53706 \n\nYuedong Wang \nDept of Statistics \n\nUniversity of Wisconsin \n\nMadison, WI 53706 \n\nChong Gu \n\nDept of Statistics \nPurdue University \n\nWest Lafayette, IN 47907 \n\nRonald Klein, MD \nDept of Ophthalmalogy \nUniversity of Wisconsin \n\nMadison, WI 53706 \n\nBarbara Klein, MD \nDept of Ophthalmalogy \nUniversity of Wisconsin \n\nMadison, WI 53706 \n\nAbstract \n\nWe describe the use of smoothing spline analysis of variance (SS(cid:173)\nANOVA) in the penalized log likelihood context, for learning \n(estimating) the probability p of a '1' outcome, given a train(cid:173)\ning set with attribute vectors and outcomes. p is of the form \npet) = eJ(t) /(1 + eJ(t)), where, if t is a vector of attributes, f \nis learned as a sum of smooth functions of one attribute plus a \nsum of smooth functions of two attributes, etc. The smoothing \nparameters governing f are obtained by an iterative unbiased risk \nor iterative GCV method. Confidence intervals for these estimates \nare available. \n\n1. Introduction to 'soft' classification and the bias-variance tradeoff. \n\nIn medical risk factor analysis records of attribute vectors and outcomes (0 or 1) \nfor each example (patient) for n examples are available as training data. Based on \nthe training data, it is desired to estimate the probability p of the 1 outcome for any \n\n415 \n\n\f416 \n\nWahba, Wang, Gu, Klein, and Klein \n\nnew examples in the future, given their attribute vectors. In 'soft' classification, the \nestimate p of p is of particular interest, and might be used, say, by a physician to \ntell a patient that if he reduces his cholesterol from t to t', then he will reduce his \nrisk of a heart attack from p(t) to p(t'). We assume here that p varies 'smoothly' \nwith any continuous attribute (predictor variable). \nIt is long known that smoothness penalties and Bayes estimates are intimately re(cid:173)\nlated (see e.g. Kimeldorf and Wahba(1970, 1971), Wahba(1990) and references \nthere). Our philosophy with regard to the use of priors in Bayes estimates is to \nuse them to generate families of reasonable estimates (or families of penalty func(cid:173)\ntionals) indexed by those smoothing or regularization parameters which are most \nrelevant to controlling the generalization error. (See Wahba(1990) Chapter 3, also \nWahba(1992)). Then use cross-validation, generalized cross validation (GCV), un(cid:173)\nbiased risk estimation or some other performance oriented method to choose these \nparameter(s) to minimize a computable proxy for the generalization error. A person \nwho believed the relevant prior might use maximum likelihood (ML) to choose the \nparameters, but ML may not be robust against an unrealistic prior (that is, ML \nmay not do very well from the generalization point of view if the prior is off), see \nWahba(1985). One could assign a hyperprior to these parameters. However, except \nin cases where real prior information is available, there is no reason to believe that \nthe use of hyperpriors will beat out a performance oriented criterion based on a good \nproxy for the generalization error, assuming, of course, that low generalization error \nis the true goal. \nO'Sullivan et al(1986) proposed a penalized log likelihood estimate of I, this work \nwas extended to the SS-ANOVA context in Wahba, Gu, Wang and Chappell(1993), \nwhere numerous other relevant references are cited. This paper is available by \nftp from ftp. stat. wise. edu, cd pub/wahba in the file soft-class. ps. Z. An \nextended bibliography is available in the same directory as ml-bib. ps. The SS(cid:173)\nANOVA allows a variety of interpretable structures for the possible relationships \nbetween the predictor variables and the outcome, and reduces to simple relations \nin some of the attributes, or even, to a two-layer neural net, when the data suggest \nthat such a representation is adequate. \n\n2. Soft classification and penalized log likelihood risk factor estimation \n\nTo describe our 'worldview', let t be a vector of attributes, tEn E T, where n is \nsome region of interest in attribute space T. Our 'world' consists of an arbitrarily \nlarge population of potential examples, whose attribute vectors are distributed in \nsome way over n and, considering all members of this 'world' with attribute vectors \nin a small neighborhood about t, the fraction of them that are l's is p(t). Our \ntraining set is assumed to be a random sample of n examples from this population, \nwhose outcomes are known, and our goal is to estimate p(t) for any tEO. In 'soft' \nclassification, we do not expect one outcome or the other to be a 'sure thing', that \nis we do not expect p(t) to be 0 or 1 for large portions of n. \nNext, we review penalized log likelihood risk estimates. Let the training data be \n{Yi, t(i), i = 1, ... n} where Yi has the value 1 or 0 according to the classification of \nexample i, and t(i) is the attribute vector for example i. If the n examples are a \nrandom sample from our 'world', then the likelihood function of this data, given \n\n\f\"Soft\" Classification with Smoothing Spline ANOVA \n\n417 \n\np( .), is \n\nlikelihood{y, p} = II~=lP(t(i))Yi (1 - p(t(i) ))l-Yi, \n\n(1) \nwhich is the product of n Bernoulli likelihoods. Define the logit f(t) by f(t) = \n10g[P(t)/(I- p(t))], then p(t) = eJ(t) 1(1 + eJ(t)). Substituting in f and taking logs \ngIves \n\n-log likelihood{y, f} = \u00a3(y, f) = L log(1 + eJ(t(i))) - Yif(t(i)). \n\nn \n\n(2) \n\ni=l \n\nWe estimate f assuming that it is in some space 1l of smooth functions. (Technically, \n1l is a reproducing kernel Hilbert space, see Wahba(1990), but you don't need to \nknow what this is to read on). The fact that f is assumed 'smooth' makes the \nmethods here very suitable for medical data analysis. The penalized log likelihood \nestimate f>.. of f will be obtained as the minimizer in 1l of \n\n\u00a3(y, f) + \"2)\"J(J) \n\nn \n\n(3) \n\nwhere J(J) is a suitable 'smoothness' penalty. A simple example is, T = [0,1] and \nJ(J) = Jo1 (J(m) (t))2dt, in which case f>.. is a polynomial spline of degree 2m - 1. If \n\n(4) \n\nthen f>.. is a thin plate spline. The thin plate spline is a linear combination of \npolynomials of degree m or less in d variables, and certain radial basis functions. \nFor more details and other penalty functionals which result in rbf's, see Wahba(1980, \n1990, 1992). \n\nThe likelihood function \u00a3(y, f) will be maximized if p(t(i)) is 1 or \u00b0 according as \nYi is 1 or 0. Thus, in the (full-rank) spline case, as ).. -+ 0, 1>.. tends to +00 or -00 \nat the data points. Therefore, by letting).. be small, we can come close to fitting \nthe data points exactly, but unless the 1 's and O's are well separated in attribute \nspace, f>.. will be a very 'wiggly' function and the generalization error (not precisely \ndefined yet) may be large. \nThe choice of ).. represents a tradeoff between overfitting and underfitting the data \n(bias-variance tradeoff). It is important in practice good value of )... We now define \nwhat we mean by a good value of )... Given the family PA,).. > 0, we want to choose \n).. so that PA is close to the 'true' but unknown p so that, if new examples arrive with \nattribute vector in a neighborhood of t, PA (t) will be a good estimate of the fraction \nof them that are 1 'so 'Closeness' can be defined in various reasonable ways. We use \nthe Kullbach-Leibler (K L) distance (not a real distance!). The K L distance between \ntwo probability measures (g, g) is defined as K L(g, g) = Eg [log (g 1 g)], where Eg \nmeans expectation given g is the true distribution. If v(t) is some probability \nmeasure on T, (say, a proxy for the distribution ofthe attributes in the population), \nthen define K Lv (p, PA) (for Bernoulli random variables) with respect to v as \n\nK Lv(p, PA) = J [P(t)log (;(~l)) + (1 - p(t)) log (11 ~ :A(~l)) ] dv(t). \n\n(5) \n\n\f418 \n\nWahba, Wang, Gu, Klein, and Klein \n\nSince K Lv is not computable from the data, it is necessary to develop a computable \nproxy for it, By a computable proxy is meant a function of), that can be calculated \nfrom the training set which has the property that its minimizer is a good estimate \nof the minimizer of K Lv, By letting p>.(t) = e!>.(t) /(1 + e!>.(t\u00bb) it is seen that to \nminimize K Lv, it is only necessary to minimize \n\nJ [log(l + e!>.(t\u00bb) - p(t)f>.(t)]dv(t) \n\n(6) \n\nover). since (5) and (6) differ by something that does not depend on )., Leaving(cid:173)\nout-half cross validation (!CV) is one conceptually simple and generally defensible \n(albeit possibly wasteful) way of choosing). to minimize a proxy for K Lv(p, P>.), \nThe n examples are randomly divided in half and the first n/2 examples are used \nto compute P>. for a series of trial values of )., Then, the remaining n/2 examples \nare used to compute \n\nKLl.cv ().) = ~ ~ [log(l + e!>.(t(i\u00bb) - Yif>.(t(i))] \n\n~ \n\nn ~ \ni::~+l \n\n(7) \n\nfor the trial values of )., Since the expected value of Yi is p(t(i)), (7) is, for each), an \nunbiased estimate of (6) with dv the sampling distribution of the {tel), ,.\" t(n/2)}, \n). would then be chosen by minimizing (7) over the trial values. It is inappropriate to \njust evaluate (7) using the same data that was used to obtain f>., as that would lead \nto overfitting the data, Variations on (7) are obtained by successively leaving out \ngroups of data. Leaving-out-one versions of (7) may be defined, but the computation \nmay be prohibitive. \n\n3. Newton-Raphson Iteration and the Unbiased Risk estimate of A. \n\nWe use the unbiased risk estimate given in Craven and Wahba(1979) for smoothing \nspline estimation with Gaussian errors, which has been adapted by Gu(1992a) for \nthe Bernoulli case, To describe the estimate we need to describe the Newton(cid:173)\nRaphson iteration for minimizing (3). Let b(J) = log(l + ef ), then Ley, f) = \nE?::db(J(t(i)) - Yif(t(i))], It is easy to show that Ey; = f(t(i)) = b'(f(t(i)) and \nvar Yi = p(t(i))(l - p(t(i)) = b\"(f(t(i)). Represent f either exactly by using a \nbasis for the (known) n-space of functions containing the solution, or approximately \nby suitable approximating basis functions, to get \n\nThen we need to find C = (C1' . ' . , C N)' to minimize \n\nN \n\nf ~ L CkBk\u00b7 \n\nk=l \n\nn \n\nN \n\nN \n\n1>.(c) = L beL CkBk(t(i))) - Yi(L CkBk(t(i))) + ~ ).c'~c, \n\n(8) \n\n(9) \n\n;=1 k=l \n\nk=l \n\nwhere E \nJ (Ek Ck Bk) = c'Ec. The gradient \\l 1>. and the Hessian \\l2l.x of l.x are given by \n\nthe necessarily non-negative definite matrix determined by \n\nis \n\n= X' (Pc - y) + n).~c, \n\n(10) \n\n\f\"Soft\" Classification with Smoothing Spline ANOVA \n\n419 \n\n= X' WcX + nXE, \n\n(11) \n\nwhere X is the matrix with ijth entry Bj(t(i)), Pc is the vector with ith entry Pc (t(i)) \ngiven by Pc (t(i)) = (1~:c/~g~:\u00bb) where fcO = 2::=1 ekBk(\u00b7), and Wc is the diagonal \nmatrix with iith entry Pc(t(i))(I-Pc(t(i))). Given the ith Newton-Raphson iterate \neCl), e(l+1) is given by \n\ne(l+1) = eel) - (X'WC<l) X + nA~)-l(X'(pc(l) - y) + nA~e(l)) \n\nand e( l+ 1) is the minimizer of \n\nIil\\e) = IIz(l) - Wcl(~~ Xell 2 + nAe'~e. \n\nwhere z(l), the so-called pseudo-data, is given by \n\nz(l) = Wc(l~/2(y - Pdl\u00bb) + W:(~~XeCl). \n\n(12) \n\n(13) \n\n(14) \n\nThe 'predicted' value z(l) = W:(~~ X e, where e is the minimizer of (13), is related to \nthe pseudo-data z(l) by \n\nZ(l) = A(l)(A)Z(l), \n\n(15) \n\n(16) \n\nwhere A(l)(A) is the smoother matrix given by \n\nA(l)(A) = W:(~~ X(X'Wc(l)X + nA~)-l X'W:(~~. \n\nIn Wahba(1990), Section 9.2 1, it was proposed to obtain a GCV score for A \nin (9) as follows: For fixed A, iterate (12) to convergence. Define VCl)(A) = \n~II(I - A(l) (A))z(l) 112 /(~tr(I - A(l) (A)))2 . Letting L be the converged value of i, \ncompute \n\nVCL)(A) = ~II(I - A(L) (A))z(L) 112 \n(~tr(I - A(L)(A)))2 \n\n,...., ~IIW:clr(Y - pC<L\u00bb)1I 2 \n(~tr(I - A(L)(A)))2 \n\n(17) \n\nand minimize VeL) with respect to A. Gu(1992a) showed that (since the variance \nis known once the mean is known here) that the unbiased risk estimate U (A) in \nCraven and Wahba can also be adapted to this problem as \n\nU(l)(A) = .!.IIW(l~/2(y - Pc(l\u00bb)11 2 + ~tr A(l)(A). \n\nn \n\nn \n\nc \n\n(18) \n\nHe also proposed an alternating iteration, different than that described in \nWahba(1990), namely, given eCl) = e(l)(A(l\u00bb), find A = ACl+l) to minimize (18). \nGiven A(l+!) , do a Newton step to get eCl+1), get A(l+2) by minimizing (18), continue \nuntil convergence. He showed that the alternating iteration gave better estimates of \nA using V than the iteration in Wahba(1990), as measured by the [( L-distance. His \nresults (with the alternating iteration) suggested U had somewhat of an advantage \nover V, and that is what we are using in the present work. Zhao et aI, this volume, \nhave used V successfully with the alternating iteration. \n\nlThe definition of A there differs from the definition here by a factor of n/2. Please \n\nnote the typographical error in (9.2.18) there where A should be 2A. \n\n\f420 \n\nWahba, Wang, Gu, Klein, and Klein \n\n4. Smoothing spline analysis of variance (SS-ANOVA) \n\nIn SS-ANOVA, /(t) = l(t1, ... , td) is decomposed as \n\nI(t) = I-' + L /a(ta) + L /a/3(ta, t/3) + ... \n\n(19) \n\na \n\na</3 \n\nwhere the terms in the expansion are uniquely determined by side conditions which \ngeneralize the side conditions ofthe usual ANOVA decompositions. Let the logit/(t) \nbe of the form (19) where the terms are summed over Ct EM, Ct, f3 E M, etc. where \nM indexes terms which are chosen to be retained in the model after a model selec(cid:173)\ntion procedure. Then 1>..,8, an estimate of I, is obtained as the minimizer of \n\n\u00a3(y, 1>.,8) + )\"J8 (I) \n\nwhere \n\nJ8(1) = L (J~lJa(fa) + L (J;JJa/3(fa/3) +... \n\naEM \n\na,{3EM \n\n(20) \n\n(21) \n\nThe Ja, Ja/3, ... are quadratic 'smoothness' penalty functionals, and the (J's satisfy \na single constraint. For certain spline-like smoothness penalties, the minimizer of \n(20) is known to be in the span of a certain set of n functions, and the vector \nc of coefficients of these functions can (for fixed ().., (J)) be chosen by the Newton \nRaphson iteration. Both)\" and the (J's are estimated by the unbiased risk estimate \nof Gu using RKPACK( available from netlibClresearch. att. com) as a subroutine \nat each Newton iteration. Details of smoothing spline ANOVA decompositions may \nbe found in Wahba(1990) and in Gu and Wahba(1993) (also available by ftp to \nftp.stat.wisc.edu, cd to pub/wahba , in the file ssanova.ps.Z). In Wahba et \nal(1993) op cit, we estimate the risk of diabetes given some of the attributes in the \nPima-Indian data base. There M was chosen partly by a screening process using \nparamteric GLIM models and partly by a leaving out approximately 1/3 procedure. \nContinuing work involves development of confidence intervals based on Gu(1992b), \ndevelopment of numerical methods suitable for very large data sets based on Gi(cid:173)\nrard's(1991) randomized trace estimation, and further model selection issues. \n\nFigure 1(left) gives a plot of body mass index (bmi) (a mea(cid:173)\n\nIn the Figures we provide some preliminary analyses of data from the Wiscon(cid:173)\nsin Epidemiological Study of Diabetic Retinopathy (WESDR, Klein et al 1988). \nThe data used here is from people with early onset diabetes participating in the \nWESDR study. \nsure of obesity) vs age (age) for 669 instances (subjects) in the WESDR study \nthat had no diabetic retinopathy or non proliferative retinopathy at the start of \nthe study. Those subjects who had (progressed) retinopathy four years later, are \nmarked as * and those with no progression are marked as '. The contours are \nlines of estimated posterior standard deviation of the estimate p of the proba(cid:173)\nbility of progression. These contours are used to delineate a region in which p \nis deemed to be reliable. Glycosylated hemoglobin (gly), a measure of blood \nsugar control. was also used in the estimation of p. A model of the form \np = eJ /(1 + eJ ), I(age, gly, bmi) = I-' + h(age) + b\u00b7 gly + h(bmi) + ha(age, bmi) \nwas selected using some of the screening procedures described in Wahba et al(1993), \nalong with an examination of the estimated multiple smoothing parameters, which \nindicated that the linear term in gly was sufficient to describe the (quite strong) \ndependence on gly. Figure l(right) shows the estimated probability of progression \n\n\f\"Soft\" Classification with Smoothing Spline ANOVA \n\n421 \n\ngiven by this model. Figure 2(left) gives cross sections of the fitted model of Figure \n1(right), and Figure 2(right) gives another cross section, along with its confidence \ninterval. Interesting observations can be made, for example, persons in their late \n20's with higher gly and bmi are at greatest risk for progression of the disease . \n\n...\u2022. \n. -.. \n: \n: \n\n-... \n\n............. . \n' . \n..... \": \n: \n\n.... \n\n... -\n\n........ \n\n............ \n\n\u2022 : \u00b7 \u00b7 \u00b7 \u2022 \u00b7 \u00b7 \u00b7 \u00b7 \u00b7 \u00b7 \n\n10 20 30 40 50 60 \n\nage (yr) \n\nFigure 1: Left: Data and contours of constant posterior standard deviation at \nthe median gly, as a function of age and bmi. Right: Estimated probability of \nprogression at the median gly, as a function of age and bmi. \n\ngy.q2 \n\ngy-q3 \n\nl:jI,,-\u2022\u2022\u2022 <:AJian \nbmi-median \n\nq1 bmi \nq2bmi \nq3bmi \nq4bmi \n\nq -\n\nCD o \n\no o \n\n10 20 30 40 50 60 \n\nage (yr) \n\n10 20 30 40 50 60 \n\n10 20 30 40 50 60 \n\nage (yr) \n\nage (yr) \n\nFigure 2: Left: Eight cross sections of the right panel of Figure 1, Estimated prob(cid:173)\nability of progression as a function of age, at four levels of bmi by two of gly. \nq1, ... q4 are the quartiles at .125, .375, .625 and .875. Right: Cross section of the \nright panel of Figure 1 for bmi and gly at their medians, as a function of age, \nwith Bayesian 'condifidence interval' (shaded) which generalizes Gu(1992b) to the \nmultivariate case. \n\n\f422 \n\nWahba, Wang, Gu, Klein, and Klein \n\nAcknowledgements \n\nSupported by NSF DMS-9121003 and DMS-9301511, and NEI-NIH EY09946 and \nEY03083 \n\nReferences \n\nCraven, P. & Wahba, G. (1979), 'Smoothing noisy data with spline functions: \nestimating the correct degree of smoothing by the method of generalized cross(cid:173)\nvalidation', Numer. Math. 31,377-403. \nGirard, D. (1991), 'Asymptotic optimality of the fast randomized versions of GCV \nand C L in ridge regression and regularization', Ann. Statist. 19, 1950-1963. \n\nGu, C. (1992a), 'Cross-validating non-Gaussian data', J. Comput. Graph. Stats. \n1,169-179. \nGu, C. (1992b), 'Penalized likelihood regression: a Bayesian analysis', Statistica \nSinica 2,255-264. \nGu, C. & Wahba, G. (1993), 'Smoothing spline ANOVA with component-wise \nBayesian \"confidence intervals''', J. Computational and Graphical Statistics 2, 1-2l. \nKimeldorf, G. & Wahba, G. (1970), 'A correspondence between Bayesian estimation \nof stochastic processes and smoothing by splines', Ann. Math. Statist. 41,495-502. \nKlein, R., Klein, B., Moss, S. Davis, M., & DeMets, D. (1988), Glycosylated \nhemoglobin predicts the incidence and progression of diabetic retinopathy, JAMA \n260, 2864-287l. \nO'Sullivan, F., Yandell, B. & Raynor, W. (1986), 'Automatic smoothing of regres(cid:173)\nsion functions in generalized linear models', J. Am. Stat. Soc. 81, 96-103. \nWahba, G. (1980), Spline bases, regularization, and generalized cross validation for \nsolving approximation problems with large quantities of noisy data, in W. Cheney, \ned., 'Approximation Theory III', Academic Press, pp. 905-912. \n\nWahba, G. (1985), 'A comparison of GCV and GML for choosing the smoothing \nparameter in the generalized spline smoothing problem', Ann. Statist. 13, 1378-\n1402. \nWahba, G. (1990), Spline Models for Observational Data, SIAM. CBMS-NSF Re(cid:173)\ngional Conference Series in Applied Mathematics, vol. 59. \n\nWahba, G. (1992), Multivariate function and operator estimation, based on smooth(cid:173)\ning splines and reproducing kernels, in M. Casdagli & S. Eubank, eds, 'Nonlinear \nModeling and Forecasting, SFI Studies in the Sciences of Complexity, Proc. Vol \nXII' , Addison-Wesley, pp. 95-112. \nWahba, G., Gu, C., Wang, Y. & Chappell, R. (1993), Soft classification, a. k. \na. risk estimation, via penalized log likelihood and smoothing spline analysis of \nvariance, to appear, Proc. Santa Fe Workshop on Supervised Machine Learning, D. \nWolpert and A. Lapedes, eds, and Proc. CLNL92, T. Petsche, ed, with permission \nof all eds. \n\n\f", "award": [], "sourceid": 849, "authors": [{"given_name": "Grace", "family_name": "Wahba", "institution": null}, {"given_name": "Yuedong", "family_name": "Wang", "institution": null}, {"given_name": "Chong", "family_name": "Gu", "institution": null}, {"given_name": "Ronald", "family_name": "Klein, MD", "institution": null}, {"given_name": "Barbara", "family_name": "Klein, MD", "institution": null}]}