{"title": "Smoothing Regularizers for Projective Basis Function Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 585, "page_last": 591, "abstract": null, "full_text": "Smoothing Regularizers for \n\nProjective Basis Function Networks \n\nJohn E. Moody and Thorsteinn S. Rognvaldsson * \nDepartment of Computer Science, Oregon Graduate Institute \n\nPO Box 91000, Portland, OR 97291 \n\nmoody@cse.ogi.edu \n\ndenni@cca.hh.se \n\nAbstract \n\nSmoothing regularizers for radial basis functions have been studied extensively, \nbut no general smoothing regularizers for projective basis junctions (PBFs), such \nas the widely-used sigmoidal PBFs, have heretofore been proposed. We de(cid:173)\nrive new classes of algebraically-simple mH'-order smoothing regularizers for \nnetworks of the form f(W, x) = L7=1 Ujg [x T Vj + Vjol + uo, with general \nprojective basis functions g[.]. These regularizers are: \n\nRa(W,m) = LU;lIvjIl2m-1 GlobalForm \n\nN \n\nj=1 \nN \n\nj=1 \n\nRdW,m) = LU;lIvjll2m \n\nLocal Form \n\nThese regularizers bound the corresponding m t\" -order smoothing integral \n\nwhere W denotes all the network weights {Uj, uo, vj, vo}, and O(x) is a weight(cid:173)\ning function on the D-dimensional input space. The global and local cases are \ndistinguished by different choices of O( x) . \nThe simple algebraic forms R(W, m) enable the direct enforcement of smooth(cid:173)\nness without the need for costly Monte-Carlo integrations of S (W, m). The new \nregularizers are shown to yield better generalization errors than weight decay \nwhen the implicit assumptions in the latter are wrong. Unlike weight decay, the \nnew regularizers distinguish between the roles of the input and output weights \nand capture the interactions between them. \n\n\u2022 Address as of September I. 1996: Centre for Computer Architecture. University of Halmstad, \n\nP.O.Box 823, S-301 18 Halmstad, Sweden \n\n\f586 \n\n1. E. Moody and T. S. Rognvaldsson \n\n1 \n\nIntroduction: What are the right biases? \n\nRegularization is a technique for reducing prediction risk by balancing model bias and \nmodel variance. A regularizer R(W) imposes prior constraints on the network parameters \nW. Using squared error as the most common example, the objective functional that is \nminimized during training is \n\nE = 2~ ?=[y(i) -\n\nI(W, x(i) W + AR(W) , \n\n(1) \n\nM \n\n.=1 \n\nwhere y(;) are target values corresponding to the inputs x(;) , M is the number of training \npatterns, and the regularization parameter A controls the importance of the prior constraints \nrelative to the fit to the data. Several approaches can be applied to estimate A (e.g. Eubank \n(1988) or Wahba (1990\u00bb. \n\nRegularization reduces model variance at the cost of some model bias. An important \nquestion arises: What are the right biases? (Geman, Bienenstock & Doursat 1992). A good \nchoice of R(W) will result in lower expected prediction error than will a poor choice. \nWeight decay is often used effectively, but it is an ad hoc technique that controls weight \nvalues without regard to the function 1(' ). It is thus not necessarily optimal and not appro(cid:173)\npriate for arbitrary function parameterizations. It will give very different results, depending \nupon whether a function is parameterized, for example, as f (w, x) or as I ( w -I, x). \nSince many real world problems are intrinsically smooth, we propose that in many cases, \nan appropriate bias to impose is to favor solutions with low mth-order curvature. Direct pe(cid:173)\nnalization of curvature is a parametrization-independent approach. The desired regularizer \nis the standard D dimensional curvature functional of order m: \n\n(2) \n\nHere 1111 denotes the ordinary euclidean tensor norm and am /ax m denotes the mth order \ndifferential operator. The weighting function n (x) ensures that the integral converges and \ndetennines the region over which we require the function to be smooth. n( x) is not required \nto be equal to the input density p( x), and will most often be different. \nThe use of smoothing functionals like (2) has been extensively studied for smoothing splines \n(Eubank 1988, Hastie & Tibshirani 1990, Wahba 1990) and for radial basis function (RBF) \nnetworks (powell 1987, Poggio & Girosi 1990, Girosi, Jones & Poggio 1995). However, \nno general class of smoothing regularizers that directly enforce smoothness sew, m) for \nprojective basis junctions (PBFs), such as the widely used sigmoidal PBFs, has been \npreviously proposed. \n\nSince explicit enforcement of smoothness using (2) requires costly, impractical Monte(cid:173)\nCarlo integrations, 1 we derive algebraically-simple regularizers R(W, m) that tightly bound \nS(W,m). \n\n2 Derivation of Simple Regularizers from Smoothing Functionals \n\nWe consider single hidden layer networks with D input variables, Nh nonlinear hidden \nunits, and No linear output units. For clarity, we set No = 1, and drop the subscript on Nh \n\n'Note that (2) is not just one integral, but actually O(Dm) integrals, since the norm of the operator \n\n8 m /8x m has O(Dm) terms. This is extremely expensive to compute for large D or large m . \n\n\fSmoothing RegularizersJor Projective Basis Function Networks \n\n587 \n\n(the derivation is trivially extended to the case No > 1). Thus, our network function is \n\nN \n\nj(x) = L ujg[9j, x] + Uo \n\nj=1 \n\n(3) \n\nwhere g[.] are the nonlinear transfer functions of the internal hidden units, x E RD is the \ninput vector , 9 j are the parameters associated with internal unit j, and W denotes all \nparameters in the network. \n\nFor regularizers R(W), we wiIl derive strict upper bounds for S(W, m). We desire the \nregularizers to be as general as possible so that they can easily be applied to different \nnetwork models. Without making any assumptions about n(x) or g(.), we have the upper \nbound \n\nS(W,m) ~ Nt. u; J dDx{l( .. ) Iliim~~~,\"1 II'-\n\n(4) \n\nwhich follows from the inequality (L~I a;)2 $ NL~I ar. We consider two possible \n\noptions for the weighting function n(x). One is to require global smoothness, in which \ncase n(x) is a very wide function that covers all relevant parts of the input space (e.g. a \nvery wide gaussian distribution or a constant distribution). The other option is to require \nlocal smoothness, in which case n( x) approaches zero outside small regions around some \nreference points (e.g. the training data). \n\n2.1 Projective Basis Representations \nProjective basis functions (PBFs) are of the form g[9j , x] = 9 [x T Vj + Vjol , where 9j = \n{Vj, Vjo}, Vj = (Vj), Vj2, \u2022\u2022\u2022 ,VjD) is the vector of weights connecting hidden unit j to the \ninp.uts, and VjO is the bias, offset, or threshold. For PBFs, expression (4) simplifies to \n\nS(W,m) $ NL u;llvjIl 2mlj(W,m), \n\nN \n\nj=) \n\nI;(W,m) \" J dDx{l( .. ) (dm~;t)l) 2 \n\nwith \n\nwhere Zj(x) = x T Vj + Vjo. \n\n(5) \n\n(6) \n\nAlthough the most commonly used g[.]' s are sigmoids, our analysis applies to many other \nforms, for example flexible fourier units, polynomials, andfational functions.3 The classes \nof PBF transfer functions g[.] that are applicable (as determined by n(x\u00bb are those for \nwhich the integral (8) is finite and well-defined. \n\n2.2 Global weighting \n\nFor the global case, we select a gaussian form for the weighting function \n\nna(x) = (.Ji;u)-D exp [ _~:~12] \n\n(7) \n\n2Throughout, we use smaliletler boldface to denote vector quantities. \n3See for example Moody & Yarvin (1992). \n\n\f588 \n\n1. E. Moody and T S. Rognvaldsson \n\nand require (J to be large. Integrating out all dimensions, except the one associated with the \nprojection vector U j, we are left with \n\n(8) \n\nIf (dmg[z]/dzm)2 is integrable and approaches zero outside a region that is small compared \nto (J, we can bound (8) by setting the exponential equal to unity. This implies \n\nIj(W,m) =:; IIUjll WIth I(m) = (J.,fi; -00 dz \n\nI(m). \n\n1 100 \n\n_ \n\n(dmg[z])2 \n\ndzm \n\nThe bound of equation (5) then becomes \n\nS(W,m) =:; NI(m) LU;IIUjWm - 1 = NI(m)Re(W,m) , \n\nN \n\nj ; l \n\n(9) \n\n(10) \n\nwhere the subscript G stands for global. Since A absorbs all constant multiplicative factors, \nwe need only weigh Re(W, m) into the training objective function. \n\n2.2.1 Local weighting \n\nFor the local case, we consider weighting functions of the general form \n\nwhere x(i) are a set of points, and n( xU) , (J) is a function that decays rapidly for large \nII x - x(i) II. We require that limu-to n( x(i) , (J) = 6( x - xCi) ). Thus, when the xCi) are the \ntraining data points, the limiting distribution of (11) is the empirical distribution. \nIn the limit (J -+ 0, equation (5) becomes \n\n(11) \n\n(12) \n\nFor the empirical distribution, we could compute the expression within parenthesis in (12) \nfor each input pattern x( i) during training and use it as our regularization cost. This is done \nby Bishop (1993) for the special case m = 2. However, this requires explicit design for \neach transfer function and becomes increasingly complicated as we go to higher m. To \nconstruct a simpler and more general form, we instead assume that the m th derivative of \ng[.] isboundedfromabovebyCL(m) == maxz (d;z9lZlf \nThis gives the bound \n\nS(W,m) =:; NCL(m) L u;lIujWm = NCL(m)RL(W,m) \n\nN \n\n(13) \n\nj ; l \n\nfor the maximum local curvature of the function (the subscript L denotes local limit). \n\n\fSmoothing RegularizersJor Projective Basis Function Networks \n\n589 \n\n3 Empirical Example \n\nWe have done extensive simulation studies that demonstrate the efficacy of our new reg(cid:173)\nularizers for PBF networks on a variety of problems. An account is given in Moody & \nRognvaldsson (1996). Here, we demonstrate the value of using smoothing regularizers on a \nsimple problem which illustrates a key difference between smoothing and quadratic weight \ndecay, the two dimensional bilinear function \n\n(14) \n\nThis example was used by Friedman & Stuetzle (1981) to demonstrate projection pursuit \nregression. It is the simplest function with interactions between input variables. \nWe fit this function with one hidden layer networks using the m = {I, 2, 3} smoothing \nregularizers, comparing the results with using weight decay. In a large set of experiments, \nwe find that both the global and local smoothing regularizers with m = 2 and m = 3 \noutperform weight decay. An example is shown in figure 1. The local m = 1 case performs \npoorly, which is unsurprising, given that the target function is quadratic. Weight decay \nperforms poorly because it lacks any form of interaction between the input layer and output \nlayer weights v j and U j. \n\n(a)GIoboI _al_ontOfdor \n\n(b) Weight doc \n\n12 \n\n10 \n\n8 \n\n6 \n\n\u2022 \n\n2 \n\n0 \" \n0 \n\nt,, ' t \n\n,-, \nt .~ \u2022 \n\nt \n\n/\n\n, .. / \n\n1 \n6 \nMonte Carlo eslimated value a S(W,I) [E +3) \n\n3 \u2022 5 \n\n2 \n\n6 \n\n5 \n\n\u2022 \n\n3 \n\n~ + \n!!!. \nN \n~ \nif \nIi \n-8 \nCt \n\n~ \u2022 \" Ii \n\n> \n\n.' \nO\u00b7 \n0 \n\n7 \n\ntf\"\u00b7 \n\n.,' \n\n0.5 \n\n1.0 \n\n'-bie Carlo eslimaled value a S(W,2) [E+6) \n\n1.5 \n\n2.0 \n\n2.5 \n\nFigure 2: Linear correlation between S(W; m) and the global RG(W, m) for neural net(cid:173)\nworks with 10 input units, 10 internal tanh[\u00b7] PBF units, and one linear output. The values \nof S(W; m) are computed through Monte Carlo integration. The left graph shows m = 1 \nand the right graph shows m = 2. Results are similar for the local form R L (W; m). \n\n5 Summary \n\nOur regularizers R(W; m) are the first general class of mth-order smoothing regularizers \nto be proposed for projective basis function (PBF) networks. They apply to large classes \nof transfer functions g[.], including sigmoids. They differ fundamentally from quadratic \nweight decay in that they distinguish the roles of the input and output weights and capture \nthe interactions between them. \n\nOur approach is quite different from that developed for smoothing splines and smoothing \nradial basis functions (RBFs), since we derive smoothing regularizers for given classes of \nunits g[8, x], rather than derive the forms of the units g[.] by requiring them to be Greens \nfunctions of the smoothing operator S(\u00b7). Our approach thus has the advantage that it can \nbe applied to the types of networks most often used in practice, namely PBFs. \n\n\fSmoothing Regularizersfor Projective Basis Function Networks \n\n591 \n\nIn Moody & Rognvaldsson (1996), we present further analysis and simulation results for \nPBFs. We have also extended our work to RBFs (Moody & Rognvaldsson 1997). \n\nAcknowledgements \n\nBoth authors thank Steve Rehfuss and Lizhong Wu for stimulating input. John Moody \nthanks Volker Tresp for a provocative discussion at a 1991 Neural Networks Workshop \nsponsored by the Deutsche Informatik Akademie. We gratefully acknowledge support for \nthis work from ARPA and ONR (grant NOOO 14-92-J-4062), NSF (grant CDA-9503968), the \nSwedish Institute, and the Swedish Research Council for Engineering Sciences (contract \nTFR-282-95-847). \n\nReferences \n\nBishop, C. (1993), 'Curvature-driven smoothing: A learning algorithm for feed forward networks', \n\nIEEE Trans. Neural Networks 4, 882-884. \n\nEubank, R. L. (1988), Spline Smoothing and Nonparametric Regression, Marcel Dekker, Inc. \nFriedman, J. H. & Stuetzle, W. (1981), 'Projection pursuit regression', J. Amer. Stat. Assoc. \n\n76(376),817-823. \n\nGeman, S., Bienenstock, E. & Doursat, R. (1992), 'Neural networks and the bias/variance dilemma', \n\nNeural Computation 4(1), 1-58. \n\nGirosi, E, Jones, M. & Poggio, T. (1995), 'Regularization theory and neural network architectures', \n\nNeural Computation 7,219-269. \n\nHastie, T. 1. & libshirani, R. 1. (1990). Generalized Additive Models, Vol. 43 of Monographs on \n\nStatistics and Applied Probability, Chapman and Hall. \n\nMoody, 1. E. & Yarvin, N. (1992), Networks with learned unit response functions, in J. E. Moody, \nS. J. Hanson & R. P. Lippmann, eds. 'Advances in Neural Information Processing Systems 4', \nMorgan Kaufmann Publishers, San Mateo. CA. pp. 1048-55. \n\nMoody, J. & Rognvaldsson. T. (1996), Smoothing regularizers for projective basis function networks. \n\nSubmitted to Neural Computation. \n\nMoody, 1. & Rognvaldsson, T. (1997), Smoothing regularizers for radial basis function networks, \n\nManuscript in preparation. \n\nPoggio, T. & Girosi, E (1990). 'Networks for approximation and learning', IEEE Proceedings 78(9). \nPowell, M. (1987). Radial basis functions for multivariable interpolation: a review .\u2022 in 1. Mason & \n\nM. Cox, eds, 'Algorithms for Approximation', Clarendon Press, Oxford. \n\nWahba, G. (1990), Spline models for observational data, CBMS-NSF Regional Conference Series in \n\nApplied Mathematics. \n\n\f", "award": [], "sourceid": 1314, "authors": [{"given_name": "John", "family_name": "Moody", "institution": null}, {"given_name": "Thorsteinn", "family_name": "R\u00f6gnvaldsson", "institution": null}]}