{"title": "From Data Distributions to Regularization in Invariant Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 223, "page_last": 230, "abstract": null, "full_text": "From Data Distributions to \n\nRegularization in Invariant Learning \n\nTodd K. Leen \n\nDepartment of Computer Science and Engineering \n\nOregon Graduate Institute of Science and Technology \n\n20000 N.W. Walker Rd \nBeaverton, Oregon 97006 \n\ntieen@cse.ogi.edu \n\nAbstract \n\nIdeally pattern recognition machines provide constant output when \nthe inputs are transformed under a group 9 of desired invariances. \nThese invariances can be achieved by enhancing the training data \nto include examples of inputs transformed by elements of g, while \nleaving the corresponding targets unchanged. Alternatively the \ncost function for training can include a regularization term that \npenalizes changes in the output when the input is transformed un(cid:173)\nder the group. \n\nThis paper relates the two approaches, showing precisely the sense \nin which the regularized cost function approximates the result of \nadding transformed (or distorted) examples to the training data. \nThe cost function for the enhanced training set is equivalent to the \nsum of the original cost function plus a regularizer. For unbiased \nmodels, the regularizer reduces to the intuitively obvious choice -\na term that penalizes changes in the output when the inputs are \ntransformed under the group. For infinitesimal transformations, \nthe coefficient of the regularization term reduces to the variance of \nthe distortions introduced into the training data. This correspon(cid:173)\ndence provides a simple bridge between the two approaches. \n\n\f22 4 \n\nTodd Leen \n\n1 A pproaches to Invariant Learning \n\nIn machine learning one sometimes wants to incorporate invariances into the func(cid:173)\ntion learned. Our knowledge of the problem dictates that the machine outputs \nought to remain constant when its inputs are transformed under a set of operations \ngl. In character recognition, for example, we want the outputs to be invariant \nunder shifts and small rotations of the input image. \n\nIn neural networks, there are several ways to achieve this invariance \n\n1. The invariance can be hard-wired by weight sharing in the case of summa(cid:173)\n\ntion nodes (LeCun et al. 1990) or by constraints similar to weight sharing \nin higher-order nodes (Giles et al. 1988). \n\n2. One can enhance the training ensemble by adding examples of inputs trans(cid:173)\n\nformed under the desired inval\"iance group, while maintaining the same \ntargets as for the raw data. \n\n3. One can add to the cost function a regularizer that penalizes changes in the \noutput when the input is transformed by elements of the group (Simard et \nal. 1992). \n\nIntuitively one expects the approaches in 3 and 4 to be intimately linked. This \npaper develops that correspondence in detail. \n\n2 The Distortion-Enhanced Input Ensemble \n\nLet the input data x be distributed according to the density function p( x). The con(cid:173)\nditional distribution for the corresponding targets is denoted p(tlx). For simplicity \nof notation we take t E R. The extension to vector targets is trivial. Let f(x; w) \ndenote the network function, parameterized by weights w. The training procedure \nis assumed to minimize the expected squared error \n\n\u00a3(w) = J J dtdx p(tlx) p(x) [t - f(x; w)]2 \n\n. \n\n(1) \n\nvVe wish to consider the effects of adding new inputs that are related to the old by \ntransformations that correspond to the desired invariances. These transformations, \nor distortions, of the inputs are carried out by group elements g E g. For Lie \ngroups2, the transformations are analytic functions of parameters a E Rk \n\nwith the identity transformation corresponding to parameter value zero \n\nx -t x' = g(x;a) , \n\ng(x;O) = x . \n\n(2) \n\n(3) \n\nIn image processing, for example, we may want our machine to exhibit invariance \nwith respect to rotation, scaling, shearing and translations of the plane. These \n\n1 We assume that the set forms a group. \n2See for example (Sattillger, 1986). \n\n\fFrom Data Distributions to Regularization in Invariant Learning \n\n225 \n\ntransformations form a six-parameter Lie group3. \nBy adding distorted input examples we alter the original density p( x). To describe \nthe new density, we introduce a probability density for the transformation param(cid:173)\neters p(a). Using this density, the distribution for the distortion-enhanced input \nensemble is \n\np(x') = j j dadx p(x'lx,a) p(a) p(x) \n\n= \n\nj jdadxt5(x'-g(x;a\u00bbp(a)p(x) \n\nwhere t5(.) is the Dirac delta function4 \nFinally we impose that the targets remain unchanged when the inputs are trans(cid:173)\nformed according to (2) i.e., p(tlx') = p(tlx). Substituting p(x') into (1) and using \nthe invariance of the targets yields the cost function \n\nt = j \n\nj J dtdxda p(tlx)p(x)p(a) [t -\n\nf(g(x;a);w)]2 \n\n(4) \n\nEquation (4) gives the cost function for the distortion-enhanced input ensemble. \n\n3 Regularization and Hints \n\nThe remainder of the paper makes precise the connection between adding trans(cid:173)\nformed inputs, as embodied in (4), and various regularization procedures. It is \nstraightforward to show that the cost function for the distortion-enhanced ensem(cid:173)\nble is equivalent to the cost function for the original data ensemble (1) plus a \nregularization term. Adding and subtracting f(x; w) to the term in square brackets \nin (4), and expanding the quadratic leaves \n\nt = E + ER , \n\n(5) \n\nwhere the regularizer is \n\nER = EH + Ec J da p(a) j dx p(x) [f(x, w) - f(g(x; a); w)]2 \n\n- 2 J j J dtdxda p(tlx) p(x) p(a) \n\nx[t -\n\nf{x;w)] [f(g(x;a);w) - f(x;w)] \n\n(6) \n\n3The parameters for rotations, scaling and shearing completely specify elements of G L2, \nthe four parameter group of 2 x 2 invertible matrices. The translations carry an additional \ntwo degrees of freedom. \n\n4 In general the density on 0 might vary through the input space, suggesting the con(cid:173)\nditional density p(o I :1'). This introduces rather minor changes in the discussion that will \nnot be considered here. \n\n\f226 \n\nToddLeen \n\nTraining with the original data ensemble using the cost function (5) is equivalent \nto adding transformed inputs to the data ensemble. \nThe first term of the regularizer \u00a3 H penalizes the average squared difference between \nI(x;w) and I(g(x;a);w). This is exactly the form one would intuitively apply \nin order to insure that the network output not change under the transformation \nx -4 g( x, a). Indeed this is the similar to the form of the invariance \"hint\" proposed \nby Abu-Mostafa (1993). The difference here is that there is no arbitrary parameter \nmultiplying the term. Instead the strength of the regularizer is governed by the \naverage over the density pea). The term \u00a3H measures the error in satisfying the \ninvariance hint. \nThe second term \u00a3a measures the correlation between the error in fitting to the \ndata, and the errol' in satisfying the hint. Only when these correlations vanish is \nthe cost function for the enhanced ensemble equal to the original cost function plus \nthe invariance hint penalty. \nThe correlation term vanishes trivially when either \n\n1. The invariance I (g( x; a); w) = I (x; w) is satisfied, or \n2. The network function equals the least squares regression on t \n\nI(x; w) = J dt p(tlx) t = E[tlx] . \n\n(7) \n\nThe lowest possible \u00a3 occurs when I satisfies (7), at which \u00a3 becomes the \nvariance in the targets averaged over p( x ). By substituting this into \u00a3a \nand carrying out the integration over dt p( tlx), the correlation term is seen \nto vanish. \n\nIf the minimum of t occurs at a weight for which the invariance is satisfied (condition \n1 above). then minimizing t ( w) is equivalent to minimizing \u00a3 ( w). If the minimum of \nt occurs at a weight for which the network function is the regression (condition 2), \nthen minimizing t is equivalent to minimizing the cost function with the intuitive \nregularizer \u00a3 H 5. \n\n3.1 \n\nInfinitesimal Transformations \n\nAbove we enumerated the conditions under which the correlation term in \u00a3R van(cid:173)\nishes exactly for unrestricted transformations. If the transformations are analytic \nin the paranleters 0', then by restricting ourselves to small transformations (those \nclose to the identity) we can-show how the correlation term approximately vanishes \nfor unbiased models. To implement this, we assume that p( a) is sharply peaked up \nabout the origin so that large transformations are unlikely. \n\n51\u00a3 the data is to be fit optimally, with enough freedom left over to satisfy the invariance \nhint, then there must be several weight values (perhaps a continuum of such values) for \nwhich the network function satisfies (7). That is, the problem must be under-specified. If \nthis is the case, then the interesting part weight space is just the subset on which (7) is \nsatisfied. On this subset the correlation term in (6) vanishes and the regularizer assumes \nthe intuitive form. \n\n\fFrom Data Distributions to Regularization in Invariant Learning \n\n227 \n\nWe obtain an approximation to the cost function t by expanding the integrands in \n(6) in power series about 0 = 0 and retaining terms to second order. This leaves \n\nt = c + J J dxdo p(x) p(o) (Oi :~ L=o :!\" f \n-2 J J J dt dx do p(tlx)p(x)p(o) [t-f(x;w)] x \n0 2 gIl I ) ( of) \nax\" \n\nOg\"l \n\u2022 OOi 0=0 2 ' J OOi (0) 0=0 \n\n0 \u00b0 -\n\n1 \n\n+ -0 \u00b00 \u00b0 \n\n[ ( \n\n-\n\n(8) \n\nwhere x P and gP denote the pth components of x and g, OJ denotes the ith com(cid:173)\nponent of the transformation parameter vector, repeated Greek and Roman indices \nare summed over, and all derivatives are evaluated at 0 = o. Note that we have \nused the fact that Lie group transformations are analytic in the parameter vector \no to derive the expansion. \nFinally we introduce two assumptions on the distribution p(o). First 0 is assumed \nto be zero mean. This corresponds, in the linear approximation, to a distribution \nof distortions whose mean is the identity transformation. Second, we assume that \nthe components of 0 are uncorrelated so that the covariance matrix is diagonal \nwith elements ul, i = 1 ... k. 6 With these assumptions, the cost function for the \ndistortion-enhanced ensemble simplifies to \n\nt = c + ~ (Tr J dx p( x) ( ~g\"l \n\n: f \nva. a=O vx\" \n\n) 2 \n\nk - L (T~ J J dx dt p(tlx) p(X) { (f(x; w) -\n\nt ) \n\n~ \n.=1 \n\n.=1 \n\n[ ~:; 10=0 ( :~ ) + :~ L=o :~ L=o (ox~2\u00a3x\" )]} \n\nX \n\nThis last expression provides a simple bridge between the the methods of adding \ntransformed examples to the data, and the alternative of adding a regularizer to \nthe cost function: The coefficient of the regularization term in the latter approach \nis equal to the variance of the transformation parameters in the former approach. \n\n6Note that the transformed patterns may be correlated in parts of the pattern space. \nFor example the results of applying the shearing and rotation operations to an infinite \nvertical line are indistinguishable. In general, there may be regions of the pattern space \nfor which the action of several different group elements are indistinguishable; that is x' = \ng(x; a) = g(x; (3). However this does not imply that a and (3 are statistically correlated. \n\n\f228 \n\nTodd Leen \n\n3.1.1 Unbiased Models \nFor unbiased models the regularizer in E( w) assumes a particularly simple form. \nSuppose the network function is rich enough to form an unbiased estimate of the \nleast squares regression on t for the un distorted data ensemble. That is, there exists \na weight value Wo such that \n\nf(x;wo) = J dt tp(tlx) == E[tlx] \n\n(10) \n\nThis is the global minimum for the original error \u00a3( w). \nThe arguments of section 3 apply here as well. However we can go further. Even \nif there is only a single, isolated weight value for which (10) is satisfied, then to \nO( 0- 2 ) the correlation term in the regularizer vanishes. To see this note that by the \nimplicit function theorem the modified cost function (9) has its global minimum at \nthe new weight 7 \n\n(11) \n\nAt this weight, the network function is no longer the regression on t, but rather \n\nf(x;wo) = E[tlx] + 0(0-2 ) \n\n\u2022 \n\n(12) \n\nSubstituting (12) into (9), we find that the minimum of (9) is, to 0(0-2 ), at the \nsame weight as the minimum of \n\nt = \u00a3 + L.k o-~ JdX p(x) [ oglJ I of (x, w) ] 2 \n\n.=1 \n\n0 Q'j Q=O \n\noxlJ \n\n(13) \n\nTo 0(0- 2 ), minimizing (13) is equivalent to minimizing (9). So we regard t as the \neffective cost function. \n\nThe regularization term in (13) is proportional to the average square of the gradient \nof the network function along the direction in the input space generated by the \nlineal' part of g. The quantity inside the square brackets is just the linear part of \n[f (g( X; Q')) -\nf (x)] from (6). The magnitude of the regularization term is just the \nvariance of the distribution of distortion parameters. \n\nThis is precisely the form of the regularizer given by Simard et al. in their tangent \nprop algorithm (Simard et aI, 1992). This derivation shows the equivalence (to \n0(0\"2)) between the tangent prop regularizer and the alternative of modifying the \ninput distribution. Furthermore, we see that with this equivalence, the constant \nfixing the strengt.h of the regularization term is simply the variance of the distortions \nintroduced into the original training set. \n\nWe should stress that the equivalence between the regularizer, and the distortion(cid:173)\nenhanced ensemble in (13) only holds to 0(0- 2 ). If one allows the variance of the \n\n7We assume that the Hessian of \u00a3 is nonsingular at woo \n\n\fFrom Data Distributions to Regularization in Invariant Learning \n\n229 \n\ndistortion parameters u 2 to become arbitrarily large in an effort to mock up an \narbitrarily large regularization term, then the equivalence expressed in (13) breaks \ndown since terms of order O( ( 4 ) can no longer be neglected. In addition, if the \ntransformations are to be kept small so that the linearization holds (e.g. by re(cid:173)\nstricting the density on a to have support on a small neighborhood of zero), then \nthe variance will bounded above. \n\n3.1.2 Smoothing Regularizers \n\nIn the previous sections we showed the equivalence between modifying the input \ndistribution and adding a regularizer to the cost function. We derived this equiv(cid:173)\nalence to illuminate mechanisms for obtaining invariant pattern recognition. The \ntechnique for dealing with infinitesimal transformations in section \u00a73.1 was used by \nBishop (1994) to show the equivalence between added input noise and smoothing \nregularizers. Bishop's results, though they preceded our own, are a special case \nof the results presented here. Suppose the group 9 is restricted to translations by \nrandom vectors g( X; a) = X + a where a is spherically distributed with variance u!. \nThen the regularizer in (13) is \n\n(14) \n\nThis regularizer penalizes large magnitude gradients in the network function and \nis, as pointed out by Bishop, one of the class of generalized Tikhonov regularizers. \n\n4 Summary \n\nWe have shown that enhancing the input ensemble by adding examples transformed \nunder a group x -? g(x;a), while maintaining the target values, is equivalent to \nadding a regularizer to the original cost function. For unbiased models the reg(cid:173)\nulatizer reduces to the intuitive form that penalizes the mean squared difference \nbetween the network output for transformed and untransformed inputs - i.e. the \nerror in satisfying the desired invariance. In general the regularizer includes a term \nthat measures correlations between the error in fitting the data, and the error in \nsatisfying the desired inva.riance. For infinitesimal transformations, the regularizer \nis equivalent (up to terms linear in the variance of the transformation parameters) \nto the tangent prop form given by Simard et a1. (1992), with regularization coef(cid:173)\nficient equal to the variance of the transformation parameters. In the special case \nthat the group transformations are limited to random translations of the input, the \nregularizer reduces to a standard smoothing regularizer. \n\n\\Ve gave conditions under which enhancing the input ensemble and adding the \nintuitive regularizer \u00a3 H are equivalent. However tins equivalence is only with regard \nto the optimal weight. We have not compared the training dynamics for the two \napproaches. In particular, it is quite possible that the full regularizer \u00a3H + \u00a3c \nexhibits different training dynamics from the intuitive form \u00a3 H. For the approach \nin which data are added to the input ensemble, one can easily construct datasets \nand distributions p( a) that either increase the condition number of the Hessian, \nor decrease it. Finally, it may be that the intuitive regularizer can have either \ndetrimental or positive effects on the Hessian as well. \n\n\f230 \n\nAcknowledgments \n\nToddLeen \n\nI thank Lodewyk Wessels, Misha Pavel, Eric Wan, Steve Rehfuss, Genevieve Orr \nand Patrice Simard for stimulating and helpful discussions, and the reviewers for \nhelpful comments. I am grateful to my father for what he gave to me in life, and \nfor the presence of his spirit after his recent passing. \n\nThis work was supported by EPRI under grant RP8015-2, AFOSR under grant \nFF4962-93-1-0253, and ONR under grant N00014-91-J-1482. \n\nReferences \n\nYasar S. Abu-Mostafa. A method for learning from hints. In S. Hanson, J. Cowan, \nand C. Giles, editors, Advances in Neural Information Processing Systems, vol. 5, \npages 73-80. Morgan Kaufmann, 1993. \n\nChris M. Bishop. Training with noise is equivalent to Tikhonov regularization. To \nappear in Neural Computation, 1994. \n\nC.L. Giles, R.D. Griffin, and T. Maxwell. Encoding geometric invariances in higher(cid:173)\norder neural networks. In D.Z.Anderson, editor, Neural Information Processing \nSystems, pages 301-309. American Institute of Physics, 1988. \n\nY. Le Cun, B. Boser, J.S. Denker, D. Henderson, R.E. Howard, W. Hubbard, and \nL.D. Jackel. Handwritten digit recognition with a back-propagation network. In \nAdvances in Neural Information Processing Systems, vol. 2, pages 396-404. Morgan \nKaufmann Publishers, 1990. \n\nPatrice Simard, Bernard Victorri, Yann Le Cun, and John Denker. Tangent prop -\na formalism for specifying selected invariances in an adaptive network. In John E. \nMoody, Steven J. Hanson, and Richard P. Lippmann, editors, Advances in Neural \nInformation Processing Systems 4, pages 895-903. Morgan Kaufmann, 1992. \nD.H. Sattinger and O.L. Weaver. Lie Groups and Algebras with Applications to \nPhysics, Geometry and Mechanics. Springer-Verlag, 1986. \n\n\f", "award": [], "sourceid": 925, "authors": [{"given_name": "Todd", "family_name": "Leen", "institution": null}]}