{"title": "The Effect of Correlated Input Data on the Dynamics of Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 169, "page_last": 175, "abstract": null, "full_text": "The effect of correlated input data on the \n\ndynamics of learning \n\nS~ren Halkjrer and Ole Winther \n\nCONNECT, The Niels Bohr Institute \n\nBlegdamsvej 17 \n\n2100 Copenhagen, Denmark \n\nhalkjaer>winther~connect.nbi.dk \n\nAbstract \n\nThe convergence properties of the gradient descent algorithm in the \ncase of the linear perceptron may be obtained from the response \nfunction. We derive a general expression for the response function \nand apply it to the case of data with simple input correlations. It \nis found that correlations severely may slow down learning. This \nexplains the success of PCA as a method for reducing training time. \nMotivated by this finding we furthermore propose to transform the \ninput data by removing the mean across input variables as well as \nexamples to decrease correlations. Numerical findings for a medical \nclassification problem are in fine agreement with the theoretical \nresults. \n\n1 \n\nINTRODUCTION \n\nLearning and generalization are important areas of research within the field of neu(cid:173)\nral networks. Although good generalization is the ultimate goal in feed-forward \nnetworks (perceptrons), it is of practical importance to understand the mechanism \nwhich control the amount of time required for learning, i. e. the dynamics of learn(cid:173)\ning. This is of course particularly important in the case of a large data set. An exact \nanalysis of this mechanism is possible for the linear perceptron and as usual it is \nhoped that the results to some extend may be carried over to explain the behaviour \nof non-linear perceptrons. \nWe consider N dimensional input vectors x E n N and scalar output y. The linear \n\n\f170 \n\nS. Halkjrer and O. Winther \n\nperceptron is parametrized by the weight vector wE nN \n\n1 \n\ny(x) = ..JNwT x \n\n(1) \nLet the training set be {( xt' , yt'), J.l = 1, . .. ,p} and the training error be the usual \nsquared error, E(w) = ~ 2:t' (yt' - y(xt'))2. We will use the well-known gradient \ndescent aigorithm i w(k + 1) = w(k) - rl'\\1 E(w(k)) to estimate the minimum points \nw'\" of E. Here 7] denotes the learning parameter. Collecting the input examples \nin the N x p matrix X and the corresponding output in y, the error function is \nwritten E(w) = ~(wTRw-2qTw+c), where R == }, 2:t' xt'(xt')T, q = -;};;XY and \nc = yT y. As in (Le Cun et at., 1991) the convergence properties of the minimum \npoints w'\" are examined in the coordinate system where R is diagonal. Let U denote \nthe matrix whose columns are the eigenvectors of R and ~ = diag( AI, ... , AN) the \ndiagonal matrix containing the eigenvalues of R. The new coordinates then become \nv = U T (w - w\"') with corresponding error function 2 \n\n(2) \n\n1 \n\nE(v) = '2vT ~v + Eo = '2 ~Aivl + Eo \n\n1 \n\n\u2022 \n\nwhere Eo = E(w\"') . Gradient descent now leads to the decoupled equations \n\nvi(k + 1) = (1 - 7]Ai)Vi(k) = (1 - 7]Adkvi(O) \n\n(3) \nwith i = 1, ... , N. Clearly, v -+ 0 requires 11 - 7]Ai I < 1 for all i, so that 7] must \nbe chosen in the interval 0 < 7] < 2/ Amax. In the extreme case Ai = A we will \nhave convergence in one step for 7] = 1/ A. However, in the usual case of unequal Ai \nthe convergence for large k will be exponential vi(k) = exp(-7]Aik)Vi(O) . (7]Ai)-1 \ntherefore defines the time constant of the i'th equation giving a slowest time constant \n(7]Amin)-I. A popular choice for the learning parameter is 7] = 1/ Amax resulting in \na slowest time constant Amax / Amin called the learning time T in the following. The \nconvergence properties of the gradient descent algorithm is thus characterized by \nT. In the case of a singular matrix R, one or more of the eigenvalues will be zero, \nand there will be no convergence along the corresponding eigendirections. This has \nhowever no influence on the error according to (2). Thus, Amin will in the following \ndenote the smallest non-zero eigenvalue. \n\nWe will in the article calculate the eigenvalue spectrum of R in order to obtain the \nlearning time of the gradient descent algorithm. This may be done by introducing \nthe response function \n\n) \nGL == G(L, H) = N TrL 1 _ RH == L 1 _ 'RH \n\n1 \n\n1 \n\n1 \n\n( \n\n(4) \n\nwhere L, H are arbitrary N x N matrices. Using a standard representation of the \nDirac o-function (Krogh, 1992) we may write the eigenvalue spectrum of R as \n\n(5) \n\nIThe Newton-Raphson method, w(k + 1) = w(k) - vr E(w(k))(vr2 E(w(k)))-l is of \ncourse much more effective in the linear case since it gives convergence in one step. This \nmethod however requires an inversion of the Hessian matrix. \n\n2Note that this analysis is valid for any part of an error surface in which a quadratic \nIn the general case R should be exchanged with the Hessian \n\napproximation is valid. \nvrvr E(w\"'). \n\n\fThe Effect of Correlated Input Data on the Dynamics of Learning \n\n171 \n\nwhere). has an infinitesimal imaginary part which is set equal to zero at the end of \nthe calculation. \nIn the 'thermodynamic limit' N -+ 00 keeping a = Il constant and finite, G \n(and thus the eigenvalue spectrum) is a self-averaging quantity (Sollich, 1996) i. \ne. G - G = O(N- 1 ), where G is defined as the response function averaged over \nthe input distribution. Previously G has been calculated for independent input \nvariables (Hertz et al., 1989; Sollich, 1996). In section 2 we derive an implicit \nequation for the averaged response function for arbitrary correlations using random \nmatrix techniques (Brody et al., 1981). This equation is solved showing that simple \ninput correlations may slow down learning significantly. Based on this finding we \npropose in section 3 data transformations for improving the learning speed and test \nthe transformation numerically on a medical classification problem in section 4. We \nconclude in section 5 with a discussion of the results. \n\n2 THE RESPONSE FUNCTION \n\nThe method for deriving the averaged response function is based on the fact that the \nresponse function (4) maybe written as a geometrical series GL = L~o (L(RHY). \nWe will assume that the input examples x/J are drawn independently from a \nGaussian distribution with means mi and correlations xixi - mimi = Gii , i. e. \nx/J(xv)T = /5/JvZ and R = aZ where Z == C + mmT . The Gaussian distribution \nhas the property that the average of products of x's can be calculated by making \nall possible pair correlations, e.g. XiXjXkXI = ZiiZkl + ZikZjl + ZilZjk. To take \nthe average of (L(RHY), we must therefore make all possible pairs of the x's and \nexchange each pair xixi with Zij. This combinatorial problem will be solved below \nin a recursive fashion leading to an implicit equation for GL. Using underbraces to \nindicate pairings of x 's, we get for r 2: 2 \n\n(L(RH)r) \n\n/J \n\n~ 2:(L~H(RH)r-1) \n+ ~2 ~ ~ (LX\" Sx\")TH~RH)'x\", 0 and 0 otherwise. The first term expresses the trivial fact \nthat for p < N the whole input space is not spanned and R will have a fraction of \n1 - a zero-eigenvalues. The continuous spectrum (the root term) only contributes \nfor ..\\_ < ..\\ < ..\\+ . Numerical simulations has been performed to test the validity \nof the spectrum (13) (Halkjrer, 1996). They are in good agreement with predicted \nresults indicating that finite size effects are unimportant. The continuous spectrum \n' ... (13) has also been calculated using the replica method (Halkjrer, 1996). \n\n\fThe Effect of Correlated Input Data on the Dynamics of Learning \n\n173 \n\nFrom the spectrum the learning time T may be read of directly \n\nT = max - , - - = max \n\n( A+ aa3) \n\nA_ A_ \n\n((1 + va) 2 \n1 - va \n\n, \n\naa3 \n\nadl- va) \n\n2 \n\n) \n\n(14) \n\nF \n\n)( \n\naN(m~+vc) \nv I-c I-va \n\nTo illustrate how input correlations and bias may affect learning time consider \nsimple correlations Cij = oijv(1 - c) + vc and mi = m. With this special choice \nT = ( \nr:)2. or m + cv > , I.e. or non-zero mean or posItIve corre atlOns, \nthe convergence time will blow up by a factor proportional to N. The input bias \neffect has previously been observed by (Le Cun et al.,Wendemuth et al.). In the \nnext section we will consider transformations to remove the large eigenvalue and \nthus to speed up learning. \n\nO\u00b7 \u00a3 \n\n. . \n\n2 \n\nI \n\n. \n\n3 DATA TRANSFORMATIONS FOR INCREASING \n\nLEARNING SPEED \n\nIn this section we consider two data transformations for minimizing the learning \ntime T of a data set, based on the results obtained in the previous sections. \n\nThe PCA transformation (Jackson, 1991) is a data transformation often used in data \n\nanalysis. Let U be the matrix whose columns are the eigenvectors of the sample \ncovariance matrix and let x mean denote the sample average vector (see below). It \nis easy to check that the transformed data set \n\nhave uncorrelated (zero-mean) variables. However, the new PCA variables will often \nhave a large spread in variances which might result in slow convergence. A simple \nrescaling of the new variables will remove this problem, such that according to (14) \na PCA transformed data set with rescaled variables will have optimal convergence \nproperties. \n\n(15) \n\nThe other transformation, which we will call double centering, is based on the re(cid:173)\nmoval of the observation means and the variable means. However, whereas the \nPCA transformation doesn't care about the initial distribution, this transformation \nis optimal for a data set generated from the matrix Zij = Oij v( l-c )+vc+mi mj stud(cid:173)\nied earlier. Define xr ean = } L:~ xf (mean of the i'th variable), x~ean = ~ L:i xf \n(mean of the p'th example) and x~:~~ = p;'\" L:~i xf (grand mean). Consider first \nthe transformed data set \n\n- ~ _ \nXi - X, \n\n. _ mean _ \n\nXi \n\n~ + mean \nXmean \nXmean \n\nThe new variables are readily seen to have zero mean, variance ii = v(l-c)- ~(I-c) \nand correlation c = N-!I. Since v(1 - c) = v(1 - c) we immediately get from \n(13) that the continuous eigenvalue spectrum is unchanged by this transformation. \nFurthermore the 'large' eigenvalue aal is equal to zero and therefore uninteresting. \n\nThus the learning time becomes T = (1 + va? /(1 - va)2. This transformation \n\nhowever removes perhaps important information from the data set, namely the \nobservation means. Motivated by these findings, we create a new data set {x~} \nwhere this information is added as an extra component \n\n-~ _ (-~ \nX \n\n-\n\nX I , ... , X N, Xmean - Xmean \n\n-~ \n\n~ \n\nmean) \n\n(16) \n\n\f174 \n\ns. Halkjrer and O. Winther \n\nTable 1: Required number of iterations and corresponding learning times for differ(cid:173)\nent data transformations. 'Raw' is the original data set, 'Var. cent.' indicates the \nvariable centered (mi = 0) data set, 'Doub. cent.' denotes (16), while the two last \ncolumns concerns the peA transformed data set (15) without and with rescaled \nvariables. \n\nIterations \n\nT \n\nRaw \n00 \n\n161190 \n\nVar. cent. Doub. cent. \n\n300 \n3330 \n\n50 \n237 \n\npeA peA (res.) \n630 \n3330 \n\n7 \n1 \n\nThe matrix Ii resulting from this data set is identical to the above case except that \na column and a row have been added. We therefore conclude that the eigenvalue \nspectrum of this data set consists of a continuous spectrum equal to the above \nand a single eigenvalue which is found to be A = ~ v(1 - c) + cv . For c 1= 0 we \nwill therefore have a learning time T of order one indicating fast convergence. For \nindependent variables (c = 0) the transformation results in a learning time of order \nN but in this case a simple removal of the variable means will be optimal. After \ntraining, when an (N + 1 )-dim parameter set w has been obtained, it is possible to \ntransform back to the original data set using the parameter transformation WI \nWI + NWN+l - N L...i=l Wi \u00b7 \n-\n\n1 \"\",N \n\n1 -\n\n-\n\n4 NUMERICAL INVESTIGATIONS \n\nThe suggested transformations for improving the convergence properties have been \ntested on a medical classification problem. The data set consisted of 40 regional \nvalues of cerebral glucose metabolism from 85 patients, 48 HIV -negatives and 37 \nHIV-positives. A simple perceptron with sigmoidal tanh output was trained using \ngradient descent on the entropic error function to diagnose the 85 patients cor(cid:173)\nrectly. The choice of an entropic error function was due to it's superior convergence \nproperties compared to the quadratic error function considered in the analysis. The \nlearning was stopped once the perceptron was able to diagnose all patients correctly. \nTable 1 shows the average number of required iterations for each of the transformed \ndata sets (see legend) as well as the ratio T = Amax/Amin for the corresponding \nmatrix R. The 'raw' data set could not be learned within the allowed 1000 itera(cid:173)\ntions which is indicated by an 00. Overall, there's fine agreement between the order \nof calculated learning times and the corresponding order of required number of it(cid:173)\nerations. Note especially the superiority of the peA transformation with rescaled \nvariables. \n\n5 CONCLUSION \n\nFor linear networks the convergence properties of the gradient descent algorithm \nmay be derived from the eigenvalue spectrum of the covariance matrix of the input \ndata. The convergence time is controlled by the ratio between the largest and \nsmallest (non-zero) eigenvalue. In this paper we have calculated the eigenvalue \nspectrum of a covariance matrix for correlated and biased inputs. It turns out that \ncorrelation and bias give rise to an eigenvalue of order the input dimension as well as \na continuous spectrum of order one. This explains why a peA transformation (with \n\n\fThe Effect of Correlated Input Data on the Dynamics of Learning \n\n175 \n\na variable rescaling) may increase learning speed significantly. We have proposed \nto center (setting equal to zero) the empirical mean both for each variable and \neach observation in order to remove the large eigenvalue. We add an additional \ncomponent containing the observation mean to the input vector in order have this \ninformation in the training set. At the end of training it is possible to transform \nthe solution back to the original representation. Numerical investigations are in fine \nagreement with the theoretical analysis for improving the convergence properties. \n\n6 ACKNOWLEDGMENTS \n\nWe would like to thank Sara A. Solla and Lars Kai Hansen for valuable comments \nand discussions. Furthermore we wish to thank Ido Kanter for providing us with \nnotes on some of his previous work. This work has been supported by the Dan(cid:173)\nish National Councils for the Natural and Technical Sciences through the Danish \nComputational Neural Network Center CONNECT. \n\nREFERENCES \n\nBrody, T. A., Flores J ., French J. B., Mello, P. A., Pendey, A., & Wong, S. S. (1981) \nRandom-matrix physics. Rev. Mod. Phys. 53:385. \n\nHalkjrer, S. (1996) Dynamics of learning in neural networks: application to the di(cid:173)\nagnosis of HIV and Alzheimer patients. Master's thesis, University of Copenhagen. \n\nHertz, J. A., Krogh, A. & Thorbergsson G. 1. (1989) Phase transitions in simple \nlearning. J. Phys. A 22 :2133-2150. \n\nJackson, J. E. (1991) A User's Guide to Principal Components. John Wiley & Sons. \n\nKrogh, A. (1992) Learning with noise in a linear perceptron J. Phys A 25:1119-1133. \n\nLe Cun, Y., Kanter , 1. & Solla, S.A. (1991) Second Order Properties of Error \nSurfaces: Learning Time and Generalization. NIPS, 3:918-924. \n\nSollich, P. (1996) Learning in large linear perceptrons and why the thermodynamic \nlimit is relevant to the real world. NIPS, 7:207-214 \nWendemuth, A., Opper, M. & Kinzel W. (1993) The effect of correlations in neural \nnetworks, J. Phys. A 26:3165. \n\n\f", "award": [], "sourceid": 1254, "authors": [{"given_name": "S\u00f8ren", "family_name": "Halkj\u00e6r", "institution": null}, {"given_name": "Ole", "family_name": "Winther", "institution": null}]}