{"title": "Limiting Form of the Sample Covariance Eigenspectrum in PCA and Kernel PCA", "book": "Advances in Neural Information Processing Systems", "page_first": 1181, "page_last": 1188, "abstract": "", "full_text": "Limiting form of the sample covariance\neigenspectrum in PCA and kernel PCA\n\nDavid C. Hoyle & Magnus Rattray\nDepartment of Computer Science,\n\nUniversity of Manchester,\nManchester M13 9PL, UK.\n\ndavid.c.hoyle@man.ac.uk\n\nmagnus@cs.man.ac.uk\n\nAbstract\n\nWe derive the limiting form of the eigenvalue spectrum for sample co-\nvariance matrices produced from non-isotropic data. For the analysis of\nstandard PCA we study the case where the data has increased variance\nalong a small number of symmetry-breaking directions. The spectrum\ndepends on the strength of the symmetry-breaking signals and on a pa-\nrameter (cid:11) which is the ratio of sample size to data dimension. Results are\nderived in the limit of large data dimension while keeping (cid:11) \ufb01xed. As (cid:11)\nincreases there are transitions in which delta functions emerge from the\nupper end of the bulk spectrum, corresponding to the symmetry-breaking\ndirections in the data, and we calculate the bias in the corresponding\neigenvalues. For kernel PCA the covariance matrix in feature space may\ncontain symmetry-breaking structure even when the data components are\nindependently distributed with equal variance. We show examples of\nphase-transition behaviour analogous to the PCA results in this case.\n\n1\n\nIntroduction\n\nA number of data analysis methods are based on the spectral decomposition of large ma-\ntrices. Examples include Principal Component Analysis (PCA), kernel PCA and spectral\nclustering methods. PCA in particular is a ubiquitous method of data analysis [1]. The prin-\ncipal components are eigenvectors of the sample covariance matrix ordered according to\nthe size of the corresponding eigenvalues. In PCA the data is projected onto the subspace\ncorresponding to the n \ufb01rst principal components, where n is chosen according to some\nmodel selection criterion. Most methods for model selection require only the eigenvalue\nspectrum of the sample covariance matrix. It is therefore useful to understand how the\nsample covariance spectrum behaves given a particular data distribution. Much is known\nabout the asymptotic properties of the spectrum in the case where the data distribution is\nisotropic, e.g. for the Gaussian Orthogonal Ensemble (GOE), and this knowledge can be\nused to construct model selection methods (see e.g. [2] and references therein). However, it\nis also instructive to consider the limiting behaviour in the case where the data does contain\nsome low-dimensional structure. This is interesting as it allows us to understand the limits\nof learnability and previous studies have already shown phase-transition behaviour in PCA\nlearning from data containing a single symmetry-breaking direction [3]. The analysis of\n\n\fdata models which include a signal component are also useful if we are to correct for bias\nin the estimated eigenvalues corresponding to retained components.\n\nPCA has limited applicability because it is a globally linear method. A promising non-\nlinear alternative is kernel PCA [4] in which data is projected into a high-dimensional\nfeature space and PCA is carried out in this feature space. The kernel trick allows all com-\nputations to be carried out ef\ufb01ciently so that the method is practical even when the feature\nspace has a very high, or even in\ufb01nite, dimension. In this case we are interested in prop-\nerties of the eigenvalue spectrum of the sample covariance matrix in feature space. The\ncovariance of the features will typically be non-isotropic even when the data itself has in-\ndependently distributed components with equal variance. The sample covariance spectrum\nwill therefore show quite rich behaviour even when the data itself has no structure. It is im-\nportant to understand the expected behaviour in order to develop model selection methods\nfor kernel PCA analogous to those used for standard PCA. Model selection methods based\non data models with isotropic noise (e.g. [2, 5]) are certainly not suitable for kernel PCA.\n\nIn this paper we apply methods from statistical mechanics and random matrix theory to de-\ntermine the limiting form of eigenvalue spectrum for sample covariance matrices produced\nfrom data containing symmetry-breaking structure. We \ufb01rst show how the replica method\ncan be used to derive the spectrum for Gaussian data with a \ufb01nite number a symmetry-\nbreaking directions. This result is con\ufb01rmed and generalised by studying the Stieltjes\ntransform of the eigenvalue spectrum, suggesting that it may be insensitive to details of\nthe data distribution. We then show how the results can be used to derive the limiting form\nof eigenvalue spectrum of the feature covariance matrix (or Gram matrix) in kernel PCA\nfor the case of a polynomial kernel.\n\n2 Statistical mechanics theory for Gaussian data\n\nWe \ufb01rst consider a data set of N-dimensional data vectors fx(cid:22)gp\n(cid:22)=1 containing a signal\nand noise component. Initially we restrict ourselves to the case where x(cid:22) is drawn from a\nGaussian distribution whose covariance matrix C is isotropic except for a small number of\northogonal symmetry-breaking directions, i.e.,\n\nS\n\nC = (cid:27)2I + (cid:27)2\n\nAmBmBT\n\nm ; BT\n\nn Bm = (cid:14)nm ; Am > 0 :\n\n(1)\n\nWe de\ufb01ne the sample covariance ^C = p(cid:0)1P(cid:22) x(cid:22)xT\n(cid:22) and study its eigenvalue spectrum\nin the limit N ! 1 when the ratio (cid:11) = p=N is held \ufb01xed and the number of symmetry-\nbreaking directions S is \ufb01nite. We work with the trace of the resolvent G((cid:21)) = ((cid:21)I(cid:0) ^C)(cid:0)1\nfrom which the density of eigenvalues (cid:26)((cid:21)) can be calculated,\n\nXm=1\n\n(cid:26)((cid:21)) = lim\n(cid:15)!0+\n\n(N (cid:25))(cid:0)1Im trG((cid:21) (cid:0) i(cid:15)) where\n\ntrG((cid:21)) =\n\nand (cid:21)i are eigenvalues of ^C. The trace of the resolvent can be represented as,\n\nN\n\nXi=1\n\n1\n\n(cid:21) (cid:0) (cid:21)i\n\n(2)\n\n(3)\n\n(4)\n\ntrG((cid:21)) =\n\n@\n@(cid:21)\n\nUsing the standard representation of the determinant of a matrix,\n\n[det A](cid:0)\n\nwe have,\n\n1\n\n2 = (2(cid:25))(cid:0)\n\nlog Z((cid:21)) :\n\n@\n@(cid:21)\n\nN\n\nlog det((cid:21)I (cid:0) ^C) =\n2 Z exp(cid:2)(cid:0) 1\n(cid:21)\n2jj(cid:30)jj2 +\n\n2 (cid:30)T A(cid:30)(cid:3) d(cid:30) ;\n2pX(cid:22)\n\n1\n\nlog Z((cid:21)) = N log 2(cid:25) (cid:0) 2 logZ exp\"(cid:0)\n\n((cid:30) (cid:1) x(cid:22))2# d(cid:30) :\n\n\fWe assume that the eigenvalue spectrum is self-averaging, so that the calculation for a\nspeci\ufb01c realisation of the sample covariance can be replaced by an ensemble average for\nlarge N that can be performed using the replica method (see e.g. [6]). Details are presented\nelsewhere [7] and here we simply state the results. The calculation is similar to [3] where\nReimann et. al. study the performance of PCA on Gaussian data with a single symmetry-\nbreaking direction, although there are also notable differences between the calculations.\n\nWe \ufb01nd the following asymptotic result for the spectral density,\n\n(cid:26)((cid:21)) = (1 (cid:0) (cid:11))(cid:2)(1 (cid:0) (cid:11))(cid:14)((cid:21)) +\n+ 1 (cid:0)\n\n(cid:2)((cid:11) (cid:0) A(cid:0)2\n\nm )! (cid:11)\n\n1\nN\n\nS\n\nXm=1\n\nwhere we have de\ufb01ned,\n\n1\nN\n\nS\n\nXm=1\n\n(cid:14)((cid:21) (cid:0) (cid:21)u(Am; (cid:27)2))(cid:2)((cid:11) (cid:0) A(cid:0)2\nm )\n\n2(cid:25)(cid:21)(cid:27)2pMax(0; ((cid:21) (cid:0) (cid:21)min)((cid:21)max (cid:0) (cid:21))) ;\n\n(cid:21)max,min = (cid:27)2(cid:11)(cid:0)1(1 (cid:6) p(cid:11))2\n\n(cid:21)u(A; (cid:27)2) = (cid:27)2(1 + A)(1 +\n\n1\n(cid:11)A\n\n) :\n\n(5)\n\n(6)\n\nThe \ufb01rst term in equation (5) sets a proportion 1 (cid:0) (cid:11) eigenvalues to zero when the rank of\n^C is less than N, i.e. when (cid:11) < 1. The last term represents the bulk of the spectrum and is\nidentical to the well-known Mar\u02c7cenko-Pastur law for isotropic data with variance (cid:27)2 [8, 9].\nIn [7] we also give the O(1=N ) corrections to this term, but here we are mainly interested\nin the leading order. The second term contains contributions due to the underlying structure\nin the data. The mth symmetry-breaking term in the data covariance C only contributes\nto the spectrum if (cid:11) > A(cid:0)2\nm . This transition must be exceeded before signals of a given\nstrength can be detected, i.e. the signal must be suf\ufb01ciently strong or the data set suf\ufb01ciently\nlarge. This corresponds to the same learning transition point observed in studies of PCA\non Gaussian data with a single symmetry-breaking direction [3]. Above this transition\nthe sample covariance eigenvalue over-estimates the true variance corresponding to this\ncomponent by a factor 1 + 1=((cid:11)Am) which indicates a signi\ufb01cant bias when the data set is\nsmall or the signal is relatively weak. Our result provides a method of bias correction for\nthe top eigenvalues in this case.\n\nIn \ufb01gure 1 we show results for Gaussian data with three symmetry-breaking directions,\neach above the transition point. On the left we show how the top eigenvalues separate\nfrom the bulk while the inset compares the density of the bulk with the theoretical result,\nshowing excellent agreement. On the right we show convergence to the theoretical result\nfor (cid:21)u(A; (cid:27)2) in equation (6) as the data dimension N is increased for \ufb01xed (cid:11).\n\n3 Analysis of the Stieltjes transform\n\nThe statistical mechanics approach is useful because it allows the derivation of results from\n\ufb01rst principles and it is possible to use this method to determine other self-averaging quanti-\nties of interest, e.g. the overlap between the leading eigenvectors of the sample and popula-\ntion covariances [3]. However, the method as presented here is restricted to Gaussian data.\nA number of results from the statistics literature have been derived under much weaker and\noften more explicit assumptions about the data distribution. It is therefore interesting to ask\nwhether equation (5) can also be derived from these results.\n\nMar\u02c7cenko and Pastur [8] studied the case of data with a general covariance matrix. The\nlimiting distribution was shown to satisfy,\n\n(cid:26)((cid:21)) = lim\n(cid:15)!0+\n\n(cid:25)(cid:0)1Im (cid:11)m(cid:26)((cid:21) + i(cid:15)) where m(cid:26)(z) = (cid:11)(cid:0)1Z d(cid:21)\n\n(cid:26)((cid:21))\n(cid:21) (cid:0) z\n\n:\n\n(7)\n\n\f(a)\n\n9\n\n8\n\n7\n\n6\n\ne\nu\nl\na\nv\nn\ne\ng\ni\nE\n\n0.3\n\n0.25\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\n \n\ny\nt\ni\ns\nn\ne\nD\ny\nt\ni\nl\ni\nb\na\nb\no\nr\nP\n\n0\n\n0\n\n1\n\n2\n\n3\n\nEigenvalue\n\n4\n\n5\n\n6\n\n5\n\n0\n\n2\n\n4\n\n6\n\n8\n\n10\n\nIndex\n\n12\n\n14\n\n16\n\n18\n\n20\n\nr\no\nr\nr\nE\n\n \nl\na\nn\no\ni\nt\nc\na\nr\nF\ng\no\nL\n\n \n\n-2\n\n-3\n\n-4\n\n-5\n\n-6\n\n(b)\n\nlog D\nlog D\nlog D\n\n1\n\n2\n\n3\n\n5\n\n6\n\nLog N\n\n7\n\n8\n\n1 = 20, A2\n\nFigure 1: In (a) we show eigenvalues of the sample covariance matrix for Gaussian data\nwith (cid:27)2 = 1, N = 2000 and (cid:11) = 0:5. The data contains three symmetry-breaking direc-\n3 = 10 all above the transition point. The\ntions with strengths A2\ninset shows the distribution of all non-zero eigenvalues except for the largest three with the\nsolid line showing the theoretical result. In (b) we show the fractional difference between\nthe three largest eigenvalues (cid:21)i and the theoretical value (cid:21)u(Ai; (cid:27)2) for i = 1; 2; 3. We set\n(cid:11) = 0:2, averaged (cid:21)i over 1000 samples to get h(cid:21)ii, set (cid:1)(cid:21)i = j1 (cid:0) h(cid:21)ii=(cid:21)u(Ai; (cid:27)2))j\nand set other values as in (a).\n\n2 = 15 and A2\n\nHere, m(cid:26)(z) is the Stieltjes transform of (cid:11)(cid:0)1(cid:26)((cid:21)) and is equal to (cid:0)p(cid:0)1trG(z). The above\nequation is therefore exactly equivalent to equation (2) and we see that this approach starts\nfrom the same point as the statistical mechanics theory. Mar\u02c7cenko and Pastur showed that\nthe Stieltjes transform satis\ufb01es the following relationship,\n\n+ (cid:11)(cid:0)1Z\nThe measure H(t) is de\ufb01ned such that N (cid:0)1Pi dk\n\nz(m(cid:26)) = (cid:0)\n\n1\nm(cid:26)\n\ndH(t)\n\nt(cid:0)1 + m(cid:26)\n\n:\n\n(8)\n\ni converges to R tkdH(t) 8k where\n\ndi are the eigenvalues of C. An equivalent result is also derived by Wachter [10] and\nmore recently by Sengupta and Mitra using the replica method [11] (for Gaussian data).\nSilverstein and Choi have shown that the support of (cid:26)((cid:21)) can be determined by the intervals\nbetween extrema of z(m(cid:26)) [12] and this has been used to determine the signal component\nof a spectrum when O(N ) equal strength symmetry-breaking directions are present [13].\nSince C in equation (1) only contains a \ufb01nite number of symmetry-breaking directions\nthen in the limit N ! 1 these will have zero measure as de\ufb01ned by H. Thus, in this limit\nthe eigenvalue density would appear to be identical to the isotropic case. However, it is the\nbehaviour of the largest eigenvalues that we are most interested in, even though these may\nhave vanishing measure. For the case of a single symmetry-breaking direction (S = 1,\nA1 = A) we take dH(t) = (1(cid:0) (cid:15))(cid:14)(t(cid:0) (cid:27)2)dt + (cid:15)(cid:14)(t(cid:0) (cid:27)2(1 + A))dt, with (cid:15) \u2019 1=N. This\ngives,\n\nz(m(cid:26)) = (cid:0)\nand stationary points satisfy,\n\n1\nm(cid:26)\n\n+\n\n(1 (cid:0) (cid:15))(cid:11)(cid:0)1\n(cid:27)(cid:0)2 + m(cid:26)\n\n+\n\n(cid:15)(cid:11)(cid:0)1\n\n(cid:27)(cid:0)2(1 + A)(cid:0)1 + m(cid:26)\n\n;\n\n(9)\n\n0 =\n\n1\n(cid:26) (cid:0)\nm2\n\n(1 (cid:0) (cid:15))(cid:11)(cid:0)1\n((cid:27)(cid:0)2 + m(cid:26))2 (cid:0)\n\n(cid:15)(cid:11)(cid:0)1\n\n((cid:27)(cid:0)2(1 + A)(cid:0)1 + m(cid:26))2 :\n\n(10)\n\nSince (cid:15) (cid:28) 1 we do not expect the behaviour of z(m(cid:26)) to be modi\ufb01ed substantially in\nthe interval [(cid:21)min; (cid:21)max]. Therefore we look for additional stationary points close to the\nsingularity at m(cid:26) = (cid:0)(cid:27)(cid:0)2(1+A)(cid:0)1. Setting m(cid:26) = (cid:0)(cid:27)(cid:0)2(1+A)(cid:0)1+(cid:14) and expanding (10)\n\nl\nl\nl\n\f1\n\n1\n\n2 =(cid:27)2(1 + A)p((cid:11) (cid:0) A(cid:0)2)+O((cid:15)). Substituting this into (9) gives z((cid:0)(cid:27)(cid:0)2(1+\nyields (cid:14) = (cid:15)\n2 ). Thus, as N ! 1, if the stationary points\nA)(cid:0)1 + (cid:14)) = (cid:27)2(1 + A)(1 + ((cid:11)A)(cid:0)1) +O((cid:15)\nat (cid:0)(cid:27)(cid:0)2(1 + A)(cid:0)1 + (cid:14) exist they will de\ufb01ne a small interval of z centred on (cid:21)u(A; (cid:27)2)\nand so de\ufb01ne an approximate contribution of N (cid:0)1(cid:14)((cid:21) (cid:0) (cid:21)u(A; (cid:27)2)) to the spectrum, in\nagreement with the previous calculations using replicas. We also see that for (cid:14) to be real\nrequires (cid:11) > A(cid:0)2, in agreement with our previous calculation for the learning transition\npoint. A similar perturbative analysis when C contains more than one symmetry-breaking\ndirection gives a set of contributions N (cid:0)1(cid:14)((cid:21) (cid:0) (cid:21)u(Am; (cid:27)2)); m = 1; : : : ; S, to (cid:26)((cid:21)).\nAgain this is in agreement with our previous replica analysis of the resolvent.\n\nThe relationship in equation (8) can be obtained with only relatively weak conditions on\nthe data distribution. One requirement is that the second moment of each element of ^C\nexists. Bai has considered the case of data vectors with non-Gaussian i.i.d. components\n(e.g. [14]) while Mar\u02c7cenko and Pastur show that the data vector components do not have\nto be independently distributed for the relation to hold and they give suf\ufb01cient conditions\non the 4th order cross-moments of the data vector components [8]. In [7] we study PCA\non some examples of non-Gaussian data with symmetry-breaking structure (non-Gaussian\nsignal and noise) and show that the separated eigenvalues behave similarly to \ufb01gure 1.\n\n4 Eigenvalue spectra for kernel PCA\n\nEquation (8) holds under quite weak conditions on the data distribution. It is therefore\nhoped that we can apply these results to the feature space of kernel PCA [4].\nIn ker-\nnel PCA the data x is transformed into a feature vector (cid:30)(x) and standard PCA is car-\nried out in the feature space. The method requires that we can de\ufb01ne a kernel function\nk(x; y) = (cid:30)(x) (cid:1) (cid:30)(y) that allows ef\ufb01cient computation of the dot-product in a high, or\neven in\ufb01nite, dimensional space. The eigenvalues of the sample covariance in feature space\nare identical to eigenvalues of the Gram matrix K(cid:22)(cid:23) with entries k(x(cid:22); x(cid:23)) and the eigen-\nvalues can therefore be computed ef\ufb01ciently for arbitrary feature-space dimension as long\nas the number of samples p is not too large (NB. The Gram matrix \ufb01rst has to be centred [4]\nso that the data has zero mean in the feature space).\nOne common choice of kernel function is the polynomial kernel k(x; y) = (c + x (cid:1) y)d in\nwhich case, for integer d, the features are all possible monomials up to order d involving\ncomponents of x. We limit our attention here to the quadratic kernel (d = 2). We con-\nsider data vectors with components that are independently and symmetrically distributed\nwith equal variance (cid:27)2 and choose a set of features (cid:30)(x) = (p2cx; Vec[xxT]) where\nVec[xxT]j+N (i(cid:0)1) = xixj. The covariance in feature space is block diagonal,\n\nnumber\n\ndi\n2c(cid:27)2\n2(cid:27)4\n\n2(cid:27)4 + (cid:20)i\n4\n\n0\n\nN\n\nN\n\n2chxxTi\n\n0\n\nN (N (cid:0) 1)=2\n\nC =0\n@\n\nhVec[xxT]Vec[xxT]Ti\n(cid:0)hVec[xxT]ihVec[xxT]Ti\n\n1\nA\nwhere angled brackets denote expectations over the data distribution. The non-zero eigen-\nvalues of C are shown on the right where (cid:20)i\nii (cid:0) 3(cid:27)4 is the 4th cumulant of the\n4 = hx4\nith component of x. We see that although each component of the data is independently\ndistributed with equal variance, the covariance structure in feature space may be quite com-\nplex.\n(cid:15) Gaussian data, c = 0\nFor isotropic Gaussian data and c = 0 there is a single degenerate eigenvalue of C and the\nasymptotic result for the spectrum is identical to the case of an isotropic distribution [8, 9]\nwith variance 2(cid:27)4 and (cid:11) de\ufb01ned as the ratio of the number of examples p to the effective\n\n\f0.3\n\n0.25\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\n \n\ny\nt\ni\ns\nn\ne\nD\ny\nt\ni\nl\ni\nb\na\nb\no\nr\nP\n\n)\n4\n8\n2\n8\n.\n5\n \n \n\n \n\n1\n\n \n(\ng\no\nl\n\n2\n\n1\n\n0\n\n-1\n\n-2\n\n-3\n\n-4\n\n4\n\n1\n\n9\n\n8\n\n7\n\n6\n\n6\n\n8\n\nlog p\n\n10\n\n0\n\n0\n\n1\n\n2\n\n3\n4\nEigenvalue\n\n5\n\n6\n\n7\n\n0\n\n10000\n\n20000\n\np\n\n30000\n\n40000\n\nFigure 2: On the left we show the Gram matrix eigenspectrum for a sample data set and\ncompare it to the theoretical result. The kernel is purely quadratic (c = 0) and we use\nisotropic Gaussian data with 2(cid:27)4 = 1, N = 63 and p = 1000 so that (cid:11) \u2019 0:5. On the right\nwe show the averaged top eigenvalue against p for \ufb01xed (cid:11). Each point is averaged over 100\nsamples except for the right-most which is averaged over 50. The dashed line shows the\ntheoretical result (cid:21)1 = 5:8284 and inset is a log-log plot of the same data.\n\ndimension in the feature space N (N +1)=2 (i.e. the degeneracy of the non-zero eigenvalue)\nso that (cid:11) = 2p=N (N + 1) and p = O(N 2) is the appropriate scaling.\nOn the left of \ufb01gure 2 we compare the spectra for a single sample data set to the theory for\np = 1000 and N = 63 which corresponds to (cid:11) \u2019 0:50 and the theoretical curve is almost\nidentical to the one used in the inset to \ufb01gure 1(a). The \ufb01nite size effects are much larger\nthan would be observed for PCA with isotropic data and on the right of \ufb01gure 2 we show\nthe average of the top eigenvalue for this value of (cid:11) as p is increased, showing a very slow\nconvergence to the asymptotic result.\n(cid:15) Gaussian data, c > 0\nFor isotropic Gaussian data and c > 0 there are two eigenvalues of C with degeneracy\nN and N (N + 1)=2 respectively. For large N and c > (cid:27)2 the top N eigenvalues play\nan analogous role to the top S eigenvalues in the PCA data model de\ufb01ned in section 2.\nA similar perturbative expansion to the one described in section 3 shows that when (cid:11) <\n(c=(cid:27)2 (cid:0) 1)(cid:0)2 (where (cid:11) \u2019 2p=N 2 is de\ufb01ned relative to the feature space) the distribution\nis identical to the c = 0 case. For (cid:11) above this transition point the N top eigenvalues\nseparate from the bulk. In the limit N ! 1 with p = O(N 2) the spread of the upper\nN eigenvalues will tend to zero and they will become localised at (cid:21)u(c=(cid:27)2 (cid:0) 1; 2(cid:27)4) as\nde\ufb01ned by equation (6). For \ufb01nite N and when the two components of the spectra are well\nseparated, we can approximate the eigenvalue spectrum of the top N eigenvalues as though\nthe data only contains these components, i.e. we model this cluster as isotropic data with\n(cid:11) = p=N and variance 2c(cid:27)2. We obtain an improved approximation by correcting the\nmean of the separated cluster by the value predicted for the mean in the large N limit.\nOn the left of \ufb01gure 3 we compare this approximation to the Gram matrix spectrum aver-\naged over 300 data sets for large c, with the inset showing the separated cluster. The theory\nis shown by the solid line and provides a good qualitative \ufb01t to the data although there are\nsigni\ufb01cant discrepancies. For the bulk we believe these to be due to \ufb01nite size effects but\nthe theory for the spread of the upper N eigenvalues is only approximate since the spread\nof this cluster will vanish as N ! 1 for \ufb01xed c and p = O(N 2). On the right of \ufb01gure 3\nwe plot the average of the top N eigenvalues against c, showing good agreement with the\ntheory. The top eigenvalue of the population covariance is shown by the line and the theory\naccurately predicts the bias in the sample estimate.\n\nl\nl\n-\n\f \n\ny\nt\ni\ns\nn\ne\nD\ny\nt\ni\nl\ni\nb\na\nb\no\nr\nP\n\n0.25\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\n0.002\n\n0.0015\n\n0.001\n\n0.0005\n\n0\n10\n\n15\n\n20\n\n25\n\n30\n\n35\n\n40\n\nSimulation\nTheory\nUnbiased Eigenvalue\n\n30\n\n25\n\n20\n\n15\n\n10\n\n5\n\n \n\n \n\ns\ne\nu\nl\na\nv\nn\ne\ng\ni\nE\nN\np\no\nt\n \nf\no\n \ne\ng\na\nr\ne\nv\nA\n\n0\n\n0\n\n2\n\n4\n\nEigenvalue\n\n6\n\n8\n\n0\n\n0\n\n5\n\n10\nc\n\n15\n\n20\n\nFigure 3: On the left we show the Gram matrix eigenvalue spectrum averaged over 300\ndata sets and compare it to the theoretical result. The inset shows the density of the top N\neigenvalues which are separated from the bulk. The kernel is quadratic with c = (cid:27)2(1 +\np500) with other parameters as in \ufb01gure 2. On the right we show the average of the top N\neigenvalues against the theoretical result as a function of c.\n\n>\n\nTheory\n\nTheory\n\n5\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n \n\ny\nt\ni\ns\nn\ne\nD\ny\nt\ni\nl\ni\nb\na\nb\no\nr\nP\n\n0\n\n0\n\n1\n\n2\n\n3\n\n4\n\nEigenvalue\n\n5\n\n6\n\n7\n\n10\n\nRank\n\n15\n\n20\n\n>\n\n 2(cid:27)4=p(cid:11) there will be a\nof the feature space. For each component of the data with (cid:20)i\ndelta function in the spectrum at (cid:21)u((cid:20)i\nIn \ufb01gure 4 we show the Gram matrix eigenvalues for a data set containing a single dimen-\nsion having positive kurtosis. On the left we have (cid:20)4 = 5 which is above the transition. We\nhave indicated with arrows the theoretical prediction for the top two eigenvalues and we\nsee that there is a signi\ufb01cant difference, although the separation is quite well described by\nthe theory. We expect that these discrepancies are due to large \ufb01nite size effects and further\nsimulations are required to verify this. On the right we have (cid:20)4 = 1 which is below the\ntransition and the spectrum is very similar to the case for isotropic Gaussian data.\n\n4=2(cid:27)4; 2(cid:27)4) as de\ufb01ned by equation (6).\n\n\f5 Conclusion\n\nWe studied the asymptotic form of the sample covariance eigenvalue spectrum from data\nwith symmetry-breaking structure. For standard PCA the asymptotic results are very ac-\ncurate even for moderate data dimension, but for kernel PCA with a quadratic kernel we\nfound that convergence to the asymptotic result was slow. The limiting form of sample co-\nvariance spectra has previously been studied in the neural networks literature where it can\nbe used in order to determine the optimal batch learning rate for large linear perceptrons.\nIndeed, the results derived in section 2 for Gaussian data can also be derived by adapting an\nelegant method developed by Sollich [15], without recourse to the replica method. Halkj\u00e6r\n& Winther used this approach to compute the spectral density for the case of a single sym-\nmetry breaking direction and obtained a similar result to us, except that the position of the\nseparated eigenvalue was at (cid:27)2(1 + A) which differs from our result [16]. In fact they as-\nsumed a large signal in their derivation and their derivation can easily be adapted to obtain\nan identical result to ours. However this method, as well as the replica approach used here,\nis limited because it only applies to Gaussian data, while the Stieltjes transform relationship\nin equation (8) has been derived under much weaker conditions on the data distribution.\n\nOur current work is focussed on extending the analysis to more general kernels, such as\nthe radial basis function (RBF) kernel where the feature space dimension is in\ufb01nite. In the\ngeneral case we \ufb01nd that the Stieltjes transform can be derived by a variational mean \ufb01eld\ntheory and therefore provides a principled approximation to the average spectral density.\nAcknowledgments DCH was supported by a MRC(UK) Special Training Fellowship in\nBioinformatics. We would like to thank the anonymous reviewers for useful comments and\nfor pointing out references [15] and [16].\n\nReferences\n\n[1] I.T. Jolliffe. Principal Component Analysis. Springer-Verlag, New York, 1986.\n[2] I.M. Johnstone. Ann. Stat., 29, 2001.\n[3] P. Reimann, C. Van den Broeck, and G.J. Bex. J. Phys. A:Math. Gen., 29:3521, 1996.\n[4] B. Scholk\u00a8opf, A. Smola, and K.-R. M\u00a8uller. Neural Computation, 10:1299\u20131319,\n\n1998.\n\n[5] T.P. Minka. Automatic choice of dimensionality for PCA. In T.K. Leen, T.G. Diet-\n\nterich, and V. Tresp, editors, NIPS 13, pages 598\u2013604. MIT Press, 2001.\n\n[6] A. Engel and C. Van den Broeck. Statistical Mechanics of Learning. Cambridge\n\nUniversity Press, 2001.\n\n[7] D.C. Hoyle and M. Rattray. Phys. Rev. E, in press.\n[8] V.A. Mar\u02c7cenko and L.A. Pastur. Math. USSR-Sb, 1:507, 1967.\n[9] A. Edelman. SIAM J. Matrix Anal. Appl., 9:543, 1988.\n[10] K.W. Wachter. Ann. Probab., 6:1, 1978.\n[11] A.M. Sengupta and P.P. Mitra. Phys. Rev. E, 60:3389, 1999.\n[12] J.W. Silverstein and S. Choi. J. Multivariate Analysis, 54:295, 1995.\n[13] J.W. Silverstein and P.L. Combettes. IEEE Trans. Signal Processing, 40:2100, 1992.\n[14] Z.D. Bai. Ann. Probab., 21:649, 1993.\n[15] P. Sollich. J. Phys. A, 27:7771, 1994.\n[16] S. Halkj\u00e6r and O. Winther. In M. Mozer, M. Jordan, and T. Petsche, editors, NIPS 9,\n\npage 169. MIT Press, 1997.\n\n\f", "award": [], "sourceid": 2501, "authors": [{"given_name": "David", "family_name": "Hoyle", "institution": null}, {"given_name": "Magnus", "family_name": "Rattray", "institution": null}]}