{"title": "On the Concentration of Spectral Properties", "book": "Advances in Neural Information Processing Systems", "page_first": 511, "page_last": 517, "abstract": null, "full_text": "On the Concentration of Spectral \n\nProperties \n\nJohn Shawe-Taylor \n\nRoyal Holloway, University of London \n\njohn@cs.rhul.ac.uk \n\nN ella Cristianini \nBIOwulf Technologies \nnello@support-vector. net \n\nJaz Kandola \n\nRoyal Holloway, University of London \n\njaz@cs.rhul.ac.uk \n\nAbstract \n\nWe consider the problem of measuring the eigenvalues of a ran(cid:173)\ndomly drawn sample of points. We show that these values can be \nreliably estimated as can the sum of the tail of eigenvalues. Fur(cid:173)\nthermore, the residuals when data is projected into a subspace is \nshown to be reliably estimated on a random sample. Experiments \nare presented that confirm the theoretical results. \n\n1 \n\nIntroduction \n\nA number of learning algorithms rely on estimating spectral data on a sample of \ntraining points and using this data as input to further analyses. For example in \nPrincipal Component Analysis (PCA) the subspace spanned by the first k eigen(cid:173)\nvectors is used to give a k dimensional model of the data with minimal residual, \nhence forming a low dimensional representation of the data for analysis or clus(cid:173)\ntering. Recently the approach has been applied in kernel defined feature spaces \nin what has become known as kernel-PCA [5]. This representation has also been \nrelated to an Information Retrieval algorithm known as latent semantic indexing, \nagain with kernel defined feature spaces [2]. \nFurthermore eigenvectors have been used in the HITS [3] and Google's PageRank [1] \nalgorithms. In both cases the entries in the eigenvector corresponding to the maxi(cid:173)\nmal eigenvalue are interpreted as authority weightings for individual articles or web \npages. \nThe use of these techniques raises the question of how reliably these quantities can \nbe estimated from a random sample of data, or phrased differently, how much data is \nrequired to obtain an accurate empirical estimate with high confidence. Ng et al. [6] \nhave undertaken a study of the sensitivity of the estimate of the first eigenvector to \nperturbations of the connection matrix. They have also highlighted the potential \ninstability that can arise when two eigenvalues are very close in value, so that their \neigenspaces become very difficult to distinguish empirically. \nThe aim of this paper is to study the error in estimation that can arise from the \nrandom sampling rather than from perturbations of the connectivity. We address \n\n\fthis question using concentration inequalities. We will show that eigenvalues esti(cid:173)\nmated from a sample of size m are indeed concentrated, and furthermore the sum \nof the last m - k eigenvalues is subject to a similar concentration effect, both re(cid:173)\nsults of independent mathematical interest. The sum of the last m - k eigenvalues \nis related to the error in forming a k dimensional PCA approximation, and hence \nwill be shown to justify using empirical projection subspaces in such algorithms as \nkernel-PCA and latent semantic kernels. \nThe paper is organised as follows. In section 2 we give the background results and \ndevelop the basic techniques that are required to derive the main results in section \n3. We provide experimental verification of the theoretical findings in section 4, \nbefore drawing our conclusions. \n\n2 Background and Techniques \n\nWe will make use of the following results due to McDiarmid. Note that lEs is the \nexpectation operator under the selection of the sample. \n\nTheoreIll 1 (McDiarmid!4}) Let Xl, ... ,Xn be independent random variables tak(cid:173)\ning values in a set A, and assume that f : An -+~, and fi : An- l -+ ~ satisfy for \nl:::;i:::;n \n\nXl,\u00b7\u00b7\u00b7 , Xn \n\nTheoreIll 2 (McDiarmid!4}) Let Xl, ... ,Xn be independent random variables tak(cid:173)\ning values in a set A, and assume that f : An -+ ~, for 1 :::; i :::; n \n\nsup \n\nIf(xI, ... , xn) -\n\nf(XI, ... , Xi - I, Xi, Xi+!,\u00b7\u00b7\u00b7, xn)1 :::; Ci, \n\nWe will also make use of the following theorem characterising the eigenvectors of a \nsymmetric matrix. \n\nTheoreIll 3 (Courant-Fischer MiniIllax TheoreIll) If M E ~mxm is symmet(cid:173)\nric, then for k = 1, ... , m, \n\nAk(M) = max min - - = \n\ndim(T) = k O#v ET vlv \n\nv'Mv \n\nv'Mv \ndim(T) = m - k+IO#v ET vlv ' \n\nmax \n\nmin \n\nwith the extrama achieved by the corresponding eigenvector. \n\nThe approach adopted in the proofs of the next section is to view the eigenvalues as \nthe sums of squares of residuals. This is applicable when the matrix is positive semi(cid:173)\ndefinite and hence can be written as an inner product matrix M = XI X, where XI is \nthe transpose of the matrix X containing the m vectors Xl, . . . , Xm as columns. This \nis the finite dimensional version of Mercer's theorem, and follows immediately if we \ntake X = V VA, where M = VA VI is the eigenvalue decomposition of M. There \nmay be more succinct ways of representing X, but we will assume for simplicity (but \nwithout loss of generality) that X is a square matrix with the same dimensions as \nM. To set the scene, we now present a short description of the residuals viewpoint. \n\n\fThe starting point is the singular value decomposition of X = U~V', where U and \nV are orthonormal matrices and ~ is a diagonal matrix containing the singular \nvalues (in descending order). We can now reconstruct the eigenvalue decomposition \nof M = X' X = V~U'U~V' = V AV', where A = ~2. But equally we can construct \na matrix N = XX' = U~V'V~U' = UAU' , with the same eigenvalues as M. \nAs a simple example consider now the first eigenvalue, which by Theorem 3 and the \nabove observations is given by \n\nA1(M) \n\nv'XX'v \n\nv'Nv \nmax - - = max \nO,t:vEIR= v'v \nO,t:vEIR= \nmax L IIPv(xj)11 2 = L IIxjl12 - min L IIP;-(xj)112 \n\nmax \nO,t:vEIR = \n\nv'v \nm \n\nv'v \n\nm \n\nm \n\nO,t:vEIR= \n\nj=l \n\nj=l \n\nO,t:vEIR= \n\nj=l \n\nwhere Pv(x) (Pv..l (x)) is the projection of x onto the space spanned by v (space \nperpendicular to v), since IIxI12 = IIPv(x)112 + IIPv..l(x)112. It follows that the first \neigenvector is characterised as the direction for which sum of the squares of the \nresiduals is minimal. \nApplying the same line of reasoning to the first equality of Theorem 3, delivers the \nfollowing equality \n\nAk = max min L IlPv(xj)112. \n\nm \n\ndim(V) = k O,t:vEV . J=l \n\n(1) \n\nNotice that this characterisation implies that if v k is the k-th eigenvector of N, then \n\nm \n\nAk = L IlPv k (xj)112, \n\nj=l \n\n(2) \n\nwhich in turn implies that if Vk is the space spanned by the first k eigenvectors, \nthen \n\nk \n\nm \n\nL Ai = L IIPVk (Xj) 112 = L IIXj W - L IIP'* (Xj) 112, \n\nm \n\nj=l \n\nm \n\nj=l \n\ni=l \n\nj=l \n\n(3) \n\nwhere Pv(x) (PV(x)) is the projection of x into the space V (space perpendicular \nto V). It readily follows by induction over the dimension of V that we can equally \ncharacterise the sum of the first k and last m - k eigenvalues by \n\nm \n\nmax L IIPv(xj)11 2 = L IIxjl12 - min L IIPv(xj)112, \n\nm \n\ndim(V) = k . \n\n) = 1 \n\nm \n\n. \n) = 1 \n\ni = l \n\nm \n\ndim(V) = k . \n\nJ=l \n\nm \n\nk \n\nL IIXjl12 - L Ai = min L IlPv(xj)112. \n. \nJ=l \n\ndim(V)=k . \n\n. \n.=1 \n\nJ=l \n\nm \n\n(4) \n\nHence, as for the case when k = 1, the subspace spanned by the first k eigenvalues \nis characterised as that for which the sum of the squares of the residuals is minimal. \nFrequently, we consider all of the above as occurring in a kernel defined feature \nspace, so that wherever we have written Xj we should have put \u00a2>(Xj), where \u00a2> is \nthe corresponding projection. \n\n3 Concentration of eigenvalues \n\nThe previous section outlined the relatively well-known perspective that we now \napply to obtain the concentration results for the eigenvalues of positive semi-definite \n\n\fmatrices. The key to the results is the characterisation in terms of the sums of \nresiduals given in equations (1) and (4). \n\nTheorem 4 Let K(x,z) be a positive semi-definite kernel function on a space X, \nand let J-t be a distribution on X. Fix natural numbers m and 1 :::; k < m and let \nS = (Xl\"'\" xm) E xm be a sample of m points drawn according to J-t. Th en for \nall f > 0, \n\nP{I~ )..k(S) -lEs [~ )..k(S)ll 2: f} :::; 2exp ( -~:m) , \n\nwhere )..k (S) is the k-th eigenvalue of the matrix K(S) with entries K(S)ij \nK(Xi,Xj) and R2 = maxxEx K(x,x). \n\nProof: The result follows from an application of Theorem 1 provided \n\n1 \nsup 1-\ns m \n\n)..k(S) -\n\n1 \n-\nm \n\n)..k(S \\ {xd)1 :::; Rim. \n\n2 \n\nLet S = S \\ {Xi} and let V (11) be the k dimensional subspace spanned by the first \nk eigenvectors of K(S) (K(S)). Using equation (1) we have \n\nm \n\nm \n\nD \nSurprisingly a very similar result holds when we consider the sum of the last m - k \neigenvalues. \n\nTheorem 5 Let K(x, z) be a positive semi-definite kernel function on a space X, \nand let J-t be a distribution on X. Fix natural numbers m and 1 :::; k < m and let \nS = (Xl, ... , Xm) E xm be a sample of m points drawn according to J-t. Then for \nall f > 0, \n\nP{I~ )..>k(S) -lEs [~ )..>k(S)ll 2: f} :::; 2 exp ( -~:m) , \n\nwhere )..>k(S) is the sum of all but the largest k eigenvalues of the matrix K(S) with \nentries K(S)ij = K(Xi,Xj) and R2 = maxxEX K(x,x). \nProof: The result follows from an application of Theorem 1 provided \n\nsup 1~)..>k(S) - ~)..>k(S \\ {xd)1 :::; R2/m. \ns m \n\nm \n\nLet S = S \\ {xd and let V (11) be the k dimensional subspace spanned by the first \nk eigenvectors of K(S) (K(S)). Using equation (4) we have \n\nm \n\nj=l \n\n#i \n\n#i \nm \n\nj=l \n\nD \n\n\fOur next result concerns the concentration of the residuals with respect to a fixed \nsubspace. For a subspace V and training set S , we introduce the notation \n\nFv(S) = - L IIPV(Xi )112 . \n\n1 m \n\nm i=l \n\nTheoreIll 6 Let J-t be a distribution on X. Fix natural numbers m and a subspace \nV and let S = (Xl, .. . , Xm) E xm be a sample of m points drawn according to J-t. \nThen for all t > 0, \n\nP{IFv(S) -lEs [Fv(S)ll ~ t} ::::: 2exp (~~r;) . \n\nProof: The result follows from an application of Theorem 2 provided \n\nsup IFv(S) - F(S \\ {xd U {xi)1 ::::: R2/m. \nS,X i \n\nClearly the largest change will occur if one of the points Xi and Xi is lies in the \nsubspace V and the other does not. In this case the change will be at most R2/m. \nD \n\n4 Experiments \n\nIn order to test the concentration results we performed experiments with the Breast \ncancer data using a cubic polynomial kernel. The kernel was chosen to ensure that \nthe spectrum did not decay too fast. \nWe randomly selected 50% of the data as a 'training' set and kept the remaining \n50% as a 'test' set. We centered the whole data set so that the origin of the feature \nspace is placed at the centre of gravity of the training set. We then performed an \neigenvalue decomposition of the training set. The sum of the eigenvalues greater \nthan the k-th gives the sum of the residual squared norms of the training points \nwhen we project onto the space spanned by the first k eigenvectors. Dividing this by \nthe average of all the eigenvalues (which measures the average square norm of the \ntraining points in the transformed space) gives a fraction residual not captured in \nthe k dimensional projection. This quantity was averaged over 5 random splits and \nplotted against dimension in Figure 1 as the continuous line. The error bars give \none standard deviation. The Figure la shows the full spectrum, while Figure 1 b \nshows a zoomed in subwindow. The very tight error bars show clearly the very tight \nconcentration of the sums of tail of eigenvalues as predicted by Theorem 5. \nIn order to test the concentration results for subsets we measured the residuals of \nthe test points when they are projected into the subspace spanned by the first k \neigenvectors generated above for the training set. The dashed lines in Figure 1 show \nthe ratio of the average squares of these residuals to the average squared norm of the \ntest points. We see the two curves tracking each other very closely, indicating that \nthe subspace identified as optimal for the training set is indeed capturing almost \nthe same amount of information in the test points. \n\n5 Conclusions \n\nThe paper has shown that the eigenvalues of a positive semi-definite matrix gener(cid:173)\nated from a random sample is concentrated. Furthermore the sum of the last m - k \neigenvalues is similarly concentrated as is the residual when the data is projected \ninto a fixed subspace. \n\n\f0.7,------,-------,-------,------,-------,-------,------,,------, \n\n0.6 \n\n0.5 \n\n0.2 \n\n0.1 \n\n0.14,-----,-----,-----,-----,-----,-----,-----,-----,-----,-----, \n\nProjection Dimensionality \n\n(a) \n\n0.12 \n\n0.1 \n\ne 0.08 \nW \nCii \nOJ :g \nen &! 0.06 \n\n0.04 \n\n0.02 \n\n1 \\ \n\n'1- - I \n\n- -:E-- -I- --:1:- _ '.[ __ \n\nO~--L---~--~--~---L-=~~~~======~~ \no \n100 \n\n20 \n\n30 \n\n80 \n\n10 \n\n40 \nProjection Dimensionality \n\n50 \n\n60 \n\n70 \n\n90 \n\n(b) \n\nFigure 1: Plots ofthe fraction of the average squared norm captured in the subspace \nspanned by the first k eigenvectors for different values of k. Continuous line is \nfraction for training set, while the dashed line is for the test set. (a) shows the full \nspectrum, while (b) zooms in on an interesting portion. \n\n\fExperiments are presented that confirm the theoretical predictions on a real world \ndataset. The results provide a basis for performing PCA or kernel-PCA from a \nrandomly generated sample, as they confirm that the subset identified by the sample \nwill indeed 'generalise' in the sense that it will capture most of the information in \na test sample. \nFurther research should look at the question of how the space identified by a sub(cid:173)\nsample relates to the eigenspace of the underlying kernel operator. \n\nReferences \n\n[1] S. Brin and L. Page. The anatomy of a large-scale hypertextual (web) search en(cid:173)\ngine. In Proceedings of the Seventh International World Wide Web Conference, \n1998. \n\n[2] Nello Cristianini, Huma Lodhi, and John Shawe-Taylor. Latent semantic kernels \nfor feature selection. Technical Report NC-TR-00-080, NeuroCOLT Working \nGroup, http://www.neurocolt.org, 2000. \n\n[3] J. Kleinberg. Authoritative sources in a hyperlinked environment. In Proceedings \n\nof 9th ACM-SIAM Symposium on Discrete Algorithms, 1998. \n\n[4] C. McDiarmid. On the method of bounded differences. In Surveys in Combina(cid:173)\n\ntorics 1989, pages 148- 188. Cambridge University Press , 1989. \n\n[5] S. Mika, B. SchCilkopf, A. Smola, K.-R. MUller, M. Scholz, and G. Ratsch. \nKernel PCA and de-noising in feature spaces. In Advances in Neural Information \nProcessing Systems 11, 1998. \n\n[6] Andrew Y. Ng, Alice X. Zheng, and Michael 1. Jordan. Link analysis, eigenvec(cid:173)\ntors and stability. In To appear in the Seventeenth International Joint Confer(cid:173)\nence on Artificial Intelligence (UCAI-Ol), 2001. \n\n\f", "award": [], "sourceid": 2127, "authors": [{"given_name": "John", "family_name": "Shawe-Taylor", "institution": null}, {"given_name": "Nello", "family_name": "Cristianini", "institution": null}, {"given_name": "Jaz", "family_name": "Kandola", "institution": null}]}