{"title": "Generalizable Singular Value Decomposition for Ill-posed Datasets", "book": "Advances in Neural Information Processing Systems", "page_first": 549, "page_last": 555, "abstract": null, "full_text": "Generalizable Singular Value \n\nDecomposition for Ill-posed Datasets \n\nUlrik Kjerns \nLars K. Hansen \nDepartment of Mathematical Modelling \n\nTechnical University of Denmark \nDK-2800 Kgs. Lyngby, Denmark \n\nuk, lkhansen@imm. dtu. dk \n\nStephen C. Strother \nPET Imaging Service \n\nVA medical center \n\nMinneapolis \n\nsteve@pet. med. va. gov \n\nAbstract \n\nWe demonstrate that statistical analysis of ill-posed data sets is \nsubject to a bias, which can be observed when projecting indepen(cid:173)\ndent test set examples onto a basis defined by the training exam(cid:173)\nples. Because the training examples in an ill-posed data set do not \nfully span the signal space the observed training set variances in \neach basis vector will be too high compared to the average vari(cid:173)\nance of the test set projections onto the same basis vectors. On \nbasis of this understanding we introduce the Generalizable Singu(cid:173)\nlar Value Decomposition (GenSVD) as a means to reduce this bias \nby re-estimation of the singular values obtained in a conventional \nSingular Value Decomposition, allowing for a generalization perfor(cid:173)\nmance increase of a subsequent statistical model. We demonstrate \nthat the algorithm succesfully corrects bias in a data set from a \nfunctional PET activation study of the human brain. \n\n1 \n\nIll-posed Data Sets \n\nAn ill-posed data set has more dimensions in each example than there are examples. \nSuch data sets occur in many fields of research typically in connection with image \nmeasurements. The associated statistical problem is that of extracting structure \nfrom the observed high-dimensional vectors in the presence of noise. The statistical \nanalysis can be done either supervised (Le. modelling with target values: classifi(cid:173)\ncation, regresssion) or unsupervised (modelling with no target values: clustering, \nPCA, ICA). In both types of analysis the ill-posedness may lead to immediate prob(cid:173)\nlems if one tries to apply conventional statistical methods of analysis, for example \nthe empirical covariance matrix is prohibitively large and will be rank-deficient. \n\nA common approach is to use Singular Value Decomposition (SVD) or the analogue \nPrincipal Component Analysis (PCA) to reduce the dimensionality of the data. Let \nthe N observed i-dimensional samples Xj, j = L .N, collected in the data matrix \nX = [Xl ... XN] of size I x N, I> N . The SVD-theorem states that such a matrix \ncan be decomposed as \n\n(1) \n\n\fwhere U is a matrix of the same size as X with orthogonal basis vectors spanning \nthe space of X, so that UTU = INxN. The square matrix A contains the singular \nvalues in the diagonal, A = diag( AI, ... , >w), which are ordered and positive Al ~ \nA2 ~ ... ~ AN ~ 0, and V is N x N and orthogonal V TV = IN. If there is a mean \nvalue significantly different from zero it may at times be advantageous to perform \nthe above analysis on mean-subtracted data, i.e. X - X = U A V T where columns \nof X all contain the mean vector x = Lj xj/N. \nEach observation Xj can be expressed in coordinates in the basis defined by the \nvectors of U with no loss of information[Lautrup et al., 1995]. A change of basis is \nobtained by qj = U T Xj as the orthogonal basis rotation \n\nQ = [ql ... qN] = U T X = UTUAVT = AVT. \n\n(2) \nSince Q is only N x Nand N \u00ab I, Q is a compact representation of the data. Having \nnow N examples of N dimension we have reduced the problem to a marginally ill(cid:173)\nposed one. To further reduce the dimensionality, it is common to retain only a \nsubset of the coordinates, e.g. the top P coordinates (P < N) and the supervised \nor unsupervised model can be formed in this smaller but now well-posed space. \n\nSo far we have considered the procedure for modelling from a training set. Our \nhope is that the statistical description generalizes well to new examples proving \nthat is is a good description of the generating process. The model should, in other \nwords, be able to perform well on a new example, x*, and in the above framework \nthis would mean the predictions based on q* = U T x* should generalize well. We \nwill show in the following, that in general, the distribution of the test set projection \nq* is quite different from the statistics of the projections of the training examples \nqj. It has been noted in previous work [Hansen and Larsen, 1996, Roweis, 1998, \nHansen et al., 1999] that PCA/SVD of ill-posed data does not by itself represent a \nprobabilistic model where we can assign a likelihood to a new test data point, and \nprocedures have been proposed which make this possible. In [Bishop, 1999] PCA has \nbeen considered in a Bayesian framework, but does not address the significant bias \nof the variance in training set projections in ill-posed data sets. In [Jackson, 1991] \nan asymptotic expression is given for the bias of eigen-values in a sample covariance \nmatrix, but this expression is valid only in the well-posed case and is not applicable \nfor ill-posed data. \n\n1.1 Example \n\nLet the signal source be I-dimensional multivariate Gaussian distribution N(O,~) \nwith a covariance matrix where the first K eigen-values equal u 2 and the last 1- K \nare zero, so that the covariance matrix has the decomposition \n\n~=u2YDyT, \n\nD=diag(1, ... ,1,0, ... ,0), \n\nyTY=I \n\n(3) \n\nOur N samples of the distribution are collected in the matrix X = [Xij] with the \nSVD \n\n(4) \nand the representation ofthe N examples in the N basis vector coordinates defined \nby U is Q = [%] = U T X = A V T. The total variance per training example is \n~Tr(XTX) = ~Tr(VAUTUAVT) = ~Tr(VA2VT) \n\nA = diag(Al, ... , AN) \n\n~ LX;j \n\ni,j \n\n= ~ Tr(VVT A2) = ~ Tr(A2) = ~L A; \n\n(5) \n\ni \n\n\fNote that this variance is the same in the U-basis coordinates: \n\nN L...J qij = ~Tr(QTQ) = ~Tr(VA2VT) = ~ LA~ \n1 '\" 2 \n\ni,j \n\nWe can derive the expected value of this variance: \n\ni \n\n(~ LX;) = (LxL) = (x? Xl) = Tr:E = a2 K \n\ni ,j \n\n(6) \n\n(7) \n\nNow, consider a test example X* '\" N(O,:E) with the projection q* = U T x* which \nwill have the average total variance \n\n(Tr[(UT x*{ (UT x*)]) = Tr [(x*x* T)UUT] \n\nTr[:EUUT] = Tr[DUUT] = a 2min(N,K) \n\n(8) \n\nIn summary, this means that the orthogonal basis U computed from the training set \nspans all the variance in the training set but fails to do so on the test examples when \nN < K, i.e. for ill-posed data. The training set variance is K / N a 2 on average per \ncoordinate, compared to a 2 for the test examples. So which of the two variances is \n\"correct\" ? From a modelling point of view, the variance from the test example tells \nus the true story, so the training set variance should be regarded as biased. This \nsuggests that the training set singular values should be corrected for this bias, in the \nabove example by re-estimating the training set projections using Q = J N / K Q. \nIn the more general case we do not know K, and the true covariance may have an ar(cid:173)\nbitrary eigen-spectrum. The GenSVD algorithm below is a more general algorithm \nfor correcting for the training set bias. \n\n2 The GenSVD Algorithm \nThe data matrix consists of N statistically independent samples X = [Xl ... XN ] \nso X \nis assumed multivariate Gaussian, \nXj '\" N(O,:E) and is ill-posed with rank:E > N. \nWith the SVD X = UoAoVaT, we now make the approximation that Uo contains \nan actual subset of the true eigen-vectors of :E \n\nis size I x N, and each column of X \n\n(9) \n\nwhere we have collected the remaining (unspanned by X) eigen-vectors and values in \nUJ. and Ai , satisfying uluJ. = I and UJUJ. = 0. The unknown 'true' eigen-values \ncorresponding to the observed eigen-vectors are collected in A = diag(Al, ... AN), \nwhich are the values we try to estimate in the following. \nIt should be noted that a direct estimation of :E using f: = j;y X X T yields f: = \nj;yuoAoVaTVoAoUJ = j;yUoA~UJ, i.e., the nonzero eigen-vectors and values of f: \nis Uo and Ao. \nThe distribution of test samples x* inside the space spanned by Uo is \n\n(10) \n\n\fThe problem is that Uo and the examples Xj are not independent, so UJ Xj is \nbiased, e.g. the SVD estimate -k A ~ of A 2 assigns all variance to lie within Uo. \nThe GenSVD algorithm bypasses this problem by, for each example, computing \na basis on all other examples, estimating the variances in A 2 in a leave-one-out \nmanner. Consider \n\n(11) \n\nwhere we introduce the notation X_j for the matrix of all examples except the \nj'th, and this matrix is decomposed as X _j = B_jA_j lC;' The operation B_jB_; Xj \nprojects the example onto the basis defined by the remaining examples, and back \nagain, so it 'strips' off the part of signal space which is special for Xj which could \nbe signal which does not generalize across examples. \n\nSince B_j and Xj are independent B-\"J Xj has the same distribution as the projec(cid:173)\ntion of a test example x*, B_; x*. Thus, B_jB_; Xj and B_jB_; x* have the same \ndistribution as well. Now, since span B_j=span X_j and span Uo=span [X_j Xj] we \nhave that span B_j~span Uo so we see that Z j and U J B_jB-\"J X* are identically dis(cid:173)\ntributed. This means that Zj has the covariance UJ B_jB-\"J~B_jB_;Uo and using \nEq. (9) and that ul B_j = 0 (since uluo = 0) we get \n\n(12) \n\nWe note that this distribution is degenerate because the covariance is of rank N -l. \nFor a sample Zj from the above distribution we have that \n\nUJ B_jB_;Uoz j = UJ B_jB_;UoUJ B_jB_; Xj = UJ B_jB_; Xj = Zj \n\n(13) \n\nAs a second approximation, assume that the observed Zj are independent so that \nwe can write the likelihood of A \n\n~ log [(27r)N/21(uJ B_J(B-\"JUo)A2(UJ B_J(R;Uo)l l /2] \n\n+~ ~zJ (UJ B_J(B_;Uo)A -2(UJ B_j) (B-\"JUo)zj \n\nj \nN ~ 2 1~ T \nC + -\n2 \n\nlog A\u00b7 + -\n2 \n\nJ \n\nz\u00b7 A - z\u00b7 \nJ \n\n2 \n\nJ \n\nj \n\nj \n\n(14) \n\nwhere we have used Eq. (13) and that the determinantl is approximated by IA21. \nThis above expression is maximized when \n\n5.~ = ~ ~ Z~j' \n\nj \n\n(15) \n\nThe GenSVD of X is then X = UoAV ,A = diag(Al' ... , AN). \nIn practice, using Eq. (11) directly to compute an SVD of the matrix X_j for each \nexample is computationally demanding. It is possible to compute Zj in a more \nefficient two-level procedure with the following algorithm: \n\nA \n\nA \n\nA \n\nA \n\nT \n\nCompute UOAoVOT = svd(X) and Qo = [qj] = AoVOT \n\nlSince Zj is degenerate, we define the likelihood over the space where Zj occur, i.e. the \n\ndeterminant in Eq. 14 should be read as 'the product of non-zero eigenvalues'. \n\n\fforeach j = L.N \n\nCompute B _;A_; v..; = svd( Q.J \nZj = B_;B-\"J qj \n1 \n\n2 \n\nA2 \n\n'\\ = Iii L:j Zij \n\nIf the data has a mean value that we wish to remove prior to the SVD it is important \nthat this is done within the GenSVD algorithm. Consider a centered matrix Xc = \nX - X where X contains the mean x replicated in all N columns. The signal space \nin Xc is now corrupted because each centered example will contain a component \nof all examples, which means the 'stripping' of signal components not spanned by \n\nother examples no longer works: B_; Xj is no longer distributed like B_; x*. This \n\nsuggests the alternative algorithm for data with removal of mean component: \n\nCompute UOAoVOT = svd(X) and Qo = [qj] = AoVOT \nforeach j = L.N \n\n-\n\n1 \n\n'\"\" \n\nq-j = N-1 6j'\u00a5j qj' \nCompute B _;A_; v..; = svd(Q_; - Q.;) \nT \nZj = B_;B-\"J (qj -\n\nii-j) \n\n-\n\nA2 _ \n\nAi - N -1 L: j Zij \n\n1 \n\n2 \n\nFinally, note that it is possible to leave out more than one example at a time if the \ndata is independent only in block, i.e. let Q.k would be Qo with the k'th block left \nout. \n\nExample With PET Scans \n\nWe compared the performance of GenSVD to conventional SVD on a functional \n[ 150] water PET activation study of the human brain. The study consisted of \n18 subjects, who were scanned four times while tracing a star-shaped maze with \na joy-stick with visual feedback, in total 72 scans of dimension '\" 25000 spatial \nvoxels. After the second scan, the visual feedback was mirrored, and the subject \naccomodated to and learned the new control environment during the last two scans. \nScans were normalized by 1) dividing each scan by the average voxel value measured \ninside a brain mask and 2) for each scan subtracting the average scan for that sub(cid:173)\nject thereby removing subject effects and 3) intra and inter-subject normalization \nand transformation using rigid body reorientation and affine linear transformations \nrespectively. Voxels inside aforementioned brain mask were arranged in the data \nmatrix with one scan per column. \n\nFigure 1 shows the results of an SVD decomposition compared to GenSVD. Each \nmarker represents one scan and the glyphs indicate scan number out of the four \n(circle-square-star-triangle). The ellipses indicate the mean and covariances of the \nprojections in each scan number. The 32 scans from eight subjects were used as a \ntraining set and 40 scans from the remaining 10 subjects for testing. The training \nset projections are filled markers, test-set projections onto the basis defined by the \ntraining set are open markers (i.e. we plot the first two columns of UoAo for SVD \nand of UoA for GenSVD). We see that there is a clear difference in variance in the \ntrain- and test-examples, which is corrected quite well by GenSVD. The lower plot \nin Figure 1 shows the singular values for the PET data set. We see that GenSVD \nestimates are much closer to the actual test projection standard deviations than the \nSVD singular values. \n\n\f3 Conclusion \n\nWe have demonstrated that projection of ill-posed data sets onto a basis defined \nby the same examples introduces a significant bias on the observed variance when \ncomparing to projections of test examples onto the same basis. The GenSVD algo(cid:173)\nrithm has been presented as a tool for correcting for this bias using a leave-one-out \nre-estimation scheme, and a computationally efficient implementation has been pro(cid:173)\nposed. \n\nWe have demonstrated that the method works well on an ill-posed real-world data \nset, were the distribution of the GenSVD-corrected training test set projections \nmatched the distribution of the observed test set projections far better than the \nuncorrected training examples. This allows a generalization performance increase \nof a subsequent statistical model, in the case of both supervised and unsupervised \nmodels. \n\nAcknowledgments \n\nThis work was supported partly by the Human Brain Project grant P20 MH57180, \nthe Danish Research councils for the Natural and Technical Sciences through the \nDanish Computational Neural Network Center (CONNECT) and the Technology \nCenter Through Highly Oriented Research (THOR). \n\nReferences \n[Bishop, 1999] Bishop, C. (1999) . Bayesian pca. In Kearns, M. S. , Soli a, S. A. , and Cohn, \nD . A., editors, Advances in Neural Information Processing Systems, volume 11. The \nMIT Press. \n\n[Hansen et al. , 1999] Hansen, L. , Larsen, J. , Nielsen, F. , Strother, S., Rostrup , E. , Savoy, \nR. , Lange, N., Sidtis, J ., Svarer, C. , and Paulson, O. (1999) . Generalizable patterns in \nneuroimaging: How many principal components? NeuroImage, 9:534- 544. \n\n[Hansen and Larsen, 1996] Hansen, L. K. and Larsen, J. (1996). Unsupervised learning \n\nand generalization. In Proceedings of IEEE International Conference on Neural Net(cid:173)\nworks, pages 25- 30. \n\n[Jackson, 1991] Jackson, J . E. (1991). A Us er's Guide to Principal Components. Wiley \n\nSeries on Probability and Statistics, John Wiley and Sons. \n\n[Lautrup et aI., 1995] Lautrup, B., Hansen, L. K., Law, I., M0rch, N., Svarer, C., and \nStrother, S. (1995). Massive weight sharing: A cure for extremely ill-posed problems. \nIn Hermann, H. J ., Wolf, D. E. , and Poppel, E. P., editors, Proceedings of Workshop \non Supercomputing in Brain Research: Prom Tomography to Neural Networks: Prom \ntomography to neural networks, HLRZ, KFA Jillich, Germany, pages 137- 148. World \nScientific. \n\n[Roweis, 1998] Roweis, S. (1998) . Em algorithms for pca and spca. In Jordan, M. I., \nKearns, M. J., and Soli a, S. A., editors, Advances in Neural Information Processing \nSystems, volume 10. The MIT Press. \n\n\fConventional SVD \n\n\u2022 \n\n\u2022 \n\n. ~ \" \n.. \u2022 \n* 1< \n'!' \u2022 \n~: j~.~ \n\n~ . \n\n.\u00b7 \n\nOc\n\n* \n\n3.00 \n\n2 .00 \n\n'E \n11 1.00 \n0 \n\"-\nE \n8 \n0 \n> (jJ \n-g \n0 \ni;l - 1.00 \n(jJ \n\n0.00 \n\n- 2.00 \n\n* * .. \n\n- 3.00 \n\n- 4 .00 \n\n- 3 .00 \n\n- 2 .00 \n\n..: \n\n.Ii. \n.Ii. \n.J>. \n.Ii. \u2022 \n\n- 1.00 \n\n1.00 \nFirst SVD component \n\n0 .00 \n\n\u2022 \n\n2.00 \n\n3 .00 \n\n4.00 \n\nSolid: Train \nOpen: Test \n\no Trace scan 1 \no Trace scan 2 \n\n* Mirror scan 1 \n\n\u00a3J. Mirror scan 2 \n\nGeneralizable SVD \n\n* 1< \n\n\u2022 \n\n1.50 \n\n1.00 \n\n\" 11 \n~ 0.50 \n8 \no \n~ 0.00 \n\n