Part of Advances in Neural Information Processing Systems 10 (NIPS 1997)
Sam Roweis
I present an expectation-maximization (EM) algorithm for principal component analysis (PCA). The algorithm allows a few eigenvectors and eigenvalues to be extracted from large collections of high dimensional data. It is computationally very efficient in space and time. It also natu(cid:173) rally accommodates missing infonnation. I also introduce a new variant of PC A called sensible principal component analysis (SPCA) which de(cid:173) fines a proper density model in the data space. Learning for SPCA is also done with an EM algorithm. I report results on synthetic and real data showing that these EM algorithms correctly and efficiently find the lead(cid:173) ing eigenvectors of the covariance of datasets in a few iterations using up to hundreds of thousands of datapoints in thousands of dimensions.
1 Why EM for peA? Principal component analysis (PCA) is a widely used dimensionality reduction technique in data analysis. Its popularity comes from three important properties. First, it is the optimal (in tenns of mean squared error) linear scheme for compressing a set of high dimensional vectors into a set of lower dimensional vectors and then reconstructing. Second, the model parameters can be computed directly from the data - for example by diagonalizing the sample covariance. Third, compression and decompression are easy operations to perfonn given the model parameters - they require only matrix multiplications.
Despite these attractive features however, PCA models have several shortcomings. One is that naive methods for finding the principal component directions have trouble with high dimensional data or large numbers of datapoints. Consider attempting to diagonalize the sample covariance matrix of n vectors in a space of p dimensions when n and p are several hundred or several thousand. Difficulties can arise both in the fonn of computational com(cid:173) plexity and also data scarcity. I Even computing the sample covariance itself is very costly, requiring 0 (np2) operations. In general it is best to avoid altogether computing the sample
• rowei s@cns . cal tech. edu; Computation & Neural Systems, California Institute of Tech. IOn the data scarcity front, we often do not have enough data in high dimensions for the sample covariance to be of full rank and so we must be careful to employ techniques which do not require full rank matrices. On the complexity front, direct diagonalization of a symmetric matrix thousands of rows in size can be extremely costly since this operation is O(P3) for p x p inputs. Fortunately, several techniques exist for efficient matrix diagonalization when only the first few leading eigenvectors and eigerivalues are required (for example the power method [10] which is only O(p2».
EM Algorithms for PCA and SPCA
627
covariance explicitly. Methods such as the snap-shot algorithm [7] do this by assuming that the eigenvectors being searched for are linear combinations of the datapoints; their com(cid:173) plexity is O(n 3 ). In this note, I present a version of the expectation-maximization (EM) algorithm [1] for learning the principal components of a dataset. The algorithm does not re(cid:173) quire computing the sample covariance and has a complexity limited by 0 (knp) operations where k is the number of leading eigenvectors to be learned.
Another shortcoming of standard approaches to PCA is that it is not obvious how to deal properly with missing data. Most of the methods discussed above cannot accommodate missing values and so incomplete points must either be discarded or completed using a variety of ad-hoc interpolation methods. On the other hand, the EM algorithm for PCA enjoys all the benefits [4] of other EM algorithms in tenns of estimating the maximum likelihood values for missing infonnation directly at each iteration.
Finally, the PCA model itself suffers from a critical flaw which is independent of the tech(cid:173) nique used to compute its parameters: it does not define a proper probability model in the space of inputs. This is because the density is not nonnalized within the principal subspace. In other words, if we perfonn PCA on some data and then ask how well new data are fit by the model, the only criterion used is the squared distance of the new data from their projections into the principal subspace. A datapoint far away from the training data but nonetheless near the principal subspace will be assigned a high "pseudo-likelihood" or low error. Similarly, it is not possible to generate "fantasy" data from a PCA model. In this note I introduce a new model called sensible principal component analysis (SPCA), an obvious modification of PC A, which does define a proper covariance structure in the data space. Its parameters can also be learned with an EM algorithm, given below.
In summary, the methods developed in this paper provide three advantages. They allow simple and efficient computation of a few eigenvectors and eigenvalues when working with many datapoints in high dimensions. They permit this computation even in the presence of missing data. On a real vision problem with missing infonnation, I have computed the 10 leading eigenvectors and eigenvalues of 217 points in 212 dimensions in a few hours using MATLAB on a modest workstation. Finally, through a small variation, these methods allow the computation not only of the principal subspace but of a complete Gaussian probabilistic model which allows one to generate data and compute true likelihoods.
2 Whence EM for peA? Principal component analysis can be viewed as a limiting case of a particular class of linear(cid:173) Gaussian models. The goal of such models is to capture the covariance structure of an ob(cid:173) served p-dimensional variable y using fewer than the p{p+ 1) /2 free parameters required in a full covariance matrix. Linear-Gaussian models do this by assuming that y was produced as a linear transfonnation of some k-dimensionallatent variable x plus additive Gaussian noise. Denoting the transfonnation by the p x k matrix C, and the ~dimensional) noise by v (with covariance matrix R) the generative model can be written as