{"title": "Bayesian PCA", "book": "Advances in Neural Information Processing Systems", "page_first": 382, "page_last": 388, "abstract": null, "full_text": "Bayesian peA \n\nChristopher M. Bishop \n\nMicrosoft Research \n\nSt.  George House,  1 Guildhall Street \n\nCambridge CB2 3NH, u.K. \n\ncmbishop@microsoft.com \n\nAbstract \n\nThe technique of principal component analysis (PCA) has recently been \nexpressed  as  the  maximum  likelihood  solution  for  a  generative  latent \nvariable  model.  In  this  paper  we  use  this  probabilistic  reformulation \nas  the basis  for a  Bayesian treatment of PCA.  Our key  result  is  that  ef(cid:173)\nfective  dimensionality  of the  latent  space  (equivalent to  the  number of \nretained principal components) can be determined automatically as  part \nof the  Bayesian  inference  procedure.  An  important application  of this \nframework  is  to  mixtures  of probabilistic  PCA  models,  in  which  each \ncomponent can determine its own effective complexity. \n\n1  Introduction \n\nPrincipal component analysis (PCA) is  a widely used technique for data analysis.  Recently \nTipping and Bishop (1997b) showed that a specific form of generative latent variable model \nhas  the  property that its  maximum likelihood solution extracts the principal sub-space of \nthe  observed data set.  This  probabilistic reformulation  of PCA permits  many extensions \nincluding a  principled formulation  of mixtures of principal  component analyzers,  as  dis(cid:173)\ncussed by Tipping and Bishop (l997a). \n\nA  central  issue  in  maximum  likelihood  (as  well  as  conventional)  PCA  is  the  choice  of \nthe number of principal  components to  be retained.  This is  particularly problematic  in  a \nmixture modelling context since ideally we  would like the components to  have potentially \ndifferent dimensionalities. However, an exhaustive search over the choice of dimensionality \nfor each of the components in  a  mixture distribution can quickly become computationally \nintractable.  In  this paper we develop a Bayesian treatment of PCA, and we  show how this \nleads  to  an  automatic  selection  of the  appropriate  model  dimensionality.  Our approach \navoids a discrete model  search,  involving instead the use  of continuous hyper-parameters \nto determine an effective number of principal components. \n\n\fBayesian peA \n\n383 \n\n2  Maximum Likelihood peA \n\nConsider  a  data  set  D  of  observed  d-dimensional  vectors  D  =  {t n }  where  n  E \n{I, ... ,N}.  Conventional  principal  component analysis  is  obtained by  first  computing \nthe sample covariance matrix given by \n\nN 1\"\" \n\n-\n\nS = N  L) t n - t) (tn  - t) \n\n-T \n\n(1) \n\nn=l \n\nwhere t  = N- 1  Ln tn is  the  sample mean.  Next the eigenvectors Ui  and eigenvalues .Ai \nof S  are  found,  where  SUi  = .AiUi  and  i  = 1, ... , d.  The eigenvectors  corresponding \nto  the  q  largest eigenvalues  (where  q  <  d)  are  retained,  and  a  reduced-dimensionality \nrepresentation of the data set is defined by Xn  =  U T (t n  -\nt) where U q =  (U 1 , . ..  ,Uq). \nIt is easily shown that PCA corresponds to the linear projection of a data set under which \nthe  retained  variance  is  a  maximum,  or equivalently the  linear  projection  for  which  the \nsum-of-squares reconstruction cost is minimized. \n\nA significant limitation of conventional PCA is  that it does not define a probability distri(cid:173)\nbution.  Recently, however, Tipping and Bishop (1997b) showed how PCA can be reformu(cid:173)\nlated as  the  maximum  likelihood  solution of a specific  latent variable model,  as  follows. \nWe first introduce a q-dimensionallatent variable x  whose prior distribution is a zero mean \nGaussianp(x) = N(O, Iq)  and Iq  is the q-dimensional unit matrix. The observed variable t \nis then defined as a linear transformation ofx with additive Gaussian noise t  = Wx+ p,+\u20ac \nwhere W  is  a d x  q matrix,  p,  is  a d-dimensional vector and  \u20ac \nis  a zero-mean Gaussian(cid:173)\ndistributed vector with covariance (72Id.  Thus p(tlx) = N(Wx +  p\"  (72Id).  The marginal \ndistribution of the observed variable is then given by the convolution of two Gaussians and \nis  itself Gaussian \n\np(t) = J p(tlx)p(x) dx = N(p\"  C) \n\n(2) \n\nwhere the covariance matrix C  =  WWT +  (72Id.  The model (2) represents a constrained \nGaussian distribution governed by the parameters p\"  Wand (72. \n\nThe log probability of the parameters under the observed data set D  is then given by \n\nL(p\"W, (72)  = -2 {dln(2rr) +lnlCl +Tr[C-1S]} \n\nN \n\n(3) \n\nwhere S  is  the sample covariance matrix given by (I).  The maximum likelihood solution \nfor p,  is easily seen to be P,ML  = t.  It was shown by Tipping and Bishop (l997b) that the \nstationary points of the log likelihood with respect to W  satisfy \n\nWML  =  Uq(Aq  - (72Iq)1/2 \n\n(4) \n\nwhere  the  columns  of U q  are  eigenvectors  of S,  with  corresponding eigenvalues  in  the \ndiagonal matrix A q \u2022  It was also shown that the maximum of the likelihood is achieved when \nthe q largest eigenvalues are chosen, so that the columns of U q  correspond to the principal \neigenvectors,  with  all  other choices  of eigenvalues corresponding to  saddle  points.  The \nmaximum likelihood solution for (72  is  then given by \n\n2 \n\n(7ML  = ~ ~ .Ai \n\n1 \n\nd \n\"\" \nq  i=q+l \n\n(5) \n\nwhich has a natural interpretation as the average variance lost per discarded dimension. The \ndensity  model  (2)  thus  represents a  probabilistic formulation  of PCA. It is  easily  verified \nthat conventional PCA is recovered in  the limit (72  -+ O. \n\n\f384 \n\nC.  M  Bishop \n\nProbabilistic PCA has been successfully applied to problems in data compression, density \nestimation and data visualization, and has been extended to  mixture and hierarchical mix(cid:173)\nture models.  As  with conventional PCA, however, the model itself provides no mechanism \nfor determining the  value of the  latent-space dimensionality q.  For q  =  d - 1 the model \nis  equivalent to a full-covariance Gaussian distribution,  while for  q  < d - 1 it represents \na  constrained Gaussian  in  which  the  variance  in  the  remaining d - q directions  is  mod(cid:173)\nelled by the single parameter (j2 . Thus the choice of q corresponds to a problem in  model \ncomplexity optimization.  If data is plentiful, then cross-validation to compare all  possible \nvalues of q offers a  possible approach.  However, this  can quickly become intractable for \nmixtures of probabilistic PCA models if we wish to allow each component to have its own \nq value. \n\n3  Bayesian peA \n\nThe  issue  of model  complexity  can  be  handled  naturally  within  a  Bayesian  paradigm. \nArmed with the probabilistic reformulation of PCA defined in Section 2,  a Bayesian treat(cid:173)\nment  of PCA  is  obtained  by  first  introducing  a  prior  distribution  p(p\"  W, (j2)  over  the \nparameters of the model.  The corresponding posterior distribution p(p\"  W , (j2ID)  is  then \nobtained by multiplying the prior by the likelihood function,  whose logarithm is  given by \n(3), and normalizing.  Finally, the  predictive density  is  obtained by marginalizing over the \nparameters, so that \n\n(6) \n\nIn order to  implement this framework  we  must address  two  issues:  (i)  the choice of prior \ndistribution, and (ii) the formulation of a tractable algorithm.  Our focus in this paper is on \nthe specific issue of controlling the effective dimensionality of the latent space (correspond(cid:173)\ning  to  the number of retained principal components).  Furthermore,  we  seek to  avoid dis(cid:173)\ncrete model selection and instead use continuous hyper-parameters to determine automat(cid:173)\nically an  appropriate effective dimensionality for the latent space as  part of the  process of \nBayesian inference. This is achieved by introducing a hierarchical prior p(Wla) over the \nmatrix W, governed by  a q-dimensional  vector of hyper-parameters a  =  {0:1, ... ,O:q}. \nThe dimensionality of the latent space is set to its maximum possible value q  =  d - 1, and \neach hyper-parameter controls one of the columns of the matrix W  through a conditional \nGaussian distribution of the form \n\n(7) \n\nwhere  {Wi}  are  the  columns  of W.  This  form  of prior is  motivated  by  the  framework \nof automatic relevance determination (ARD) introduced in the context of neural networks \nby  Neal  and MacKay  (see  MacKay,  1995).  Each  O:i  controls the  inverse  variance  of the \ncorresponding  Wi,  so  that  if a  particular  O:i  has  a  posterior distribution  concentrated  at \nlarge values, the corresponding Wi  will  tend to be small, and that direction in  latent space \nwill  be  effectively  'switched  off'.  The  probabilistic  structure  of the  model  is  displayed \ngraphically in Figure  I. \n\nIn  order to  make  use  of this  model  in  practice  we  must be  able  to  marginalize  over the \nposterior distribution of W.  Since this is analytically intractable we  have developed three \nalternative  approaches based  on  (i)  type-II  maximum  likelihood  using  a  local  Gaussian \napproximation to a mode of the  posterior distribution (MacKay,  1995), (ii) Markov chain \nMonte Carlo  using  Gibbs  sampling,  and  (iii)  variational  inference using  a  factorized  ap(cid:173)\nproximation to the posterior distribution. Here we describe the first of these in more detail. \n\n\fBayesian peA \n\n385 \n\nFigure 1:  Representation of Bayesian PCA as a probabilistic graphical model showing the hierarchi(cid:173)\ncal prior over W  governed by the vector of hyper-parameters ex.  The box. denotes a 'plate' comprising \na data set  of N  independent  observations  of the  visible  vector tn  (shown shaded)  together with  the \ncorresponding hidden  variables  X n . \n\nThe location W MP  of the mode can be found by maximizing the log posterior distribution \ngiven, from Bayes'  theorem, by \n\nInp(WID) =  L  - 2 L aill w ill 2  + const. \n\n1 d-l \n\ni=1 \n\n(8) \n\nwhere  L  is  given  by  (3).  For  the  purpose  of controlling  the  effective dimensionality  of \nthe  latent  space,  it  is  sufficient  to  treat  J.L,  (1 2  and  Q  as  parameters  whose  values  are  to \nbe  estimated,  rather than  as  random variables.  In  this  case  there  is  no  need to  introduce \npriors over these  variables,  and we can determine J.L  and  (1 2  by maximum likelihood.  To \nestimate ex  we use type-II maximum likelihood, corresponding to maximizing the marginal \nlikelihood p( D I ex)  in which we have integrated over W  using the quadratic approximation . \nIt is  easily shown (Bishop,  1995) that this  leads to a  re-estimation formula for the hyper(cid:173)\nparameters ai of the form \n\n/i \nai :=  II W ill 2 \n\n(9) \n\nwhere /i :::::  d - ai Tri (H- 1 )  is the effective number of parameters in Wi, H  is the Hessian \nmatrix given  by the  second derivatives of Inp(WID) with respect to  the elements of W \n(evaluated at W MP),  and Tri (.)  denotes the  trace  of the sub-matrix corresponding to  the \nvector Wi. \n\nFor the results presented in this paper, we make the further simplification of replacing / i in \n(9) by d, corresponding to the assumption that all  model parameters are  'well-determined'. \nThis significantly reduces the computational cost since it avoids evaluation and manipula(cid:173)\ntion of the Hessian matrix.  An additional consequence is that vectors Wi  for which there is \ninsufficient support from  the data wiII  be driven to zero, with the corresponding a i  -t 00, \nso that un-used dimensions are switched off completely. We define the effective dimension(cid:173)\nality of the model to be the number of vectors Wi  whose values remain non-zero. \n\nThe solution  for W  MP  can be  found efficiently  using the EM algorithm,  in  which  the E(cid:173)\nstep  involves  evaluation  of the  expected  sufficient  statistics  of the  latent-space  posterior \ndistribution, given by \n\nM- 1W T (tn - J.L) \n(12M + (xn)(xn) T \n\n(10) \n\n(II) \n\n\f386 \n\nC.  M  Bishop \n\nwhere M  =  (WTW + a 2 Iq).  The M-step involves updating the model parameters using \n\n(12) \n\nW \n\n[ptn-I-')(X~)] [pxnX~)H'Ar \n~d L {lit n  -\n\nN \n\n(;2 \n\nJ-t)  + Tr [(XnX~)WTW]} (13) \nwhere A  = diag(ad.  Optimization of Wand a 2  is  alternated with  re-estimation  of n, \nusing (9) with '\"'Ii  =  d,  until all  of the parameters satisfy a suitable convergence criterion. \n\nJ-t1l 2  - 2(x~)WT(tn -\n\nn=l \n\nAs an illustration of the operation of this algorithm, we consider a data set consisting of 300 \npoints  in  10 dimensions,  in  which  the data is  drawn  from  a  Gaussian distribution having \nstandard deviation  1.0 in  3 directions and standard deviation 0.5 in  the remaining 7 direc(cid:173)\ntions.  The result of fitting  both maximum likelihood and Bayesian PCA models is  shown \nin Figure 2.  In this case the Bayesian model has an effective dimensionality of qeff  =  3. \n\n\u2022 \n\n\u2022  \u2022 \u2022 \n\u00b7 \u2022 \u2022 \u00b7  \u2022 \n\u2022  \u2022 \n\u2022 \u2022  \u2022 \n\u2022 \n\u2022 \u2022 \n\u2022 \n\u2022 \n\u2022  \u2022 \n\u2022 \u2022  \u2022 \u2022 \n\u2022  \u2022 \n\u2022 \u00b7 \n\u2022 \n\u2022 \u2022 \n\u2022 \n\n\u2022 \n\nFigure 2:  Hinton diagrams of the matrix W  for a data set in  10 dimensions having m  =  3 directions \nwith  larger variance  than the  remaining 7 directions.  The left plot shows  W  from  maximum  likeli(cid:173)\nhood peA while the right plot shows  WMP from  the Bayesian approach, showing how  the  model  is \nable to discover the appropriate dimensionality by suppressing the 6 surplus degrees of freedom. \n\nThe effective dimensionality found by Bayesian PCA will  be dependent on the number N \nof points in the data set.  For N  ~ 00 we expect qeff  ~ d -1, and in this limit the maximum \nlikelihood framework and the Bayesian approach will give identical results.  For finite data \nsets the effective dimensionality may be reduced, with degrees of freedom for  which there \nis insufficient evidence in the data set being suppressed.  The variance of the data in the re(cid:173)\nmaining d - qeff directions is then accounted for by the single degree of freedom defined by \na 2 .  This is illustrated by considering data in  10 dimensions generated from a Gaussian dis(cid:173)\ntribution  with  standard  deviations  given  by  {1.0, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1}. \nIn Figure 3 we plot qeff  (averaged over 50 independent experiments) versus the number N \nof points in the data set. \n\nThese  results  indicate  that  Bayesian  PCA  is  able  to  determine  automatically  a  suitable \neffective dimensionality  qeff  for the  principal component subspace,  and  therefore offers a \npractical alternative to exhaustive comparison of dimensionalities using techniques such as \ncross-validation.  As  an  illustration  of the generalization capability of the resulting model \nwe consider a data set of 20 points in  10 dimensions generated from a Gaussian distribution \nhaving  standard  deviations  in  5  directions  given  by  (1.0,0.8,0.6 , 0.4,0.2)  and  standard \ndeviation 0.04 in  the remaining  5 directions.  We  fit  maximum likelihood PCA models to \nthis data having q  values  in  the  range  1-9 and compare their log  likelihoods  on  both  the \ntraining data and on an independent test set, with the results (averaged over 10 independent \nexperiments) shown in Figure 4.  Also shown are the corresponding results obtained from \nBayesian PCA. \n\n\fFigure 3:  Plot of the average effective dimensionality of the Bayesian PCA model versus the number \nN  of data points for data in a  IO-dimensional  space. \n\n.. .... _-----(cid:173)\n\n---\n\n,0 \n\n, , , , , \n, , \n\n<'l \n\n8 \nc \n'8.  6 \n~ 4 \nQ; \na.  2 \n8 \u00a3  a \nQi \n\"'\" =-2 \n~ \n-4 \n\n-6~~~2~~3--~4--~5---6--~7--~8--~9~ \n\nq \n\nFigure 4:  Plot of the  log  likelihood for the  training set (dashed curve) and  the  test set (solid curve) \nfor maximum likelihood  PCA models having q values in the range  1-9, showing that the best gener(cid:173)\nalization is achieved for q  =  5 which corresponds to the number of directions of significant variance \nin  the  data  set.  Also  shown  are  the  training  (circle)  and  test  (cross)  results  from  a Bayesian  PCA \nmodel,  plotted at  the  average effective  q value  given  by  qeff  =  5.2.  We  see  that the  Bayesian  PCA \nmodel  automatically discovers the appropriate dimensionality for the principal component subspace, \nand furthermore that it has a generalization performance which is close to that of the optimal  fixed  q \nmodel. \n\n4  Mixtures of Bayesian peA Models \n\nGiven a probabilistic formulation of PCA it is  straightforward to construct a mixture distri(cid:173)\nbution comprising a  linear superposition of principal component analyzers.  In the case of \nmaximum likelihood PCA we have to choose both the number IvI  of components and the \nlatent space dimensionality q  for each component.  For moderate numbers of components \nand data spaces of several dimensions it quickly becomes intractable to explore the expo(cid:173)\nnentially large number of combinations of q  values for a given value of M. Here Bayesian \nPCA offers a significant advantage in allowing the effective dimensionalities of the models \nto be determined automatically. \n\nAs  an  illustration we consider a  density estimation problem involving hand-written digits \nfrom the CEDAR database.  The data set comprises 8  x  8  scaled and smoothed gray-scale \nimages of the digits '2', '3' and '4', partitioned randomly into 1500 training, 900 validation \nand 900 test points.  For mixtures of maximum likelihood PCA the model parameters can be \n\n\f388 \n\nC.  M  Bishop \n\ndetermined using the EM algorithm in  which the M-step uses (4) and (5), with eigenvector \nand eigenvalues obtained from the weighted covariance matrices in which the weighting co(cid:173)\nefficients are the posterior probabilities for the components determined in the E-step.  Since, \nfor maximum likelihood PCA,  it is computationally impractical  to  explore independent q \nvalues for each component we consider mixtures in  which every component has the same \ndimensionality.  We  therefore train mixtures having M  E  {2, 4, 6, 8, 10, 12, 14, 16, 18} for \nall  values  q  E  {2, 4, 8, 12, 16, 20, 25, 30, 40, 50}.  In  order to  avoid  singularities associ(cid:173)\nated  with  the more complex models  we  omit any component from the  mixture for which \nthe  value  of  (7 2  goes to  zero during the  optimization.  The highest log  likelihood  on  the \nvalidation set ( - 295) is obtained for M  =  6 and q  =  50. \nFor  mixtures  of Bayesian  PCA  models  we  need  only  explore alternative  values  for  M , \nwhich are taken from the same set as for the mixtures of maximum likelihood PCA. Again, \nthe best performance on the validation set (- 293) is obtained for M  =  6.  The values of the \nlog likelihood for the test set were -295 (maximum likelihood PCA) and -293 (Bayesian \nPCA).  The mean  vectors I-Li  for each  of the 6 components of the Bayesian PCA  mixture \nmodel are shown in Figure 5. \n\n62 \n\n54 \n\n63 \n\n60 \n\n62 \n\n59 \n\nFigure  5:  The  mean  vectors  for  each  of the  6 components  in  the  Bayesian  PCA  mixture  model, \ndisplayed as an 8  x  8 image, together with the corresponding values of the effective dimensionality. \n\nThe Bayesian treatment of PCA discussed in  this paper can be particularly advantageous \nfor  small  data  sets  in  high  dimensions  as  it  can  avoid  the  singularities  associated  with \nmaximum likelihood (or conventional) PCA by suppressing unwanted degrees of freedom \nin the model.  This is especially helpful in a mixture modelling context, since the effective \nnumber of data points associated with specific  'clusters'  can be small even when the total \nnumber of data points appears to be large. \n\nReferences \n\nBishop,  C.  M.  (1995).  Neural  Networks for  Pattern  Recognition.  Oxford  University \n\nPress. \n\nMacKay, D. J.  C.  (1995). Probable networks  and  plausible  predictions  - a  review  of \npractical Bayesian methods for supervised neural networks. Network:  Computation \nin Neural Systems 6  (3), 469-505. \n\nTipping, M. E. and C.  M.  Bishop (1997a). Mixtures of principal component analysers. \nIn  Proceedings lEE Fifth  International  Conference  on Artificial Neural Networks. \nCambridge,  u.K., July. , pp.  13-18. \n\nTipping,  M. E . and C.  M.  Bishop (1997b). Probabilistic principal component analysis. \n\nAccepted for publication in  the Journal of the Royal Statistical Society, B. \n\n\f", "award": [], "sourceid": 1549, "authors": [{"given_name": "Christopher", "family_name": "Bishop", "institution": null}]}