{"title": "Generalized\u00b2 Linear\u00b2 Models", "book": "Advances in Neural Information Processing Systems", "page_first": 593, "page_last": 600, "abstract": "", "full_text": "Generalized2  Linear2  Models \n\nGeoffrey J.  Gordon \nggordon@es.emu.edu \n\nAbstract \n\nWe  introduce the Generalized2  Linear2  Model, a statistical estima(cid:173)\ntor which combines features of nonlinear regression and factor anal(cid:173)\nysis.  A  (GL)2M  approximately  decomposes  a  rectangular  matrix \nX  into  a  simpler  representation  j(g(A)h(B)).  Here  A  and  Bare \nlow-rank matrices,  while  j,  g,  and  h  are  link  functions.  (GL)2Ms \ninclude  many  useful  models  as  special  cases,  including  principal \ncomponents analysis,  exponential-family peA, the infomax formu(cid:173)\nlation  of independent  components  analysis,  linear  regression,  and \ngeneralized  linear  models.  They  also  include  new  and  interesting \nspecial  cases,  one  of which  we  describe  below.  We  also  present  an \niterative  procedure  which  optimizes  the  parameters of a  (GL)2M. \nThis  procedure  reduces  to  well-known  algorithms  for  some  of the \nspecial cases listed above;  for  other special cases,  it  is  new. \n\n1 \n\nIntroduction \n\nLet  the m  x n  matrix X  contain an independent sample from  some unknown distri(cid:173)\nbution.  Each column of X  represents a training example, and each row represents a \nmeasured feature of the examples.  It is often reasonable to assume that some of the \nfeatures  are  redundant,  that  is,  that  there exists  a  reduced  set  of I  features  which \ncontains all or most  of the information in  X. \nIf the  reduced  features  are  linear  functions  of the  original  features  and the  distri(cid:173)\nbutions  of the  elements  of X  are  Gaussian,  redundancy  means  we  can write  X  as \nthe  product  of two  smaller  matrices  U  and  V  with  small  sum  of  squared  errors. \nThis  factorization  is  essentially  a  singular  value  decomposition:  U  must  span  the \nfirst  I  dimensions of the left  principal subspace of X,  while  V T  must  span the first \nI  dimensions of the right  principal subspace.  (Since  the above requirements do  not \nuniquely determine U  and V,  the SVD  traditionally imposes additional restrictions \nwhich  we  will  ignore in  this  paper.) \n\nThe  SVD  has  a  long  list  of successes  in  machine  learning,  including  information \nretrieval applications such as latent semantic analysis  [1]  and link analysis  [2];  pat(cid:173)\ntern  recognition  applications  such  as  \"eigenfaces\"  [3];  structure  from  motion  al(cid:173)\ngorithms  [4];  and  data  compression  tools  [5].  Unfortunately,  the  SVD  makes  two \nassumptions which can limit its  accuracy as  a  learning tool. \n\nThe first  assumption is  the use  of the sum of squared errors between X  and UV as \na  loss  function.  Squared error loss  means  that  predicting  1000  when  the answer is \n1010  is  as  bad  as  saying -7  when  the  answer  is  3.  The  second  assumption  is  that \n\n\fthe reduced features  are linear functions  of the original features.  Instead,  X  might \nbe a  nonlinear function  of UV, and U  and V  might  be nonlinear functions  of some \nother matrices  A  and  B.  To  address  these shortcomings,  we  propose the model \n\nx = f(g(A)h(B)) \n\n(1) \n\nfor  the expected value of X.  We  also propose allowing non-quadratic loss functions \nfor  the error  (X - X)  and the parameter matrices  A  and B.  The fixed  functions \n\nare called  link functions.  By analogy to generalized linear models  [6],  we  call equa(cid:173)\ntion  (1)  a  Generalized2 Linear2 Model:  generalized2 because it  uses  link  functions \nfor  the  parameters  A  and  B  as  well  as  the  prediction  X,  and  linear2  because  like \nthe SVD  it  is  bilinear. \n\nAs long as we choose link and loss functions that match each other (see below for the \ndefinition of matching link and loss),  there will exist efficient  algorithms for  finding \nA and B  given X, f, g,  and h.  Because (1)  is  a generalization of the SVD,  (GL)2Ms \nare drop-in replacements for  SVDs  in all of the applications mentioned above,  with \nbetter  reconstruction  performance  when  the  SVD's  error  model  is  inaccurate.  In \naddition, they open up new applications (see section 7 for one) where an SVD would \nhave  been  unable to provide a  sufficiently  accurate reconstruction. \n\n2  Matching  link and  loss  functions \n\nWhenever we  try to optimize the predictions of a nonlinear model, we need to worry \nabout getting stuck in  local  minima.  One example of this  problem is  when  we  try \nto fit  a  single  sigmoid  unit  with  parameters (J  E  lRd  to training inputs  Xi  E  lRd  and \ntarget outputs Yi  E  lR  under squared error loss: \n\nYi  =  10git(Zi) \n\nZi  = Xi  . (J \n\nEven for small training sets, the number of local minima of L  can grow exponentially \nwith  the  dimension  d  [7].  On  the other  hand,  if we  optimize  the  same  predictions \nYi  under  the logarithmic loss  function  ~i[Yi log Yi  + (1  - Vi) 10g(1  - Yi)]  instead of \nsquared error, our optimization problem is  convex.  Because the logistic link  works \nwith the log loss to produce a convex optimization problem, we  say they match each \nother [7].  Matching link-loss  pairs are important because minimizing a  convex loss \nfunction  is  usually far  easier than minimizing a  non convex one. \nWe  can  use  any convex function  F(z)  to generate a  matching pair of link  and loss \nfunctions.  The loss  function  which corresponds to F  is \n\n(2) \nwhere F*(y) is  defined so that minz DF(Z I y)  =  O.  (F*  is  the  convex  dual  of F  [8], \nand  D F is  the generalized  Bregman  divergence  from  Z to Y  [9].) \nExpression (2)  is nonnegative, and it is globally convex in all of the ZiS  (and therefore \nalso in  (J  since each Zi  is  a  linear function  of (J).  If we  write f  for  the gradient of F, \nthe derivative of (2)  with respect  to Zi  is  f(Zi)  - Vi.  So,  (2)  will  be zero if and only \nif Yi  =  f(Zi)  for  all  i;  in  other  words,  using  the  loss  (2)  implies  that  Yi  =  f(z;)  is \nour best prediction of Vi,  and f  is  therefore our matching link function. \nWe  will  need  two  facts  about  convex  duals  below.  The first  is  that  F*  is  always \nconvex,  and  the  second  is  that  the  gradient  of F*  is  equal  to  f - l.  (Also,  convex \nduality  is  defined  even  when  F,  G,  and  H  aren't  differentiable.  If they  are  not, \nreplace derivatives by subgradients below.) \n\n\f3  Loss  functions  for  (G L )2Ms \n\nIn (GL)2Ms, matching loss functions will be particularly important because we need \nto  deal  with  three  separate  nonlinear  link  functions.  We  will  usually  not  be  able \nto  avoid  local  minima entirely;  instead,  the overall  loss  function  will  be  convex  in \nsome groups of parameters if we  hold  the remaining parameters fixed. \n\nWe  will  specify  a  (GL)2M  by  picking  three  link  functions  and  their  matching loss \nfunctions.  We  can then combine these individual loss  functions  into an overall loss \nfunction  as  described  in section 4;  fitting  a  (GL)2M  will  then reduce  to minimizing \nthe overall loss function with respect to our parameters.  Each choice of links results \nin  a  different  (G L)2M  and therefore potentially a  different  decomposition of X. \n\nThe choice of link functions  is  where we  should inject  our domain knowledge about \nwhat sort of noise there is  in  X  and what parameter matrices A  and B  are a  priori \nmost likely.  Useful  link functions  include f (x)  =  x  (corresponding to squared error \nand Gaussian noise), f (x)  = log x  (unnormalized KL-di vergence and Poisson noise), \nand  f(x)  =  (1 + e- x) - l  (log-loss  and Bernoulli noise). \nThe  loss  functions  themselves  are  only  necessary  for  the  analysis;  all  of our  algo(cid:173)\nrithms  need  only  the  link  functions  and  (in  some  cases)  their  derivatives.  So,  we \ncan pick the loss  functions  and differentiate to get the matching link functions;  or, \nwe  can pick the link functions  directly and not  worry about  the corresponding loss \nfunctions.  In order for  our analysis to apply,  the link functions  must be derivatives \nof some  (possibly unknown)  convex functions. \nOur loss  functions  are D F , DG,  and DH where \n\nG  : lRmxl  H \n\nlR \n\nare convex functions.  We  will abuse notation and call F, G, and H  loss functions as \nwell:  F  is  the prediction loss,  and its derivative f  is  the prediction link;  it provides \nour model of the noise in X.  G and H  are the parameter losses, and their derivatives \ng  and  h  are the parameter links;  they tell  us  which  values  of A  and B  are a  priori \nmost  likely.  By  convention,  since  F  takes  an m  x  n  matrix argument,  we  will  say \nthat  the input  and output to f  are also  m  x  n  matrices  (similarly for  g  and  h). \n\n4  The model and  its fixed  point  equations \n\nWe  will  define  a  (GL)2M  by  specifying  an  overall  loss  function  which  relates  the \nparameter  matrices  A  and  B  to  the  data  matrix  X.  If we  write  U  =  g(A)  and \nV  =  h(B), the  (GL)2M  loss  function  is \n\nL(U, V)  =  F(UV) - X  0  UV + G*(U) + H*(V) \nHere  A  0  B  is  the  \"matrix dot product,\"  often written tr(AT B). \n\n(3) \n\nExpression  (3)  is  a  sum of three Bregman divergences,  ignoring terms which  don't \ndepend on U  and V:  it is  DF(UV I X)+DG(O I U) +DH(O I V).  The F-divergence \ntends to pull UV towards X, while  the G- and H-divergences favor  small U  and V. \n\nTo  further justify  (3),  we  can examine what  happens when  we  compute its  deriva(cid:173)\ntives  with respect  to U  and V  and set  them to O.  The result  is  a  set of fixed-point \nequations that the optimal parameter settings must  satisfy: \n\nUT(X - f(UV)) \n(X - f(UV))VT \n\nB \nA \n\n(4) \n(5) \n\n\fTo  understand these  equations,  we  can  consider  two  special  cases.  First,  if we  let \nG*  go  to zero  (so  there is  no  pressure to keep  U  and V  small) ,  (4)  becomes \n\nUT(X -\n\nf(UV))  =  0 \n\n(6) \n\nwhich  tells  us  that  each  column  of the  error  matrix  must  be  orthogonal  to  each \ncolumn of U.  Second, if we  set the prediction link to be  f(UV)  = UV,  (6)  becomes \n\nUTUV =  UTX \n\nwhich  tells  us  that  for  a  given  U,  we  must  choose  V  so  that  UV  reconstructs  X \nwith  the smallest  possible sum of squared errors. \n\n5  Algorithms for  fitting  (GL)2Ms \n\nWe could  solve equations  (4- 5)  with any of several different  algorithms.  For exam(cid:173)\nple,  we  could  use  gradient  descent  on  either  U, V  or  A, B.  Or,  we  could  use  the \ngeneralized gradient descent  [9]  update rule  (with learning rate a): \n\nA  +-\",  (X -\n\nf(UV))V T \n\nB  +-\",  UT(X -\n\nf(UV)) \n\nThe advantage of these  algorithms  is  that they are simple to implement  and don't \nrequire  additional assumptions on  F , G, and  H.  They can even  work  when  F,  G, \nand Hare nondifferentiable by using subgradients. \nIn  this paper, though, we will focus on a different algorithm.  Our algorithm is  based \non  Newton's  method,  and  it  reduces  to  well-known  algorithms  for  several  special \ncases of (GL)2Ms.  Of course, since the end goal is  solving  (4-5), this algorithm will \nnot always be the method of choice;  instead, any given implementation of a  (GL)2M \nshould  use  the simplest algorithm that works. \n\nFor our Newton algorithm we  will  need to place some restrictions on the prediction \nand parameter loss functions.  (These restrictions are only necessary for  the Newton \nalgorithm;  more general loss functions still give valid  (GL)2Ms,  but require different \nalgorithms.)  First, we will require (4-5) to be differentiable.  Second, we will restrict \n\nF(Z)  =  LFij (Zij) \n\nij \n\nH(B)  =  L  Hj(B. j ) \n\nj \n\nThese definitions fix most of the second derivatives of L(U, V) to be zero, simplifying \nand speeding up computation.  Write  f ij , gi,  and hj  for  the respective derivatives. \nWith  these  restrictions,  we  can  linearize  (4)  and  (5)  around  our  current  guess  at \nthe parameters, then solve  the resulting equations to find  updated parameters.  To \nsimplify notation, we can think of (4)  as  j  separate equations, one for  each column \nof V.  Linearizing with respect  to Vj  gives: \n\n(UT DjU + Hj)(Vr w  - Vj)  =  UT(X.j  - f.j(UV j ))  - B. j \n\nwhere the l  x l  matrix H j  is  the Hessian of Hi at V j '  or equivalently the inverse of \nthe  Hessian  of Hj  at  B.j;  and the m  x  m  diagonal matrix Dj  contains  the  second \nderivatives of F  with respect  to the jth column of its argument.  That is, \n\nNow, collecting terms involving Vjew yields: \n\n\fWe  can recognize  (7)  as a  weighted least squares problem with weights VJ5j,  prior \nprecision H j ,  prior mean Vj + H j1 B-j , and target outputs \n\nUV j  + Dj1(x.j -\n\nf(UV j )) \n\nSimilarly,  we  can linearize with respect  to rows  of U to find  the equation \n\nUreW(VDiVT + G i )  = ((Xi.  - j;.(Ui.V))Di1 + Ui.V)DiVT + Ui. G i - Ai. \n\n(8) \nwhere  G i  is  the  Hessian  of Gi  and  Di  contains  the  second  derivatives  of  F  with \nrespect  to the ith row  of its argument. \n\nWe  can  solve  one  copy of (7)  simultaneously for  each column of V,  then replace V \nby  vnew.  Next we  can solve one copy of (8)  simultaneously for  each row  of U,  then \nreplace U by  unew.  Alternating between these two updates will tend to reduce  (3).1 \n\n6  Related  models \n\nThere are many important special cases of (GL)2Ms.  We  derive  two in this section; \nothers  include  principal  components  analysis,  \"sensible\"  PCA,  linear  regression, \ngeneralized  linear  models,  and the  weighted  majority algorithm.  (Our  Newton  al(cid:173)\ngorithm turns into power iteration for  PCA and iteratively-reweighted least squares \nfor  GLMs.)  (GL)2Ms  are related  to generalized  bilinear  models;  the  latter include \nsome of the above special cases,  but not ICA,  weighted majority, or the example of \nsection 7.  There are natural generalizations of (GL)2Ms  to multilinear interactions. \nFinally,  some  models  such  as  non-negative  matrix  factorization  [10]  and  general(cid:173)\nized  low-rank approximation [11]  are cousins  of (GL)2Ms:  they  use  a  loss  function \nwhich  is  convex  in  either  factor  with  the  other fixed  but  which  is  not  a  Bregman \ndivergence. \n\n6.1 \n\nIndependent  components analysis \n\nIn  ICA,  we  assume  that  there is  a  hidden  matrix V  (the  same  size  as  X)  of inde(cid:173)\npendent random variables, and that X  was generated from  V  by applying a square \nmatrix  U.  We  seek  to  recover  the  mixing  matrix  U  and  the  sources  V;  in  other \nwords,  we  want  to  decompose  X  =  UV  so  that  the  elements  of V  are  as  nearly \nindependent as possible. \n\nThe info max algorithm for  ICA assumes that the elements of V  follow  distributions \nwith  heavy  tails  (i.e. ,  high  kurtosis).  This  assumption  helps  us  find  independent \ncomponents  because  the  sum  of two  heavy-tailed  random  variables  tends  to  have \nlighter  tails,  so  we  can  search for  U  by  trying to  make  the  elements  of V  follow  a \nheavy-tailed distribution. \n\nIn our notation, the fixed  point of the info max algorithm for  ICA is \n\n(9) \n(see, e.g., equation (11)  or (13)  of [12]).  To reproduce (9) , we  will  let  the prediction \nlink f  be the identity,  and we  will  let  the duals of the parameter loss  functions  be \n\n_ UT  = tanh(V)XT \n\nG*(U) \n\nH*(V) \n\n-dogdet U \nE L log cosh Vij \n\nij \n\niTo guarantee convergence,  we  can check  that  (3)  decreases and reduce our step size  if \nwe  encounter problems.  (Since  UT D j  U,  H j\n,  V Di V T,  and G i  are all  positive definite, the \nNewton  update  directions  are  descent  directions;  so,  there  always  exists  a  small  enough \nstep size.)  We  have not  found  this check  necessary  in practice. \n\n\fwhere  f  is  a  small positive  real number.  Then equations  (4)  and  (5)  become \n\nUT(X - UV) \n(X - UV)VT \n\nttanh(V) \n-fU- T \n\n(10) \n(11) \n\nsince the derivative of log cosh v  is  tanh v  and the derivative of log det U  is  U - T . \nRight-multiplying  (10)  by  (UV)T  and substituting in  (11)  yields \n\n_uT  =  tanh(V)(UV)T \n\n(12) \n\nNow  since  UV -+  X  as  f  -+  0,  (12)  is  equivalent  to  (9)  in the limit of vanishing f. \n\n6.2  Exponential family  peA \n\nTo  duplicate  exponential  family  PCA  [13],  we  can  set  the  prediction  link  f  arbi(cid:173)\ntrarily and  let  the parameter links  9  and  h  be  large multiples  of the  identity.  Our \nNewton  algorithm is  applicable under the assumptions of [13],  and  (7)  becomes \n\n(13) \n\nEquation  (13)  along  with  the  corresponding  modification  of  (8)  should  provide  a \nmuch faster algorithm than the one proposed in  [13],  which  updates only part of U \nor V  at a  time  and keeps  updating the same part until  convergence before  moving \non to the next one. \n\n7  Example:  robot  belief states \n\nFigure  1 shows  a  map  of a  corridor in  the  CMU  CS  building.  A  robot  navigating \nin  this  corridor can  sense  both side  walls  and  compute an  accurate estimate of its \nlateral position.  Unless  it is  near an observable feature  such  the lab door near the \nmiddle  of the  corridor,  however,  it  can't  accurately  resolve  its  position  along  the \ncorridor and it  can't  tell  whether it  is  pointing left  or right. \n\nIn  order  to  plan  to  achieve  a  goal  in  this  environment,  the  robot  must  maintain \na  belief  state  (a  probability  distribution  representing  its  best  information  about \nthe  unobserved  state  variables).  The  map  shows  the  robot's  starting  belief state: \nit  is  at  one  end  of  the  corridor  facing  in,  but  it  doesn't  know  which  end.  We \ncollected a training set of 400 belief states by driving the robot along the corridor and \nfeeding its sensor readings to a belief tracker [14].  To simulate a larger environment \nwith  greater uncertainty,  we  artificially  reduced  sensor  range  and  increased error. \nFigure  1 shows  two of the collected  beliefs. \n\nPlanning is  difficult  because belief states are high-dimensional:  even in  this simple \nworld there are 550  states  (275  positions  at  lOcm  intervals  along the  corridor  x  2 \norientations), so a belief is  a vector in ]R550.  Fortunately, the robot never encounters \nmost  belief states.  This  regularity  can make  planning tractable:  if we  can identify \na  few  features  which  extract  the  important  information from  belief states,  we  can \nplan in  low-dimensional feature  space instead of high-dimensional belief space. \nWe  factored  the matrix of belief states using feature space ranks l  =  3,4, 5.  For the \nprediction  link  f(Z)  we  used  exp(Z)  (componentwise);  this  link  ensures  that  the \npredicted beliefs are positive, and treats errors in small probabilities as  proportion(cid:173)\nally more important than errors in large ones.  (The matching loss for f  is  a  Poisson \nlog-likelihood  or  unnormalized  KL-divergence.)  For  the  parameter link  h  we  used \n10 12 I,  corresponding to  H*  =  lO- 12 11V11 2 /2  (a weak  bias towards small V). \n\n\f~,~I ~,A~\"\"~,  _ -----=-----' \n~,:------------c.\n1 \n,. A.  1 ~,~I \n\n-------::::c-,f\\~\"'~\\ _ ~ \n-------c:L,. A-----\"-----..t\n\n~,~I -------c:Lj~\\ .~\\  _ -----=-----,1 \n~t \n\n,/____,________=_\\ -----=-----' \n\nFigure  1:  Belief states.  Left  panel:  overhead map of corridor with initial  belief b1 ; \nbelief state bso  (just  before robot finds  out which direction it's pointing);  belief bgo \n(just after finding out).  Right panel:  reconstruction of bso  with 3, 4,  and 5 features. \n\nWe set G*  = 1O- 1211U11 2 j2 +6..(U), where 6..  is 0 when the first column of U contains \nall  Is and  00  otherwise.  This loss  function fixes  the first  column of U,  representing \nour knowledge that one feature should be a normalizing constant so that each belief \nsums to  1.  The subgradient of G*  is  1O- 12U + [k, 0],  so equation  (5)  becomes \n\n(X - exp(UV))VT = 1O- 12U + [k, 0] \n\nHere  [k,O]  is  a  matrix with an arbitrary first  column and all  other elements 0;  this \nmatrix has enough degrees of freedom  to compensate for  the constraints on U. \n\nOur Newton algorithm handles this modified fixed point equation without difficulty. \nSo,  this  (GL)2M  is  a  principled  and  efficient  way  to  decompose  a  matrix of prob(cid:173)\nability  distributions.  So  far  as  we  know  this  model  and  algorithm  have  not  been \ndescribed in the literature. \n\nFigure  1 shows  our reconstructions of a  representative belief state using  I  =  3, 4,5 \nfeatures  (one of which is a normalizing constant that can be discarded for planning). \nThe I =  5 reconstruction is  consistently  good across all  400  beliefs,  while  the I =  4 \nreconstruction  has  minor  artifacts  for  some  beliefs.  A  small  number  of restarts  is \nrequired to achieve good decompositions for  I  =  3  where the optimization problem \nis  most  constrained.  For  comparison, a  traditional  SVD  requires  a  matrix of rank \nabout  25  to  achieve  the  same  mean-squared  reconstruction  error  as  our  rank-3 \ndecomposition.  It  requires rank about  85  to match our rank-5  decomposition. \nExamination  of the  learned  U  matrix  (not  shown)  for  I  = 4  reveals  that  the  cor(cid:173)\nridor  is  mapped  into  two  smooth curves  in  feature  space, one  for  each orientation. \nCorresponding  states  with  opposite  orientations  are  mapped  into  similar  feature \nvectors for  one half of the corridor  (where the training beliefs  were sometimes con(cid:173)\nfused  about  orientation)  but  not  the  other  (where  there  were  no  training  beliefs \nthat indicated any connection between orientations).  Reconstruction artifacts occur \nwhen  a  curve nearly self-intersects and causes  confusion  between states.  This self(cid:173)\nintersection happens  because of local minima in  the loss  function;  with  more flexi(cid:173)\nbility (I  =  5)  the optimizer is able to untangle the curves and avoid self-intersection. \n\nOur success in  compressing the belief state translates directly  into success  in  plan(cid:173)\nning;  see  [15]  for  details.  By  comparison,  traditional  SVD  on either  the  beliefs  or \nthe  log  beliefs  produces feature  sets  which  are  unusable for  planning because  they \ndo not  achieve sufficiently good reconstruction with few  enough features. \n\n\f8  Discussion \n\nWe  have introduced a  new  general class of nonlinear regression and factor  analysis \nmodel,  presenting a  derivation and algorithm,  reductions to previously-known spe(cid:173)\ncial  cases,  and  a  practical example.  The model  is  a  drop-in  replacement for  PCA, \nbut  can  provide  much  better  reconstruction  performance  in  cases  where  the  PCA \nerror model is  inaccurate.  Future research includes online algorithms for  parameter \nadjustment;  extensions for  missing data;  and exploration of new  link functions. \n\nAcknowledgments \n\nThanks  to  Nick  Roy  for  helpful  comments  and  for  providing  the  data  analyzed \nin  section  7.  This  work  was  supported  by  AFRL  contract  F30602-01-C-0219, \nDARPA's  MICA  program,  and  by  AFRL  contract  F30602- 98- 2- 0137,  DARPA's \nCoABS  program.  The opinions and conclusions are the author's and do  not  reflect \nthose of the US  government or its agencies. \n\nReferences \n\n[1]  T. K. Landauer, P. W . Foltz, and D.  Laham. Introduction to latent semantic analysis. \n\nDiscourse  Processes,  25:259- 284,  1998. \n\n[2]  Jon  M.  Kleinberg.  Authoritative  sources  in  a  hyperlinked environment.  Journal  of \n\nthe  ACM,  46(5) :604-632,  1999. \n\n[3]  M.  Turk  and  A.  Pentland.  Eigenfaces  for  recognition.  Journal  of Cognitive  Neuro(cid:173)\n\nscience,  3(1) :71-86,  1991. \n\n[4]  Carlo  Tomasi  and  Takeo  Kanade.  Shape  and  motion  from  image  streams  under \n\northography:  a  factorization  method.  Int.  J.  Computer  Vision , 9(2):137- 154,  1992. \n\n[5]  D.  P.  O'Leary and S.  Peleg.  Digital  image  compression  by outer product  expansion. \n\nIEEE  Trans .  Communications, 31:441-444,  1983. \n\n[6]  P. McCullagh and J. A.  Neider.  Generalized Linear Models.  Chapman & Hall, London, \n\n2nd edition,  1983. \n\n[7]  Peter  Auer,  Mark  Hebster,  and  Manfred  K.  Warmuth.  Exponentially  many  local \n\nminima for  single  neurons.  In  NIPS, vol.  8.  MIT  Press,  1996. \n\n[8]  R.  Tyrell Rockafellar.  Convex Analysis. Princeton University Press, New Jersey,  1970. \n[9]  Geoffrey  J.  Gordon.  Approximate  Solutions  to  Markov  Decision  Processes.  PhD \n\nthesis,  Carnegie  Mellon  University, 1999. \n\n[10]  Daniel Lee and H.  Sebastian Seung.  Algorithms for  nonnegative matrix factorization. \n\nIn  NIPS,  vol.  13.  MIT  Press,  2001. \n\n[11]  Nathan Srebro.  Personal communication,  2002. \n[12]  Anthony J . Bell and Terrence J.  Sejnowski.  The 'independent components' of natural \n\nscenes are edge filters.  Vision  Research, 37(23):3327- 3338,  1997. \n\n[13]  Michael  Collins,  Sanjoy Dasgupta, and Robert Schapire.  A generalization of principal \n\ncomponent analysis to the exponential family.  In  NIPS, vol.  14.  MIT  Press,  2002. \n\n[14]  D.  Fox, W.  Burgard,  F .  Dellaert,  and  S. Thrun.  Monte  Carlo  localization:  Efficient \n\nposition  estimation for  mobile  robots.  In  AAAI,  1999. \n\n[15]  Nicholas Roy and Geoffrey J.  Gordon.  Exponential family  PCA for  belief compression \n\nin  POMDPs.  In  NIPS,  vol.  15.  MIT  Press,  2003. \n\n[16]  Sam Roweis.  EM algorithms for  PCA and SPCA.  In NIPS,  vol.  10.  MIT Press,  1998. \n\n\f", "award": [], "sourceid": 2144, "authors": [{"given_name": "Geoffrey", "family_name": "Gordon", "institution": null}]}