{"title": "Factorization with Uncertainty and Missing Data: Exploiting Temporal Coherence", "book": "Advances in Neural Information Processing Systems", "page_first": 1507, "page_last": 1514, "abstract": "", "full_text": "Factorization with uncertainty and\nmissing data: exploiting temporal\n\ncoherence\n\nAmit Gruber and Yair Weiss\n\nSchool of Computer Science and Engineering\n\nThe Hebrew University of Jerusalem\n\n91904 Jerusalem, Israel\n\nfamitg,yweissg@cs.huji.ac.il\n\nAbstract\n\nThe problem of \\Structure From Motion\" is a central problem in\nvision: given the 2D locations of certain points we wish to recover\nthe camera motion and the 3D coordinates of the points. Un-\nder simpli\ufb02ed camera models, the problem reduces to factorizing\na measurement matrix into the product of two low rank matrices.\nEach element of the measurement matrix contains the position of\na point in a particular image. When all elements are observed, the\nproblem can be solved trivially using SVD, but in any realistic sit-\nuation many elements of the matrix are missing and the ones that\nare observed have a di\ufb01erent directional uncertainty. Under these\nconditions, most existing factorization algorithms fail while human\nperception is relatively unchanged.\nIn this paper we use the well known EM algorithm for factor analy-\nsis to perform factorization. This allows us to easily handle missing\ndata and measurement uncertainty and more importantly allows us\nto place a prior on the temporal trajectory of the latent variables\n(the camera position). We show that incorporating this prior gives\na signi\ufb02cant improvement in performance in challenging image se-\nquences.\n\n1\n\nIntroduction\n\nFigure 1 illustrates the classical structure from motion (SFM) displays introduced by\nUllman [13]. A transparent cylinder with painted dots rotates around its elongated\naxis. Even though no structure is apparent in any single frame, humans obtain a\nvivid percept of a cylinder1.\n\nSFM has been dealt with extensively in the computer vision literature. Typically a\nsmall number of feature points are tracked and a measurement matrix is formed in\n\n1An\n\nonline\n\nanimation\n\nof\n\nthis\n\nfamous\n\nstimulus\n\nis\n\navailable\n\nat:\n\naris.ss.uci.edu/cogsci/personnel/ho\ufb01man/cylinderapplet.html\n\n\fFigure 1: The classical structure from motion stimulus introduced by Ullman [13].\nHumans continue to perceive the correct structure even when each dot appears only\nfor a small number of frames, but most existing factorization algorithm fail in this\ncase. Replotted from [1]\n\nwhich each element corresponds to the image coordinates of a tracked point. The\ngoal is to recover the camera motion and the 3D location of these points. Under\nsimpli\ufb02ed camera models it can be shown that this problem reduces to a problem of\nmatrix factorization. We wish to describe the measurement matrix as a product of\ntwo low rank matrices. Thus if all features are reliably tracked in all the frames, the\nproblem can be solved trivially using SVD [11]. In particular, performing an SVD\non the measurement matrix of the rotating cylinder stimulus recovers the correct\nstructure even if the measurement matrix is contaminated with signi\ufb02cant amounts\nof noise and if the number of frames is relatively small.\n\nBut in any realistic situation, the measurement matrix will have missing entries.\nThis is either because certain feature points are occluded in some of the frames and\nhence their positions are unknown, or due to a failure in the tracking algorithm.\nThis has lead to the development of a number of algorithms for factorization with\nmissing data [11, 6, 9, 2].\n\nFactorization with missing data turns out to be much more di\u2013cult than the full\ndata case. To illustrate the di\u2013culty, consider the cylinder stimulus in \ufb02gure 1.\nHumans still obtain a vivid percept of a cylinder even when each dot has a short\n\\dot life\". That is, each dot appears at a random starting frame, continues to appear\nfor a small number of frames, and then disappears [12]. We applied the algorithms\nin [11, 6, 9, 2] to a sequence of 20 frames of a rotating cylinder in which the dot life\nwas 10 frames. Thus the matrix was half full (or half empty). Surprisingly, none of\nthe algorithms could recover the cylinder structure. They either failed to \ufb02nd any\nstructure or they gave a structure that was drastically di\ufb01erent from a cylinder.\nPresumably, humans are using additional prior knowledge that the algorithms are\nnot.\n\nIn this paper we point out a source of information in image sequences that is usually\nneglected by factorization algorithms: temporal coherence. In a video sequence, the\ncamera location at time t + 1 will probably be similar to its location at time t. In\nother words, if we randomly permute the temporal order of the frames, we will get\na very unlikely image sequence. Yet nearly all existing factorization algorithms will\nbe invariant to this random permutation of the frames: they only seek a low rank\napproximation to a matrix and permuting the rows of the matrix will not change\nthe approximation.\n\nIn order to enable the use of temporal coherence, we formulate factorization in\n\n\fterms of maximum likelihood for a factor analysis model, where the latent vari-\nable corresponds to camera position. We use the familiar EM algorithm for factor\nanalysis to perform factorization with missing data and uncertainty. We show how\nto add a temporal coherence prior to the model and derive the EM updates. We\nshow that incorporating this prior gives a signi\ufb02cant improvement in performance\nin challenging image sequences.\n\n2 Model\n\nA set of P feature points in F images are tracked along an image sequence. Let\n(uf p; vf p) denote image coordinates of feature point p in frame f . Let U = (uf p),\nV = (vf p) and W = (wij) where w2i\u00a11;j = uij and w2i;j = vij for 1 \u2022 i \u2022 F , i.e.\nW is an interleaving of the rows of U and V .\n\nIn the orthographic camera model, points in the 3D world are projected in parallel\nonto the image plane. For example, if the camera\u2019s optical center is in the origin\n(w.r.t 3D coordinate system), and its x; y axes coincide with X; Y axes in the 3D\nworld, then taking a picture is a simple projection (in homogeneous coordinates):\n\n(x; y) = \u2022 1\n\n0\n\n0\n1\n\n0\n0\n\n0\n\n0\u201a2\n64\n\nX\nY\nZ\n1\n\n3\n75\n\n. The depth, Z, has no in(cid:176)uence on the image. In this\n\nmodel, a camera can undergo rotation, translation, or a combination of the two.\n\nUnder orthography, and in the absence of noise,\n\n[W ]2F \u00a3P = [M ]2F \u00a34 [S]4\u00a3P\n\n(1)\n\nwhere M = 2\n4\n\nM1\n...\nMF\n\n3\n52F \u00a34\n\nand S = 2\n64\n\nX1\nY1\nZ1\n1\n\n\u00a2 \u00a2 \u00a2 XP\n\u00a2 \u00a2 \u00a2 YP\n\u00a2 \u00a2 \u00a2 ZP\n\u00a2 \u00a2 \u00a2\n1\n\nmotion (rotation and translation, [Mi]2\u00a34 = \u2022 mT\n\ni\nnT\ni\n\n. M describes camera\n\n3\n754\u00a3P\nei \u201a ). mi and ni are 3 \u00a3 1\n\ndi\n\nvectors that describe the rotation of the camera; di and ei are scalars describing\ncamera translation, 2 and S describes points location in 3D.\n\nFor noisy observations, the model becomes:\n\n[W ]2F \u00a3P = [M ]2F \u00a34 [S]4\u00a3P + [\u00b7]2F \u00a3P\n\n(2)\n\nwhere \u00b7 is Gaussian noise.\n\nIf the elements of the noise matrix \u00b7 are uncorrelated and of equal variance then we\nseek a factorization that minimizes the mean squared error between W and M S.\nThis can be solved trivially using the SVD of W . Missing data can be modeled\nusing equation 2 by assuming some elements of the noise matrix \u00b7 have in\ufb02nite\nvariance. Obviously the SVD is not the solution once we allow di\ufb01erent elements\nof \u00b7 to have di\ufb01erent variances.\n\n2.1 Factorization as factor analysis\n\nIt is well known that the SVD calculation can be formulated as a limiting case of\nmaximum likelihood factor analysis [8]. In standard factor analysis we have a set\n2We do not subtract the mean of each row from it, since in case of missing data the\n\ncentroids of points do not coincide.\n\n\fof observations fy(t)g that are linear combinations of a latent variable x(t):\n\ny(t) = Ax(t) + \u00b7(t)\n\n(3)\n\nwith x(t) \u00bb N (0; (cid:190)2\nxI) and \u00b7(t) \u00bb N (0; \u201ct). If \u201ct is a diagonal matrix with constant\nelements \u201ct = (cid:190)2I then in the limit (cid:190)=(cid:190)x ! 0 the ML estimate for A will give the\nsame answer as the SVD. We now show how to rewrite the SFM problem in this\nform.\n\nIn equation 1 the horizontal and vertical coordinates of the same point appear in\ndi\ufb01erent rows. It can be rewritten as:\n\n[U V ]F \u00a32P = [M N ]F \u00a38\u2022 S 0\n\n0 S\u201a8\u00a32P\n\n+ [~\u00b7]F \u00a32P\n\n(4)\n\nLet y(t) be the vector of noisy observations (noisy image locations) at time t,\ni.e. y(t) = [u(t) v(t)], that is y(t) = [u1(t); \u00a2 \u00a2 \u00a2 uP (t) v1(t); \u00a2 \u00a2 \u00a2 vP (t)]T . Let x(t)\nbe a vector of length 8 that denotes the camera position at time t x(t) =\n\n[m(t)T d(t) n(t)T e(t)]T and let A = \u2022 ST\n\n0\n\n0\n\nST \u201a. Identifying y(t) with the tth row\n\nof the matrix [U V ] and x(t) with the tth row of [m n], then equation 4 is equivalent\nto equation 3.\n\nWe can now use the standard EM algorithm for factor analysis to \ufb02nd the ML\nestimate for S.\n\nE step:\n\nE(x(t)jy(t)) = \u00a1(cid:190)\u00a12\nV (x(t)jy(t)) = \u00a1(cid:190)\u00a12\n\nx I + AT \u201c\u00a11\nx I + AT \u201c\u00a11\n\n< x(t) > = E(x(t)jy(t))\n\nt A\u00a2\u00a11\nt A\u00a2\u00a11\n\nAT \u201c\u00a11\n\nt y(t)\n\n< x(t)x(t)T > = V (x(t)jy(t))\u00a1 < x(t) >< x(t) >T\n\n(5)\n\n(6)\n\n(7)\n\n(8)\n\nM step: In the M step we solve the normal equations for the structure S. The exact\nform depends on the structure of \u201ct. Denote by sp a vector of length 3 that denotes\nthe 3D coordinates of point p then for a diagonal noise covariance matrix \u201ct the M\nstep is:\n\nwhere\n\nsp = BpC \u00a11\n\np\n\nBp = Xt \u00a3\u201c\u00a11\n\nt (p; p)(utp\u00a1 < dt >) < m(t)T >\n\n(9)\n\n(10)\n\n+ \u201c\u00a11\n\nt (p + P; p + P )(vtp\u00a1 < et >) < n(t) >T\u2044\nt (p; p) < m(t)m(t)T >\n\nCp = Xt \u00a3\u201c\u00a11\n\n+ \u201c\u00a11\n\nt (p + P; p + P ) < n(t)n(t)T >\u2044\n\nwhere the expectation required in the M step are the appropriate subvectors and\nsubmatrices of < x(t) > and < x(t)x(t)T >.\n\nt (p; p) = \u201c\u00a11\n\nIf we set \u201c\u00a11\nt (p + P; p + P ) = 0 if point p is missing in frame t then we\nobtain an EM algorithm for factorization with missing data. Note that the form of\nthe updates means we can put any value we wish in the missing elements of y and\nthey will be ignored by the algorithm.\n\n\fx(1)\n\nx(2)\n\nx(3)\n\nx(1)\n\nx(2)\n\nx(3)\n\ny(1)\n\ny(2)\na\n\ny(3)\n\ny(1)\n\ny(2)\nb\n\ny(3)\n\nFigure 2: a. The graphical model assumed by most factorization algorithms for\nSFM. The camera location x(t) is assumed to be independent of the camera loca-\ntion at any other time step. b. The graphical model assumed by our approach.\nWe model temporal coherence by assuming a Markovian structure on the camera\nlocation.\n\nA more realistic noise model for real images is that \u201ct is not diagonal but rather that\nthe noise in the horizontal and vertical coordinates of the same point are correlated\nwith an arbitrary 2 \u00a3 2 inverse covariance matrix. This problem is usually called\nfactorization with uncertainty [5, 7]. It is easy to derive the M step in this case as\nwell. It is similar to equation 9 except that cross terms involving \u201c\u00a11\nt (p; p + P ) are\nalso involved:\n\nwhere\n\nsp = (Bp + B0\n\np)(Cp + C 0\n\np)\u00a11\n\nB0\n\np = Xt \u00a3\u201c\u00a11\n\nt (p; p +P )(vtp\u00a1< et>) < m(t)T >\n\n(11)\n\n(12)\n\n+ \u201c\u00a11\n\nt (p + P; p)(utp\u00a1 < dt >) < n(t) >T\u2044\nt (p; p + P ) < n(t)m(t)T >\n\nC 0\n\np = Xt \u00a3\u201c\u00a11\n\n+ \u201c\u00a11\n\nt (p + P; p) < m(t)n(t)T >\u2044\n\nRegardless of uncertainty and missing data the complexity of the EM algorithm\ngrows linearly with the number of feature points and the number of frames. At\nevery iteration, the most computationally intensive step is an inversion of an 8 \u00a3 8\nmatrix.\n\n2.2 Adding temporal coherence\n\nThe factor analysis algorithm for factorization assumes that the latent variables\nx(t) are independent. In SFM this assumption means that the camera location in\ndi\ufb01erent frames is independent and hence permuting the order of the frames makes\nno di\ufb01erence for the factorization. As mentioned in the introduction, in almost any\nvideo sequence this assumption is wrong. Typically camera location varies smoothly\nas a function of time.\n\nFigure 2a shows the graphical model corresponding to most factorization algorithms:\nthe independence of the camera location is represented by the fact that every time\nstep is isolated from the other time steps in the graph. But it is easy to \ufb02x this\nassumption by adding edges between the latent variables as shown in \ufb02gure 2b.\n\nSpeci\ufb02cally, we use a second order approximation to the motion of the camera:\n\nx(t) = x(t \u00a1 1) + v(t \u00a1 1) +\n\n1\n2\n\na(t \u00a1 1) + \u20201\n\n(13)\n\n\fTruth\n\nfactor analysis\n\nJacobs\n\n5\n\n4\n\n3\n\n2\n\n1\n\n0\n\n\u22121\n\n\u22122\n\n\u22123\n\n\u22124\n\n\u22125\n\n5\n\n5\n\n4\n\n3\n\n2\n\n1\n\n0\n\n\u22121\n\n\u22122\n\n\u22123\n\n\u22124\n\n\u22125\n\n5\n\n4\n\n3\n\n2\n\n1\n\n0\n\n\u22121\n\n\u22122\n\n\u22123\n\n\u22124\n\n\u22125\n\n4\n\n3\n\n2\n\n1\n\n0\n\n\u22121\n\n\u22122\n\n\u22123\n\n\u22124\n\n\u22125\n\n5\n\n4\n\n3\n\n2\n\n1\n\n0\n\n\u22121\n\n\u22122\n\n\u22123\n\n\u22124\n\n\u22125\n\n6\n\n10\n\n8\n\n6\n\n4\n\n2\n\n0\n\n\u22122\n\n\u22124\n\n\u22126\n\n6\n\n4\n\n2\n\n0\n\n\u22122\n\n\u22124\n\n\u22126\n\n4\n\n2\n\n0\n\n\u22122\n\n\u22124\n\n\u22126\n\n5\n\n4\n\n3\n\n2\n\n1\n\n0\n\n\u22121\n\n\u22122\n\n\u22123\n\n\u22124\n\n\u22125\n\n6\n\n150\n\n100\n\n50\n\n0\n\n\u221250\n\n\u2212100\n\n\u2212150\n\n\u2212200\n\n\u2212250\n\n\u2212300\n\n8\n\n4\n\n2\n\n0\n\n\u22122\n\n\u22124\n\n\u22126\n\n6\n\n4\n\n2\n\n0\n\n\u22122\n\n\u22124\n\n\u22126\n\nStructure:\n\nStructure:\n\nFigure 3: Comparison of factor analysis and Jacobs\u2019 algorithm on synthetic se-\nquences. All other existing algorithms performed worse than Jacobs. They all fail\nwhen there is noise and missing data while factor analysis with temporal coherence\nsucceeds. Structure and motion are shown from a top view.\n\nv(t) = v(t \u00a1 1) + a(t \u00a1 1) + \u20202\na(t) = a(t \u00a1 1) + \u20203\ny(t) = Ax(t) + \u00b7(t)\n\n(14)\n(15)\n(16)\n\nNote that we do not assume that the 2D trajectory of each point is smooth. Rather\nwe assume the 3D trajectory of the camera is smooth.\n\nIt is straightforward to derive the EM iterations for a ML estimate of S using the\nmodel in equation 16. The M step is unchanged from the classical factor analysis\nand is given by equation 9. The only change in the E step is that E(x(t)jy) and\nV (x(t)jy) need to be calculated using a Kalman smoother. We use a standard RTS\nsmoother [4]. Note that the computation of the E step is still linear in the number\nof frames and datapoints.\n\nKalman \ufb02ltering has been used extensively in a more perspective SFM set-\nting(e.g. [10]). However, in perspective projections the problem is no longer one\nof factorization. Thus even for Gaussian noise, the Extended Kalman \ufb02lter needs\nto be used, smoothing is not performed and no guarantee of increase in likelihood\nis obtained. Within the factorization framework, we can use the classical Kalman\n\ufb02lter and obtain a simple algorithm that provably increases the likelihood at every\niteration.\n\n3 Experiments\n\nIn this section we describe the experimental performance of EM with time coherence\ncompared to ground truth and to previous algorithms for structure from motion with\nmissing data [11, 6, 9, 2]. For [11, 6, 9] we used the Matlab implementation made\npublic by D. Jacobs.\n\nThe \ufb02rst input sequence is the sequence of the cylinder shown in \ufb02gure 1. 100\npoints uniformly drawn from the cylinder surface are tracked over 20 frames. Each\nof the points appears for 10 frames, starting at a random time, and then disappears.\nThe observed image locations were added a Gaussian noise with standard deviation\n(cid:190) = 0:1.\n\nWe checked the performance of the di\ufb01erent algorithms in the cases of: (1) full noise\nfree observation matrix , (2) noisy full observation matrix, (3) noiseless observations\n\n\fError as function of noise\n\nError as function of missing data\n\n8<\n\n \n\nr\no\nr\nr\ne\ne\nr\na\nu\nq\ns\n \nn\no\ni\nt\nc\nu\nr\nt\ns\nn\no\nc\ne\nr\n\n7\n\n6\n\n5\n\n4\n\n3\n\n2\n\n0\n\nEM with Temporal Coherence\nEM\nJacobs\n\nEM with Temporal Coherence\nEM\nJacobs\n\nx 104\n\n10\n\n9\n\n8\n\n7\n\n6\n\n5\n\n4\n\n3\n\n2\n\n1\n\n \n\nr\no\nr\nr\ne\ne\nr\na\nu\nq\ns\n \nn\no\n\ni\nt\nc\nu\nr\nt\ns\nn\no\nc\ne\nr\n\n0.05\n\n0.1\n\n0.15\n\n0.2\n\n0.25\n\n0.3\n\n0.35\n\n0.4\n\n0.45\n\n0.5\n\npercentage of missing data\n\n0\n0\n\n0.1\n\n0.2\n0.3\nnoise level (sigma)\n\n0.4\n\n0.5\n\nFigure 4: Graphs depict in(cid:176)uence of noise and percentage of missing data on recon-\nstruction results of factor analysis and [6].\n\n200\n\n150\n\n100\n\n50\n\n0\n\n\u221250\n\n\u2212100\n\n100\n\n50\n\n0\n\n\u221250\n\n\u2212100\n\n\u2212150\n\nFigure 5: Results of scene reconstruction from a real sequence: A binder and is\nplaced on a rotating surface \ufb02lmed with a static camera. Our algorithm succeeded\nin (approximately) obtaining the right structure and all other algorithms failed.\nResults are shown in top view.\n\nwith missing data and (4) noisy observations with missing data.\n\nAll algorithms performed well and gave similar results for the full matrix noiseless\nsequence.\n\nIn the fully observed noisy case, factor analysis without temporal coherence gave\ncomparable performance to Tomasi-Kanade, which minimize kM S \u00a1 W k2\nF . When\ntemporal coherence was added, the reconstruction results were improved. The re-\nsults of Shum\u2019s algorithm were similar to Tomasi-Kanade. The algorithms of Jacobs\nand Brand turned to be noise sensitive.\n\nIn the case of noiseless missing data (\ufb02gure 3 top), our algorithm and Jacobs\u2019\nalgorithm reconstruct the correct motion and structure. Tomasi-Kanade\u2019s algorithm\nand Shum\u2019s algorithm could not handle this pattern of missing data and failed to\ngive any structure.\n\nOnce we add even very mild amounts of noise (\ufb02gure 3 middle) all existing algo-\nrithms fail. While factor analysis with temporal coherence continues to extract the\ncorrect structure even for signi\ufb02cant noise values.\n\nFigure 5 shows result on a real sequence.\n\n\f4 Discussion\n\nDespite progress in algorithms for factorization with uncertainty the best existing\nalgorithms still fall far short of human performance, even for seemingly simple\nstimuli. Presumably, humans are using additional prior information. In this paper\nwe have focused on one particular prior: the temporal smoothness of the camera\nmotion. We showed how to formulate SFM as a factor analysis problem and how to\nadd temporal coherence to the EM algorithm. Our experimental results show that\nthis simple prior can give a signi\ufb02cant improvement in performance in challenging\nsequences.\n\nTemporal coherence is just one of many possible priors. It has been suggested that\nhumans also use a smoothness prior on the 3D surface they are perceiving [12]. It\nwould be interesting to extend our framework in this direction.\n\nThe most drastic simpli\ufb02cation our model makes is the assumption of Gaussian\nnoise. It would be interesting to extend the algorithm to non Gaussian settings.\nThis may require approximate inference algorithms in the E step as used in [3].\n\nReferences\n\n[1] R.A. Andersen and D.C Bradley. Perception of three-dimensional structure\n\nfrom motion. In Trends in Cognitive Sciences, 2, pages 222{228, 1998.\n\n[2] M.E. Brand. Incremental singular value decomposition of uncertain data with\n\nmissing values. In ECCV, pages 707{720, May 2002.\n\n[3] F. Dellaert, S. M. Seitz, C. E. Thorpe, and S. Thrun. Structure from motion\n\nwithout correspondence. In ICCV, pages 696{702, January 1999.\n\n[4] Arthur Gelb, editor. Applied Optimal Estimation. MIT Press, 1974.\n[5] M. Irani and P. Anandan. Factorization with uncertainty. In ECCV, pages\n\n959{966, January 2000.\n\n[6] D. Jacobs. Linear \ufb02tting with missing data: Applications to structure-from-\nmotion and to characterizing intensity images. In CVPR, pages 206{212, 1997.\n\n[7] D. D. Morris and T. Kanade. A uni\ufb02ed factorization algorithm for points, line\nsegments and planes with uncertain models. In ICCV, pages 696{702, January\n1999.\n\n[8] S. Roweis. Em algorithms for pca and spca. In NIPS, pages 431{437, 1997.\n\n[9] H. Y. Shum, K. Ikeuchi, and R. Reddy. Principal component analysis with\nmissing data and its application to polyhedral object modeling. pages 854{\n867, September 1995.\n\n[10] S. Soatto and P. Perona. Reducing structure from motion: a general framework\nfor dynamic vision. IEEE Trans. on Pattern Analysis and Machine Intelligence,\npages 943{960, 1999.\n\n[11] C. Tomasi and T. Kanade. Shape and motion from image streams under or-\nthography: A factorization method. Int. J. of Computer Vision, 9(2):137{154,\nNovember 1992.\n\n[12] S. Treue, M. Husain, and R. Andersen. Human perception of structure from\n\nmotion. Vision Research, 31:59{75, 1991.\n\n[13] S. Ullman. The interpertation of visual motion. MIT Press, 1979.\n\n\f", "award": [], "sourceid": 2493, "authors": [{"given_name": "Amit", "family_name": "Gruber", "institution": null}, {"given_name": "Yair", "family_name": "Weiss", "institution": null}]}