{"title": "A Pseudo-Bayesian Algorithm for Robust PCA", "book": "Advances in Neural Information Processing Systems", "page_first": 1390, "page_last": 1398, "abstract": "Commonly used in many applications, robust PCA represents an algorithmic attempt to reduce the sensitivity of classical PCA to outliers.  The basic idea is to learn a decomposition of some data matrix of interest into low rank and sparse components, the latter representing unwanted outliers.  Although the resulting problem is typically NP-hard, convex relaxations provide a computationally-expedient alternative with theoretical support.  However, in practical regimes performance guarantees break down and a variety of non-convex alternatives, including Bayesian-inspired models, have been proposed to boost estimation quality.  Unfortunately though, without additional a priori knowledge none of these methods can significantly expand the critical operational range such that exact principal subspace recovery is possible.  Into this mix we propose a novel pseudo-Bayesian algorithm that explicitly compensates for design weaknesses in many existing non-convex approaches leading to state-of-the-art performance with a sound analytical foundation.", "full_text": "A Pseudo-Bayesian Algorithm for Robust PCA\n\nTae-Hyun Oh1\n\nYasuyuki Matsushita2\n\nIn So Kweon1\n\n1Electrical Engineering, KAIST, Daejeon, South Korea\n\n2Multimedia Engineering, Osaka University, Osaka, Japan\n\n3Microsoft Research, Beijing, China\n\nDavid Wipf3\u2217\n\nthoh.kaist.ac.kr@gmail.com\n\nyasumat@ist.osaka-u.ac.jp\n\niskweon@kaist.ac.kr\n\ndavidwip@microsoft.com\n\nAbstract\n\nCommonly used in many applications, robust PCA represents an algorithmic at-\ntempt to reduce the sensitivity of classical PCA to outliers. The basic idea is to learn\na decomposition of some data matrix of interest into low rank and sparse compo-\nnents, the latter representing unwanted outliers. Although the resulting problem is\ntypically NP-hard, convex relaxations provide a computationally-expedient alterna-\ntive with theoretical support. However, in practical regimes performance guarantees\nbreak down and a variety of non-convex alternatives, including Bayesian-inspired\nmodels, have been proposed to boost estimation quality. Unfortunately though,\nwithout additional a priori knowledge none of these methods can signi\ufb01cantly\nexpand the critical operational range such that exact principal subspace recovery is\npossible. Into this mix we propose a novel pseudo-Bayesian algorithm that explic-\nitly compensates for design weaknesses in many existing non-convex approaches\nleading to state-of-the-art performance with a sound analytical foundation.\n\nIntroduction\n\n1\nIt is now well-established that principal component analysis (PCA) is quite sensitive to outliers,\nwith even a single corrupted data element carrying the potential of grossly biasing the recovered\nprincipal subspace. This is particularly true in many relevant applications that rely heavily on low-\ndimensional representations [8, 13, 27, 33, 22]. Mathematically, such outliers can be described by\nthe measurement model Y = Z + E, where Y \u2208 Rn\u00d7m is an observed data matrix, Z = AB(cid:62) is a\nlow-rank component with principal subspace equal to span[A], and E is a matrix of unknown sparse\ncorruptions with arbitrary amplitudes.\nIdeally, we would like to remove the effects of E, which would then allow regular PCA to be applied\nto Z for obtaining principal components devoid of unwanted bias. For this purpose, robust PCA\n(RPCA) algorithms have recently been motivated by the optimization problem\ns.t. Y = Z + E,\n\n(1)\nwhere (cid:107) \u00b7 (cid:107)0 denotes the (cid:96)0 matrix norm (meaning the number of nonzero matrix elements) and the\nmax(n, m) multiplier ensures that both rank and sparsity terms scale between 0 and nm, re\ufb02ecting\na priori agnosticism about their relative contributions to Y. The basic idea is that if {Z\u2217, E\u2217}\nminimizes (1), then Z\u2217 is likely to represent the original uncorrupted data.\nAs a point of reference, if we somehow knew a priori which elements of E were zero (i.e., no gross\ncorruptions), then (1) could be effectively reduced to the much simpler matrix completion (MC)\nproblem [5]\n\nminZ,E max(n, m) \u00b7 rank[Z] + (cid:107)E(cid:107)0\n\n(2)\nwhere \u2126 denotes the set of indices corresponding with zero-valued elements in E. A major challenge\nwith RPCA is that an accurate estimate of the support set \u2126 can be elusive.\n\nminZ rank[Z] s.t. yij = zij, \u2200(i, j) \u2208 \u2126,\n\n\u2217This work was done while the \ufb01rst author was an intern at Microsoft Research, Beijing. The \ufb01rst and third authors were supported by\nthe NRF of Korea grant funded by the Korea government, MSIP (No. 2010-0028680). The second author was partly supported by JSPS\nKAKENHI Grant Number JP16H01732.\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fUnfortunately, solving (1) is non-convex, discontinuous, and NP-hard in general. Therefore, the\nconvex surrogate referred to as principal component pursuit (PCP)\n\n(cid:112)max(n, m) \u00b7 (cid:107)Z(cid:107)\u2217 + (cid:107)E(cid:107)1\n\nminZ,E\n\ns.t. Y = Z + E\n\n(3)\nis often adopted, where (cid:107) \u00b7 (cid:107)\u2217 denotes the nuclear norm and (cid:107) \u00b7 (cid:107)1 is the (cid:96)1 matrix norm. These\nrepresent the tightest convex relaxations of the rank and (cid:96)0 norm functions respectively. Several\ntheoretical results quantify technical conditions whereby the solutions of (1) and (3) are actually\nequivalent [4, 6]. However, these conditions are highly restrictive and do not provably hold in\npractical situations of interest such as face clustering [10], motion segmentation [10], high dynamic\nrange imaging [22] or background subtraction [4]. Moreover, both the nuclear and (cid:96)1 norms are\nsensitive to data variances, often over-shrinking large singular values of Z or coef\ufb01cients in E [11].\nAll of this motivates stronger approaches to approximating (1). In Section 2 we review existing\nalternatives, including both non-convex and probabilistic approaches; however, we argue that none\nof these can signi\ufb01cantly outperform PCP in terms of principal subspace recovery in important,\nrepresentative experimental settings devoid of prior knowledge (e.g., true signal distributions, outlier\nlocations, rank, etc.). We then derive a new pseudo-Bayesian algorithm in Section 3 that has been\ntailored to conform with principled overarching design criteria. By \u2018pseudo\u2019, we mean an algorithm\ninspired by Bayesian modeling conventions, but with special modi\ufb01cations that deviate from the\noriginal probabilistic script for reasons related to estimation quality and computational ef\ufb01ciency.\nNext, Section 4 examines relevant theoretical properties, explicitly accounting for all approximations\ninvolved, while Section 5 provides empirical validations. Proofs and other technical details are\ndeferred to [23]. Our high-level contributions can be summarized as follows:\n- We derive a new pseudo-Bayesian RPCA algorithm with ef\ufb01cient ADMM subroutine.\n- While provable recovery guarantees are absent for non-convex RPCA algorithms, we nonetheless\nquantify how our pseudo-Bayesian design choices lead to a desirable energy landscape. In particular,\nwe show that although any outlier support pattern will represent an inescapable local minima of (1)\n(or a broad class of functions that mimic (1)), our proposal can simultaneously retain the correct\nglobal optimum while eradicating at least some of the suboptimal minima associated with incorrect\noutlier location estimates.\n- We empirically demonstrate improved performance over state-of-the-art algorithms (including\nPCP) in terms of standard phase transition plots with a dramatically expanded success region. Quite\nsurprisingly, our algorithm can even outperform convex matrix completion (MC) despite the fact\nthat the latter is provided with perfect knowledge of which entries are not corrupted, suggesting\nthat robust outlier support pattern estimation is indeed directly facilitated by our model.\n\n2 Recent Work\nThe vast majority of algorithms for solving (1) either implicitly or explicitly attempt to solve a\nproblem of the form\n\nminZ,E f1(Z) +(cid:80)\n\ni,j f2(eij) s.t. Y = Z + E,\n\n(4)\n\nwhere f1 and f2 are penalty functions that favor minimal rank and sparsity respectively. When f1\nis the nuclear norm (scaled appropriately) and f2(e)=|e|, then (4) reduces to (3). Methods differ\nhowever by replacing f1 and f2 with non-convex alternatives, such as generalized Huber functions\n[7] or Schatten (cid:96)p quasi-norms with p < 1 [18, 19]. When applied to the singular values of Z and\nelements of E respectively, these selections enact stronger enforcement of minimal rank and sparsity.\nIf prior knowledge of the true rank of Z is available, a truncated nuclear norm approach (TNN-RPCA)\nhas also been proposed [24]. Further divergences follow from the spectrum of optimization schemes\napplied to different objectives, such as the alternating directions method of multipliers (ADMM)\nalgorithm [3] or iteratively reweighted least squares (IRLS) [18].\nWith all of these methods, we may consider relaxing the strict equality constraint to the regularized\nform\n\n(5)\nwhere \u03bb > 0 is a trade-off parameter. This has inspired a number of competing Bayesian formulations,\nwhich typically proceed as follows. Let\n\n\u03bb(cid:107)Y \u2212 Z \u2212 E(cid:107)2F + f1(Z) +(cid:80)\np(Y|Z, E) \u221d exp(cid:2)\u2212 1\n\n2\u03bb(cid:107)Y \u2212 Z \u2212 E(cid:107)2F(cid:3)\n\ni,j f2(eij),\n\nminZ,E\n\n1\n\n(6)\n\n2\n\n\f(cid:104)\u2212 e2\n\n(cid:105)\n\np(E|\u0393) =(cid:81)\n\nde\ufb01ne a likelihood function, where \u03bb represents a non-negative variance parameter assumed to be\nknown.2 Hierarchical prior distributions are then assigned to Z and E to encourage minimal rank and\nstrong sparsity, respectively. For the latter, the most common choice is the Gaussian scale-mixture\n(GSM) de\ufb01ned hierarchically by\n\nij\n\nij\n2\u03b3ij\n\nij ) \u221d \u03b31\u2212a\n\n, with hyper prior p(\u03b3\u22121\n\ni,j p(eij|\u03b3ij), p(eij|\u03b3ij) \u221d exp\n\nexp[ \u2212b\n],\n\u03b3ij\n(7)\nwhere \u0393 is a matrix of non-negative variances and a, b\u22650 are \ufb01xed parameters. Note that when\nthese values are small, the resulting distribution over each eij (obtained by marginalizing over the\nrespective \u03b3ij) is heavy-tailed with a sharp peak at zero, the de\ufb01ning characteristics of sparse priors.\nFor the prior on Z, Bayesian methods have somewhat broader distinctions. In particular, a number of\nmethods explicitly assume that Z=AB(cid:62) and specify GSM priors on A and B [1, 9, 15, 30]. For ex-\nwhere \u03b8 is a non-negative variance vector. An equivalent prior is used for p(B|\u03b8) with a shared\ni p(\u03b8i) with p(\u03b8i) de\ufb01ned for consistency\nwith p(\u03b3\u22121\nij ) in (7). Low rank solutions are favored via the same mechanism as described above for\nsparsity, but only the sparse variance prior is applied to columns of A and B, effectively pruning\nthem from the model if the associated \u03b8i is small. Given the above, the joint distribution is\n\nample, variational Bayesian RPCA (VB-RPCA) [1] assumes p(A|\u03b8)\u221dexp(cid:2)\u2212tr(cid:0)A diag[\u03b8]\u22121A(cid:62)(cid:1)(cid:3),\nvalue of \u03b8. This model also applies the prior p(\u03b8) =(cid:81)\n\np(Y, A, B, E, \u0393, \u03b8) = p(Y|A, B, E)p(E|\u0393)p(A|\u03b8)p(B|\u03b8)p(\u0393)p(\u03b8).\n\n(8)\nFull Bayesian inference with this is intractable, hence a common variational Bayesian (VB) mean-\n\ufb01eld approximation is applied [1, 2]. The basic idea is to obtain a tractable approximate factorial\nposterior distribution by solving\n\nminq(\u03a6) KL [q(\u03a6)||p(A, B, E, \u0393, \u03b8|Y)] ,\n\n(9)\nwhere q(\u03a6) (cid:44) q(A)q(B)q(E)q(\u0393)q(\u03b8), each q represents an arbitrary probability distribution, and\nKL[\u00b7||\u00b7] denotes the Kullback-Leibler divergence between two distributions. This can be accom-\nplished via coordinate descent minimization over each respective q distribution while holding the\nothers \ufb01xed. Final estimates of Z and E are obtained by the means of q(A), q(B), and q(E) upon\nconvergence. A related hierarchical model is used in [9, 30], but MCMC sampling techniques are\nused for full Bayesian inference RPCA (FB-RPCA) at the expense of considerable computational\ncomplexity and multiple tuning parameters.\nAn alternative empirical Bayesian algorithm (EB-RPCA) is described in [31]. In addition to the\nlikelihood function (6) and prior from (7), this method assumes a direct Gaussian prior on Z given by\n\n(10)\n\np(Z|\u03a8) \u221d exp(cid:2)\u2212 1\n2tr(cid:0)Z(cid:62)\u03a8\u22121Z(cid:1)(cid:3) ,\n(cid:82)(cid:82) p(Y|Z, E)p(Z|\u03a8)p(E|\u0393)dZdE\n\nwhere \u03a8 is a symmetric and positive de\ufb01nite matrix.3 Inference is accomplished via an empirical\nBayesian approach [20]. The basic idea is to marginalize out the unknown Z and E and solve\n\nmax\u03a8,\u0393\n\n(11)\nusing an EM-like algorithm. Once we have an optimal {\u03a8\u2217, \u0393\u2217}, we then compute the posterior\nmean of p(Z, E|Y, \u03a8\u2217, \u0393\u2217) which is available in closed-form.\nFinally, a recent class of methods has been derived around the concept of approximate message\npassing, AMP-RPCA [26], which applies Gaussian priors to the factors A and B and infers posterior\nestimates by loopy belief propagation [21]. In our experiments (see [23]) we found AMP-RPCA to\nbe quite sensitive to data deviating from these distributions.\n3 A New Pseudo-Bayesian Algorithm\nAs it turns out, it is quite dif\ufb01cult to derive a fully Bayesian model, or some tight variational/empirical\napproximation, that leads to an ef\ufb01cient algorithm capable of consistently outperforming the original\nconvex PCP, at least in the absence of additional, exploitable prior knowledge. It is here that we adopt\n\n2Actually many methods attempt to learn this parameter from data, but we avoid this consideration for simplicity.\nAs well, for subtle reasons such learning is sometimes not even identi\ufb01able in the strict statistical sense.\n3Note that in [31] this method is motivated from an entirely different variational perspective anchored in convex\nanalysis; however, the cost function that ultimately emerges is equivalent to what follows with these priors.\n\n3\n\n\fa pseudo-Bayesian approach, by which we mean that a Bayesian-inspired cost function will be altered\nusing manipulations that, although not consistent with any original Bayesian model, nonetheless\nproduce desirable attributes relevant to blindly solving (1). In some sense however, we view this as a\nstrength, because the \ufb01nal model analysis presented later in Section 4 does not rely on any presumed\nvalidity of the underlying prior assumptions, but rather on explicit properties of the objective that\nemerges, including all assumptions and approximation involved.\nBasic Model: We begin with the same likelihood function from (6), noting that in the limit as \u03bb \u2192 0\nthis will enforce the constraint set from (1). We also adopt the same prior on E given by (7) above\nand used in [1] and [31], but we need not assume any additional hyperprior on \u0393. In contrast, for the\nprior on Z our method diverges, and we de\ufb01ne the Gaussian\n\n(cid:105)\n\n(cid:104)\u2212 1\n\n(cid:62)\n\n2 I(cid:12)(cid:12) ,\n\n2 \u0393i\u00b7 + \u03bb\n\np(Z|\u03a8r, \u03a8c) \u221d exp\n\n(cid:62)\n\n(\u03a8r \u2297 I + I \u2297 \u03a8c)\n\n\u22121 (cid:126)z\n\nL(\u03a8r, \u03a8c, \u0393) = (cid:126)y\n\n(cid:62)\n\ny (cid:126)y + log |\u03a3y|,\n\u03a3\u22121\n\n,\n\n2(cid:126)z\n\n(12)\nwhere (cid:126)z(cid:44)vec[Z] is the column-wise vectorization of Z, \u2297 denotes the Kronecker product, and\n\u03a8c\u2208Rn\u00d7n and \u03a8r\u2208Rm\u00d7m are positive semi-de\ufb01nite, symmetric matrices.4 Here \u03a8c can be viewed\nas applying a column-wise covariance factor, and \u03a8r a row-wise one. Note that if \u03a8r=0, then this\nprior collapses to (10); however, by including \u03a8r we can retain symmetry in our model, or invariance\nto inference using either Y or Y(cid:62). Related priors can also be used to improve the performance of\naf\ufb01ne rank minimization problems [34].\nWe apply the empirical Bayesian procedure from (11); the resulting convolution of Gaussians inte-\ngral [2] can be computed in closed-form. After applying \u22122 log[\u00b7] transformation, this is equivalent\nto minimizing\n\nwhere \u03a3y (cid:44) \u03a8r \u2297 I + I \u2297 \u03a8c + \u00af\u0393 + \u03bbI,\n\nL(\u03a8r, \u03a8c, \u0393) = (cid:126)y\n\ny (cid:126)y +(cid:80)\n\n\u03a3\u22121\n\n2 I(cid:12)(cid:12) +(cid:80)\n\ni log(cid:12)(cid:12)\u03a8r + 1\n\nj log(cid:12)(cid:12)\u03a8c + 1\n\n(13)\nand \u00af\u0393 (cid:44) diag[(cid:126)\u03b3]. Note that for even reasonably sized problems \u03a3y \u2208 Rnm\u00d7nm will be huge, and\nconsequently we will require certain approximations to produce affordable update rules. Fortunately\nthis can be accomplished while simultaneously retaining a principled objective function capable of\noutperforming existing methods.\nPseudo-Bayesian Objective: We \ufb01rst modify (13) to give\n2 \u0393\u00b7j + \u03bb\n\n(14)\nwhere \u0393\u00b7j(cid:44)diag[\u03b3\u00b7j] and \u03b3\u00b7j represents the j-th column of \u0393. Similarly we de\ufb01ne \u0393i\u00b7(cid:44)diag[\u03b3i\u00b7]\nwith \u03b3i\u00b7 the i-th row of \u0393. This new cost is nothing more than (13) but with the log | \u00b7 | term\nsplit in half producing a lower bound by Jensen\u2019s inequality; the Kronecker product can naturally\nbe dissolved under these conditions. Additionally, (14) represents a departure from our original\nBayesian model in that there is no longer any direct empirical Bayesian or VB formulation that\nwould lead to (14). Note that although this modi\ufb01cation cannot be justi\ufb01ed on strictly probabilistic\nterms, we will see shortly that it nonetheless still represents a viable cost function in the abstract\nsense, and lends itself to increased computational ef\ufb01ciency. The latter is an immediate effect of\nthe drastically reduced dimensionality of the matrices inside the determinant. Henceforth (14) will\nrepresent the cost function that we seek to minimize; relevant properties will be handled in Section 4.\nWe emphasize that all subsequent analysis is based directly upon (14), and therefore already accounts\nfor the approximation step in advancing from (13). This is unlike other Bayesian model justi\ufb01cations\nrelying on the legitimacy of the original full model, and yet then adopt various approximations that\nmay completely change the problem.\nUpdate Rules: Common to many empirical Bayesian and VB approaches, our basic optimiza-\ntion strategy involves iteratively optimizing upper bounds on (14) in the spirit of majorization-\nminimization [12]. At a high level, our goal will be to apply bounds which separate \u03a8c, \u03a8r, and\n\u0393 into terms of the general form log |X| + tr[AX\u22121], the reason being that this expression has a\nsimple global minimum over X given by X=A. Therefore the strategy will be to update the bound\n(parameterized by some matrix A), and then update the parameters of interest X.\nUsing standard conjugate duality relationships and variational bounding techniques [14][Chapter 4],\nit follows after some linear algebra that\n4Technically the Kronecker sum \u03a8r\u2297I + I\u2297\u03a8c must be positive de\ufb01nite for the inverse in (12) to be de\ufb01ned.\nHowever, we can accommodate the semi-de\ufb01nite case using the following convention. Without loss of\ngenerality assume that \u03a8r\u2297I + I\u2297\u03a8c = RR(cid:62) for some matrix R. We then qualify that p(Z|\u03a8r, \u03a8c) = 0\nif (cid:126)z /\u2208 span[R], and p(Z|\u03a8r, \u03a8c) \u221d exp[\u2212 1\n2 (cid:126)z\n\n(cid:62)\n\n(R(cid:62))\u2020R\u2020(cid:126)z] otherwise.\n\n4\n\n\f(cid:62)\n\n(cid:126)y\n\ny (cid:126)y \u2264 1\n\u03a3\u22121\n\n\u03bb(cid:107)Y \u2212 Z \u2212 E(cid:107)2F +(cid:80)\n\n(cid:62)\n\n(\u03a8r \u2297 I + I \u2297 \u03a8c)\n\n\u22121 (cid:126)z\n\n+ (cid:126)z\n\ne2\nij\n\u03b3ij\n\n(15)\n\ni,j\n\nfor all Z and E. For \ufb01xed values of \u03a8r, \u03a8c, and \u0393 we optimize this quadratic bound to obtain revised\nestimates for Z and E, noting that exact equality in (15) is possible via the closed-form solution\n\n(cid:126)z = (\u03a8r \u2297 I + I \u2297 \u03a8c) \u03a3\n\n(16)\nIn large practical problems, (16) may become expensive to compute directly because of the high\ndimensional inverse involved. However, we may still \ufb01nd the optimum ef\ufb01ciently by an ADMM\nprocedure described in [23].\nWe can also further bound the righthand side of (15) using Jensen\u2019s inequality as\n\n(cid:126)e = \u00af\u0393\u03a3\n\n\u22121\ny (cid:126)y,\n\n\u22121\ny (cid:126)y.\n\n(cid:62)\n\n(\u03a8r \u2297 I + I \u2297 \u03a8c)\n\n\u22121 (cid:126)z \u2264 tr\n\n(cid:126)z\n\n(cid:62)\n\nZ\n\n\u22121\nr + ZZ\n\nZ\u03a8\n\n(cid:62)\n\n\u22121\n\u03a8\nc\n\n(17)\n\n(cid:105)\n\n.\n\n(cid:104)\n\nAlong with (15) this implies that for \ufb01xed values of Z and E we can obtain an upper bound which\nonly depends on \u03a8r, \u03a8c, and \u0393 in a decoupled or separable fashion.\nFor the log |\u00b7| terms in (14), we also derive convenient upper bounds using determinant identities and\na \ufb01rst-order approximation, the goal being to \ufb01nd a representation that plays well with the previous\ndecoupled bound for optimization purposes. Again using conjugate duality relationships, we can\nform the bound\n\nlog(cid:12)(cid:12)\u03a8c + 1\n\n2 \u0393\u00b7j + \u03bb\n\n2 I(cid:12)(cid:12) \u2261 log |\u03a8c| + log |\u0393\u00b7j| + log |W (\u03a8c, \u0393\u00b7j)|\n(cid:105)\n\n(cid:104)\n\n\u2264 log |\u03a8c| + log |\u0393\u00b7j| + tr\n(cid:21)\n\n(cid:20)\n\n\u221a\n\nis understood to apply element-wise, and W (\u03a8c, \u0393\u00b7j) is de\ufb01ned as\nW (\u03a8c, \u0393\u00b7j) (cid:44) 1\n2\u03bb\n\n2I\n2I\n\n2I\nI\n\n\u221a\n\nc\n0\n\n0\n\u22121\u00b7j\n\u0393\n\n+\n\n.\n\n(cid:20) \u03a8\u22121\n\n(cid:21)\n\n\u22121\n)(cid:62)\u03a8\nc\n\n(\u2207j\n\n\u22121\nc\n\n\u03a8\n\n+ (\u2207c\n\n\u22121\u00b7j\n\n\u0393\n\n(cid:62)\n\n)\n\n\u03b3\n\n\u22121\u00b7j +C,\n\n(18)\n\n(19)\n\nwhere the inverse \u03b3\u22121\u00b7j\n\nAdditionally, C is a standard constant, which accompanies the \ufb01rst-order approximation to guarantee\nthat the upper bound is tangent to the underlying cost function; however, its exact value is irrelevant\nfor optimization purposes. Finally, the requisite gradients are de\ufb01ned as\n\u2207c\n\n(cid:44) \u2202W (\u03a8c,\u0393\u00b7j )\n\n(cid:44) \u2202W (\u03a8c,\u0393\u00b7j )\n\n= \u03a8c\u2212 \u03a8c(Sj\n\n= diag[\u0393\u00b7j \u2212 1\n\nc)\u22121\u0393\u00b7j], \u2207j\n\n\u22121\u00b7j\n\n\u0393\n\n\u22121\u00b7j\n\n\u2202\u0393\n\n2 \u0393\u00b7j(Sj\n\n2 I. Analogous bounds can be derived for the log(cid:12)(cid:12)\u03a8r + 1\n\n\u2202\u03a8\u22121\n\n\u22121\n\u03a8\nc\n\nc\n\n2 \u0393\u00b7j + \u03bb\n\nwhere Sj\nc\nin (14).\nThese bounds are principally useful because all \u03a8c, \u03a8r, \u0393\u00b7j, and \u0393i\u00b7 factors have been decoupled.\nConsequently, with Z, E, and all the relevant gradients \ufb01xed, we can separately combine \u03a8c-, \u03a8r-,\nand \u0393-dependent terms from the bounds and then optimize independently. For example, combining\nterms from (17) and (18) involving \u03a8c for all j, this requires solving\n\n2 \u0393i\u00b7 + \u03bb\n\nc)\u22121\u03a8c,\n(20)\n\n2 I(cid:12)(cid:12) terms\n\n(cid:44) \u03a8c + 1\n\nm log |\u03a8c| + tr\n\nmin\n\u03a8c\n\nj(\u2207j\n\n\u22121\n\u03a8\nc\n\n)(cid:62)\u03a8\u22121\n\nc + ZZ(cid:62)\u03a8\u22121\n\nc\n\n(21)\n\n(cid:104)(cid:80)\n\n(cid:104)(cid:80)\n\n5\n\n(cid:105)\n\n.\n\n(cid:105)\n\n(cid:104)(cid:80)\n\n+ ZZ(cid:62)(cid:105)\n\nAnalogous cost functions emerge for \u03a8r and \u0393. All three problems have closed-form optimal\nsolutions given by\n\n,\n\nj \u2207j (cid:62)\n\ni \u2207i (cid:62)\n\n\u22121\n\u03a8\nr\n\n\u22121\n\u03a8\nc\n\n\u03a8c = 1\nm\n\n, \u03a8r = 1\nn\n\n+ Z(cid:62)Z\nwhere the squaring operator is applied element-wise to (cid:126)z, (cid:126)uc (cid:44) [\u2207c\n], and analogously\nfor (cid:126)ur. One interesting aspect of (22) is that it forces \u03a8c (cid:23) 1\nn Z(cid:62)Z, thus\nmaintaining a balancing symmetry and preventing one or the other from possibly converging towards\nzero. This is another desirable consequence of using the bound in (17). To \ufb01nalize then, the proposed\npipeline, which we henceforth refer to as pseudo-Bayesian RPCA (PB-RPCA), involves the steps\nshown under Algorithm 1 in [23]. These can be implemented in such a way that the complexity is\nlinear in max(n, m) and cubic in min(n, m).\n\n(cid:126)\u03b3 = (cid:126)z2 + (cid:126)uc + (cid:126)ur,\n; . . . ;\u2207c\n\n\u22121\u00b71\nm ZZ(cid:62) and \u03a8r (cid:23) 1\n\n(22)\n\n\u22121\u00b7m\n\u0393\n\n\u0393\n\n\f4 Analysis of the PB-RPCA Objective\nOn the surface it may appear that the PB-RPCA objective (14) represents a rather circuitous route\nto solving (1), with no obvious advantage over the convex PCP relaxation from (3), or any other\napproach for that matter. However quite surprisingly, we prove in [23] that by simply replacing\nthe log | \u00b7 | matrix operators in (14) with tr[\u00b7], the resulting function collapses exactly to convex\nPCP. So what at \ufb01rst appear as distant cousins are actually quite closely related objectives. Of\ncourse our work is still in front of us to explain why log | \u00b7 |, and therefore the PB-RPCA objective\nby association, might display any particular advantage. This leads us to considerations of relative\nconcavity, non-separability, and symmetry as described below in turn.\nRelative Concavity: Although both log | \u00b7 | and tr[\u00b7] are concave non-decreasing functions of the\nsingular values of symmetric positive de\ufb01nite matrices, and hence favor both sparsity of \u0393 and\nminimal rank of \u03a8r or \u03a8c, the former is far more strongly concave (in the sense of relative concavity\ndescribed in [25]). In this respect we may expect that log | \u00b7 | is less likely to over-shrink large values\n[11]. Moreover, applying a concave non-decreasing penalty to elements of \u0393 favors a sparse estimate,\nwhich in turn transfers this sparsity directly to E by virtue of the left multiplication by \u00af\u0393 in (16).\nLikewise for the singular values of \u03a8c and \u03a8r.\nNon-Separability: While potentially desirable, the relative concavity distinction described above\nis certainly not suf\ufb01cient to motivate why PB-RPCA might represent an effective RPCA approach,\nespecially given the breadth of non-convex alternatives already in the literature. However, a much\nstronger argument can be made by exposing a fundamental limitation of all RPCA methods (convex\nor otherwise) that rely on minimization of generic penalties in the separable or additive form of (4).\nFor this purpose, let \u2126 denote a set of indices that correspond with zero-valued elements in E, such\nthat E\u2126 = 0 while all other elements of E are arbitrary nonzeros (it can equally be viewed as the\ncomplement of the support of E). In the case of MC, \u2126 would also represent the set of observed\nmatrix elements. We then have the following:\nProposition 1. To guarantee that (4) has the same global optimum as (1) for all Y where a unique\nsolution exists, it follows that f1 and f2 must be non-convex and no feasible descent direction can\never remove an index from or decrease the cardinality of \u2126.\nIn [31] it has been shown that, under similar conditions, the gradient in a feasible direction at any\nzero-valued element of E must be in\ufb01nite to guarantee a matching global optimum, from which\nthis result naturally follows. The rami\ufb01cations of this proposition are profound if we ever wish to\nproduce a version of RPCA that can mimic the desirable behavior of much simpler MC problems\nwith known support, or at least radically improve upon PCP with unknown outlier support. In words,\nProposition 1 implies that under the stated global-optimality preserving conditions, if any element of\nE converges to zero during optimization with an arbitrary descent algorithm, it will remain anchored\nat zero until the end. Consequently, if the algorithm prematurely errs in setting the wrong element\nto zero, meaning the wrong support pattern has been inferred at any time during an optimization\ntrajectory, it is impossible to ever recover, a problem naturally side-stepped by MC where the support\nis effectively known. Therefore, the adoption of separable penalty functions can be quite constraining\nand they are unlikely to produce suf\ufb01ciently reliable support recovery.\nBut how does this relate to PB-RPCA? Our algorithm maintains a decidedly non-separable\npenalty function on \u03a8c, \u03a8r, and \u0393, which directly transfers to an implicit, non-separable reg-\nularizer over Z and E when viewed through the dual-space framework from [32].5 By this we\nmean a penalty f (Z, E)(cid:54)=f1(Z)+f2(E) for any functions f1 and f2, and with Z \ufb01xed, we have\n\ni,j fij(eij) for any set of functions {fij}.\n\nf (Z, E)(cid:54)=(cid:80)\n\nWe now examine the consequences. Let \u2126 now denote a set of indices that correspond with zero-\nvalued elements in \u0393, which translates into an equivalent support set for Z via (16). This then leads\nto quanti\ufb01able bene\ufb01ts:\nProposition 2. The following properties hold w.r.t. the PB-RPCA objective (assuming n = m for\nsimplicity):\n\u2022 Assume that a unique global solution to (1) exists such that either rank[Z]+maxj (cid:107)e\u00b7j(cid:107)0<n or\nrank[Z]+maxi (cid:107)ei\u00b7(cid:107)0<n. Additionally, let {\u03a8\u2217\nr, \u0393\u2217} denote a globally minimizing solution to\n(14) and {Z\u2217,E\u2217} the corresponding values of Z and E computed using (16). Then in the limit \u03bb\u21920,\nZ\u2217 and E\u2217 globally minimize (1).\n5Even though this penalty function is not available in closed-form, non-separability is nonetheless enforced via\nthe linkage between \u03a8c, \u03a8r, and \u0393 in the log | \u00b7 | operator.\n\nc , \u03a8\u2217\n\n6\n\n\f(a) CVX\u2013PCP\n\n(b) IRLS\u2013RPCA\n\n(c) VB\u2013RPCA\n\n(d) PB\u2013RPCA w/o sym.\n\n(e) CVX\u2013MC\n\n(f) TNN\u2013RPCA\n\n(g) FB\u2013RPCA\n\n(h) PB\u2013RPCA (Proposed)\n\nFigure 1: Phase transition over outlier (y-axis) and rank (x-axis) ratio variations. Here CVX-MC and\nTNN-RPCA maintain advantages of exactly known outlier support pattern and true rank respectively.\n\u2022 Assume that Y has no entries identically equal to zero.6 Then for any arbitrary \u2126, there will always\nexist a range of \u03a8c and \u03a8r values such that for any \u0393 consistent with \u2126 we are not at a locally\nminimizing solution to (14), meaning there exists a feasible descent direction whereby elements of \u0393\ncan escape from zero.\nA couple important comments are worth stating regarding this result. First, the rank and row/column-\nsparsity requirements are extremely mild. In fact, any minimum of (1) will be such that rank[Z] +\nmaxj (cid:107)e\u00b7j(cid:107)0 \u2264 n and rank[Z] + maxi (cid:107)ei\u00b7(cid:107)0 \u2264 m, regardless of Y. Secondly, unlike any separable\npenalty function (4) that retains the correct global optimal as (1), Proposition 2 implies that (14)\nneed not be locally minimized by every possible support pattern for outlier locations. Consequently,\npremature convergence to suboptimal supports need not disrupt trajectories towards the global solution\nto the extent that (4) may be obstructed. Moreover, beyond algorithms that explicitly adopt separable\npenalties (the vast majority), some existing Bayesian approaches may implicitly default to (4). For\nexample, as shown in [23], the mean-\ufb01eld factorizations adopted by VB-RPCA actually allow the\nunderlying free energy objective to be expressible as (4) for some f1 and f2.\nSymmetry: Without the introduction of symmetry via our pseudo-Bayesian proposal (meaning either\n\u03a8c or \u03a8r is forced to zero), then PB-RPCA collapses to something like EB-RPCA, which depends\nheavily on whether Y or Y(cid:62) is provided as input and penalizes column- and row-spaces asymmetri-\ncally. In this regime it can be shown that the analogous requirement to replicate Proposition 2 becomes\nmore stringent, namely we must assume the asymmetric condition rank[Z] + maxj (cid:107)e\u00b7j(cid:107)0 < n. Thus\nthe symmetric cost of PB-RPCA of allows us to relax this column-wise restriction provided a row-\nwise alternative holds (and vice versa), allowing the PB-RPCA objective (14) to match the global\noptimum of our original problem from (1) under broader conditions.\nIn closing this section, we reiterate that all of our analysis and conclusions are based on (14), after\nthe stated approximations. Therefore we need not rely on the plausibility of the original Bayesian\nstarting point from Section 3 nor the tightness of subsequent approximations for justi\ufb01cation; rather\n(14) can be viewed as a principled stand-alone objective for RPCA regardless of its origins. Moreover,\nit represents the \ufb01rst approach satisfying the relative concavity, non-separability, and symmetry\nproperties described above, which can loosely be viewed as necessary, but not suf\ufb01cient design\ncriteria for an optimal RPCA objective.\n\n5 Experiments\nTo examine signi\ufb01cant factors that in\ufb02uence the ability to solve (1), we \ufb01rst evaluate the relative\nperformance of PB-RPCA estimating random simulated subspaces from corrupted measurements,\nthe standard benchmark. Later we present subspace clustering results for motion segmentation as a\npractical application. Additional experiments and a photometric stereo example are provided in [23].\nPhase Transition Graphs: We compare our method against existing RPCA methods: PCP [16],\nTNN [24], IRLS [18], VB [1], and FB [9]. We also include results using PB-RPCA but with symmetry\nremoved (which then defaults to something like EB-RPCA), allowing us to isolate the importance of\nthis factor, called \u201cPB-RPCA w/o sym.\u201d. For competing algorithms, we set parameters based on the\nvalues suggested by original authors with the exception of IRLS. Detailed settings and parameters\ncan be found in [23].\n\n6This assumption can be relaxed with some additional effort but we avoid such considerations here for clarity of\npresentation.\n\n7\n\nRank ratio0.050.10.150.20.250.30.350.4Outlier ratio0.20.40.6Rank ratio0.050.10.150.20.250.30.350.4Outlier ratio0.20.40.6Rank ratio0.050.10.150.20.250.30.350.4Outlier ratio0.20.40.6Rank ratio0.050.10.150.20.250.30.350.4Outlier ratio0.20.40.6Rank ratioOutlier ratio 0.050.10.150.20.250.30.350.40.10.20.30.40.500.20.40.60.81Rank ratio0.050.10.150.20.250.30.350.4Outlier ratio0.20.40.6[Known outlier location]Rank ratio0.050.10.150.20.250.30.350.4Outlier ratio0.20.40.6[Known rank]Rank ratio0.050.10.150.20.250.30.350.4Outlier ratio0.20.40.6Rank ratio0.050.10.150.20.250.30.350.4Outlier ratio0.20.40.6Rank ratioOutlier ratio 0.050.10.150.20.250.30.350.40.10.20.30.40.500.20.40.60.81\f\u03c1\n\n0.1\n0.2\n0.3\n0.4\n\nSSC\n\nRobust SSC\n\nPCP+SSC\n\nPB+SSC (Ours)\n\nWithout sub-sampling (large number of measurements)\n2.4 / 0.0\n2.4 / 0.0\n2.8 / 0.0\n3.1 / 0.0\n\n3.0 / 0.0\n3.0 / 0.0\n3.6 / 0.2\n4.7 / 0.2\n\n5.3 / 0.3\n6.4 / 0.4\n7.2 / 0.5\n8.5 / 0.6\n\n19.0 / 14.9\n28.2 / 28.3\n33.2 / 34.7\n36.5 / 39.0\n\nWith sub-sampling (small number of measurements)\n\n0.1\n0.2\n0.3\n0.4\n*Values are percentage with (mean / median).\n\n19.5 / 17.2\n33.0 / 33.3\n39.3 / 41.1\n42.2 / 43.5\n\n4.0 / 0.0\n5.3 / 0.0\n5.7 / 1.7\n6.4 / 2.1\n\n2.9 / 0.0\n3.7 / 0.0\n5.0 / 0.7\n9.8 / 5.1\n\n2.8 / 0.0\n3.6 / 0.0\n3.9 / 0.0\n3.7 / 0.0\n\nFigure 2: Hard case comparison.\n\nFigure 3: Motion segmentation errors on Hopkins155.\n\nWe construct phase transition plots as in [4, 9] that evaluate the recovery success of every pairing of\noutlier ratio and rank using data Y=ZGT +EGT , where Y\u2208Rm\u00d7n and m=n=200. The ground truth\noutlier matrix EGT is generated by selecting non-zero entries uniformly with probability \u03c1\u2208[0,1], and\nits magnitudes are sampled iid from the uniform distribution U [\u221220, 20]. We generate the ground\ntruth low-rank matrix by ZGT =AB(cid:62), where A\u2208Rn\u00d7r and B\u2208Rm\u00d7r are drawn from iid N (0,1).\nFigure 1 shows comparisons among competing methods, as well as the convex nuclear norm based\nmatrix completion (CVX-MC) [5], the latter representing a far easier estimation task given that\nmissing entry locations (analogous to corruptions) occur in known locations. The color of each cell\nencodes the percentage of success trials (out of 10 total) whereby the normalized root-mean-squared\nerror (NRMSE, (cid:107) \u02c6Z\u2212ZGT (cid:107)F\n(cid:107)ZGT (cid:107)F ) recovering ZGT is less than 0.001 to classify success following [4, 9].\nNotably PB-RPCA displays a much broader recoverability region. This improvement is even\nmaintained over TNN-RPCA and MC which require prior knowledge such as the true rank and\nexact outlier locations respectively. These forms of prior knowledge offer a substantial advantage,\nalthough in practical situations are usually unavailable. PB-RPCA also outperforms PB-RPCA w/o\nsym. (its closest relative) by a wide margin, suggesting that the symmetry plays an important role.\nThe poor performance of FB-RPCA is explained in [23].\nHard Case Comparison: Recovery of Gaussian iid low-rank components (the typical benchmark\nrecovery problem in the literature) is somewhat ideal for existing algorithms like PCP because the\nsingular vectors of ZGT will not resemble unit vectors that could be mistaken for sparse components.\nHowever, a simple test reveals just how brittle PCP is to deviations from the theoretically optimal\nregime. We generate a rank one ZGT = \u03c3a3(b3)(cid:62), where the cube operation is applied element-wise,\na and b are vectors drawn iid from a unit sphere, and \u03c3 scales ZGT to unit variance. EGT has nonzero\nelements drawn iid from U [\u22121, 1]. Figure 2 shows the recovery results as the outlier ratio is increased.\nThe hard case refers to the data just described, while the easy case follows the model used to make\nthe phase transition plots. While PB-RPCA is quite stable, PCP completely fails for the hard data.\nOutlier Removal for Motion Segmentation: Under an af\ufb01ne camera model, the stacked matrix\nconsisting of feature point trajectories of k rigidly moving objects forms a union of k af\ufb01ne subspaces\nof at most rank 4k [29]. But in practice, mismatches often occur due to occlusions or tracking\nalgorithm limitations, and these introduce signi\ufb01cant outliers into the feature motions such that the\ncorresponding trajectory matrix may be at or near full rank. We adopt an experimental paradigm\nfrom [17] designed to test motion segmentation estimation in the presence of outliers. To mimic\nmismatches while retaining access to ground-truth, we randomly corrupt the entries of the trajectory\nmatrix formed from Hopkins155 data [28]. Speci\ufb01cally, following [17] we add noise drawn from\nN (0, 0.1\u03ba) to randomly sampled points with outlier ratio \u03c1\u2208[0, 1], where \u03ba is the maximum absolute\nvalue of the data. We may then attempt to recover a clean version from the corrupted measurements\nusing RPCA as a preprocessing step; motion segmentation can then be applied using standard\nsubspace clustering [29]. We use SSC and robust SSC algorithms [10] as baselines, and compare\nwith RPCA preprocessing computed via PCP (as suggested in [10]) and PB-RPCA followed by SSC.\nAdditionally, we sub-sampled the trajectory matrix to increase problem dif\ufb01culty by fewer samples.\nSegmentation accuracy is reported in Fig. 3, where we observe that PB shows the best performance\nacross different outlier ratios, and the performance gap widens when the measurements are scarce.\n6 Conclusion\nSince the introduction of convex RPCA algorithms, there has not been a signi\ufb01cant algorithmic\nbreak-through in terms of dramatically enhancing the regime where success is possible, at least in the\nabsence of any prior information (beyond the generic low-rank and sparsity assumptions). The likely\nexplanation is that essentially all of these approaches solve either a problem in the form of (4), an\nasymmetric problem in the form of (11), or else require strong priori knowledge. We provide a novel\nintegration of three important design criteria, concavity, non-separability, and symmetry, that leads to\nstate-of-the-art results by a wide margin without tuning parameters or prior knowledge.\n\n8\n\nOutlier Ratio00.20.40.60.81Success Rate00.20.40.60.81 PB-RPCA (easy case) PB-RPCA (hard case) PCP (easy case) PCP (hard case)\fReferences\n[1] S. D. Babacan, M. Luessi, R. Molina, and A. K. Katsaggelos. Sparse Bayesian methods for low-rank\n\nmatrix estimation. IEEE Trans. Signal Process., 2012.\n\n[2] C. M. Bishop. Pattern recognition and machine learning. Springer New York, 2006.\n[3] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning\nvia the alternating direction method of multipliers. Foundations and Trends R(cid:13) in Machine Learning, 2011.\n\n[4] E. J. Cand\u00e8s, X. Li, Y. Ma, and J. Wright. Robust principal component analysis? J. of the ACM, 2011.\n[5] E. J. Cand\u00e8s and B. Recht. Exact matrix completion via convex optimization. Foundations of Computational\n\n[6] V. Chandrasekaran, S. Sanghavi, P. A. Parrilo, and A. S. Willsky. Rank-sparsity incoherence for matrix\n\ndecomposition. SIAM J. on Optim., 2011.\n\n[7] R. Chartrand. Nonconvex splitting for regularized low-rank+ sparse decomposition. IEEE Trans. Signal\n\n[8] Y.-L. Chen and C.-T. Hsu. A generalized low-rank appearance model for spatio-temporally correlated rain\n\nstreaks. In IEEE Int. Conf. Comput. Vis., 2013.\n\n[9] X. Ding, L. He, and L. Carin. Bayesian robust principal component analysis. IEEE Trans. Image Process.,\n\nmathematics, 2009.\n\nProcess., 2012.\n\n2011.\n\n[10] E. Elhamifar and R. Vidal. Sparse subspace clustering: Algorithm, theory, and applications. IEEE Trans.\n\n[11] J. Fan and R. Li. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am.\n\nPattern Anal. and Mach. Intell., 2013.\n\nStat. Assoc., 2001.\n\n[12] D. R. Hunter and K. Lange. A tutorial on MM algorithms. The American Statistician, 2004.\n[13] H. Ji, C. Liu, Z. Shen, and Y. Xu. Robust video denoising using low rank matrix completion. In IEEE\n\n[14] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul. An introduction to variational methods for\n\nConf. Comput. Vis. and Pattern Recognit., 2010.\n\ngraphical models. Mach. Learn., 1999.\n\nAISTATS, 2011.\n\nlow-rank matrices. arXiv:1009.5055, 2010.\n\nIEEE Int. Conf. Comput. Vis., 2011.\n\n[15] B. Lakshminarayanan, G. Bouchard, and C. Archambeau. Robust Bayesian matrix factorisation. In\n\n[16] Z. Lin, M. Chen, and Y. Ma. The augmented Lagrange multiplier method for exact recovery of corrupted\n\n[17] G. Liu and S. Yan. Latent low-rank representation for subspace segmentation and feature extraction. In\n\n[18] C. Lu, Z. Lin, and S. Yan. Smoothed low rank and sparse matrix recovery by iteratively reweighted least\n\nsquares minimization. IEEE Trans. Image Process., 2015.\n\n[19] K. Mohan and M. Fazel. Iterative reweighted algorithms for matrix rank minimization. J. Mach. Learn.\n\nRes., 2012.\n\n[20] K. P. Murphy. Machine Learning: a Probabilistic Perspective. MIT Press, 2012.\n[21] K. P. Murphy, Y. Weiss, and M. I. Jordan. Loopy belief propagation for approximate inference: An\n\nempirical study. In UAI, 1999.\n\n[22] T.-H. Oh, J.-Y. Lee, Y.-W. Tai, and I. S. Kweon. Robust high dynamic range imaging by rank minimization.\n\nIEEE Trans. Pattern Anal. and Mach. Intell., 2015.\n\n[23] T.-H. Oh, Y. Matsushita, I. S. Kweon, and D. Wipf. Pseudo-Bayesian robust PCA: Algorithms and analyses.\n\narXiv preprint arXiv:1512.02188, 2015.\n\n[24] T.-H. Oh, Y.-W. Tai, J.-C. Bazin, H. Kim, and I. S. Kweon. Partial sum minimization of singular values in\n\nRobust PCA: Algorithm and applications. IEEE Trans. Pattern Anal. and Mach. Intell., 2016.\n\n[25] J. A. Palmer. Relative convexity. ECE Dept., UCSD, Tech. Rep, 2003.\n[26] J. T. Parker, P. Schniter, and V. Cevher.\n\nBilinear generalized approximate message passing.\n\narXiv:1310.2632, 2013.\n\n[27] Y. Peng, A. Ganesh, J. Wright, W. Xu, and Y. Ma. RASL: Robust alignment by sparse and low-rank\n\ndecomposition for linearly correlated images. IEEE Trans. Pattern Anal. and Mach. Intell., 2012.\n\n[28] R. Tron and R. Vidal. A benchmark for the comparison of 3-d motion segmentation algorithms. In IEEE\n\nConf. Comput. Vis. and Pattern Recognit., 2007.\n\n[29] R. Vidal. Subspace clustering. IEEE Signal Process. Mag., 2011.\n[30] N. Wang and D.-Y. Yeung. Bayesian robust matrix factorization for image and video processing. In IEEE\n\nInt. Conf. Comput. Vis., 2013.\n\n[31] D. Wipf. Non-convex rank minimization via an empirical Bayesian approach. In UAI, 2012.\n[32] D. Wipf, B. D. Rao, and S. Nagarajan. Latent variable Bayesian models for promoting sparsity. IEEE\n\nTrans. on Information Theory, 2011.\n\n[33] L. Wu, A. Ganesh, B. Shi, Y. Matsushita, Y. Wang, and Y. Ma. Robust photometric stereo via low-rank\n\nmatrix completion and recovery. In Asian Conf. Comput. Vis., 2010.\n\n[34] B. Xin and D. Wipf. Pushing the limits of af\ufb01ne rank minimization by adapting probabilistic PCA. In Int.\n\nConf. Mach. Learn., 2015.\n\n9\n\n\f", "award": [], "sourceid": 767, "authors": [{"given_name": "Tae-Hyun", "family_name": "Oh", "institution": "KAIST"}, {"given_name": "Yasuyuki", "family_name": "Matsushita", "institution": "Osaka University"}, {"given_name": "In", "family_name": "Kweon", "institution": "KAIST"}, {"given_name": "David", "family_name": "Wipf", "institution": "Microsoft Research"}]}