{"title": "Perfect Dimensionality Recovery by Variational Bayesian PCA", "book": "Advances in Neural Information Processing Systems", "page_first": 971, "page_last": 979, "abstract": "The variational Bayesian (VB) approach is one of the best tractable approximations to the Bayesian estimation, and it was demonstrated to perform well in many applications. However, its good performance was not fully understood theoretically. For example, VB sometimes produces a sparse solution, which is regarded as a practical advantage of VB, but such sparsity is hardly observed in the rigorous Bayesian estimation. In this paper, we focus on probabilistic PCA and give more theoretical insight into the empirical success of VB. More specifically, for the situation where the noise variance is unknown, we derive a sufficient condition for perfect recovery of the true PCA dimensionality in the large-scale limit when the size of an observed matrix goes to infinity. In our analysis, we obtain bounds for a noise variance estimator and simple closed-form solutions for other parameters, which themselves are actually very useful for better implementation of VB-PCA.", "full_text": "Perfect Dimensionality Recovery\n\nby Variational Bayesian PCA\n\nShinichi Nakajima\nNikon Corporation\n\nTokyo, 140-8601, Japan\n\nnakajima.s@nikon.co.jp\n\nRyota Tomioka\n\nThe University of Tokyo\nTokyo 113-8685, Japan\n\ntomioka@mist.i.u-tokyo.ac.jp\n\nMasashi Sugiyama\n\nTokyo Institute of Technology\n\nTokyo 152-8552, Japan\n\nsugi@cs.titech.ac.jp\n\nS. Derin Babacan\n\nUniversity of Illinois at Urbana-Champaign\n\nUrbana, IL 61801, USA\n\ndbabacan@illinois.edu\n\nAbstract\n\nThe variational Bayesian (VB) approach is one of the best tractable approxima-\ntions to the Bayesian estimation, and it was demonstrated to perform well in many\napplications. However, its good performance was not fully understood theoreti-\ncally. For example, VB sometimes produces a sparse solution, which is regarded\nas a practical advantage of VB, but such sparsity is hardly observed in the rigorous\nBayesian estimation. In this paper, we focus on probabilistic PCA and give more\ntheoretical insight into the empirical success of VB. More speci\ufb01cally, for the sit-\nuation where the noise variance is unknown, we derive a suf\ufb01cient condition for\nperfect recovery of the true PCA dimensionality in the large-scale limit when the\nsize of an observed matrix goes to in\ufb01nity. In our analysis, we obtain bounds for\na noise variance estimator and simple closed-form solutions for other parameters,\nwhich themselves are actually very useful for better implementation of VB-PCA.\n\n1 Introduction\n\nVariational Bayesian (VB) approximation [1] was proposed as a computationally ef\ufb01cient alternative\nto rigorous Bayesian estimation. The key idea is to force the posterior to be factorized, so that the\nintegration\u2014a typical intractable operation in Bayesian methods\u2014can be analytically performed\nover each parameter with the other parameters \ufb01xed. VB has been successfully applied to many\napplications [4, 7, 20, 11].\nTypically, VB solves a non-convex optimization problem with an iterative algorithm [3], which\nmakes theoretical analysis dif\ufb01cult. An important exceptional case is the matrix factorization (MF)\nmodel [11, 6, 19] with no missing entry in the observed matrix. Recently, the global analytic solution\nof VBMF has been derived and theoretical properties such as the mechanism of sparsity induction\nhave been revealed [15, 16]. These works also posed thought-provoking relations between VB\nand rigorous Bayesian estimation: The VB posterior is actually quite different from the true Bayes\nposterior (compare the left and the middle graphs in Fig. 1), and VB induces sparsity in its solution\nbut such sparsity is hardly observed in rigorous Bayesian estimation (see the right graph in Fig. 1).1\nThese facts might deprive the justi\ufb01cation of VB based solely on the fact that it is one of the best\ntractable approximations to the Bayesian estimation.\n\n1Also in mixture models, inappropriate model pruning by VB approximation was discussed [12].\n\n1\n\n\fBayes p osterior (V = 1)\n\nMAP estimators:\n\n0\n\n.\n\n2\n.\n0\n\n3\n\n(A, B ) \u2248 (\u00b1 1, \u00b1 1)\n\n3\n\n2\n\n1\n\n0\n\nB\n\n0.1\n\n0 . 2\n0.3\n\n0.1\n\n0.2\n0.3\n\n0.3\n0.2\n0.1\n\n0\n.\n3\n\n0\n.\n2\n\n0\n.\n1\n\n\u22122\n\n\u22121\n\n\u22121\n\n\u22122\n\n\u22123\n\u22123\n\n0\n.\n1\n\n0\n.\n2\n\n0\n.\n3\n\n0.3\n0 . 2\n\n0.1\n\n0.1\n0.2\n0.3\n\n0.3\n0.2\n\n0.1\n\nB\n\n2\n.\n0\n\n0\n\n.\n\n3\n\n0\nA\n\n1\n\n2\n\n3\n\n3\n\n2\n\n1\n\n0\n\n\u22121\n\n\u22122\n\n\u22123\n\u22123\n\nVB p osterior (V = 1)\n\nVB estimator : (A, B ) = (0, 0)\n\n0.05\n\n0.1\n\n0\n\n.\n\n0\n5\n\n0.05\n\n0.05\n\n0.1\n\n0.15\n\n0\n\n.\n\n1\n\n5\n\n0.1\n\n0.1\n\n0.05\n\n0.05\n\n5\n\n0 . 0\n\n\u22122\n\n\u22121\n\n0\nA\n\n1\n\n2\n\n3\n\n3\n\n2\n\n1\n\n \n\n!U\n\nF B\n\nM AP\n\nVB\n\nEF B\n\nEM AP\n\nEVB\n\n1\n\nV\n\n2\n\n \n\n3\n\nFigure 1: Dissimilarities between VB and the rigorous Bayesian estimation. (Left and Center) the\nBayes posterior and the VB posterior of a 1 \u00d7 1 MF model, V = BA + E, when V = 1 is observed\n(E is a Gaussian noise). VB approximates the Bayes posterior having two modes by an origin-\ncentered Gaussian, which induces sparsity. (Right) Behavior of estimators of U = BA!, given the\nobservation V . The VB estimator (the magenta solid curve) is zero when V \u2264 1, which means exact\nsparsity. On the other hand, FB (fully-Bayesian or rigorous Bayes; blue crosses) shows no sign of\nsparsity. Further discussion will be provided in Section 5.2. All graphs are quoted from [15].\n\nSince the probabilistic PCA [21, 18, 3] is an instance of MF, the global analytic solution derived\nin [16] for MF can be utilized for analyzing the probabilistic PCA. Indeed, automatic dimensional-\nity selection of VB-PCA, which is an important practical advantage of VB-PCA, was theoretically\ninvestigated in [17]. However, the noise variance, which is usually unknown in many realistic appli-\ncations of PCA, was treated as a given constant in that analysis.2 In this paper, we consider a more\npractical and challenging situation where the noise variance is unknown, and theoretically analyze\nVB-PCA.\nIt was reported that noise variance estimation fails in some Bayesian approximation methods, if the\nformal rank is set to be full [17]. With such methods, an additional model selection procedure is\nrequired for dimensionality selection [14, 5]. On the other hand, we theoretically show in this paper\nthat VB-PCA can estimate the noise variance accurately, and therefore automatic dimensionality se-\nlection works well. More speci\ufb01cally, we establish a suf\ufb01cient condition that VB-PCA can perfectly\nrecover the true dimensionality in the large-scale limit when the size of an observed matrix goes to\nin\ufb01nity. An interesting \ufb01nding is that, although the objective function minimized for noise variance\nestimation is multimodal in general, only a local search algorithm is required for perfect recovery.\nOur results are based on the random matrix theory [2, 5, 13, 22], which elucidates the distribution\nof singular values in the large-scale limit.\nIn the development of the above theoretical analysis, we obtain bounds for the noise variance esti-\nmator and simple closed-form solutions for other parameters. We also show that they can be nicely\nutilized for better implementation of VB-PCA.\n\n2 Formulation\n\nIn this section, we introduce the variational Bayesian matrix factorization (VBMF).\n\n2.1 Bayesian Matrix Factorization\nAssume that we have an observation matrix V \u2208 RL\u00d7M, which is the sum of a target matrix U \u2208\nRL\u00d7M and a noise matrix E\u2208 RL\u00d7M:\n\nV = U + E.\n\nIn the matrix factorization model, the target matrix is assumed to be low rank, which can be ex-\npressed as the following factorizability:\n\n2If the noise variance is known, we can actually show that dimensionality selection by VB-PCA is outper-\n\nformed by a naive strategy (see Section 3.3). This means that VB-PCA is not very useful in this setting.\n\nU = BA!,\n\n2\n\n\fwhere A \u2208 RM\u00d7H and B \u2208 RL\u00d7H. $ denotes the transpose of a matrix or vector. Thus, the rank\nof U is upper-bounded by H \u2264 min(L, M ).\nIn this paper, we consider the probabilistic matrix factorization (MF) model [19], where the obser-\nvation noise E and the priors of A and B are assumed to be Gaussian:\np(V |A, B) \u221d exp!\u2212 1\nFro\" ,\n2\u03c32\u2019V \u2212 BA!\u20192\np(B) \u221d exp!\u2212 1\n2 tr!AC\u22121\nA A!\"\" ,\n\n(1)\n(2)\nHere, we denote by \u2019\u00b7\u2019 Fro the Frobenius norm, and by tr(\u00b7) the trace of a matrix. We assume that\nL \u2264 M. If L > M, we may simply re-de\ufb01ne the transpose V ! as V so that L \u2264 M holds.3 Thus,\nthis does not impose any restriction. We assume that the prior covariance matrices CA and CB are\ndiagonal and positive de\ufb01nite, i.e.,\nCA = diag(c2\n\np(A) \u221d exp!\u2212 1\n\n2 tr!BC\u22121\n\nB B!\"\" .\n\nCB = diag(c2\n\nb1, . . . , c2\n\nbH )\n\nfor cah, cbh > 0, h = 1, . . . , H. Without loss of generality, we assume that the diagonal entries of\nthe product CACB are arranged in non-increasing order, i.e., cahcbh \u2265 cah! cbh! for any pair h < h$.\nThroughout the paper, we denote a column vector of a matrix by a bold lowercase letter, and a row\nvector by a bold lowercase letter with a tilde, namely,\n\na1, . . . , c2\n\naH ),\n\nA = (a1, . . . , aH) = (#a1, . . . ,#aM )! \u2208 RM\u00d7H, B = (b1, . . . , bH) =$#b1, . . . ,#bL%!\n\n2.2 Variational Bayesian Approximation\n\n\u2208 RL\u00d7H.\n\nThe Bayes posterior is given by\n\np(A, B|V ) = p(V |A,B)p(A)p(B)\n\n(3)\nwhere p(Y ) = )p(V |A, B)*p(A)p(B). Here, )\u00b7*p denotes the expectation over the distribution p.\nSince this expectation is intractable, we need to approximate the Bayes posterior.\nLet r(A, B), or r for short, be a trial distribution. The following functional with respect to r is called\nthe free energy:\n\np(V )\n\n,\n\n(4)\n\nF (r) =&log\n\nr(A,B)\n\np(V |A,B)p(A)p(B)\u2019r(A,B)\n\n=&log r(A,B)\n\np(A,B|V )\u2019r(A,B) \u2212 log p(V ).\n\nIn the last equation, the \ufb01rst term is the Kullback-Leibler (KL) divergence from the trial distribution\nto the Bayes posterior, and the second term is a constant. Therefore, minimizing the free energy (4)\namounts to \ufb01nding a distribution closest to the Bayes posterior in the sense of the KL divergence. A\ngeneral approach to Bayesian approximate inference is to \ufb01nd the minimizer of the free energy (4)\nwith respect to r in some restricted function space.\nIn the VB approximation, the independence between the entangled parameter matrices A and B is\nassumed:\n\nr = r(A)r(B).\n\n(5)\nUnder this constraint, an iterative algorithm for minimizing the free energy (4) was derived [3, 11].\n\nLet(r be such a minimizer, and we de\ufb01ne the MF solution by the mean of the target matrix U:\nIn the context of PCA where V is a data matrix, the solution is given as the subspace spanned by (U.\n\nThe MF model has hyperparameters (CA, CB) in the priors (2). By manually choosing them, we\ncan control regularization and sparsity of the solution (e.g., the PCA dimensions). A popular way\nto set the hyperparameter in the Bayesian framework is again based on the minimization of the free\nenergy (4):\n\n(U =)BA!*!r(A,B) .\n\n(6)\n\nWe refer to this method as an empirical VB (EVB) method. When the noise variance \u03c32 is unknown,\nit can also be estimated as\n\n((CA,(CB) = argminCA,CB (minr F (r; CA, CB|V )) .\n(\u03c32 = argmin\u03c32!minr,CA,CB F (r; CA, CB,\u03c3 2|V )\" .\n\n3When the number of samples is larger (smaller) than the data dimensionality in the PCA setting, the\n\nobservation matrix V should consist of the columns (rows), each of which corresponds to each sample.\n\n3\n\n\f3 Simple Closed-Form Solutions of VBMF\n\nRecently, the global analytic solution of VBMF has been derived [16]. However, it is given as a\nsolution of a quartic equation (Corollary 1 in [16]), and it is not easy to use for further analysis due\nto its complicated expression. In this section, we derive much simpler forms, which will be used for\nanalyzing VB-PCA in the next section.\n\n3.1 VB Solution\n\nOur new analytic-form solution only involves linear and quadratic equations, which is summarized\nin the following theorem (the proof is omitted due to the space limitation):\n\nTheorem 1 Let\n\n(7)\nbe the singular value decomposition (SVD) of V with its singular values {\u03b3h} arranged in non-\nincreasing order, and the associated right and left singular vectors {\u03c9ah, \u03c9bh}. Then, the VB solu-\ntion can be written as a truncated shrinkage SVD as follows:\n\nh=1 \u03b3h\u03c9bh\u03c9!ah\n\nV =+H\n\nHere, the truncation threshold and the shrinkage estimator are, respectively, given by\n\nif \u03b3h \u2265 \u03b3VB\nh ,\notherwise.\n\n(U VB =\n\nh \u03c9bh\u03c9!ah,\n\nH,h=1(\u03b3VB\nh = \u03c3.//0 (L+M )\nh = \u03b3h21 \u2212 \u03c32\n\n2\u03b32\n\n\u03b3VB\n\n\u02d8\u03b3VB\n\nwhere\n\nh\n\n0\n\nh =-\u02d8\u03b3VB\n(\u03b3VB\n+12 (L+M )\n\n2 + \u03c32\nc2\n2c2\n\nah\n\n2 + \u03c32\nc2\n2c2\nbh\n\nah\n\nbh32\nbh33 .\nh2M + L +4(M \u2212 L)2 + 4\u03b32\n\nc2\nah\n\nh\nc2\n\n\u2212 LM ,\n\nWe can also derive a simple closed-form expression of the VB posterior (omitted).\n\n3.2 EVB Solution\n\nCombining Theorem 1 with the global EVB solution (Corollary 2 in [16]), we have the following\ntheorem (the proof is omitted):\n\n(8)\n\n(9)\n\n(10)\n\n(11)\n\n(12)\n\n(13)\n\n(14)\n\n(15)\n\nTheorem 2 Let\n\n\u03b1 = L\nM ,\n\nand let \u03ba = \u03ba(\u03b1) (> 1) be the zero-cross point of the following decreasing function:\n\nx\nThen, the EVB solution can be written as a truncated shrinkage SVD as follows:\n\n\u03a6(x) = log(x+1)\n\n\u2212 1\n2 .\n\nwhere\n\n\u039e (\u03ba; \u03b1) = \u03a6 (\u221a\u03b1\u03ba) + \u03a6$ \u03ba\u221a\u03b1% ,\nH,h=1(\u03b3EVB\n(\u03b3EVB\n\u03b3EVB = \u03c34M + L + \u221aLM$\u03ba + 1\n\u03ba%,\n\nh \u03c9bh\u03c9!ah,\n\nwhere\n\nh\n\n(U EVB =\n\nHere, the truncation threshold and the shrinkage estimator are, respectively, given by\n\n=5\u02d8\u03b3EVB\n\n0\n\nh\n\nif \u03b3h \u2265 \u03b3EVB,\notherwise.\n\n\u02d8\u03b3EVB\nh\n\n= \u03b3h\n\n2 61 \u2212 (M +L)\u03c32\n\n\u03b32\nh\n\n+4$1 \u2212 (M +L)\u03c32\n\n\u03b32\nh\n\nh 7 .\n\u2212 4LM\u03c34\n\n\u03b34\n\n%2\n\nThe EVB threshold (14) involves \u03ba, which needs to be numerically computed. We can easily prepare\na table of the values for 0 <\u03b1 \u2264 1 beforehand, like the cumulative Gaussian probability used in\nstatistical tests. Fig. 2 shows the EVB threshold (14) by a red solid curve labeled as \u2018EVB\u2019.\n\n4\n\n\f2\n\n\u03c3\nM\n\u221a\n/\n\u03b3\n\n2.5\n2\n1.5\n1\n0.5\n0\n \n0\n\n \n\nEVB\nM PUL\nVB F L\n\n0.2\n\n0.4\n\n\u03b1\n\n0.6\n\n0.8\n\n1\n\n)\nu\n(\np\n\n3\n\n2\n\n1\n\n0\n \n0\n\n!u\"p ( u )\n\n \n\n\u03b1 = 1\n\u03b1 = 0.1\n\nu(0.1)\n\nu(1)\n\n1\n\n2\nu = \u03b3\n\n3\n2/(\u03c3\n2 \u2217M )\n\n4\n\n5\n\n6\n\n4\n\n2\n\n0\n \n0\n\n \n\nx\n\n\u03c80(x)\n\u03c8(x)\n\n6\n\n8\n\n2\n\n4\n\nx\n\nFigure 2: Thresholds.\n\nFigure 3: Mar\u02c7cenko-Pastur law.\n\nFigure 4: \u03c80(x) and \u03c8(x).\n\n3.3 Large-Scale Limiting Behavior of EVB When Noise Variance Is Known\n\nHere, we \ufb01rst introduce a result from random matrix theory [13, 22], and then discuss the behavior\nof EVB when the noise variance is known.\nAssume that E\u2208 RL\u00d7M is a random matrix such that each element is independently drawn from a\ndistribution with mean zero and variance \u03c32\u2217 (not necessarily Gaussian). Let u1, u2, . . . , uL be the\neigenvalues of\n\nM\u03c3 2\u2217EE!, and de\ufb01ne the empirical distribution of the eigenvalues by\n\n1\n\np(u) = 1\n\nL (\u03b4(u1) + \u03b4(u2) + \u00b7\u00b7\u00b7 + \u03b4(uL)) ,\n\nwhere \u03b4(u) denotes the Dirac measure at u. Then the following proposition holds:\n\nProposition 1 (Mar\u02c7cenko-Pastur law) [13, 22] In the large-scale limit when L and M go to in-\n\ufb01nity with its ratio \u03b1 = L/M \ufb01xed, the probability measure of the empirical distribution of the\neigenvalue u of\n\n1\n\n\u03c32\u2217M EE! converges almost surely to\n\u221a(u\u2212u)(u\u2212u)\n\np(u)du =\n\n2\u03c0\u03b1u\n\n\u03b8(u < u < u)du,\n\n\u03b3MPUL = \u221aM\u03c3 2\u2217u = (\u221aL + \u221aM )\u03c3\u2217,\n\n(16)\nwhere u = (1 \u2212 \u221a\u03b1)2, u = (1 + \u221a\u03b1)2, and \u03b8(\u00b7) denotes an indicator function such that\n\u03b8(condition) = 1 if the condition is true and \u03b8(condition) = 0 otherwise.\nFig. 3 shows the Mar\u02c7cenko-Pastur (MP) distributions for \u03b1 = 0.1, 1. The mean )u*p(u) = 1 (which\nis constant for any 0 <\u03b1 \u2264 1) and the upper-limit u = u(\u03b1) for \u03b1 = 0.1, 1 are indicated by\narrows. Note that the MP distribution appears for a single sample matrix; different from standard\n\u201clarge-sample\u201d theories, we do not need many sample matrices (this property is sometimes called\nself-averaging). This single-sample property of the MP distribution is highly useful in our analysis\nbecause we are working with a single observation matrix in the MF scenario.\nProposition 1 states that all singular values of the random matrix E are almost surely upper-bounded\nby\n(17)\nwhich we call the Mar\u02c7cenko-Pastur upper-limit (MPUL). This property can be used for designing\nestimators robust against noise [10, 9]. Although EVB-PCA was proposed independently from the\nrandom matrix theory [3], its good performance can be proven with a related property to Proposi-\ntion 1, as shown in Section 4.\nWhen the noise variance is known, we can set the parameter to \u03c3 = \u03c3\u2217 in Eq.(1). We depicted\nMPUL (17) for this case in Fig. 2. We can see that MPUL lower-bounds the EVB threshold (14)\n(actually this is true regardless of the value of \u03ba > 0). This implies a nice behavior of EVB, i.e.,\nEVB eliminates all noise components in the large-scale limit. However, a simple optimal strategy\u2014\ndiscarding the components with singular values smaller than \u03b3MPUL\u2014outperforms EVB, because\nsignals lying between the gap [\u03b3MPUL,\u03b3 EVB) are discarded by EVB. Therefore, EVB is not very\nuseful when \u03c32\u2217 is known. In Section 4, we investigate the behavior of EVB in a more practical and\nchallenging situation where \u03c32\u2217 is unknown and is also estimated from observation.\nIn Fig. 2, we also depicted the VB threshold (9) with almost \ufb02at prior cah, cbh \u2192 \u221e (labeled\nas \u2018VBFL\u2019) for comparison. Actually, this coincides with the mean of the MP distribution, i.e.,\nh )2/(M\u03c3 2) = )u*p(u) = 1. This implies that VBFL retains a lot of noise com-\nlimcah ,cah\u2192\u221e(\u03b3VB\nponents, and does not perform well even when \u03c32\u2217 is known.\n\n5\n\n\f4 Analysis of EVB When Noise Variance Is Unknown\n\nIn this section, we derive bounds of the VB-based noise variance estimator, and obtain a suf\ufb01cient\ncondition for perfect dimensionality recovery in the large-scale limit.\n\n4.1 Bounds of Noise Variance Estimator\n\nThe simple closed-form solution obtained in Section 3 is the global minimizer of the free energy\n(4), given \u03c32. Using the solution, we can explicitly describe the free energy as a function of \u03c32. We\nobtain the following theorem (the proof is omitted):\n\nTheorem 3 The noise variance estimator(\u03c32 EVB is the global minimizer of\n\nh\n\n\u2126(\u03c3\u22122) =+H\n\nh=1 \u03c8$ \u03b32\n\nwhere \u03c8 (x) = \u03c80 (x) + \u03b8 (x > x) \u03c81 (x) ,\n\nh\n\nM\u03c3 2% ,\nh=H+1 \u03c80$ \u03b32\nM\u03c3 2% ++L\nx = 1 + \u03b1 + \u221a\u03b1!\u03ba + \u03ba\u22121\" ,\n\u03c81 (x) = log (\u221a\u03b1\u03ba(x) + 1) + \u03b1 log$ \u03ba(x)\u221a\u03b1 + 1% \u2212 \u221a\u03b1\u03ba(x),\n2\u221a\u03b12(x \u2212 (1 + \u03b1)) +8(x \u2212 (1 + \u03b1))2 \u2212 4\u03b13 .\n\n\u03ba is a constant de\ufb01ned in Theorem 2, and \u03ba(x) is a function of x (> x) de\ufb01ned by\n\n\u03c80 (x) = x \u2212 log x,\n\n\u03ba(x) = 1\n\n(18)\n\n(19)\n(20)\n\n(21)\n\nh/(\u03c32M )) are rescaled versions of the squared EVB threshold (14) and the\nh/(\u03c32M )) =\n\nNote that x and \u03ba(\u03b32\nEVB shrinkage estimator (15), respectively, i.e., x = (\u03b3EVB)2/(\u03c32M ) and \u03ba(\u03b32\n/(\u03c32\u221aM L).\n\u03b3h\u02d8\u03b3EVB\nThe functions \u03c80 (x) and \u03c8 (x) are depicted in Fig. 4. We can prove the convexity of \u03c80 (x) and\nquasi-convexity of \u03c8 (x), which are useful properties in our theoretical analysis.\n\nh\n\n> 0 for h = 1, . . . , (H EVB and\n\nh\n\nh\n\n= 0 for h = (H EVB + 1, . . . , H. Then we have the following theorem:\n\nLet (H EVB be the rank estimated by VB, which satis\ufb01es(\u03b3EVB\n(\u03b3EVB\nTheorem 4 (H EVB is upper-bounded as\n(H EVB \u2264 H = min$9 L\n1+\u03b1: \u2212 1, H% ,\nand the noise variance estimator(\u03c32 EVB is bounded as follows:\nM(L\u2212H(1+\u03b1))3 <(\u03c32 EVB \u2264 1\nH+1, \"L\nLM+L\nh =\uf8f1\uf8f4\uf8f2\uf8f4\uf8f3\n\nfor h = 1, . . . , L,\nfor h = L + 1.\n\nmax2\u03c32\n\n\u221e for h = 0,\n\n(Sketch of proof) First, we show that a global minimizer w.r.t. \u03c32 exists in (\u03b32\n\nwhere \u03c32\n\nh=1 \u03b32\nh,\n\n\u03b32\nh\nM x\n0\n\nh=H+1 \u03b32\n\nh\n\n(22)\n\n(23)\n\n(24)\n\n(25)\n\nL/M, \u03b32\n\n1 /M ), and it\n\n\u2202\u2126\n\n\u0398(\u03c3\u22122) \u2261 1\n(M + L)\u03c32 <\u03b3 h!\u03b3h \u2212 \u02d8\u03b3EVB\n\nh\n\nis a stationary point. Given a hypothetic (H, the derivative of \u2126 w.r.t. \u03c3\u22122 is written as\n\nh=1 \u03b3h(\u03b3h\u2212\u02d8\u03b3EVB\nh\nLM\n\n\u2202\u03c3\u22122 = \u2212\u03c32 + \"#H\n\n)+\"L\n\nh=#H+1\n\n\u03b32\nh\n\nL\n\n.\n\nEq.(15) implies the following bounds:\n\n\" < (\u221aM + \u221aL)2\u03c32\n\nfor \u03b3h >\u03b3 EVB,\n\n(26)\nwhich allows us to bound \u0398 by simple inequalities. Finding a condition prohibiting \u0398 to be zero\nproves the theorem.\n2\nTheorem 4 states that EVB discards the (L\u2212/L/(1 + \u03b1)0 + 1) \u2265 1 smallest components, regardless\nof the observed values {\u03b3h}. For example, the half components are always discarded when the\nmatrix is square (i.e., \u03b1 = L/M = 1). The smallest singular value \u03b3L is always discarded, and\n(\u03c32 EVB >\u03b3 2\n\nL/(M (L \u2212 (L \u2212 1)(1 + \u03b1)) >\u03b3 2\n\nL/M always holds.\n\n6\n\n\f1\n\n0.5\n\ne\nt\na\nR\n\ns\ns\ne\nc\nc\nu\nS\n\n0\n \n0\n\n1\n\n2\n\ny\n\n3\n\n(a) \u03b1 = 1\n\n \n\n \n\n \n\n1\n\n0.5\n\ne\nt\na\nR\n\ns\ns\ne\nc\nc\nu\nS\n\n0\n \n0\n\n\u03be = 0.0\n\u03be = 0.1\n\u03be = 0.2\n\u03be = 0.4\n\u03be = 0.6\n\u03be = 0.8\n4\n\n5\n\n1\n\n0.5\n\ne\nt\na\nR\n\ns\ns\ne\nc\nc\nu\nS\n\n0\n \n0\n\n\u03be = 0.0\n\u03be = 0.1\n\u03be = 0.2\n\u03be = 0.4\n\u03be = 0.6\n\u03be = 0.8\n4\n\n5\n\n\u03be = 0.0\n\u03be = 0.1\n\u03be = 0.2\n\u03be = 0.4\n\u03be = 0.6\n\u03be = 0.8\n4\n\n5\n\n1\n\n2\n\ny\n\n3\n\n(c) \u03b1 = 0.1\n\n1\n\n2\n\ny\n\n3\n\n(b) \u03b1 = 0.5\n\nFigure 5: Success rate of dimensionality recovery in numerical simulation for M = 200. The\nthreshold for the guaranteed recovery (the second inequality in Eq.(28)) is depicted with a vertical\nbar with the same color and line style.\n\n4.2 Perfect Recovery Condition\n\nHere, we derive a suf\ufb01cient condition for perfect recovery of the true PCA dimensionality in the\nlarge-scale limit.\nAssume that the observation matrix V is generated as\n\nV = U\u2217 + E,\n\n(27)\nwhere U\u2217 is a true signal matrix with rank H\u2217 and the singular values {\u03b3\u2217h}, and each element of\nthe noise matrix E is subject to a distribution with mean zero and variance \u03c32\u2217. We rely on a result\n[2, 5] on the eigenvalue distribution of the spiked covariance model [8]. The following theorem\nguarantees the accuracy of VB-PCA:\nTheorem 5 Assume H \u2265 H\u2217 (i.e., we set the rank of the MF model suf\ufb01ciently large), and denote\nthe relevant rank (dimensionality) ratio by\n\n\u03be = H\u2217\nL .\n\n\u03be< 1\n\nx and \u03b3\u22172\n\n(28)\n\nIn the large-scale limit with \ufb01nite \u03b1 and H\u2217, EVB implemented with a local search algorithm for\n\nthe noise variance \u03c32 estimation almost surely recovers the true rank, i.e., (H EVB = H\u2217, if \u03be = 0 or\n\nwhere x is de\ufb01ned in Eq.(19).\n(Sketch of proof) We \ufb01rst show that, in the large-scale limit and when \u03be = 0, Eq.(25) is equal to\nh de\ufb01ned in\nzero if and only if \u03c32 = \u03c32\u2217. This means the perfect recovery in the no-signal case. \u03c32\n\n1\u2212x\u03be \u2212 \u03b1% \u00b7 M\u03c3 2\u2217,\n\nH\u2217 >$ x\u22121\n\nH\u2217+1 < (\u03c32 <\u03c3 2\n\nH\u2217. Using Eq.(26), we can obtain a\n2\n\nEq.(24) is actually the thresholding point of estimated(\u03c32 for the h-th component to be discarded.\nTherefore, (H EVB = H\u2217 if and only if \u03c32\nsuf\ufb01cient condition that a local minimum exists only in this range, which proves the theorem.\nNote that \u03be \u2192 0 in the large scale limit. However, we treated \u03be as a positive value in Theorem 5,\nhoping that the obtained result can approximately hold in a practical situation when L and M are\nlarge but \ufb01nite. The obtained result well explains the dependency on \u03be in the numerical simulation\nbelow.\nTheorem 5 guarantees that, if the true rank H\u2217 is small enough compared with L and the smallest\nsignal \u03b3\u2217H\u2217 is large enough compared with \u03c32\u2217, VB-PCA works perfectly. It is important to note\nthat, although the objective function (18) is non-convex and possibly multimodal in general, perfect\nrecovery does not require global search, but only a local search, of the objective function for noise\nvariance estimation.\nFig. 5 shows numerical results for M = 200 and \u03b1 = 1, 0.5, 0.1. E was drawn from the Gaussian\ndistribution with variance \u03c32\u2217 = 1, and signal singular values were drawn from the uniform distri-\nbution on [yM\u03c32\u2217, 10M ] for different y (the horizontal axis of the graphs indicates y). The vertical\naxis indicates the success rate of dimensionality recovery, i.e., (H EVB = H\u2217, over 100 trials. If the\n\ncondition for \u03be (the \ufb01rst inequality in Eq.(28)) is violated, the corresponding line is depicted with\nmarkers. Otherwise, the threshold of y for the guaranteed recovery (the second inequality in Eq.(28))\nis indicated by a vertical bar with the same color and line style. We can see that the guarantee by\nTheorem 5 approximately holds even in this small matrix size, although it is slightly conservative.\n\n7\n\n\f5 Discussion\n\nHere, we discuss implementation of VB-PCA, and the origin of sparsity of VB.\n\n5.1\n\nImplementation\n\nImplementation of VB-PCA (VB-MF) based on the result given in [16] involves a quartic equation.\nThis means that we need to use a highly complicated analytic-form solution, derived by, e.g., Fer-\nrari\u2019s method, of a quartic equation, or rely on a numerical quartic solver, which is computationally\nless ef\ufb01cient. The theorems we gave in this paper can actually simplify the implementation greatly.\nA table of \u03ba de\ufb01ned in Theorem 2 should be prepared beforehand. Given an observed matrix V ,\nwe perform SVD and obtain the singular values. After that, in our new implementation, we \ufb01rst\ndirectly estimate the noise variance based on Theorem 3, using any 1-D local search algorithm\u2014\n\na dimensionality reduction purpose, simply discarding all the components such that \u03c32\ngives the result (here \u03c32\n\nTheorem 4 helps restrict the search range. Then we obtain the noise variance estimator(\u03c32 EVB. For\nh < (\u03c32 EVB\nh is de\ufb01ned by Eq.(24)). When the estimator (U EVB is needed, Theorem 2\ngives the result for \u03c32 = (\u03c32 EVB. The VB posterior is also easily computed (its description is\n\nomitted). In this way, we can perform VB-PCA, whose performance is guaranteed, very easily.\n\n5.2 Origin of Exact Sparsity\n\nSparsity is regarded as a practical advantage of VB. Nevertheless, as discussed in Section 1, it is\nnot necessarily a property inherent in the rigorous Bayesian estimation. Actually, at least in MF,\nthe sparsity is induced by the independence assumption between A and B. Let us go back to Fig.1,\nwhere the Bayes posterior for V = 1 is shown in the left graph. In this scalar factorization model, the\n\ndirection, and the mass in the second and the fourth quadrants toward the negative direction. Since\nthe Bayes posterior is skewed and more mass is put in the \ufb01rst and the third quadrants, it is natural\n\nprobability mass in the \ufb01rst and the third quadrants pulls the estimator (U = BA toward the positive\nthat the Bayesian estimator(\u03b3 = )BA*p(A,B|V ) is positive. This is true even if V > 0 is very small.\n\nOn the other hand, the VB posterior (the middle graph) is prohibited to be skewed because of the\nindependent assumption, and thus it has to wait until V grows so that one of the modes has a enough\nprobability mass. This is the cause of sparsity in VBMF. The Bayes posterior (left graph) implies\nthat, if we restrict the posterior to be Gaussian, but allow to have correlation between A and B, exact\nsparsity will not be observed.\nIt is observed that the Bayesian estimation gives a sparse solution when the hyper parameters\n(CA, CB) are optimized. This estimator is also depicted as blue diamonds labeled as EFB (em-\npirical fully-Bayesian) in the right graph of Fig.1. Note that, in this case, the independence between\nA and C\u22121/2\n), which are strongly correlated in the prior (2) and hence in\nthe Bayes posterior, is forced\u2014the point estimation of CA (as well as CB) breaks the correlation\nbecause approximating by the delta function induces the independence from all other parameters.\nFurther investigation on the relation between the independence constraint and the sparsity induction\nis our future work.\n\n(as well as B and C\u22121/2\n\nA\n\nB\n\n6 Conclusion\n\nIn this paper, we considered the variational Bayesian PCA (VB-PCA) when the noise variance is\nunknown. Analyzing the behavior of the noise variance estimator, we derived a suf\ufb01cient condition\nfor VB-PCA to perfectly recover the true dimensionality. Our result theoretically supports the use-\nfulness of VB-PCA. In our theoretical analysis, we obtained bounds for a noise variance estimator\nand simple closed-form solutions for other parameters, which were shown to be useful for better\nimplementation of VB-PCA.\n\nAcknowledgments\nSN, RT, and MS thank the support from MEXT Kakenhi 23120004, MEXT Kakenhi 22700138, and\nthe FIRST program, respectively. SDB was supported by a Beckman Postdoctoral Fellowship.\n\n8\n\n\fReferences\n[1] H. Attias. Inferring parameters and structure of latent variable models by variational Bayes.\nIn Proceedings of the Fifteenth Conference Annual Conference on Uncertainty in Arti\ufb01cial\nIntelligence (UAI-99), pages 21\u201330, San Francisco, CA, 1999. Morgan Kaufmann.\n\n[2] J. Baik and J. W. Silverstein. Eigenvalues of large sample covariance matrices of spiked pop-\n\nulation models. Journal of Multivariate Analysis, 97(6):1382\u20131408, 2006.\n\n[3] C. M. Bishop. Variational principal components. In Proc. of ICANN, volume 1, pages 514\u2013509,\n\n1999.\n\n[4] Z. Ghahramani and M. J. Beal. Graphical models and variational methods. In Advanced Mean\n\nField Methods, pages 161\u2013177. MIT Press, 2001.\n\n[5] D. C. Hoyle. Automatic PCA dimension selection for high dimensional data and small sample\n\nsizes. Journal of Machine Learning Research, 9:2733\u20132759, 2008.\n\n[6] A. Ilin and T. Raiko. Practical approaches to principal component analysis in the presence of\n\nmissing values. JMLR, 11:1957\u20132000, 2010.\n\n[7] T. S. Jaakkola and M. I. Jordan. Bayesian parameter estimation via variational methods. Statis-\n\ntics and Computing, 10:25\u201337, 2000.\n\n[8] I. M. Johnstone. On the distribution of the largest eigenvalue in principal components analysis.\n\nAnnals of Statistics, 29:295\u2013327, 2001.\n\n[9] N. El Karoui. Spectrum estimation for large dimensional covariance matrices using random\n\nmatrix theory. Annals of Statistics, 36(6):2757\u20132790, 2008.\n\n[10] O. Ledoit and M. Wolf. A well-conditioned estimator for large-dimensional covariance matri-\n\nces. Journal of Multivariate Analysis, 88(2):365\u2013411, 2004.\n\n[11] Y. J. Lim and T. W. Teh. Variational Bayesian approach to movie rating prediction. In Pro-\n\nceedings of KDD Cup and Workshop, 2007.\n\n[12] D. J. C. Mackay. Local minima, symmetry-breaking, and model pruning in variational\nfree energy minimization. Available from http://www.inference.phy.cam.ac.uk/\nmackay/minima.pdf. 2001.\n\n[13] V. A. Marcenko and L. A. Pastur. Distribution of eigenvalues for some sets of random matrices.\n\nMathematics of the USSR-Sbornik, 1(4):457\u2013483, 1967.\n\n[14] T. P. Minka. Automatic choice of dimensionality for PCA. In Advances in NIPS, volume 13,\n\npages 598\u2013604. MIT Press, 2001.\n\n[15] S. Nakajima and M. Sugiyama. Theoretical analysis of Bayesian matrix factorization. Journal\n\nof Machine Learning Research, 12:2579\u20132644, 2011.\n\n[16] S. Nakajima, M. Sugiyama, and S. D. Babacan. Global solution of fully-observed variational\nBayesian matrix factorization is column-wise independent. In Advances in Neural Information\nProcessing Systems 24, 2011.\n\n[17] S. Nakajima, M. Sugiyama, and S. D. Babacan. On Bayesian PCA: Automatic dimensionality\nselection and analytic solution. In Proceedings of 28th International Conference on Machine\nLearning (ICML2011), Bellevue, WA, USA, Jun. 28\u2013Jul.2 2011.\n\n[18] S. Roweis and Z. Ghahramani. A unifying review of linear Gaussian models. Neural Compu-\n\ntation, 11:305\u2013345, 1999.\n\n[19] R. Salakhutdinov and A. Mnih. Probabilistic matrix factorization. In J. C. Platt, D. Koller,\nY. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20,\npages 1257\u20131264, Cambridge, MA, 2008. MIT Press.\n\n[20] M. Sato, T. Yoshioka, S. Kajihara, K. Toyama, N. Goda, K. Doya, and M. Kawato. Hierarchical\n\nBayesian estimation for MEG inverse problem. Neuro Image, 23:806\u2013826, 2004.\n\n[21] M. E. Tipping and C. M. Bishop. Probabilistic principal component analysis. Journal of the\n\nRoyal Statistical Society, 61:611\u2013622, 1999.\n\n[22] K. W. Wachter. The strong limits of random matrix spectra for sample matrices of independent\n\nelements. Annals of Probability, 6:1\u201318, 1978.\n\n9\n\n\f", "award": [], "sourceid": 470, "authors": [{"given_name": "Shinichi", "family_name": "Nakajima", "institution": null}, {"given_name": "Ryota", "family_name": "Tomioka", "institution": null}, {"given_name": "Masashi", "family_name": "Sugiyama", "institution": null}, {"given_name": "S.", "family_name": "Babacan", "institution": null}]}