{"title": "Global Analytic Solution for Variational Bayesian Matrix Factorization", "book": "Advances in Neural Information Processing Systems", "page_first": 1768, "page_last": 1776, "abstract": "Bayesian methods of matrix factorization (MF) have been actively explored recently as promising alternatives to classical singular value decomposition. In this paper, we show that, despite the fact that the optimization problem is non-convex, the global optimal solution of variational Bayesian (VB) MF can be computed analytically by solving a quartic equation. This is highly advantageous over a popular VBMF algorithm based on iterated conditional modes since it can only find a local optimal solution after iterations. We further show that the global optimal solution of empirical VBMF (hyperparameters are also learned from data) can also be analytically computed. We illustrate the usefulness of our results through experiments.", "full_text": "Global Analytic Solution\n\nfor Variational Bayesian Matrix Factorization\n\nShinichi Nakajima\nNikon Corporation\n\nTokyo, 140-8601, Japan\n\nMasashi Sugiyama\n\nTokyo Institute of Technology\n\nTokyo 152-8552, Japan\n\nnakajima.s@nikon.co.jp\n\nsugi@cs.titech.ac.jp\n\nRyota Tomioka\n\nThe University of Tokyo\nTokyo 113-8685, Japan\n\ntomioka@mist.i.u-tokyo.ac.jp\n\nAbstract\n\nBayesian methods of matrix factorization (MF) have been actively explored re-\ncently as promising alternatives to classical singular value decomposition. In this\npaper, we show that, despite the fact that the optimization problem is non-convex,\nthe global optimal solution of variational Bayesian (VB) MF can be computed\nanalytically by solving a quartic equation. This is highly advantageous over a\npopular VBMF algorithm based on iterated conditional modes since it can only\n\ufb01nd a local optimal solution after iterations. We further show that the global opti-\nmal solution of empirical VBMF (hyperparameters are also learned from data) can\nalso be analytically computed. We illustrate the usefulness of our results through\nexperiments.\n\n1 Introduction\n\nThe problem of \ufb01nding a low-rank approximation of a target matrix through matrix factorization\n(MF) attracted considerable attention recently since it can be used for various purposes such as\nreduced rank regression [19], canonical correlation analysis [8], partial least-squares [27, 21],\nmulti-class classi\ufb01cation [1], and multi-task learning [7, 29].\nSingular value decomposition (SVD) is a classical method for MF, which gives the optimal low-\nrank approximation to the target matrix in terms of the squared error. Regularized variants of SVD\nhave been studied for the Frobenius-norm penalty (i.e., singular values are regularized by the \u21132-\npenalty) [17] or the trace-norm penalty (i.e., singular values are regularized by the \u21131-penalty) [23].\nSince the Frobenius-norm penalty does not automatically produce a low-rank solution, it should be\ncombined with an explicit low-rank constraint, which is non-convex. In contrast, the trace-norm\npenalty tends to produce sparse solutions, so a low-rank solution can be obtained without explicit\nrank constraints. This implies that the optimization problem of trace-norm MF is still convex, and\nthus the global optimal solution can be obtained. Recently, optimization techniques for trace-norm\nMF have been extensively studied [20, 6, 12, 25].\nBayesian approaches to MF have also been actively explored. A maximum a posteriori (MAP)\nestimation, which computes the mode of the posterior distributions, was shown [23] to correspond to\nthe \u21131-MF when Gaussian priors are imposed on factorized matrices [22]. The variational Bayesian\n(VB) method [3, 5], which approximates the posterior distributions by factorized distributions, has\nalso been applied to MF [13, 18]. The VB-based MF method (VBMF) was shown to perform well\nin experiments, and its theoretical properties have been investigated [15].\n\n1\n\n\fFigure 1: Matrix factorization model. H \u2264 L \u2264 M. A = (a1, . . . , aH) and B = (b1, . . . , bH).\n\nHowever, the optimization problem of VBMF is non-convex. In practice, the VBMF solution is\ncomputed by the iterated conditional modes (ICM) [4, 5], where the mean and the covariance of the\nposterior distributions are iteratively updated until convergence [13, 18]. One may obtain a local\noptimal solution by the ICM algorithm, but many restarts would be necessary to \ufb01nd a good local\noptimum.\nIn this paper, we \ufb01rst show that, although the optimization problem is non-convex, the global opti-\nmal solution of VBMF can be computed analytically by solving a quartic equation. This is highly\nadvantageous over the standard ICM algorithm since the global optimum can be found without any\niterations and restarts. We next consider an empirical VB (EVB) scenario where the hyperparam-\neters (prior variances) are also learned from data. Again, the optimization problem of EVBMF is\nnon-convex, but we still show that the global optimal solution of EVBMF can be computed analyti-\ncally. The usefulness of our results is demonstrated through experiments.\nRecently, the global optimal solution of VBMF when the target matrix is square has been obtained\nin [15]. Thus, our contribution to VBMF can be regarded as an extension of the previous result to\ngeneral rectangular matrices. On the other hand, for EVBMF, this is the \ufb01rst paper that gives the\nanalytic global solution, to the best of our knowledge. The global analytic solution for EVBMF is\nshown to be highly useful in experiments.\n\n2 Bayesian Matrix Factorization\n\nIn this section, we formulate the MF problem and review a variational Bayesian MF algorithm.\n\n2.1 Formulation\nThe goal of MF is to approximate an unknown target matrix U (\u2208 RL\u00d7M ) from its n observations\n\nV n = {V (i) \u2208 RL\u00d7M}n\n\ni=1.\n\nWe assume that L \u2264 M. If L > M, we may simply re-de\ufb01ne the transpose U\n\u22a4 as U so that L \u2264 M\nholds. Thus this does not impose any restriction.\nA key assumption of MF is that U is a low-rank matrix. Let H (\u2264 L) be the rank of U. Then\nthe matrix U can be decomposed into the product of A \u2208 RM\u00d7H and B \u2208 RL\u00d7H as follows (see\nFigure 1):\n\nU = BA\n\n\u22a4\n\n.\n\nV = U + E,\n\u02c6\nnX\n\n\u2212 1\n2\u03c32\n\ni=1\n\n2\n\nAssume that the observed matrix V is subject to the following additive-noise model:\n\nwhere E (\u2208 RL\u00d7M ) is a noise matrix. Each entry of E is assumed to independently follow the\nGaussian distribution with mean zero and variance \u03c32. Then, the likelihood p(V n|A, B) is given by\n\n!\n\np(V n|A, B) \u221d exp\n\n\u2225V (i) \u2212 BA\n\n\u22a4\u22252\n\nFro\n\n,\n\nwhere \u2225 \u00b7 \u2225Fro denotes the Frobenius norm of a matrix.\n\nU=A\u22a4LMBHLMH\f2.2 Variational Bayesian Matrix Factorization\n\nWe use the Gaussian priors on the parameters A = (a1, . . . , aH) and B = (b1, . . . , bH):\n\u03c6(U) = \u03c6A(A)\u03c6B(B), where \u03c6A(A) \u221d exp\n\nand \u03c6B(B) \u221d exp\n\n!\n\n\u02c6\n\u2212 HX\n\nh=1\n\n\u2225ah\u22252\n2c2\nah\n\n\u02c6\n\u2212 HX\n\nh=1\n\n!\n\n.\n\n\u2225bh\u22252\n2c2\nbh\n\nah and c2\nare hyperparameters corresponding to the prior variance. Without loss of generality, we\nc2\nbh\nassume that the product cahcbh is non-increasing with respect to h.\nLet r(A, B|V n) be a trial distribution for A and B, and let FVB be the variational Bayes (VB) free\nenergy with respect to r(A, B|V n):\n\nFVB(r|V n) =\nwhere \u2329\u00b7\u232ap denotes the expectation over p.\nThe VB approach minimizes the VB free energy FVB(r|V n) with respect to the trial distribution\nr(A, B|V n), by restricting the search space of r(A, B|V n) so that the minimization is computation-\nally tractable. Typically, dissolution of probabilistic dependency between entangled parameters (A\nand B in the case of MF) makes the calculation feasible:1\n\nr(A,B|V n)\n\n,\n\n\u00bf\nlog r(A, B|V n)\np(V n, A, B)\n\n(cid:192)\n\nThe resulting distribution is called the VB posterior. The VB solution bU VB is given by the VB\n\nrah(ah|V n)rbh(bh|V n).\n\nr(A, B|V n) =\n\n(1)\n\nh=1\n\nHY\n\nposterior mean:\n\n\u22a4\u232ar(A,B|V n).\n\nbU VB = \u2329BA\nHY\n\nNM (ah; \u00b5ah\n\u2021\n\nBy applying the variational method to the VB free energy, we see that the VB posterior can be\nexpressed as follows:\n\nr(A, B|V n) =\n\n, \u03a3ah)NL(bh; \u00b5bh\n\n, \u03a3bh),\n\nwhere Nd(\u00b7; \u00b5, \u03a3) denotes the d-dimensional Gaussian density with mean \u00b5 and covariance matrix\n\u03a3. \u00b5ah, \u00b5bh, \u03a3ah, and \u03a3bh satisfy\n\nh=1\n\n\u2021\n\n\u00b7\u22121\n\n\u00b7\u22121\n\nn\u03b1h\n\u03c32 +c\n\n\u22122\nbh\n\nIL,\n\n(2)\n\n\u00b5ah\n\nh \u00b5bh\n\n, \u00b5bh\n\n= \u03a3ah\u039e\u22a4\n\n, \u03a3ah=\n\nn\u03b2h\n\u03c32 +c\n\u00b7\nwhere Id denotes the d-dimensional identity matrix, and\n\n= \u03a3bh\u039eh\u00b5ah\n\n\u22252 + tr(\u03a3ah), \u03b2h = \u2225\u00b5bh\nV \u2212\n, V =\n\n\u2021\n\u03b1h = \u2225\u00b5ah\n\u039eh = n\n\u03c32\n\n\u22a4\n\u00b5bh\u2032 \u00b5\nah\u2032\n\nX\n\n\u22122\nah\n\nh\u2032\u0338=h\n\nIM , \u03a3bh=\nnX\n\u22252 + tr(\u03a3bh),\n1\nn\n\nV (i).\n\ni=1\n\nThe iterated conditional modes (ICM) algorithm [4, 5] for VBMF (VB-ICM) iteratively updates\n\u00b5ah, \u00b5bh, \u03a3ah, and \u03a3bh by Eq.(2) from some initial values until convergence [13, 18], allowing one\nto obtain a local optimal solution. Finally, an estimator of U is computed as\n\nbU VB\u2212ICM =\n(cid:176)(cid:176)(cid:176)(cid:176)(cid:176)V (i) \u2212 HX\n\n\u00b5bh\n\n\u00b5\n\n\u22a4\nah\n\nHX\n(cid:176)(cid:176)(cid:176)(cid:176)(cid:176)2\n\nh=1\n\n\u22a4\nah\n\n.\n\n\u00b5bh\n\n\u00b5\n\nHX\n\n\u2021\n\n0@ 1\n\nnX\n\n1\n\n\u03c32 =\n\n\u03c32LM\n\nn\n\ni=1\n\n+\n\n\u03b1h\u03b2h \u2212 \u2225\u00b5ah\n\n\u22252\u2225\u00b5bh\n\n\u22252\n\nh=1\n\nFro\n\nh=1\n\n\u00b71A ,\n\nWhen the noise variance \u03c32 is unknown, it may be estimated by the following re-estimation formula:\n\nwhich corresponds to the derivative of the VB free energy with respect to \u03c32 set to zero (see Eq.(4)\nin Section 3). This can be incorporated in the ICM algorithm by updating \u03c32 from some initial value\nby the above formula in every iteration of the ICM algorithm.\n1Although a weaker constraint, r(A, B|V n) = rA(A|V n)rB(B|V n), is suf\ufb01cient to derive a tractable itera-\ntive algorithm [13], we assume the stronger one (1) used in [18], which makes our theoretical analysis tractable.\n\n3\n\n\f2.3 Empirical Variational Bayesian Matrix Factorization\n\nah and c2\nbh\n\nin the current setup) can also be learned from\n\nIn the VB framework, hyperparameters (c2\ndata by minimizing the VB free energy, which is called the empirical VB (EVB) method [5].\nBy setting the derivatives of the VB free energy with respect to c2\noptimality condition can be obtained (see also Eq.(4) in Section 3):\n= \u03b2h/L.\n\n(3)\nThe ICM algorithm for EVBMF (EVB-ICM) is to iteratively update c2\nby Eq.(3), in addi-\ntion to \u00b5ah, \u00b5bh, \u03a3ah, and \u03a3bh by Eq.(2). Again, one may obtain a local optimal solution by this\nalgorithm.\n\n= \u03b1h/M and c2\nbh\n\nto zero, the following\n\nah and c2\nbh\n\nah and c2\nbh\n\nc2\nah\n\n3 Analytic-form Expression of Global Optimal Solution of VBMF\n\nIn this section, we derive an analytic-form expression of the VBMF global solution.\nThe VB free energy can be explicitly expressed as follows.\nlog |\u03a3ah\nFVB(r|V n) = nLM\n2\n\nlog \u03c32+\n\nlog c2\nah\n\nlog c2\nbh\n\n+ L\n2\n\n\u2212 1\n2\n\n\u2212 1\n2\n\nM\n2\n\n\u02c6\nHX\n(cid:176)(cid:176)(cid:176)(cid:176)(cid:176)V (i) \u2212 HX\n\nh=1\n\nnX\n\n|+ \u03b1h\n\u2021\nHX\n2c2\nah\n\n(cid:176)(cid:176)(cid:176)(cid:176)(cid:176)2\n\n!\n\n|+ \u03b2h\n2c2\nbh\n\n,\n\n(4)\n\nlog |\u03a3bh\n\u00b7\n\n\u22252\u2225\u00b5bh\n\n\u22252\n\n+\n\n\u03b1h\u03b2h \u2212 \u2225\u00b5ah\nwhere | \u00b7 | denotes the determinant of a matrix. We solve the following problem:\n\n+ n\n2\u03c32\n\n1\n2\u03c32\n\n\u00b5bh\n\n\u22a4\nah\n\nh=1\n\nh=1\n\ni=1\n\nFro\n\n\u00b5\n\n, c2\nbh\n\n(c2\nah\n\nGiven\nmin FVB({\u00b5ah\ns.t. \u00b5ah\n\n) \u2208 R2\n, \u00b5bh\n\u2208 RM , \u00b5bh\n\n++ (\u2200h = 1, . . . , H), \u03c32 \u2208 R++,\n, \u03a3ah, \u03a3bh; h = 1, . . . , H})\n\u2208 SL\n\u2208 RL, \u03a3ah\n++, \u03a3bh\n\n\u2208 SM\n\n++ (\u2200h = 1, . . . , H),\n\n++ denotes the set of d \u00d7 d symmetric positive-de\ufb01nite matrices. This is a non-convex\nwhere Sd\noptimization problem, but still we show that the global optimal solution can be analytically obtained.\nLet \u03b3h (\u2265 0) be the h-th largest singular value of V , and let \u03c9ah and \u03c9bh be the associated right\nand left singular vectors:2\nLetb\u03b3h be the second largest real solution of the following quartic equation with respect to t:\np\n\nfh(t) := t4 + \u03be3t3 + \u03be2t2 + \u03be1t + \u03be0 = 0,\n\nLX\n\n\u03b3h\u03c9bh\u03c9\n\n!\n\nV =\n\n\u22a4\nah\n\n(5)\n\nh=1\n\n.\n\n,\n\n\u03be1 = \u03be3\n\n(L2 + M 2)b\u03b72\n\u00b6(cid:181)\n\nLM\n\nh\n\n\u03be3\u03b3h +\n\n(cid:181)\n1 \u2212 \u03c32L\nn\u03b32\nh\n\nh =\n\n+\n\n2\u03c34\n\n\u00b6\n\nn2c2\nah\n\nc2\nbh\n\n1 \u2212 \u03c32M\nn\u03b32\nh\n\n\u03b32\nh.\n\nLet\n\n!2 \u2212 LM \u03c34\nThen we can analytically express the VBMF solution bU VB as in the following theorem.\n\n\u03c34\n2n2c2\nah\n\n\u03c34\n2n2c2\nah\n\n(L + M)\u03c32\n\nc2\nbh\n\nc2\nbh\n\n2n\n\n2n\n\nn2\n\n+\n\n+\n\nvuut\u02c6\n\n2In our analysis, we assume that V has no missing entry, and its singular value decomposition (SVD) is\n\neasily obtained. Therefore, our results cannot be directly applied to missing entry prediction.\n\n4\n\n,\n\nLM\n\n\u03be3 =\n\n\u02c6\nwhere the coef\ufb01cients are de\ufb01ned by\n\u03be2 = \u2212\nb\u03b72\n\n(L \u2212 M)2\u03b3h\n\n!2\n\n\u02c6b\u03b72\nvuuut(L + M)\u03c32\ne\u03b3h =\n\n\u03c34\nn2c2\nah\n\n\u03be0 =\n\nc2\nbh\n\n\u2212\n\n+\n\nh\n\n,\n\n\u03be0,\n\n.\n\n(6)\n\n\fTheorem 1 The global VB solution can be expressed as\n\nbU VB =\n\nHX\n\nh=1\n\nb\u03b3VB\n\nh \u03c9bh\u03c9\n\n, where b\u03b3VB\n\nh =\n\n\u22a4\nah\n\n\u2030b\u03b3h\n\n0\n\nif \u03b3h >e\u03b3h,\n\notherwise.\n\nSketch of proof: We \ufb01rst show that minimizing (4) amounts to a reweighed SVD and any minimizer\nis a stationary point. Then, by analyzing the stationary condition (2), we obtain an equation with\n\nrespect tob\u03b3h as a necessary and suf\ufb01cient condition to be a stationary point (note that its quadratic\nThe coef\ufb01cients of the quartic equation (5) are analytic, sob\u03b3h can also be obtained analytically3,\n\napproximation gives bounds of the solution [15]). Its rigorous evaluation results in the quartic equa-\ntion (5). Finally, we show that only the second largest solution of the quartic equation (5) lies within\nthe bounds, which completes the proof.\n\ne.g., by Ferrari\u2019s method [9] (we omit the details due to lack of space). Therefore, the global VB\nsolution can be analytically computed. This is a strong advantage over the standard ICM algorithm\nsince many iterations and restarts would be necessary to \ufb01nd a good solution by ICM.\nBased on the above result, the complete VB posterior can also be obtained analytically as follows.\n\nCorollary 2 The VB posteriors are given by\n\nrA(A|V n) =\n\nwhere, forb\u03b3VB\n\nh\n\n, \u03a3bh),\n\nHY\n\nrB(B|V n) =\n\nNM (bh; \u00b5bh\n\nh\n\nh=1\n\n, \u03a3ah),\n\nNM (ah; \u00b5ah\n\nHY\nqb\u03b3VB\nb\u03b4h \u00b7 \u03c9ah, \u00b5bh\nbeing the solution given by Theorem 1,\n\u02c6\n\u2212\u00a1\nnb\u03b72\n= \u00b1\n\u02c6\n\u2212\u00a1\nnb\u03b72\nh + \u03c32(M \u2212 L)\nn(M \u2212 L)(\u03b3h \u2212b\u03b3VB\n(\nif \u03b3h >e\u03b3h,\n\n\u2212 \u03c32(M \u2212 L)\n\nh ) +\n\n2\u03c32M c\n\nh\n+\n\n\u22122\nah\n\nh\n\nh\n\nh\n\nh\n\nh=1\n\n\u22121\nh\n\nqb\u03b3VB\nb\u03b4\n!\np\n\u00a2\n\u2212 \u03c32(M \u2212 L))2 + 4M n\u03c32b\u03b72\n(nb\u03b72\n= \u00b1\n\u00b7 \u03c9bh,\nb\u03b4\n2nM(b\u03b3VB\n!\np\n\u00a2\n+\n(nb\u03b72\nh + \u03c32(M \u2212 L))2 + 4Ln\u03c32b\u03b72\nb\u03b4h + n\u22121\u03c32c\n2nL(b\u03b3VB\nq\nn2(M \u2212 L)2(\u03b3h \u2212b\u03b3VB\n\n\u22121\nh + n\u22121\u03c32c\n\n\u22122\nah )\n\n\u22122\nbh\n\n)\n\nh\n\nh\n\nh )2 + 4\u03c34LM\nc2\nbh\n\nc2\nah\n\n,\n\nIM ,\n\nIL,\n\n\u00b5ah\n\n\u03a3ah =\n\n\u03a3bh =\n\nb\u03b4h =\nb\u03b72\n\nh =\n\n\u03b72\nh\n\n\u03c32\n\nncah cbh\n\notherwise.\n\nWhen the noise variance \u03c32 is unknown, one may use the minimizer of the VB free energy with\nrespect to \u03c32 as its estimate. In practice, this single-parameter minimization may be carried out\nnumerically based on Eq.(4) and Corollary 2.\n\n4 Analytic-form Expression of Global Optimal Solution of Empirical VBMF\n\n, \u00b5bh\n\u2208 RM , \u00b5bh\n\nIn this section, we solve the following problem to obtain the EVBMF global solution:\nGiven \u03c32 \u2208 R++,\nmin FVB({\u00b5ah\ns.t. \u00b5ah\n\n++ (\u2200h = 1, . . . , H),\nwhere Rd\n++ denotes the set of the d-dimensional vectors with positive elements. We show that, al-\n\u201c\nthough this is again a non-convex optimization problem, the global optimal solution can be obtained\nanalytically. We can observe the invariance of the VB free energy (4) under the transform\n\u22122\nh c2\nbh\n\n; h = 1, . . . , H})\n, c2\nbh\n\u2208 SM\n++, \u03a3bh\n\n, \u03a3ah, \u03a3bh, c2\nah\n\u2208 RL, \u03a3ah\n\n\u201c \u2192'\n\n, \u03a3ah, \u03a3bh, c2\nah\n\n\u22122\nh \u03a3bh, s2\n\n++, (c2\nah\n\n\u22121\nh \u00b5bh\n\n) \u2208 R2\n\n(sh\u00b5ah\n\nh\u03a3ah, s\n\n\u2208 SL\n\n(\u00b5ah\n\nhc2\nah\n\n, \u00b5bh\n\n, c2\nbh\n\n, c2\nbh\n\n'\n\n, s2\n\n, s\n\n, s\n\n)\n\n)\n\n3In practice, one may solve the quartic equation numerically, e.g., by the \u2018roots\u2019 function in MATLAB R\u20dd\n\n.\n\n5\n\n\f(a) V = 1.5\n\n(b) V = 2.1\n\n(c) V = 2.7\n\nFigure 2: Pro\ufb01les of the VB free energy (4) when L = M = H = 1, n = 1, and \u03c32 = 1 for\nobservations V = 1.5, 2.1, and 2.7. (a) When V = 1.5 < 2 = \u03b3\n, the VB free energy is monotone\nincreasing and thus the global solution is given by ch \u2192 0. (b) When V = 2.1 > 2 = \u03b3\n, a local\nminimum exists at ch = \u02d8ch \u2248 1.37, but \u2206h \u2248 0.12 > 0 so ch \u2192 0 is still the global solution. (c)\n, \u2206h \u2248 \u22120.74 \u2264 0 and thus the minimizer at ch = \u02d8ch \u2248 2.26 is the global\nWhen V = 2.7 > 2 = \u03b3\nsolution.\nfor any {sh \u0338= 0; h = 1, . . . , H}. Accordingly, we \ufb01x the ratios to cah/cbh = S > 0, and refer to\nch := cahcbh also as a hyperparameter.\nLet\n\nh\n\nh\n\nh\n\ns(cid:181)\n\n\u00b62 \u2212 4LM \u03c34\n\nn2\n\n1A ,\n\n(7)\n\n0@\u03b32\n\nh\n\n\u221a\n\n\u02d8c2\nh =\n\n1\n2LM\n\u221a\n\n\u2212 (L + M)\u03c32\n\n+\n\n\u03b32\nh\n\n\u2212 (L + M)\u03c32\n\nn\n\nn\n\n\u221a\n\n= (\n\nL +\n\nM)\u03c3/\n\nh\n\nh\n\n\u03b3\n\nn.\nThen, we have the following lemma:\nLemma 3 If \u03b3h \u2265 \u03b3\n, the VB free energy function (4) can have two local minima, namely, ch \u2192 0\nand ch = \u02d8ch. Otherwise, ch \u2192 0 is the only local minimum of the VB free energy.\nSketch of proof: Analyzing the region where ch is so small that the VB solution given ch isb\u03b3h = 0,\nwe \ufb01nd a local minimum ch \u2192 0. Combining the stationary conditions (2) and (3), we derive a\nh whose larger solution is given by Eq.(7). Showing that the\nquadratic equation with respect to c2\nsmaller solution corresponds to saddle points completes the proof.\nFigure 2 shows the pro\ufb01les of the VB free energy (4) when L = M = H = 1, n = 1, and \u03c32 = 1\nfor observations V = 1.5, 2.1, and 2.7. As illustrated, depending on the value of V , either ch \u2192 0\nor ch = \u02d8ch is the global solution.\nLet\n\n\u00b7\n\n\u2021\n\n\u2021\n\n\u00a2\n\n\u00b7\n\n\u00a1\u22122\u03b3h\u02d8\u03b3VB\n\n\u2206h := M log\n\nn\u03b3h\nM \u03c32 \u02d8\u03b3VB\n\nh + 1\n\n+ L log\n\nn\u03b3h\nL\u03c32 \u02d8\u03b3VB\n\nh + 1\n\n+ n\n\u03c32\n\nh + LM \u02d8c2\n\nh\n\n,\n\n(8)\n\nh\n\nh\n\nwhere \u02d8\u03b3VB\nis the VB solution for ch = \u02d8ch. We can show that the sign of \u2206h corresponds to that of\nthe difference of the VB free energy at ch = \u02d8ch and ch \u2192 0. Then, we have the following theorem\nTheorem 4 The hyperparameterbch that globally minimizes the VB free energy function (4) is given\nand corollary.\nbybch = \u02d8ch if \u03b3h > \u03b3\nHX\nbU EVB =\n\nand \u2206h \u2264 0. Otherwisebch \u2192 0.\n, where b\u03b3EVB\nb\u03b3EVB\n\nCorollary 5 The global EVB solution can be expressed as\n\nand \u2206h \u2264 0,\n\nh=1\n\nSince the optimal hyperparameter value bch can be expressed in a closed-form, the global EVB\n\nsolution can also be computed analytically using the result given in Section 3. This is again a strong\nadvantage over the standard ICM algorithm since ICM would require many iterations and restarts to\n\ufb01nd a good solution.\n\nif \u03b3h > \u03b3\nh\notherwise.\n\nh \u03c9bh\u03c9\n\n\u02d8\u03b3VB\nh\n0\n\n(\n\n\u22a4\nah\n\n:=\n\nh\n\n6\n\n012322.53Global solutionhc01233.253.5hcGlobal solution012344.55hcGlobal solution\fIn this section, we experimentally evaluate the usefulness of our analytic-form solutions using arti-\n\ufb01cial and benchmark datasets. The MATLAB R\u20dd code will be available at [14].\n\n5 Experiments\n\n5.1 Arti\ufb01cial Dataset\n\nP\n\n\u2217 =\n\n\u2217\n\n\u2217\nH\nh=1 b\nha\n\nah and \u03c32\nbh\n\n\u2217\u22a4\nh with L = 30, M = 100, and H\n\n\u2217 = 10,\nWe randomly created a true matrix V\nwhere every element of {ah, bh} was drawn independently from the standard Gaussian distribution.\nWe set n = 1, and an observation matrix V was created by adding independent Gaussian noise with\nvariance \u03c32 = 1 to each element. We used the full-rank model, i.e., H = L = 30. The noise\nvariance \u03c32 was assumed to be unknown, and estimated from data (see Section 2.2 and Section 3).\nWe \ufb01rst investigate the learning curve of the VB free energy over EVB-ICM iterations. We created\nthe initial values of the EVB-ICM algorithm as follows: \u00b5ah and \u00b5bh were set to randomly created\northonormal vectors, \u03a3ah and \u03a3bh were set to identity matrices multiplied by scalars \u03c32\nah and \u03c32\n,\nbh\nrespectively. \u03c32\nas well as the noise variance \u03c32 were drawn from the \u03c72-distribution with\ndegree-of-freedom one. 10 learning curves of the VB free energy were plotted in Figures 3(a). The\nvalue of the VB free energy of the global solution computed by our analytic-form solution was also\nplotted in the graph by the dashed line. The graph shows that the EVB-ICM algorithm reduces the\nVB free energy reasonably well over iterations. However, for this arti\ufb01cial dataset, the convergence\nspeed was quite slow once in 10 runs, which was actually trapped in a local minimum.\nNext, we compare the computation time. Figure 3(b) shows the computation time of EVB-ICM over\niterations and our analytic form-solution. The computation time of EVB-ICM grows almost linearly\nwith respect to the number of iterations, and it took 86.6 [sec] for 100 iterations on average. On the\nother hand, the computation of our analytic-form solution took only 0.055 [sec] on average, includ-\ning the single-parameter search for \u03c32. Thus, our method provides the reduction of computation\ntime in 4 orders of magnitude, with better accuracy as a minimizer of the VB free energy.\nNext, we investigate the generalization error of the global analytic solutions of VB and EVB, mea-\nFro/(LM). Figure 3(c) shows the mean and error bars (min and max)\nover 10 runs for VB with various hyperparameter values and EVB. A single hyperparameter value\nwas commonly used (i.e., c1 = \u00b7\u00b7\u00b7 = cH) in VB, while each hyperparameter ch was separately\noptimized in EVB. The result shows that EVB gives slightly lower generalization errors than VB\nwith the best common hyperparameter. Thus, automatic hyperparameter selection of EVB works\nquite well.\nFigure 3(d) shows the hyperparameter values chosen in EVB sorted in the decreasing order. This\nshows that, for all 10 runs, ch is positive for h \u2264 H\n\u2217. This implies that\nthe effect of automatic relevance determination [16, 5] works excellently for this arti\ufb01cial dataset.\n\nsured by G = \u2225bU \u2212 V\n\n\u2217 (= 10) and zero for h > H\n\n\u2217\u22252\n\n5.2 Benchmark Dataset\n\nMF can be used for canonical correlation analysis (CCA) [8] and reduced rank regression (RRR)\n[19] with appropriately pre-whitened data. Here, we solve these tasks by VBMF and evaluate the\nperformance using the concrete slump test dataset [28] available from the UCI repository [2].\nThe experimental results are depicted in Figure 4, which is in the same format as Figure 3. The\nresults showed that similar trends to the arti\ufb01cial dataset can still be observed for the CCA task with\nthe benchmark dataset (the RRR results are similar and thus omitted from the \ufb01gure). Overall, the\nproposed global analytic solution is shown to be a useful alternative to the popular ICM algorithm.\n\n6 Discussion and Conclusion\n\nOvercoming the non-convexity of VB methods has been one of the important challenges in the\nBayesian machine learning community, since it sometimes prevented us from applying the VB meth-\nods to highly complex real-world problems. In this paper, we focused on the MF problem with no\nmissing entry, and showed that this weakness could be overcome by computing the global optimal\nsolution analytically. We further derived the global optimal solution analytically for the EVBMF\n\n7\n\n\f(a) VB free energy\n\n(b) Computation time\n\n(c) Generalization error\n\n(d) Hyperparameter value\n\nFigure 3: Experimental results for arti\ufb01cial dataset.\n\n(a) VB free energy\n\n(b) Computation time\n\n(c) Generalization error\n\n(d) Hyperparameter value\n\nFigure 4: Experimental results of CCA for the concrete slump test dataset.\n\nmethod, where hyperparameters are also optimized based on data samples. Since no hand-tuning\nparameter remains in EVBMF, our analytic-form solution is practically useful and computationally\nhighly ef\ufb01cient. Numerical experiments showed that the proposed approach is promising.\n\u2192 \u221e, the priors get (almost) \ufb02at and the quartic equation (5) is factorized as\n\u201c1\u2212 \u03c32L\nWhen cahcbh\n\n\u2192\u221e fh(t) =\u201ct + M\n\nn\u03b32\ncah\nh\nTheorem 1 states that its second largest solution gives the VB estimator for \u03b3h > limcah cbh\n\n\u201d\u03b3h\u201d = 0.\n\u2192\u221ee\u03b3h =\n\n\u201d\u03b3h\u201d\u201ct \u2212 M\n\nlim\ncbh\n\n\u201c1\u2212 \u03c32L\n\nn\u03b32\nh\n\np\n\nL\n\nL\n\nM \u03c32/n. Thus we have\n\nn\u03b32\nh\n\n\u201d\u03b3h\u201d\u201ct +\u201c1\u2212 \u03c32M\n(cid:181)\n\u2192\u221eb\u03b3VB\n\nh = max\n\nn\u03b32\nh\n\n\u201d\u03b3h\u201d\u201ct \u2212\u201c1\u2212 \u03c32M\n\u00b6\u00b6\n(cid:181)\n1 \u2212 M \u03c32\nn\u03b32\nh\n\n0,\n\n\u03b3h.\n\nlim\ncah cbh\n\nThis is the positive-part James-Stein (PJS) shrinkage estimator [10], operated on each singular com-\nponent separately, and this coincides with the upper-bound derived in [15] for arbitrary cahcbh > 0.\nThe counter-intuitive fact\u2014a shrinkage is observed even in the limit of \ufb02at priors\u2014can be explained\nby strong non-uniformity of the volume element of the Fisher metric, i.e., the Jeffreys prior [11], in\nthe parameter space. We call this effect model-induced regularization (MIR), because it is induced\nnot by priors but by structure of model likelihood functions. MIR was shown to generally appear in\nBayesian estimation when the model is non-identi\ufb01able (i.e., the mapping between parameters and\ndistribution functions is not one-to-one) and the parameters are integrated out at least partially [26].\nThus, it never appears in MAP estimation [15]. The probabilistic PCA can be seen as an example\nof MF, where A and B correspond to latent variables and principal axes, respectively [24]. The\nMIR effect is observed in its analytic solution when A is integrated out and B is estimated to be the\nmaximizer of the marginal likelihood.\nOur results fully made use of the assumptions that the likelihood and priors are both spherical Gaus-\nsian, the VB posterior is column-wise independent, and there exists no missing entry. They were\nnecessary to solve the free energy minimization problem as a reweighted SVD. An important fu-\nture work is to obtain the analytic global solution under milder assumptions. This will enable us to\nhandle more challenging problems such as missing entry prediction [23, 20, 6, 13, 18, 22, 12, 25].\n\nAcknowledgments\n\nThe authors appreciate comments by anonymous reviewers, which helped improve our earlier\nmanuscript and suggested promising directions for future work. MS thanks the support from the\nFIRST program. RT was partially supported by MEXT Kakenhi 22700138.\n\n8\n\n0501001.891.91.911.921.931.941.95FVB/(LM) EVB-AnalyticEVB-ICMIteration050100020406080100120Time(sec) EVB-AnalyticEVB-ICMIteration1001010.180.20.220.240.260.280.3\u221achG EVB-AnalyticVB-Analytic010203000.20.40.60.811.21.41.6hhc^1.8EVB-Analytic050100150200250\u221263.56\u221263.55\u221263.54\u221263.53\u221263.52EVB-AnalyticEVB-ICMIteration05010015020025000.20.40.60.811.21.4 Time(sec)EVB-AnalyticEVB-ICMIteration10150100150200G \u221ach1010\u22123\u22121EVB-AnalyticVB-Analytic012300.050.10.150.20.25hhc^EVB-Analytic\fReferences\n[1] Y. Amit, M. Fink, N. Srebro, and S. Ullman. Uncovering shared structures in multiclass classi\ufb01cation. In\n\nProceedings of International Conference on Machine Learning, pages 17\u201324, 2007.\n\n[2] A. Asuncion and D.J. Newman. UCI machine learning repository, 2007.\n[3] H. Attias. Inferring parameters and structure of latent variable models by variational Bayes. In Proceed-\nings of the Fifteenth Conference Annual Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI-99),\npages 21\u201330, San Francisco, CA, 1999. Morgan Kaufmann.\n\n[4] J. Besag. On the Statistical Analysis of Dirty Pictures. J. Royal Stat. Soc. B, 48:259\u2013302, 1986.\n[5] C. M. Bishop. Pattern Recognition and Machine Learning. Springer, New York, NY, USA, 2006.\n[6] J.-F. Cai, E. J. Candes, and Z. Shen. A singular value thresholding algorithm for matrix completion. SIAM\n\nJournal on Optimization, 20(4):1956\u20131982, 2008.\n\n[7] O. Chapelle and Z. Harchaoui. A Machine Learning Approach to Conjoint Analysis. In Advances in\n\nneural information processing systems, volume 17, pages 257\u2013264, 2005.\n\n[8] D. R. Hardoon, S. R. Szedmak, and J. R. Shawe-Taylor. Canonical correlation analysis: An overview\n\nwith application to learning methods. Neural Computation, 16(12):2639\u20132664, 2004.\n\n[9] M. Hazewinkel, editor. Encyclopaedia of Mathematics. Springer, 2002.\n[10] W. James and C. Stein. Estimation with quadratic loss. In Proceedings of the 4th Berkeley Symposium on\nMathematical Statistics and Probability, volume 1, pages 361\u2013379. University of California Press, 1961.\n[11] H. Jeffreys. An Invariant Form for the Prior Probability in Estimation Problems. In Proceedings of the\n\nRoyal Society of London. Series A, volume 186, pages 453\u2013461, 1946.\n\n[12] S. Ji and J. Ye. An accelerated gradient method for trace norm minimization. In Proceedings of Interna-\n\ntional Conference on Machine Learning, pages 457\u2013464, 2009.\n\n[13] Y. J. Lim and T. W. Teh. Variational Bayesian Approach to Movie Rating Prediction. In Proceedings of\n\nKDD Cup and Workshop, 2007.\n\n[14] S. Nakajima. Matlab Code for VBMF, http://sites.google.com/site/shinnkj23/, 2010.\n[15] S. Nakajima and M. Sugiyama. Implicit regularization in variational Bayesian matrix factorization. In\n\nProceedings of 27th International Conference on Machine Learning (ICML2010), 2010.\n\n[16] R. M. Neal. Bayesian Learning for Neural Networks. Springer, 1996.\n[17] A. Paterek. Improving Regularized Singular Value Decomposition for Collaborative Filtering. In Pro-\n\nceedings of KDD Cup and Workshop, 2007.\n\n[18] T. Raiko, A. Ilin, and J. Karhunen. Principal Component Analysis for Large Sale Problems with Lots of\n\nMissing Values. In Proc. of ECML, volume 4701, pages 691\u2013698, 2007.\n\n[19] G. R. Reinsel and R. P. Velu. Multivariate reduced-rank Regression: Theory and Applications. Springer,\n\nNew York, 1998.\n\n[20] J. D. M. Rennie and N. Srebro. Fast maximum margin matrix factorization for collaborative prediction.\n\nIn Proceedings of the 22nd International Conference on Machine learning, pages 713\u2013719, 2005.\n\n[21] R. Rosipal and N. Kr\u00a8amer. Overview and recent advances in partial least squares. In Subspace, Latent\n\nStructure and Feature Selection Techniques, volume 3940, pages 34\u201351. Springer, 2006.\n\n[22] R. Salakhutdinov and A. Mnih. Probabilistic matrix factorization. In J.C. Platt, D. Koller, Y. Singer, and\n\nS. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 1257\u20131264, 2008.\n\n[23] N. Srebro, J. Rennie, and T. Jaakkola. Maximum Margin Matrix Factorization. In Advances in NIPS,\n\nvolume 17, 2005.\n\n[24] M.E. Tipping and C.M. Bishop. Probabilistic Principal Component Analysis. Journal of the Royal Sta-\n\ntistical Society: Series B, 61(3):611\u2013622, 1999.\n\n[25] R. Tomioka, T. Suzuki, M. Sugiyama, and H. Kashima. An ef\ufb01cient and general augmented Lagrangian\nalgorithm for learning low-rank matrices. In Proceedings of International Conference on Machine Learn-\ning, 2010.\n\n[26] S. Watanabe. Algebraic Geometry and Statistical Learning. Cambridge University Press, Cambridge,\n\nUK, 2009.\n\n[27] K. J. Worsley, J-B. Poline, K. J. Friston, and A. C. Evanss. Characterizing the Response of PET and fMRI\n\nData Using Multivariate Linear Models. NeuroImage, 6(4):305\u2013319, 1997.\n\n[28] I-Cheng Yeh. Modeling slump \ufb02ow of concrete using second-order regressions and arti\ufb01cial neural net-\n\nworks. Cement and Concrete Composites, 29(6):474\u2013480, 2007.\n\n[29] K. Yu, V. Tresp, and A. Schwaighofer. Learning Gaussian Processes from Multiple Tasks. In Proc. of\n\nICML, page 1019, 2005.\n\n9\n\n\f", "award": [], "sourceid": 1082, "authors": [{"given_name": "Shinichi", "family_name": "Nakajima", "institution": null}, {"given_name": "Masashi", "family_name": "Sugiyama", "institution": null}, {"given_name": "Ryota", "family_name": "Tomioka", "institution": null}]}