{"title": "Online Optimization for Max-Norm Regularization", "book": "Advances in Neural Information Processing Systems", "page_first": 1718, "page_last": 1726, "abstract": "Max-norm regularizer has been extensively studied in the last decade as it promotes an effective low rank estimation of the underlying data. However, max-norm regularized problems are typically formulated and solved in a batch manner, which prevents it from processing big data due to possible memory bottleneck. In this paper, we propose an online algorithm for solving max-norm regularized problems that is scalable to large problems. Particularly, we consider the matrix decomposition problem as an example, although our analysis can also be applied in other problems such as matrix completion. The key technique in our algorithm is to reformulate the max-norm into a matrix factorization form, consisting of a basis component and a coefficients one. In this way, we can solve the optimal basis and coefficients alternatively. We prove that the basis produced by our algorithm converges to a stationary point asymptotically. Experiments demonstrate encouraging results for the effectiveness and robustness of our algorithm. See the full paper at arXiv:1406.3190.", "full_text": "Online Optimization for Max-Norm Regularization\n\nJie Shen\n\nDept. of Computer Science\n\nRutgers University\n\nPiscataway, NJ 08854\njs2007@rutgers.edu\n\nHuan Xu\n\nDept. of Mech. Engineering\nNational Univ. of Singapore\n\nSingapore 117575\n\nPing Li\n\nDept. of Statistics\n\nDept. of Computer Science\n\nRutgers University\n\nmpexuh@nus.edu.sg\n\npingli@stat.rutgers.edu\n\nAbstract\n\nMax-norm regularizer has been extensively studied in the last decade as it pro-\nmotes an effective low rank estimation of the underlying data. However, max-\nnorm regularized problems are typically formulated and solved in a batch manner,\nwhich prevents it from processing big data due to possible memory bottleneck.\nIn this paper, we propose an online algorithm for solving max-norm regularized\nproblems that is scalable to large problems. Particularly, we consider the matrix\ndecomposition problem as an example, although our analysis can also be applied\nin other problems such as matrix completion. The key technique in our algorithm\nis to reformulate the max-norm into a matrix factorization form, consisting of a\nbasis component and a coef\ufb01cients one. In this way, we can solve the optimal\nbasis and coef\ufb01cients alternatively. We prove that the basis produced by our al-\ngorithm converges to a stationary point asymptotically. Experiments demonstrate\nencouraging results for the effectiveness and robustness of our algorithm.\nSee the full paper at arXiv:1406.3190.\n\n1 Introduction\n\nIn the last decade, estimating low rank matrices has attracted increasing attention in the machine\nlearning community owing to its successful applications in a wide range of domains including sub-\nspace clustering [13], collaborative \ufb01ltering [9] and visual texture analysis [25], to name a few.\nSuppose that we are given an observed data matrix Z of size p\u00d7 n, i.e., n observations in p ambient\ndimensions, with each observation being i.i.d. sampled from some unknown distribution, we aim\nto learn a prediction matrix X with a low rank structure to approximate Z. This problem, together\nwith its many variants, typically involves minimizing a weighted combination of the residual error\nand matrix rank regularization term.\nGenerally speaking, it is intractable to optimize a matrix rank [15]. To tackle this challenge, re-\nsearchers suggest alternative convex relaxations to the matrix rank. The two most widely used con-\nvex surrogates are the nuclear norm 1 [15] and the max-norm 2 [19]. In the work of [6], Cand`es et al.\nproved that under mild conditions, solving a convex optimization problem consisting of a nuclear\nnorm regularization and a weighted \u21131 norm penalty can exactly recover the low-rank component of\nthe underlying data even if a constant fraction of the entries are arbitrarily corrupted. In [20], Srebro\nand Shraibman studied collaborative \ufb01ltering and proved that the max-norm regularization formu-\nlation enjoyed a lower generalization error than the nuclear norm. Moreover, the max-norm was\nshown to empirically outperform the nuclear norm in certain practical applications as well [11, 12].\nTo optimize a max-norm regularized problem, however, algorithms proposed in prior work [12, 16,\n19] require to access all the data. In a large scale setting, the applicability of such batch optimiza-\n\n1Also known as the trace norm, the Ky-Fan n-norm and the Schatten 1-norm.\n2Also known as the (cid:13)2-norm.\n\n1\n\n\ftion methods will be hindered by the memory bottleneck.\nIn this paper, by utilizing the matrix\nfactorization form of the max-norm, we propose an online algorithm to solve max-norm regularized\nproblems. The main advantage of online algorithms is that the memory cost is independent from the\nsample size, which makes online algorithms a good \ufb01t for the big data era [14, 18].\nSpeci\ufb01cally, we are interested in the max-norm regularized matrix decomposition (MRMD) prob-\nlem. Assume that the observed data matrix Z can be decomposed into a low rank component X and\na sparse one E, we aim to simultaneously and accurately estimate the two components, by solving\nthe following convex program:\n\n\u2225Z \u2212 X \u2212 E\u22252\n\n1\n2\n\n\u2225X\u22252\n\nmax + \u03bb2\u2225E\u22251;1.\n\n\u03bb1\n2\n\nmin\nX;E\n\n(1.1)\nF +\nmax is the max-norm (which promotes low rank), \u2225\u00b7\u2225\n\nF denotes the Frobenius norm, \u2225\u00b7\u2225\n\nHere \u2225\u00b7\u2225\nis the \u21131 norm of a matrix seen as a vector, and \u03bb1 and \u03bb2 are two non-negative parameters.\nOur main contributions are two-folds: 1) We develop an online method to solve this MRMD\nproblems, making it scalable to big data. 2) We prove that the solutions produced by our algorithm\nconverge to a stationary point asymptotically.\n\n1;1\n\n1.1 Connection to Matrix Completion\n\nWhile we mainly focus on the matrix decomposition problem, our method can be extended to the\nmatrix completion (MC) problem [4, 7] with max-norm regularization [5], which is another popular\ntopic in machine learning and signal processing. The MC problem can be described as follows:\n\nmin\nX\n\n1\n2\n\n\u2225P\u2126 (Z \u2212 X)\u22252\n\nF +\n\n\u2225X\u22252\n\nmax ,\n\n\u03bb\n2\n\nwhere \u2126 is the set of indices of observed entries in Z and P\u2126(M ) is the orthogonal projector onto\nthe span of matrices vanishing outside of \u2126 so that the (i, j)-th entry of P\u2126(M ) is equal to Mij if\n(i, j) \u2208 \u2126 and zero otherwise. Interestingly, the max-norm regularized MC problem can be cast into\nour framework. To see this, let us introduce an auxiliary matrix M, with Mij = C > 0 if (i, j) \u2208 \u2126\nand Mij = 1\n\nC otherwise. The following reformulated MC problem,\n\nmin\nX;E\n\n1\n2\n\n\u2225Z \u2212 X \u2212 E\u22252\n\nF +\n\n\u2225X\u22252\n\nmax + \u2225M \u25e6 E\u2225\n\n1;1 ,\n\n\u03bb\n2\n\nwhere \u201c\u25e6\u201d denotes the entry-wise product, is equivalent to our MRMD formulation (1.1). Further-\nmore, when C tends to in\ufb01nity, the reformulated problem converges to the original MC problem.\n\n1.2 Related Work\n\nHere we discuss some relevant work in the literature. Most previous works on max-norm focused\non showing that the max-norm was empirically superior to the nuclear norm in a wide range of ap-\nplications, such as collaborative \ufb01ltering [19] and clustering [11]. Moreover, in [17], Salakhutdinov\nand Srebro studied the in\ufb02uence of data distribution for the max-norm regularization and observed\ngood performance even when the data were sampled non-uniformly.\nThere are also studies which investigated the connection between the max-norm and the nuclear\nnorm. A comprehensive study on this problem, in the context of collaborative \ufb01ltering, can be found\nin [20], which established and compared the generalization bounds for the nuclear norm regular-\nization and max-norm regularization, and showed that the generalization bound of the max-norm\nregularization scheme is superior. More recently, Foygel et al. [9] attempted to unify the nuclear\nnorm and max-norm for gaining further insights on these two important regularization schemes.\nThere are few works to develop ef\ufb01cient algorithms for solving max-norm regularized problems, par-\nticularly large scale ones. Rennie and Srebro [16] devised a gradient-based optimization method and\nempirically showed promising results on large collaborative \ufb01ltering datasets. In [12], the authors\npresented large scale optimization methods for max-norm constrained and max-norm regularized\nproblems with a theoretical guarantee to a stationary point. Nevertheless, all those methods were\nformulated in a batch manner, which can be hindered by the memory bottleneck.\n\n2\n\n\fFrom a high level, the goal of this paper is similar to that of [8]. Motivated by the celebrated Robust\nPrincipal Component Analysis (RPCA) problem [6, 23, 24], the authors of [8] developed an online\nimplementation for the nuclear-norm regularized matrix decomposition. Yet, since the max-norm\nis a much more complicated mathematical entity (e.g., even the subgradient of the max-norm is not\ncompletely characterized to the best of our knowledge), new techniques and insights are needed in\norder to develop online methods for the max-norm regularization. For example, after taking the\nmax-norm with its matrix factorization form, the data are still coupled and we propose to convert\nthe problem to a constrained one for stochastic optimization.\nThe main technical contribution of this paper is to convert max-norm regularization to an appropriate\nmatrix factorization problem amenable to online implementation. Part of our proof ideas are inspired\nby [14], which also studied online matrix factorization. In contrast to [14], our formulation contains\nan additive sparse noise matrix, which enjoys the bene\ufb01t of robustness to sparse contamination. Our\nproof techniques are also different. For example, to prove the convergence of the dictionary and\nto well de\ufb01ne their problem, [14] needs to assume that the magnitude of the learned dictionary is\nconstrained. In contrast, in our setup we prove that the optimal basis is uniformly bounded, and\nhence our problem is naturally well de\ufb01ned.\n\n2 Problem Setup\n\nWe \ufb01rst introduce our notations. We use bold letters to denote vectors. The i-th row and j-th column\nof a matrix M are denoted by m(i) and mj, respectively. The \u21131 norm and \u21132 norm of a vector v\nare denoted by \u2225v\u22251 and \u2225v\u22252, respectively. The \u21132;\u221e norm of a matrix is de\ufb01ned as the maximum\n\u21132 row norm. Finally, the trace of a square matrix M is denoted as Tr(M ).\nWe are interested in developing an online algorithm for the MRMD Problem (1.1). By taking the\nmatrix factorization form of the max-norm [19]:\n\n\u2225X\u2225\n\nmax\n\n, min\n\nL;R\n\n{\u2225L\u2225\n\n2;\u221e \u00b7 \u2225R\u2225\n\n2;\u221e : X = LR\n\n\u22a4\n\n, L \u2208 Rp\u00d7d, R \u2208 Rn\u00d7d},\n\n(2.1)\n\nwhere d is the intrinsic dimension of the underlying data, we can rewrite Problem (1.1) into the\nfollowing equivalent form:\n\nmin\nL;R;E\n\n1\n2\n\n\u2225Z \u2212 LRT \u2212 E\u22252\n\nF +\n\n\u03bb1\n2\n\n\u2225L\u22252\n\n2;\u221e\u2225R\u22252\n\n2;\u221e + \u03bb2\u2225E\u22251;1.\n\n(2.2)\n\nIntuitively, the variable L corresponds to a basis and the variable R is a coef\ufb01cients matrix with\neach row corresponding to the coef\ufb01cients. At a \ufb01rst sight, the problem can only be optimized\nin a batch manner as the term \u2225R\u22252\n2;\u221e couples all the samples. In other words, to compute the\noptimal coef\ufb01cients of the i-th sample, we are required to compute the subgradient of \u2225R\u2225\n2;\u221e,\nwhich needs to access all the data. Fortunately, we have the following proposition that alleviates the\ninter-dependency among samples.\nProposition 2.1. Problem (2.2) is equivalent to the following constrained program:\n\nminimize\n\nL;R;E\n\nsubject to\n\n\u2225Z \u2212 LRT \u2212 E\u22252\n1\n2\n\u2225R\u22252\n\n2;\u221e = 1.\n\nF +\n\n\u2225L\u22252\n\n2;\u221e + \u03bb2\u2225E\u22251;1,\n\n\u03bb1\n2\n\n(2.3)\n\nProposition 2.1 states that our primal MRMD problem can be transformed to an equivalent con-\nstrained one. In the new formulation (2.3), the coef\ufb01cients of each individual sample (i.e., a row of\nthe matrix R) is uniformly constrained. Thus, the samples are decoupled. Consequently, we can,\nequipped with Proposition 2.1, rewrite the original problem in an online fashion, with each sample\nbeing separately processed:\n\nn\u2211\n\nminimize\n\n\u2225zi \u2212 Lri \u2212 ei\u22252\nsubject to \u2200i \u2208 1, 2, . . . , n,\u2225ri\u22252\n\nL;R;E\n\n1\n2\n\ni=1\n\n2 +\n\u2264 1,\n\n\u03bb1\n2\n\n2\n\n3\n\nn\u2211\n\ni=1\n\n\u2225L\u22252\n\n2;\u221e + \u03bb2\n\n\u2225ei\u22251,\n\n\fwhere zi is the i-th observed sample, ri is the coef\ufb01cients and ei is the sparse error. Combining the\n\ufb01rst and third terms in the above equation, we have\n\nn\u2211\n\ni=1\n\nL;R;E\n\nminimize\n\n\u02dc\u2113(zi, L, ri, ei) +\nsubject to \u2200i \u2208 1, 2, . . . , n,\u2225ri\u22252\n\u2225z \u2212 Lr \u2212 e\u22252\nn\u2211\n\n\u02dc\u2113(z, L, r, e) , 1\n2\n\n2\n\nfn(L) , 1\nn\n\n\u2113(zi, L) +\n\ni=1\n\n2;\u221e,\n\n\u2225L\u22252\n\u03bb1\n2\n\u2264 1,\n\n2 + \u03bb2\u2225e\u22251.\n\n\u2225L\u22252\n\n2;\u221e,\n\n\u03bb1\n2n\n\n(2.4)\n\n(2.5)\n\n(2.6)\n\n(2.7)\n\n(2.8)\n\nwhere\n\nThis is indeed equivalent to optimizing (i.e., minimizing) the empirical loss function:\n\nwhere\n\n\u2113(z, L) = min\nr;e;\u2225r\u22252\n\n2\n\n\u22641\n\n\u02dc\u2113(z, L, r, e).\n\nWhen n goes to in\ufb01nity, the empirical loss converges to the expected loss, de\ufb01ned as follows\n\nf (L) = lim\n\nn\u2192+\u221e fn(L) = Ez[\u2113(z, L)].\n\n3 Algorithm\nWe now present our online implementation to solve the MRMD problem. The detailed algorithm\nis listed in Algorithm 1. Here we \ufb01rst brie\ufb02y explain the underlying intuition: We optimize the\ncoef\ufb01cients r, the sparse noise e and the basis L in an alternating manner, which is known to be a\nsuccessful strategy [8, 10, 14]. At the t-th iteration, given the basis Lt\u22121, we can optimize over r\nand e by examining the Karush Kuhn Tucker (KKT) conditions. To update the basis Lt, we then\noptimize the following objective function:\n\ngt(L) , 1\nt\n\n\u02dc\u2113(zi, L, ri, ei) +\n\n\u2225L\u22252\n\n2;\u221e,\n\n\u03bb1\n2t\n\n(3.1)\n\ni=1 and {ei}t\n\ni=1 have been computed in previous iterations.\n\nwhere {ri}t\nIt is easy to verify that\nEq. (3.1) is a surrogate function of the empirical cost function ft(L) de\ufb01ned in Eq. (2.6). The basis\nLt can be optimized by block coordinate decent, with Lt\u22121 being a warm start for ef\ufb01ciency.\n4 Main Theoretical Results and Proof Outline\nIn this section we present our main theoretic result regarding the validity of the proposed algorithm.\nWe \ufb01rst discuss some necessary assumptions.\n\nt\u2211\n\ni=1\n\n4.1 Assumptions\n\n1. The observed data are i.i.d. generated from a distribution with compact support Z.\n2. The surrogate functions gt(L) in Eq. (3.1) are strongly convex. Particularly, we assume that\nthe smallest eigenvalue of the positive semi-de\ufb01nite matrix 1\nt At de\ufb01ned in Algorithm 1 is\nnot smaller than some positive constant \u03b21. Note that we can easily enforce this assumption\nby adding a term (cid:12)1\n2\n\nF to gt(L).\n\n\u2225L\u22252\n\n3. The minimizer for Problem (2.7) is unique. Notice that \u02dc\u2113(z, L, r, e) is strongly convex\nw.r.t. e and convex w.r.t. r. Hence, we can easily enforce this assumption by adding a term\n\u03b3\u2225r\u22252\n\n2, where \u03b3 is a small positive constant.\n\n4.2 Main Theorem\nThe following theorem is the main theoretical result of this work. It states that when t tends to\nin\ufb01nity, the basis Lt produced by Algorithm 1 converges to a stationary point.\nTheorem 4.1 (Convergence to a stationary point of Lt). Assume 1, 2 and 3. Given that the intrinsic\ndimension of the underlying data is d, the optimal basis Lt produced by Algorithm 1 asymptotically\nconverges to a stationary point of Problem (2.8) when t tends to in\ufb01nity.\n\n4\n\n\fAlgorithm 1 Online Max-Norm Regularized Matrix Decomposition\nInput: Z \u2208 Rp\u00d7n (observed samples), parameters \u03bb1 and \u03bb2, L0 \u2208 Rp\u00d7d (initial basis), zero\n\nmatrices A0 \u2208 Rd\u00d7d and B0 \u2208 Rp\u00d7d\n\nOutput: optimal basis Lt\n1: for t = 1 to n do\n2:\n3:\n\nAccess the t-th sample zt.\nCompute the coef\ufb01cient and noise:\n\n{rt, et} = arg min\nr;e;\u2225r\u22252\n\u22641\n\n2\n\n\u02dc\u2113(zt, Lt\u22121, r, e).\n\n(3.2)\n\nCompute the accumulation matrices At and Bt:\n\n4:\n\n5:\n\nCompute the basis Lt by optimizing the surrogate function (3.1):\n\n\u22a4\nt ,\n\nAt \u2190 At\u22121 + rtr\nBt \u2190 Bt\u22121 + (zt \u2212 et) r\nt\u2211\n(\n\n\u02dc\u2113(zi, L, ri, ei) +\n\n\u2225L\u22252\n\n\u03bb1\n2t\n\n(\n\n) \u2212 Tr\n\n\u22a4\n\nTr\n\nL\n\nLAt\n\n(\n\n\u22a4\nt .\n\n2;\u221e\n\n))\n\nLt = arg min\n\nL\n\n= arg min\n\nL\n\n1\nt\n\n1\nt\n\ni=1\n1\n2\n\n\u22a4\n\nL\n\nBt\n\n+\n\n(3.3)\n\n\u2225L\u22252\n\n2;\u221e.\n\n\u03bb1\n2t\n\n6: end for\n\n4.3 Proof Outline for Theorem 4.1\n\nThe essential tools for our analysis are from stochastic approximation [3] and asymptotic statis-\ntics [21]. There are three main steps in our proof:\n(I) We show that the positive stochastic process gt(Lt) de\ufb01ned in Eq. (3.1) converges almost surely.\n(II) Then we prove that the empirical loss function, ft(Lt) de\ufb01ned in Eq. (2.6) converges almost\nsurely to the same limit of its surrogate gt(Lt). According to the central limit theorem, we can\nexpect that ft(Lt) also converges almost surely to the expected loss f (Lt) de\ufb01ned in Eq. (2.8),\nimplying that gt(Lt) and f (Lt) converge to the same limit.\n(III) Finally, by taking a simple Taylor expansion, it justi\ufb01es that the gradient of f (L) taking at Lt\nvanishes as t tends to in\ufb01nity, which concludes Theorem 4.1.\n\nTheorem 4.2 (Convergence of the surrogate function gt(Lt)). The surrogate function gt(Lt) we\nde\ufb01ned in Eq. (3.1) converges almost surely, where Lt is the solution produced by Algorithm 1.\n\nTo establish the convergence of gt(Lt), we verify that gt(Lt) is a quasi-martingale [3] that converges\nalmost surely. To this end, we show that the expectation of the difference of gt+1(Lt+1) and gt(Lt)\ncan be upper bounded by a family of functions \u2113(\u00b7, L) indexed by L \u2208 L, where L is a compact\nset. Then we show that the family of functions satisfy the hypotheses in the corollary of Donsker\nTheorem [21] and thus can be uniformly upper bounded. Therefore, we conclude that gt(Lt) is a\nquasi-martingale and converges almost surely.\nNow let us verify the hypotheses in the corollary of Donsker Theorem. First we prove that the index\nset L is uniformly bounded.\nProposition 4.3. Let rt, et and Lt be the optimal solutions produced by Algorithm 1. Then,\n\n1. The optimal solutions rt and et are uniformly bounded.\n\n2. The matrices 1\n\nt At and 1\n\nt Bt are uniformly bounded.\n\n5\n\n\f3. There exists a compact set L, such that for all Lt produced by Algorithm 1, Lt \u2208 L. That\n\nis, there exists a positive constant Lmax that is uniform over t, such that for all t > 0,\n\n\u2225Lt\u2225 \u2264 Lmax.\n\nt At and 1\n\nTo prove the third claim (which is required for our proof of convergence of gt(Lt)), we should prove\nthat for all t > 0, rt, et, 1\nt Bt can be uniformly bounded, which can easily be veri\ufb01ed.\nThen, by utilizing the \ufb01rst order optimal condition of Problem (3.3), we can build an equation that\nconnects Lt with the four items we mentioned in the \ufb01rst and second claim. From Assumption 2, we\nknow that the nuclear norm of 1\nt At can be uniformly lower bounded. This property provides us the\nway to show that Lt can be uniformly upper bounded. Note that in [8, 14], both papers assumed that\nthe dictionary (or basis) is uniformly bounded. In contrast, here in the third claim of Proposition 4.3,\nwe prove that such condition naturally holds in our problem.\nNext, we show that the family of functions \u2113(z, L) is uniformly Lipschitz w.r.t. L.\nProposition 4.4. Let L \u2208 L and denote the minimizer of \u2113(z, L, r, e) de\ufb01ned in (2.7) as:\n\n\u2217\n\n{r\n\n, e\n\n\u2217} = arg min\n\u22641\n\nr;e;\u2225r\u2225\n\n2\n\n\u2225z \u2212 Lr \u2212 e\u22252\n\n2 + \u03bb2\u2225e\u22251.\n\n1\n2\n\nThen, the function \u2113(z, L) de\ufb01ned in Problem (2.7) is continuously differentiable and\n\n\u2207L\u2113(z, L) = (Lr\n\u2217\n\n\u2217 \u2212 z)r\n\n\u2217T .\n\n+ e\nFurthermore, \u2113(z,\u00b7) is uniformly Lipschitz and bounded.\nBy utilizing the corollary of Theorem 4.1 from [2], we can verify the differentiability of \u2113(z, L) and\nthe form of its gradient. As all of the items in the gradient are uniformly bounded (Assumption 1\nand Proposition 4.3), we show that \u2113(z, L) is uniformly Lipschitz and bounded.\nBased on Proposition 4.3 and 4.4, we verify that all the hypotheses in the corollary of Donsker\nTheorem [21] are satis\ufb01ed. This implies the convergence of gt(Lt). We now move to step (II).\nTheorem 4.5 (Convergence of f (Lt)). Let f (Lt) be the expected loss function de\ufb01ned in Eq. (2.8)\nand Lt is the solution produced by the Algorithm 1. Then,\n\n1. gt(Lt) \u2212 ft(Lt) converges almost surely to 0.\n2. ft(Lt) de\ufb01ned in Eq. (2.6) converges almost surely.\n3. f (Lt) converges almost surely to the same limit of ft(Lt).\n\nWe apply Lemma 8 from [14] to prove the \ufb01rst claim. We denote the difference of gt(Lt) and ft(Lt)\nby bt. First we show that bt is uniformly Lipschitz. Then we show that the difference between Lt+1\nt ). Finally, we verify the\nand Lt is O( 1\nconvergence of the summation of the serial { 1\nProposition 4.6. Let {Lt} be the basis sequence produced by the Algorithm 1. Then,\n\nt ), making bt+1 \u2212 bt be uniformly upper bounded by O( 1\n\nt=1. Thus, Lemma 8 from [14] applies.\n\nt bt}\u221e\n\n\u2225Lt+1 \u2212 Lt\u2225F = O(\n\n1\nt\n\n).\n\n(4.1)\n\nProposition 4.6 can be proved by combining the strong convexity of gt(L) (Assumption 2 in Sec-\ntion 4.1) and the Lipschitz of gt(L); see the full paper for details.\nEquipped with Proposition 4.6, we can verify that the difference of the sequence bt = gt(Lt) \u2212\nt ). The convergence of the summation of the serial { 1\nt bt}\u221e\nft(Lt) can be upper bounded by O( 1\nt=1\ncan be examined by the expectation convergence property of quasi-martingale gt(Lt), stated in [3].\nApplying the Lemma 8 from [14], we conclude that gt(Lt) \u2212 ft(Lt) converges to zero a.s..\nAfter the \ufb01rst claim of Theorem 4.5 being proved, the second claim follows immediately, as gt(Lt)\nconverges a.s. (Theorem 4.2). By the central limit theorem, the third claim can be veri\ufb01ed.\nAccording to Theorem 4.5, we can see that gt(Lt) and f (Lt) converge to the same limit a.s. Let\nt tends to in\ufb01nity, as Lt is uniformly bounded (Proposition 4.3), the term (cid:21)1\n2;\u221e in gt(Lt)\n2t\nvanishes. Thus gt(Lt) becomes differentiable. On the other hand, we have the following proposition\nabout the gradient of f (L).\n\n\u2225Lt\u22252\n\n6\n\n\fProposition 4.7 (Gradient of f (L)). Let f (L) be the expected loss function de\ufb01ned in Eq. (2.8).\nThen, f (L) is continuously differentiable and \u2207f (L) = Ez[\u2207L\u2113(z, L)]. Moreover, \u2207f (L) is uni-\nformly Lipschitz on L.\nThus, taking a \ufb01rst order Taylor expansion for f (Lt) and gt(Lt), we can show that the gradient of\nf (Lt) equals to that of gt(Lt) when t tends to in\ufb01nity. Since Lt is the minimizer for gt(L), we know\nthat the gradient of f (Lt) vanishes. Therefore, we have proved Theorem 4.1.\n\n5 Experiments\n\nIn this section, we report some simulation results on synthetic data to demonstrate the effectiveness\nand robustness of our online max-norm regularized matrix decomposition (OMRMD) algorithm.\nData Generation. The simulation data are generated by following a similar procedure in [6]. The\nclean data matrix X is produced by X = U V T , where U \u2208 Rp\u00d7d and V \u2208 Rn\u00d7d. The entries\nof U and V are i.i.d. sampled from the Gaussian distribution N (0, 1). We introduce a parameter \u03c1\nto control the sparsity of the corruption matrix E, i.e., a \u03c1-fraction of the entries are non-zero and\nfollowing an i.i.d. uniform distribution over [\u22121000, 1000]. Finally, the observation matrix Z is\nproduced by Z = X + E.\nEvaluation Metric. Our goal is to estimate the correct subspace for the underlying data. Here, we\nevaluate the \ufb01tness of our estimated subspace basis L and the ground truth basis U by the Expressed\nVariance (EV) [22]:\n\nEV(U, L) , Tr(LT U U T L)\n\nTr(U U T )\n\n.\n\nThe values of EV range in [0, 1] and a higher EV value indicates a more accurate subspace recovery.\nOther Settings. Through the experiments, we set the ambient dimension p = 400 and the total\n\u221a\nnumber of samples n = 5000 unless otherwise speci\ufb01ed. We \ufb01x the tunable parameter \u03bb1 = \u03bb2 =\np, and use default parameters for all baseline algorithms we compare with. Each experiment is\n1/\nrepeated 10 times and we report the averaged EV as the result.\n\n(a) OMRMD\n\n(b) OR-PCA\n\nFigure 1: Performance of subspace recovery under different rank and corruption fraction. Brighter\ncolor means better performance.\n\nWe \ufb01rst study the effectiveness of the algorithm, measured by the EV value of its output after the last\nsample, and compare it to the nuclear norm based online RPCA (OR-PCA) algorithm [8]. Speci\ufb01-\ncally, we vary the intrinsic dimension d from 0.02p to 0.5p, with a step size 0.04p, and the corruption\nfraction \u03c1 from 0.02 to 0.5, with a step size 0.04. The results are reported in Figure 1 where brighter\ncolor means higher EV (hence better performance). We observe that for easier tasks (i.e., when\ncorruption and rank are low), both algorithms perform comparably. On the other hand, for more\ndif\ufb01cult cases, OMRMD outperforms OR-PCA. This is possibly because the max-norm is a tighter\napproximation to the matrix rank.\nWe next study the convergence of OMRMD by plotting the EV curve against the number of sam-\nples. Besides OR-PCA, we also add Principal Component Pursuit (PCP) [6] and an online PCA\n\n7\n\nrank / ambient dimensionfraction of corruption0.020.140.260.380.50.50.380.260.140.02rank / ambient dimensionfraction of corruption0.020.140.260.380.50.50.380.260.140.02\f(a) (cid:26) = 0:01\n\n(b) (cid:26) = 0:3\n\n(c) (cid:26) = 0:5\n\n(d) p = 3000, d = 300, (cid:26) = 0:3\n\nFigure 2: EV value against number of samples. p = 400 and d = 80 in (a) to (c).\n\nalgorithm [1] as baseline algorithms to compare with. The results are reported in Figure 2. As ex-\npected, PCP achieves the best performance since it is a batch method and needs to access all the data\nthroughout the algorithm. Online PCA degrades signi\ufb01cantly even with low corruption (Figure 2a).\nOMRMD is comparable to OR-PCA when the corruption is low (Figure 2a), and converges signif-\nicantly faster when the corruption is high (Figure 2b and 2c). Indeed, this is true even with high\ndimension and as many as 100, 000 samples (Figure 2d). This observation agrees with Figure 1,\nand again suggests that for large corruption, max-norm may be a better \ufb01t than the nuclear norm.\nAdditional experimental results are available in the full paper.\n\n6 Conclusion\n\nIn this paper, we developed an online algorithm for max-norm regularized matrix decomposition\nproblem. Using the matrix factorization form of the max-norm, we convert the original problem\nto a constrained one which facilitates an online implementation for solving the original problem.\nWe established theoretical guarantees that the solutions will converge to a stationary point asymp-\ntotically. Moreover, we empirically compared our proposed algorithm with OR-PCA, which is a\nrecently proposed online algorithm for nuclear-norm based matrix decomposition. The simulation\nresults suggest that the proposed algorithm outperforms OR-PCA, in particular for harder task (i.e.,\nwhen a large fraction of entries are corrupted). Our experiments, to an extent, empirically suggest\nthat the max-norm might be a tighter relaxation of the rank function compared to the nuclear norm.\n\nAcknowledgments\n\nThe research of Jie Shen and Ping Li is partially supported by NSF-DMS-1444124, NSF-III-\n1360971, NSF-Bigdata-1419210, ONR-N00014-13-1-0764, and AFOSR-FA9550-13-1-0137. Part\nof the work of Jie Shen was conducted at Shanghai Jiao Tong University. The work of Huan Xu is\npartially supported by the Ministry of Education of Singapore through AcRF Tier Two grant R-265-\n000-443-112.\n\n8\n\n11000200030004000500000.20.40.60.81Number of SamplesEV OMRMDOR\u2212PCAPCPOnline PCA11000200030004000500000.20.40.60.81Number of SamplesEV OMRMDOR\u2212PCAPCPOnline PCA11000200030004000500000.20.40.60.81Number of SamplesEV OMRMDOR\u2212PCAPCPOnline PCA246810x 10400.20.40.60.81Number of SamplesEV OMRMDOR\u2212PCAPCP\fReferences\n[1] Matej Artac, Matjaz Jogan, and Ales Leonardis. Incremental pca for on-line visual learning and recog-\nnition. In Pattern Recognition, 2002. Proceedings. 16th International Conference on, volume 3, pages\n781\u2013784. IEEE, 2002.\n\n[2] J Fr\u00b4ed\u00b4eric Bonnans and Alexander Shapiro. Optimization problems with perturbations: A guided tour.\n\nSIAM review, 40(2):228\u2013264, 1998.\n\n[3] L\u00b4eon Bottou. Online learning and stochastic approximations. On-line learning in neural networks, 17(9),\n\n1998.\n\n[4] Jian-Feng Cai, Emmanuel J Cand`es, and Zuowei Shen. A singular value thresholding algorithm for matrix\n\ncompletion. SIAM Journal on Optimization, 20(4):1956\u20131982, 2010.\n\n[5] Tony Cai and Wen-Xin Zhou. A max-norm constrained minimization approach to 1-bit matrix completion.\n\nJournal of Machine Learning Research, 14:3619\u20133647, 2014.\n\n[6] Emmanuel J Cand`es, Xiaodong Li, Yi Ma, and John Wright. Robust principal component analysis?\n\nJournal of the ACM (JACM), 58(3):11, 2011.\n\n[7] Emmanuel J Cand`es and Benjamin Recht. Exact matrix completion via convex optimization. Foundations\n\nof Computational mathematics, 9(6):717\u2013772, 2009.\n\n[8] Jiashi Feng, Huan Xu, and Shuicheng Yan. Online robust pca via stochastic optimization. In Advances in\n\nNeural Information Processing Systems, pages 404\u2013412, 2013.\n\n[9] Rina Foygel, Nathan Srebro, and Ruslan Salakhutdinov. Matrix reconstruction with the local max norm.\n\nIn NIPS, pages 944\u2013952, 2012.\n\n[10] Prateek Jain, Praneeth Netrapalli, and Sujay Sanghavi. Low-rank matrix completion using alternating\nminimization. In Proceedings of the 45th annual ACM symposium on Symposium on theory of computing,\npages 665\u2013674. ACM, 2013.\n\n[11] Ali Jalali and Nathan Srebro. Clustering using max-norm constrained optimization. In ICML, 2012.\n[12] Jason D Lee, Ben Recht, Ruslan Salakhutdinov, Nathan Srebro, and Joel A Tropp. Practical large-scale\n\noptimization for max-norm regularization. In NIPS, pages 1297\u20131305, 2010.\n\n[13] Guangcan Liu, Zhouchen Lin, Shuicheng Yan, Ju Sun, Yong Yu, and Yi Ma. Robust recovery of subspace\nstructures by low-rank representation. Pattern Analysis and Machine Intelligence, IEEE Transactions on,\n35(1):171\u2013184, 2013.\n\n[14] Julien Mairal, Francis Bach, Jean Ponce, and Guillermo Sapiro. Online learning for matrix factorization\n\nand sparse coding. The Journal of Machine Learning Research, 11:19\u201360, 2010.\n\n[15] Benjamin Recht, Maryam Fazel, and Pablo A Parrilo. Guaranteed minimum-rank solutions of linear\n\nmatrix equations via nuclear norm minimization. SIAM review, 52(3):471\u2013501, 2010.\n\n[16] Jasson DM Rennie and Nathan Srebro. Fast maximum margin matrix factorization for collaborative\nprediction. In Proceedings of the 22nd international conference on Machine learning, pages 713\u2013719.\nACM, 2005.\n\n[17] Ruslan Salakhutdinov and Nathan Srebro. Collaborative \ufb01ltering in a non-uniform world: Learning with\n\nthe weighted trace norm. tc (X), 10:2, 2010.\n\n[18] Shai Shalev-Shwartz, Yoram Singer, Nathan Srebro, and Andrew Cotter. Pegasos: Primal estimated sub-\n\ngradient solver for svm. Mathematical programming, 127(1):3\u201330, 2011.\n\n[19] Nathan Srebro, Jason DM Rennie, and Tommi Jaakkola. Maximum-margin matrix factorization. In NIPS,\n\nvolume 17, pages 1329\u20131336, 2004.\n\n[20] Nathan Srebro and Adi Shraibman. Rank, trace-norm and max-norm. In Learning Theory, pages 545\u2013\n\n560. Springer, 2005.\n\n[21] Aad W Van der Vaart. Asymptotic statistics, volume 3. Cambridge university press, 2000.\n[22] Huan Xu, Constantine Caramanis, and Shie Mannor. Principal component analysis with contaminated\n\ndata: The high dimensional case. In COLT, pages 490\u2013502, 2010.\n\n[23] Huan Xu, Constantine Caramanis, and Shie Mannor. Outlier-robust pca:\n\nInformation Theory, IEEE Transactions on, 59(1):546\u2013572, 2013.\n\nthe high-dimensional case.\n\n[24] Huan Xu, Constantine Caramanis, and Sujay Sanghavi. Robust pca via outlier pursuit. IEEE Transactions\n\non Information Theory, 58(5):3047\u20133064, 2012.\n\n[25] Zhengdong Zhang, Arvind Ganesh, Xiao Liang, and Yi Ma. Tilt: transform invariant low-rank textures.\n\nInternational Journal of Computer Vision, 99(1):1\u201324, 2012.\n\n9\n\n\f", "award": [], "sourceid": 902, "authors": [{"given_name": "Jie", "family_name": "Shen", "institution": "Rutgers University"}, {"given_name": "Huan", "family_name": "Xu", "institution": "NUS"}, {"given_name": "Ping", "family_name": "Li", "institution": "Rutgers University"}]}