{"title": "Beyond Alternating Updates for Matrix Factorization with Inertial Bregman Proximal Gradient Algorithms", "book": "Advances in Neural Information Processing Systems", "page_first": 4266, "page_last": 4276, "abstract": "Matrix Factorization is a popular non-convex optimization problem, for which alternating minimization schemes are mostly used. They usually suffer from the major drawback that the solution is biased towards one of the optimization variables. A remedy is non-alternating schemes. However, due to a lack of Lipschitz continuity of the gradient in matrix factorization problems, convergence cannot be guaranteed. A recently developed approach relies on the concept of Bregman distances, which generalizes the standard Euclidean distance. We exploit this theory by proposing a novel Bregman distance for matrix factorization problems, which, at the same time, allows for simple/closed form update steps. Therefore, for non-alternating schemes, such as the recently introduced Bregman Proximal Gradient (BPG) method and an inertial variant Convex--Concave Inertial BPG (CoCaIn BPG), convergence of the whole sequence to a stationary point is proved for Matrix Factorization. In several experiments, we observe a superior performance of our non-alternating schemes in terms of speed and objective value at the limit point.", "full_text": "Beyond Alternating Updates for Matrix Factorization\nwith Inertial Bregman Proximal Gradient Algorithms\n\nMahesh Chandra Mukkamala\nMathematical Optimization Group\n\nSaarland University, Germany\nmukkamala@math.uni-sb.de\n\nPeter Ochs\n\nMathematical Optimization Group\n\nSaarland University, Germany\n\nochs@math.uni-sb.de\n\nAbstract\n\nMatrix Factorization is a popular non-convex optimization problem, for which\nalternating minimization schemes are mostly used. They usually suffer from\nthe major drawback that the solution is biased towards one of the optimization\nvariables. A remedy is non-alternating schemes. However, due to a lack of Lipschitz\ncontinuity of the gradient in matrix factorization problems, convergence cannot\nbe guaranteed. A recently developed approach relies on the concept of Bregman\ndistances, which generalizes the standard Euclidean distance. We exploit this theory\nby proposing a novel Bregman distance for matrix factorization problems, which,\nat the same time, allows for simple/closed form update steps. Therefore, for non-\nalternating schemes, such as the recently introduced Bregman Proximal Gradient\n(BPG) method and an inertial variant Convex\u2013Concave Inertial BPG (CoCaIn\nBPG), convergence of the whole sequence to a stationary point is proved for Matrix\nFactorization. In several experiments, we observe a superior performance of our\nnon-alternating schemes in terms of speed and objective value at the limit point.\n\n1\n\nIntroduction\n\nMatrix factorization has numerous applications in Machine Learning [43, 57], Computer Vision\n[17, 58, 62, 28], Bio-informatics [56, 12] and many others. Given a matrix A \u2208 RM\u00d7N , one is\ninterested in the factors U \u2208 RM\u00d7K and Z \u2208 RK\u00d7N such that A \u2248 UZ holds. This is usually cast\ninto the following non-convex optimization problem\n\n(cid:26)\n\n(cid:27)\n\nmin\n\nU\u2208U ,Z\u2208Z\n\n\u03a8 \u2261\n\n1\n\n2 (cid:107)A \u2212 UZ(cid:107)2\n\nF + R1(U) + R2(Z)\n\n,\n\n(1.1)\n\nwhere U,Z are constraint sets and R1,R2 are regularization terms. The most frequently used\ntechniques for solving matrix factorization problems involve alternating updates (Gauss\u2013Seidel type\nmethods [26]) like PALM [8], iPALM [53], BCD [63], BC-VMFB [18], HALS [19] and many\nothers. A common disadvantage of these schemes is their bias towards one of the optimization\nvariables. Such alternating schemes involve \ufb01xing a subset of variables to do the updates. In order\nto guarantee convergence to a stationary point, alternating schemes require the \ufb01rst term in (1.1)\nto have a Lipschitz continuous gradient only with respect to each subset of variables. However,\nin general Lipschitz continuity of the gradient fails to hold for all variables. The same problem\nappears in various practical applications such as Quadratic Inverse Problems, Poisson Linear Inverse\nProblems, Cubic Regularized Non-convex Quadratic Problems and Robust Denoising Problems with\nNon-convex Total Variation Regularization [46, 9, 4]. They belong to the following broad class of\nnon-convex additive composite minimization problems\n\ninf(cid:8)\u03a8 \u2261 f (x) + g (x) : x \u2208 C(cid:9) ,\n\n(P)\n\n(1.2)\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fwhere f is potentially a non-convex extended real valued function, g is a smooth (possibly non-\nconvex) function and C is a nonempty, closed, convex set in Rd. In order to use non-alternating\nschemes for (1.1), the gradient Lipschitz continuity must be generalized. Such a generalization was\ninitially proposed by [6] and popularized by [4] in convex setting and for non-convex problems in [9].\nThey are based on a generalized proximity measure known as Bregman distance and have recently\nled to new algorithms to solve (1.2): Bregman Proximal Gradient (BPG) method [9] and its inertial\nvariant Convex\u2013Concave Inertial BPG (CoCaIn BPG) [46].\nBPG generalizes the proximal gradient method from Euclidean distances to Bregman distances as\nproximity measures. Its convergence theory relies on the generalized Lipschitz assumption, discussed\nabove, called L-smad property [9]. It involves an upper bound and a lower bound, where the upper\nbound involves a convex majorant to control the step-size of BPG. However, the signi\ufb01cance of lower\nbounds for BPG was not clear. In non-convex optimization literature, the lower bounds which involve\nconcave minorants were largely ignored. Recently, extending on [61, 50], CoCaIn BPG changed\nthis trend by justifying the usage of lower bounds to incorporate inertia for faster convergence [46].\nMoreover, the generated inertia is adaptive, in the sense that it changes according to the function\nbehavior, i.e., CoCaIn BPG does not use an inertial parameter depending on the iteration counter\nunlike Nesterov Accelerated Gradient (NAG) method [47] (also FISTA [5]) in the convex setting.\nIn this paper we ask the question: \"Can we apply BPG and CoCaIn BPG ef\ufb01ciently for Matrix Fac-\ntorization problems?\u201d. This question is signi\ufb01cant, since convergence of the Bregman minimization\nvariants BPG and CoCaIn BPG relies on the L-smad property, which is non-trivial to verify and\nan open problem for Matrix Factorization. Another crucial issue is the ef\ufb01cient computability of\nthe algorithm\u2019s update steps, which is particularly hard due to the coupling between two subsets of\nvariables. We successfully solve these challenges.\nContributions. We make recently introduced powerful Bregman minimization based algorithms\nBPG [9] and CoCaIn BPG [46] and the corresponding convergence results applicable to the matrix\nfactorization problems. Experiments show a signi\ufb01cant advantage of BPG and CoCaIn BPG which are\nnon-alternating by construction, compared to popular alternating minimization schemes in particular\nPALM [8] and iPALM [53]. The proposed algorithms require the following non-trivial contributions:\n\u2022 We propose a novel Bregman distance for Matrix Factorization with the following auxiliary\n\nfunction (called kernel generating distance) with certain c1, c2 > 0:\n\n(cid:32)\n\n(cid:33)2\n\n(cid:32)\n\n(cid:33)\n\n.\n\nh(U, Z) = c1\n\n(cid:107)U(cid:107)2\n\nF + (cid:107)Z(cid:107)2\n\nF\n\n2\n\n+ c2\n\n(cid:107)U(cid:107)2\n\nF + (cid:107)Z(cid:107)2\n\nF\n\n2\n\nThe generated Bregman distance embeds the crucial coupling between the variables U\nand Z. We prove the L-smad property with such a kernel generating distance and infer\nconvergence of BPG and CoCaIn BPG to a stationary point.\n\n\u2022 We compute the analytic solution for subproblems of the proposed variants of BPG, for\nwhich the usual analytic solutions based on Euclidean distances cannot be used.\n\nSimple Illustration of BPG for Matrix Factorization. Consider the following simple matrix\nfactorization optimization problem, where we set R1 := 0 and R2 := 0 in (1.1)\n\n(cid:26)\n\nmin\n\nU\u2208RM\u00d7K ,Z\u2208RK\u00d7N\n\n\u03a8(U, Z) =\n\n1\n\n2 (cid:107)A \u2212 UZ(cid:107)2\n\nF\n\n.\n\n(1.3)\n\n(cid:27)\n\nFor this problem, the update steps of Bregman Proximal Gradient for Matrix Factorization\n(BPG-MF) given in Section 2.1 (also see Section 2.4) with a chosen \u03bb \u2208 (0, 1) are the following:\nF ) + (cid:107)A(cid:107)F and perform the intermediary\n\nIn each iteration, compute tk = 3((cid:13)(cid:13)Uk(cid:13)(cid:13)2\n\nF +(cid:13)(cid:13)Zk(cid:13)(cid:13)2\n\ngradient descent steps (non-alternating) for U and Z independently with step-size \u03bb\ntk\n\n(cid:2)(UkZk \u2212 A)(Zk)T(cid:3) , Qk = Zk \u2212\n(cid:16)(cid:13)(cid:13)Pk(cid:13)(cid:13)2\n\n\u03bb\ntk\n\nF +(cid:13)(cid:13)Qk(cid:13)(cid:13)2\n\n:\n\n(cid:2)(Uk)T (UkZk \u2212 A)(cid:3) .\n(cid:17)\n\nThen, the additional scaling steps Uk+1 = rtkPk and Zk+1 = rtkQk are required, where the\nscaling factor r \u2265 0 satis\ufb01es a cubic equation: 3t2\nr3 + (cid:107)A(cid:107)F r \u2212 1 = 0.\n\nF\n\nk\n\nPk = Uk \u2212\n\n\u03bb\ntk\n\n2\n\n\f1.1 Related Work\n\nAlternating Minimization is the go-to strategy for matrix factorization problems due to coupling\nbetween two subsets of variables [24, 1, 64]. In the context of non-convex and non-smooth optimiza-\ntion, recently PALM [8] was proposed and convergence to stationary point was proved. An inertial\nvariant, iPALM was proposed in [53]. However, such methods require a subset of variables to be\n\ufb01xed. We remove such a restriction here and take the contrary view by proposing non-alternating\nschemes based on a powerful Bregman proximal minimization framework, which we review below.\nBregman Proximal Minimization extends upon the standard proximal minimization, where Breg-\nman distances are used as proximity measures. Based on initial works in [6, 4, 9], related inertial\nvariants were proposed in [46, 67]. Related line-search methods were proposed in [52] based on\n[10, 11]. More related works in convex optimization include [49, 40, 42]. Recently, the symmetric\nnon-negative matrix factorization problem was solved with a non-alternating Bregman proximal\nminimization scheme [21] with the following kernel generating distance\n\nh(U) = (cid:107)U(cid:107)4\n\nF\n\n4\n\n+ (cid:107)U(cid:107)2\n\nF\n\n2\n\n.\n\nHowever for the following applications, such a h is not suitable, unlike our Bregman distance.\nNon-negative Matrix Factorization (NMF) is a variant of the matrix factorization problem which\nrequires the factors to have non-negative entries [25, 37]. Some applications are hyperspectral\nunmixing, clustering and others [24, 22]. The non-negativity constraints pose new challenges [37]\nand only convergence to a stationary point [24, 31] is guaranteed, as NMF is NP-hard in general.\nUnder certain restrictions, NMF can be solved exactly [2, 44] but such methods are computationally\ninfeasible. We give ef\ufb01cient algorithms for NMF and show the superior performance empirically.\nMatrix Completion is another variant of Matrix Factorization arising in recommender systems\n[35] and bio-informatics [39, 60], which is an active research topic due to the hard non-convex\noptimization problem [15, 23]. The state-of-the-art methods were proposed in [33, 65] and other\nrecent methods include [66]. Here, our algorithms are either faster or competitive.\nOur algorithms are also applicable to Graph Regularized NMF (GNMF) [13], Sparse NMF [8],\nNuclear Norm Regularized problems [14, 32], Symmetric NMF via non-symmetric extension [68].\n\n2 Matrix Factorization Problem Setting and Algorithms\n\nNotation. We refer to [55] for standard notation, unless speci\ufb01ed otherwise.\nFormally, in a matrix factorization problem, given a matrix A \u2208 RM\u00d7N , we want to obtain the\nfactors U \u2208 RM\u00d7K and ZK\u00d7N such that A \u2248 UZ, which is captured by the following non-convex\nproblem\n(2.1)\n\n\u03a8(U, Z) :=\n\n(cid:26)\n\n(cid:27)\n\n1\n\n,\n\n2 (cid:107)A \u2212 UZ(cid:107)2\n\nF + R1(U) + R2(Z)\n\nmin\n\nU\u2208U ,Z\u2208Z\n\n2 (cid:107)A \u2212 UZ(cid:107)2\n\nwhere R1(U) + R2(Z) is the separable regularization term, 1\nF is the data-\ufb01tting term,\nand U,Z are the constraint sets for U and Z respectively. Here, R1(U) and R2(Z) can be potentially\nnon-convex extended real valued functions and possibly non-smooth. In this paper, we propose\nto make use of BPG and its inertial variant CoCaIn BPG to solve (2.1). The introduction of these\nalgorithms requires the following preliminary considerations.\nDe\ufb01nition 2.1. (Kernel Generating Distance [9]) Let C be a nonempty, convex and open subset of\nRd. Associated with C, a function h : Rd \u2192 (\u2212\u221e, +\u221e] is called a kernel generating distance if it\nsatis\ufb01es: (i) h is proper, lower semicontinuous and convex, with dom h \u2282 C and dom \u2202h = C, and\n(ii) h is C 1 on int dom h \u2261 C. We denote the class of kernel generating distances by G(C).\nFor every h \u2208 G(C), the associated Bregman distance is given by Dh : dom h \u00d7 int dom h \u2192 R+:\n\nFor examples, consider the following kernel generating distances:\n\nDh (x, y) := h (x) \u2212 [h (y) + (cid:104)\u2207h (y) , x \u2212 y(cid:105)] .\n\nh0(x) =\n\n1\n\n2 (cid:107)x(cid:107)2 ,\n\nh1(x) =\n\n1\n\n4 (cid:107)x(cid:107)4 +\n\n1\n\n2 (cid:107)x(cid:107)2\n\nand h2(x) =\n\n1\n\n3 (cid:107)x(cid:107)3 +\n\n1\n\n2 (cid:107)x(cid:107)2 .\n\n3\n\n\fThe Bregman distances associated with h0(x) is the Euclidean distance. The Bregman distances\nassociated with h1 and h2 appear in the context of non-convex quadratic inverse problems [9, 46]\nand non-convex cubic regularized problems [46] respectively. For a review on the recent literature,\nwe refer the reader to [59] and for early work on Bregman distances to [16].\nThese distance measures are key for development of algorithms for the following class of non-convex\nadditive composite problems\n\ninf(cid:8)\u03a8 \u2261 f (x) + g (x) : x \u2208 C(cid:9) ,\n\n(P)\n\nwhich is assumed to satisfy the following standard assumption [9].\nAssumption A. (i) h \u2208 G(C) with C = dom h. (ii) f : Rd \u2192 (\u2212\u221e, +\u221e] is a proper and lower\nsemicontinuous function (potentially non-convex) with dom f \u2229 C (cid:54)= \u2205. (iii) g : Rd \u2192 (\u2212\u221e, +\u221e]\nis a proper and lower semicontinuous function (potentially non-convex) with dom h \u2282 dom g, which\n\nis continuously differentiable on C. (iv) v(P) := inf(cid:8)\u03a8 (x) : x \u2208 C(cid:9) > \u2212\u221e.\ninf (cid:8)\u03a8(U, Z) := f1(U) + f2(Z) + g(U, Z) : (U, Z) \u2208 C(cid:9) .\n\nMatrix Factorization Example. A special case of (2.2) is the following problem,\n\n(2.3)\nWe denote f (U, Z) = f1(U) + f2(Z). Many practical matrix factorization problems can be\ncast into the form of (2.1). The choice of f and g is dependent on the problem, for which we\nprovide some examples in Section 3. Here f1, f2 satisfy the assumptions of f with dimensions\nchosen accordingly. Moreover by de\ufb01nition, f is separable in U and Z, which we assume only\nfor practical reasons. Also, the choice of f, g may not be unique. For example, in (2.1) when\nF the choice of f as in (2.3) can be R1 + R2 and\nR1(U) = \u03bb0\ng = 1\n\nF and R2(Z) = \u03bb0\nF . However, the other choice is to set g = \u03a8 and f := 0.\n\n2 (cid:107)Z(cid:107)2\n\n2 (cid:107)U(cid:107)2\n2 (cid:107)A \u2212 UZ(cid:107)2\n\n(2.2)\n\n2.1 BPG-MF: Bregman Proximal Gradient for Matrix Factorization\n\n(cid:27)\n\n(cid:26)\n\nWe require the notion of Bregman Proximal Gradient Mapping [9, Section 3.1] given by\n\nT\u03bb (x) = argmin\n\nf (u) + (cid:104)\u2207g (x) , u \u2212 x(cid:105) +\n\nDh (u, x) : u \u2208 C\n\n.\n\n(2.4)\n\n1\n\u03bb\n\nThen, the update step of Bregman Proximal Gradient (BPG) [9] for solving (2.2) is xk+1 \u2208 T\u03bb(xk),\nfor some \u03bb > 0 and h \u2208 G(C). Convergence of BPG relies on a generalized notion of Lipschitz\ncontinuity, the so-called L-smad property (De\ufb01ntion 2.2).\nBeyond Lipschitz continuity. BPG extends upon the popular proximal gradient methods, for which\nconvergence relies on Lipschitz continuity of the smooth part of the objective in (2.2). However,\nsuch a notion of Lipschitz continuity is restrictive for many practical applications such as Poisson\nlinear inverse problems [4], quadratic inverse problems [9, 46], cubic regularized problems [46] and\nrobust denoising problems with non-convex total variation regularization [46]. The extensions for\ngeneralized notions of Lipschitz continuity of gradients is an active area of research [6, 4, 40, 9]. We\nconsider the following from [9].\nDe\ufb01nition 2.2 (L-smad property). The function g is said to be L-smooth adaptable (L-smad) on C\nwith respect to h, if and only if Lh \u2212 g and Lh + g are convex on C.\n2 (cid:107)x(cid:107)2, L-smad property is implied by Lipschitz continuous gradient. Consider the\nWhen h(x) = 1\nfunction f (x) = x4, it is L-smad with respect to h(x) = x4 and L \u2265 1, however \u2207f is not Lipschitz\ncontinuous.\nNow, we are ready to present the BPG algorithm for Matrix Factorization.\nBPG-MF: BPG for Matrix Factorization.\nInput. Choose h \u2208 G(C) with C \u2261 int dom h such that g satis\ufb01es L-smad with respect to h on C.\nInitialization. (U1, Z1) \u2208 int dom h and let \u03bb > 0.\nGeneral Step. For k = 1, 2, . . ., compute\n\n\u2212 \u2207Uh(Uk, Zk) , Qk = \u03bb\u2207Zg(cid:0)Uk, Zk(cid:1)\n\n(cid:8)\u03bbf (U, Z) +(cid:10)Pk, U(cid:11) +(cid:10)Qk, Z(cid:11) + h(U, Z)(cid:9) .\n\n\u2212 \u2207Zh(Uk, Zk) ,\n\nPk = \u03bb\u2207Ug(cid:0)Uk, Zk(cid:1)\n\n(2.5)\n\n(Uk+1, Zk+1) \u2208 argmin\n(U,Z)\u2208C\n\n4\n\n\fUnder Assumption A and the following one (mostly satis\ufb01ed in practice), BPG is well-de\ufb01ned [9].\nAssumption B. The range of T\u03bb lies in C and, for all \u03bb > 0, the function h + \u03bbf is supercoercive.\n\nThe update step for BPG-MF is easy to derive from BPG, however convergence of BPG also relies\non the \u201cright\u201d choice of kernel generating distance h and the L-smad condition. Finding h such that\nL-smad holds (also see Section 2.2) and that the update step can be given in closed form (also see\nSection 2.4) is our main contribution and allows us to invoke the convergence results from [9]. The\nconvergence result states that the whole sequence of iterates generated by BPG-MF converges to a\nstationary point, precisely given in Theorem 2.2. The result depends on the non-smooth KL-property\n(see [7, 3, 8]) which is a mild requirement and is satis\ufb01ed by most practical objectives. We provide\nbelow the convergence result in [9, Theorem 4.1] adapted to BPG-MF.\nTheorem 2.1 (Global Convergence of BPG-MF). Let Assumptions A and B hold and let g be L-smad\nwith respect to h, where h is assumed to be \u03c3-strongly convex with full domain. Assume \u2207g,\u2207h to\nk\u2208N be a bounded sequence\ngenerated by BPG-MF with 0 < \u03bbL < 1, and suppose \u03a8 satis\ufb01es the KL property, then, such a\nsequence has \ufb01nite length, and converges to a critical point.\n\nbe Lipschitz continuous on any bounded subset. Let(cid:8)(Uk+1, Zk+1)(cid:9)\n\n2.2 New Bregman Distance for Matrix Factorization\n\nWe prove the L-smad property for the term g(U, Z) = 1\nproblem in (2.1). The kernel generating distance is a linear combination of\nand h2(U, Z) := (cid:107)U(cid:107)2\n\n2 (cid:107)A \u2212 UZ(cid:107)2\n\nh1(U, Z) :=\n\n(cid:107)U(cid:107)2\n\nF + (cid:107)Z(cid:107)2\n\n(cid:33)2\n\n(cid:32)\n\nF\n\n2\n\nF of the matrix factorization\n\nF + (cid:107)Z(cid:107)2\n\nF\n\n2\n\n,\n\n(2.6)\n\nand it is designed to also allow for closed form updates (see Section 2.4).\nProposition 2.1. Let g, h1, h2 be as de\ufb01ned above. Then, for L \u2265 1, the function g satis\ufb01es the\nL-smad property with respect to the following kernel generating distance\n(2.7)\nha(U, Z) = 3h1(U, Z) + (cid:107)A(cid:107)F h2(U, Z) .\n\nThe proof is given in Section G.1 in the supplementary material. The Bregman distances considered\nin previous works [46, 9] are separable and not applicable for matrix factorization problems. The\ninherent coupling between two subsets of variables U, Z is the main source of non-convexity in the\nobjective g. The kernel generating distance (in particular h1 in (2.7)) contains the interaction/coupling\nterms between U and Z which makes it amenable for matrix factorization problems.\n\n2.3 CoCaIn BPG-MF: An Adaptive Inertial Bregman Proximal Gradient Method\n\nThe goal of this section is to introduce an inertial variant of BPG-MF, called CoCaIn BPG-MF. The\neffective step-size choice for BPG-MF can be restrictive due to large constant like (cid:107)A(cid:107)F (see (2.7)),\nfor which we present a practical example in the numerical experiments. In order to allow for larger\nstep-sizes, one needs to adapt it locally, which is often done via a backtracking procedure. CoCaIn\nBPG-MF combines inertial steps with a novel backtracking procedure proposed in [46].\nInertial algorithms often lead to better convergence [51, 53, 46]. The classical Nesterov Accelerated\nGradient (NAG) method [47] and the popular Fast Iterative Shrinkage-Thresholding Algorithm\n(FISTA) [5] employ an extrapolation based inertial strategy. However, the extrapolation is governed\nby a parameter which is typically scheduled to follow certain iteration-dependent scheme [47, 29]and\nis restricted to the convex setting. Recently with Convex\u2013Concave Inertial Bregman Proximal\nGradient (CoCaIn BPG) [46], it was shown that one could leverage the upper bound (convexity of\nLh \u2212 g) and lower bound (convexity of Lh + g) to incorporate inertia in an adaptive manner.\nWe recall now the update steps of CoCaIn BPG [46] to solve (2.2). Let h \u2208 G(C), \u03bb > 0, and x0 =\nx1 \u2208 Rd be an initalization, then in each iteration the extrapolated point yk = xk + \u03b3k(xk \u2212 xk\u22121)\nis computed followed by a BPG like update (at yk) given by xk+1 \u2208 T\u03c4k (yk), where \u03b3k is the\ninertial parameter and \u03c4k is the step-size parameter. Similar conditions to BPG are required for the\nconvergence to a stationary point. We use CoCaIn BPG for Matrix Factorization (CoCaIn BPG-MF)\nand our proposed novel kernel generating distance h from (2.7) makes the convergence results of\n[46] applicable. Along with Assumption B, we require the following assumption.\n\n5\n\n\f(cid:16)\n\n(cid:17)\n\n2\n\nF\n\n(cid:107)U(cid:107)2\n\nis convex.\n\nAssumption C. (i) There exists \u03b1 \u2208 R such that f (U, Z) \u2212 \u03b1\nF + (cid:107)Z(cid:107)2\n(ii) The kernel generating distance h is \u03c3-strongly convex on RM\u00d7K \u00d7 RK\u00d7N .\nThe Assumption C(i) refers to notion of semi-convexity of the function f, (see [50, 46]) and seems\nto be closely connected to the inertial feature of an algorithm. For notational brevity, we use\nDg (x, y) := g (x) \u2212 [g (y) + (cid:104)\u2207g (y) , x \u2212 y(cid:105)] which may also be negative if g is not a kernel\ngenerating distance. Moreover, we use Dh((X1, Y1), (X2, Y2)) as Dh(X1, Y1, X2, Y2). We\nprovide CoCaIn BPG-MF below.\nCoCaIn BPG-MF: Convex\u2013Concave Inertial BPG for Matrix Factorization.\nInput. Choose \u03b4, \u03b5 > 0 with 1 > \u03b4 > \u0001, h \u2208 G(C) with C \u2261 int dom h, g is L-smad on C w.r.t h.\nInitialization. (U1, Z1) = (U0, Z0) \u2208 int dom h \u2229 dom f, \u00afL0 > \u2212\u03b1\nGeneral Step. For k = 1, 2, . . ., compute extrapolated points\n\nY k\nU = Uk + \u03b3k\n\nwhere \u03b3k \u2265 0 such that\n\n(\u03b4 \u2212 \u03b5)Dh\n\nwhere Lk satis\ufb01es\n\nand Y k\n\nZ = Zk + \u03b3k\n\n\u2265 (1 + Lk\u03c4k\u22121)Dh\n\n(1\u2212\u03b4)\u03c3 and \u03c40 \u2264 \u00afL\u22121\n0 .\n(cid:0)Zk \u2212 Zk\u22121(cid:1) ,\n(cid:1) ,\n(cid:0)Uk, Zk, Y k\n(cid:1) .\n(cid:1)\n\n(cid:0)Uk \u2212 Uk\u22121(cid:1)\n(cid:0)Uk\u22121, Zk\u22121, Uk, Zk(cid:1)\n(cid:0)Uk, Zk, Y k\n(cid:1)\n(cid:0)Uk, Zk, Y k\nZ ) , Qk = \u03c4k\u2207Zg(cid:0)Y k\n(cid:1)\n(cid:0)Uk+1, Zk+1, Y k\n(cid:1)\n\n(cid:8)\u03c4kf (U, Z) +(cid:10)Pk, U(cid:11) +(cid:10)Qk, Z(cid:11) + h(U, Z)(cid:9) ,\n(cid:1) .\n\n(cid:0)Uk+1, Zk+1, Y k\n\nk }. Now, compute\n\n\u2212 \u2207Zh(Y k\n\n\u2265 \u2212LkDh\n\nU, Y k\nZ\n\nU, Y k\nZ\n\nU, Y k\nZ\n\nU, Y k\nZ\n\nDg\n\nU, Y k\nZ\n\nU, Y k\nZ\n\n\u2264 \u00afLkDh\n\n(2.8)\n\n(2.9)\n\n(2.10)\n\n(2.11)\n\n(2.12)\n\nU, Y k\n\nZ ) ,\n\nPk = \u03c4k\u2207Ug(cid:0)Y k\nChoose \u00afLk \u2265 \u00afLk\u22121, and set \u03c4k \u2264 min{\u03c4k\u22121, \u00afL\u22121\n\n\u2212 \u2207Uh(Y k\n\nU, Y k\n\nU, Y k\nZ\n\n(Uk+1, Zk+1) \u2208 argmin\n(U,Z)\u2208C\n\nsuch that \u00afLk satis\ufb01es\n\nDg\n\nThe extrapolation step is performed in (2.8), which is similar to NAG/FISTA. However, the inertia\ncannot be arbitrary and the analysis from [46] requires step (2.9) which is governed by the convexity\nof lower bound, Lkh + g, however only locally as in (2.10). The update step (2.11) is similar to\nBPG-MF, however the step-size is controlled via the convexity of upper bound \u00afLkh \u2212 g, but only\nlocally as in (2.12). The local adaptation of the steps (2.10) and (2.12) is performed via backtracking.\nSince, \u00afLk can be potentially very small compared to L, hence potentially large steps can be taken.\nThere is no restriction on Lk in each iteration, and smaller Lk can result in high value for the inertial\nparameter \u03b3k. Thus the algorithm in essence aims to detect \"local convexity\" of the objective. The\nupdate steps of CoCaIn BPG-MF can be executed sequentially without any nested loops for the\nbacktracking. One can always \ufb01nd the inertial parameter \u03b3k in (2.9) due to [46, Lemma 4.1]. For\nF +(cid:107)Z(cid:107)2\ncertain cases, (2.9) yields an explicit condition on \u03b3k. For example, for h(U, Z) = 1\nF ),\n. We now provide below the convergence result from [46, Theorem\nwe have 0 \u2264 \u03b3k \u2264\n5.2] adapted to CoCaIn BPG-MF.\nTheorem 2.2 (Global Convergence of CoCaIn BPG-MF). Let Assumptions A, B and C hold, let g\nbe L-smad with respect to h with full domain. Assume \u2207g,\u2207h to be Lipschitz continuous on any\nk\u2208N be a bounded sequence generated by CoCaIn BPG-MF,\nand suppose f, g satisfy the KL property, then, such a sequence has \ufb01nite length, and converges to a\ncritical point.\n\nbounded subset. Let(cid:8)(Uk+1, Zk+1)(cid:9)\n\n(cid:113) \u03b4\u2212\u03b5\n\n2 ((cid:107)U(cid:107)2\n\n1+\u03c4k\u22121Lk\n\n2.4 Closed Form Solutions for Update Steps of BPG-MF and CoCaIn BPG-MF\n\nOur second signi\ufb01cant contribution is to make BPG-MF and CoCaIn BPG-MF an ef\ufb01cient choice for\nsolving Matrix Factorization, namely closed form expressions for the main update steps (2.5), (2.11).\nFor the derivation, we refer to the supplementary material, here we just state our results.\n\n6\n\n\f1\n\nFor the L2-regularized problem\n\n\u03bb0\n2\n\n(cid:16)\nF +(cid:13)(cid:13)\u2212Qk(cid:13)(cid:13)2\n(cid:0)(cid:13)(cid:13)\u2212Pk(cid:13)(cid:13)2\n\nF\n\nh = ha\n\n(cid:17)\n(cid:1)r3 +(c2 +\u03bb0)r\u22121 = 0 .\n\ng(U, Z) =\n\nf (U, Z) =\n\n2 (cid:107)A \u2212 UZ(cid:107)2\nF ,\n\n(cid:107)U(cid:107)2\nwith c1 = 3, c2 = (cid:107)A(cid:107)F and 0 < \u03bb < 1 the BPG-MF updates are:\nUk+1 = \u2212rPk , Zk+1 = \u2212rQk with r \u2265 0 , c1\nFor NMF with additional non-negativity constraints, we replace \u2212Pk and \u2212Qk by \u03a0+(\u2212Pk) and\n\u03a0+(\u2212Qk) respectively where \u03a0+(.) = max{0, .} and max is applied element wise.\nNow consider the following L1-Regularized problem\n\nF + (cid:107)Z(cid:107)2\n\nF\n\n,\n\n1\n\nh = ha .\n\ng(U, Z) =\n\n2 (cid:107)A \u2212 UZ(cid:107)2\nF ,\n\nf (U, Z) = \u03bb1 ((cid:107)U(cid:107)1 + (cid:107)Z(cid:107)1) ,\n\n(cid:16)(cid:13)(cid:13)S\u03bb1\u03bb(\u2212Pk)(cid:13)(cid:13)2\n\n(2.13)\nThe soft-thresholding operator is de\ufb01ned for any y \u2208 Rd by S\u03b8 (y) = max{|y| \u2212 \u03b8, 0} sgn (y) where\n\u03b8 > 0. Set c1 = 3, c2 = (cid:107)A(cid:107)F and 0 < \u03bb < 1 the BPG-MF updates with the above given g, f, h are:\n(cid:17)\nF +(cid:13)(cid:13)S\u03bb1\u03bb(\u2212Qk)(cid:13)(cid:13)2\nUk+1 = rS\u03bb1\u03bb(\u2212Pk), Zk+1 = rS\u03bb1\u03bb(\u2212Qk) with r \u2265 0 and\nr3 + c2r \u2212 1 = 0 .\n(cid:0)\n(cid:1)) and S\u03bb1\u03bb\n\n(cid:0)Qk + \u03bb1\u03bbeKeT\n(cid:1)).\nWe denote a vector of ones as eD \u2208 RD. For additional non-negativity constraints we need to re-\nplace S\u03bb1\u03bb(\u2212Pk) with \u03a0+(\u2212\nExcluding the gradient computation, the computational complexity of our updates is O(M K + N K)\nonly, thanks to linear operations. PALM and iPALM additionally involve calculating Lipschitz\nconstants with at most O(K 2 max{M, N}2) computations. Examples like Graph Regularized NMF\n(GNMF) [13], Sparse NMF [8], Matrix Completion [35], Nuclear Norm Regularization [14, 32],\nSymmetric NMF [68] and proofs are given in the supplementary material.\n\n\u2212Qk(cid:1) to \u03a0+(\u2212\n\n(cid:0)Pk + \u03bb1\u03bbeM eT\n\nc1\n\nK\n\nN\n\nF\n\n2 (cid:107)Z(cid:107)2\n\n2 (cid:107)U(cid:107)2\n\nF and R2(Z) = \u03bb0\n\n3 Experiments\nIn this section, we show experiments for (2.1). Denote the regularization settings, R1: with R1 \u2261\nR2 \u2261 0, R2: with L2 regularization R1(U) = \u03bb0\nF for some \u03bb0 > 0,\nR3: with L1 Regularization R1(U) = \u03bb0 (cid:107)U(cid:107)1 and R2(Z) = \u03bb0 (cid:107)Z(cid:107)1 for some \u03bb0 > 0.\nAlgorithms. We compare our \ufb01rst order optimization algorithms, BPG-MF and CoCaIn BPG-MF,\nand recent state-of-the-art optimization methods iPALM [53] and PALM [8]. We focus on algorithms\nthat guarantee convergence to a stationary point. We also use BPG-MF-WB, where WB stands for\n\"with backtracking\", which is equivalent to CoCaIn BPG-MF with \u03b3k \u2261 0. We use two settings for\niPALM, where all the extrapolation parameters are set to a single value \u03b2 set to 0.2 and 0.4. PALM is\nequivalent to iPALM if \u03b2 = 0. We use the same initialization for all methods.\nSimple Matrix Factorization. We set U = RM\u00d7K and Z = RK\u00d7N . We use a randomly generated\nsynthetic data matrix with A \u2208 R200\u00d7200 and report performance in terms of function value for three\nregularization settings, R1, R2 and R3 with K = 5. Note that this enforces a factorization into at\nmost rank 5 matrizes U and Z, which yields an additional implicit regularization. For R2 and R3 we\nuse \u03bb0 = 0.1. CoCaIn BPG-MF is superior1 as shown in Figure 1 .\nStatistical Evaluation. We also provide the statistical evaluation of all the algorithms in Figure 2,\nfor the above problem. The optimization variables are sampled from [0,0.1] and 50 random seeds\nare considered. CoCaIn BPG outperforms other methods, however PALM methods are also very\ncompetitive. In L1 regularization setting, the performance of CoCaIn BPG is the best. In all settings,\nBPG-MF performance is worst due to a constant step size, which might change in settings where\nlocal adapation with backtracking line search is computationally not feasible.\nMatrix Completion. In recommender systems [35] given a matrix A with entries at few index pairs\nin set \u2126, the goal is to obtain factors U and Z that generalize via following optimization problem\n\nmin\n\nU\u2208RM\u00d7K ,Z\u2208RK\u00d7N\n\n\u03a8(U, Z) :=\n\n1\n\n2 (cid:107)P\u2126 (A \u2212 UZ)(cid:107)2\n\nF +\n\n\u03bb0\n2\n\n(cid:107)U(cid:107)2\n\nF + (cid:107)Z(cid:107)2\n\nF\n\n,\n\n(3.1)\n\n(cid:26)\n\n(cid:16)\n\n(cid:17)(cid:27)\n\n1Note that in the y-axis label v(P) is the least objective value attained by any of the methods.\n\n7\n\n\fwhere P\u2126 preserves the given matrix entries and sets others to zero. We use 80% data of MovieLens-\n100K, MovieLens-1M and MovieLens-10M [30] datasets and use other 20% to test (details in the\nsupplementary material). CoCaIn BPG-MF is faster than all methods as given in Figure 3.\n\n(a) No Regularization\n\n(b) L2-Regularization\n\n(c) L1-Regularization\n\nFigure 1: Simple Matrix Factorization on Synthetic Dataset.\n\n(a) No Regularization\n\n(b) L2-Regularization\n\n(c) L1-Regularization\n\nFigure 2: Statistical Evaluation on Simple Matrix Factorization.\n\n(a) MovieLens-100K\n\n(b) MovieLens-1M\n\n(c) MovieLens-10M\n\nFigure 3: Matrix Completion on MovieLens Datasets [30].\n\nAs evident from Figures 1, 4, 3, CoCaIn BPG-MF, BPG-MF-WB can result in better performance\nthan well known alternating methods. BPG-MF is not better than PALM and iPALM because of\nprohibitively small step-sizes (due to (cid:107)A(cid:107)F in (2.7)), which is resolved by CoCaIn BPG-MF and\nBPG-MF-WB using backtracking. Time comparisons are provided in the supplementary material,\nwhere we show that our methods are competitive.\n\nConclusion and Extensions\n\nWe proposed non-alternating algorithms to solve matrix factorization problems, contrary to the typical\nalternating strategies. We use the Bregman proximal algorithms, BPG [9] and an inertial variant\nCoCaIn BPG [46] for matrix factorization problems. We developed a novel Bregman distance, crucial\nfor proving convergence to a stationary point. Moreover, we also provide non-trivial ef\ufb01cient closed\nform update steps for many matrix factorization problems. This line of thinking raises new open\nquestions, such as extensions to Tensor Factorization [34], to Robust Matrix Factorization [65],\nstochastic variants [20, 27, 45, 48] and state-of-the-art matrix factorization model [33].\n\n8\n\n100101102103Iterations(logscale)10\u2212410\u2212310\u2212210\u22121100101102103104\u03a8(Uk,Zk)\u2212v(P)(logscale)CoCaInBPG-MFBPG-MF-WBBPG-MFPALMiPALM(\u03b2=0.2)iPALM(\u03b2=0.4)100101102103Iterations(logscale)10\u2212410\u2212310\u2212210\u22121100101102103104\u03a8(Uk,Zk)\u2212v(P)(logscale)CoCaInBPG-MFBPG-MF-WBBPG-MFPALMiPALM(\u03b2=0.2)iPALM(\u03b2=0.4)100101102103Iterations(logscale)10\u2212410\u2212310\u2212210\u22121100101102103104\u03a8(Uk,Zk)\u2212v(P)(logscale)CoCaInBPG-MFBPG-MF-WBBPG-MFPALMiPALM(\u03b2=0.2)iPALM(\u03b2=0.4)1532.51535.01537.51540.01542.51545.01547.51550.0Functionvalue010203040NumberofseedsPALMiPALM(\u03b2=0.2)iPALM(\u03b2=0.4)CoCaInBPG-MFBPG-MF-WBBPG-MF1545.01547.51550.01552.51555.01557.51560.01562.5Functionvalue0510152025303540NumberofseedsPALMiPALM(\u03b2=0.2)iPALM(\u03b2=0.4)CoCaInBPG-MFBPG-MF-WBBPG-MF158515901595160016051610Functionvalue0123456NumberofseedsPALMiPALM(\u03b2=0.2)iPALM(\u03b2=0.4)CoCaInBPG-MFBPG-MF-WBBPG-MF100101102103Iterations(logscale)10\u2212210\u22121100101102103104105\u03a8(Uk,Zk)\u2212v(P)(logscale)CoCaInBPG-MFBPG-MF-WBBPG-MFPALMiPALM(\u03b2=0.2)iPALM(\u03b2=0.4)100101102103Iterations(logscale)100101102103104105106\u03a8(Uk,Zk)\u2212v(P)(logscale)CoCaInBPG-MFBPG-MF-WBBPG-MFPALMiPALM(\u03b2=0.2)iPALM(\u03b2=0.4)100101102103Iterations(logscale)102103104105106107\u03a8(Uk,Zk)\u2212v(P)(logscale)CoCaInBPG-MFBPG-MF-WBBPG-MFPALMiPALM(\u03b2=0.2)iPALM(\u03b2=0.4)\fReferences\n[1] P. Ablin, D. Fagot, H. Wendt, A. Gramfort, and C. F\u00e9votte. A quasi-Newton algorithm on the orthogonal\nmanifold for NMF with transform learning. In IEEE International Conference on Acoustics, Speech and\nSignal Processing (ICASSP), pages 700\u2013704, 2019.\n\n[2] S. Arora, R. Ge, R. Kannan, and A. Moitra. Computing a nonnegative matrix factorization\u2013provably. In\nProceedings of the forty-fourth annual ACM symposium on Theory of computing, pages 145\u2013162. ACM,\n2012.\n\n[3] H. Attouch and J. Bolte. On the convergence of the proximal algorithm for nonsmooth functions involving\n\nanalytic features. Mathematical Programming, 116(1-2):5\u201316, 2009.\n\n[4] H. H. Bauschke, J. Bolte, and M. Teboulle. A descent lemma beyond Lipschitz gradient continuity:\n\ufb01rst-order methods revisited and applications. Mathematics of Operations Research, 42(2):330\u2013348, 2017.\n\n[5] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems.\n\nSIAM Journal on Imaging Sciences, 2(1):183\u2013202, 2009.\n\n[6] B. Birnbaum, N. R. Devanur, and L. Xiao. Distributed algorithms via gradient descent for \ufb01sher markets.\n\nIn Proceedings of the 12th ACM conference on Electronic commerce, pages 127\u2013136. ACM, 2011.\n\n[7] J. Bolte, A. Daniilidis, A.S. Lewis, and M. Shiota. Clarke subgradients of strati\ufb01able functions. SIAM\n\nJournal on Optimization, 18(2):556\u2013572, 2007.\n\n[8] J. Bolte, S. Sabach, and M. Teboulle. Proximal alternating linearized minimization for nonconvex and\n\nnonsmooth problems. Mathematical Programming, 146(1-2):459\u2013494, 2014.\n\n[9] J. Bolte, S. Sabach, M. Teboulle, and Y. Vaisbourd. First order methods beyond convexity and Lipschitz\ngradient continuity with applications to quadratic inverse problems. SIAM Journal on Optimization,\n28(3):2131\u20132151, 2018.\n\n[10] S. Bonettini, I. Loris, F. Porta, and M. Prato. Variable metric inexact line-search-based methods for\n\nnonsmooth optimization. SIAM Journal on optimization, 26(2):891\u2013921, 2016.\n\n[11] S. Bonettini, I. Loris, F. Porta, M. Prato, and S. Rebegoldi. On the convergence of a linesearch based\n\nproximal-gradient method for nonconvex optimization. Inverse Problems, 33(5), 2017.\n\n[12] J.-P. Brunet, P. Tamayo, T. R. Golub, and J. P. Mesirov. Metagenes and molecular pattern discovery using\n\nmatrix factorization. Proceedings of the National Academy of Sciences, 101(12):4164\u20134169, 2004.\n\n[13] D. Cai, X. He, J. Han, and T. S. Huang. Graph regularized nonnegative matrix factorization for data\nrepresentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(8):1548\u20131560, 2011.\n\n[14] J. F. Cai, E. J. Cand\u00e8s, and Z. Shen. A singular value thresholding algorithm for matrix completion. SIAM\n\nJournal on Optimization, 20(4):1956\u20131982, 2010.\n\n[15] E. J. Cand\u00e8s and B. Recht. Exact matrix completion via convex optimization. Foundations of Computational\n\nmathematics, 9(6):717, 2009.\n\n[16] Y. Censor and A. Lent. An iterative row-action method for interval convex programming. Journal of\n\nOptimization Theory and Applications, 34(3):321\u2013353, 1981.\n\n[17] S. Chaudhuri, R. Velmurugan, and R. M. Rameshan. Blind image deconvolution. Springer, 2016.\n\n[18] E. Chouzenoux, J. C. Pesquet, and A. Repetti. A block coordinate variable metric forward\u2013backward\n\nalgorithm. Journal of Global Optimization, 66(3):457\u2013485, 2016.\n\n[19] A. Cichocki, R. Zdunek, and S. Amari. Hierarchical ALS algorithms for nonnegative matrix and 3D tensor\nfactorization. In International Conference on Independent Component Analysis and Signal Separation,\npages 169\u2013176. Springer, 2007.\n\n[20] D. Davis, D. Drusvyatskiy, and K. J. MacPhee. Stochastic model-based minimization under high-order\n\ngrowth. ArXiv preprint arXiv:1807.00255, 2018.\n\n[21] R. A. Dragomir, A. d\u2019Aspremont, and J. Bolte. ArXiv preprint arXiv:1901.10791, 2019.\n\n[22] F. Esposito, N. Gillis, and N. D. Buono. Orthogonal joint sparse NMF for microarray data analysis. Journal\n\nof Mathematical Biology, pages 1\u201325, 2019.\n\n9\n\n\f[23] H. Fang, Z. Zhang, Y. Shao, and C. J. Hsieh.\n\nImproved bounded matrix completion for large-scale\nrecommender systems. In International Joint Conference on Arti\ufb01cial Intelligence (IJCAI), pages 1654\u2013\n1660. AAAI Press, 2017.\n\n[24] N. Gillis. The why and how of nonnegative matrix factorization. Regularization, Optimization, Kernels,\n\nand Support Vector Machines, 12(257), 2014.\n\n[25] N. Gillis and S. A. Vavasis. Fast and robust recursive algorithms for separable nonnegative matrix\n\nfactorization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(4):698\u2013714, 2014.\n\n[26] G. H. Golub and C. F.V. Loan. Matrix computations, volume 3. John Hopkins University Press, 2012.\n\n[27] R. M. Gower, N. Loizou, X. Qian, A. Sailanbayev, E. Shulgin, and P. Richtarik. SGD: General analysis\n\nand improved rates. ArXiv preprint arXiv:1901.09401, 2019.\n\n[28] B. D. Haeffele and R. Vidal. Structured low-rank matrix factorization: Global optimality, algorithms, and\n\napplications. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019.\n\n[29] F. Hanzely, P. Richtarik, and L. Xiao. Accelerated Bregman proximal gradient methods for relatively\n\nsmooth convex optimization. ArXiv preprint arXiv:1808.03045, 2018.\n\n[30] F. M. Harper and J. A. Konstan. The movielens datasets: History and context. Transactions on Interactive\n\nIntelligent Systems (TIIS), 5(4):19, 2016.\n\n[31] C. J. Hsieh and I. S. Dhillon. Fast coordinate descent methods with variable selection for non-negative\nmatrix factorization. In International Conference on Knowledge Discovery and Data Mining (ICKDDM),\npages 1064\u20131072. ACM, 2011.\n\n[32] C. J. Hsieh and P. Olsen. Nuclear norm minimization via active subspace selection. In International\n\nConference on Machine Learning, pages 575\u2013583, 2014.\n\n[33] P. Jawanpuria and B. Mishra. A uni\ufb01ed framework for structured low-rank matrix learning. In J. Dy and\nA. Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80,\npages 2254\u20132263. PMLR, 2018.\n\n[34] T. G. Kolda and B. W. Bader. Tensor decompositions and applications. SIAM review, 51(3):455\u2013500, 2009.\n\n[35] Y. Koren, R. Bell, and C. Volinsky. Matrix factorization techniques for recommender systems. Computer,\n\n42(8):30\u201337, 2009.\n\n[36] D. D. Lee and H. S. Seung. Learning the parts of objects by non-negative matrix factorization. Nature,\n\n401(6755):788, 1999.\n\n[37] D. D. Lee and H. S. Seung. Algorithms for non-negative matrix factorization. In Advances in Neural\n\nInformation Processing Systems, pages 556\u2013562, 2001.\n\n[38] W. Li and D.-Y. Yeung. Relation regularized matrix factorization. In International Joint Conference on\n\nArti\ufb01cal Intelligence (IJCAI), pages 1126\u20131131, 2009.\n\n[39] C. Lu, M. Yang, F. Luo, F. X. Wu, M. Li, Y. Pan, Y. Li, and J. Wang. Prediction of lncRNA\u2013disease\n\nassociations based on inductive matrix completion. Bioinformatics, 34(19):3357\u20133364, 2018.\n\n[40] H. Lu, R. M. Freund, and Y. Nesterov. Relatively smooth convex optimization by \ufb01rst-order methods, and\n\napplications. SIAM Journal on Optimization, 28(1):333\u2013354, 2018.\n\n[41] R. Luss and M. Teboulle. Conditional gradient algorithms for rank-one matrix approximations with a\n\nsparsity constraint. SIAM Review, 55(1):65\u201398, 2013.\n\n[42] C. J. Maddison, D. Paulin, Y. W. Teh, and A. Doucet. Dual space preconditioning for gradient descent.\n\nArXiv preprint arXiv:1902.02257, 2019.\n\n[43] A. Mnih and R. R. Salakhutdinov. Probabilistic matrix factorization. In Advances in Neural Information\n\nProcessing Systems, pages 1257\u20131264, 2008.\n\n[44] A. Moitra. An almost optimal algorithm for computing nonnegative rank. SIAM Journal on Computing,\n\n45(1):156\u2013173, 2016.\n\n[45] M. C. Mukkamala and M. Hein. Variants of rmsprop and adagrad with logarithmic regret bounds. In\n\nInternational Conference on Machine Learning (ICML), pages 2545\u20132553, 2017.\n\n10\n\n\f[46] M. C. Mukkamala, P. Ochs, T. Pock, and S. Sabach. Convex-Concave backtracking for inertial Bregman\n\nproximal gradient algorithms in non-convex optimization. ArXiv preprint arXiv:1904.03537, 2019.\n\n[47] Y. E. Nesterov. A method for solving the convex programming problem with convergence rate O(1/k2).\n\nDoklady Akademii Nauk SSSR, 269(3):543\u2013547, 1983.\n\n[48] L. M. Nguyen, P. H. Nguyen, M. V. Dijk, P. Richt\u00e1rik, K. Scheinberg, and M. Tak\u00e1\u02c7c. SGD and Hogwild!\n\nconvergence without the bounded gradients assumption. ArXiv preprint arXiv:1802.03801, 2018.\n\n[49] Q. V. Nguyen. Forward\u2013Backward splitting with Bregman distances. Vietnam Journal of Mathematics,\n\n45(3):519\u2013539, 2017.\n\n[50] P. Ochs. Local convergence of the heavy-ball method and ipiano for non-convex optimization. Journal of\n\nOptimization Theory and Applications, 177(1):153\u2013180, 2018.\n\n[51] P. Ochs, Y. Chen, T. Brox, and T. Pock. iPiano: inertial proximal algorithm for nonconvex optimization.\n\nSIAM Journal on Imaging Sciences, 7(2):1388\u20131419, 2014.\n\n[52] P. Ochs, J. Fadili, and T. Brox. Non-smooth non-convex Bregman minimization: Uni\ufb01cation and new\n\nalgorithms. Journal of Optimization Theory and Applications, 181(1):244\u2013278, 2019.\n\n[53] T. Pock and S. Sabach. Inertial proximal alternating linearized minimization (iPALM) for nonconvex and\n\nnonsmooth problems. SIAM Journal on Imaging Sciences, 9(4):1756\u20131787, 2016.\n\n[54] M. Powell. On search directions for minimization algorithms. Mathematical programming, 4(1):193\u2013201,\n\n1973.\n\n[55] R. T. Rockafellar and R. J.-B. Wets. Variational Analysis, volume 317 of Fundamental Principles of\n\nMathematical Sciences. Springer-Verlag, Berlin, 1998.\n\n[56] S. Sra and I. S. Dhillon. Generalized nonnegative matrix approximations with Bregman divergences. In\n\nAdvances in Neural Information Processing Systems, pages 283\u2013290, 2006.\n\n[57] N. Srebro, J. Rennie, and T. S. Jaakkola. Maximum-margin matrix factorization. In Advances in Neural\n\nInformation Processing Systems, pages 1329\u20131336, 2005.\n\n[58] J.-L. Starck, F. Murtagh, and J. Fadili. Sparse image and signal processing: wavelets, curvelets, morpho-\n\nlogical diversity. Cambridge University Press, 2010.\n\n[59] M. Teboulle. A simpli\ufb01ed view of \ufb01rst order methods for optimization. Mathematical Programming,\n\n170(1):67\u201396, 2018.\n\n[60] K. Thung, P. T. Yap, E. Adeli, S. W. Lee, D. Shen, and Alzheimer\u2019s Disease Neuroimaging Initiative.\nConversion and time-to-conversion predictions of mild cognitive impairment using low-rank af\ufb01nity pursuit\ndenoising and matrix completion. Medical image analysis, 45:68\u201382, 2018.\n\n[61] B. Wen, X. Chen, and T. K. Pong. Linear convergence of proximal gradient algorithm with extrapolation for\na class of nonconvex nonsmooth minimization problems. SIAM Journal on Optimization, 27(1):124\u2013145,\n2017.\n\n[62] Y. Xu, Z. Li, J. Yang, and D. Zhang. A survey of dictionary learning algorithms for face recognition. IEEE\n\naccess, 5:8502\u20138514, 2017.\n\n[63] Y. Xu and W. Yin. A block coordinate descent method for regularized multiconvex optimization with\napplications to nonnegative tensor factorization and completion. SIAM Journal on imaging sciences,\n6(3):1758\u20131789, 2013.\n\n[64] Lei Yang, Ting Kei Pong, and Xiaojun Chen. A nonmonotone alternating updating method for a class of\n\nmatrix factorization problems. SIAM Journal on Optimization, 28(4):3402\u20133430, 2018.\n\n[65] Q. Yao and J. Kwok. Scalable robust matrix factorization with nonconvex loss. In Advances in Neural\n\nInformation Processing Systems, pages 5061\u20135070, 2018.\n\n[66] A. W. Yu, W. Ma, Y. Yu, J. Carbonell, and S. Sra. Ef\ufb01cient structured matrix rank minimization. In\n\nAdvances in Neural Information Processing Systems, pages 1350\u20131358, 2014.\n\n[67] X. Zhang, R. Barrio, M. Martinez, H. Jiang, and L. Cheng. Bregman proximal gradient algorithm with ex-\ntrapolation for a class of nonconvex nonsmooth minimization problems. ArXiv preprint arXiv:1904.11295,\n2019.\n\n[68] Z. Zhu, X. Li, K. Liu, and Q. Li. Dropping symmetry for fast symmetric nonnegative matrix factorization.\n\nIn Advances in Neural Information Processing Systems, pages 5154\u20135164, 2018.\n\n11\n\n\f", "award": [], "sourceid": 2399, "authors": [{"given_name": "Mahesh Chandra", "family_name": "Mukkamala", "institution": "Saarland University"}, {"given_name": "Peter", "family_name": "Ochs", "institution": "Saarland University"}]}