{"title": "Low-rank Interaction with Sparse Additive Effects Model for Large Data Frames", "book": "Advances in Neural Information Processing Systems", "page_first": 5496, "page_last": 5506, "abstract": "Many applications of machine learning involve the analysis of large data frames -- matrices collecting heterogeneous measurements (binary, numerical, counts, etc.) across samples -- with missing values. Low-rank models, as studied by Udell et al. (2016), are popular in this framework for tasks such as visualization, clustering and missing value imputation. Yet, available methods with statistical guarantees and efficient optimization do not allow explicit modeling of main additive effects such as row and column, or covariate effects. In this paper, we introduce a low-rank interaction and sparse additive effects (LORIS) model which combines matrix regression on a dictionary and low-rank design, to estimate main effects and interactions simultaneously. We provide statistical guarantees in the form of upper bounds on the estimation error of both components. Then, we introduce a mixed coordinate gradient descent (MCGD) method which provably converges sub-linearly to an optimal solution and is computationally efficient for large scale data sets. We show on simulated and survey data that the method has a clear advantage over current practices.", "full_text": "Low-rank Interaction with Sparse Additive Effects\n\nModel for Large Data Frames\n\nGenevi\u00e8ve Robin\n\nCentre de Math\u00e9matiques Appliqu\u00e9es\n\u00c9cole Polytechnique, XPOP, INRIA\n\n91120 Palaiseau, France\n\ngenevieve.robin@polytechnique.edu\n\nHoi-To Wai\n\nDepartment of SE&EM\n\nThe Chinese University of Hong Kong\n\nShatin, Hong Kong\n\nhtwai@se.cuhk.edu.hk\n\nJulie Josse\n\nCentre de Math\u00e9matiques Appliqu\u00e9es\n\u00c9cole Polytechnique, XPOP, INRIA\n\n91120 Palaiseau, France\n\njulie.josse@polytechnique.edu\n\nOlga Klopp\n\nESSEC Business School\n\nCREST, ENSAE\n\n95021 Cergy, France\nklopp@essec.edu\n\n\u00c9ric Moulines\n\nCentre de Math\u00e9matiques Appliqu\u00e9es\n\u00c9cole Polytechnique, XPOP, INRIA\n\n91120 Palaiseau, France\n\neric.moulines@polytechnique.edu\n\nAbstract\n\nMany applications of machine learning involve the analysis of large data frames \u2013\nmatrices collecting heterogeneous measurements (binary, numerical, counts, etc.)\nacross samples \u2013 with missing values. Low-rank models, as studied by Udell\net al. [27], are popular in this framework for tasks such as visualization, clustering\nand missing value imputation. Yet, available methods with statistical guarantees\nand ef\ufb01cient optimization do not allow explicit modeling of main additive effects\nsuch as row and column, or covariate effects. In this paper, we introduce a low-\nrank interaction and sparse additive effects (LORIS) model which combines\nmatrix regression on a dictionary and low-rank design, to estimate main effects\nand interactions simultaneously. We provide statistical guarantees in the form of\nupper bounds on the estimation error of both components. Then, we introduce a\nmixed coordinate gradient descent (MCGD) method which provably converges\nsub-linearly to an optimal solution and is computationally ef\ufb01cient for large scale\ndata sets. We show on simulated and survey data that the method has a clear\nadvantage over current practices.\n\n1\n\nIntroduction\n\nRecently, a lot of effort has been devoted towards the ef\ufb01cient analysis of large data frames, a term\ncoined by Udell et al. [27]. A data frame is a large table of heterogeneous data (binary, numerical,\ncounts) with missing entries, where each row represents an example and each column a feature. In\norder to analyze them, a powerful technique is to use low-rank models that embed rows and columns\nof data frames into low-dimensional spaces [15, 25, 27], enabling effective data analytics such as\nclustering, visualization and missing value imputation; see also [18] and the references therein.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fCharacterizing additive effects of side information \u2013 such as covariates, row or column effects \u2013\nsimultaneously with low rank interactions is an important extension to plain low-rank models. For\nexample, in data frames obtained from recommender systems, user information and item characteris-\ntics are known to in\ufb02uence the ratings in addition to interactions between users and items [7]. These\nmodi\ufb01cations to the low rank model have been advocated in the statistics literature, but they have\nbeen implemented only for small data frames [1].\nIn the large-scale low-rank matrix estimation literature, available methods either do not take additive\neffects into account [6, 20, 27, 22, 8], or only handle the numerical data [12, 11]. As a common\nheuristics for preprocessing, prior work such as [20, 27] remove the row and column means and\napply some normalization of the row and column variance. We show in numerical experiments this\napparently benign operation is not appropriate for large and heterogenous data frames, and can cause\nsevere impairments in the analysis.\nThe present work investigates a generalization of previous contributions in the analysis of data frames.\nOur contributions can be summarized as follows.\n\nContributions We present a new framework that is statistically and computationally ef\ufb01cient for\nanalyzing large and incomplete heterogeneous data frames.\n\n\u2022 We describe in Section 2 the low-rank interaction with sparse additive effects (LORIS)\nmodel, which combines matrix regression on a dictionary with low rank approximation. We\npropose a convex doubly penalized quasi-maximum likelihood approach, where the rank\nconstraint is relaxed with a nuclear norm penalty, to estimate the regression coef\ufb01cients and\nthe low rank component simultaneously. We establish non-asymptotic upper bounds on the\nestimation errors.\n\n\u2022 We propose in Section 3 a Mixed Coordinate Gradient Descent (MCGD) method to solve\nef\ufb01ciently the LORIS estimation problem. It uses a mixed update strategy including a\nproximal update for the sparse component and a conditional gradient (CG) for the low-rank\ncomponent. We show that the MCGD method converges to an \u0001-optimal solution in O(1/\u0001)\niterations. We also outline an extension to ef\ufb01cient distributed implementation.\n\n\u2022 We demonstrate in Section 4 the ef\ufb01cacy of our method both in terms of estimation and\n\nimputation quality on simulated and survey data examples.\n\nRelated work Our statistical model and analysis are related to prior work on low-rank plus sparse\nmatrix decomposition [28, 3, 4, 13, 17]; these papers provide statistical results for a particular case\nwhere the loss function is quadratic and the sparse component is entry-wise sparse. In comparison,\nthe originality of the present work is two-fold. First, the sparsity pattern of the main effects is not\nrestricted to entry-wise sparsity. Second, the data \ufb01tting term is not quadratic, but a heterogeneous\nexponential family quasi log-likelihood. This new framework enables us to tackle many more data\nsets combining heterogeneous data, main effects and interactions.\nFor the algorithmic development, our proposed method is related to the prior work such as [21, 26, 5,\n11, 29, 14, 24, 9, 19, 2, 10]. These are based on various \ufb01rst-order optimization methods and shall be\nreviewed in detail in Section 3. Among others, the MCGD method is mostly related to the recent\nFW-T method by Mu et al. [24] that uses a mixed update rule to tackle a similar estimation problem.\nThere are two differences: \ufb01rst, FW-T is focused on a quadratic loss which is a special case of the\nstatistical estimation problem that we analyze; second, the per-iteration complexity of MCGD is\nlower as the update rules are simpler. Despite the simpli\ufb01cations, using a new proof technique, we\nprove that the convergence rate of MCGD is strictly faster than FW-T.\nNotations: For any m \u2208 N, [m] := {1, ..., m}. The operator P\u2126(\u00b7) : Rn\u00d7p \u2192 Rn\u00d7p is the\nprojection operator on the set of entries in \u2126 \u2282 [n] \u00d7 [p], and (\u00b7)+ : R \u2192 R+ is the projection\noperator on the non-negative orthant (x)+ := max{0, x}. For matrices, we denote by (cid:107)\u00b7(cid:107)F the\nFrobenius norm, (cid:107)\u00b7(cid:107)(cid:63) the nuclear norm, (cid:107)\u00b7(cid:107) the operator norm, and (cid:107)\u00b7(cid:107)\u221e the entry-wise in\ufb01nity norm.\nFor vectors, we denote by (cid:107)\u00b7(cid:107)1 is the (cid:96)1-norm, (cid:107)\u00b7(cid:107)2 the Euclidean norm, (cid:107)\u00b7(cid:107)\u221e the in\ufb01nity norm,\nand (cid:107)\u00b7(cid:107)0 the number of non zero coef\ufb01cients. The binary operator (cid:104)X, Y (cid:105) denotes the Frobenius\ninner product. A function f : Rq \u2192 R is said to be \u03c3-smooth if f is continuously differentiable and\n(cid:107)\u2207f (\u03b8) \u2212 \u2207f (\u03b8(cid:48))(cid:107)2 \u2264 \u03c3(cid:107)\u03b8 \u2212 \u03b8(cid:48)(cid:107)2 for all \u03b8, \u03b8(cid:48) \u2208 Rq.\n\n2\n\n\f2 Problem Formulation\n\nHeterogenous Data Model Let (Y, X) be a probability space equipped with a \u03c3-\ufb01nite measure \u00b5.\nThe canonical exponential family distribution {Exph,g(m), m \u2208 X} with base measure h : Y \u2192 R+,\nlink function g : X \u2192 R, and scalar parameter, m \u2208 X, has a density given by\n\nfm(y) = h(y) exp (ym \u2212 g(m)) .\n\n(1)\n\nThe exponential family is a \ufb02exible framework to model different types of data. For example,\n(Y = R, g(m) = m2\u03c32/2, h(y) = (2\u03c0\u03c32)\u22121/2 exp(\u2212y2/2\u03c32)) yields a Gaussian distribution with\nmean m and variance \u03c32 for numerical data; (Y = {0, 1}, g(m) = log(1 + exp(m)), h(y) = 1)\nyields a Bernoulli distribution with success probability 1/(1 + exp(\u2212m)) for binary data;\n(Y = N, g(m) = exp(am), h(y) = 1/y!) where a \u2208 R yields a Poisson distribution with intensity\nexp(am) for count data. In these cases, the parameter space is X = R.\nLet {(Yj, gj, hj), j \u2208 [p]} be a collection of observation spaces, base and link functions correspond-\ning to the column types of a data frame Y = [Yij](i,j)\u2208[n]\u00d7[p] \u2208 Yn\np . For each i \u2208 [n]\nand j \u2208 [p], we denote by M0\nij the target parameter minimizing the Kullback-Leibler divergence\nbetween the distribution of Yij and the exponential family Exphj ,gj , j \u2208 [p], given by\n\n1 \u00d7 . . . \u00d7 Yn\n\nM0\n\nij = arg max\n\nm\n\nEYij [log(hj(Yij)) + Yijm \u2212 gj(m)] .\n\n(2)\n\nWe propose the following model to estimate M0 = [M0\neffects and interactions.\n\nij](i,j)\u2208[n]\u00d7[p] in the presence of additive\n\nLOw-rank Interaction with Sparse additive effects (LORIS) model For every entry Yij, as-\nsume a vector of covariates xij \u2208 Rq is also available, e.g., user information and item characteristics.\nDenote xij(k), k \u2208 [q] the k-th component of xij and de\ufb01ne the matrix X(k) = [xij(k)](i,j)\u2208[n]\u00d7[p].\nWe introduce the following decomposition of the parameter matrix M0:\n\nM0 =\n\n\u03b10\n\nkX(k) + \u03980.\n\n(3)\n\nWe call (3) the LORIS model, where \u03b1 \u2208 Rq is a sparse vector with unknown support modeling\nadditive effects and \u03980 \u2208 Rn\u00d7p a low-rank matrix modeling the interactions.\nIn fact, LORIS is a generalization of robust matrix completion [3], where the parameter matrix can\nbe decomposed as the sum of two matrices, one is low-rank and the other has some complementary\nlow-dimensional structure such as entry-wise or column-wise sparsity. Statistical recoverability\nresults in robust matrix estimation under a noiseless setting can be found in [28, 3, 4, 13]; the additive\nnoise setting can be found in a recent work [17]. [23] also provide exact recovery results for more\ngeneral sparsity patterns.\nEstimation Problem Denote \u2126 = {(i, j) \u2208 [n] \u00d7 [p] : Yij is observed} as the observation set.\nFor M \u2208 Rn\u00d7p, L(M) is the negative log-likelihood of the observed data (Y, \u2126) parameterized by\nM. Up to an additive constant,\n\nL(M) =\n\n{\u2212YijMij + gj(Mij)} .\n\nq(cid:88)\n\nk=1\n\nFor a > 0, we consider the following estimation problem:\n\n(cid:33)\n\n\u03b1kX(k) + \u0398\n\n+ \u03bbS (cid:107)\u03b1(cid:107)1 + \u03bbL (cid:107)\u0398(cid:107)(cid:63) .\n\n( \u02c6\u03b1, \u02c6\u0398) \u2208 argmin\n(cid:107)\u03b1(cid:107)\u221e\u2264a\n(cid:107)\u0398(cid:107)\u221e\u2264a\n\nL\n\nWe denote by \u02c6M =(cid:80)q\n\nk=1 \u02c6\u03b1kX(k) + \u02c6\u0398 the estimated parameter matrix. The (cid:96)1 and nuclear norm\npenalties are convex relaxations of the sparsity and low-rank constraints, and the regularization\nparameters \u03bbS and \u03bbL serve as trade-offs between \ufb01tting the data and enforcing sparsity of \u03b1 and\ncontrolling the \"effective rank\" of \u0398.\n\n(cid:88)\n(cid:32) q(cid:88)\n\n(i,j)\u2208\u2126\n\nk=1\n\n3\n\n(4)\n\n(5)\n\n\fk=1 |X(k)ij| \u2264 \u03bd.\n\nk (cid:54)= 0, (cid:104)\u03980, X(k)(cid:105) = 0.\n\nStatistical Guarantees Here we establish convergence rates for the joint estimation of \u03b10 and \u03980;\nthe proofs can be found in the supplementary material. Consider the following assumptions.\n\nH1 (cid:13)(cid:13)\u03980(cid:13)(cid:13)\u221e \u2264 a,(cid:13)(cid:13)\u03b10(cid:13)(cid:13)\u221e \u2264 a and for all k \u2208 [q] such that \u03b10\n[n] \u00d7 [p],(cid:80)q\nIn particular, H2 guarantees that for all (\u0398, \u03b1) satisfying H1, the matrix M =(cid:80)q\n\nIn particular, H1 guarantees the uniqueness of the decomposition in the LORIS model (3).\nH2 For \u03bd > 0, all k \u2208 [q] and (i, j) \u2208 [n] \u00d7 [p], X(k)ij \u2208 [\u22121, 1]. Furthermore for all (i, j) \u2208\n\nk=1 \u03b1kX(k) + \u0398\nsatis\ufb01es (cid:107)M(cid:107)\u221e \u2264 (1 + \u03bd)a. Let G be the q \u00d7 q Gram matrix of the dictionary (X(1), . . . , X(q))\nde\ufb01ned by G = [(cid:104)X(k), X(l)(cid:105)](k,l)\u2208[q]\u00d7[q].\nH3 For \u03ba > 0 and all \u03b1 \u2208 Rq, \u03b1(cid:62)G\u03b1 \u2265 \u03ba2 (cid:107)\u03b1(cid:107)2\n2 .\nNote we do not consider the case where the Gram matrix is singular, e.g., q > np. For 0 < \u03c3\u2212 \u2264\n\u03c3+ < +\u221e and 0 < \u03b3 < \u221e consider the following assumption on the link functions gj:\nH4 The functions gj are twice differentiable, and for all x \u2208 [\u2212(1 + \u03bd)a \u2212 \u03b3, (1 + \u03bd)a + \u03b3],\n\n\u03c32\u2212 \u2264 g(cid:48)(cid:48)\n\nj (x) \u2264 \u03c32\n\n+, j \u2208 [p].\n\nH4 implies the data \ufb01tting term L(M) is smooth and satis\ufb01es a restricted strong convexity property.\nH5 For all (i, j) \u2208 [n] \u00d7 [p], Yij is a sub-exponential random variable with scale and variance\n+.\nparameters 1/\u03b3 and \u03c32\n\nIf the random variables Yij are actually distributed according to an exponential family distribution of\nthe form (1), then H4 implies H5.\nH6 For (i, j) \u2208 [n]\u00d7 [p], the events \u03c9ij = {(i, j) \u2208 \u2126} are independent with occurrence probability\n\u03c0ij. Furthermore, there exists 0 < \u03c0 \u2264 1 such that for all (i, j) \u2208 [n] \u00d7 [p], \u03c0ij \u2265 \u03c0.\nH6 implies a data missing-at-random scenario where Yij is observed with probability at least \u03c0.\nTheorem 1 Assume H1-6. Set\n\n(cid:112)\u03c0 max(n, p) log(n + p), and \u03bbS = 24 max\n\n\u03bbL = 2C\u03c3+\n\n(6)\n\nwhere C is a positive constant. Assume that max(n, p) \u2265 4\u03c32\n2 exp(\u03c32\n\n+/\u03b32 + 2\u03c32\n\n(cid:107)X(k)(cid:107)1 log(n + p)/\u03b3,\n\n+/\u03b36 log2((cid:112)min(n, p)/(\u03c0\u03b3\u03c3\u2212)) +\n+\u03b3a). Then, with probability at least 1 \u2212 9(n + p)\u22121,\n(cid:19)\n\ns maxk (cid:107)X(k)(cid:107)1 log(n + p)\n\n(cid:18) r max(n, p)\n\n+ D\u03b1,\ns maxk (cid:107)X(k)(cid:107)1\n\n\u03ba2\u03c0\n\n(7)\n\nk\n\n(cid:13)(cid:13) \u02c6\u03b1 \u2212 \u03b10(cid:13)(cid:13)2\n(cid:13)(cid:13)(cid:13) \u02c6\u0398 \u2212 \u03980(cid:13)(cid:13)(cid:13)2\n\n2\n\nF\n\n\u2264 C1\n\u2264 C2\n\n+\n\n\u03c0\n\n\u03c0\n\nlog(n + p) + D\u0398.\n\nIn (7), s := (cid:107)\u03b10(cid:107)0, r := rank(\u03980). C1 and C2 are positive constants and D\u03b1 and D\u0398 are residuals\nof lower order whose exact values are given in Appendix A.\n\nThe proof can be found in Appendix A. In Theorem 1, the rate obtained for \u03b10 is the same as\nthe bound obtained in [17] in the special case of robust matrix completion. Examples satisfying\nmaxk (cid:107)X(k)(cid:107)1 /\u03ba2 = O(1) include the case where the elements of the dictionary are matrices are\nall zeros except a row or a column of one, (to model row and column effects) and the number of rows\nn and columns p are of the same order; or when the covariates xij are categorical and the categories\nare balanced, i.e., the number of samples per category is of the same order.\nThe rate obtained for \u03980 is the sum of the standard low-rank matrix completion rate of order\nr max(n, p)/\u03c0, e.g., [16], and of a term which boils down to sparse vector estimation rate as long\nas maxk (cid:107)X(k)(cid:107)1 = O(1). Again, the latter can be satis\ufb01ed by the special case of robust matrix\ncompletion, for which our rates match the results of [17].\n\n4\n\n\f3 A Mixed Coordinate Gradient Descent Method for LORIS\n\nThis section introduces a mixed coordinate gradient descent (MCGD) method to solve the LORIS\nestimation problem (5). We assume that a is suf\ufb01ciently large such that the constraints (cid:107)\u03b1(cid:107)\u221e \u2264\na,(cid:107)\u0398(cid:107)\u221e \u2264 a are always inactive. To simplify notation, we denote the log-likelihood function as\n\nL(\u03b1, \u0398) := L ((cid:80)q\n\nk=1 \u03b1kX(k) + \u0398). We assume\n\nH7 (a) L(\u03b1, \u0398) is \u03c3\u0398-smooth w.r.t. \u0398ij for (i, j) \u2208 \u2126 and (b) \u03c3\u03b1-smooth w.r.t. \u03b1; (c) the gradient\n\u2207\u03b1L(\u03b1, \u0398) is \u02c6\u03c3\u0398-Lipschitz w.r.t. \u0398. Moreover, the gradient \u2207\u0398L(\u03b1, \u0398) is bounded as long as\n\u03b1, \u0398 are bounded.\n\nThe above is implied by H4 for bounded (\u03b1, \u0398). We consider the augmented objective function:\n\nF (\u03b1, \u0398, R) := L(\u03b1, \u0398) + \u03bbS(cid:107)\u03b1(cid:107)1 + \u03bbLR .\n\n(8)\nFor some RUB \u2265 0, if an optimal solution ( \u02c6\u03b1, \u02c6\u0398) to (5) satis\ufb01es (cid:107) \u02c6\u0398(cid:107)(cid:63) \u2264 RUB, then any optimal\nsolution to the following problem\n\nP(RUB) :\n\nmin\n\n\u03b1\u2208Rq,\u0398\u2208Rn\u00d7p,R\u2208R+\n\nF (\u03b1, \u0398, R) s.t. RUB \u2265 R \u2265 (cid:107)\u0398(cid:107)(cid:63) ,\n\n(9)\n\nwill also be optimal to (5). For example, ( \u02c6\u03b1, \u02c6\u0398, \u02c6R) with \u02c6R = (cid:107) \u02c6\u0398(cid:107)(cid:63) is an optimal solution to (9). We\nhave de\ufb01ned the problem as P(RUB) to emphasize its dependence on the upper bound RUB. Later we\nshall describe a simple strategy to estimate RUB. We \ufb01x the set \u039e \u2286 [n] \u00d7 [p] where \u2126 \u2286 \u039e is the\ntarget coordinate set for the low rank matrix \u02c6\u0398 that we are interested in.\n\nProposed Method A natural way to exploit structure in P(RUB) is to apply coordinate gradient\ndescent to update \u03b1 and (\u0398, R) separately. While the trace-norm constraint on (\u0398, R) can be handled\nby the conditional gradient (CG) method [14], the (cid:96)1 norm penalization on \u03b1 is more ef\ufb01ciently\ntackled by the proximal gradient method in practice. In addition, we tighten the upper bound RUB\non-the-\ufb02y as the algorithm proceeds. The MCGD method goes as follows. At the tth iteration, we are\ngiven the previous iterate (\u03b1(t\u22121), \u0398(t\u22121), R(t\u22121)) and the upper bound R(t)\nUB is computed. The \ufb01rst\nblock \u03b1 is updated with a proximal gradient step:\n\n(cid:0)\u03b1(t\u22121) \u2212 \u03b3\u2207\u03b1L(\u03b1(t\u22121), \u0398(t\u22121))(cid:1)\n\n(cid:0)\u03b1(t\u22121) \u2212 \u03b3\u2207\u03b1L(\u03b1(t\u22121), \u0398(t\u22121))(cid:1) .\n\n\u03b1(t) = prox\u03b3\u03bbS(cid:107)\u00b7(cid:107)1\n\n= T\u03b3\u03bbS\n\n(10)\n\nIn (10), \u2207\u03b1L(\u00b7) is the gradient of the log-likelihood function taken w.r.t. \u03b1, \u03b3 > 0 is a pre-de\ufb01ned\nstep size parameter and T\u03bb(x) := sign(x) (cid:12) (x \u2212 \u03bb1)+ is the component-wise soft thresholding\noperator. Alternatively, we can exactly solve the problem\n\n\u03b1(t) \u2208 arg min\u03b1\u2208Rq F (\u03b1, \u0398(t\u22121), R(t\u22121)) ,\n\nfor which closed-form solution can be obtained in certain special cases (see below).\nThe second block (\u0398, R) is updated with a CG step\n\n(\u0398(t), R(t)) = (\u0398(t\u22121), R(t\u22121)) + \u03b2t( \u02c6\u0398(t) \u2212 \u0398(t\u22121), \u02c6R(t) \u2212 R(t\u22121)) ,\n\nwhere \u03b2t \u2208 [0, 1] is a step size to be de\ufb01ned later. ( \u02c6\u0398(t), \u02c6R(t)) is a direction evaluated as\n(cid:104)Z,\u2207\u0398L(\u03b1(t), \u0398(t\u22121))(cid:105) + \u03bb1R s.t. (cid:107)Z(cid:107)(cid:63) \u2264 R \u2264 R(t)\nUB ,\n\n( \u02c6\u0398(t), \u02c6R(t)) \u2208 arg min\n\nZ,R\n\n(11)\n\n(12)\n\n(13)\n\nand \u2207\u0398L(\u00b7) is the gradient of L(\u00b7) taken w.r.t. \u0398. If (\u0398(t\u22121), R(t\u22121)) is feasible to P(R(t)\nUB), then\n(\u0398(t), R(t)) must also be feasible to P(R(t)\nUB). Furthermore, if we let u1, v1 be the top left and right\nsingular vectors of the gradient matrix \u2207\u0398L(\u03b1(t), \u0398(t\u22121)) and \u03c31(\u2207\u0398L(\u03b1(t), \u0398(t\u22121))) be the top\nsingular value, then ( \u02c6\u0398(t), \u02c6R(t)) admits a simple closed form solution:\n\n( \u02c6\u0398(t), \u02c6R(t)) =\n\n(0, 0),\n(\u2212R(t)\n\nUBu1v(cid:62)\n\n1 , R(t)\n\nUB),\n\nif \u03bbL \u2265 \u03c31(\u2207\u0398L(\u03b1(t), \u0398(t\u22121))) ,\nif \u03bbL < \u03c31(\u2207\u0398L(\u03b1(t), \u0398(t\u22121))) .\n\n(14)\n\n(cid:40)\n\n5\n\n\fLastly, the step size \u03b2t is determined by:\n\n(cid:110)\n\n\u03b2t = min\n\n1,\n\n(cid:104)\u0398(t\u22121) \u2212 \u02c6\u0398(t),\u2207\u0398L(\u03b1(t), \u0398(t\u22121))(cid:105) + \u03bbL(R(t\u22121) \u2212 \u02c6R(t))\n\n\u03c3\u0398(cid:107)P\u2126( \u02c6\u0398(t) \u2212 \u0398(t\u22121))(cid:107)2\n\nF\n\n(cid:111)\n\n.\n\n(15)\n\nThe step size strategy ensures decrease in the objective value between successive iterations. This is\nessential for establishing convergence of the proposed method [cf. Theorem 2]. We remark that the\narithmetics in the MCGD method are not affected when we restrict the update of \u0398(t) in (12) to the\nentries in \u039e only. This is due to L(X) = L(P\u2126(X)) and the CG update direction (13) only involves\nthe gradient of \u2207\u0398L(\u03b1(t), \u0398(t\u22121)) w.r.t. entries of \u0398 in \u2126, where \u2126 \u2286 \u039e.\n\nUB We describe a strategy for computing a valid upper bound\n\nComputing the Upper Bound R(t)\nUB for \u02c6R and (cid:107) \u02c6\u0398(cid:107)(cid:63) during the updates in the MCGD method. Let us assume that:\nR(t)\nH8 For all \u0398 and \u03b1, we have L(\u03b1, \u0398) \u2265 0.\nThe above can be enforced as the log-likelihood function is lower bounded [cf. H4]. From (5) and\nusing the above assumption, it is obvious that\n\nF0(0, 0) = L(0, 0) \u2265 L( \u02c6\u03b1, \u02c6\u0398) + \u03bbS(cid:107) \u02c6\u03b1(cid:107)1 + \u03bbL(cid:107) \u02c6\u0398(cid:107)(cid:63) \u2265 \u03bbL(cid:107) \u02c6\u0398(cid:107)(cid:63),\n\n(16)\nL L(0 + fU (0)) is a valid upper bound to (cid:107) \u02c6\u0398(cid:107)(cid:63); furthermore it can be tightened\nand thus R0\nas we progress in the MCGD method. In particular, observe that ( \u02c6\u03b1, \u02c6\u0398, \u02c6R) with \u02c6R = (cid:107) \u02c6\u0398(cid:107)(cid:63) is an\noptimal solution to P(R0\n\nUB := \u03bb\u22121\n\nUB), we have\n\nF (\u03b1, \u0398, R) \u2265 F ( \u02c6\u03b1, \u02c6\u0398, \u02c6R) = L( \u02c6\u03b1, \u02c6\u0398) + \u03bbS(cid:107) \u02c6\u03b1(cid:107)1 + \u03bbL \u02c6R \u2265 \u03bbL \u02c6R.\n\n(17)\nUB), \u03bb\u22121\nL F (\u03b1, \u0398, R) is an upper bound to \u02c6R and\nL F (\u03b1(t), \u0398(t\u22121), R(t\u22121)) at iteration t, where\nUB) and\n\nUB := \u03bb\u22121\n\nUB \u2265 R(t\u22121). That is, (\u03b1(t), \u0398(t\u22121), R(t\u22121)) is feasible to both P(R(t)\n\n). Lastly, we summarize the MCGD method in Algorithm 1.\n\nUB\n\nIn other words, for all feasible (\u03b1, \u0398, R) to P(R0\n(cid:107) \u02c6\u0398(cid:107)(cid:63). The above motivates us to select R(t)\nwe observe that R(t)\nP(R(t\u22121)\nComputation Complexity Consider the MCGD\nmethod in Algorithm 1. Observe that line 3 re-\nquires computing the gradient w.r.t. \u03b1 which in-\nvolves |\u2126|q Floating Points Operations (FLOPS)\nand the soft thresholding operator involves O(q)\nFLOPS. As the log-likelihood function L(\u00b7) is\nevaluated element-wisely on \u0398, evaluating the\nobjective value and the derivative w.r.t. \u0398 re-\nquires O(|\u2126|) FLOPS. As such, line 4 can be\nevaluated in O(|\u2126|) FLOPS and line 5 requires\nO(|\u2126| max{n, p} log(1/\u03b4)) FLOPS where the\nadditional complexity is due to the top SVD\ncomputation and \u03b4 is a preset accuracy level\nof SVD computation. Lastly, line 6 requires\nO(|\u039e|) FLOPS since we only need to update\nthe entries of \u0398 in \u039e [cf. see the remark after\n(15)]. The overall per-iteration complexity is\nO(|\u039e| + |\u2126|(max{n, p} log(1/\u03b4) + q)).\n\nAlgorithm 1 MCGD Method for (9).\n1: Initialize: \u2014 \u0398(0), \u03b1(0), R(0).\n\nE.g.,\n\n\u0398(0), \u03b1(0), R(0) = (0, 0, 0).\n\n2: for t = 1, 2, . . . , T do\n3:\n\n4:\n\n// Update for \u03b1 //\nCompute the proximal update using (10)\n[or exact update via (11)] to obtain \u03b1(t).\n// Update for (\u0398, R) //\nCompute the upper bound as R(t)\n\u03bb\u22121\nL F (\u03b1(t), \u0398(t\u22121), R(t\u22121)).\n\nUB :=\n\n5: Compute the update direction, ( \u02c6\u0398(t), \u02c6R(t)),\n\nusing Eq. (14).\n\n6: Compute the CG update using (12), where\n\nthe step size \u03b2t is set as Eq. (15).\n\n7: end for\n8: Return: \u0398(T ), \u03b1(T ), R(T ).\n\nFrom the above, the per-iteration computation complexity of the MCGD method scales linearly with\nthe problem dimension max{n, p} and |\u2126|. This is comparable to [24, 9], where the former focuses\nonly on the least square loss case. The following theorem, whose proof can be found in Appendix C,\nshows that the MCGD method converges at a sublinear rate.\n\nTheorem 2 Assume H7 and H8. De\ufb01ne the quantity\n\n(cid:110) 24(Q(t))2\n\nC(t) := max\n\n,\n\n24\u02c6\u03c32\n\n\u0398(Q(t))2\n\u03c3\u0398\n\n\u03b3\n\n+ max{6R(t)\n\nUB(\u03bbL + M (t)), 24\u03c3\u0398(R(t)\n\n,\n\n(18)\n\nUB)2}(cid:111)\n\n6\n\n\f,\n\nUB)2}(cid:111)\n(cid:17)\u22121\n\n1\n\n(cid:16) 1\n\nT(cid:88)\n\nT\n\nt=1\n\nC(t)\n\n(cid:17)\n\n(cid:110) 24(Q(0))2\n\n(cid:16) 1\n\n\u0001\n\n:= \u03bb\u22121\n\nwhere Q(t)\n:=\nL F (\u03b1(t), \u0398(t\u22121), R(t\u22121)). If we choose the step sizes as \u03b3 \u2264 1/\u03c3\u03b1 and \u03b2t as in (15), then\n\u03bb\u22121\n(i) the above quantity is upper bounded as C(t) \u2264 C for all t \u2265 1, where\n\nS F (\u03b1(t), \u0398(t), R(t)), M (t)\n\n:= (cid:107)\u2207\u0398L(\u03b1(t), \u0398(t\u22121))(cid:107)2 and R(t)\n\nUB\n\nC := max\n\n,\n\n24\u02c6\u03c32\n\n\u0398(Q(0))2\n\u03c3\u0398\n\n\u03b3\n\n+ max{6R(0)\n\nUB(\u03bbL + \u00afM ), 24\u03c3\u0398(R(0)\n\n(19)\n\nsuch that \u00afM is an upper bound to M (t), and (ii) the MCGD method converges to an \u0001-optimal\nsolution to (5) in T iterations, i.e., F0(\u03b1(T ), \u0398(T )) \u2212 F0( \u02c6\u03b1, \u02c6\u0398) \u2264 \u0001, where\n\nT \u2265 C(T )\n\n\u2212\n\n1\n\nF0(\u03b1(0), \u0398(0)) \u2212 F0( \u02c6\u03b1, \u02c6\u0398)\n\nwith C(T ) :=\n\n+\n\n.\n\n(20)\n\nIn particular, as C(T ) \u2264 C, at most C(\u0001\u22121 \u2212 (F0(\u03b1(0), \u0398(0)) \u2212 F0( \u02c6\u03b1, \u02c6\u0398))\u22121)+ iterations are\nrequired for the MCGD method to reach an \u0001-optimal solution to (5).\n\nDetailed Comparison to Prior Algorithms Previous contributions have focused on the special\ncase of (5) where q = np, the dictionary (X(1), . . . , X(q)) is the canonical basis of Rn\u00d7p, and the\nlink functions are quadratic. In this particular case, (5) becomes the estimation problem solved in\nsparse plus low-rank matrix decomposition. Popular examples are the alternating direction method of\nmultiplier [21, 26] or the projected gradient method on a reformulated problem [5]. These methods\neither require computing a complete SVD or knowing the optimal rank number of \u0398 a priori. When\nn, p (cid:29) 1, it is computationally prohibitive to evaluate the complete SVD since each iteration would\nrequire O(max{n2p, p2n}) FLOPS. Other related work rely on factorizing the low-rank component,\nyielding nonconvex problems [11]; see also [29] and references therein.\nSimilar to the development of MCGD, a natural alternative is to apply algorithms based on the CG\n(a.k.a. Frank-Wolfe) method [14], whose iterations only require the computation of a top SVD. The\npresent work is closely related to the efforts in [24, 9] which focused on the quadratic setting. Mu\net al. [24] combines the CG method with proximal update as a two-steps procedure; Garber et al. [9]\ncombines a CD method with CG updates on both the sparse and low-rank components. The work in\n[9] is also related to [19, 2] which combine CD with CG updates for solving constrained problems,\ninstead of penalized problems like (5). Sublinear convergence rates are proven for the above methods.\nFinally, Fithian and Mazumder [8] also suggested to apply CD on (5), yet the convergence properties\nwere not discussed.\nIn fact, when the MCGD\u2019s result is specialized to the same setting as [24], our worst-case bound on\niteration number computed with C match the bound in [24]. As shown in the supplementary material,\nwe have C(t) \u2192 C (cid:63), where C (cid:63) depends on the optimal objective value of (9) and is smaller than\nC. Since the quantity C(T ) in (20) is an average of {C(t)}T\nt=1, this implies that the MCGD method\nrequires less number of iterations for convergence than that is required by [24]. Such reduction is\npossible due to the on-the-\ufb02y update for R(t)\nUB. Moreover, our analysis in Theorem 2 holds when the\nMCGD method is implemented with a few practical modi\ufb01cations.\n\nExact Partial Minimization for \u03b1 Consider the special case of (5) where the link functions are\neither quadratic or exponential and the dictionary matrices satisfy:\n\nsupp(X(k)) \u2229 supp(X(k(cid:48))) = \u2205, k (cid:54)= k(cid:48) and [X(k)]i,j = ck, \u2200 (i, j) \u2208 supp(X(k)) .\n\n(21)\nIn this case, the partial minimization (11) can be decoupled into q scalar optimizations involving\none coordinate of \u03b1, which can be solved in closed form. Note that this modi\ufb01cation to the MCGD\nmethod is supported by Theorem 2 and the sublinear convergence rate holds. On the contrary, closed\nform update of \u03b1 is not supported by prior works such as [24, 9, 19, 2].\n\nDistributed MCGD Optimization Consider the case where the observed data entries are stored\nacross K workers, each of them communicating with a central server. It is natural to distribute the\nMCGD optimization over these workers to of\ufb02oad computation burden, or for privacy protection.\nFormally, we divide \u2126 into K disjoint partitions such that \u2126 = \u21261 \u222a\u00b7\u00b7\u00b7\u222a \u2126K and worker k holds \u2126k.\nk=1 Lk(\u03b1, \u0398), where Lk(\u03b1, \u0398) is de\ufb01ned by replacing the summation\nover \u2126 with \u2126k in (4). Clearly, when \u03b1 and P\u2126k (\u0398) are given to the kth worker, the worker will be\n\nIn this way, L(\u03b1, \u0398) =(cid:80)K\n\n7\n\n\fsingular vectors of the gradient matrix \u2207\u0398L(\u03b1, \u0398) =(cid:80)K\n\nable to evaluate the local loss function and its gradient.\nAs shown in Appendix D, the MCGD method can be easily extended to utilize distributed computation.\nThe proximal update in line 3 is replaced by the following procedure. First, the local gradients\ncomputed by the workers are aggregated, then the soft thresholding operation is performed at\nthe central server. Meanwhile, as the CG update in line 5 essentially requires computing the top\nk=1 \u2207\u0398Lk(\u03b1,P\u2126k (\u0398)), the latter can be\nimplemented through a distributed version of the power method exploiting the decomposable structure\nof the gradient, such as described in [30]. It only requires O(log(1/\u03b4)) power iterations to compute\na top SVD solution of accuracy \u03b4. Thus, for a suf\ufb01ciently small \u03b4 > 0, the overall per-iteration\ncomplexity of the distributed method at the tth iteration is reduced to O(|\u039e| + max{n, p} log(1/\u03b4))\nat the central server, and O(|\u2126k|(max{n, p} log(1/\u03b4) + q)) at the kth worker.\n\n4 Numerical Experiments\n\nExperimental Setup We \ufb01rst generate the target parameter M0 according to the LORIS model in\n(3). For the sparse additive effects component, we consider q = pn/5 where we set (X(k))ij = 1 if\nj(n \u2212 1) + i \u2208 {5(k \u2212 1) + 1, ..., 5k}. This models a categorical variable containing n/5 categories.\nFurthermore, the target sparse component \u03b10 has a sparsity level of 10%. For the low-rank component,\nthe target parameter \u03980 is generated as a rank-4 matrix formed by the outer product of random\northogonal vectors. Notice that due to the structure of sparse additive effects, the surveyed prior\nmethods [21, 11, 5] cannot be applied directly.\n\nGaussian Design To compare our framework to a reasonable benchmark, we focus on a homoge-\nnous setting with numerical data modeled with the quadratic link function g(m) = m2. We set the\nregularization parameters \u03bbS and \u03bbL to the theoretical values given in Theorem 1. We compare\nour result with a common two-step procedure where the components \u03b1kj are \ufb01rst estimated in\na preprocessing step as the means of the variables taken by group; then \u0398 is estimated using the\nsoftImpute method proposed in [12]. The regularization parameter for [12] is set to the same value \u03bbL.\nWe compare the results in terms of estimation error and computing time in Table 1, after letting the\ntwo methods converge to the same precision of 10\u22125. We observe the two methods perform equally\nwell in terms of estimating \u0398. LORIS yields constant estimation errors of \u03b10 as the dimension\nincreases and the support of \u03b10 is kept constant, contrary to the two-step procedure for which the\nestimation error of \u03b10 increases with the dimension. As expected, the two-step method is faster for\nsmall data sets, whereas for large data sizes LORIS is superior in computational time. The above\nresults are consistent with our theoretical \ufb01ndings.\n\n(cid:13)(cid:13)(cid:13)\u03980 \u2212 \u02c6\u0398\n\n(cid:13)(cid:13)(cid:13)2\n\nF\n\n(cid:13)(cid:13)\u03b10 \u2212 \u02c6\u03b1(cid:13)(cid:13)2\n\n2\n\nproblem size (n \u00d7 p)\n\ntime (secs)\n\n150 \u00d7 30\n1, 500 \u00d7 300\n15, 000 \u00d7 300\n15, 000 \u00d7 3, 000\n\nLORIS\n0.17\n13.8\n130.2\n348\n\ntwo-step LORIS\n0.02\n10.7\n136.6\n528\n\n52\n175.5\n675\n2.7 \u00d7 103\n\ntwo-step\n52\n234\n720\n2.6 \u00d7 103\n\nLORIS\n1.8\n0.95\n0.95\n2.34\n\ntwo-step\n3.0\n17.1\n16.2\n180\n\nTable 1: Comparison of proposed method with a two-step method in terms of computation time and\nestimation error for increasing dimensions (averaged over 10 experiments).\n\nSurvey data To test the ef\ufb01cacy of our framework with heterogeneous data, we examine a survey\nconducted by the French National Institute of Statistics (Insee: http://www.insee.fr/) concerning\nthe hobbies of French people. The data set contains n = 8, 403 individuals and p = 19 binary and\nquantitative variables, indicating whether or not the person has been involved in different activities\n(reading, \ufb01shing, etc.), the number of hours spent watching TV and the overall number of hobbies of\nthe individuals. Individuals are grouped by age category (15 \u2212 25, 25 \u2212 35, etc.): this categorical\nvariable is used as a predictor of the survey responses in the subsequent experiment. We introduce\n30% of missing values in the data set, and compare the imputation error of LORIS with a mixed data\nmodel (using a quadratic loss for numeric columns, a logistic loss for binary columns and a Poisson\nloss for counts) and LORIS with a Gaussian data model, with the imputation error of softImpute. The\n\n8\n\n\fFigure 2: Imputation error of LORIS with mixed data model and Gaussian data model, and softImpute\n(10 replications) for categorical variables (left) and quantitative variables (right).\n\nresults are given in Figure 2 across 10 replications of the experiment, and show that, for this example,\nboth LORIS models improve on the baseline softImpute by a factor 2. We also observe that modeling\nexplicitly the binary variables leads to better imputation.\nFinally, we apply LORIS with a mixed data model to the original data set. A subset of the resulting \u03b1\nvector is given in Table 2. There is a coef\ufb01cient in \u03b1kj for every age category k and every variable j.\nThe coef\ufb01cients in Table 2 indicate that young individuals engage in activities such as music and sport\nmore than older people, and the opposite trend for collecting, knitting and \ufb01shing. Some coef\ufb01cients\nare set to zero, indicating the absence of effect of the age category on the variable. We also observe\nthat younger people engage overall in more activities than older people.\n\nAge category Music\n25-35\n35-45\n45-55\n55-65\n65-75\n75-85\n\n2.2\n2.0\n1.1\n0\n0\n-0.1\n\nSport Collecting Mechanic Knitting\n0.4\n0.3\n-0.8\n-2.2\n-2.1\n-0.9\nTable 2: Estimated age category effects (\u03b1).\n\n-1.7\n-2.3\n-2.7\n-1.0\n-0.7\n-0.1\n\n-2.1\n-2.7\n-2.1\n-1.9\n-1.4\n-0.6\n\n0\n0\n0\n0\n-1.1\n-0.5\n\nFishing Nb activities\n-1.9\n-2.3\n-2.7\n-1.6\n-1.3\n-0.6\n\n10.0\n13.0\n13.8\n8.8\n5.5\n2.2\n\nConclusion In this paper, we proposed a new framework for handling large data frames with\nheterogeneous data and missing values which incorporates additive effects. It consists of a doubly\npenalized quasi-maximum likelihood estimator and a new optimization algorithm to implement the\nestimator. We examined both the statistical and computational ef\ufb01ciency of the framework and\nderived worst case bounds of its performance. Future work includes the incorporation of qualitative\nfeatures with more than two categories and of missing values in the dictionary matrices.\n\n5 Acknowledgement\n\nThe authors would like to thank for the useful comments from three anonymous reviewers. HTW\u2019s\nwork was supported by the grant NSF CCF-BSF 1714672.\n\nReferences\n[1] A. Agresti. Categorical Data Analysis, 3rd Edition. Wiley, 2013.\n\n[2] A. Beck, E. Pauwels, and S. Sabach. The cyclic block conditional gradient method for convex optimization\n\nproblems. SIAM Journal on Optimization, 25(4):2024\u20132049, 2015.\n\n9\n\nLORIS\u2212mixedLORIS\u2212gaussiansoftImpute30003500400045005000LORIS\u2212mixedLORIS\u2212gaussiansoftImpute20000300004000050000\f[3] E. J. Cand\u00e8s, X. Li, Y. Ma, and J. Wright. Robust principal component analysis? J. ACM, 58(3):11:1\u201311:37,\nJune 2011. ISSN 0004-5411. doi: 10.1145/1970392.1970395. URL http://doi.acm.org/10.1145/\n1970392.1970395.\n\n[4] V. Chandrasekaran, S. Sanghavi, P. A. Parrilo, and A. S. Willsky. Rank-sparsity incoherence for matrix\ndecomposition. SIAM Journal on Optimization, 21(2):572\u2013596, 2011. doi: 10.1137/090761793. URL\nhttps://doi.org/10.1137/090761793.\n\n[5] Y. Chen and M. J. Wainwright. Fast low-rank estimation by projected gradient descent: General statistical\n\nand algorithmic guarantees. CoRR, abs/1509.03025, 2015.\n\n[6] J. de Leeuw. Principal component analysis of binary data by iterated singular value decomposition.\nComput. Stat. Data Anal., 50(1):21\u201339, Jan. 2006. ISSN 0167-9473. doi: 10.1016/j.csda.2004.07.010.\nURL http://dx.doi.org/10.1016/j.csda.2004.07.010.\n\n[7] A. Feuerverger, Y. He, and S. Khatri. Statistical signi\ufb01cance of the net\ufb02ix challenge. Statist. Sci., 27(2):\n\n202\u2013231, 05 2012. doi: 10.1214/11-STS368. URL http://dx.doi.org/10.1214/11-STS368.\n\n[8] W. Fithian and R. Mazumder. Flexible Low-Rank Statistical Modeling with Missing Data and Side\n\nInformation. Statistical Science, 33(2):238\u2013260, 2018.\n\n[9] D. Garber, S. Sabach, and A. Kaplan. Fast generalized conditional gradient method with applications to\n\nmatrix recovery problems. arXiv preprint arXiv:1802.05581, 2018.\n\n[10] G. Gidel, F. Pedregosa, and S. Lacoste-Julien. Frank-wolfe splitting via augmented lagrangian method. In\nOPTML 2017: 10th NIPS Workshop on Optimization for Machine Learning (NIPS 2017), page 21, 2017.\nURL http://opt-ml.org/papers/OPT2017_paper_21.pdf.\n\n[11] Q. Gu, Z. W. Wang, and H. Liu. Low-rank and sparse structure pursuit via alternating minimization.\nIn A. Gretton and C. C. Robert, editors, Proceedings of the 19th International Conference on Arti\ufb01cial\nIntelligence and Statistics, volume 51 of Proceedings of Machine Learning Research, pages 600\u2013609,\nCadiz, Spain, 09\u201311 May 2016. PMLR. URL http://proceedings.mlr.press/v51/gu16.html.\n\n[12] T. Hastie, R. Mazumder, J. Lee, and R. Zadeh. Matrix Completion and Low-Rank SVD via Fast Alternating\n\nLeast Squares. The Journal of Machine Learning Research, 16:3367\u20133402, jan 2015.\n\n[13] D. Hsu, S. M. Kakade, and T. Zhang. Robust matrix decomposition with sparse corruptions. EEE\n\nTransactions on Information Theory, 57(11):7221\u20137234, 2011.\n\n[14] M. Jaggi. Revisiting frank-wolfe: Projection-free sparse convex optimization. In ICML (1), pages 427\u2013435,\n\n2013.\n\n[15] H. A. L. Kiers. Simple structure in component analysis techniques for mixtures of qualitative and\nISSN 1860-0980. doi: 10.1007/\n\nquantitative variables. Psychometrika, 56(2):197\u2013212, Jun 1991.\nBF02294458. URL https://doi.org/10.1007/BF02294458.\n\n[16] O. Klopp. Noisy low-rank matrix completion with general sampling distribution. Bernoulli, 20(1):282\u2013303,\n\n2014.\n\n[17] O. Klopp, K. Lounici, and A. B. Tsybakov. Robust matrix completion. Probability Theory and Related\nFields, 169(1):523\u2013564, Oct 2017. doi: 10.1007/s00440-016-0736-y. URL https://doi.org/10.\n1007/s00440-016-0736-y.\n\n[18] N. K. Kumar and J. Schneider. Literature survey on low rank approximation of matrices. Linear and\nMultilinear Algebra, 65(11):2212\u20132244, 2017. doi: 10.1080/03081087.2016.1267104. URL https:\n//doi.org/10.1080/03081087.2016.1267104.\n\n[19] S. Lacoste-Julien, M. Jaggi, M. Schmidt, and P. Pletscher. Block-coordinate frank-wolfe optimization\nfor structural svms. In Proceedings of the 30th International Conference on International Conference on\nMachine Learning-Volume 28, pages I\u201353. JMLR. org, 2013.\n\n[20] A. J. Landgraf and Y. Lee. Generalized principal component analysis: Projection of saturated model\n\nparameters. Technical report, The Ohio State University, Department of Statistics, 06 2015.\n\n[21] Z. Lin, R. Liu, and Z. Su. Linearized alternating direction method with adaptive penalty for low-rank\n\nrepresentation. In Advances in neural information processing systems, pages 612\u2013620, 2011.\n\n[22] L. T. Liu, E. Dobriban, and A. Singer. e pca: High dimensional exponential family pca. Annals of Applied\n\nStatistics, to appear, 2018.\n\n10\n\n\f[23] M. Mardani, G. Mateos, and G. B. Giannakis. Recovery of low-rank plus compressed sparse matrices with\napplication to unveiling traf\ufb01c anomalies. IEEE Transactions on Information Theory, 59(8):5186\u20135205,\nAug 2013. ISSN 0018-9448. doi: 10.1109/TIT.2013.2257913.\n\n[24] C. Mu, Y. Zhang, J. Wright, and D. Goldfarb. Scalable robust matrix recovery: Frank\u2013wolfe meets proximal\nmethods. SIAM Journal on Scienti\ufb01c Computing, 38(5):A3291\u2013A3317, 2016. doi: 10.1137/15M101628X.\nURL https://doi.org/10.1137/15M101628X.\n\n[25] J. Pag\u00e8s. Multiple factor analysis by example using R. Chapman and Hall/CRC, 2014.\n\n[26] M. Tao and X. Yuan. Recovering low-rank and sparse components of matrices from incomplete and noisy\n\nobservations. SIAM Journal on Optimization, 21(1):57\u201381, 2011.\n\n[27] M. Udell, C. Horn, R. Zadeh, and S. Boyd. Generalized low rank models. Foundations and Trends\nin Machine Learning, 9(1), 2016. doi: 10.1561/2200000055. URL http://dx.doi.org/10.1561/\n2200000055.\n\n[28] H. Xu, C. Caramanis, and S. Sanghavi. Robust pca via outlier pursuit.\n\nIn Proceedings of the 23rd\nInternational Conference on Neural Information Processing Systems, NIPS\u201910, pages 2496\u20132504, USA,\n2010. Curran Associates Inc. URL http://dl.acm.org/citation.cfm?id=2997046.2997174.\n\n[29] X. Zhang, L. Wang, and Q. Gu. A uni\ufb01ed framework for nonconvex low-rank plus sparse matrix\nIn International Conference on Arti\ufb01cial Intelligence and Statistics, AISTATS 2018, 9-11\nrecovery.\nApril 2018, Playa Blanca, Lanzarote, Canary Islands, Spain, pages 1097\u20131107, 2018. URL http:\n//proceedings.mlr.press/v84/zhang18c.html.\n\n[30] W. Zheng, A. Bellet, and P. Gallinari. A distributed frank-wolfe framework for learning low-rank matrices\n\nwith the trace norm. arXiv preprint arXiv:1712.07495, 2017.\n\n11\n\n\f", "award": [], "sourceid": 2641, "authors": [{"given_name": "Genevi\u00e8ve", "family_name": "Robin", "institution": "\u00c9cole Polytechnique"}, {"given_name": "Hoi-To", "family_name": "Wai", "institution": "The Chinese University of Hong Kong"}, {"given_name": "Julie", "family_name": "Josse", "institution": "\u00c9cole Polytechnique"}, {"given_name": "Olga", "family_name": "Klopp", "institution": "Universit\u00e9 Paris Ouest"}, {"given_name": "Eric", "family_name": "Moulines", "institution": "Ecole Polytechnique"}]}