{"title": "An Improved Analysis of Alternating Minimization for Structured Multi-Response Regression", "book": "Advances in Neural Information Processing Systems", "page_first": 6616, "page_last": 6627, "abstract": "Multi-response linear models aggregate a set of vanilla linear models by assuming correlated noise across them, which has an unknown covariance structure. To find the coefficient vector, estimators with a joint approximation of the noise covariance are often preferred than the simple linear regression in view of their superior empirical performance, which can be generally solved by alternating-minimization type procedures. Due to the non-convex nature of such joint estimators, the theoretical justification of their efficiency is typically challenging. The existing analyses fail to fully explain the empirical observations due to the assumption of resampling on the alternating procedures, which requires access to fresh samples in each iteration. In this work, we present a resampling-free analysis for the alternating minimization algorithm applied to the multi-response regression. In particular, we focus on the high-dimensional setting of multi-response linear models with structured coefficient parameter, and the statistical error of the parameter can be expressed by the complexity measure, Gaussian width, which is related to the assumed structure. More importantly, to the best of our knowledge, our result reveals for the first time that the alternating minimization with random initialization can achieve the same performance as the well-initialized one when solving this multi-response regression problem. Experimental results support our theoretical developments.", "full_text": "An Improved Analysis of Alternating Minimization\n\nfor Structured Multi-Response Regression\n\nSheng Chen \u2217\nThe Voleon Group\n\nchen2832@umn.edu\n\nDept. of Computer Science & Engineering\n\nUniversity of Minnesota, Twin Cities\n\nArindam Banerjee\n\nbanerjee@cs.umn.edu\n\nAbstract\n\nMulti-response linear models aggregate a set of vanilla linear models by assuming\ncorrelated noise across them, which has an unknown covariance structure. To \ufb01nd\nthe coef\ufb01cient vector, estimators with a joint approximation of the noise covariance\nare often preferred than the simple linear regression in view of their superior\nempirical performance, which can be generally solved by alternating-minimization-\ntype procedures. Due to the non-convex nature of such joint estimators, the\ntheoretical justi\ufb01cation of their ef\ufb01ciency is typically challenging. The existing\nanalyses fail to fully explain the empirical observations due to the assumption of\nresampling on the alternating procedures, which requires access to fresh samples in\neach iteration. In this work, we present a resampling-free analysis for the alternating\nminimization algorithm applied to the multi-response regression. In particular,\nwe focus on the high-dimensional setting of multi-response linear models with\nstructured coef\ufb01cient parameter, and the statistical error of the parameter can be\nexpressed by the complexity measure, Gaussian width, which is related to the\nassumed structure. More importantly, to the best of our knowledge, our result\nreveals for the \ufb01rst time that the alternating minimization with random initialization\ncan achieve the same performance as the well-initialized one when solving this\nmulti-response regression problem. Experimental results support our theoretical\ndevelopments.\n\n1\n\nIntroduction\n\nWe consider the following multi-response linear model [1, 5, 18] with m real-valued outputs,\n\ny = X\u03b8\u2217 + \u03b7 , where \u03b7 = \u03a31/2\u2217 \u02dc\u03b7\n\n(1)\nwhere y \u2208 Rm is the response vector and X \u2208 Rm\u00d7p consists of m p-dimensional feature vectors,\nand \u02dc\u03b7 \u2208 Rm is a zero-mean isotropic noise vector. The m responses share the same underlying\nparameter \u03b8\u2217 \u2208 Rp, which corresponds to the so-called pooled model [17]. Without loss of generality,\nthe counterpart of (1) with response-speci\ufb01c parameters can be equivalently written in the above form,\nby block-diagonalizing rows of X and concatenating different parameters into a single vector. What\nmakes this model different from vanilla linear models is the correlated noise \u03b7 across responses, which\nis assumed to be a linear transformation of \u02dc\u03b7. The noise covariance of \u03b7 is given by Cov(\u03b7) = \u03a3\u2217.\nThis model has found numbers of real-world applications, such as econometrics [17], computational\nbiology [22] and climate informatics [14, 15], just to name a few.\nIn practice, we are given n observations of (X, y), denoted by D = {(Xi, yi)}n\ni=1, while the\nnoise covariance structure \u03a3\u2217 between responses is typically unknown. Our goal is to estimate the\n\n\u2217This work was done when the author studied at University of Minnesota, Twin Cities\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fparameter \u03b8\u2217, ideally together with \u03a3\u2217. In this work, we additionally focus on the high-dimensional\nregime, where the true parameter \u03b8\u2217 is assumed to possess certain low-complexity structure measured\nby some function f : Rp (cid:55)\u2192 R+, which can be either convex (e.g., norms) or non-convex (e.g., L0\ncardinality). For the low-dimensional setting, it has been shown, both empirically and theoretically,\nthat simultaneously estimating \u03a3 and \u03b8 leads to better performance than ordinary least squares [20].\nInspired by this fact, we consider the joint estimator of (\u03a3, \u03b8) in high dimension as follows,\nf (\u03b8) \u2264 \u03bb ,\n\n(cid:16) \u02c6\u03b8n, \u02c6\u03a3n\n\n2 (yi \u2212 Xi\u03b8)\n\nn(cid:88)\n\nlog |\u03a3|\n\n(cid:17)\n\ns.t.\n\n(2)\n\n+\n\n(cid:13)(cid:13)(cid:13)\u03a3\u2212 1\n\n(cid:13)(cid:13)(cid:13)2\n\n2\n\n= argmin\n\u03b8\u2208Rp, \u03a3(cid:23)0\n\n1\n2\n\n1\n2n\n\ni=1\n\nwhich corresponds to the constrained maximum likelihood estimator (MLE) of (\u03a3, \u03b8) when the noise\nis multivariate Gaussian. The structural assumption on \u03b8\u2217 is encoded by the inequality constraint.\nThough the noise structure is accounted in this joint estimator, one challenge faced by the associated\noptimization problem is the non-convexity of the objective function. In the light of the simplicity\nof the marginal optimization over \u03a3 and \u03b8 when the other is \ufb01xed, a popular approach to dealing\nwith such problem is alternating minimization (AltMin), i.e., alternately solving for \u03a3 (and \u03b8) while\nkeeping \u03b8 (and \u03a3) \ufb01xed. For problem (2), the update of AltMin can be written as\n\nn(cid:88)\n\n(cid:16)\n\ni=1\n\n\u02c6\u03a3(t+1) =\n\n1\nn\n\n(cid:17)(cid:16)\n\n(cid:17)T\n(cid:13)(cid:13)(cid:13)2\n\n2\n\nyi \u2212 Xi\n\n\u02c6\u03b8(t)\n\n(cid:13)(cid:13)(cid:13) \u02c6\u03a3\n\nn(cid:88)\n\ni=1\n\nyi \u2212 Xi\n\n\u02c6\u03b8(t)\n\n\u2212 1\n(t+1) (yi \u2212 Xi\u03b8)\n\n2\n\n\u02c6\u03b8(t+1) = argmin\n\u03b8\u2208Rp\n\n1\n2n\n\n,\n\ns.t.\n\nf (\u03b8) \u2264 \u03bb ,\n\n(3)\n\n(4)\n\nwhich will be executed for a number of iterations, say T . Here the new \u02c6\u03a3(t+1) is obtained by\ncomputing the empirical covariance of the residues estimated at \u02c6\u03b8(t). Though f can potentially\nbe non-convex, the update of \u02c6\u03b8(t+1) is merely solving a constrained least squares problem, for\nwhich various algorithms are guaranteed to \ufb01nd the global minimum under mild conditions on\ndata [21, 3, 33]. Generally speaking, both steps are easy to implement, which makes AltMin more\nattractive compared with other optimization algorithms that jointly update \u03b8 and \u03a3. In the low-\ndimensional setting, the AltMin algorithm for multi-response regression was initially proposed by\n[28]. For the high-dimensional counterpart with sparse parameters, previous works [32, 23, 31]\nconsidered the regularized MLE approaches, which are also solved by AltMin-type algorithms.\nUnfortunately, none of those works provide \ufb01nite-sample statistical guarantees for their algorithms.\nThe \ufb01rst attempt to establish the non-asymptotic error bound of this AltMin approach is made by\n[20] for low-dimensional regime, with a brief extension to sparse parameter setting using iterative\nhard thresholding method [21]. But they did not allow more general structure of the parameter. One\nof the closely related work is [10] with a focus on general parameter structures captured by norms.\nThey proposed an alternating estimation framework, in which the generalized Dantzig selector [8] is\nused for \u03b8-step as an alternative to the regularized and the constrained estimators.\nThe AltMin technique has also been applied to many other estimation problems, such as matrix\ncompletion [19], phase retrieval [27], and mixed linear regression [44]. However, the current\ntheoretical understanding of AltMin is still incomplete. Including the aforementioned works, the\nstatistical guarantees for non-convex AltMin procedures are often shown under the resampling\nassumption, which assumes that each iteration receives a fresh sample. Albeit this can be achieved by\npartitioning the data into disjoint subsets and using different batches in each update, people seldom do\nso in practice, as it usually results in worse performance than using all data in every iteration. From\nthe theoretical perspective, the resampling assumption oversimpli\ufb01es the analysis of the algorithm\nused in practice, which may otherwise require sophisticated proof techniques [36].\nIn this paper, we aim at a better way to bound the statistical error of the above AltMin procedure for\ngeneral structure-inducing f. In principal, non-asymptotic statistical analyses for the high dimension\ntypically involve bounding suprema of stochastic processes [26, 2, 42, 30]. The dif\ufb01culty of analyzing\nAltMin lies in the dependency between the data and the obtained iterates, and the lack of independence\nprevents applications of various concentration inequalities to the suprema of the target processes. The\nresampling assumption facilitates the analysis of AltMin by assuming access to new data that are\nindependent of previous iterates. In contrast to resampling, we here resort to uniform bounds to tackle\nthe dependency issue. That is, instead of dealing with the processes involving the speci\ufb01c iterates\ngenerated by AltMin, we try to bound their worst-case counterparts that consider all possible iterates\n\n2\n\n\fbefore running AltMin. This solution to dependency ends up with more complicated stochastic\nprocesses, which need careful treatment. By applying generic chaining [37], an advanced tool from\nprobability theory, we are able to obtain the desired bounds for the processes under consideration,\nand eventually express the error bound in terms of a complexity measure called Gaussian width\n[16, 7] (see Section 3.1). In particular, we analyze the AltMin procedure under two different choices\nof initialization, one with an arbitrarily initialized iterate and the other starting at a point close to\n\u03b8\u2217. The L2-error for both types of AltMin is shown to converge geometrically to certain minimum\nachievable error emin with overwhelming probability, i.e.,\n\nwhere \u03c1n < 1 is the contraction factor and emin is given by\n\n(cid:13)(cid:13) \u02c6\u03b8(T ) \u2212 \u03b8\u2217(cid:13)(cid:13)2 \u2264 emin + \u03c1T\n(cid:18) w(C) + m\u221a\n(cid:19)\n(cid:18) w(C)\u221a\n(cid:19)\n\nemin = O\n\nn\n\nn \u00b7(cid:0)(cid:13)(cid:13) \u02c6\u03b8(0) \u2212 \u03b8\u2217(cid:13)(cid:13)2 \u2212 emin\n\n(cid:1)\n\n(arbitrary initialization) ,\n\n(5)\n\n(6)\n\nn\n\nemin = O\n\n(good initialization) .\n\n(7)\nHere w(C) is the Gaussian width of a set C related the structure of \u03b8\u2217 (see De\ufb01nition 2). Surprisingly\nthe error for good initializations matches the resampling-based result up to some constant, which\nrequires more fresh data to achieve such a bound. In general, our work improves the results in\n[20, 10] in several aspects. First, our analysis does not rely on the resampling assumption. Second,\nour statistical guarantees work for general sub-Gaussian noise while [20] and [10] only considered\nGaussian noise. Third, we allow the complexity function f to be non-convex, whereas [10] required\nf being a norm. Last but not least, our result suggests that when the amount of data is adequate\nfor the error bound (6) to meet the requirement of good initialization, the AltMin with arbitrary\ninitialization can achieve the same level of error as the well-initialized one. Although this type of\nguarantee for arbitrary initializations was discovered for other problems [43], it has not been revealed\nfor the multi-response regression, and our proof technique is also different from the existing ones.\nThe rest of the paper is organized as follows. In Section 2, we outline our strategy for combating non-\nconvexity and present the algorithmic details of the AltMin procedure for structured multi-response\nregression. In Section 3, we present the statistical guarantees for the AltMin algorithm under suitable\nprobabilistic assumptions. We provide some experimental results in Section 4, and conclude in\nSection 5. All proofs are deferred to the supplementary material.\n\n2 Strategy to Conquer Non-Convexity\n\nFor many statistical estimation problems, we can construct the estimator of the underlying model\nparameter w\u2217, by minimizing certain loss function on the given sample D,\n\n\u02c6w = argmin\nw\u2208W\n\nL(w;D) .\n\n(8)\n\nIn order to show the recovery guarantee for non-convex estimation, there are mainly two commonly-\nused strategies. One strategy is to show certain local convergence in a neighborhood N of the global\nminimizer \u02c6w of (8) [6, 29, 39, 45, 25]. With a proper initialization inside N , subsequent iterates\nproduced by some local search might be able to converge to \u02c6w, whose statistical error is expected to\nbe small. This strategy is particularly suitable for the noiseless setting, as \u02c6w is equal to w\u2217, and most\nof the existing works use gradient descent type or its variants as workhorse algorithms. The other\nstrategy is to show that there is no spurious local minima of L under the assumed statistical models,\nso that any optimization algorithms that provably converge to local minima will suf\ufb01ce for a good\nestimation [34, 35, 4, 13, 24, 12].\nFor our multi-response regression problem, however, it is dif\ufb01cult to apply the aforementioned\nstrategies. First, bounding the statistical error of the global minimizer is nontrivial in the noisy setting,\nespecially when the objective L(w) involves more than one set of variables like the multi-response\nregression, let alone characterizing the equivalence of all local minima. Second, the gradient-based\nlocal search is inef\ufb01cient for the problem (2), since the update of \u03a3 involves matrix inversion and\nprojection onto positive semide\ufb01nite (PSD) cone. In contrast, AltMin procedure has a closed-form\nsolution to \u03a3-step, which is preferred in this setting.\n\n3\n\n\fIn this work, we consider another strategy for the non-convex estimation in which w (w\u2217) is composed\nof two parameters, a and b (a\u2217 and b\u2217). The loss L is assumed to jointly non-convex over a and b,\nbut might be marginally convex w.r.t. a (b) when b (a) is \ufb01xed. When the marginal subproblems are\neasy to solve, alternating minimization procedure is appealing for the purpose of estimation, which is\ntrue for the multi-response regression. The AltMin algorithm executes the following updates,\n\n\u02c6a(t+1) = argmin\n\na\u2208A\n\nL(a, \u02c6b(t);D) ,\n\n\u02c6b(t+1) = argmin\n\nb\u2208B\n\nL(\u02c6a(t+1), b;D) .\n\n(9)\n\nThe basic idea for showing the statistical guarantees of AltMin is to derive the statistical error bounds\nfor both the a- and b-steps when the other parameter is \ufb01xed to the latest estimate. Since both\nsubproblems in (9) are usually simpler, the separate errors might be easier to characterize than\nconsidered jointly, which are ideally of the form,\n\n(cid:16)\u02c6b(t+1), b\u2217\n\n(cid:17) \u2264 e2\n\n(cid:0)d1\n\n(cid:0)\u02c6a(t+1), a\u2217(cid:1)(cid:1) .\n\n(10)\n\n(cid:0)\u02c6a(t+1), a\u2217(cid:1) \u2264 e1\n\nd1\n\n(cid:16)\n\nd2\n\n(cid:16)\u02c6b(t), b\u2217\n\n(cid:17)(cid:17)\n\n,\n\nd2\n\nThe function d1 (respectively d2) characterizes the closeness between \u02c6a(t+1) and a\u2217 (\u02c6b(t+1) and b\u2217),\nwhich is nonnegative with d1(a\u2217, a\u2217) = 0 (d2(b\u2217, b\u2217) = 0) but not necessarily a metric. The choice\nof d1 and d2 depends on the goal of analysis for the problem under consideration, and a suitable\ncombination of d1 and d2 may facilitate the proof. The upper bound e1 (respectively e2) may depend\non other quantities such as n, but our emphasis is the dependence on the estimation accuracy of b\n(a). It is natural to expect that e1 (e2) will shrink as \u02c6b(t) (\u02c6a(t)) moves closer to b\u2217 (a\u2217). Under this\ncondition, we can apply the bounds in (10) alternatingly and recursively\n\nd1(\u02c6a(T ), a\u2217) \u2264 e1\n\nd2(\u02c6b(T ), b\u2217) \u2264 e2\n\n(cid:16)\n\nd2\n\n(cid:16)\n(cid:0)d1\n\n(cid:16)\u02c6b(T\u22121), b\u2217\n(cid:17)(cid:17) \u2264 . . . . . . \u2264 e1\n(cid:124)\n(cid:16)\n(cid:0)\u02c6a(T ), a\u2217(cid:1)(cid:1) \u2264 . . . . . . \u2264 e2\n(cid:124)\n\n(cid:16)\n(cid:16)\n\n. . . e1\n\ne2\ncomposition of T e1(\u00b7) and T \u2212 1 e2(\u00b7)\n\n. . .\n\nd2\n\n(cid:16)\n(cid:16)\u02c6b(0), b\u2217\n(cid:123)(cid:122)\n(cid:16)\u02c6b(0), b\u2217\n(cid:17)(cid:17)\n(cid:123)(cid:122)\n\n(cid:17)(cid:17)\n(cid:17)(cid:17)\n(cid:125)\n\n. . .\n\nd2\n\n. . . e1\n\ne1\ncomposition of T e2(\u00b7) and T e1(\u00b7)\n\n(cid:16)\n\n(cid:17)(cid:17)\n(cid:125)\n\n(11)\n\n(12)\n\nwhich may imply the error of \u02c6a(T ) and \u02c6b(T ) under other metrics of interest as well. Compared\nwith the previous strategies, one notable difference of our treatment is that we do not care about the\noptimization convergence of AltMin, as we neither characterize the error of any local minimizers\nof L(\u00b7) nor show any iterate convergence to those minimizers. Instead the ingredients we need\nare simply the statistical error bounds in (10). Given this fact, our analysis can be extended to the\nalternating estimation (AltEst) procedure [10] that need not optimize a joint objective over a and b\nand certainly cannot be handled by the earlier strategies.\nIn order to get (10), the analysis for each AltMin step is often confronted with a technical challenge\ndue to the dependency between data and the iterates obtained so far, which is bypassed by many\nexisting analyses via the resampling assumption. Essentially the resampling-based result states\nthat with any \ufb01xed \u02c6b(t) (\u02c6a(t+1)), given a fresh sample D(t) independent of \u02c6b(t) (\u02c6a(t+1)), the next\niterate \u02c6a(t+1) (\u02c6b(t+1)) satis\ufb01es the corresponding bound in (10) with high probability. To avoid the\nresampling, we leverage the idea of uniform bounds [40], which aims to show that given a sample D,\nthe bounds in (10) hold uniformly with high probability for all possible value of the input \u02c6b(t) and\n\u02c6a(t+1). This argument asks for no fresh data in each iteration, and the probability of the error bounds\nbeing true does not deteriorate with growing number of iterations. For structured multi-response\nregression, we will focus on the AltMin procedure shown in Algorithm 1. For the rest of the paper,\nC0, C1, c0, c1 and so on are reserved for absolute constants.\n\n3 Statistical Guarantees of Alternating Minimization\n\nIn this section, we apply the resampling-free analysis strategy introduced in Section 2 to the multi-\nresponse regression problem, for which a = \u03a3 and b = \u03b8. First we introduce a few notations. Given\na set A \u2286 Rp, de\ufb01ne coneA = {c \u00b7 a | c \u2265 0, a \u2208 A}. We denote the smallest and the largest\neigenvalue of \u03a3\u2217 as \u03c3\u2212\n\u2217 and \u03c3+\u2217 , and assume Diag(\u03a3\u2217) = Im\u00d7m throughout the paper for simplicity.\nIn addition, we drop the subscripts indexing the iteration, and analyze both \u03a3-update and \u03b8-update in\n\n4\n\n\fAlgorithm 1 Alternating minimization for multi-response regression\nInput: Number of iterations T , Data D = {(Xi, yi)}n\nOutput: Estimated \u02c6\u03b8(T )\n1: Initialize \u02c6\u03b8(0) (e.g., solving (4) with \u02c6\u03a3(0) = I)\n2: for t:= 0 to T \u2212 1 do\n3:\n4:\n5: end for\n6: return \u02c6\u03b8(T )\n\nCompute \u02c6\u03a3(t+1) according to (3)\nCompute \u02c6\u03b8(t+1) by solving (4)\n\ni=1 and Tuning parameter \u03bb\n\na broader setting, where the other parameter is \ufb01xed as a generic input in certain regions, i.e.,\n\nn(cid:88)\n\ni=1\n\n\u02c6\u03a3(\u03b8) =\n\n1\nn\n\n(yi \u2212 Xi\u03b8) (yi \u2212 Xi\u03b8)T ,\n\n\u02c6\u03b8(\u03a3) = argmin\nf (\u03b8)\u2264f (\u03b8\u2217)\n\n1\n2n\n\n2 (yi \u2212 Xi\u03b8)\n\n(cid:13)(cid:13)(cid:13)\u03a3\u2212 1\n\nn(cid:88)\n\ni=1\n\n(13)\n\n(14)\n\n.\n\n(cid:13)(cid:13)(cid:13)2\n\n2\n\nNote that here the tuning parameter \u03bb in (4) for the \u03b8-step is set as \u03bb = f (\u03b8\u2217), which will be kept for\nthe rest of the analysis. Given the recent progress in non-convex optimization [3], we also assume\nthat \u02c6\u03b8(\u03a3) can be solved globally despite the potential non-convexity of f. The input regions we\nconsider for \u03b8 and \u03a3 are respectively given by\n\nR =(cid:8)\u03b8 \u2208 Rp(cid:12)(cid:12) f (\u03b8) \u2264 f (\u03b8\u2217)(cid:9) ,\n(cid:110) \u02c6\u03a3(\u03b8) \u2208 Rm\u00d7m(cid:12)(cid:12) \u03b8 \u2208 R, (cid:107)\u03b8 \u2212 \u03b8\u2217(cid:107)2 \u2264 e0\n\n(cid:111)\n\nM(e0) =\n\n,\n\n(15)\n\n(16)\n\nin which e0 is the error tolerance to be speci\ufb01ed for the initialization. Note that the input region\nM(e0) implicitly depends on R as well as the sample D = {(x, y)}n\ni=1 used for computing \u02c6\u03a3(\u03b8).\n\n3.1 Preliminaries\n\nTo apply the proof strategy for AltMin, we \ufb01rst de\ufb01ne the distance function d1 and d2.\n\nDe\ufb01nition 1 (distance functions) The distance functions for \u03a3-step and \u03b8-step are de\ufb01ned as\n\nd1(\u03a3, \u03a3\u2217) =\n\n\u03be(\u03a3)\n\u03be(\u03a3\u2217)\n\n\u2212 1, where \u03be(\u03a3) =\nd2(\u03b8, \u03b8\u2217) = (cid:107)\u03b8 \u2212 \u03b8\u2217(cid:107)2 .\n\n,\n\n(17)\n\n(18)\n\n(cid:112)Tr(\u03a3\u22121\u03a3\u2217\u03a3\u22121)\n\nTr(\u03a3\u22121)\n\nAlthough d1 may look odd at \ufb01rst glance, it actually arises as a natural choice after we \ufb01x d2, as the\nL2-error of \u03b8 is our primary goal in the statistical analysis. It is worth noting that \u03be(\u03a3) is minimized\nat \u03a3 = \u03a3\u2217. The following de\ufb01nition is critical to the analysis for general structures of \u03b8\u2217 [7].\n\nDe\ufb01nition 2 (error spherical cap) For a structure-inducing f, its error spherical cap is de\ufb01ned as\n(19)\n\nC = cone(cid:8)u \u2208 Rp(cid:12)(cid:12) f (\u03b8\u2217 + u) \u2264 f (\u03b8\u2217)(cid:9) \u2229 Sp\u22121 ,\n\nwhere Sp\u22121 = {u | (cid:107)u(cid:107)2 = 1} is the unit sphere of Rp.\n\nThe probabilistic analysis of d1 and d2 is built upon the concept of sub-Gaussian vectors and matrices,\nwhich are de\ufb01ned below.\nDe\ufb01nition 3 (sub-Gaussian vector and matrix) A vector x \u2208 Rp is said to be sub-Gaussian if its\n\u03c82-norm satis\ufb01es,\n\n|||x|||\u03c82\n\n= sup\nu\u2208Sp\u22121\n\n|||(cid:104)x, u(cid:105)|||\u03c82\n\n\u2264 \u03ba < +\u221e ,\n\n(20)\n\n5\n\n\fwhere |||\u00b7|||\u03c82\nis de\ufb01ned for a random variable x \u2208 R as |||x|||\u03c82\nX \u2208 Rm\u00d7p is sub-Gaussian if the following \u03c82-norm for X is \ufb01nite,\n\n= supq\u22651\n\n1\n\n(E|x|q)\nq\u221a\nq\n\n. A matrix\n\n|||X|||\u03c82\n\n=\n\nsup\n\nu\u2208Sp\u22121 v\u2208Sm\u22121\n\n\u2212 1\nv XT v\n\n2\n\n\u2264 \u03ba < +\u221e ,\n\n(21)\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)uT \u0393\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\u03c82\n\nwhere \u0393v = E[XT vvT X]. Further, \u0393v for any v \u2208 Sm\u22121 is assumed to satisfy the condition\n0 < \u00b5\u2212 \u2264 \u03bbmin(\u0393v) \u2264 \u03bbmax(\u0393v) \u2264 \u00b5+ < +\u221e, for some constants \u00b5\u2212 and \u00b5+.\n\n\u2264 \u03ba, it is not dif\ufb01cult to verify that |||X|||\u03c82\n\nThis de\ufb01nition is adopted from [41, 20]. If rows of X are i.i.d. copies of an isotropic sub-Gaussian\nrandom vector x with |||x|||\u03c82\n\u2264 C\u03ba for a universal\nconstant C, and \u00b5\u2212 = \u00b5+ = 1. Our assumptions on {Xi} and { \u02dc\u03b7i} are given below.\n(A1) The designs X1, . . . , Xn are i.i.d. copies of a sub-Gaussian X with parameter \u03ba, \u00b5\u2212 and \u00b5+.\n(A2) The isotropic noises \u02dc\u03b71, . . . , \u02dc\u03b7n are i.i.d. copies of a sub-Gaussian \u02dc\u03b7 with parameter \u03c4.\n\nAnother key ingredient in the analysis is the complexity measure of the parameter structure captured\nby C, which turns out to be the notion of Gaussian width [16].\nDe\ufb01nition 4 (Gaussian width) The Gaussian width w(A) of a set A \u2286 Rp is de\ufb01ned as\n\n(cid:20)\n\n(cid:21)\n\nw(A) = Eg\u223cN (0,I)\n\n(cid:104)g, u(cid:105)\n\nsup\nu\u2208A\n\n.\n\n(22)\n\nGaussian width is easy to calculate or bound for the error spherical caps induced by many f of\ninterest [7, 9]. Based on Gaussian width, the proofs of the error bounds utilize a powerful tool\nfrom probability theory, called generic chaining [37]. We refer the interested readers to the recent\nmonograph [38] and references therein.\n\n3.2 Error Bound for Arbitrary Initializations\n\n\u2212\n\u2217 \u00b5\u2212\n\n\u03c3\n\n(cid:111)\n\n(cid:26)\n\nGiven the de\ufb01nitions of distance function d1 and d2, we \ufb01rst focus on the separate error bounds\nfor the \u03a3-step and the \u03b8-step in (13) and (14). To allow arbitrary initializations, we consider the\ntolerance of initialization error e0 = +\u221e, which appears in the de\ufb01nition of M(e0).\n\nLemma 1 (error bound for \u03a3-estimation) Under the assumptions (A1) and (A2), if the sample size\nn \u2265 C0 max\n, with probability at least 1\u2212C2 exp (\u2212C1m),\n\u02c6\u03a3(\u03b8) given in (13) is invertible for any \u03b8 \u2208 R and its error satis\ufb01es\n\n(cid:17)2(cid:27)\n1, \u03c4 4, \u03ba4(cid:16) \u03c3+\u2217 \u00b5+\n(cid:16) \u02c6\u03a3(\u03b8), \u03a3\u2217\n(cid:1) term in (23) is the typical statistical rate for covariance estimation.\n\nRemark: If \u03b8 = \u03b8\u2217, the \u03a3-step computes the sample covariance of the noise, for which d2(\u03b8, \u03b8\u2217) =\n\n0, and the remaining O(cid:0)(cid:112) m\n\n(cid:110)\n(cid:17) \u2264 C3\u03c4 2\n\n\u00b7 d2 (\u03b8, \u03b8\u2217) .\n\n(cid:114) m\n\nm, w4(C)\n\n\u00b7max\n\n(cid:115)\n\n\u00b5+\n\u03c3\u2212\u2217\n\nn\n\n(cid:17)2(cid:27)\n\nLemma 2 (error bound for \u03b8-estimation) Under the assumptions (A1) and (A2), if the sample\nsize n \u2265 C0 max\n, then with probability at least 1 \u2212\nC2 exp (\u2212C1m), the following bound holds for \u02c6\u03b8(\u03a3) given in (14) with any input \u03a3 \u2208 M(+\u221e),\n\n(cid:26)\n1, \u03c4 4, \u03ba4(cid:16) \u03c3+\u2217 \u00b5+\n(cid:16) \u02c6\u03b8(\u03a3), \u03b8\u2217\n\n(cid:111)\n(cid:17) \u2264 (1 + d1 (\u03a3, \u03a3\u2217)) \u00b7 C4\u03ba(cid:112)\u00b5+\n\n\u00b7 m + w(C)\u221a\n\nn\n\n,\n\n(24)\n\nm, w4(C)\n\n\u00b7 max\n\n(cid:110)\n\n\u2212\n\u2217 \u00b5\u2212\n\nd2\n\nm\n\n\u03c3\n\n+ C4\n\n(23)\n\nd1\n\nm\n\nn\n\n\u00b5\u2212(cid:113)\n\nTr(\u03a3\u22121\u2217 )\n\nwhere \u03be(\u03a3) is given in De\ufb01nition 1.\n\n6\n\n\fRemark: For \u03a3 = \u03a3\u2217 and \u03a3 = I, \u03b8-step corresponds to the oracle estimator \u02c6\u03b8orc and the ordinary\nleast squares (OLS) estimator \u02c6\u03b8odn respectively, i.e.,\n\n(cid:13)(cid:13)(cid:13)2\n\n2\n\n(yi \u2212 Xi\u03b8)\n\n(cid:107)yi \u2212 Xi\u03b8(cid:107)2\n2 .\n\n,\n\n(25)\n\n(26)\n\nAn analysis similar to [10] shows that with high probability the L2-errors of \u02c6\u03b8orc and \u02c6\u03b8odn satisfy\n\n\u02c6\u03b8orc = argmin\nf (\u03b8)\u2264f (\u03b8\u2217)\n\n\u02c6\u03b8odn = argmin\nf (\u03b8)\u2264f (\u03b8\u2217)\n\n(cid:13)(cid:13)(cid:13) \u02c6\u03b8orc \u2212 \u03b8\u2217\n(cid:13)(cid:13)(cid:13) \u02c6\u03b8odn \u2212 \u03b8\u2217\n\n(cid:13)(cid:13)(cid:13)2\n(cid:13)(cid:13)(cid:13)2\n\ni=1\n\ni=1\n\n\u2212 1\n2\u2217\n\n1\n2n\n\n1\n2n\n\n(cid:13)(cid:13)(cid:13)\u03a3\n\nn(cid:88)\nn(cid:88)\n\u2264 C(cid:48)\u03ba(cid:112)\u00b5+\n\u00b5\u2212(cid:113)\n\u2264 C(cid:48)\u03ba(cid:112)\u00b5+\nTr(\u03a3\u22121\u2217 )\n\u00b7 w(C)\u221a\n\u00b5\u2212\u221a\n(cid:114) m\n\nm\n\nn\n\nTr(\u03a3\u22121\u2217 )\n\n.\n\neorc\neodn\n\n=\n\n\u00b7 w(C)\u221a\n\nn\n\n(cid:44) eorc,\n\n(cid:44) eodn,\n\nwhich indicates that the oracle estimator improves the OLS by a factor of\n\n(27)\n\n(28)\n\n(29)\n\n(30)\n\n(31)\n\n(32)\n\nIn practice, this improvement can be signi\ufb01cant, especially when there is strong cross-correlation\namong the responses, such that \u03a3\u2217 is close to singular.\nBy assembling Lemma 1 and 2, we obtain the following theorem for the error of AltMin, which\nexhibits a geometrical convergence to certain minimum achievable error.\n\n1, \u03c4 4, \u03ba4(cid:16) \u00b5+\u03c3+\u2217\n\nTheorem 1 (error bound for arbitrarily-initialized AltMin) Under the assumptions (A1) and\n(A2), if the sample size n \u2265 C0 \u00b7 max\n,\n\u00b5\u2212\nand \u02c6\u03b8(0) is a feasible initialization (i.e., f ( \u02c6\u03b8(0)) \u2264 f (\u03b8\u2217)),\n1 \u2212 C2 exp(\u2212C1m), the following error bound holds for \u02c6\u03b8(T ) returned by Algorithm 1\n\n(cid:17)2\n, \u03ba2(cid:16) \u00b5+\n(cid:13)(cid:13)(cid:13)2\nn \u00b7(cid:16)(cid:13)(cid:13)(cid:13) \u02c6\u03b8(0) \u2212 \u03b8\u2217\nin which \u03c1n and emin satisfy the inequalities below with \u03b4n = C5\u03c4 2(cid:112) m\n\n(cid:17)2(cid:16) \u03c3+\u2217\n(cid:17)\nn \u2264 1\n4 ,\n\n(cid:13)(cid:13)(cid:13) \u02c6\u03b8(T ) \u2212 \u03b8\u2217\n\nthen with probability at least\n\n(cid:110) w4(C)\n\n\u2264 emin + \u03c1T\n\n(cid:17)(cid:27)\n\n\u2212 emin\n\n\u00b7 max\n\nm , m\n\n(cid:26)\n\n\u2212\n\u00b5\u2212\u03c3\n\u2217\n\n(cid:111)\n\n\u2212\n\u2217\n\u03c3\n\n,\n\n(cid:13)(cid:13)(cid:13)2\n\u00b5\u2212(cid:113)\n\u00b5\u2212(cid:113)\n\nC3\u03ba\u00b5+\n\u03c3\u2212\u2217 Tr(\u03a3\u22121\u2217 )\n\n\u03c1n \u2264\n\nemin \u2264 C4\u03ba(cid:112)\u00b5+\n\nTr(\u03a3\u22121\u2217 )\n\n\u00b7 m + w(C)\u221a\n\nn\n\n\u2264 1\n2\n\n,\n\n\u00b7 m + w(C)\u221a\n\nn\n\n\u00b7 1 + \u03b4n\n1 \u2212 \u03c1n\n\n.\n\nRemark: The inequality (30) indicates that the upper bound of the error for AltMin procedure will\ndecrease geometrically to the minimum achievable error emin with rate \u03c1n. Though the initialization\ncondition f ( \u02c6\u03b8(0)) \u2264 f (\u03b8\u2217) may not be true for arbitrary \u02c6\u03b8(0), it should be satis\ufb01ed by the \ufb01rst iterate\n\u02c6\u03b8(1), from which Theorem 1 starts to apply.\nNote that the \u03c1n in (30) not only controls the convergence rate of error, but also affects the value of\nemin. The emin is of the same order as the right-hand side of (24) with \u03a3 = \u03a3\u2217, which has an extra\nadditive O\nterm compared with eorc. This is due to the uniformity considered for the \u03b8-step\nover all \u03a3 \u2208 M(+\u221e). To improve the bound for AltMin, we can consider a small e0 for M(e0).\n\n(cid:16) m\u221a\n\n(cid:17)\n\nn\n\nImproved Bound with Good Initializations\n\n3.3\nAs discussed above, we consider a smaller input region M(e0) for the \u03b8-step with e0 =\nBefore presenting the results, we introduce the set called error spherical sector.\n\n(cid:113) \u03c3\n\n\u2212\n\u2217\n\u00b5+ .\n\n7\n\n\fDe\ufb01nition 5 (error spherical sector) For a structure-inducing f, its error spherical sector is de\ufb01ned\nas\n\nS = cone(cid:8)u \u2208 Rp(cid:12)(cid:12) f (\u03b8\u2217 + u) \u2264 f (\u03b8\u2217)(cid:9) \u2229 Bp\u22121 ,\n\nwhere Bp = {u | (cid:107)u(cid:107)2 \u2264 1} is the unit ball of Rp.\nGeometrically S is closely related to the previously de\ufb01ned C in (19), and their Gaussian widths\nsatisfy w(S) \u2264 w(C) + c for some universal constant c. Based on this de\ufb01nition, the following\ntheorem characterizes the sharpened error of AltMin under good initializations.\n\nTheorem 2 (error bound for well-initialized AltMin) Under the assumptions (A1) and (A2), if\nthe sample size n \u2265 C0 \u00b7 max\n,\n\nw2(C) , m2(cid:111)\n(cid:17)2(cid:16) \u03c3+\u2217\n, \u03ba2(cid:16) \u00b5+\nand a feasible initialization \u02c6\u03b8(0) satis\ufb01es (cid:107) \u02c6\u03b8(0) \u2212 \u03b8\u2217(cid:107)2 \u2264 (cid:113) \u03c3\n1 \u2212 C2 exp(cid:0)\u2212C1 \u00b7 min(cid:8)w2(C), m(cid:9)(cid:1), the error bound (30) holds for \u02c6\u03b8(T ) returned by Algorithm 1\n\n\u2212\n\u2217\n\u03c3\n\u2212\n\u2217\n\u00b5+ , then with probability at least\n\n1, \u03c4 4, \u03ba4(cid:16) \u00b5+\u03c3+\u2217\n\n(cid:110) w4(C)\n\nm , m3\n\n(cid:17)(cid:27)\n\n\u00b7 max\n\n(cid:17)2\n\n(cid:26)\n\n(33)\n\n\u00b5\u2212\u03c3\n\n\u2212\n\u2217\n\n\u00b5\u2212\n\nwith \u03c1n and emin satisfying\n\n\u00b7 w(S)\u221a\n\nn\n\n\u2264 1\n2\n\n,\n\n(34)\n\nC3\u03ba\u00b5+\n\u03c3\u2212\u2217 Tr(\u03a3\u22121\u2217 )\n\n\u03c1n \u2264\n\nemin \u2264 C4\u03ba(cid:112)\u00b5+\n\n\u00b7 w(S)\u221a\n\n\u00b5\u2212(cid:113)\n\u00b5\u2212(cid:113)\n\nTr(\u03a3\u22121\u2217 )\nwhere \u03b4n is the same as the one given in Theorem 1.\nRemark: Since w(S) only differs from w(C) by a constant, the above error bound matches the order\nof the oracle error eorc. For instance, if \u03b8\u2217 is s-sparse and f = (cid:107) \u00b7 (cid:107)0, then w(S) and emin satisfy,\n\nn\n\n(35)\n\n\u00b7 1 + \u03b4n\n1 \u2212 \u03c1n\n\n,\n\n(cid:16)(cid:112)s log p\n(cid:17)\n\nw(S) = O\n\n(cid:32)(cid:114)\n\n(cid:33)\n\n=\u21d2\n\nemin = O\n\ns log p\n\nn\n\nThe initialization condition is a result of setting a small value of e0, which yields an improved version\nof Lemma 2 so that we can obtain a better bound in Theorem 2. A reasonably good initialization\nof \u02c6\u03b8(0) can be obtained by solving OLS \u02c6\u03b8odn, whose error bound is given (27). On the other hand,\nthe iterates obtained by running arbitrarily-initialized AltMin may also satisfy the initialization\nrequirements as Theorem 1 guarantees a moderate error. Once the requirements are met during the\niteration, the arbitrarily-initialized AltMin can attain this sharper bound as well as the well-initialized.\n\n4 Experiments\n\nIn this section, we present some experimental results to support our theoretical analysis. Speci\ufb01cally\nwe focus on the sparsity structure of \u03b8\u2217, and consider L0-cardinality as complexity function f.\nThroughout the experiment, we \ufb01x problem dimension p = 1000, sparsity level of \u03b8\u2217 s = 20, and\nnumber of iterations T = 10. Entries of X of \u02dc\u03b7 are generated by i.i.d. standard Gaussian, and\n\u03b8\u2217 = [1, . . . , 1\n\n]T . \u03a3\u2217 is given as a block diagonal matrix with \u03a3(cid:48) =\n\n,\u22121, . . . ,\u22121\n\n(cid:104) 1\n\n, 0, . . . , 0\n\n(cid:105)\n\na\n1\n\na\n\n(cid:124) (cid:123)(cid:122) (cid:125)\n\n10\n\n(cid:124)\n\n(cid:123)(cid:122)\n\n10\n\n(cid:125)\n\n(cid:124) (cid:123)(cid:122) (cid:125)\n\n980\n\nreplicated along the diagonal. All the plots are obtained based on the average over 100 random trials.\nFirst we set a = 0.9, m = 10, and vary sample size n from 30 to 80. We run the AltMin initialized\nby both OLS and Gaussian random vector, where \u03b8-step is solved by the hard-thresholding pursuit\n(HTP) algorithm [11]. The error plots are shown in Figure 1. Second, we \ufb01x m = 10, and vary the\nparameter a in \u03a3\u2217 from 0.5 to 0.9 for n = 30, 40, 50 and 60. The plots in Figure 2(a) shows the\nerror of AltMin against a. As indicated by (29), the improvement of the oracle least squares over the\nordinary one is ampli\ufb01ed with increasingly large a. Figure 2(b) compares the actual ratio of eorc to\neodn and the suggested one. Finally we \ufb01x a = 0.8, and the number of responses m ranges from 10\nto 18 for n = 30, 40, 50 and 60. The results are presented in Figure 2(c) and 2(d).\n\n8\n\n\f(a) L2-error vs. n\n\n(b) L2-error vs. t (rand initialization)\n\nFigure 1: (a) A phase transition is observed for the randomly-initialized AltMin around n = 40, whose error is\non a par with the well-initialized for n \u2265 40. This coincides with the remark for Theorem 2. Also, the error of\nAltMin is close to the oracle estimator, which is signi\ufb01cantly better than OLS. (b) Our theoretical results suggest\nthat a larger sample size leads to smaller \u03c1n, thus AltMin converge faster as shown in the plots.\n\n(a) L2-error vs. a\n\n(b) eorc/eodn (n=60)\n\n(c) L2-error vs. m\n\n(d) L2-error vs. m (n=30)\n\nFigure 2: (a) With a varying from 0.5 to 0.9, the responses become increasingly correlated and the error of\nAltMin reduces more quickly. (b) The actual ratio of eorc to eodn is very close the predicted one given by (29).\n(c) As m increases from 10 to 18, the error of AltMin does not decrease drastically. The main reason is the\nincreasingly large error in the estimation of \u03a3\u2217. (d) Compared with the error of OLS, the advantage of AltMin\nbecomes marginal with growing m, while its gap with the oracle estimator is widened.\n\n5 Conclusions\n\nIn this paper, we investigate the alternating minimization (AltMin) algorithm for high-dimensional\nmulti-response linear models, which allow general structures of the underlying parameter. In particu-\nlar, we present a resampling-free analysis for the statistical error of the non-convex AltMin procedure.\nOur error bound matches the resampling-based result up to some constant, which is of the same order\nas the oracle estimator. Above all, the error bounds suggest that the arbitrarily-initialized AltMin is\nable to attain the same level of estimation error as the one with good initializations.\n\nAcknowledgements\nThe research was supported by NSF grants IIS-1563950, IIS-1447566, IIS-1447574, IIS-1422557,\nCCF-1451986, CNS- 1314560, IIS-0953274, IIS-1029711, NASA grant NNX12AQ39A, and gifts\nfrom Adobe, IBM, and Yahoo.\n\n9\n\n30354045505560Samplesizen0.010.020.030.040.050.060.070.080.090.1NormalizedL2-errorfor\u02c6\u03b8Twell-initializedrandomly-initializedOLSoracle12345678910Iterationt0.10.20.30.40.50.60.7NormalizedL2-errorfor\u02c6\u03b8tn=30n=40n=50n=600.50.550.60.650.70.750.80.850.9a0.010.020.030.040.050.060.07NormalizedL2-errorfor\u02c6\u03b8Tn=30n=40n=50n=600.50.550.60.650.70.750.80.850.9a0.40.50.60.70.80.9eorc/eodnactual ratiopredicted ratio101112131415161718Numberofrepsonsesm0.0150.020.0250.030.0350.040.0450.05NormalizedL2-errorfor\u02c6\u03b8Tn=30n=40n=60n=80101112131415161718Numberofrepsonsesm0.0250.030.0350.040.0450.050.0550.06NormalizedL2-errorfor\u02c6\u03b8TAltMinOLSoracle\fReferences\n[1] T. W. Anderson. An introduction to multivariate statistical analysis. Wiley-Interscience, 2003.\n\n[2] A. Banerjee, S. Chen, F. Fazayeli, and V. Sivakumar. Estimation with norm regularization. In\n\nAdvances in Neural Information Processing Systems (NIPS), 2014.\n\n[3] R. Barber and W. Ha. Gradient descent with nonconvex constraints: local concavity determines\n\nconvergence. arXiv preprint arXiv:1703.07755, 2017.\n\n[4] S. Bhojanapalli, B. Neyshabur, and N. Srebro. Global optimality of local search for low rank\nmatrix recovery. In Advances in Neural Information Processing Systems, pages 3873\u20133881,\n2016.\n\n[5] L. Breiman and J. H. Friedman. Predicting multivariate responses in multiple linear regression.\nJournal of the Royal Statistical Society: Series B (Statistical Methodology), 59(1):3\u201354, 1997.\n\n[6] E. J Candes, X. Li, and M. Soltanolkotabi. Phase retrieval via wirtinger \ufb02ow: Theory and\n\nalgorithms. IEEE Transactions on Information Theory, 61(4):1985\u20132007, 2015.\n\n[7] V. Chandrasekaran, B. Recht, P. A. Parrilo, and A. S. Willsky. The convex geometry of linear\n\ninverse problems. Foundations of Computational Mathematics, 12(6):805\u2013849, 2012.\n\n[8] S. Chatterjee, S. Chen, and A. Banerjee. Generalized dantzig selector: Application to the\n\nk-support norm. In Advances in Neural Information Processing Systems (NIPS), 2014.\n\n[9] S. Chen and A. Banerjee. Structured estimation with atomic norms: General bounds and\n\napplications. In NIPS, pages 2908\u20132916, 2015.\n\n[10] S. Chen and A. Banerjee. Alternating estimation for structured high-dimensional multi-response\n\nmodels. In Advances in Neural Information Processing Systems, pages 2835\u20132844, 2017.\n\n[11] S. Foucart. Hard thresholding pursuit: an algorithm for compressive sensing. SIAM Journal on\n\nNumerical Analysis, 49(6):2543\u20132563, 2011.\n\n[12] R. Ge, C. Jin, and Y. Zheng. No spurious local minima in nonconvex low rank problems: A\n\nuni\ufb01ed geometric analysis. arXiv preprint arXiv:1704.00708, 2017.\n\n[13] R. Ge, J. D Lee, and T. Ma. Matrix completion has no spurious local minimum. In Advances in\n\nNeural Information Processing Systems, pages 2973\u20132981, 2016.\n\n[14] A. Goncalves, P. Das, S. Chatterjee, V. Sivakumar, F. J. Von Zuben, and A. Banerjee. Multi-task\n\nsparse structure learning. In CIKM, pages 451\u2013460, 2014.\n\n[15] A. Goncalves, F. J Von Zuben, and A. Banerjee. Multi-task sparse structure learning with\ngaussian copula models. The Journal of Machine Learning Research, 17(1):1205\u20131234, 2016.\n\n[16] Y. Gordon. Some inequalities for gaussian processes and applications.\n\nMathematics, 50(4):265\u2013289, 1985.\n\nIsrael Journal of\n\n[17] W. H. Greene. Econometric Analysis. Prentice Hall, 7. edition, 2011.\n\n[18] A. J. Izenman. Modern Multivariate Statistical Techniques: Regression, Classi\ufb01cation, and\n\nManifold Learning. Springer, 2008.\n\n[19] P. Jain, P. Netrapalli, and S. Sanghavi. Low-rank matrix completion using alternating minimiza-\n\ntion. In STOC, pages 665\u2013674, 2013.\n\n[20] P. Jain and A. Tewari. Alternating minimization for regression problems with vector-valued\noutputs. In Advances in Neural Information Processing Systems (NIPS), pages 1126\u20131134,\n2015.\n\n[21] P. Jain, A. Tewari, and P. Kar. On iterative hard thresholding methods for high-dimensional\n\nm-estimation. In NIPS, pages 685\u2013693, 2014.\n\n10\n\n\f[22] S. Kim and E. P. Xing. Tree-guided group lasso for multi-response regression with structured\n\nsparsity, with an application to eqtl mapping. Ann. Appl. Stat., 6(3):1095\u20131117, 2012.\n\n[23] W. Lee and Y. Liu. Simultaneous multiple response regression and inverse covariance matrix\nestimation via penalized gaussian maximum likelihood. J. Multivar. Anal., 111:241\u2013255, 2012.\n\n[24] Q. Li and G. Tang. The nonconvex geometry of low-rank matrix optimizations with general\n\nobjective functions. arXiv preprint arXiv:1611.03060, 2016.\n\n[25] C. Ma, K. Wang, Y. Chi, and Y. Chen. Implicit regularization in nonconvex statistical esti-\nmation: Gradient descent converges linearly for phase retrieval, matrix completion and blind\ndeconvolution. arXiv preprint arXiv:1711.10467, 2017.\n\n[26] S. Negahban, P. Ravikumar, M. J. Wainwright, and B. Yu. A uni\ufb01ed framework for the analysis\n\nof regularized M-estimators. Statistical Science, 27(4):538\u2013557, 2012.\n\n[27] P. Netrapalli, P. Jain, and S. Sanghavi. Phase retrieval using alternating minimization. In NIPS,\n\n2013.\n\n[28] W. Oberhofer and J. Kmenta. A general procedure for obtaining maximum likelihood estimates\nin generalized regression models. Econometrica: Journal of the Econometric Society, pages\n579\u2013590, 1974.\n\n[29] S. Oymak, B. Recht, and M. Soltanolkotabi. Sharp time\u2013data tradeoffs for linear inverse\n\nproblems. arXiv preprint arXiv:1507.04793, 2015.\n\n[30] Y. Plan, R. Vershynin, and E. Yudovina. High-dimensional estimation with geometric constraints.\n\nInformation and Inference, 2016.\n\n[31] P. Rai, A. Kumar, and H. Daume. Simultaneously leveraging output and task structures for\n\nmultiple-output regression. In NIPS, pages 3185\u20133193, 2012.\n\n[32] A. J. Rothman, E. Levina, and J. Zhu. Sparse multivariate regression with covariance estimation.\n\nJournal of Computational and Graphical Statistics, 19(4):947\u2013962, 2010.\n\n[33] J. Shen and P. Li. On the iteration complexity of support recovery via hard thresholding pursuit.\n\nIn International Conference on Machine Learning, pages 3115\u20133124, 2017.\n\n[34] J. Sun, Q. Qu, and J. Wright. A geometric analysis of phase retrieval. arXiv preprint arX-\n\niv:1602.06664, 2016.\n\n[35] J. Sun, Q. Qu, and J. Wright. Complete dictionary recovery over the sphere i: Overview and the\n\ngeometric picture. IEEE Transactions on Information Theory, 63(2):853\u2013884, 2017.\n\n[36] R. Sun and Z.-Q. Luo. Guaranteed matrix completion via nonconvex factorization. In FOCS,\n\n2015.\n\n[37] M. Talagrand. The Generic Chaining. Springer, 2005.\n\n[38] M. Talagrand. Upper and Lower Bounds for Stochastic Processes. Springer, 2014.\n\n[39] S. Tu, R. Boczar, M. Simchowitz, M. Soltanolkotabi, and B. Recht. Low-rank solutions of\n\nlinear matrix equations via procrustes \ufb02ow. arXiv preprint arXiv:1507.03566, 2015.\n\n[40] V. Vapnik. Statistical learning theory. Wiley New York, 1998.\n\n[41] R. Vershynin. Introduction to the non-asymptotic analysis of random matrices. In Y. Eldar and\nG. Kutyniok, editors, Compressed Sensing, chapter 5, pages 210\u2013268. Cambridge University\nPress, 2012.\n\n[42] R. Vershynin. Estimation in High Dimensions: A Geometric Perspective, pages 3\u201366. Springer\n\nInternational Publishing, 2015.\n\n[43] Ir\u00e8ne Waldspurger. Phase retrieval with random gaussian sensing vectors by alternating projec-\n\ntions. IEEE Trans. Information Theory, 64(5):3301\u20133312, 2018.\n\n11\n\n\f[44] X. Yi, C. Caramanis, and S. Sanghavi. Alternating minimization for mixed linear regression. In\n\nICML, pages 613\u2013621, 2014.\n\n[45] Q. Zheng and J. Lafferty. A convergent gradient descent algorithm for rank minimization\nand semide\ufb01nite programming from random linear measurements. In Advances in Neural\nInformation Processing Systems, pages 109\u2013117, 2015.\n\n12\n\n\f", "award": [], "sourceid": 3335, "authors": [{"given_name": "Sheng", "family_name": "Chen", "institution": "University of Minnesota"}, {"given_name": "Arindam", "family_name": "Banerjee", "institution": "Voleon"}]}