{"title": "Handling correlated and repeated measurements with the smoothed multivariate square-root Lasso", "book": "Advances in Neural Information Processing Systems", "page_first": 3959, "page_last": 3970, "abstract": "A limitation of Lasso-type estimators is that the optimal regularization parameter depends on the unknown noise level. Estimators such as the concomitant Lasso address this dependence by jointly estimating the noise level and the regression coefficients. Additionally, in many applications, the data is obtained by averaging multiple measurements: this reduces the noise variance, but it dramatically reduces sample sizes and prevents refined noise modeling. In this work, we propose a concomitant estimator that can cope with complex noise structure by using non-averaged measurements, its data-fitting term arising as a smoothing of the nuclear norm. The resulting optimization problem is convex and amenable, thanks to smoothing theory, to state-of-the-art optimization techniques that leverage the sparsity of the solutions. Practical benefits are demonstrated on toy datasets, realistic simulated data and real neuroimaging data.", "full_text": "Handling correlated and repeated measurements with\n\nthe smoothed multivariate square-root Lasso\n\nQuentin Bertrand \u2217\n\nUniversit\u00e9 Paris Saclay, Inria, CEA\n\nPalaiseau, 91120, France\n\nquentin.bertrand@inria.fr\n\nMathurin Massias \u2217\n\nUniversit\u00e9 Paris Saclay, Inria, CEA\n\nPalaiseau, 91120, France\n\nmathurin.massias@inria.fr\n\nAlexandre Gramfort\n\nUniversit\u00e9 Paris Saclay, Inria, CEA\n\nPalaiseau, 91120, France\n\nalexandre.gramfort@inria.fr\n\nJoseph Salmon\n\nUniv. Montpellier, CNRS\n\nMontpellier, France\n\njoseph.salmon@umontpellier.fr\n\nAbstract\n\nA limitation of Lasso-type estimators is that the optimal regularization parameter\ndepends on the unknown noise level. Estimators such as the concomitant Lasso\naddress this dependence by jointly estimating the noise level and the regression\ncoef\ufb01cients. Additionally, in many applications, the data is obtained by averaging\nmultiple measurements: this reduces the noise variance, but it dramatically reduces\nsample sizes and prevents re\ufb01ned noise modeling. In this work, we propose a\nconcomitant estimator that can cope with complex noise structure by using non-\naveraged measurements, its data-\ufb01tting term arising as a smoothing of the nuclear\nnorm. The resulting optimization problem is convex and amenable, thanks to\nsmoothing theory, to state-of-the-art optimization techniques that leverage the\nsparsity of the solutions. Practical bene\ufb01ts are demonstrated on toy datasets,\nrealistic simulated data and real neuroimaging data.\n\n1\n\nIntroduction\n\nIn many statistical applications, the number of parameters p is much larger than the number of\nobservations n. A popular approach to tackle linear regression problems in such scenarios is to\nconsider convex (cid:96)1-type penalties, as popularized by Tibshirani (1996). The use of these penalties\nrelies on a regularization parameter \u03bb trading data \ufb01delity versus sparsity. Unfortunately, Bickel\net al. (2009) showed that, in the case of white Gaussian noise, the optimal \u03bb depends linearly on\nthe standard deviation of the noise \u2013 referred to as noise level. Because the latter is rarely known in\npractice, one can jointly estimate the noise level and the regression coef\ufb01cients, following pioneering\nwork on concomitant estimation (Huber and Dutter, 1974; Huber, 1981). Adaptations to sparse\nregression (Owen, 2007) have been analyzed under the names of square-root Lasso (Belloni et al.,\n2011) or scaled Lasso (Sun and Zhang, 2012). Generalizations have been proposed in the multitask\nsetting, the canonical estimator being Multi-Task Lasso (Obozinski et al., 2010).\nThe latter estimators take their roots in a white Gaussian noise model. However some real-world\ndata (such as magneto-electroencephalographic data) are contaminated with strongly non-white\nGaussian noise (Engemann and Gramfort, 2015). From a statistical point of view, the non-uniform\nnoise level case has been widely explored: Daye et al. (2012); Wagener and Dette (2012); Kolar and\nSharpnack (2012); Dalalyan et al. (2013). In a more general case, with a correlated Gaussian noise\n\n\u2217 These authors contributed equally.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fmodel, estimators based on non-convex optimization problems were proposed (Lee and Liu, 2012)\nand analyzed for sub-Gaussian covariance matrices (Chen and Banerjee, 2017) through the lens of\npenalized Maximum Likelihood Estimation (MLE). Other estimators (Rothman et al., 2010; Rai et al.,\n2012) assume that the inverse of the covariance (the precision matrix) is sparse, but the underlying\noptimization problems remain non-convex. A convex approach to regression with correlated noise, the\nSmooth Generalized Concomitant Lasso (SGCL) was proposed by Massias et al. (2018a). Relying on\nsmoothing techniques (Moreau, 1965; Nesterov, 2005; Beck and Teboulle, 2012), the SGCL jointly\nestimates the regression coef\ufb01cients and the noise co-standard deviation matrix (the square root of\nthe noise covariance matrix). However, in applications such as M/EEG, the number of parameters in\nthe co-standard deviation matrix (\u2248 104) is typically equal to the number of observations, making it\nstatistically hard to estimate accurately.\nIn this article we consider applications to M/EEG data in the context of neuroscience. M/EEG data\nconsists in recordings of the electric and magnetic \ufb01elds at the surface or close to the head. Here\nwe tackle the source localization problem, which aims at estimating which regions of the brain\nare responsible for the observed electro-magnetic signals: this problem can be cast as a multitask\nhigh dimensional linear regression (Ndiaye et al., 2015). MEG and EEG data are obtained from\nheterogeneous types of sensors: magnetometers, gradiometers and electrodes, leading to samples\ncontaminated with different noise distributions, and thus non-white Gaussian noise. Moreover the\nadditive noise in M/EEG data is correlated between sensors and rather strong: the noise variance\nis commonly even stronger that the signal power. It is thus customary to make several repetitions\nof the same cognitive experiment, e.g., showing 50 times the same image to a subject in order\nto record 50 times the electric activity of the visual cortex. The multiple measurements are then\nclassically averaged across the experiment\u2019s repetitions in order to increase the signal-to-noise ratio.\nIn other words, popular estimators for M/EEG usually discard the individual observations, and rely\non Gaussian i.i.d. noise models (Ou et al., 2009; Gramfort et al., 2013).\nIn this work we propose Concomitant Lasso with Repetitions (CLaR), an estimator that is\n\nblock coordinate descent techniques,\n\n\u2022 designed to exploit all available measurements collected during repetitions of experiments,\n\u2022 de\ufb01ned as the solution of a convex minimization problem, handled ef\ufb01ciently by proximal\n\u2022 built thanks to an explicit connection with nuclear norm smoothing1. This can also be\nviewed as a partial smoothing of the multivariate square-root Lasso (van de Geer and Stucky,\n2016),\n\u2022 shown (through extensive benchmarks w.r.t. existing estimators) to leverage experimental\n\u2022 available as open source code to reproduce all the experiments.\n\nrepetitions to improve support identi\ufb01cation,\n\nIn Section 2, we recall the framework of concomitant estimation, and introduce CLaR. In Section 3,\nwe detail the properties of CLaR, and derive an algorithm to solve it. Finally, Section 4 is dedicated\nto experimental results.\n\n2 Concomitant estimation with correlated noise\n\nProbabilistic model Let r be the number of repetitions of the experiment. The r observation\n(cid:80)r\nmatrices are denoted Y (1), . . . , Y (r) \u2208 Rn\u00d7q with n the number of sensors/samples and q the\nnumber of tasks/time samples. The mean over the repetitions of the observation matrices is written\nl=1 Y (l). Let X \u2208 Rn\u00d7p be the design (or gain) matrix, with p features stored column-\n\u00afY = 1\nwise: X = [X:1| . . .|X:p], where for a matrix A \u2208 Rm\u00d7n its jth column (resp. row) is denoted\nr\nA:j \u2208 Rm\u00d71 (resp. Aj: \u2208 R1\u00d7n. The matrix B\u2217 \u2208 Rp\u00d7q contains the coef\ufb01cients of the linear\nregression model. Each measurement (i.e., repetition of the experiment) follows the model:\n\n(1)\nwhere the entries of E(l) are i.i.d. samples from standard normal distributions, the E(l)\u2019s are inde-\npendent, and S\u2217 \u2208 S n\n+) stands for the set\nof positive (resp. semi-de\ufb01nite positive) matrices. Note that even if the observations Y (1), . . . , Y (r)\ndiffer because of the noise E(1), . . . , E(r), B\u2217 and the noise structure S\u2217 are shared across repetitions.\n\n++ is the co-standard deviation matrix, and S n\n\n++ (resp. S n\n\n\u2200l \u2208 [r],\n\nY (l) = XB\u2217 + S\u2217E(l) ,\n\n1Other Schatten norms are treated in Appendix A.2.\n\n2\n\n\fmatrices, (cid:107)\u00b7(cid:107)p for the (cid:96)p norm, for any p \u2208 [1,\u221e). For a matrix B \u2208 Rp\u00d7q, (cid:107)B(cid:107)2,1 =(cid:80)p\n(cid:107)A(cid:107)S =(cid:112)Tr(A(cid:62)SA) is the Mahalanobis norm induced by S \u2208 S n\n\nNotation We write (cid:107)\u00b7(cid:107) (resp. (cid:104)\u00b7,\u00b7(cid:105)) for the Euclidean norm (resp. inner product) on vectors and\nj=1 (cid:107)Bj:(cid:107)\n(resp. (cid:107)B(cid:107)2,\u221e = maxj\u2208[p] (cid:107)Bj:(cid:107)), and for any p \u2208 [1,\u221e], we write (cid:107)B(cid:107)S ,p for the Schatten p-norm\n(i.e., the (cid:96)p norm of the singular values of B). The unit (cid:96)p ball is written Bp, p \u2208 [1,\u221e). For S1\nand S2 \u2208 S n\n+. When we write S1 (cid:23) S2 we implicitly assume that\nboth matrices belong to S n\n+. For a square matrix A \u2208 Rn\u00d7n, Tr(A) represents the trace of A and\n++. For a, b \u2208 R, we denote\n(a)+ = max(a, 0), a \u2228 b = max(a, b) and a \u2227 b = min(a, b). The block soft-thresholding operator\nat level \u03c4 > 0, is denoted BST(\u00b7, \u03c4 ), and reads for any vector x, BST(x, \u03c4 ) = (1 \u2212 \u03c4 /(cid:107)x(cid:107))+ x. The\nidentity matrix of size n \u00d7 n is denoted Idn, and [r] is the set of integers from 1 to r.\n\n+, S1 (cid:23) S2 if S1 \u2212 S2 \u2208 S n\n\n2.1 The proposed CLaR estimator\n\nTo leverage the multiple repetitions while taking into account the noise structure, we introduce the\nConcomitant Lasso with Repetitions (CLaR):\nDe\ufb01nition 1. CLaR estimates the parameters of Model (1) by solving:\n( \u02c6BCLaR, \u02c6SCLaR) \u2208 arg min\nB\u2208Rp\u00d7q\nS(cid:23)\u03c3 Idn\n\nf (B, S) + \u03bb(cid:107)B(cid:107)2,1 , with f (B, S) (cid:44) r(cid:88)\n\n(cid:107)Y (l)\u2212XB(cid:107)2\n\n(2)\nwhere \u03bb > 0 controls the sparsity of \u02c6BCLaR and \u03c3 > 0 controls the smallest eigenvalue of \u02c6SCLaR.\n\nTr(S)\n\nS\u22121\n\n2nqr\n\n2n\n\nl=1\n\n+\n\n,\n\n2.2 Connections with concomitant Lasso on averaged data\n\nIn low SNR settings, a standard way to deal with strong noise is to use the averaged observation\n\u00afY \u2208 Rn\u00d7q instead of the raw observations. The associated model reads:\n\n\u00afY = XB\u2217 + \u02dcS\u2217 \u02dcE ,\n\n(3)\n\u221a\nwith \u02dcS\u2217 (cid:44) S\u2217/\n\u221a\nr and \u02dcE has i.i.d. entries drawn from a standard normal distribution. The SNR2 is\nr, yet the number of samples goes from rnq to nq, making it statistically dif\ufb01cult to\nmultiplied by\nestimate the O(n2) parameters of S\u2217. CLaR generalizes the Smoothed Generalized Concomitant\nLasso (Massias et al., 2018a), which has the drawback of only targeting averaged observations:\nDe\ufb01nition 2 (SGCL, Massias et al. 2018a). SGCL estimates the parameters of Model (3), by solving:\n( \u02c6BSGCL, \u02c6SSGCL) \u2208 arg min\nB\u2208Rp\u00d7q\n\n\u02dcf (B, \u02dcS) + \u03bb(cid:107)B(cid:107)2,1 , with \u02dcf (B, \u02dcS) (cid:44) (cid:107) \u00afY \u2212 XB(cid:107)2\n\nTr( \u02dcS)\n\n\u02dcS\u22121\n\n2nq\n\n2n\n\n+\n\n.\n\n\u221a\n\n\u02dcS(cid:23)\u03c3/\n\nr Idn\n\n(4)\n\u221a\nRemark 3. Note that \u02c6SCLaR estimates S\u2217, while \u02c6SSGCL estimates \u02dcS\u2217 = S\u2217/\nr. Since we impose\n\u221a\nthe constraint \u02c6SCLaR (cid:23) \u03c3 Idn, we rescale the constraint so that \u02c6SSGCL (cid:23) \u03c3/\nr Idn in (4) for future\ncomparisons. Also note that CLaR and SGCL are the same when r = 1 and Y (1) = \u00afY .\nThe justi\ufb01cation for CLaR is the following: if the quadratic loss (cid:107)Y \u2212 XB(cid:107)2 were used, the pa-\nrameters of Model (1) could be estimated by using either (cid:107) \u00afY \u2212 XB(cid:107)2 or 1\ndata-\ufb01tting term. Yet, both alternatives yield the same solutions as the two terms are equal up to\nconstants. Hence, the quadratic loss does not leverage the multiple repetitions and ignores the noise\nstructure. On the contrary, the more re\ufb01ned data-\ufb01tting term of CLaR allows to take into account the\nindividual repetitions, leading to improved performance in applications.\n\n(cid:80)(cid:107)Y (l) \u2212 XB(cid:107)2 as a\n\nr\n\n3 Results and properties of CLaR\n\nWe start this part by introducing some elements of smoothing theory (Moreau, 1965; Nesterov, 2005;\nBeck and Teboulle, 2012) that sheds some light on the origin of the data-\ufb01tting term introduced\nearlier.\n\n2See the de\ufb01nition we consider in Eq. (16).\n\n3\n\n\f3.1 Smoothing of the nuclear norm\n\n(cid:16)(cid:107)\u00b7(cid:107)2 + n\n\n(cid:17)\n\nLet us analyze the data-\ufb01tting term of CLaR, by connecting it to the Schatten 1-norm. We derive a\nformula for the smoothing of the this norm (Proposition 4), which paves the way for a more general\nsmoothing theory for matrix variables (see Appendix A). Let us de\ufb01ne the following smoothing\nfunction:\n\n\u03c9\u03c3(\u00b7) (cid:44) 1\n2\n\n\u03c3 ,\n\n(5)\nand the inf-convolution of functions f1 and f2, f1 (cid:3) f2(y) (cid:44) inf x f1(x) + f2(y \u2212 x). The name\n\u201csmoothing\u201d used in this paper comes from the following fact: if f1 is a closed proper convex\n2(cid:107)\u00b7(cid:107)2)\u2217 =\nfunction, then f\u2217\n2(cid:107)\u00b7(cid:107)2 is smooth (see Appendix A.1 for a detailed proof).\n(f\u2217\n1 + ( 1\nThe next propositions are key to our framework and show the connection between the SGCL, CLaR\nand the Schatten 1-norm:\nProposition 4 (Proof in Appendix A.3). The \u03c9\u03c3-smoothing of the Schatten-1 norm, i.e., the function\n(cid:107)\u00b7(cid:107)S ,1\n\n(cid:3) \u03c9\u03c3 : Rn\u00d7q (cid:55)\u2192 R, is the solution of the following smooth optimization problem:\n\n2(cid:107)\u00b7(cid:107)2 is strongly convex, and thus its Fenchel transform (f\u2217\n\n2(cid:107)\u00b7(cid:107)2)\u2217)\u2217 = (f1 (cid:3) 1\n\n2(cid:107)\u00b7(cid:107)2)\u2217\u2217 = f1 (cid:3) 1\n\n1 + 1\n\n1 + 1\n\n((cid:107)\u00b7(cid:107)S ,1\n\n(cid:3) \u03c9\u03c3)(Z) = min\nS(cid:23)\u03c3 Idn\n\n1\n\n2 (cid:107)Z(cid:107)2\n\nS\u22121 + 1\n2 \u03c3-approximation of (cid:107)\u00b7(cid:107)S ,1.\n\n2 Tr(S) .\n\n(cid:3) \u03c9\u03c3) is a \u03c3-smooth n\n\nMoreover ((cid:107)\u00b7(cid:107)S ,1\nDe\ufb01nition 5 (Clipped Square Root). For \u03a3 \u2208 S n\nU diag(\u03b31, . . . , \u03b3n)U(cid:62) (U is orthogonal), let us de\ufb01ne the Clipped Square Root operator:\n\n+ with spectral decomposition \u03a3 =\n\n(6)\n\n\u221a\nClSqrt(\u03a3, \u03c3) = U diag(\n\n\u03b31 \u2228 \u03c3, . . . ,\n\n\u03b3n \u2228 \u03c3)U(cid:62) .\n\n(7)\n\n\u221a\n\nProposition 6 (Proof in Appendix B.1). Any solution of the CLaR Problem (2), ( \u02c6B, \u02c6S) =\n( \u02c6BCLaR, \u02c6SCLaR) is also a solution of:\n\n(cid:17)\n\n(cid:16)(cid:107)\u00b7(cid:107)S ,1\n\u02c6R \u02c6R(cid:62), \u03c3(cid:1) , where \u02c6R = [Y (1) \u2212 X \u02c6B| . . .|Y (r) \u2212 X \u02c6B] .\n\n(Z) + \u03bbn(cid:107)B(cid:107)2,1\n\n(cid:3) \u03c9\u03c3\n\n\u02c6B = arg min\nB\u2208Rp\u00d7q\n\n\u02c6S = ClSqrt(cid:0) 1\n\nrq\n\nProperties similar to Proposition 6 can be traced back to van de Geer and Stucky (2016, Sec 2.2),\nwho introduced the multivariate square-root Lasso:\n\n\u02c6B \u2208 arg min\nB\u2208Rp\u00d7q\n\n1\n\u221a\n\nq\n\nn\n\n(cid:107) \u00afY \u2212 XB(cid:107)S ,1 + \u03bb(cid:107)B(cid:107)2,1 ,\n\n(8)\n\nand showed that if ( \u00afY \u2212 X \u02c6B)( \u00afY \u2212 X \u02c6B)(cid:62) (cid:31) 0, the latter optimization problem admits a variational3\nformulation:\n\n1\n\n(cid:107) \u00afY \u2212 XB(cid:107)2\n\nS\u22121 +\n\n2nq\n\nTr(S)\n\n2n\n\n+ \u03bb(cid:107)B(cid:107)2,1 .\n\n(9)\n\n( \u02c6B, \u02c6S) \u2208 arg min\nB\u2208Rp\u00d7q,\n\u02dcS(cid:31)0\n\nIn other words Proposition 6 generalizes van de Geer (2016, Lemma 3.4) for all matrices \u00afY \u2212 X \u02c6B,\ngetting rid of the condition ( \u00afY \u2212 X \u02c6B)( \u00afY \u2212 X \u02c6B)(cid:62) (cid:31) 0. In the present contribution, the problem\nformulation in Proposition 4 is motivated by computational aspects, as it helps to address the\ncombined non-smoothness of the data-\ufb01tting term (cid:107)\u00b7(cid:107)S ,1 and the penalty term (cid:107)\u00b7(cid:107)2,1. Note that\nanother smoothing of the nuclear norm was proposed in Argyriou et al. (2008); Bach et al. (2012,\nSec. 5.2):\n\nZ (cid:55)\u2192 min\nS(cid:31)0\n\n1\n2\n\nTr[Z(cid:62)S\u22121Z] +\n\n1\n2\n\n(10)\nwhich is a \u03c3-smooth n\u03c3-approximation of (cid:107)\u00b7(cid:107)S ,1 (see Appendix A.5), therefore less precise than\nours.\n\nTr(S) +\n\nTr(S\u22121) ,\n\n\u03c32\n2\n\n3also called concomitant formulation since minimization is performed over an additional variable (Owen,\n\n2007; Ndiaye et al., 2017).\n\n4\n\n\f(cid:80)r\nl=1 Y (l)Y (l)(cid:62) // precomputed\n\nAlgorithm 1 ALTERNATE MINIMIZATION FOR CLAR\ninput\ninit\nfor t = 1, . . . , T do\n\n: X, \u00afY , \u03c3, \u03bb, TS update, T\n: B = 0p,q, S\u22121 = \u03c3\u22121 Idn, \u00afR = \u00afY , covY = 1\n\nr\n\nif t = 1\n\n(mod TS update) then // noise update\n\nRR(cid:62) = RRT(covY , Y, X, B) // Eq. (15)\nS \u2190 ClSqrt( 1\nfor j = 1, . . . , p do Lj = X(cid:62)\n\nqr RR(cid:62), \u03c3) // Eq. (12)\n:j S\u22121X:j\n(cid:16) X(cid:62)\nfor j = 1, . . . , p do // coef. update\n; Bj: \u2190 BST\n\n\u00afR \u2190 \u00afR + X:jBj:\n\nreturn B, S\n\n(cid:17)\n\n, \u03bbnq\nLj\n\n\u00afR \u2190 \u00afR \u2212 X:jBj:\n\n;\n\n:j S\u22121 \u00afR\nLj\n\nOther alternatives to exploit the multiple repetitions without simply averaging them, would consist in\ninvestigating other Schatten p-norms:\n\nrq(cid:107)[Y (1) \u2212 XB| . . .|Y (r) \u2212 XB](cid:107)S ,p + \u03bbn(cid:107)B(cid:107)2,1 .\n1\u221a\n\n(11)\n\narg min\nB\u2208Rp\u00d7q\n\nWithout smoothing, problems of the form given in Equation (11) present the drawback of having\ntwo non-smooth terms, and calling for primal-dual algorithms (Chambolle and Pock, 2011) with\ncostly proximal operators. Even if the non-smooth Schatten 1-norm is replaced by the formula in\nEquation (6), numerical challenges remain: S can approach 0 arbitrarily, hence, the gradient w.r.t. S\nof the data-\ufb01tting term is not Lipschitz over the optimization domain. Recently, Molstad (2019)\nproposed two algorithms to directly solve Equation (11): a prox-linear ADMM, and accelerated\nproximal gradient descent, the latter lacking convergence guarantees since the composite objective\nhas two non-smooth terms. Before that, van de Geer and Stucky (2016) devised a \ufb01xed point method,\nlacking descent guarantees. A similar problem was raised for the concomitant Lasso by Ndiaye et al.\n(2017) who used smoothing techniques to address it. Here we replaced the nuclear norm (p = 1) by\nits smoothed version (cid:107)\u00b7(cid:107)S ,1\n(cid:3) \u03c9\u03c3. Similar results for the Schatten 2-norm and Schatten \u221e-norm are\nprovided in the Appendix (Propositions 21 and 22).\n\n3.2 Algorithmic details: convexity, (block) coordinate descent, parameters in\ufb02uence\n\nWe detail the principal results needed to solve Problem (2) numerically, leading to the implementation\nproposed in Algorithm 1. We \ufb01rst recall useful results for alternate minimization of convex composite\nproblems.\nProposition 7 (Proof in Appendix B.2). CLaR is jointly convex in (B, S). Moreover, f is convex\nand smooth on the feasible set, and (cid:107)\u00b7(cid:107)2,1 is convex and separable in Bj:\u2019s, thus minimizing the\nobjective alternatively in S and in Bj:\u2019s (see Algorithm 1) converges to a global minimum.\n\nHence, for our alternate minimization implemenation, we only need to consider solving problems\nwith B or S \ufb01xed, which we detail in the next propositions.\nProposition 8 (Minimization in S; proof in Appendix B.3). Let B \u2208 Rn\u00d7q be \ufb01xed. The minimiza-\ntion of f (B, S) w.r.t. S with the constraint S (cid:23) \u03c3 Idn admits the closed-form solution:\n\nS = ClSqrt\n\n(Y (l) \u2212 XB)(Y (l) \u2212 XB)(cid:62), \u03c3\n\n.\n\n(12)\n\n(cid:18) 1\n\nr(cid:88)\n\nrq\n\nl=1\n\n(cid:19)\n\nProposition 9 (Proof in Appendix B.4). For a \ufb01xed S \u2208 S n\n(cid:19)\nof f (\u00b7, S) + \u03bb(cid:107)\u00b7(cid:107)2,1 in the jth line of B admits a closed-form solution:\n\n(cid:18)\n\nBj: = BST\n\nBj: +\n\n:j S\u22121( \u00afY \u2212XB)\nX(cid:62)\n(cid:107)X:j(cid:107)2\n\nS\u22121\n\n,\n\n\u03bbnq\n(cid:107)X:j(cid:107)2\n\nS\u22121\n\n++, each step of the block minimization\n\n.\n\n(13)\n\nAs for other Lasso-type estimators, there exists \u03bbmax \u2265 0 such that whenever \u03bb \u2265 \u03bbmax, the\nestimated coef\ufb01cients vanish. This \u03bbmax helps calibrating roughly \u03bb in practice by choosing it as a\nfraction of \u03bbmax.\n\n5\n\n\fProposition 10 (Critical regularization parameter; proof in Appendix B.5.). For the CLaR estimator\n\nwe have: with Smax (cid:44) ClSqrt(cid:0) 1\n\nqr\n\nl=1 Y (l)Y (l)(cid:62), \u03c3(cid:1),\n(cid:80)r\n(cid:13)(cid:13)X(cid:62)S\u22121\n\u00afY(cid:13)(cid:13)2,\u221e ,\n\nmax\n\n\u2200\u03bb \u2265 \u03bbmax (cid:44) 1\n\nnq\n\n\u02c6BCLaR = 0 .\n\n(14)\n\nConvex formulation bene\ufb01ts. Thanks to the convex formulation, convergence of Algorithm 1 can\nbe ensured using the duality gap as a stopping criterion (as it guarantees a targeted sub-optimality\nlevel). To compute the duality gap, we derive the dual of Problem (2) in Proposition 24. In addition,\nconvexity allows to leverage acceleration methods such as working sets strategies (Fan and Lv, 2008;\nTibshirani et al., 2012; Johnson and Guestrin, 2015; Massias et al., 2018b) or safe screening rules (El\nGhaoui et al., 2012; Fercoq et al., 2015) while retaining theoretical convergence guarantees. Such\ntechniques are trickier to adapt in the non-convex case (see Appendix C), as they could change the\nlocal minima reached.\nChoice of \u03c3. Although \u03c3 has a smoothing interpretation, from a practical point of view it remains\n(cid:80)r\nan hyperparameter to set. As in Massias et al. (2018a), \u03c3 is always chosen as follows: \u03c3 =\n(cid:107)Y (cid:107) /(1000 \u00d7 nq). In practice, the experimental results were little affected by the choice of \u03c3.\nRemark 11. Once covY (cid:44) 1\n1 Y (l)Y (l)(cid:62) is pre-computed, the cost of updating S does not depend\non r, i.e., is the same as working with averaged data. Indeed, with R = [Y (1) \u2212 XB| . . .|Y (r) \u2212 XB],\nthe following computation can be done in O(qn2) (details are in Appendix B.7).\n\nr\n\nRR(cid:62) = RRT(covY , Y, X, B) (cid:44) rcovY + r(XB)(XB)(cid:62) \u2212 r \u00afY (cid:62)(XB) \u2212 r(XB)(cid:62) \u00afY .\n\n(15)\n\nStatistical properties showing the advantages of using CLaR (over SGCL) can be found in Ap-\npendix B.8. As one could expect, using r times more observations improves the covariance estima-\ntion.\n\n4 Experiments\n\nOur Python code (with Numba compilation, Lam et al. 2015) is released as an open source package:\nhttps://github.com/QB3/CLaR. We compare CLaR to other estimators: SGCL (Massias et al.,\n2018a), an (cid:96)2,1 version of MLE (Chen and Banerjee, 2017; Lee and Liu, 2012) ((cid:96)2,1-MLE), a version\nof the (cid:96)2,1-MLE with multiple repetitions ((cid:96)2,1-MLER), an (cid:96)2,1 penalized version of MRCE (Rothman\net al., 2010) with repetitions ((cid:96)2,1-MRCER) and the Multi-Task Lasso (MTL, Obozinski et al. 2010).\nThe cost of an epoch of block coordinate descent is summarized in Table 1 in Appendix C.4 for each\nalgorithm4. All competitors are detailed in Appendix C.\n\n\u221a\n\nSynthetic data Here we demonstrate the ability of our estimator to recover the support i.e., the\nability to identify the predictive features. There are n = 150 observations, p = 500 features, q = 100\ntasks. The design X is random with Toeplitz-correlated features with parameter \u03c1X = 0.6 (correlation\n|i\u2212j|\nX ), and its columns have unit Euclidean norm. The true coef\ufb01cient B\u2217\nbetween X:i and X:j is \u03c1\nhas 30 non-zeros rows whose entries are independent and normally centered distributed. S\u2217 is a\nToeplitz matrix with parameter \u03c1S. The SNR is \ufb01xed and constant across all repetitions\n\nSNR (cid:44) (cid:107)XB\u2217(cid:107)/\n\nr(cid:107)XB\u2217 \u2212 \u00afY (cid:107) .\n\n(16)\nFor Figures 1 to 3, the \ufb01gure of merit is the ROC curve, i.e., the true positive rate (TPR) against the\nfalse positive rate (FPR). For each estimator, the ROC curve is obtained by varying the value of the\nregularization parameter \u03bb on a geometric grid of 160 points, from \u03bbmax (speci\ufb01c to each algorithm)\nto \u03bbmin, the latter also being estimator speci\ufb01c and chosen to obtain a FPR larger than 0.4.\nIn\ufb02uence of noise structure. Figure 1 represents the ROC curves for different values of \u03c1S. As\n\u03c1S increases, the noise becomes more and more correlated. From left to right, the performance of\nCLaR, SGCL, (cid:96)2,1-MRCER, (cid:96)2,1-MRCE, and (cid:96)2,1-MLER increases as they are designed to exploit\ncorrelations in the noise, while the performance of MTL decreases, as its i.i.d. Gaussian noise model\nbecomes less and less valid.\nIn\ufb02uence of SNR. On Figure 2 we can see that when the SNR is high (left), all estimators (except\n(cid:96)2,1-MLE) reach the (0, 1) point. This means that for each algorithm (except (cid:96)2,1-MLE), there exists a\n\n4The cost of computing the duality gap is also provided whenever available.\n\n6\n\n\fFigure 1 \u2013 In\ufb02uence of\nnoise structure.\nROC\ncurves of support recovery\n(\u03c1X = 0.6, SNR = 0.03,\nr = 20) for different \u03c1S\nvalues.\n\nFigure 2 \u2013 In\ufb02uence of\nSNR. ROC curves of sup-\nport recovery (\u03c1X = 0.6,\n\u03c1S = 0.4, r = 20) for dif-\nferent SNR values.\n\nFigure 3 \u2013 In\ufb02uence of\nthe number of repetitions.\nROC curves of support\nrecovery (\u03c1X = 0.6,\nSNR = 0.03, \u03c1S = 0.4)\nfor different r values.\n\nFigure 4 \u2013 In\ufb02uence of\nthe number of repetitions.\nROC curves with empiri-\ncal X and S and simulated\nB\u2217 (amp = 2 nA.m), for\ndifferent number of repeti-\ntions.\n\nFigure 5 \u2013 Amplitude in-\n\ufb02uence. ROC curves with\nempirical X and S and\nsimulated B\u2217 (r = 50),\nfor different amplitudes of\nthe signal.\n\n\u03bb such that the estimated support is exactly the true one. However, when the SNR decreases (middle),\nthe performance of SGCL and MTL starts to drop, while that of CLaR, (cid:96)2,1-MLER and (cid:96)2,1-MRCER\nremains stable (CLaR performing better), highlighting their capacity to leverage multiple repetitions\nof measurements to handle the noise structure. Finally, when the SNR is too low (right), all algorithms\nperform poorly, but CLaR, (cid:96)2,1-MLER and (cid:96)2,1-MRCER still performs better.\nIn\ufb02uence of the number of repetitions. Figure 3 shows ROC curves of all compared approaches for\ndifferent r, starting from r = 1 (left) to 100 (right). Even with r = 20 (middle) CLaR outperforms\nthe other estimators, and when r = 100 CLaR can better leverage the large number of repetitions.\n\nRealistic data We now evaluate the estimators on realistic magneto- and electroencephalography\n(M/EEG) data. The M/EEG recordings measure the electrical potential and magnetic \ufb01elds induced\nby the active neurons. Data are time series of length q with n sensors and p sources mapping to\nlocations in the brain. Because the propagation of the electromagnetic \ufb01elds is driven by the linear\nMaxwell equations, one can assume that the relation between the measurements Y (1), . . . , Y (r) and\nthe amplitudes of sources in the brain B\u2217 is linear.\nThe M/EEG inverse problem consists in identifying B\u2217. Because of the limited number of sensors\n(a few hundreds in practice), as well as the physics of the problem, the M/EEG inverse problem is\nseverely ill-posed and needs to be regularized. Moreover, the experiments being usually short (less\n\n7\n\nCLaRSGCL\u20182,1-MLER\u20182,1-MLE\u20182,1-MRCERMTL0.00.20.4FPR0.00.51.0TPR\u03c1S=0.00.00.20.4FPR\u03c1S=0.50.00.20.4FPR\u03c1S=0.80.00.20.4FPR0.00.51.0TPRSNR=0.130.00.20.4FPRSNR=0.050.00.20.4FPRSNR=0.020.00.20.4FPR0.00.51.0TPRr=10.00.20.4FPRr=200.00.20.4FPRr=100CLaRSGCL\u20182,1-MLER\u20182,1-MLE\u20182,1-MRCERMTL0.00.10.2FPR0.00.51.0TPRr=200.00.10.2FPRr=300.00.10.2FPRr=500.00.10.2FPR0.00.51.0TPRamp=1nAm0.00.10.2FPRamp=3nAm0.00.10.2FPRamp=5nAm\fthan 1 s.) and focused on speci\ufb01c cognitive functions, the number of active sources is expected to\nbe small, i.e., B\u2217 is assumed to be row-sparse. This plausible biological assumption motivates the\nframework of Section 2 (Ou et al., 2009).\nDataset. We use the sample dataset 5 from the MNE software (Gramfort et al., 2014). The experi-\nmental conditions here are auditory stimulations in the right or left ear, leading to two main foci of\nactivations in bilateral auditory cortices (i.e., 2 non-zeros rows for B\u2217). For this experiment, we keep\nonly the gradiometer magnetic channels. After removing one channel corrupted by artifacts, this\nleads to n = 203 signals. The length of the temporal series is q = 100, and the data contains r = 50\nrepetitions. We choose a source space of size p = 1281 which corresponds to about 1 cm distance\nbetween neighboring sources. The orientation is \ufb01xed, and normal to the cortical mantle.\nRealistic MEG data simulations. We use here true empirical values for X and S by solving Maxwell\nequations and taking an empirical co-standard deviation matrix. To generate realistic MEG data we\nsimulate neural responses B\u2217 with 2 non-zeros rows corresponding to areas known to be related to\nauditory processing (Brodmann area 22). Each non-zero row of B\u2217 is chosen as a sinusoidal signal\nwith realistic frequency (5 Hz) and amplitude (amp \u223c 1 \u2212 10 nAm). We \ufb01nally simulate r MEG\nsignals Y (l) = XB\u2217 + S\u2217E(l), E(l) being matrices with i.i.d. normal entries.\nThe signals being contaminated with correlated noise, if one wants to use homoscedastic solvers it is\nnecessary to whiten the data \ufb01rst (and thus to have an estimation of the covariance matrix, the later\noften being unknown). In this experiment we demonstrate that without this whitening process, the\nhomoscedastic solver MTL fails, as well as solvers which does not take in account the repetitions:\nSGCL and (cid:96)2,1-MLE. In this scenario CLaR, (cid:96)2,1-MLER and (cid:96)2,1-MRCER do succeed in recovering\nthe sources, CLaR leading to the best results. As for the synthetic data, Figures 4 and 5 are obtained\nby varying the estimator-speci\ufb01c regularization parameter \u03bb from \u03bbmax to \u03bbmin on a geometric grid.\nAmplitude in\ufb02uence. Figure 5 shows ROC curves for different values of the amplitude of the signal.\nWhen the amplitude is high (right), all the algorithms perform well, however when the amplitude\ndecreases (middle) only CLaR leads to good results, almost hitting the (0, 1) corner. When the\namplitude gets lower (left) all algorithms perform worse, CLaR still yielding the best results.\nIn\ufb02uence of the number of repetitions. Figure 4 shows ROC curves for different number of repetitions\nr. When the number of repetitions is high (right, r = 50), the algorithms taking into account all the\nrepetitions (CLaR, (cid:96)2,1-MLER, (cid:96)2,1-MRCER) perform best, almost hitting the (0, 1) corner, whereas\nthe algorithms which do not take into account all the repetitions ((cid:96)2,1-MLE, MTL, SGCL) perform\npoorly. As soon as the number of repetitions decreases (middle and left) the performances of all the\nalgorithms except CLaR start dropping severely. CLaR is once again the algorithm taking the most\nadvantage of the number of repetitions.\n\nReal data As before, we use the sample dataset, keeping only the magnetometer magnetic channels\n(n = 102 signals). We choose a source space of size p = 7498 (about 5 mm between neighboring\nsources). The orientation is \ufb01xed, and normal to the cortical mantle. As for realistic data, X is\nthe empirical design matrix, but this time we use the empirical measurements Y (1), . . . , Y (r). The\nexperiment are left or right auditory stimulations, extensive results for right auditory stimulations\n(resp. visual stimulations) can be found in Appendix D.3 (resp. Appendix D.4 and D.5). As two\nsources are expected (one in each hemisphere, in bilateral auditory cortices), we vary \u03bb by dichotomy\nbetween \u03bbmax (returning 0 sources) and a \u03bbmin (returning more than 2 sources), until \ufb01nding a \u03bb\ngiving exactly 2 sources. Results are provided in Figures 6 and 7. Running times of each algorithm\nare of the same order of magnitude and can be found in Appendix D.2.\nComments on Figure 6, left auditory stimulations. Sources found by the algorithms are represented\nby red spheres. SGCL, (cid:96)2,1-MLE and (cid:96)2,1-MRCER completely fail, \ufb01nding sources that are not in\nthe auditory cortices at all (SGCL sources are deep, thus not in the auditory cortices, and cannot be\nseen). MTL and (cid:96)2,1-MLER do \ufb01nd sources in auditory cortices, but only in one hemisphere (left for\nMTL and right for (cid:96)2,1-MLER). CLaR is the only one that \ufb01nds one source in each hemisphere in the\nauditory cortices as expected.\nComments on Figure 7, right auditory stimulations.\nIn this experiment we only keep r = 33\nrepetitions (out of 65 available) and it can be seen that only CLaR \ufb01nds correct sources, MTL \ufb01nds\nsources only in one hemisphere and all the other algorithms do \ufb01nd sources that are not in the\n\n5publicly available real M/EEG data recorded after auditory or visual stimulations.\n\n8\n\n\f(a) CLaR\n\n(b) SGCL\n\n(c) (cid:96)2,1-MLER\n\n(d) (cid:96)2,1-MLE\n\n(e) (cid:96)2,1-MRCER\n\n(f) MTL\n\nFigure 6 \u2013 Real data, left auditory stimulations (n = 102, p = 7498, q = 76, r = 63) Sources found\nin the left hemisphere (top) and the right hemisphere (bottom) after left auditory stimulations.\n\n(a) CLaR\n\n(b) SGCL\n\n(c) (cid:96)2,1-MLER\n\n(d) (cid:96)2,1-MLE\n\n(e) (cid:96)2,1-MRCER\n\n(f) MTL\n\nFigure 7 \u2013 Real data, right auditory stimulations (n = 102, q = 7498, q = 76, r = 33) Sources\nfound in the left hemisphere (top) and the right hemisphere (bottom) after right auditory stimulations.\n\nauditory cortices. This highlights the robustness of CLaR, even with a limited number of repetitions,\ncon\ufb01rming previous experiments (see Figure 3).\n\nConclusion This work introduces CLaR, a sparse estimator for multitask regression. It is designed\nto handle correlated Gaussian noise in the context of repeated observations, a standard framework in\napplied sciences such as neuroimaging. The resulting optimization problem can be solved ef\ufb01ciently\nwith state-of-the-art convex solvers, and the algorithmic cost is the same as for single repetition data.\nThe theory of smoothing connects CLaR to the Schatten 1-Lasso in a principled manner, which opens\nthe way to the use of more sophisticated data\ufb01tting terms. The bene\ufb01ts of CLaR for support recovery\nin the presence of non-white Gaussian noise were extensively evaluated against a large number of\ncompetitors, both on simulations and on empirical MEG data.\n\nAcknowledgments This work was funded by ERC Starting Grant SLAB ERC-YStG-676943.\n\n9\n\n\fReferences\nA. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Machine Learning, 73\n\n(3):243\u2013272, 2008.\n\nF. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Convex optimization with sparsity-inducing norms.\n\nFoundations and Trends in Machine Learning, 4(1):1\u2013106, 2012.\n\nA. Beck. First-Order Methods in Optimization, volume 25. SIAM, 2017.\n\nA. Beck and M. Teboulle. Smoothing and \ufb01rst order methods: A uni\ufb01ed framework. SIAM J. Optim.,\n\n22(2):557\u2013580, 2012.\n\nA. Belloni, V. Chernozhukov, and L. Wang. Square-root Lasso: pivotal recovery of sparse signals via\n\nconic programming. Biometrika, 98(4):791\u2013806, 2011.\n\nP. J. Bickel, Y. Ritov, and A. B. Tsybakov. Simultaneous analysis of Lasso and Dantzig selector. Ann.\n\nStatist., 37(4):1705\u20131732, 2009.\n\nS. Boyd and L. Vandenberghe. Convex optimization. Cambridge University Press, 2004.\n\nA. Chambolle and T. Pock. A \ufb01rst-order primal-dual algorithm for convex problems with applications\n\nto imaging. J. Math. Imaging Vis., 40(1):120\u2013145, 2011.\n\nS. Chen and A. Banerjee. Alternating estimation for structured high-dimensional multi-response\n\nmodels. In NeurIPS, pages 2838\u20132848, 2017.\n\nA. S. Dalalyan, M. Hebiri, K. Meziani, and J. Salmon. Learning heteroscedastic models by convex\n\nprogramming under group sparsity. In ICML, 2013.\n\nJ. Daye, J. Chen, and H. Li. High-dimensional heteroscedastic regression with an application to\n\neQTL data analysis. Biometrics, 68(1):316\u2013326, 2012.\n\nL. El Ghaoui, V. Viallon, and T. Rabbani. Safe feature elimination in sparse supervised learning. J.\n\nPaci\ufb01c Optim., 8(4):667\u2013698, 2012.\n\nD. A. Engemann and A. Gramfort. Automated model selection in covariance estimation and spatial\n\nwhitening of MEG and EEG signals. NeuroImage, 108:328\u2013342, 2015.\n\nJ. Fan and J. Lv. Sure independence screening for ultrahigh dimensional feature space. J. R. Stat. Soc.\n\nSer. B Stat. Methodol., 70(5):849\u2013911, 2008.\n\nO. Fercoq, A. Gramfort, and J. Salmon. Mind the duality gap: safer rules for the lasso. In ICML,\n\npages 333\u2013342, 2015.\n\nJ. Friedman, T. J. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the graphical\n\nlasso. Biostatistics, 9(3):432\u2013441, 2008.\n\nA. Gramfort, D. Strohmeier, J. Haueisen, M. S. H\u00e4m\u00e4l\u00e4inen, and M. Kowalski. Time-frequency mixed-\nnorm estimates: Sparse M/EEG imaging with non-stationary source activations. NeuroImage, 70:\n410\u2013422, 2013.\n\nA. Gramfort, M. Luessi, E. Larson, D. A. Engemann, D. Strohmeier, C. Brodbeck, L. Parkkonen, and\nM. S. H\u00e4m\u00e4l\u00e4inen. MNE software for processing MEG and EEG data. NeuroImage, 86:446 \u2013 460,\n2014. doi: http://dx.doi.org/10.1016/j.neuroimage.2013.10.027.\n\nP. J. Huber. Robust Statistics. John Wiley & Sons Inc., 1981.\n\nP. J. Huber and R. Dutter. Numerical solution of robust regression problems. In Compstat 1974 (Proc.\nSympos. Computational Statist., Univ. Vienna, Vienna, 1974), pages 165\u2013172. Physica Verlag,\nVienna, 1974.\n\nT. B. Johnson and C. Guestrin. Blitz: A principled meta-algorithm for scaling sparse optimization. In\n\nICML, pages 1171\u20131179, 2015.\n\n10\n\n\fM. Kolar and J. Sharpnack. Variance function estimation in high-dimensions. In ICML, pages\n\n1447\u20131454, 2012.\n\nS. K. Lam, A. Pitrou, and S. Seibert. Numba: A LLVM-based Python JIT Compiler. In Proceedings\nof the Second Workshop on the LLVM Compiler Infrastructure in HPC, pages 1\u20136. ACM, 2015.\n\nW. Lee and Y. Liu. Simultaneous multiple response regression and inverse covariance matrix\nestimation via penalized Gaussian maximum likelihood. Journal of multivariate analysis, 111:\n241\u2013255, 2012.\n\nM. Massias, O. Fercoq, A. Gramfort, and J. Salmon. Generalized concomitant multi-task lasso for\n\nsparse multimodal regression. In AISTATS, volume 84, pages 998\u20131007, 2018a.\n\nM. Massias, A. Gramfort, and J. Salmon. Celer: a fast solver for the Lasso with dual extrapolation.\n\nIn ICML, 2018b.\n\nA. J. Molstad.\n\nInsights and algorithms for the multivariate square-root lasso. arXiv preprint\n\narXiv:1909.05041, 2019.\n\nJ.-J. Moreau. Proximit\u00e9 et dualit\u00e9 dans un espace hilbertien. Bull. Soc. Math. France, 93:273\u2013299,\n\n1965.\n\nE. Ndiaye, O. Fercoq, A. Gramfort, and J. Salmon. Gap safe screening rules for sparse multi-task\n\nand multi-class models. In NeurIPS, pages 811\u2013819, 2015.\n\nE. Ndiaye, O. Fercoq, A. Gramfort, V. Lecl\u00e8re, and J. Salmon. Ef\ufb01cient smoothed concomitant lasso\nestimation for high dimensional regression. Journal of Physics: Conference Series, 904(1):012006,\n2017.\n\nY. Nesterov. Smooth minimization of non-smooth functions. Math. Program., 103(1):127\u2013152, 2005.\n\nG. Obozinski, B. Taskar, and M. I. Jordan. Joint covariate selection and joint subspace selection for\n\nmultiple classi\ufb01cation problems. Statistics and Computing, 20(2):231\u2013252, 2010.\n\nW. Ou, M. H\u00e4mal\u00e4inen, and P. Golland. A distributed spatio-temporal EEG/MEG inverse solver.\n\nNeuroImage, 44(3):932\u2013946, Feb 2009.\n\nA. B. Owen. A robust hybrid of lasso and ridge regression. Contemporary Mathematics, 443:59\u201372,\n\n2007.\n\nN. Parikh, S. Boyd, E. Chu, B. Peleato, and J. Eckstein. Proximal algorithms. Foundations and\n\nTrends in Machine Learning, 1(3):1\u2013108, 2013.\n\nF. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Pretten-\nhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and\nE. Duchesnay. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res., 12:2825\u20132830,\n2011.\n\nP. Rai, A. Kumar, and H. Daume. Simultaneously leveraging output and task structures for multiple-\n\noutput regression. In NeurIPS, pages 3185\u20133193, 2012.\n\nA. J. Rothman, E. Levina, and J. Zhu. Sparse multivariate regression with covariance estimation.\n\nJournal of Computational and Graphical Statistics, 19(4):947\u2013962, 2010.\n\nT. Sun and C.-H. Zhang. Scaled sparse linear regression. Biometrika, 99(4):879\u2013898, 2012.\n\nR. Tibshirani. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B Stat. Methodol.,\n\n58(1):267\u2013288, 1996.\n\nR. Tibshirani, J. Bien, J. Friedman, T. J. Hastie, N. Simon, J. Taylor, and R. J. Tibshirani. Strong\nrules for discarding predictors in lasso-type problems. J. R. Stat. Soc. Ser. B Stat. Methodol., 74(2):\n245\u2013266, 2012.\n\nP. Tseng. Convergence of a block coordinate descent method for nondifferentiable minimization. J.\n\nOptim. Theory Appl., 109(3):475\u2013494, 2001.\n\n11\n\n\fP. Tseng and S. Yun. Block-coordinate gradient descent method for linearly constrained nonsmooth\n\nseparable optimization. J. Optim. Theory Appl., 140(3):513, 2009.\n\nS. van de Geer. Estimation and testing under sparsity, volume 2159 of Lecture Notes in Mathematics.\nSpringer, 2016. Lecture notes from the 45th Probability Summer School held in Saint-Four, 2015,\n\u00c9cole d\u2019\u00c9t\u00e9 de Probabilit\u00e9s de Saint-Flour.\n\nS. van de Geer and B. Stucky. \u03c7 2-con\ufb01dence sets in high-dimensional regression. In Statistical\n\nanalysis for high-dimensional data, pages 279\u2013306. Springer, 2016.\n\nJ. Wagener and H. Dette. Bridge estimators and the adaptive Lasso under heteroscedasticity. Math.\n\nMethods Statist., 21:109\u2013126, 2012.\n\n12\n\n\f", "award": [], "sourceid": 2180, "authors": [{"given_name": "Quentin", "family_name": "Bertrand", "institution": "INRIA"}, {"given_name": "Mathurin", "family_name": "Massias", "institution": "Inria"}, {"given_name": "Alexandre", "family_name": "Gramfort", "institution": "INRIA"}, {"given_name": "Joseph", "family_name": "Salmon", "institution": "Universit\u00e9 de Montpellier"}]}