{"title": "A Dirty Model for Multi-task Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 964, "page_last": 972, "abstract": "We consider the multiple linear regression problem, in a setting where some of the set of relevant features could be shared across the tasks. A lot of recent research has studied the use of $\\ell_1/\\ell_q$ norm block-regularizations with $q > 1$  for such (possibly) block-structured problems, establishing strong guarantees on recovery even under high-dimensional scaling where the number of features scale with the number of observations. However, these papers also caution that the performance of such block-regularized methods are very dependent on the {\\em extent} to which the features are shared across tasks. Indeed they show~\\citep{NWJoint} that if the extent of overlap is less than a threshold, or even if parameter {\\em values} in the shared features are highly uneven, then block $\\ell_1/\\ell_q$ regularization could actually perform {\\em worse} than simple separate elementwise $\\ell_1$ regularization. We are far away from a realistic multi-task setting: not only do the set of relevant features have to be exactly the same across tasks, but their values have to as well.  Here, we ask the question: can we leverage support and parameter overlap when it exists, but not pay a penalty when it does not? Indeed, this falls under a more general question of whether we can model such \\emph{dirty data} which may not fall into a single neat structural bracket (all block-sparse, or all low-rank and so on). Here, we take a first step, focusing on developing a dirty model for the multiple regression problem. Our method uses a very simple idea: we decompose  the parameters into two components and {\\em regularize these differently.} We show both theoretically and empirically, our method strictly and noticeably outperforms both $\\ell_1$ and $\\ell_1/\\ell_q$ methods, over the entire range of possible overlaps. We also provide theoretical guarantees that the method performs well under high-dimensional scaling.", "full_text": "A Dirty Model for Multi-task Learning\n\nAli Jalali\n\nUniversity of Texas at Austin\nalij@mail.utexas.edu\n\nPradeep Ravikumar\n\nUniversity of Texas at Asutin\n\npradeepr@cs.utexas.edu\n\nSujay Sanghavi\n\nUniversity of Texas at Austin\n\nsanghavi@mail.utexas.edu\n\nChao Ruan\n\nUniversity of Texas at Austin\nruan@cs.utexas.edu\n\nAbstract\n\nWe consider multi-task learning in the setting of multiple linear regression, and\nwhere some relevant features could be shared across the tasks. Recent research\nhas studied the use of \u21131/\u2113q norm block-regularizations with q > 1 for such block-\nsparse structured problems, establishing strong guarantees on recovery even under\nhigh-dimensional scaling where the number of features scale with the number of\nobservations. However, these papers also caution that the performance of such\nblock-regularized methods are very dependent on the extent to which the features\nare shared across tasks. Indeed they show [8] that if the extent of overlap is less\nthan a threshold, or even if parameter values in the shared features are highly\nuneven, then block \u21131/\u2113q regularization could actually perform worse than sim-\nple separate elementwise \u21131 regularization. Since these caveats depend on the\nunknown true parameters, we might not know when and which method to apply.\nEven otherwise, we are far away from a realistic multi-task setting: not only do the\nset of relevant features have to be exactly the same across tasks, but their values\nhave to as well.\nHere, we ask the question: can we leverage parameter overlap when it exists,\nbut not pay a penalty when it does not ? Indeed, this falls under a more general\nquestion of whether we can model such dirty data which may not fall into a single\nneat structural bracket (all block-sparse, or all low-rank and so on). With the\nexplosion of such dirty high-dimensional data in modern settings, it is vital to\ndevelop tools \u2013 dirty models \u2013 to perform biased statistical estimation tailored\nto such data. Here, we take a \ufb01rst step, focusing on developing a dirty model\nfor the multiple regression problem. Our method uses a very simple idea: we\nestimate a superposition of two sets of parameters and regularize them differently.\nWe show both theoretically and empirically, our method strictly and noticeably\noutperforms both \u21131 or \u21131/\u2113q methods, under high-dimensional scaling and over\nthe entire range of possible overlaps (except at boundary cases, where we match\nthe best method).\n\n1\n\nIntroduction: Motivation and Setup\n\nHigh-dimensional scaling. In \ufb01elds across science and engineering, we are increasingly faced with\nproblems where the number of variables or features p is larger than the number of observations n.\nUnder such high-dimensional scaling, for any hope of statistically consistent estimation, it becomes\nvital to leverage any potential structure in the problem such as sparsity (e.g. in compressed sens-\ning [3] and LASSO [14]), low-rank structure [13, 9], or sparse graphical model structure [12]. It is in\nsuch high-dimensional contexts in particular that multi-task learning [4] could be most useful. Here,\n\n1\n\n\fmultiple tasks share some common structure such as sparsity, and estimating these tasks jointly by\nleveraging this common structure could be more statistically ef\ufb01cient.\n\nBlock-sparse Multiple Regression. A common multiple task learning setting, and which is the focus\nof this paper, is that of multiple regression, where we have r > 1 response variables, and a common\nset of p features or covariates. The r tasks could share certain aspects of their underlying distri-\nbutions, such as common variance, but the setting we focus on in this paper is where the response\nvariables have simultaneously sparse structure: the index set of relevant features for each task is\nsparse; and there is a large overlap of these relevant features across the different regression prob-\nlems. Such \u201csimultaneous sparsity\u201d arises in a variety of contexts [15]; indeed, most applications\nof sparse signal recovery in contexts ranging from graphical model learning, kernel learning, and\nfunction estimation have natural extensions to the simultaneous-sparse setting [12, 2, 11].\n\nIt is useful to represent the multiple regression parameters via a matrix, where each column corre-\nsponds to a task, and each row to a feature. Having simultaneous sparse structure then corresponds\nto the matrix being largely \u201cblock-sparse\u201d \u2013 where each row is either all zero or mostly non-zero,\nand the number of non-zero rows is small. A lot of recent research in this setting has focused on\n\u21131/\u2113q norm regularizations, for q > 1, that encourage the parameter matrix to have such block-\nsparse structure. Particular examples include results using the \u21131/\u2113\u221e norm [16, 5, 8], and the \u21131/\u21132\nnorm [7, 10].\n\nDirty Models. Block-regularization is \u201cheavy-handed\u201d in two ways. By strictly encouraging shared-\nsparsity, it assumes that all relevant features are shared, and hence suffers under settings, arguably\nmore realistic, where each task depends on features speci\ufb01c to itself in addition to the ones that are\ncommon. The second concern with such block-sparse regularizers is that the \u21131/\u2113q norms can be\nshown to encourage the entries in the non-sparse rows taking nearly identical values. Thus we are\nfar away from the original goal of multitask learning: not only do the set of relevant features have\nto be exactly the same, but their values have to as well. Indeed recent research into such regularized\nmethods [8, 10] caution against the use of block-regularization in regimes where the supports and\nvalues of the parameters for each task can vary widely. Since the true parameter values are unknown,\nthat would be a worrisome caveat.\n\nWe thus ask the question: can we learn multiple regression models by leveraging whatever overlap\nof features there exist, and without requiring the parameter values to be near identical? Indeed this\nis an instance of a more general question on whether we can estimate statistical models where the\ndata may not fall cleanly into any one structural bracket (sparse, block-sparse and so on). With\nthe explosion of dirty high-dimensional data in modern settings, it is vital to investigate estimation\nof corresponding dirty models, which might require new approaches to biased high-dimensional\nestimation. In this paper we take a \ufb01rst step, focusing on such dirty models for a speci\ufb01c problem:\nsimultaneously sparse multiple regression.\n\nOur approach uses a simple idea: while any one structure might not capture the data, a superposition\nof structural classes might. Our method thus searches for a parameter matrix that can be decomposed\ninto a row-sparse matrix (corresponding to the overlapping or shared features) and an elementwise\nsparse matrix (corresponding to the non-shared features). As we show both theoretically and em-\npirically, with this simple \ufb01x we are able to leverage any extent of shared features, while allowing\ndisparities in support and values of the parameters, so that we are always better than both the Lasso\nor block-sparse regularizers (at times remarkably so).\n\nThe rest of the paper is organized as follows: In Sec 2. basic de\ufb01nitions and setup of the problem\nare presented. Main results of the paper is discussed in sec 3. Experimental results and simulations\nare demonstrated in Sec 4.\nNotation: For any matrix M, we denote its jth row as Mj, and its k-th column as M (k). The set\nof all non-zero rows (i.e. all rows with at least one non-zero element) is denoted by RowSupp(M )\n|, i.e. the sums of\n\nand its support by Supp(M ). Also, for any matrix M, let kM k1,1 :=Pj,k |M (k)\nabsolute values of the elements, and kM k1,\u221e :=Pj kMjk\u221e where, kMjk\u221e := maxk |M (k)\n\nj\n\n|.\n\nj\n\n2\n\n\f2 Problem Set-up and Our Method\n\nMultiple regression. We consider the following standard multiple linear regression model:\n\ny(k) = X (k) \u00af\u03b8(k) + w(k),\n\nk = 1, . . . , r,\n\nwhere y(k) \u2208 Rn is the response for the k-th task, regressed on the design matrix X (k) \u2208 Rn\u00d7p\n(possibly different across tasks), while w(k) \u2208 Rn is the noise vector. We assume each w(k) is\ndrawn independently from N (0, \u03c32). The total number of tasks or target variables is r, the number\nof features is p, while the number of samples we have for each task is n. For notational convenience,\nwe collate these quantities into matrices Y \u2208 Rn\u00d7r for the responses, \u00af\u0398 \u2208 Rp\u00d7r for the regression\nparameters and W \u2208 Rn\u00d7r for the noise.\nDirty Model. In this paper we are interested in estimating the true parameter \u00af\u0398 from data by lever-\naging any (unknown) extent of simultaneous-sparsity. In particular, certain rows of \u00af\u0398 would have\nmany non-zero entries, corresponding to features shared by several tasks (\u201cshared\u201d rows), while\ncertain rows would be elementwise sparse, corresponding to those features which are relevant for\nsome tasks but not all (\u201cnon-shared rows\u201d), while certain rows would have all zero entries, corre-\n\nautomatically adapt to different levels of sharedness, and yet enjoy the following guarantees:\n\nthe estimator succeeds. We note that this is stronger than merely recovering the row-support of \u00af\u0398,\nwhich is union of its supports for the different tasks. In particular, denoting Uk for the support of the\n\nsponding to those features that are not relevant to any task. We are interested in estimators b\u0398 that\nSupport recovery: We say an estimator b\u0398 successfully recovers the true signed support if\nsign(Supp(b\u0398)) = sign(Supp( \u00af\u0398)). We are interested in deriving suf\ufb01cient conditions under which\nk-th column of \u00af\u0398, and U =Sk Uk.\nestimator b\u0398,\n\nError bounds: We are also interested in providing bounds on the elementwise \u2113\u221e norm error of the\n\nmax\n\nk=1,...,r(cid:12)(cid:12)(cid:12)b\u0398(k)\n\nj \u2212 \u00af\u0398(k)\n\nj\n\nkb\u0398 \u2212 \u00af\u0398k\u221e = max\n\nj=1,...,p\n\n(cid:12)(cid:12)(cid:12) .\n\n2.1 Our Method\n\nOur method explicitly models the dirty block-sparse structure. We estimate a sum of two parameter\nmatrices B and S with different regularizations for each: encouraging block-structured row-sparsity\nin B and elementwise sparsity in S. The corresponding \u201cclean\u201d models would either just use block-\nsparse regularizations [8, 10] or just elementwise sparsity regularizations [14, 18], so that either\nmethod would perform better in certain suited regimes. Interestingly, as we will see in the main\nresults, by explicitly allowing to have both block-sparse and elementwise sparse component, we are\nable to outperform both classes of these \u201cclean models\u201d, for all regimes \u00af\u0398.\n\nAlgorithm 1 Dirty Block Sparse\nSolve the following convex optimization problem:\n\n1\n2n\n\nrXk=1(cid:13)(cid:13)(cid:13)y(k) \u2212 X (k)(cid:16)S(k) + B(k)(cid:17)(cid:13)(cid:13)(cid:13)\n\nS,B\n\n(bS, bB) \u2208 arg min\nThen output b\u0398 = bB + bS.\n3 Main Results and Their Consequences\n\n2\n\n2\n\n+ \u03bbskSk1,1 + \u03bbbkBk1,\u221e.\n\n(1)\n\nWe now provide precise statements of our main results. A number of recent results have shown that\nthe Lasso [14, 18] and \u21131/\u2113\u221e block-regularization [8] methods succeed in recovering signed sup-\nports with controlled error bounds under high-dimensional scaling regimes. Our \ufb01rst two theorems\nextend these results to our dirty model setting. In Theorem 1, we consider the case of deterministic\ndesign matrices X (k), and provide suf\ufb01cient conditions guaranteeing signed support recovery, and\nelementwise \u2113\u221e norm error bounds. In Theorem 2, we specialize this theorem to the case where the\n\n3\n\n\frows of the design matrices are random from a general zero mean Gaussian distribution: this allows\nus to provide scaling on the number of observations required in order to guarantee signed support\nrecovery and bounded elementwise \u2113\u221e norm error.\nOur third result is the most interesting in that it explicitly quanti\ufb01es the performance gains of our\nmethod vis-a-vis Lasso and the \u21131/\u2113\u221e block-regularization method. Since this entailed \ufb01nding the\nprecise constants underlying earlier theorems, and a correspondingly more delicate analysis, we\nfollow Negahban and Wainwright [8] and focus on the case where there are two-tasks (i.e. r = 2),\nand where we have standard Gaussian design matrices as in Theorem 2. Further, while each of two\ntasks depends on s features, only a fraction \u03b1 of these are common. It is then interesting to see how\nthe behaviors of the different regularization methods vary with the extent of overlap \u03b1.\nComparisons. Negahban and Wainwright [8] show that there is actually a \u201cphase transition\u201d in the\nscaling of the probability of successful signed support-recovery with the number of observations.\nDenote a particular rescaling of the sample-size \u03b8Lasso(n, p, \u03b1) =\ns log(p\u2212s) . Then as Wainwright\n[18] show, when the rescaled number of samples scales as \u03b8Lasso > 2 + \u03b4 for any \u03b4 > 0, Lasso\nsucceeds in recovering the signed support of all columns with probability converging to one. But\nwhen the sample size scales as \u03b8Lasso < 2\u2212\u03b4 for any \u03b4 > 0, Lasso fails with probability converging\nto one. For the \u21131/\u2113\u221e-reguralized multiple linear regression, de\ufb01ne a similar rescaled sample size\ns log(p\u2212(2\u2212\u03b1)s) . Then as Negahban and Wainwright [8] show there is again a\n\u03b81,\u221e(n, p, \u03b1) =\ntransition in probability of success from near zero to near one, at the rescaled sample size of \u03b81,\u221e =\n(4 \u2212 3\u03b1). Thus, for \u03b1 < 2/3 (\u201cless sharing\u201d) Lasso would perform better since its transition is at\na smaller sample size, while for \u03b1 > 2/3 (\u201cmore sharing\u201d) the \u21131/\u2113\u221e regularized method would\nperform better.\n\nn\n\nn\n\nAs we show in our third theorem, the phase transition for our method occurs at the rescaled sample\nsize of \u03b81,\u221e = (2 \u2212 \u03b1), which is strictly before either the Lasso or the \u21131/\u2113\u221e regularized method\nexcept for the boundary cases: \u03b1 = 0, i.e. the case of no sharing, where we match Lasso, and for\n\u03b1 = 1, i.e. full sharing, where we match \u21131/\u2113\u221e. Everywhere else, we strictly outperform both\nmethods. Figure 3 shows the empirical performance of each of the three methods; as can be seen,\nthey agree very well with the theoretical analysis. (Further details in the experiments Section 4).\n\n3.1 Suf\ufb01cient Conditions for Deterministic Designs\n\nWe \ufb01rst consider the case where the design matrices X (k) for k = 1, \u00b7 \u00b7 \u00b7, r are deterministic,\nand start by specifying the assumptions we impose on the model. We note that similar suf\ufb01cient\nconditions for the deterministic X (k)\u2019s case were imposed in papers analyzing Lasso [18] and\nblock-regularization methods [8, 10].\n\nj (cid:13)(cid:13)(cid:13)2 \u2264 \u221a2n for all j = 1, . . . , p, k = 1, . . . , r.\n\nA0 Column Normalization(cid:13)(cid:13)(cid:13)X (k)\nLet Uk denote the support of the k-th column of \u00af\u0398, and U = Sk Uk denote the union of\nA1 Incoherence Condition \u03b3b := 1 \u2212 max\n\nsupports for each task. Then we require that\n\n, X (k)\n\n, X (k)\n\n> 0.\n\nj\u2208U c\n\nUk\n\nj\n\nWe will also \ufb01nd it useful to de\ufb01ne \u03b3s := 1\u2212max1\u2264k\u2264r maxj\u2208U c\nNote that by the incoherence condition A1, we have \u03b3s > 0.\nA2 Eigenvalue Condition Cmin := min\n1\u2264k\u2264r\n\n, X (k)\n\nUk\n\nA3 Boundedness Condition Dmax := max\n\nj\n\n, X (k)\n\nrXk=1\n\n(cid:28)X (k)\n\n(cid:13)(cid:13)(cid:13)(cid:13)\n\u03bbmin(cid:18) 1\nn DX (k)\n1\u2264k\u2264r(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)\n(cid:18) 1\nn DX (k)\n2(2 \u2212 \u03b3s)\u03c3plog(pr)\n\nUkE(cid:17)\u22121(cid:29)(cid:13)(cid:13)(cid:13)(cid:13)1\nUkE(cid:16)DX (k)\n\nUk (cid:16)DX (k)\nk(cid:13)(cid:13)(cid:13)(cid:13)DX (k)\nUkE(cid:19) > 0.\nUkE(cid:19)\u22121(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)\u221e,1\n2(2 \u2212 \u03b3b)\u03c3plog(pr)\n\n< \u221e.\n\n, X (k)\n\n\u03bbb >\n\nand\n\nUk\n\nUk\n\n\u03b3b\u221an\n\n\u03bbs >\n\n\u03b3s\u221an\n\n, X (k)\n\nUkE(cid:17)\u22121(cid:13)(cid:13)(cid:13)(cid:13)1\n\n.\n\n.\n\n(2)\n\nFurther, we require the regularization penalties be set as\n\n4\n\n\f \n\nDirty Model\n\nL1/Linf Reguralizer\n\nLASSO\n\np=128\np=256\np=512\n\n2.5\n\n3\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\ns\ns\ne\nc\nc\nu\nS\n\n \nf\n\no\n\n \ny\nt\ni\nl\ni\n\nb\na\nb\no\nr\nP\n\nDirty Model\n\nL1/Linf Reguralizer\n\nLASSO\n\n \n\np=128\np=256\np=512\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\ns\ns\ne\nc\nc\nu\nS\n\n \nf\n\no\n\n \ny\nt\ni\nl\ni\n\nb\na\nb\no\nr\nP\n\n \n0\n0.5\n\n1\n\n1.5\n\n1.7\n\n2\n\nControl Parameter \u03b8\n\n2.5\n\n3\n\n3.1\n\n3.5\n\n4\n\n \n0\n0.5\n\n1\n\n1.333\n\n1.5\n\nControl Parameter \u03b8\n\n2\n\n(a) \u03b1 = 0.3\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\ns\ns\ne\nc\nc\nu\nS\n\n \nf\n\no\n\n \ny\nt\ni\nl\ni\n\nb\na\nb\no\nr\nP\n\nDirty Model\n\n L1/Linf\nReguralizer\n\n \n0\n0.5\n\n1\n\n1.2\n\nControl Parameter \u03b8\n\n1.5\n\n1.6\n\n(b) \u03b1 = 2\n3\n\n \n\nLASSO\n\np=128\np=256\np=512\n\n2\n\n2.5\n\n(c) \u03b1 = 0.8\n\nFigure 1: Probability of success in recovering the true signed support using dirty model, Lasso and \u21131/\u2113\u221e\nregularizer. For a 2-task problem, the probability of success for different values of feature-overlap fraction \u03b1\nis plotted. As we can see in the regimes that Lasso is better than, as good as and worse than \u21131/\u2113\u221e regularizer\n((a), (b) and (c) respectively), the dirty model outperforms both of the methods, i.e., it requires less number of\nobservations for successful recovery of the true signed support compared to Lasso and \u21131/\u2113\u221e regularizer. Here\ns = \u230a p\nTheorem 1. Suppose A0-A3 hold, and that we obtain estimate b\u0398 from our algorithm with regular-\n\nization parameters chosen according to (2). Then, with probability at least 1 \u2212 c1 exp(\u2212c2n) \u2192 1,\nwe are guaranteed that the convex program (1) has a unique optimum and\n\n10\u230b always.\n\nSupp(b\u0398) \u2286 Supp( \u00af\u0398),\n\n(a) The estimate b\u0398 has no false inclusions, and has bounded \u2113\u221e norm error so that\n}\n\nand kb\u0398 \u2212 \u00af\u0398k\u221e,\u221e \u2264r 4\u03c32 log (pr)\n{z\n(j,k)\u2208Supp( \u00af\u0398)(cid:12)(cid:12)(cid:12)\u00af\u03b8(k)\n(cid:12)(cid:12)(cid:12) > bmin.\n\n(b) sign(Supp(b\u0398)) = sign(cid:0)Supp( \u00af\u0398)(cid:1) provided that\n\n+ \u03bbsDmax\n\nn Cmin\n\nmin\n\nbmin\n\n|\n\nj\n\n.\n\nHere the positive constants c1, c2 depend only on \u03b3s, \u03b3b, \u03bbs, \u03bbb and \u03c3, but are otherwise independent\nof n, p, r, the problem dimensions of interest.\n\nRemark: Condition (a) guarantees that the estimate will have no false inclusions; i.e. all included\nfeatures will be relevant. If in addition, we require that it have no false exclusions and that recover\nthe support exactly, we need to impose the assumption in (b) that the non-zero elements are large\nenough to be detectable above the noise.\n\n3.2 General Gaussian Designs\n\nOften the design matrices consist of samples from a Gaussian ensemble. Suppose that for each task\nk = 1, . . . , r the design matrix X (k) \u2208 Rn\u00d7p is such that each row X (k)\n\u2208 Rp is a zero-mean\nGaussian random vector with covariance matrix \u03a3(k) \u2208 Rp\u00d7p, and is independent of every other\nrow. Let \u03a3(k)\nV,U \u2208 R|V|\u00d7|U | be the submatrix of \u03a3(k) with rows corresponding to V and columns to\nU. We require these covariance matrices to satisfy the following conditions:\nC1 Incoherence Condition \u03b3b := 1 \u2212 max\n\n> 0\n\nj\u2208U c\n\nj,Uk\n\ni\n\nrXk=1\n\n(cid:13)(cid:13)(cid:13)(cid:13)\u03a3(k)\n\nUk,Uk(cid:17)\u22121(cid:13)(cid:13)(cid:13)(cid:13)1\n,(cid:16)\u03a3(k)\n\n5\n\n\fC2 Eigenvalue Condition Cmin := min\n1\u2264k\u2264r\nis bounded away from zero.\n\nC3 Boundedness Condition Dmax :=(cid:13)(cid:13)(cid:13)(cid:13)(cid:16)\u03a3(k)\n\n\u03bbmin(cid:16)\u03a3(k)\nUk,Uk(cid:17)\u22121(cid:13)(cid:13)(cid:13)(cid:13)\u221e,1\n\n< \u221e.\n\nUk,Uk(cid:17) > 0 so that the minimum eigenvalue\n\nThese conditions are analogues of the conditions for deterministic designs; they are now imposed\non the covariance matrix of the (randomly generated) rows of the design matrix.\nFurther, de\ufb01ning s := maxk |Uk|, we require the regularization penalties be set as\n\n\u03bbs > (cid:0)4\u03c32Cmin log(pr)(cid:1)1/2\n\u03b3s\u221anCmin \u2212p2s log(pr)\n\nand\n\n\u03bbb > (cid:0)4\u03c32Cminr(r log(2) + log(p))(cid:1)1/2\n\u03b3b\u221anCmin \u2212p2sr(r log(2) + log(p))\n\n.\n\n(3)\n\n,\n\nCmin\u03b3 2\ns\n\n2sr(cid:0)r log(2)+log(p)(cid:1)\n\nTheorem 2. Suppose assumptions C1-C3 hold, and that the number of samples scale as n >\n\nwith probability at least 1 \u2212 c1 exp (\u2212c2 (r log(2) + log(p))) \u2212 c3 exp(\u2212c4 log(rs)) \u2192 1 for some\n\nmax(cid:18) 2s log(pr)\n(cid:19) . Suppose we obtain estimate b\u0398 from algorithm (3). Then,\npositive numbers c1 \u2212 c4, we are guaranteed that the algorithm estimate b\u0398 is unique and satis\ufb01es\n\nthe following conditions:\n\nCmin\u03b3 2\nb\n\nSupp(b\u0398) \u2286 Supp( \u00af\u0398),\n\n(a) the estimate b\u0398 has no false inclusions, and has bounded \u2113\u221e norm error so that\nCmin\u221an\n\n|\n(b) sign(Supp(b\u0398)) = sign(cid:0)Supp( \u00af\u0398)(cid:1) provided that\n\nand kb\u0398 \u2212 \u00af\u0398k\u221e,\u221e \u2264r 50\u03c32 log(rs)\n(j,k)\u2208Supp( \u00af\u0398)(cid:12)(cid:12)(cid:12)\u00af\u03b8(k)\n\n+ \u03bbs(cid:18)\n{z\n(cid:12)(cid:12)(cid:12) > gmin.\n\nnCmin\n\nmin\n\ngmin\n\n4s\n\nj\n\n.\n\n+ Dmax(cid:19)\n}\n\n3.3 Sharp Transition for 2-Task Gaussian Designs\n\nThis is one of the most important results of this paper. Here, we perform a more delicate and\n\ufb01ner analysis to establish precise quantitative gains of our method. We focus on the special case\nwhere r = 2 and the design matrix has rows generated from the standard Gaussian distribution\nN (0, In\u00d7n), so that C1 \u2212 C3 hold, with Cmin = Dmax = 1. As we will see both analytically and\nexperimentally, our method strictly outperforms both Lasso and \u21131/\u2113\u221e-block-regularization over\nfor all cases, except at the extreme endpoints of no support sharing (where it matches that of Lasso)\nand full support sharing (where it matches that of \u21131/\u2113\u221e). We now present our analytical results; the\nempirical comparisons are presented next in Section 4. The results will be in terms of a particular\nrescaling of the sample size n as\n\n\u03b8(n, p, s, \u03b1) :=\n\nn\n\n(2 \u2212 \u03b1)s log (p \u2212 (2 \u2212 \u03b1)s)\n\n.\n\nWe will also require the assumptions that\n\nF1 \u03bbs >\n\n(cid:16)4\u03c32(1 \u2212 ps/n)(log(r) + log(p \u2212 (2 \u2212 \u03b1)s))(cid:17)1/2\n\n(n)1/2 \u2212 (s)1/2 \u2212 ((2 \u2212 \u03b1) s (log(r) + log(p \u2212 (2 \u2212 \u03b1)s)))1/2 ,\n\nF2 \u03bbb >\n\n(cid:16)4\u03c32(1 \u2212 ps/n)r(r log(2) + log(p \u2212 (2 \u2212 \u03b1)s))(cid:17)1/2\n\n(n)1/2 \u2212 (s)1/2 \u2212 ((1 \u2212 \u03b1/2) sr (r log(2) + log(p \u2212 (2 \u2212 \u03b1)s)))1/2 .\n\nTheorem 3. Consider a 2-task regression problem (n, p, s, \u03b1), where the design matrix has rows\n\ngenerated from the standard Gaussian distribution N (0, In\u00d7n).\n\n6\n\nSuppose maxj\u2208B\u2217(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\n(cid:12)(cid:12)(cid:12)\u0398\u2217(1)\n\nj\n\n(cid:12)(cid:12)(cid:12) \u2212\n\n\f= o(\u03bbs), where B\u2217 is the submatrix of \u0398\u2217 with rows where both entries are non-zero.\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\nj\n\n(cid:12)(cid:12)(cid:12)\n\n(cid:12)(cid:12)(cid:12)\u0398\u2217(2)\nThen the estimate b\u0398 of the problem (1) satis\ufb01es the following:\n\n(Success) Suppose the regularization coef\ufb01cients satisfy F1 \u2212 F2. Further, assume that the number\nof samples scales as \u03b8(n, p, s, \u03b1) > 1. Then, with probability at least 1 \u2212 c1 exp(\u2212c2n) for\n\n(Failure) If \u03b8(n, p, s, \u03b1) < 1 there is no solution ( \u02c6B, \u02c6S) for any choices of \u03bbs and \u03bbb such that\n\nand \u2113\u221e error bound conditions (a-b) in Theorem 2.\n\nsome positive numbers c1 and c2, we are guaranteed that b\u0398 satis\ufb01es the support-recovery\nsign(cid:16)Supp(b\u0398)(cid:17) = sign(cid:0)Supp( \u00af\u0398)(cid:1).\n(cid:12)(cid:12)(cid:12) \u2212(cid:12)(cid:12)(cid:12)\u0398\u2217(2)\n\nto be small only on rows where both entries are\n\nnon-zero. As we show in a more general theorem in the appendix, even in the case where the gap is\nlarge, the dependence of the sample scaling on the gap is quite weak.\n\nWe note that we require the gap(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\n(cid:12)(cid:12)(cid:12)\u0398\u2217(1)\n\nj\n\nj\n\n(cid:12)(cid:12)(cid:12)\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\n4 Empirical Results\n\nIn this section, we investigate the performance of our dirty block sparse estimator on synthetic and\nreal-world data. The synthetic experiments explore the accuracy of Theorem 3, and compare our\nestimator with LASSO and the \u21131/\u2113\u221e regularizer. We see that Theorem 3 is very accurate indeed.\nNext, we apply our method to a real world datasets containing hand-written digits for classi\ufb01cation.\nAgain we compare against LASSO and the \u21131/\u2113\u221e.\n(a multi-task regression dataset) with r = 2 tasks. In both of this real world dataset, we show that\ndirty model outperforms both LASSO and \u21131/\u2113\u221e practically. For each method, the parameters are\nchosen via cross-validation; see supplemental material for more details.\n\n4.1 Synthetic Data Simulation\n\nWe consider a r = 2-task regression problem as discussed in Theorem 3, for a range of parameters\n(n, p, s, \u03b1). The design matrices X have each entry being i.i.d. Gaussian with mean 0 and variance\n1. For each \ufb01xed set of (n, s, p, \u03b1), we generate 100 instances of the problem. In each instance,\ngiven p, s, \u03b1, the locations of the non-zero entries of the true \u00af\u0398 are chosen at randomly; each non-\nzero entry is then chosen to be i.i.d. Gaussian with mean 0 and variance 1. n samples are then\ngenerated from this. We then attempt to estimate using three methods: our dirty model, \u21131/\u2113\u221e\nregularizer and LASSO. In each case, and for each instance, the penalty regularizer coef\ufb01cients are\nfound by cross validation. After solving the three problems, we compare the signed support of the\nsolution with the true signed support and decide whether or not the program was successful in signed\nsupport recovery. We describe these process in more details in this section.\nPerformance Analysis: We ran the algorithm for \ufb01ve different values of the overlap ratio \u03b1 \u2208\n{0.3, 2\n3 , 0.8} with three different number of features p \u2208 {128, 256, 512}. For any instance of the\nproblem (n, p, s, \u03b1), if the recovered matrix \u02c6\u0398 has the same sign support as the true \u00af\u0398, then we\ncount it as success, otherwise failure (even if one element has different sign, we count it as failure).\n\nn\n\nAs Theorem 3 predicts and Fig 3 shows, the right scaling for the number of oservations is\ns log(p\u2212(2\u2212\u03b1)s) , where all curves stack on the top of each other at 2 \u2212 \u03b1. Also, the number of obser-\nvations required by dirty model for true signed support recovery is always less than both LASSO and\n\u21131/\u2113\u221e regularizer. Fig 1(a) shows the probability of success for the case \u03b1 = 0.3 (when LASSO\nis better than \u21131/\u2113\u221e regularizer) and that dirty model outperforms both methods. When \u03b1 = 2\n3\n(see Fig 1(b)), LASSO and \u21131/\u2113\u221e regularizer performs the same; but dirty model require almost\n33% less observations for the same performance. As \u03b1 grows toward 1, e.g. \u03b1 = 0.8 as shown in\nFig 1(c), \u21131/\u2113\u221e performs better than LASSO. Still, dirty model performs better than both methods\nin this case as well.\n\n7\n\n\fL1/Linf Regularizer\n\n4\n\n3.5\n\n3\n\n2.5\n\n2\n\n1.5\n\nDirty Model\n\nl\n\nd\no\nh\ns\ne\nr\nh\nT\nn\no\n\n \n\ni\nt\ni\ns\nn\na\nr\nT\ne\ns\na\nh\nP\n\n \n\n1\n\n \n0\n\n0.1\n\n0.2\n\n0.3\n\n0.4\n\nShared Support Parameter \u03b1\n\n0.5\n\n0.6\n\n \n\np=128\np=256\np=512\n\nLASSO\n\n0.7\n\n0.8\n\n0.9\n\n1\n\nFigure 2: Veri\ufb01cation of the result of the Theorem 3 on the behavior of phase transition threshold by changing\nthe parameter \u03b1 in a 2-task (n, p, s, \u03b1) problem for dirty model, LASSO and \u21131/\u2113\u221e regularizer. The y-axis\nis\n10\u230b. Our\ndirty model method shows a gain in sample complexity over the entire range of sharing \u03b1. The pre-constant in\nTheorem 3 is also validated.\n\ns log(p\u2212(2\u2212\u03b1)s) , where n is the number of samples at which threshold was observed. Here s = \u230a p\n\nn\n\nn\n10\n\nAverage Classi\ufb01cation Error\n\nVariance of Error\n\nAverage Row Support Size\n\nAverage Support Size\n\n20\n\nAverage Classi\ufb01cation Error\n\nVariance of Error\n\nAverage Row Support Size\n\nAverage Support Size\n\n40\n\nAverage Classi\ufb01cation Error\n\nVariance of Error\n\nAverage Row Support Size\n\nAverage Support Size\n\nOur Model\n\n8.6%\n0.53%\n\nB:165\nS:18\n\nB:211\nS:34\n\nB:270\nS:67\n\nB + S:171\nB + S:1651\n3.0%\n0.56%\n\nB + S:226\nB + S:2118\n2.2%\n0.57%\n\nB + S:299\nB + S:2761\n\n\u21131/\u2113\u221e LASSO\n10.8%\n9.9%\n0.51%\n0.64%\n170\n123\n539\n1700\n4.1%\n3.5%\n0.68%\n0.62%\n173\n217\n821\n2165\n3.2%\n2.8%\n0.85%\n0.68%\n354\n368\n3669\n2053\n\nTable 1: Handwriting Classi\ufb01cation Results for our model, \u21131/\u2113\u221e and LASSO\n\nScaling Veri\ufb01cation: To verify that the phase transition threshold changes linearly with \u03b1 as pre-\ndicted by Theorem 3, we plot the phase transition threshold versus \u03b1. For \ufb01ve different values of\n\u03b1 \u2208 {0.05, 0.3, 2\n3 , 0.8, 0.95} and three different values of p \u2208 {128, 256, 512}, we \ufb01nd the phase\ntransition threshold for dirty model, LASSO and \u21131/\u2113\u221e regularizer. We consider the point where\nthe probability of success in recovery of signed support exceeds 50% as the phase transition thresh-\nold. We \ufb01nd this point by interpolation on the closest two points. Fig 2 shows that phase transition\nthreshold for dirty model is always lower than the phase transition for LASSO and \u21131/\u2113\u221e regular-\nizer.\n4.2 Handwritten Digits Dataset\n\nWe use the handwritten digit dataset [1], containing features of handwritten numerals (0-9) extracted\nfrom a collection of Dutch utility maps. This dataset has been used by a number of papers [17, 6]\nas a reliable dataset for handwritten recognition algorithms. There are thus r = 10 tasks, and each\nhandwritten sample consists of p = 649 features.\nTable 1 shows the results of our analysis for different sizes n of the training set . We measure the\nclassi\ufb01cation error for each digit to get the 10-vector of errors. Then, we \ufb01nd the average error and\nthe variance of the error vector to show how the error is distributed over all tasks. We compare our\nmethod with \u21131/\u2113\u221e reguralizer method and LASSO. Again, in all methods, parameters are chosen\nvia cross-validation.\n\nFor our method we separate out the B and S matrices that our method \ufb01nds, so as to illustrate how\nmany features it identi\ufb01es as \u201cshared\u201d and how many as \u201cnon-shared\u201d. For the other methods we\njust report the straight row and support numbers, since they do not make such a separation.\n\nAcknowledgements\n\nWe acknowledge support from NSF grant IIS-101842, and NSF CAREER program, Grant 0954059.\n\n8\n\n\fand D.J. Newman.\n\nUCI Machine\n\nLearning Repository,\nof\n\nUniversity\n\nReferences\n[1] A. Asuncion\n\nhttp://www.ics.uci.edu/ mlearn/MLRepository.html.\nCalifornia, School of Information and Computer Science, Irvine, CA, 2007.\n\n[2] F. Bach. Consistency of the group lasso and multiple kernel learning. Journal of Machine\n\nLearning Research, 9:1179\u20131225, 2008.\n\n[3] R. Baraniuk. Compressive sensing. IEEE Signal Processing Magazine, 24(4):118\u2013121, 2007.\n[4] R. Caruana. Multitask learning. Machine Learning, 28:41\u201375, 1997.\n[5] C.Zhang and J.Huang. Model selection consistency of the lasso selection in high-dimensional\n\nlinear regression. Annals of Statistics, 36:1567\u20131594, 2008.\n\n[6] X. He and P. Niyogi. Locality preserving projections. In NIPS, 2003.\n[7] K. Lounici, A. B. Tsybakov, M. Pontil, and S. A. van de Geer. Taking advantage of sparsity in\n\nmulti-task learning. In 22nd Conference On Learning Theory (COLT), 2009.\n\n[8] S. Negahban and M. J. Wainwright. Joint support recovery under high-dimensional scaling:\nIn Advances in Neural Information Processing\n\nBene\ufb01ts and perils of \u21131,\u221e-regularization.\nSystems (NIPS), 2008.\n\n[9] S. Negahban and M. J. Wainwright. Estimation of (near) low-rank matrices with noise and\n\nhigh-dimensional scaling. In ICML, 2010.\n\n[10] G. Obozinski, M. J. Wainwright, and M. I. Jordan. Support union recovery in high-dimensional\n\nmultivariate regression. Annals of Statistics, 2010.\n\n[11] P. Ravikumar, H. Liu, J. Lafferty, and L. Wasserman. Sparse additive models. Journal of the\n\nRoyal Statistical Society, Series B.\n\n[12] P. Ravikumar, M. J. Wainwright, and J. Lafferty. High-dimensional ising model selection using\n\n\u21131-regularized logistic regression. Annals of Statistics, 2009.\n\n[13] B. Recht, M. Fazel, and P. A. Parrilo. Guaranteed minimum-rank solutions of linear matrix\nequations via nuclear norm minimization. In Allerton Conference, Allerton House, Illinois,\n2007.\n\n[14] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical\n\nSociety, Series B, 58(1):267\u2013288, 1996.\n\n[15] J. A. Tropp, A. C. Gilbert, and M. J. Strauss. Algorithms for simultaneous sparse approx-\nimation. Signal Processing, Special issue on \u201cSparse approximations in signal and image\nprocessing\u201d, 86:572\u2013602, 2006.\n\n[16] B. Turlach, W.N. Venables, and S.J. Wright. Simultaneous variable selection. Techno- metrics,\n\n27:349\u2013363, 2005.\n\n[17] M. van Breukelen, R.P.W. Duin, D.M.J. Tax, and J.E. den Hartog. Handwritten digit recogni-\n\ntion by combined classi\ufb01ers. Kybernetika, 34(4):381\u2013386, 1998.\n\n[18] M. J. Wainwright. Sharp thresholds for noisy and high-dimensional recovery of sparsity using\n\u21131-constrained quadratic programming (lasso). IEEE Transactions on Information Theory, 55:\n2183\u20132202, 2009.\n\n9\n\n\f", "award": [], "sourceid": 1164, "authors": [{"given_name": "Ali", "family_name": "Jalali", "institution": null}, {"given_name": "Sujay", "family_name": "Sanghavi", "institution": null}, {"given_name": "Chao", "family_name": "Ruan", "institution": null}, {"given_name": "Pradeep", "family_name": "Ravikumar", "institution": null}]}