{"title": "Phase transitions for high-dimensional joint support recovery", "book": "Advances in Neural Information Processing Systems", "page_first": 1161, "page_last": 1168, "abstract": "We consider the following instance of transfer learning: given a pair of regression problems, suppose that the regression coefficients share a partially common support, parameterized by the overlap fraction $\\overlap$ between the two supports. This set-up suggests the use of $1, \\infty$-regularized linear regression for recovering the support sets of both regression vectors. Our main contribution is to provide a sharp characterization of the sample complexity of this $1,\\infty$ relaxation, exactly pinning down the minimal sample size $n$ required for joint support recovery as a function of the model dimension $\\pdim$, support size $\\spindex$ and overlap $\\overlap \\in [0,1]$. For measurement matrices drawn from standard Gaussian ensembles, we prove that the joint $1,\\infty$-regularized method undergoes a phase transition characterized by order parameter $\\orpar(\\numobs, \\pdim, \\spindex, \\overlap) = \\numobs{(4 - 3 \\overlap) s \\log(p-(2-\\overlap)s)}$. More precisely, the probability of successfully recovering both supports converges to $1$ for scalings such that $\\orpar > 1$, and converges to $0$ to scalings for which $\\orpar < 1$. An implication of this threshold is that use of $1, \\infty$-regularization leads to gains in sample complexity if the overlap parameter is large enough ($\\overlap > 2/3$), but performs worse than a naive approach if $\\overlap < 2/3$. We illustrate the close agreement between these theoretical predictions, and the actual behavior in simulations. Thus, our results illustrate both the benefits and dangers associated with block-$1,\\infty$ regularization in high-dimensional inference.", "full_text": "Joint support recovery under high-dimensional scaling:\n\nBene\ufb01ts and perils of `1,\u221e-regularization\n\nDepartment of Electrical Engineering and Computer Sciences\n\nSahand Negahban\n\nUniversity of California, Berkeley\n\nBerkeley, CA 94720-1770\n\nsahand n@eecs.berkeley.edu\n\nDepartment of Statistics, and Department of Electrical Engineering and Computer Sciences\n\nMartin J. Wainwright\n\nUniversity of California, Berkeley\n\nBerkeley, CA 94720-1770\n\nwainwrig@eecs.berkeley.edu\n\nAbstract\n\nGiven a collection of r \u2265 2 linear regression problems in p dimensions, suppose that the\nregression coef\ufb01cients share partially common supports. This set-up suggests the use of\n`1/`\u221e-regularized regression for joint estimation of the p \u00d7 r matrix of regression coef\ufb01-\ncients. We analyze the high-dimensional scaling of `1/`\u221e-regularized quadratic program-\nming, considering both consistency rates in `\u221e-norm, and also how the minimal sample size\nn required for performing variable selection grows as a function of the model dimension,\nsparsity, and overlap between the supports. We begin by establishing bounds on the `\u221e-\nerror as well suf\ufb01cient conditions for exact variable selection for \ufb01xed design matrices, as\nwell as designs drawn randomly from general Gaussian matrices. These results show that the\nhigh-dimensional scaling of `1/`\u221e-regularization is qualitatively similar to that of ordinary\n`1-regularization. Our second set of results applies to design matrices drawn from standard\nGaussian ensembles, for which we provide a sharp set of necessary and suf\ufb01cient conditions:\nthe `1/`\u221e-regularized method undergoes a phase transition characterized by the rescaled sam-\nple size \u03b81,\u221e(n, p, s, \u03b1) = n/{(4 \u2212 3\u03b1)s log(p \u2212 (2 \u2212 \u03b1) s)}. More precisely, for any \u03b4 > 0,\nthe probability of successfully recovering both supports converges to 1 for scalings such that\n\u03b81,\u221e \u2265 1 + \u03b4, and converges to 0 for scalings for which \u03b81,\u221e \u2264 1\u2212 \u03b4. An implication of this\nthreshold is that use of `1,\u221e-regularization yields improved statistical ef\ufb01ciency if the overlap\nparameter is large enough (\u03b1 > 2/3), but performs worse than a naive Lasso-based approach\nfor moderate to small overlap (\u03b1 < 2/3). We illustrate the close agreement between these\ntheoretical predictions, and the actual behavior in simulations.\n\n1 Introduction\n\nThe area of high-dimensional statistical inference is concerned with the behavior of models and algorithms in\nwhich the dimension p is comparable to, or possibly even larger than the sample size n. In the absence of addi-\ntional structure, it is well-known that many standard procedures\u2014among them linear regression and principal\ncomponent analysis\u2014are not consistent unless the ratio p/n converges to zero. Since this scaling precludes hav-\ning p comparable to or larger than n, an active line of research is based on imposing structural conditions on the\ndata\u2014for instance, sparsity, manifold constraints, or graphical model structure\u2014and then studying conditions\nunder which various polynomial-time methods are either consistent, or conversely inconsistent.\n\n1\n\n\fThis paper deals with high-dimensional scaling in the context of solving multiple regression problems, where\nthe regression vectors are assumed to have shared sparse structure. More speci\ufb01cally, suppose that we are\ngiven a collection of r different linear regression models in p dimensions, with regression vectors \u03b2 i \u2208 Rp, for\ni = 1, . . . , r. We let S(\u03b2 i) = {j | \u03b2 i\nj 6= 0} denote the support set of \u03b2 i. In many applications\u2014among them\nsparse approximation, graphical model selection, and image reconstruction\u2014it is natural to impose a sparsity\nconstraint, corresponding to restricting the cardinality |S(\u03b2 i)| of each support set. Moreover, one might expect\nsome amount of overlap between the sets S(\u03b2 i) and S(\u03b2 j) for indices i 6= j since they correspond to the\nsets of active regression coef\ufb01cients in each problem. For instance, consider the problem of image denoising\nor reconstruction, using wavelets or some other type of multiresolution basis. It is well known that natural\nimages tend to have sparse representations in such bases. Moreover, similar images\u2014say the same scene taken\nfrom multiple cameras\u2014would be expected to share a similar subset of active features in the reconstruction.\nSimilarly, in analyzing the genetic underpinnings of a given disease, one might have results from different\nsubjects and/or experiments, meaning that the covariate realizations and regression vectors would differ in their\nnumerical values, but one expects the same subsets of genes to be active in controlling the disease, which\ntranslates to a condition of shared support in the regression coef\ufb01cients. Given these structural conditions of\nshared sparsity in these and other applications, it is reasonable to consider how this common structure can be\nexploited so as to increase the statistical ef\ufb01ciency of estimation procedures.\nIn this paper, we study the high-dimensional scaling of block `1/`\u221e regularization. Our main contribution is\nto obtain some precise\u2014and arguably surprising\u2014insights into the bene\ufb01ts and dangers of using block `1/`\u221e\nregularization, as compared to simpler `1-regularization (separate Lasso for each regression problem). We\nbegin by providing a general set of suf\ufb01cient conditions for consistent support recovery for both \ufb01xed design\nmatrices, and random Gaussian design matrices. In addition to these basic consistency results, we then seek to\ncharacterize rates, for the particular case of standard Gaussian designs, in a manner precise enough to address\nthe following questions.\n\n(a) First, under what structural assumptions on the data does the use of `1/`\u221e block-regularization provide\na quanti\ufb01able reduction in the scaling of the sample size n, as a function of the problem dimension p\nand other structural parameters, required for consistency?\n\n(b) Second, are there any settings in which `1/`\u221e block-regularization can be harmful relative to compu-\n\ntationally less expensive procedures?\n\nAnswers to these questions yield useful insight into the tradeoff between computational and statistical ef\ufb01-\nciency.\nIndeed, the convex programs that arise from using block-regularization typically require a greater\ncomputational cost to solve. Accordingly, it is important to understand under what conditions this increased\ncomputational cost guarantees that fewer samples are required for achieving a \ufb01xed level of statistical accuracy.\n\nAs a representative instance of our theory, consider the special case of standard Gaussian design matrices and\ntwo regression problems (r = 2), with the supports S(\u03b2 1) and S(\u03b2 2) each of size s and overlapping in a\nfraction \u03b1 \u2208 [0, 1] of their entries. For this problem, we prove that block `1/`\u221e regularization undergoes a\nphase transition in terms of the rescaled sample size\n\n\u03b81,\u221e(n, p, s, \u03b1)\n\n: =\n\nn\n\n(4 \u2212 3\u03b1)s log(p \u2212 (2 \u2212 \u03b1)s)\n\n.\n\n(1)\n\nIn words, for any \u03b4 > 0 and for scalings of the quadruple (n, p, s, \u03b1) such that \u03b81,\u221e \u2265 1 + \u03b4, the probability of\nsuccessfully recovering both S(\u03b2 1) and S(\u03b2 2) converges to one, whereas for scalings such that \u03b81,\u221e \u2264 1 \u2212 \u03b4,\nthe probability of success converges to zero. By comparison to previous theory on the behavior of the Lasso\n(ordinary `1-regularized quadratic programming), the scaling (1) has two interesting implications. For the s-\nsparse regression problem with standard Gaussian designs, the Lasso has been shown [10] to undergo a phase\ntransition as a function of the rescaled sample size\n\n\u03b8Las(n, p, s)\n\n: =\n\n,\n\n(2)\n\nso that solving two separate Lasso problems, one for each regression problem, would recover both supports for\nproblem sequences (n, p, s) such that \u03b8Las > 1. Thus, one consequence of our analysis is to provide a precise\ncon\ufb01rmation of the natural intuition: if the data is well-aligned with the regularizer, then block-regularization\nincreases statistical ef\ufb01ciency. On the other hand, our analysis also conveys a cautionary message: if the overlap\nis too small\u2014more precisely, if \u03b1 < 2/3\u2014then block `1,\u221e is actually harmful relative to the naive Lasso-based\napproach. This fact illustrates that some care is required in the application of block regularization schemes.\n\nn\n\n2s log(p \u2212 s)\n\n2\n\n\fIn Section 2, we provide a precise description of the\nThe remainder of this paper is organized as follows.\nproblem. Section 3 is devoted to the statement of our main result, some discussion of its consequences, and\nillustration by comparison to empirical simulations.\n\n2 Problem set-up\n\nWe begin by setting up the problem to be studied in this paper, including multivariate regression and family of\nblock-regularized programs for estimating sparse vectors.\n\n2.1 Multivariate regression\n\nIn this problem, we consider the following form of multivariate regression. For each i = 1, . . . , r, let \u03b2i \u2208 Rp\nbe a regression vector, and consider the family of linear observation models\n(3)\ni = 1, 2, . . . , r.\nHere each X i \u2208 Rn\u00d7p is a design matrix, possibly different for each vector \u03b2i, and wi \u2208 Rn is a noise vector.\nWe assume that the noise vectors wi and wj are independent for different regression problems i 6= j. In this\npaper, we assume that each wi has a multivariate Gaussian N (0, \u03c32In\u00d7n) distribution. However, we note that\nqualitatively similar results will hold for any noise distribution with sub-Gaussian tails (see the book [1] for\nmore background).\n\nyi = X i\u03b2i + wi,\n\n2.2 Block-regularization schemes\n\nFor compactness in notation, we frequently use B to denote the p \u00d7 r matrix with \u03b2 i \u2208 Rp as the ith column.\nGiven a parameter q \u2208 [1,\u221e], we de\ufb01ne the `1/`q block-norm as follows:\nk, . . . , \u03b2r\n\nk, \u03b22\n\n(4)\n\n: =\n\nkBk`1/`q\n\nk)kq,\n\ncorresponding to applying the `q norm to each row of B, and the `1-norm across all of these blocks. We note\nthat all of these block norms are special cases of the CAP family of penalties [12].\n\nThis family of block-regularizers (4) suggests a natural family of M-estimators for estimating B, based on\nsolving the block-`1/`q-regularized quadratic program\n\npXk=1\n\nk(\u03b21\n\n(5)\n\nbB \u2208 arg min\n\nB\u2208Rp\u00d7r(cid:8) 1\n\n2n\n\nrXi=1\n\nkyi \u2212 X i\u03b2ik2\n\n2 + \u03bbnkBk`1/`q(cid:9),\n\nwhere \u03bbn > 0 is a user-de\ufb01ned regularization parameter. Note that the data term is separable across the different\nregression problems i = 1, . . . , r, due to our assumption of independence on the noise vectors. Any coupling\nbetween the different regression problems is induced by the block-norm regularization.\n\nIn the special case of univariate regression (r = 1), the parameter q plays no role, and the block-regularized\nscheme (6) reduces to the Lasso [7, 3]. If q = 1 and r \u2265 2, the block-regularization function (like the data\nterm) is separable across the different regression problems i = 1, . . . , r, and so the scheme (6) reduces to\nsolving r separate Lasso problems. For r \u2265 2 and q = 2, the program (6) is frequently referred to as the group\nLasso [11, 6]. Another important case [9, 8], and the focus of this paper, is block `1/`\u221e regularization.\nThe motivation for using block `1/`\u221e regularization is to encourage shared sparsity among the columns of the\nregression matrix B. Geometrically, like the `1 norm that underlies the ordinary Lasso, the `1/`\u221e block norm\nhas a polyhedral unit ball. However, the block norm captures potential interactions between the columns \u03b2i\nk) in any given row\nin the matrix B. Intuitively, taking the maximum encourages the elements (\u03b21\nk = 1, . . . , p to be zero simultaneously, or to both be non-zero simultaneously. Indeed, if \u03b2i\nk 6= 0 for at least\none i \u2208 {1, . . . , r}, then there is no additional penalty to have \u03b2j\nk 6= 0 as well, as long as |\u03b2j\nk|.\nk| \u2264 |\u03b2i\n2.3 Estimation in `\u221e norm and support recovery\nFor a given \u03bbn > 0, suppose that we solve the block `1/`\u221e program, thereby obtaining an estimate\n\nk . . . , \u03b2r\n\nk, \u03b22\n\nbB \u2208 arg min\n\nB\u2208Rp\u00d7r(cid:8) 1\n\n2n\n\nrXi=1\n\nkyi \u2212 X i\u03b2ik2\n\n2 + \u03bbnkBk`1/`\u221e(cid:9),\n\n3\n\n(6)\n\n\fWe note that under high-dimensional scaling (p (cid:29) n), this convex program (6) is not necessarily strictly\nconvex, since the quadratic term is rank de\ufb01cient and the block `1/`\u221e norm is polyhedral, which implies that\nthe program is not strictly convex. However, a consequence of our analysis is that under appropriate conditions,\nthe optimal solution bB is in fact unique.\nIn this paper, we study the accuracy of the estimate bB, as a function of the sample size n, regression dimensions\np and r, and the sparsity index s = maxi=1,...,r |S(\u03b2 i)|. There are various metrics with which to assess the\n\u201ccloseness\u201d of the estimate bB to the truth B, including predictive risk, various types of norm-based bounds on\nthe difference bB \u2212 B, and variable selection consistency. In this paper, we prove results bounding the `\u221e/`\u221e\nIn addition, we prove results on support recovery criteria. Recall that for each vector \u03b2 i \u2208 Rp, we use S(\u03b2 i) =\nk 6= 0} to denote its support set. The problem of union support recovery corresponds to recovering the\n{k | \u03b2 i\nset\n\nkbB \u2212 Bk`\u221e/`\u221e : = max\n\ni=1,...,r |bBi\n\nk \u2212 Bi\nk|.\n\ndifference\n\nk=1,...,p\n\nmax\n\nJ : =\n\nS(\u03b2 i),\n\n(7)\n\nr[i=1\n\ncorresponding to the subset J \u2286 {1, . . . , p} of indices that are active in at least one regression problem. Note\nthat the cardinality of |J| is upper bounded by rs, but can be substantially smaller (as small as s) if there is\noverlap among the different supports.\n\nIn some results, we also study the more re\ufb01ned criterion of recovering the individual signed supports, meaning\nthe signed quantities sign(\u03b2 i\n\nk), where the sign function is given by\n\nsign(t) = \uf8f1\uf8f2\uf8f3\n\n+1 if t > 0\nif t = 0\n0\n\u22121 if t < 0\n\n(8)\n\n(10)\n\nThere are multiple ways in which the support (or signed support) can be estimated, depending on whether we\nuse primal or dual information from an optimal solution.\n\n`1/`\u221e primal recovery: Solve the block-regularized program (6), thereby obtaining a (primal) optimal solu-\n\ntion bB \u2208 Rp\u00d7r, and estimate the signed support vectors\n`1/`\u221e dual recovery: Solve the block-regularized program (6), thereby obtaining an primal solution bB \u2208\nk|. Estimate the signed support via:\n\n[Spri(b\u03b2 i)]k = sign(b\u03b2 i\n\nRp\u00d7r. For each row k = 1, . . . , p, compute the set Mk : = arg max\n\nk).\n\n(9)\n\ni=1,...,r |b\u03b2 i\nif i \u2208 Mk\notherwise.\n\nk)\n\nk)] = (sign(b\u03b2 i\n\n0\n\n[Sdua(b\u03b2 i\n\nAs our development will clarify, this procedure corresponds to estimating the signed support on the basis of a\ndual optimal solution associated with the optimal primal solution.\n\n2.4 Notational conventions\n\nThroughout this paper, we use the index p \u2208 {1, . . . , r} as a superscript in indexing the different regression\nproblems, or equivalently the columns of the matrix B \u2208 Rp\u00d7r. Given a design matrix X \u2208 Rn\u00d7p and a subset\nS \u2286 {1, . . . , p}, we use XS to denote the n \u00d7 |S| sub-matrix obtained by extracting those columns indexed by\nS. For a pair of matrices A \u2208 Rm\u00d7` and B \u2208 Rm\u00d7n, we use the notation(cid:10)A, B(cid:11) : = AT B for the resulting\n` \u00d7 n matrix.\nWe use the following standard asymptotic notation: for functions f, g, the notation f (n) = O(g(n)) means that\nthere exists a \ufb01xed constant 0 < C < +\u221e such that f (n) \u2264 Cg(n); the notation f (n) = \u2126(g(n)) means that\nf (n) \u2265 Cg(n), and f (n) = \u0398(g(n)) means that f (n) = O(g(n)) and f (n) = \u2126(g(n)).\n\n4\n\n\f3 Main results and their consequences\n\nIn this section, we provide precise statements of the main results of this paper. Our \ufb01rst main result (Theorem 1)\nprovides suf\ufb01cient conditions for deterministic design matrices X 1, . . . , X r. This result allows for an arbitrary\nnumber r of regression problems. Not surprisingly, these results show that the high-dimensional scaling of block\n`1/`\u221e is qualitiatively similar to that of ordinary `1-regularization: for instance, in the case of random Gaussian\ndesigns and bounded r, our suf\ufb01cient conditions in [5] ensure that n = \u2126(s log p) samples are suf\ufb01cient to\nrecover the union of supports correctly with high probability, which matches known results on the Lasso [10].\n\nAs discussed in the introduction, we are also interested in the more re\ufb01ned question: can we provide nec-\nessary and suf\ufb01cient conditions that are sharp enough to reveal quantitative differences between ordinary `1-\nregularization and block regularization? In order to provide precise answers to this question, our \ufb01nal two results\nconcern the special case of r = 2 regression problems, both with supports of size s that overlap in a fraction \u03b1\nof their entries, and with design matrices drawn randomly from the standard Gaussian ensemble. In this setting,\nour \ufb01nal two results (Theorem 2 and 3) show that block `1/`\u221e regularization undergoes a phase transition\nspeci\ufb01ed by the rescaled sample size. We then discuss some consequences of these results, and illustrate their\nsharpness with some simulation results.\n\n3.1 Suf\ufb01cient conditions for deterministic designs\n\nIn addition to the sample size n, problem dimensions p and r, and sparsity index s, our results are stated in\nterms of the minimum eigenvalue Cmin of the |J| \u00d7 |J| matrices 1\n\nJi\u2014that is,\n\nnhX i\n\nJ , X i\n\nas well as an `\u221e-operator norm of their inverses:\n\nJ , X i\n\nJi(cid:1) \u2265 Cmin\n\n\u03bbmin(cid:0) 1\nnhX i\n|||(cid:0) 1\nnhX i\n\nJ , X i\n\nJi(cid:1)\u22121\n\nfor all i = 1, . . . , r,\n\n|||\u221e \u2264 Dmax\n\nfor all i = 1, . . . , r.\n\n(11)\n\n(12)\n\n(14)\n\nIt is natural to think of these quantites as being constants (independent of p and s), although our results do allow\nthem to scale.\n\nWe assume that the columns of each design matrix X i, i = 1, . . . , r are normalized so that\n\nkX i\n\nkk2\n\n2 \u2264 2n\n\nfor all k = 1, 2, . . . p.\n\n(13)\n\nThe choice of the factor 2 in this bound is for later technical convenience. We also require the following\nincoherence condition on the design matrix is satisi\ufb01ed: there exists some \u03b3 \u2208 (0, 1] such that\n\nJi)\u22121(cid:11)k1 \u2264 (1 \u2212 \u03b3),\nand we also de\ufb01ne the support minimum value Bmin = mink\u2208J maxi=1,...,r |\u03b2 i\nk|,\nFor a parameter \u03be > 1 (to be chosen by the user), we de\ufb01ne the probability\n\nk(cid:10)X i\n\nJ (hX i\n\n`=1,...,|J c|\n\nJ , X i\n\n`, X i\n\nmax\n\nrXi=1\n\n(15)\n\n\u03b32\n\nn\n\nr2+r log(p)\n\n\u03c61(\u03be, p, s)\n\nn \u2265 4\u03be\u03c32\n\n: = 1 \u2212 2 exp(\u2212(\u03be \u2212 1)[r + log p]) \u2212 2 exp(\u2212(\u03be2 \u2212 1) log(rs))\nwhich speci\ufb01es the precise rate with which the \u201chigh probability\u201d statements in Theorem 1 hold.\nTheorem 1. Consider the observation model (3) with design matrices X i satisfying the column bound (13) and\nincoherence condition (14). Suppose that we solve the block-regularized `1/`\u221e convex program (6) with regu-\nlarization parameter \u03c12\nfor some \u03be > 1. Then with probability greater than \u03c61(\u03be, p, s) \u2192 1,\nwe are guaranteed that:\ni=1 S(b\u03b2 i) \u2286 J, and it satis\ufb01es the\n\n(a) The block-regularized program has a unique solution bB such thatSr\n{z\n\nk| \u2264 \u03bes 4\u03c32\n\nk=1,...,p|b\u03b2 i\n\n+ Dmax \u03c1n\n\n.\n\n(16)\n\nelementwise bound\n\nb1(\u03be, \u03c1n, n, s)\n\nk \u2212 \u03b2 i\n\nmax\ni=1,...,r\n\nlog(rs)\n\nCmin\n\nmax\n\n|\n\n}\n\nn\n\n5\n\n\fthe union of supports J.\n\n(b) If in addition Bmin \u2265 b1(\u03be, \u03c1n, n, s), thenSr\n\ni=1 S(b\u03b2 i) = J, so that the solution bB correctly speci\ufb01es\nRemarks: To clarify the scope of the claims, part (a) guarantees that the estimator recovers the union support\nJ correctly, whereas part (b) guarantees that for any given i = 1, . . . , r and k \u2208 S(\u03b2 i), the sign sign(b\u03b2 i\nk) is\ncorrect. Note that we are guaranteed that b\u03b2 i\nk = 0 for all k /\u2208 J. However, within the union support J, when\nusing primal recovery method, it is possible to have false non-zeros\u2014i.e., there may be an index k \u2208 J\\S(\u03b2 i)\nsuch that b\u03b2 i\nk 6= 0. Of course, this cannot occur if the support sets S(\u03b2 i) are all equal. This phenomenon is\nrelated to geometric properties of the block `1/`\u221e norm: in particular, for any given index k, when b\u03b2 j\nk 6= 0 for\nsome j \u2208 {1, . . . , r}, then there is no further penalty to havingb\u03b2 i\nk 6= 0 for other column indices i 6= j.\nThe dual signed support recovery method (10) is more conservative in estimating the individual support sets.\nIn particular, for any given i \u2208 {1, . . . , r}, it only allows an index k to enter the signed support estimate\nSdua(b\u03b2 i) when |b\u03b2 i\nk| achieves the maximum magnitude (possibly non-unique) across all indices i = 1, . . . , r.\nConsequently, Theorem 1 guarantees that the dual signed support method will never include an index in the\nindividual supports. However, it may incorrectly exclude indices of some supports, but like the primal support\nestimator, it is always guaranteed to correctly recover the union of supports J.\nWe note that it is possible to ensure that under some conditions that the dual support method will correctly\nrecover each of the individual signed supports, without any incorrect exclusions. However, as illustrated by\nTheorem 2, doing so requires additional assumptions on the size of the gap |\u03b2 i\nk | for indices k \u2208 B : =\nS(\u03b2 i) \u2229 S(\u03b2 j).\n3.2 Sharp results for standard Gaussian ensembles\n\nk| \u2212 |\u03b2 j\n\nOur results thus far show under standard mutual incoherence or irrepresentability conditions, the block `1/`\u221e\nmethod produces consistent estimators for n = \u2126(s log(p\u2212 s)). In qualitative terms, these results match known\nscaling for the Lasso, or ordinary `1-regularization. In order to provide keener insight into the (dis)advantages\nassociated with using `1/`\u221e block regularization, we specialize the remainder of our analysis to the case of\nr = 2 regression problems, where the corresponding design matrices X i, i = 1, 2 are sampled from the standard\nGaussian ensemble [2, 4]\u2014i.e., with i.i.d. rows N (0, Ip\u00d7p). Our goal in studying this special case is to be able\nto make quantiative comparisons with the Lasso.\n\nn\n\n\u03c1nkT (Bgap)k\u221e, where T (Bgap) = \u03c1n \u2227 Bgap.\n\nWe consider a sequence of models indexed by the triplet (p, s, \u03b1), corresponding to the problem dimension\np, support sizes s. and overlap parameter \u03b1 \u2208 [0, 1]. We assume that s \u2264 p/2, capturing the intuition of a\n(relatively) sparse model. Suppose that for a given model, we take n = n(p, s, \u03b1) observations. according to\nequation (3). We can then study the probability of successful recovery as a function of the model triplet, and\nthe sample size n.\nIn order to state our main result, we de\ufb01ne the order parameter or rescaled sample size \u03b81,\u221e(n, p, s, \u03b1) : =\n\n(4\u22123\u03b1)s log(p\u2212(2\u2212\u03b1)s) . We also de\ufb01ne the support gap value as well as c\u221e-gap Bgap = (cid:12)(cid:12)|\u03b2 1\n\nc\u221e = 1\n3.2.1 Suf\ufb01cient conditions\nWe begin with a result that provides suf\ufb01cient conditions for support recovery using block `1/`\u221e regularization.\nTheorem 2 (Achievability). Given the observation model (3) with random design X drawn with i.i.d. standard\nGaussian entries, and consider problem sequences (n, p, s, \u03b1) for which \u03b81,\u221e(n, p, s, \u03b1) > 1 + \u03b4 for some\nn and c\u221e \u2192 0 , then with probability\n(i) The block `1,\u221e-program (6) has a unique solution (b\u03b2 1,b\u03b2 2), with supports S(b\u03b2 1) \u2286 J and S(b\u03b2 2) \u2286\n\n\u03b4 > 0. If we solve the block-regularized program (6) with \u03c1n = \u03beq log p\ngreater than 1 \u2212 c1 exp(\u2212c2 log(p \u2212 (2 \u2212 \u03b1)s)), the following properties hold:\n\nJ. Moreover, we have the elementwise bound\n\nB|(cid:12)(cid:12), and\n\nB| \u2212 |\u03b2 2\n\n(17)\n\nmax\ni=1,2\n\nk \u2212 \u03b2 i\n\nmax\n\nk=1,...,p|b\u03b2 i\n\nk| \u2264 \u03ber 100 log(s)\n\nn\n\n\u221an\n\n+ \u03c1n(cid:2) 4s\n{z\n\n+ 1(cid:3),\n}\n\nb3(\u03be, \u03c1n, n, s)\n\n|\n\n6\n\n\f(ii) If the support minimum Bmin > 2b3(\u03be, \u03c1n, n, s), then the primal support method successfully recovers\nthe support union J = S(\u03b2 1)\u222aS(\u03b2 2). Moreover, using the primal signed support recovery method (9),\nwe have\n(18)\n\nfor all k \u2208 S(\u03b2 i).\n\n[Spri(b\u03b2 i)]k = sign(\u03b2 i\n\nk)\n\n3.2.2 Necessary conditions\n\nWe now turn to the question of \ufb01nding matching necessary conditions for support recovery.\nTheorem 3 (Lower bounds). Given the observation model (3) with random design X drawn with i.i.d. standard\nGaussian entries.\n\n(a) For problem sequences (n, p, s, \u03b1) such that \u03b81,\u221e(n, p, s, \u03b1) < 1 \u2212 \u03b4 for some \u03b4 > 0 and for any\nnon-increasing regularization sequence \u03c1n > 0, no solution bB = (b\u03b2 1,b\u03b2 2) to the block-regularized\nprogram (6) has the correct support union S(b\u03b2 1) \u222a S(b\u03b2 2).\n\n(b) Recalling\n\nc2(\u03c1n, Bgap)\n\nof Bgap,\n\nde\ufb01nition\n\nde\ufb01ne\n\nlimit\n\nthe\n\nrescaled\n\ngap\n\nthe\n\n: =\n\nlim sup(n,p,s) kT (Bgap)k2\n\n\u03c1n\u221as\n\n. If the sample size n is bounded as\n\nfor some \u03b4 > 0, then the dual recovery method (10) fails to recover the individual signed supports.\n\nn < (1 \u2212 \u03b4)(cid:2)(4 \u2212 3\u03b1) + (c2(\u03c1n, Bgap))2(cid:3)s log[p \u2212 (2 \u2212 \u03b1)s]\n\nIt is important to note that c\u221e \u2265 c2, which implies then that as long as c\u221e \u2192 0, then c2 \u2192 0, so that the\nconditions of Theorem 3(a) and (b) are equivalent. However, note that if c2 does not go to 0, then in fact, the\nmethod could fail to recover the correct support even if \u03b81,\u221e > 1 + \u03b4. This result is key to understanding the\n`1,\u221e-regularization term. The gap between the vectors plays a fundamental role in in reducing the sampling\ncomplexity. Namely, if the gap is too large, then the sampling ef\ufb01ciency is greatly reduced as compared to if\nthe gap is very small. In summary, while (a) and (b) seem equivalent on the surface, the requirement in (b) is in\nfact stronger than that in (a) and demonstrates the importance of condition (iii) in Theorem 2. It shows that if\nthe gap is too large, then correct joint support recovery is not possible.\n\n3.3\n\nIllustrative simulations and some consequences\n\nIn this section, we provide some illustrative simulations that illustrate the phase transitions predicted by The-\norems 2 and 3, and show that the theory provides an accurate description of practice even for relatively small\nproblem sizes (e.g., p = 128). Figure 1 plots the probability of successful recovery of the individual signed sup-\n\nports using dual support recovery (10)\u2014namely, P[Sdua(b\u03b2 i) = S\u00b1(\u03b2 i), Sdua(b\u03b2 2) = S\u00b1(\u03b2 2)] for i = 1, 2\u2014\nversus the order parameter \u03b81,\u221e(n, p, s, \u03b1). The plot contains four sets of \u201cstacked\u201d curves, each corresponding\nto a different choice of the overlap parameter, ranging from \u03b1 = 1 (left-most stack), to \u03b1 = 0.1 (right-most\nstack). Each stack contains three curves, corresponding to the problem sizes p \u2208 {128, 256, 512}. In all cases,\nwe \ufb01xed the support size s = 0.1p. The stacking behavior of these curves demonstrates that we have isolated\nthe correct order parameter, and the step-function behavior is consistent with the theoretical predictions of a\nsharp threshold.\n\nTheorems 2 and 3 have some interesting consequences, particularly in comparison to the behavior of the \u201cnaive\u201d\nLasso-based individual decoding of signed supports\u2014that is, the method that simply applies the Lasso (ordinary\n`1-regularization) to each column i = 1, 2 separately. By known results [10] on the Lasso, the performance of\nthis naive approach is governed by the order parameter\n\n\u03b8Las(n, p, s) =\n\n,\n\n(19)\n\nn\n\n2s log(p \u2212 s)\n\nmeaning that for any \u03b4 > 0, it succeeds for sequences such that \u03b8Las > 1 + \u03b4, and conversely fails for sequences\nsuch that \u03b8Las < 1\u2212\u03b4. To compare the two methods, we de\ufb01ne the relative ef\ufb01ciency coef\ufb01cient R(\u03b81,\u221e, \u03b8Las) :\n= \u03b8Las(n, p, s)/\u03b81,\u221e(n, p, s, \u03b1). A value of R < 1 implies that the block method is more ef\ufb01cient, while R > 1\nimplies that the naive method is more ef\ufb01cient.\n\nWith this notation, we have the following:\nCorollary 1. The relative ef\ufb01ciency of the block `1,\u221e program (6) compared to the Lasso is given by\nR(\u03b81,\u221e, \u03b8Las) = 4\u22123\u03b1\n. Thus, for sublinear sparsity s/p \u2192 0, the block scheme has greater\nstatistical ef\ufb01ciency for all overlaps \u03b1 \u2208 (2/3, 1], but lower statistical ef\ufb01ciency for overlaps \u03b1 \u2208 [0, 2/3).\n\nlog(p\u2212(2\u2212\u03b1)s)\n\nlog(p\u2212s)\n\n2\n\n7\n\n\f1\n\n0.8\n\ns\ns\ne\nc\nc\nu\ns\n \n.\n\nb\no\nr\nP\n\n0.6\n\n0.4\n\n0.2\n\n \n\n0\n0\n\n`1,\u221e relaxation for s = 0.1*p and \u03b1 = 1, 0.7, 0.4, 0.1\n\n \n\n\u03b1 = 1\n\u03b1 = 1\n\u03b1 = 1\n\n\u03b1 = 0.7\n\u03b1 = 0.7\n\u03b1 = 0.7\n\n\u03b1 = 0.4\n\u03b1 = 0.4\n\u03b1 = 0.4\n\n\u03b1 = 0.1\n\u03b1 = 0.1\n\u03b1 = 0.1\n\n1\n\n2\n\nControl parameter q\n\n3\n\np = 128\np = 256\np = 512\n5\n\n4\n\nFigure 1. Probability of success in recovering the joint signed supports plotted against the control parameter \u03b81\n\u221e =\nn/[2s log(p \u2212 (2 \u2212 \u03b1)s))] for linear sparsity s = 0.1p. Each stack of graphs corresponds to a \ufb01xed overlap \u03b1, as\nlabeled on the \ufb01gure. The three curves within each stack correspond to problem sizes p{128, 256, 512}; note how\nthey all align with each other and exhibit step-like behavior, consistent with Theorems 2 and 3. The vertical lines\ncorrespond to the thresholds \u03b8\u2217\n\u221e(\u03b1) predicted by Theorems 2 and 3; note the close agreement between theory and\nsimulation.\n\n1\n\n,\n\n,\n\nReferences\n\n[1] V. V. Buldygin and Y. V. Kozachenko. Metric characterization of random variables and random processes.\n\nAmerican Mathematical Society, Providence, RI, 2000.\n\n[2] E. Candes and T. Tao. The Dantzig selector: Statistical estimation when p is much larger than n. Annals\n\nof Statistics, 2006.\n\n[3] S. Chen, D. L. Donoho, and M. A. Saunders. Atomic decomposition by basis pursuit. SIAM J. Sci.\n\nComputing, 20(1):33\u201361, 1998.\n\n[4] D. L. Donoho and J. M. Tanner. Counting faces of randomly-projected polytopes when the projection\nradically lowers dimension. Technical report, Stanford University, 2006. Submitted to Journal of the\nAMS.\n\n[5] S. Negahban and M. J. Wainwright. Joint support recovery under high-dimensional scaling: Bene\ufb01ts and\n\nperils of `1,\u221e-regularization. Technical report, Department of Statistics, UC Berkeley, January 2009.\n\n[6] G. Obozinski, B. Taskar, and M. Jordan. Joint covariate selection for grouped classi\ufb01cation. Technical\n\nreport, Statistics Department, UC Berkeley, 2007.\n\n[7] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society,\n\nSeries B, 58(1):267\u2013288, 1996.\n\n[8] J. A. Tropp, A. C. Gilbert, and M. J. Strauss. Algorithms for simultaneous sparse approximation. Sig-\nnal Processing, 86:572\u2013602, April 2006. Special issue on \u201dSparse approximations in signal and image\nprocessing\u201d.\n\n[9] B. Turlach, W.N. Venables, and S.J. Wright. Simultaneous variable selection. Technometrics, 27:349\u2013363,\n\n2005.\n\n[10] M. J. Wainwright. Sharp thresholds for high-dimensional and noisy recovery of sparsity using using `1-\n\nconstrained quadratic programs. Technical Report 709, Department of Statistics, UC Berkeley, 2006.\n\n[11] Kim Y., Kim J., and Y. Kim. Blockwise sparse regression. Statistica Sinica, 16(2), 2006.\n[12] P. Zhao, G. Rocha, and B. Yu. Grouped and hierarchical model selection through composite absolute\n\npenalties. Technical report, Statistics Department, UC Berkeley, 2007.\n\n8\n\n\f", "award": [], "sourceid": 903, "authors": [{"given_name": "Sahand", "family_name": "Negahban", "institution": null}, {"given_name": "Martin", "family_name": "Wainwright", "institution": null}]}