{"title": "Structured Estimation with Atomic Norms: General Bounds and Applications", "book": "Advances in Neural Information Processing Systems", "page_first": 2908, "page_last": 2916, "abstract": "For structured estimation problems with atomic norms, recent advances in the literature express sample complexity and estimation error bounds in terms of certain geometric measures, in particular Gaussian width of the unit norm ball, Gaussian width of a spherical cap induced by a tangent cone, and a restricted norm compatibility constant. However, given an atomic norm, bounding these geometric measures can be difficult. In this paper, we present general upper bounds for such geometric measures, which only require simple information of the atomic norm under consideration, and we establish tightness of these bounds by providing the corresponding lower bounds. We show applications of our analysis to certain atomic norms, especially k-support norm, for which existing result is incomplete.", "full_text": "Structured Estimation with Atomic Norms:\n\nGeneral Bounds and Applications\n\nSheng Chen\n\nArindam Banerjee\n\nDept. of Computer Science & Engg., University of Minnesota, Twin Cities\n\n{shengc,banerjee}@cs.umn.edu\n\nAbstract\n\nFor structured estimation problems with atomic norms, recent advances in the lit-\nerature express sample complexity and estimation error bounds in terms of certain\ngeometric measures, in particular Gaussian width of the unit norm ball, Gaussian\nwidth of a spherical cap induced by a tangent cone, and a restricted norm com-\npatibility constant. However, given an atomic norm, bounding these geometric\nmeasures can be dif\ufb01cult. In this paper, we present general upper bounds for such\ngeometric measures, which only require simple information of the atomic nor-\nm under consideration, and we establish tightness of these bounds by providing\nthe corresponding lower bounds. We show applications of our analysis to certain\natomic norms, especially k-support norm, for which existing result is incomplete.\n\n1\n\nIntroduction\n\nAccurate recovery of structured sparse signal/parameter vectors from noisy linear measurements has\nbeen extensively studied in the \ufb01eld of compressed sensing, statistics, etc. The goal is to recover\na high-dimensional signal (parameter) \u03b8\u2217 \u2208 Rp which is sparse (only has a few nonzero entries),\npossibly with additional structure such as group sparsity. Typically one assume linear models, y =\nX\u03b8\u2217 + \u03c9, in which X \u2208 Rn\u00d7p is the design matrix consisting of n samples, y \u2208 Rn is the observed\nresponse vector, and \u03c9 \u2208 Rn is an unknown noise vector. By leveraging the sparsity of \u03b8\u2217, previous\nwork has shown that certain L1-norm based estimators [22, 7, 8] can \ufb01nd a good approximation of\n\u03b8\u2217 using sample size n (cid:28) p. Recent work has extended the notion of unstructured sparsity to other\nstructures in \u03b8\u2217 which can be captured or approximated by some norm R(\u00b7) [10, 18, 3, 11, 6, 19]\nother than L1, e.g., (non)overlapping group sparsity with L1/L2 norm [24, 15], etc. In general, two\nbroad classes of estimators are considered in recovery analysis: (i) Lasso-type estimators [22, 18, 3],\nwhich solve the regularized optimization problem\n\n\u02c6\u03b8\u03bbn = argmin\n\u03b8\u2208Rp\n\n1\n2n\n\n(cid:107)X\u03b8 \u2212 y(cid:107)2\n\n2 + \u03bbnR(\u03b8) ,\n\n(1)\n\nand (ii) Dantzig-type estimators [7, 11, 6], which solve the constrained problem\n\nR(\u03b8) s.t. R\u2217(XT (X\u03b8 \u2212 y)) \u2264 \u03bbn ,\n\n\u02c6\u03b8\u03bbn = argmin\n\u03b8\u2208Rp\n\n(2)\nwhere R\u2217(\u00b7) is the dual norm of R(\u00b7). Variants of these estimators exist [10, 19, 23], but the recovery\nanalysis proceeds along similar lines as these two classes of estimators.\nTo establish recovery guarantees, [18] focused on Lasso-type estimators and R(\u00b7) from the class of\ndecomposable norm, e.g., L1, non-overlapping L1/L2 norm. The upper bound for the estimation\nerror (cid:107) \u02c6\u03b8\u03bbn\u2212\u03b8\u2217(cid:107)2 for any decomposable norm is characterized in terms of three geometric measures:\n(i) a dual norm bound, as an upper bound for R\u2217(XT \u03c9), (ii) sample complexity, the minimal sample\nsize needed for a certain restricted eigenvalue (RE) condition to be true [4, 18], and (iii) a restricted\n\n1\n\n\fnorm compatibility constant between R(\u00b7) and L2 norms [18, 3]. The non-asymptotic estimation\n\u221a\nerror bound typically has the form (cid:107) \u02c6\u03b8\u03bbn \u2212 \u03b8\u2217||2 \u2264 c/\nn, where c depends on a product of dual\nnorm bound and restricted norm compatibility, whereas the sample complexity characterizes the\nminimum number of samples after which the error bound starts to be valid. In recent work, [3]\nextended the analysis of Lasso-type estimator for decomposable norm to any norm, and gave a more\nsuccinct characterization of the dual norm bound for R\u2217(XT \u03c9) and the sample complexity for the\nRE condition in terms of Gaussian widths [14, 10, 20, 1] of suitable sets where, for any set A \u2208 Rp,\nthe Gaussian width is de\ufb01ned as\n\nw(A) = E sup\nu\u2208A\n\n(cid:104)u, g(cid:105) ,\n\n(3)\n\n\u221a\n\nR(u)\n(cid:107)u(cid:107)2\n\n(see Section 2 for details).\n\nTR(\u03b8\u2217) = cone{u \u2208 Rp | R(\u03b8\u2217 + u) \u2264 R(\u03b8\u2217)} .\n\nwhere g is a standard Gaussian random vector. For Dantzig-type estimators, [11, 6] obtained similar\nextensions. To be speci\ufb01c, assume entries in X and \u03c9 are i.i.d. normal, and de\ufb01ne the tangent cone,\n(4)\nnw(\u2126R)) where \u2126R = {u \u2208\nThen one can get (high-probability) upper bound for R\u2217(XT \u03c9) as O(\nRp|R(u) \u2264 1} is the unit norm ball, and the RE condition is satis\ufb01ed with O(w2(TR(\u03b8\u2217) \u2229 Sp\u22121))\nsamples, in which Sp\u22121 is the unit sphere. For convenience, we denote by CR(\u03b8\u2217) the spherical\ncap TR(\u03b8\u2217) \u2229 Sp\u22121 throughout the paper. Further, the restricted norm compatibility is given by\n\u03a8R(\u03b8\u2217) = supu\u2208TR(\u03b8\u2217)\nThus, for any given norm, it suf\ufb01ces to get a characterization of (i) w(\u2126R), the width of the u-\nnit norm ball, (ii) w(CR(\u03b8\u2217)), the width of the spherical cap induced by the tangent cone TR(\u03b8\u2217),\nand (iii) \u03a8R(\u03b8\u2217), the restricted norm compatibility in the tangent cone. For the special case of L1\nnorm, accurate characterization of all three measures exist [10, 18]. However, for more general\nnorms, the literature is rather limited. For w(\u2126R), the characterization is often reduced to com-\nparison with either w(CR(\u03b8\u2217)) [3] or known results on other norm balls [13]. While w(CR(\u03b8\u2217))\nhas been investigated for certain decomposable norms [10, 9, 1], little is known about general non-\ndecomposable norms. One general approach for upper bounding w(CR(\u03b8\u2217)) is via the statistical\ndimension [10, 19, 1], which computes the expected squared distance between a Gaussian random\nvector and the polar cone of TR(\u03b8\u2217). To specify the polar, one need full information of the sub-\ndifferential \u2202R(\u03b8\u2217), which could be dif\ufb01cult to obtain for non-decomposable norms. A notable\nbound for (overlapping) L1/L2 norms is presented in [21], which yields tight bounds for mildly\nnon-overlapping cases, but is loose for highly overlapping ones. For \u03a8R(\u03b8\u2217), the restricted norm\ncompatibility, results are only available for decomposable norms [18, 3].\nIn this paper, we present a general set of bounds for the width w(\u2126R) of the norm ball, the width\nw(CR(\u03b8\u2217)) of the spherical cap, and the restricted norm compatibility \u03a8R(\u03b8\u2217). For the analysis,\nwe consider the class of atomic norms that are invariant under sign-changes, i.e., the norm of a\nvector stays unchanged if any entry changes only by \ufb02ipping its sign. The class is quite general, and\ncovers most of the popular norms used in practical applications, e.g., L1 norm, ordered weighted\nL1 (OWL) norm [5] and k-support norm [2]. Speci\ufb01cally we show that sharp bounds on w(\u2126R)\ncan be obtained using simple calculation based on a decomposition inequality from [16]. To upper\nbound w(CR(\u03b8\u2217)) and \u03a8R(\u03b8\u2217), instead of a full speci\ufb01cation of TR(\u03b8\u2217), we only require some\ninformation regarding the subgradient of R(\u03b8\u2217), which is often readily accessible. The key insight\nis that bounding statistical dimension often ends up computing the expected distance from Gaussian\nvector to a single point rather than to the whole polar cone, thus the full information on \u2202R(\u03b8\u2217) is\nunnecessary. In addition, we derive the corresponding lower bounds to show the tightness of our\nresults. As examples, we illustrate the bounds for L1 and OWL norms [5]. Finally, we give sharp\nbounds for the recently proposed k-support norm [2], for which existing analysis is incomplete.\nThe rest of the paper is organized as follows: we \ufb01rst review the relevant background for Dantzig-\ntype estimator and atomic norm in Section 2. In Section 3, we introduce the general bounds for the\ngeometric measures. In Section 4, we discuss the tightness of our bounds. Section 5 is dedicated to\nthe example of k-support norm, and we conclude in Section 6.\n\n2 Background\n\nIn this section, we brie\ufb02y review the recovery guarantee for the generalized Dantzig selector in (2)\nand the basics on atomic norms. The following lemma, originally [11, Theorem 1], provides an error\nbound for (cid:107) \u02c6\u03b8\u03bbn \u2212 \u03b8\u2217(cid:107)2. Related results have appeared for other estimators [18, 10, 19, 3, 23].\n\n2\n\n\fLemma 1 Assume that y = X\u03b8\u2217 + \u03c9, where entries of X and \u03c9 are i.i.d. copies of standard\nGaussian random variable. If \u03bbn \u2265 c1\nnw(\u2126R) and n > c2w2(TR(\u03b8\u2217) \u2229 Sp\u22121) = w2(CR(\u03b8\u2217))\nfor some constant c1, c2 > 1, with high probability, the estimate \u02c6\u03b8\u03bbn given by (2) satis\ufb01es\n\n\u221a\n\n(cid:19)\n\n(cid:107) \u02c6\u03b8\u03bbn \u2212 \u03b8\u2217(cid:107)2 \u2264 O\n\n\u03a8R(\u03b8\u2217)\n\nw(\u2126R)\u221a\nn\n\n.\n\n(5)\n\n(cid:18)\n\n(cid:88)\n\n(cid:41)\n\n(cid:40)(cid:88)\n\nIn this Lemma, there are three geometric measures\u2014w(\u2126R), w(CR(\u03b8\u2217)) and \u03a8R(\u03b8\u2217)\u2014which need\nto be determined for speci\ufb01c R(\u00b7) and \u03b8\u2217. In this work, we focus on general atomic norms R(\u00b7).\nGiven a set of atomic vectors A \u2282 Rp, the corresponding atomic norm of any \u03b8 \u2208 Rp is given by\n\n(cid:107)\u03b8(cid:107)A = inf\n\ncaa, ca \u2265 0 \u2200 a \u2208 A\n\na\u2208A\n\na\u2208A\n\nca : \u03b8 =\n\n(6)\nIn order for (cid:107) \u00b7 (cid:107)A to be a valid norm, atomic vectors in A has to span Rp, and a \u2208 A iff \u2212a \u2208 A.\nThe unit ball of atomic norm (cid:107) \u00b7 (cid:107)A is given by \u2126A = conv(A). In addition, we assume that the\natomic set A contains v(cid:12) a for any v \u2208 {\u00b11}p if a belongs to A, where (cid:12) denotes the elementwise\n(Hadamard) product for vectors. This assumption guarantees that both (cid:107) \u00b7 (cid:107)A and its dual norm\nare invariant under sign-changes, which is satis\ufb01ed by many widely used norms, such as L1 norm,\nOWL norm [5] and k-support norm [2]. For the rest of the paper, we will use \u2126A, TA(\u03b8\u2217), CA(\u03b8\u2217)\nand \u03a8A(\u03b8\u2217) with A replaced by appropriate subscript for speci\ufb01c norms. For any vector u and\ncoordinate set S, we de\ufb01ne uS by zeroing out all the coordinates outside S.\n\n3 General Analysis for Atomic Norms\n\nIn this section, we present detailed analysis of the general bounds for the geometric measures,\nw(\u2126A), w(CA(\u03b8\u2217)) and \u03a8A(\u03b8\u2217). In general, knowing the atomic set A is suf\ufb01cient for bound-\ning w(\u2126A). For w(CA(\u03b8\u2217)) and \u03a8A(\u03b8\u2217), we only need a single subgradient of (cid:107)\u03b8\u2217(cid:107)A and some\nsimple additional calculations.\n\n3.1 Gaussian width of unit norm ball\nAlthough the atomic set A may contain uncountably many vectors, we assume that A can be de-\ncomposed as a union of M \u201csimple\u201d sets, A = A1 \u222a A2 \u222a . . . \u222a AM . By \u201csimple,\u201d we mean the\nGaussian width of each Ai is easy to compute/bound. Such a decomposition assumption is often\nsatis\ufb01ed by commonly used atomic norms, e.g., L1, L1/L2, OWL, k-support norm. The Gaussian\nwidth of the unit norm ball of (cid:107) \u00b7 (cid:107)A can be easily obtained using the following lemma, which is\nessentially the Lemma 2 in [16]. Related results appear in [16].\nLemma 2 Let M > 4, A1,\u00b7\u00b7\u00b7 ,AM \u2282 Rp, and A = \u222amAm. The Gaussian width of unit norm\nball of (cid:107) \u00b7 (cid:107)A satis\ufb01es\n\nw(\u2126A) = w(conv(A)) = w(A) \u2264 max\n1\u2264m\u2264M\n\nw(Am) + 2 sup\nz\u2208A\n\n(cid:107)z(cid:107)2\n\n(7)\n\n(cid:112)log M\n\nNext we illustrate application of this result to bounding the width of the unit norm ball of L1 and\nOWL norm.\nExample 1.1 (L1 norm): Recall that the L1 norm can be viewed as the atomic norm induced by the\nset AL1 = {\u00b1ei\ni=1 is the canonical basis of Rp. Since the Gaussian\nwidth of a singleton is 0, if we treat A as the union of individual {+ei} and {\u2212ei}, we have\n\n: 1 \u2264 i \u2264 p}, where {ei}p\n\nw(\u2126L1 ) \u2264 0 + 2(cid:112)log 2p = O((cid:112)log p) .\n\n(OWL) norm [13, 25, 5] de\ufb01ned as (cid:107)\u03b8(cid:107)owl =(cid:80)p\n\n(8)\nExample 1.2 (OWL norm): A recent variant of L1 norm is the so-called ordered weighted L1\ni , where w1 \u2265 w2 \u2265 . . . \u2265 wp \u2265 0 are\npre-speci\ufb01ed ordered weights, and |\u03b8|\u2193 is the permutation of |\u03b8| with entries sorted in decreasing\norder. In [25], the OWL norm is proved to be an atomic norm with atomic set\nAowl =\n\nu \u2208 Rp : uS c = 0, uS =\n\n, v \u2208 {\u00b11}p\n\ni=1 wi|\u03b8|\u2193\n\n(cid:91)\n\n(cid:91)\n\n(cid:91)\n\nAi =\n\n(cid:40)\n\n(cid:41)\n\n. (9)\n\nvS(cid:80)i\n\nj=1 wj\n\n1\u2264i\u2264p\n\n1\u2264i\u2264p\n\n| supp(S)|=i\n\n3\n\n\f(cid:115)\n\n(cid:115)\n\n(cid:1) atomic vectors.\nWe \ufb01rst apply Lemma 2 to each set Ai, and note that each Ai contains 2i(cid:0)p\n(cid:17) \u2264 2\n(cid:16) p\n(cid:16) p\n(cid:18)\u221a\n(cid:112)log p = O\n\nwhere \u00afw is the average of w1, . . . , wp. Then we apply the lemma again to Aowl and obtain\n\n(cid:18)p\n(cid:19)\n2i(cid:80)i\n(cid:112)2 + log p +\n\nw(Ai) \u2264 0 + 2\n\n(cid:114)\n(cid:19)\n\n((cid:80)i\n\nj=1 wj)2\n\n(cid:114)\n\nj=1 wj\n\n2 + log\n\n2 + log\n\nlog 2i\n\n\u2264\n\n\u00afw\n\ni\n\ni\n\ni\n\ni\n\ni\n\nw(\u2126owl) = w(Aowl) \u2264 2\n\u00afw\n\nlog p\n\u00afw\n\n,\n\n2\n\u00afw\n\n(cid:17)\n\n,\n\n(10)\n\nwhich matches the result in [13].\n\n3.2 Gaussian width of the intersection of tangent cone and unit sphere\nIn this subsection, we consider the computation of general w(CA(\u03b8\u2217)). Using the de\ufb01nition of dual\nnorm, we can write (cid:107)\u03b8\u2217(cid:107)A as (cid:107)\u03b8\u2217(cid:107)A = sup(cid:107)u(cid:107)\u2217\nA\u22641(cid:104)u, \u03b8\u2217(cid:105), where (cid:107) \u00b7 (cid:107)\u2217\nA denotes the dual norm of\n(cid:107) \u00b7 (cid:107)A. The u\u2217 for which (cid:104)u\u2217, \u03b8\u2217(cid:105) = (cid:107)\u03b8\u2217(cid:107)A, is a subgradient of (cid:107)\u03b8\u2217(cid:107)A. One can obtain u\u2217 by\nsimply solving the so-called polar operator [26] for the dual norm (cid:107) \u00b7 (cid:107)\u2217\nA,\n\nu\u2217 \u2208 argmax\nA\u22641\n\n(cid:107)u(cid:107)\u2217\n\n(cid:104)u, \u03b8\u2217(cid:105) .\n\n(11)\n\nBased on polar operator, we start with the Lemma 3, which plays a key role in our analysis.\nLemma 3 Let u\u2217 be a solution to the polar operator (11), and de\ufb01ne the weighted L1 semi-norm\n\n(cid:107) \u00b7 (cid:107)u\u2217 as (cid:107)v(cid:107)u\u2217 =(cid:80)p\n\ni=1 |u\u2217\n\ni | \u00b7 |vi|. Then the following relation holds\n\nTA(\u03b8\u2217) \u2286 Tu\u2217 (\u03b8\u2217) ,\nwhere Tu\u2217 (\u03b8\u2217) = cone{v \u2208 Rp | (cid:107)\u03b8\u2217 + v(cid:107)u\u2217 \u2264 (cid:107)\u03b8\u2217(cid:107)u\u2217}.\nThe proof of this lemma is in supplementary material. Note that the solution to (11) may not be\nunique. A good criterion for choosing u\u2217 is to avoid zeros in u\u2217, as any u\u2217\ni = 0 will lead to the\nunboundedness of unit ball of (cid:107)\u00b7(cid:107)u\u2217, which could potentially increase the size of Tu\u2217 (\u03b8\u2217). Next we\npresent the upper bound for w(CA(\u03b8\u2217)).\nTheorem 4 Suppose that u\u2217 is one of the solutions to (11), and de\ufb01ne the following sets,\n\ni (cid:54)= 0},\n\nR = {i | u\u2217\n\ni (cid:54)= 0, \u03b8\u2217\n\ni = 0} .\n\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f3\n\n(cid:113)\n\ni = 0},\n\nQ = {i | u\u2217\n\ni (cid:54)= 0, \u03b8\u2217\nThe Gaussian width w(CA(\u03b8\u2217)) is upper bounded by\nif R is empty\n\nS = {i | u\u2217\n\n\u221a\n\np ,\n\ns log(cid:0) p\u2212m\n\n(cid:1) ,\n\nw(CA(\u03b8\u2217)) \u2264\n\n,\n\n(12)\n\nif R is nonempty\n\ns\n\nmin\n\nmax\n\n\u03ba2\n\nm + 3\n\ni | and \u03bamax = maxi\u2208S |u\u2217\ni |.\n\n2 s + 2\u03ba2\nwhere m = |Q|, s = |S|, \u03bamin = mini\u2208R |u\u2217\nProof: By Lemma 3, we have w(CA(\u03b8\u2217)) \u2264 w(Tu\u2217 (\u03b8\u2217) \u2229 Sp\u22121) (cid:44) w(Cu\u2217 (\u03b8\u2217)). Hence we can\nfocus on bounding w(Cu\u2217 (\u03b8\u2217)). We \ufb01rst analyze the structure of v that satis\ufb01es (cid:107)\u03b8\u2217 + v(cid:107)u\u2217 \u2264\ni = 0}, the corresponding entries vi\u2019s can be arbitrary since\n(cid:107)\u03b8\u2217(cid:107)u\u2217. For the coordinates Q = {i | u\u2217\nit does not affect the value of (cid:107)\u03b8\u2217 + v(cid:107)u\u2217. Thus all possible vQ form a m-dimensional subspace,\nwhere m = |Q|. For S \u222a R = {i | u\u2217\ni (cid:54)= 0}, we de\ufb01ne \u02dc\u03b8 = \u03b8\u2217\nS\u222aR and \u02dcv = vS\u222aR, and \u02dcv needs to\nsatisfy\n\n(cid:107)\u02dcv + \u02dc\u03b8(cid:107)u\u2217 \u2264 (cid:107) \u02dc\u03b8(cid:107)u\u2217 ,\n\nwhich is similar to the L1-norm tangent cone except that coordinates are weighted by |u\u2217|. Therefore\nwe use the techniques for proving the Proposition 3.10 in [10]. Based on the structure of v, The\nnormal cone at \u03b8\u2217 for Tu\u2217 (\u03b8\u2217) is given by\n\nN (\u03b8\u2217) = {z : (cid:104)z, v(cid:105) \u2264 0 \u2200v s.t. (cid:107)v + \u03b8\u2217(cid:107)u\u2217 \u2264 (cid:107)\u03b8\u2217(cid:107)u\u2217}\n\n= {z : zi = 0 for i \u2208 Q, zi = |u\u2217\n\ni |sign(\u02dc\u03b8i)t for i \u2208 S, |zi| \u2264 |u\u2217\n\ni |t for i \u2208 R, for any t \u2265 0} .\n\n4\n\n\fGiven a standard Gaussian random vector g, using the relation between Gaussian width and statisti-\ncal dimension (Proposition 2.4 and 10.2 in [1]), we have\n\nw2(Cu\u2217 (\u03b8\u2217)) \u2264 E[\n\n(cid:107)z \u2212 g(cid:107)2\n\nzS\u222aR\u2208N (\u03b8\u2217)\n\ninf\n\ninf\n\nj\u2208S\n\ni\u2208Q\n\ng2\ni +\n\n2] = E[\n\n(cid:88)\n\n(cid:88)\n\n(cid:88)\nz\u2208N (\u03b8\u2217)\nj\u2208S\nj|sign(\u02dc\u03b8j)t \u2212 gj)2 +\n(|u\u2217\n(cid:88)\nj|2 + |S| + E[\n|u\u2217\n(cid:88)\n(cid:88)\n\n(cid:32)(cid:90) +\u221e\n\n|u\u2217\nj|2 + |S| +\n\nk|t\n|zk|\u2264|u\u2217\n\nk\u2208R\n2\u221a\n2\u03c0\n2\u221a\n2\u03c0\n\n|u\u2217\nj|2 + |S| +\n\n1\n|u\u2217\nk|t\n\nk|t\n|u\u2217\n\nk\u2208R\n\nexp\n\ninf\n\nk\u2208R\n\ninf\n\nz\u2208N (\u03b8\u2217)\n= |Q| + E[\n\n\u2264 |Q| + t2(cid:88)\n\u2264 |Q| + t2(cid:88)\n\u2264 |Q| + t2(cid:88)\n\nj\u2208S\n\nj\u2208S\n\nj\u2208S\n\nk\u2208R\n(zk \u2212 gk)2]\n\n(cid:88)\n\n(zj \u2212 gj)2 +\n\n(zk \u2212 gk)2]\n\n(cid:88)\n\nk\u2208R\n(zk \u2212 gk)2]\n\n(cid:33)\n\n\u2212g2\n2\n\nk\n\n)dgk\n\n(cid:18)\n\n(gk \u2212 |u\u2217\n\u2212|u\u2217\n\nk|t)2 exp(\n(cid:19)\nk|2t2\n2\n\n(\u2217) .\n\nThe details for the derivation above can be found in Appendix C of [10]. If R is empty, by taking\nt = 0, we have\n\n(\u2217) \u2264 |Q| + t2(cid:88)\n\nj\u2208S\n\n|u\u2217\nj|2 + |S| = |Q| + |S| = p .\n\nIf R is nonempty, we denote \u03bamin = mini\u2208R |u\u2217\n\n(cid:114)\n\n(cid:16)|S\u222aR|\n\n(cid:17)\n\n|S|\n\n1\n\n\u03bamin\n\n2 log\n\n, we obtain\n\n2|R| exp\n\u221a\n\n(cid:16)\u2212 \u03ba2\n(cid:17) \u2264 |Q| +\n\n2\u03c0\u03bamint\n2\u03ba2\n\u03ba2\n\nmint2\n2\n\ni | and \u03bamax = maxi\u2208S |u\u2217\n(cid:17)\n\n(cid:18) 2\u03ba2\n(cid:19)\n(cid:18)|S \u222a R|\n\n\u03ba2\n\n= |Q| + |S|\n\nmax\n\nlog\n\ni |. Taking t =\n(cid:19)\n\n(cid:18)|S \u222a R|\n\n(cid:19)\n\n+ 1\n\n|S|\n\nmin\n3\n2\n\n+\n\n|S| .\n\n|S| log\n\nmax\n\nmin\n\n|S|\n\n(\u2217) \u2264 |Q| + |S|(\u03ba2\n\n(cid:114)\n\nmaxt2 + 1) +\n|R||S|\n\n(cid:16)|S\u222aR|\n\n\u03c0 log\n\n|S|\n\n+\n\n|S \u222a R|\n\nSubstituting |Q| = m, |S| = s and |S \u222aR| = p\u2212 m into the last inequality completes the proof.\n\nSuppose that \u03b8\u2217 is a s-sparse vector. We illustrate the above bound on the Gaussian width of the\nspherical cap using L1 norm and OWL norm as examples.\nExample 2.1 (L1 norm): The dual norm of L1 is L\u221e norm, and its easy to verify that u\u2217 =\n[1, 1, . . . , 1]T \u2208 Rp is a solution to (11). Applying Theorem 4 to u\u2217, we have\n\n(cid:16) p\ns + s log\ns\nExample 2.2 (OWL norm): For OWL, its dual norm is given by (cid:107)u(cid:107)\u2217\nowl = maxb\u2208Aowl(cid:104)b, u(cid:105).\nW.l.o.g. we assume \u03b8\u2217 = |\u03b8\u2217|\u2193, and a solution to (11) is given by u\u2217 = [w1, . . . , ws, \u02dcw, \u02dcw, . . . , \u02dcw]T ,\nin which \u02dcw is the average of ws+1, . . . , wp. If all wi\u2019s are nonzero, the Gaussian width satis\ufb01es\n\nw(CL1(\u03b8\u2217)) \u2264\n\n(cid:114) 3\n\n(cid:18)(cid:114)\n\n(cid:17)(cid:19)\n\n(cid:16) p\n\ns + 2s log\n\n(cid:17)\n\n= O\n\n2\n\ns\n\n.\n\n(cid:114)\n\nw(Cowl(\u03b8\u2217)) \u2264\n\n3\n2\n\ns +\n\n2w2\n1\n\u02dcw2 s log\n\n(cid:16) p\n\n(cid:17)\n\ns\n\n.\n\n3.3 Restricted norm compatibility\nThe next theorem gives general upper bounds for the restricted norm compatibility \u03a8A(\u03b8\u2217).\nTheorem 5 Assume that (cid:107)u(cid:107)A \u2264 max{\u03b21(cid:107)u(cid:107)1, \u03b22(cid:107)u(cid:107)2} for all u \u2208 Rp. Under the setting of\nTheorem 4, the restricted norm compatibility \u03a8A(\u03b8\u2217) is upper bounded by\n\n\u03a8A(\u03b8\u2217) \u2264\n\n(cid:110)\n\nif R is empty\n\u03b22, \u03b21\n\n\u03a6Q + max\n\n(cid:40) \u03a6 ,\n(cid:107)u(cid:107)A(cid:107)u(cid:107)2\n\nwhere \u03a6 = supu\u2208Rp\n\nand \u03a6Q = supsupp(u)\u2286Q\n\n(cid:111)\n\ns\n\n(cid:16)\n\n(cid:17)\u221a\n1 + \u03bamax\n\u03bamin\n(cid:107)u(cid:107)A(cid:107)u(cid:107)2\n\n.\n\n5\n\nif R is nonempty\n\n,\n\n,\n\n(13)\n\n\fProof: As analyzed in the proof of Theorem 4, vQ for v \u2208 Tu\u2217 (\u03b8\u2217) can be arbitrary, and the\nvS\u222aR = vQc satis\ufb01es\n\nQc(cid:107)u\u2217 =\u21d2 (cid:88)\nj| \u2264(cid:88)\n(cid:88)\n\n|vj||u\u2217\n\ni\u2208S\n\ni\u2208S\n\n(cid:88)\n\nj| \u2264(cid:88)\n\n|vj||u\u2217\n\n|\u03b8\u2217\ni + vi||u\u2217\n|\u03b8\u2217\ni ||u\u2217\n\ni | +\ni | =\u21d2 \u03bamin(cid:107)vR(cid:107)1 \u2264 \u03bamax(cid:107)vS(cid:107)1\n\ni ||u\u2217\n|\u03b8\u2217\ni |\n\nj\u2208R\n\ni\u2208S\n\n(cid:107)vQc + \u03b8\u2217\n\n=\u21d2 (cid:88)\n\nQc(cid:107)u\u2217 \u2264 (cid:107)\u03b8\u2217\ni | +\ni | \u2212 |vi|)|u\u2217\n\n(|\u03b8\u2217\n\ni\u2208S\n\nj\u2208R\nIf R is empty, by Lemma 3, we obtain\n\n\u03a8A(\u03b8\u2217) \u2264 \u03a8u\u2217 (\u03b8\u2217) (cid:44) sup\n\nv\u2208Tu\u2217 (\u03b8\u2217)\n\n(cid:107)v(cid:107)A\n(cid:107)v(cid:107)2\n\n\u2264 sup\nv\u2208Rp\n\n(cid:107)v(cid:107)A\n(cid:107)v(cid:107)2\n\n= \u03a6 .\n\nIf R is nonempty, we have\n\u03a8A(\u03b8\u2217) \u2264 \u03a8u\u2217 (\u03b8\u2217) \u2264 sup\n\nv\u2208Tu\u2217 (\u03b8\u2217)\n(cid:107)v(cid:107)A\n(cid:107)v(cid:107)2\n\n+\n\n\u2264 sup\n\nsupp(v)\u2286Q\n\n(cid:107)vQ(cid:107)A + (cid:107)vQc(cid:107)A\n\n\u2264\n\n(cid:107)v(cid:107)2\n\nsup\n\nsupp(v(cid:48))\u2286Qc\nR(cid:107)1\u2264\u03bamax(cid:107)v(cid:48)\n\u03b2(1 + \u03bamax\n\u03bamin\n(cid:107)v(cid:48)(cid:107)2\n\nS(cid:107)1\n)(cid:107)v(cid:48)(cid:107)1\n\n\u03bamin(cid:107)v(cid:48)\n\nsup\n\nsupp(v(cid:48))\u2286S\n\n\u2264 \u03a6Q + max{\u03b22,\n\nin which the last inequality in the \ufb01rst line uses the property of Tu\u2217 (\u03b8\u2217).\n\nsup\n\nsupp(v)\u2286Q, supp(v(cid:48))\u2286Qc\n\u03bamin(cid:107)v(cid:48)\nR(cid:107)1\u2264\u03bamax(cid:107)v(cid:48)\nS(cid:107)1\nmax{\u03b21(cid:107)v(cid:48)(cid:107)1, \u03b22(cid:107)v(cid:48)(cid:107)2}\n\n(cid:107)v(cid:48)(cid:107)2\n\n} \u2264 \u03a6Q + max{\u03b22, \u03b21\n\n(cid:107)v(cid:107)A + (cid:107)v(cid:48)(cid:107)A\n\n(cid:107)v + v(cid:48)(cid:107)2\n\n(cid:18)\n\n1 +\n\n\u03bamax\n\u03bamin\n\n(cid:19)\u221a\n\ns} ,\n\nRemark: We call \u03a6 the unrestricted norm compatibility, and \u03a6Q the subspace norm compatibility,\nboth of which are often easier to compute than \u03a8A(\u03b8\u2217). The \u03b21 and \u03b22 in the assumption of (cid:107) \u00b7 (cid:107)A\ncan have multiple choices, and one has the \ufb02exibility to choose the one that yields the tightest bound.\nExample 3.1 (L1 norm): To apply the Theorem 5 to L1 norm, we can choose \u03b21 = 1 and \u03b22 = 0.\nWe recall the u\u2217 for L1 norm, whose Q is empty while R is nonempty. So we have for s-sparse \u03b8\u2217\n\n\u03a8L1 (\u03b8\u2217) \u2264 0 + max\n(cid:110)\n\nExample 3.2 (OWL norm): For OWL, note that (cid:107) \u00b7 (cid:107)owl \u2264 w1(cid:107) \u00b7 (cid:107)1. Hence we choose \u03b21 = w1\nand \u03b22 = 0. As a result, we similarly have for s-sparse \u03b8\u2217\nw1\n\u02dcw\n\n\u03a8owl(\u03b8\u2217) \u2264 0 + max\n\n0, w1\n\n(cid:16)\n\n1 +\n\n\u221a\n\ns .\n\n\u02dcw\n\ns\n\n(cid:26)\n\n(cid:18)\n\n0,\n\n1 +\n\n1\n1\n\ns\n\n(cid:19)\u221a\n(cid:17)\u221a\n\n\u221a\n= 2\n\n(cid:27)\n(cid:111) \u2264 2w2\n\n1\n\ns .\n\n4 Tightness of the General Bounds\n\nSo far we have shown that the geometric measures can be upper bounded for general atomic norms.\nOne might wonder how tight the bounds in Section 3 are for these measures. For w(\u2126A), as the\nresult from [16] depends on the decomposition of A for the ease of computation, it might be tricky\nto discuss its tightness in general. Hence we will focus on the other two, w(CA(\u03b8\u2217)) and \u03a8A(\u03b8\u2217).\nTo characterize the tightness, we need to compare the lower bounds of w(CA(\u03b8\u2217)) and \u03a8A(\u03b8\u2217),\nwith their upper bounds determined by u\u2217. While there can be multiple u\u2217, it is easy to see that any\nconvex combination of them is also a solution to (11). Therefore we can always \ufb01nd a u\u2217 that has\nthe largest support, i.e., supp(u(cid:48)) \u2286 supp(u\u2217) for any other solution u(cid:48). We will use such u\u2217 to\ngenerate the lower bounds. First we need the following lemma for the cone TA(\u03b8\u2217).\nLemma 6 Consider a solution u\u2217 to (11), which satis\ufb01es supp(u(cid:48)) \u2286 supp(u\u2217) for any other\nsolution u(cid:48). Under the setting of notations in Theorem 4, we de\ufb01ne an additional set of coordinates\nP = {i | u\u2217\n\ni = 0}. Then the tangent cone TA(\u03b8\u2217) satis\ufb01es\n\ni = 0, \u03b8\u2217\n\n(14)\nwhere \u2295 denotes the direct (Minkowski) sum operation, cl(\u00b7) denotes the closure, T1 = {v \u2208\nRp | vi = 0 for i /\u2208 P} is a |P|-dimensional subspace, and T2 = {v \u2208 Rp | sign(vi) =\n\u2212sign(\u03b8\u2217\n\ni ) for i \u2208 supp(\u03b8\u2217), vi = 0 for i /\u2208 supp(\u03b8\u2217)} is a | supp(\u03b8\u2217)|-dimensional orthant.\n\nT1 \u2295 T2 \u2286 cl(TA(\u03b8\u2217)) ,\n\n6\n\n\fThe proof of Lemma 6 is given in supplementary material. The following theorem gives us the lower\nbound for w(CA(\u03b8\u2217)) and \u03a8A(\u03b8\u2217).\nTheorem 7 Under the setting of Theorem 4 and Lemma 6, the following lower bounds hold,\n\n(15)\n(16)\nProof: To lower bound w(CA(\u03b8\u2217)), we use Lemma 6 and the relation between Gaussian width and\nstatistical dimension (Proposition 10.2 in [1]),\n\n\u03a8A(\u03b8\u2217) \u2265 \u03a6Q\u222aS .\n\nm + s) ,\n\n\u221a\nw(CA(\u03b8\u2217)) \u2265 O(\n\ninf\n\nz\u2208NT1\u2295T2 (\u03b8\u2217)\n\n(cid:107)z \u2212 g(cid:107)2\n\n2] \u2212 1 (\u2217) ,\n\nw(TA(\u03b8\u2217)) \u2265 w(T1 \u2295 T2 \u2229 Sp\u22121) \u2265(cid:114)E[\n(cid:115)\n(cid:88)\n\n(cid:88)\n\nwhere the normal cone NT1\u2295T2(\u03b8\u2217) of T1 \u2295 T2 is given by NT1\u2295T2 (\u03b8\u2217) = {z : zi = 0 for i \u2208\nP, sign(zi) = sign(\u03b8\u2217\n(\u2217) =\n\n(cid:114)\ni ) for i \u2208 supp(\u03b8\u2217)}. Hence we have\n\n| supp(\u03b8\u2217)|\n\n\u2212 1 = O(\n\n|P| +\n\nm + s) ,\n\nE[\n\n\u221a\n\nI{gj \u03b8\u2217\n\nj <0}] \u2212 1 =\n\ng2\ni +\n\ng2\nj\n\n2\n\ni\u2208P\n\nj\u2208supp(\u03b8\u2217)\n\nwhere the last equality follows the fact that P \u222a supp(\u03b8\u2217) = Q \u222a S. This completes proof of (15).\nTo prove (16), we again use Lemma 6 and the fact P \u222a supp(\u03b8\u2217) = Q \u222a S. Noting that (cid:107) \u00b7 (cid:107)A is\ninvariant under sign-changes, we get\n\n\u03a8A(\u03b8\u2217) = sup\n\nv\u2208TA(\u03b8\u2217)\n\n(cid:107)v(cid:107)A\n(cid:107)v(cid:107)2\n\n\u2265 sup\n\nv\u2208T1\u2295T2\n\n(cid:107)v(cid:107)A\n(cid:107)v(cid:107)2\n\n=\n\nsup\n\nsupp(v)\u2286P\u222asupp(\u03b8\u2217)\n\n(cid:107)v(cid:107)A\n(cid:107)v(cid:107)2\n\n= \u03a6Q\u222aS .\n\nRemark: We compare the lower bounds (15) (16) with the upper bounds (12) (13). If R is empty,\nm + s = p, and the lower bounds actually match the upper bounds up to a constant factor for both\nw(CA(\u03b8\u2217)) and \u03a8A(\u03b8\u2217). If R is nonempty, the lower and upper bounds of w(CA(\u03b8\u2217)) differ by a\n), which can be small in practice. For \u03a8A(\u03b8\u2217), as \u03a6Q\u222aS \u2265 \u03a6Q,\nmultiplicative factor 2\u03ba2\n\u221a\n\u03ba2\ns) term in upper bound, since the assumption on (cid:107) \u00b7 (cid:107)A\nwe usually have at most an additive O(\noften holds with a constant \u03b21 and \u03b22 = 0 for most norms.\n\nlog( p\u2212m\n\nmax\n\nmin\n\ns\n\n5 Application to the k-Support Norm\n\nIn this section, we apply our general results on geometric measures to a non-trivial example, k-\nsupport norm [2], which has been proved effective for sparse recovery [11, 17, 12]. The k-support\nnorm can be viewed as an atomic norm, for which A = {a \u2208 Rp | (cid:107)a(cid:107)0 \u2264 k, (cid:107)a(cid:107)2 \u2264 1}. The\nk-support norm can be explicitly expressed as an in\ufb01mum convolution given by\n\n(cid:110)(cid:88)\n\n(cid:107)ui(cid:107)2\n\n(cid:12)(cid:12)(cid:12) (cid:107)ui(cid:107)0 \u2264 k\n\n(cid:111)\n\n(cid:107)\u03b8(cid:107)sp\n\nk = inf(cid:80)\n\ni ui=\u03b8\n\ni\n\n,\n\n(17)\n\nand its dual norm is the so-called 2-k symmetric gauge norm de\ufb01ned as\n\n(cid:107)\u03b8(cid:107)sp\u2217\n\nk = (cid:107)\u03b8(cid:107)(k) = (cid:107)|\u03b8|\u2193\n\n1:k(cid:107)2 ,\n\n(cid:115)\n\nLemma 2, we know the Gaussian width of the unit ball of k-support norm\n\n(18)\nIt is straightforward to see that the dual norm is simply the L2 norm of the largest k entries in |\u03b8|.\nSuppose that all the sets of coordinates with cardinality k can be listed as S1,S2, . . . ,S(p\nk). Then A\ncan be written as A = A1 \u222a . . . \u222a A(p\nk), where each Ai = {a \u2208 Rp | supp(a) \u2286 Si, (cid:107)a(cid:107)2 \u2264 1}.\n2 \u2264 \u221a\nk. Using\n(cid:19)\n\nIt is not dif\ufb01cult to see that w(Ai) = E(cid:2)supa\u2208Ai(cid:104)a, g(cid:105)(cid:3) = E(cid:107)gSi(cid:107)2 \u2264(cid:112)E(cid:107)gSi(cid:107)2\n(cid:17)\n(cid:16) p\n(19)\n+ k = O\nk\nwhich matches that in [11]. Now we turn to the calculation of w(Csp\nk (\u03b8\u2217)) and \u03a8sp\nk (\u03b8\u2217). As we have\nseen in the general analysis, the solution u\u2217 to the polar operator (11) is important in characterizing\nthe two quantities. We \ufb01rst present a simple procedure in Algorithm 1 for solving the polar operator\nfor (cid:107) \u00b7 (cid:107)sp\u2217\nk . The time complexity is only O(p log p + k). This procedure can be utilized to compute\nthe k-support norm, or be applied to estimation with (cid:107) \u00b7 (cid:107)sp\u2217\nusing generalized conditional gradient\nmethod [26], which requires solving the polar operator in each iteration.\n\n(cid:18)(cid:114)\n\n(cid:18)p\n\n(cid:16) p\n\nk ) \u2264\n\n(cid:114)\n\nw(\u2126sp\n\n(cid:19)\n\nk + 2\n\nk + 2\n\n(cid:17)\n\nk log\n\nk log\n\n+ k\n\n\u221a\n\n\u221a\n\n\u2264\n\nlog\n\nk\n\nk\n\nk\n\n,\n\n7\n\n\fk\n\nAlgorithm 1 Solving polar operator for (cid:107) \u00b7 (cid:107)sp\u2217\nInput: \u03b8\u2217 \u2208 Rp, positive integer k\nsolution u\u2217 to the polar operator (11)\nOutput:\n1: z = |\u03b8\u2217|\u2193, t = 0\n2: for i = 1 to k do\n3:\n\n\u03b31 = (cid:107)z1:i\u22121(cid:107)2, \u03b32 = (cid:107)zi:p(cid:107)1, d = k \u2212 i + 1, \u03b2 =\nif \u03b32\n\n2\u03b1 + \u03b2\u03b32, u\u2217 = [w, \u03b21]T\n\n2\u03b1 + \u03b2\u03b32 > t and \u03b2 < wi\u22121 then\nt = \u03b32\n\n4:\n5:\nend if\n6:\n7: end for\n8: change the sign and order of u\u2217 to conform with \u03b8\u2217\n9: return u\u2217\n\n1\n\n1\n\n\u03b32\u221a\n\n1 d2 , \u03b1 =\n\n\u03b32\n2 d+\u03b32\n\n\u221a\n\n2\n\n\u03b31\n1\u2212\u03b22d\n\n, w = z1:i\u22121\n2\u03b1\n\n(1 is (p \u2212 i + 1)-dimensional vector with all ones)\n\nTheorem 8 For a given \u03b8\u2217, Algorithm 1 returns a solution to polar operator (11) for (cid:107) \u00b7 (cid:107)sp\u2217\nk .\nThe proof of this theorem is provided in supplementary material. Now we consider w(Csp\nk (\u03b8\u2217))\nk (\u03b8\u2217) for s-sparse \u03b8\u2217 (here s-sparse \u03b8\u2217 means | supp(\u03b8\u2217)| = s) in three scenarios: (i) over-\nand \u03a8sp\nspeci\ufb01ed k, where s < k, (ii) exactly speci\ufb01ed k, where s = k, and (iii) under-speci\ufb01ed k, where\ns > k. The bounds are given in Theorem 9, and the proof is also in supplementary material.\nTheorem 9 For given s-sparse \u03b8\u2217 \u2208 Rp, the Gaussian width w(Csp\ncompatibility \u03a8sp\n\nk (\u03b8\u2217)) and the restricted norm\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f3\n\n(cid:113) 2p\n\n\u221a\n\nk , if s < k\n2(1 + \u03b8\u2217\n\u03b8\u2217\n\nmax\n\nmin\n\n(cid:113) 2s\n\n(1 + \u03bamax\n\u03bamin\n\n)\n\n) , if s = k\n\n,\n\nk , if s > k\n(20)\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f3\n\n\u221a\n\np , if s < k\n\nk (\u03b8\u2217) for a speci\ufb01ed k are given by\n(cid:114)\n(cid:113) 3\n\n(cid:1) , if s = k\n(cid:1) , if s > k\n\ns log(cid:0) p\ns log(cid:0) p\n\n2 s + 2\u03b8\u22172\n\u03b8\u22172\n\n2 s + 2\u03ba2\n\nmax\n\nmax\n\nmin\n\n3\n\ns\n\ns\n\n\u03ba2\n\nmin\n\nw(Csp\n\nk (\u03b8\u2217)) \u2264\n\n, \u03a8sp\n\nk (\u03b8\u2217) \u2264\n\nwhere \u03b8\u2217\n\nmax = maxi\u2208supp(\u03b8\u2217) |\u03b8\u2217\n\ni | and \u03b8\u2217\n\nmin = mini\u2208supp(\u03b8\u2217) |\u03b8\u2217\ni |.\nk (\u03b8\u2217) is unknown and the bound on w(Csp\n\nk (\u03b8\u2217)) given in [11] is loose, as\nRemark: Previously \u03a8sp\nit used the result in [21]. Based on Theorem 9, we note that the choice of k can affect the recovery\nguarantees. Over-speci\ufb01ed k leads to a direct dependence on the dimensionality p for w(Csp\nk (\u03b8\u2217))\nk (\u03b8\u2217), resulting in a weak error bound. The bounds are sharp for exactly speci\ufb01ed or under-\n(cid:33)\nand \u03a8sp\nspeci\ufb01ed k. Thus, it is better to under-specify k in practice. where the estimation error sati\ufb01es\n\n(cid:32)(cid:114)\n\n(cid:107) \u02c6\u03b8\u03bbn \u2212 \u03b8\u2217(cid:107)2 \u2264 O\n\ns + s log (p/k)\n\nn\n\n(21)\n\n6 Conclusions\n\nIn this work, we study the problem of structured estimation with general atomic norms that are\ninvariant under sign-changes. Based on Dantzig-type estimators, we provide the general bounds\nfor the geometric measures. In terms of w(\u2126A), instead of comparison with other results or direct\ncalculation, we demonstrate a third way to compute it based on decomposition of atomic set A.\nFor w(CA(\u03b8\u2217)) and \u03a8A(\u03b8\u2217), we derive general upper bounds, which only require the knowledge\nof a single subgradient of (cid:107)\u03b8\u2217(cid:107)A. We also show that these upper bounds are close to the lower\nbounds, which makes them practical in general. To illustrate our results, we discuss the application\nto k-support norm in details and shed light on the choice of k in practice.\n\nAcknowledgements\nThe research was supported by NSF grants IIS-1447566, IIS-1422557, CCF-1451986, CNS-\n1314560, IIS-0953274, IIS-1029711, and by NASA grant NNX12AQ39A.\n\n8\n\n\fReferences\n[1] D. Amelunxen, M. Lotz, M. B. McCoy, and J. A. Tropp. Living on the edge: Phase transitions in convex\n\nprograms with random data. Inform. Inference, 3(3):224\u2013294, 2014.\n\n[2] A. Argyriou, R. Foygel, and N. Srebro. Sparse prediction with the k-support norm. In Advances in Neural\n\nInformation Processing Systems (NIPS), 2012.\n\n[3] A. Banerjee, S. Chen, F. Fazayeli, and V. Sivakumar. Estimation with norm regularization. In Advances\n\nin Neural Information Processing Systems (NIPS), 2014.\n\n[4] P. J. Bickel, Y. Ritov, and A. B. Tsybakov. Simultaneous analysis of Lasso and Dantzig selector. The\n\nAnnals of Statistics, 37(4):1705\u20131732, 2009.\n\n[5] M. Bogdan, E. van den Berg, W. Su, and E. Candes. Statistical estimation and testing via the sorted L1\n\nnorm. arXiv:1310.1969, 2013.\n\n[6] T. T. Cai, T. Liang, and A. Rakhlin. Geometrizing Local Rates of Convergence for High-Dimensional\n\nLinear Inverse Problems. arXiv:1404.4408, 2014.\n\n[7] E. Candes and T Tao. The Dantzig selector: statistical estimation when p is much larger than n. The\n\nAnnals of Statistics, 35(6):2313\u20132351, 2007.\n\n[8] E. J. Cand`es, J. K. Romberg, and T. Tao. Stable signal recovery from incomplete and inaccurate measure-\n\nments. Communications on Pure and Applied Mathematics, 59(8):1207\u20131223, 2006.\n\n[9] E. J. Cands and B. Recht. Simple bounds for recovering low-complexity models. Math. Program., 141(1-\n\n2):577\u2013589, 2013.\n\n[10] V. Chandrasekaran, B. Recht, P. A. Parrilo, and A. S. Willsky. The convex geometry of linear inverse\n\nproblems. Foundations of Computational Mathematics, 12(6):805\u2013849, 2012.\n\n[11] S. Chatterjee, S. Chen, and A. Banerjee. Generalized dantzig selector: Application to the k-support norm.\n\nIn Advances in Neural Information Processing Systems (NIPS), 2014.\n\n[12] S. Chen and A. Banerjee. One-bit compressed sensing with the k-support norm. In International Confer-\n\nence on Arti\ufb01cial Intelligence and Statistics (AISTATS), 2015.\n\n[13] M. A. T. Figueiredo and R. D. Nowak. Sparse estimation with strongly correlated variables using ordered\n\nweighted l1 regularization. arXiv:1409.4005, 2014.\n\n[14] Y. Gordon. Some inequalities for gaussian processes and applications. Israel Journal of Mathematics,\n\n50(4):265\u2013289, 1985.\n\n[15] L. Jacob, G. Obozinski, and J.-P. Vert. Group lasso with overlap and graph lasso.\n\nConference on Machine Learning (ICML), 2009.\n\nIn International\n\n[16] A. Maurer, M. Pontil, and B. Romera-Paredes. An Inequality with Applications to Structured Sparsity\n\nand Multitask Dictionary Learning. In Conference on Learning Theory (COLT), 2014.\n\n[17] A. M. McDonald, M. Pontil, and D. Stamos. Spectral k-support norm regularization. In Advances in\n\nNeural Information Processing Systems (NIPS), 2014.\n\n[18] S. Negahban, P. Ravikumar, M. J. Wainwright, and B. Yu. A uni\ufb01ed framework for the analysis of\n\nregularized M-estimators. Statistical Science, 27(4):538\u2013557, 2012.\n\n[19] S. Oymak, C. Thrampoulidis, and B. Hassibi. The Squared-Error of Generalized Lasso: A Precise Anal-\n\nysis. arXiv:1311.0830, 2013.\n\n[20] Y. Plan and R. Vershynin. Robust 1-bit compressed sensing and sparse logistic regression: A convex\n\nprogramming approach. IEEE Transactions on Information Theory, 59(1):482\u2013494, 2013.\n\n[21] N. Rao, B. Recht, and R. Nowak. Universal Measurement Bounds for Structured Sparse Signal Recovery.\n\nIn International Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS), 2012.\n\n[22] R. Tibshirani. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society,\n\nSeries B, 58(1):267\u2013288, 1996.\n\n[23] J. A. Tropp. Convex recovery of a structured signal from independent random linear measurements. In\n\nSampling Theory, a Renaissance. 2015.\n\n[24] M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. Journal of the\n\nRoyal Statistical Society, Series B, 68:49\u201367, 2006.\n\n[25] X. Zeng and M. A. T. Figueiredo. The Ordered Weighted (cid:96)1 Norm: Atomic Formulation, Projections, and\n\nAlgorithms. arXiv:1409.4271, 2014.\n\n[26] X. Zhang, Y. Yu, and D. Schuurmans. Polar operators for structured sparse estimation. In Advances in\n\nNeural Information Processing Systems (NIPS), 2013.\n\n9\n\n\f", "award": [], "sourceid": 1659, "authors": [{"given_name": "Sheng", "family_name": "Chen", "institution": "University of Minnesota"}, {"given_name": "Arindam", "family_name": "Banerjee", "institution": "University of Minnesota"}]}