{"title": "Multi-Stage Dantzig Selector", "book": "Advances in Neural Information Processing Systems", "page_first": 1450, "page_last": 1458, "abstract": "We consider the following sparse signal recovery (or feature selection) problem: given a design matrix $X\\in \\mathbb{R}^{n\\times m}$ $(m\\gg n)$ and a noisy observation vector $y\\in \\mathbb{R}^{n}$ satisfying $y=X\\beta^*+\\epsilon$ where $\\epsilon$ is the noise vector following a Gaussian distribution $N(0,\\sigma^2I)$, how to recover the signal (or parameter vector) $\\beta^*$ when the signal is sparse? The Dantzig selector has been proposed for sparse signal recovery with strong theoretical guarantees. In this paper, we propose a multi-stage Dantzig selector method, which iteratively refines the target signal $\\beta^*$. We show that if $X$ obeys a certain condition, then with a large probability the difference between the solution $\\hat\\beta$ estimated by the proposed method and the true solution $\\beta^*$ measured in terms of the $l_p$ norm ($p\\geq 1$) is bounded as \\begin{equation*} \\|\\hat\\beta-\\beta^*\\|_p\\leq \\left(C(s-N)^{1/p}\\sqrt{\\log m}+\\Delta\\right)\\sigma, \\end{equation*} $C$ is a constant, $s$ is the number of nonzero entries in $\\beta^*$, $\\Delta$ is independent of $m$ and is much smaller than the first term, and $N$ is the number of entries of $\\beta^*$ larger than a certain value in the order of $\\mathcal{O}(\\sigma\\sqrt{\\log m})$. The proposed method improves the estimation bound of the standard Dantzig selector approximately from $Cs^{1/p}\\sqrt{\\log m}\\sigma$ to $C(s-N)^{1/p}\\sqrt{\\log m}\\sigma$ where the value $N$ depends on the number of large entries in $\\beta^*$. When $N=s$, the proposed algorithm achieves the oracle solution with a high probability. In addition, with a large probability, the proposed method can select the same number of correct features under a milder condition than the Dantzig selector.", "full_text": "Multi-Stage Dantzig Selector\n\nJi Liu, Peter Wonka, Jieping Ye\n\n{ji.liu,peter.wonka,jieping.ye}@asu.edu\n\nArizona State University\n\nAbstract\n\nWe consider the following sparse signal recovery (or feature selection) problem:\ngiven a design matrix X \u2208 Rn\u00d7m (m (cid:192) n) and a noisy observation vector\ny \u2208 Rn satisfying y = X\u03b2\u2217 + \u0001 where \u0001 is the noise vector following a Gaussian\ndistribution N(0, \u03c32I), how to recover the signal (or parameter vector) \u03b2\u2217 when\nthe signal is sparse?\nThe Dantzig selector has been proposed for sparse signal recovery with strong\ntheoretical guarantees. In this paper, we propose a multi-stage Dantzig selector\nmethod, which iteratively re\ufb01nes the target signal \u03b2\u2217. We show that if X obeys a\ncertain condition, then with a large probability the difference between the solution\n\u02c6\u03b2 estimated by the proposed method and the true solution \u03b2\u2217 measured in terms\nof the lp norm (p \u2265 1) is bounded as\n\n(cid:112)\n\n(cid:180)\n\n(cid:179)\n\n(cid:107) \u02c6\u03b2 \u2212 \u03b2\u2217(cid:107)p \u2264\n\nC(s \u2212 N)1/p\n\nlog m + \u2206\n\n\u03c3,\n\nwhere C is a constant, s is the number of nonzero entries in \u03b2\u2217, \u2206 is independent\nof m and is much smaller than the \ufb01rst term, and N is the number of entries of\n\u03b2\u2217 larger than a certain value in the order of O(\u03c3\nlog m). The proposed method\nfrom Cs1/p\u221a\nimproves the estimation bound of the standard Dantzig selector approximately\nlog m\u03c3 where the value N depends on the\nnumber of large entries in \u03b2\u2217. When N = s, the proposed algorithm achieves the\noracle solution with a high probability. In addition, with a large probability, the\nproposed method can select the same number of correct features under a milder\ncondition than the Dantzig selector.\n\nlog m\u03c3 to C(s\u2212 N)1/p\u221a\n\n\u221a\n\n1 Introduction\n\nThe sparse signal recovery problem has been studied in many areas including machine learning\n[18, 19, 22], signal processing [8, 14, 17], and mathematics/statistics [2, 5, 7, 10, 11, 12, 13, 20].\nIn the sparse signal recovery problem, one is mainly interested in the signal recovery accuracy, i.e.,\nthe distance between the estimation \u02c6\u03b2 and the original signal or the true solution \u03b2\u2217. If the design\nmatrix X is considered as a feature matrix, i.e., each column is a feature vector, and the observation\ny as a target object vector, then the sparse signal recovery problem is equivalent to feature selection\n(or model selection). In feature selection, one concerns the feature selection accuracy. Typically,\na group of features corresponding to the coef\ufb01cient values in \u02c6\u03b2 larger than a threshold form the\nsupporting feature set. The difference between this set and the true supporting set (i.e., the set of\nfeatures corresponding to nonzero coef\ufb01cients in the original signal) measures the feature selection\naccuracy.\nTwo well-known algorithms for learning sparse signals include LASSO [15] and Dantzig selec-\ntor [7]:\n\nLASSO\n\nmin\n\u03b2\n\n:\n\n(cid:107)X\u03b2 \u2212 y(cid:107)2\n\n2 + \u03bb||\u03b2||1\n\n(1)\n\n1\n2\n\n1\n\n\fDantzig Selector\n\n: ||\u03b2||1\n\nmin\n\u03b2\ns.t. : (cid:107)X T (X\u03b2 \u2212 y)(cid:107)\u221e \u2264 \u03bb\n\n(2)\n\nStrong theoretical results concerning LASSO and Dantzig selector have been established in the\nliterature [4, 5, 7, 17, 20, 22].\n\n1.1 Contributions\n\nIn this paper, we propose a multi-stage procedure based on the Dantzig selector, which estimates\nthe supporting feature set F0 and the signal \u02c6\u03b2 iteratively. The intuition behind the proposed multi-\nstage method is that feature selection and signal recovery are tightly correlated and they can bene\ufb01t\nfrom each other: a more accurate estimation of the supporting features can lead to a better signal\nrecovery and a more accurate signal recovery can help identify a better set of supporting features.\nIn the proposed method, the supporting set F0 starts from an empty set and its size increases by one\nafter each iteration. At each iteration, we employ the basic framework of Dantzig selector and the\ninformation about the current supporting feature set F0 to estimate the new signal \u02c6\u03b2. In addition, we\nselect the supporting feature candidates in F0 among all features in the data at each iteration, thus\nallowing to remove incorrect features from the previous supporting feature set.\nThe main contributions of this paper lie in the theoretical analysis of the proposed method. Speci\ufb01-\nselector approximately from Cs1/p\u221a\ncally, we show: 1) the proposed method can improve the estimation bound of the standard Dantzig\nlog m\u03c3 where the value N depends\non the number of large entries in \u03b2\u2217; 2) when N = s, the proposed algorithm can achieve the oracle\nsolution with a high probability; 3) with a high probability, the proposed method can select the same\nnumber of correct features under a milder condition than the standard Dantzig selector method. The\nnumerical experiments validate these theoretical results.\n\nlog m\u03c3 to C(s \u2212 N)1/p\u221a\n\n1.2 Related Work\n\nSparse signal recovery without the observation noise was studied in [6]. It has been shown that\nunder certain irrepresentable conditions, the 0-support of the LASSO solution is consistent with the\ntrue solution. It was shown that when the absolute value of each element in the true solution is large\nenough, a weaker condition (coherence property) can guarantee the feature selection accuracy [5].\nThe prediction bound of LASSO, i.e., (cid:107)X( \u02c6\u03b2 \u2212 \u03b2\u2217)(cid:107)2, was also presented. A comprehensive anal-\nysis for LASSO, including the recovery accuracy in an arbitrary lp norm (p \u2265 1), was presented\nin [20]. In [7], the Dantzig selector was proposed for sparse signal recovery and a bound of recovery\naccuracy with the same order as LASSO was presented. An approximate equivalence between the\nLASSO estimator and the Dantzig selector was shown in [1]. In [11], the l\u221e convergence rate was\nstudied simultaneously for LASSO and Dantzig estimators in a high-dimensional linear regression\nmodel under a mutual coherence assumption. In [9], conditions on the design matrix X under which\nthe LASSO and Dantzig selector coef\ufb01cient estimates are identical for certain tuning parameters\nwere provided.\nMany heuristic methods have been proposed in the past, including greedy least squares regression\n[16, 8, 19, 21, 3], two stage LASSO [20], multiple thresholding procedures [23], and adaptive\nLASSO [24]. They have been shown to outperform the standard convex methods in many prac-\ntical applications.\nIt was shown [16] that under an irrepresentable condition the solution of the\ngreedy least squares regression algorithm (also named OMP or forward greedy algorithm) guaran-\ntees the feature selection consistency in the noiseless case. The results in [16] were extended to\nthe noisy case [19]. Very recently, the results were further improved in [21] by considering arbi-\ntrary loss functions (not necessarily quadratic). In [3], the consistency of OMP was shown under\nthe mutual incoherence conditions. A multiple thresholding procedure was proposed to re\ufb01ne the\nsolution of LASSO or Dantzig selector [23]. An adaptive forward-backward greedy algorithm was\nproposed [18], and it was shown that under the restricted isometry condition the feature selection\nconsistency is achieved if the minimal nonzero entry in the true solution is larger than O(\u03c3\nlog m).\nThe adaptive LASSO was proposed to adaptively tune the weight value for the L1 penalty, and it\nwas shown to enjoy the oracle properties [24].\n\n\u221a\n\n2\n\n\f1.3 De\ufb01nitions, Notations, and Basic Assumptions\nWe use X \u2208 Rn\u00d7m to denote the design matrix and focus on the case m (cid:192) n, i.e., the signal\ndimension is much larger than the observation dimension. The correlation matrix A is de\ufb01ned as\nA = X T X with respect to the design matrix. The noise vector \u0001 follows the multivariate normal\ndistribution \u0001 \u223c N(0, \u03c32I). The observation vector y \u2208 Rn satis\ufb01es y = X\u03b2\u2217+\u0001, where \u03b2\u2217 denotes\nthe original signal (or true solution). \u02c6\u03b2 is used to denote the solution of the proposed algorithm. The\n\u03b1-supporting set (\u03b1 \u2265 0) for a vector \u03b2 is de\ufb01ned as\n\nsupp\u03b1(\u03b2) = {j : |\u03b2j| > \u03b1}.\n\nThe \u201csupporting\u201d set of a vector refers to the 0-supporting set. F denotes the supporting set of the\noriginal signal \u03b2\u2217. For any index set S, |S| denotes the size of the set and \u00afS denotes the complement\nof S in {1, 2, 3, ..., m}.\nIn this paper, s is used to denote the size of the supporting set F , i.e.,\ns = |F|. We use \u03b2S to denote the subvector of \u03b2 consisting of the entries of \u03b2 in the index set S.\nThe lp norm of a vector v is computed by (cid:107)v(cid:107)p = (\ni )1/p, where vi denotes the ith entry of v.\ni vp\nF XF )\u22121X T\nThe oracle solution \u00af\u03b2 is de\ufb01ned as \u00af\u03b2F = (X T\nF y, and \u00af\u03b2 \u00afF = 0. We employ the following\nnotation to measure some properties of a PSD matrix M \u2208 RK\u00d7K [20]:\n\n(cid:80)\n\n\u00b5(p)\nM,k =\n\ninf\n\nu\u2208Rk,|I|=k\n\n(cid:107)MI,I u(cid:107)p\n\n(cid:107)u(cid:107)p\n\n,\n\n\u03b8(p)\nM,k,l =\n\nsup\n\nu\u2208Rl,|I|=k,|J|=l,I\u2229J=\u2205\n\n\u03c1(p)\nM,k =\n(cid:107)MI,J u(cid:107)p\n\n(cid:107)u(cid:107)p\n\nsup\n\nu\u2208Rk,|I|=k\n\n(cid:107)MI,I u(cid:107)p\n\n(cid:107)u(cid:107)p\n\n,\n\n,\n\n(3)\n\n(4)\n\nwhere p \u2208 [1,\u221e], I and J are disjoint subsets of {1, 2, ..., K}, and MI,J \u2208 R|I|\u00d7|J| is a submatrix\nof M with rows from the index set I and columns from the index set J. Additionally, we use the\nfollowing notation to denote two probabilities:\n\n1 = \u03b71(\u03c0 log ((m \u2212 s)/\u03b71))\u22121/2,\n\u03b7(cid:48)\n\n\u03b7(cid:48)\n2 = \u03b72(\u03c0 log(s/\u03b72))\u22121/2.\n\n(5)\n\nwhere \u03b71 and \u03b72 are two factors between 0 and 1. In this paper, if we say \u201clarge\u201d, \u201clarger\u201d or \u201cthe\nlargest\u201d, it means that the absolute value is large, larger or the largest. For simpler notation in the\ncomputation of sets, we sometimes use \u201cS1 + S2\u201d to indicate the union of two sets S1 and S2, and\nuse \u201cS1 \u2212 S2\u201d to indicate the removal of the intersection of S1 and S2 from the \ufb01rst set S1. In this\npaper, the following assumption is always admitted.\nAssumption 1. We assume that s = |supp0(\u03b2\u2217)| < n, the variable number is much larger than\nthe feature dimension (i.e. m (cid:192) n), each column vector is normalized as X T\ni Xi = 1 where Xi\nindicates the ith column (or feature) of X, and the noise vector \u0001 follows the Gaussian distribution\nN(0, \u03c32I).\n\ni Xi = n, which is essentially identical to our assump-\nIn the literature, it is often assumed that X T\ntion. However, this may lead to a slight difference of a factor\nn in some conclusions. We have\nautomatically transformed conclusions from related work according to our assumption when citing\nthem in our paper.\n\n\u221a\n\n1.4 Organization\n\nThe rest of the paper is organized as follows. We present our multi-stage algorithm in Section 2. The\nmain theoretical results are summarized in Section 3 with detailed proofs given in the supplemental\nmaterial. The numerical simulation is reported in Section 4. Finally, we conclude the paper in\nSection 5. All proofs can be found in the supplementary \ufb01le.\n\n2 The Multi-Stage Dantzig Selector Algorithm\n\nIn this section, we introduce the multi-stage Dantzig selector algorithm. In the proposed method,\nwe update the support set F0 and the estimation \u02c6\u03b2 iteratively; the supporting set F0 starts from an\nempty set and its size increases by one after each iteration. At each iteration, we employ the basic\n\n3\n\n\fframework of Dantzig selector and the information about the current supporting set F0 to estimate\nthe new signal \u02c6\u03b2 by solving the following linear program:\n\nmin (cid:107)\u03b2 \u00afF0(cid:107)1\ns.t. (cid:107)X T\n\u00afF0\n(cid:107)X T\n\n(X\u03b2 \u2212 y)(cid:107)\u221e \u2264 \u03bb\nF0(X\u03b2 \u2212 y)(cid:107)\u221e = 0.\n\n(6)\n\nSince the features in F0 are considered as the supporting candidates, it is natural to enforce them to\nbe orthogonal to the residual vector X\u03b2 \u2212 y, i.e., one should make use of them for reconstructing\nthe overestimation y. This is the rationale behind the constraint: (cid:107)X T\nF0(X\u03b2 \u2212 y)(cid:107)\u221e = 0. The\nother advantage is when all correct features are chosen, the proposed algorithm can be shown to\nconverge to the oracle solution. The detailed procedure is formally described in Algorithm 1 below.\n0 = \u2205 and N = 0, the proposed method is identical to the standard Dantzig\nApparently, when F (0)\nselector.\n\n0\n\nAlgorithm 1 Multi-Stage Dantzig Selector\nRequire: F (0)\n, \u03bb, N, X, y,\nEnsure: \u02c6\u03b2(N ), F (N )\n1: while i=0; i\u2264N; i++ do\n2:\n3:\n0\n4: end while\n\n0\n\nObtain \u02c6\u03b2(i) by solving the problem (6) with F0 = F (i)\nForm F (i+1)\n\nas the index set of the i + 1 largest elements of \u02c6\u03b2(i).\n\n0\n\n3 Main Results\n\n3.1 Motivation\n\nTo motivate the proposed multi-stage algorithm, we \ufb01rst consider a simple case where some knowl-\nedge about the supporting features is known in advance. In standard Dantzig selector, we assume\nF0 = \u2205. If we assume that the features belonging to a set F0 are known as supporting features, i.e.,\nF0 \u2282 F , we have the following result:\nTheorem 1. Assume that assumption 1 holds. Take F0 \u2282 F and \u03bb = \u03c3\noptimization problem (6). If there exists some l such that\n\n(cid:114)\n\nm\u2212s\n\u03b71\n\nin the\n\n2 log\n\n(cid:179)\n\n(cid:180)\n\n1, the lp-norm (1 \u2264 p \u2264 \u221e) of the difference between\n\n(cid:182)1\u22121/p\n\nl\n\n(cid:181)| \u00afF0 \u2212 \u00afF|\n(cid:184)1/p\n(cid:179)| \u00afF0\u2212 \u00afF|\n\nA,s+l,l\n\n1 +\n\n> 0\n\n(cid:183)\n\n(cid:115)\n\n(cid:107) \u02c6\u03b2 \u2212 \u00af\u03b2(cid:107)p \u2264\n\n(cid:179)| \u00afF0\u2212 \u00afF|\n\n(| \u00afF0 \u2212 \u00afF| + l2p)1/p\n\nholds, then with a probability larger than 1\u2212 \u03b7(cid:48)\n\u02c6\u03b2, the solution of the problem (6), and the oracle solution \u00af\u03b2 is bounded as\n\nA,s+l \u2212 \u03b8(p)\n\u00b5(p)\n(cid:180)p\u22121\nA,s+l \u2212 \u03b8(p)\n\u00b5(p)\nand with a probability larger than 1 \u2212 \u03b7(cid:48)\n(cid:184)1/p\n(cid:180)p\u22121\n\u02c6\u03b2, the solution of the problem (6) and the true solution \u03b2\u2217 is bounded as\n(cid:179)| \u00afF0\u2212 \u00afF|\n(cid:112)\nA,s+l \u2212 \u03b8(p)\n\u00b5(p)\ns1/p\n\n(| \u00afF0 \u2212 \u00afF| + l2p)1/p\n\n(cid:179)| \u00afF0\u2212 \u00afF|\n\n(cid:180)1\u22121/p\n\n(cid:180)1\u22121/p\n\n(cid:107) \u02c6\u03b2 \u2212 \u03b2\u2217(cid:107)p \u2264\n\n(cid:115)\n\n(cid:183)\n\n1 +\n\nA,s+l,l\n\nA,s+l,l\n\nl\n\nl\n\nl\n\nl\n\n\u03c3\n\n2 log(s/\u03b72)\n\n\u00b5(p)\n(X T\n\nF XF )1/2,s\n\n\u03c3\n\n2 log\n\nm \u2212 s\n\u03b71\n\n(7)\n\n(cid:181)\n\n(cid:181)\n\n(cid:182)\n\n(cid:182)\n\n\u03c3\n\n2 log\n\nm \u2212 s\n\u03b71\n\n+\n\n(8)\n\n1 \u2212 \u03b7(cid:48)\n\n2, the lp-norm (1 \u2264 p \u2264 \u221e) of the difference between\n\n4\n\n\fIt is clear that both bounds (for any 1 \u2264 p \u2264 \u221e) are monotonically increasing with respect to the\nvalue of | \u00afF0 \u2212 \u00afF|. In other words, the larger F0 is, the lower these bounds are. This coincides\nwith our motivation that more knowledge about the supporting features can lead to a better signal\nestimation. Most related literatures directly estimate the bound of (cid:107) \u02c6\u03b2 \u2212 \u03b2\u2217(cid:107)p. Since \u03b2\u2217 may not be\na feasible solution of problem (6), it is not easy to directly estimate the distance between \u02c6\u03b2 and \u03b2\u2217.\nThe bound in the inequality (8), which consists of two terms. Since m (cid:192) n \u2265 s, we have\n\n(cid:112)\n\n2 log((m \u2212 s)/\u03b71) (cid:192)(cid:112)\n\n2 log(s/\u03b72) if \u03b71 \u2248 \u03b72. When p = 2, the following holds:\n\n(cid:181)| \u00afF0 \u2212 \u00afF|\n\n(cid:182)1\u22121/2 \u2264 \u00b5(2)\n\n(X T\n\nF XF )1/2,s\n\nF XF ,s \u2264 \u00b5(2)\n\n(X T\n\nF XF )1/2,s.\n\nA,s+l \u2212 \u03b8(2)\n\u00b5(2)\n\nA,s+l,l\n\nl\nA,s \u2264 \u00b5(2)\n\nX T\n\nA,s+l \u2264 \u00b5(2)\n\u00b5(2)\n\nsince\n\nFrom the analysis in the next section, we can see that the \ufb01rst term is the upper bound of the distance\nfrom the optimizer to the oracle solution (cid:107) \u02c6\u03b2 \u2212 \u00af\u03b2(cid:107)p and the second term is the upper bound of the\ndistance from the oracle solution to the true solution (cid:107) \u00af\u03b2 \u2212 \u03b2\u2217(cid:107)p. Thus, the \ufb01rst term might be much\nlarger than the second term.\n\n3.2 Comparison with Dantzig Selector\n\nWe \ufb01rst compare our estimation bound with the one in [7] for p = 2. For convenience of comparison,\nwe rewrite the theorem in [7] equivalently as:\nTheorem 2. Suppose \u03b2 \u2208 Rm is any s-sparse vector of parameters obeying \u03b42s + \u03b8(2)\nSetting \u03bbp = \u03c3\nsolution of the standard Dantzig selector \u02c6\u03b2D obeys\n\nA,s,2s < 1.\n2 log(m/\u03b7) (0 < \u03b7 \u2264 1), with a probability at least 1 \u2212 \u03b7(\u03c0 log m)\u22121/2, the\n\n(cid:112)\n\n(cid:112)\n\n(cid:107) \u02c6\u03b2D \u2212 \u03b2\u2217(cid:107)2 \u2264\n\n4\n\n1 \u2212 \u03b42s \u2212 \u03b8(2)\n\nA,s,2s\n\ns1/2\u03c3\n\n2 log(m/\u03b7),\n\n(9)\n\nA,2s \u2212 1, 1 \u2212 \u00b5(2)\n\nA,2s).\n\nwhere \u03b42s = max(\u03c1(2)\nTheorem 1 also implies a bound estimation result for Dantzig selector by letting F0 = \u2205 and p = 2.\nSpeci\ufb01cally, we set F0 = \u2205, N = 0, and \u03bb = \u03c3\nin the multi-stage method, and set\np = 2, l = s, \u03b71 = m\u2212s\nm \u03b7, and \u03b72 = s\nm \u03b7 for a convenient of comparison with Theorem 1. If follows\nthat with probability larger than 1 \u2212 \u03b7(\u03c0 log m)\u22121/2, the following bound holds:\n\u221a\n\n(cid:114)\n\nm\u2212s\n\u03b71\n\n2 log\n\n(cid:179)\n\n(cid:180)\n\uf8f6\uf8f8 s1/2\u03c3\n\n(cid:112)\n\n2 log (m/\u03b7).\n\n(10)\n\n(cid:107) \u02c6\u03b2 \u2212 \u03b2\u2217(cid:107)2 \u2264\n\n10\nA,2s \u2212 \u03b8(2)\n\u00b5(2)\n\nA,2s,s\n\n+\n\n1\n\n\u00b5(2)\n(X T\n\nF XF )1/2,s\n\n\uf8eb\uf8ed\n\nIt is easy to verify that\nF XF )1/2,s \u2264 1.\n1\u2212\u03b42s\u2212\u03b8(2)\nA,s,2s \u2264 \u00b5(2)\nThus, the bound in (10) is comparable to the one in (9). In the following, we compare the perfor-\nmance bound of the proposed multi-stage method (N > 0) with the one in (10).\n\nA,2s,s \u2264 \u00b5(2)\n\nA,2s \u2264 \u00b5(2)\n\nA,2s\u2212\u03b8(2)\n\nF XF ),s =\n\nF XF )1/2,s\n\n\u00b5(2)\n(X T\n\n(X T\n\n(X T\n\n(cid:179)\n\n(cid:180)2 \u2264 \u00b5(2)\n\n3.3 Feature Selection\n\nThe estimation bounds in Theorem 1 assume that a set F0 is given. In this section, we show how\nthe supporting set can be estimated. Similar to previous work [5, 19], |\u03b2\u2217\nj | for j \u2208 F is required\nto be larger than a threshold value. As is clear from the proof in the next section, the threshold\nvalue mainly depends on the value of (cid:107) \u02c6\u03b2 \u2212 \u03b2\u2217(cid:107)\u221e. We essentially employ the result with p = \u221e in\nTheorem 1 to estimate the threshold value. In the following, we \ufb01rst consider the simple case when\nN = 0. We have shown in the last section that the estimation bound in this case is similar to the one\nfor Dantzig selector.\n\n5\n\n\fTheorem 3. Under the assumption 1, if there exists an index set J such that |\u03b2\u2217\nj \u2208 J and there exists a nonempty set\n\nj | > \u03b10 for any\n\n\u2126 = {l | \u00b5(\u221e)\n\n> 0}\n\n(cid:180)\n\n(cid:179) s\n(cid:182)\n\nl\n\nA,s+l,l\n\nA,s+l \u2212 \u03b8(\u221e)\n(cid:115)\n(cid:181)\n(cid:162) \u03c3\n(cid:114)\n\n2 log\n\n(cid:179)\n\n(cid:180)\n\nm \u2212 s\n\u03b71\n\nm\u2212s\n\u03b71\n\nwhere\n\n\u03b10 = 4 min\nl\u2208\u2126\n\n(cid:162)\n\n(cid:161)\n1, s\nl\n\u00b5(\u221e)\nA,s+l \u2212 \u03b8(\u221e)\n\nmax\n\nA,s+l,l\n\n(cid:161)\n\ns\nl\n\n(cid:112)\n\n2 log(s/\u03b72),\n\n1\n\n\u03c3\n\n+\n\n\u00b5(\u221e)\n\n(X T\n\nF XF )1/2,s\n\n\u221a\n\n\u221a\n\n\u221a\n\n(cid:161)\n\n(cid:162)\n\ns\nl\n\nA,s+l,l\n\n2 log\n\nj | > O(\u03c3\n\nA,s+l \u2212 \u03b8(\u221e)\n\ninto the problem (6) (equivalent to Dantzig\n1 \u2212\n\nthen taking F0 = \u2205, N = 0, \u03bb = \u03c3\nselector), the largest |J| elements of \u02c6\u03b2std (or \u02c6\u03b2(0)) belong to F with probability larger than 1\u2212 \u03b7(cid:48)\n\u03b7(cid:48)\n2.\nThe theorem above indicates that under the given condition, if minj\u2208J |\u03b2\u2217\nlog m) (as-\nsuming that there exists l \u2265 s such that \u00b5(\u221e)\n> 0), then with high probability\nthe selected |J| features by Dantzig selector belong to the true supporting set.\nIn particular, if\n|J| = s, then the consistency of feature selection is achieved. The result above is comparable to the\nones for other feature selection algorithms, including LASSO [5, 22], greedy least squares regres-\nsion [16, 8, 19], two stage LASSO [20], and adaptive forward-backward greedy algorithm [18]. In\nall these algorithms, the condition minj\u2208F |\u03b2\u2217\nlog m is required, since the noise level is\nO(\u03c3\nlog m) [18]. Because C is always a coef\ufb01cient in terms of the covariance matrix XX T (or\nthe feature matrix X), it is typically treated as a constant term; see the literature listed above.\nNext, we show that the condition |\u03b2\u2217\nprocedure with N > 0, as summarized in the following theorem:\nTheorem 4. Under the assumption 1, if there exists a nonempty set\n> 0}\n\nj | > \u03b10 in Theorem 3 can be relaxed by the proposed multi-stage\n\nA,s+l \u2212 \u03b8(\u221e)\n\u2126 = {l | \u00b5(\u221e)\n(cid:115)\n(cid:181)\nJ)| > i holds for all i \u2208 {0, 1, ...,|J| \u2212 1}, where\nand there exists a set J such that |supp\u03b1i(\u03b2\u2217\n(cid:162)(cid:111) \u03c3\n(cid:180)\n\n(cid:161)\n(cid:114)\nA,s+l \u2212 \u03b8(\u221e)\n\u00b5(\u221e)\n\n\u03b1i = 4 min\nl\u2208\u2126\n\nm \u2212 s\n\u03b71\n\nj | \u2265 C\u03c3\n\n2 log(s/\u03b72),\n\n(cid:180)\n(cid:182)\n\n(cid:179) s\n\nF XF )1/2,s\n\n1, s\u2212i\n\n2 log\n\nm\u2212s\n\u03b71\n\n1 \u2212 \u03b7(cid:48)\n2.\n\nthen taking F (0)\n\n0 = \u2205, \u03bb = \u03c3\n\nand N = |J| \u2212 1 into Algorithm 1, the solution after\n0 \u2282 F (i.e. |J| correct features are selected) with probability larger than\n\nN iterations satis\ufb01es F (N )\n1 \u2212 \u03b7(cid:48)\nAssume that one aims to select N correct features by the standard Dantzig selector and the multi-\nstage method. These two theorems show that the standard Dantzig selector requires that at least N\nj |\u2019s with j \u2208 F are larger than the threshold value \u03b10, while the proposed multi-stage method\nof |\u03b2\u2217\nj |\u2019s are larger than the threshold value \u03b1i\u22121, for i = 1,\u00b7\u00b7\u00b7 , N. Since\nrequires that at least i of the |\u03b2\u2217\n{\u03b1j} is a strictly decreasing sequence satisfying for some l \u2208 \u2126,\n(cid:179)\n\n\u03b1i\u22121 \u2212 \u03b1i >\n\n(cid:115)\n\n(cid:182)\n\n(cid:181)\n\n2 log\n\n(cid:112)\n\n(cid:162)\n(cid:161)\n\n\u00b5(\u221e)\n\n(X T\n\ns\u2212i\nl\n\n(cid:110)\n\n2 log\n\nmax\n\n(cid:179)\n\nA,s+l,l\n\nA,s+l,l\n\n+\n\n1\n\n\u03c3\n\nl\n\nl\n\n4\u03b8(\u221e)\n\u00b5(\u221e)\nA,s+l \u2212 \u03b8(\u221e)\n\nA,s+l,l\n\nA,s+l,l\n\nl\n\n(cid:162)(cid:180)2 \u03c3\n\n(cid:161)\n\ns\u2212i\nl\n\nm \u2212 s\n\u03b71\n\n,\n\nthe proposed multi-stage method requires a strictly weaker condition for selecting N correct features\nthan the standard Dantzig selector.\n\n3.4 Signal Recovery\n\nIn this section, we derive the estimation bound of the proposed multi-stage method by combing\nresults from Theorems 1, 3, and 4.\n\n6\n\n\f(cid:180)\n\nA,s+l,l\n\nl\n\n2 log\n\nm\u2212s\n\u03b71\n\n1 \u2212 \u03b7(cid:48)\n\nA,2s \u2212 \u03b8(p)\n\nA,2s,s > 0,\n\n\u00b5(\u221e)\nA,s+l \u2212 \u03b8(\u221e)\n\nTheorem 5. Under the assumption 1, if there exists l such that\n\n(cid:179) s\n(cid:180)\n> 0 and \u00b5(p)\n(cid:114)\n(cid:179)\nJ)| > i holds for all i \u2208 {0, 1, ...,|J| \u2212 1}, where the\nand there exists a set J such that |supp\u03b1i(\u03b2\u2217\n\u03b1i\u2019s are de\ufb01ned in Theorem 4, then\n(1) taking F0 = \u2205, N = 0 and \u03bb = \u03c3\n(cid:115)\n(cid:181)\nthan 1 \u2212 \u03b7(cid:48)\n(cid:114)\n(cid:179)\n(cid:115)\n\n(2) taking F0 = \u2205, N = |J| and \u03bb = \u03c3\nthan 1 \u2212 \u03b7(cid:48)\n(cid:107) \u02c6\u03b2mul \u2212 \u03b2\u2217(cid:107)p \u2264 (2p+1 + 2)1/p(s \u2212 N)1/p\n\n2, the solution of the multi-stage method \u02c6\u03b2mul (i.e., \u02c6\u03b2(N )) obeys:\n\n2, the solution of the Dantzig selector \u02c6\u03b2D (i.e, \u02c6\u03b2(0)) obeys:\n\n(cid:107) \u02c6\u03b2D \u2212 \u03b2\u2217(cid:107)p \u2264 (2p+1 + 2)1/ps1/p\n\ninto Algorithm 1, with probability larger\n\ninto Algorithm 1, with probability larger\n\n(cid:182)\n(cid:180)\n(cid:181)\n\nA,2s \u2212 \u03b8(p)\n\u00b5(p)\n\nm \u2212 s\n\u03b71\n\n2 log(s/\u03b72);\n\n1 \u2212 \u03b7(cid:48)\n\n\u00b5(p)\n(X T\n\nF XF )1/2,s\n\n(cid:112)\n\n\u03c3\n\n2 log\n\nm\u2212s\n\u03b71\n\ns1/p\n\n+\n\n(cid:112)\n\n\u03c3\n\n2 log\n\nA,2s,s\n\n(11)\n\nA,2s\u2212N \u2212 \u03b8(p)\n\u00b5(p)\n\nA,2s\u2212N,s\u2212N\n\n\u03c3\n\n2 log\n\n+\n\ns1/p\n\n\u00b5p\n(X T\n\nF XF )1/2,s\n\n\u03c3\n\n2 log(s/\u03b72).\n\n(12)\n\n(cid:182)\n\nm \u2212 s\n\u03b71\n\nSimilar to the analysis in Theorem 1, the \ufb01rst term (i.e., the distance from \u02c6\u03b2 to the oracle solution \u00af\u03b2)\nimproved the standard Dantzig selector from Cs1/p\u221a\ndominates in the estimated bounds. Thus, the performance of the multi-stage method approximately\nlog m\u03c3. When\np = 2, our estimation has the same order as the greedy least squares regression algorithm [19] and\nthe adaptive forward-backward greedy algorithm [18].\n\nlog m\u03c3 to C(s \u2212 N)1/p\u221a\n\n3.5 The Oracle Solution\n\nThe oracle solution is the minimum-variance unbiased estimator of the true solution given the noisy\nobservation. We show in the following theorem that the proposed method can obtain the oracle\nsolution with high probability under certain conditions:\nTheorem 6. Under the assumption 1, if there exists l such that \u00b5(\u221e)\nthe supporting set F of \u03b2\u2217 satis\ufb01es |supp\u03b1i(\u03b2\u2217\nare de\ufb01ned in Theorem 4, then taking F0 = \u2205, N = s and \u03bb = \u03c3\nthe oracle solution can be achieved, i.e. F (N )\n1 \u2212 \u03b7(cid:48)\nThe theorem above shows that when the nonzero elements of the true coef\ufb01cients vector \u03b2\u2217 are large\nenough, the oracle solution can be achieved with high probability.\n\n> 0, and\nF )| > i for all i \u2208 {0, 1, ..., s \u2212 1}, where the \u03b1i\u2019s\ninto Algorithm 1,\n= F and \u02c6\u03b2(N ) = \u00af\u03b2 with probability larger than\n\n(cid:114)\nA,s+l \u2212 \u03b8(\u221e)\n(cid:180)\n\n1 \u2212 \u03b7(cid:48)\n2.\n\nm\u2212s\n\u03b71\n\ns\u2212i\nl\n\n2 log\n\n(cid:179)\n\nA,s+l,l\n\n(cid:162)\n\n(cid:161)\n\n0\n\n4 Simulation Study\n\nWe have performed simulation studies to verify our theoretical analysis. Our comparison includes\ntwo aspects: signal recovery accuracy and feature selection accuracy. The signal recovery accuracy\nis measured by the relative signal error: SRA = (cid:107) \u02c6\u03b2 \u2212 \u03b2\u2217(cid:107)2/(cid:107)\u03b2\u2217(cid:107)2, where \u02c6\u03b2 is the solution of a\nspeci\ufb01c algorithm. The feature selection accuracy is measured by the percentage of correct features\nselected: F SA = | \u02c6F \u2229 F|/|F|, where \u02c6F is the estimated feature candidate set.\nWe generate an n\u00d7 m random matrix X. Each element of X follows an independent standard Gaus-\nsian distribution N(0, 1). We then normalize the length of the columns of X to be 1. The s\u2212sparse\noriginal signal \u03b2\u2217 is generated with s nonzero elements independently uniformly distributed from\n\n7\n\n\fFigure 1: Numerical simulation. We compare the solutions of the standard Dantzig selector method\n(N = 0), the proposed method for different values of N, and the oracle solution. The SRA and\nF SA comparisons are reported on the top row and the bottom row, respectively. The starting point\nof each curve records the SRA (or F SA) value of the standard Dantzig selector method; the ending\npoint records the value of the oracle solution; the middle part of each curve records the results by\nthe proposed method for different values of N.\n\n[\u221210, 10]. We for y by y = X\u03b2\u2217 + \u0001, where the noise vector \u0001 is generated by the Gaussian distri-\nbution N(0, \u03c32I). For a fair comparison, we choose the same \u03bb = \u03c3\n2 log m in both algorithms.\nThe following experiments are repeated 20 times and we report their average performance.\n\n\u221a\n\n0 = \u2205 and output the \u02c6\u03b2(N )\u2019s. Note that the solution of the\nWe run the proposed algorithm with F (0)\nstandard Dantzig selector algorithm is equivalent to \u02c6\u03b2(0) with N = 0. We report the SRA curve of\n\u02c6\u03b2(N ) with respect to N in the top row of Figure 1. Based on \u02c6\u03b2(N ), we compute the supporting set\n\u02c6F (N ) as the index of the N largest entries in \u02c6\u03b2(N ). Note that the supporting set we compute here\nis different from the supporting set \u02c6F (N )\n0 which only contains the N largest feature indexes. The\nbottom row of Figure 1 shows the F SA curve with respect to N. We can observe from Figure 1 that\n1) the multi-stage method obtains a solution with a smaller distance to the original signal than the\nstandard Dantzig selector method; 2) the multi-stage method selects a larger percentage of correct\nfeatures than the standard Dantzig selector method; 3) the multi-stage method can achieve the oracle\nsolution. Overall, the recovery accuracy curve increases with an increasing value of N and the\nfeature selection accuracy curve is decreasing with an increasing value of N.\n\n5 Conclusion\n\nIn this paper, we propose a multi-stage Dantzig selector method which iteratively selects the sup-\nporting features and recovers the original signal. The proposed method makes use of the information\nof supporting features to estimate the signal and simultaneously makes use of the information of the\nestimated signal to select the supporting features. Our theoretical analysis shows that the proposed\nmethod improves upon the standard Dantzig selector in both signal recovery and supporting feature\nselection. The \ufb01nal numerical simulation validates our theoretical analysis.\nSince the multi-stage procedure can improve the Dantzig selector, one natural question is whether\nthe analysis can be extended to other related techniques such as LASSO. The two-stage LASSO\nhas been shown to outperform the standard LASSO. We plan to extend our analysis for multi-stage\nLASSO in the future. In addition, we plan to improve the proposed algorithm by adopting stopping\nrules similar to the ones recently proposed in [3, 19, 21].\n\nAcknowledgments\nThis work was supported by NSF IIS-0612069, IIS-0812551, CCF-0811790, IIS-0953662, and\nNGA HM1582-08-1-0016.\n\n8\n\n02468101214160.020.040.060.080.10.120.140.160.180.2NSRAn=50 m=200 s=15 \u03c3=0.001 standardoraclemulti\u2212stage02468101214160.050.10.150.20.250.30.350.40.450.50.55NSRAn=50 m=200 s=15 \u03c3=0.1 standardoraclemulti\u2212stage02468100.020.040.060.080.10.120.140.160.18NSRAn=50 m=500 s=10 \u03c3=0.001 standardoraclemulti\u2212stage02468100.020.040.060.080.10.120.140.160.180.2NSRAn=50 m=500 s=10 \u03c3=0.1 standardoraclemulti\u2212stage02468101214160.850.90.951NFSAn=50 m=200 s=15 \u03c3=0.001 standardoraclemulti\u2212stage02468101214160.650.70.750.80.850.90.951NFSAn=50 m=200 s=15 \u03c3=0.1 standardoraclemulti\u2212stage02468100.880.90.920.940.960.981NFSAn=50 m=500 s=10 \u03c3=0.001 standardoraclemulti\u2212stage02468100.840.860.880.90.920.940.960.981NFSAn=50 m=500 s=10 \u03c3=0.1 standardoraclemulti\u2212stage\fReferences\n[1] P. J. Bickel, Y. Ritov, and A. B. Tsybakov. Simultaneous analysis of Lasso and Dantzig selector.\n\nAnnals of Statistics, 37:1705\u20131732, 2009.\n\n[2] F. Bunea, A. Tsybakov, and M. Wegkamp. Sparsity oracle inequalities for the Lasso. Electronic\n\nJournal of Statistics, 2007.\n\n[3] T. Cai and L. Wang. Orthogonal matching pursuit for sparse signal reconvery. Technical\n\nReport, 2010.\n\n[4] T. Cai, G. Xu, and J. Zhang. On recovery of sparse signals via l1 minimization. IEEE Trans-\n\nactions on Information Theory, 55(7):3388\u20133397, 2009.\n\n[5] E. J. Candes and Y. Plan. Near-ideal model selection by l1 minimization. Annals of Statistics,\n\n37:2145\u20132177, 2006.\n\n[6] E. J. Candes and T. Tao. Decoding by linear programming. IEEE Transactions on Information\n\nTheory, 51(12):4203\u20134215, 2005.\n\n[7] E. J. Candes and T. Tao. The Dantzig selector: Statistical estimation when p is much larger\n\nthan n. Annals of Statistics, 35:2313, 2007.\n\n[8] D. L. Donoho, M. Elad, and V. N. Temlyakov. Stable recovery of sparse overcomplete rep-\nresentations in the presence of noise. IEEE Transactions on Information Theory, pages 6\u201318,\n2006.\n\n[9] G. M. James, P. Radchenko, and J. Lv. DASSO: connections between the Dantzig selector and\n\nLasso. Journal of The Royal Statistical Society Series B, 71(1):127\u2013142, 2009.\n\n[10] V. Koltchinskii and M. Yuan. Sparse recovery in large ensembles of kernel machines on-line\n\nlearning and bandits. COLT, pages 229\u2013238, 2008.\n\n[11] K. Lounici. Sup-norm convergence rate and sign concentration property of Lasso and Dantzig\n\nesti mators. Electronic Journal of Statistics, 2:90\u2013102, 2008.\n\n[12] N. Meinshausen, P. Bhlmann, and E. Zrich. High dimensional graphs and variable selection\n\nwith the Lasso. Annals of Statistics, 34:1436\u20131462, 2006.\n\n[13] P. Ravikumar, G. Raskutti, M. J. Wainwright, and B. Yu. Model selection in gaussian graphical\n\nmodels: High-dimensional consistency of l1-regularized MLE. pages 1329\u20131336, 2008.\n\n[14] J. Romberg. The Dantzig selector and generalized thresholding. CISS, pages 22\u201325, 2008.\n[15] R. Tibshirani. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical\n\nSociety: Series B, 58(1):267\u2013288, 1996.\n\n[16] J. A. Tropp. Greed is good: Algorithmic results for sparse approximation. IEEE Transactions\n\non Information Theory, 50:2231\u20132242, 2004.\n\n[17] M. J. Wainwright. Sharp thresholds for noisy and high-dimensional recovery of sparsity using\nIEEE Transactions on Information Theory,\n\nl1-constrained quadratic programming (Lasso).\npages 2183\u20132202, 2009.\n\n[18] T. Zhang. Adaptive forward-backward greedy algorithm for sparse learning with linear models.\n\nNIPS, pages 1921\u20131928, 2008.\n\n[19] T. Zhang. On the consistency of feature selection using greedy least squares regression. Journal\n\nof Machine Learning Reserch, 10:555\u2013568, 2009.\n\n[20] T. Zhang. Some sharp performance bounds for least squares regression with l1 regularization.\n\nAnnals of Statistics, 37:2109, 2009.\n\n[21] T. Zhang. Sparse recovery with orthogonal matching pursuit under RIP. arXiv:1005.2249,\n\n2010.\n\n[22] P. Zhao and B. Yu. On model selection consistency of Lasso. Journal of Machine Learning\n\nReserch, 7:2541\u20132563, 2006.\n\n[23] S. Zhou. Thresholding procedures for high dimensional variable selection and statistical esti-\n\nmation. NIPS, pages 2304\u20132312, 2009.\n\n[24] H. Zou. The adaptive Lasso and its oracle properties. Journal of the American Statistical\n\nAssociation, 101:1418\u20131429, 2006.\n\n9\n\n\f", "award": [], "sourceid": 573, "authors": [{"given_name": "Ji", "family_name": "Liu", "institution": null}, {"given_name": "Peter", "family_name": "Wonka", "institution": null}, {"given_name": "Jieping", "family_name": "Ye", "institution": null}]}