{"title": "Nonconvex Penalization Using Laplace Exponents and Concave Conjugates", "book": "Advances in Neural Information Processing Systems", "page_first": 602, "page_last": 610, "abstract": null, "full_text": "Nonconvex Penalization Using Laplace Exponents\n\nand Concave Conjugates\n\nZhihua Zhang and Bojun Tu\n\nCollege of Computer Science & Technology\n\nZhejiang University\n\nHangzhou, China 310027\n\n{zhzhang, tubojun}@zju.edu.cn\n\nAbstract\n\nIn this paper we study sparsity-inducing nonconvex penalty functions using L\u00b4evy\nprocesses. We de\ufb01ne such a penalty as the Laplace exponent of a subordina-\ntor. Accordingly, we propose a novel approach for the construction of sparsity-\ninducing nonconvex penalties. Particularly, we show that the nonconvex logarith-\nmic (LOG) and exponential (EXP) penalty functions are the Laplace exponents\nof Gamma and compound Poisson subordinators, respectively. Additionally, we\nexplore the concave conjugate of nonconvex penalties. We \ufb01nd that the LOG and\nEXP penalties are the concave conjugates of negative Kullback-Leiber (KL) dis-\ntance functions. Furthermore, the relationship between these two penalties is due\nto asymmetricity of the KL distance.\n\n1 Introduction\n\nVariable selection plays a fundamental role in statistical modeling for high-dimensional data sets,\nespecially when the underlying model has a sparse representation. The approach based on penalty\ntheory has been widely used for variable selection in the literature. A principled approach is to\ndue the lasso of [17], which uses the \u21131-norm penalty. Recently, some nonconvex alternatives,\nsuch as the bridge penalty, the nonconvex exponential penalty (EXP) [3, 8], the logarithmic penalty\n(LOG) [19, 13], the smoothly clipped absolute deviation (SCAD) penalty [6] and the minimax con-\ncave plus (MCP) penalty [20], have been demonstrated to have attractive properties theoretically and\npractically.\nThere has also been work on nonconvex penalties within a Bayesian framework. Zou and Li [23]\nderived their local linear approximation (LLA) algorithm by combining the EM algorithm with an\ninverse Laplace transformation. In particular, they showed that the bridge penalty can be obtained\nby mixing the Laplace distribution with a stable distribution. However, Zou and Li [23] proved that\nboth MCP and SCAD can not be cast into this framework. Other authors have shown that the prior\ninduced from the LOG penalty has an interpretation as a scale mixture of Laplace distributions with\nan inverse gamma density [5, 9, 12, 2]. Recently, Zhang et al. [22] extended this class of Laplace\nvariance mixtures by using a generalized inverse Gaussian density. Additionally, Grif\ufb01n and Brown\n[11] devised a family of normal-exponential-gamma priors.\nOur work is motivated by recent developments of Bayesian nonparametric methods in feature se-\nlection [10, 18, 4, 15]. Especially, Polson and Scott [15] proposed a nonparametric approach for\nnormal variance mixtures using L\u00b4evy processes, which embeds \ufb01nite dimensional normal variance\nmixtures in in\ufb01nite ones. We develop a Bayesian nonparametric approach for the construction\nof sparsity-inducing nonconvex penalties. Particularly, we show that Laplace transformations of\nL\u00b4evy processes can be viewed as pseudo-priors and the corresponding Laplace exponents then form\n\n1\n\n\fsparsity-inducing nonconvex penalties. Moreover, we exemplify that the LOG and EXP penalties\ncan be respectively regarded as Laplace exponents of Gamma and compound Poisson subordinators.\nIn addition, we show that both LOG and EXP can be constructed via the Kullback-Leibler distance.\nThis construction recovers an inherent connection between LOG and EXP. Moreover, it provides us\nwith an approach for adaptively updating tuning hyperparameters, which is a very important com-\nputational issue in nonconvex sparse penalization. Typically, the multi-stage LLA and SparseNet\nalgorithms with nonconvex penalties [21, 13] implement a two-dimensional grid research, so they\ntake more computational costs. However, we do not claim that our method will always be optimal\nfor generalization performance.\n\n2 L\u00b4evy Processes for Nonconvex Penalty Functions\n\u2211\nSuppose we are given a set of training data {(xi, yi) : i = 1, . . . , n}, where the xi \u2208 Rp are the\ni=1 xi = 0 and\ninput vectors and the yi are the corresponding outputs. Moreover, we assume that\n\n\u2211\n\nn\n\nn\n\ni=1 yi = 0. We now consider the following linear regression model:\n\ny = Xb + \",\n\nwhere y = (y1, . . . , yn)T is the n\u00d71 output vector, X = [x1, . . . , xn]T is the n\u00d7p input matrix,\nand \" is a Gaussian error vector N (\"|0, \u03c3In). We aim to \ufb01nd a sparse estimate of regression vector\nb = (b1, . . . , bp)T under the MAP framework.\nWe particular study the use of Laplace variance mixtures in sparsity modeling. For this purpose, we\nde\ufb01ne a hierarchical model:\n\nwhere the \u03b7js are known as the local shrinkage parameters and L(b|u, \u03b7) denotes a Laplace distri-\nbution of the density\n\n[bj|\u03b7j, \u03c3] ind\u223c L(bj|0, \u03c3(2\u03b7j)\n\n\u22121),\n\np(\u03c3) = \\Constant\",\n\n[\u03b7j] iid\u223c p(\u03b7j),\n(\n\n)\n\nL(b|u, \u03b7) =\n\n1\n4\u03b7\n\nexp\n\n\u2212 1\n2\u03b7\n\n|b \u2212 u|\n\n.\n\nThe classical regularization framework is based on a penalty function induced from the margin prior\np(bj|\u03c3). Let\nwhere p(b|\u03c3) =\n\n\u22121)p(\u03b7)d\u03b7. Then the penalized regression problem is\n\n\u222b \u221e\n0 L(b|0, \u03c3\u03b7\n\n{\n\n\u03c8(|b|) = \u2212 log p(b|\u03c3),\np\u2211\n\n\u2225y\u2212Xb\u22252\n\n2 + \u03bb\n\nmin\n\nb\n\nF (b) , 1\n2\n\n}\n\n\u03c8(|bj|)\n\n.\n\nj=1\n\nUsing some direct calculations, we can obtain that d (|b|)\nd|b|2 < 0. This implies\nthat \u03c8(|b|) is nondecreasing and concave in |b|. In other words, \u03c8(|b|) forms a class of nonconvex\npenalty functions for b.\nMotivated by use of Bayesian nonparametrics in sparsity modeling, we now explore Laplace scale\nmixtures by relating \u03b7 with a subordinator. We thus have a Bayesian nonparametric formulation for\nthe construction of joint priors of the bj\u2019s.\n\nd|b| > 0 and d2 (|b|)\n\n2.1 Subordinators and Laplace Exponents\n\n\u221e\n\nBefore we go into the presentation, we give some notions and lemmas that will be uses later. Let\nf \u2208 C\n(0,\u221e) with f \u2265 0. We say f is completely monotone if (\u22121)nf (n) \u2265 0 for all n \u2208 N and\na Bernstein function if (\u22121)nf (n) \u2264 0 for all n \u2208 N. The following lemma will be useful.\nLemma 1 Let \u03bd be the L\u00b4evy measure such that\n\n\u222b \u221e\n0 min(u, 1)\u03bd(du) < \u221e.\n\n2\n\n\f(1) f is a Bernstein function if and only if the mapping s 7\u2192 exp(\u2212tf (s)) is completely mono-\n\ntone for all t \u2265 0.\n\n(2) f is a Bernstein function if and only if it has the representation\n\n\u222b \u221e\n\n[\n]\n1 \u2212 exp(\u2212su)\n\n\u03bd(du) for all s > 0,\n\n(1)\n\nf (s) = \u03b1 + \u03b2s +\n\nwhere \u03b1, \u03b2 \u2265 0.\n\n0\n\nOur work is based on the notion of subordinators. Roughly speaking, a subordinator is an one-\ndimensional L\u00b4evy process that is non-decreasing (a.s.) [16]. An important property for subordinators\nis given in the following lemma.\nLemma 2 If T = (T (t) : t \u2265 0) is a subordinator, then the Laplace transformation of its density\ntakes the form\n\n(\n\n)\n\nE\n\n\u2212sT (t)\ne\n\n\u2212sT (t)p(T (t))dT (t) = e\ne\n\n\u2212t (s),\n\n=\n\nwhere\n\n]\n\u03bd(du)\nHere \u03b2 \u2265 0 and \u03bd is the L\u00b4evy measure de\ufb01ned in Lemma 1.\nConversely, if \u03c8 is an arbitrary mapping from (0,\u221e) \u2192 (0,\u221e) of the form (2), then e\nLaplace transformation of the density of a subordinator.\n\n[\n1 \u2212 e\n\n\u03c8(s) = \u03b2s +\n\nfor s > 0.\n\n\u2212su\n\n0\n\n(2)\n\n\u2212t (s) is the\n\n\u222b \u221e\n\u222b \u221e\n\n0\n\nLemmas 1 and 2 can be found in [1, 16]. The function \u03c8 in (2) is usually called the Laplace\nexponent of the subordinator and it satis\ufb01es \u03c8(0) = 0. Lemma 1 implies that the Laplace exponent\n\u03c8 is a Bernstein function and the corresponding Laplace transformation exp(\u2212t\u03c8(s)) is completely\nmonotone.\nRecall that the Laplace exponent \u03c8(s) is nonnegative, nondecreasing and concave on (0,\u221e). Thus,\nif we let s = |b|, then \u03c8(|b|) de\ufb01nes a nonconvex penalty function of b on (\u2212\u221e,\u221e). Moreover,\nsuch \u03c8(|b|) is nondifferentiable at the origin because \u03c8\n) < 0. Thus, it is able\nto induce sparsity. In this regard, exp(\u2212t\u03c8(|b|)) forms a pseudo-prior for b1. Lemma 2 shows that\nthe prior can be de\ufb01ned by a Laplace transformation. In summary, we have the following theorem.\nTheorem 1 Let \u03c8(s) be a nonzero Bernstein function of s on (0,\u221e). If \u03c8(s) = 0, then \u03c8(|b|) is a\nnondifferentiable and nonconvex function of b on (\u2212\u221e,\u221e). Furthermore,\n\n(0+) > 0 and \u03c8\n\n(0\n\n\u2212\n\n\u2032\n\n\u2032\n\nexp(\u2212t\u03c8(|b|)) =\n\nexp(\u2212|b|T (t))p(T (t))dT (t), t \u2265 0,\n\nwhere (T (t) : t \u2265 0) is some subordinator.\n\n0\n\nThe subordinator T (t) plays the same role as the local shrinkage parameter \u03b7, which is also called a\nlatent variable. Moreover, we will see that t plays the role of a tuning hyperparameter. Theorem 1\nshows an explicit relationship between the local shrinkage parameter and the corresponding tuning\nhyperparameter; i.e., the former is a stochastic process of the later. It is also worth noting that\n\n\u22121)T (t)\n\n\u22121p(T (t))dT (t).\n\n\u222b \u221e\n\nexp(\u2212t\u03c8(|b|)) = 2\n\u22121p(T (t))dT (t) = 1/C < \u221e, p\n\u2217\n\nL(b|0, (2T (t))\n\n0\n\n0 T (t)\n\n(T (t)) , CT (t)\n\n\u22121p(T (t)) de\ufb01nes a new proper\nThus, if\ndensity for T (t). In this case, the proper prior C exp(\u2212t\u03c8(|b|)) is a Laplace scale mixture, i.e., the\nmixture of L(b|0, (2T (t))\n(T (t)) ,\n\u22121p(T (t)) de\ufb01nes an improper density for T (t). Thus, the improper prior exp(\u2212t\u03c8(|b|)) is a\n\u222b 1\nT (t)\nmixture of L(b|0, (2T (t))\n0 exp((cid:0)t (s))ds is in\ufb01nite, exp((cid:0)t (jbj)) is an improper density w.r.t. Lebesgue measure. Other-\nwise, it can forms a proper density. In any case, we use the terminology of pseudo-priors for exp((cid:0)t (jbj)).\n\n\u22121p(T (t))dT (t) = \u221e, then p\n\u2217\n\n\u22121) with p\n\u2217\n\u22121) with p\n\u2217\n\n(T (t)). If\n\n0 T (t)\n\n(T (t)).\n\n1If\n\n\u222b \u221e\n\n\u222b \u221e\n\n\u222b \u221e\n\n3\n\n\f2.2 The MAP Estimation\n\nBased on the subordinator given in the previous subsection, we rewrite the hierarchical representa-\ntion for joint prior of the bj under the regression framework. That is,\n\u22121),\n\n[bj|\u03b7j, \u03c3]\n\nwhich is equivalent to\n\n\u2217\np\n\n(b|\u03c3) =\n\nHere T (tj) = \u03b7j. The joint marginal pseudo-prior of the bj\u2019s is\n\n\u2217\np\n\nind\u223c L(bj|0, \u03c3(2\u03b7j)\n)\n\n(\u03b7j) \u221d \u03c3\u03b7\n(\n\n\u22121\nj p(\u03b7j),\n\n\u2212 \u03b7j\n\u03c3\n\n|bj|\n\n[bj, \u03b7j|\u03c3]\n\u222b \u221e\n(\n\np\u220f\n\nind\u221d exp\n)\n\n|bj|\n\n\u2212 \u03b7j\n\u03c3\n\nj=1\n\n0\n\nexp\n\n{\n\nmin\n\nb\n\n\u2225y \u2212 Xb\u22252\n\n2 + \u03c3\n\n1\n2\n\nP (\u03b7j)d\u03b7j =\n\nexp\n\nj=1\n\n}\ntj\u03c8(|bj|/\u03c3)\n\n.\n\np\u2211\n\nj=1\n\np(\u03b7j).\n\np\u220f\n\n(\n\n(|bj|\n\n))\n\n\u03c3\n\n.\n\n\u2212 tj\u03c8\n\nThus, the MAP estimate of b is based on the following optimization problem\n\nClearly, the tj\u2019s are tuning hyperparameters and the \u03b7j\u2019s are latent variables. Moreover, it is inter-\nesting that \u03b7j (T (tj)) is de\ufb01ned as a subordinator w.r.t. tj.\n\n3 Gamma and Compound Poisson Subordinators\n\nIn [15], the authors discussed the use of \u03b1-stable subordinators and inverted-beta subordinators. In\nthis section we study applications of Gamma and Compound Poisson subordinators in constructing\nnonconvex penalty functions. We establish an interesting connection of these two subordinators with\nnonconvex logarithmic (LOG) and exponential (EXP) penalties. Particularly, these two penalties are\nthe Laplace exponents of the two subordinators, respectively.\n\n3.1 The LOG penalty and Gamma Subordinator\n\nThe log-penalty function is de\ufb01ned by\n\u03c8(|b|) =\n\n(\n\n)\n\u03b1|b|+1\n\n(3)\nClearly, \u03c8(|b|) is a Bernstein function of |b| on (0,\u221e). Thus, it is the Laplace exponent of a subor-\ndinator. In particular, we have the following theorem.\nTheorem 2 Let \u03c8(s) be de\ufb01ned by (3) with s = |b|. Then,\n\n, \u03b1, \u03b3 > 0.\n\nlog\n\n(\n\n\u222b \u221e\n\n[\n1 \u2212 exp(\u2212su)\n\n]\n\n\u03bd(du),\n\n1\n\u03b3\n\n)\n\n1\n\u03b3\nwhere the L\u00b4evy measure \u03bd is\n\nFurthermore,\n\nlog\n\n\u03b1s+1\n\n=\n\n0\n\n1\n\u03b3u\n\n\u03bd(du) =\n\nexp(\u2212u/\u03b1)du.\n\n\u222b \u221e\n\nexp(\u2212t\u03c8(s)) = (\u03b1s+1)\n\n\u2212t=(cid:13) =\n\nexp(\u2212sT (t))p(T (t))dT (t),\n\nwhere {T (t) : t \u2265 0} is a Gamma subordination and each T (t) has density\n\n0\n\n\u22121 exp(\u2212\u03b1\n\n\u22121\u03b7).\n\nt\n(cid:13)\n\n\u03b7\n\np(T (t) = \u03b7) =\n\n\u2212 t\n\n(cid:13)\n\n\u03b1\n(cid:0)(t/\u03b3)\n\n4\n\n\fAs we see, T (t) follows Gamma distribution Ga(T (t)|t/\u03b3, \u03b1). Thus, the {T (t) : t \u2265 0} is called\nthe Gamma subordinator.\nWe also note that the corresponding pseudo-prior is\n\nexp(\u2212t\u03c8(|b|)) =\n\nL(b|0, T (t)\n\n\u22121)T (t)\n\n\u22121p(T (t))dT (t).\n\n(\n\u03b1|b|+1\n\n)\u2212t=(cid:13) \u221d\n\n\u222b \u221e\n\nFurthermore, if t > \u03b3, we can form the pseudo-prior as a proper distribution, which is the mixture\nof L(b|0, T (t)\n\n\u22121) with Gamma distribution Ga(T (t)|\u03b3\n\n\u22121t\u22121, \u03b1).\n\n0\n\n3.2 The EXP Penalty and Compound Poisson Subordinator\nWe call {K(t), t \u2265 0} a Poisson process of intensity \u03bb > 0 if K takes values in N \u222a {0} and each\nK(t) \u223c Po(K(t)|\u03bbt), namely,\n\nP (K(t) = k) =\n\n(\u03bbt)k\n\n\u2212(cid:21)t, for k = 0, 1, 2, . . .\ne\n\nk!\n\nLet {Z(k) : k \u2208 N} be a sequence of i.i.d. random real variables from common law \u00b5Z and let K be\na Poisson process of intensity \u03bb that is independent of all the Z(k). Then T (t) , Z(K(1)) + \u00b7\u00b7\u00b7 +\nZ(K(t)) for t \u2265 0 follows a compound Poisson distribution (denoted T (t) \u223c Po(T (t)|\u03bbt, \u00b5Z)). We\nthen call {T (t) : t \u2265 0} the compound Poisson process. It is well known that Poisson processes are\nsubordinators. A compound Poisson process is a subordinator if and only if the Z(k) are nonnegative\nrandom variables [16].\nIn this section we employ the compound Poisson process to explore the EXP penalty, which is\n\n\u03c8(|b|) =\n\n1\n\u03b3\n\n(1 \u2212 exp(\u2212\u03b1|b|)), \u03b1, \u03b3 > 0.\n\n(4)\n\nIt is easily seen that \u03c8(|b|) is a Bernstein function of |b| on (0,\u221e). Moreover, we have\nTheorem 3 Let \u03c8(s) be de\ufb01ned by (4) where |b| = s. Then\n\n\u222b \u221e\n\u222b \u221e\n\n0\n\n[1 \u2212 exp(\u2212su)]\u03bd(du)\n\n\u03c8(s) =\n\u22121\u03b4(cid:11)(u)du. Furthermore,\n\nexp(\u2212t\u03c8(s)) =\n\nexp(\u2212sT (t))P (T (t))dT (t),\n\nwith the L\u00b4evy measure \u03bd(du) = \u03b3\n\n0\n\n\u222b\n\nR (1\u2212 exp(\u2212\u03b1|b|))db = \u221e, so \u03b3\n\nwhere {T (t) : t \u2265 0} is a compound Poisson subordinator, each T (t) \u223c Po(T (t)|t/\u03b3, \u03b4(cid:11)(\u00b7)), and\n\u03b4u(\u00b7) is the Dirac Delta measure.\nNote that\nAs we see, there are two parameters \u03b1 and \u03b3 in both LOG and EXP penalties. Usually, for the LOG\npenalty ones set \u03b3 = log(1 + \u03b1), because the corresponding \u03c8(|b|) goes from \u2225b\u22251 to \u2225b\u22250, as \u03b1\nvarying from 0 to \u221e. In the same reason, ones set \u03b3 = 1\u2212 exp(\u2212\u03b1) for the EXP penalty. Thus, \u03b1\n(or \u03b3) measures the sparseness. It makes sense to set \u03b1 as \u03b1 = p (i.e., the dimension of the input\nvector) in the following experiments. Interestingly, the following theorem shows a limiting property\nof the subordinators.\n\n\u22121(1 \u2212 exp(\u2212\u03b1|b|)) is an improper prior of b.\n\nTheorem 4 Assume that \u03b1 > 0 and \u03b3 > 0.\n\n(1) If \u03b3 = log(1 + \u03b1), then lim(cid:11)\u21920 Ga(T (t)|t/\u03b3, \u03b1) d\u2192 \u03b4t(T (t)).\n\u2212(cid:11), then lim(cid:11)\u21920 Po(T (t)|t/\u03b3, \u03b4(cid:11)(\u00b7)) d\u2192 \u03b4t(T (t)).\n(2) If \u03b3 = 1 \u2212 e\n\nIn this section we have an interesting connection between the LOG and EXP penalties based on\nthe relationship between the Gamma and compound Poisson subordinators. Subordinators help\n\n5\n\n\fus establish a direct connection between the tuning hyperparameters tj and the latent variables \u03b7j\n(T (tj)). However, when we implement the MAP estimation, it is challenging how to select these\ntuning hyperparameters. Recently, Palmer et al. [14] considered the application of concave conju-\ngates in developing variational EM algorithms for non-Gaussian latent variable models. In the next\nsection we rederive the nonconvex LOG and EXP penalties via concave conjugate. This derivation\nis able to deal with the challenge.\n\n4 A View of Concave Conjugate\n\nOur derivation for the LOG and EXP penalties is based on the Kullback-Leibler (KL) distance.\nGiven two nonnegative vectors a = (a1, . . . , ap)T and s = (s1, . . . , sp)T , the KL distance between\nthem is\n\np\u2211\n\nj=1\n\nKL(a, s) =\n\naj log\n\n\u2212aj+sj,\n\naj\nsj\n\nj=1\n\nj=1\n\nlog\n\nwhere 0 log 0\ntypically KL(a, s) \u0338= KL(s, a).\np\u2211\nTheorem 5 Let a = (a1, . . . , ap)T be a nonnegative vector and |b| = (|b1|, . . . ,|bp|)T . Then,\naj\u03c8(|bj|) ,\np\u2211\nwhen wj = aj/(1 + \u03b1|bj|), and\n\n0 = 0. It is well known that KL(a, s) \u2265 0 and KL(a, s) = 0 if and only if a = s, but\np\u2211\np\u2211\n\n(\n)\n\u03b1|bj|+1\n\n{\nwT|b| +\n{\nwT|b| +\n\n= min\nw\u22650\n\nKL(a, w)\n\n}\n\n}\n\naj\n\u03b1\n\n1\n\u03b1\n\nj=1\n\nj=1\n\n1\n\u03b1\n\naj\n\u03b1\n\nKL(w, a)\n\n[1 \u2212 exp(\u2212\u03b1|bj|)] = min\nw\u22650\n\naj\u03c8(|bj|) ,\nwhen wj = aj exp(\u2212\u03b1|bj|).\nWhen setting aj = (cid:11)\n(cid:13) tj, we readily see the LOG and EXP penalties. Thus, Theorem 5 illustrates\na very interesting connection between the LOG and EXP penalties. Since KL(a, w) is strictly\nconvex in either w or a, the LOG and EXP penalties are respectively the concave conjugates of\n\u2212\u03b1\nThe construction method for the nonconvex penalties provides us with a new approach for solv-\ning the corresponding penalized regression model. In particular, to solve the nonconvex penalized\nregression problem:\n\n\u22121KL(a, w) and \u2212\u03b1\n\n\u22121KL(w, a).\n\n{\n\nJ(b, a) , 1\n2\n\n\u2225y \u2212 Xb\u22252\n\n2 +\n\naj\u03c8(|bj|)\n\np\u2211\n\nj=1\n\n(5)\n\n(6)\n\n}\n}}\n\n,\n\n}\n\nwe equivalently formulate it as\n\nmin\n\nb\n\n{\n\n{ 1\n\nmin\n\nb\n\nmin\nw\u22650\n\n\u2225y \u2212 Xb\u22252\n\n2 + wT|b| +\n\n1\n\u03b1\n\n2\n\nD(w, a)\n\n.\n\nHere D(w, a) is either KL(a, w) or KL(w, a). Moreover, we are also interested in adaptive esti-\nmation of a in solving the problem (6). Accordingly, we develop a new training algorithm, which\nconsists of two steps.\nWe are given initial values w(0), e.g., w(0) = (1, . . . , 1)T . After the kth estimates (b(k), a(k)) of\n(b, a) are obtained, the (k+1)th iteration of the algorithm is de\ufb01ned as follows.\n\nThe \ufb01rst step calculates w(k) via\n\nw(k) = argmin\n\n{ p\u2211\n\nwj|b(k)\n\nj\n\n| +\n\n1\n\u03b1\n\nD(w, a(k))\n\n.\n\nw>0\n\nj=1\n\nParticular, w(k)\n\nj = a(k)\n\nj /(1 + \u03b1|b(k)\n\nj\n\n|) in LOG, while w(k)\n\nj = a(k)\n\nj\n\nexp(\u2212\u03b1|b(k)\n\nj\n\n|) in EXP.\n\n6\n\n\fThe second step then calculates (b(k+1), a(k+1)) via\n\n(b(k+1), a(k+1)) = argmin\n\nb; a\n\n}\n\nD(w(k), a)\n\n.\n\n1\n\u03b1\n\n2 + |b|T w(k) +\n\n{\n\n1\n2\n\n\u2225y \u2212 Xb\u22252\n{\n\nNote that given w(k), b and a are independent. Thus, this step can be partitioned into two parts.\nNamely, a(k+1) = w(k) and\n\nb(k+1) = argmin\n\nb\n\n\u2225y \u2212 Xb\u22252\n\n2 +\n\n1\n2\n\n|bj|\n\nw(k)\n\nj\n\n.\n\n}\n\np\u2211\n\nj=1\n\nRecall that the LOG and EXP penalties are differentiable and strictly concave in |b| on [0,\u221e). Thus,\nthe above algorithm enjoys the same convergence property of the LLA was studied by Zou and Li\n[23] (see Theorem 1 and Proposition 1 therein).\n\n5 Experimental Analysis\n\nWe conduct experimental analysis of our algorithms with LOG and EXP given in the previous sec-\ntion. We also implement the Lasso, adaptive Lasso (adLasso) and MCP-based methods. All these\nmethods are solved by the coordinate descent algorithm. For LOD and EXP algorithms, we \ufb01x\n\u03b1 = p (the dimension of the input vector), and set w(0) = \u03c91 where \u03c9 is selected by using cross-\nvalidation and 1 is the vector of ones. For Lasso, AdLasso and MCP, we use cross-validation to\nselect the tunning parameters (\u03bb in Lasso, \u03bb and \u03b3 in AdLasso and MCP).\nIn this simulation example, we use a data model as follow\n\ny = xT b + \u03c3\u03f5\n\nwhere \u03f5 \u223c N (0, 1), and b is a 200-dimension vector with only 10 non-zeros such that bi = b100+i =\n0.2i, i = 1, . . . , 5. Each data point x is sampled from a multivariate normal distribution with zero\nmean and covariance matrix (cid:6) = {0.7\n|i\u2212j|}1\u2264i;j\u2264200. We choose \u03c3 such that the Signal-to-Noise\nRatio (SNR), which is de\ufb01ned as\n\n\u221a\n\nSNR =\n\nbT (cid:6)b\n\n\u03c3\n\n,\n\nis a speci\ufb01ed value. Our experiment is performed on n = 100 and two different SNR values. We\ngenerate N = 1000 test data for each test. Let ^b denote the solution given by each algorithm. The\nStandardized Prediction Error (SPE) is de\ufb01ned as\n\n\u2211\ni=1(yi \u2212 xT\n\nN\n\nN \u03c32\n\n^b)2\n\ni\n\nSPE =\n\nand the Feature Selection Error (FSE) is proportion of coef\ufb01cients in ^b which is correctly set to zero\nor non-zero based on true b.\nFigure 1 reports the average results over 20 repeats. From the \ufb01gure, we see that both the LOG and\nEXP outperform the other methods in prediction accuracy and sparseness in most cases. Our meth-\nods usually takes about 10 iterations to get convergence. Thus, our methods are computationally\nmore ef\ufb01cient than the AdLasso and MCP.\nIn the second experiment, we apply our methods to regression problems on four datasets from UCI\nMachine Learning Repository and the cookie (Near-Infrared (NIR) Spectroscopy of Biscuit Doughs)\ndataset [7]. For the four UCI datasets, we randomly select 70% of the data for training and the rest\nfor test, and repeat this process for 20 times. We report the mean and standard deviation of the Root\nMean Square Error (RMSE) and the model sparsity (proportion of zero coef\ufb01cients in the model)\nin Tables 1 and 2. For the NIR dataset, we follow the steup for the original dataset: 40 instances\nfor training and 32 instances for test. We form four different datasets for the four responses (\u201cfat\u201d,\n\u201csucrose\u201d, \u201cdry \ufb02our\u201d and \u201cwater\u201d) in the experiment, and report the RMSE on the test set and\nthe model sparsity in Table 3. We can see that all the methods are competitive in both prediction\naccuracy. But the nonconvex LOG, EXP and MCP have strong ability in feature selection.\n\n7\n\n\fSPE\n\nSNR = 3.0\n\n\u201cFSE\u201d\n\nSPE\n\n\u201cFSE\u201d\n\nSNR = 10.0\n\nFigure 1: Box-and-whisker plots of SPE and FSE results. Here (a), (b), (c), (d), (e) are for LOG,\nEXP, Lasso, AdLasso, and MCP, respectively\n\nTable 1: Root Mean Square Error on Real datasets\n\nAbalone\n2.207(\u00b10.077)\n2.208(\u00b10.077)\n2.208(\u00b10.078)\n2.208(\u00b10.078)\n2.209(\u00b10.078)\n\nHousing\n4.880(\u00b10.405)\n4.883(\u00b10.405)\n4.886(\u00b10.414)\n4.887(\u00b10.413)\n4.889(\u00b10.412)\n\nPyrim\n0.138(\u00b10.032)\n0.130(\u00b10.033)\n0.118(\u00b10.035)\n0.127(\u00b10.028)\n0.122(\u00b10.036)\n\nTriazines\n0.156(\u00b10.018)\n0.153(\u00b10.020)\n0.146(\u00b10.017)\n0.146(\u00b10.017)\n0.148(\u00b10.017)\n\nLOG\nEXP\n\nLasso\nAdLasso\nMCP\n\nTable 2: Sparsity on Real datasets\n\nAbalone\n12.50(\u00b10.00)\n10.63(\u00b14.46)\n1.88(\u00b14.46)\n8.75(\u00b15.73)\n12.50(\u00b10.00)\n\nHousing\n11.54(\u00b15.70)\n8.08(\u00b15.15)\n3.08(\u00b15.10)\n8.07(\u00b17.08)\n11.54(\u00b16.66)\n\nPyrim\n57.22(\u00b135.32)\n88.15(\u00b15.69)\n36.48(\u00b124.52)\n34.62(\u00b128.81)\n41.48(\u00b123.88)\n\nTriazines\n68.17(\u00b131.19)\n76.25(\u00b121.84)\n62.08(\u00b114.65)\n63.58(\u00b115.18)\n73.00(\u00b118.77)\n\nLOG\nEXP\n\nLasso\nAdLasso\nMCP\n\nTable 3: Root Mean Square Error and Sparsity on Real datasets NIR\n\nNIR(fat)\n\nNIR(water)\n\nRMSE Sparsity RMSE Sparsity RMSE Sparsity RMSE Sparsity\n\nNIR(sucrose)\n\nNIR(dry \ufb02our)\n\nLOG\nEXP\n\nLasso\nAdLasso\nMCP\n\n0.334\n0.307\n\n0.437\n0.835\n0.943\n\n99.14\n97.29\n\n68.86\n88.14\n94.14\n\n1.45\n1.47\n\n2.54\n2.22\n2.07\n\n98.71\n97.71\n\n53.43\n86.14\n95.43\n\n0.992\n0.908\n\n0.785\n0.862\n0.839\n\n99.71\n98.86\n\n92.29\n99.14\n99.71\n\n0.400\n0.484\n\n0.378\n0.407\n0.504\n\n98.14\n94.14\n\n65.57\n85.86\n96.29\n\n6 Conclusion\n\nIn this paper we have introduced subordinators of L\u00b4evy processes into the de\ufb01nition of nonconvex\npenalties. This leads us to a Bayesian nonparametric approach for constructing sparsity-inducing\npenalties. In particular, we have illustrated the construction of the LOG and EXP penalties. Along\nthis line, it would be interesting to investigate other penalty functions via subordinators and compare\nthe performance of these penalties. We will conduct a comprehensive study in the future work.\n\nAcknowledgments\n\nThis work has been supported in part by the Natural Science Foundations of China (No. 61070239).\n\n8\n\n\fReferences\n[1] D. Applebaum. L\u00b4evy Processes and Stochastic Calculus. Cambridge University Press, Cambridge, UK,\n\n2004.\n\n[2] A. Armagan, D. Dunson, and J. Lee. Generalized double Pareto shrinkage. Technical report, Duke\n\nUniversity Department of Statistical Science, February 2011.\n\n[3] P. S. Bradley and O. L. Mangasarian. Feature selection via concave minimization and support vector\nmachines. In The 26th International Conference on Machine Learning, pages 82\u201390. Morgan Kaufmann\nPublishers, San Francisco, California, 1998.\n\n[4] F. Caron and A. Doucet. Sparse bayesian nonparametric regression. In Proceedings of the 25th interna-\n\ntional conference on Machine learning, page 88, 2008.\n\n[5] V. Cevher. Learning with compressible priors. In Advances in Neural Information Processing Systems\n\n22, pages 261\u2013269, 2009.\n\n[6] J. Fan and R. Li. Variable selection via nonconcave penalized likelihood and its Oracle properties. Journal\n\nof the American Statistical Association, 96:1348\u20131361, 2001.\n\n[7] Osborne B. G., Fearn T., Miller A. R., and Douglas S. Application of near-infrared re\ufb02ectance spec-\ntroscopy to compositional analysis of biscuits and biscuit dough. Journal of the Science of Food and\nAgriculture, 35(1):99\u2013105, 1984.\n\n[8] C. Gao, N. Wang, Q. Yu, and Z. Zhang. A feasible nonconvex relaxation approach to feature selection. In\n\nProceedings of the Twenty-Fifth National Conference on Arti\ufb01cial Intelligence (AAAI\u201911), 2011.\n\n[9] P. J. Garrigues and B. A. Olshausen. Group sparse coding with a Laplacian scale mixture prior.\n\nAdvances in Neural Information Processing Systems 22, 2010.\n\nIn\n\n[10] Z. Ghahramani, T. Grif\ufb01ths, and P. Sollich. Bayesian nonparametric latent feature models.\n\nmeeting on Bayesian Statistics, 2006.\n\nIn World\n\n[11] J. E. Grif\ufb01n and P. J. Brown. Bayesian adaptive Lassos with non-convex penalization. Technical report,\n\nUniversity of Kent, 2010.\n\n[12] A. Lee, F. Caron, A. Doucet, and C. Holmes. A hierarchical Bayesian framework for constructing sparsity-\n\ninducing priors. Technical report, University of Oxford, UK, 2010.\n\n[13] R. Mazumder, J. Friedman, and T. Hastie. SparseNet: Coordinate descent with nonconvex penalties.\n\nJournal of the American Statistical Association, 106(495):1125\u20131138, 2011.\n\n[14] J. A. Palmer, D. P. Wipf, K. Kreutz-Delgado, and B. D. Rao. Variational EM algorithms for non-Gaussian\n\nlatent variable models. In Advances in Neural Information Processing Systems 18, 2006.\n\n[15] N. G. Polson and J. G. Scott. Local shrinkage rules, l\u00b4evy processes, and regularized regression. Journal\n\nof the Royal Statistical Society (Series B), 74(2):287\u2013311, 2012.\n\n[16] S.-I. P. Sato. L\u00b4evy Processes and in\ufb01nitely Divisible Distributions. Cambridge University Press, Cam-\n\nbridge, UK, 1999.\n\n[17] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society,\n\nSeries B, 58:267\u2013288, 1996.\n\n[18] M. K. Titsias. The in\ufb01nite gamma-poisson feature models. In Advances in Neural Information Processing\n\nSystems 20, 2007.\n\n[19] J. Weston, A. Elisseeff, B. Sch\u00a8olkopf, and M. Tipping. Use of the zero-norm with linear models and\n\nkernel methods. Journal of Machine Learning Research, 3:1439\u20131461, 2003.\n\n[20] C.-H. Zhang. Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics,\n\n38:894\u2013942, 2010.\n\n[21] T. Zhang. Analysis of multi-stage convex relaxation for sparse regularization. Journal of Machine Learn-\n\ning Research, 11:1081\u20131107, 2010.\n\n[22] Z. Zhang, S. Wang, D. Liu, and M. I. Jordan. EP-GIG priors and applications in Bayesian sparse learning.\n\nJournal of Machine Learning Research, 13:2031\u20132061, 2012.\n\n[23] H. Zou and R. Li. One-step sparse estimates in nonconcave penalized likelihood models. The Annals of\n\nStatistics, 36(4):1509\u20131533, 2008.\n\n9\n\n\f", "award": [], "sourceid": 4763, "authors": [{"given_name": "Zhihua", "family_name": "Zhang", "institution": null}, {"given_name": "Bojun", "family_name": "Tu", "institution": null}]}