{"title": "Multi-Stage Multi-Task Feature Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1988, "page_last": 1996, "abstract": "Multi-task sparse feature learning aims to improve the generalization performance by exploiting the shared features among tasks. It has been successfully applied to many applications including computer vision and biomedical informatics. Most of the existing multi-task sparse feature learning algorithms are formulated as a convex sparse regularization problem, which is usually suboptimal, due to its looseness for approximating an $\\ell_0$-type regularizer. In this paper, we propose a non-convex formulation for multi-task sparse feature learning based on a novel regularizer. To solve the non-convex optimization problem, we propose a Multi-Stage Multi-Task Feature Learning (MSMTFL) algorithm. Moreover, we present a detailed theoretical analysis showing that MSMTFL achieves a better parameter estimation error bound than the convex formulation. Empirical studies on both synthetic and real-world data sets demonstrate the effectiveness of MSMTFL in comparison with the state of the art multi-task sparse feature learning algorithms.", "full_text": "Multi-Stage Multi-Task Feature Learning\u2217\n\nyPinghua Gong,\n\nzJieping Ye,\n\nyChangshui Zhang\n\nyState Key Laboratory on Intelligent Technology and Systems\n\nTsinghua National Laboratory for Information Science and Technology (TNList)\n\nDepartment of Automation, Tsinghua University, Beijing 100084, China\n\nzComputer Science and Engineering, Center for Evolutionary Medicine and Informatics\n\nThe Biodesign Institute, Arizona State University, Tempe, AZ 85287, USA\n\ny{gph08@mails, zcs@mail}.tsinghua.edu.cn,\n\nzjieping.ye@asu.edu\n\nAbstract\n\nMulti-task sparse feature learning aims to improve the generalization performance\nby exploiting the shared features among tasks. It has been successfully applied to\nmany applications including computer vision and biomedical informatics. Most\nof the existing multi-task sparse feature learning algorithms are formulated as\na convex sparse regularization problem, which is usually suboptimal, due to its\nlooseness for approximating an \u21130-type regularizer. In this paper, we propose a\nnon-convex formulation for multi-task sparse feature learning based on a novel\nregularizer. To solve the non-convex optimization problem, we propose a Multi-\nStage Multi-Task Feature Learning (MSMTFL) algorithm. Moreover, we present\na detailed theoretical analysis showing that MSMTFL achieves a better parameter\nestimation error bound than the convex formulation. Empirical studies on both\nsynthetic and real-world data sets demonstrate the effectiveness of MSMTFL in\ncomparison with the state of the art multi-task sparse feature learning algorithms.\n\n1 Introduction\n\nMulti-task learning (MTL) exploits the relationships among multiple related tasks to improve the\ngeneralization performance. It has been applied successfully to many applications such as speech\nclassi\ufb01cation [16], handwritten character recognition [14, 17] and medical diagnosis [2]. One com-\nmon assumption in multi-task learning is that all tasks should share some common structures in-\ncluding the prior or parameters of Bayesian models [18, 21, 24], a similarity metric matrix [16], a\nclassi\ufb01cation weight vector [6], a low rank subspace [4, 13] and a common set of shared features\n[1, 8, 10, 11, 12, 14, 20].\nIn this paper, we focus on multi-task feature learning, in which we learn the features speci\ufb01c to\neach task as well as the common features shared among tasks. Although many multi-task feature\nlearning algorithms have been proposed, most of them assume that the relevant features are shared\nby all tasks. This is too restrictive in real-world applications [9]. To overcome this limitation, Jalali\net al. (2010) [9] proposed an \u21131 + \u21131,1 regularized formulation, called dirty model, to leverage\nthe common features shared among tasks. The dirty model allows a certain feature to be shared\nby some tasks but not all tasks. Jalali et al. (2010) also presented a theoretical analysis under the\nincoherence condition [5, 15] which is more restrictive than RIP [3, 27]. The \u21131 + \u21131,1 regularizer\nis a convex relaxation for the \u21130-type one, which, however, is too loose to well approximate the\n\u21130-type regularizer and usually achieves suboptimal performance (requiring restrictive conditions or\nobtaining a suboptimal error bound) [23, 26, 27]. To remedy the shortcoming, we propose to use a\nnon-convex regularizer for multi-task feature learning in this paper.\n\n(cid:3)\n\nThis work was completed when the \ufb01rst author visited Arizona State University.\n\n1\n\n\fContributions: We propose to employ a capped-\u21131,\u21131 regularized formulation (non-convex) to\nlearn the features speci\ufb01c to each task as well as the common features shared among tasks. To\nsolve the non-convex optimization problem, we propose a Multi-Stage Multi-Task Feature Learning\n(MSMTFL) algorithm, using the concave duality [26]. Although the MSMTFL algorithm may not\nobtain a globally optimal solution, we theoretically show that this solution achieves good perfor-\nmance. Speci\ufb01cally, we present a detailed theoretical analysis on the parameter estimation error\nbound for the MSMTFL algorithm. Our analysis shows that, under the sparse eigenvalue condition\nwhich is weaker than the incoherence condition in Jalali et al. (2010) [9], MSMTFL improves the\nerror bound during the multi-stage iteration, i.e., the error bound at the current iteration improves\nthe one at the last iteration. Empirical studies on both synthetic and real-world data sets demonstrate\nthe effectiveness of the MSMTFL algorithm in comparison with the state of the art algorithms.\nNotations: Scalars and vectors are denoted by lower case letters and bold face lower case let-\nters, respectively. Matrices and sets are denoted by capital letters and calligraphic capital let-\nters, respectively. The \u21131 norm, Euclidean norm, \u21131 norm and Frobenius norm are denoted by\n\u2225 \u00b7 \u22251, \u2225 \u00b7 \u2225, \u2225 \u00b7 \u22251 and \u2225 \u00b7 \u2225F , respectively.\n| \u00b7 | denotes the absolute value of a scalar or the\nnumber of elements in a set, depending on the context. We de\ufb01ne the \u2113p,q norm of a matrix X as\n\u2225X\u2225p,q =\n. We de\ufb01ne Nn as {1,\u00b7\u00b7\u00b7 , n} and N (\u00b5, \u03c32) as a normal dis-\ntribution with mean \u00b5 and variance \u03c32. For a d\u00d7m matrix W and sets Ii \u2286 Nd\u00d7{i},I \u2286 Nd\u00d7Nd,\nwe let wIi be a d \u00d7 1 vector with the j-th entry being wji, if (j, i) \u2208 Ii, and 0, otherwise. We also\nlet WI be a d \u00d7 m matrix with the (j, i)-th entry being wji, if (j, i) \u2208 I, and 0, otherwise.\n\n)p)1/p\n\n|xij|q)1/q\n\n(\u2211\n\n\u2211\n\n(\n\n(\n\ni\n\nj\n\n2 The Proposed Formulation\nAssume we are given m learning tasks associated with training data {(X1, y1),\u00b7\u00b7\u00b7 , (Xm, ym)},\nwhere Xi \u2208 Rni(cid:2)d is the data matrix of the i-th task with each row as a sample; yi \u2208 Rni is\nthe response of the i-th task; d is the data dimensionality; ni is the number of samples for the i-th\ntask. We consider learning a weight matrix W = [w1,\u00b7\u00b7\u00b7 , wm] \u2208 Rd(cid:2)m consisting of the weight\nvectors for m linear predictive models: yi \u2248 fi(Xi) = Xiwi, i \u2208 Nm. In this paper, we propose a\nnon-convex multi-task feature learning formulation to learn these m models simultaneously, based\non the capped-\u21131,\u21131 regularization. Speci\ufb01cally, we \ufb01rst impose the \u21131 penalty on each row of W ,\nobtaining a column vector. Then, we impose the capped-\u21131 penalty [26, 27] on that vector. Formally,\nwe formulate our proposed model as follows:\n\n\uf8f1\uf8f2\uf8f3l(W ) + \u03bb\n\nd\u2211\n\nmin\nW\n\nmin\n\nj=1\n\n)\uf8fc\uf8fd\uf8fe ,\n\n(\u2225wj\u22251, \u03b8\n\u2211\n\n(1)\n\nwhere l(W ) is an empirical loss function of W ; \u03bb (> 0) is a parameter balancing the empirical loss\nand the regularization; \u03b8 (> 0) is a thresholding parameter; wj is the j-th row of the matrix W . In\nthis paper, we focus on the quadratic loss function: l(W ) =\n\n\u2225Xiwi \u2212 yi\u22252.\n\n1\n\nm\ni=1\n\nmni\n\nAlgorithm 1: MSMTFL: Multi-Stage Multi-Task Feature Learning\n\n1 Initialize \u03bb(0)\nj = \u03bb;\n2 for \u2113 = 1, 2,\u00b7\u00b7\u00b7 do\n3\n\nLet \u02c6W (\u2113) be a solution of the following problem:\n\n\uf8f1\uf8f2\uf8f3l(W ) +\n\nd\u2211\n\nj=1\n\n\uf8fc\uf8fd\uf8fe .\n\n\u03bb(\u2113(cid:0)1)\n\nj\n\n\u2225wj\u22251\n\n(2)\n\nmin\n\nW2Rd(cid:2)m\n\nj = \u03bbI(\u2225( \u02c6w(\u2113))j\u22251 < \u03b8) (j = 1,\u00b7\u00b7\u00b7 , d), where ( \u02c6w(\u2113))j is the j-th row of \u02c6W (\u2113) and I(\u00b7)\n\nLet \u03bb(\u2113)\ndenotes the {0, 1} valued indicator function.\n\n4 end\n\nIntuitively, due to the capped-\u21131, \u21131 penalty, the optimal solution of Eq. (1) denoted as W \u22c6 has many\nzero rows. For a nonzero row (w\u22c6)k, some entries may be zero, due to the \u21131-norm imposed on each\n\n2\n\n\frow of W . Thus, under the formulation in Eq. (1), a certain feature can be shared by some tasks\nbut not all the tasks. Therefore, the proposed formulation can leverage the common features shared\namong tasks.\nThe formulation in Eq. (1) is non-convex and is dif\ufb01cult to solve. To this end, we propose a Multi-\nStage Multi-Task Feature Learning (MSMTFL) algorithm (see Algorithm 1). Note that if we termi-\nnate the algorithm with \u2113 = 1, the MSMTFL algorithm is equivalent to the \u21131 regularized multi-task\nfeature learning algorithm (Lasso). Thus, the solution obtained by MSMTFL can be considered\nas a re\ufb01nement of that of Lasso. Although Algorithm 1 may not \ufb01nd a globally optimal solution,\nthe solution has good performance. Speci\ufb01cally, we will theoretically show that the solution ob-\ntained by Algorithm 1 improves the performance of the parameter estimation error bound during\nthe multi-stage iteration. Moreover, empirical studies also demonstrate the effectiveness of our pro-\nposed MSMTFL algorithm. We provide more details about intuitive interpretations, convergence\nanalysis and reproducibility discussions of the proposed algorithm in the full version [7].\n\n3 Theoretical Analysis\n\n(\n\nIn this section, we theoretically analyze the parameter estimation performance of the solution ob-\ntained by the MSMTFL algorithm. To simplify the notations in the theoretical analysis, we assume\nthat the number of samples for all the tasks are the same. However, our theoretical analysis can be\neasily extended to the case where the tasks have different sample sizes.\nWe \ufb01rst present a sub-Gaussian noise assumption which is very common in the analysis of sparse\nregularization literature [23, 25, 26, 27].\nAssumption 1 Let \u00afW = [ \u00afw1,\u00b7\u00b7\u00b7 , \u00afwm] \u2208 Rd(cid:2)m be the underlying sparse weight matrix and\nyi = Xi \u00afwi + \u03b4i, Eyi = Xi \u00afwi, where \u03b4i \u2208 Rn is a random vector with all entries \u03b4ji (j \u2208 Nn, i \u2208\nNm) being independent sub-Gaussians: there exists \u03c3 > 0 such that \u2200j \u2208 Nn, i \u2208 Nm, t \u2208 R:\nE\u03b4ji exp(t\u03b4ji) \u2264 exp\n)\nRemark 1 We call the random variable satisfying the condition in Assumption 1 sub-Gaussian,\n\u222b 1\nsince its moment generating function is upper bounded by that of the zero mean Gaussian ran-\ndom variable. That is, if a normal random variable x \u223c N (0, \u03c32), then we have E exp(tx) =\n\u2212 x2\n(cid:0)1 exp(tx)\ndx =\nexp(\u03c32t2/2) \u2265 E\u03b4ji exp(t\u03b4ji).\nRemark 2 Based on the Hoeffding\u2019s Lemma, for any random variable x \u2208 [a, b] and Ex = 0, we\nhave E(exp(tx)) \u2264 exp\n. Therefore, both zero mean Gaussian and zero mean bounded\nrandom variables are sub-Guassians. Thus, the sub-Gaussian noise assumption is more general\nthan the Gaussian noise assumption which is commonly used in the literature [9, 11].\n\n(\u2212(x \u2212 \u03c32t)2/(2\u03c32)\n\n(\n(\n\ndx = exp(\u03c32t2/2)\n\n\u222b 1\n\n\u03c32t2/2\n\n.\n\n(cid:0)1 1p\n\n2\u03c0\u03c3\n\nt2(b(cid:0)a)2\n\n)\n\n1p\n\n2\u03c0\u03c3\n\nexp\n\n)\n\nexp\n\n)\n\n8\n\n2\u03c32\n\nWe next introduce the following sparse eigenvalue concept which is also common in the analysis of\nsparse regularization literature [22, 23, 25, 26, 27].\nDe\ufb01nition 1 Given 1 \u2264 k \u2264 d, we de\ufb01ne\n\n{\u2225Xiw\u22252\n}\n{\u2225Xiw\u22252\n}\nn\u2225w\u22252 : \u2225w\u22250 \u2264 k\nn\u2225w\u22252 : \u2225w\u22250 \u2264 k\n\n\u03c1+\ni (k) = sup\nw\n\n(cid:0)\n\u03c1\ni (k) = inf\nw\n\n, \u03c1+\n\nmax(k) = max\ni2Nm\n\n\u03c1+\ni (k),\n\n, \u03c1\n\n(cid:0)\nmin(k) = min\ni2Nm\n\n(cid:0)\n\u03c1\ni (k).\n\n(cid:0)\nRemark 3 \u03c1+\ni (k)) is in fact the maximum (minimum) eigenvalue of (Xi)TS (Xi)S /n, where\nS is a set satisfying |S| \u2264 k and (Xi)S is a submatrix composed of the columns of Xi indexed by\nS. In the MTL setting, we need to exploit the relations of \u03c1+\n\n(cid:0)\ni (k)) among multiple tasks.\n\ni (k) (\u03c1\n\ni (k) (\u03c1\n\nWe present our parameter estimation error bound on MSMTFL in the following theorem:\n\n3\n\n\fTheorem 1 Let Assumption 1 hold. De\ufb01ne \u00afFi = {(j, i) : \u00afwji \u0338= 0} and \u00afF = \u222ai2Nm\nas the number of nonzero rows of \u00afW . We assume that\n\n\u00afFi. Denote \u00afr\n\nwhere s is some integer satisfying s \u2265 \u00afr. If we choose \u03bb and \u03b8 such that for some s \u2265 \u00afr:\n\n\u03c1+\ni (s)\n\n\u2200(j, i) \u2208 \u00afF,\u2225 \u00afwj\u22251 \u2265 2\u03b8\n\u221a\n\n(cid:0)\n\u03c1\ni (2\u00afr + 2s)\n\n\u2264 1 +\n\ns\n2\u00afr\n\n,\n\nand\n\n2\u03c1+\n\nmax(1) ln(2dm/\u03b7)\n\n,\n\n\u03bb \u2265 12\u03c3\n\u03b8 \u2265\n\nn\n\n,\n\n\u221a\n\n11m\u03bb\n\n(cid:0)\nmin(2\u00afr + s)\n\u03c1\n\u221a\n\n(3)\n\n(4)\n\n(5)\n\n(6)\n\n(7)\n\nthen the following parameter estimation error bound holds with probability larger than 1 \u2212 \u03b7:\n\u03c1+\nmax(\u00afr)(7.4\u00afr + 2.7 ln(2/\u03b7))/n\n\n39.5m\u03c3\n\n\u2225 \u02c6W (\u2113) \u2212 \u00afW\u22252,1 \u2264 0.8\u2113/2 9.1m\u03bb\n\n\u00afr\n\n(cid:0)\n\u03c1\nmin(2\u00afr + s)\n\n+\n\n(cid:0)\n\u03c1\nmin(2\u00afr + s)\n\n,\n\nwhere \u02c6W (\u2113) is a solution of Eq. (2).\nRemark 4 Eq. (3) assumes that the \u21131-norm of each nonzero row of \u00afW is away from zero. This\nrequires the true nonzero coef\ufb01cients should be large enough, in order to distinguish them from\nthe noise. Eq. (4) is called the sparse eigenvalue condition [27], which requires the eigenvalue ratio\n(cid:0)\n\u03c1+\ni (s) to grow sub-linearly with respect to s. Such a condition is very common in the analysis\ni (s)/\u03c1\nof sparse regularization [22, 25] and it is slightly weaker than the RIP condition [3, 27].\n\n(\n\n\u221a\n\n)\n\n)\n\n(\u221a\n\nRemark 5 When \u2113 = 1 (corresponds to Lasso), the \ufb01rst term of the right-hand side of Eq. (7)\ndominates the error bound in the order of\n\n(8)\nsince \u03bb satis\ufb01es the condition in Eq. (5). Note that the \ufb01rst term of the right-hand side of Eq. (7)\nshrinks exponentially as \u2113 increases. When \u2113 is suf\ufb01ciently large in the order of O(ln(m\n\u00afr/n) +\nln ln(dm)), this term tends to zero and we obtain the following parameter estimation error bound:\n(9)\n\n\u2225 \u02c6W Lasso \u2212 \u00afW\u22252,1 = O\n(\n\n\u2225 \u02c6W (\u2113) \u2212 \u00afW\u22252,1 = O\n\n\u00afr/n + ln(1/\u03b7)/n\n\n\u00afr ln(dm/\u03b7)/n\n\n\u221a\n\n\u221a\n\n)\n\nm\n\nm\n\n.\n\n,\n\nm\n\n)\n\n(\n\n\u221a\n\n\u00afr ln(dm/\u03b7)/n\n\nJalali et al. (2010) [9] gave an \u21131,1-norm error bound \u2225 \u02c6W Dirty\u2212 \u00afW\u22251,1 = O\nln(dm/\u03b7)/n\nas well as a sign consistency result between \u02c6W and \u00afW . A direct comparison between these two\nbounds is dif\ufb01cult due to the use of different norms. On the other hand, the worst-case estimate of\nthe \u21132,1-norm error bound of the algorithm in Jalali et al. (2010) [9] is in the same order with Eq. (8),\nthat is: \u2225 \u02c6W Dirty \u2212 \u00afW\u22252,1 = O\n. When dm is large and the ground truth has\na large number of sparse rows (i.e., \u00afr is a small constant), the bound in Eq. (9) is signi\ufb01cantly better\nthan the ones for the Lasso and Dirty model.\nRemark 6 Jalali et al. (2010) [9] presented an \u21131,1-norm parameter estimation error bound and\nhence a sign consistency result can be obtained. The results are derived under the incoherence\ncondition which is more restrictive than the RIP condition and hence more restrictive than the sparse\neigenvalue condition in Eq. (4). From the viewpoint of the parameter estimation error, our proposed\nalgorithm can achieve a better bound under weaker conditions. Please refer to [19, 25, 27] for\nmore details about the incoherence condition, the RIP condition, the sparse eigenvalue condition\nand their relationships.\nRemark 7 The capped-\u21131 regularized formulation in Zhang (2010) [26] is a special case of our for-\nmulation when m = 1. However, extending the analysis from the single task to the multi-task setting\nis nontrivial. Different from previous work on multi-stage sparse learning which focuses on a single\ntask [26, 27], we study a more general multi-stage framework in the multi-task setting. We need to\n(cid:0)\nexploit the relationship among tasks, by using the relations of sparse eigenvalues \u03c1+\ni (k))\nand treating the \u21131-norm on each row of the weight matrix as a whole for consideration. Moreover,\nwe simultaneously exploit the relations of each column and each row of the matrix.\n\ni (k) (\u03c1\n\n4\n\n\f4 Proof Sketch\n\nWe \ufb01rst provide several important lemmas (please refer to the full version [7] or supplementary\nmaterials for detailed proofs) and then complete the proof of Theorem 1 based on these lemmas.\nLemma 1 Let \u00af\u03a5 = [\u00af\u03f51,\u00b7\u00b7\u00b7 , \u00af\u03f5m] with \u00af\u03f5i = [\u00af\u03f51i,\u00b7\u00b7\u00b7 , \u00af\u03f5di]T = 1\ni (Xi \u00afwi \u2212 yi) (i \u2208 Nm). De\ufb01ne\n\u00afH \u2287 \u00afF such that (j, i) \u2208 \u00afH (\u2200i \u2208 Nm), provided there exists (j, g) \u2208 \u00afF ( \u00afH is a set consisting of\nthe indices of all entries in the nonzero rows of \u00afW ). Under the conditions of Assumption 1 and the\nnotations of Theorem 1, the followings hold with probability larger than 1 \u2212 \u03b7:\n\nn X T\n\n\u221a\n\n\u2225 \u00af\u03a5\u22251,1 \u2264 \u03c3\n\u2225 \u00af\u03a5 (cid:22)H\u22252\n\nF\n\n\u2264 m\u03c32\u03c1+\n\n2\u03c1+\n\nmax(1) ln(2dm/\u03b7)\n\n,\n\nn\n\nmax(\u00afr)(7.4\u00afr + 2.7 ln(2/\u03b7))/n.\n\n(10)\n\n(11)\n\nLemma 1 gives bounds on the residual correlation ( \u00af\u03a5) with respect to \u00afW . We note that Eq. (10) and\nEq. (11) are closely related to the assumption on \u03bb in Eq. (5) and the second term of the right-hand\nside of Eq. (7) (error bound), respectively. This lemma provides a fundamental basis for the proof\nof Theorem 1.\nLemma 2 Use the notations of Lemma 1 and consider Gi \u2286 Nd \u00d7 {i} such that \u00afFi \u2229 Gi = \u2205 (i \u2208\nNm). Let \u02c6W = \u02c6W (\u2113) be a solution of Eq. (2) and \u2206 \u02c6W = \u02c6W \u2212 \u00afW . Denote \u02c6\u03bbi = \u02c6\u03bb(\u2113(cid:0)1)\n[\u03bb(\u2113(cid:0)1)\nmaxi\n\n,\u00b7\u00b7\u00b7 , \u03bb(\u2113(cid:0)1)\n\u02c6\u03bb0i. If 2\u2225\u00af\u03f5i\u22251 < \u02c6\u03bbGi, then the following inequality holds at any stage \u2113 \u2265 1:\n\n]T . Let \u02c6\u03bbGi = min(j,i)2Gi\n\n\u02c6\u03bbGi and \u02c6\u03bb0i = maxj\n\n\u02c6\u03bbji, \u02c6\u03bbG = mini2Gi\n\n=\n\u02c6\u03bbji, \u02c6\u03bb0 =\n\nd\n\n1\n\ni\n\nm\u2211\n\n\u2211\n\ni=1\n\n(j,i)2Gi\n\n| \u02c6w(\u2113)\n\nji\n\n| \u2264 2\u2225 \u00af\u03a5\u22251,1 + \u02c6\u03bb0\n\u02c6\u03bbG \u2212 2\u2225 \u00af\u03a5\u22251,1\n\nm\u2211\n\n\u2211\n\ni=1\n\n(j,i)2Gc\n\ni\n\n|\u2206 \u02c6w(\u2113)\n\nji\n\n|.\n\nGi, \u00afF = \u222ai2Nm\n\nDenote G = \u222ai2Nm\n\u00afFi and notice that \u00afF \u2229 G = \u2205 \u21d2 \u2206 \u02c6W (\u2113) = \u02c6W (\u2113). Lemma 2\nsays that \u2225\u2206 \u02c6W (\u2113)G \u22251,1 = \u2225 \u02c6W (\u2113)G \u22251,1 is upper bounded in terms of \u2225\u2206 \u02c6W (\u2113)Gc \u22251,1, which indicates that\nthe error of the estimated coef\ufb01cients locating outside of \u00afF should be small enough. This provides\nan intuitive explanation why the parameter estimation error of our algorithm can be small.\nLemma 3 Using the notations of Lemma 2, we denote G = G(\u2113) = \u00afHc \u2229 {(j, i) : \u02c6\u03bb(\u2113(cid:0)1)\n\u222ai2Nm\nlargest s coef\ufb01cients (in absolute value) of \u02c6wGi, Ii = Gc\n)\n(\nThen, the following inequalities hold at any stage \u2113 \u2265 1:\n4\u2225 \u00af\u03a5Gc\n(cid:0)\nmin(2\u00afr + s)\n\n= \u03bb} =\nGi with \u00afH being de\ufb01ned as in Lemma 1 and Gi \u2286 Nd \u00d7 {i}. Let Ji be the indices of the\n\u00afFi.\n\n\u2211\n(j,i)2 (cid:22)F (\u02c6\u03bb(\u2113(cid:0)1)\n\nIi and \u00afF = \u222ai2Nm\n\n\u222a Ji, I = \u222ai2Nm\n\n)\u221a\n\n\u22252\nF +\n\n\u221a\n\n1 + 1.5\n\n(\n\n(12)\n\n8m\n\n2(cid:22)r\ns\n\n)2\n\n\u03c1\n\nji\n\nji\n\n(\u2113)\n\n,\n\ni\n\n\u2225\u2206 \u02c6W (\u2113)\u22252,1 \u2264\n\u2225\u2206 \u02c6W (\u2113)\u22252,1 \u2264 9.1m\u03bb\n\n\u221a\n\n\u00afr\n\n(cid:0)\n\u03c1\nmin(2\u00afr + s)\n\n.\n\n(13)\n\nLemma 3 is established based on Lemma 2, by considering the relationship between Eq. (5) and\nEq. (10), and the speci\ufb01c de\ufb01nition of G = G(\u2113). Eq. (12) provides a parameter estimation error\nbound in terms of \u21132,1-norm by \u2225 \u00af\u03a5Gc\n(see the de\ufb01nition\nof \u02c6\u03bbji (\u02c6\u03bb(\u2113(cid:0)1)\n) in Lemma 2). This is the result directly used in the proof of Theorem 1. Eq. (13)\nstates that the error bound is upper bounded in terms of \u03bb, the right-hand side of which constitutes\nthe shrinkage part of the error bound in Eq. (7).\n\n\u22252\nF and the regularization parameters \u02c6\u03bb(\u2113(cid:0)1)\n\nji\n\nji\n\n(\u2113)\n\nLemma 4 Let \u02c6\u03bbji = \u03bbI\nin Lemma 1. Then under the condition of Eq. (3), we have:\n\n,\u2200i \u2208 Nm with some \u02c6W \u2208 Rd(cid:2)m. \u00afH \u2287 \u00afF is de\ufb01ned\n\u2264 m\u03bb2\u2225 \u00afW (cid:22)H \u2212 \u02c6W (cid:22)H\u22252\n\n2,1/\u03b82.\n\n\u02c6\u03bb2\nji\n\n(\u2225 \u02c6wj\u22251 < \u03b8, j \u2208 Nd\n\u2211\n\u2211\n\n)\n\n\u2264\n\n\u02c6\u03bb2\nji\n\n(j,i)2 (cid:22)F\n\n(j,i)2 (cid:22)H\n\n5\n\n\f(\u2113)\n\n(\u2113)\n\n(\u2113)\n\nGc\n\n\\ \u00afH| \u2264 m\u03b8\n\n(cid:0)2\u2225 \u02c6W (\u2113(cid:0)1)\n\nwhere the last inequality follows from \u2200(j, i) \u2208 Gc\n\u221a\n)2(\nn (cid:22)H \u2212 \u00afWGc\n\u00afwj\u22252\n)\n(cid:13)(cid:13)(cid:13)2\n\n(\n(cid:13)(cid:13)(cid:13) \u02c6W (\u2113(cid:0)1) \u2212 \u00afW\n\n2,1 = \u2225\u2206 \u02c6W (\u2113)\u22252\n\n4u + (37/36)m\u03bb2\u03b8\n\n1/\u03b82 \u2265 1 \u21d2 |Gc\n(\n\u2225 \u02c6W (\u2113) \u2212 \u00afW\u22252\n(cid:13)(cid:13)(cid:13) \u02c6W (0) \u2212 \u00afW\n\n\u2264 78m\n\n(cid:0)\nmin(2\u00afr + s))2\n\n\u2264 0.8\u2113\n\n\u2264 8m\n\n(cid:13)(cid:13)(cid:13)2\n\n1 + 1.5\n\n312mu\n\n(cid:0)2\n\n\u2264\n\n2(cid:22)r\ns\n\n(\u03c1\n\n+\n\n2,1\n\n2,1\n\n(\u2113)\n\n1 \u2212 0.8\u2113\n1 \u2212 0.8\n\n(\u2113)\n\n(\u2113)\n\n(\u2113)\n\nGc\n\nn (cid:22)H\u22252\n2,1,\n\n(15)\n1/\u03b82 = \u2225( \u02c6w(\u2113(cid:0)1))j \u2212\n\n\\ \u00afH|\u2225 \u00af\u03a5\u222521,1\n(cid:0)2\u2225 \u02c6W (\u2113(cid:0)1)\nn (cid:22)H \u2212 \u00afWGc\n\\ \u00afH,\u2225( \u02c6w(\u2113(cid:0)1))j\u22252\n\u2211\nn (cid:22)H\u22252\n2,1. According to Eq. (12), we have:\n\u22252\n4\u2225 \u00af\u03a5Gc\n(j,i)2 (cid:22)F (\u02c6\u03bb(\u2113(cid:0)1)\nF +\n(cid:0)\nmin(2\u00afr + s))2\n\n)\n(cid:13)(cid:13)(cid:13) \u02c6W (\u2113(cid:0)1) \u2212 \u00afW\n\n312mu\n\n+ 0.8\n\n(\u03c1\n\n)2\n\nji\n\n(\u2113)\n\n(cid:13)(cid:13)(cid:13)2\n\n2,1\n\n(cid:0)\nmin(2\u00afr + s))2\n(\u03c1\n\u2264 0.8\u2113\n9.12m2\u03bb2\u00afr\n(cid:0)\nmin(2\u00afr + s))2\n\n(\u03c1\n\n1560mu\n\n\u2211\n\nLemma 4 establishes an upper bound of\n2,1, which is critical for\nbuilding the recursive relationship between \u2225 \u02c6W (\u2113) \u2212 \u00afW\u22252,1 and \u2225 \u02c6W (\u2113(cid:0)1) \u2212 \u00afW\u22252,1 in the proof of\nTheorem 1. This recursive relation is crucial for the shrinkage part of the error bound in Eq. (7).\n\n(j,i)2 (cid:22)F \u02c6\u03bb2\n\nji by \u2225 \u00afW (cid:22)H \u2212 \u02c6W (cid:22)H\u22252\n\n4.1 Proof of Theorem 1\n\nProof For notational simplicity, we denote the right-hand side of Eq. (11) as:\n\nBased on \u00afH \u2286 Gc\n\u2225 \u00af\u03a5Gc\n\u2264 u + \u03bb2|Gc\n\n(\u2113)\n\n(\u2113)\n\nu = m\u03c32\u03c1+\n\nmax(\u00afr)(7.4\u00afr + 2.7 ln(2/\u03b7))/n.\n\n(14)\n(\u2113), Lemma 1 and Eq. (5), the followings hold with probability larger than 1 \u2212 \u03b7:\n\u22252\nF = \u2225 \u00af\u03a5 (cid:22)H\u22252\n\nF\n\nn (cid:22)H\u22252\n\nF + \u2225 \u00af\u03a5Gc\n\n\u2264 u + |Gc\n\\ \u00afH|/144 \u2264 u + (1/144)m\u03bb2\u03b8\n\n(\u2113)\n\n2,1\n\n(\u03c1\n\n(cid:0)\nmin(2\u00afr + s))2\n\n(cid:0)\nmin(2\u00afr + s))2\nIn the above derivation, the \ufb01rst inequality is due to Eq. (12); the second inequality is due to the\nassumption s \u2265 \u00afr in Theorem 1, Eq. (15) and Lemma 4; the third inequality is due to Eq. (6); the\nlast inequality follows from Eq. (13) and 1 \u2212 0.8\u2113 \u2264 1 (\u2113 \u2265 1). Thus, following the inequality\n\u221a\na + b \u2264 \u221a\n\na +\n\n\u221a\n\n(\u03c1\n\n+\n\n.\n\nb (\u2200a, b \u2265 0), we obtain:\n\u2225 \u02c6W (\u2113) \u2212 \u00afW\u22252,1 \u2264 0.8\u2113/2 9.1m\u03bb\n\n\u221a\n\n\u00afr\n\n(cid:0)\n\u03c1\nmin(2\u00afr + s)\n\n+\n\n\u221a\n39.5\n(cid:0)\n\u03c1\nmin(2\u00afr + s)\n\nmu\n\n.\n\n(cid:3)\n\nSubstituting Eq. (14) into the above inequality, we verify Theorem 1.\n\n5 Experiments\n\nWe compare our proposed MSMTFL algorithm with three competing multi-task feature learning\nalgorithms: \u21131-norm multi-task feature learning algorithm (Lasso), \u21131,2-norm multi-task feature\nlearning algorithm (L1,2) [14] and dirty model multi-task feature learning algorithm (DirtyMTL)\n[9]. In our experiments, we employ the quadratic loss function for all the compared algorithms.\n\n5.1 Synthetic Data Experiments\n\nWe generate synthetic data by setting the number of tasks as m and each task has n samples which\nare of dimensionality d; each element of the data matrix Xi \u2208 Rn(cid:2)d (i \u2208 Nm) for the i-th task is\nsampled i.i.d. from the Gaussian distribution N (0, 1) and we then normalize all columns to length 1;\neach entry of the underlying true weight \u00afW \u2208 Rd(cid:2)m is sampled i.i.d. from the uniform distribution\nin the interval [\u221210, 10]; we randomly set 90% rows of \u00afW as zero vectors and 80% elements of\nthe remaining nonzero entries as zeros; each entry of the noise \u03b4i \u2208 Rn is sampled i.i.d. from the\nGaussian distribution N (0, \u03c32); the responses are computed as yi = Xi \u00afwi + \u03b4i (i \u2208 Nm).\nWe \ufb01rst report the averaged parameter estimation error \u2225 \u02c6W\u2212 \u00afW\u22252,1 vs. Stage (\u2113) plots for MSMTFL\n(Figure 1). We observe that the error decreases as \u2113 increases, which shows the advantage of our pro-\nposed algorithm over Lasso. This is consistent with the theoretical result in Theorem 1. Moreover,\nthe parameter estimation error decreases quickly and converges in a few stages.\n\n6\n\n\fWe then report the averaged parameter estimation error \u2225 \u02c6W \u2212 \u00afW\u22252,1 in comparison with four al-\ngorithms in different parameter settings (Figure 2). For a fair comparison, we compare the smallest\nestimation errors of the four algorithms in all the parameter settings [25, 26]. As expected, the pa-\nrameter estimation error of the MSMTFL algorithm is the smallest among the four algorithms. This\nempirical result demonstrates the effectiveness of the MSMTFL algorithm. We also have the follow-\ning observations: (a) When \u03bb is large enough, all four algorithms tend to have the same parameter\nestimation error. This is reasonable, because the solutions \u02c6W \u2019s obtained by the four algorithms are\nall zero matrices, when \u03bb is very large. (b) The performance of the MSMTFL algorithm is similar\nfor different \u03b8\u2019s, when \u03bb exceeds a certain value.\n\nFigure 1: Averaged parameter estimation error \u2225 \u02c6W \u2212 \u00afW\u22252,1 vs. Stage (\u2113) plots for MSMTFL on\nthe synthetic data set (averaged over 10 runs). Here we set \u03bb = \u03b1\nln(dm)/n, \u03b8 = 50m\u03bb. Note\nthat \u2113 = 1 corresponds to Lasso; the results show the stage-wise improvement over Lasso.\n\n\u221a\n\nFigure 2: Averaged parameter estimation error \u2225 \u02c6W \u2212 \u00afW\u22252,1 vs. \u03bb plots on the synthetic data set\n(averaged over 10 runs). MSMTFL has the smallest parameter estimation error among the four al-\ngorithms. Both DirtyMTL and MSMTFL have two parameters; we set \u03bbs/\u03bbb = 1, 0.5, 0.2, 0.1\nfor DirtyMTL (1/m \u2264 \u03bbs/\u03bbb \u2264 1 was adopted in Jalali et al. (2010) [9]) and \u03b8/\u03bb =\n50m, 10m, 2m, 0.4m for MSMTFL.\n\n5.2 Real-World Data Experiments\n\nWe conduct experiments on two real-world data sets: MRI and Isolet data sets. (1) The MRI data\nset is collected from the ANDI database, which contains 675 patients\u2019 MRI data preprocessed using\nFreeSurfer1. The MRI data include 306 features and the response (target) is the Mini Mental State\nExamination (MMSE) score coming from 6 different time points: M06, M12, M18, M24, M36, and\nM48. We remove the samples which fail the MRI quality controls and have missing entries. Thus,\nwe have 6 tasks with each task corresponding to a time point and the sample sizes corresponding to\n6 tasks are 648, 642, 293, 569, 389 and 87, respectively. (2) The Isolet data set2 is collected from\n150 speakers who speak the name of each English letter of the alphabet twice. Thus, there are 52\nsamples from each speaker. The speakers are grouped into 5 subsets which respectively include 30\nsimilar speakers, and the subsets are named Isolet1, Isolet2, Isolet3, Isolet4, and Isolet5. Thus, we\nnaturally have 5 tasks with each task corresponding to a subset. The 5 tasks respectively have 1560,\n1560, 1560, 1558, and 1559 samples (Three samples are historically missing), where each sample\nincludes 617 features and the response is the English letter label (1-26).\nIn the experiments, we treat the MMSE and letter labels as the regression values for the MRI data\nset and the Isolet data set, respectively. For both data sets, we randomly extract the training samples\nfrom each task with different training ratios (15%, 20% and 25%) and use the rest of samples to form\nthe test set. We evaluate the four multi-task feature learning algorithms in terms of normalized mean\nsquared error (nMSE) and averaged means squared error (aMSE), which are commonly used in\n\n1www.loni.ucla.edu/ADNI/\n2www.zjucadcg.cn/dengcai/Data/data.html\n\n7\n\n246810020406080100120StageParamter estimation error (L2,1)m=15,n=40,d=250,\u03c3=0.01 \u03b1=5e\u2212005\u03b1=0.0001\u03b1=0.0002\u03b1=0.0005246810050100150200StageParamter estimation error (L2,1)m=20,n=30,d=200,\u03c3=0.005 \u03b1=5e\u2212005\u03b1=0.0001\u03b1=0.0002\u03b1=0.0005246810020406080100StageParamter estimation error (L2,1)m=10,n=60,d=300,\u03c3=0.001 \u03b1=5e\u2212005\u03b1=0.0001\u03b1=0.0002\u03b1=0.000510\u22125100100101102103\u03bbParamter estimation error (L2,1)m=15,n=40,d=250,\u03c3=0.01 10\u2212610\u2212410\u22122100100101102103\u03bbParamter estimation error (L2,1)m=20,n=30,d=200,\u03c3=0.005 10\u2212610\u2212410\u2212210010\u22121100101102103\u03bbParamter estimation error (L2,1)m=10,n=60,d=300,\u03c3=0.001 LassoL1,2DirtyMTL(1\u03bb)DirtyMTL(0.5\u03bb)DirtyMTL(0.2\u03bb)DirtyMTL(0.1\u03bb)MSMTFL(50m\u03bb)MSMTFL(10m\u03bb)MSMTFL(2m\u03bb)MSMTFL(0.4m\u03bb)\fTable 1: Comparison of four multi-task feature learning algorithms on the MRI data set in terms of\naveraged nMSE and aMSE (standard deviation), which are averaged over 10 random splittings.\n\nmeasure\n\ntraning ratio\n\nnMSE\n\naMSE\n\n0.15\n0.20\n0.25\n0.15\n0.20\n0.25\n\nLasso\n\n0.6651(0.0280)\n0.6254(0.0212)\n0.6105(0.0186)\n0.0189(0.0008)\n0.0179(0.0006)\n0.0172(0.0009)\n\nL1,2\n\n0.6633(0.0470)\n0.6489(0.0275)\n0.6577(0.0194)\n0.0187(0.0010)\n0.0184(0.0005)\n0.0183(0.0006)\n\nDirtyMTL\n\n0.6224(0.0265)\n0.6140(0.0185)\n0.6136(0.0180)\n0.0172(0.0006)\n0.0171(0.0005)\n0.0167(0.0008)\n\nMSMTFL\n\n0.5539(0.0154)\n0.5542(0.0139)\n0.5507(0.0142)\n0.0159(0.0004)\n0.0161(0.0004)\n0.0157(0.0006)\n\nmulti-task learning problems [28, 29]. For each training ratio, both nMSE and aMSE are averaged\nover 10 random splittings of training and test sets and the standard deviation is also shown. All\nparameters of the four algorithms are tuned via 3-fold cross validation.\n\nFigure 3: Averaged test error (nMSE and aMSE) vs. training ratio plots on the Isolet data set. The\nresults are averaged over 10 random splittings.\n\nTable 1 and Figure 3 show the experimental results in terms of averaged nMSE (aMSE) and the\nstandard deviation. From these results, we observe that: (a) Our proposed MSMTFL algorithm out-\nperforms all the competing feature learning algorithms on both data sets, with the smallest regression\nerrors (nMSE and aMSE) as well as the smallest standard deviations. (b) On the MRI data set, the\nMSMTFL algorithm performs well even in the case of a small training ratio. The performance for\nthe 15% training ratio is comparable to that for the 25% training ratio. (c) On the Isolet data set,\nwhen the training ratio increases from 15% to 25%, the performance of the MSMTFL algorithm\nincreases and the superiority of the MSMTFL algorithm over the other three algorithms is more\nsigni\ufb01cant. Our results demonstrate the effectiveness of the proposed algorithm.\n\n6 Conclusions\n\nIn this paper, we propose a non-convex multi-task feature learning formulation based on the capped-\n\u21131,\u21131 regularization. The proposed formulation learns the speci\ufb01c features of each task as well as the\ncommon features shared among tasks. We propose to solve the non-convex optimization problem\nby employing a Multi-Stage Multi-Task Feature Learning (MSMTFL) algorithm, using concave\nduality. We also present a detailed theoretical analysis in terms of the parameter estimation error\nbound for the MSMTFL algorithm. The analysis shows that our MSMTFL algorithm achieves good\nperformance under the sparse eigenvalue condition, which is weaker than the incoherence condition.\nExperimental results on both synthetic and real-world data sets demonstrate the effectiveness of our\nproposed MSMTFL algorithm in comparison with the state of the art multi-task feature learning\nalgorithms. In our future work, we will focus on a general non-convex regularization framework for\nmulti-task feature learning settings (involving different loss functions and non-convex regularization\nterms) and derive theoretical bounds.\n\nAcknowledgements\n\nThis work is supported in part by 973 Program (2013CB329503), NSFC (Grant No. 91120301,\n60835002 and 61075004), NIH (R01 LM010730) and NSF (IIS-0953662, CCF-1025177).\n\n8\n\n0.150.20.250.50.550.60.650.7Training RationMSE LassoL1,2DirtyMTLMSMTFL0.150.20.250.120.130.140.150.160.17Training RatioaMSE LassoL1,2DirtyMTLMSMTFL\fReferences\n[1] A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Machine Learning,\n\n73(3):243\u2013272, 2008.\n\n[2] J. Bi, T. Xiong, S. Yu, M. Dundar, and R. Rao. An improved multi-task learning approach with applica-\ntions in medical diagnosis. Machine Learning and Knowledge Discovery in Databases, pages 117\u2013132,\n2008.\n\n[3] E. Candes and T. Tao. Decoding by linear programming.\n\n51(12):4203\u20134215, 2005.\n\nIEEE Transactions on Information Theory,\n\n[4] J. Chen, J. Liu, and J. Ye. Learning incoherent sparse and low-rank patterns from multiple tasks.\n\nSIGKDD, pages 1179\u20131188, 2010.\n\nIn\n\n[5] D. Donoho, M. Elad, and V. Temlyakov. Stable recovery of sparse overcomplete representations in the\n\npresence of noise. IEEE Transactions on Information Theory, 52(1):6\u201318, 2006.\n\n[6] T. Evgeniou and M. Pontil. Regularized multi\u2013task learning. In SIGKDD, pages 109\u2013117, 2004.\n[7] P. Gong, J. Ye, and C. Zhang. Multi-stage multi-task feature learning. arXiv:1210.5806, 2012.\n[8] P. Gong, J. Ye, and C. Zhang. Robust multi-task feature learning. In SIGKDD, pages 895\u2013903, 2012.\n[9] A. Jalali, P. Ravikumar, S. Sanghavi, and C. Ruan. A dirty model for multi-task learning. In NIPS, pages\n\n964\u2013972, 2010.\n\n[10] S. Kim and E. Xing. Tree-guided group lasso for multi-task regression with structured sparsity. In ICML,\n\npages 543\u2013550, 2009.\n\n[11] K. Lounici, M. Pontil, A. Tsybakov, and S. Van De Geer. Taking advantage of sparsity in multi-task\n\nlearning. In COLT, pages 73\u201382, 2009.\n\n[12] S. Negahban and M. Wainwright. Joint support recovery under high-dimensional scaling: Bene\ufb01ts and\n\nperils of \u21131;1-regularization. In NIPS, pages 1161\u20131168, 2008.\n\n[13] S. Negahban and M. Wainwright. Estimation of (near) low-rank matrices with noise and high-dimensional\n\nscaling. The Annals of Statistics, 39(2):1069\u20131097, 2011.\n\n[14] G. Obozinski, B. Taskar, and M. Jordan. Multi-task feature selection. Statistics Department, UC Berkeley,\n\nTech. Rep, 2006.\n\n[15] G. Obozinski, M. Wainwright, and M. Jordan. Support union recovery in high-dimensional multivariate\n\nregression. Annals of statistics, 39(1):1\u201347, 2011.\n\n[16] S. Parameswaran and K. Weinberger. Large margin multi-task metric learning. In NIPS, pages 1867\u2013\n\n1875, 2010.\n\n[17] N. Quadrianto, A. Smola, T. Caetano, S. Vishwanathan, and J. Petterson. Multitask learning without label\n\ncorrespondences. In NIPS, pages 1957\u20131965, 2010.\n\n[18] A. Schwaighofer, V. Tresp, and K. Yu. Learning gaussian process kernels via hierarchical bayes. In NIPS,\n\npages 1209\u20131216, 2005.\n\n[19] S. Van De Geer and P. B\u00a8uhlmann. On the conditions used to prove oracle results for the lasso. Electronic\n\nJournal of Statistics, 3:1360\u20131392, 2009.\n\n[20] X. Yang, S. Kim, and E. Xing. Heterogeneous multitask learning with joint sparsity constraints. In NIPS,\n\npages 2151\u20132159, 2009.\n\n[21] K. Yu, V. Tresp, and A. Schwaighofer. Learning gaussian processes from multiple tasks. In ICML, pages\n\n1012\u20131019, 2005.\n\n[22] C. Zhang and J. Huang. The sparsity and bias of the lasso selection in high-dimensional linear regression.\n\nThe Annals of Statistics, 36(4):1567\u20131594, 2008.\n\n[23] C. Zhang and T. Zhang. A general theory of concave regularization for high dimensional sparse estimation\n\nproblems. Statistical Science, 2012.\n\n[24] J. Zhang, Z. Ghahramani, and Y. Yang. Learning multiple related tasks using latent independent compo-\n\nnent analysis. In NIPS, pages 1585\u20131592, 2006.\n\n[25] T. Zhang. Some sharp performance bounds for least squares regression with \u21131 regularization. The Annals\n\nof Statistics, 37:2109\u20132144, 2009.\n\n[26] T. Zhang. Analysis of multi-stage convex relaxation for sparse regularization. JMLR, 11:1081\u20131107,\n\n2010.\n\n[27] T. Zhang. Multi-stage convex relaxation for feature selection. Bernoulli, 2012.\n[28] Y. Zhang and D. Yeung. Multi-task learning using generalized t process. In AISTATS, 2010.\n[29] J. Zhou, J. Chen, and J. Ye. Clustered multi-task learning via alternating structure optimization. In NIPS,\n\npages 702\u2013710, 2011.\n\n9\n\n\f", "award": [], "sourceid": 978, "authors": [{"given_name": "Pinghua", "family_name": "Gong", "institution": null}, {"given_name": "Jieping", "family_name": "Ye", "institution": null}, {"given_name": "Chang-shui", "family_name": "Zhang", "institution": null}]}