{"title": "Moreau-Yosida Regularization for Grouped Tree Structure Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1459, "page_last": 1467, "abstract": "We consider the tree structured group Lasso where the structure over the features can be represented as a tree with leaf nodes as features and internal nodes as clusters of the features. The structured regularization with a pre-defined tree structure is based on a group-Lasso penalty, where one group is defined for each node in the tree. Such a regularization can help uncover the structured sparsity, which is desirable for applications with some meaningful tree structures on the features. However, the tree structured group Lasso is challenging to solve due to the complex regularization. In this paper, we develop an efficient algorithm for the tree structured group Lasso. One of the key steps in the proposed algorithm is to solve the Moreau-Yosida regularization associated with the grouped tree structure. The main technical contributions of this paper include (1) we show that the associated Moreau-Yosida regularization admits an analytical solution, and (2) we develop an efficient algorithm for determining the effective interval for the regularization parameter. Our experimental results on the AR and JAFFE face data sets demonstrate the efficiency and effectiveness of the proposed algorithm.", "full_text": "Moreau-Yosida Regularization for Grouped\n\nTree Structure Learning\n\nJun Liu\n\nComputer Science and Engineering\n\nArizona State University\n\nJ.Liu@asu.edu\n\nAbstract\n\nJieping Ye\n\nComputer Science and Engineering\n\nArizona State University\nJieping.Ye@asu.edu\n\nWe consider the tree structured group Lasso where the structure over the features\ncan be represented as a tree with leaf nodes as features and internal nodes as clus-\nters of the features. The structured regularization with a pre-de\ufb01ned tree structure\nis based on a group-Lasso penalty, where one group is de\ufb01ned for each node in\nthe tree. Such a regularization can help uncover the structured sparsity, which is\ndesirable for applications with some meaningful tree structures on the features.\nHowever, the tree structured group Lasso is challenging to solve due to the com-\nplex regularization. In this paper, we develop an ef\ufb01cient algorithm for the tree\nstructured group Lasso. One of the key steps in the proposed algorithm is to solve\nthe Moreau-Yosida regularization associated with the grouped tree structure. The\nmain technical contributions of this paper include (1) we show that the associated\nMoreau-Yosida regularization admits an analytical solution, and (2) we develop\nan ef\ufb01cient algorithm for determining the effective interval for the regularization\nparameter. Our experimental results on the AR and JAFFE face data sets demon-\nstrate the ef\ufb01ciency and effectiveness of the proposed algorithm.\n\n1 Introduction\n\nMany machine learning algorithms can be formulated as a penalized optimization problem:\n\nmin\nx\n\nl(x) + \u03bb\u03c6(x),\n\n(1)\n\nwhere l(x) is the empirical loss function (e.g., the least squares loss and the logistic loss), \u03bb > 0\nis the regularization parameter, and \u03c6(x) is the penalty term. Recently, sparse learning via (cid:96)1 regu-\nlarization [20] and its various extensions has received increasing attention in many areas including\nmachine learning, signal processing, and statistics. In particular, the group Lasso [1, 16, 22] utilizes\nthe group information of the features, and yields a solution with grouped sparsity. The traditional\ngroup Lasso assumes that the groups are non-overlapping. However, in many applications the fea-\ntures may form more complex overlapping groups. Zhao et al. [23] extended the group Lasso to\nthe case of overlapping groups, imposing hierarchical relationships for the features. Jacob et al. [6]\nconsidered group Lasso with overlaps, and studied theoretical properties of the estimator. Jenatton et\nal. [7] considered the consistency property of the structured overlapping group Lasso, and designed\nan active set algorithm.\nIn many applications, the features can naturally be represented using certain tree structures. For\nexample, the image pixels of the face image shown in Figure 1 can be represented as a tree, where\neach parent node contains a series of child nodes that enjoy spatial locality; genes/proteins may\nform certain hierarchical tree structures. Kim and Xing [9] studied the tree structured group Lasso\nfor multi-task learning, where multiple related tasks follow a tree structure. One challenge in the\npractical application of the tree structured group Lasso is that the resulting optimization problem is\nmuch more dif\ufb01cult to solve than Lasso and group Lasso, due to the complex regularization.\n\n1\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 1: Illustration of the tree structure of a two-dimensional face image. The 64 \u00d7 64 image (a) can be\ndivided into 16 sub-images in (b) according to the spatial locality, where the sub-images can be viewed as the\nchild nodes of (a). Similarly, each 16 \u00d7 16 sub-image in (b) can be divided into 16 sub-images in (c), and such\na process is repeated for the sub-images in (c) to get (d).\n\nFigure 2: A sample index tree for illustration. Root: G0\n2 = {3, 4, 5, 6}, G1\n1 = {1}, G2\nG1\n\n3 = {7, 8}. Depth 2: G2\n\n1 = {1, 2, 3, 4, 5, 6, 7, 8}. Depth 1: G1\n2 = {2}, G2\n\n3 = {3, 4}, G2\n\n4 = {5, 6}.\n\n1 = {1, 2},\n\nIn this paper, we develop an ef\ufb01cient algorithm for the tree structured group Lasso, i.e., the op-\ntimization problem (1) with \u03c6(\u00b7) being the grouped tree structure regularization (see Equation 2).\nOne of the key steps in the proposed algorithm is to solve the Moreau-Yosida regularization [17, 21]\nassociated with the grouped tree structure. The main technical contributions of this paper include:\n(1) we show that the associated Moreau-Yosida regularization admits an analytical solution, and the\nresulting algorithm for the tree structured group Lasso has a time complexity comparable to Lasso\nand group Lasso, and (2) we develop an ef\ufb01cient algorithm for determining the effective interval for\nthe parameter \u03bb, which is important in the practical application of the algorithm. We have performed\nexperimental studies using the AR and JAFFE face data sets, where the features form a hierarchical\ntree structure based on the spatial locality as shown in Figure 1. Our experimental results demon-\nstrate the ef\ufb01ciency and effectiveness of the proposed algorithm. Note that while the present paper\nwas under review, we became aware of a recent work by Jenatton et al. [8] which applied block\ncoordinate ascent in the dual and showed that the algorithm converges in one pass.\n\n2 Grouped Tree Structure Regularization\n\nWe begin with the de\ufb01nition of the so-called index tree:\nni} contain all the node(s)\nDe\ufb01nition 1. For an index tree T of depth d, we let Ti = {Gi\n1 = {1, 2, . . . , p} and ni \u2265 1, i = 1, 2, . . . , d. The\ncorresponding to depth i, where n0 = 1, G0\nnodes satisfy the following conditions: 1) the nodes from the same depth level have non-overlapping\nj \u2229 Gi\nj0 be the parent node\nindices, i.e., Gi\nof a non-root node Gi\n\nk = \u2205,\u2200i = 1, . . . , d, j (cid:54)= k, 1 \u2264 j, k \u2264 ni; and 2) let Gi\u22121\nj, then Gi\n\nj \u2286 Gi\u22121\nj0 .\n\n2, . . . , Gi\n\n1, Gi\n\nFigure 2 shows a sample index tree. We can observe that 1) the index sets from different nodes may\noverlap, e.g., any parent node overlaps with its child nodes; 2) the nodes from the same depth level\ndo not overlap; and 3) the index set of a child node is a subset of that of its parent node.\nThe grouped tree structure regularization is de\ufb01ned as:\n\nd(cid:88)\n\nni(cid:88)\n\ni=0\n\nj=1\n\n2\n\n\u03c6(x) =\n\nj(cid:107)xGi\nwi\n\nj\n\n(cid:107),\n\n(2)\n\nwhere x \u2208 Rp, wi\n(cid:107) \u00b7 (cid:107) is the Euclidean norm, and xGi\n\nj \u2265 0 (i = 0, 1, . . . , d, j = 1, 2, . . . , ni) is the pre-de\ufb01ned weight for the node Gi\nj,\nj.\nis a vector composed of the entries of x with the indices in Gi\n\nj\n\nG01G11G12G13G21G22G23G24\fIn the next section, we study the Moreau-Yosida regularization [17, 21] associated with (2), develop\nan analytical solution for such a regularization, propose an ef\ufb01cient algorithm for solving (1), and\nspecify the meaningful interval for the regularization parameter \u03bb.\n3 Moreau-Yosida Regularization of \u03c6(\u00b7)\nThe Moreau-Yosida regularization associated with the grouped tree structure regularization \u03c6(\u00b7) for\na given v \u2208 Rp is given by:\n\nd(cid:88)\n\nni(cid:88)\n\ni=0\n\nj=1\n\n\uf8fc\uf8fd\uf8fe ,\n\n(cid:107)x \u2212 v(cid:107)2 + \u03bb\n\n1\n2\n\nj(cid:107)xGi\nwi\n\nj\n\n(cid:107)\n\n(3)\n\n\uf8f1\uf8f2\uf8f3f(x) =\n\n\u03c6\u03bb(v) = min\nx\n\nfor some \u03bb > 0. Denote the minimizer of (3) as \u03c0\u03bb(v). The Moreau-Yosida regularization has many\nuseful properties: 1) \u03c6\u03bb(\u00b7) is continuously differentiable despite the fact that \u03c6(\u00b7) is non-smooth; 2)\n\u03c0\u03bb(\u00b7) is a non-expansive operator. More properties on the general Moreau-Yosida regularization\ncan be found in [5, 10]. Note that, f(\u00b7) in (3) is indeed a special case of the problem (1) with\n2(cid:107)x \u2212 v(cid:107)2. Our recent study has shown that, the ef\ufb01cient optimization of the Moreau-\nl(x) = 1\nYosida regularization is key to many optimization algorithms [13, Section 2]. Next, we focus on the\nef\ufb01cient optimization of (3). For convenience of subsequent discussion, we denote \u03bbi\n\nj = \u03bbwi\nj.\n\n3.1 An Analytical Solution\n\nWe show that the minimization of (3) admits an analytical solution. We \ufb01rst present the detailed\nprocedure for \ufb01nding the minimizer in Algorithm 1.\n\nAlgorithm 1 Moreau-Yosida Regularization of the tree structured group Lasso (MYtgLasso)\nInput: v \u2208 Rp, the index tree T with nodes Gi\n\nj (i = 0, 1, . . . , d, j = 1, 2, . . . , ni) that satisfy\n\nj \u2265 0 (i = 0, 1, . . . , d, j = 1, 2, . . . , ni), \u03bb > 0, and \u03bbi\n\nj = \u03bbwi\n\nj\n\nDe\ufb01nition 1, the weights wi\n\nOutput: u0 \u2208 Rp\n1: Set\n\n2: for i = d to 0 do\n3:\n4:\n\nfor j = 1 to ni do\n\nCompute\n\nend for\n\n5:\n6: end for\n\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f3\n\nui\n\nGi\nj\n\n=\n\nud+1 = v,\n\n0\n(cid:107)\u2212\u03bbi\n(cid:107) ui+1\n\nGi\nj\n\nj\n\n(cid:107)ui+1\nGi\nj\n(cid:107)ui+1\nGi\nj\n\nGi\nj\n\n(cid:107)ui+1\n(cid:107)ui+1\n\nj\n\n(cid:107) \u2264 \u03bbi\n(cid:107) > \u03bbi\nj,\n\nGi\nj\n\n(4)\n\n(5)\n\nj\n\nj\n\n(cid:80)d\n\nby at most \u03bbi\n\n(cid:80)ni\n\nj, we update uGi\n\nj. The time complexity of MYtgLasso is O(\nj=1 |Gi\n\nIn the implementation of the MYtgLasso algorithm, we only need to maintain a working variable u,\nwhich is initialized with v. We then traverse the index tree T in the reverse breadth-\ufb01rst order to up-\naccording to the operation in (5), which reduces the\ndate u. At the traversed node Gi\nj|).\nEuclidean norm of uGi\nj| \u2264 p. Therefore, the time complexity of MYtgLasso\nBy using De\ufb01nition 1, we have\nis O(pd). If the tree is balanced, i.e., d = O(log p), then the time complexity of MYtgLasso is\nO(p log p).\nMYtgLasso can help explain why the structured group sparsity can be induced. Let us analyze the\ntree given in Figure 2, with the solution denoted as x\u2217. We let wi\n2, and v =\n[1, 2, 1, 1, 4, 4, 1, 1]T. After traversing the nodes of depth 2, we can get that the elements of x\u2217 with\nindices in G2\n3 are zero; and when the traversal continues to the nodes of depth 1, the elements\nof x\u2217 with indices in G1\n4 are still nonzero. Finally, after\ntraversing the root node, we obtain x\u2217 = [0, 0, 0, 0, 1, 1, 0, 0]T.\n\n3 are set to zero, but those with G2\n\nj = 1,\u2200i, j, \u03bb =\n\n(cid:80)ni\n\nj=1 |Gi\n\n1 and G1\n\n1 and G2\n\n\u221a\n\ni=0\n\n3\n\n\fNext, we show that MYtgLasso \ufb01nds the exact minimizer of (3). The main result is summarized in\nthe following theorem:\nTheorem 1. u0 returned by Algorithm 1 is the unique solution to (3).\n\nBefore giving the detailed proof for Theorem 1, we introduce some notations, and present several\ntechnical lemmas.\nDe\ufb01ne the mapping \u03c6i\n\nj : Rp \u2192 R as\n\nj(x) = (cid:107)xGi\n\u03c6i\nni(cid:88)\nd(cid:88)\nWe can then express \u03c6(x) de\ufb01ned in (2) as:\n\nj\n\n(cid:107).\n\n\u03c6(x) =\n\n\u03bbi\nj\u03c6i\n\nj(x).\n\ni=0\n\nj=1\n\nd(cid:88)\n\nni(cid:88)\n\nThe subdifferential of f(\u00b7) de\ufb01ned in (3) at the point x can be written as:\n\nwhere\n\n\u2202f(x) = x \u2212 v +\n\ni=0\n\n(cid:110)\n(cid:189)\ny \u2208 Rp : (cid:107)y(cid:107) \u2264 1, yGi\ny \u2208 Rp : yGi\n(cid:107) , yGi\nj denotes the complementary set of Gi\nj.\n\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f3\n\nj(x) =\n\nxGi\nj\n(cid:107)xGi\n\n\u2202\u03c6i\n\nj=1\n\n=\n\nj\n\nj\n\nj\n\nj\n\n\u03bbi\nj\u2202\u03c6i\n\nj(x),\n\n(cid:111)\n\n(cid:190)\n\n= 0\n\n= 0\n\nif xGi\n\nj\n\nif xGi\n\nj\n\n= 0\n(cid:54)= 0,\n\nand Gi\nLemma 1. For any 1 \u2264 i \u2264 d, 1 \u2264 j \u2264 ni, we can \ufb01nd a unique path from the node Gi\nnode G0\n\n1. Let the nodes on this path be Gl\nrl\nj \u2286 Gl\nGi\n\nrl\n\n, for l = 0, 1, . . . , i with r0 = 1 and ri = j. We have\n,\u2200l = 0, 1, . . . , i \u2212 1.\n\nj to the root\n\n(6)\n\n(7)\n\n(8)\n\n(9)\n(10)\n\n(11)\n\n(12)\n\n(13)\n\nj \u2229 Gl\nGi\n\nr = \u2205,\u2200r (cid:54)= rl, l = 1, 2, . . . , i \u2212 1, r = 1, 2, . . . , ni.\nProof: According to De\ufb01nition 1, we can \ufb01nd a unique path from the node Gi\nIn addition, based on the structure of the index tree, we have (9) and (10).\nLemma 2. For any i = 1, 2, . . . , d, j = 1, 2, . . . , ni, we have\nj(ui)\n\n\u2208 ui+1\n\n\u2212 \u03bbi\n\n(cid:162)\n\nui\n\n,\n\n(cid:161)\n\u2202\u03c6i\nj(ui) \u2286 \u2202\u03c6i\n\nGi\nj\n\nj\n\nj(u0).\n\nGi\nj\n\nGi\nj\n\n\u2202\u03c6i\n\nProof: We can verify (11) using (5), (6) and (8).\nFor (12), it follows from (6) and (8) that, it is suf\ufb01cient to verify that\nj \u2265 0.\n\n, for some \u03b1i\n\n= \u03b1i\n\nu0\nGi\nj\n\njui\nGi\nj\n\nj to the root node G0\n1.\n(cid:164)\n\nIt follows from Lemma 1 that we can \ufb01nd a unique path from Gi\npath as: Gl\nrl\n\n1. Denote the nodes on the\n, where l = 0, 1, . . . , i, ri = j, and r0 = 1. We \ufb01rst analyze the relationship between\n= 0 by using\n\n= 0, which leads to ui\u22121\n\nand ui\u22121\n\nj to G0\n\n. If\n\nri\u22121, we have ui\u22121\nGi\u22121\nri\u22121\n\nui\n\nGi\nj\n\nGi\nj\n\nGi\nj\n\nri\u22121, we have ui\u22121\nGi\u22121\nri\u22121\n\n=\n\nui\n\nGi\u22121\nri\u22121\n\n, which leads to\n\n(cid:176)(cid:176)(cid:176)(cid:176)(cid:176)ui\n\nG\n\n(cid:176)(cid:176)(cid:176)(cid:176)(cid:176)\u2212\u03bbi\u22121\n(cid:176)(cid:176)(cid:176)(cid:176)(cid:176)\n\ni\u22121\nri\u22121\n\nri\u22121\n\ni\u22121\nri\u22121\n\n(cid:176)(cid:176)(cid:176)(cid:176)(cid:176)ui\n\nG\n\n(cid:176)(cid:176)(cid:176)(cid:176) \u2264 \u03bbi\u22121\n(cid:176)(cid:176)(cid:176)(cid:176) > \u03bbi\u22121\n\nGi\u22121\nri\u22121\n\nGi\u22121\nri\u22121\n\n(cid:176)(cid:176)(cid:176)(cid:176)ui\n(cid:176)(cid:176)(cid:176)(cid:176)ui\n(cid:176)(cid:176)(cid:176)(cid:176)(cid:176)\u2212\u03bbi\u22121\n(cid:176)(cid:176)(cid:176)(cid:176)(cid:176)\n\ni\u22121\nri\u22121\n\nri\u22121\n\n(9). Otherwise, if\n\n(cid:176)(cid:176)(cid:176)(cid:176)(cid:176)ui\n\nG\n\ni\u22121\nri\u22121\n\n(cid:176)(cid:176)(cid:176)(cid:176)(cid:176)ui\n\nG\n\nui\u22121\n\nGi\nj\n\n=\n\nui\n\nGi\nj\n\nby using (9). Therefore, we have\n\nui\u22121\n\nGi\nj\n\n= \u03b2iui\nGi\nj\n\n, for some \u03b2i \u2265 0.\n\n(14)\n\n4\n\n\fBy a similar argument, we have\nul\u22121\n\nGl\nrl\n\nTogether with (9), we have\n\n= \u03b2lul\nGl\nrl\n\n, \u03b2l \u2265 0,\u2200l = 1, 2, . . . , i \u2212 1.\n\nul\u22121\n\nGi\nj\n\n= \u03b2lul\nGi\nj\n\n, \u03b2l \u2265 0, ,\u2200l = 1, 2, . . . , i \u2212 1.\n\n(15)\n\n(16)\n\nFrom (14) and (16), we show (13) holds with \u03b1i\nWe are now ready to prove our main result:\nProof of Theorem 1: It is easy to verify that f(\u00b7) de\ufb01ned in (3) is strongly convex, thus it admits a\nunique minimizer. Our methodology for the proof is to show that\n\nl=1\u03b2l. This completes the proof.\n\nj = \u03a0i\n\n(cid:164)\n\n0 \u2208 \u2202f(u0),\n\nwhich is the suf\ufb01cient and necessary condition for u0 to be the minimizer of f(\u00b7).\nAccording to De\ufb01nition 1, the leaf nodes are non-overlapping. We assume that the union of the leaf\nnodes equals to {1, 2, . . . , p}; otherwise, we can add to the index tree the additional leaf nodes with\nweight 0 to satisfy the aforementioned assumption. Clearly, the original index tree and the new index\ntree with the additional leaf nodes of weight 0 yield the same penalty \u03c6(\u00b7) in (2), the same Moreau-\nYosida regularization in (3), and the same solution from Algorithm 1. Therefore, to prove (17),\nit suf\ufb01ces to show 0 \u2208 \u2202f(u0)Gi\nj. Next, we focus on establishing the\nfollowing relationship:\n\n, for all the leaf nodes Gi\n\nj\n\n0 \u2208 \u2202f(u0)Gd\n\n.\n\n1\n\nIt follows from Lemma 1 that, we can \ufb01nd a unique path from the node Gd\nnodes on this path are Gl\nrl\nwe can get that the nodes that contain the index set Gd\nother words, \u2200x, we have\nr(x)\n\n1. Let the\n, for l = 0, 1, . . . , d with r0 = 1 and rd = 1. By using (10) of Lemma 1,\n1 are exactly on the aforementioned path. In\n\n= {0},\u2200r (cid:54)= rl, l = 1, 2, . . . , d \u2212 1, r = 1, 2, . . . , ni\n\n1 to the root G0\n\n(cid:161)\n\n(cid:162)\n\n\u2202\u03c6l\n\n(17)\n\n(18)\n\n(19)\n\n(20)\n\n(21)\n\n(22)\n\n(cid:164)\n\nGd\n1\n\n(cid:161)\n\nby using (6) and (8).\nApplying (11) and (12) of Lemma 2 to each node on the aformetioned path, we have\n\nGl\nrl\nMaking using of (9), we obtain from (20) the following relationship:\n\nGl\nrl\n\nGl\nrl\n\n\u2208 \u03bbl\n\nrl\n\n\u2202\u03c6l\nrl\n\n(ul)\n\n\u2286 \u03bbl\n\nrl\n\n\u2202\u03c6l\nrl\n\n(u0)\n\n\u2212 ul\n\nul+1\nGl\nrl\n\n,\u2200l = 0, 1, . . . , d.\n\n(cid:162)\n\n\u2208 \u03bbl\nAdding (21) for l = 0, 1, . . . , d, we have\n\n\u2212 ul\n\nul+1\nGd\n1\n\nGd\n1\n\nrl\n\n\u2202\u03c6l\nrl\n\n(u0)\n\nGd\n1\n\n,\u2200l = 0, 1, . . . , d.\n\n(cid:162)\n(cid:161)\n\n\u2208 d(cid:88)\n\nl=0\n\n(cid:162)\n\n(u0)\n\nGd\n1\n\nud+1\nGd\n1\n\n\u2212 u0\n\nGd\n1\n\n\u03bbl\nrl\n\n\u2202\u03c6l\nrl\n\n(cid:161)\n(cid:162)\n(cid:161)\n\nIt follows from (4), (7), (19) and (22) that (18) holds.\nSimilarly, we have 0 \u2208 f(u0)Gi\n\nfor the other leaf nodes Gi\n\nj\n\nj. Thus, we have (17).\n\n3.2 The Proposed Optimization Algorithm\nWith the analytical solution for \u03c0\u03bb(\u00b7), the minimizer of (3), we can apply many existing methods for\nsolving (1). First, we show in the following lemma that, the optimal solution to (1) can be computed\nas a \ufb01xed point. We shall show in Section 3.3 that, the result in this lemma can also help determine\nthe interval for the values of \u03bb.\nLemma 3. Let x\u2217 be an optimal solution to (1). Then, x\u2217 satis\ufb01es:\nx\u2217 = \u03c0\u03bb\u03c4 (x\u2217 \u2212 \u03c4 l(cid:48)(x\u2217)),\u2200\u03c4 > 0.\n\n(23)\n\n5\n\n\fProof: x\u2217 is an optimal solution to (1), if and only if\n\n0 \u2208 l(cid:48)(x\u2217) + \u03bb\u2202\u03c6(x\u2217),\n\n(24)\n\n1\n\n0 \u2208 x\u2217 \u2212 (x\u2217 \u2212 \u03c4 l(cid:48)(x\u2217)) + \u03bb\u03c4 \u2202\u03c6(x\u2217),\u2200\u03c4 > 0.\n\nwhich leads to\n(25)\n2(cid:107)x\u2212(x\u2217\u2212 \u03c4 l(cid:48)(x\u2217))(cid:107)2 + \u03bb\u03c4 \u03c6(x). Recall that \u03c0\u03bb(\u00b7) is the minimizer\nThus, we have x\u2217 = arg minx\n(cid:164)\nof (3). We have (23).\nIt follows from Lemma 3 that we can apply the \ufb01xed point continuation method [4] for solving (1). It\nis interesting to note that, with an appropriately chosen \u03c4, the scheme in (23) indeed corresponds to\nthe gradient method developed for the composite function optimization [2, 19], achieving the global\nconvergence rate of O(1/k) for k iterations. In addition, the scheme in (23) can be accelerated\nto obtain the accelerated gradient descent [2, 19], where the Moreau-Yosidea regularization also\nneeds to be evaluated in each of its iteration. We employ the accelerated gradient descent developed\nin [2] for the optimization in this paper. The algorithm is called \u201ctgLasso\u201d, which stands for the tree\nstructured group Lasso. Note that, tgLasso includes our previous algorithm [11] as a special case,\nwhen the index tree is of depth 1 and w0\n\n1 = 0.\n\n3.3 The Effective Interval for the Values of \u03bb\n\nWhen estimating the model parameters via (1), a key issue is to choose the appropriate values for\nthe regularization parameter \u03bb. A commonly used approach is to select the regularization parameter\nfrom a set of candidate values, whose values, however, need to be pre-speci\ufb01ed in advance. There-\nfore, it is essential to specify the effective interval for the values of \u03bb. An analysis of MYtgLasso\nin Algorithm 1 shows that, with increasing \u03bb, the entries of the solution to (3) are monotonically\ndecreasing. Intuitively, the solution to (3) shall be exactly zero if \u03bb is suf\ufb01ciently large and all the\nentries of x are penalized in \u03c6(x). Next, we summarize the main results of this subsection.\nTheorem 2. The zero point is a solution to (1) if and only if the zero point is a solution to (3)\nwith v = \u2212l(cid:48)(0). For the penalty \u03c6(x), let us assume that all entries of x are penalized, i.e.,\n\u2200l \u2208 {1, 2, . . . , p}, there exists at least one node Gi\nj > 0. Then,\nfor any 0 < (cid:107) \u2212 l(cid:48)(0)(cid:107) < +\u221e, there exists a unique \u03bbmax < +\u221e satisfying: 1) if \u03bb \u2265 \u03bbmax the\nzero point is a solution to (1), and 2) if 0 < \u03bb < \u03bbmax, the zero point is not a solution to (1).\nProof: If x\u2217 = 0 is the solution to (1), we have (24). Setting \u03c4 = 1 in (23), we obtain that x\u2217 = 0\nis also the solution to (3) with v = \u2212l(cid:48)(0). If x\u2217 = 0 is the solution to (3) with v = \u2212l(cid:48)(0), we\nhave 0 \u2208 l(cid:48)(0) + \u03bb\u2202\u03c6(0), which indicates that x\u2217 = 0 is the solution to (1).\nThe function \u03c6(x) is closed convex. According to [18, Chapater 3.1.5], \u2202\u03c6(0) is a closed convex\nand non-empty bounded set. From (8), it is clear that 0 \u2208 \u2202\u03c6(0). Therefore, we have (cid:107)x(cid:107) \u2264\nR,\u2200x \u2208 \u2202\u03c6(0), where R is a \ufb01nite radius constant. Let\n\nj that contains l and meanwhile wi\n\nS = {x : x = \u2212\u03b1Rl(cid:48)(0)/(cid:107)l(cid:48)(0)(cid:107), \u03b1 \u2208 [0, 1]}\n\nbe the line segment from 0 to \u2212Rl(cid:48)(0)/(cid:107)l(cid:48)(0)(cid:107). It is obvious that S is closed convex and bounded.\nDe\ufb01ne I = S\n\n\u2202\u03c6(0), which is clearly closed convex and bounded. De\ufb01ne\n\n(cid:84)\n\n\u02dc\u03bbmax = (cid:107)l(cid:48)(0)(cid:107)/ max\nx\u2208I\n\n(cid:107)x(cid:107).\n\nj that contains l and meanwhile wi\n\nIt follows from (cid:107)l(cid:48)(0)(cid:107) > 0 and the boundedness of I that \u02dc\u03bbmax > 0. We \ufb01rst show \u02dc\u03bbmax < +\u221e.\nOtherwise, we have I = {0}. Thus, \u2200\u03bb > 0, we have \u2212l(cid:48)(0)/\u03bb /\u2208 \u2202\u03c6(0), which indicates that 0 is\nneither the solution to (1) nor (3) with v = \u2212l(cid:48)(0). Recall the assumption that, \u2200l \u2208 {1, 2, . . . , p},\nj > 0. It follows from Algorithm 1\nthere exists at least one node Gi\nthat, there exists a \u02dc\u03bb < +\u221e such that when \u03bb > \u02dc\u03bb, 0 is a solution to (3) with v = \u2212l(cid:48)(0), leading to\na contradiction. Therefore, we have 0 < \u02dc\u03bbmax < +\u221e. Let \u03bbmax = \u02dc\u03bbmax. The arguments hold since\n1) if \u03bb \u2265 \u03bbmax, then \u2212l(cid:48)(0)/\u03bb \u2208 I \u2286 \u2202\u03c6(0); and 2) if 0 < \u03bb < \u03bbmax, then \u2212l(cid:48)(0)/\u03bb /\u2208 \u2202\u03c6(0). (cid:164)\nWhen l(cid:48)(0) = 0, the problem (1) has a trivial zero solution. We next focus on the nontrivial case\nl(cid:48)(0) (cid:54)= 0. We present the algorithm for ef\ufb01ciently solving \u03bbmax in Algorithm 2. In Step 1, \u03bb0 is an\ninitial guess of the solution. Our empirical study shows that \u03bb0 =\nj )2 works quite\nwell. In Step 2-6, we specify an interval [\u03bb1, \u03bb2] in which \u03bbmax resides. Finally, in Step 7-14, we\napply bisection for computing \u03bbmax.\n\n(cid:80)ni\n\n(cid:114)\n\ni=0\n\nj=1(wi\n\n(cid:80)d\n\n(cid:107)l(cid:48)(0)(cid:107)2\n\n6\n\n\fAlgorithm 2 Finding \u03bbmax via Bisection\nInput: l(cid:48)(0), the index tree T with nodes Gi\n\n(i = 0, 1, . . . , d, j = 1, 2, . . . , ni), \u03bb0, and \u03b4 = 10\u221210\n\nj (i = 0, 1, . . . , d, j = 1, 2, . . . , ni), the weights wi\n\nj \u2265 0\n\nSet \u03bb2 = \u03bb, and \ufb01nd the largest \u03bb1 = 2\u2212i\u03bb, i = 1, 2, . . . such that \u03c0\u03bb1(\u2212l(cid:48)(0)) (cid:54)= 0\nSet \u03bb1 = \u03bb, and \ufb01nd the smallest \u03bb2 = 2i\u03bb, i = 1, 2, . . . such that \u03c0\u03bb2(\u2212l(cid:48)(0)) = 0\n\nOutput: \u03bbmax\n1: Set \u03bb = \u03bb0\n2: if \u03c6\u03bb(\u2212l(cid:48)(0)) = 0 then\n3:\n4: else\n5:\n6: end if\n7: while \u03bb2 \u2212 \u03bb1 \u2265 \u03b4 do\n8:\n9:\n10:\n11:\n12:\n13:\n14: end while\n15: \u03bbmax = \u03bb\n\nSet \u03bb = \u03bb1+\u03bb2\nif \u03c0\u03bb(\u2212l(cid:48)(0)) = 0 then\nelse\n\nSet \u03bb2 = \u03bb\n\nSet \u03bb1 = \u03bb\n\nend if\n\n2\n\n4 Experiments\n\nWe have conducted experiments to evaluate the ef\ufb01ciency and effectiveness of the proposed tgLasso\nalgorithm on the face data sets JAFFE [14] and AR [15]. JAFFE contains 213 images of ten Japanese\nactresses with seven facial expressions: neutral, happy, disgust, fear, anger, sadness, and suprise.\nWe used a subsect of AR that contains 400 images corresponding to 100 subjects, with each subject\ncontaining four facial expression: neutral, smile, anger, and scream. For both data sets, we resize\nthe image size to 64 \u00d7 64, and make use of the tree structure depicted in Figure 1. Our task is\nto discriminate each facial expression from the rest ones. Thus, we have seven and four binary\nclassi\ufb01cation tasks for JAFFE and AR, respectively. We employ the least squares loss for l(\u00b7), and\nset the regularization parameter \u03bb = r \u00d7 \u03bbmax, where \u03bbmax is computed using Algorithm 2, and\nr = {5\u00d7 10\u22121, 2\u00d7 10\u22121, 1\u00d7 10\u22121, 5\u00d7 10\u22122, 2\u00d7 10\u22122, 1\u00d7 10\u22122, 5\u00d7 10\u22123, 2\u00d7 10\u22123}. The source\ncodes, included in the SLEP package [12], are available online1.\n\nTable 1: Computational time (seconds) for one binary classi\ufb01cation task (averaged over 7 and 4 runs for JAFFE\nand AR, respectively). The total time for all eight regularization parameters is reported.\n\ntgLasso\nalternating algorithm [9]\n\nJAFFE AR\n73\n30\n4054\n5155\n\nEf\ufb01ciency of the Proposed tgLasso We compare our proposed tgLasso with the recently proposed\nalternating algorithm [9] designed for the tree-guided group Lasso. We report the total computational\ntime (seconds) for running one binary classi\ufb01cation task (averaged over 7 and 4 tasks for JAFFE and\nAR, respectively) corresponding to the eight regularization parameters in Table 1. We can obseve\nthat tgLasso is much more ef\ufb01cient than the alternating algorithm. We note that, the key step of\ntgLasso in each iteration is the associated Moreau-Yosida regularization, which can be ef\ufb01ciently\ncomputed due to the existence of an analytical solution; and the key step of the alternating algorithm\nin each iteration is the matrix inversion, which does not scale well to high-dimensional data.\nClassi\ufb01cation Performance We compare the classi\ufb01cation performance of tgLasso with Lasso. On\nAR, we use 50 subjects for training, and the remaining 50 subjects for testing; and on JAFFE, we\nuse 8 subjects for training, and the remaining 2 subjects for testing. This subject-independent setting\nis challenging, as the subjects to be tested are not included in the training set. The reported results\nare averaged over 10 runs for randomly chosen subjects. For each binary classi\ufb01cation task, we\ncompute the balanced error rate [3] to cope with the unbalanced positive and negative samples. We\n\n1http://www.public.asu.edu/\u02dcjye02/Software/SLEP/\n\n7\n\n\fFigure 3: Classi\ufb01cation performance comparison between Lasso and the tree structured group Lasso. The\nhorizontal axis corresponds to different regularization parameters \u03bb = r \u00d7 \u03bbmax.\n\nFigure 4: Markers obtained by Lasso, and tree structured group Lasso (white pixels correspond to the markers).\nFirst row: face images of four expression from the AR data set; Second row: the markers identi\ufb01ed by tree\nstructured group Lasso; Third row: the markers identi\ufb01ed by Lasso.\n\nreport the averaged results in Figure 3. Results show that tgLasso outperforms Lasso in both cases.\nThis veri\ufb01es the effectiveness of tgLasso in incorporating the tree structure in the formulation, i.e.,\nthe spatial locality information of the face images. Figure 4 shows the markers identi\ufb01ed by tgLasso\nand Lasso under the best regularization parameter. We can observe from the \ufb01gure that tgLasso\nresults in a block sparsity solution, and most of the selected pixels are around mouths and eyes.\n\n5 Conclusion\n\nIn this paper, we consider the ef\ufb01cient optimization for the tree structured group Lasso. Our main\ntechnical result show the Moreau-Yosida regularization associated with the tree structured group\nLasso admits an analytical solution. Based on the Moreau-Yosida regularization, we an design ef\ufb01-\ncient algorithm for solving the grouped tree structure regularized optimization problem for smooth\nconvex loss functions, and develop an ef\ufb01cient algorithm for determining the effective interval for\nthe parameter \u03bb. Our experimental results on the AR and JAFFE face data sets demonstrate the\nef\ufb01ciency and effectiveness of the proposed algorithm. We plan to apply the proposed algorithm to\nother applications in computer vision and bioinformatics involving the tree structure.\n\nAcknowledgments\n\nThis work was supported by NSF IIS-0612069, IIS-0812551, CCF-0811790, IIS-0953662, NGA\nHM1582-08-1-0016, NSFC 60905035, 61035003, and the Of\ufb01ce of the Director of National Intelli-\ngence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), through the US Army.\n\n8\n\n5e\u221212e\u221211e\u221215e\u221222e\u221221e\u221225e\u221232e\u22123181920212223242526regularization parameter rbalanced error rate (%)AR tgLassoLasso5e\u221212e\u221211e\u221215e\u221222e\u221221e\u221225e\u221232e\u221233636.53737.53838.53939.540regularization parameter rbalanced error rate (%)JAFFE tgLassoLassoNeutralSmileAngerSceam\fReferences\n[1] F. Bach, G. Lanckriet, and M. Jordan. Multiple kernel learning, conic duality, and the SMO algorithm. In\n\nInternational conference on Machine learning, 2004.\n\n[2] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems.\n\nSIAM Journal on Imaging Sciences, 2(1):183\u2013202, 2009.\n\n[3] I. Guyon, A. B. Hur, S. Gunn, and G. Dror. Result analysis of the nips 2003 feature selection challenge.\n\nIn Neural Information Processing Systems, pages 545\u2013552, 2004.\n\n[4] E.T. Hale, W. Yin, and Y. Zhang. Fixed-point continuation for (cid:96)1-minimization: Methodology and con-\n\nvergence. SIAM Journal on Optimization, 19(3):1107\u20131130, 2008.\n\n[5] J. Hiriart-Urruty and C. Lemar\u00b4echal. Convex Analysis and Minimization Algorithms I & II. Springer\n\nVerlag, Berlin, 1993.\n\n[6] L. Jacob, G. Obozinski, and J. Vert. Group lasso with overlap and graph lasso. In International Conference\n\non Machine Learning, 2009.\n\n[7] R. Jenatton, J.-Y. Audibert, and F. Bach. Structured variable selection with sparsity-inducing norms.\n\nTechnical report, arXiv:0904.3523v2, 2009.\n\n[8] R. Jenatton, J. Mairal, G. Obozinski, and F. Bach. Proximal methods for sparse hierarchical dictionary\n\nlearning. In International Conference on Machine Learning, 2010.\n\n[9] S. Kim and E. P. Xing. Tree-guided group lasso for multi-task regression with structured sparsity. In\n\nInternational Conference on Machine Learning, 2010.\n\n[10] C. Lemar\u00b4echal and C. Sagastiz\u00b4abal. Practical aspects of the Moreau-Yosida regularization I: Theoretical\n\nproperties. SIAM Journal on Optimization, 7(2):367\u2013385, 1997.\n\n[11] J. Liu, S. Ji, and J. Ye. Multi-task feature learning via ef\ufb01cient (cid:96)2,1-norm minimization. In Uncertainty\n\nin Arti\ufb01cial Intelligence, 2009.\n\n[12] J. Liu, S. Ji, and J. Ye. SLEP: Sparse Learning with Ef\ufb01cient Projections. Arizona State University, 2009.\n[13] J. Liu, L. Yuan, and J. Ye. An ef\ufb01cient algorithm for a class of fused lasso problems. In ACM SIGKDD\n\nConference on Knowledge Discovery and Data Mining, 2010.\n\n[14] M. J. Lyons, J. Budynek, and S. Akamatsu. Automatic classi\ufb01cation of single facial images.\n\nTransactions on Pattern Analysis and Machine Intelligence, 21(12):1357\u20131362, 1999.\n\nIEEE\n\n[15] A.M. Martinez and R. Benavente. The AR face database. Technical report, 1998.\n[16] L. Meier, S. Geer, and P. B\u00a8uhlmann. The group lasso for logistic regression. Journal of the Royal\n\nStatistical Society: Series B, 70:53\u201371, 2008.\n\n[17] J.-J. Moreau. Proximit\u00b4e et dualit\u00b4e dans un espace hilbertien. Bulletin de la Societe mathematique de\n\nFrance, 93:273\u2013299, 1965.\n\n[18] Y. Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. Kluwer Academic Publish-\n\ners, 2004.\n\n[19] Y. Nesterov. Gradient methods for minimizing composite objective function. CORE Discussion Paper,\n\n2007.\n\n[20] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society\n\nSeries B, 58(1):267\u2013288, 1996.\n\n[21] K. Yosida. Functional Analysis. Springer Verlag, Berlin, 1964.\n[22] M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. Journal Of\n\nThe Royal Statistical Society Series B, 68(1):49\u201367, 2006.\n\n[23] P. Zhao, G. Rocha, and B. Yu. The composite absolute penalties family for grouped and hierarchical\n\nvariable selection. Annals of Statistics, 37(6A):3468\u20133497, 2009.\n\n9\n\n\f", "award": [], "sourceid": 600, "authors": [{"given_name": "Jun", "family_name": "Liu", "institution": null}, {"given_name": "Jieping", "family_name": "Ye", "institution": null}]}