{"title": "Bilevel learning of the Group Lasso structure", "book": "Advances in Neural Information Processing Systems", "page_first": 8301, "page_last": 8311, "abstract": "Regression with group-sparsity penalty plays a central role in high-dimensional prediction problems. Most of existing methods require the group structure to be known a priori. In practice, this may be a too strong assumption, potentially hampering the effectiveness of the regularization method. To circumvent this issue, we present a method to estimate the group structure by means of a continuous bilevel optimization problem where the data is split into training and validation sets. Our approach relies on an approximation scheme where the lower level problem is replaced by a smooth dual forward-backward algorithm with Bregman distances. We provide guarantees regarding the convergence of the approximate procedure to the exact problem and demonstrate the well behaviour of the proposed method on synthetic experiments. Finally, a preliminary application to genes expression data is tackled with the purpose of unveiling functional groups.", "full_text": "Bilevel Learning of the Group Lasso Structure\n\nJordan Frecon\u2217,1\n\nSaverio Salzo\u2217 ,1\n\nMassimiliano Pontil1,2\n\n1 Computational Statistics and Machine Learning, Istituto Italiano di Tecnologia (Italy)\n\n2 Department of Computer Science, University College London (UK)\n\nAbstract\n\nRegression with group-sparsity penalty plays a central role in high-dimensional\nprediction problems. However, most existing methods require the group structure\nto be known a priori. In practice, this may be a too strong assumption, potentially\nhampering the effectiveness of the regularization method. To circumvent this issue,\nwe present a method to estimate the group structure by means of a continuous\nbilevel optimization problem where the data is split into training and validation sets.\nOur approach relies on an approximation scheme where the lower level problem is\nreplaced by a smooth dual forward-backward algorithm with Bregman distances.\nWe provide guarantees regarding the convergence of the approximate procedure to\nthe exact problem and demonstrate the well behaviour of the proposed method on\nsynthetic experiments. Finally, a preliminary application to genes expression data\nis tackled with the purpose of unveiling functional groups.\n\n1\n\nIntroduction\n\nWith recent technological advances, high-dimensional datasets have become massively widespread in\nnumerous applications ranging from social sciences to computational biology [20, 25, 1]. In addition,\nin many statistical problems, the number of unknown parameters can be signi\ufb01cantly larger than the\nnumber of data samples, thus leading to underdetermined and computationally intractable problems.\nNonetheless, many classes of datasets exhibit a sparse representation when expressed as a linear\ncombination of suitable dictionary elements. This has led, over the past decades, to the development\nof sparsity inducing norms and regularizers to unveil structure in the data. However, beyond the\nsparsity patterns of the data, there might also be a more complex structure, which is widely referred\nto as structured sparsity [16, 22, 23, 29]. In this line of research, a lot of work has been devoted to\nencode a priori structure of the data in (possibly overlapping) groups or hierarchical trees [31, 15, 17].\nIn the present paper, we restrict our study to the popular Group Lasso problem [30]. Given a vector\nof outputs y \u2208 RN and a design matrix X \u2208 RN\u00d7P , the Group Lasso problem amounts in \ufb01nding\n\nL(cid:88)\n\nl=1\n\n\u02c6w \u2208 argmin\nw\u2208RP\n\n1\n2\n\n(cid:107)y \u2212 Xw(cid:107)2 + \u03bb\n\n(cid:107)wGl(cid:107)2,\n\n(1)\n\nfor some regularization parameter \u03bb > 0 and a non-overlapping group structure, i.e., an unordered\npartition of the features in L groups {G1, . . . ,GL} such that \u222aL\nl=1Gl = {1, . . . , P} and (\u2200l (cid:54)=\nl(cid:48)),Gl \u2229 Gl(cid:48) = \u2205. The speci\ufb01c form of the regularizer permits to enforce sparsity at the group-level,\nthus often leading to a better interpretability of the features than standard Lasso.\nHowever, in many applications, we might have hundreds or thousands of features whose group-\nstructure {G1, . . . ,GL} may be unknown, or only partially known. In addition, the number of groups\nL itself might not be known. Nonetheless, the prior knowledge of the group structure is crucial\nin order to achieve a lower prediction error [19]. Note that this problem can be seen as purely\n\n\u2217Equal contribution.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fcombinatorial since it amounts in searching the best partition amongst (LP /L!) possible unordered\npartitions.\nIn this paper we address the problem of learning the Group Lasso structure, within the setting of\nmulti-task learning, through a bilevel optimization approach. We establish some basic mathematical\nproperties of this methodology and demonstrate that it works well in practice.\nRelated works. We are only aware of few approaches devoted to infer the Group Lasso structure.\nA probabilistic modeling approach has been investigated in [14] to learn the relevance of pairs of\nfeatures only. More recently, [27] considered a broad family of heavy-tailed priors for the group\nvariables along with a variational inference scheme to learn the parameters of these priors. However,\nthis approach becomes prohibitive when dealing with a large number of features. In addition, the\nsetting is different: it analyzes the latent Group Lasso and it assumes that the learning tasks have a\nsimilar structure, meaning that the relevance of a given group is largely shared across the tasks.\nContributions and outline. The principal contribution of this paper is the formulation of the problem\nof learning the Group Lasso structure as a continuous bilevel optimization problem. In Section 2,\nwe present, in a formal way, our bilevel approach. A new algorithmic solution based on an upper\nstochastic gradient descent and a lower dual forward-backward scheme with Bregman distances\nis devised in Section 3. The performance of the proposed approach are quantitatively assessed on\nsynthetic data in Section 4, and shown to favorably compare against standard approaches. In addition,\nan application to real data in the context of gene expression analysis is provided with the goal of\ndiscovering functional groups. Finally, conclusions and perspectives are drawn in Section 5.\nNotations. Let X be an Euclidian space. \u03930(X ) denotes the space of functions h : X \u2192 ]\u2212\u221e, +\u221e]\nclosed, proper and convex. We also denote by argmin h the set of minimizers of h or the minimizer\nof h when it is unique.\n\n2 Proposed Bilevel Problem for Learning the Groups\n\nIn this section, we describe a bilevel framework for estimating the Group Lasso structure based on a\nmulti-task learning problem, without any further a priori information.\n\n2.1 Original Problem\nWe encapsulate the group structure by means of an hyperparameter \u03b8 = [\u03b81 . . . \u03b8L] \u2208 {0, 1}P\u00d7L,\nde\ufb01ning at most L groups, such that (\u2200p \u2208 {1, . . . , P},\u2200l \u2208 {1, . . . , L}), \u03b8p,l = 1 if the p-th feature\nbelongs to the l-th group, and 0 otherwise. Note that when no prior information on the number of\ngroups is given, one should consider the extreme setting where there might be at most L = P groups.\nIn order to properly select \u03b8, we propose to consider the following bilevel problem.\nProblem 2.1 (Mixed Integer-Continuous Bilevel Problem.). Given some vectors of outputs yt \u2208 RN\nand design matrices Xt \u2208 RN\u00d7P for t \u2208 {1, . . . , T}, as well as a regularization parameter \u03bb > 0,\n\ufb01nd\n\nwhere C( \u02c6w(\u03b8)) = (1/T )(cid:80)T\n\n\u02c6\u03b8 \u2208 argmin\n(2)\n\u03b8\u2208{0,1}P \u00d7L\nt=1 Ct( \u02c6wt(\u03b8)), Ct : RP \u2192 R is a smooth function and \u02c6w(\u03b8) =\n( \u02c6w1(\u03b8), . . . , \u02c6wT (\u03b8)) is a minimizer of T separated Group Lasso problems sharing a common group\nstructure, i.e., it solves\n\n\u03b8l = 1P ,\n\nC( \u02c6w(\u03b8))\n\ns.t.\n\nl=1\n\nL(cid:88)\n\nminimize\n\n(w1,...,wT )\u2208RP \u00d7T\n\n1\nT\n\n(cid:107)yt \u2212 Xtwt(cid:107)2 + \u03bb\n\n1\n2\n\n(cid:107)\u03b8l (cid:12) wt(cid:107)2\n\n,\n\n(3)\n\nwhere (cid:12) denotes the Hadamard product.\nThe constraint in the right-hand side of Problem (2) ensures that every feature belongs to a single group.\n\u02c6wt(\u03b8)(cid:107)2\nA natural choice for Ct is to consider the validation error Ct( \u02c6wt(\u03b8)) = 1\nevaluated on a set {y(val), X (val)}. For such choice, the selection of \u03b8 is motivated by the need of\ngeneralizing well to unseen data. In practice, it is often a good surrogate to the estimation error when\nthe true features are unknown.\n\n2(cid:107)y(val)\n\n\u2212 X (val)\n\nt\n\nt\n\n(cid:32)\n\nT(cid:88)\n\nt=1\n\n(cid:33)\n\nL(cid:88)\n\nl=1\n\n2\n\n\fWe note that directly solving Problem 2.1 is a challenge, since it is a mixed integer-continuous bilevel\nproblem. To overcome this dif\ufb01culty, we consider in the next section a relaxation of the problem in\nthe continuous setting.\n\n(cid:80)T\nt=1 (cid:107)wt(cid:107)2\n\n2.2 Relaxed Problem\nWe propose to consider a continuous relaxed version of Problem 2.1 where \u03b8 \u2208 [0, 1]P\u00d7L and the\npenalty term \u0001\n2 is added to (3), for some \u0001 > 0, in order to ensure strong convexity of\n2T\nthe lower level objective function and hence the uniqueness of its minimizer. The resulting problem\nlies within the framework of continuous bilevel optimization [10] which has recently been gaining a\nrenewed interest in image processing [18, 5] as well as in neural networks and machine learning (see\ne.g. [4, 21, 24, 12]). Our relaxation of Problem 2.1 is formulated as follows.\nProblem 2.2 (Exact Bilevel Problem). Let C be as in Problem 2.1 and let \u03c8, \u03be : [0, 1] \u2192 R be\nincreasing continuous functions such that \u03c8(0) = \u03be(0) = 0 and \u03c8(1) = \u03be(1) = 1. Given some\nvectors of outputs yt \u2208 RN and design matrices Xt \u2208 RN\u00d7P for t \u2208 {1, . . . , T}, as well as some\nregularization parameters \u03bb > 0 and \u0001 > 0, solve\n\nminimize\n\n\u03b8\u2208\u0398\n\nU(\u03b8) with\n\nT(cid:88)\n\nL(w, \u03b8) :=\n\n(cid:26)\n\n1\nT\n\nLt(wt, \u03b8), Lt(wt, \u03b8) =\n\n\u03b8 = [\u03b81 . . . \u03b8L] \u2208 [0, 1]P\u00d7L(cid:12)(cid:12)(cid:12) L(cid:88)\n\nt=1\n\n\u03be(\u03b8l) = 1P\n\n.\n\nl=1\n\nwhere\uf8f1\uf8f4\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f4\uf8f3\n\n\u0398 =\n\n(cid:26) \u02c6w(\u03b8) = argminw\u2208RP \u00d7T L(w, \u03b8),\nL(cid:88)\n\nU(\u03b8) = C( \u02c6w(\u03b8)),\n\n(cid:107)yt \u2212 Xtwt(cid:107)2\n\n(cid:107)wt(cid:107)2\n\n2 + \u03bb\n\n2 +\n\n1\n2\n\n\u0001\n2\n\nl=1\n\n(cid:27)\n\n(4)\n\n(cid:107)\u03c8(\u03b8l) (cid:12) wt(cid:107)2,\n\n(5)\n\nand \u03c8 and \u03be are applied component-wise to the vectors \u03b8l\u2019s.\nRemark 2.1. The functions \u03c6 and \u03be permit to cover different continuous relaxations of Problem 2.1\nand the conditions \u03c8(0) = \u03be(0) = 0 and \u03c8(1) = \u03be(1) = 1 are compatibility conditions with\nProblem 2.1. Among the different choices of \u03c8 and \u03be we point out \u03c8 = \u03be = Id, which corresponds\nto a convex relaxation in which \u0398 is the unit simplex.\n\nThe following result establishes the existence of solution of Problem 2.2. The proof is given in the\nsupplementary material (Section A.1).\nProposition 2.1 (Existence of Solutions). Suppose that \u0398 is a compact nonempty subset of RP\u00d7L\nand that C and \u03c8 are continuous functions. Then \u03b8 (cid:55)\u2192 \u02c6w(\u03b8) is continuous and hence Problem 2.2\nadmits solutions.\n\n+\n\n2.3 Approximate Problem\n\nUsually, we don\u2019t have a closed form expression for \u02c6w(\u03b8) but we rather have an iterative procedure\nconverging to \u02c6w(\u03b8) that we arbitrarily stop after Q iterations. Therefore, we actually solve an\napproximate problem of the following form.\nProblem 2.3 (Approximate Bilevel Problem).\nmappings A and B, as well as a maximum number of inner iterations Q \u2208 N, solve\n\nLet C and \u0398 be as in Problem 2.2. Given two\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f3\n\nu(0)(\u03b8) is chosen arbitrarily\nfor q = 0, 1, . . . , Q \u2212 1\n\n(cid:4) u(q+1)(\u03b8) = A(u(q)(\u03b8), \u03b8)\n\nw(Q)(\u03b8) = B(u(Q)(\u03b8), \u03b8),\nU (Q)(\u03b8) = C(w(Q)(\u03b8)).\n\n(6)\n\nminimize\n\n\u03b8\u2208\u0398\n\nU (Q)(\u03b8), where\n\nRemark 2.2. Problem 2.3 encompasses many situations encountered in practice. For example, when\nB = Id, it reduces to the usual case where w(q+1)(\u03b8) = A(w(q)(\u03b8), \u03b8). In addition, this formulation\nalso covers dual algorithms: in this case A corresponds to the dual variable update, and B denotes the\nprimal-dual relationship (see, e.g., [3]).\n\n3\n\n\fThe following theorem gives the conditions under which the approximate problem converges to the\nexact one as the number of inner iterations Q grows.\nTheorem 2.1 (Convergence of the Approximate Problem). In addition to the assumptions of Prob-\nlem 2.3, suppose that the iterates {w(Q)(\u03b8)}Q\u2208N converge to \u02c6w(\u03b8) uniformly on \u0398 as Q \u2192 +\u221e.\nThen the approximate Problem 2.3 converges to the exact Problem 2.2 in the following sense\n\nU (Q)(\u03b8) \u2212\u2192\n\nQ\u2192+\u221e inf\n\u03b8\u2208\u0398\n\ninf\n\u03b8\u2208\u0398\n\nU(\u03b8) and\n\nargmin\n\n\u03b8\u2208\u0398\n\nU (Q)(\u03b8) \u2212\u2192\n\nQ\u2192+\u221e argmin\n\u03b8\u2208\u0398\n\nU(\u03b8),\n\n(7)\n\nwhere the latter convergence is meant as set convergence, i.e., for every sequence (\u02c6\u03b8(Q))Q\u2208N such\nthat \u02c6\u03b8(Q) \u2208 argminU (Q), we have dist(\u02c6\u03b8(Q), argminU) \u2192 0 as Q \u2192 +\u221e, which is equivalent to\nmax{dist(\u02c6\u03b8, argminU)| \u02c6\u03b8 \u2208 argminU (Q)} \u2192 0 as Q \u2192 +\u221e.\nTheorem 2.1 justi\ufb01es the minimization of U (Q) (for suf\ufb01ciently large Q) instead of U.\n\n3 Algorithmic Solution\n\nThe lower level problem in (4)-(5), can be, in principle, addressed by several available solvers.\nHowever, since this problem is nonsmooth, these solvers are usually nonsmooth as well, that is A\nand B in (6) are nonsmooth. This causes U (Q) to be nonsmooth, besides being nonconvex. In that\ncase, minimizing U (Q) is a challenge. Indeed, even just determining a (hyper)subgradient of U (Q) in\na stable fashion by recursively computing a subgradient of u(q)(\u03b8) might be hopeless. Therefore, we\nembrace the idea proposed in [24] to devise a smooth algorithm by relying on Bregman proximity\noperators and we make two advances. First, we propose a new algorithm based on a dual forward-\nbackward scheme with Bregman distances where A and B are smooth. Second, by relying on [2], we\nprove the uniform convergence of such algorithm to the solution of the lower level problem, so to\nmeet the requirements of Theorem 2.1. This approach \ufb01nally gives a smooth function U (Q) whose\ngradient can be recursively computed by applying the standard chain rule [13].\n\n3.1 Principle\n\nSince the proposed bilevel problem is a nonconvex problem with possibly many minima, \ufb01nding the\nglobal optimum is out of reach. However, local minima can still be of high quality, meaning that no\nimprovements in the objective can be obtained by small perturbations and that the corresponding\nobjective value is close to the in\ufb01mum. Let us remark that, since in the parametrization of the groups\nthe ordering is not relevant, the upper level objective function is invariant under permutations of\n(\u03b81, . . . , \u03b8L), so there are L! equivalent solutions.\nIn order to solve the bilevel problem, we rely on the following projected gradient descent algorithm\n(8)\nwhere P\u0398 denotes the projection onto \u0398 (see [8] for an ef\ufb01cient projection method when \u0398 is the\nunit simplex) and \u03b3 > 0 is a given step-size. Overall, this procedure requires to compute the Q-th\niterate w(Q)(\u03b8(k)) as well as the hypergradient \u2207U (Q)(\u03b8(k)).\nSince both the lower and upper level problems are separable with respect to the tasks, the hypergradient\nis the sum of T terms. In Section 3.4, we design a stochastic variant of (8) taking advantage of this\nstructure.\n\n(cid:0)\u03b8(k) \u2212 \u03b3\u2207U (Q)(\u03b8(k))(cid:1),\n\n(\u2200k \u2208 {0, . . . , K \u2212 1}),\n\n\u03b8(k+1) = P\u0398\n\n3.2 Solving the Lower Level Problem\n\nIn this section, we address the lower level problem in (4)-(5). Since it is is separable with respect to\nthe tasks, without loss of generality we can deal with a single task omitting the index t.\nProblem 3.1. Given some vectors of outputs y \u2208 RN , a design matrix X \u2208 RN\u00d7P , regularization\nparameters \u03bb > 0 and \u0001 > 0, as well as some group structure \u03b8 \u2208 \u0398, \ufb01nd\n\n(cid:110)L(w, \u03b8) :=\n\n\u02c6w(\u03b8) = argmin\nw\u2208RP\n\nL(cid:88)\n\nl=1\n\n(cid:111)\n\n,\n\n(cid:125)\n\n(cid:107)\u03c8(\u03b8l) (cid:12) w(cid:107)2\n\n(cid:123)(cid:122)\n\ng(A\u03b8w)\n\n(9)\n\n1\n2\n\n(cid:124)\n\n(cid:107)y \u2212 Xw(cid:107)2\n\n2 +\n\n(cid:123)(cid:122)\n\nf (w)\n\n\u0001\n2\n\n(cid:107)w(cid:107)2\n\n2\n\n(cid:125)\n\n+ \u03bb\n\n(cid:124)\n\n4\n\n\fwhere f \u2208 \u03930(RP ) is smooth and \u0001-strongly convex, g \u2208 \u03930(RP\u00d7L) is nonsmooth and A\u03b8 is the\nlinear operator de\ufb01ned as A\u03b8 : w \u2208 RP (cid:55)\u2192 (\u03c8(\u03b81) (cid:12) w, . . . , \u03c8(\u03b8L) (cid:12) w) \u2208 RP\u00d7L.\nLet us note that in order to solve Problem 3.1 we cannot use the standard forward-backward algo-\nrithm [6, 7] since the proximity operator of g \u25e6 A\u03b8 cannot be computed in closed form. Moreover, we\nalso ask for a smooth algorithm, meaning one for which A and B in (6) are smooth. Therefore, we\ntackle the dual of Problem 3.1.\nProblem 3.2. Find a solution \u02c6u(\u03b8) of\n\n(10)\n\nf\u2217(\u2212A(cid:62)\n\n\u03b8 u) + g\u2217(u),\n\nminimize\nu\u2208RP \u00d7L\n\n\u03b8 : u \u2208 RP\u00d7L (cid:55)\u2192(cid:80)L\n\nwhere f\u2217 and g\u2217 denote the Fenchel conjugates of f and g respectively, and where A(cid:62)\nof the operator A\u03b8, that is, A(cid:62)\nNote that the dual Problem 3.2 admits a solution, since strong duality holds and the primal Problem 3.1\nhas solutions [3]. Moreover it is a smooth constrained convex optimization problem.\nIndeed,\nsince f is closed and \u0001-strongly convex, it follows that f\u2217 is everywhere differentiable with \u0001\u22121-\nLipschitz continuous gradient and hence \u2207[f\u2217\u25e6(\u2212A(cid:62)\n\u03b8 ) is (cid:107)A\u03b8(cid:107)2\u0001\u22121-Lipschitz\ncontinuous. Besides, we have \u2207f\u2217 = (\u2207f )\u22121 = (X(cid:62)X + \u0001IdP )\u22121(\u00b7 + X(cid:62)y). On the other\nhand, g\u2217 is the indicator function of the product of L balls B2(\u03bb) \u00d7 . . .B2(\u03bb) := B2(\u03bb)L, i.e.,\n\n\u03b8 )] = \u2212A\u03b8\u2207f\u2217\u25e6(\u2212A(cid:62)\n\nl=1 \u03c8(\u03b8l) (cid:12) ul \u2208 RP .\n\n\u03b8 is the transpose\n\nl=1 \u0131B2(\u03bb)(ul), where B2(\u03bb) is the closed ball of RP centered at zero and of radius \u03bb.\n\ng\u2217(u) =(cid:80)L\n\nWe propose to solve Problem 3.1 by applying a forward-backward algorithm with Bregman distances\nto the dual Problem 3.2 [2, 28] and using the primal-dual link w = \u2207f\u2217(\u2212A(cid:62)\n\u03b8 u). This algorithm\ncalls for a Bregman proximity operator of g\u2217 which can be made smooth with an appropriate choice\nof the Bregman distance. In the following, we provide the related details.\nDe\ufb01nition 3.1 (Bregman Proximity Operator [28]). Let X be an Euclidean space, h \u2208 \u03930(X ) and let\n\u03a6 \u2208 \u03930(X ) be a Legendre function. Then, the Bregman proximity operator (in Van Nguyen sense) of\nh with respect to \u03a6 is\n\nprox\u03a6\n\nh (v) = argmin\n\nu\u2208X\n\nh(u) + \u03a6(u) \u2212 (cid:104)u, v(cid:105).\n\nThe dual forward-backward algorithm with Bregman distances (FBB) for Problem 3.1 is as follows.\nGiven some step-size \u03b3 > 0 and u(0)(\u03b8), then\n\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f3 for q = 0, 1, . . . , Q \u2212 1\n(cid:4) u(q+1)(\u03b8) = prox\u03a6\n\nw(Q)(\u03b8) = \u2207f\u2217(\u2212A(cid:62)\n\n\u03b8 u(Q)(\u03b8)).\n\n\u03b3g\u2217(cid:0)\u2207\u03a6(u(q)(\u03b8)) + \u03b3A\u03b8\u2207f\u2217(\u2212A(cid:62)\n\n\u03b8 u(q)(\u03b8))(cid:1)\n\nthe Legendre function \u03a6. We consider \u03a6(u) =(cid:80)L\n\nThe updating rules in (12) de\ufb01ne the mappings A and B in Problem 2.3. We note that in this case,\nsince \u2207f\u2217 is an af\ufb01ne mapping, B is smooth. Whereas the smoothness of A depends on the choice of\nl=1 \u03c6(ul) with dom \u03c6 = B2(\u03bb), so that, for every\nl \u2208 {1, . . . , L},\n\nu(q+1)\nl\n\n(\u03b8) = prox\u03c6\n\n\u03b3\u0131B2(\u03bb)\n\n(\u03b8)) + \u03b3\u03c8(\u03b8l) (cid:12) \u2207f\u2217(\u2212A(cid:62)\n\n\u03b8 u(q)(\u03b8)\n\n(cid:16)\u2207\u03c6(u(q)\n\nl\n\n(cid:17)\n\n(11)\n\n(12)\n\n\u0131B2(\u03bb)(u) + \u03c6(u) \u2212 (cid:104)u,\u2207\u03c6(u(q)\n\u03c6(u) \u2212 (cid:104)u,\u2207\u03c6(u(q)\n\nl\n\n(\u03b8)) + \u03b3\u03c8(\u03b8l) (cid:12) \u2207f\u2217(\u2212A(cid:62)\n\n\u03b8 u(q)(\u03b8))(cid:105)\n\nl\n\n(\u03b8)) + \u03b3\u03c8(\u03b8l) (cid:12) \u2207f\u2217(\u2212A(cid:62)\n\n\u03b8 u(q)(\u03b8))(cid:105)\n\n= argmin\n\nu\u2208RP\n\nu\u2208RP\n\n= argmin\n= \u2207\u03c6\u2217(\u2207\u03c6(u(q)\n\n(\u03b8)) + \u03b3\u03c8(\u03b8l) (cid:12) \u2207f\u2217(\u2212A(cid:62)\n\n(13)\nTherefore, in order to make A smooth, we need to choose the Legendre function \u03c6 so that \u03c6\u2217 is twice\ndifferentiable. Here, we propose to resort to the following function.\nDe\ufb01nition 3.2 (Separable Hellinger-like Function [2]). The separable Hellinger-like function is\n\n\u03b8 u(q)(\u03b8))).\n\nl\n\nde\ufb01ned as \u03a6(u) =(cid:80)L\n\nif ul \u2208 B2(\u03bb),\notherwise.\n\n(14)\n\nl=1 \u03c6(ul) where for every ul \u2208 RP ,\n\n(cid:26)\u2212(cid:112)\u03bb2 \u2212 (cid:107)ul(cid:107)2\n\n2,\n\n\u03c6(ul) =\n\n+\u221e,\n\n5\n\n\fAlgorithm 1 Dual forward-backward with Bregman distances: FBB-GLasso(y, X, \u03bb, \u03b8)\nRequire: Data y, design matrix X, regularization parameter \u03bb and group-structure \u03b8\n\nSet the number of iterations Q \u2208 N and the step-size \u03b3 < \u0001\u03bb\u22121(cid:107)A\u03b8(cid:107)\u22122.\nSet L to the number of groups in \u03b8\nInitialize u(0)(\u03b8) \u2261 0P\u00d7L\nfor q = 0 to Q \u2212 1 do\n\nw(q)(\u03b8) = (X(cid:62)X + \u0001IdP )\u22121(cid:0)X(cid:62)y \u2212(cid:80)L\n\nl=1 \u03c8(\u03b8l) (cid:12) u(q)\n\n(\u03b8)(cid:1)\n\nl\n\n(\u03b8) =\n\nfor l = 1 to L do\n\u221a\n(cid:113)\n\nv(q+1)\nl\nu(q+1)\nl\nend for\n\n(\u03b8) =\n\nl (\u03b8)(cid:107)2 + \u03b3\u03c8(\u03b8l) (cid:12) w(q)(\u03b8)\n\nl\n\nu(q)\n\n(\u03b8)\n\u03bb2\u2212(cid:107)uq\n\u03bb\n\n1+(cid:107)v(q+1)\n\n(\u03b8)(cid:107)2\n\nv(q+1)\nl\n\n(\u03b8)\n\nl\n\n2\n\nend for\n\nw(Q)(\u03b8) = (X(cid:62)X + \u0001IdP )\u22121(cid:0)X(cid:62)y \u2212(cid:80)L\nFor such choice, we have that for every v \u2208 RP , \u2207\u03c6\u2217(v) = \u03bbv/(cid:112)1 + (cid:107)v(cid:107)2\n\nl=1 \u03c8(\u03b8l) (cid:12) u(Q)\n\n(\u03b8)(cid:1)\n\noutput w(Q)\n\nl\n\n2. The corresponding\nforward-backward scheme with Bregman distance is given in Algorithm 1 where, for the sake of\nreadability, we introduced the primal iterates w(q)(\u03b8) and the auxiliary variables v(q)\n(\u03b8), denoting\nthe argument of \u2207\u03c6\u2217 in (13). The following theorem addresses the convergence of Algorithm 1. The\ncorresponding proof is given in the supplementary material (Section A.2).\nTheorem 3.1 (Convergence of the Dual FBB Scheme). The sequence {w(Q)(\u03b8)}Q\u2208N generated by\nAlgorithm 1 converges to the solution \u02c6w(\u03b8) of Problem 3.1 for any step-size 0 < \u03b3 < \u0001\u03bb\u22121(cid:107)A\u03b8(cid:107)\u22122.\nIn addition if \u03b3 = \u0001\u03bb\u22121(cid:107)A\u03b8(cid:107)\u22122/2, then\n\nl\n\n(\u2200 Q \u2208 N)\n\n1\n2\n\n(cid:107)w(Q)(\u03b8) \u2212 \u02c6w(\u03b8)(cid:107)2\n\n2 \u2264 2\u03bb\u0001\u22122\n\nQ\n\n(cid:107)A\u03b8(cid:107)2D\u03a6(\u02c6u(\u03b8), u(0)),\n\n(15)\n\nwhere D\u03a6 is the Bregman distance associated to \u03a6, i.e.,\n\n(\u2200u \u2208 dom \u03a6, \u2200v \u2208 int dom \u03a6), D\u03a6(u, v) = \u03a6(u) \u2212 \u03a6(v) \u2212 (cid:104)\u2207\u03a6(v), u \u2212 v(cid:105).\n\n(16)\nRemark 3.1. Since ran(\u03c8) \u2282 [0, 1], (cid:107)A\u03b8(cid:107)2 \u2264 L. If \u03c82 \u2264 \u03be, then (cid:107)A\u03b8(cid:107) \u2264 1; equality is obtained\nwhen \u03c82 = \u03be. Therefore, since D\u03a6(\u00b7, u(0)) is continuous on dom \u03a6 = B(\u03bb)L, (cid:107)A\u03b8(cid:107)2D\u03a6(\u02c6u(\u03b8), u(0))\nin (15) can be uniformly bounded from above on \u0398.\nTheorem 3.1 and Remark 3.1 establish that {w(Q)(\u03b8)}Q\u2208N converges to \u02c6w(\u03b8) uniformly on \u0398 as\nQ \u2192 +\u221e with a sublinear rate. This result applies to every task of the lower level objective in\nProblem 2.2 and hence it also applies to the collection of tasks {w(Q)(\u03b8)}Q\u2208N and \u02c6w(\u03b8). Therefore,\nthe requirements of Theorem 2.1 are met and the solutions of Problems 2.3 converge to the solutions\nof Problem 2.2 as Q \u2192 +\u221e.\n\n3.3 Computing the Hypergradient\nIn this section, we discuss the computation of the (hyper)gradient of U (Q). It follows from (6) that,\nfor every \u03b8 \u2208 \u0398,\n\n\u2207U (Q)(\u03b8) =\n\n1\nT\n\n(cid:123)(cid:122)\n\n[(w(Q)\n\n)(cid:48)(\u03b8)](cid:62)\n\nt\nR(P \u00d7L)\u00d7P\n\n(cid:125)\n\n(cid:124)\n\n(cid:123)(cid:122)\n\n\u2207Ct(w(Q)\nt\nRP\n\n(cid:125)\n\n(\u03b8))\n\n\u2208 RP\u00d7L,\n\n(17)\n\nT(cid:88)\n\nt=1\n\n(cid:124)\n\nt\n\n(\u03b8)) = X (val)(cid:62)\n\nwhere \u2207Ct(w(Q)\n)(cid:48)(\u03b8)\ncan be computed by recursively differentiating the formulas in (12). This is the so-called forward\nmode for the computation of the hypergradient. However in our setting, this requires storing the\nderivatives (u(q))(cid:48)(\u03b8) which have size (P \u00d7 L) \u00d7 (P \u00d7 L). Here, since we are interested in the\nproduct [(w(Q)\n(\u03b8)) we implement the reverse mode differentiation [13] (see also\n\n). In equation (17) the derivative (w(Q)\n\n)(cid:48)(\u03b8)](cid:62)\u2207Ct(w(Q)\n\n(\u03b8)\u2212y(val)\n\n(X (val)\n\nw(Q)\n\nt\n\nt\n\nt\n\nt\n\nt\n\nt\n\nt\n\n6\n\n\f[11]) which gives a more ef\ufb01cient procedure that only requires storing matrices of size P \u00d7 L. The\ndetails are given in Algorithm 2 in the supplementary material (Section B.1). Finally, as suggested\nin [13, Chapter 15] and more recently in [24], we implement a variant of Algorithm 2 in which all\nthe derivatives of the mapping A are evaluated at the last iterate u(Q) (instead of varying during\nthe iterations), This reduces the execution time and memory requirements. In our experiments, we\nobserve that the hypergradient is left unchanged by this operation as long as Q is large enough.\n\n3.4 Solving the Approximate Bilevel Problem\n\nSince the hypergradient in (17) has the form of a sum of T terms, each one depending on a single task,\nwe implement a stochastic solver, by estimating the hypergradient \u2207U (Q) on a single task chosen at\nrandom. Here, we resort to the proxSAGA algorithm [26], which is a nonconvex proximal variant of\nSAGA [9]. The details are given in the supplementary material (Section B.2). In the following we\nprovide the related convergence theorem.\nTheorem 3.2 (Convergence of the Proposed Bilevel Scheme). Let \u03b2 be the Lipschitz constant of\n\u2207U (Q) and \u03b3 \u2264 1/(5\u03b2T ). Let {\u03b8(k)}K\nk=1 be generated according to Algorithm 4 in the supplemen-\ntary material (Section B.2). Then, for \u02dck uniformly sampled from {1, . . . , K}, the following holds:\n\nU (Q)(\u03b8(0)) \u2212 U (Q)(\u03b8\u2217)\n\n,\n\n(18)\n\n(19)\n\n(cid:17)\n\n.\n\nK\nwhere \u03b8\u2217 is a minimizer of U (Q) and G\u03b3 is the gradient mapping\n\nE(cid:104)(cid:107)G\u03b3(\u03b8(\u02dck))(cid:107)2(cid:105) \u2264 50\u03b2T 2\n\n5T \u2212 2\n\n(cid:16)\n\nG\u03b3(\u03b8) =\n\n1\n\u03b3\n\n\u03b8 \u2212 P\u0398(\u03b8 \u2212 \u03b3\u2207U (Q)(\u03b8)\n\nWe note that computing the Lipschitz constant \u03b2 is out of reach. Thus, we choose \u03b3 small enough\nsuch that the algorithm converges.\nFinally, we suggest the initialization \u03b8(0) = P\u0398(L\u221211P\u00d7L +n) where n \u223c N (0P\u00d7L, 0.1L\u221211P\u00d7L)\nin order to be as uninformative as possible regarding the group structure while still breaking the\nsymmetry by adding a small perturbation.\n\n4 Numerical Experiments\n\nIn this section, we \ufb01rst devise synthetic experiments to illustrate and assess the performance of the\nproposed method. Then, we tackle a real-data experiment in the context of gene expression analysis.\nA MATLAB R(cid:13) toolbox is available upon request to the authors.\n\n4.1 Synthetic Experiments\n\np,l = 1 if p \u2208 Gl\n\n:= {1 + (l \u2212 1)(P/L\u2217), . . . , l(P/L\u2217)} and 0 otherwise.\n\nExperimental setting. We consider the setting where N = 50, P = 100, and the group structure\n\u03b8\u2217 is made of L\u2217 = 10 groups equally distributed over the features such that (\u2200p \u2208 {1, . . . , P}, \u2200l \u2208\n{1, . . . , L\u2217}), \u03b8\u2217\nIn\naddition, if not stated otherwise, we \ufb01x T = 500 tasks and every regressor w\u2217\nt is set to have non-zero\ncoef\ufb01cients equal to 1 in at most 2 groups chosen at random. Both the training, validation and test\nsets are synthesized as follow. For every task t \u2208 {1, . . . , T}, the design matrix Xt \u2208 RN\u00d7P is\n\ufb01rst drawn from a standard normal distribution N (0N\u00d7P , 1N\u00d7P ) and then normalized column-wise.\nFinally we de\ufb01ne the vector of outputs yt = Xtw\u2217\nWe consider the convex relaxation pointed in Remark 2.1, set (Q = 500, \u0001 = 10\u22123, \u03b3 = 0.1, K =\n2000) and denote the proposed solution as \u03b8BiGL. We also consider its threshold counterpart \u03b8BiGLThr\nwhere each feature is assigned to its most dominant group. These two solutions are compared\nwith Lasso and oracle Group Lasso, computed respectively for \u03b8Lasso = IdP and \u03b8GL = \u03b8\u2217. In\naddition to the validation error, performance are quanti\ufb01ed in terms of test and estimation error,\n\nt + n where n \u223c N (0N , 0.31N ).\n\nt=1 (cid:107)y(test)\n\nt\n\n\u2212 X (test)\n\n\u00b7 (cid:107)2\n\nt\n\nt=1 (cid:107)w\u2217\n\nt \u2212 \u00b7(cid:107)2\n\n2 respectively.\n\n2 and (1/2T )(cid:80)T\n\n(1/2T )(cid:80)T\n\nIllustration of the method. First, we illustrate the well-behaviour of the algorithmic solution, for\nvarious values of \u03bb, when L\u2217 is known. We consider the previously mentioned setting and display in\nFig.5 (in the supplementary material) the corresponding oracle w\u2217 (top left) exhibiting 10 groups.\n\n7\n\n\fFigure 1: The minimization of U (Q) is displayed in the left plot for various \u03bb. Comparison of\nestimation errors (middle) show that the proposed BiGL and BiGLThr estimates yield performance\nclose to the oracle GL. In addition, \u03b8BiGLThr satisfactorily agrees with the oracle \u03b8\u2217 (right).\n\nFigure 2: Left and middle plots illustrate the impact of Q and T on the validation and the estimation\nerror respectively. The right-hand \ufb01gure shows that an adequate estimation of the groups can be\nobtained even when the number of groups is set to 20 instead of 10.\n\nFigure 1 (left) shows, for several values of \u03bb, how the upper level objective decreases as the number\nof outer iterations k grows. Even though convergence is not yet fully reached, the corresponding\nsolutions still yield performance close to oracle Group Lasso as shown by the validation, test and\nestimation error (see Figure 1 and supplementary material). More importantly, Figure 1 (right) shows\nthat for the \u03bb minimizing the validation error, denoted \u03bbmin, the corresponding estimated groups\n\u03b8BiGLThr satisfactorily agree with the oracle \u03b8\u2217, thus con\ufb01rming that minimizing the validation error\nis an adequate way to learn the groups.\nImpact of the number of inner iterations Q. Now that a proof of concept has been provided,\nwe propose to investigate the impact of Q on the validation error. To do so, we repeat the same\nexperiment for \u03bb = \u03bbmin and different values of Q. Once the estimates \u03b8BiGL and \u03b8BiGLThr are\nobtained, the validation error (where the \u02c6w(\u00b7)\u2019s are computed a posteriori for 104 iterations) for each\nof the four methods is plotted as a function of Q in Fig.2 (left). The results show that increasing Q\nsuf\ufb01ciently permits reaching performance close to GL. In addition, we stress that, for Q \u2265 500, the\nperformance of BiGL and BiGLThr become indistinguishable, thus showing that the algorithm does\ntend to assign a single group to each feature.\nImpact of the number of tasks T . Here we investigate how the estimation errors varies as the\nnumber of task T increases, see in Fig.2 (middle). While the performance of Lasso and GL do not\nsigni\ufb01cantly depend on T , we observe that the performance of BiGL and BiGLThr get close to those\nof GL as T grows. Similar conclusions can be drawn regarding the test error. Hence, this con\ufb01rms\nthat learning the groups is intrinsically a multi-task problem that bene\ufb01ts from having a large number\nof tasks.\nImpact of the number of groups L. While in the previous experiments the number of groups was\nknown a priori (L\u2217 = 10), here we relax this assumption and let the algorithm \ufb01nd at most L = 20\ngroups. We repeat the experiment and show the results in Fig. 2 (right). Note that 9 out of the 10\nextra groups are not displayed since they were found empty, while the remaining group contains very\nfew features. Overall, the oracle \u03b8\u2217 is still satisfactorily estimated.\nImpact of groups sizes. We repeat the same experiment except that \u03b8\u2217 is now made of 5 groups of 5\nfeatures and 5 groups of 15 features. The proposed method still satisfactorily estimates groups of\ndifferent size, as Figure 3 shows, in case L\u2217 is known as well as if L\u2217 is overestimated (L = 20).\n\n8\n\n0100020003000678910-110024682610204060801002610204060801005050050006810101102103246261020406080100261020406080100\fFigure 3: Illustration of the estimated group-structure when the oracle groups have different sizes. Left:\ninitialization with the correct number of groups L = 10. Right: initialization with an overestimation\nL = 20 of the number of groups.\n\nFigure 4: Application to the prediction of gene ontology classes from regulatory motifs. Our approach\nis able to reach a lower prediction error than Lasso by partitioning the features into 30 groups.\n\n4.2 Application to Real Data\n\nUnderstanding the complexity of gene expression networks and the mechanisms involved in its\nregulation constitutes an extremely dif\ufb01cult task [25]. In this section, we lead a preliminary experiment\non gene expression data collected from https://www.ensembl.org/ using BioMart. The data\nconsists of N = 60 genes each one characterized by P = 50 features, corresponding to the regulatory\nmotifs in promoters. These samples may belong to at most 108 gene ontology classes. Each class\ncorresponds to a very speci\ufb01c molecular function of the transcripts. The data set is split into training,\nvalidation and test sets of 20 genes each. We perform a multi-task classi\ufb01cation (T = 108) where\neach task consists of a one versus all classi\ufb01cation problem. Our bilevel algorithm is initialized with\nL = 50 possible groups. Validation and test errors are displayed in Fig. 4 as functions of \u03bb. The\nresults show a signi\ufb01cant decrease in prediction error when using the proposed method compared with\nLasso. In addition, \u03b8BiGLThr suggests that there exist 30 relevant groups among the features. This\npreliminary experiment is encouraging and paves the way to set extended and more comprehensive\nexperiments in gene data analysis.\n\n5 Conclusion\n\nThis contribution studied the problem of learning the groups by solving a continuous bilevel problem.\nWe replaced the exact Group Lasso optimization problem by a smooth dual forward-backward\nalgorithm with Bregman distances. This method is in the line of what has been proposed in [24].\nWe also provided theoretical justi\ufb01cations of this approximation method which, to the best of our\nknowledge, is new. When compared to standard sparse regression methods, the proposed procedure\nachieved equivalent performance to the oracle Group Lasso where the true groups are known.\nMoreover, when the numbers of tasks and inner iterations are suf\ufb01ciently large, a satisfactory estimate\nof the groups can be obtained even if the number of groups are unknown. One of the advantages of the\nproposed approach is that it can be easily adapted to different convex losses with Lipschitz-continuous\ngradient. Future works notably include the extension to overlapping groups [15] and could also aim\nat learning the groups in group-sparse classi\ufb01cation problems.\n\n9\n\n26102040608010026102040608010026102040608010026102040608010010-11000.60.70.80.9110-11000.150.20.250.30.35\fAcknowledgments\n\nWe wish to thank Luca Franceschi and the anonymous referees for their useful comments. We also\nwould like to thank Giorgio Valentini for providing the gene expression dataset. This work was\nsupported in part by SAP SE.\n\nReferences\n[1] A. Ahmed and E. Xing. Recovering time-varying networks of dependencies in social and\nbiological studies. Proceedings of the National Academy of Sciences, 106(29):11878\u201311883,\n2009.\n\n[2] H. H. Bauschke, J. Bolte, and M. Teboulle. A descent lemma beyond Lipschitz gradient\ncontinuity: \ufb01rst-order methods revisited and applications. Mathematics of Operations Research,\n42(2):330\u2013348, 2016.\n\n[3] H. H. Bauschke and P. L. Combettes. Convex Analysis and Monotone Operator Theory in\n\nHilbert Spaces. Springer, New York, USA, 2nd edition, 2017.\n\n[4] Y. Bengio. Gradient-based optimization of hyperparameters. Neural Computation, 12(8):1889\u2013\n\n1900, 2000.\n\n[5] L. Calatroni, C. Chung, J. C. De los Reyes, C.-B. Sch\u00f6nlieb, and T. Valkonen. Bilevel approaches\n\nfor learning of variational imaging models. Variational Methods, pages 252\u2013290, 2016.\n\n[6] G. Chen and R. T. Rockafellar. Convergence rates in forward\u2013backward splitting. SIAM Journal\n\non Optimization, 7(2):421\u2013444, 1997.\n\n[7] P. L Combettes and V. R. Wajs. Signal recovery by proximal forward-backward splitting. SIAM\n\nMultiscale Modeling & Simulation, 4(4):1168\u20131200, 2005.\n\n[8] L. Condat. Fast projection onto the simplex and the (cid:96)1-ball. Mathematical Programming,\n\n158(1-2):575\u2013585, 2016.\n\n[9] A. Defazio, F. Bach, and S. Lacoste-Julien. SAGA: A fast incremental gradient method with\nsupport for non-strongly convex composite objectives. In Advances in Neural Information\nProcessing Systems 27, pages 1646\u20131654, Montreal, Canada, 08\u201313 Dec 2014.\n\n[10] S. Dempe. Foundations of Bilevel Programming. Springer, Boston, USA, 2002.\n\n[11] L. Franceschi, N. Donini, P. Frasconi, and M. Pontil. Forward and reverse gradient-based\nhyperparameter optimization. In Proceedings of the 34th International Conference on Machine\nLearning, volume 70, pages 1165\u20131173, Sydney, Australia, 06\u201311 Aug 2017.\n\n[12] L. Franceschi, P. Frasconi, S. Salzo, R. Grazzi, and M. Pontil. Bilevel programming for\nIn Proceedings of the 35th International\nhyperparameter otimization and meta-learning.\nConference on Machine Learning, volume 80, pages 1568\u20131577, Stockholm, Sweden, 10\u201315\nJul 2018.\n\n[13] A. Griewank and A. Walther. Evaluating Derivatives: Principles and Techniques of Algorithmic\nDifferentiation. Society for Industrial and Applied Mathematics, Philadelphia, USA, 2nd edition,\n2008.\n\n[14] D. Hern\u00e1ndez-Lobato and J. M. Hern\u00e1ndez-Lobato. Learning feature selection dependencies in\nmulti-task learning. In Advances in Neural Information Processing Systems 26, pages 746\u2013754,\nLake Tahoe, USA, 05\u201310 Dec 2013.\n\n[15] L. Jacob, G. Obozinski, and J.-P. Vert. Group Lasso with overlap and graph Lasso.\n\nIn\nProceedings of the 26th Annual International Conference on Machine Learning, pages 433\u2013440,\nMontreal, Canada, 14\u201318 Jun 2009.\n\n[16] R. Jenatton, J.-Y. Audibert, and F. Bach. Structured variable selection with sparsity-inducing\n\nnorms. Journal of Machine Learning Research, 12:2777\u20132824, 2011.\n\n10\n\n\f[17] S. Kim and E. Xing. Tree-guided group Lasso for multi-response regression with structured\nsparsity, with an application to eQTL mapping. The Annals of Applied Statistics, 6(3):1095\u2013\n1117, 2012.\n\n[18] K. Kunisch and T. Pock. A bilevel optimization approach for parameter learning in variational\n\nmodels. SIAM Journal on Imaging Sciences, 6(2):938\u2013983, 2013.\n\n[19] K. Lounici, M. Pontil, S. Van De Geer, and A. B. Tsybakov. Oracle inequalities and optimal\n\ninference under group sparsity. The Annals of Statistics, 39(4):2164\u20132204, 2011.\n\n[20] S. Ma, X. Song, and J. Huang. Supervised group Lasso with applications to microarray data\n\nanalysis. BMC Bioinformatics, 8(1):60, 2007.\n\n[21] D. Maclaurin, D. Duvenaud, and R. P. Adams. Gradient-based hyperparameter optimization\nthrough reversible learning. In Proceedings of the 32nd International Conference on Machine\nLearning, volume 37, pages 2113\u20132122, Lille, France, 2015.\n\n[22] A. Maurer and M. Pontil. Structured sparsity and generalization. Journal of Machine Learning\n\nResearch, 13:671\u2013690, 2012.\n\n[23] C. A. Micchelli, J. Morales, and M. Pontil. Regularizers for structured sparsity. Advances in\n\nComputational Mathematics, 38(3):455\u2013489, 2013.\n\n[24] P. Ochs, R. Ranftl, T. Brox, and T. Pock. Techniques for gradient-based bilevel optimization with\nnon-smooth lower level problems. Journal of Mathematical Imaging and Vision, 56(2):175\u2013194,\n2016.\n\n[25] M. Re and G. Valentini. Predicting gene expression from heterogeneous data. In International\nMeeting on Computational Intelligence Methods for Bioinformatics and Biostatistics, Genoa,\nItaly, 15-17 Oct 2009.\n\n[26] S. Reddi, S. Sra, B. Poczos, and A. Smola. Proximal stochastic methods for nonsmooth\nnonconvex \ufb01nite-sum optimization. In Advances in Neural Information Processing Systems 29,\npages 1145\u20131153, Barcelona, Spain, 05\u201310 Dec 2016.\n\n[27] N. Shervashidze and F. Bach. Learning the structure for structured sparsity. IEEE Transactions\n\non Signal Processing, 63(18):4894\u20134902, 2015.\n\n[28] Q. Van Nguyen. Forward-backward splitting with Bregman distances. Vietnam Journal of\n\nMathematics, 45(3):519\u2013539, 2017.\n\n[29] M. J. Wainwright. Structured regularizers for high-dimensional problems: statistical and\n\ncomputational issues. Annual Review of Statistics and Its Application, 1:233\u2013253, 2014.\n\n[30] M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables.\nJournal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1):49\u201367, 2006.\n\n[31] P. Zhao, G. Rocha, and B. Yu. The composite absolute penalties family for grouped and\n\nhierarchical variable selection. The Annals of Statistics, pages 3468\u20133497, 2009.\n\n11\n\n\f", "award": [], "sourceid": 5045, "authors": [{"given_name": "Jordan", "family_name": "Frecon", "institution": "Istituto Italiano di Tecnologia"}, {"given_name": "Saverio", "family_name": "Salzo", "institution": "Istituto Italiano di Tecnologia"}, {"given_name": "Massimiliano", "family_name": "Pontil", "institution": "IIT & UCL"}]}