{"title": "Orbit Regularization", "book": "Advances in Neural Information Processing Systems", "page_first": 3221, "page_last": 3229, "abstract": "We propose a general framework for regularization based on group majorization. In this framework, a group is defined to act on the parameter space and an orbit is fixed; to control complexity, the model parameters are confined to lie in the convex hull of this orbit (the orbitope). Common regularizers are recovered as particular cases, and a connection is revealed between the recent sorted 1 -norm and the hyperoctahedral group. We derive the properties a group must satisfy for being amenable to optimization with conditional and projected gradient algorithms. Finally, we suggest a continuation strategy for orbit exploration, presenting simulation results for the symmetric and hyperoctahedral groups.", "full_text": "Orbit Regularization\n\nRenato Negrinho\n\nInstituto de Telecomunicac\u00b8 \u02dcoes\n\nInstituto Superior T\u00b4ecnico\n1049\u2013001 Lisboa, Portugal\n\nrenato.negrinho@gmail.com\n\nAndr\u00b4e F. T. Martins\u2217\n\nInstituto de Telecomunicac\u00b8 \u02dcoes\n\nInstituto Superior T\u00b4ecnico\n1049\u2013001 Lisboa, Portugal\n\natm@priberam.pt\n\nAbstract\n\nWe propose a general framework for regularization based on group-induced ma-\njorization. In this framework, a group is de\ufb01ned to act on the parameter space and\nan orbit is \ufb01xed; to control complexity, the model parameters are con\ufb01ned to the\nconvex hull of this orbit (the orbitope). We recover several well-known regulariz-\ners as particular cases, and reveal a connection between the hyperoctahedral group\nand the recently proposed sorted (cid:96)1-norm. We derive the properties a group must\nsatisfy for being amenable to optimization with conditional and projected gradi-\nent algorithms. Finally, we suggest a continuation strategy for orbit exploration,\npresenting simulation results for the symmetric and hyperoctahedral groups.\n\n1\n\nIntroduction\n\nThe main motivation behind current sparse estimation methods and regularized empirical risk min-\nimization is the principle of parsimony, which states that simple explanations should be preferred\nover complex ones. Traditionally, this has been done by de\ufb01ning a function \u2126 : V \u2192 R that evalu-\nates the complexity of a model w \u2208 V and trades off this quantity with a data-dependent term. The\npenalty function \u2126 is often designed to be a convex surrogate of an otherwise non-tractable quantity,\na strategy which has led to important achievements in sparse regression [1], compressed sensing\n[2], and matrix completion [3], allowing to successfully recover parameters from highly incomplete\ninformation. Prior knowledge about the structure of the variables and the intended sparsity pattern,\nwhen available, can be taken into account when designing \u2126 via sparsity-inducing norms [4]. Per-\nformance bounds under different regimes have been established theoretically [5, 6], contributing to\na better understanding of the success and failure modes of these techniques.\nIn this paper, we introduce a new way to characterize the complexity of a model via the concept\nof group-induced majorization. Rather than regarding complexity in an absolute manner via \u2126, we\nde\ufb01ne it relative to a prototype model v \u2208 V , by requiring that the estimated model w satis\ufb01es\n\nw (cid:22)G v,\n\n(1)\nwhere (cid:22)G is an ordering relation on V induced by a group G. This idea is rooted in majorization\ntheory, a well-established \ufb01eld [7, 8] which, to the best of our knowledge, has never been applied to\nmachine learning. We therefore review these concepts in \u00a72, where we show that this formulation\nsubsumes several well-known regularizers and motivates new ones. Then, in \u00a73, we introduce two\nimportant properties of groups that serve as building blocks for the rest of the paper: the notions\nof matching function and region cones. In \u00a74, we apply these tools to the permutation and signed\npermutation groups, unveiling connections with the recent sorted (cid:96)1-norm [9] as a byproduct. In \u00a75\nwe turn to algorithmic considerations, pinpointing the group-speci\ufb01c operations that make a group\namenable to optimization with conditional and projected gradient algorithms.\n\n\u2217Also at Priberam Labs, Alameda D. Afonso Henriques, 41 - 2\u25e6, 1000\u2013123, Lisboa, Portugal.\n\n1\n\n\fFigure 1: Examples of orbitopes for the orthog-\nonal group O(d) (left) and the hyperoctahedral\ngroup P\u00b1 (right). Shown are also the correspond-\ning region cones, which in the case of O(d) degen-\nerates into a ray.\n\nA key aspect of our framework is a decoupling in which the group G captures the invariances of\nthe regularizer, while the data-dependent term is optimized in the group orbitopes. In \u00a76, we build\non this intuition to propose a simple continuation algorithm for orbit exploration. Finally, \u00a77 shows\nsome simulation results, and we conclude in \u00a78.\n\n2 Orbitopes and Majorization\n\n2.1 Vector Spaces and Groups\nLet V be a vector space with an inner product (cid:104)\u00b7,\u00b7(cid:105). We will be mostly concerned with the case where\nV = Rd, i.e., the d-dimensional real Euclidean space, but some of the concepts introduced here\ngeneralize to arbitrary Hilbert spaces. A group is a set G endowed with an operation \u00b7 : G\u00d7 G \u2192 G\nsatisfying closure (g \u00b7 h \u2208 G, \u2200g, h \u2208 G), associativity ((f \u00b7 g) \u00b7 h = f \u00b7 (g \u00b7 h), \u2200f, g, h \u2208 G),\nexistence of identity (\u22031G \u2208 G such that 1G \u00b7 g = g \u00b7 1G = g, \u2200g \u2208 G), and existence of inverses\n(each g \u2208 G has an inverse g\u22121 \u2208 G such that g \u00b7 g\u22121 = g\u22121 \u00b7 g = 1G). Throughout, we use\nboldface letters u, v, w, . . . for vectors, and g, h, . . . for group elements. We also omit the group\noperation symbol, writing gh instead of g \u00b7 h.\n\n2.2 Group Actions, Orbits, and Orbitopes\nA (left) group action of G on V [10] is a function \u03c8 : G \u00d7 V \u2192 V satisfying \u03c8(g, \u03c8(h, v)) =\n\u03c8(g \u00b7 h, v) and \u03c8(1G, v) = v for all g, h \u2208 G and v \u2208 V . When the action is clear from the context,\nwe omit the letter \u03c8, writing simply gv for the action of the group element g on v, instead of \u03c8(g, v).\nIn this paper, we always assume our actions are linear, i.e., g(c1v1 + c2v2) = c1gv1 + c2gv2 for\nscalars c1 and c2 and vectors v1 and v2. In some cases, we also assume they are norm-preserving,\ni.e., (cid:107)gv(cid:107) = (cid:107)v(cid:107) for any g \u2208 G and v \u2208 V . When V = Rd, we may regard the groups underlying\nthese actions as subgroups of the general linear group GL(d) and of the orthogonal group O(d),\nrespectively. GL(d) is the set of d-by-d invertible matrices, and O(d) the set of d-by-d orthogonal\nmatrices {U \u2208 Rd\u00d7d | U(cid:62)U = U U(cid:62) = Id}, where Id denotes the d-dimensional identity matrix.\nA group action de\ufb01nes an equivalence relation on V , namely w \u2261 v iff there is g \u2208 G such that\nw = gv. The orbit of a vector v \u2208 V under the action of G is the set Gv := {gv | g \u2208 G}, i.e., the\nvectors that result from acting on v with some element of G. Its convex hull is called the orbitope:\n(2)\nFig. 1 (left) illustrates this concept for the orthogonal group in R2. An important concept associated\nwith group actions and orbitopes is that of G-majorization [7]:\nDe\ufb01nition 1 Let v, w \u2208 V . We say that w is G-majorized by v, denoted w (cid:22)G v, if w \u2208 OG(v).\nProposition 2 If the group action is linear, then (cid:22)G is re\ufb02exive and transitive, i.e., it is a pre-order.\nProof: See supplemental material.\nGroup majorization plays an important role in the area of multivariate inequalities in statistics [11].\nIn this paper, we use this concept for representing model complexity, as described next.\n\nOG(v) := conv(Gv).\n\n2.3 Orbit Regularization\n\nWe formulate our learning problem as follows:\n\n(3)\nwhere L : V \u2192 R is a loss function, G is a given group, and v \u2208 V is a seed vector. This\nformulation subsumes several well-known cases, outlined below.\n\nminimize L(w)\n\ns.t. w (cid:22)G v,\n\n2\n\n\f\u2022 (cid:96)2-regularization.\nIf G := O(d) is the orthogonal group acting by multiplication, we recover (cid:96)2\nregularization. Indeed, we have Gv = {U v \u2208 Rd | U \u2208 O(d)} = {w \u2208 Rd | (cid:107)w(cid:107)2 = (cid:107)v(cid:107)2}, for\nany seed v \u2208 Rd. That is, the orbitope OG(v) = conv(Gv) becomes the (cid:96)2-ball with radius (cid:107)v(cid:107)2.\nThe only property of the seed that matters in this case is its (cid:96)2-norm.\n\u2022 Permutahedron. Let P be the symmetric group (also called the permutation group), which can\nbe represented as the set of d-by-d permutation matrices. Given v \u2208 Rd, the orbitope induced by v\nunder P is the convex hull of all the permutations of v, which can be equivalently described as the\nvectors that are transformations of v through a doubly stochastic matrix:\n\nOP (v) = conv{P v | P \u2208 P} = {M v | M 1 = 1, M(cid:62)1 = 1, M \u2265 0}.\n\n(4)\n\nThis set is called the permutahedron [12]. We will revisit this case in \u00a74.\n\u2022 Signed permutahedron. Let P\u00b1 be the hyperoctahedral group (also called the signed permuta-\ntion group), i.e., the d-by-d matrices with entries in {0,\u00b11} such that the sum of the absolute values\nin each row and column is 1. The action of P\u00b1 on Rd permutes the entries of vectors and arbitrarily\nswitches signs. Given v \u2208 Rd, the orbitope induced by v under P\u00b1 is:\n\nOP\u00b1 (v) = conv{Diag(s)P v | P \u2208 P, s \u2208 {\u00b11}d},\n\n(5)\nwhere Diag(s) denotes a diagonal matrix formed by the entries of s. We call this set the signed\npermutahedron; it is depicted in Fig. 1 and will also be revisited in \u00a74.\n\u2022 (cid:96)1 and (cid:96)\u221e-regularization. As a particular case of the signed permutahedron, we recover (cid:96)1 and\n(cid:96)\u221e balls by choosing seeds of the form v = \u03b3e1 (a scaled canonical basis vector) and v = \u03b31 (a\nconstant vector), respectively, where \u03b3 is a scalar. In the \ufb01rst case, we obtain the (cid:96)1-ball, OG(v) =\n\u03b3 conv({e1, . . . , ed}) and in the second case, we get the (cid:96)\u221e-ball OG(v) = \u03b3 conv({\u00b11}d).\n\u2022 Symmetric matrices with majorized eigenvalues. Let G := O(d) be again the orthogonal\ngroup, but now acting by conjugation on the vector space of d-by-d symmetric matrices, V = Sd.\nGiven a seed v \u2261 A \u2208 Sd, its orbit is Gv = {U AU(cid:62) | U \u2208 O(d)} = {U Diag(\u03bb(A))U(cid:62) | U \u2208\nO(d)}, where \u03bb(A) denotes a vector containing the eigenvalues of A in decreasing order (so we\nmay assume without loss of generality that the seed is diagonal). The orbitope OG(v) becomes:\n\nOG(v) := {B \u2208 Sd | \u03bb(B) (cid:22)P \u03bb(A)},\n\n(6)\nwhich is the set of matrices whose eigenvalues are in the permutahedron OP (\u03bb(A)) (see example\nabove). This is called the Schur-Horn orbitope in the literature [8].\n\u2022 Squared matrices with majorized singular values. Let G := O(d) \u00d7 O(d) act on Rd\u00d7d (the\nspace of squared matrices, not necessarily symmetric) as gU,V A := U AV (cid:62). Given a seed v \u2261 A,\nits orbit is Gv = {U AV (cid:62) | U, V \u2208 O(d)} = {U Diag(\u03c3(A))V (cid:62) | U, V \u2208 O(d)}, where \u03c3(A)\ncontains the singular values of A in decreasing order (so we may assume without loss of generality\nthat the seed is diagonal and non-negative). The orbitope OG(v) becomes:\n\nOG(v) := {B \u2208 Rd\u00d7d | \u03c3(B) (cid:22)P \u03c3(A)},\n\n(7)\n\nwhich is the set of matrices whose singular values are in the permutahedron OP (\u03c3(A)).\n\u2022 Spectral and nuclear norm regularization. The previous case subsumes spectral and nuclear\nnorm balls: indeed, for a seed A = \u03b3Id, the orbitope becomes the convex hull of orthogonal ma-\ntrices, which is the spectral norm ball {A \u2208 Rd\u00d7d | (cid:107)A(cid:107)2 := \u03c31(A) \u2264 \u03b3}; while for a seed\nA = \u03b3 Diag(e1), the orbitope becomes the convex hull of rank-1 matrices with norm bounded by\ni \u03c3i \u2264 \u03b3}. This norm has been widely\n\n\u03b3, which is the nuclear norm ball {A \u2208 Rd\u00d7d | (cid:107)A(cid:107)\u2217 :=(cid:80)\n\nused for low-rank matrix factorization and matrix completion [3].\nBesides these examples, other regularization strategies, such as non-overlapping (cid:96)2,1 and (cid:96)\u221e,1 norms\n[13, 4] can be obtained by considering products of the groups above. We omit details for space.\n\n2.4 Relation with Atomic Norms\nAtomic norms have been recently proposed as a toolbox for structured sparsity [6]. Let A \u2286 V\nbe a centrally symmetric set of atoms, i.e., v \u2208 A iff \u2212v \u2208 A. The atomic norm induced by A\nis de\ufb01ned as (cid:107)w(cid:107)A := inf{t > 0 | w \u2208 t conv(A)}. The corresponding atomic ball is the set\n{w | (cid:107)w(cid:107)A \u2264 1} = conv(A). Not surprisingly, orbitopes are often atomic norm balls.\n\n3\n\n\fProposition 3 (Atomic norms) If G is a subgroup of the general linear group GL(d) and satis\ufb01es\n\u2212v \u2208 Gv, then the set OG(v) is the ball of an atomic norm.\nProof: Under the given assumption, the set Gv is centrally symmetric, i.e., it satis\ufb01es w \u2208 Gv iff\n\u2212w \u2208 Gv (indeed, the left hand side implies that w = gv for some g \u2208 G, and \u2212v \u2208 Gv implies\nthat \u2212v = hv for some h \u2208 G, therefore, \u2212w = \u2212gh\u22121(\u2212v) = gh\u22121v \u2208 Gv). As shown by\nChandrasekaran et al. [6], this guarantees that (cid:107).(cid:107)Gv satis\ufb01es the axioms of a norm.\nCorollary 4 For any choice of seed, the signed permutahedron OP\u00b1 (v) and the orbitope formed\nby the squared matrices with majorized singular values are both atomic norm balls. If d is even and\n+ , then the permutahedron OP (v) and the orbitope\nv is of the form v = (v+,\u2212v+), with v+ \u2208 Rd/2\nformed by the symmetric matrices with eigenvalues majorized by \u03bb(v) are both atomic norm balls.\n\n3 Matching Function and Region Cones\n\nWe now construct a unifying perspective that highlights the role of the group G. Two key concepts\nthat play a crucial role in our analysis are that of matching function and region cone. In the sequel,\nthese will work as building blocks for important algorithmic and geometric characterizations.\nDe\ufb01nition 5 (Matching function) The matching function of G, mG : V \u00d7 V \u2192 R, is de\ufb01ned as:\n(8)\n\nmG(u, v) := sup{(cid:104)u, w(cid:105) | w \u2208 Gv}.\n\nIntuitively, mG(u, v) \u201caligns\u201d the orbits of u and v before taking the inner product. Note also that\nmG(u, v) = sup{(cid:104)u, w(cid:105) | w \u2208 OG(v)}, since we may equivalently maximize the linear objective\nover OG(v), which is the convex hull of Gv. We therefore have the following\nProposition 6 (Duality) Fix v \u2208 V , and de\ufb01ne the indicator function of the orbitope, IOG(v)(w) =\n0 if w \u2208 OG(v), and \u2212\u221e otherwise. The Fenchel dual of IOG(v) is mG(., v). As a consequence,\nletting L(cid:63) : V \u2192 R is the Fenchel dual of the loss L, the dual problem of Eq. 3 is:\n\nmaximize \u2212 L(cid:63)(\u2212u) \u2212 mG(u, v) w.r.t. u \u2208 V.\n\nGv is its dual norm. We will revisit this dual formulation in \u00a74.\n\n(9)\nNote that if (cid:107).(cid:107)Gv is a norm (e.g., if the conditions of Prop. 3 are satis\ufb01ed), then the statement above\nmeans that mG(., v) = (cid:107).(cid:107)(cid:63)\nThe following properties have been established in [14, 15].\nProposition 7 For any u, v \u2208 V , we have: (i) mG(c1u, c2v) = c1c2mG(u, v) for c1, c2 \u2265 0;\n(ii) mG(g1u, g2v) = mG(u, v) for g1, g2 \u2208 G; (iii) mG(u, v) = mG(v, u). Furthermore, the\nfollowing three statements are equivalent: (i) w (cid:22)G v, (ii) f (w) \u2264 f (v) for all G-invariant convex\nfunctions f : V \u2192 R, (iii) mG(u, w) \u2264 mG(u, v) for all u \u2208 V .\nIn the sequel, we always assume that G is a subgroup of the orthogonal group O(d). This implies\nthat the orbitope OG(v) is compact for any v \u2208 V (and therefore the sup in Eq. 8 can be replaced by\na max), and that (cid:107)gv(cid:107) = (cid:107)v(cid:107) for any v \u2208 V . Another important concept is that of the normal cone\nof a point w \u2208 V with respect to the orbitope OG(v), denoted as NGv(w) and de\ufb01ned as follows:\n(10)\nNormal cones plays an important role in convex analysis [16]. The particular case of the normal\ncone at the seed v (illustrated in Fig. 1) is of great importance, as will be seen below.\nDe\ufb01nition 8 (Region cone) Given v \u2208 V , the region cone at v is KG(v) := NGv(v). It is the set\nof points that are \u201cmaximally aligned\u201d with v in terms of the matching function:\n\nNGv(w) := {u \u2208 V | (cid:104)u, w(cid:48) \u2212 w(cid:105) \u2264 0 \u2200w(cid:48) (cid:22)G v}.\n\nKG(v) = {u \u2208 V | mG(u, v) = (cid:104)u, v(cid:105)}.\n\n(11)\n\n4 Permutahedra and Sorted (cid:96)1-Norms\nIn this section, we focus on the permutahedra introduced in \u00a72. Below, given a vector w \u2208 Rd, we\ndenote by w(k) its kth order statistic, i.e., we will \u201csort\u201d w so that w(1) \u2265 w(2) \u2265 . . . \u2265 w(d). We\nalso consider the order statistics of the magnitudes |w|(k) by sorting the absolute values.\n\n4\n\n\f4.1 Signed Permutahedron\n\nWe start by de\ufb01ning the \u201csorted (cid:96)1-norm,\u201d proposed by Bogdan et al. [9] in their recent SLOPE\nmethod as a means to control the false discovery rate, and studied by Zeng and Figueiredo [17].\nDe\ufb01nition 9 (Sorted (cid:96)1-norm) Let v, w \u2208 Rd, with v1 \u2265 v2 \u2265 . . . \u2265 vd \u2265 0 and v1 > 0. The\n\nsorted (cid:96)1-norm of w (weighted by v) is de\ufb01ned as: (cid:107)w(cid:107)SLOPE,v :=(cid:80)d\n\nj=1 vj|w|(j).\nIn [9] it is shown that (cid:107).(cid:107)SLOPE,v satis\ufb01es the axioms of a norm. The rationale is that larger com-\n(cid:80)\nponents of w are penalized more than smaller ones, in a way controlled by the prescribed v. For\nv = 1, we recover the standard (cid:96)1-norm, while the (cid:96)\u221e-norm corresponds to v = e1. Another spe-\ncial case is the OSCAR regularizer [18, 19], (cid:107)w(cid:107)OSCAR,\u03c41,\u03c42 := \u03c41(cid:107)w(cid:107)1 + \u03c42\ni<j max{|wi|,|wj|},\ncorresponding to a linearly spaced v, vj = (\u03c41 + \u03c42(d \u2212 j)) for j = 1, . . . , d. The next proposition\nreveals a connection between SLOPE and the atomic norm induced by the signed permutahedron.\nProposition 10 Let v \u2208 Rd\ninduced by the P\u00b1-orbitope seeded at v are dual to each other: (cid:107).(cid:107)(cid:63)P\u00b1v = (cid:107).(cid:107)SLOPE,v.\nProof: From Prop. 6, we have (cid:107)w(cid:107)(cid:63)P\u00b1v = mP\u00b1 (w, v). Let P be a signed permutation matrix s.t.\n\u02dcw := P w has its components sorted by decreasing magnitude, | \u02dcw|1 \u2265 . . . \u2265 | \u02dcw|d. From Prop. 7,\nwe have mP\u00b1(w, v) = m( \u02dcw, v) = (cid:104)| \u02dcw|, v(cid:105) = (cid:107)w(cid:107)SLOPE,v.\nThe next proposition [7, 14] provides a characterization of the P\u00b1-orbitope in terms of inequalities\nabout the cumulative distribution of the order statistics.\nProposition 11 (Submajorization ordering) The orbitope OP\u00b1 (v) can be characterized as:\n\n+ be as in Def. 9. The sorted (cid:96)1-norm weighted by v and the atomic norm\n\nOP\u00b1 (v) =\n\n(12)\nProp. 11 leads to a precise characterization of the atomic norm (cid:107)w(cid:107)P\u00b1v, and therefore of the dual\nnorm of SLOPE: (cid:107)w(cid:107)P\u00b1v = maxi=1,...,d\n\nj\u2264i |v|(j).\n\n.\n\nj\u2264i |v|(j), \u2200i = 1, . . . , d\n\n(cid:110)\nw \u2208 Rd (cid:12)(cid:12) (cid:80)\nj\u2264i |w|(j) \u2264(cid:80)\n(cid:80)\nj\u2264i |w|(j)/(cid:80)\n\n(cid:111)\n\n4.2 Permutahedron\n\nOP (v) =\n\nThe unsigned counterpart of Prop. 11 goes back to Hardy et al. [20].\nProposition 12 (Majorization ordering) The P-orbitope seeded at v can be characterized as:\n\n(cid:110)\nw \u2208 Rd (cid:12)(cid:12) 1(cid:62)w = 1(cid:62)v \u2227 (cid:80)\n\nj\u2264i w(j) \u2264(cid:80)\n\n(cid:111)\nj\u2264i v(j), \u2200i = 1, . . . , d \u2212 1\n\n(13)\nnorm (we need to con\ufb01ne to the linear subspace V := {w \u2208 Rd | (cid:80)d\nAs seen in Corollary 4, if d is even and v = (v+,\u2212v+), with v \u2265 0, then (cid:107)w(cid:107)Pv quali\ufb01es as a\n(cid:80)\nj\u2264i w(j)/(cid:80)\nj=1 wj = 0}). From Prop. 12,\nwe have that this norm can be written as: (cid:107)w(cid:107)Pv = maxi=1,...,d\u22121\nThe dual norm of (cid:107).(cid:107)Pv is (cid:107)w(cid:107)(cid:63)Pv =(cid:80)d/2\nProposition 13 Assume the conditions above hold and that v1 \u2265 v2 \u2265 . . . \u2265 vd/2 \u2265 0 and v1 > 0.\n\nj=1 vj(w(j) \u2212 w(d\u2212j+1)).\n\nj\u2264i v(j).\n\n.\n\nProof: Similar to the proof of Prop. 11.\n\n5 Conditional and Projected Gradient Algorithms\n\nTwo important classes of algorithms in sparse modeling are the conditional gradient method [21, 22]\nand the proximal gradient method [23, 24]. Under Ivanov regularization as in Eq. 3, the latter reduces\nto the projected gradient method. In this section, we show that both algorithms are a good \ufb01t for\nsolving Eq. 3 for arbitrary groups, as long as the two building blocks mentioned in \u00a73 are available:\n(i) a procedure for evaluating the matching function (necessary for conditional gradient methods)\nand (ii) a procedure for projecting onto the region cone (necessary for projected gradient).\n\n5\n\n\f1: Initialize w1 = 0\n2: for t = 1, 2, . . . do\n3:\n4:\n5: wt+1 = (1 \u2212 \u03b7t)wt + \u03b7tut\n6: end for\n\nut = arg maxu(cid:22)Gv(cid:104)\u2212\u2207L(wt), u(cid:105)\n\u03b7t = 2/(t + 2)\n\n1: Initialize w1 = 0\n2: for t = 1, 2, . . . do\n3:\n4:\n5: wt+1 = arg minw(cid:22)Gv (cid:107)w \u2212 a(cid:107)\n6: end for\n\nChoose a stepsize \u03b7t\na = wt \u2212 \u03b7t\u2207L(wt)\n\nFigure 2: Conditional gradient (left) and projected gradient (right) algorithms.\n\n5.1 Conditional Gradient\n\nThe conditional gradient method is shown in Fig. 2 (left). We assume that a procedure is available\nfor computing the gradient of the loss. The relevant part is the maximization in line 3, which\ncorresponds precisely to an evaluation of the matching function m(s, v), with s = \u2212\u2207L(wt)\n(cf. Eq. 8). Fortunately, this step is ef\ufb01cient for a variety of cases:\nIf G = P, the matching function can be evaluated in time O(d log d) with a sim-\nPermutations.\nple sort operation. Without losing generality, we assume the seed v is sorted in descending order\n(otherwise, pre-sort it before the main loop starts). Then, each time we need to evaluate m(s, v),\nwe compute a permutation P such that P s is also sorted. The minimizer in line 3 will equal P \u22121v.\nIf G = P\u00b1, a similar procedure with the same O(d log d) runtime also\nSigned permutations.\nworks, except that now we sort the absolute values, and set the signs of P \u22121v to match those of s.\nA \u2208 Sd and B =\nSymmetric matrices with majorized eigenvalues. Let A = UA\u03bb(A)U(cid:62)\nB \u2208 Sd, where the eigenvalues \u03bb(A) and \u03bb(B) are sorted in decreasing order.\nUB\u03bb(B)U(cid:62)\nIn this case, the matching function becomes mG(A, B) = maxV \u2208O(d) trace(A(cid:62)V BV (cid:62)) =\n(cid:104)\u03bb(A), \u03bb(B)(cid:105) due to von Neumann\u2019s trace inequality [25], the maximizer being V = UAU(cid:62)\nB .\nTherefore, we need only to make an eigen-decomposition and set B(cid:48) = UA\u03bb(B)U(cid:62)\nA .\nA \u2208 Rd\u00d7d and\nSquared matrices with majorized singular values. Let A = UA\u03c3(A)V (cid:62)\nB \u2208 Rd\u00d7d, where the singular values are sorted. We have mG(A, B) =\nB = UB\u03c3(B)V (cid:62)\nmaxU,V \u2208O(d) trace(A(cid:62)U BV (cid:62)) = (cid:104)\u03c3(A), \u03c3(B)(cid:105) also from von Neumann\u2019s inequality [25]. To\nevaluate the matching function, we need only to make an SVD and set B(cid:48) = UA\u03c3(B)V (cid:62)\nA .\n\n5.2 Projected Gradient\n\nThe projected gradient algorithm is illustrated in Fig. 2 (right); the relevant part is line 5, which\ninvolves a projection onto the orbitope OG(v). This projection may be hard to compute directly,\nsince the orbitope may lack a concise half-space representation. However, we next transform this\nproblem into a projection onto the region cone KG(v) (the proof is in the supplemental material).\nProposition 14 Assume G is a subgroup of O(d). Let g \u2208 G be such that (cid:104)a, gv(cid:105) = mG(a, v).\nThen, the solution of the problem in line 5 is w\u2217 = a \u2212 \u03a0KG(gv)(a \u2212 gv).\nThus, all is necessary is computing the arg-max associated with the matching function, and a black\nbox that projects onto the region cone KG(v). Again, this step is ef\ufb01cient in several cases:\nIf G = P, the region cone of a point v is the set of points w satisfying vi > vj \u21d2\nPermutations.\nwi \u2265 wj, for all i, j \u2208 1, . . . , d. Projecting onto this cone is a well-studied problem in isotonic\nregression [26, 27], with existing O(d) algorithms.\nSigned permutations.\nator of the sorted (cid:96)1-norm, also solvable in O(d) runtime with a stack-based algorithm [9].\n\nIf G = P\u00b1, this problem is precisely the evaluation of the proximity oper-\n\n6 Continuation Algorithm\n\nFinally, we present a general continuation procedure for exploring regularization paths when L\nis a convex loss function (not necessarily differentiable) and the seed v is not prescribed. The\n\n6\n\n\fRequire: Factor \u0001 > 0, interpolation parameter \u03b1 \u2208 [0, 1]\n1: Initialize seed v0 randomly and set (cid:107)v0(cid:107) = \u0001\n2: Set t = 0\n3: repeat\n4:\n5:\n6:\n9: Use cross-validation to choose the best (cid:98)w \u2208 {w1, w2, . . .}\n7:\n8: until (cid:107)wt(cid:107)Gvt < 1.\n\nSolve wt = arg minw(cid:22)Gvt L(w)\nPick v(cid:48)\nSet next seed vt+1 = (1 + \u0001)(\u03b1v(cid:48)\nt \u2190 t + 1\n\nt \u2208 Gvt \u2229 KG(wt)\n\nt + (1 \u2212 \u03b1)wt)\n\nFigure 3: Left: Continuation algorithm. Right: Reachable region WG for the hyperoctahedral group, with a\nreconstruction loss L(w) = (cid:107)w \u2212 a(cid:107)2. Only points v s.t. \u2212\u2207L(v) = a \u2212 v \u2208 KG(v) belong to this set.\nDifferent initializations of v0 lead to different paths along WG, all ending in a.\n\nprocedure\u2014outlined in Fig. 3\u2014solves instances of Eq. 3 for a sequence of seeds v1, v2, . . ., using\na simple heuristic for choosing the next seed given the previous one and the current solution.\nThe basic principle behind this procedure is the same as in other homotopy continuation methods\n[28, 29, 30, 31]: we start with very strong regularization (using a small norm ball), and then gradually\nweaken the regularization (increasing the ball) while \u201ctracking\u201d the solution. The process stops\nwhen the solution is found to be in the interior of the ball (the condition in line 8), which means the\nregularization constraint is no longer active. The main difference with respect to classical homotopy\nmethods is that we do not just scale the ball (in our case, the G-orbitope); we also generate new\nseeds that shape the ball along the way. To do so, we adopt a simple heuristic (line 6) to make\nthe seed move toward the current solution wt before scaling the orbitope. This procedure depends\non the initialization (see Fig. 3 for an illustration), which drives the search into different regions.\nReasoning in terms of groups, line 4 makes us move inside the orbits, while line 6 is an heuristic\nto jump to a nearby orbit. For any choice of \u0001 > 0 and \u03b1 \u2208 [0, 1], the algorithm is convergent and\nproduces a strictly decreasing sequence L(w1) > L(w2) > \u00b7\u00b7\u00b7 before it terminates (a proof is\nprovided as supplementary material). We expect that, eventually, a seed v will be generated that is\n\nclose to the true model (cid:98)w. Although it may not be obvious at \ufb01rst sight why would it be desirable\nthat v \u2248 (cid:98)w, we provide a simple result below (Prop. 15) that sheds some light on this matter, by\n\ncharacterizing the set of points in V that are \u201creachable\u201d by optimizing Eq. 3.\nFrom the optimality conditions of convex programming [32, p. 257], we have that w\u2217 is a solution\nof the optimization problem in Eq. 3 if and only if 0 \u2208 \u2202L(w\u2217) + NGv(w\u2217), where \u2202L(w) denotes\nthe subdifferential of L at w, and NGv(w) is the normal cone to OG(v) at w, de\ufb01ned in \u00a73. For\ncertain seeds v \u2208 V , it may happen that the optimal solution w\u2217 of Eq. 3 is the seed itself. Let WG\nbe the set of seeds with this property:\n\nWG := {v \u2208 V | L(v) \u2264 L(w), \u2200w (cid:22)G v} = {v \u2208 V | 0 \u2208 \u2202L(v) + KG(v)},\n\n(14)\n\nwhere KG(v) is the region cone and the right hand side follows from the optimality conditions. We\nnext show that this set is all we need to care about.\n\nProposition 15 Consider the set of points that are solutions of Eq. 3 for some seed v \u2208 V ,(cid:99)WG :=\n(cid:8)w\u2217 \u2208 V (cid:12)(cid:12) \u2203v \u2208 V : w\u2217 \u2208 arg minw(cid:22)Gv L(w)(cid:9). We have(cid:99)WG = WG.\nProof: Obviously, v \u2208 WG \u21d2 v \u2208 (cid:99)WG. For the reverse direction, suppose that w\u2217 \u2208 (cid:99)WG,\nin which case there is some v \u2208 V such that w\u2217 (cid:22)G v and L(w\u2217) \u2264 L(w) for any w (cid:22)G v.\nSince (cid:22)G is a pre-order, it must hold in particular that L(w\u2217) \u2264 L(w) for any w (cid:22)G w\u2217 (cid:22)G v.\nTherefore, we also have that w\u2217 \u2208 arg minw(cid:22)Gw\u2217 L(w), i.e., w\u2217 \u2208 WG.\n\n7 Simulation Results\n\nWe describe the results of numerical experiments when regularizing with the permutahedron (sym-\nmetric group) and the signed permutahedron (hyperoctahedral group). All problems were solved\n\nusing the conditional gradient algorithm, as described in \u00a75. We generated the true model (cid:98)w \u2208 Rd\n\n7\n\n\fFigure 4: Learning curves for the permutahedron and signed permutahedron regularizers with a perfect seed.\nShown are averages and standard deviations over 10 trials. The baselines are (cid:96)1 (three leftmost plots, resp. with\nk = 150, 250, 400), and (cid:96)2 (last plot, with k = 500).\n\nFigure 5: Mean squared errors in the training set (left) and the test set (right) along the regularization path.\nFor the permutahedra regularizers, this path was traced with the continuation algorithm. The baseline is (cid:96)1\nregularization. The horizontal lines in the right plot show the solutions found with validation in a held-out set.\n\nby-d matrix X with i.i.d. Gaussian entries and variance \u03c32 = 1/d, and simulated measurements\n\nn) is Gaussian noise. We set d = 500 and \u03c3n = 0.3\u03c3.\n\nby sampling the entries from a uniform distribution in [0, 1] and subtracted the mean, keeping k \u2264 d\n\nnonzeros; after which (cid:98)w was normalized to have unit (cid:96)2-norm. Then, we sampled a random n-\ny = X(cid:98)w + n, where n \u223c N (0, \u03c32\nformation, we used for the orbitope regularizers a seed in the orbit of the true (cid:98)w, up to a constant\n\nFor the \ufb01rst set of experiments (Fig. 4), we set k \u2208 {150, 250, 400, 500} and varied the number\nof measurements n. To assess the advantage of knowing the true parameters up to a group trans-\n\nfactor (this constant, and the regularization constants for (cid:96)1 and (cid:96)2, were all chosen with valida-\ntion in a held-out set). As expected, this information was bene\ufb01cial, and no signi\ufb01cant difference\nwas observed between the permutahedron and the signed permutahedron. For the second set of\nexperiments (Fig. 5), where the aim is to assess the performance of the continuation method, no\ninformation about the true model was given. Here, we \ufb01xed n = 250 and k = 300 and ran the\ncontinuation algorithm with \u0001 = 0.1 and \u03b1 = 0.0, for 5 different initializations of v0. We observe\nthat this procedure was effective at exploring the orbits, eventually \ufb01nding a slightly better model\nthan the one found with (cid:96)1 and (cid:96)2 regularizers.\n\n8 Conclusions and Future Work\n\nIn this paper, we proposed a group-based regularization scheme using the notion of orbitopes. Sim-\nple choices of groups recover commonly used regularizers such as (cid:96)1, (cid:96)2, (cid:96)\u221e, spectral and nuclear\nmatrix norms; as well as some new ones, such as the permutahedron and signed permutahedron.\nAs a byproduct, we revealed a connection between the permutahedra and the recently proposed\nsorted (cid:96)1-norm. We derived procedures for learning with these orbit regularizers via conditional and\nprojected gradient algorithms, and a continuation strategy for orbit exploration.\nThere are several avenues for future research. For example, certain classes of groups, such as re\ufb02ec-\ntion groups [33], have additional properties that may be exploited algorithmically. Our work should\nbe regarded as a \ufb01rst step toward group-based regularization\u2014we believe that the regularizers stud-\nied here are just the tip of the iceberg. Groups and their representations are well studied in other\ndisciplines [10], and chances are high that this framework can lead to new regularizers that are a\ngood \ufb01t to speci\ufb01c machine learning problems.\n\nAcknowledgments\n\nWe thank all reviewers for their valuable comments. This work was partially supported by FCT\ngrants PTDC/EEI-SII/2312/2012 and PEst-OE/EEI/LA0008/2011, and by the EU/FEDER pro-\ngramme, QREN/POR Lisboa (Portugal), under the Intelligo project (contract 2012/24803).\n\n8\n\n\fReferences\n[1] R. Tibshirani. Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society\n\nB., pages 267\u2013288, 1996.\n\n[2] D. Donoho. Compressed Sensing. IEEE Transactions on Information Theory, 52(4):1289\u20131306, 2006.\n[3] E. Cand`es and B. Recht. Exact Matrix Completion via Convex Optimization. Foundations of Computa-\n\ntional Mathematics, 9(6):717\u2013772, 2009.\n\n[4] F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Convex optimization with sparsity-inducing norms. In\n\nOptimization for Machine Learning. MIT Press, 2011.\n\n[5] S. Negahban, P. Ravikumar, M. Wainwright, and B. Yu. A Uni\ufb01ed Framework for High-Dimensional\nAnalysis of M-estimators with Decomposable Regularizers. In Neural Information Processing Systems,\npages 1348\u20131356, 2009.\n\n[6] V. Chandrasekaran, B. Recht, P. Parrilo, and A. Willsky. The Convex Geometry of Linear Inverse Prob-\n\nlems. Foundations of Computational Mathematics, 12(6):805\u2013849, 2012.\n\n[7] A. Marshall, I. Olkin, and B. Arnold. Inequalities: Theory of Majorization and Its Applications. Springer,\n\n2010.\n\n1980.\n\n[8] R. Sanyal, F. Sottile, and B. Sturmfels. Orbitopes. Technical report, arXiv:0911.5436, 2009.\n[9] M. Bogdan, E. Berg, W. Su, and E. Cand`es. Statistical estimation and testing via the ordered (cid:96)1 norm.\n\nTechnical report, arXiv:1310.1969, 2013.\n\n[10] J. Serre and L. Scott. Linear Representations of Finite Groups, volume 42. Springer, 1977.\n[11] Y. Tong. Probability Inequalities in Multivariate Distributions, volume 5. Academic Press New York,\n\n[12] G. Ziegler. Lectures on Polytopes, volume 152. Springer, 1995.\n[13] M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. Journal of the\n\nRoyal Statistical Society Series B (Statistical Methodology), 68(1):49, 2006.\n\n[14] M. Eaton. On group induced orderings, monotone functions, and convolution theorems. Lecture Notes-\n\n[15] A. Giovagnoli and H. Wynn. G-Majorization with Applications to Matrix Orderings. Linear algebra and\n\nMonograph Series, pages 13\u201325, 1984.\n\nits applications, 67:111\u2013135, 1985.\n\n[16] R.T. Rockafellar. Convex Analysis. Princeton University Press, 1970.\n[17] X. Zeng and M. A. T. Figueiredo. Decreasing weighted sorted (cid:96)1 regularization. Technical report,\n\narXiv:1404.3184, 2014.\n\n[18] H. Bondell and B. Reich. Simultaneous Regression Shrinkage, Variable Selection, and Supervised Clus-\n\ntering of Predictors with OSCAR. Biometrics, 64(1):115\u2013123, 2008.\n\n[19] L. Zhong and J. Kwok. Ef\ufb01cient Sparse Modeling with Automatic Feature Grouping. IEEE Transactions\n\non Neural Networks and Learning Systems, 23(9):1436\u20131447, 2012.\n\n[20] G. Hardy, J. Littlewood, and G. P\u00b4olya. Inequalities. Cambridge University Press, 1952.\n[21] M. Frank and P. Wolfe. An Algorithm for Quadratic Programming. Naval research logistics quarterly, 3\n\n(1-2):95\u2013110, 1956.\n\n[22] M. Jaggi. Revisiting Frank-Wolfe: Projection-free Sparse Convex Optimization. In Proc. of the Interna-\n\ntional Conference on Machine Learning, pages 427\u2013435, 2013.\n\n[23] S.J. Wright, R. Nowak, and M. A. T. Figueiredo. Sparse reconstruction by separable approximation. IEEE\n\nTransactions on Signal Processing, 57(7):2479\u20132493, 2009.\n\n[24] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems.\n\nSIAM Journal on Imaging Sciences, 2(1):183\u2013202, 2009.\n\n[25] L. Mirsky. A trace inequality of john von neumann. Monatshefte f\u00a8ur Mathematik, 79(4):303\u2013306, 1975.\n[26] P. Pardalos and G. Xue. Algorithms for a Class of Isotonic Regression Problems. Algorithmica, 23(3):\n\n211\u2013222, 1999.\n\n407\u2013499, 2004.\n\n[27] R. Luss, S. Rosset, and M. Shahar. Decomposing Isotonic Regression for Ef\ufb01ciently Solving Large\n\nProblems. In Neural Information Processing Systems, pages 1513\u20131521, 2010.\n\n[28] M.R. Osborne, B. Presnell, and B.A. Turlach. A new approach to variable selection in least squares\n\nproblems. IMA Journal of Numerical Analysis, 20:389\u2013403, 2000.\n\n[29] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. The Annals of statistics, 32:\n\n[30] M. A. T. Figueiredo, R. Nowak, and S. Wright. Gradient projection for sparse reconstruction: Application\nto compressed sensing and other inverse problems. IEEE Journal of Selected Topics in Signal Processing,\n1(4):586\u2013597, 2007.\n\n[31] E. Hale, W. Yin, and Y. Zhang. Fixed-point continuation for l1-minimization: Methodology and conver-\n\ngence. SIAM Journal on Optimization, 19:1107\u20131130, 2008.\n\n[32] D.P. Bertsekas, A. Nedic, and A.E. Ozdaglar. Convex analysis and optimization. Athena Scienti\ufb01c, 2003.\n[33] A. Steerneman. g-majorization, group-induced cone orderings, and re\ufb02ection groups. Linear Algebra and\n\nits Applications, 127:107\u2013119, 1990.\n\n[34] J.J. Moreau. Fonctions convexes duales et points proximaux dans un espace hilbertien. CR de l\u2019Acad\u00b4emie\n\ndes Sciences de Paris S\u00b4erie A Mathematics, 255:2897\u20132899, 1962.\n\n9\n\n\f", "award": [], "sourceid": 1643, "authors": [{"given_name": "Renato", "family_name": "Negrinho", "institution": "Instituto de Telecomunica\u00e7\u00f5es"}, {"given_name": "Andre", "family_name": "Martins", "institution": "CMU"}]}