{"title": "An Embedding Framework for Consistent Polyhedral Surrogates", "book": "Advances in Neural Information Processing Systems", "page_first": 10781, "page_last": 10791, "abstract": "We formalize and study the natural approach of designing convex surrogate loss functions via embeddings for problems such as classification or ranking. In this approach, one embeds each of the finitely many predictions (e.g. classes) as a point in \\reals^d, assigns the original loss values to these points, and convexifies the loss in some way to obtain a surrogate. We prove that this approach is equivalent, in a strong sense, to working with polyhedral (piecewise linear convex) losses. Moreover, given any polyhedral loss L, we give a construction of a link function through which L is a consistent surrogate for the loss it embeds. We go on to illustrate the power of this embedding framework with succinct proofs of consistency or inconsistency of various polyhedral surrogates in the literature.", "full_text": "An Embedding Framework for Consistent Polyhedral\n\nSurrogates\n\nJessie Finocchiaro\n\njefi8453@colorado.edu\n\nCU Boulder\n\nRafael Frongillo\n\nraf@colorado.edu\n\nCU Boulder\n\nBo Waggoner\n\nbwag@colorado.edu\n\nCU Boulder\n\nAbstract\n\nWe formalize and study the natural approach of designing convex surrogate loss\nfunctions via embeddings for problems such as classi\ufb01cation or ranking. In this\napproach, one embeds each of the \ufb01nitely many predictions (e.g. classes) as a point\nin Rd, assigns the original loss values to these points, and convexi\ufb01es the loss in\nsome way to obtain a surrogate. We prove that this approach is equivalent, in a\nstrong sense, to working with polyhedral (piecewise linear convex) losses. More-\nover, given any polyhedral loss L, we give a construction of a link function through\nwhich L is a consistent surrogate for the loss it embeds. We go on to illustrate\nthe power of this embedding framework with succinct proofs of consistency or\ninconsistency of various polyhedral surrogates in the literature.\n\n1\n\nIntroduction\n\nConvex surrogate losses are a central building block in machine learning for classi\ufb01cation and\nclassi\ufb01cation-like problems. A growing body of work seeks to design and analyze convex surrogates\nfor given loss functions, and more broadly, understand when such surrogates can and cannot be found.\nFor example, recent work has developed tools to bound the required number of dimensions of the\nsurrogate\u2019s hypothesis space [13, 24]. Yet in some cases these bounds are far from tight, such as\nfor abstain loss (classi\ufb01cation with an abstain option) [4, 24, 25, 33, 34]. Furthermore, the kinds of\nstrategies available for constructing surrogates, and their relative power, are not well-understood.\nWe augment this literature by studying a particularly natural approach for \ufb01nding convex surrogates,\nwherein one \u201cembeds\u201d a discrete loss. Speci\ufb01cally, we say a convex surrogate L embeds a discrete\nloss (cid:96) if there is an injective embedding from the discrete reports (predictions) to a vector space\nsuch that (i) the original loss values are recovered, and (ii) a report is (cid:96)-optimal if and only if the\nembedded report is L-optimal. If this embedding can be extended to a calibrated link function, which\nmaps approximately L-optimal reports to (cid:96)-optimal reports, then consistency follows [2]. Common\nexamples of this general construction include hinge loss as a surrogate for 0-1 loss and the abstain\nsurrogate mentioned above.\nUsing tools from property elicitation, we show a tight relationship between such embeddings and\nthe class of polyhedral (piecewise-linear convex) loss functions. In particular, by focusing on Bayes\nrisks, we show that every discrete loss is embedded by some polyhedral loss, and every polyhedral\nloss function embeds some discrete loss. Moreover, we show that any polyhedral loss gives rise to\na calibrated link function to the loss it embeds, thus giving a very general framework to construct\nconsistent convex surrogates for arbitrary losses.\n\nRelated works. The literature on convex surrogates focuses mainly on smooth surrogate losses [4,\n5, 7, 8, 26, 30]. Nevertheless, nonsmooth losses, such as the polyhedral losses we consider, have\nbeen proposed and studied for a variety of classi\ufb01cation-like problems [19, 31, 32]. A notable\naddition to this literature is Ramaswamy et al. [25], who argue that nonsmooth losses may enable\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fdimension reduction of the prediction space (range of the surrogate hypothesis) relative to smooth\nlosses, illustrating this conjecture with a surrogate for abstain loss needing only log n dimensions for\nn labels, whereas the best known smooth loss needs n \u2212 1. Their surrogate is a natural example of an\nembedding (cf. Section 5.1), and serves as inspiration for our work.\nWhile property elicitation has by now an extensive literature [10, 12, 15, 17, 18, 22, 28, 29], these\nworks are mostly concerned with point estimation problems. Literature directly connecting property\nelicitation to consistency is sparse, with the main reference being Agarwal and Agarwal [2]; note\nhowever that they consider single-valued properties, whereas properties elicited by general convex\nlosses are necessarily set-valued.\n\n2 Setting\n\nFor discrete prediction problems like classi\ufb01cation, due to hardness of directly optimizing a given\ndiscrete loss, many machine learning algorithms can be thought of as minimizing a surrogate loss\nfunction with better optimization qualities, e.g., convexity. Of course, to show that this surrogate\nloss successfully addresses the original problem, one needs to establish consistency, which depends\ncrucially on the choice of link function that maps surrogate reports (predictions) to original reports.\nAfter introducing notation, and terminology from property elicitation, we thus give a suf\ufb01cient\ncondition for consistency (Def. 4) which depends solely on the conditional distribution over Y.\n2.1 Notation and Losses\nLet Y be a \ufb01nite outcome (label) space, and throughout let n = |Y|. The set of probability distributions\non Y is denoted \u2206Y \u2286 RY, represented as vectors of probabilities. We write py for the probability of\noutcome y \u2208 Y drawn from p \u2208 \u2206Y. We \ufb01rst discuss the conditional setting, with just labels Y and\nno features X , and show in \u00a7 2.3 how these notions relate to the usual X \u00d7 Y setting.\nWe assume that a given discrete prediction problem, such as classi\ufb01cation, is given in the form of\na discrete loss (cid:96) : R \u2192 RY+, which maps a report (prediction) r from a \ufb01nite set R to the vector of\nloss values (cid:96)(r) = ((cid:96)(r)y)y\u2208Y for each possible outcome y \u2208 Y. We will assume throughout that the\ngiven discrete loss is non-redundant, meaning every report is uniquely optimal (minimizes expected\nloss) for some distribution p \u2208 \u2206Y. Similarly, surrogate losses will be written L : Rd \u2192 RY+,\ntypically with reports written u \u2208 Rd. We write the corresponding expected loss when Y \u223c p as\n(cid:104)p, (cid:96)(r)(cid:105) and (cid:104)p, L(u)(cid:105). The Bayes risk of a loss L : Rd \u2192 RY+ is the function L : \u2206Y \u2192 R+ given\nby L(p) := inf u\u2208Rd(cid:104)p, L(u)(cid:105); naturally for discrete losses we write (cid:96) (and the in\ufb01mum is over R).\nFor example, 0-1 loss is a discrete loss with R = Y = {\u22121, 1} given by (cid:96)0-1(r)y = 1{r (cid:54)= y}, with\nBayes risk (cid:96)0-1(p) = 1 \u2212 maxy\u2208Y py. Two important surrogates for (cid:96)0-1 are hinge loss Lhinge(u)y =\n(1 \u2212 yu)+, where (x)+ = max(x, 0), and logistic loss L(u)y = log(1 + exp(\u2212yu)) for u \u2208 R.\nMost of the surrogates L we consider will be polyhedral, meaning piecewise linear and convex;\nwe therefore brie\ufb02y recall the relevant de\ufb01nitions. In Rd, a polyhedral set or polyhedron is the\nintersection of a \ufb01nite number of closed halfspaces. A polytope is a bounded polyhedral set. A convex\nfunction f : Rd \u2192 R is polyhedral if its epigraph is polyhedral, or equivalently, if it can be written as\na pointwise maximum of a \ufb01nite set of af\ufb01ne functions [27].\nDe\ufb01nition 1 (Polyhedral loss). A loss L : Rd \u2192 RY+ is polyhedral if L(u)y is a polyhedral (convex)\nfunction of u for each y \u2208 Y.\nFor example, hinge loss is polyhedral, whereas logistic loss is not. To motivate our focus on\npolyhedral losses, we echo Ramaswamy et al. [25, Section 1.2], who note that smooth surrogates\noften encode much more information than necessary, and in these cases non-smooth surrogates are\nthe only candidates to achieve a low dimension d above.\n\n2.2 Property Elicitation\n\nTo make headway, we will appeal to concepts and results from the property elicitation literature,\nwhich elevates the property, or map from distributions to optimal reports, as a central object to study\nin its own right. In our case, this map will often be multivalued, meaning a single distribution could\n\n2\n\n\fyield multiple optimal reports. (For example, when p = (1/2, 1/2), both r = 1 and r = \u22121 optimize\n0-1 loss.) To this end, we will use double arrow notation to mean a mapping to all nonempty subsets,\n\u21d2 R is shorthand for \u0393 : \u2206Y \u2192 2R \\ \u2205. See the discussion following De\ufb01nition 3\nso that \u03b3 : \u2206Y\nfor conventions regarding R, \u0393, \u03b3, L, (cid:96), etc.\n\u21d2 R. The level set of \u0393 for report\nDe\ufb01nition 2 (Property, level set). A property is a function \u0393 : \u2206Y\nr is the set \u0393r := {p : r \u2208 \u0393(p)}.\nIntuitively, \u0393(p) is the set of reports which should be optimal for a given distribution p, and \u0393r is the\nset of distributions for which the report r should be optimal. For example, the mode is the property\nmode(p) = arg maxy\u2208Y py, and captures the set of optimal reports for 0-1 loss: for each distribution\nover the labels, one should report the most likely label. In this case we say 0-1 loss elicits the mode,\nas we formalize below.\nDe\ufb01nition 3 (Elicits). A loss L : R \u2192 RY+, elicits a property \u0393 : \u2206Y\nr\u2208R (cid:104)p, L(r)(cid:105) .\n\n\u2200p \u2208 \u2206Y , \u0393(p) = arg min\n\n\u21d2 R if\n\n(1)\n\nAs \u0393 is uniquely de\ufb01ned by L, we write prop[L] to refer to the property elicited by a loss L.\nFor \ufb01nite properties (those with |R| < \u221e) and discrete losses, we will use lowercase notation \u03b3 and\n(cid:96), respectively, with reports r \u2208 R; for surrogate properties and losses we use \u0393 and L, with reports\nu \u2208 Rd. For general properties and losses, we will also use \u0393 and L, as above.\n2.3 Links and Embeddings\n\nTo assess whether a surrogate and link function align with the original loss, we turn to the common\ncondition of calibration. Roughly, a surrogate and link are calibrated if the best possible expected\nloss achieved by linking to an incorrect report is strictly suboptimal.\nDe\ufb01nition 4. Let original loss (cid:96) : R \u2192 RY+, proposed surrogate L : Rd \u2192 RY+, and link function\n\u03c8 : Rd \u2192 R be given. We say (L, \u03c8) is calibrated with respect to (cid:96) if for all p \u2208 \u2206Y\n\n,\n\ninf\n\nu\u2208Rd:\u03c8(u)(cid:54)\u2208\u03b3(p)(cid:104)p, L(u)(cid:105) > inf\n\nu\u2208Rd(cid:104)p, L(u)(cid:105) .\n\n(2)\n\nIt is well-known that calibration implies consistency, in the following sense (cf. [2]). Given feature\nspace X , \ufb01x a distribution D \u2208 \u2206(X \u00d7 Y). Let L\u2217 be the best possible expected L-loss achieved\nby any hypothesis H : X \u2192 Rd, and (cid:96)\u2217 the best expected (cid:96)-loss for any hypothesis h : X \u2192 R,\nrespectively. Then (L, \u03c8) is consistent if a sequence of surrogate hypotheses H1, H2, . . . whose\nL-loss limits to L\u2217, then the (cid:96)-loss of \u03c8 \u25e6 H1, \u03c8 \u25e6 H2, . . . limits to (cid:96)\u2217. As De\ufb01nition 4 does not\ninvolve the feature space X , we will drop it for the remainder of the paper.\nSeveral consistent convex surrogates in the literature can be thought of as \u201cembeddings\u201d, wherein one\nmaps the discrete reports to a vector space, and \ufb01nds a convex loss which agrees with the original\nloss. A key condition is that the original reports should be optimal exactly when the corresponding\nembedded points are optimal. We formalize this notion as follows.\nDe\ufb01nition 5. A loss L : Rd \u2192 RY embeds a loss (cid:96) : R \u2192 RY if there exists some injective\nembedding \u03d5 : R \u2192 Rd such that (i) for all r \u2208 R we have L(\u03d5(r)) = (cid:96)(r), and (ii) for all\np \u2208 \u2206Y , r \u2208 R we have\n(3)\n\nr \u2208 prop[(cid:96)](p) \u21d0\u21d2 \u03d5(r) \u2208 prop[L](p) .\n\nNote that it is not clear if embeddings give rise to calibrated links; indeed, apart from mapping the\nembedded points back to their original reports via \u03c8(\u03d5(r)) = r, how to map the remaining values is\nfar from clear. We address the question of when embeddings lead to calibrated links in Section 4.\nTo illustrate the idea of embedding, let us examine hinge loss in detail as a surrogate for 0-1 loss\nfor binary classi\ufb01cation. Recall that we have R = Y = {\u22121, +1}, with Lhinge(u)y = (1 \u2212 uy)+\nand (cid:96)0-1(r)y := 1{r (cid:54)= y}, typically with link function \u03c8(u) = sgn(u). We will see that hinge\nloss embeds (2 times) 0-1 loss, via the embedding \u03d5(r) = r. For condition (i), it is straightforward\n\n3\n\n\fto check that Lhinge(r)y = 2(cid:96)0-1(r)y for all r, y \u2208 {\u22121, 1}. For condition (ii), let us compute the\nproperty each loss elicits, i.e., the set of optimal reports for each p:\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f3\n\n[1,\u221e)\n1\n[\u22121, 1]\n\u22121\n(\u2212\u221e,\u22121] p1 = 0\n\np1 = 1\np1 \u2208 (1/2, 1)\np1 = 1/2\np1 \u2208 (0, 1/2)\n\n.\n\n\uf8f1\uf8f2\uf8f31\n\np1 > 1/2\n{\u22121, 1} p1 = 1/2\n\u22121\np1 < 1/2\n\nprop[(cid:96)0-1](p) =\n\nprop[Lhinge](p) =\n\nIn particular, we see that \u22121 \u2208 prop[(cid:96)0-1](p) \u21d0\u21d2 p1 \u2208 [0, 1/2] \u21d0\u21d2 \u22121 \u2208 prop[Lhinge](p),\nand 1 \u2208 prop[(cid:96)0-1](p) \u21d0\u21d2 p1 \u2208 [1/2, 1] \u21d0\u21d2 1 \u2208 prop[Lhinge](p). With both conditions of\nDe\ufb01nition 5 satis\ufb01ed, we conclude that Lhinge embeds 2(cid:96)0-1. In this particular case, it is known\n(Lhinge, \u03c8) is calibrated for \u03c8(u) = sgn(u); in Section 4 we show that, perhaps surprisingly, all\nembeddings lead to calibration with an appropriate link.\n\n3 Embeddings and Polyhedral Losses\n\n(cid:96)(p) = min\n\nIn this section, we establish a tight relationship between the technique of embedding and the use of\npolyhedral (piecewise-linear convex) surrogate losses. We defer to the following section the question\nof when such surrogates are consistent.\nTo begin, we observe that our embedding condition in De\ufb01nition 5 is equivalent to merely matching\nBayes risks. This useful fact will drive many of our results.\nProposition 1. A loss L embeds discrete loss (cid:96) if and only if L = (cid:96).\nProof. Throughout we have L : Rd \u2192 RY+, (cid:96) : R \u2192 RY+, and de\ufb01ne \u0393 = prop[L] and \u03b3 = prop[(cid:96)].\n\u21d2 U by \u03b3(cid:48) : p (cid:55)\u2192\nSuppose L embeds (cid:96) via the embedding \u03d5. Letting U := \u03d5(R), de\ufb01ne \u03b3(cid:48) : \u2206Y\n\u0393(p) \u2229 U. To see that \u03b3(cid:48)(p) (cid:54)= \u2205 for all p \u2208 \u2206Y, note that by the de\ufb01nition of \u03b3 as the property\nelicited by (cid:96) we have some r \u2208 \u03b3(p), and by the embedding condition (3), \u03d5(r) \u2208 \u0393(p). By [9,\nLemma 3], we see that L|U (the loss L with reports restricted to U) elicits \u03b3(cid:48) and L = L|U . As\nL(\u03d5(\u00b7)) = (cid:96)(\u00b7) by the embedding, we have\nr\u2208R(cid:104)p, (cid:96)(r)(cid:105) = min\n\nr\u2208R(cid:104)p, L(\u03d5(r))(cid:105) = min\nfor all p \u2208 \u2206Y. Combining with the above, we now have L = (cid:96).\nFor the reverse implication, assume that L = (cid:96). In what follows, we implicitly work in the af\ufb01ne hull\nof \u2206Y, so that interiors are well-de\ufb01ned, and (cid:96) may be differentiable on the (relative) interior of \u2206Y.\nSince (cid:96) is discrete, \u2212(cid:96) is polyhedral as the pointwise maximum of a \ufb01nite set of linear functions. The\nprojection of its epigraph E(cid:96) onto \u2206Y forms a power diagram by [3], whose cells are full-dimensional\nand correspond to the level sets \u03b3r of \u03b3 = prop[(cid:96)].\nFor each r \u2208 R, let pr be a distribution in the interior of \u03b3r, and let ur \u2208 \u0393(p). Observe that,\nby de\ufb01nition of the Bayes risk and \u0393, for all u \u2208 Rd the hyperplane v (cid:55)\u2192 (cid:104)v,\u2212L(ur)(cid:105) supports\nthe epigraph EL of \u2212L at the point (p,\u2212(cid:104)p, L(u)(cid:105)) if and only if u \u2208 \u0393(p). Thus, the hyperplane\nv (cid:55)\u2192 (cid:104)v,\u2212L(ur)(cid:105) supports EL = E(cid:96) at the point (pr,\u2212(cid:104)pr, L(ur)(cid:105)), and thus does so at the entire\nfacet {(p,\u2212(cid:104)p, L(ur)(cid:105)) : p \u2208 \u03b3r}; by the above, ur \u2208 \u0393(p) for all such distributions as well. We\nconclude that ur \u2208 \u0393(p) \u21d0\u21d2 p \u2208 \u03b3r \u21d0\u21d2 r \u2208 \u03b3(p), satisfying condition (3) for \u03d5 : r (cid:55)\u2192 ur. To\nsee that the loss values match, we merely note that the supporting hyperplanes to the facets of EL\nand E(cid:96) are the same, and the loss values are uniquely determined by the supporting hyperplane. (In\nparticular, if h supports the facet corresponding to \u03b3r, we have (cid:96)(r)y = L(ur)y = h(\u03b4y), where \u03b4y is\nthe point distribution on outcome y.)\n\nu\u2208U(cid:104)p, L(u)(cid:105) = L|U ,\n\nFrom this more succinct embedding condition, we can in turn simplify the condition that a loss\nembeds some discrete loss: it does if and only if its Bayes risk is polyhedral. (We say a concave\nfunction is polyhedral if its negation is a polyhedral convex function.) Note that the Bayes risk, a\nfunction from distributions over Y to the reals, may be polyhedral even if the loss itself is not.\nProposition 2. A loss L embeds a discrete loss if and only if L is polyhedral.\n\n4\n\n\fProof. If L embeds (cid:96), Proposition 1 gives us L = (cid:96), and its proof already argued that (cid:96) is polyhedral.\nFor the converse, let L be polyhedral; we again examine the proof of Proposition 1. The projection of\nL onto \u2206Y forms a power diagram by [3] with \ufb01nitely many cells C1, . . . , Ck, which we can index by\n\u21d2 R by \u03b3r = Cr for r \u2208 R, we see that the same\nR := {1, . . . , k}. De\ufb01ning the property \u03b3 : \u2206Y\nconstruction gives us points ur \u2208 Rd such that ur \u2208 \u0393(p) \u21d0\u21d2 r \u2208 \u03b3(p). De\ufb01ning (cid:96) : R \u2192 RY+ by\n(cid:96)(r) = L(ur), the same proof shows that L embeds (cid:96).\n\nCombining Proposition 2 with the observation that polyhedral losses have polyhedral Bayes risks [9,\nLemma 5], we obtain the \ufb01rst direction of our equivalence between polyhedral losses and embedding.\nTheorem 1. Every polyhedral loss L embeds a discrete loss.\n\nWe now turn to the reverse direction: which discrete losses are embedded by some polyhedral loss?\nPerhaps surprisingly, we show that every discrete loss is embeddable, using a construction via convex\nconjugate duality which has appeared several times in the literature (e.g. [1, 8, 11]). Note however\nthat the number of dimensions d required could be as large as |Y|.\nTheorem 2. Every discrete loss (cid:96) is embedded by a polyhedral loss.\nProof. Let n = |Y|, and let C : Rn \u2192 R be given by (\u2212(cid:96))\u2217, the convex conjugate of \u2212(cid:96). From\nstandard results in convex analysis, C is polyhedral as \u2212(cid:96) is, and C is \ufb01nite on all of RY as the\ndomain of \u2212(cid:96) is bounded [27, Corollary 13.3.1]. Note that \u2212(cid:96) is a closed convex function, as the\nin\ufb01mum of af\ufb01ne functions, and thus (\u2212(cid:96))\u2217\u2217 = \u2212(cid:96). De\ufb01ne L : Rn \u2192 RY by L(u) = C(u)1 \u2212 u,\nwhere 1 \u2208 RY is the all-ones vector. We \ufb01rst show that L embeds (cid:96), and then establish that the range\nof L is in fact RY+, as desired.\nWe compute Bayes risks and apply Proposition 1 to see that L embeds (cid:96). For any p \u2208 \u2206Y, we have\n\nL(p) = inf\n\nu\u2208Rn(cid:104)p, C(u)1 \u2212 u(cid:105)\nC(u) \u2212 (cid:104)p, u(cid:105)\n= inf\nu\u2208Rn\n= \u2212 sup\nu\u2208Rn(cid:104)p, u(cid:105) \u2212 C(u)\n= \u2212C\u2217(p) = \u2212(\u2212(cid:96)(p))\u2217\u2217 = (cid:96)(p) .\n\nIt remains to show L(u)y \u2265 0 for all u \u2208 Rn, y \u2208 Y. Letting \u03b4y \u2208 \u2206Y be the point distribution on\noutcome y \u2208 Y, we have for all u \u2208 Rn, L(u)y \u2265 inf u(cid:48)\u2208Rn L(u(cid:48))y = L(\u03b4y) = (cid:96)(\u03b4y) \u2265 0, where\nthe \ufb01nal inequality follows from the nonnegativity of (cid:96).\n\n4 Consistency via Calibrated Links\n\nWe have now seen the tight relationship between polyhedral losses and embeddings; in particular,\nevery polyhedral loss embeds some discrete loss. The embedding itself tells us how to link the\nembedded points back to the discrete reports (map \u03d5(r) to r), but it is not clear when this link can be\nextended to the remaining reports, and whether such a link can lead to consistency. In this section,\nwe give a construction to generate calibrated links for any polyhedral loss.\nThe full version [9, Appendix D] contains the full proof; this section provides a sketch along with the\nmain construction and result. The \ufb01rst step is to give a link \u03c8 such that exactly minimizing expected\nsurrogate loss L, followed by applying \u03c8, always exactly minimizes expected original loss (cid:96). The\nexistence of such a link is somewhat subtle, because in general some point u that is far from any\nembedding point can minimize expected loss for two very different distributions p, p(cid:48), making it\nunclear whether there exists a choice \u03c8(u) \u2208 R that is (cid:96)-optimal for both distributions. We show that\nas we vary p over \u2206Y, there are only \ufb01nitely many sets of the form U = arg minu\u2208Rd(cid:104)p, L(u)(cid:105) [9,\nLemma 4]. Associating each U with RU \u2286 R, the set of reports whose embedding points are in\nU, we enforce that all points in U link to some report in RU . (As a special case, embedding points\nmust link to their corresponding reports.) Proving that these choices are well-de\ufb01ned uses a chain\nof arguments involving the Bayes risk, ultimately showing that if u lies in multiple such sets U, the\ncorresponding report sets RU all intersect at some r =: \u03c8(u).\nIntuitively, to ensure calibration, we just need to \u201cthicken\u201d this construction, by mapping all\napproximately-optimal points u to optimal reports r. Let U contain all optimal report sets U\n\n5\n\n\fof the form above. A key step in the following de\ufb01nition will be to narrow down a \u201clink envelope\u201d \u03a8\nwhere \u03a8(u) denotes the legal or valid choices for \u03c8(u).\nDe\ufb01nition 6. Given a polyhedral L that embeds some (cid:96), an \u0001 > 0, and a norm (cid:107) \u00b7 (cid:107), the \u0001-thickened\nlink \u03c8 is constructed as follows. First, initialize \u03a8 : Rd \u21d2 R by setting \u03a8(u) = R for all u. Then for\neach U \u2208 U, for all points u such that inf u\u2217\u2208U (cid:107)u\u2217 \u2212 u(cid:107) < \u0001, update \u03a8(u) = \u03a8(u) \u2229 RU . Finally,\nde\ufb01ne \u03c8(u) \u2208 \u03a8(u), breaking ties arbitrarily. If \u03a8(u) became empty, then leave \u03c8(u) unde\ufb01ned.\nTheorem 3. Let L be polyhedral, and (cid:96) the discrete loss it embeds from Theorem 1. Then for small\nenough \u0001 > 0, the \u0001-thickened link \u03c8 is well-de\ufb01ned and, furthermore, is a calibrated link from L to (cid:96).\n\nSketch. Well-de\ufb01ned: For the initial construction above, we argued that if some collection such as\nU, U(cid:48), U(cid:48)(cid:48) overlap at a u, then their report sets RU , RU(cid:48), RU(cid:48)(cid:48) also overlap, so there is a valid choice\nr = \u03c8(u). Now, we thicken all sets U \u2208 U by a small enough \u0001; it can be shown that if the thickened\nsets overlap at u, then U, U(cid:48), U(cid:48)(cid:48) themselves overlap, so again RU , RU(cid:48), RU(cid:48)(cid:48) overlap and there is a\nvalid chioce r = \u03c8(u).\nCalibrated: By construction of the thickened link, if u maps to an incorrect report, i.e. \u03c8(u) (cid:54)\u2208 \u03b3(p),\nthen u must have at least distance \u0001 to the optimal set U. We then show that the minimal gradient\nof the expected loss along any direction away from U is lower-bounded, giving a constant excess\nexpected loss at u.\n\nNote that the construction given above in De\ufb01nition 6 is not necessarily computationally ef\ufb01cient as\nthe number of labels n grows. In practice this potential inef\ufb01ciency is not typically a concern, as the\nfamily of losses typically has some closed form expression in terms of n, and thus the construction\ncan proceed at the symbolic level. We illustrate this formulaic approach in \u00a7 5.1.\n\n5 Application to Speci\ufb01c Surrogates\n\nOur results give a framework to construct consistent surrogates and link functions for any discrete\nloss, but they also provide a way to verify the consistency or inconsistency of given surrogates. Below,\nwe illustrate the power of this framework with speci\ufb01c examples from the literature, as well as new\nexamples. In some cases we simplify existing proofs, while in others we give new results, such as a\nnew calibrated link for abstain loss, and the inconsistency of the recently proposed Lov\u00e1sz hinge.\n\n5.1 Consistency of abstain surrogate and link construction\n\nIn classi\ufb01cation settings with a large number of labels, several authors consider a variant of classi\ufb01ca-\ntion, with the addition of a \u201creject\u201d or abstain option. For example, Ramaswamy et al. [25] study the\nloss (cid:96)\u03b1 : [n] \u222a {\u22a5} \u2192 RY+ de\ufb01ned by (cid:96)\u03b1(r)y = 0 if r = y, \u03b1 if r = \u22a5, and 1 otherwise. Here, the\nreport \u22a5 corresponds to \u201cabstaining\u201d if no label is suf\ufb01ciently likely, speci\ufb01cally, if no y \u2208 Y has\npy \u2265 1 \u2212 \u03b1. Ramaswamy et al. [25] provide a polyhedral surrogate for (cid:96)\u03b1, which we present here for\n\u03b1 = 1/2. Letting d = (cid:100)log2(n)(cid:101) their surrogate is L1/2 : Rd \u2192 RY+ given by\n\n(4)\nwhere B : [n] \u2192 {\u22121, 1}d is a arbitrary injection; let us assume n = 2d so that we have a bijection.\nConsistency is proven for the following link function,\n\n+ ,\n\nL1/2(u)y =(cid:0)maxj\u2208[d] B(y)juj + 1(cid:1)\n(cid:26)\n\n\u03c8(u) =\n\n\u22a5\nB\u22121(sgn(\u2212u)) otherwise\n\nmini\u2208[d] |ui| \u2264 1/2\n\n.\n\n(5)\n\nIn light of our framework, we can see that L1/2 is an excellent example of an embedding, where\n\u03d5(y) = B(y) and \u03d5(\u22a5) = 0 \u2208 Rd. Moreover, the link function \u03c8 can be recovered from Theorem 3\nwith norm (cid:107) \u00b7 (cid:107)\u221e and \u0001 = 1/2; see Figure 1(L). Hence, our framework would have simpli\ufb01ed the\nprocess of \ufb01nding such a link, and the corresponding proof of consistency. To illustrate this point\nfurther, we give an alternate link \u03c81 corresponding to (cid:107) \u00b7 (cid:107)1 and \u0001 = 1, shown in Figure 1(R):\n\n(cid:26)\n\n\u03c81(u) =\n\n\u22a5\n(cid:107)u(cid:107)1 \u2264 1\nB\u22121(sgn(\u2212u)) otherwise .\n\n6\n\n(6)\n\n\fFigure 1: Constructing links for the abstain surrogate L1/2 with d = 2. The embedding is shown in bold labeled\nby the corresponding reports. (L) The link envelope \u03a8 resulting from Theorem 3 using (cid:107) \u00b7 (cid:107)\u221e and \u0001 = 1/2, and\na possible link \u03c8 which matches eq. (5) from [25]. (M) An illustration of the thickened sets from De\ufb01nition 6\nfor two sets U \u2208 U, using (cid:107) \u00b7 (cid:107)1 and \u0001 = 1. (R) The \u03a8 and \u03c8 from Theorem 3 using (cid:107) \u00b7 (cid:107)1 and \u0001 = 1.\n\nTheorem 3 immediately gives calibration of (L1/2, \u03c81) with respect to (cid:96)1/2. Aside from its simplicity,\none possible advantage of \u03c81 is that it appears to yield the same constant in generalization bounds as\n\u03c8, yet assigns \u22a5 to much less of the surrogate space Rd. It would be interesting to compare the two\nlinks in practice.\n\n5.2\n\nInconsistency of Lov\u00e1sz hinge\n\nMany structured prediction settings can be thought of as making multiple predictions at once, with\na loss function that jointly measures error based on the relationship between these predictions [14,\n16, 23]. In the case of k binary predictions, these settings are typically formalized by taking the\npredictions and outcomes to be \u00b11 vectors, so R = Y = {\u22121, 1}k. One then de\ufb01nes a joint loss\nfunction, which is often merely a function of the set of mispredictions, meaning we may write\n(cid:96)g(r)y = g({i \u2208 [k] : ri (cid:54)= yi}) for some set function g : 2[k] \u2192 R. For example, Hamming loss\nis given by g(S) = |S|. In an effort to provide a general convex surrogate for these settings when\ng is a submodular function, Yu and Blaschko [32] introduce the Lov\u00e1sz hinge, which leverages the\nwell-known convex Lov\u00e1sz extension of submodular functions. While the authors provide theoretical\njusti\ufb01cation and experiments, consistency of the Lov\u00e1sz hinge is left open, which we resolve.\nRather than formally de\ufb01ne the Lov\u00e1sz hinge, we defer the complete analysis to the full version of the\npaper [9], and focus here on the k = 2 case. For brevity, we write g\u2205 := g(\u2205), g1,2 := g({1, 2}), etc.\nAssuming g is normalized and increasing (meaning g1,2 \u2265 {g1, g2} \u2265 g\u2205 = 0), the Lov\u00e1sz hinge\nL : Rk \u2192 RY+ is given by\nLg(u)y = max\n\n(1 \u2212 u1y1)+g1 + (1 \u2212 u2y2)+(g1,2 \u2212 g1),\n\n(cid:110)\n\n(7)\n(1 \u2212 u2y2)+g2 + (1 \u2212 u1y1)+(g1,2 \u2212 g2)\nwhere (x)+ = max{x, 0}. We will explore the range of values of g for which Lg is consistent, where\nthe link function \u03c8 : R2 \u2192 {\u22121, 1}2 is \ufb01xed as \u03c8(u)i = sgn(ui), with ties broken arbitrarily.\nLet us consider the coef\ufb01cients g\u2205 = 0, g1 = g2 = g1,2 = 1, for which (cid:96)g is merely 0-1 loss on Y. For\nconsistency, for any distribution p \u2208 \u2206Y, we must have that whenever u \u2208 arg minu(cid:48)\u2208R2(cid:104)p, Lg(u(cid:48))(cid:105),\nthe outcome \u03c8(u) must be the most likely, i.e., in arg maxy\u2208Y p(y). Simplifying eq. (7), however,\nwe have\n\n,\n\n(cid:111)\n\nLg(u)y = max(cid:8)(1 \u2212 u1y1)+, (1 \u2212 u2y2)+\n\n(cid:9) = max(cid:8)1 \u2212 u1y1, 1 \u2212 u2y2, 0(cid:9) ,\n\n(8)\nwhich is exactly the abstain surrogate (4) for d = 2. We immediately conclude that Lg cannot be\nconsistent with (cid:96)g, as the origin will be the unique optimal report for Lg under distributions with\npy < 0.5 for all y, and one can simply take a distribution which disagrees with the way ties are broken\nin \u03c8. For example, if we take sgn(0) = 1, then under p((1, 1)) = p((1,\u22121)) = p((\u22121, 1)) = 0.2\nand p((\u22121,\u22121)) = 0.4, we have {0} = arg minu\u2208R2(cid:104)p, Lg(u)(cid:105), yet we also have \u03c8(0) = (1, 1) /\u2208\n{(\u22121,\u22121)} = arg minr\u2208R(cid:104)p, (cid:96)g(r)(cid:105).\n\n7\n\nu1u2\u20221\u20222\u20223\u20224\u2022\u22a5RRRR1,\u22a51,\u22a52,\u22a52,\u22a53,\u22a53,\u22a54,\u22a54,\u22a51,2,\u22a51,4,\u22a52,3,\u22a53,4,\u22a5u1u2UU0\u20221\u20222\u20223\u20224\u2022\u22a5u1u2\u20221\u20222\u20223\u20224\u2022\u22a5\fFigure 2: Minimizers of (cid:104)p, (cid:96)top-2(cid:105) and (cid:104)p, (cid:96)2(cid:105), respectively, varying p over \u22063.\n\nIn fact, this example is typical: using our embedding framework, and characterizing when 0 \u2208 R2\nis an embedded point, one can show that Lg is consistent if and only if g1,2 = g1 + g2. Moreover,\nin this linear case, which corresponds to g being modular, the Lov\u00e1sz hinge reduces to weighted\nHamming loss, which is trivially consistent from the consistency of hinge loss for 0-1 loss. In the full\nversion of the paper [9], we generalize this observation for all k: Lg is consistent if and only if g is\nmodular. In other words, even for k > 2, the only consistent Lov\u00e1sz hinge is weighted Hamming\nloss. These results cast doubt on the effectiveness of the Lov\u00e1sz hinge in practice.\n\n5.3\n\nInconsistency of top-k losses\n\nIn certain classi\ufb01cation problems when ground truth may be ambiguous, such as object identi\ufb01cation,\nit is common to predict a set of possible labels. As one instance, the top-k classi\ufb01cation problem\nis to predict the set of k most likely labels; formally, we have R := {r \u2208 {0, 1}n : (cid:107)r(cid:107)0 = k},\n1 < k < n, Y = [n], and discrete loss (cid:96)top-k(r)y = 1 \u2212 ry. Surrogates for this problem commonly\ntake reports u \u2208 Rn, with the link \u03c8(u) = {u[1], . . . , u[k]}, where u[i] is the ith largest entry of u.\nLapin et al. [19, 20, 21] provide the following convex surrogate loss for this problem, which Yang\nand Koyejo [31] show to be inconsistent:1\n\n(cid:16)\n\nLk(u)y :=\n\n(cid:80)k\n\n(cid:17)\n\n1 \u2212 uy + 1\n\nk\n\ni=1(u \u2212 ey)[i]\n\n,\n\n+\n\n(9)\n\n(cid:0)1 \u2212 uy + 1\n\n2 (u[1] + u[2] \u2212 min(1, uy))(cid:1)\n\nwhere ey is 1 in component y and 0 elsewhere. With our framework, we can say more. Speci\ufb01cally,\nwhile (Lk, \u03c8) is not consistent for (cid:96)top-k, since Lk is polyhedral, we know from Theorem 1 that it\nembeds some discrete loss (cid:96)k, and from Theorem 3 there is a link \u03c8(cid:48) such that (Lk, \u03c8(cid:48)) is calibrated\n(and consistent) for (cid:96)k. We therefore turn to deriving this discrete loss (cid:96)k.\nFor concreteness, consider the case with k = 2 over n = 3 outcomes. We can re-write L2(u)y =\n. By inspection, we can derive the properties elicited by\n(cid:96)top-2 and L2, respectively, which reveals that the set R(cid:48) consisting of all permutations of (1, 0, 0),\n(1, 1, 0), and (2, 1, 0), are always represented among the minimizers of L2. Thus, L2 embeds the loss\n2(cid:104)r, 1 \u2212 ey(cid:105) otherwise. Observe that (cid:96)2 is just (cid:96)top-2 with\n(cid:96)2(r)y = 0 if ry = 2 or (cid:96)2(r)y = 1 \u2212 ry + 1\nan extra term punishing weight on elements other than y, and a reward for a weight of 2 on y.\nMoreover, we can visually inspect the corresponding properties (Fig. 2) to immediately see why L2\nis inconsistent: for distributions where the two least likely labels are roughly equally (un)likely, the\nminimizer will put all weight on the most likely label, and thus fail to distinguish the other two. More\ngenerally, L2 cannot be consistent because the property it embeds does not \u201cre\ufb01ne\u201d (subdivide) the\ntop-k property, so not just \u03c8, but no link function, could make L2 consistent.\n\n+\n\n1Yang and Koyejo also introduce a consistent surrogate, but it is non-convex.\n\n8\n\np1p2p3(1,0,1)(1,1,0)(0,1,1)p1p2p3(1,0,1)(2,0,1)(1,0,2)(1,0,0)(0,0,1)(2,1,0)(0,1,2)(1,1,0)(0,1,1)(0,1,0)(0,2,1)(1,2,0)\f6 Conclusion and Future Directions\n\nThis paper formalizes an intuitive way to design convex surrogate losses for classi\ufb01cation-like\nproblems\u2014by embedding the reports into Rd. We establish a close relationship between embeddings\nand polyhedral surrogates, showing both that every polyhedral loss embeds a discrete loss (Theorem 1)\nand that every discrete loss is embedded by some polyhedral loss (Theorem 2). We then construct a\ncalibrated link function from any polyhedral loss to the discrete loss it embeds, giving consistency for\nall such losses (Theorem 3). We conclude with examples of how the embedding framework presented\ncan be applied to understand existing surrogates in the literature, including those for the abstain loss,\ntop-k loss, and Lov\u00e1sz hinge. In particular, our link construction recovers the link function proposed\nby Ramaswamy et al. [25] for abstain loss, as well as another simpler link based on the L1 norm.\nOne open question of particular interest involves the dimension of the surrogate prediction space;\ngiven a discrete loss, can we construct a surrogate that embeds it of minimal dimension? If we na\u00efvely\nembed the reports into an n-dimensional space, the dimensionality of the problem scales linearly in\nthe number of possible labels n. As the dimension of the optimization problem is a function of this\nembedding dimension d, a promising direction is to leverage tools from elicitation complexity [13, 18]\nand convex calibration dimension [24] to understand when we can take d << n.\n\nAcknowledgements\n\nWe thank Arpit Agarwal and Peter Bartlett for many early discussions, which led to several important\ninsights. We thank Eric Balkanski for help with a lemma about submodular functions. This material\nis based upon work supported by the National Science Foundation under Grant No. 1657598.\n\n9\n\n\fReferences\n[1] Jacob Abernethy, Yiling Chen, and Jennifer Wortman Vaughan. Ef\ufb01cient market making via\nconvex optimization, and a connection to online learning. ACM Transactions on Economics\nand Computation, 1(2):12, 2013. URL http://dl.acm.org/citation.cfm?id=2465777.\n\n[2] Arpit Agarwal and Shivani Agarwal. On consistent surrogate risk minimization and property\nelicitation. In JMLR Workshop and Conference Proceedings, volume 40, pages 1\u201319, 2015.\nURL http://www.jmlr.org/proceedings/papers/v40/Agarwal15.pdf.\n\n[3] Franz Aurenhammer. Power diagrams: properties, algorithms and applications. SIAM Journal\non Computing, 16(1):78\u201396, 1987. URL http://epubs.siam.org/doi/pdf/10.1137/\n0216006.\n\n[4] Peter L Bartlett and Marten H Wegkamp. Classi\ufb01cation with a reject option using a hinge loss.\n\nJournal of Machine Learning Research, 9(Aug):1823\u20131840, 2008.\n\n[5] Peter L. Bartlett, Michael I. Jordan, and Jon D. McAuliffe. Convexity, classi\ufb01cation, and\nrisk bounds. Journal of the American Statistical Association, 101(473):138\u2013156, 2006. URL\nhttp://amstat.tandfonline.com/doi/abs/10.1198/016214505000000907.\n\n[6] S.P. Boyd and L. Vandenberghe. Convex optimization. Cambridge University Press, 2004.\n\n[7] Koby Crammer and Yoram Singer. On the algorithmic implementation of multiclass kernel-\n\nbased vector machines. Journal of machine learning research, 2(Dec):265\u2013292, 2001.\n\n[8] John Duchi, Khashayar Khosravi, Feng Ruan, et al. Multiclass classi\ufb01cation, information,\n\ndivergence and surrogate risk. The Annals of Statistics, 46(6B):3246\u20133275, 2018.\n\n[9] Jessie Finocchiaro, Rafael Frongillo, and Bo Waggoner. An embedding framework for consistent\n\npolyhedral surrogates. In Advances in neural information processing systems, 2019.\n\n[10] Tobias Fissler, Johanna F Ziegel, and others. Higher order elicitability and Osband\u2019s principle.\n\nThe Annals of Statistics, 44(4):1680\u20131707, 2016.\n\n[11] Rafael Frongillo and Ian Kash. General truthfulness characterizations via convex analysis. In\n\nWeb and Internet Economics, pages 354\u2013370. Springer, 2014.\n\n[12] Rafael Frongillo and Ian Kash. Vector-Valued Property Elicitation. In Proceedings of the 28th\n\nConference on Learning Theory, pages 1\u201318, 2015.\n\n[13] Rafael Frongillo and Ian A. Kash. On Elicitation Complexity. In Advances in Neural Information\n\nProcessing Systems 29, 2015.\n\n[14] Wei Gao and Zhi-Hua Zhou. On the consistency of multi-label learning. In Proceedings of the\n\n24th annual conference on learning theory, pages 341\u2013358, 2011.\n\n[15] T. Gneiting. Making and Evaluating Point Forecasts. Journal of the American Statistical\n\nAssociation, 106(494):746\u2013762, 2011.\n\n[16] Tamir Hazan, Joseph Keshet, and David A McAllester. Direct loss minimization for structured\nprediction. In Advances in Neural Information Processing Systems, pages 1594\u20131602, 2010.\n\n[17] Nicolas S. Lambert. Elicitation and evaluation of statistical forecasts. 2018. URL https:\n\n//web.stanford.edu/~nlambert/papers/elicitability.pdf.\n\n[18] Nicolas S. Lambert, David M. Pennock, and Yoav Shoham. Eliciting properties of probability\ndistributions. In Proceedings of the 9th ACM Conference on Electronic Commerce, pages\n129\u2013138, 2008.\n\n[19] Maksim Lapin, Matthias Hein, and Bernt Schiele. Top-k multiclass svm. In Advances in Neural\n\nInformation Processing Systems, pages 325\u2013333, 2015.\n\n10\n\n\f[20] Maksim Lapin, Matthias Hein, and Bernt Schiele. Loss functions for top-k error: Analysis and\ninsights. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,\npages 1468\u20131477, 2016.\n\n[21] Maksim Lapin, Matthias Hein, and Bernt Schiele. Analysis and optimization of loss functions\nfor multiclass, top-k, and multilabel classi\ufb01cation. IEEE transactions on pattern analysis and\nmachine intelligence, 40(7):1533\u20131554, 2018.\n\n[22] Kent Osband and Stefan Reichelstein. Information-eliciting compensation schemes. Jour-\nnal of Public Economics, 27(1):107\u2013115, June 1985.\nISSN 0047-2727. doi: 10.1016/\n0047-2727(85)90031-3. URL http://www.sciencedirect.com/science/article/pii/\n0047272785900313.\n\n[23] Anton Osokin, Francis Bach, and Simon Lacoste-Julien. On structured prediction theory with\ncalibrated convex surrogate losses. In Advances in Neural Information Processing Systems,\npages 302\u2013313, 2017.\n\n[24] Harish G Ramaswamy and Shivani Agarwal. Convex calibration dimension for multiclass loss\n\nmatrices. The Journal of Machine Learning Research, 17(1):397\u2013441, 2016.\n\n[25] Harish G Ramaswamy, Ambuj Tewari, Shivani Agarwal, et al. Consistent algorithms for\nmulticlass classi\ufb01cation with an abstain option. Electronic Journal of Statistics, 12(1):530\u2013554,\n2018.\n\n[26] M.D. Reid and R.C. Williamson. Composite binary losses. The Journal of Machine Learning\n\nResearch, 9999:2387\u20132422, 2010.\n\n[27] R.T. Rockafellar. Convex analysis, volume 28 of Princeton Mathematics Series. Princeton\n\nUniversity Press, 1997.\n\n[28] L.J. Savage. Elicitation of personal probabilities and expectations. Journal of the American\n\nStatistical Association, pages 783\u2013801, 1971.\n\n[29] Ingo Steinwart, Chlo\u00e9 Pasin, Robert Williamson, and Siyu Zhang. Elicitation and Identi\ufb01cation\nof Properties. In Proceedings of The 27th Conference on Learning Theory, pages 482\u2013526,\n2014.\n\n[30] Robert C Williamson, Elodie Vernet, and Mark D Reid. Composite multiclass losses. Journal\n\nof Machine Learning Research, 17(223):1\u201352, 2016.\n\n[31] Forest Yang and Sanmi Koyejo. On the consistency of top-k surrogate losses. CoRR,\n\nabs/1901.11141, 2019. URL http://arxiv.org/abs/1901.11141.\n\n[32] Jiaqian Yu and Matthew B Blaschko. The lov\u00e1sz hinge: A novel convex surrogate for submodu-\n\nlar losses. IEEE transactions on pattern analysis and machine intelligence, 2018.\n\n[33] Ming Yuan and Marten Wegkamp. Classi\ufb01cation methods with reject option based on convex\n\nrisk minimization. Journal of Machine Learning Research, 11(Jan):111\u2013130, 2010.\n\n[34] Chong Zhang, Wenbo Wang, and Xingye Qiao. On reject and re\ufb01ne options in multicategory\nclassi\ufb01cation. Journal of the American Statistical Association, 113(522):730\u2013745, 2018.\ndoi: 10.1080/01621459.2017.1282372. URL https://doi.org/10.1080/01621459.2017.\n1282372.\n\n11\n\n\f", "award": [], "sourceid": 5760, "authors": [{"given_name": "Jessica", "family_name": "Finocchiaro", "institution": "University of Colorado Boulder"}, {"given_name": "Rafael", "family_name": "Frongillo", "institution": "CU Boulder"}, {"given_name": "Bo", "family_name": "Waggoner", "institution": "U. Colorado, Boulder"}]}