{"title": "Differentiable Ranking and Sorting using Optimal Transport", "book": "Advances in Neural Information Processing Systems", "page_first": 6861, "page_last": 6871, "abstract": "Sorting is used pervasively in machine learning, either to define elementary algorithms, such as $k$-nearest neighbors ($k$-NN) rules, or to define test-time metrics, such as top-$k$ classification accuracy or ranking losses. Sorting is however a poor match for the end-to-end, automatically differentiable pipelines of deep learning. Indeed, sorting procedures output two vectors, neither of which is differentiable: the vector of sorted values is piecewise linear, while the sorting permutation itself (or its inverse, the vector of ranks) has no differentiable properties to speak of, since it is integer-valued. We propose in this paper to replace the usual \\texttt{sort} procedure with a differentiable proxy. Our proxy builds upon the fact that sorting can be seen as an optimal assignment problem, one in which the $n$ values to be sorted are matched to an \\emph{auxiliary} probability measure supported on any \\emph{increasing} family of $n$ target values. From this observation, we propose extended rank and sort operators by considering optimal transport (OT) problems (the natural relaxation for assignments) where the auxiliary measure can be any weighted measure supported on $m$ increasing values, where $m \\ne n$. We recover differentiable operators by regularizing these OT problems with an entropic penalty, and solve them by applying Sinkhorn iterations. Using these smoothed rank and sort operators, we propose differentiable proxies for the classification 0/1 loss as well as for the quantile regression loss.", "full_text": "Differentiable Ranks and Sorting\n\nusing Optimal Transport\n\nMarco Cuturi Olivier Teboul\n\nJean-Philippe Vert\n\nGoogle Research, Brain Team\n\n{cuturi,oliviert,jpvert}@google.com\n\nAbstract\n\nSorting is used pervasively in machine learning, either to de\ufb01ne elementary algo-\nrithms, such as k-nearest neighbors (k-NN) rules, or to de\ufb01ne test-time metrics,\nsuch as top-k classi\ufb01cation accuracy or ranking losses. Sorting is however a poor\nmatch for the end-to-end, automatically differentiable pipelines of deep learning.\nIndeed, sorting procedures output two vectors, neither of which is differentiable:\nthe vector of sorted values is piecewise linear, while the sorting permutation itself\n(or its inverse, the vector of ranks) has no differentiable properties to speak of, since\nit is integer-valued. We propose in this paper to replace the usual sort procedure\nwith a differentiable proxy. Our proxy builds upon the fact that sorting can be\nseen as an optimal assignment problem, one in which the n values to be sorted are\nmatched to an auxiliary probability measure supported on any increasing family\nof n target values. From this observation, we propose extended rank and sort\noperators by considering optimal transport (OT) problems (the natural relaxation\nfor assignments) where the auxiliary measure can be any weighted measure sup-\nported on m increasing values, where m (cid:54)= n. We recover differentiable operators\nby regularizing these OT problems with an entropic penalty, and solve them by\napplying Sinkhorn iterations. Using these smoothed rank and sort operators, we\npropose differentiable proxies for the classi\ufb01cation 0/1 loss as well as for the\nquantile regression loss.\n\nIntroduction\n\n1\nSorting n real values stored in an array x = (x1, . . . , xn) \u2208 Rn requires \ufb01nding a permutation\n\u03c3 in the symmetric group Sn such that x\u03c3 := (x\u03c31, . . . , x\u03c3n) is increasing. A call to a sorting\nprocedure returns either the vector of sorted values S(x) := x\u03c3, or the vector R(x) of the ranks\nof these values, namely the inverse of the sorting permutation, R(x) := \u03c3\u22121. For instance, if the\ninput vector x = (0.38, 4,\u22122, 6,\u22129), one has \u03c3 = (5, 3, 1, 2, 4), and the sorted vector S(x) is\nx\u03c3 = (\u22129,\u22122, 0.38, 4, 6), while R(x) = \u03c3\u22121 = (3, 4, 2, 5, 1) lists the rank of each entry in x.\nOn (not) learning with sorting and ranking. Operators R and S play an important role across\nstatistics and machine learning. For instance, R is the main workhorse behind order statistics [12],\nbut also appears prominently in k-NN rules, in which R is applied on a vector of distances to select\nthe closest neighbors to a query point. Ranking is also used to assess the performance of an algorithm:\neither at test time, such as 0/1 and top-k classi\ufb01cation accuracies and NDCG metrics when learning-\nto-rank [21], or at train time, by selecting pairs [9, 8] and triplets [37] of points of interest. The\nsorting operator S is of no less importance, and can be used to handle outliers in robust statistics, as\nin trimmed [20] and least-quantile regression [32] or median-of-means estimators [26, 25]. Yet, and\nalthough examples of using R and S abound in ML, neither R nor S are actively used in end-to-end\nlearning approaches: while S is not differentiable everywhere, R is outright pathological, since it is\npiecewise constant and has therefore a Jacobian \u2202R/\u2202x that is almost everywhere zero.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fEverywhere differentiable proxies to ranking and sorting. Replacing the usual ranking and sort-\ning operators by differentiable approximations holds an interesting promise, as it would immediately\nenable an end-to-end training of any algorithm or metric that uses sorting. For instance, all of the test\nmetrics enumerated above could be upgraded to training losses, if one were able to replace their inner\ncalls to R and S by differentiable proxies. More generally, one can envision applications in which\nthese proxies can be used to impose rank/sorting based constraints, such as fairness considerations that\nrely on the quantiles of (logistic) regression outputs [14, 22]. In the literature, such smoothed ranks\noperators appeared \ufb01rst in [36], where a softranks operator is de\ufb01ned as the expectation of the rank\noperator under a random perturbation, Ez[R(x + z)], where z is a standard Gaussian random vector.\nThat expectation (and its gradient w.r.t. x) were approximated in [36] using a O(n3) algorithm.\nShortly after, [29] used the fact that the rank of each value xi in x can be written as(cid:80)j 1xi>xj , and\nsmoothed these indicator functions with logistic maps g\u03c4 (u) := (1 + exp(\u2212u/\u03c4 ))\u22121. The soft-rank\noperator they propose is A1n where A = g\u03c4 (D), where g\u03c4 is applied elementwise to the pairwise\nmatrix of differences D = [xi \u2212 xj]ij, for a total of O(n2) operations. A similar yet more re\ufb01ned\napproach was recently proposed by [18], building on the same pairwise difference matrix D to output\na unimodal row-stochastic matrix. This yields as in [36] a probabilistic rank for each input.\nOur contribution: smoothed R and S operators using optimal transport (OT). We show \ufb01rst\nthat the sorting permutation \u03c3 for x can be recovered by solving an optimal assignment (OA) problem,\nfrom an input measure supported on all values in x to a second auxiliary target measure supported on\nany increasing family y = (y1 < \u00b7\u00b7\u00b7 < yn). Indeed, a key result from OT theory states that, pending\na simple condition on the matching cost, the OA is achieved by matching the smallest element in\nx to y1, the second smallest to y2, and so forth, therefore \u201crevealing\u201d the sorting permutation of x.\nWe leverage the \ufb02exibility of OT to introduce generalized \u201csplit\u201d ranking and sorting operators that\nuse target measures with only m (cid:54)= n weighted target values, and use the resulting optimal transport\nplans to compute convex combinations of ranks and sorted values. These operators are however far\ntoo costly to be of practical interest and, much like sorting algorithms, remain non-differentiable.\nTo recover tractable and differentiable operators, we regularize the OT problem and solve it using\nthe Sinkhorn algorithm [10], at a cost of O(nm(cid:96)) operations, where (cid:96) is the number of Sinkhorn\niterations needed for the algorithm to converge. We show that the size m of the target measure can be\nset as small as 3 in some applications, while (cid:96) rarely exceeds 100 with the settings we consider.\nOutline. We recall \ufb01rst the link between the R and S operators and OT between 1D measures, to\nde\ufb01ne then generalized Kantorovich rank and sort operators in \u00a72. We turn them into differentiable\noperators using entropic regularization, and discuss in \u00a73 the several parameters that can shape\nthis smoothness. Using these smooth operators, we propose in \u00a74 alternatives to cross-entropy and\nleast-quantile losses to learn classi\ufb01ers and regression functions.\nNotations. We write On \u2282 Rn for the set of increasing vectors of size n, and \u03a3n \u2282 Rn\n+ for the\nprobability simplex. 1n is the n-vector of ones. Given c = (c1, . . . , cn) \u2208 Rn, we write c for the\ncumulative sum of c, namely vector (c1 + \u00b7\u00b7\u00b7 + ci)i. Given two permutations \u03c3 \u2208 Sn, \u03c4 \u2208 Sm\nand a matrix A \u2208 Rn\u00d7m, we write A\u03c3\u03c4 for the n \u00d7 m matrix [A\u03c3i\u03c4j ]ij obtained by permuting the\nrows and columns of A using \u03c3, \u03c4. For any x \u2208 R, \u03b4x is the Dirac measure on x. For a probability\nmeasure \u03be \u2208 P(R), we write F\u03be for its cumulative distribution function (CDF), and Q\u03be for its quantile\nfunction (generalized if \u03be is discrete). Functions are applied element-wise on vectors or matrices; the\n\u25e6 operator stands for the element-wise product of vectors.\n2 Ranking and Sorting as an Optimal Transport Problem\n\nThe fact that solving the OT problem between two discrete univariate measures boils down to sorting\nis well known [33, \u00a72]. The usual narrative states that the Wasserstein distance between two univariate\nmeasures reduces to comparing their quantile functions, which can be obtained by inverting CDFs,\nwhich are themselves computed by considering the sorted values of the supports of these measures.\nThis downstream connection from OT to quantiles, CDFs and \ufb01nally sorting has been exploited in\nseveral works, notably because the n log n price for sorting is far cheaper than the order n3 log n [35]\none has to pay to solve generic OT problems. This is evidenced by the recent surge in interest for\nsliced Wasserstein distances [30, 5, 23]. We propose in this section to go instead upstream, that is\nto rede\ufb01ne ranking and sorting functions as byproducts of the resolution of an optimal assignment\nproblem between measures supported on the reals. We then propose in Def.1 generalized rank and\nsort operators using the Kantorovich formulation of OT.\n\n2\n\n\fi=1 ai\u03b4xi and \u03c5 =(cid:80)m\n\nSolving the OT problem between 1D measures using sorting. Let \u03be, \u03c5 be two discrete probability\nmeasures on R, de\ufb01ned respectively by their supports x, y and probability weight vectors a, b as\n\u03be =(cid:80)n\nj=1 bj\u03b4yj . We consider in what follows a translation invariant and non-\nnegative ground metric de\ufb01ned as (x, y) \u2208 R2 (cid:55)\u2192 h(y \u2212 x), where h : R \u2192 R+. With that ground\ncost, the OT problem between \u03be and \u03c5 boils down to the following LP, writing Cxy := [h(yj \u2212 xi)]ij,\n(1)\nOTh(\u03be, \u03c5) := min\n\nP\u2208U (a b)(cid:104)P, Cxy (cid:105), where U (a, b) := {P \u2208 Rn\u00d7m\n\n|P 1m = a, P T 1n = b} .\n\n+\n\nWe make in what follows the additional assumption that h is convex. A fundamental result [33,\nTheorem 2.9] states that in that case (see also [13] for the more involved case where h is concave)\nOTh(\u03be, \u03c5) can be computed in closed form using the quantile functions Q\u03be, Q\u03c5 of \u03be, \u03c5:\n\nOTh(\u03be, \u03c5) =(cid:90)[0,1]\n\nh (Q\u03c5(u) \u2212 Q\u03be(u)) du.\n\n(2)\n\nTherefore, to compute OT between \u03be and \u03c5, one only needs to integrate the difference in their quantile\nfunctions, which can be done by inverting the empirical distribution functions for \u03be, \u03c5, which itself\nonly requires sorting the entries in x and y to obtain their sorting permutations \u03c3 and \u03c4. Additionally,\nEq. (2) allows us not only to recover the value of OTh as de\ufb01ned in Eq. (1), but it can also be used to\nrecover the corresponding optimal solution P(cid:63) in n + m operations, using the permutations \u03c3 and \u03c4\nto build a so-called north-west corner solution [28, \u00a73.4.2]:\nProposition 1. Let \u03c3 and \u03c4 be sorting permutations for x and y. De\ufb01ne N to be the north-west\ncorner solution using permuted weights a\u03c3, b\u03c4 . Then N\u03c3\u22121,\u03c4\u22121 is optimal for (1).\nSuch a permuted north-western corner solution is illustrated in Figure 1(b). It is indeed easy to check\nthat in that case (P(cid:63))\u03c3,\u03c4 runs from the top-left (north-west) to the bottom right corner. In the simple\ncase where n = m and a = b = 1n/n, the solution N\u03c3\u22121,\u03c4\u22121 is a permutation matrix divided by n,\nnamely a matrix equal to 0 everywhere except for its entries indexed by (i, \u03c4 \u25e6 \u03c3\u22121)i which are all\nequal to 1/n. That solution is a vertex of the Birkhoff [3] polytope, namely, an optimal assignment\nwhich to the i-th value in x associates the (\u03c4 \u25e6 \u03c3\u22121)i-th value in y; informally, this solution assigns\nthe i-th smallest entry in x to the i-th smallest entry in y.\nGeneralizing sorting, CDFs and quantiles using optimal transport. From now on in this paper,\nwe make the crucial assumption that y is already sorted, that is, y1 < \u00b7\u00b7\u00b7 < ym. \u03c4 is therefore the\nidentity permutation. When in addition n = m, the i-th value in x is simply assigned to the \u03c3\n-th\nvalue in y. Conversely, and as illustrated in Figure 1(a), the rank i value in x is assigned to the i-th\nvalue yi. Because of this, R and S can be rewritten using the optimal assignment matrix P(cid:63):\nProposition 2. Let n = m and a = b = 1n/n. Then for all strictly convex functions h and y \u2208 On,\nif P(cid:63) is an optimal solution to (1), then\n1\n...\nn\n\n(cid:63) x = Q\u03be(b) \u2208 On.\n\n\uf8f9\uf8fa\uf8fb = nF\u03be(x), S(x) = nP T\n\nR(x) = n2P(cid:63)b = nP(cid:63)\uf8ee\uf8ef\uf8f0\n\n\u22121\ni\n\nThese identities stem from the fact that nP(cid:63) is a permutation matrix, which can be applied to the\nvector nb = (1, . . . , n) to recover the rank of each entry in x, or transposed and applied to x to\nrecover the sorted values of x. The former expression can be equivalently interpreted as n times\nthe CDF of \u03be evaluated elementwise to x, the latter as the quantiles of \u03be at levels b. The identities\nin Prop. 2 are valid when the input measures \u03be, \u03c5 are uniform and of the same size. The \ufb01rst\ncontribution of this paper is to consider more general scenarios, in which m, the size of y, can be\nsmaller than n, and where weights a, b need not be uniform. This is a major departure from previous\nreferences [18, 36, 29], which all require pairwise comparisons between the entries in x. We show in\nour applications that m can be as small as 3 when trying to recover a quantile, as in Figs. 1, 3.\nKantorovich ranks and sorts. The so-called Kantorovich formulation of OT [33, \u00a71.5] can be used\nto compare discrete measures of varying sizes and weights. Solving that problem usually requires\nsplitting the mass ai of a point xi so that it it assigned across many points yj (or vice-versa). As a\nresult, the i-th line (or j-th column) of a solution P(cid:63) \u2208 Rn\u00d7m\nhas usually more than one positive\nentry. Extending directly the formulas presented in Prop. 2 we recover extended operators that we call\nKantorovich ranking and sorting operators. These operators are new to the best of our knowledge.\n\n+\n\n3\n\n\fThe K-ranking operator computes convex com-\nbinations of rank values (as described in the\nentries nb) while the K-sorting operator com-\nputes convex combinations of values contained\nin x directly. Note that we consider here\nconvex combinations (weighted averages) of\nthese ranks/values, according to the Euclidean\ngeometry. Extending more generally these\ncombinations to Fr\u00e9chet means using alterna-\ntive geometries (KL, hyperbolic, etc) on these\nranks/values is left for future work. Because\nthese quantities are only de\ufb01ned pointwisely\n(we output vectors and not functions) and de-\npend on the ordering of a, x, b, y, we drop our\nreference to measure \u03be in notations.\nDe\ufb01nition 1. For any (x, a, y, b) \u2208 Rn \u00d7\n\u03a3n \u00d7 Om \u00d7 \u03a3m, let P(cid:63) \u2208 U (a, b) be an opti-\nmal solution for (1) with a given convex func-\ntion h. The K-ranks and K-sorts of x w.r.t to a\nevaluated using (b, y) are respectively:\n\n\u22121 \u25e6 (P(cid:63)b) \u2208 [0, n]n,\n\u22121 \u25e6 (P T\n(cid:63) x) \u2208 Om.\n\n(cid:101)R (a, x; b, y) := na\n(cid:101)S (a, x; b, y) := b\nThe K-rank vector map (cid:101)R outputs a vector of\n\nsize n containing a continuous rank for each\nentry for x (these entries can be alternatively in-\nterpreted as n times a \u201csynthetic\u201d CDF value in\n[0, 1], itself a convex mixture of the CDF values\nbj of the yj onto which each xi is transported).\n\nFigure 1: (a) sorting seen as transporting optimally\nx to milestones in y. (b) Kantorovich sorting gen-\neralizes the latter by considering target measures y\nwith m = 3 non-uniformly weighted points (here\nb = [.48, .16, .36]). K-ranks and K-sorted vectors\n\n(cid:101)R,(cid:101)S are generalizations of R and S that operate by\n\nmixing ranks in b or mixing original values in x to\nform continuous ranks for the elements in x and m\n\u201csynthetic\u201d quantiles at levels b. (c) Entropy regular-\nized OT generalizes further K-operations by solving\nOT with the Sinkhorn algorithm, which results in\ndense transport plans differentiable in all inputs.\n\n(cid:101)S is a split-quantile operator outputting m in-\n\ncreasing values which are each, respectively,\nbarycenters of some of the entries in x. The fact that these values are increasing can be obtained\nby a simple argument in which \u03be and \u03c5 are cast again as uniform measures of the same size using\nduplicated supports xi and yj, and then use the monotonicity given by the third identity of Prop. 2.\nComputations and Non-differentiability The generalized ranking and sorting operators presented\nin Def. 1 are interesting in their own right, but have very little practical appeal. For one, their\ncomputation relies on solving an OT problem at a cost of O(nm(n + m) log(nm))[35] and remains\ntherefore far more costly than regular sorting, even when m is very small. Furthermore, these operators\nremain fundamentally not differentiable. This can be hinted by the simple fact that it is dif\ufb01cult to\nguarantee in general that a solution P(cid:63) to (1) is unique. Most importantly, the Jacobian \u2202P(cid:63)/\u2202x is,\nvery much like R, null almost everywhere. This can be visualized by looking at Figure 1(b) to notice\nthat an in\ufb01nitesimal change in x would not change P(cid:63) (notice however that an in\ufb01nitesimal change in\nweights a would; that Jacobian would involve North-west corner type mass transfers). All of these\npathologies \u2014 computational cost, non-uniqueness of optimal solution and non-differentiability \u2014\ncan be avoided by using regularized OT [10].\n\n3 The Sinkhorn Ranking and Sorting Operators\n\nBoth K-rank (cid:101)R and K-sort (cid:101)S operators are expressed using the optimal solution P(cid:63) to the linear\n\nprogram in (1). However, P(cid:63) is not differentiable w.r.t inputs a, x nor parameters b, y [2, \u00a75]. We\npropose instead to rely on a differentiable variant [10, 11] of the OT problem that uses entropic\nregularization [38, 17, 24], as detailed in [28, \u00a74]. This differentiability is re\ufb02ected in the fact that\nthe optimal regularized transport plan is a dense matrix (yielding more arrows in Fig. 1(c)), which\nensures differentiability everywhere w.r.t. both a and x.\n\n4\n\nx1x2x3x4x5y5y4y3y2y1x1x2x3x4x5y1y2y3P?=\".08.12.04.16.2.2.2#2U\u2713155,h.48.16.36i\u25c6.48.641x1x2x3x4x5x1x2x3x4x5y1y2y3(a)(b)(c)?=(4,5,1,2,3)R(x)=5\u00b7\".6.81.2.4#,S(x)=\"x4x5x1x2x3#=xeR(x)=5\u00b7\".576.9281.48.48#,eS(x)=\uf8ff.166x1+.4167(x4+x5).75x1+.25x2.444x2+.556x3\fFigure 2: Behaviour of the S-ranks (cid:101)R\u03b5 (a, x; b, y) and S-sort operators (cid:101)S\u03b5 (a, x; b, y) as a function\nof \u03b5. Here n = m = 10, b is uniform and y = (0, . . . , m \u2212 1)/(m \u2212 1) is the regular grid in\n[0, 1]. (left) input data x presented as a bar plot. (center) Vector output of (cid:101)R\u03b5 (a, x; b, y) (various\ncontinuous) ranks as a function of \u03b5. When \u03b5 is small, one recovers an integer valued vector of ranks.\nAs \u03b5 increases, regularization kicks in and produces mixtures of rank values that are continuous.\nThese mixed ranks are closer for values that are close in absolute terms, as is the case with the 0-th\nand 9-th index of the input vector whose continuous ranks are almost equal when \u03b5 \u2248 10\u22122. (right)\nvector of \"soft\" sorted values. These converge to the average of values in x as \u03b5 is increased.\n\nConsider \ufb01rst a regularization strength \u03b5 > 0 to de\ufb01ne the solution to the regularized OT problem:\n\nP \u03b5\n\n(cid:63) := argmin\n\nP\u2208U (a,b) (cid:104)P, Cxy (cid:105) \u2212 \u03b5H(P )\n\n, where H(P ) = \u2212(cid:88)i,j\n\nPij (log Pij \u2212 1) .\n\n(cid:63) has the factorized form D(u)KD(v), where K = exp(\u2212Cxy/\u03b5)\nOne can easily show [10] that P \u03b5\nand u \u2208 Rn and v \u2208 Rm are \ufb01xed points of the Sinkhorn iteration outlined in Alg. 1. To differentiate\n(cid:63) w.r.t. a or x one can use the implicit function theorem, but this would require solving a linear\nP \u03b5\nsystem using K. We consider here a more direct approach, using algorithmic differentiation of the\nSinkhorn iterations, after a number (cid:96) of iterations needed for Alg. 1 to converge [19, 4, 15]. That\nnumber (cid:96) depends on the choice of \u03b5 [16]: typically, the smaller \u03b5, the more iterations (cid:96) are needed to\nensure that each successive update in v, u brings the column-sum of the iterate D(u)KD(v) closer\nto b, namely that the difference between v\u25e6 K T u and b (as measured by a discrepancy function \u2206 as\nused in Alg. 1) falls below a tolerance parameter \u03b7. Assuming P \u03b5\n(cid:63) has been computed, we introduce\nSinkhorn ranking and sorting operators by simply appending an \u03b5 subscript to the quantities presented\n(cid:63) = D(u)KD(v).\nin Def. 1, and replacing P(cid:63) in these de\ufb01nitions by the regularized OT solution P \u03b5\n\nDe\ufb01nition 2 (Sinkhorn Rank & Sort). Given a regularization\nstrength \u03b5 > 0, run Alg.1 to de\ufb01ne\n\n(cid:101)R\u03b5 (a, x; b, y) := na\n(cid:101)S\u03b5 (a, x; b, y) := b\n\n\u22121 \u25e6 u \u25e6 K(v \u25e6 b) \u2208 [0, n]n,\n\u22121 \u25e6 v \u25e6 K T (u \u25e6 x) \u2208 Rm.\n\nAlgorithm 1: Sinkhorn\nInputs: a, b, x, y, \u03b5, h, \u03b7\nCxy \u2190 [h(yj \u2212 xi)]ij;\nK \u2190 e\u2212Cxy/\u03b5, u = 1n;\nrepeat\nuntil \u2206(v \u25e6 K T u, b) < \u03b7;\nResult: u, v, K\n\nv \u2190 b/K T u, u \u2190 a/Kv\n\nSensitivity to \u03b5. Parameter \u03b5 plays the same role as other\ntemperature parameters in previously proposed smoothed\nsorting operators [29, 36, 18]: the smaller \u03b5 is, the closer the\nSinkhorn operator\u2019s output is to the original vectors of ranks\nand sorted values; The bigger \u03b5, the closer P \u03b5\n\nillustrated in Fig. 2. Although choosing a small value for \u03b5 might seem natural, in the sense that\n\n(cid:63) to matrix abT , and therefore all entries of (cid:101)R\u03b5 collapse\nto the average of n\u00afb, while all entries of (cid:101)S\u03b5 collapse to the weigted average (using a) of x, as\n(cid:101)R\u03b5,(cid:101)S\u03b5 approximate more faithfully R, S, one should not forget that this would result in recovering\n\nthe de\ufb01ciencies of R, S in terms of differentiability. When learning with such operators, it may\ntherefore be desirable to use a value for \u03b5 that is large enough to ensure \u2202P \u03b5\n(cid:63) /\u2202x has non-null entries.\nWe usually set \u03b5 = 10\u22122 or 10\u22123 when x, y lie in [0,1] as in Fig. 2. We have kept \u03b5 \ufb01xed throughout\nAlg. 1, but we do notice some speedups using scheduling as advocated by [34].\n\n5\n\n01234567890.00.20.40.60.8Entries of x10410310210110010102468Entries of R(x) as a function of 01234567890.20.30.40.50.60.70.8Comparing S(x) with S(x) for various =1.0=0.1=0.01=0.001=0.0001S(x)\fParallelization. The Sinkhorn computations laid out in Algorithm 1 imply the application of kernels\nK or K T to vectors v and u of size m and n respectively. These computation can be carried out in\nparallel to compare S vectors x1, . . . , xS \u2208 Rn of real numbers, with respective probability weights\na1, . . . , aS, to a single vector y with weights b. To do so, one can store all kernels Ks := e\u2212Cs/\u03b5 in\na tensor of size S \u00d7 n \u00d7 m, where Cs = Cxsy.\nNumerical Stability. When using small regularization strengths, we recommend to cast Sinkhorn\niterations in the log-domain by considering the following stabilized iterations for each pair of vectors\nxs, y, resulting in the following updates (with \u03b1 and \u03b2 initialized to 0n and 0m),\n\n\u03b1 \u2190 \u03b5 log a + min\u03b5(cid:16)Cxsy \u2212 \u03b11T\n\u03b2 \u2190 \u03b5 log b + min\u03b5(cid:0)C T\n\nm \u2212 1n\u03b2T(cid:17) + \u03b1,\nn(cid:1) + \u03b2,\n\nxsy \u2212 1m\u03b1T \u2212 \u03b21T\n\nrepeat\n\nAlgorithm 2: Sinkhorn Ranks/Sorts\n\nwhere min\u03b5 is the soft-minimum operator applied linewise to a matrix to output a vector, namely for\n\nM \u2208 Rn\u00d7m, min\u03b5(M ) \u2208 Rn and is such that [min\u03b5(M )]i = \u2212\u03b5(log(cid:80)j e\u2212Mij /\u03b5). The rationale\n\nbehind the substractions/additions of \u03b1 and \u03b2 above is that once a Sinkhorn iteration is carried out,\nthe terms inside the parenthesis above are normalized, in the sense that once divided by \u03b5, their\nexponentials sum to one (they can be used to recover a coupling). Therefore, they must be negative,\nwhich improves the stability of summing exponentials [28, \u00a74.4].\nCost function. Any nonneg-\native convex function h can\nbe used to de\ufb01ne the ground\ncost, notably h(u) = |u|p,\nwith p set to either 1 or 2. An-\nother important result that we\ninherit from OT is that, as-\nsuming \u03b5 is close enough to 0,\nthe transport matrices P (cid:63)\n\u03b5 we\nobtain should not vary under\nthe application of any increas-\ning map to each entry in x or\ny. We take advantage of this\nimportant result to stabilize\nfurther Sinkhorn\u2019s algorithm,\nand at the same time resolve\nthe thorny issue of being able\nto settle for a value for \u03b5 that can be used consistently, regardless of the range of values in x. We\npropose to set y to be the regular grid on [0, 1] with m points, and rescale the input entries of x so\nthat they cover [0, 1] to de\ufb01ne the cost matrice Cxy. We rescale the entries of x using an increasing\nsquashing function, such as arctan or a logistic map. We also notice in our experiments that it is\nimportant to standardize input vectors x before squashing them into [0, 1]n, namely to apply, given a\nsquashing function g, the map \u02dcg on x before computing the cost matrix Cxy:\n\nInputs: (as, xs)s \u2208 (\u03a3n \u00d7 Rn)S, (b, y) \u2208 \u03a3m \u00d7 Om, h, \u03b5, \u03b7,(cid:101)g.\n\u2200s,(cid:101)xs =(cid:101)g(xs), Cs = [h(yj \u2212 ((cid:101)xs)i)]ij, \u03b1s = 0n, \u03b2s = 0m.\nn(cid:1) + \u03b2s\n\u2200s, \u03b2s \u2190 \u03b5 log bs + min\u03b5(cid:0)C T\ns \u2212 \u03b2s1T\ns(cid:17) + \u03b1s\n\u2200s, \u03b1s \u2190 \u03b5 log as + min\u03b5(cid:16)Cs \u2212 \u03b1s1T\nm \u2212 1n\u03b2T\nuntil maxs \u2206(cid:0)exp(cid:0)C T\nn(cid:1) 1n, b(cid:1) < \u03b7;\ns(cid:17) b,\ns \u25e6 exp(cid:16)Cxsy \u2212 \u03b1s1T\n\u2200s,(cid:101)R\u03b5(xs) \u2190 a\u22121\nm \u2212 1n\u03b2T\nn(cid:1) xs.\ns \u25e6 exp(cid:0)C T\n\u2200s,(cid:101)S\u03b5(xs) \u2190 b\u22121\ns \u2212 \u03b2s1T\nResult:(cid:16)(cid:101)R\u03b5(xs),(cid:101)S\u03b5(xs)(cid:17)s\n\nxsy \u2212 1m\u03b1T\n.\n\nxsy \u2212 1m\u03b1T\n\ns \u2212 1m\u03b1T\n\ns \u2212 \u03b2s1T\n\n(3)\n\n(4)\n\n\u02dcg : x (cid:55)\u2192 g(cid:32) x \u2212 (xT 1n)1n\n\nn(cid:107)x \u2212 (xT 1n)1n(cid:107)2(cid:33) .\n\n1\u221a\n\nThe choices that we have made are summarized in Alg. 2, but we believe there are opportunities to\nperfect them depending on the task.\nSoft \u03c4 quantiles. To illustrate the \ufb02exibility offered by\nthe freedom to choose a non-uniform target measure b, y,\nwe consider the problem of computing a smooth approxi-\nmation of the \u03c4 quantile of a discrete distribution \u03be, where\n\u03c4 \u2208 [0, 1]. This smooth approximation can be obtained by\ntransporting \u03be towards a tilted distribution, with weights\nsplit roughly as \u03c4 on the left and (1 \u2212 \u03c4 ) on the right,\nwith the addition of a small \u201c\ufb01ller\u201d weight in the mid-\ndle. This \ufb01ller weight is set to a small value t, and\nis designed to \u201ccapture\u201d whatever values may lie close\nto that quantile. This choice results in m = 3, with\n\nFigure 3: Computing the 30% quantile\nof 20 values as the weighted average of\nvalues that are selected by the Sinkhorn\nalgorithm to send their mass onto \ufb01ller\nweight t located halfway in [0, 1], and\n\u201csandwiched\u201d by two masses approxi-\nmately equal to \u03c4, 1 \u2212 \u03c4.\n\n6\n\n01/21\u2327=30%0.65t=0.10.25\fFigure 4: Error bars (averages over 12 runs) for test accuracy curves on CIFAR-10 using the same\nnetworks structures, a vanilla CNN for 4 convolution layers on the left and a resnet18 on the right.\nWe use the ADAM optimizer with a constant stepsize set to 10\u22124.\n\nFigure 5: Identical setup to Fig. 4, with the CIFAR-100 database.\n\nweights b = [\u03c4 \u2212 t/2, t, 1 \u2212 \u03c4 \u2212 t/2] and target values\ny = [0, 1/2, 1] as in Figure 3, in which t = 0.1. With\nsuch weights/locations, a differentiable approximation to the \u03c4-quantile of the inputs can be recovered\nas the second entry of vector \u02dcS\u03b5:\n\n\u02dcq\u03b5(x; \u03c4, t) =\uf8ee\uf8f0(cid:101)S\u03b5\uf8eb\uf8ed 1n\n\nn\n\n1 \u2212 \u03c4 \u2212 t/2(cid:35) ,\uf8ee\uf8f0\n, x;(cid:34) \u03c4 \u2212 t/2\n\nt\n\n0\n1\n2\n\n1\uf8f9\uf8fb , h\uf8f6\uf8f8\uf8f9\uf8fb2\n\n.\n\n(5)\n\n4 Learning with Smoothed Ranks and Sorts\nDifferentiable approximation of the top-k Loss. Given a set of labels {1, . . . , L} and a space \u2126 of\ninput points, a parameterized multiclass classi\ufb01er on \u2126 is a function f\u03b8 : \u2126 \u2192 RL. The function\ndecides the class attributed to \u03c9 by selecting a label with largest activation, l(cid:63) \u2208 argmaxl[f\u03b8(\u03c9)]l.\nTo train the classi\ufb01er using a training set {(\u03c9i, li)} \u2208 (\u2126 \u00d7 L)N , one typically resorts to minimizing\nthe cross-entropy loss, which results in solving min\u03b8(cid:80)i 1T\nWe propose a differentiable variant of the 0/1 and more generally top k losses that bypasses com-\nbinatorial consideration [27, 39] nor builds upon non-differentiable surrogates [6]. Ignoring the\ndegenerate case in which l(cid:63) is not unique, given a query \u03c9, stating that the the label l(cid:63) has been\nselected is equivalent to stating that the entry indexed at l(cid:63) of the vector of ranks R(f\u03b8(\u03c9)) is L.\nGiven a labelled pair (\u03c9, l), the 0/1 loss of the classi\ufb01er for that pair is therefore,\n\nL log f\u03b8(\u03c9i) \u2212 [f\u03b8(\u03c9i)]li.\n\nL0/1(f\u03b8(\u03c9), l) = H (L \u2212 [R(f\u03b8(\u03c9)]l) ,\n\n(6)\n\n7\n\n0100200300400500600epoch0.760.780.800.820.84accuracy vanilla CNNsoft error loss cross entropy loss 0100200300400500600epoch0.850.860.870.880.890.900.91accuracy resnet18soft error loss cross entropy loss 0100200300400500600epoch0.300.350.400.450.500.55accuracy vanilla CNNsoft error loss cross entropy loss 0100200300400500600epoch0.300.350.400.450.500.55accuracy resnet18soft error loss cross entropy loss \falgorithm\n\nn=3\n\nn=7\n\nn=9\n\nn=15\n\nStochastic NeuralSort\n\nDeterministic NeuralSort\n\n0.122 (0.734)\n0.097 (0.716)\n0.126 (0.742)\nTable 1: Sorting exact and partial precision on the neural sort task averaged over 10 runs. Our\nmethod performs better than the method presented in [18] for all the sorting tasks, with the exact\nsame network architecture.\n\n0.920 (0.946)\n0.919 (0.945)\n0.928 (0.950)\n\n0.452 (0.829)\n0.434 (0.824)\n0.497 (0.847)\n\n0.636 (0.873)\n0.610 (0.862)\n0.656 (0.882)\n\nOur\n\nn=5\n\n0.790 (0.907)\n0.777 (0.901)\n0.811 (0.917)\n\nwhere H is the heaviside function: H(u) = 1 if u > 0 and H(u) = 0 for u \u2264 0. More generally, if\nfor some labelled input \u03c9, the entry [R(f\u03b8)]lo is bigger than L \u2212 k + 1, then that labelled example\nhas a top-k error of 0. Conversely, if [R(f\u03b8)]l is smaller than L \u2212 k + 1, then the top-k error is 1.\nThe top-k error can be therefore formulated as in (6), where the argument L \u2212 [R(f\u03b8(\u03c9)]l within the\nHeaviside function is replaced by L \u2212 [R(f\u03b8(\u03c9)]l \u2212 k + 1.\nThe 0/1 and top-k losses are unstable on two different counts: H is discontinuous, and so is R\nwith respect to the entries f\u03b8(\u03c9). The differentiable loss that we propose, as a replacement for\ncross-entropy (or more generalized top-k cross entropy losses [1]), leverages therefore both the\nSinkhorn rank operator and a smoothed Heaviside like function. Because Sinkhorn ranks are always\nwithin the boundaries of [0, L], we propose to modify this loss by considering a continuous increasing\nfunction Jk from [0, L] to R:\n\n(cid:101)Lk,\u03b5(f\u03b8(\u03c9), l) = Jk(cid:18)L \u2212(cid:20)(cid:101)R\u03b5(cid:18) 1L\n\nL\n\n, f\u03b8(\u03c9);\n\n1L\nL\n\n,\n\n1L\nL\n\n, h(cid:19)(cid:21)l(cid:19) ,\n\nWe propose the simple family of ReLU losses Jk(u) = max(0, u \u2212 k + 1), and have focused our\nexperiments on the case k = 1. We train a vanilla CNN (4 Conv2D with 2 max-pooling layers,\nReLU activation, 2 fully connected layers, batchnorm on each) and a Resnet18 on CIFAR-10 and\nCIFAR-100. Fig. 4 and 5 report test-set classi\ufb01cation accuracies / epochs. We used \u03b5 = 10\u22123,\n\u03b7 = 10\u22123, a squared distance cost h(u) = u2 and a stepsize of 10\u22124 with the ADAM optimizer.\nLearning CNNs by sorting handwritten num-\nbers. We use the MNIST experiment setup\nin [18], in which a CNN is given n numbers\nbetween between 0 and 9999 given as 4 con-\ncatenated MNIST images. The labels are the\nranks (within n pairs) of each of these n num-\nbers. We use the code kindly made available by\nthe authors. We use 100 epochs, and con\ufb01rm\nexperimentally that S-sort performs on par with\ntheir neural-sort function. We set \u03b5 = 0.005.\nLeast quantile regression. The goal of least\nquantile regression [32] is to minimize, given\na vector of response variables z1, . . . , zN \u2208 R\nand regressor variables W = [w1, . . . , wN ] \u2208\nRd\u00d7N , the \u03c4 quantile of the loss between re-\nsponse and predicted values, namely writing\nx = (|zi \u2212 f\u03b8(wi)|)i and setting a = 1N /N\nand \u03be the measure with weights a and support\nx, to minimize w.r.t. \u03b8 the quantile \u03c4 of \u03be.\nWe proceed by drawing mini-batches of size 512. Our baseline method (labelled \u03b5 = 0) consists in\nidentifying which point, among those 512, has an error that is equal to the desired quantile, and then\ntake gradient steps according to that point. Our proposal is to consider the soft \u03c4 quantile \u02dcq\u03b5(x; \u03c4, t)\noperator de\ufb01ned in (5), using for the \ufb01ller weight t = 1/512. This is labelled as \u03b5 = 10\u22122. We use\nthe datasets considered in [31] and consider the same regressor architecture, namely a 2 hidden layer\nNN with hidden layer size 64, ADAM optimizer and steplength 10\u22124. Results are summarized in\nTable2. We consider two quantiles, \u03c4 = 50% and 90%.\nFor each quantile/dataset pair, we report the original (not-regularized) \u03c4 quantile of the errors\nevaluated on the entire training set, on an entire held-out test set, and the MSE on the test set of the\n\nFigure 6: Test accuracy on the simultaneous\nMNIST CNN / sorting task proposed in [18] (aver-\nage of 12 runs)\n\n8\n\n0501001502002503003504000.20.30.40.50.60.70.8all correct, n=5 our neural sort \f\u03c4 = 50%\n\n\u03b5 = 10\u22122 (our)\n\n\u03c4 = 90%\n\n\u03b5 = 10\u22122 (our)\n\nQuantile\nMethod\nDataset\n\nbio\nbike\n\nstar\n\nfacebook\n\nconcrete\ncommunity\n\n\u03b5 = 0\nTest MSE\n0.83\n0.31\n0.82\n0.46\n0.01\n0.18\n0.80\n0.68\n0.58\n0.45\n0.30\n0.48\n\n\u03b5 = 0\nTest MSE\n0.74\n1.19\n0.65\n1.60\n0.27\n0.27\n1.55\n0.77\n0.50\n1.08\n0.98\n0.46\n\nTrain\n0.28\n0.14\n0.04\n0.33\n0.25\n0.06\n\nTrain\n1.17\n0.76\n0.21\n1.29\n0.83\n0.77\n\nTrain\n0.33\n0.23\n0.00\n0.55\n0.35\n0.27\n\nTest MSE\n0.81\n0.28\n0.87\n0.49\n0.04\n0.19\n0.89\n0.74\n0.61\n0.51\n0.32\n0.53\n\nTest MSE\n1.18\n1.17\n0.63\n1.57\n0.22\n0.27\n0.77\n1.57\n0.51\n1.08\n0.44\n0.98\nTable 2: Least quantile losses (averaged on 12 runs) obtained on datasets compiled by [31]. We\nconsider two quantiles, at 50% and 90%. The baseline method (\u03b5 = 0) consists in estimating the\nquantile empirically and taking a gradient step with respect to that point. Our method (\u03b5 = 10\u22122)\nuses the softquantile operator \u02dcq\u03b5(x; \u03c4, t) de\ufb01ned in (5), using for the \ufb01ller weight t = 1/512. We\nobserve better performance at train time (which may be due to a \u201csmoothed\u201d optimization landscape\nwith less local minima) but different behaviors on test sets, either using the quantile loss or the MSE.\nNote that we report here for both methods and for both train and test sets the \u201ctrue\u201d quantile error\nmetric.\n\nTrain\n1.15\n0.69\n0.27\n1.15\n0.72\n0.56\n\nfunction that is recovered. We notice that our algorithm reaches overall better quantile errors on the\ntraining set\u2014this is our main goal\u2014but comparable test/MSE errors.\nConclusion. We have proposed in this paper differentiable proxies to the ranking and sorting\noperations. These proxies build upon the existing connection between sorting and the computation\nof OT in 1D. By generalizing sorting using OT, and then introducing a regularized form that can\nbe solved using Sinkhorn iterations, we recover the simple bene\ufb01t that all of its steps can be easily\nautomatically differentiated. We have shown that, with a focus on numerical stability, one can use\nthere operators in various settings, including smooth extensions of test-time metrics that rely on sort,\nand which can be now used as training losses. For instance, we have used the Sinkhorn sort operator\nto provide a smooth approximation of quantiles to solve least-quantile regression problems, and the\nSinkhorn rank operator to formulate an alternative to the cross-entropy that can mimic the 0/1 loss\nin multiclass classi\ufb01cation. This smooth approximation to the rank, and the resulting gradient \ufb02ow\nthat we obtain is strongly reminiscent of rank based dynamics, in which players in a given game\nproduce an effort (a gradient) that is a direct function of their rank (or standing) within the game,\nas introduced by [7]. Our use of the Sinkhorn algorithm can therefore be interpreted as a smooth\nmechanism to enact such dynamics. Several open questions remain: although the choice of a cost\n\nhave in principle no in\ufb02uence on the vector of Sinkhorn ranks or sorted values in the limit when \u03b5\n\nfunction h, target vector y and squashing function g (used to form vector(cid:101)x in Alg. 1, using Eq. 4)\ngoes to 0 (they all converge to R and S), these choices strongly shape the differentiability of (cid:101)R\u03b5\nand (cid:101)S\u03b5 when \u03b5 > 0. Our empirical \ufb01ndings suggest that whitening and squashing all entries within\n\n[0, 1] is crucial to obtain stable numerically, but more generally to retain consistent gradients across\niterations, without having to re-de\ufb01ne \u03b5 at each iteration.\n\nReferences\n[1] Leonard Berrada, Andrew Zisserman, and M Pawan Kumar. Smooth loss functions for deep\n\ntop-k classi\ufb01cation. arXiv preprint arXiv:1802.07595, 2018.\n\n[2] Dimitris Bertsimas and John N Tsitsiklis. Introduction to Linear Optimization. Athena Scienti\ufb01c,\n\n1997.\n\n[3] Garrett Birkhoff. Tres observaciones sobre el algebra lineal. Universidad Nacional de Tucum\u00e1n\n\nRevista Series A, 5:147\u2013151, 1946.\n\n[4] Nicolas Bonneel, Gabriel Peyr\u00e9, and Marco Cuturi. Wasserstein barycentric coordinates:\nhistogram regression using optimal transport. ACM Transactions on Graphics, 35(4):71:1\u2013\n71:10, 2016.\n\n[5] Nicolas Bonneel, Julien Rabin, Gabriel Peyr\u00e9, and Hanspeter P\ufb01ster. Sliced and Radon Wasser-\nstein barycenters of measures. Journal of Mathematical Imaging and Vision, 51(1):22\u201345,\n2015.\n\n9\n\n\f[6] Stephen Boyd, Corinna Cortes, Mehryar Mohri, and Ana Radovanovic. Accuracy at the top. In\n\nAdvances in neural information processing systems, pages 953\u2013961, 2012.\n\n[7] Yann Brenier. Rearrangement, convection, convexity and entropy. Philosophical Transactions\n\nof the Royal Society A, 371: 20120343, 2013.\n\n[8] Christopher Burges, Krysta Svore, Paul Bennett, Andrzej Pastusiak, and Qiang Wu. Learning\nto rank using an ensemble of lambda-gradient models. In Proceedings of the learning to rank\nChallenge, pages 25\u201335, 2011.\n\n[9] Christopher J Burges, Robert Ragno, and Quoc V Le. Learning to rank with nonsmooth cost\n\nfunctions. In Advances in neural information processing systems, pages 193\u2013200, 2007.\n\n[10] Marco Cuturi. Sinkhorn distances: lightspeed computation of optimal transport. In Advances in\n\nNeural Information Processing Systems 26, pages 2292\u20132300, 2013.\n\n[11] Marco Cuturi and Arnaud Doucet. Fast computation of Wasserstein barycenters. In Proceedings\n\nof ICML, volume 32, pages 685\u2013693, 2014.\n\n[12] Herbert Aron David and Haikady Navada Nagaraja. Order statistics. Encyclopedia of Statistical\n\nSciences, 2004.\n\n[13] Julie Delon, Julien Salomon, and Andrei Sobolevski. Local matching indicators for transport\nproblems with concave costs. SIAM Journal on Discrete Mathematics, 26(2):801\u2013827, 2012.\n\n[14] Michael Feldman, Sorelle A Friedler, John Moeller, Carlos Scheidegger, and Suresh Venkata-\nsubramanian. Certifying and removing disparate impact. In Proceedings of the 21th ACM\nSIGKDD International Conference on Knowledge Discovery and Data Mining, pages 259\u2013268.\nACM, 2015.\n\n[15] R\u00e9mi Flamary, Marco Cuturi, Nicolas Courty, and Alain Rakotomamonjy. Wasserstein discrim-\n\ninant analysis. Machine Learning, 107(12):1923\u20131945, 2018.\n\n[16] Joel Franklin and Jens Lorenz. On the scaling of multidimensional matrices. Linear Algebra\n\nand its Applications, 114:717\u2013735, 1989.\n\n[17] Alfred Galichon and Bernard Salani\u00e9. Matching with trade-offs: revealed preferences over\n\ncompeting characteristics. Technical report, Preprint SSRN-1487307, 2009.\n\n[18] Aditya Grover, Eric Wang, Aaron Zweig, and Stefano Ermon. Stochastic optimization of sorting\n\nnetworks via continuous relaxation. In Proceedings of ICLR 2019, 2019.\n\n[19] Tatsunori Hashimoto, David Gifford, and Tommi Jaakkola. Learning population-level diffusions\nwith generative RNNs. In International Conference on Machine Learning, pages 2417\u20132426,\n2016.\n\n[20] Peter J Huber. Robust statistics. Springer, 2011.\n\n[21] Kalervo J\u00e4rvelin and Jaana Kek\u00e4l\u00e4inen. Cumulated gain-based evaluation of ir techniques. ACM\n\nTransactions on Information Systems (TOIS), 20(4):422\u2013446, 2002.\n\n[22] Ray Jiang, Aldo Pacchiano, Tom Stepleton, Heinrich Jiang, and Silvia Chiappa. Wasserstein\n\nfair classi\ufb01cation. arXiv preprint arXiv:1907.12059, 2019.\n\n[23] Soheil Kolouri, Yang Zou, and Gustavo K Rohde. Sliced Wasserstein kernels for probability dis-\ntributions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,\npages 5258\u20135267, 2016.\n\n[24] JJ Kosowsky and Alan L Yuille. The invisible hand algorithm: Solving the assignment problem\n\nwith statistical physics. Neural networks, 7(3):477\u2013490, 1994.\n\n[25] Guillaume Lecu\u00e9 and Matthieu Lerasle. Robust machine learning by median-of-means: theory\n\nand practice. Annals of Statistics, 2019. to appear.\n\n10\n\n\f[26] G\u00e1bor Lugosi, Shahar Mendelson, et al. Regularization, sparse recovery, and median-of-means\n\ntournaments. Bernoulli, 25(3):2075\u20132106, 2019.\n\n[27] Tan Nguyen and Scott Sanner. Algorithms for direct 0\u20131 loss optimization in binary classi\ufb01ca-\n\ntion. In International Conference on Machine Learning, pages 1085\u20131093, 2013.\n\n[28] Gabriel Peyr\u00e9 and Marco Cuturi. Computational optimal transport. Foundations and Trends in\n\nMachine Learning, 11(5-6):355\u2013607, 2019.\n\n[29] Tao Qin, Tie-Yan Liu, and Hang Li. A general approximation framework for direct optimization\n\nof information retrieval measures. Information retrieval, 13(4):375\u2013397, 2010.\n\n[30] Julien Rabin, Gabriel Peyr\u00e9, Julie Delon, and Marc Bernot. Wasserstein barycenter and its\napplication to texture mixing. In International Conference on Scale Space and Variational\nMethods in Computer Vision, pages 435\u2013446. Springer, 2011.\n\n[31] Yaniv Romano, Evan Patterson, and Emmanuel J Cand\u00e8s. Conformalized quantile regression.\n\narXiv preprint arXiv:1905.03222, 2019.\n\n[32] Peter J Rousseeuw. Least median of squares regression. Journal of the American statistical\n\nassociation, 79(388):871\u2013880, 1984.\n\n[33] Filippo Santambrogio. Optimal transport for applied mathematicians. Birkhauser, 2015.\n\n[34] Bernhard Schmitzer. Stabilized sparse scaling algorithms for entropy regularized transport\n\nproblems. arXiv preprint arXiv:1610.06519, 2016.\n\n[35] Robert E. Tarjan. Dynamic trees as search trees via euler tours, applied to the network simplex\n\nalgorithm. Mathematical Programming, 78(2):169\u2013177, 1997.\n\n[36] Michael Taylor, John Guiver, Stephen Robertson, and Tom Minka. Softrank: optimizing non-\nsmooth rank metrics. In Proceedings of the 2008 International Conference on Web Search and\nData Mining, pages 77\u201386. ACM, 2008.\n\n[37] Kilian Q Weinberger and Lawrence K Saul. Distance metric learning for large margin nearest\n\nneighbor classi\ufb01cation. Journal of Machine Learning Research, 10:207\u2013244, 2009.\n\n[38] Alan Geoffrey Wilson. The use of entropy maximizing models, in the theory of trip distribution,\nmode split and route split. Journal of Transport Economics and Policy, pages 108\u2013126, 1969.\n\n[39] Shaodan Zhai, Tian Xia, Ming Tan, and Shaojun Wang. Direct 0-1 loss minimization and\nmargin maximization with boosting. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani,\nand K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages\n872\u2013880. Curran Associates, Inc., 2013.\n\n11\n\n\f", "award": [], "sourceid": 3730, "authors": [{"given_name": "Marco", "family_name": "Cuturi", "institution": "Google Brain & CREST - ENSAE"}, {"given_name": "Olivier", "family_name": "Teboul", "institution": "Google Brain"}, {"given_name": "Jean-Philippe", "family_name": "Vert", "institution": null}]}