{"title": "Sinkhorn Distances: Lightspeed Computation of Optimal Transport", "book": "Advances in Neural Information Processing Systems", "page_first": 2292, "page_last": 2300, "abstract": "Optimal transportation distances are a fundamental family of parameterized distances for histograms in the probability simplex. Despite their appealing theoretical properties, excellent performance and intuitive formulation, their computation involves the resolution of a linear program whose cost is prohibitive whenever the histograms' dimension exceeds a few hundreds. We propose in this work a new family of optimal transportation distances that look at transportation problems from a maximum-entropy perspective. We smooth the classical optimal transportation problem with an entropic regularization term, and show that the resulting optimum is also a distance which can be computed through Sinkhorn's matrix scaling algorithm at a speed that is several orders of magnitude faster than that of transportation solvers. We also report improved performance on the MNIST benchmark problem over competing distances.", "full_text": "Sinkhorn Distances:\n\nLightspeed Computation of Optimal Transport\n\nGraduate School of Informatics, Kyoto University\n\nMarco Cuturi\n\nmcuturi@i.kyoto-u.ac.jp\n\nAbstract\n\nOptimal transport distances are a fundamental family of distances for probability\nmeasures and histograms of features. Despite their appealing theoretical proper-\nties, excellent performance in retrieval tasks and intuitive formulation, their com-\nputation involves the resolution of a linear program whose cost can quickly be-\ncome prohibitive whenever the size of the support of these measures or the his-\ntograms\u2019 dimension exceeds a few hundred. We propose in this work a new family\nof optimal transport distances that look at transport problems from a maximum-\nentropy perspective. We smooth the classic optimal transport problem with an\nentropic regularization term, and show that the resulting optimum is also a dis-\ntance which can be computed through Sinkhorn\u2019s matrix scaling algorithm at a\nspeed that is several orders of magnitude faster than that of transport solvers. We\nalso show that this regularized distance improves upon classic optimal transport\ndistances on the MNIST classi\ufb01cation problem.\n\n1 Introduction\n\nChoosing a suitable distance to compare probabilities is a key problem in statistical machine learn-\ning. When little is known on the probability space on which these probabilities are supported, various\ninformation divergences with minimalistic assumptions have been proposed to play that part, among\nwhich the Hellinger, \u03c72, total variation or Kullback-Leibler divergences. When the probability\nspace is a metric space, optimal transport distances (Villani, 2009, \u00a76), a.k.a. earth mover\u2019s (EMD)\nin computer vision (Rubner et al., 1997), de\ufb01ne a more powerful geometry to compare probabilities.\n\nThis power comes, however, with a heavy computational price tag. No matter what the algorithm\nemployed \u2013 network simplex or interior point methods \u2013 the cost of computing optimal transport\ndistances scales at least in O(d3log(d)) when comparing two histograms of dimension d or two\npoint clouds each of size d in a general metric space (Pele and Werman, 2009, \u00a72.1).\nIn the particular case that the metric probability space of interest can be embedded in Rn and n is\nsmall, computing or approximating optimal transport distances can become reasonably cheap. In-\ndeed, when n = 1, their computation only requires O(d log d) operations. When n \u2265 2, embeddings\nof measures can be used to approximate them in linear time (Indyk and Thaper, 2003; Grauman and\nDarrell, 2004; Shirdhonkar and Jacobs, 2008) and network simplex solvers can be modi\ufb01ed to run\nin quadratic time (Gudmundsson et al., 2007; Ling and Okada, 2007). However, the distortions of\nsuch embeddings (Naor and Schechtman, 2007) as well as the exponential increase of costs incurred\nby such modi\ufb01cations as n grows make these approaches inapplicable when n exceeds 4. Outside of\nthe perimeter of these cases, computing a single distance between a pair of measures supported by\na few hundred points/bins in an arbitrary metric space can take more than a few seconds on a single\nCPU. This issue severely hinders the applicability of optimal transport distances in large-scale data\nanalysis and goes as far as putting into question their relevance within the \ufb01eld of machine learning,\nwhere high-dimensional histograms and measures in high-dimensional spaces are now prevalent.\n\n1\n\n\fWe show in this paper that another strategy can be employed to speed-up optimal transport, and even\npotentially de\ufb01ne a better distance in inference tasks. Our strategy is valid regardless of the metric\ncharacteristics of the original probability space. Rather than exploit properties of the metric proba-\nbility space of interest (such as embeddability in a low-dimensional Euclidean space) we choose to\nfocus directly on the original transport problem, and regularize it with an entropic term. We argue\nthat this regularization is intuitive given the geometry of the optimal transport problem and has,\nin fact, been long known and favored in transport theory to predict traf\ufb01c patterns (Wilson, 1969).\nFrom an optimization point of view, this regularization has multiple virtues, among which that of\nturning the transport problem into a strictly convex problem that can be solved with matrix scaling\nalgorithms. Such algorithms include Sinkhorn\u2019s celebrated \ufb01xed point iteration (1967), which is\nknown to have a linear convergence (Franklin and Lorenz, 1989; Knight, 2008). Unlike other itera-\ntive simplex-like methods that need to cycle through complex conditional statements, the execution\nof Sinkhorn\u2019s algorithm only relies on matrix-vector products. We propose a novel implementation\nof this algorithm that can compute simultaneously the distance of a single point to a family of points\nusing matrix-matrix products, and which can therefore be implemented on GPGPU architectures.\nWe show that, on the benchmark task of classifying MNIST digits, regularized distances perform\nbetter than standard optimal transport distances, and can be computed several orders of magnitude\nfaster.\n\nThis paper is organized as follows: we provide reminders on optimal transport theory in Section 2,\nintroduce Sinkhorn distances in Section 3 and provide algorithmic details in Section 4. We follow\nwith an empirical study in Section 5 before concluding.\n\n2 Reminders on Optimal Transport\n\nTransport Polytope and Interpretation as a Set of Joint Probabilities\nIn what follows, h\u00b7, \u00b7i\nstands for the Frobenius dot-product. For two probability vectors r and c in the simplex \u03a3d := {x \u2208\n+ : xT 1d = 1}, where 1d is the d dimensional vector of ones, we write U (r, c) for the transport\nRd\npolytope of r and c, namely the polyhedral set of d \u00d7 d matrices,\n\nU (r, c) := {P \u2208 Rd\u00d7d\n\n+ | P 1d = r, P T 1d = c}.\n\nU (r, c) contains all nonnegative d \u00d7 d matrices with row and column sums r and c respectively.\nU (r, c) has a probabilistic interpretation: for X and Y two multinomial random variables taking\nvalues in {1, \u00b7 \u00b7 \u00b7 , d}, each with distribution r and c respectively, the set U (r, c) contains all possible\njoint probabilities of (X, Y ). Indeed, any matrix P \u2208 U (r, c) can be identi\ufb01ed with a joint probabil-\nity for (X, Y ) such that p(X = i, Y = j) = pij. We de\ufb01ne the entropy h and the Kullback-Leibler\ndivergences of P, Q \u2208 U (r, c) and a marginals r \u2208 \u03a3d as\n\nh(r) = \u2212\n\nd\n\nX\n\ni=1\n\nri log ri,\n\nh(P ) = \u2212\n\nX\n\npij log pij, KL(P kQ) = X\n\npij log\n\ni,j=1\n\nij\n\nd\n\npij\nqij\n\n.\n\nOptimal Transport Distance Between r and c Given a d \u00d7 d cost matrix M , the cost of mapping\nr to c using a transport matrix (or joint probability) P can be quanti\ufb01ed as hP, M i. The problem\nde\ufb01ned in Equation (1)\n\ndM (r, c) := min\n\nP \u2208U(r,c)\n\nhP, M i.\n\n(1)\n\nis called an optimal transport (OT) problem between r and c given cost M . An optimal table P \u22c6\nfor this problem can be obtained, among other approaches, with the network simplex (Ahuja et al.,\n1993, \u00a79). The optimum of this problem, dM (r, c), is a distance between r and c (Villani, 2009,\n\u00a76.1) whenever the matrix M is itself a metric matrix, namely whenever M belongs to the cone of\ndistance matrices (Avis, 1980; Brickell et al., 2008):\n\nM = {M \u2208 Rd\u00d7d\n\n+ : \u2200i, j \u2264 d, mij = 0 \u21d4 i = j, \u2200i, j, k \u2264 d, mij \u2264 mik + mkj }.\n\nFor a general matrix M , the worst case complexity of computing that optimum scales in O(d3 log d)\nfor the best algorithms currently proposed, and turns out to be super-cubic in practice as well (Pele\nand Werman, 2009, \u00a72.1).\n\n2\n\n\f3 Sinkhorn Distances: Optimal Transport with Entropic Constraints\n\nEntropic Constraints on Joint Probabilities The following information theoretic inequality\n(Cover and Thomas, 1991, \u00a72) for joint probabilities\n\n\u2200r, c \u2208 \u03a3d, \u2200P \u2208 U (r, c), h(P ) \u2264 h(r) + h(c),\n\nis tight, since the independence table rcT (Good, 1963) has entropy h(rcT ) = h(r) + h(c). By the\nconcavity of entropy, we can introduce the convex set\nU\u03b1(r, c) := {P \u2208 U (r, c) | KL(P krcT ) \u2264 \u03b1} = {P \u2208 U (r, c) | h(P ) \u2265 h(r)+h(c)\u2212\u03b1} \u2282 U (r, c).\n\nThese two de\ufb01nitions are indeed equivalent, since one can easily check that KL(P krcT ) = h(r) +\nh(c) \u2212 h(P ), a quantity which is also the mutual information I(XkY ) of two random variables\n(X, Y ) should they follow the joint probability P (Cover and Thomas, 1991, \u00a72). Hence, the set of\ntables P whose Kullback-Leibler divergence to rcT is constrained to lie below a certain threshold\ncan be interpreted as the set of joint probabilities P in U (r, c) which have suf\ufb01cient entropy with\nrespect to h(r) and h(c), or small enough mutual information. For reasons that will become clear in\nSection 4, we call the quantity below the Sinkhorn distance of r and c:\nDe\ufb01nition 1 (Sinkhorn Distance). dM,\u03b1(r, c) := min\n\nhP, M i\n\nP \u2208U\u03b1(r,c)\n\nWhy consider an entropic constraint in optimal transport? The \ufb01rst reason is computational, and is\ndetailed in Section 4. The second reason is built upon the following intuition. As a classic result\nof linear optimization, the OT problem is always solved on a vertex of U (r, c). Such a vertex is\na sparse d \u00d7 d matrix with only up to 2d \u2212 1 non-zero elements (Brualdi, 2006, \u00a78.1.3). From a\nprobabilistic perspective, such vertices are quasi-deterministic joint probabilities, since if pij > 0,\nthen very few probabilities pij \u2032 for j 6= j\u2032 will be non-zero in general. Rather than considering\nsuch outliers of U (r, c) as the basis of OT distances, we propose to restrict the search for low cost\njoint probabilities to tables with suf\ufb01cient smoothness. Note that this is equivalent to considering\nthe maximum-entropy principle (Jaynes, 1957; Darroch and Ratcliff, 1972) if we were to maximize\nentropy while keeping the transportation cost constrained.\n\nBefore proceeding to the description of the properties of Sinkhorn distances, we note that Ferradans\net al. (2013) have recently explored similar ideas. They relax and penalize (through graph-based\nnorms) the original transport problem to avoid undesirable properties exhibited by the original op-\ntima in the problem of color matching. Combined, their idea and ours suggest that many more\nsmooth regularizers will be worth investigating to solve the the OT problem, driven by either or both\ncomputational and modeling motivations.\n\nMetric Properties of Sinkhorn Distances When \u03b1 is large enough, the Sinkhorn distance co-\nincides with the classic OT distance. When \u03b1 = 0, the Sinkhorn distance has a closed form and\nbecomes a negative de\ufb01nite kernel if one assumes that M is itself a negative de\ufb01nite distance, or\nequivalently a Euclidean distance matrix1.\nProperty 1. For \u03b1 large enough, the Sinkhorn distance dM,\u03b1 is the transport distance dM .\n\n2 (h(r) + h(c)), we have that for \u03b1\n\nProof. Since for any P \u2208 U (r, c), h(P ) is lower bounded by 1\nlarge enough U\u03b1(r, c) = U (r, c) and thus both quantities coincide.(cid:4)\nProperty 2 (Independence Kernel). dM,0 = rT M c. If M is a Euclidean distance matrix, dM,0 is a\nnegative de\ufb01nite kernel and e\u2212tdM,0, the independence kernel, is positive de\ufb01nite for all t > 0.\nThe proof is provided in the appendix. Beyond these two extreme cases, the main theorem of this\nsection states that Sinkhorn distances are symmetric and satisfy triangle inequalities for all possible\nvalues of \u03b1. Since for \u03b1 small enough dM,\u03b1(r, r) > 0 for any r such that h(r) > 0, Sinkhorn\ndistances cannot satisfy the coincidence axiom (d(x, y) = 0 \u21d4 x = y holds for all x, y). However,\nmultiplying dM,\u03b1 by 1r6=c suf\ufb01ces to recover the coincidence property if needed.\nTheorem 1. For all \u03b1 \u2265 0 and M \u2208 M, dM,\u03b1 is symmetric and satis\ufb01es all triangle inequalities.\nThe function (r, c) 7\u2192 1r6=cdM,\u03b1(r, c) satis\ufb01es all three distance axioms.\n\n1\u2203n, \u2203\u03d51, \u00b7 \u00b7 \u00b7 , \u03d5d \u2208 Rn such that mij = k\u03d5i \u2212 \u03d5j k2\n\n2. Recall that, in that case, M raised to power t\n\nelement-wise, [mt\n\nij], 0 < t < 1 is also a Euclidean distance matrix (Berg et al., 1984, p.78,\u00a73.2.10).\n\n3\n\n\fM\n\nU (r, c)\n\n\u03bb = 0\n\nrcT\n\n\u03bb \u2192 \u221e\n\nP\u03b1\n\nP \u03bb\n\nP \u22c6\n\nP \u22c6\n\nU\u03b1(r, c)\n\nP\u03b1= argmin\nP \u2208U\u03b1(r,c)\n\nhP, M i\n\nP \u03bbmax\n\nmachine-precision\n\nlimit\n\nP \u22c6= argmin\nP \u2208U(r,c)\n\nhP, M i\n\nP \u03bb= argmin\nP \u2208U(r,c)\n\nhP, M i\u2212\n\n1\n\u03bb\n\nh(P )\n\nRd\u00d7d\n\nFigure 1: Transport polytope U (r, c) and Kullback-Leibler ball U\u03b1(r, c) of level \u03b1 centered\naround rcT . This drawing implicitly assumes that the optimal transport P \u22c6 is unique. The Sinkhorn\ndistance dM,\u03b1(r, c) is equal to hP\u03b1, M i, the minimum of the dot product with M on that ball. For \u03b1\nlarge enough, both objectives coincide, as U\u03b1(r, c) gradually overlaps with U (r, c) in the vicinity\nof P \u22c6. The dual-sinkhorn distance d\u03bb\nM (r, c), the minimum of the transport problem regularized by\nminus the entropy divided by \u03bb, reaches its minimum at a unique solution P \u03bb, forming a regular-\nization path for varying \u03bb from rcT to P \u22c6. For a given value of \u03b1, and a pair (r, c) there exists\n\u03bb \u2208 [0, \u221e] such that both d\u03bb\nM can be ef\ufb01ciently computed using\nSinkhorn\u2019s \ufb01xed point iteration (1967). Although the convergence to P \u22c6 of this \ufb01xed point iteration\nis theoretically guaranteed as \u03bb \u2192 \u221e, the procedure cannot work beyond a problem-dependent\nvalue \u03bbmax beyond which some entries of e\u2212\u03bbM are represented as zeroes in memory.\n\nM (r, c) and dM,\u03b1(r, c) coincide. d\u03bb\n\nThe gluing lemma (Villani, 2009, p.19) is key to proving that OT distances are indeed distances. We\npropose a variation of this lemma to prove our result:\nLemma 1 (Gluing Lemma With Entropic Constraint). Let \u03b1 \u2265 0 and x, y, z \u2208 \u03a3d. Let P \u2208\nU\u03b1(x, y) and Q \u2208 U\u03b1(y, z). Let S be the d \u00d7 d de\ufb01ned as sik := Pj\nThe proof is provided in the appendix. We can prove the triangle inequality for dM,\u03b1 by using the\nsame proof strategy than that used for classic transport distances:\nProof of Theorem 1. The symmetry of dM,\u03b1 is a direct result of M \u2019s symmetry. Let x, y, z be three\nelements in \u03a3d. Let P \u2208 U\u03b1(x, y) and Q \u2208 U\u03b1(y, z) be two optimal solutions for dM,\u03b1(x, y) and\ndM,\u03b1(y, z) respectively. Using the matrix S of U\u03b1(x, z) provided in Lemma 1, we proceed with the\nfollowing chain of inequalities:\n\n. Then S \u2208 U\u03b1(x, z).\n\npij qjk\n\nyj\n\ndM,\u03b1(x, z) = min\n\nP \u2208U\u03b1(x,z)\n\nhP, M i \u2264 hS, M i = X\n\nmik X\n\nik\n\nj\n\n= X\n\nmij\n\nijk\n\npij qjk\n\nyj\n\n+ mjk\n\npij qjk\n\nyj\n\n= X\n\nij\n\nmijpij X\n\npij qjk\n\nyj\nqjk\nyj\n\nk\n\n\u2264 X\n\nijk\n\n(mij + mjk)\n\npijqjk\n\nyj\n\n+ X\n\nmjkqjk X\n\njk\n\ni\n\npij\nyj\n\n= X\n\nmij pij + X\n\nij\n\njk\n\nmjkqjk = dM,\u03b1(x, y) + dM,\u03b1(y, z). (cid:4)\n\n4 Computing Regularized Transport with Sinkhorn\u2019s Algorithm\n\nWe consider in this section a Lagrange multiplier for the entropy constraint of Sinkhorn distances:\n\nFor \u03bb > 0, d\u03bb\n\nM (r, c) := hP \u03bb, M i, where P \u03bb = argmin\nP \u2208U(r,c)\n\nhP, M i \u2212\n\n1\n\u03bb\n\nh(P ).\n\n(2)\n\nBy duality theory we have that to each \u03b1 corresponds a \u03bb \u2208 [0, \u221e] such that dM,\u03b1(r, c) = d\u03bb\nholds for that pair (r, c). We call d\u03bb\n\nM (r, c)\nM the dual-Sinkhorn divergence and show that it can be computed\n\n4\n\n\ffor a much cheaper cost than the original distance dM . Figure 1 summarizes the relationships be-\ntween dM , dM,\u03b1 and d\u03bb\nM . Since the entropy of P \u03bb decreases monotonically with \u03bb, computing dM,\u03b1\ncan be carried out by computing d\u03bb\nM with increasing values of \u03bb until h(P \u03bb) reaches h(r)+h(c)\u2212\u03b1.\nWe do not consider this problem here and only use the dual-Sinkhorn divergence in our experiments.\n\nM with Matrix Scaling Algorithms Adding an entropy regularization to the opti-\n\nComputing d\u03bb\nmal transport problem enforces a simple structure on the optimal regularized transport P \u03bb:\nLemma 2. For \u03bb > 0, the solution P \u03bb is unique and has the form P \u03bb = diag(u)K diag(v),\nwhere u and v are two non-negative vectors of Rd uniquely de\ufb01ned up to a multiplicative factor and\nK := e\u2212\u03bbM is the element-wise exponential of \u2212\u03bbM .\n\nProof. The existence and unicity of P \u03bb follows from the boundedness of U (r, c) and the strict\nconvexity of minus the entropy. The fact that P \u03bb can be written as a rescaled version of K is a well\nknown fact in transport theory (Erlander and Stewart, 1990, \u00a73.3): let L(P, \u03b1, \u03b2) be the Lagrangian\nof Equation (2) with dual variables \u03b1, \u03b2 \u2208 Rd for the two equality constraints in U (r, c):\n\nL(P, \u03b1, \u03b2) = X\n\nij\n\n1\n\u03bb\n\npij log pij + pijmij + \u03b1T (P 1d \u2212 r) + \u03b2T (P T 1d \u2212 c).\n\nFor any couple (i, j), (\u2202L/\u2202pij = 0) \u21d2 pij = e\u22121/2\u2212\u03bb\u03b1i e\u2212\u03bbmij e\u22121/2\u2212\u03bb\u03b2j . Since K is\nstrictly positive, Sinkhorn\u2019s theorem (1967) states that there exists a unique matrix of the form\ndiag(u)K diag(v) that belongs to U (r, c), where u, v \u2265 0d. P \u03bb is thus necessarily that matrix,\nand can be computed with Sinkhorn\u2019s \ufb01xed point iteration (u, v) \u2190 (r./Kv, c./K \u2032u). (cid:4)\nGiven K and marginals r and c, one only needs to iterate Sinkhorn\u2019s update a suf\ufb01cient number\nof times to converge to P \u03bb. One can show that these successive updates carry out iteratively the\nprojection of K on U (r, c) in the Kullback-Leibler sense. This \ufb01xed point iteration can be written\nas a single update u \u2190 r./K(c./K \u2032u). When r > 0d, diag(1./r)K can be stored in a d \u00d7 d matrix\n\u02dcK to save one Schur vector product operation with the update u \u2190 1./( \u02dcK(c./K \u2032u)). This can be\neasily ensured by selecting the positive indices of r, as seen in the \ufb01rst line of Algorithm 1.\n\nM (r, c1), \u00b7 \u00b7 \u00b7 , d\u03bb\n\nAlgorithm 1 Computation of d = [d\u03bb\nInput M, \u03bb, r, C := [c1, \u00b7 \u00b7 \u00b7 , cN ].\nI = (r > 0); r = r(I); M = M (I, :); K = exp(\u2212\u03bbM )\nu = ones(length(r), N )/length(r);\n\u02dcK = bsxfun(@rdivide, K, r) % equivalent to \u02dcK = diag(1./r)K\nwhile u changes or any other relevant stopping criterion do\n\nM (r, cN )], using Matlab syntax.\n\nu = 1./( \u02dcK(C./(K \u2032u)))\n\nend while\nv = C./(K \u2032u)\nd = sum(u. \u2217 ((K. \u2217 M )v)\n\nParallelism, Convergence and Stopping Criteria As can be seen right above, Sinkhorn\u2019s algo-\nrithm can be vectorized and generalized to N target histograms c1, \u00b7 \u00b7 \u00b7 , cN . When N = 1 and C\nis a vector in Algorithm 1, we recover the simple iteration mentioned in the proof of Lemma 2.\nWhen N > 1, the computations for N target histograms can be simultaneously carried out by up-\ndating a single matrix of scaling factors u \u2208 Rd\u00d7N\ninstead of updating a scaling vector u \u2208 Rd\n+.\nThis important observation makes the execution of Algorithm 1 particularly suited to GPGPU plat-\nforms. Despite ongoing research in that \ufb01eld (Bieling et al., 2010) such speed ups have not been yet\nachieved on complex iterative procedures such as the network simplex. Using Hilbert\u2019s projective\nmetric, Franklin and Lorenz (1989) prove that the convergence of the scaling factor u (as well as v)\nis linear, with a rate bounded above by \u03ba(K)2, where\n\n+\n\n\u03ba(K) = p\u03b8(K) \u2212 1\np\u03b8(K) + 1\n\n< 1, and \u03b8(K) = max\ni,j,l,m\n\nKilKjm\nKjlKim\n\n.\n\nThe upper bound \u03ba(K) tends to 1 as \u03bb grows, and we do observe a slower convergence as P \u03bb gets\ncloser to the optimal vertex P \u22c6 (or the optimal facet of U (r, c) if it is not unique). Different stopping\ncriteria can be used for Algorithm 1. We consider two in this work, which we detail below.\n\n5\n\n\f5 Experimental Results\n\nMNIST Digits We test the performance of\ndual-Sinkhorn divergences on the MNIST\ndigits dataset. Each image is converted\nto a vector of intensities on the 20 \u00d7 20\npixel grid, which are then normalized to\nsum to 1. We consider a subset of N \u2208\n{3, 5, 12, 17, 25} \u00d7 103 points in the dataset.\nFor each subset, we provide mean and stan-\ndard deviation of classi\ufb01cation error using\na 4 fold (3 test, 1 train) cross validation\n(CV) scheme repeated 6 times, resulting\nin 24 different experiments. Given a dis-\ntance d, we form the kernel e\u2212d/t, where\nt > 0 is chosen by CV on each train-\ning fold within {1, q10(d), q20(d), q50(d)},\nwhere qs is the s% quantile of a subset of\ndistances observed in that fold. We regu-\nlarize non-positive de\ufb01nite kernel matrices\nresulting from this computation by adding\na suf\ufb01ciently large diagonal term. SVM\u2019s\nwere run with Libsvm (one-vs-one) for mul-\nticlass classi\ufb01cation. We select the regular-\nization C in 10{\u22122,0,4} using 2 folds/2 repeats CV on the training fold. We consider the Hellinger,\n\u03c72, total variation and squared Euclidean (Gaussian kernel) distances. M is the 400 \u00d7 400 matrix\nof Euclidean distances between the 20 \u00d7 20 bins in the grid. We also tried Mahalanobis distances\non this example using exp(-tM.\u02c62), t>0 as well as its inverse, with varying values of t, but\nnone of these results proved competitive. For the Independence kernel we considered [ma\nij] where\na \u2208 {0.01, 0.1, 1} is chosen by CV on each training fold. We select \u03bb in {5, 7, 9, 11} \u00d7 1/q50(M )\nwhere q50(M (:)) is the median distance between pixels. We set the number of \ufb01xed-point iterations\nto an arbitrary number of 20 iterations. In most (though not all) folds, the value \u03bb = 9 comes up as\nthe best setting. The dual-Sinkhorn divergence beats by a safe margin all other distances, including\nthe classic optimal transport distance, here labeled as EMD.\n\nFigure 2: Average test errors with shaded con\ufb01-\ndence intervals. Errors are computed using 1/4 of the\ndataset for train and 3/4 for test. Errors are averaged\nover 4 folds \u00d7 6 repeats = 24 experiments.\n\nDoes the Dual-Sinkhorn Divergence Con-\nverge to the EMD? We study the conver-\ngence of the dual-Sinkhorn divergence to-\nwards classic optimal transport distances as\n\u03bb grows. Because of the regularization in\nEquation (2), d\u03bb\nM (r, c) is necessarily larger\nthan dM (r, c), and we expect this gap to de-\ncrease as \u03bb increases. Figure 3 illustrates\nthis by plotting the boxplot of the distri-\nbutions of (d\u03bb\nM (r, c) \u2212 dM (r, c))/dM (r, c)\nover 402 pairs of images from the MNIST\ndatabase. d\u03bb\nM typically approximates the\nEMD with a high accuracy when \u03bb exceeds\n50 (median relative gap of 3.4% and 1.2%\nfor 50 and 100 respectively). For this exper-\niment as well as all the other experiments be-\nlow, we compute a vector of N divergences\nd at each iteration, and stop when none of\nthe N values of d varies more in absolute\nvalue than a 1/100th of a percent i.e. we stop\nwhen kdt./dt\u22121 \u2212 1k\u221e < 1e \u2212 4.\n\nDeviation of Sinkhorn\u2019s Distance\nto EMD on subset of MNIST Data\n\n1.4\n\n1.2\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\nD\nM\nE\n\n/\n)\n\nD\nM\nE\n\u2212\nn\nr\no\nh\nk\nn\nS\n\ni\n\n(\n \nf\no\n \nn\no\ni\nt\nu\nb\ni\nr\nt\ns\nD\n\ni\n\n1\n\n5\n\n9\n\n15\n\u03bb\n\n25\n\n50\n\n100\n\nFigure 3: Decrease of the gap between the dual-\nSinkhorn divergence and the EMD as a function of\n\u03bb on a subset of the MNIST dataset.\n\n6\n\n\f)\n.\ns\n \n\n \n\n0\n10\n\n2\n10\n\n\u22122\n\n10\n\nt\ns\nD\n\ni\n\n \nr\ne\np\n\nn\ni\n(\n \n\ne\nc\nn\na\n\n(log log scale)\n\nComputational Speed for Histograms of\n\nVarying Dimension Drawn Uniformly on the Simplex\n\nFastEMD\nRubner\u2019s emd\nSink. CPU \u03bb=50\nSink. GPU \u03bb=50\nSink. CPU \u03bb=10\nSink. GPU \u03bb=10\nSink. CPU \u03bb=1\nSink. GPU \u03bb=1\n\nSeveral Orders of Magnitude Faster\nWe measure the computational speed of\nclassic optimal transport distances vs. that\nof dual-Sinkhorn divergences using Rub-\nner et al.\u2019s (1997) and Pele and Wer-\nman\u2019s (2009) publicly available imple-\nmentations. We pick a random distance\nmatrix M by generating a random graph\nof d vertices with edge presence probabil-\nity 1/2 and edge weights uniformly dis-\ntributed between 0 and 1. M is the all-\npairs shortest-path matrix obtained from\nthis connectivity matrix using the Floyd-\nWarshall algorithm (Ahuja et al., 1993,\n\u00a75.6). Using this procedure, M is likely\nto be an extreme ray of the cone M (Avis,\n1980, p.138). The elements of M are\nthen normalized to have unit median. We\nimplemented Algorithm 1 in matlab, and\nuse emd mex and emd hat gd metric\nmex/C \ufb01les. The EMD distances and\nSinkhorn CPU are run on a single core\n(2.66 Ghz Xeon). Sinkhorn GPU is run\non a NVidia Quadro K5000 card. We con-\nsider \u03bb in {1, 10, 50}. \u03bb = 1 results in\na relatively dense matrix K, with results\ncomparable to that of the Independence kernel, while for \u03bb = 10 or 50 K = e\u2212\u03bbM has very small\nvalues. Rubner et al.\u2019s implementation cannot be run for histograms larger than d = 512. As can be\nexpected, the competitive advantage of dual-Sinkhorn divergences over EMD solvers increases with\nthe dimension. Using a GPU results in a speed-up of an additional order of magnitude.\n\nFigure 4: Average computational time required to com-\npute a distance between two histograms sampled uni-\nformly in the d dimensional simplex for varying values\nof d. Dual-Sinkhorn divergences are run both on a sin-\ngle CPU and on a GPU card.\n\nHistogram Dimension\n\ni\n\ne\nm\nT\nn\no\n\n \n\ni\nt\n\nu\nc\ne\nx\nE\n\n \n.\n\ng\nv\nA\n\n1024\n\n\u22124\n\n10\n\n128\n\n256\n\n512\n\n \n\n64\n\n \n\n2048\n\nEmpirical Complexity To provide an accu-\nrate picture of the actual cost of the algorithm,\nwe replicate the experiments above but focus\nnow on the number of iterations (matrix-matrix\nproducts) typically needed to obtain the conver-\ngence of a set of N divergences from a given\npoint r, all uniformly sampled on the simplex.\nAs can be seen in Figure 5, the number of it-\nerations required for vector d to converge in-\ncreases as e\u2212\u03bbM becomes diagonally dominant.\nHowever, the total number of iterations does\nnot seem to vary with respect to the dimen-\nsion. This observation can explain why we do\nobserve a quadratic (empirical) time complex-\nity O(d2) with respect to the dimension d in\nFigure 4 above. These results suggest that the\ncostly action of keeping track of the actual ap-\nproximation error (computing variations in d)\nis not required, and that simply prede\ufb01ning a\n\ufb01xed number of iterations can work well and\nyield even additional speedups.\n\n6 Conclusion\n\nFigure 5: The in\ufb02uence of \u03bb on the number of\niterations required to converge on histograms uni-\nformly sampled from the simplex.\n\nWe have shown that regularizing the optimal transport problem with an entropic penalty opens the\ndoor for new numerical approaches to compute OT. This regularization yields speed-ups that are\neffective regardless of any assumptions on the ground metric M . Based on preliminary evidence, it\n\n7\n\n\fseems that dual-Sinkhorn divergences do not perform worse than the EMD, and may in fact perform\nbetter in applications. Dual-Sinkhorn divergences are parameterized by a regularization weight \u03bb\nwhich should be tuned having both computational and performance objectives in mind, but we have\nnot observed a need to establish a trade-off between both. Indeed, reasonably small values of \u03bb seem\nto perform better than large ones.\n\nAcknowledgements The author would like to thank: Zaid Harchaoui for suggesting the title of this\npaper and highlighting the connection between the mutual information of P and its Kullback-Leibler\ndivergence to rcT ; Lieven Vandenberghe, Philip Knight, Sanjeev Arora, Alexandre d\u2019Aspremont\nand Shun-Ichi Amari for fruitful discussions; reviewers for anonymous comments.\n\n7 Appendix: Proofs\n\nProof of Property 1. The set U1(r, c) contains all joint probabilities P for which h(P ) = h(r) +\nh(c).\nIn that case (Cover and Thomas, 1991, Theorem 2.6.6) applies and U1(r, c) can only be\nequal to the singleton {rcT }. If M is negative de\ufb01nite, there exists vectors (\u03d51, \u00b7 \u00b7 \u00b7 , \u03d5d) in some\nEuclidean space Rn such that mij = k\u03d5i \u2212 \u03d5jk2\n2 through (Berg et al., 1984, \u00a73.3.2). We thus have\nthat\n\nrT M c = X\n\nricjk\u03d5i \u2212 \u03d5jk2 = (X\n\nrik\u03d5ik2 + X\n\ncik\u03d5ik2) \u2212 2 X\n\nhri\u03d5i, cj\u03d5j i\n\nij\n\ni\n\ni\n\nij\n\n= rT u + cT u \u2212 2rT Kc\n\nwhere ui = k\u03c6ik2 and Kij = h\u03d5i, \u03d5j i. We used the fact that P ri = P ci = 1 to go from the\n\ufb01rst to the second equality. rT M c is thus a n.d. kernel because it is the sum of two n.d. kernels: the\n\ufb01rst term (rT u + cT u) is the sum of the same function evaluated separately on r and c, and thus a\nnegative de\ufb01nite kernel (Berg et al., 1984, \u00a73.2.10); the latter term \u22122rT Ku is negative de\ufb01nite as\nminus a positive de\ufb01nite kernel (Berg et al., 1984, De\ufb01nition \u00a73.1.1).\n\nRemark. The proof above suggests a faster way to compute the Independence kernel. Given a matrix\nM , one can indeed pre-compute the vector of norms u as well as a Cholesky factor L of K above\nto preprocess a dataset of histograms by premultiplying each observations ri by L and only store\nLri as well as precomputing its diagonal term rT\ni u. Note that the independence kernel is positive\nde\ufb01nite on histograms with the same 1-norm, but is no longer positive de\ufb01nite for arbitrary vectors.\n\nProof of Lemma 1. Let T be the a probability distribution on {1, \u00b7 \u00b7 \u00b7 , d}3 whose coef\ufb01cients are\nde\ufb01ned as\n\ntijk :=\n\npij qjk\n\nyj\n\n,\n\n(3)\n\nfor all indices j such that yj > 0. For indices j such that yj = 0, all values tijk are set to 0.\nLet S := [Pj tijk]ik. S is a transport matrix between x and z. Indeed,\nyj = X\n\nsijk = X\n\n= X\n\nX\n\npij = X\n\nX\n\nX\n\nqjk = zk (column sums)\n\nX\n\npijqjk\n\nyj\n\ni\n\nj\n\nj\n\ni\n\nX\n\nX\n\nsijk = X\n\nX\n\nk\n\nj\n\nj\n\nk\n\npijqjk\n\nyj\n\nj\n\n= X\n\nj\n\ni\n\nj\n\nX\n\nqjk = X\n\nk\n\nj\n\nj\n\nyj = X\n\npij = xi (row sums)\n\nj\n\nqjk\nyj\npij\nyj\n\nqjk\nyj\npij\nyj\n\nWe now prove that h(S) \u2265 h(x) + h(z) \u2212 \u03b1. Let (X, Y, Z) be three random variables jointly\ndistributed as T . Since by de\ufb01nition of T in Equation (3)\n\np(X, Y, Z) = p(X, Y )p(Y, Z)/p(Y ) = p(X)p(Y |X)p(Z|Y ),\n\nthe triplet (X, Y, Z) is a Markov chain X \u2192 Y \u2192 Z (Cover and Thomas, 1991, Equation 2.118)\nand thus, by virtue of the data processing inequality (Cover and Thomas, 1991, Theorem 2.8.1), the\nfollowing inequality between mutual informations applies:\n\nI(X; Y ) \u2265 I(X; Z), namely\n\nh(X, Z) \u2212 h(X) \u2212 h(Z) \u2265 h(X, Y ) \u2212 h(X) \u2212 h(Y ) \u2265 \u2212\u03b1.\n\n8\n\n\fReferences\nAhuja, R., Magnanti, T., and Orlin, J. (1993). Network Flows: Theory, Algorithms and Applications. Prentice\n\nHall.\n\nAvis, D. (1980). On the extreme rays of the metric cone. Canadian Journal of Mathematics, 32(1):126\u2013144.\nBerg, C., Christensen, J., and Ressel, P. (1984). Harmonic Analysis on Semigroups. Number 100 in Graduate\n\nTexts in Mathematics. Springer Verlag.\n\nBieling, J., Peschlow, P., and Martini, P. (2010). An ef\ufb01cient gpu implementation of the revised simplex method.\n\nIn Parallel Distributed Processing, 2010 IEEE International Symposium on, pages 1\u20138.\n\nBrickell, J., Dhillon, I., Sra, S., and Tropp, J. (2008). The metric nearness problem. SIAM J. Matrix Anal. Appl,\n\n30(1):375\u2013396.\n\nBrualdi, R. A. (2006). Combinatorial matrix classes, volume 108. Cambridge University Press.\nCover, T. and Thomas, J. (1991). Elements of Information Theory. Wiley & Sons.\nDarroch, J. N. and Ratcliff, D. (1972). Generalized iterative scaling for log-linear models. The Annals of\n\nMathematical Statistics, 43(5):1470\u20131480.\n\nErlander, S. and Stewart, N. (1990). The gravity model in transportation analysis: theory and extensions. Vsp.\nFerradans, S., Papadakis, N., Rabin, J., Peyr\u00b4e, G., Aujol, J.-F., et al. (2013). Regularized discrete optimal\ntransport. In International Conference on Scale Space and Variational Methods in Computer Vision, pages\n1\u201312.\n\nFranklin, J. and Lorenz, J. (1989). On the scaling of multidimensional matrices. Linear Algebra and its\n\napplications, 114:717\u2013735.\n\nGood, I. (1963). Maximum entropy for hypothesis formulation, especially for multidimensional contingency\n\ntables. The Annals of Mathematical Statistics, pages 911\u2013934.\n\nGrauman, K. and Darrell, T. (2004). Fast contour matching using approximate earth mover\u2019s distance. In IEEE\n\nConf. Vision and Patt. Recog., pages 220\u2013227.\n\nGudmundsson, J., Klein, O., Knauer, C., and Smid, M. (2007). Small manhattan networks and algorithmic\napplications for the earth movers distance. In Proceedings of the 23rd European Workshop on Computational\nGeometry, pages 174\u2013177.\n\nIndyk, P. and Thaper, N. (2003). Fast image retrieval via embeddings.\n\nStatistical and Computational Theories of Vision (at ICCV).\n\nIn 3rd International Workshop on\n\nJaynes, E. T. (1957). Information theory and statistical mechanics. Phys. Rev., 106:620\u2013630.\nKnight, P. A. (2008). The sinkhorn-knopp algorithm: convergence and applications. SIAM Journal on Matrix\n\nAnalysis and Applications, 30(1):261\u2013275.\n\nLing, H. and Okada, K. (2007). An ef\ufb01cient earth mover\u2019s distance algorithm for robust histogram comparison.\n\nIEEE transactions on Patt. An. and Mach. Intell., pages 840\u2013853.\n\nNaor, A. and Schechtman, G. (2007). Planar earthmover is not in l1. SIAM J. Comput., 37(3):804\u2013826.\nPele, O. and Werman, M. (2009). Fast and robust earth mover\u2019s distances. In ICCV\u201909.\nRubner, Y., Guibas, L., and Tomasi, C. (1997). The earth movers distance, multi-dimensional scaling, and\ncolor-based image retrieval. In Proceedings of the ARPA Image Understanding Workshop, pages 661\u2013668.\nShirdhonkar, S. and Jacobs, D. (2008). Approximate earth movers distance in linear time. In CVPR 2008,\n\npages 1\u20138. IEEE.\n\nSinkhorn, R. (1967). Diagonal equivalence to matrices with prescribed row and column sums. The American\n\nMathematical Monthly, 74(4):402\u2013405.\n\nVillani, C. (2009). Optimal transport: old and new, volume 338. Springer Verlag.\nWilson, A. G. (1969). The use of entropy maximising models, in the theory of trip distribution, mode split and\n\nroute split. Journal of Transport Economics and Policy, pages 108\u2013126.\n\n9\n\n\f", "award": [], "sourceid": 1105, "authors": [{"given_name": "Marco", "family_name": "Cuturi", "institution": "Kyoto University"}]}