NIPS 2016
Mon Dec 5th through Sun the 11th, 2016 at Centre Convencions Internacional Barcelona
Paper ID: 1705 Stochastic Optimization for Large-scale Optimal Transport

### Reviewer 1

#### Summary

Optimal transport (OT) is a way to compare prob. distributions, and is computational expensive operation. The authors have proposed a stochastic optimization by only requiring that they are able to draw from the distributions. This way they do not need to discretise the densities and instead use the fact that the dual OT can be recast as max. of expectation and entropic regularization of the primal OT can be solved via faster algorithm. This way the authors can compare discrete-discrete, discrete-continuous and by using SGD over a reproducing kernel Hilbert Space they can solve the continuous-continuous problem as well!

#### Qualitative Assessment

The authors have proposed a method which does not require discretize of densities, by finding two key insights: recasting the dual OT problem and entropic regularization of the primal OT problem. This way they are able to get, when comparing discrete distributions, a method which is faster than the current state of the art. Their method also gets better performance when comparing discrete-continuous distributions, and finally give the only known method of optimizing continuous-continuous distributions, via SGD over a reproducing kernel Hilbert space. This is an amazing break-through and will prove to be very useful.

#### Confidence in this Review

2-Confident (read it all; understood it all reasonably well)

### Reviewer 2

#### Summary

The authors study the problem of learning the optimal transport between two measures. The potential applications of the paper are concerned by 'large-scale' problems, leading to a natural use of a stochastic gradient strategy. The contribution can be,roughly speaking , splitted into three parts: 1) the dual formulation of the OT problem is traduced in an optimization problem with an expectation $(\mathcal{D}_{\epsilon})$ (See Proposition 2.1). From a lack of strong convexity of the objective function, authors naturally introduce a Kullback-Leibler entropy penalization (with a coefficient $\epsilon$ that balances the effect of this penalization). 2) the dual (penalized) problem is solved with a stochastic gradient procedure. Several methods are possible: SGD, Polyak averaging, Av S.G.D. The authors choose to use Av. S.G.D. 3) Simulations are somewhat convincing, but I am not a specialist of Bag of words embbedings problem. Therefore, maybe I should not be trusted in on this point ;) Authors propose several frameworks: discrete, semi-discrete, continuous OT problems. The last ones are dealt with RKHS.

#### Qualitative Assessment

My evaluation of the paper is balanced. Pros: The writing of the paper is very clear. The algorithms are very natural and suspected to work well on large scale OT problems. The dual formulation is a good idea to handle a stochastic algorithm. Everything is ok with all that points. But... Cons: 1/ The mirror descent strategy is not new. Some references should be added somewhere in the paper. Even though not already applied to O.T. problems, it has been studied by several authors (not me! ;) ) when Bregman divergences are involved in the optimization problem. I guess it has also been studied with stochastic algorithms. 2/ There is no real theoretical analysis in the paper. The influence of the size of epsilon is certainly very important for convergence rates of the (Av.) S.G.D. described here because it quantifies the size of the strong convexity. A natural extension should use a cooling strategy eps-> 0 when n is increasing, to jointly associate the effect of the a.s. convergence of the S.G.D with the concentration of (S_epsilon) (briefly described in the Appendix). Of course, such a study could represent a good contribution for a top level statistical journal, and is not attended for NIPS. Nevertheless, the authors could include in their paper some theoretical insights (with a fixed epsilon for example). Now, I provide a linear description of my comments: L.16: instead of Sinkhorn's method, we can also use other optimization procedures of Teboulle to solve O.T. Sinkhorn's methods are the state of the art for this kind of problems only because the authors concerned by this problem have not read "Interior gradient and proximal methods for convex optimization" of Auslender and Teboulle. This kind of method could certainly be applied in this framework. L.80: please discuss on the lack of strong convexity here L.99: read the sentence ;) Section 2: somewhere should be mentionned that Sinkhorn's method has few theoretical guarantees (rates) L.129: If I remember well, I have seen somewhere that it is possible to assess some quantitative results stronger than the one of the Appendix, in terms of epsilon and rates. ( e^{-\lambda \epsilon^{-1}} ?) in Cominetti and San Martin, 92? L. 161->186: The choice of the stepsize is the icing on the cake of all stochastic gradient algorithms. As far as I can read, the authors have not written clearly somewhere what is this step size for them. It is only hidden in the description of Algorithm 1. The constant step size choice should be discussed somewhere, a stopping criterion should be described. In Algorithm 1, why i is choosen uniformly among \{1,\ldots I\}? Is it possible to use $\mu_i$ to sample i at each step k, and then update the distribution \mu? L. 192: I do not know what is Glove word embeddings. Some insights on the way $c$ is built should be helpful. L. 198: epsilon is chosen very small, leading to a very weak strong convexity of the OT problem. Some comments are welcome. Algorithm 2: this time, the step size used is k^{-1/2}. Again, some comments on this choice should be performed somewhere, and theoretical insights would be helpful for the reader.

#### Confidence in this Review

3-Expert (read the paper in detail, know the area, quite certain of my opinion)

### Reviewer 3

#### Summary

This paper presents methods for the stochastic optimization of large-scale (regularized) optimal transport (OT), which is built upon the dual smoothed representation of entropic regularized OT [Cuturi and Peyré, 2016] (not cited). The paper shows that the dual smoothed objective (appeared in [Cuturi and Peyré, 2016]) can be rewritten differently as a finite separable sum, thus immediately admits stochastic methods by sampling one separate term at a time. As such, the paper explores three different settings of OT, where one or both measures are discrete or continuous, each with one or some illustrative numerical examples. In particular, it proposes to use SAG [16] for the discrete-discrete setting, SGD for the discrete-continuous setting, and RKHS parametrization for the continuous-continuous setting. In this paper, the convergence complexities are stated inaccurate (see my explanation below), experiments are done very briefly, some important claims are not well justified. The three research problems attempted certainly go beyond an 8-page paper could embrace. Cuturi, Marco, and Gabriel Peyré. "A smoothed dual approach for variational Wasserstein problems." SIAM Journal on Imaging Sciences 9.1 (2016): 320-343.

#### Confidence in this Review

3-Expert (read the paper in detail, know the area, quite certain of my opinion)

### Reviewer 4

#### Summary

This paper propose to use stochastic optimization to solve optimal transport problems. The authors study three types of optimal transport problems: discrete, semi-discrete, and continuous, with the assumption that both distributions can be sampled. In the discrete case, they apply SAG, a variance reduced gradient method for minimizing finite-sum, to the semi-dual problem. In the semi-discrete case, they apply SGD with averaging to the semi-dual problem. In the continuous case, they utilize kernel SGD to optimize the dual problem. The correctness/effectiveness of algorithms are justified by analysis and experiments.

#### Qualitative Assessment

Overall, I enjoy reading this paper and find it reaches the quality standard for NIPS. Pros: -- The key idea is forceful and well supported. The three cases provide a complete and persuasive story of utilizing stochastic optimization for optimal transport. -- There is sufficient amount of novelty. While the idea of SAG or SGD is not new, their application in optimal transport setting is. Moreover, the SGD idea leads to solutions of semi-discrete and continuous OT problems without discretization. -- The figures and algorithm tables are very clear and nice-looking. Cons (or questions to be addressed): -- For the semi-discrete or continuous OT cases, methods using discretization (if there are any) should be compared against. The authors warn about their discretization errors, but do not show their performance. Since these methods are the only baseline, it is important to know if the proposed algorithms have any performance gain besides being conceptually attractive. -- Could you comment on why in the kernel SGD experiment the algorithm converges so slowly? As this is only a toy problem, I am worried that this algorithm is impractical in practice. Minor comment: in Algorithm 3, should \alpha_n be replaced by \alpha_k?

#### Confidence in this Review

2-Confident (read it all; understood it all reasonably well)