
Submitted by
Assigned_Reviewer_5
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
This paper proposes a new family of "transportation
distances" for discrete domains, such as histograms, for which the earth
movers' distance (EMD) has been widely used. The problem is that EMD is
very expensive to compute as it is cubic (actually d^3 log d) in the
dimension of the observation space (the number of histogram dimensions,
d). In this paper it is argued that with the inclusion of an entropic
regularizer finding the optimal transportation distance becomes a much
simpler computational problem, one that can be solved in linear time and
is easily parallelized.
I do not work with EMD or such
measures extensively, but I very much enjoyed reading this paper. The
formulation seems very clean and the result appear impressive. The problem
is well motivate, the reader is guided through the technical development
quietly nicely, and the exposition is generally very good. A nice paper to
read, with novel results (to me at least), and the potential for
significant impact in several areas of vision and learning.
The weakness, as such, is in the experiments. I would;t have
thought that MNIST was a natural dataset for this technique. First the
inputs are images which can be considered discrete inputs for which one
wants to use EMD, but most people would place them in a continuous vector
space. Also, while the ordering of bins in a histogram may have no
intrinsic ordering per se, this is not the case with an image. And
finally, since MNIST has been beaten mercilessly in the last couple of
decades it is not surprising that the error rates on MNIST are not
particularly good by most standards. That said, the results do strongly
support the claims of the authors that the new optimal transportation
distance is both effective and extremely efficient to compute with the
what they refer to as the SinkhornKnopp algorithm.
Q2: Please summarize your review in 12
sentences
This paper proposes a new family of optimal
transportation distances for histogram domains. By including an entropic
regularizer to a conventional distance measure they show that one can
compute the optimal transportation distance in linear rather than cubic
time. This should make use of the EMD much more practical for a wide
spectrum of tasks.
Submitted by
Assigned_Reviewer_9
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
This manuscript proposes to regularize the
computation of the earth mover's distance (EMD) with an entropic term,
with the dual aim of improving the empirical performance of the
distance and in significantly speeding up its computation. The
manuscript presents theoretical analysis in support of this
regularization approach, and empirical results demonstrated improved
classification accuracy for regularized EMD relative to a variety of
distances (including classical EMD), as well as faster computation
time.
The theoretical analysis, explaining how the entropic
regularization works, proving that it is metric, and describing how to
compute it efficiently, are interesting, though it is not clear how
much of this is truly novel, as opposed to being novel within the NIPS
community. The authors mention that this type of regularization has
long been known in the transportation theory literature.
The
MNIST experiments are well done, and the results are compelling.
Similarly, the timing experiment shows convincingly that the speedup
achieved here is significant. What is missing is timing information
for a real classification problem. The manuscript should have at
least mentioned the relative timing of the computations for the MNIST
data sets.
Smaller comments:
"EMD" (p. 2) needs to be
written out when it is first used.
The abbreviations for the key
in Figure 2 need to be spelled out.
The experiment in Section 5.1
apparently uses a variety of kernels within an SVM, but the SVM is
only mentioned obliquely. The first sentence in the section makes it
sound like the distances are being used directly to solve the MNIST
problem.
What exactly is the confidence interval shown in Figure
2? Confidences should be computed w.r.t. 6 complete labelings of the
data set, not 24 (i.e., the CV folds should not contribute to the
variance).
I don't understand why the manuscript says
"Sinkhorn distances are 10.000 times faster than EMD solvers." Is the
"." a decimal point? Or is this supposed to be "10000 times faster"?
Neither conclusion seems to be justified by the results in Figure 4.
Q2: Please summarize your review in 12
sentences
Introduces to NIPS readers a regularized earth
mover's distance. The concept is compelling, and the empirical
results (both in terms of classification performance and running
time) are convincing. Submitted by
Assigned_Reviewer_11
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
The paper studies variants of the Earth Mover's
distance (EMD) with either a constraint on the entropy or a penalty term
in the objective linear in the entropy. When there is a bound on the
entropy, the objective turns out to be a distance metric but the paper
does not show how to compute it efficiently. The other variant is not a
distance metric but the paper shows that the optimal transportation is the
unique stochastic matrix obtained by scaling rows and columns of the
matrix with entries exp(lambda*m_{i,j}) where m_{i,j} is the
distance/transportation cost between i and j and lambda is the coefficient
of the entropy term in the objective. This stochastic matrix can be
approximated by SinkhornKnopp's algorithm. The paper makes an interesting
observation about a variant of EMD that might be efficiently approximated.
However, the comparison with EMD in the paper is not fair, since the
authors insist on EMD being computed exactly while the variant being
studied is only approximated without any statement on the convergence rate
of the algorithm. It is conceivable that for the random vectors tested in
the paper, the algorithm converges quickly but it might be slow for
pathological vectors. It is also not clear if comparing with EMD on just
the original pixel space is the right comparison when the previous work
cited by the paper computed EMD in the gradient orientation histograms
(SIFT features) as well as the pixel space. The authors might also want to
compare the results with previous work using variants/approximations of
EMD such as Kristen Grauman, Trevor Darrell. Approximate
Correspondences in High Dimensions. NIPS 2006. The pyramid match
approach above is inspired by ideas from work on approximating EMD e.g.
Indyk, P., and Thaper, N. 2003. Fast Image Retrieval via Embeddings.
The paper is generally well written but the authors might want to
include a statement of the SinkhornKnopp's theorem, a description of the
algorithm using formula rather than Matlab code, and some statement on the
convergence rate of the algorithm.
============================
Rebuttal comments: there are many ways to speed up EMD
computation, one such approach is to use coarser distances as taken by
Pele and Werman already cited in the paper (round so that distances can
only be 1,2,3). I am not sure if there are publicly available
implementations taking advantage of the small weight range even though
this is an area with a lot of theoretical results such as
Gabow,
Harold N., and Robert E. Tarjan. "Faster scaling algorithms for network
problems." SIAM Journal on Computing 18.5 (1989): 10131036.
It
would be interesting to compare to that approach in terms of speed and
retrieval accuracy. Q2: Please summarize your review in
12 sentences
The paper makes an interesting observation that after
adding a penalty term based on the entropy of the transportation to the
objective of EMD, the solution can be found via SinkhornKnopp's
algorithm. The variant seems interesting and might merit further
investigation but the comparison made in the paper with EMD is
inadequate.
Q1:Author
rebuttal: Please respond to any concerns raised in the reviews. There are
no constraints on how you want to argue your case, except for the fact
that your text should be limited to a maximum of 6000 characters. Note
however that reviewers and area chairs are very busy and may not read long
vague rebuttals. It is in your own interest to be concise and to the
point.
We thank the reviewers for their reviews. Some of the
issues they raise are crucial, and we will incorporate this discussion in
our draft. We also thank them for reading this rebuttal.
******
Reviewer 11:  However, the comparison with EMD in the paper is not
fair, since the author insists on EMD being computed exactly
Many
thanks for this comment which underlines, in fact, a strength of our
approach. We need to clarify this.
We do no insist on getting
exact solutions for EMD: this is in fact the only practical choice we
have. All EMD solvers that we know of build upon the network simplex. None
offers an early stopping option.
Because of its combinatorial
nature, the simplex does not have obvious stopping rules. Indeed, the
change in the objective between two successive iterations is not
necessarily decreasing, and one can build examples where the simplex
achieves its largest improvement at the last iteration. For instance, in
3D, think of a feasible set shaped like an ice cream cone: a polyhedral
ball on top of a cone. Assume the simplex starts from the top of the
icecream ball and tracks back to the optimum, the tip of the cone, moving
from vertex to vertex. The very last iteration gives the largest decrease
in the objective.
We could imagine computing in parallel a dual
simplex (this would basically double the execution cost) and use the dual
gap as a stopping rule, or stop arbitrarily the execution. To our
knowledge, neither has been studied nor advocated. This stands in stark
contrast to the ability we have to stop early with the Sinkhorn method,
and easily control this approximation by checking the convergence of x or
that of the marginals of the current iterate.
 [...] without any
statement on the convergence rate of the algorithm
SinkhornKnopp
has a linear convergence rate (l.57,l.367): an upper bound on x_tx^*
decreases geometrically with t. The constant is reported by Knight (2008)
for the bistochastic case. Adapting these results for nonuniform
marginals is straightforward (private communication with Knight) and we
will add them. The constant depends on the smallest values of r_i, c_i and
exp(lambda m_ij) and converges to 1 as these numbers go to 0.

It is also not clear if [...].
We chose the MNIST dataset because
there is no ambiguity on what the ground metric should be (pixel
distances) and because it is a widely recognized benchmark. We have
started looking into more advanced datasets in computer vision.

The authors might also want to compare [...]
We will. Note however
that these approximations of EMD are known to perform worse than the exact
EMD and can only compare clouds of points in low dimensions (e.g. planar
case), where the ground metric M can only be the Euclidean distance
between these points. Our approach works for any ground metric M and any
Wasserstein exponent.
 might want to include [...] We will.
****** Reviewer 5  while the ordering of bins in a histogram
may have no intrinsic ordering per se, this is not the case with an image.
Our histograms are not permutation invariant: for all 20x20 pixel
intensities, namely d=400 bins in the grid, we use for m_ij the Euclidean
distance between these two bins i and j (l.293). Please correct us if we
have misunderstood your comment.
****** Reviewer 9:  it is
not clear how much of this is truly novel, as opposed to being novel
within the NIPS community.
Thank you for pointing this out. Most
of the paper is truly novel, let us draw the line:
What was known
in the transportation literature & introduced here to the NIPS
community: = The regularized transportation problem can be computed
using SinkhornKnopp (a.k.a Bregman balancing) to obtain a "smoother"
optimal transportation. Such regularized optima are used to predict the
actual movements of people in transportation networks.
Our
contributions: = Using the solution of the regularized problem to
define a transportation metric = Proving that a hard threshold on the
entropy of P results in a true metric (with a new version of the gluing
lemma, lemma 1), and that alpha=0 yields a psd kernel. = Vectorizing
Sinkhornknopp: optimal transportations from one r to many c's can be
computed altogether, with large speedups over separate executions.
Specialists of the algorithm we have approached (Knight) were unaware of
this fact. This remark is trivial in hindsight, but crucial to improve
performance, use GPUs, and beat the simplex by a large factor.

[...] relative timing of the computations for the MNIST data sets.
The MNIST dataset is such that d=400 but most vectors are sparse,
typically of support 120180. Numbers below are for one average execution:
emd_fast: 200 ms emd_mex: same if naively used, but can be tweaked
to leverage sparsity: 43 ms Sinkhorn (lambda=9) 0.82 ms on a single
core (50x faster than emd_mex). SinkhornGPU (lambda=9) 0.30 ms (150x
faster).
The speed up using GPUs here is less impressive (extra x3
instead of extra x10) because GPUs are not fast at indexing (extracting
nonzero components). These numbers agree with figure 4 (for d between 128
& 256)
 the SVM is only mentioned obliquely [...] Your
question is answered in l.279. We will clarify this.
 the CV
folds should not contribute to the variance We were not aware of this
issue, many thanks for pointing this out. Can you give us a reference
please? Does your remark still apply in our setting where training sets do
not overlap? (we only train on 1 fold an test on the 3 remaining folds)
 up to 10000 times faster Indeed, we meant 10,000 and not
10.000. These speedups refer to the area for large d of figure 4. One unit
of the logscale grid is equal to a 10^2 speed up, so we do indeed observe
that order of magnitude on the right side.
PS: there are two small
typos in Algorithm 1. We apologize for the confusion: l.235: Set
x=ones(length(r),size(c,2))/length(r); l. 238: v=c.*(1./(K'*u))
 