Summary and Contributions: Authors propose a new algorithm, sinkhorn natural gradient, for learning paramters in a generative model. A fairly thorough theoretical analysis is provided. Some experimental validation is presented as well. I appreciate author's response. I have increased my previous judgement.
Strengths: I think it is a very good paper. It develops a new algorithm akin to natural gradient, and also develop computational theory of the Sinkhorn information matrix, a computationally tractable alternative to wasserstein information matrix. The approximation theory for the gradient and hessians is very much welcome. It involves some tedious calculations but to my understanding they are novel and are a nice contribution to the community. I foresee this paper will generate impact.
Weaknesses: The validation section is a bit weak, and may be improved. I am not surprised of computational improvements because of the use of extra hessian information. However, the computational cost of such operation can be large. Authors should either comment on that or quantify the gains. The paper would be much stronger if authors were able to argue this extra computational expense is justified. Does the quality of generated images improve? authors may show examples of generated images. How these results compare to the ones of Genevay et al, 2018? It is too bad the cost function and domain have to be bounded. How the authors may deal with the unbounded case? perhaps they may appeal to results in https://arxiv.org/abs/1905.11882
Correctness: Arguments seem correct to me
Clarity: It is well written. Some notation seems unnecessary (e.g. defining a discrete operator \mathcal{A})
Relation to Prior Work: Prior work is correctly acknowledged. Perhaps the author may need to cite some other work that also deals with the hessian of the sinkhorn divergence (e.g. https://arxiv.org/pdf/1812.09150.pdf)
Reproducibility: Yes
Additional Feedback: It is too bad the cost function has to be bounded. How the authors may deal with the unbounded case? perhaps they may appeal to results in https://arxiv.org/abs/1905.11882. Additionally, recently the Sinkhorn EM algorithm was proposed, which involves similar hessian computations. Authors may comment on the relationship between both methods (if any) https://arxiv.org/abs/2006.16548 There is an unsatisfying issue, which is not unique to this paper: authors develop the natural gradient method using the Sinkhorn distance (i.e. the debiased entropic OT distance) but then define the Sinkhorn information matrix using the entropic OT distance without Debiasing. This is confusing as it does not lead insights as to which one is the one that really matters. Some more clarification on this would be appreciated.
Summary and Contributions: The authors aim at deriving a formula for the natural gradient with Sinkhorn proximity constraints in the case the probability distributions are given by the pushforward of a ground distribution through a parametrized map, the standard setting for GANs. Edit after rebuttal: I have read the authors' response and the other reviews. I have updated my score.
Strengths: The derived methods seem well-grounded in entropic regularized optimal transport and the derived formulas easy to implement.
Weaknesses: The experiments could have been more extensive. A thorough description is missing. Several questions are left unanswered: How stable is the training with the Sinkhorn natural gradient? How good are the results? How good would the results be, if the ground cost would be fixed (non-adversarial)? etc.
Correctness: The methods seem correct.
Clarity: The structure, notations and clarity of the paper can be improved. For instance, the part 'Training adversarial generative models' is not well written.
Relation to Prior Work: Some difference are highlighted, like the relation to Wasserstein natural gradients, but not much is discussed.
Reproducibility: Yes
Additional Feedback:
Summary and Contributions: The paper studies the minimization problem in generative models. The authors consider using a gradient operator generated by Sinkorn divergence, a perturbed Wasserstein divergence. They design fast algorithm based on Sinkorn natural gradient. In experiments, they compare Sinkorn natural gradient with other methods and demonstrate its numerical benefits.
Strengths: The paper studies cruical questions in AI optimization problems. This is to consider parameterization invariant gradient operator in parameter space. The current natural gradient in Fisher-Rao metric can be not well defined in generative models. The Wasserstein natural metric and its generalizations are very useful in this direction.
Weaknesses: I have several concerns in this work. 1. For Sinkhorn divergence, can you provide us several analytical examples, in which SIM has closed form solutions, such as generative model one dimensional space and Gaussian family? 2. If the ground cost is not quadratic, does the Sinkorn divergence still provide us metric structure? For example, suppose the ground cost is homogenous degree one, see c(x,y)=|x-y|_1, can you provide us a Sinkhorn matrix? Any simple example?
Correctness: The method is correct.
Clarity: The paper is well written. Some clarification is needed when the ground cost is not homogenous of degree two.
Relation to Prior Work: For the discussion of Sinkhorn divergence, the connection with Schrodinger bridge problem and Fisher information regularization is needed. See attached references. F. Leger, W.C. Li. Hopf-Cole transformation via generalized Schrodinger bridge problem. 2. S.N.Chow, W.C. Li, C.C. Mou, H.M. Zhou. A discrete Schrodinger bridge problem via optimal transport on graphs. For the discussion of Wasserstein information matrix, the following important papers are also needed: 1. W.C. Li, J.X. Zhao. Wasserstein information matrix. 2. W.C. Li, G. Montufar. Ricci curvature for parameter statistics via optimal transport, Information geometry.
Reproducibility: Yes
Additional Feedback: The authors address my questions. In particular, they point out an interesting analytical example. A good paper. It may be also good to include this example in the revision.
Summary and Contributions: This paper proposes a natural gradient algorithm for training implicit generative models. They propose to use the Sinkhorn divergence as the "natural" metric. They then derive an update that requires the computation of the Sinkhorn Information Matrix (SIM), and show that propose a way to approximate it efficiently. They also provide an empirical version of SIM. Finally they show experimentally that the proposed method converge faster than standard SGD type algorithm for training a generative model.
Strengths: I believe this work is an important contribution to the field of natural gradient algorithm. The work is well motivated by theory and the derivation of the algorithm is quite rigorous, the paper tries to justify each choice as much as possible through theory.
Weaknesses: I think the experimental section of the paper is bit weak and I'm not convinced the proposed algorithm can scale very well (see below for more details)
Correctness: I haven't checked the proofs but from a quick look the claims seem reasonable and well supported. The empirical methodology is a bit weak: - The author compare the convergence of the different methods in terms of number of epochs, however I believe it's a bit unfair to other methods which are much simpler, I think wall clock time would be more appropriate, as the cost per iteration of the proposed method could be several times higher than SGD. - It would have been nice to have some confidence intervals for the plots. - A lot of questions remain unanswered: what is the effect of the number of iteration of conjugate gradient to compute the inverse on the convergence ? same question but for computing the SIM ? Furthermore what is the influence of the batch size ? I suspect that the proposed algorithm requires larger batch-size in order to work. - I find the setting for comparing the different methods quite limited. It would have been more interesting to compare the performance of the proposed algorithm in a standard GAN setup.
Clarity: The paper is overall well written and quite clear. However the experimental section lack some details, such as the number of iteration of conjugate gradient used to compute the inverse and the number of iterations (or accuracy) used for computing the SIM.
Relation to Prior Work: The related work is well discussed.
Reproducibility: No
Additional Feedback: === Edit after rebuttal === I'm increasing my score as the authors addressed most of my comments in the rebuttal.