__ Summary and Contributions__: The paper presents a robust optimal transport formulation to handle outliers. As the part of learning objective, it learns weights for each sample such that the outlier samples get lower weights. The method's application is shown on GAN and UDA.

__ Strengths__: - Novel method and achieves good experimental results.
- Well written paper; easy-to-follow, clearly positions the proposed method with the existing approaches.
- Overall satisfactory evaluation.

__ Weaknesses__: There are two concerns I have with this work.
- The method could lead to an increase or exacerbation of the biases present in the dataset. Consider training face generator, the face datasets like CelebA-HQ and FFHQ have some strong biases like largely Caucasian faces, very few blond males etc. Now training a GAN on these datasets with the proposed method, the GAN might completely or almost drop African faces or blond-male faces because they are very few in the dataset and further weighted low by robust OT method. On the other hand it might increase the presence of Caucasian-smiling-female faces (dominant attributes) due to higher presence and larger weights.
- Possibility of mode drop. What prevents this method from mode drop when the mode is difficult to model? For example, if you want to train a generator with 60-70% MNIST and 30-40% SVHN, the SVHN images are difficult compared MNIST images. I think there is a risk that the generator trained with robust OT method might completely drop the difficult images, which are definitely not outliers given the 30-40% presence.
My overall suggestion is to have an analysis on what is the impact of the method on the distribution, which is to observe the percentage of each class in the generated data or class wise performance in the case of UDA. If there are some drawbacks in these cases then, it should be presented.
Another point, related to the above points, what happens if you flatten the weights or make it uniform except for the outliers. It could be a trick to circumvent issues of biases or mode drop.

__ Correctness__: The empirical methodology and claims seems correct.

__ Clarity__: The paper is well written.

__ Relation to Prior Work__: The related work is well discussed and the proposed work is also well placed through this discussion.

__ Reproducibility__: Yes

__ Additional Feedback__: ## Post-rebuttal ##
I find the rebuttal convincing for the issues raised in the reviews. My concerns about inducing biases and mode drop are well addressed with experimental results. Therefore, I increase my rating to 7.

__ Summary and Contributions__: In the paper, the authors proposed a communicationally-efficient dual form of the robust optimal transport (OT) optimization that is useful for large-scale deep learning applications. In particular, the main idea of the paper is motivated from a robust formulation of OT with unbalanced marginal constraints, which is the popular unbalanced optimal transport (UOT) in the literature. However, training the neural networks with the dual form of UOT is challenging as it involves maximizing over two discriminator functions with some constraints. Therefore, the authors proposed to constrain the transportation plan in the UOT problem to have its marginal constraints, which they refer to as relaxed distribution, lying in the probability space and the f-divergences between the target distributions and the relaxed distributions to be bounded by some given constants. They demonstrated that their formulation can work efficiently with applications of GAN and domain adaptation.

__ Strengths__: In my opinion, the robust optimal transport proposed by the authors is quite interesting. The careful experiments with GAN and domain adaptation demonstrate the effectiveness of the framework for dealing with outliers. The theoretical results in Theorem 1 and Theorem 2 are reasonable. I checked the proofs of these results in the Supplementary material and they all seem correct to me.

__ Weaknesses__: There are a few concerns with the paper that the authors should address:
(1) Regarding the formulation of robust optimal transport in equation (5), when the probability measures $P_{X}$ and $P_{Y}$ are discrete, the objective function with $\pi$ is a linear programming problem. Solving directly the linear programming problem can be expensive as the standard interior point methods have complexity of order $n^3$ where $n$ is the maximum number of supports of $P_{X}$ and $P_{Y}$. For this reason, we can utilize the entropic regularized approach to equation (5). Eventually, it leads to a strongly convex objective function as follows:
\min_{P_{\tilde{X}}, P_{\tilde{Y}}} \min_{\pi} \pi_{ij} C_{ij} - \eta H(\pi),
s.t. D_{f}(P_{\tilde{X}}|| P_{X}) \leq \rho_{1}, D_{f}(P_{\tilde{Y}}|| P_{Y}) \leq \rho_{2}.
Here, $H(\pi)$ stands for the entropy of $\pi$.
The above objective function is equivalent to solving the following optimization problem:
\min_{P_{\tilde{X}}, P_{\tilde{Y}}} \min_{\pi} \pi_{ij} C_{ij} + \tau_{1} D_{f}(P_{\tilde{X}}|| P_{X}) \leq + \tau_{2} D_{f}(P_{\tilde{Y}}|| P_{Y}) \leq \rho_{2} - \eta H(\pi),
where $\tau_{1}, \tau_{2}, \eta > 0$ are some given positive numbers.
The objective function is strongly convex with respect to $\pi$. When $D_{f}$ is KL-divergence, the dual form of that objective function has very nice form (see the reference [1]). We essentially can use Sinkhorn algorithm to solve it. The Sinkhorn framework is also differentiable; therefore, it can be used for deep learning framework.
Now, my question is: how efficient the current proposed objective functions (7) and (8) comparing to the Sinkhorn framework when we utilize the entropic regularized framework? It will be great if the authors can provide some initial experiments on the performance of Sinkhorn algorithm for their problem when $P_{X}$ and $P_{Y}$ are discrete.
(2) I am in fact confused about the objective function (8). How is it easier to implement than the objective function (7)? It will be helpful if the author can provide an explicit algorithm for solving (8) in the main text or in the supplementary material.
Note that, the complexity of the standard algorithms for solving both (7) and (8) are at least at the order of $n \times m$ (up to some accuracy level) where $n$ and $m$ are respectively the number of supports of $P_{X}^{(m)}$ and $P_{Y}^{(n)}$. When both $n$ and $m$ are large, these standard algorithms are not scalable. Therefore, we need to devise careful optimization algorithms in order to deal with the large-scale of $n \times m$.
(3) Assume that we have $n$ i.i.d. data from the corrupted distribution $P_{X} = (1 - \gamma) P_{X}^{c} + \gamma P_{X}^{a}$. Denote $P_{n}$ as the empirical measure of these data. What can we say about the gap between $P_{n}$ and $P_{X}$ under the robust optimal transport based on $n$? Can we guarantee that the gap goes to 0 as the sample size goes to infinity?
(4) Typos:
--- In Theorem 1, the function k in the maximum in equation (1) should be function D. Please also fix the proof of Theorem 1 in the Appendix accordingly as the notation are not consistent with that in the main paper.
--- Line 137: $P_{Y}^{(m)}$ should be changed to $P_{Y}^{(n)}$.
--- Line 140: $w^{x} \in \Delta^{m}$ should be changed to $w_{x} \in \Delta^{m}$.

__ Correctness__: I checked most of the claims, the theories, and the experiments. They seem correct to me.

__ Clarity__: The paper is quite well-written. There are a few typos in the main paper for which I already listed. The authors should try to proofread the paper more to make sure that it is typo-free.

__ Relation to Prior Work__: The paper does not cite several related recent work on unbalanced optimal transport. A few examples of these work are:
[1] K. Pham, K. Le, N. Ho, T. Pham, H. Bui. On Unbalanced Optimal Transport: An Analysis of Sinkhorn Algorithm. ICML, 2020.
[2] T. Séjourné, J. Feydy, F. Vialard, A. Trouvé, G. Peyré. Sinkhorn Divergences for Unbalanced Optimal Transport. arXiv preprint arXiv: 1910.12958, 2019.

__ Reproducibility__: Yes

__ Additional Feedback__:

__ Summary and Contributions__: The authors are motivated by three factors: (A) the increasing use of OT distances in objective functions for generative modeling (specifically GANs), and domain translation; (B) the sensitivity of the OT distances to outliers; and (C) the difficulty employing existing robust OT solvers in deep learning due to their instability.
The authors introduce a novel, and efficient robust OT measure (a variant of the wasserstein distance). Their formulation can be implemented as learning a re-weighting of the training data, with bounded divergence from the uniform distribution. They demonstrate the utility of their method for training GANs where the training data contains outliers, and for improving accuracy in domain adaptaion. In addition they prove that their proposed robust OT measure is upper bounded by a constant factor relative to the true OT distance when outliers are removed from the distribution.

__ Strengths__: + The authors experiments qualitatively and quantitatively demonstrate that their measure improves the sample quality of GANs when the dataset is intentionally corrupted with outliers (CIFAR+MNIST) or when the dataset naturally contains outliers (DomainNet Sketch). They also show that it does not greatly affect sample quality when the dataset is clean (CIFAR)
+ The sample weights learned by their method gives a simple way of interpreting which portions of the distribution were ignored during training
+ The proposed robust adversarial measure dramatically improves of the adversarial baseline when used for domain adaptation. Result improve further to be on par with competing methods when entropy regularization is applied to the the target logits.
+ The authors derive two versions of their measure, one suitable to discrete distributions, and the other suitable to continuous distributions (or large datasets), and describe the practical details of using both.
+ Clarity: the authors clearly explain their proposed measure by comparing and constrasting it with unbalanced OT, and the Kantrovich-Rubeninstein duality.

__ Weaknesses__: - The authors do not explore the effect of their formulation on more recent GAN architectures, or high resolution datasets.

__ Correctness__: The empirical claims of the author are well supported in the main text and supplement, and the proofs of the author's theoretical claims in the supplement seem correct.

__ Clarity__: Yes, the paper is well-written and easy to follow.

__ Relation to Prior Work__: Yes, this is discussed throughout the paper, despite the lack of a centralized related works section.

__ Reproducibility__: Yes

__ Additional Feedback__: ### Post-Rebuttal Comments ###
After reading the rebuttal I agree that the authors evaluate their method on generally accepted benchmarks for the field, and this addresses my concern. I am keeping my previous score.
###

__ Summary and Contributions__: This paper proposes to optimize unbalanced OT with a computationally effective dual form. Two solvers for the dual problem are developed and shows high stability in large-scale applications. The theoretical analysis also demonstrate the robust OT can handle outliers. Both the empirical end theorical results show the effectiveness of the proposed method.

__ Strengths__: +The paper is well-written. The motivation of the proposed method is well clarified in the Introduction. The backgrounds in Section 2 also help readers understand. The method is described clearly and easy to follow.
+The theorical analysis is solid. Theorem 2 in Section 3.3 shows that the robust OT can be upper bounded. The proposed method is demonstrated to be effective from a theoretical perspective.
+The experiment results in multiple datasets show the superiority of the proposed method.

__ Weaknesses__: -The proposed methods is both used in generative models and domain adaptation. Actually, when applying a method on the two areas (i,e., generative models and domain adaptation), special design may be needed. Could the authors compare the difference more detailly between the way of applying the proposed method on the two areas?
-For domain adaptation experiments, the proposed method is compared with the existing methods on VISDA-17. Experiments on more datasets (e.g., Office-31) are needed.

__ Correctness__: Yes.
Yes.

__ Clarity__: This paper is well written and easy to read.

__ Relation to Prior Work__: Yes.

__ Reproducibility__: Yes

__ Additional Feedback__: