Review for NeurIPS paper: Distribution Matching for Crowd Counting

NeurIPS 2020

Distribution Matching for Crowd Counting

Review 1

Summary and Contributions: The paper presents a distribution matching based loss for crowd counting. Most existing methods use Gaussians at head locations to generate heat map which is then used as targets for training the networks. The authors argue that this method could hurt the performance, and hence propose a optimal transport based distribution matching loss to train the network. The authors show that the using this loss results in much tighter bound on the error. The authors evaluate their methods on 4 datasets and show improvements on some of them.

Strengths: 1. Representing targets for crowd counting is a known issue. There is a recent interest in addressing this. The proposed method is an interesting approach for this issue. 2. The authors show theoretically and empirically that the proposed solution results in lower error.

Weaknesses: 1. Novelty - The contributions of the paper in terms of novelty are: (i) The idea of using OT based distribution matching loss, (ii) theoretical results showing that such loss results in lower error, (iii) empirical results showing that such a loss indeed results in lower error. For the crowd counting community, this may be considered as considerable contributions. However, the paper does not address if this is of interest to the broader vision/ml community - which is expected for a Neurips kind of venue. For example, the authors could have considered a broader set of applications like object detection for evaluating their method. 2. Related work - The authors have not considered several recent works like [1-8] in their related works discussion. Further, the authors should have given a better background for the recent works that have focussed on improving representation of ground-truth [2,9,10] for training the networks. 3. Comparison of results: The authors claim that they obtain significant improvements in 4 datasets compared to recent SOTA. First - they have conveniently left out comparisons to recent methods like [2,11,12,13,14] that have better/comparable results compared to the proposed method. Second, their improvements over SOTA cannot be considered significant in all the datasets. In many cases, the numbers are too close. For example: (i) UCF-QNRF, their MAE is 85.6 vs BL which is 88.7 (I agree MSE is much better), (ii) ShanghaiTech numbers are too close to BL. (iii) UCF-CC-50, the difference in MAE is only 1.2 (as compared to CAN) and the MSE for OT is much worse compared to CAN. [1] From Open Set to Closed Set: Counting Objects by Spatial Divide-and-Conquer. ICCV 19 [2] Multi-Level Bottom-Top and Top-Bottom Feature Fusion for Crowd Counting. ICCV 19 [3] Leveraging unlabeled data for crowd counting by learning to rank. [4] Top-down feed- back for crowd counting convolutional neural network. AAAI 2019 [5] Where are the Blobs: Counting by Localization with Point Supervision. ECCV 2018 [6] Crowd Counting via Adversarial Cross-Scale Consistency Pursuit. CVPR 2018 [7] DecideNet: Counting Varying Density Crowds Through Attention Guided Detection and Density. CVPR 2018 [8] Crowd Counting with Deep Negative Correlation Learning [9] Adaptive Density Map Generation for Crowd Counting. ICCV 19 [10] Improving Dense Crowd Counting Convolutional Neural Networks using Inverse k-Nearest Neighbor Maps and Multiscale Upsampling. VISAPP 20 [11] Crowd Counting with Deep Structured Scale Integration Network. ICCV 19 [12] Relational Attention Network for Crowd Counting. ICCV 19 [13] PaDNet: Pan-Density Crowd Counting. TIP 19. [14] Ha-ccn: Hi- erarchical attention-based crowd counting network. TIP 2019

Correctness: Yes

Clarity: Yes

Relation to Prior Work: No.

Reproducibility: Yes

Additional Feedback: Update after the rebuttal --------------------------------- I had 3 main issues with this paper - 1. impact on the broader vision community, 2. lack of discussion on related work, and 3. missing comparisons. 1. Impact to the broader vision community - I had asked for the authors to discuss about the broader impact which is important for Neurips. Similar concerns were raised by R2, where R2 asked if it is possible to “transfer DM-count similar tasks like key point regression”. Instead of focusing on this in their rebuttal, the authors go on to argue how crowd counting is very appropriate for Neurips by stating that “4 dozen papers are published at top tier conferences”. This statement does not address if their solution is applicable to other problems like what R2 suggested. For now, the paper is narrowly focussed on crowd counting. 2. Related work: The authors state that it is not possible to cite each and every paper. While this is true - they are expected to cite the papers which have similar motivation of improving the representation of density maps [2,9,10]. These works also focus on improving the density maps for better learning - however, they are not being discussed. They refer to lines 82-84 in their rebuttal - but I don’t see these references. 3. Missing comparisons - My main concern with this is that they missed out comparing with methods which were better than them, which is kind of misleading to the reader. I do understand, it may not be possible to outperform all the methods in all the datasets - and that is okay. However, a discussion why this Is the case would be helpful for the reader. Additionally, some of their statements like (i) “Our method ranked first in the leaderboard 26 at the time of submission, reducing SOTA error from 105 to 88” is misleading - since BL has achieved 88.7 and it is an ICCV 19 work, (ii) Also, their method in the table is highlighted for ShanghaiTech-B, but [11] performs better. After carefully reading the rebuttal, the authors have not completely addressed my concerns, especially with respect to the impact on the broader set of vision problems. However, as I had stated in my original comments, I do believe that the contributions are novel especially to the crowd counting community. Also, after reading comments from other reviewers and the rebuttal for them, I reconsider my earlier decision and upgrade the rating of the paper. The authors are recommended to include a discussion on how their method could be applied to other similar problems, and also discuss at least the relevant related work if not all of them.

Review 2

Summary and Contributions: This work demonstrates that using the Gaussian kernel to smooth the ground truth dot annotations can hurt the generalisation bound and transfers the crowd counting task to a distribution matching problem by using the proposed DM-Count. There are three loss items in DM-Count: the counting loss, the OT loss, and the Total Variation (TV) loss. The performance of DM-Count on the widely used datasets is better than state-of-the-art results.

Strengths: It is a good idea to use the Optimal Transport on the Crowd Counting and a detailed theoretical proof is provided in this paper. The experiments demonstrate the effectiveness of the proposed DM-Count. This paper is well written and meaningful to the community.

Weaknesses: 1. In #141, “The OT loss will approximate well the dense areas of the crowd, but the approximation might be poorer for the low density areas of the crowd”. I wonder why there is a performance gap in the crowd areas and low-density areas for the OT loss. More details should be provided for explanation. 2. In #248, “In all experiments, DM-Count outperforms all other methods except CAN under MSE in NWPU (where they are comparable)”. Why the DM-Count performs bad on RMSE compared with CAN? 3. In Table 3, without using the TV loss, the performance of DM-Count is worse than Bayesian Loss (in Table 2). I think the OT loss may not robust. 4. Is this possible to transfer the DM-Count to other similar tasks, like keypoint regression? I think it is interesting if the DM-Count could solve the crowded problems in the human-pose estimation.

Correctness: Yes.

Clarity: Yes.

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback: Please address the problems mentioned in “Weaknesses”. Besides, the “Ablation Studies” mentioned in the supplementary is important for readers to learn more details about each component and I suggest appending it to the paper if possible.

Review 3

Summary and Contributions: This paper proposes a new objective function for crowd counting, which minimizes the Wasserstein distance between the predicted normalized density map and the ground-truth point maps without using the Gaussian assumption. Generalization bounds and theoretical analysis are also provided in this paper.

Strengths: + This paper is well written and easy to understand. + This paper has a good motivation and a solid theoretical grounding. + Although the Wasserstein distance has been widely applied in GAN and other applications. This is the first time it is applied in crowd counting tasks.

Weaknesses: 1 There are three terms in the loss function, the effects of the introduced hyper-parameters should be studied. 2 The third loss is named the total variation loss. However, the "total variation" is a terminology that should be clearly defined and explained. The derivation of Eq. 6 should also be further explained. 3 The two-dimention Wasserstein distance doesn't have a close form solution, this paper applies the Sinkhorn algorithm to get the approximate solution. As mentioned in this paper, iterations are needed before each gradient descend. What is the maximum iterations you set in the experiment? Will these iterations significantly slow down the training speed?

Correctness: The defination of the terminology is unclear.

Clarity: This paper is well written.

Relation to Prior Work: The relation to prior work has been clearly discussed.

Reproducibility: No

Additional Feedback: If my concerns are addressed, I am willing to increase the score.

Review 4

Summary and Contributions: This paper proposes a Distribution Matching method for crowd counting. It uses optimal transport to measure the similarity between the normalized predicted/GT density maps.

Strengths: + Show that imposing Gaussians to annotations will hurt the generalization performance of counting. + DM-Count may be a new direction for counting. It gets rid of Gaussian-smoothing density map. + Propose OT loss and achieve the SOTA.

Weaknesses: - Ablation study is not adequate. The author should add more analysis. For example, the results of single loss should be shown (only OT loss, only TV loss). - In addition to visualization results, the author should evaluate the quantitative counting performance in high-density region to show the improvements. - Minor issues: missing the reference of PCC-Net-VGG in Table 2.

Correctness: Yes.

Clarity: Good.

Relation to Prior Work: Yes. The paper shows the differences with other works, such as Bayesian loss.

Reproducibility: Yes

Additional Feedback: If the author can release the code, I think it is very useful for the community. After the rebuttal: Although the authors respond to my issues well, I think this paper has some key weakness after reading other reviews and feedbacks, especially R1. So my final score is “6. Marginally above the acceptance threshold”.