Review for NeurIPS paper: Modeling Noisy Annotations for Crowd Counting

NeurIPS 2020

Modeling Noisy Annotations for Crowd Counting

Review 1

Summary and Contributions: In this work, the authors propose to model the annotation noise in crowd counting with Gaussian distribution. The noise arising from human annotation in Gaussian kernel generated density map is not i.i.d thus the correlation between pixels should be considered. By minimizing the negative log-likelihood of the approximated joint distribution of the density map values with a full covariance multivariate Gaussian density, the model has the flexibility to handle the annotation noise and achieve competitive performance in crowd counting.

Strengths: 1. This work first considers the noise in point-wise annotation in crowd counting which is a practical problem worth noting. 2. The proposed modeling method and the proposed loss function are novel. 3. The experiments demonstrate the effectiveness of the proposed methods achieve competitive performance for different networks in different dataset.

Weaknesses: Missing references. uncertianty learning has been well studied and related work should be discussed. Further, this paper should be compared with uncetainty learning methods. The reason that the distribution of \Phi which is the summation of individual {\chi}^2 distribution is approximated with Gaussian distribution is unclear. The comparison in Tab. 2 should include “L2 + Reg (L_i)” which is a reasonable comparison. The effect of \alpha and \beta should be jointly considered. Large \beta is believed to be helpful to handle the annotation noise, which reduces the peak value. It should be studied the effect of \alpha when \beta is large. Is it still effective? The robustness of different \beta in L2 loss should also be evaluated like Fig.4. It sould be better to further verify the proposed method on other tasks for example key-point estimation to demonstrate the generalizability. Post Rebuttal: The uncertainty learning based baseline is not convincing. Further, it would be much better to show the generalization of the proposed method on other tasks.

Correctness: It is unclear why the authors approximate the distribution \Phi which is the summation of individual {\chi}^2 distribution with Gaussian distribution.

Clarity: Yes

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback:

Review 2

Summary and Contributions: This paper aims at addressing the issue of annotation noise in crowd counting, which is ignored by most of traditional density map based crowd counting methods. A new loss function based on the modeling of annotation noise is proposed, which can be decomposed into a pixel wise weighted MSE term and a correlation term. Extensive experimental results have demonstrated the effectiveness of the designed loss function towards the robustness of annotation noise for crowd counting.

Strengths: The idea of improving crowd counting from the perspective of modeling the annotation noise is novel. The approximation of the loss function which reduces its computational cost seems ingenious. Experimental results as well as detailed ablation studies have demonstrated the effectiveness of the proposed loss function.

Weaknesses: (1) How the correlation between pixels is modeled in the proposed loss function is not very clear. (2) The authors mentioned that 3 backbones including VGG19 [4], CSRNet [1], and MCNN [11] were tested in Section 4.1, but only VGG19 results were reported. Please correct me if I have missed the details. (3) Since the proposed loss function is universal, it needs to be integrated into the existing SOTA crowd counting algorithms to better verify its effectiveness. (4) The related work part simply lists the existing methods, and does not comment on these methods or explain the relationship between these methods and the proposed one. (5) The paper is poorly written, and there are a lot of grammatical errors and typos that make the paper hard to follow. (6) I don’t quite understand Line 119~Line 123 “However, such an assumption is tenuous, as the observation noise is induced by the noise in the observed annotation locations D that is passed through the non-linear function in (1).” (7) I don't quite follow Line 26 "However, if an annotation moves, the changes in the density map in nearby pixels are correlated".

Correctness: Seems correct.

Clarity: The paper is poorly written, and there are a lot of grammatical errors and typos that make the paper hard to follow. To name a few as follows: (1) Line 55 "direct" should be “ directly” (2) Line 70 and 71 “propose” should be “proposes” (3) Line 256 “ showing that are loss is relatively robust to this parameter” (4) Line 262 "since the some correlations are not considered”

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback: I have read the authors' feedback and other reviewers' comments. I tend towards maintaining my original rating.

Review 3

Summary and Contributions: The paper focusses on the problem of crowd counting. Specifically, the authors model annotation noise using random variable with Gaussian Distribution and they model the pdf of the crowd density value at each location in the density map. The resulting loss function leads to robust results in the presence of simulated noise. Also, they are able to achieve better results compared to recent methods on several datasets.

Strengths: 1. The problem of modeling annotation noise has not been explored earlier in crowd counting. This is a promising direction and an important problem. 2. Empirical results show that the proposed method is more robust in terms of overall error in the presence of simulated noise. Also, the proposed loss function achieves better results compared to recent SOTA. 3. The authors have conducted extensive experiments on multiple datasets.

Weaknesses: 1. A major concern that I have is, the authors consider only shifts in annotations as the noise. However, real-world annotations include other types of noises like missing annotations or duplicate annotations. The authors do not consider this in their discussion. From the outset, it seems that their current method cannot accommodate these additional noises. From this perspective, I would say that the paper is incomplete in modeling different types of annotation noises. 2. It is not clear why the authors approximate pdf phi and Psi with Gaussian distribution. 3. It is not clear why eta_ri term is non-central chi-squared distribution. 4. As far as I understand, small shifts in annotations will not affect performance much, since neural networks can be robust if receptive size of the network is large enough. Can the authors discuss this more in detail. 5. The proposed method seems to be too specific to the counting problem. Can this method be extended to other problems in vision like object detection.

Correctness: May be

Clarity: Somewhat. More discussions and clarifications are required. See weaknesses section.

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback: The authors have partially addressed my concerns. However, I do not agree that duplicate annotations are rare - being involved in data collection efforts - this is not uncommon, especially for pointwise annotations. I upgrade my rating slighlty.

Review 4

Summary and Contributions: This paper proposes a new method for capturing noisy annnotation in crowd counting. Instead of treating the point annotation as the "true" location of a person head, this paper assumes it is a noisy version of some underlying true location with Gaussian noisy. Based on this formulation, this paper proposes a loss function for crowd counting.

Strengths: The idea of modeling annotation noise in the context of crowd counting is relatively new. Based on the experiments, the proposed loss (which captures the annotation noise) seems to work well compared with existing loss function used in crowd counting.

Weaknesses: The novelty of the paper is a bit limited. The idea of modeling annoation noise has been explored in [4] (although in sightly different form). The general idea of learning from noisy labels has also been explored in many previous works. The main contribution of the paper is use this idea in the specific context of crowd counting.

Correctness: Seems to be correct, although I did not fully verify all the math derivations

Clarity: Yes

Relation to Prior Work: Previous works on learning from noisy annotation should be expanded. The difference with these works should be more elaborated.

Reproducibility: Yes

Additional Feedback: None