NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Reviewer 1
Originality & Clarity & significance The technique presented in the paper is quite novel. The paper is generally well written and easy to read. The method will potentially improve a lot of existing algorithms. Quality I am satisfied with the methodology sections. However, I have quite a few questions regarding the experimental part: 1. The experimental settings are not clear to me. For the label-conditioned branch, which I believe as one of the most important parts (e.g., the problem is raised in l-101?), there seems only one experiment in Table 4 is performed. Is it true? Does it mean all the other evaluations and ablation studies are conducted with a single branch, rather than the full network shown in Figure 2? 2. What is the actual difference between simple-2D and full? 3. I found the evaluation choices are random. E.g., (a) Table 1 shows only the performance of full model (this is fine), but there is not full model performance for 8 stacks. (b) There is a +gt only for the full model in Table 4. (c). Table 5 does not report the simple-2D and full, and the 1 cm and 2 cm. It is better to provide either consistent or full results.
Reviewer 2
The paper considers the task of dense pose estimation, which can be decomposed into finding a semantic segmentation of body parts complemented with the regression of u-v coordinates for each pixel, relative to each body part. The authors explore several error models which all result in the addition of output heads to an existing dense pose network. The simplest one is a local error model, the most complex model (the `full’ model) considered, is one that adds in a global error plus local errors that are independent for the u-v coordinates. The output heads are trained by maximum likelihood for the regression model of the u-v coordinates. The full model performs as well as the simpler error models but results in a higher neg. log-likelihood and is therefore presented as superior. Two other cases are considered: 1) training a model whose output heads for the error estimation are additionally conditioned on the ground truth and 2) a way of combining independently trained models by using their uncertainty estimates. It is unclear why the authors do not present the results of related work, which makes it somewhat difficult to assess their models’ performance, but of course still allows to compare their relative performance. Although the work ablates the model adaptations it considers, it does not seem to discuss the results and implications very well. For instance the simpler models (simple-2D) perform slightly better than the full error model, which however in turn receives a better neg. log-likelihood. Why is that? The model ensembling using uncertainty predictions only performs marginally better than a simple model average, there is no discussion or significance assessment. The model whose uncertainty heads are conditioned on the ground truth during training perform better at test time (w/o access to the ground truth). It is unclear why this conditioning helps training of the core prediction network beyond using the regression loss alone. There is no discussion of that. This model could also not be used to assess the model’s uncertainty at test time, also prohibiting an ensemble of this kind of model. The term `introspection ensemble’ seems a bit far-fetched, more accurately it is an uncertainty-weighted ensemble. Being concerned with highly structured aleatoric uncertainty and modeling uncertainty in the annotation process, there is missing related work, e.g. the Probabilistic U-Net which models aleatoric uncertainty for semantic annotations. Also, the error model is concerned with modelling the error of subvectors in u-v-space but not between the part label predictions, i.e. the inter subvector covariance and thus the semantic segmentation, why was there no attempt to incorporate this? Another current limitation of the approach is that the error model does not consider the correlation of errors specific to regions, e.g. individual body parts. Instead they are either local and/or global, despite the discussed observation that the error of individual body parts may be strongly correlated (lines 193 - 196). The discussion of why learning with an uncertainty model helps training and final performance seems insufficient. It is stated that a `model’s better understanding of the uncertainty of the data’ helps. The reason that as to why it helps is presumably the loss attenuation, that allows the model to down-weight the loss of difficult and thus likely ambiguous / noisy examples or pixels. There is quite a bit of literature on that, which could be part of a discussion. Lastly some typos need fixing: `ensemlbing’, `cosntrained’, `gradient descend’, `locationS’.
Reviewer 3
The paper presents a mathematical framework that estimates uncertainty of dense correspondences from noisy labels. The underlying framework models the distribution of residuals obtained by comparing a target vector encoding the correspondences and other properties and its prediction. The underlying model considers three sources of noise: a general noise affecting all dimensions of residuals, noise affecting the dimensions encoding the association of pixels to a human body part, and directional noise modeling directional errors. All three sources of noise are modeled with Gaussian distributions. I think the framework is solid given the assumption of Gaussian distribution. However, I have a major and a few minor concerns. 1. My major concern is that the submission is missing other baselines on human dense correspondences in the experiments. The only baseline is based on [13]. However, there exist other recent methods (e.g., Dense Human Body Correspondences Using Convolutional Networks by Wei et al.). The experiments indeed show improvements over [13] but it is unclear if the paper is advancing the state of the art by including the uncertainty model. In sum, I think the paper should include other CNNs for dense correspondences and include the proposed framework to it and show if the benefit is consistent across other models. 2. A minor concern has to do with the lack of justification of using Gaussian distributions to model the uncertainties. Although I understand that a Gaussian distribution typically simplifies the math, I don't understand why a Gaussian distribution is a good model for the residuals dealt with this problem. 3. A third minor concerns are a few typos and grammatical errors. a) In line 67: "The recent method of [17] proposes a frequentist *methods* ..." -> "The recent method of [17] proposes a frequentist *method* ..." b) In line 85: "In order do [...]" -> "In order to [...]" c) In line 88: I think you mean E_q[\delta] = 0 (i.e., the expected residual is zero); or am I missing something? d) In line 99: When defining \sigma_2, what is u? It was never defined. e) In line 105: "In oder to [...]" -> "In order to [...]" f) In line 126: I think Eq. 7 is just floating around w/o any introduction first. I would suggest gently introducing the Equation in line 126. g) Eq. (9) is missing a period. h) In parts of the text, the Equations are note properly references (e.g., line 216 and line 218). Please clearly refer to them as Eq. (X) in the text. Post Rebuttal: The rebuttal addressed my major concern which was the lack of comparisons with other baselines. The new results clearly show a marginal advantage, and they should be included in the final paper. However, I would encourage the authors to discuss why a Gaussian distribution is good for modeling the errors. Just stating that it is good because it is unimodal is not a good argument. There are many other unimodal distributions (e.g., Laplacian distribution), and I think it would be nice if the paper shows a histogram of the errors and the fit of a Gaussian to them.