NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:4477
Title:Fixing the train-test resolution discrepancy

Reviewer 1

The technical innovation of the paper is quite limited, with the exploration of an observation about the resolution discrepancy between training and testing images when performing standard data augmentation. The paper is technically sound, but the contribution is also technically very simple. The paper is clear, but it can be improved. For instance, Figures 1 and 2 are not being referred in the text. On line 42, it is not clear what the paper means by "better results". On line 74, the paper could have explained p-pooling. The results presented by the paper do not seem quite significant because just one dataset (ImageNet-2012) and two models (ResNet and PNASNet) are used. In addition, are the results shown significant? I just read authors' rebuttal for this paper and checked other reviewers' comments. Re: my review, the authors addressed them partially -- they showed results with more models, but they were vague about the significance of the results (I was expecting more formal stats significance results here). Therefore, I'm keeping my current score.

Reviewer 2

Clarity: The paper is clearly written and easy to follow. Significance: The results in the paper are significant for the practitioners and existing deployments as they shed light on the train-test resolution discrepancy and suggest method to improve test performance for existing trained models. Novelty: The analysis in this paper is novel (though improved performance on higher resolution images has been observed earlier). Questions: While the focus is on fixing discrepancy after the model has been initially trained, why not just fix the training such that there is no discrepancy, as opposed to changing the size for test and finetuning? Line 110-111 derives f = sqrt(HW), which does not seem to be right since k doesn’t include the sensor size. Typically, sqrt(hw) = 2f tan(theta_fov/2) where h, w are sensor size and not resolution, which is independent of the size. Please clarify or state what additional assumptions are made. --- After reading the author rebuttal and other reviews, I would maintain my previous rating of 8 ( a very good submission). Follow provides the key reasons 1) I like the fact that authors study and analyze an observed phenomenon that has been mentioned in passing in earlier works but never been looked at in detail. I have wondered about this myself and feel that the insights provided by this work are beneficial. 2) Authors provide a method based on their analysis that leads to test time improvements on pre-trained models. That itself is a major contribution and validation of their analysis.

Reviewer 3

Overall I think the idea is quite interesting, but because it is an empirical result, it should be validated better. Resolution difference during training and testing (mainly due to data augmentation) has long been known to the community, but few have been done to handle it. The proposed fine-tuning, albeit simple, works quite well. However, since this is an empirical submission on an empirical topic, I tend to look for more points that are "useful". For example, what is a good ratio between the data augmentation hyperparameter(s) and the train/test size ratio? In other words, a general rule of thumb for practioners in image recognition on the ImageNet dataset. And, are the observations generalizable? If I transfer the ImageNet model to a specific task (e.g., CUB), will the findings in this paper be useful? If the answer is yes, how can it be useful? How about other domains where scale difference is not caused by data augmentation? e.g., object detection? ----- I raised my recommended score a bit after reading the response. The author response answered most of my questions (except the detection one), and the answer to R1 concerning state-of-the-art (86.4%) on ImageNet is interesting.