Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Update: I thank the authors for the feedback and the additional figure. I would gladly vote for acceptance of the paper. ----------------------------------------------- The authors conducted the first large meta-analysis of overfitting due to test set reuse in the machine learning community. They surveyed a wide range of 112 Kaggle competitions and concluded that there is little evidence of substantial overfitting. I found the question very important to the machine learning community. The investigation is thorough that it covered a broad spectrum of datasets. The statistical methods used here are appropriate. And therefore, the conclusion is trustworthy. Also, the paper is well-written. I have a few suggestions, but overall, I think this paper deserves publication in NeurIPS. 1. For the overall conclusion, I think a more precise one would be: on the one hand, there is strong evidence regarding the existence of overfitting (p-values in Sec. 3.3); on the other hand, the extent of overfitting is mild (Sec. 3.1 and 3.2). 2. For the figures (Fig. 1, 2), there are too many things that it is a bit hard to parse. First, there are too many points. A common solution is to reduce the point size or make them transparent. Second, since the linear trend is apparent, I would suggest removing the linear regression line. Also, it would be nice to lay the points above the y=x line so that we can see them. Besides, the confidence intervals are missing for some plots, e.g., Figure 1. 3. It is important to investigate the covariates that are likely to cause overfitting. The authors mentioned that one reason might be the small test data size. Can the authors quantify this by plotting, e.g., the extent of overfitting against the testing data sizes across all competitions considered? It is also helpful to recommend a minimum testing data size for an ML competition to have mild overfitting. 4. Another aspect that people care about these competitions is that if the ranks of submissions are preserved between the public testing dataset and private testing dataset. Can the authors add some results on this? Minor: 5. Line 146: I am curious about the subtleties. 6. Eq. (2): missing bracket inside summation. 7. One more related work on adaptive data analysis to include is [Russo et al, 2016, Controlling...]
This study analyses overfitting in machine learning based on a meta-analysis of over hundred Kaggle competitions that comprise a variety of classification tasks. Kaggle provides a leave-out test data set, which is used to publicly rank competing methods; the competing methods can aim to improve their public ranking by improving their method. This can potentially lead to overfitting. The degree of that overfitting can be evaluated since Kaggle also maintains additional leave-out data sets that are used to provide a final ranking for the competing methods. The results indicate that the degree of overfitting is relatively low, and the authors' interpretation is that this demonstrates the robustness of cross-validation as a method development technique. In terms of the strengths and weaknesses The overall Quality of the work is high. Whereas the idea and implementation are relatively straightforward at a conceptual level, this work contributes new empirical information on overfitting, a fundamental subject in machine learning. If there are shortcomings, the limitation of the study only to classification tasks is one; but expanding the present work to other machine learning tasks could be relatively straightforward; the paper introduces the new concept, and the clarity of presentation does benefit from a well-defined scope on classification tasks. Source code is also provided, further supporting the overall quality (transparency, access) of the work. Clarity. The study design and presentation are clear, and the implementation is rigorous. Sufficient empirical analysis and discussion of related work are provided, and potential extensions are discussed (other competition platforms; other modeling tasks; scoring methods to assess overfitting). Originality. The large-scale analysis of overfitting in machine learning studies, implemented based on public competition platforms, seems to be a new idea. The paper also includes interesting reflections on the types of competitions and data set qualities (money prices, data set size..) and how these are reflected in overfitting. The work does not include substantially new theoretical or methodological ideas; the originality is mainly empirical. Significance. The study lays a groundwork for extended analysis of overfitting of different types of machine learning models. It has implications for better understanding of the fundamentals in the field. The weakness is that the improved understanding does not readily translate to pragmatic recommendations for analysis, besides bringing increased confidence in cross-validation as a model training technique. A paper with a similar title has been presented in ICML2019 workshop (https://sites.google.com/view/icml2019-generalization/schedule). The content is not the same but there seems to be notable overlap, and the ICML2019 paper is not cited in the present submission. It would help if the authors can clarify what are the main new contributions of this submission with respect to the ICML2019 workshop paper.
UPDATE: I thank the authors for their feedback, which I have read. I am not inclined to alter my score from an 8, but once again emphasize that this paper is a good one and I hope to see it published. ---------------------------------------- Thanks to the authors for the hard work on this paper. ==Originality== The work is original in that it is the first rigorous study of the MetaKaggle dataset as pertaining to the problem of adaptive overfitting. As the authors point out, some prior work has been done but only on a few of the most popular image datasets. ==Quality== The work is of high quality, and the experiments are well designed. It is unfortunate that the authors did not also perform the analyses on the regression problems, and I hope they intend to publish follow-up work. It was not clear to me why the authors conclude that only the accuracy metric has "enough" data for statistical measures. The reasoning in lines 275-278 is insufficient. Please be more concrete here. What notion of "enough data" are you using? ==Clarity== The paper is clear and well-written. ==Significance== In general, I think this paper will be of high significance as it begins to cast doubt on a common fear (adaptive overfitting). However, one point that kept coming up is that some of the datasets were "outliers" because their public and private test sets were not IID. It is true that this paper is entirely about adaptive overfitting in the IID case, so it makes sense to somewhat put aside non-IID cases, but in my personal experience adaptive overfitting is particularly problematic when the data is not quite IID, but close. For example, competition 3641 has 7 subjects, and the public/private split is done on subject. This is actually reasonable when considering how most ML algorithms are actually applied in the field. I.e., this non-IID split is appropriate and better serves practical algorithm development than an IID split would have. An IID split would have given an artificially high generalization estimate for what the trained model would be able to achieve "in the field". So, it is of high interest to also investigate how adaptive overfitting works in the non-IID (but realistic) cases, but in this work the non-IID "outlier" competitions are effectively put to the side. I would love to see an additional discussion (even in the appendix) on what conclusions can be drawn from this work about adaptive overfitting in non-IID cases.