NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:5286
Title:Model Similarity Mitigates Test Set Overuse

Reviewer 1

With my limited familiarity with the area, I found the paper to be easy and enjoyable to read and follow. The authors did a good job of outlining the motivation, existing work, and limitations. The work seems like a strong combination of an interesting theoretical finding, which requires a neat extension of the union bound as explained in Section 3, and an assessment of its empirical impact. The problem addressed -- namely, quantifying the effect of test set overuse -- is very timely, and the result that model similarity provably increases the number of times one can re-use a test set is quite interesting. Given the complex expressions in Theorem 2 and Corollary 4, it would be helpful to either also provide a simplified form (under further assumptions) or analyze the results for some interesting special cases for which the expressions become simpler and dependencies on parameters more apparent. For the adaptive classification case (Section 4), the analysis using (n+1)^{k-1} as the maximum number of queries seems quite conservative. Could you comment on its tightness / usefulness in practice? For corollary 4, it seems the way the parameters (alpha, eps, etc.) are stated may not be the most natural or useful. It is not clear when the corollary provides a useful result. E.g., as stated, eps is a function of alpha, eta is a function of eps and alpha, and N_eta is required to be bounded by a function of alpha. This makes it difficult to assess simple things like whether alpha has a positive or negative effect on the result. Would it help to re-state the result in terms of (only) the largest alpha that satisfies the bound on N_eta? The Naive Bayes assumption that comes towards the end of page 6 is left mysterious till that point in the paper. I would suggest alluding to it (or at least its textual interpretation near line 237) early on in the paper.

Reviewer 2

The writing focuses on mathematical analysis of the case where all model predictions tend to be correlated as well as on empirical analysis of how correlated the model predictions are in practice. The math looks OK, but,it is the straightforward empirical analysis that is most valuable, and it seems to me that all of the theoretical conclusions could be made without theory, simply through simulation instead of work on deriving bounds. Thus, it is hard for me to see value in large parts of the paper.

Reviewer 3

It has been previously suggested that the currently massive reuse of common test data sets can lead to overfitting of machine learning models. This work provides new analysis and support for this observation that the similarity between alternative models, following from shared prior knowledge and modeling choices, can reduce the potential overfitting substantially when compared to the theoretical expectations. The paper demonstrates empirically that predictions provided by alternative models are more similar than what could be expected just based on the prediction accuracy and provide an exact quantification of this effect. The authors use this observation to derive modified confidence interval for predictions that incorporates the similarity information. Application is demonstrated with real data. Quality. The work provides new information on is of good quality; the derivations follow standard style. A weak point is that the discussion is limited to relatively specific data sets / modeling tasks. Earlier work is being appropriately discussed. Clarity. Overall the text is well written and clear to follow. A shortcoming is that the Introduction does not include any references, and existing theoretical results are mentioned without citation (e.g. rows 19-21). Originality. The idea that overfitting is mitigated thourh similarity between models is not novel as such but compared to earlier related approaches that are being cited, this work provides new empirical evidence and theoretical understanding of this phenomenon. Significance. The results are useful for understanding fundamental aspects and limitations of machine learning techniques, and the pragmatically motivated, improved error bounds can be useful in practical applications. The vast applicability across a range of sample sizes, data set types, and modeling tasks remains to be shown, however.