{"title": "Model Similarity Mitigates Test Set Overuse", "book": "Advances in Neural Information Processing Systems", "page_first": 9993, "page_last": 10002, "abstract": "Excessive reuse of test data has become commonplace in today's machine learning workflows. Popular benchmarks, competitions, industrial scale tuning, among other applications, all involve test data reuse beyond guidance by statistical confidence bounds. Nonetheless, recent replication studies give evidence that popular benchmarks continue to support progress despite years of extensive reuse. We proffer a new explanation for the apparent longevity of test data: Many proposed models are similar in their predictions and we prove that this similarity mitigates overfitting. Specifically, we show empirically that models proposed for the ImageNet ILSVRC benchmark agree in their predictions well beyond what we can conclude from their accuracy levels alone. Likewise, models created by large scale hyperparameter search enjoy high levels of similarity. Motivated by these empirical observations, we give a non-asymptotic generalization bound that takes similarity into account, leading to meaningful confidence bounds in practical settings.", "full_text": "Model Similarity Mitigates Test Set Overuse\n\nHoria Mania\nUC Berkeley\n\nJohn Miller\nUC Berkeley\n\nLudwig Schmidt\n\nUC Berkeley\n\nhmania@berkeley.edu\n\nmiller_john@berkeley.edu\n\nludwig@berkeley.edu\n\nMoritz Hardt\nUC Berkeley\n\nBenjamin Recht\n\nUC Berkeley\n\nhardt@berkeley.edu\n\nbrecht@berkeley.edu\n\nAbstract\n\nExcessive reuse of test data has become commonplace in today\u2019s machine learn-\ning work\ufb02ows. Popular benchmarks, competitions, industrial scale tuning, among\nother applications, all involve test data reuse beyond guidance by statistical con\ufb01-\ndence bounds. Nonetheless, recent replication studies give evidence that popular\nbenchmarks continue to support progress despite years of extensive reuse. We\nproffer a new explanation for the apparent longevity of test data: Many proposed\nmodels are similar in their predictions and we prove that this similarity mitigates\nover\ufb01tting. Speci\ufb01cally, we show empirically that models proposed for the Im-\nageNet ILSVRC benchmark agree in their predictions well beyond what we can\nconclude from their accuracy levels alone. Likewise, models created by large\nscale hyperparameter search enjoy high levels of similarity. Motivated by these\nempirical observations, we give a non-asymptotic generalization bound that takes\nsimilarity into account, leading to meaningful con\ufb01dence bounds in practical set-\ntings.\n\n1\n\nIntroduction\n\nBe it validation sets for model tuning, popular benchmark data, or machine learning competitions,\nthe holdout method is central to the scienti\ufb01c and industrial activities of the machine learning com-\nmunity. As compute resources scale, a growing number of practitioners evaluate an unprecedented\nnumber of models against various holdout sets. These practices, collectively, put signi\ufb01cant pres-\nsure on the statistical guarantees of the holdout method. Theory suggests that for k models chosen\nindependently of n test data points, the holdout method provides valid risk estimates for each of\n\nthese models up to a deviation on the order ofplog(k)/n [5]. But this bound is the consequence\n\nof an unrealistic assumption. In practice, models incorporate prior information about the available\ntest data since human analysts choose models in a manner guided by previous results. Adaptive\nhyperparameter search algorithms similarly evolve models on the basis of past trials [12].\n\nAdaptivity signi\ufb01cantly complicates the theoretical guarantees of the holdout method. A simple\nadaptive strategy, resembling the practice of selectively ensembling k models, can bias the holdout\n\nmethod by as much as pk/n [5]. If this bound were attained in practice, holdout data across the\n\nboard would rapidly lose its value over time. Nonetheless, recent replication studies give evidence\nthat popular benchmarks continue to support progress despite years of extensive reuse [15, 20].\n\nIn this work, we contribute a new explanation for why the adaptive bound is not attained in prac-\ntice and why even the standard non-adaptive bound is more pessimistic than it needs to be. Our\nexplanation centers around the phenomenon of model similarity. Practitioners evaluate models that\nincorporate common priors, past experiences, and standard practices. As we show empirically, this\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f(a) Pairwise model similarities on ImageNet\n\n(b) Number of models to be tested\n\nFigure 1: (a) shows the empirical pairwise similarity between Imagenet models and the hypothetical similarity\n(b) plots the number of testable models on\nbetween models if they were making mistakes independently.\nImagenet such that the population error rates for all models are estimated up to \u00b11% error with probability\n0.95. We compare the guarantee of the standard union bound with that of a union bound which considers\nmodel similarities.\n\nresults in models that exhibit signi\ufb01cant agreement in their predictions, well beyond what would\nfollow from their accuracy values alone. Complementing our empirical investigation of model sim-\nilarity, we provide a new theoretical analysis of the holdout method that takes model similarity into\naccount, vastly improving over known bounds in the adaptive and non-adaptive cases when model\nsimilarity is high.\n\n1.1 Our contributions\n\nOur contributions are two-fold. On the empirical side, we demonstrate that a large number of pro-\nposed ImageNet [3, 16] and CIFAR-10 [9] models exhibit a high degree of similarity: Their predic-\ntions agree far more than we would be able to deduce from their accuracy levels alone. Comple-\nmenting our empirical \ufb01ndings, we give new generalization bounds that incorporate a measure of\nsimilarity. Our generalization bounds help to explain why holdout data has much greater longevity\nthan prior bounds suggest when models are highly similar, as is the case in practice. Figure 1 sum-\nmarizes these two complementary developments.\n\nUnderlying Figure 1a is a family of representative ImageNet models whose pairwise similarity we\nevaluate. The mean level of similarity of these models, together with a re\ufb01ned union bound, offers a\n4\u00d7 improvement over a carefully optimized baseline bound that does not take model similarity into\naccount. In Figure 1b we compare our guarantee on the number of holdout reuses with the baseline\nbound. This illustrates that our bound is not just asymptotic, but concrete\u2014it gives meaningful\nvalues in the practical regime. Moreover, in Section 5 we discuss how an additional assumption on\nmodel predictions can boost the similarity based guarantee by multiple orders of magnitude.\n\nInvestigating model similarity in practice further, we evaluate similarity of models encountered dur-\ning the course of a large random hyperparamter search and a large neural architecture search for the\nCIFAR-10 dataset. We \ufb01nd that the pairwise model similarities throughout both procedures remain\nhigh. The similarity provides a counterweight to the massive number of model evaluations, limiting\nthe amount of over\ufb01tting we observe.\n\n1.2 Related work\n\nRecht et al. [15] recently created new test sets for ImageNet and CIFAR10, carefully following the\noriginal test set creation processes. Reevaluating all proposed models on the new test sets showed\nthat while there was generally an absolute performance drop, the effect of over\ufb01tting due to adaptive\nbehavior was limited to non-existent.\nIndeed, newer and better models on the old test set also\nperformed better on the new test set, even though they had in principle more time to adapt to the test\nset. Also, Yadav and Bottou [20] recently released a new test set for the seminal MNIST task, on\nwhich they observed no over\ufb01tting.\n\n2\n\n0.00.51.0FractionofModels0.500.700.850.95SimilarityMeanSimilarityModelSimilaritiesonImageNetActualSimilarityIndependentSimilarity0.650.750.85ModelSimilarity105106107NumberofTestableModelsImageNetModelsSimilarity-BasedBoundsforImageNetSimilarityBoundBinomialplusUnionBound\fDwork et al. [5] recognized the issue of adaptivity in holdout reuse and provided new holdout mech-\nanisms based on noise addition that support quadratically more queries than the standard method in\nthe worse case. There is a rich line of work on adaptive data analysis; Smith [18] offers a compre-\nhensive survey of the \ufb01eld.\n\nWe are not the \ufb01rst to proffer an explanation for the apparent lack of over\ufb01tting in machine learning\nbenchmarks. Blum and Hardt [2] argued that if analysts only check if they improved on the previous\nbest model, while ignoring models that did not improve, better adaptive generalization bounds are\npossible. Zrnic and Hardt [21] offered improved guarantees for adaptive analysts that satisfy natural\nassumptions, e.g. the analyst is unable to arbitrarily use information from queries asked far in the\npast. More recently, Feldman et al. [6] gave evidence that the number of classes in a classi\ufb01cation\nproblem helps mitigate over\ufb01tting in benchmarks. We see these different explanations as playing\ntogether in what is likely the full explanation of the available empirical evidence. In parallel to\nour work, Yadav and Bottou [20] discussed the advantages of comparing models on the same test\nset; pairing tests can provide tighter con\ufb01dence bounds for model comparisons in this setting than\nindividual con\ufb01dence intervals for each model.\n\n2 Problem setup\n\nLet f : X \u2192 Y be a classi\ufb01er mapping examples from domain X to a label from the set Y.\nMoreover, we consider a test set S = {(x1, y1), . . .} of n examples sampled i.i.d.\nfrom a data\ndistribution D. The main quantity we aim to analyze is the gap between the accuracy of the classi\ufb01er\nf on the test set S and the population accuracy of the same classi\ufb01er under the distribution D. If the\ngap between the two accuracies is large, we say f over\ufb01t to the test set.\n\nAs is commonly done in the adaptive data analysis literature [1], we formalize interactions with the\ntest set via statistical queries q : X \u00d7Y \u2192 R. In our case, the queries are {0, 1}-valued; given a clas-\nsi\ufb01er f we consider the query qf de\ufb01ned by qf (z) = 1{f (x) 6= y}, where z = (x, y). Then, we de-\nnote the empirical mean of query qf on the test set S (i.e., f \u2019s test error) by ES[qf ] = 1\ni=1 qf (zi).\nThe population mean (population error) is accordingly de\ufb01ned as ED[q] = Ez\u223cDq(z).\n\nnPn\n\nWhen discussing over\ufb01tting, we are usually interested in a set of classi\ufb01ers, e.g., obtained via a\nhyperparameter search. Let f1, . . . , fk be such a set of classi\ufb01ers and q1, . . . , qk be the set of corre-\nsponding queries. To quantify the probability that over\ufb01tting occurs (i.e., one of the fi has a large\ndeviation between test and population accuracy), we would like to upper bound the probability\n\nA standard way to bound (1) is to invoke the union bound and treat each query separately:\n\nP(cid:18) max\n\n1\u2264i\u2264k |ES[qi] \u2212 ED[qi]| \u2265 \u03b5(cid:19) .\n\n(1)\n\n(2)\n\nP(cid:18) max\n\n1\u2264i\u2264k |ES[qi] \u2212 ED[qi]| \u2265 \u03b5(cid:19) \u2264\n\nP (|ES[qi] \u2212 ED[qi]| \u2265 \u03b5)\n\nk\n\nXi=1\n\nWe can then utilize standard concentration results to bound the right hand side. However, such an\napproach inherently cannot capture dependencies between the queries qi (or classi\ufb01ers fi). In partic-\nular, we are interested in the similarity between two queries q and q\u2032 measured by P (q(z) = q\u2032(z))\n(the probability of agreement between the 0-1 losses of the corresponding two classi\ufb01ers). The main\ngoal of this paper is to understand how high similarity can lead to better bounds on (1), both in\ntheory and in numerical experiments with real data from ImageNet and CIFAR-10.\n\n3 Non-adaptive classi\ufb01cation\n\nWe begin by analyzing the effect of the classi\ufb01er similarity when the classi\ufb01ers to be evaluated are\nchosen non-adaptively. For instance, this is the case when the algorithm designer \ufb01xes a grid of\nhyperparameters to be explored before evaluating any of the classi\ufb01ers on the test set. To draw valid\ngains from the hyperparameter search, it is important that the resulting test accuracies re\ufb02ect the true\npopulation accuracies, i.e., probability (1) is small.\nBound (2) is sharp when the events {|ES[qi] \u2212 ED[qi]| \u2265 \u03b5} are almost disjoint, which is not true\nwhen the queries are similar to each other. To address this issue, we modify our use of the union\n\n3\n\n\fbound. We consider the left tails Ei = {ES[qi] \u2212 ED[qi] \u2265 \u03b5}. For any t \u2265 0, we obtain\n\nP k\n[i=1\n\nk\n\nEi! \u2264 P {ES[q1] \u2212 ED[q1] \u2265 \u03b5 \u2212 t}\n\nEi!\n[i=2\n= P (ES[q1] \u2212 ED[q1] \u2265 \u03b5 \u2212 t) + P k\n[i=2\nXi=2\nP (Ei \u2229 {ES[q1] \u2212 ED[q1] < \u03b5 \u2212 t}) .\n\nEi \u2229 {ES[q1] \u2212 ED[q1] < \u03b5 \u2212 t}!\n\n\u2264 P (ES[q1] \u2212 ED[q1] \u2265 \u03b5 \u2212 t) +\n\nk\n\n(3)\n\nIntuitively, the terms P (Ei \u2229 {ES[q1] \u2212 ED[q1] < \u03b5 \u2212 t}) are small when the queries q1 and qi are\nsimilar: if P(q1(z) = qi(z)) is large, we cannot simultaneously have ES[q1] < ED[q1] + \u03b5 \u2212 t and\nES[qi] \u2265 ED[qi] + \u03b5 since the deviations go into opposite directions. In the rest of this section, we\nmake this intuition precise in and derive an upper bound on (1) in terms of the query similarities.\nBefore we state our main result, we introduce the following notion of a similarity covering.\nDe\ufb01nition 1. Let F be a set of queries. We say a query set M is a \u03b7 similarity cover of F if for any\nquery q \u2208 F there exist q\u2032, q\u2032\u2032 \u2208 M such that ED[q\u2032] \u2264 ED[q], ED[q\u2032\u2032] \u2265 ED[q], P(q\u2032(z) = q(z)) \u2265\n\u03b7, and P(q\u2032\u2032(z) = q(z)) \u2265 \u03b7 ( M does not necessarily have to be a subset of F ). Let N\u03b7(F) denote\nthe size of a minimal \u03b7 similarity cover of F (when the query set F is clear from context we use the\nsimpler notation N\u03b7).\nTheorem 2. Let F = {q1, q2, . . . , qk} be a collection of queries qi : Z \u2192 {0, 1} independent of the\ntest set {z1, z2, . . . , zn}. Then, for any \u03b7 \u2208 [0, 1] we have\n\nP(cid:18) max\n\n1\u2264i\u2264k |ES[qi] \u2212 ED[qi]| \u2265 \u03b5(cid:19) \u2264 2N\u03b7e\u2212 n\u03b52\n\n2 + 2ke\u2212 n\u03b5\n\n4 log(1+ \u03b5\n\n4(1\u2212\u03b7) ).\n\nThen, for all \u03b7 \u2264 1 \u2212 max(cid:26) 2 log(4k/\u03b4)\n\nn\n\n,q log(4N\u03b7/\u03b4)\n\n2n\n\n1\u2264i\u2264k |ES[qi] \u2212 ED[qi]| \u2264 max(r 2 log(4N\u03b7/\u03b4)\n\nmax\n\nn\n\nMoreover, if \u03b5 =q log((2N\u03b7+1)/\u03b4)\n\nn\n\n\u03b5\n4(cid:16)e2\u03b5(2k)\n\nand \u03b7 \u2265 1 \u2212\nn\u03b5 \u22121(cid:17)\nmax\n1\u2264i\u2264k |ES[qi] \u2212 ED[qi]| \u2264 \u03b5.\n\n4\n\n(cid:27), we have with probability 1 \u2212 \u03b4\n) .\n,r 32(1 \u2212 \u03b7) log (4k/\u03b4)\n, we have with probability 1 \u2212 \u03b4\n\nn\n\n(4)\n\n(5)\n\n(6)\n\n(7)\n\nTo elucidate how model similarity \u03b7 controls the number of queries k for which Theorem (2) gives\na non-trivial bound, consider the case where N\u03b7 = 1, i.e. at least one model is \u03b7-similar to all of the\nothers. As the similarity \u03b7 of the model collection grows, the number of queries k grows as well, as\nthe following simple result shows.\nCorollary 3. Let F = {q1, q2, . . . , qk} be a collection of k queries qi : Z \u2192 {0, 1} \ufb01xed indepen-\ndently of the test set. Choose \u03b7\u22c6 so that N\u03b7\u22c6 = 1. Suppose n \u2265 c1 max(cid:8)\u03b5\u22121, \u03b5\u22122(cid:9) and the number\nof queries k satis\ufb01es\n\nk \u2264\n\nc2\u03b5\n\n(1 \u2212 \u03b7\u22c6)\n\nfor positive constants c1, c2. Then, with probability 3/4, max1\u2264i\u2264k |ES[qi] \u2212 ED[qi]| \u2264 \u03b5.\nThe proof of Theorem (2) starts with the re\ufb01ned union bound (3), or a standard triangle inequality,\nand then applies the Chernoff concentration bound shown in Lemma 4 for random variables which\ntake values in {\u22121, 0, 1}. We defer the proof details of both the lemma and the theorem to Appendix\nA.\nLemma 4. Suppose Xi are i.i.d. discrete random variables which take values \u22121, 0, and 1 with\nprobabilities p\u22121, p0, and p1 respectively, and hence EXi = p1 \u2212 p\u22121. Then, for any t \u2265 0 such\nthat p1 \u2212 p\u22121 + t/2 \u2265 0 we have\nXi=1\n\nXi > p1 \u2212 p\u22121 + t! \u2264 e\u2212 nt\n\nP 1\n\n2 log(cid:16)1+ t\n\n2p1 (cid:17).\n\nn\n\nn\n\n4\n\n\fDiscretization arguments based on coverings are standard in statistical learning theory. Covers based\non the population Hamming distance P(q\u2032(z) 6= q(z)) have been previously studied [4, 11] (Note\nthat for {0, 1}-valued queries the Hamming distance is equal to the L2 and L1 distances). An\nimportant distinction between our result and prior work is that prior work requires \u03b7 to be greater\nthan 1\u2212 \u03b5. Theorem 2 can offer an improvement over the standard guaranteeplog(k)/n even when\n\u03b7 is much smaller than 1\u2212\u03b5. First of all note that (5) holds for \u03b7 bounded away from one. Moreover,\nn\u03b5 \u2264 1 + \u221a\u03b5 (the choice of 1 + \u221a\u03b5 is somewhat arbitrary), we see the\nsince e2\u03b5 \u2248 1 + 2\u01eb, if (2k)\nrequirement on \u03b7 for (6) is satis\ufb01ed when \u03b7 is on the order of 1 \u2212 \u221a\u03b5.\n\n4\n\n4 Adaptive classi\ufb01cation\n\nIn the previous section, we showed similarity can prevent over\ufb01tting when the sequence of queries\nis chosen non-adaptively, i.e. when the queries {q1, q2, . . . , qn} are \ufb01xed independently of the test\nset S. In the adaptive setting, we assume the query qt can be selected as a function of the previous\nqueries {q1, q2, . . . , qt\u22121} and estimates {ES[q1], ES[q2], . . . , ES[qt\u22121]}. Even when queries are\nchosen adaptively, we show leveraging similarity can provide sharper bounds on the probability of\nover\ufb01tting, P (max1\u2264i\u2264k |ES[qi] \u2212 ED[qi]| \u2265 \u03b5).\nIn the adaptive setting, the \ufb01eld of adaptive data analysis offers a rich technical repertoire to address\nover\ufb01tting [5, 18]. In this framework, analogous to the typical machine learning work\ufb02ow, an an-\nalyst iteratively selects a classi\ufb01er and then queries a mechanism to provide an estimate of test-set\nperformance. In practice, the mechanism often used is the Trivial Mechanism which computes the\nempirical mean of the query on the test set and returns the exact value to the analyst. For simplicity,\nwe study how similarity improves the performance of the trivial mechanism.\n\nThe empirical mean of any query can take at most n + 1 values, and thus a deterministic analyst\nmight ask at most (n + 1)k\u22121 queries in k rounds of interaction with the Trivial Mechanism. Let F\ndenote the set of (n + 1)k\u22121 possible queries. Then, we apply Theorem 2 to F .\nCorollary 5. Let F be the set of queries that a \ufb01xed analyst A might query the Trivial Mechanism.\nWe assume that the Trivial Mechanism has access to a test set of size n. Let \u03b1 \u2208 [0, 1],\n\n\u03b5 =r 4(k1\u2212\u03b1 log(n + 1) + log(2/\u03b4))\n\nn\n\n,\n\nand \u03b7 = 1 \u2212\n\n\u03b5\n\n4(e\u03b5k\u03b1 \u22121) . If N\u03b7(F) \u2264 (n + 1)k1\u2212\u03b1\n\n, we have with probability 1 \u2212 \u03b4\n\n1\u2264i\u2264k |ES[qi] \u2212 ED[qi]| \u2264 \u03b5,\nmax\n\n(8)\n\nfor any queries q1, q2, . . . qk chosen adaptively by A.\n\nProof. Note that when \u03b7 = 1 \u2212\nfollows from the \ufb01rst part of Theorem 2.\n\n\u03b5\n\n4(e\u03b5k\u03b1 \u22121) we have log(cid:16)1 + \u03b5\n\n4(1\u2212\u03b7)(cid:17) \u2265 \u03b5k\u03b1. Then, the result\n\nIn Corollary 5, the parameter \u03b1 quanti\ufb01es the strength of the similarity assumption. For \u03b1 = 0,\nthere is no similarity requirement, and Corollary 5 always applies. In this case, the bound matches\n\nstandard results for the trivial mechanism with \u03b5 = \u02dcO(pk/n). However, as \u03b1 grows, the similarity\nrequirement becomes restrictive while the corresponding con\ufb01dence interval becomes increasingly\ntight. In particular, for any \u03b1 > 0, if F permits a similarity cover N\u03b7(F) \u2264 (n + 1)k1\u2212\u03b1\nfor \u03b7 =\n1\u2212(\u03b5/4)(e\u03b5k\u03b1\n\u22121)\u22121, we obtain a super linear improvement in the dependence on k. For instance, if\n\u03b1 = 1/2, then \u03b5 = \u02dcO(pk1/2/n), and we obtain a quadratic improvement in the number of queries\n\nfor a \ufb01xed sample size. This improvement is similar to that achieved by the Gaussian mechanism\n[1, 5]. Moreover, since our technique is essentially tightening a union bound, this improvement\neasily extends to other mechanisms that rely on compression-based arguments, for instance, the\nLadder Mechanism [2].\n\n5\n\n\f5 Empirical results\n\nSo far, we have established theoretically that similarity between classi\ufb01ers allows us to evaluate a\nlarger number of classi\ufb01ers on the test set without over\ufb01tting. In this section, we investigate whether\nthese improvements already occur in the regime of contemporary machine learning. We speci\ufb01cally\nfocus on ImageNet and CIFAR-10, two widely used machine learning benchmarks that have recently\nbeen shown to exhibit little to no adaptive over\ufb01tting in spite of almost a decade of test set re-use\n[15]. For both datasets, we empirically measure two main quantities: (i) The similarity between\na wide range of models, some of them arising from hyperparameter search experiments. (ii) The\nresulting increase in the number of models we can evaluate in a non-adaptive setting compared to a\nbaseline that does not utilize the model similarities.\n\n5.1 Similarities on Imagenet\n\nWe utilize the model testbed from Recht et al. [15],1 who collected a dataset of 66 image classi-\n\ufb01ers that includes a wide range of standard ImageNet models such as AlexNet [10], ResNets [7],\nDenseNets [8], VGG [17], Inception [19], and several other models. As a baseline for the observed\nsimilarities between these models, we compare them to classi\ufb01ers with the same accuracy but oth-\nerwise random predictions: given two models f1 and f2 with population error rates \u00b51 and \u00b52, we\nknow that the similarity P(1{f1(x) 6= y} = 1{f2(x) 6= y}) equals \u00b51\u00b52 + (1 \u2212 \u00b51)(1 \u2212 \u00b52) if the\nrandom variables 1{f1(x) 6= y} and 1{f2(x) 6= y} are independent. Figure 1a in the introduction\nshows these model similarities assuming the models make independent mistakes and also the empir-\nical data for the(cid:0)66\n2(cid:1) = 2,145 pairs of models. We see that the empirical similarities are signi\ufb01cantly\n\nhigher than the random baseline (mean 0.85 vs 0.62).\n\nThe corresponding Figure 1b shows two lower bounds on the number of models that can be evaluated\nIn particular, we use n = 50,000 (the size of the ImageNet\nfor the empirical ImageNet data.\nvalidation set) and a target probability \u03b4 = 0.05 for the over\ufb01tting event (1) with error \u03b5 = 0.01.\nWe compare two methods for computing the number of non-adaptively testable models: a guarantee\nbased on the simple union bound (2) and a guarantee based on our more re\ufb01ned union bound derived\nfrom our theoretical analysis in Section 3. Later in this section, we introduce an even stronger bound\nthat utilizes higher-order interactions between the model similarities and yields signi\ufb01cantly larger\nimprovements under an assumption on the structure among the classi\ufb01ers.\n\nTo obtain meaningful quantities in the regime of ImageNet, all bounds here require signi\ufb01cantly\nsharper numerical calculations than the standard theoretical tools such as Chernoff bounds. We now\ndescribe these calculations at a high level and defer the details to Appendix B. After introducing the\nthree methods, we compare them on the ImageNet data.\n\nStandard union bound. Given n, \u03b5, and the population error rate of all models ED[qi], we can\ncompute the right hand side of (2) exactly.2 It is well known that higher accuracies lead to smaller\nprobability of error and hence allow for a larger number of test set reuses. We assume all models\nhave population accuracy 75.6%, the average top-1 accuracy of the 66 Imagenet models. In this\ncase, the vanilla union bound (2) guarantees that k = 257,397 models can be evaluated on a test set\nof size 50,000 so that their empirical accuracies would lie in the con\ufb01dence interval 0.756 \u00b1 0.01\nwith probability at least 95%.\n\nSimilarity Union Bound. While the union bound (2) is easy to use, it does not leverage the depen-\ndencies between the random variables 1{fi(x) 6= y} for i \u2208 {1, 2, . . . k}. To exploit this property,\nwe utilize the re\ufb01ned union bound (3) which is guaranteed to be an improvement over (2) when the\nparameter t is optimized. In order to use (3), we must compute the probabilities\n\nP ({ES[q2] \u2212 ED[q2] \u2264 \u03b12} \u2229 {ES[q1] \u2212 ED[q1] \u2265 \u03b11})\n\n(9)\n\nfor given \u03b11, \u03b12, ED[q1], ED[q2], and similarity P(q1(z) = q2(z)). In Appendix B, we show that we\ncan compute these probabilities ef\ufb01ciently by assigning success probabilities to three independent\nBernoulli random variables X1, X2, and W such that (X1W, X2W ) is equal to (q1(z), q2(z)) in\n\n1Available at https://github.com/modestyachts/ImageNetV2.\n2After an additional union bound to decouple the left and right tails.\n\n6\n\n\fdistribution. Let pw := P(W = 1). Then, given i.i.d. draws X1i, X2i, and Wi, we condition on the\nvalues of Wi to express probability (9) as\n\nP ({ES[q2] \u2212 ED[q2] \u2264 \u03b12} \u2229 {ES[q1] \u2212 ED[q1] \u2265 \u03b11})\n\nn\n\n=\n\nXj=0(cid:18)n\nj(cid:19)pj\n\nw(1 \u2212 pw)n\u2212j P j\nXi=1\n\nX2i \u2264 \u230an(p2 + \u03b12)\u230b! P j\nXi=1\n\n(10)\n\nX1i \u2265 \u2308n(p1 + \u03b11)\u2309! .\n\nWe refer the reader to Appendix B for more details. The two tail probabilities for X1i and X2i can\nbe computed ef\ufb01ciently with the use of beta functions. Using (10) and (3) with a binary search over\nt, we can compute the probability of making an error \u03b5 when estimating the population error rates\nof k models with given error rates and pairwise similarities. Figure 1b shows the maximum number\nof models k that can be evaluated on the same test set so that the probability of making an \u03b5 = 0.01\nerror in estimating all their error rates is at most 0.05 when the models satisfy ED[qi] = 0.244 and\nP(qi(z) = qj(z)) \u2265 0.85 for all 1 \u2264 i, j \u2264 k. The \ufb01gure shows that our new bound offers a\nsigni\ufb01cant improvement over the guarantee given by the standard union bound (2).\n\nSimilarity union bound with a Naive Bayes assumption. While the previous computation uses\nthe pairwise similarities observed empirically to offer an improved guarantee on the number of\nallowed test set reuses, it does not take into account higher order dependencies between the models.\nIn particular, Figure 4 in Appendix C shows that 27.8% of test images are correctly classi\ufb01ed by\nall the models, 55.9% of test images are correctly classi\ufb01ed by 60 of the 66 models considered, and\n4.7% of test images are incorrectly classi\ufb01ed by all the models. We now show how this kind of\nagreement between models enables a larger number of test set reuses. Inspired by the coupling used\nin (10), we make the following assumption.\n\nAssumption A1 (Naive Bayes). Let q1, q2, . . . qk be a collection of queries such that ED[qi] = p\nand P(qi(z) = qj(z)) = \u03b7 for some p and \u03b7, for all 1 \u2264 i, j \u2264 k. We say such a collection has a\nNaive Bayes structure if there exist px and pw in [0, 1] such that (q1(z), q2(z), . . . , qk(z)) is equal to\n(X1W, X2W, . . . , XkW ) in distribution, where W , X1, . . . Xk are independent Bernoulli random\nvariables with P(W = 1) = pw and P(Xi = 1) = px for all 1 \u2264 i \u2264 k.\nIntuitively, a collection of queries 1{fi(x) 6= y} has a Naive Bayes structure if the data distribution\nD generates easy examples (x, y) with probability 1 \u2212 pw such that all the models fi classify cor-\nrectly, and if an example is not easy, the models make mistakes independently. As mentioned before,\nFigure 4 supports the existence of such an easy set. When a test point in the ImageNet test set is not\nan easy example, the models do not make mistakes independently. Therefore, Assumption A1 is not\nexactly satis\ufb01ed by existing ImageNet models. However, we know that independent Bernoulli trials\nsaturate the standard union bound (2). This effect can also be observed in Figure 2. As the similarity\nbetween the models decreases, i.e. 1 \u2212 pw decreases, the models make mistakes independently and\nthe guarantee with Assumption A1 converges to the standard union bound guarantee. So while As-\nsumption A1 is not exactly satis\ufb01ed in practice, the violation among the ImageNet classi\ufb01ers likely\nimplies an even better lower bound on the number of testable models.\n\nAssumption A1 is computationally advantageous. It allows us to compute the over\ufb01tting probability\n(1) exactly, as we detail in Appendix B. Figure 2 is an extension of Figure 1b; it shows the relative\nimprovement of our bounds over the standard union bound in terms of the number of testable models\nwhen \u03b5 = 0.01 and \u03b4 = 0.01. Moreover, Figure 2 also shows that the relative improvement of our\nbounds increases quickly with \u03b5. According to Figure 2, Assumption A1 implies that we can evaluate\n108 models on the test set in the regime of ImageNet without over\ufb01tting. While this number of\nmodels might seem unnecessarily large, in Section 4 we saw that when models are chosen adaptively\nwe must consider a tree of possible models, which can easily contain 108 models.\n\n5.2 Similarities on CIFAR-10\n\nPractitioners often evaluate many more models than the handful that ultimately appear in publi-\ncation. The choice of architecture is the result of a long period of iterative re\ufb01nement, and the\nhyperparameters for any \ufb01xed architecture are often chosen by evaluating a large grid of plausible\nmodels. Using data from CIFAR-10, we demonstrate these common practices both generate large\nclasses of very similar models.\n\n7\n\n\fFigure 2: Left \ufb01gure shows the multiplicative gains in the number of testable models, as a function of model\nsimilarity, over the guarantee offered by the standard union plus binomial bound, with \u03b5 = 0.01 and \u03b4 = 0.05.\nRight \ufb01gure shows the same multiplicative gains, but as a function of \u03b5, when \u03b4 = 0.05 and the pairwise\nsimilarity is \u03b7 = 0.85.\n\nFigure 3: Model similarities and covering numbers for random hyperparameter search on CIFAR10.\n\nRandom hyperparameter search. To understand the similarity between models evaluated in hy-\nperparameter search, we ran our own random search to choose hyperparameters for a ResNet-110.\nThe grid included properties of the architecture (e.g. type of residual block), the optimization al-\ngorithm (e.g. choice of optimizer), and the data distribution (e.g. data augmentation strategies). A\nfull speci\ufb01cation of the grid is included in Appendix D. We sample and train 320 models, and, for\neach model, we select 10 checkpoints evenly spaced throughout training. The best model considered\nachieves accuracy of 96.6%, and, after restricting to models with accuracy at least 50%, we are left\nwith 1,235 model checkpoints. In Figure 3, we show the similarity for each pair of checkpoints and\ncompute an upper bound on the corresponding similarity covering number N\u03b7(F) for each possible\nvalue of \u03b7. As in the case of ImageNet, CIFAR10 models found by random search are signi\ufb01cantly\nmore similar than random chance would suggest.\n\nNeural architecture search.\nIn the random search experiment, all of the models were chosen non-\nadaptively\u2014the grid of models is \ufb01xed in advance. However, similarity protects against over\ufb01tting\nalso in the adaptive setting. To illustrate this, we compute the similarity for models evaluated by\nautomatic neural architecture search. In particular, we ran the DARTS neural architecture search\npipeline to adaptively evaluate a large number of plausible models in search of promising con\ufb01gura-\ntions [13, 14]. In Table 1, we report the mean accuracies and pairwise similarities for 20 randomly\nselected con\ufb01gurations evaluated by DARTS, as well as the top 20 scoring con\ufb01gurations according\nto DARTS internal scoring mechanism. Table 1 also shows the multiplicative gains in the number\nof testable models offered by our similarity bound (SB) and our naive Bayes bound (NBB) over\nthe standard union bound are between one and four orders of magnitude. Therefore, even in a high\naccuracy regime we can guarantee a signi\ufb01cantly higher number of test set reuses without over\ufb01tting\nwhen taking into account model similarities.\n\n8\n\n0.650.750.85ModelSimilarity101103105107MultiplicativeGainsNaiveBayesBoundSimilarityBound0.0060.0080.0100.012Error\u01eb101103105MultiplicativeGainsNaiveBayesBoundSimilarityBound0.000.250.500.751.00FractionofModelPairs0.50.60.70.80.91.0SimilarityCIFAR10HyperparameterSearchSimilarityObservedSimilarityIndependentSimilarity0.50.60.70.80.91.0SimilarityResolution\u03b7020040060080010001200CoverSizeCIFAR10HyperparameterSearchCoveringNumbers\fTable 1: Neural Architecture Search Similarities\n\nModels\n\nMean Accuracy Mean Similarity\n\n20 Random\n20 Highest Scoring\n\n96.8%\n96.9%\n\n97.5%\n97.6%\n\nIncrease in Testable Models\nSB\nNBB\n1.6 \u00b7 104\u00d7\n9.9\u00d7\n12.0\u00d7 3.4 \u00b7 104\u00d7\n\n6 Conclusions and future work\n\nWe have shown that contemporary image classi\ufb01cation models are highly similar, and that this\nsimilarity increases the longevity of the test set both in theory and in experiment. It is worth noting\nthat model similarity does not preclude progress on the test set: two models that are 85% similar\ncan differ by as much as 15% in accuracy (for context: the top-5 accuracy improvement from the\nseminal AlexNet to the current state of the art on ImageNet is about 17%). In addition, it is well\nknown that higher model accuracy implies a larger number of test set reuses without over\ufb01tting.\nSo as the machine learning practitioner explores increasingly better performing models that also\nbecome more similar, it can actually become harder to over\ufb01t.\n\nThere are multiple important avenues for future work. First, one natural question is why the classi-\n\ufb01cation models turn out to be so similar. In addition, it would be insightful to understand whether\nthe similarity phenomenon is speci\ufb01c to image classi\ufb01cation or also arises in other classi\ufb01cation\ntasks. There may also be further structural dependencies between models that mitigate the amount\nof over\ufb01tting. Finally, it would be ideal to have a statistical procedure that leverages such model\nstructure to provide reliable and accurate performance bounds for test set re-use.\n\nAcknowledgements. We thank Vitaly Feldman for helpful discussions. This work is generously\nsupported in part by ONR awards N00014-17-1-2191, N00014-17-1-2401, and N00014-18-1-2833,\nthe DARPA Assured Autonomy (FA8750-18-C-0101) and Lagrange (W911NF-16-1-0552) pro-\ngrams, a Siemens Futuremakers Fellowship, an Amazon AWS AI Research Award, a gift from\nMicrosoft Research, and the National Science Foundation Graduate Research Fellowship Program\nunder Grant No. DGE 1752814.\n\nReferences\n\n[1] R. Bassily, K. Nissim, A. Smith, T. Steinke, U. Stemmer, and J. Ullman. Algorithmic stability\nfor adaptive data analysis. In Symposium on Theory of Computing (STOC), 2016. https:\n//arxiv.org/abs/1511.02513.\n\n[2] A. Blum and M. Hardt. The Ladder: A reliable leaderboard for machine learning competitions.\nIn International Conference on Machine Learning (ICML), 2015. https://arxiv.org/abs/\n1502.04585.\n\n[3] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A large-scale hierar-\nchical image database. In Conference on Computer Vision and Pattern Recognition (CVPR),\n2009. http://www.image-net.org/papers/imagenet_cvpr09.pdf.\n\n[4] L. Devroye, L. Gy\u00f6r\ufb01, and G. Lugosi. A probabilistic theory of pattern recognition. Springer,\n\n1996. http://www.szit.bme.hu/~gyorfi/pbook.pdf.\n\n[5] C. Dwork, V. Feldman, M. Hardt, T. Pitassi, O. Reingold, and A. L. Roth. Preserving statis-\ntical validity in adaptive data analysis. In Symposium on Theory of computing (STOC), 2015.\nhttps://arxiv.org/abs/1411.2664.\n\n[6] V. Feldman, R. Frostig, and M. Hardt. The advantages of multiple classes for reducing over-\n\ufb01tting from test set reuse. In International Conference on Machine Learning (ICML), 2019.\nhttps://arxiv.org/abs/1905.10360.\n\n[7] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Con-\nference on Computer Vision and Pattern Recognition (CVPR), 2016. https://arxiv.org/\nabs/1512.03385.\n\n9\n\n\f[8] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten. Densely connected convolutional\nnetworks. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017. https:\n//arxiv.org/abs/1608.06993.\n\n[9] A. Krizhevsky. Learning multiple layers of features from tiny images, 2009. https://www.\n\ncs.toronto.edu/~kriz/learning-features-2009-TR.pdf.\n\n[10] A. Krizhevsky,\n\nI. Sutskever,\n\nand G. E. Hinton.\n\nwith deep convolutional neural networks.\n2012.\ntion Processing\n4824-imagenet-classification-with-deep-convolutional-neural-networks.\n\nin Neural\n\nSystems\n\nImagenet\n\nclassi\ufb01cation\nIn Advances\nInforma-\nhttps://papers.nips.cc/paper/\n\n(NIPS),\n\n[11] J. Langford. Quantitatively Tight Sample Complexity Bounds. PhD thesis, Carnegie Mel-\nlon University, 2002. http://hunch.net/~jl/projects/prediction_bounds/thesis/\nthesis.pdf.\n\n[12] L. Li and K. Jamieson. Hyperband: A novel bandit-based approach to hyperparameter opti-\n\nmization. Journal of Machine Learning Research, 18:1\u201352, 2018.\n\n[13] L. Li and A. Talwalkar. Random search and reproducibility for neural architecture search. In\nConference on Uncertainty in Arti\ufb01cial Intelligence (UAI), 2019. https://arxiv.org/abs/\n1902.07638.\n\n[14] H. Liu, K. Simonyan, and Y. Yang. Darts: Differentiable architecture search. In International\nConference on Learning Representations (ICLR), 2019. https://arxiv.org/abs/1806.\n09055.\n\n[15] B. Recht, R. Roelofs, L. Schmidt, and V. Shankar. Do ImageNet classi\ufb01ers generalize to\nIn International Conference on Machine Learning (ICML), 2019. https://\n\nImageNet?\narxiv.org/abs/1902.10811.\n\n[16] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy,\nA. Khosla, M. Bernstein, A. C. Berg, and F.-F. Li.\nImageNet large scale visual recogni-\ntion challenge. International Journal of Computer Vision, 2015. https://arxiv.org/abs/\n1409.0575.\n\n[17] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recog-\n\nnition. 2014. https://arxiv.org/abs/1409.1556.\n\n[18] A. Smith. Information, privacy and stability in adaptive data analysis, 2017. https://arxiv.\n\norg/abs/1706.00820.\n\n[19] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and\nIn Conference on Computer Vision and\n\nA. Rabinovich. Going deeper with convolutions.\nPattern Recognition (CVPR), 2015. https://arxiv.org/abs/1409.4842v1.\n\n[20] C. Yadav and L. Bottou. Cold Case: The Lost MNIST Digits. 2019. https://arxiv.org/\n\nabs/1905.10498.\n\n[21] T. Zrnic and M. Hardt. Natural analysts in adaptive data analysis. In International Conference\n\non Machine Learning (ICML), 2019. https://arxiv.org/abs/1901.11143.\n\n10\n\n\f", "award": [], "sourceid": 5286, "authors": [{"given_name": "Horia", "family_name": "Mania", "institution": "UC Berkeley"}, {"given_name": "John", "family_name": "Miller", "institution": "University of California, Berkeley"}, {"given_name": "Ludwig", "family_name": "Schmidt", "institution": "UC Berkeley"}, {"given_name": "Moritz", "family_name": "Hardt", "institution": "University of California, Berkeley"}, {"given_name": "Benjamin", "family_name": "Recht", "institution": "UC Berkeley"}]}