{"title": "A Meta-Analysis of Overfitting in Machine Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 9179, "page_last": 9189, "abstract": "We conduct the first large meta-analysis of overfitting due to test set reuse in the machine learning community. Our analysis is based on over one hundred machine learning competitions hosted on the Kaggle platform over the course of several years. In each competition, numerous practitioners repeatedly evaluated their progress against a holdout set that forms the basis of a public ranking available throughout the competition. Performance on a separate test set used only once determined the final ranking. By systematically comparing the public ranking with the final ranking, we assess how much participants adapted to the holdout set over the course of a competition. Our study shows, somewhat surprisingly, little evidence of substantial overfitting. These findings speak to the robustness of the holdout method across different data domains, loss functions, model classes, and human analysts.", "full_text": "A Meta-Analysis of Over\ufb01tting in Machine Learning\n\nRebecca Roelofs\u2217\n\nUC Berkeley\n\nroelofs@berkeley.edu\n\nJohn Miller\nUC Berkeley\n\nSara Fridovich-Keil\u2217\n\nUC Berkeley\n\nsfk@berkeley.edu\n\nVaishaal Shankar\n\nUC Berkeley\n\nmiller_john@berkeley.edu\n\nvaishaal@berkeley.edu\n\nMoritz Hardt\nUC Berkeley\n\nBenjamin Recht\n\nUC Berkeley\n\nLudwig Schmidt\n\nUC Berkeley\n\nhardt@berkeley.edu\n\nbrecht@berkeley.edu\n\nludwig@berkeley.edu\n\nAbstract\n\nWe conduct the \ufb01rst large meta-analysis of over\ufb01tting due to test set reuse in the\nmachine learning community. Our analysis is based on over one hundred machine\nlearning competitions hosted on the Kaggle platform over the course of several\nyears. In each competition, numerous practitioners repeatedly evaluated their\nprogress against a holdout set that forms the basis of a public ranking available\nthroughout the competition. Performance on a separate test set used only once\ndetermined the \ufb01nal ranking. By systematically comparing the public ranking\nwith the \ufb01nal ranking, we assess how much participants adapted to the holdout set\nover the course of a competition. Our study shows, somewhat surprisingly, little\nevidence of substantial over\ufb01tting. These \ufb01ndings speak to the robustness of the\nholdout method across different data domains, loss functions, model classes, and\nhuman analysts.\n\n1\n\nIntroduction\n\nThe holdout method is central to empirical progress in the machine learning community. Competitions,\nbenchmarks, and large-scale hyperparameter search all rely on splitting a data set into multiple pieces\nto separate model training from evaluation. However, when practitioners repeatedly reuse holdout\ndata, the danger of over\ufb01tting to the holdout data arises [6, 13].\n\nDespite its importance, there is little empirical research into the manifested robustness and validity\nof the holdout method in practical scenarios. Real-world use cases of the holdout method often\nfall outside the guarantees of existing theoretical bounds, making questions of validity a matter of\nguesswork.\n\nRecent replication studies [16] demonstrated that the popular CIFAR-10 [10] and ImageNet [5, 18]\nbenchmarks continue to support progress despite years of intensive use. The longevity of these\nbenchmarks perhaps suggests that over\ufb01tting to holdout data is less of a concern than reasoning from\n\ufb01rst principles might have suggested. However, this is evidence from only two, albeit important,\ncomputer vision benchmarks. It remains unclear whether the observed phenomenon is speci\ufb01c to the\ndata domain, model class, or practices of vision researchers. Unfortunately, these replication studies\n\n\u2217Equal contribution\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\frequired assembling new test sets from scratch, resulting in a highly labor-intensive analysis that is\ndif\ufb01cult to scale.\n\nIn this paper, we empirically study holdout reuse at a signi\ufb01cantly larger scale by analyzing data\nfrom 120 machine learning competitions on the popular Kaggle platform [2]. Kaggle competitions\nare a particularly well-suited environment for studying over\ufb01tting since data sources are diverse,\ncontestants use a wide range of model families, and training techniques vary greatly. Moreover,\nKaggle competitions use public and private test data splits which provide a natural experimental setup\nfor measuring over\ufb01tting on various datasets.\n\nTo provide a detailed analysis of each competition, we introduce a coherent methodology to char-\nacterize the extent of over\ufb01tting at three increasingly \ufb01ne scales. Our approach allows us both to\ndiscuss the overall \u201chealth\u201d of a competition across all submissions and to inspect signs of over\ufb01tting\nseparately among the top submissions. In addition, we develop a statistical test speci\ufb01c to the\nclassi\ufb01cation competitions on Kaggle to compare the submission scores to those arising in an ideal\nnull model that assumes no over\ufb01tting. Observed data that are close to data predicted by the null\nmodel is strong evidence against over\ufb01tting.\n\nOverall, we conclude that the classi\ufb01cation competitions on Kaggle show little to no signs of\nover\ufb01tting. While there are some outlier competitions in the data, these competitions usually have\npathologies such as non-i.i.d. data splits or (effectively) small test sets. Among the remaining\ncompetitions, the public and private test scores show a remarkably good correspondence. The\npicture becomes more nuanced among the highest scoring submissions, but the overall effect sizes\nof (potential) over\ufb01tting are typically small (e.g., less than 1% classi\ufb01cation accuracy). Thus, our\n\ufb01ndings show that substantial over\ufb01tting is unlikely to occur naturally in regular machine learning\nwork\ufb02ows.\n\n2 Background and setup\n\nBefore we delve into the analysis of the Kaggle data, we brie\ufb02y de\ufb01ne the type of over\ufb01tting we\nstudy and then describe how the Kaggle competition format naturally lends itself to investigating\nover\ufb01tting in machine learning competitions.\n\n2.1 Adaptive over\ufb01tting\n\n\u201cOver\ufb01tting\u201d is often used as an umbrella term to describe any unwanted performance drop of a\nmachine learning model. Here, we focus on adaptive over\ufb01tting, which is over\ufb01tting caused by test\nset reuse. While other phenomena under the over\ufb01tting umbrella are also important aspects of reliable\nmachine learning (e.g. performance drops due to distribution shifts), they are beyond the scope of our\npaper since they require an experimental setup different from ours.\n\nFormally, let f : X \u2192 Y be a trained model that maps examples x \u2208 X to output values y \u2208 Y\n(e.g., class labels or regression targets). The standard approach to measuring the performance of\nsuch a trained model is to de\ufb01ne a loss function L : Y \u00d7 Y \u2192 R and to draw samples S =\n{(x1, y1), . . . , (xn, yn)} from a data distribution D which we then use to evaluate the test loss\nLS(f ) = Pn\ni=1 L(f (xi), yi). As long as the model f does not depend on the test set S, standard\nconcentration results [19] show that LS(f ) is a good approximation of the true performance given by\nthe population loss LD(f ) = ED[L(f (x), y)].\n\nHowever, machine learning practitioners often undermine the assumption that f does not depend\non the test set by selecting models and tuning hyperparameters based on the test loss. Especially\nwhen algorithm designers evaluate a large number of different models on the same test set, the \ufb01nal\nclassi\ufb01er may only perform well on the speci\ufb01c examples in the test set. The failure to generalize to\nthe entire data distribution D manifests itself in a large adaptivity gap LD(f ) \u2212 LS(f ) and leads to\noverly optimistic performance estimates.\n\n2.2 Kaggle\n\nKaggle is the most widely used platform for machine learning competitions, currently hosting 1,461\nactive and completed competitions. Various organizations (companies, educators, etc.) provide the\n\n2\n\n\fdatasets and evaluation rules for the competitions, which are generally open to any participant. Each\ncompetition is centered around a dataset consisting of a training set and a test set.\n\nConsidering the danger of over\ufb01tting to the test set in a competitive environment, Kaggle subdivides\neach test set into public and private components. The subsets are randomly shuf\ufb02ed together and\nthe entire test set is released without labels, so that participants should not know which test samples\nbelong to which split. Hence participants submit predictions for the entire test set. The Kaggle server\nthen internally evaluates each submission on both public and private splits and updates the public\ncompetition leaderboard only with the score on the public split. At the end of the competition, Kaggle\nreleases the private scores, which determine the winner.\n\nKaggle has released the MetaKaggle dataset2, which contains detailed information about competitions,\nsubmissions, etc. on the Kaggle platform. The structure of Kaggle competitions makes MetaKaggle\na useful dataset for investigating over\ufb01tting empirically at a large scale. In particular, we can view\nthe public test split Spublic as the regular test set and use the held-out private test split Sprivate to\napproximate the population loss. Since Kaggle competitors do not receive feedback from Sprivate until\nthe competition has ended, we assume that the submitted models may be over\ufb01t to Spublic but not to\nSprivate.3 Under this assumption, the difference between private and public loss LSprivate(f ) \u2212 LSpublic (f )\nis an approximation of the adaptivity gap LD(f ) \u2212 LS(f ). Hence our setup allows us to estimate the\namount of over\ufb01tting occurring in a typical machine learning competition. In the rest of this paper,\nwe will analyze the public versus private score differences as a proxy for adaptive over\ufb01tting.\n\nDue to the large number of competitions on the Kaggle platform, we restrict our attention to the\nmost popular classi\ufb01cation competitions. In particular, we survey the competitions with at least\n1,000 submissions before the competition deadline. Moreover, we include only competitions with\nevaluation metrics that have at least 10 competitions. These metrics are classi\ufb01cation accuracy, AUC\n(area under curve), MAP@K (mean average precision), the logistic loss, and a multiclass variant of\nthe logistic loss. Appendix A provides more details about our selection criteria.\n\n3 Detailed analysis of competitions scored with classi\ufb01cation accuracy\n\nWe begin with a detailed look at classi\ufb01cation competitions scored with the standard accuracy metric.\nClassi\ufb01cation is the prototypical machine learning task and accuracy is a widely understood perfor-\nmance measure. This makes the corresponding competitions a natural starting point to understand\nnuances in the Kaggle data. Moreover, there is a large number of accuracy competitions, which\nenables us to meaningfully compare effect sizes across competitions. As we will see in Section 3.3,\naccuracy competitions also offer the advantage that we can obtain measures of statistical uncertainty.\nLater sections will then present an overview of the competitions scored with other metrics.\n\nWe conduct our analysis of over\ufb01tting at three levels of granularity that become increasingly stringent.\nThe \ufb01rst level considers all submissions in a competition and checks for systematic over\ufb01tting that\nwould affect a substantial number of submissions (e.g., if public and private score diverge early\nin the competition). The second level then zooms into the top 10% of submissions (measured by\npublic accuracy) and conducts a similar comparison of public to private scores. The goal here is\nto understand whether there is more over\ufb01tting among the best submissions since they are likely\nmost adapted to the test set. The third analysis level then takes a mainly quantitative approach and\ncomputes the probabilities of the observed public vs. private accuracy differences under an ideal null\nmodel. This allows us to check if the observed gaps are larger than purely random \ufb02uctuations.\n\nIn the following subsections, we will apply these three analysis methods to investigate four accuracy\ncompetitions. These four competitions are the accuracy competitions with the largest number\nof submissions and serve as representative examples for a typical accuracy competition in the\nMetaKaggle dataset (see Table 1 for information about these competitions). Section 3.4 then\ncomplements these analyses with a quantitative look at all competitions before we summarize our\n\ufb01ndings for accuracy competitions in Section 3.5.\n\n2https://www.kaggle.com/kaggle/meta-kaggle\n3Since test examples without labels are available, contestants may still over\ufb01t to Sprivate using an unsupervised\napproach. While such over\ufb01tting may have occurred in a limited number of competitions, we base our analysis\non the assumption that unsupervised over\ufb01tting did not occur widely with effect sizes that are large compared to\nover\ufb01tting to the public test split.\n\n3\n\n\fTable 1: The four accuracy competitions with the largest number of submissions. npublic is the size of\nthe public test set and nprivate is the size of the private test set.\n\nID\n\nName\n\n# Submissions\n\nnpublic\n\nnprivate\n\n5275 Can we predict voting outcomes?\n3788 Allstate Purchase Prediction Challenge\n7634 TensorFlow Speech Recognition Challenge\n7115 Cdiscount\u2019s Image Classi\ufb01cation Challenge\n\n35,247\n24,532\n24,263\n5,859\n\n249,344\n59,657\n3,171\n53,0455\n\n249,343\n139,199\n155,365\n1,237,727\n\n3.1 First analysis level: visualizing the overall trend\n\nAs mentioned in Section 2.2, the main quantities of interest are the accuracies on the public and\nprivate parts of the test set. In order to visualize this information at the level of all submissions to a\ncompetition, we create a scatter plot of the public and private accuracies. In an ideal competition with\na large test set and without any over\ufb01tting, the public and private accuracies of a submission would\nall be almost identical and lie near the y = x diagonal. On the other hand, substantial deviations\nfrom the diagonal would indicate gaps between the public and private accuracy and present possible\nevidence of over\ufb01tting in the competition.\n\nFigure 1 shows such scatter plots for the four accuracy competitions mentioned above. In addition\nto a point for each submission, the scatter plots also contain a linear regression \ufb01t to the data. All\nfour plots show a linear \ufb01t that is close to the y = x diagonal. In addition, three of the four plots\nshow very little variation around the diagonal. The variation around the diagonal in the remaining\ncompetition is largely symmetric, which indicates that it is likely the effect of random chance.\n\nThese scatter plots can be seen as indicators of overall competition \u201chealth\u201d: in case of pervasive\nover\ufb01tting, we would expect a plateauing trend where later points mainly move on the x-axis (public\naccuracy) but stagnate on the y-axis (private accuracy). In contrast, the four plots in Figure 1 show\nthat as submissions progress on the public test set, they see corresponding improvements also on the\nprivate test set. Moreover, the public scores remain representative of the private scores.\n\nAs can be seen in Appendix B.3, this overall trend is representative for the 34 accuracy competitions.\nAll except one competition show a linear \ufb01t close to the main diagonal. The only competition with a\nsubstantial deviation is the \u201cTAU Robot Surface Detection\u201d competition (ID 12598). We contacted\nthe authors of this competition and con\ufb01rmed that there are subtleties in the public / private split\nwhich undermine the assumption that the two splits are i.i.d. Hence we consider this competition to\nbe an outlier since it does not conform to the experimental setup described in Section 2.2. So at least\non the coarse scale of the \ufb01rst analysis level, there are little to no signs of adaptive over\ufb01tting: it is\neasier to make genuine progress on the data distribution in these competitions than to substantially\nover\ufb01t to the test set.\n\nCompetition 5275\n\nCompetition 3788\n\nCompetition 7634\n\nCompetition 7115\n\nFigure 1: Private versus public accuracy for all submissions in the most popular Kaggle accuracy competitions.\nEach point corresponds to an individual submission (shown with 95% Clopper-Pearson con\ufb01dence intervals,\nalthough the con\ufb01dence intervals are smaller than the plotted data points).\n\n4\n\nSubmissionLinear fity=x\f3.2 Second analysis level: zooming in to the top submissions\n\nWhile the scatter plots discussed above give a comprehensive picture of an entire competition, one\nconcern is that over\ufb01tting may be more prevalent among the submissions with the highest public\naccuracy since they may be more adapted to the public test set. Moreover, the best submissions are\nthose where over\ufb01tting would be most serious since invalid accuracies there would give a misleading\nimpression of performance on future data. So to analyze the best submissions in more detail, we also\ncreated scatter plots for the top 10% of submissions (as scored by public accuracy).\n\nFigure 2 shows scatter plots for the same four competitions as before. Since the axes now encompass\na much smaller range (often only a few percent), they give a more nuanced picture of the performance\namong the best submissions.\n\nIn the leftmost and rightmost plot, the linear \ufb01t for the submissions still closely tracks the y = x\ndiagonal. Hence there is little sign of over\ufb01tting also among the top 10% of submissions. On the\nother hand, the middle two plots in Figure 2 show noticeable deviations from the main diagonal.\nInterestingly, the linear \ufb01t is above the diagonal in Competition 7634 and below the main diagonal\nin Competition 3788. The trend in Competition 3788 is more concerning since it indicates that the\npublic accuracy overestimates the private accuracy. However, in both competitions the absolute effect\nsize (deviation from the diagonal) is small (about 1%). It is also worth noting that the accuracies\nin Competition 3788 are not in a high-accuracy regime but around 55%. So the relative error from\npublic to private test set is small as well.\n\nAppendix B.4 contains scatter plots for the remaining accuracy competitions that show a similar\noverall trend. Besides the Competition 12598 discussed in the previous subsection, there are two\nadditional competitions with a substantial public vs. private deviation. One competition is #3641,\nwhich has a total test set size of about 4,000 but is only derived from 7 human test subjects (the dataset\nconsists of magnetoencephalography (MEG) recordings from these subjects). The other competition\nis 12681, which contains a public test set of size 90. Very small (effective) test sets make it easier to\nreconstruct the public / private split (and then to over\ufb01t), and also make the public and private scores\nmore noisy. Hence we consider these two competitions to be outliers. Overall, the second analysis\nlevel shows that if over\ufb01tting occurs among the top submissions, it only does so to a small extent.\n\nCompetition 5275\n\nCompetition 3788\n\nCompetition 7634\n\nCompetition 7115\n\nFigure 2: Private versus public accuracy for the top 10% of submissions in the most popular Kaggle accuracy\ncompetitions. Each point corresponds to an individual submission (shown with 95% Clopper-Pearson con\ufb01dence\nintervals).\n\n3.3 Third analysis level: quantifying the amount of random variation\n\nWhen discussing deviations from the ideal y = x diagonal in the previous two subsections, an\nimportant question is how much variation we should expect from random chance. Due to the \ufb01nite\nsizes of the test sets, the public and private accuracies will never match exactly and it is a priori\nunclear how much of the deviation we can attribute to random chance and how much to over\ufb01tting.\n\nTo quantitatively understand the expected random variation, we compute the probability of a given\npublic vs. private deviation (p-value) for a simple null model. By inspecting the distribution of the\nresulting p-values, we can then investigate to what extent the observed deviations can be attributed to\nrandom chance.\n\nWe consider the following null model under which we compute p-values for observing certain gaps\nbetween the public and private accuracies. We \ufb01x a submission that makes a given number of mistakes\n\n5\n\nSubmissionLinear fity=x\fon the entire test set (public and private split combined). We then randomly split the test set into\ntwo parts with sizes corresponding to the public and private splits of the competition. This leads to\na certain number of mistakes (and hence accuracy) on each of the two parts. The p-value for this\nsubmission is then given by the probability of the event\n\n|public_accuracy \u2212 private_accuracy| \u2265 \u03b5\n\n(1)\n\nwhere \u03b5 is the observed deviation between public and private accuracy. We describe the details of\ncomputing these p-values in Appendix B.5.\n\nFigure 3 plots the distribution of these p-values for the same four competitions as before. To see\npotential effects of over\ufb01tting, we show p-values for all submissions, the top 10% of submissions (as\nin the previous subsection), and for the \ufb01rst submission for each team in the competition. We plot\nthe \ufb01rst submissions separately since they should be least adapted to the test set and hence show the\nsmallest amount of over\ufb01tting.\n\nCompetition 5275\n\nCompetition 3788\n\nCompetition 7634\n\nCompetition 7115\n\nFigure 3: CDFs of the p-values for three (sub)sets of submissions in the four accuracy competitions with the\nlargest number of submissions.\n\nUnder our ideal null hypothesis, the p-values would have a uniform distribution with a CDF following\nthe y = x diagonal. However, this is only the case for Competition 5275 and the \ufb01rst submissions\nin Competitions 7634 & 7115, and even there only approximately. Most other curves show a\nsubstantially larger number of small p-values (large gaps between public and private accuracy) than\nexpected under the null model. Moreover, the top 10% of submissions always exhibit the smallest\np-values, followed by all submissions and then the \ufb01rst submissions.\n\nWhen interpreting these p-value plots, we emphasize that the null model is highly idealized. In\nparticular, we assume that every submission is evaluated on its own independent random public /\nprivate split, which is clearly not the case for Kaggle competitions. So for correlated submissions,\nwe would expect clusters of approximately equal p-values that are unlikely to arise under the null\nmodel. Since it is plausible that many models are trained with standard libraries such as XGBoost [4]\nor scikit-learn [14], we conjecture that correlated models are behind the jumps in the p-value CDFs.\n\nIn addition, it is important to note that for large test sets (e.g., the 198,856 examples in Competition\n3788), even very small systematic deviations between the public and private accuracies are statistically\nsigni\ufb01cant and hence lead to small p-values. So while the analysis based on p-value plots does point\ntowards irregularities such as over\ufb01tting, the overall effect size (deviation between public and private\naccuracies) can still be small.\n\nGiven the highly discriminative nature of the p-values, a natural question is whether any competition\nexhibits approximately uniform p-values. As can be seen in Appendix B.5, some competitions indeed\nhave p-value plots that are close to the uniform distribution under null model (Figure 11 in the same\nappendix highlights four examples). Due to the idealized null hypothesis, this is strong evidence that\nthese competitions are free from over\ufb01tting.\n\n3.4 Aggregate view of the accuracy competitions\n\nThe previous subsections provided tools for analyzing individual competitions and then relied on\na qualitative survey of all accuracy competitions. Due to the many nuances and failure modes in\nmachine learning competitions, we believe that this case-by-case approach is the most suitable for\nthe Kaggle data. But since this approach also carries the risk of missing overall trends in the data, we\ncomplement it here with a quantitative analysis of all accuracy competitions.\n\n6\n\ny=xAll SubmissionsTop 10% SubmissionsFirst Submissions\fIn order to compare the amount of over\ufb01tting across competitions, we compute the mean accuracy\ndifference across all submissions in a competition. Speci\ufb01cally, let C be a set of submissions for a\ngiven competition. Then the mean accuracy difference of the competition is de\ufb01ned as\n\nmean_accuracy_difference =\n\n1\n|C| X\n\ni\u2208C\n\npublic_accuracy(i) \u2212 private_accuracy(i) .\n\n(2)\n\nA larger mean accuracy difference indicates more potential over\ufb01tting. Note that since accuracy is\nin a sense the opposite of loss, our computation of mean accuracy difference is the opposite of our\nearlier expression for the generalization gap.\n\nFigure 4: Left: Empirical CDF of the mean accuracy differences (%) for 34 accuracy competitions with at least\n1,000 submissions. Right: Mean accuracy differences versus competition end date for the same competitions.\n\nFigure 4 shows the empirical CDF of the mean accuracy differences for all accuracy competitions.\nWhile the score differences between \u22122% to +2% are approximately symmetric and centered at\nzero (as a central limit theorem argument would suggest), the plot also shows a tail with larger score\ndifferences consisting of \ufb01ve competitions. Figure 8 in Appendix B.2 aggregates our three types of\nanalysis plots for these competitions.\n\nWe have already mentioned three of these outliers in the discussion so far. The worst outlier is\nCompetition 12598 which contains a non-i.i.d. data split (see Section 3.1). Section 3.2 noted that\nCompetitions 3641 and 12681 had very small test sets. Similarly, the other two outliers (#12349\nand #5903) have small private test sets of size 209 and 100, respectively. Moreover, the public -\nprivate accuracy gap decreases among the very best submissions in these competitions (see Figure 8\nin Appendix B.2). Hence we do not consider these competitions as examples of adaptive over\ufb01tting.\n\nAs a second aggregate comparison, the right plot\nin Figure 4 shows the mean accuracy differences\nvs. competition end date and separates two types\nof competitions: in-class and other, which are\nmainly the \u201cfeatured\u201d competitions on Kaggle\nwith prize money. Interestingly, many of the com-\npetitions with large public - private deviations are\nfrom 2019. Moreover, the large deviations are\nalmost exclusively in-class competitions. This\nmay indicate that in-class competitions undergo\nless quality control than the featured competi-\ntions (e.g., have smaller test sets), and that the\nquality control standards on Kaggle may change\nover time.\n\nAs a third aggregate comparison, Figure 5 shows\nthe mean accuracy differences vs. test set sizes,\nwith orange dots to indicate competitions we\n\ufb02agged as having either a known or a non-i.i.d.\npublic vs. private test set split. Although a reli-\nable recommendation for applied machine learn-\ning will require broader investigation, our results for accuracy competitions suggest that at least\n10,000 examples is a reasonable minimum test set size to protect against adaptive over\ufb01tting.\n\nFigure 5: Mean accuracy differences versus test set\nsize (public and private combined) for 32 accuracy com-\npetitions with at least 1,000 submissions and available\ntest set size (the test set sizes for two competitions with\nat least 1,000 submissions were not available from the\nMetaKaggle dataset).\n\n7\n\n\f3.5 Did we observe over\ufb01tting?\n\nThe preceding subsections provided an increasingly \ufb01ne-grained analysis of the Kaggle competitions\nevaluated with classi\ufb01cation accuracy. The scatter plots for all submissions in Section 3.1 show a\ngood \ufb01t to the y = x diagonal with small and approximately symmetric deviations from the diagonal.\nThis is strong evidence that the overall competition is not affected by substantial over\ufb01tting. When\nrestricting the plots to the top submissions in Section 3.2, the picture becomes more varied but\nthe largest over\ufb01tting effect size (public - private accuracy deviation) is still small. Thus for some\ncompetitions we observe evidence of mild over\ufb01tting, but always with small effect size. In both\nsections we have identi\ufb01ed outlier competitions that do not follow the overall trend but also have\nissues such as small test sets or non-i.i.d. splits. At the \ufb01nest level of our analysis (Section 3.3), the\np-value plots show that the data is only sometimes in agreement with an idealized null model for no\nover\ufb01tting.\n\nOverall we see more signs of over\ufb01tting as we sharpen our analysis to highlight smaller effect sizes.\nSo while we cannot rule out every form of over\ufb01tting, we view our \ufb01ndings as evidence that over\ufb01tting\ndid not pose a signi\ufb01cant danger in the most popular classi\ufb01cation accuracy competitions on Kaggle.\nIn spite of up to 35,000 submissions to these competitions, there are no large over\ufb01tting effects. So\nwhile over\ufb01tting may have occurred to a small extent, it did not invalidate the overall conclusions\nfrom the competitions such as which submissions rank among the top or how well they perform on\nthe private test split.\n\n4 Classi\ufb01cation competitions with further evaluation metrics\n\nIn addition to classi\ufb01cation accuracy competitions, we also surveyed competitions evaluated with\nAUC, MAP@K, LogLoss, and MulticlassLoss. Unfortunately, the Meta Kaggle dataset contains only\naggregate scores for the public and private test set splits, not the loss values for individual predictions.\nFor the accuracy metric, aggregate scores are suf\ufb01cient to compute statistical measures of uncertainty\nsuch as the error bars in the scatter plots (Sections 3.1 & 3.2) or the p-value plots (Section 3.3).\nHowever, the aggregate data is insuf\ufb01cient to compute similar quantities for the other classi\ufb01cation\nmetrics. For instance, the lack of example-level scores precludes the use of standard tools such as the\nbootstrap or permutation tests, as we are unable to re-sample the test set. Hence our analysis here is\nmore qualitative than for accuracy competitions in the preceding section. Nevertheless, inspecting\nthe scatter plots can still convey overall trends in the data.\n\nFigure 6: Empirical CDF of mean score differences (across all pre-deadline submissions) for 40 AUC competi-\ntions, 17 MAP@K competitions, 15 LogLoss competitions, and 14 MulticlassLoss competitions.\n\nAppendices C to F contain the scatter plots for all submissions and for the top 10% of submissions\nto these competitions. The overall picture is similar to the accuracy plots: many competitions have\nscatter plots with a good linear \ufb01t close to the y = x diagonal. There is more variation in the top 10%\nplots but due to the lack of error bars it is dif\ufb01cult to attribute this to over\ufb01tting vs. random noise. As\nbefore, a small number of competitions show more variation that may be indicative of over\ufb01tting. In\nall cases, the Kaggle website (data descriptions and discussion forums) give possible reasons for this\nbehavior (non-i.i.d. splits, competitions with two stages and different test sets, etc.). Thus, we view\nthese competitions as outliers that do not contradict the overall trend.\n\nFigure 6 shows plots with aggregate statistics (mean score difference) similar to Section 3.4. As for\nthe accuracy competitions, the empirical distribution has an approximately symmetric part centered at\n0 (no score change) and a tail with larger score differences that consists mainly of outlier competitions.\n\n8\n\n\f5 Related work\n\nAs mentioned in the introduction, the reproducibility experiment of Recht et al. [16] also points to-\nwards a surprising absence of adaptive over\ufb01tting in popular machine learning benchmarks. However,\nthere are two important differences to our work. First, Recht et al. [16] assembled new test sets\nfrom scratch, which makes it hard to disentangle the effects of adaptive over\ufb01tting and distribution\nshifts. In contrast, most of the public / private splits in Kaggle competitions are i.i.d., which removes\ndistribution shifts as a confounder. Second, Recht et al. [16] only investigate two image classi\ufb01cation\nbenchmarks on which most models come from the same model class (CNNs) [9, 11]. We survey\n120 competitions on which the Kaggle competitors experimented with a broad range of models and\ntraining approaches. Hence our conclusions about over\ufb01tting apply to machine learning more broadly.\n\nThe adaptive data analysis literature [6, 17] provides a range of theoretical explanations for how the\ncommon machine learning work\ufb02ow may implicitly mitigate over\ufb01tting [3, 8, 12, 23]. Our work\nis complementary to these papers and conducts a purely empirical study of over\ufb01tting in machine\nlearning competitions. We hope that our \ufb01ndings can help test and re\ufb01ne the theoretical understanding\nof over\ufb01tting in future work.\n\nThe Kaggle community has analyzed competition \u201cshake-up\u201d, i.e., rank changes between the public\nand private leaderboards of a competition. We refer the reader to the comprehensive data analysis\nconducted by Trotman [21] as a concrete example. Focusing on rank changes is complementary to our\napproach based on submission scores. From a competition perspective where the winning submissions\nare de\ufb01ned by the private leaderboard rank, large rank changes are indeed undesirable. However,\nfrom the perspective of adaptive over\ufb01tting, large rank changes can be a natural consequence of\nrandom noise in the evaluation. For instance, consider a setting where a large number of competitors\nsubmits solutions with very similar public leaderboard scores. Due to the limited size of the public\nand private test sets, the public scores are only approximately equal to the private scores (even in the\nabsence of any adaptive over\ufb01tting), which can lead to substantial rank changes even though all score\ndeviations are small and of (roughly) equal size. Since the ranking approach can result in such \u201cfalse\npositives\u201d from the perspective of adaptive over\ufb01tting, we decided not to investigate rank changes in\nour paper. Nevertheless, we note that we also manually compared the shake-up results to our analysis\nand generally found agreement among the set of problematic competitions (e.g., competitions with\nnon-i.i.d. splits or known public / private splits).\n\n6 Conclusion and future work\n\nWe surveyed 120 competitions on Kaggle covering a wide range of classi\ufb01cation tasks but found\nlittle to no signs of adaptive over\ufb01tting. Our results cast doubt on the standard narrative that adaptive\nover\ufb01tting is a signi\ufb01cant danger in the common machine learning work\ufb02ow.\n\nMoreover, our \ufb01ndings call into question whether common practices such as limiting test set re-use\nincrease the reliability of machine learning. We have seen multiple competitions where a non-i.i.d.\nsplit lead to substantial gaps between public and private scores, suggesting that distribution shifts\n[7, 15, 16, 20] may be a more pressing problem than test set re-use in current machine learning.\n\nThere are multiple directions for empirically understanding over\ufb01tting in more detail. Our analysis\nhere focused on classi\ufb01cation competitions, but Kaggle also hosts many regression competitions. Is\nthere more adaptive over\ufb01tting in regression? Answering this questions will likely require access to\nthe individual predictions of the Kaggle submissions to appropriately handle outlier submissions. In\naddition, there are still questions among the classi\ufb01cation competitions. For instance, one re\ufb01nement\nof our analysis here is to obtain statistical measures of uncertainty for competitions evaluated with\nmetrics such as AUC (which will also require a more \ufb01ne-grained version of the Kaggle data). Finally,\nanother important question is whether other competition platforms such as CodaLab [1] or EvalAI\n[22] also show little signs of adaptive over\ufb01tting.\n\nAcknowledgments\n\nThis research was generously supported in part by ONR awards N00014-17-1-2191, N00014-17-1-\n2401, and N00014-18-1-2833, the DARPA Assured Autonomy (FA8750-18-C-0101) and Lagrange\n(W911NF-16-1-0552) programs, a Siemens Futuremakers Fellowship, and an Amazon AWS AI\nResearch Award.\n\n9\n\n\fReferences\n\n[1] CodaLab. https://competitions.codalab.org/competitions/.\n\n[2] Kaggle. https://www.kaggle.com/.\n\n[3] A. Blum and M. Hardt. The Ladder: A reliable leaderboard for machine learning competitions.\n\nIn International Conference on Machine Learning (ICML), 2015.\n\n[4] T. Chen and C. Guestrin. XGBoost: A scalable tree boosting system. In International Conference\n\non Knowledge Discovery and Data Mining (KDD), 2016.\n\n[5] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei.\n\nImageNet: A large-scale\nIn Conference on Computer Vision and Pattern Recognition\n\nhierarchical image database.\n(CVPR), 2009.\n\n[6] C. Dwork, V. Feldman, M. Hardt, T. Pitassi, O. Reingold, and A. L. Roth. Preserving statistical\nvalidity in adaptive data analysis. In ACM symposium on Theory of computing (STOC), 2015.\n\n[7] L. Engstrom, B. Tran, D. Tsipras, L. Schmidt, and A. Madry. Exploring the landscape of spatial\n\nrobustness. In International Conference on Machine Learning (ICML), 2019.\n\n[8] V. Feldman, R. Frostig, and M. Hardt. The advantages of multiple classes for reducing over\ufb01tting\n\nfrom test set reuse. International Conference on Machine Learning (ICML), 2019.\n\n[9] K. Fukushima. Neocognitron: A self-organizing neural network model for a mechanism of\n\npattern recognition unaffected by shift in position. In Biological Cybernetics, 1980.\n\n[10] A. Krizhevsky. Learning multiple layers of features from tiny images. https://www.cs.\n\ntoronto.edu/~kriz/learning-features-2009-TR.pdf, 2009.\n\n[11] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classi\ufb01cation with deep convolutional\n\nneural networks. In Advances in Neural Information Processing Systems (NIPS), 2012.\n\n[12] H. Mania, J. Miller, L. Schmidt, M. Hardt, and B. Recht. Model similarity mitigates test set\n\noveruse. In Advances in Neural Information Processing Systems (NeurIPS), 2019.\n\n[13] K. P. Murphy. Machine Learning: A Probabilistic Perspective. The MIT Press, 2012. Chapter\n\n1.4.8.\n\n[14] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel,\nP. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher,\nM. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in python. Journal of Machine\nLearning Research, 2011.\n\n[15] J. Quionero-Candela, M. Sugiyama, A. Schwaighofer, and N. D. Lawrence. Dataset Shift in\n\nMachine Learning. The MIT Press, 2009.\n\n[16] B. Recht, R. Roelofs, L. Schmidt, and V. Shankar. Do imagenet classi\ufb01ers generalize to\n\nimagenet? In International Conference on Machine Learning (ICML), 2019.\n\n[17] A. Roth and A. Smith. Lectures notes \u201cThe Algorithmic Foundations of Adaptive Data Analysis\u201d,\n\n2017. https://adaptivedataanalysis.com/.\n\n[18] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy,\nA. Khosla, M. Bernstein, A. C. Berg, and F.-F. Li. ImageNet large scale visual recognition\nchallenge. International Journal of Computer Vision, 2015.\n\n[19] S. Shalev-Shwartz and S. Ben-David. Understanding Machine Learning: From Theory to\n\nAlgorithms. Cambridge University Press, 2014.\n\n[20] A. Torralba and A. A. Efros. Unbiased look at dataset bias. In Conference on Computer Vision\n\nand Pattern Recognition (CVPR), 2011.\n\n[21] J. Trotman. Meta kaggle: Competition shake-up. 2019. URL https://www.kaggle.com/\n\njtrotman/meta-kaggle-competition-shake-up.\n\n10\n\n\f[22] D. Yadav, R. Jain, H. Agrawal, P. Chattopadhyay, T. Singh, A. Jain, S. Singh, S. Lee, and\nD. Batra. EvalAI: Towards better evaluation systems for AI agents. 2019. URL http:\n//arxiv.org/abs/1902.03570.\n\n[23] T. Zrnic and M. Hardt. Natural analysts in adaptive data analysis. International Conference on\n\nMachine Learning (ICML), 2019.\n\n11\n\n\f", "award": [], "sourceid": 4929, "authors": [{"given_name": "Rebecca", "family_name": "Roelofs", "institution": "UC Berkeley"}, {"given_name": "Vaishaal", "family_name": "Shankar", "institution": "UC Berkeley"}, {"given_name": "Benjamin", "family_name": "Recht", "institution": "UC Berkeley"}, {"given_name": "Sara", "family_name": "Fridovich-Keil", "institution": "UC Berkeley"}, {"given_name": "Moritz", "family_name": "Hardt", "institution": "University of California, Berkeley"}, {"given_name": "John", "family_name": "Miller", "institution": "University of California, Berkeley"}, {"given_name": "Ludwig", "family_name": "Schmidt", "institution": "UC Berkeley"}]}