{"title": "Cold Case: The Lost MNIST Digits", "book": "Advances in Neural Information Processing Systems", "page_first": 13443, "page_last": 13452, "abstract": "Although the popular MNIST dataset \\citep{mnist} is derived from the NIST database \\citep{nist-sd19}, precise processing steps of this derivation have been lost to time. We propose a reconstruction that is accurate enough to serve as a replacement for the MNIST dataset, with insignificant changes in accuracy. We trace each MNIST digit to its NIST source and its rich metadata such as writer identifier, partition identifier, etc. We also reconstruct the complete MNIST test set with 60,000 samples instead of the usual 10,000. Since the balance 50,000 were never distributed, they enable us to investigate the impact of twenty-five years of MNIST experiments on the reported testing performances. Our results unambiguously confirm the trends observed by \\citet{recht2018cifar,recht2019imagenet}: although the misclassification rates are slightly off, classifier ordering and model selection remain broadly reliable. We attribute this phenomenon to the pairing benefits of comparing classifiers on the same digits.", "full_text": "Cold Case: the Lost MNIST Digits\n\nChhavi Yadav\n\nNew York University\n\nNew York, NY\n\nchhavi@nyu.edu\n\nL\u00e9on Bottou\n\nFacebook AI Research\nand New York University\n\nNew York, NY\n\nleon@bottou.org\n\nAbstract\n\nAlthough the popular MNIST dataset [LeCun et al., 1994] is derived from the\nNIST database [Grother and Hanaoka, 1995], the precise processing steps for\nthis derivation have been lost to time. We propose a reconstruction that is ac-\ncurate enough to serve as a replacement for the MNIST dataset, with insigni\ufb01cant\nchanges in accuracy. We trace each MNIST digit to its NIST source and its rich\nmetadata such as writer identi\ufb01er, partition identi\ufb01er, etc. We also reconstruct\nthe complete MNIST test set with 60,000 samples instead of the usual 10,000.\nSince the balance 50,000 were never distributed, they can be used to investigate\nthe impact of twenty-\ufb01ve years of MNIST experiments on the reported testing\nperformances. Our limited results unambiguously con\ufb01rm the trends observed\nby Recht et al. [2018, 2019]: although the misclassi\ufb01cation rates are slightly off,\nclassi\ufb01er ordering and model selection remain broadly reliable. We attribute this\nphenomenon to the pairing bene\ufb01ts of comparing classi\ufb01ers on the same digits.\n\n1\n\nIntroduction\n\nThe MNIST dataset [LeCun et al., 1994, Bottou et al., 1994] has been used as a standard machine\nlearning benchmark for more than twenty years. During the last decade, many researchers have\nexpressed the opinion that this dataset has been overused. In particular, the small size of its test\nset, merely 10,000 samples, has been a cause of concern. Hundreds of publications report increas-\ningly good performance on this same test set. Did they over\ufb01t the test set? Can we trust any new\nconclusion drawn on this dataset? How quickly do machine learning datasets become useless?\nThe \ufb01rst partitions of the large NIST handwritten character collection [Grother and Hanaoka, 1995]\nhad been released one year earlier, with a training set written by 2000 Census Bureau employees and\na substantially more challenging test set written by 500 high school students. One of the objectives\nof LeCun, Cortes, and Burges was to create a dataset with similarly distributed training and test sets.\nThe process they describe produces two sets of 60,000 samples. The test set was then downsampled\nto only 10,000 samples, possibly because manipulating such a dataset with the computers of the\ntimes could be annoyingly slow. The remaining 50,000 test samples have since been lost.\nThe initial purpose of this work was to recreate the MNIST preprocessing algorithms in order to\ntrace back each MNIST digit to its original writer in NIST. This reconstruction was \ufb01rst based on the\navailable information and then considerably improved by iterative re\ufb01nements. Section 2 describes\nthis process and measures how closely our reconstructed samples match the of\ufb01cial MNIST samples.\nThe reconstructed training set contains 60,000 images matching each of the MNIST training images.\nSimilarly, the \ufb01rst 10,000 images of the reconstructed test set match each of the MNIST test set\nimages. The next 50,000 images are a reconstruction of the 50,000 lost MNIST test images.1\n\n1Code and data are available at https://github.com/facebookresearch/qmnist.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fThe original NIST test contains 58,527 digit images written by 500 dif-\nferent writers. In contrast to the training set, where blocks of data from\neach writer appeared in sequence, the data in the NIST test set is scram-\nbled. Writer identities for the test set is available and we used this infor-\nmation to unscramble the writers. We then split this NIST test set in two:\ncharacters written by the \ufb01rst 250 writers went into our new training set.\nThe remaining 250 writers were placed in our test set. Thus we had two\nsets with nearly 30,000 examples each.\n\nThe new training set was completed with enough samples from the\nold NIST training set, starting at pattern #0, to make a full set of 60,000\ntraining patterns. Similarly, the new test set was completed with old\ntraining examples starting at pattern #35,000 to make a full set with\n60,000 test patterns. All the images were size normalized to \ufb01t in a 20\nx 20 pixel box, and were then centered to \ufb01t in a 28 x 28 image using\ncenter of gravity. Grayscale pixel values were used to reduce the effects\nof aliasing. These are the training and test sets used in the benchmarks\ndescribed in this paper. In this paper, we will call them the MNIST data.\n\nFigure 1: The two paragraphs of Bottou et al. [1994] describing the MNIST preprocessing. The\nhsf4 partition of the NIST dataset, that is, the original test set, contains in fact 58,646 digits.\n\nIn the same spirit as [Recht et al., 2018, 2019], the rediscovery of the 50,000 lost MNIST test\ndigits provides an opportunity to quantify the degradation of the of\ufb01cial MNIST test set over a\nquarter-century of experimental research. Section 3 compares and discusses the performances of\nwell known algorithms measured on the original MNIST test samples, on their reconstructions,\nand on the reconstructions of the 50,000 lost test samples. Our results provide a well controlled\ncon\ufb01rmation of the trends identi\ufb01ed by Recht et al. [2018, 2019] on a different dataset.\n\n2 Recreating MNIST\n\nRecreating the algorithms that were used to construct the MNIST dataset is a challenging task.\nFigure 1 shows the two paragraphs that describe this process in [Bottou et al., 1994]. Although this\nwas the \ufb01rst paper mentioning MNIST, the creation of the dataset predates this benchmarking effort\nby several months.2 Curiously, this description incorrectly reports that the number of digits in the\nhsf4 partition, that is, the original NIST testing set, as 58,527 instead of 58,646.3\nThese two paragraphs give a relatively precise recipe for selecting the 60,000 digits that compose the\nMNIST training set. Alas, applying this recipe produces a set that contains one more zero and one\nless eight than the actual MNIST training set. Although they do not match, these class distributions\nare too close to make it plausible that 119 digits were really missing from the hsf4 partition.\nThe description of the image processing steps is much less precise. How are the 128x128 binary\nNIST images cropped? Which heuristics, if any, are used to disregard noisy pixels that do not\nbelong to the digits themselves? How are rectangular crops centered in a square image? How are\nthese square images resampled to 20x20 gray level images? How are the coordinates of the center\nof gravity rounded for the \ufb01nal centering step?\n\n2.1 An iterative process\n\nOur initial reconstruction algorithms were informed by the existing description and, crucially, by\nour knowledge of a mysterious resampling algorithm found in ancient parts of the Lush codebase:\ninstead of using a bilinear or bicubic interpolation, this code computes the exact overlap of the input\nand output image pixels.4\n\n2When LB joined this effort during the summer 1994, the MNIST dataset was already ready.\n3The same description also appears in [LeCun et al., 1994, Le Cun et al., 1998]. These more recent texts\nincorrectly use the names SD1 and SD3 to denote the original NIST test and training sets. And additional\nsentence explains that only a subset of 10,000 test images was used or made available, \u201c5000 from SD1 and\n5000 from SD3.\u201d\n\n4See https://tinyurl.com/y5z7qtcg.\n\n2\n\n\fMagni\ufb01cation:\nMNIST #0\nNIST #229421\n\nFigure 2: Side-by-side display of the \ufb01rst sixteen digits in the MNIST and QMNIST training set.\nThe magni\ufb01ed view of the \ufb01rst one illustrates the correct reconstruction of the antialiased pixels.\n\nAlthough our \ufb01rst reconstructed dataset, dubbed QMNISTv1, behaves very much like MNIST in\nmachine learning experiments, its digit images could not be reliably matched to the actual MNIST\ndigits. In fact, because many digits have similar shapes, we must rely on subtler details such as\nthe anti-aliasing pixel patterns. It was however possible to identify a few matches. For instance\nwe found that the lightest zero in the QMNIST training set matches the lightest zero in the MNIST\ntraining set. We were able to reproduce their antialiasing patterns by \ufb01ne-tuning the initial centering\nand resampling algorithms, leading to QMNISTv2.\nWe then found that the smallest L2 distance between MNIST digits and jittered QMNIST digits was\na reliable match indicator. Running the Hungarian assignment algorithm on the two training sets\ngave good matches for most digits. A careful inspection of the worst matches allowed us to further\ntune the cropping algorithms, and to discover, for instance, that the extra zero in the reconstructed\ntraining set was in fact a duplicate digit that the MNIST creators had identi\ufb01ed and removed. The\nability to obtain reliable matches allowed us to iterate much faster and explore more aspects the\nimage processing algorithm space, leading to QMNISTv3, v4, and v5. Note that all this tuning was\nachieved by matching training set images only.\nThis seemingly pointless quest for an exact reconstruction was surprisingly addictive. Supposedly\nurgent tasks could be inde\ufb01nitely delayed with this important procrastination pretext. Since all good\nthings must come to an end, we eventually had to freeze one of these datasets and call it QMNIST.\n\n2.2 Evaluating the reconstruction quality\n\nAlthough the QMNIST reconstructions are closer to the MNIST images than we had envisioned,\nthey remain imperfect.\nTable 2 indicates that about 0.25% of the QMNIST training set images are shifted by one pixel\nrelative to their MNIST counterpart. This occurs when the center of gravity computed during the\nlast centering step (see Figure 1) is very close to a pixel boundary. Because the image reconstruction\nis imperfect, the reconstructed center of gravity sometimes lands on the other side of the pixel\nboundary, and the alignment code shifts the image by a whole pixel.\n\n3\n\n\fTable 1: Quartiles of the jittered distances between matching MNIST and QMNIST training digit\nimages with pixels in range 0 . . . 255. A L2 distance of 255 would indicate a one pixel difference.\nThe L1 distance represents the largest absolute difference between image pixels.\n75% Max\n17.3\n10.5\n1\n3\n\n25% Med\n8.7\n7.1\n1\n1\n\nJittered L2 distance\nJittered L1 distance\n\nMin\n0\n0\n\nTable 2: Count of training samples for which the MNIST and QMNIST images align best without\ntranslation or with a \u00b11 pixel translation.\n\nJitter\nNumber of matches\n\n0 pixels \u00b11 pixels\n59853\n\n147\n\nTable 3: Misclassi\ufb01cation rates of a Lenet5 convolutional network trained on both the MNIST and\nQMNIST training sets and tested on the MNIST test set, on the 10K QMNIST testing examples\nmatching the MNIST testing set, and on the 50k remaining QMNIST testing examples.\n\nTest on\nTrain on MNIST\n0.82% (\u00b10.2%)\nTrain on QMNIST 0.81% (\u00b10.2%)\n\nMNIST\n\nQMNIST10K\n0.81% (\u00b10.2%)\n0.80% (\u00b10.2%)\n\nQMNIST50K\n1.08% (\u00b10.1%)\n1.08% (\u00b10.1%)\n\nTable 1 gives the quartiles of the L2 distance and L1 distances between the MNIST and QMNIST\nimages, after accounting for these occasional single pixel shifts. An L2 distance of 255 would\nindicate a full pixel of difference. The L1 distance represents the largest difference between image\npixels, expressed as integers in range 0 . . . 255.\nIn order to further verify the reconstruction quality, we trained a variant of the Lenet5 network\ndescribed by Le Cun et al. [1998]. Its original implementation is still available as a demonstration in\nthe Lush codebase. Lush [Bottou and LeCun, 2001] descends from the SN neural network software\n[Bottou and Le Cun, 1988] and from its AT&T Bell Laboratories variants developped in the nineties.\nThis particular variant of Lenet5 omits the \ufb01nal Euclidean layer described in [Le Cun et al., 1998]\nwithout incurring a performance penalty. Following the pattern set by the original implementation,\nthe training protocol consists of three sets of 10 epochs with global stepsizes 104, 105, and 106.\nEach set starts with estimating the diagonal of the Hessian. Per-weight stepsizes are then computed\nby dividing the global stepsize by the estimated curvature plus 0.02. Table 3 reports insigni\ufb01cant\ndifferences when one trains with the MNIST or QMNIST training set or test with MNIST test set\nor the matching part of the QMNIST test set. On the other hand, we observe a more substantial\ndifference when testing on the remaining part of the QMNIST test set, that is, the reconstructions of\nthe lost MNIST test digits. Such discrepancies will be discussed more precisely in Section 3.\n\n2.3 MNIST trivia\n\nThe reconstruction effort allowed us to uncover a lot of previously unreported facts about MNIST.\n\n1. There are exactly three duplicate digits in the entire NIST handwritten character collection.\nOnly one of them falls in the segments used to generate MNIST but was removed by the\nMNIST authors.\n\n2. The \ufb01rst 5001 images of the MNIST test set seem randomly picked from those written by\nwriters #2350-#2599, all high school students. The next 4999 images are the consecutive\nNIST images #35,000-#39,998, in this order, written by only 48 Census Bureau employees,\nwriters #326-#373, as shown in Figure 5. Although this small number could make us fear\nfor statistical signi\ufb01cance, these comparatively very clean images contribute little to the\ntotal test error.\n\n4\n\n\f3. Even-numbered images among the 58,100 \ufb01rst MNIST training set samples exactly match\nthe digits written by writers #2100-#2349, all high school students, in random order. The\nremaining images are the NIST images #0 to #30949 in that order. The beginning of this\nsequence is visible in Figure 2. Therefore, half of the images found in a typical minibatch\nof consecutive MNIST training images are likely to have been written by the same writer.\nWe can only recommend shuf\ufb02ing the training set before assembling the minibatches.\n\n4. There is a rounding error in the \ufb01nal centering of the 28x28 MNIST images. The average\ncenter of mass of a MNIST digits is in fact located half a pixel away from the geometrical\ncenter of the image. This is important because training on correctly centered images yields\nsubstantially worse performance on the standard MNIST testing set.\n\n5. A slight defect in the MNIST resampling code generates low amplitude periodic patterns\nin the dark areas of thick characters. These patterns, illustrated in Figure 3, can be traced\nto a 0.99 fudge factor that is still visible in the Lush legacy code.5 Since the period of these\npatterns depend on the sizes of the input images passed to the resampling code, we were\nable to determine that the small NIST images were not upsampled by directly calling the\nresampling code, but by \ufb01rst doubling their resolution, then downsampling to size 20x20.\n6. Converting the continuous-valued pixels of the subsampled images into integer-valued pix-\nels is delicate. Our code linearly maps the range observed in each image to the interval\n[0.0,255.0], rounding to the closest integer. Comparing the pixel histograms (see Figure 4)\nreveals that MNIST has substantially more pixels with value 128 and less pixels with value\n255. We could not think of a plausibly simple algorithm compatible with this observation.\n\n3 Generalization Experiments\n\nThis section takes advantage of the reconstruction of the lost 50,000 testing samples to revisit some\nMNIST performance results reported during the last twenty-\ufb01ve years. Recht et al. [2018, 2019]\nperform a similar study on the CIFAR10 and ImageNet datasets and identify very interesting trends.\nHowever they also explain that they cannot fully ascertain how closely the distribution of the re-\nconstructed dataset matches the distribution of the original dataset, raising the possibility of the\nreconstructed dataset being substantially harder than the original. Because the published MNIST\ntest set was subsampled from a larger set, we have a much tighter control of the data distribution and\ncan con\ufb01dently con\ufb01rm their \ufb01ndings.\nBecause the MNIST testing error rates are usually low, we start with a careful discussion of the com-\nputation of con\ufb01dence intervals and of the statistical signi\ufb01cance of error comparisons in the context\nof repeated experiments. We then report on MNIST results for several methods: k-nearest neight-\nbors (KNN), support vector machines (SVM), multilayer perceptrons (MLP), and several \ufb02avors of\nconvolutional networks (CNN).\n\n3.1 About con\ufb01dence intervals\n\nSince we want to know whether the actual performance of a learning system differs from the per-\nformance estimated using an overused testing set with run-of-the-mill con\ufb01dence intervals, all con-\n\ufb01dence intervals reported in this work were obtained using the classic Wald method: when we\nobserve n1 misclassi\ufb01cations out of n independent samples, the error rate \u232b = n1/n is reported\nwith con\ufb01dence 1\u2318 as\n\n\u232b \u00b1 zr \u232b(1 \u232b)\n\nn\n\n,\n\n(1)\nwhere z = p2 erfc1(\u2318) is approximately equal to 2 for a 95% con\ufb01dence interval. For instance, an\nerror rate close to 1.0% measured on the usual 10,000 test example is reported as a 1%\u00b1 0.2% error\nrate, that is, 100 \u00b1 20 misclassi\ufb01cations. This approach is widely used despite the fact that it only\nholds for a single use of the testing set and that it relies on an imperfect central limit approximation.\nThe simplest way to account for repeated uses of the testing set is the Bonferroni correction [Bon-\nferroni, 1936], that is, dividing \u2318 by the number K of potential experiments, simultaneously de\ufb01ned\n\n5See https://tinyurl.com/y5z7abyt\n\n5\n\n\fFigure 3: We have reproduced a defect of the original resampling code that creates low amplitude\nperiodic patterns in the dark areas of thick characters.\n\n0.05\n\n0.01\n0.005\n\n0.001\n0.0005\n\n0.0001\n\n50\n\n100\n\n150\n\n200\n\n250\n\nFigure 4: Histogram of pixel values in range 1-255 in the MNIST (red dots) and QMNIST (blue\nline) training set. Logarithmic scale.\n\nFigure 5: Histogram of Writer IDs and Number of digits written by the writer in MNIST Train,\nMNIST Test 10K and QMNIST Test 50K sets.\n\nbefore performing any measurement. Although relaxing this simultaneity constraint progressively\nrequires all the apparatus of statistical learning theory [Vapnik, 1982, \u00a76.3], the correction still takes\nthe form of a divisor K applied to con\ufb01dence level \u2318. Because of the asymptotic properties of the\nerfc function, the width of the actual con\ufb01dence intervals essentially grows like log(K).\nIn order to complete this picture, one also needs to take into account the bene\ufb01ts of using the same\ntesting set. Ordinary con\ufb01dence intervals are overly pessimistic when we merely want to know\nwhether a \ufb01rst classi\ufb01er with error rate \u232b1 = n1/n is worse than a second classi\ufb01er with error rate\n\u232b2 = n2/n. Because these error rates are measured on the same test samples, we can instead rely\non a pairing argument: the \ufb01rst classi\ufb01er can be considered worse with con\ufb01dence 1\u2318 when\n\n\u232b1 \u232b2 =\n\nn12 n21\n\nn\n\n z\n\n,\n\n(2)\n\npn12 + n21\n\nn\n\nwhere n12 represents the count of examples misclassi\ufb01ed by the \ufb01rst classi\ufb01er but not the second\nclassi\ufb01er, n21 is the converse, and z = p2 erfc1(2\u2318) is approximately 1.7 for a 95% con\ufb01dence.\nFor instance, four additional misclassi\ufb01cations out of 10,000 examples is suf\ufb01cient to make such a\ndetermination. This correspond to a difference in error rate of 0.04%, roughly ten times smaller than\nwhat would be needed to observe disjoint error bars (1). This advantage becomes very signi\ufb01cant\nwhen combined with a Bonferroni-style correction: K pairwise comparisons remain simultaneously\n\n6\n\n\fvalid with con\ufb01dence 1\u2318 if all comparisons satisfy\n\nn12 n21 p2 erfc1\u2713 2\u2318\n\nK\u25c6 pn12 + n21\n\nFor instance, in the realistic situation\n\nn = 10000 , n1 = 200 , n12 = 40 , n21 = 10 , n2 = n1 n12 + n21 = 170 ,\n\nthe conclusion that classi\ufb01er 1 is worse than classi\ufb01er 2 remains valid with con\ufb01dence 95% as long as\nit is part of a series of K\uf8ff4545 pairwise comparisons. In contrast, after merely K=50 experiments,\nthe 95% con\ufb01dence interval for the absolute error rate of classi\ufb01er 1 is already 2%\u00b1 0.5%, too large\nto distinguish it from the error rate of classi\ufb01er 2. We should therefore expect that repeated model\nselection on the same test set leads to decisions that remain valid far longer than the corresponding\nabsolute error rates.6\n\n3.2 Results\n\nWe report results using two training sets, namely the MNIST training set and the QMNIST recon-\nstructions of the MNIST training digits, and three testing sets, namely the of\ufb01cial MNIST testing\nset with 10,000 samples (MNIST), the reconstruction of the of\ufb01cial MNIST testing digits (QM-\nNIST10K), and the reconstruction of the lost 50,000 testing samples (QMNIST50K). We use the\nnames TMTM, TMTQ10, TMTQ50 to identify results measured on these three testing sets after\ntraining on the MNIST training set. Similarly we use the names TQTM, TQTQ10, and TQTQ50,\nfor results obtained after training on the QMNIST training set and testing on the three test sets.\nNone of these results involves data augmentation or preprocessing steps such as deskewing, noise\nremoval, blurring, jittering, elastic deformations, etc.\nFigure 6 (left plot) reports the testing error rates obtained with KNN for various values of the pa-\nrameter k using the MNIST training set as reference points. The QMNIST50K results are slightly\nworse but within the con\ufb01dence intervals. The best k determined on MNIST is also the best k for\nQMNIST50K. Figure 6 (right plot) reports similar results and conclusions when using the QMNIST\ntraining set as a reference point.\nFigure 7 reports testing error rates obtained with RBF kernel SVMs after training on the MNIST\ntraining set with various values of the hyperparameters C and g. The QMNIST50 results are\nconsistently higher but still fall within the con\ufb01dence intervals except maybe for mis-regularized\nmodels. Again the hyperparameters achieving the best MNIST performance also achieve the best\nQMNIST50K performance.\nFigure 8 (left plot) provides similar results for a single hidden layer multilayer network with vari-\nous hidden layer sizes, averaged over \ufb01ve runs. The QMNIST50K results again appear consistently\nworse than the MNIST test set results. On the one hand, the best QMNIST50K performance is\nachieved for a network with 1100 hidden units whereas the best MNIST testing error is achieved by\na network with 700 hidden units. On the other hand, all networks with 300 to 1100 hidden units per-\nform very similarly on both MNIST and QMNIST50, as can be seen in the plot. A 95% con\ufb01dence\ninterval paired test on representative runs reveals no statistically signi\ufb01cant differences between the\nMNIST test performances of these networks. Each point in \ufb01gure 8 (right plot) gives the MNIST\nand QMNIST50K testing error rates of one MLP experiment. This plot includes experiments with\nseveral hidden layer sizes and also several minibatch sizes and learning rates. We were only able to\nreplicate the reported 1.6% error rate Le Cun et al. [1998] using minibatches of \ufb01ve or less examples.\nFinally, Figure 9 summarizes all the experiments reported above. It also includes several \ufb02avors\nof convolutional networks: the Lenet5 results were already presented in Table 3, the VGG-11 [Si-\nmonyan and Zisserman, 2014] and ResNet-18 [He et al., 2016] results are representative of the\nmodern CNN architectures currently popular in computer vision. We also report results obtained\nusing four models from the TF-KR MNIST challenge.7 Model TFKR-a8 is an ensemble two VGG-\nand one ResNet-like models trained with an augmented version of the MNIST training set. Models\n\n6See [Feldman et al., 2019] for a different perspective on this issue.\n7https://github.com/hwalsuklee/how-far-can-we-go-with-MNIST\n8TFKR-a: https://github.com/khanrc/mnist\n\n7\n\n\fFigure 6: KNN error rates for various values of k using either the MNIST (left plot) or QMNIST\n(right plot) training sets. Red circles: testing on MNIST. Blue triangles: testing on its QMNIST\ncounterpart. Green stars: testing on the 50,000 new QMNIST testing examples.\n\nFigure 7: SVM error rates for various values of the regularization parameter C (left plot) and the\nRBF kernel parameter g (right plot) after training on the MNIST training set, using the same color\nand symbols as \ufb01gure 6.\n\nFigure 8: Left plot: MLP error rates for various hidden layer sizes after training on MNIST, using the\nsame color and symbols as \ufb01gure 6. Right plot: scatter plot comparing the MNIST and QMNIST50K\ntesting errors for all our MLP experiments.\n\n8\n\n\fFigure 9: Scatter plot comparing the MNIST and QMNIST50K testing performance of all the models\ntrained on MNIST during the course of this study.\n\nTFKR-b9, TFKR-c10, and TFKR-d11 are single CNN models with varied architectures. This scatter\nplot shows that the QMNIST50 error rates are consistently slightly higher than the MNIST testing\nerrors. However, the plot also shows that comparing the MNIST testing set performances of var-\nious models provides a near perfect ranking of the corresponding QMNIST50K performances. In\nparticular, the best performing model on MNIST, TFKR-a, remains the best performing model on\nQMNIST50K.\n\n4 Conclusion\n\nWe have recreated a close approximation of the MNIST preprocessing chain. Not only did we\ntrack each MNIST digit to its NIST source image and associated metadata, but also recreated the\noriginal MNIST test set, including the 50,000 samples that were never distributed. These fresh\ntesting samples allow us to investigate how the results reported on a standard testing set suffer\nfrom repeated experimentation. Our results con\ufb01rm the trends observed by Recht et al. [2018,\n2019], albeit on a different dataset and in a substantially more controlled setup. All these results\nessentially show that the \u201ctesting set rot\u201d problem exists but is far less severe than feared. Although\nthe repeated usage of the same testing set impacts absolute performance numbers, it also delivers\npairing advantages that help model selection in the long run. In practice, this suggests that a shifting\ndata distribution is far more dangerous than overusing an adequately distributed testing set.\n\n9TFKR-b: https://github.com/bart99/tensorflow/tree/master/mnist\n10TFKR-c: https://github.com/chaeso/dnn-study\n11TFKR-d: https://github.com/ByeongkiJeong/MostAccurableMNIST_keras\n\n9\n\n\fAcknowledgments\n\nWe thank Chris Burges, Corinna Cortes, and Yann LeCun for the precious information they were\nable to share with us about the birth of MNIST. We thank Larry Jackel for instigating the whole\nMNIST project and for commenting on this \"cold case\". We thank Maithra Raghu for pointing out\nhow QMNIST could be used to corroborate the results of Recht et al. [2019]. We thank Ben Recht,\nLudwig Schmidt and Roman Werpachowski for their constructive comments.\n\nReferences\nCarlo E. Bonferroni. Teoria statistica delle classi e calcolo delle probabilit\u00e0. Pubblicazioni del R.\nIstituto superiore di scienze economiche e commerciali di Firenze. Libreria internazionale Seeber,\n1936.\n\nL\u00e9on Bottou and Yann Le Cun. SN: A simulator for connectionist models.\n\nNeuroNimes 88, pages 371\u2013382, Nimes, France, 1988.\n\nIn Proceedings of\n\nL\u00e9on Bottou and Yann LeCun. Lush Reference Manual. http://lush.sf.net/doc, 2001.\nL\u00e9on Bottou, Corinna Cortes, John S. Denker, Harris Drucker, Isabelle Guyon, Lawrence D. Jackel,\nYann Le Cun, Urs A. Muller, Eduard S\u00e4ckinger, Patrice Simard, and Vladimir Vapnik. Compar-\nison of classi\ufb01er methods: a case study in handwritten digit recognition. In Proceedings of the\n12th IAPR International Conference on Pattern Recognition, Conference B: Computer Vision &\nImage Processing., volume 2, pages 77\u201382, Jerusalem, October 1994. IEEE.\n\nVitaly Feldman, Roy Frostig, and Moritz Hardt. The advantages of multiple classes for reducing\nover\ufb01tting from test set reuse. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Pro-\nceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings\nof Machine Learning Research, pages 1892\u20131900. PMLR, 2019.\n\nPatrick J. Grother and Kayee K. Hanaoka. NIST Special Database 19: Handprinted forms and\ncharacters database. https://www.nist.gov/srd/nist-special-database-19,\n1995. SD1 was released in 1990, SD3 and SD7 in 1992, SD19 in 1995, SD19 2nd edition in\n2016.\n\nKaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recog-\nnition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages\n770\u2013778, 2016.\n\nYann Le Cun, L\u00e9on Bottou, Yoshua Bengio, and Patrick Haffner. Gradient based learning applied\n\nto document recognition. Proceedings of IEEE, 86(11):2278\u20132324, 1998.\n\nYann LeCun, Corinna Cortes, and Christopher J. C. Burges. The MNIST database of handwritten\ndigits. http://yann.lecun.com/exdb/mnist/, 1994. MNIST was created in 1994 and\nreleased in 1998.\n\nBenjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do CIFAR-10 classi-\n\n\ufb01ers generalize to CIFAR-10? arXiv preprint arXiv:1806.00451, 2018.\n\nBenjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do ImageNet classi-\n\ufb01ers generalize to ImageNet? In Proceedings of the 36th International Conference on Machine\nLearning. PMLR, 2019.\n\nKaren Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image\n\nrecognition. arXiv preprint arXiv:1409.1556, 2014.\n\nV. N. Vapnik. Estimation of dependences based on empirical data. Springer Series in Statistics.\n\nSpringer Verlag, Berlin, New York, 1982.\n\n10\n\n\f", "award": [], "sourceid": 7431, "authors": [{"given_name": "Chhavi", "family_name": "Yadav", "institution": "NYU"}, {"given_name": "Leon", "family_name": "Bottou", "institution": "Facebook AI Research"}]}