{"title": "Statistical Model Criticism using Kernel Two Sample Tests", "book": "Advances in Neural Information Processing Systems", "page_first": 829, "page_last": 837, "abstract": "We propose an exploratory approach to statistical model criticism using maximum mean discrepancy (MMD) two sample tests. Typical approaches to model criticism require a practitioner to select a statistic by which to measure discrepancies between data and a statistical model. MMD two sample tests are instead constructed as an analytic maximisation over a large space of possible statistics and therefore automatically select the statistic which most shows any discrepancy. We demonstrate on synthetic data that the selected statistic, called the witness function, can be used to identify where a statistical model most misrepresents the data it was trained on. We then apply the procedure to real data where the models being assessed are restricted Boltzmann machines, deep belief networks and Gaussian process regression and demonstrate the ways in which these models fail to capture the properties of the data they are trained on.", "full_text": "Statistical Model Criticism\n\nusing Kernel Two Sample Tests\n\nJames Robert Lloyd\n\nDepartment of Engineering\nUniversity of Cambridge\n\nZoubin Ghahramani\n\nDepartment of Engineering\nUniversity of Cambridge\n\nAbstract\n\nWe propose an exploratory approach to statistical model criticism using maximum\nmean discrepancy (MMD) two sample tests. Typical approaches to model criti-\ncism require a practitioner to select a statistic by which to measure discrepancies\nbetween data and a statistical model. MMD two sample tests are instead con-\nstructed as an analytic maximisation over a large space of possible statistics and\ntherefore automatically select the statistic which most shows any discrepancy. We\ndemonstrate on synthetic data that the selected statistic, called the witness func-\ntion, can be used to identify where a statistical model most misrepresents the data\nit was trained on. We then apply the procedure to real data where the models being\nassessed are restricted Boltzmann machines, deep belief networks and Gaussian\nprocess regression and demonstrate the ways in which these models fail to capture\nthe properties of the data they are trained on.\n\n1\n\nIntroduction\n\nStatistical model criticism or checking1 is an important part of a complete statistical analysis. When\none \ufb01ts a linear model to a data set a complete analysis includes computing e.g. Cook\u2019s distances\n[3] to identify in\ufb02uential points or plotting residuals against \ufb01tted values to identify non-linearity or\nheteroscedasticity. Similarly, modern approaches to Bayesian statistics view model criticism as in\nimportant component of a cycle of model construction, inference and criticism [4].\nAs statistical models become more complex and diverse in response to the challenges of modern\ndata sets there will be an increasing need for a greater range of model criticism procedures that are\neither automatic or widely applicable. This will be especially true as automatic modelling methods\n[e.g. 5, 6, 7] and probabilistic programming [e.g. 8, 9, 10, 11] mature.\nModel criticism typically proceeds by choosing a statistic of interest, computing it on data and\ncomparing this to a suitable null distribution. Ideally these statistics are chosen to assess the utility\nof the statistical model under consideration (see applied examples [e.g. 4]) but this can require\nconsiderable expertise on the part of the modeller. We propose an alternative to this manual approach\nby using a statistic de\ufb01ned as a supremum over a broad class of measures of discrepancy between two\ndistributions, the maximum mean discrepancy (MMD) [e.g. 12]). The advantage of this approach\nis that the discrepancy measure attaining the supremum automatically identi\ufb01es regions of the data\nwhich are most poorly represented by the statistical model \ufb01t to the data.\nWe demonstrate MMD model criticism on toy examples, restricted Boltzmann machines and deep\nbelief networks trained on MNIST digits and Gaussian process regression models trained on several\ntime series. Our proposed method identi\ufb01es discrepancies between the data and \ufb01tted models that\nwould not be apparent from predictive performance focused metrics. It is our belief that more effort\nshould be expended on attempting to falsify models \ufb01tted to data, using model criticism techniques\nor otherwise. Not only would this aid research in targeting areas for improvement but it would give\ngreater con\ufb01dence in any conclusions drawn from a model.\n\n1We follow Box [1] using the term \u2018model criticism\u2019 for similar reasons to O\u2019Hagan [2].\n\n1\n\n\f2 Model criticism\n\nSuppose we observe data Y obs = (yobs\ni )i=1...n and we attempt to \ufb01t a model M with parameters\n\u03b8. After performing a statistical analysis we will have either an estimate, \u02c6\u03b8, or an (approximate)\nposterior, p(\u03b8 | Y obs, M ), for the parameters. How can we check whether any aspects of the data\nwere poorly modelled?\n\nCriticising prior assumptions The classical approach to model criticism is to attempt to falsify\nthe null hypothesis that the data could have been generated by the model M for some value of the\nparameters \u03b8 i.e. Y obs \u223c p(Y | \u03b8, M ). This is typically achieved by constructing a statistic T of the\ndata whose distribution does not depend on the parameters \u03b8 i.e. a pivotal quantity. The extent to\nwhich the observed data Y obs differs from expectations under the model M can then be quanti\ufb01ed\nwith a tail-area based p-value\n\npfreq(Y obs) = P(T (Y ) \u2265 T (Y obs)) where Y \u223c p(Y | \u03b8, M )\n\n(2.1)\nAnalogous quantities in a Bayesian analysis are the prior predictive p-values of Box [1]. The null\nhypothesis is replaced with the claim that the data could have been generated from the prior predic-\n\ntive distribution Y obs \u223c(cid:82) p(Y | \u03b8, M )p(\u03b8 | M )d\u03b8. A tail-area p-value can then be constructed for\n\nfor any \u03b8.\n\n(cid:90)\n\nany statistic T of the data\n\npprior(Y obs) = P(T (Y ) \u2265 T (Y obs)) where Y \u223c\n\np(Y | \u03b8, M )p(\u03b8 | M )d\u03b8.\n\n(2.2)\n\nBoth of these procedures construct a function of the data p(Y obs) whose distribution under a suit-\nable null hypothesis is uniform i.e. a p-value. The p-value quanti\ufb01es how surprising it would be\nfor the data Y obs to have been generated by the model. The different null hypotheses re\ufb02ect the\ndifferent uses of the word \u2018model\u2019 in frequentist and Bayesian analyses. A frequentist model is a\nclass of probability distributions over data indexed by parameters whereas a Bayesian model is a\njoint probability distribution over data and parameters.\n\nY \u223c(cid:82) p(Y | \u03b8, M )p(\u03b8 | Y obs, M )d\u03b8. The corresponding test for an analysis resulting in a point\n\nCriticising estimated models or posterior distributions A constrasting method of Bayesian\nmodel criticism is the calculation of posterior predictive p-values ppost [e.g. 13, 14] where\nthe prior predictive distribution in (2.2) is replaced with the posterior predictive distribution\nestimate of the parameters \u02c6\u03b8 would use the plug-in predictive distribution Y \u223c p(Y | \u02c6\u03b8, M ) to form\nthe plug-in p-value pplug.\nThese p-values quantify how surprising the data Y obs is even after having observed it. A simple\nvariant of this method of model criticism is to use held out data Y \u2217, generated from the same\ndistribution as Y obs, to compute a p-value i.e. p(Y \u2217) = P(T (Y ) \u2265 T (Y \u2217)). This quanti\ufb01es how\nsurprising the held out data is after having observed Y obs.\n\nWhich type of model criticism should be used? Different forms of model criticism are appropri-\nate in different contexts, but we believe that posterior predictive and plug-in p-values will be most\noften useful for highly \ufb02exible models. For example, suppose one is \ufb01tting a deep belief network\nto data. Classical p-values would assume a null hypothesis that the data could have been generated\nfrom some deep belief network. Since the space of all possible deep belief networks is very large\nit will be dif\ufb01cult to ever falsify this hypothesis. A more interesting null hypothesis to test in this\nexample is whether or not our particular deep belief network can faithfully mimick the distribution\nof the sample it was trained on. This is the null hypothesis of posterior or plug-in p-values.\n\n3 Model criticism using maximum mean discrepancy two sample tests\n\ni )i=1...n \u223ciid p(y | \u03b8, M ).\nWe assume that our data Y obs are i.i.d. samples from some distribution (yobs\nAfter performing inference resulting in a point estimate of the parameters \u02c6\u03b8, the null hypothesis\nassociated with a plug-in p-value is (yobs\nWe can test this null hypothesis using a two sample test [e.g. 15, 16].\nIn particular, we have\nsamples of data (yobs\ni )i=1...n and we can generate samples from the plug-in predictive distribution\ni )i=1...m \u223ciid p(y | \u02c6\u03b8, M ) and then test whether or not these samples could have been generated\n(yrep\n\ni )i=1...n \u223ciid p(y | \u02c6\u03b8, M ).\n\n2\n\n\ffrom the same distribution. For consistency with two sample testing literature we now switch nota-\ntion; suppose we have samples X = (xi)i=1...m and Y = (yi)i=1...n drawn i.i.d. from distributions\np and q respectively. The two sample problem asks if p = q.\nA way of answering the two sample problem is to consider maximum mean discrepancy (MMD)\n[e.g. 12] statistics\n\n(3.1)\nwhere F is a set of functions. When F is a reproducing kernel Hilbert space (RKHS) the function\nattaining the supremum can be derived analytically and is called the witness function\n\nMMD(F, p, q) = sup\nf\u2208F\n\n(Ex\u223cp[f (x)] \u2212 Ey\u223cq[f (y)])\n\nf (x) = Ex(cid:48)\u223cp[k(x, x(cid:48))] \u2212 Ex(cid:48)\u223cq[k(x, x(cid:48))]\n\nwhere k is the kernel of the RKHS. Substituting (3.2) into (3.1) and squaring yields\n\nMMD2(F, p, q) = Ex,x(cid:48)\u223cp[k(x, x(cid:48))] + 2Ex\u223cp,y\u223cq[k(x, y)] + Ey,y(cid:48)\u223cq[k(y, y(cid:48))].\n\nThis expression only involves expectations of the kernel k which can be estimated empirically by\n\n(3.2)\n\n(3.3)\n\nMMD2\n\nb(F, X, Y ) =\n\n1\nm2\n\nk(xi, xj) \u2212 2\nmn\n\nk(xi, yj) +\n\nk(yi, yj).\n\n(3.4)\n\nn(cid:88)\n\ni,j=1\n\n1\nn2\n\nm(cid:88)\n\ni,j=1\n\nm(cid:88)\n\ni=1\n\nm,n(cid:88)\n\ni,j=1\n\nn(cid:88)\n\ni=1\n\nOne can also estimate the witness function from \ufb01nite samples\n\n\u02c6f (x) =\n\n1\nm\n\nk(x, xi) \u2212 1\nn\n\nk(x, yi)\n\n(3.5)\n\ni.e. the empirical witness function is the difference of two kernel density estimates [e.g. 17, 18].\nThis means that we can interpret the witness function as showing where the estimated densities of\np and q are most different. While MMD two sample tests are well known in the literature the main\ncontribution of this work is to show that this interpretability of the witness function makes them a\nuseful tool as an exploratory form of statistical model criticism.\n\n4 Examples on toy data\n\nTo illustrate the use of the MMD two sample test as a tool for model criticism we demonstrate its\nproperties on two simple datasets and models.\n\nNewcomb\u2019s speed of light data A histogram of Simon Newcomb\u2019s 66 measurements used to\ndetermine the speed of light [19] is shown on the left of \ufb01gure 1. We \ufb01t a normal distribution to this\ndata by maximum likelihood and ask whether this model is a faithful representation of the data.\n\nFigure 1: Left: Histogram of Simon Newcomb\u2019s speed of light measurements. Middle: Histogram\ntogether with density estimate (red solid line) and MMD witness function (green dashed line). Right:\nHistogram together with updated density estimate and witness function.\n\nWe sampled 1000 points from the \ufb01tted distribution and performed an MMD two sample test using\na radial basis function kernel2. The estimated p-value of the test was less than 0.001 i.e. a clear\ndisparity between the model and data.\nThe data, \ufb01tted density estimate (normal distribution) and witness function are shown in the middle\nof \ufb01gure 1. The witness function has a trough at the centre of the data and peaks either side indicating\nthat the \ufb01tted model has placed too little mass in its centre and too much mass outside its centre.\n\n2 Throughout this paper we estimate the null distribution of the MMD statistic using the bootstrap method\ndescribed in [12] using 1000 replicates. We use a radial basis function kernel and select the lengthscale by 5\nfold cross validation using predictive likelihood of the kernel density estimate as the selection criterion.\n\n3\n\n\u221250\u221240\u221230\u221220\u221210010203040024681012141618Deviations from 24,800 nanosecondsCount\u221260\u221240\u221220020406000.050.1Density estimateDeviations from 24,800 nansoeconds\u221260\u221240\u2212200204060\u22120.200.2Witness function\u221260\u221240\u221220020406000.050.1Density estimateDeviations from 24,800 nansoeconds\u221260\u221240\u2212200204060\u22120.0500.050.1Witness function\fThis suggests that we should modify our model by either using a distribution with heavy tails or\nexplicitly modelling the possibility of outliers. However, to demonstrate some of the properties of\nthe MMD two sample test we make an unusual choice of \ufb01tting a Gaussian by maximum likelihood,\nbut ignoring the two outliers in the data. The new \ufb01tted density estimate (the normal distribution)\nand witness function of an MMD test are shown on the right of \ufb01gure 1. The estimated p-value\nassociated with the MMD two sample test is roughly 0.5 despite the \ufb01tted model being a very poor\nexplanation of the outliers.\nThe nature of an MMD test depends on the kernel de\ufb01ning the RKHS in equation (3.1). In this\npaper we use the radial basis function kernel which encodes for smooth functions with a typical\nlengthscale [e.g. 20]. Consequently the test identi\ufb01es \u2018dense\u2019 discrepancies, only identifying outliers\nif the model and inference method are not robust to them. This is not a failure; a test that can identify\ntoo many types of discrepancy would have low statistical power (see [12] for discussion of the power\nof the MMD test and alternatives).\n\nHigh dimensional data The interpretability of the witness functions comes from being equal to\nthe difference of two kernel density estimates. In high dimensional spaces, kernel density estima-\ntion is a very high variance procedure that can result in poor density estimates which destroy the\ninterpretability of the method. In response, we consider using dimensionality reduction techniques\nbefore performing two sample tests.\nWe generated synthetic data from a mixture of 4 Gaussians and a t-distribution in 10 dimensions3.\nWe then \ufb01t a mixture of 5 Gaussians and performed an MMD two sample test. We reduced the di-\nmensionality of the data using principal component analysis (PCA), selecting the \ufb01rst two principal\ncomponents. To ensure that the MMD test remains well calibrated we include the PCA dimensional-\nity reduction within the bootstrap estimation of the null distribution. The data and plug-in predictive\nsamples are plotted on the left of \ufb01gure 2. While we can see that one cluster is different from the\nrest, it is dif\ufb01cult to assess by eye if these distributions are different \u2014 due in part to the dif\ufb01culty\nof plotting two sets of samples on top of each other.\n\nFigure 2: Left: PCA projection of synthetic high dimensional cluster data (green circles) and projec-\ntion of samples from \ufb01tted model (red circles). Right: Witness function of MMD model criticism.\nThe poorly \ufb01t cluster is clearly identi\ufb01ed.\n\nThe MMD test returns a p-value of 0.05 and the witness function (right of \ufb01gure 2) clearly identi\ufb01es\nthe cluster that has been incorrectly modelled. Presented with this discrepancy a statistical modeller\nmight try a more \ufb02exible clustering model [e.g. 21, 22]. The p-value of the MMD statistic can also\nbe made non-signi\ufb01cant by \ufb01tting a mixture of 10 Gaussians; this is a suf\ufb01cient approximation to\nthe t-distribution such that no discrepancy can be detected with the amount of data available.\n\n5 What exactly do neural networks dream about?\n\n\u201cTo recognize shapes, \ufb01rst learn to generate images\u201d quoth Hinton [23]. Restricted Boltzmann\nMachine (RBM) pretraining of neural networks was shown by [24] to learn a deep belief network\n(DBN) for the data i.e. a generative model. In agreement with this observation, as well as computing\nestimates of marginal likelihoods and testing errors, it is standard to demonstrate the effectiveness\nof a generative neural network by generating samples from the distribution it has learned.\n\n3For details see code at [redacted]\n\n4\n\n\u22128\u22126\u22124\u221220246\u22124\u22123\u22122\u221210123456 \u22120.035\u22120.03\u22120.025\u22120.02\u22120.015\u22120.01\u22120.00500.0050.01\fWhen trained on the MNIST handwritten digit data, samples from RBMs (see \ufb01gure 3a for random\nsamples4) and DBNs certainly look like digits, but it is hard to detect any systematic anomalies\npurely by visual inspection. We now use MMD model criticism to investigate how faithfully RBMs\nand DBNs can capture the distribution over handwritten digits.\n\nRBMs can consistently mistake the identity of digits We trained an RBM with architecture\n(784) \u2194 (500) \u2194 (10)5 using 15 epochs of persistent contrastive divergence (PCD-15), a batch\nsize of 20 and a learning rate of 0.1 (i.e. we used the same settings as the code available at the deep\nlearning tutorial [25]). We generated 3000 independent samples from the learned generative model\nby initialising the network with a random training image and performing 1000 gibbs updates with\nthe digit labels clamped6 to generate each image (as in e.g. [23]).\nSince we generated digits from the class conditional distributions we compare each class separately.\nRather than show plots of the witness function for each digit we summarise the witness function\nby examples of digits closest to the peaks and troughs of the witness function (the witness function\nestimate is differentiable so we can \ufb01nd the peaks and troughs by gradient based optimisation).\nWe apply MMD model criticism to each class conditional distribution, using PCA to reduce to 2\ndimensions as in section 4.\n\na)\n\nc)\n\ne)\n\nb)\n\nd)\n\nf)\n\nFigure 3: a) Random samples from an RBM. b) Peaks of the witness function for the RBM (digits\nthat are over-represented by the model). c) Peaks of the witness function for samples from 1500\nRBMs (with differently initialised pseudo random number generators during training). d) Peaks of\nthe witness function for the DBN. e) Troughs (digits that are under-represented by the model) of the\nwitness function for samples from 1500 RBMs. f) Troughs of the witness function for the DBN.\n\nFigure 3b shows the digits closest to the two most extreme peaks of the witness function for each\nclass; the peaks indicate where the \ufb01tted distribution over-represents the distribution of true digits.\nThe estimated p-value for all tests was less than 0.001. The most obvious problem with these digits\nis that the \ufb01rst 2 and 3 look quite similar.\nTo test that this was not just an single unlucky RBM, we trained 1500 RBMs (with differently\ninitialised pseudo random number generators) and generated one sample from each and performed\nthe same tests. The estimated p-values were again all less than 0.001 and the summaries of the\npeaks of the witness function are shown in \ufb01gure 3c. On the \ufb01rst toy data example we observed\nthat the MMD statistic does not highlight outliers and therefore we can conclude that RBMs are\nmaking consistent mistakes e.g. generating a 0 from the 7 distribution or a 5 when it should have\nbeen generating an 8.\n\nDBNs have nightmares about ghosts We now test the effectiveness of deep learning to represent\nthe distribution of MNIST digits. In particular, we \ufb01t a DBN with architecture (784) \u2190 (500) \u2190\n(500) \u2194 (2000) \u2194 (10) using RBM pre-training and a generative \ufb01ne tuning algorithm described\nin [24]. Performing the same tests with 3000 samples results in estimated p-values of less than 0.001\nexcept for the digit 4 (0.150) and digit 7 (0.010). Summaries of the witness function peaks are shown\nin \ufb01gure 3d.\n\n4 Speci\ufb01cally these are the activations of the visible units before sampling sampling binary values. This\nprocedure is an attempt to be consistent with the grayscale input distribution of the images. Analogous discrep-\nancies would be discovered if we had instead sampled binary pixel values.\n\n5That is, 784 input pixels and 10 indicators of the class label are connected to 500 hidden neurons.\n6Without clamping the label neurons, the generative distribution is heavily biased towards certain digits.\n\n5\n\n\fThe witness function no longer shows any class label mistakes (except perhaps for the digit 1 which\nlooks very peculiar) but the 2, 3, 7 and 8 appear \u2018ghosted\u2019 \u2014 the digits fade in and out. For compari-\nson, \ufb01gure 3f shows digits closest to the troughs of the witness function; there is no trace of ghosting.\nThis discrepancy could be due to errors in the autoassociative memory of a DBN propogating down\nthe hidden layers resulting in spurious features in several visible neurons.\n\n6 An extension to non i.i.d. data\n\ni\n\n, yrep\n\n, yobs\n\ni\n\ni \u223c p(y | xobs\n\ni\n\nTemperature\n\nInternet\n\nCall centre\n\n, yobs\n\ni )i=1...n and (xobs\n\ni\n\nDataset\nAirline\nSolar\nMauna\nWheat\n\nWe now describe how the MMD statistic can be used for model criticism of non i.i.d. predictive\ndistributions. In particular we construct a model criticism procedure for regression models.\nWe assume that our data consists of pairs of inputs and outputs (xobs\ni )i=1...n. A typical formu-\nlation of the problem of regression is to estimate the conditional distribution of the outputs given\nthe inputs p(y | x, \u03b8). Ignoring that our data are not i.i.d. we can generate data from the plug-in con-\nditional distribution yrep\n, \u02c6\u03b8) and compute the empirical MMD estimate (3.4) between\n(xobs\ni )i=1...n. The only difference between this test and the MMD two\nsample test is that our data is generated from a conditional distribution, rather than being i.i.d. . The\nnull distribution of this statistic can be trivially estimated by sampling several sets of replicate data\nfrom the plug-in predictive distribution.\nTo demonstrate this test we apply it to 4 re-\ngression algorithms and 13 time series analysed\nin [7]. In this work the authors compare sev-\neral methods for constructing Gaussian process\n[e.g. 20] regression models. Example data sets\nare shown in \ufb01gures 4 and 5. While it is clear\nthat simple methods will fail to capture all of\nthe structure in this data, it is not clear a priori\nhow much better the more advanced methods\nwill fair.\nTo construct p-values we use held out data us-\ning the same split of training and testing data\nas the interpolation experiment in [7]7.\nTa-\nble 1 shows a table of p-values for 13 data sets\nand 4 regression methods. The four methods\nare linear regression (Lin), Gaussian process\nregression using a squared exponential kernel\n(SE), spectral mixture kernels [26] (SP) and the\nmethod proposed in [7] (ABCD). Values in bold\nindicate a positive discovery after a Benjamini\u2013\nHochberg [27] procedure with a false discovery\nrate of 0.05 applied to each model construction\nmethod.\nWe now investigate the type of discrepancies found by this test by looking at the witness function\n(which can still be interpreted as the difference of kernel density estimates). Figure 4 shows the\nsolar and gas production data sets, the posterior distribution of the SE \ufb01ts to this data and the witness\nfunctions for the SE \ufb01t. The solar witness function has a clear narrow trough, indicating that the data\nis more dense than expected by the \ufb01tted model in this region. We can see that this has identi\ufb01ed a\nregion of low variability in the data i.e. it has identi\ufb01ed local heteroscedasticity not captured by the\nmodel. Similar conclusions can be drawn about the gas production data and witness function.\nOf the four methods compared here, only ABCD is able to model heteroscedasticity, explaining\nwhy it is the only method with a substantially different set of signi\ufb01cant p-values. However, the\nprocedure is still potentially failing to capture structure on four of the datasets.\n\nTable 1: Two sample test p-values applied\nto 13 time series and 4 regression algorithms.\nBold values indicate a positive discovery using a\nBenjamini\u2013Hochberg procedure with a false dis-\ncovery rate of 0.05 for each method.\n\nABCD\n0.15\n0.05\n0.21\n0.19\n0.75\n0.01\n0.07\n0.00\n0.11\n0.52\n0.01\n0.12\n0.00\n\nLin\n0.34\n0.00\n0.00\n0.00\n0.44\n0.00\n0.00\n0.00\n0.00\n0.00\n0.00\n0.00\n0.00\n\nSP\n0.07\n0.00\n0.34\n0.00\n0.68\n0.05\n0.00\n0.00\n0.01\n0.34\n0.00\n0.00\n0.01\n\nSE\n0.36\n0.00\n0.99\n0.00\n0.54\n0.00\n0.02\n0.00\n0.00\n0.29\n0.00\n0.00\n0.00\n\nGas production\n\nSulphuric\n\nUnemployment\n\nRadio\n\nBirths\nWages\n\n7Gaussian processes when applied to regression problems learn a joint distribution of all output values.\nHowever this joint distribution information is rarely used; typically only the pointwise conditional distributions\np(y | xobs\n\n, \u02c6\u03b8) are used as we have done here.\n\ni\n\n6\n\n\fFigure 4: From left to right: Solar data with SE posterior. Witness function of SE \ufb01t to solar. Gas\nproduction data with SE posterior. Witness function of SE \ufb01t to gas production.\n\nFigure 5 shows the unemployment and Internet data sets, the posterior distribution for the ABCD\n\ufb01ts to the data and the witness functions of the ABCD \ufb01ts. The ABCD method has captured much\nof the structure in these data sets, making it dif\ufb01cult to visually identify discrepancies between\nmodel and data. The witness function for unemployment shows peaks and troughs at similar values\nof the input x. Comparing to the raw data we see that at these input values there are consistent\noutliers. Since ABCD is based on Gaussianity assumptions these consistent outliers have caused\nthe method to estimate a large variance in this region, when the true data is non-Gaussian. There\nis also a similar pattern of peaks and troughs on the Internet data suggesting that non-normality has\nagain been detected. Indeed, the data appears to have a hard lower bound which is inconsistent with\nGaussianity.\n\nFigure 5: From left to right: Unemployment data with ABCD posterior. Witness function of ABCD\n\ufb01t to unemployment. Internet data with ABCD posterior. Witness function of ABCD \ufb01t to Internet.\n\n7 Discussion of model criticism and related work\n\nAre we criticising a particular model, or class of models?\nIn section 2 we interpreted the differ-\nences between classical, Bayesian prior/posterior and plug-in p-values as corresponding to different\nnull hypotheses and interpretations of the word \u2018model\u2019. In particular classical p-values test a null\nhypothesis that the data could have been generated by a class of distributions (e.g. all normal distri-\nbutions) whereas all other p-values test a particular probability distribution.\nRobins, van der Vaart & Ventura [28] demonstrated that Bayesian and plug-in p-values are not clas-\nsical p-values (frequentist p-values in their terminology) i.e. they do not have a uniform distribution\nunder the relevant null hypothesis. However, this was presented as a failure of these methods; in\nparticular they demonstrated that methods proposed by Bayarri & Berger [29] based on posterior\npredictive p-values are asymptotically classical p-values.\nThis claimed inadequacy of posterior predictive p-values was rebutted [30] and while their useful-\nness is becoming more accepted (see e.g. introduction of [31]) it would appear there is still confusion\non the subject [32]. We hope that our interpretation of the differences between these methods as dif-\nferent null hypotheses \u2014 appropriate in different circumstances \u2014 sheds further light on the matter.\n\nShould we worry about using the same data for traning and criticism? Plug-in and posterior\npredictive p-values test the null hypothesis that the observed data could have been generated by the\n\ufb01tted model or posterior predictive distribution. In some situations it may be more appropriate to\nattempt to falsify the null hypothesis that future data will be generated by the plug-in or posterior\npredictive distribution. As mentioned in section 2 this can be achieved by reserving a portion of the\ndata to be used for model criticism alone, rather than \ufb01tting a model or updating a posterior on the\nfull data. Cross validation methods have also been investigated in this context [e.g. 33, 34].\n\n7\n\nxySolar165017001750180018501900195020001360.21360.41360.61360.813611361.21361.41361.61361.8 5010015020020406080100120140160180200\u22120.08\u22120.07\u22120.06\u22120.05\u22120.04\u22120.03\u22120.02\u22120.0100.010.02xyGas production19601965197019751980198519901995123456x 104 5010015020020406080100120140160180200\u22120.04\u22120.03\u22120.02\u22120.0100.010.02xyUnemployment1950195519601965197019751980200300400500600700800900100011001200 5010015020020406080100120140160180200\u22120.02\u22120.015\u22120.01\u22120.00500.0050.010.0150.02xyInternet2004.92004.922004.942004.962004.9820052005.022005.042005.06234567891011x 104 5010015020020406080100120140160180200\u22120.01\u22120.008\u22120.006\u22120.004\u22120.00200.0020.0040.0060.0080.01\fOther methods for evaluating statistical models Other typical methods of model evaluation in-\nclude estimating the predictive performance of the model, analyses of sensitivities to modelling\nparameters / priors, graphical tests, and estimates of model utility. For a recent survey of Bayesian\nmethods for model assessment, selection and comparison see [35] which phrases many techniques\nas estimates of the utility of a model. For some discussion of sensitivity analysis and graphical\nmodel comparison see [e.g. 4].\nIn this manuscript we have focused on methods that compare statistics of data with predictive dis-\ntributions, ignoring parameters of the model. The discrepancy measures of [36] compute statistics\nof data and parameters; examples can be found in [4]. O\u2019Hagan [2] also proposes a method and\nselectively reviews techniques for model criticism that also take model parameters into account.\nIn the spirit of scienti\ufb01c falsi\ufb01cation [e.g. 37], ideally all methods of assessing a model should\nbe performed to gain con\ufb01dence in any conclusions made. Of course, when performing multiple\nhypothesis tests care must be taken in the intrepetation of individual p-values.\n\n8 Conclusions and future work\n\nIn this paper we have demonstrated an exploratory form of model criticism based on two sample\ntests using kernel maximum mean discrepancy. In contrast to other methods for model criticism,\nthe test analytically maximises over a broad class of statistics, automatically identifying the statistic\nwhich most demonstrates the discrepancy between the model and data. We demonstrated how this\nmethod of model criticism can be applied to neural networks and Gaussian process regression and\ndemonstrated the ways in which these models were misrepresenting the data they were trained on.\nWe have demonstrated an application of MMD two sample tests to model criticism, but they can\nalso be applied to any aspect of statistical modelling where two sample tests are appropriate. This\nincludes for example, Geweke\u2019s tests of markov chain posterior sampler validity [38] and tests of\nmarkov chain convergence [e.g. 39].\nThe two sample tests proposed in this paper naturally apply to i.i.d. data and models, but model\ncriticism techniques should of course apply to models with other symmetries (e.g. exchangeable\ndata, logitudinal data / time series, graphs, and many others). We have demonstrated an adaptation\nof the MMD test to regression models but investigating extensions to a greater number of model\nclasses would be a pro\ufb01table area for future study.\nWe conclude with a question. Do you know how the model you are currently working with most\nmisrepresents the data it is attempting to model? In proposing a new method of model criticism we\nhope we have also exposed the reader unfamiliar with model criticism to its utility in diagnosing\npotential inadequacies of a model.\n\nReferences\n[1] George E P Box. Sampling and Bayes\u2019 inference in scienti\ufb01c modelling and robustness. J. R. Stat. Soc.\n\nSer. A, 143(4):383\u2013430, 1980.\n\n[2] A O\u2019Hagan. HSSS model criticism. Highly Structured Stochastic Systems, pages 423\u2013444, 2003.\n[3] Dennis Cook and Sanford Weisberg. Residuals and inuence in regression. Mon. on Stat. and App. Prob.,\n\n1982.\n\n[4] A Gelman, J B Carlin, H S Stern, D B Dunson, A Vehtari, and D B Rubin. Bayesian Data Analysis, Third\n\nEdition. Chapman & Hall/CRC Texts in Statistical Science. Taylor & Francis, 2013.\n\n[5] Roger B Grosse, Ruslan Salakhutdinov, William T Freeman, and Joshua B Tenenbaum. Exploiting com-\n\npositionality to explore a large space of model structures. In Conf. on Unc. in Art. Int. (UAI), 2012.\n\n[6] Chris Thornton, Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown. Auto-WEKA: Combined se-\nlection and hyperparameter optimization of classi\ufb01cation algorithms. In Proc. Int. Conf. on Knowledge\nDiscovery and Data Mining, KDD \u201913, pages 847\u2013855, New York, NY, USA, 2013. ACM.\n\n[7] James Robert Lloyd, David Duvenaud, Roger Grosse, Joshua B Tenenbaum, and Zoubin Ghahramani.\nAutomatic construction and Natural-Language description of nonparametric regression models. In Asso-\nciation for the Advancement of Arti\ufb01cial Intelligence (AAAI), July 2014.\n\n[8] D Koller, D McAllester, and A Pfeffer. Effective bayesian inference for stochastic programs. Association\n\nfor the Advancement of Arti\ufb01cial Intelligence (AAAI), 1997.\n\n8\n\n\f[9] B Milch, B Marthi, S Russel, D Sontag, D L Ong, and A Kolobov. BLOG: Probabilistic models with\n\nunknown objects. In Proc. Int. Joint Conf. on Arti\ufb01cial Intelligence, 2005.\n\n[10] Noah D Goodman, Vikash K Mansinghka, Daniel M Roy, Keith Bonawitz, and Joshua B Tenenbaum.\n\nChurch : a language for generative models. In Conf. on Unc. in Art. Int. (UAI), 2008.\n\n[11] Stan Development Team. Stan: A C++ library for probability and sampling, version 2.2, 2014.\n[12] Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Berhard Sch\u00a8olkopf, and Alexander Smola. A\n\nkernel method for the two-sample problem. Journal of Machine Learning Research, 1:1\u201310, 2008.\n\n[13] Irwin Guttman. The use of the concept of a future observation in goodness-of-\ufb01t problems. J. R. Stat.\n\nSoc. Series B Stat. Methodol., 29(1):83\u2013100, 1967.\n\n[14] Donald B Rubin. Bayesianly justi\ufb01able and relevant frequency calculations for the applied statistician.\n\nAnn. Stat., 12(4):1151\u20131172, 1984.\n\n[15] Harold Hotelling. A generalized t-test and measure of multivariate dispersion. In Proc. 2nd Berkeley\n\nSymp. Math. Stat. and Prob. The Regents of the University of California, 1951.\n\n[16] P J Bickel. A distribution free version of the smirnov two sample test in the p-variate case. Ann. Math.\n\nStat., 40(1):1\u201323, 1 February 1969.\n\n[17] Murray Rosenblatt. Remarks on some nonparametric estimates of a density function. Ann. Math. Stat.,\n\n27(3):832\u2013837, September 1956.\n\n[18] E Parzen. On estimation of a probability density function and mode. Ann. Math. Stat., 1962.\n[19] Stephen M Stigler. Do robust estimators work with real data? Ann. Stat., 5(6):1055\u20131098, November\n\n1977.\n\n[20] C E Rasmussen and C K Williams. Gaussian Processes for Machine Learning. The MIT Press, Cam-\n\nbridge, MA, USA, 2006.\n\n[21] D Peel and G J McLachlan. Robust mixture modelling using the t distribution. Stat. Comput., 10(4):339\u2013\n\n348, 1 October 2000.\n\n[22] Tomoharu Iwata, David Duvenaud, and Zoubin Ghahramani. Warped mixtures for nonparametric cluster\n\nshapes. In Conf. on Unc. in Art. Int. (UAI). arxiv.org, 2013.\n\n[23] Geoffrey E Hinton. To recognize shapes, \ufb01rst learn to generate images. Prog. Brain Res., 165:535\u2013547,\n\n2007.\n\n[24] Geoffrey E Hinton, Simon Osindero, and Yee Whye Teh. A fast learning algorithm for deep belief nets.\n\nNeural Comput., 18(7):1527\u20131554, 2006.\n\n[25] Deep learning tutorial - http://www.deeplearning.net/tutorial/, 2014.\n[26] Andrew Gordon Wilson and Ryan Prescott Adams. Gaussian process covariance kernels for pattern\n\ndiscovery and extrapolation. In Proc. Int. Conf. Machine Learn., 2013.\n\n[27] Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: A practical and powerful\n\napproach to multiple testing. J. R. Stat. Soc. Series B Stat. Methodol., 1995.\n\n[28] James M Robins, Aad van der Vaart, and Valerie Venture. Asymptotic distribution of p-values in compos-\n\nite null models. J. Am. Stat. Assoc., 95(452):1143\u20131156, 2000.\n\n[29] M J Bayarri and J O Berger. Quantifying surprise in the data and model veri\ufb01cation. Bayes. Stat., 1999.\n[30] Andrew Gelman. A Bayesian formulation of exploratory data analysis and goodness-of-\ufb01t testing. Int.\n\nStat. Rev., 2003.\n\n[31] M J Bayarri and M E Castellanos. Bayesian checking of the second levels of hierarchical models. Stat.\n\nSci., 22(3):322\u2013343, August 2007.\n\n[32] Andrew Gelman. Understanding posterior p-values. Elec. J. Stat., 2013.\n[33] A E Gelfand, D K Dey, and H Chang. Model determination using predictive distributions with implemen-\n\ntation via sampling-based methods. Technical Report 462, Stanford Uni CA Dept Stat, 1992.\n\n[34] E C Marshall and D J Spiegelhalter. Identifying outliers in bayesian hierarchical models: a simulation-\n\nbased approach. Bayesian Anal., 2(2):409\u2013444, June 2007.\n\n[35] Aki Vehtari and Janne Ojanen. A survey of Bayesian predictive methods for model assessment, selection\n\nand comparison. Stat. Surv., 6:142\u2013228, 2012.\n\n[36] Andrew Gelman, Xiao-Li Meng, and Hal Stern. Posterior predictive assessment of model \ufb01tness via\n\nrealized discrepancies. Stat. Sin., 6:733\u2013807, 1996.\n\n[37] K Popper. The logic of scienti\ufb01c discovery. Routledge, 2005.\n[38] John Geweke. Getting it right. J. Am. Stat. Assoc., 99(467):799\u2013804, September 2004.\n[39] Mary Kathryn Cowles and Bradley P Carlin. Markov chain monte carlo convergence diagnostics: A\n\ncomparative review. J. Am. Stat. Assoc., 91(434):883\u2013904, 1 June 1996.\n\n9\n\n\f", "award": [], "sourceid": 531, "authors": [{"given_name": "James", "family_name": "Lloyd", "institution": "University of Cambridge"}, {"given_name": "Zoubin", "family_name": "Ghahramani", "institution": "University of Cambridge"}]}