{"title": "Bias-Corrected Bootstrap and Model Uncertainty", "book": "Advances in Neural Information Processing Systems", "page_first": 521, "page_last": 528, "abstract": "", "full_text": "Bias-Corrected Bootstrap and Model\n\nUncertainty\n\nHarald Steck\u2217\nMIT CSAIL\n\n200 Technology Square\nCambridge, MA 02139\n\nharald@ai.mit.edu\n\nTommi S. Jaakkola\n\nMIT CSAIL\n\n200 Technology Square\nCambridge, MA 02139\n\ntommi@ai.mit.edu\n\nAbstract\n\nThe bootstrap has become a popular method for exploring model\n(structure) uncertainty. Our experiments with arti\ufb01cial and real-\nworld data demonstrate that the graphs learned from bootstrap\nsamples can be severely biased towards too complex graphical mod-\nels. Accounting for this bias is hence essential, e.g., when explor-\ning model uncertainty. We \ufb01nd that this bias is intimately tied to\n(well-known) spurious dependences induced by the bootstrap. The\nleading-order bias-correction equals one half of Akaike\u2019s penalty\nfor model complexity. We demonstrate the e\ufb00ect of this simple\nbias-correction in our experiments. We also relate this bias to the\nbias of the plug-in estimator for entropy, as well as to the di\ufb00er-\nence between the expected test and training errors of a graphical\nmodel, which asymptotically equals Akaike\u2019s penalty (rather than\none half).\n\n1 Introduction\nEfron\u2019s bootstrap is a powerful tool for estimating various properties of a given\nstatistic, most commonly its bias and variance (cf. [5]). It quickly gained popularity\nalso in the context of model selection. When learning the structure of graphical\nmodels from small data sets, like gene-expression data, it has been applied to explore\nmodel (structure) uncertainty [7, 6, 8, 12].\nHowever, the bootstrap procedure also involves various problems (e.g., cf. [4] for an\noverview). For instance, in the non-parametric bootstrap, where bootstrap samples\nD(b) (b = 1, ..., B) are generated by drawing the data points from the given data D\nwith replacement, each bootstrap sample D(b) often contains multiple identical data\npoints, which is a typical property of discrete data. When the given data D is in fact\ncontinuous (with a vanishing probability of two data points being identical), e.g., as\nin gene-expression data, the bootstrap procedure introduces a spurious discreteness\nin the samples D(b). A statistic computed from these discrete bootstrap samples\nmay di\ufb00er from the ones based on the continuous data D. As noted in [4], however,\nthe e\ufb00ects due to this induced spurious discreteness are typically negligible.\nIn this paper, we focus on the spurious dependences induced by the bootstrap proce-\ndure, even when given discrete data. We demonstrate that the consequences of those\n\n\u2217\n\nNow at: ETH Zurich, Institute for Computational Science, 8092 Zurich, Switzerland.\n\n\fspurious dependences cannot be neglected when exploring model (structure) uncer-\ntainty by means of bootstrap, whether parametric or non-parametric. Graphical\nmodels learned from the bootstrap samples are biased towards too complex models\nand this bias can be considerably larger than the variability of the graph structure,\nespecially in the interesting case of limited data. As a result, too many edges are\npresent in the learned model structures, and the con\ufb01dence in the presence of edges\nis overestimated. This suggests that a bias-corrected bootstrap procedure is essen-\ntial for exploring model structure uncertainty. Similarly to the statistics literature,\nwe give a derivation for the bias-correction term to amend several popular scoring\nfunctions when applied to bootstrap samples (cf. Section 3.2). This bias-correction\nterm asymptotically equals one half of the penalty term for model complexity in\nthe Akaike Information Criterion (AIC), cf. Section 3.2. The (huge) e\ufb00ects of this\nbias and the proposed bias-correction are illustrated in our experiments in Section\n5.\nAs the maximum likelihood score and the entropy are intimately tied to each other in\nthe exponential family of probability distributions, we also relate this bias towards\ntoo complex models with the bias of the plug-in estimator for entropy (Section\n3.1). Moreover, we show in Section 4, similarly to [13, 1], how the (bootstrap)\nbias-correction can be used to obtain a scoring function whose penalty for model\ncomplexity asymptotically equals Akaike\u2019s penalty (rather than one half of that).\n\n2 Bootstrap Bias-Estimation and Bias-Correction\nIn this section, we introduce relevant notation and brie\ufb02y review the bootstrap\nbias estimation of an arbitrary statistic as well as the bootstrap bias-correction (cf.\nalso [5, 4]). The scoring-functions commonly used for graphical models such as the\nAkaike Information Criterion (AIC), the Bayesian Information Criterion (BIC), the\nMinimum Description Length (MDL), or the posterior probability, can be viewed\nas special cases of a statistic.\nIn a domain of n discrete random variables, X = (X1, ..., Xn), let p(X) denote the\n(unknown) true distribution from which the given data D has been sampled. The\nempirical distribution implied by D is then given by \u02c6p(X), where \u02c6p(x) = N(x)/N,\n\nwhere N(x) is the frequency of state X = x and N =(cid:80)x N(x) is the sample size of\nD. A statistic T is any number that can be computed from the given data D. Its bias\nis de\ufb01ned as BiasT = (cid:104)T (D)(cid:105)D\u223cp\u2212 T (p), where (cid:104)T (D)(cid:105)D\u223cp denotes the expectation\nover the data sets D of size N sampled from the (unknown) true distribution p.\nWhile T (D) is an arbitrary statistic, T (p) is the associated, but possibly slightly\ndi\ufb00erent, statistic that can be computed from a (normalized) distribution. Since\nthe true distribution p is typically unknown, BiasT cannot be computed. However,\nit can be approximated by the bootstrap bias-estimate, where p is replaced by the\nempirical distribution \u02c6p, and the average over the data sets D is replaced by the\none over the bootstrap samples D(b) generated from \u02c6p, where b = 1, ..., B with\n\nsu\ufb03ciently large B (e.g., cf. [5]): (cid:100)BiasT = (cid:104)T (D(b))(cid:105)b \u2212 T (\u02c6p)\n(1)\nThe estimator T (\u02c6p) is a so-called plug-in statistic, as the empirical distribution is\n\u201dplugged in\u201d in place of the (unknown) true one. For example, \u02dcT\u03c32(\u02c6p) = IE(X 2) \u2212\n(D) = N/(N\u2212\nIE(X)2 is the familiar plug-in statistic for the variance, while T unbiased\n1)T\u03c32(\u02c6p) is the unbiased estimator.\nObviously, a plug-in statistic yields an unbiased estimate concerning the distribu-\ntion that is plugged in. Consequently, when the empirical distribution is plugged\nin, a plug-in statistic typically does not give an unbiased estimate concerning the\n(unknown) true distribution. Only plug-in statistics that are linear functions of\n\u02c6p(x) are inherently unbiased (e.g., the arithmetic mean). However, most statistics,\n\n\u03c32\n\n\fincluding the above scoring functions, are non-linear functions of \u02c6p(x) (or equiva-\nlently of N(x)). In this case, the bias does not vanish in general. In the special case\nwhere a plug-in statistic is a convex (concave) function of \u02c6p, it follows immediately\nfrom the Jensen inequality that its bias is positive (negative). For example, the\nstatistic T\u03c32(\u02c6p) is a negative quadratic, and thus concave, function of \u02c6p, and hence\nunderestimates the variance of the (unknown) true distribution.\nThe general procedure of bias-correction can be used to reduce the bias of a biased\nstatistic considerably. The bootstrap bias-corrected estimator T BC is given by\n\nT BC(D) = T (D) \u2212 (cid:100)BiasT = 2 T (D) \u2212 (cid:104)T (D(b))(cid:105)b,\n\nwhere (cid:100)BiasT is the bootstrap bias estimate according to Eq. 1.1 Typically, T BC(D)\n\nagrees with the corresponding unbiased estimator in leading order in N (cf., e.g.,\n[5]). Higher-order corrections can be achieved by \u201dbootstrapping the bootstrap\u201d\n[5].\nBias-correction can be dangerous in practice (cf.\n[5]): even though T BC(D) is\nless biased than T (D), the bias-corrected estimator may have substantially larger\nvariance. This is due to a possibly higher variability in the estimate of the bias,\nparticularly when computed from small data sets. However, this is not an issue\nin this paper, since the \u201destimate\u201d of the bias turns out to be independent of the\nempirical distribution (in leading order in N).\n\n(2)\n\n3 Bias-Corrected Scoring-Functions\nIn this section, we show that the above popular scoring-functions are (considerably)\nbiased towards too complex models when applied to bootstrap samples (in place of\nthe given data). These scoring functions can be amended by an additional penalty\nterm that accounts for this bias. Using the bootstrap bias-correction in a slightly\nnon-standard way, a simple expression for this penalty term follows easily (Sec-\ntion 3.2) from the well-know bias of the plug-in estimator of the entropy, which is\nreviewed in Section 3.1 (cf. also, e.g., [11, 2, 16]).\n3.1 Bias-Corrected Estimator for True Entropy\nThe entropy of\n\n\u2212(cid:80)x p(x) log p(x). Since this is a concave function of the p\u2019s, the plug-in esti-\nThe bootstrap bias estimate of H(\u02c6p(X)) is (cid:100)BiasH = (cid:104)H(D(b))(cid:105)b \u2212 H(\u02c6p), where\n(cid:104)H(D(b))(cid:105)b =\n(cid:105)\u03bd(x)\u223cBin(N, \u02c6p(x)),\n\nmator H(\u02c6p(X)) tends to underestimate the true entropy H(p(X)) (cf. Section 2).\n\nthe (true) distribution p(X)\n\nis de\ufb01ned by H(p(X)) =\n\nH(D(b)(X)) = \u2212(cid:88)\n\nlog \u03bd(x)\nN\n\n(cid:104) \u03bd(x)\nN\n\nB(cid:88)\n\n1\nB\n\n(3)\n\n\u2212(cid:88)\n\nwhere Bin(N, \u02c6p(x)) denotes the Binomial distribution that originates from the re-\nsampling procedure in the bootstrap; N is the sample size; \u02c6p(x) is the probability of\nsampling a data point with X = x. An exact evaluation of Eq. 3 is computationally\nprohibitive in most cases. Monte Carlo methods, while yielding accurate results,\nare computationally costly. An analytical approximation of Eq. 3 follows imme-\ndiately from the second-order Taylor expansion of L(q(x)) := q(x) log q(x) about\n\u02c6p(x), where q(x) = \u03bd(x)/N:2\n\n2(cid:88)\n)(cid:105)\u03bd(x) = H(\u02c6p(x)) \u2212 1\n(cid:48)(cid:48)(\u02c6p(x)) (cid:104)[ \u03bd(x)\nL\nN\n(|X| \u2212 1) + O(\n= H(\u02c6p(x)) \u2212 1\n1\nN 2 ),\n2N\n1Note that (cid:104)T (D(b))(cid:105)b is not the bias-corrected statistic.\n2Note that this approximation can be applied analogously to BiasH (instead of the\n\n\u2212 \u02c6p(x)]2(cid:105)\u03bd(x) + O(\n\nbootstrap estimate (cid:100)BiasH ), and the same leading-order term is obtained.\n\n(cid:104)L( \u03bd(x)\nN\n\n1\nN 2 )\n\n(4)\n\nx\n\nx\n\nb=1\n\nx\n\n\fwhere \u2212L\n(cid:48)(cid:48)(\u02c6p(x)) = \u22121/\u02c6p(x) is the observed Fisher information evaluated at the\nempirical value \u02c6p(x), and (cid:104)[\u03bd(x) \u2212 N \u02c6p(x)]2(cid:105)\u03bd(x) = N \u02c6p(x)(1 \u2212 \u02c6p(x)) is the well-\nknown variance of the Binomial distribution, induced by the bootstrap. In Eq. 4,\n|X| is the number of (joint) states of X. The bootstrap bias-corrected estimator\nfor the entropy of the (unknown true) distribution is thus given by H BC(\u02c6p(X)) =\nH(\u02c6p(X)) + 1\n3.2 Bias-Correction for Bootstrapped Scoring-Functions\nThis section is concerned with the bias of popular scoring functions that is induced\nby the bootstrap procedure. For the moment, let us focus on the BIC when learning\na Bayesian network structure m,\n\n2N (|X| \u2212 1) + O( 1\n\nN 2 ).\n\nn(cid:88)\n\ni=1\n\n(cid:88)\n\nxi,\u03c0i\n\nTBIC(D, m) = N\n\n\u02c6p(xi, \u03c0i) log\n\n\u02c6p(xi, \u03c0i)\n\n\u02c6p(\u03c0i)\n\n\u2212 1\n2\n\nlog N \u00b7 |\u03b8|.\n\n(5)\n\nThe maximum likelihood involves a summation over all the variables (i = 1, ..., n)\nand all the joint states of each variable Xi and its parents \u03a0i according to graph\nm. The number of independent parameters in the Bayesian network is given by\n\n|\u03b8| =\n\n(|Xi| \u2212 1) \u00b7 |\u03a0i|\n\nn(cid:88)\n\ni=1\n\n(6)\nwhere |Xi| denotes the number of states of variable Xi, and |\u03a0i| the number of\n(joint) states of its parents \u03a0i. Like other scoring-functions, the BIC is obviously\nintended to be applied to the given data.\nIf done so, optimizing the BIC yields\nan \u201dunbiased\u201d estimate of the true network structure underlying the given data.\nHowever, when the BIC is applied to a bootstrap sample D(b)(instead of the given\ndata D), the BIC cannot be expected to yield an \u201dunbiased\u201d estimate of the true\ngraph. This is because the maximum likelihood term in the BIC is biased when\ncomputed from the bootstrap sample D(b) instead of the given data D. This bias\n\nways. First, it is the (exact) bias induced by the bootstrap procedure, while Eq. 1 is a\nbootstrap approximation of the (unknown) true bias. Second, while Eq. 1 applies to\na statistic in general, the last term in Eq. 1 necessarily has to be a plug-in statistic.\n\nreads (cid:100)BiasTBIC = (cid:104)TBIC(D(b))(cid:105)b \u2212 TBIC(D). It di\ufb00ers conceptually from Eq. 1 in two\nIn contrast, both terms involved in (cid:100)BiasTBIC comprise the same general statistic.\n\nSince the maximum likelihood term is intimately tied to the entropy in the expo-\nnential family of probability distributions, the leading-order approximation of the\nbias of the entropy carries over (cf. Eq. 4):\n\n(cid:18){|Xi| \u00b7 |\u03a0i| \u2212 1} \u2212 {|\u03a0i| \u2212 1}(cid:19) + O(\n\n(cid:100)BiasTBIC =\n(7)\nwhere |\u03b8| is the number of independent parameters in the model, as given in Eq. 6\nfor Bayesian networks. Note that this bias is identical to one half of the penalty\nfor model complexity in the Akaike Information Criterion (AIC). Hence, this bias\ndue to the bootstrap cannot be neglected compared to the penalty terms inherent\nin all popular scoring functions. Also our experiments in Section 5 con\ufb01rm the\ndominating e\ufb00ect of this bias when exploring model uncertainty.\nThis bias in the maximum likelihood gives rise to spurious dependences induced by\nthe bootstrap (a well-known property). In this paper, we are mainly interested in\nstructure learning of graphical models.\nIn this context, the bootstrap procedure\nobviously gives rise to a (considerable) bias towards too complex models. As a\nconsequence, too many edges are present in the learned graph structure, and the\ncon\ufb01dence in the presence of edges is overestimated. Moreover, the (undesirable)\nadditional directed edges in Bayesian networks tend to point towards variables that\nalready have a large number of parents. This is because the bias is proportional to\n\n|\u03b8| + O(\n\nn(cid:88)\n\n1\nN\n\n1\nN\n\n) =\n\n1\n2\n\ni=1\n\n1\n2\n\n),\n\n\fthe number of joint states of the parents of a variable (cf. Eqs. 7 and 6). Hence,\nthe amount of the induced bias generally varies among the di\ufb00erent edges in the\ngraph.\nConsequently, the BIC has to be amended when applied to a bootstrap sample\nD(b) (instead of the given data D). The bias-corrected BIC reads T BC\nBIC(D(b), m) =\n|\u03b8| (in leading order in N). Since the bias originates from the\nTBIC(D(b), m) \u2212 1\nmaximum likelihood term involved in the BIC, the same bias-correction applies to\nthe AIC and MDL scores. Moreover, as the BIC approximates the (Bayesian) log\nmarginal likelihood, log p(D|m), for large N, the leading-order bias-correction in\nEq. 7 can also be expected to account for most of the bias of log p(D(b)|m) when\napplied to bootstrap samples D(b).\n\n2\n\n4 Bias-Corrected Maximum-Likelihood\nIt may be surprising that the bias derived in Eq. 7 equals only one half of the\nAIC penalty. In this section, we demonstrate that this is indeed consistent with the\nAIC score. Using the standard bootstrap bias-correction procedure (cf. Section 2),\nwe obtain a scoring function that asymptotically equals the AIC. This approach is\nsimilar to the ones in [1, 13].\nAssume that we are given some data D sampled from the (unknown) true distri-\nbution p(X). The goal is to learn a Bayesian network model with p(X|\u02c6\u03b8, m), or\n\u02c6p(X|m) in short, where m is the graph structure and \u02c6\u03b8 are the maximum likelihood\nparameter estimates, given data D. An information theoretic measure for the qual-\nity of graph m is the KL divergence between the (unknown) true distribution p(X)\nand the one described by the Bayesian network, \u02c6p(X|m) (cf. the approach in [1]).\nSince the entropy of the true distribution p(X) is an irrelevant constant when com-\nparing di\ufb00erent graphs, minimizing the KL-divergence is equivalent to minimizing\nthe statistic\n\nT (p, \u02c6p, m) = \u2212(cid:88)\n\nx\n\np(x) log \u02c6p(x|m),\n\nwhich is the test error of the learned model when using the log loss. When p is\nunknown, one cannot evaluate T (p, \u02c6p, m), but approximate it by the training error,\n(9)\n\n\u02c6p(x) log \u02c6p(x|m) = \u2212(cid:88)\n\nT (\u02c6p, m) = \u2212(cid:88)\n\n\u02c6p(x|m) log \u02c6p(x|m).\n\nx\n\nx\n\n(8)\n\n(assuming exponential family distributions). Note that T (\u02c6p, m) is equal to the\nnegative maximum log likelihood up to the irrelevant factor N. It is well-known\nthat the training error underestimates the test error. However, the \u201dbias-corrected\ntraining error\u201d,\n\n(10)\ncan serve as a surrogate, (nearly) unbiased estimator for the unknown test error,\nT (p, \u02c6p, m), and hence as a scoring function for model selection. The bias is given\nby the di\ufb00erence between the expected training error and the expected test error,\n|\u03b8|. (11)\n\nT BC(\u02c6p, m) = T (\u02c6p, m) \u2212 BiasT ( \u02c6p,m),\n\nBiasT =(cid:88)\n(cid:124)\n\np(x|m)(cid:104)log \u02c6p(x|m)(cid:105)D\u223cp\n(cid:125)\n\nx\n=\u2212H(p(X|m))\u2212 1\n\n|\u03b8|+O( 1\n\n(cid:123)(cid:122)\n\nN2 )\n\n2N\n\n\u2212(cid:88)\n(cid:124)\n\n(cid:104)\u02c6p(x|m) log \u02c6p(x|m)(cid:105)D\u223cp\n(cid:125)\n\nx\n=\u2212H(p(X|m))+ 1\n\n|\u03b8|+O( 1\n\n(cid:123)(cid:122)\n\nN2 )\n\n2N\n\n\u2248 \u2212 1\nN\n\nThe expectation is taken over the various data sets D (of sample size N) sampled\nfrom the unknown true distribution p; H(p(X|m)) is the (unknown) conditional\nentropy of the true distribution. In the leading-order approximation in N (cf. also\nSection 3.1), the number of independent parameters of the model, |\u03b8|, is given\nin Eq. 6 for Bayesian network. Note that both the expected test error and the\nexpected training error give rise to one half of the AIC penalty each. The overall\n\n\fbias amounts to |\u03b8|/N, which exactly equals the AIC penalty for model complexity.\nNote that, while the AIC asymptotically favors the same models as cross-validation\n[15], it typically does not select the true model underlying the given data, but a\nmore complex model.\nWhen the bootstrap estimate of the (exact) bias in Eq. 11 is inserted in the scoring\nfunction in Eq. 10, the resulting score may be viewed as the frequentist version of\nthe (Bayesian) Deviance Information Criterion (DIC)[13] (up to a factor 2): while\naveraging over the distribution of the model parameters is natural in the Bayesian\napproach, this is mimicked by the bootstrap in the frequentist approach.\n\n5 Experiments\nIn our experiments with arti\ufb01cial and real-world data, we demonstrate the crucial\ne\ufb00ect of the bias induced by the bootstrap procedure, when exploring model uncer-\ntainty. We also show that the penalty term in Eq. 7 can compensate for most of\nthis (possibly large) bias in structure learning of Bayesian networks.\nIn the \ufb01rst experiment, we used data sampled from the alarm network (37 dis-\ncrete variables, 46 edges). Comprising 300 and 1,000 data points, respectively, the\ngenerated data sets can be expected to entail some model structure uncertainty.\nWe examined two di\ufb00erent scoring functions, namely BIC and posterior probability\n(uniform prior over network structures, equivalent sample size \u03b1 = 1, cf.\n[10]).\nWe used the K2 search strategy [3] because of its computational e\ufb03ciency and its\naccuracy in structure learning, which is high compared to local search (even when\ncombined with simulated annealing) [10]. This accuracy is due to the additional\ninput required by the K2 algorithm, namely a correct topological ordering of the\nvariables according to the true network structure. Consequently, the reported vari-\nability in the learned network structures tends to be smaller than the uncertainty\ndetermined by local search (without this additional information). However, we are\nmainly interested in the bias induced by the bootstrap here, which can be expected\nto be largely una\ufb00ected by the search strategy.\nAlthough the true alarm network is known, we use the network structures learned\nfrom the given data D as a reference in our experiments: as expected, the optimal\ngraphs learned from our small data sets tend to be sparser than the original graph\nin order to avoid over-\ufb01tting (cf. Table 1).3\nWe generated 200 bootstrap samples from the given data D (as suggested in [5]),\nand then learned the network structure from each. Table 1 shows that the bias\ninduced by the bootstrap procedure is considerable for both the BIC and the pos-\nterior probability:\nit cannot be neglected compared to the standard deviation of\nthe distribution over the number of edges. Also note that, despite the small data\nsets, the bootstrap yields graphs that have even more edges than the true alarm\nnetwork. In contrast, Table 1 illustrates that this bias towards too complex models\ncan be reduced dramatically by the bias-correction outlined in Section 3.2. However\nnote that the bias-correction does not work perfectly as it is only the leading-order\ncorrection in N (cf. Eq. 7).\nThe jackknife is an alternative resampling method, and can be viewed as an ap-\nproximation to the bootstrap (e.g., cf.\n[5]). In the delete-d jackknife procedure,\nsubsamples are generated from the given data D by deleting d data points.4 The\nchoice d = 1 is most popular, but leads to inconsistencies for non-smooth statistics\n(e.g., cf.\n[5]). These inconsistency can be resolved by choosing a larger value for\n\n3Note that the greedy K2 algorithm yields exactly one graph from each given data set.\n4As a consequence, unlike bootstrap samples, jackknife samples do not contain multiple\n\nidentical data points when generated from a given continuous data set (cf. Section 1).\n\n\falarm network data\n\nN = 300\n\nN = 1, 000\n\npheromone\nN = 320\nBIC\n63.0 \u00b1 1.5\ndata D\n41\nboot BC 40.7 \u00b1 4.9\n57.8 \u00b1 3.5\n49.1 \u00b1 11.5\n135.7 \u00b1 51.1\nboot\n41.0 \u00b1 0.0\n63.2 \u00b1 1.5\njack 1\n41.1 \u00b1 0.9\n63.1 \u00b1 2.3\njack d\nTable 1: Number of edges (mean \u00b1 standard deviation) in the network structures\nlearned from the given data set D, and when using various resampling methods:\nbias-corrected bootstrap (boot BC), naive bootstrap (boot), delete-1 jackknife (jack\n1), and delete-d jackknife (jack d; here d = N/10).\n\nposterior\n40\n40.5 \u00b1 3.5\n47.8 \u00b1 10.9\n40.0 \u00b1 0.0\n40.1 \u00b1 0.3\n\nposterior\n44\n44.1 \u00b1 2.9\n47.9 \u00b1 4.8\n44.0 \u00b1 0.0\n43.7 \u00b1 0.4\n\nBIC\n43\n44.2 \u00b1 2.6\n47.3 \u00b1 4.6\n43.0 \u00b1 0.0\n43.1 \u00b1 0.3\n\nposterior\n\nFigure 1: The axis of these scatter plots show the con\ufb01dence in the presence of the\nedges in the graphs learned from the pheromone data. The vertical and horizontal\nlines indicate the threshold values according to the mean number of edges in the\ngraphs determined by the three methods (cf. Table 1).\n\n\u221a\n\nN < d (cid:28) N, cf.\n\n[5]. The underestimation of both the bias\nd, roughly speaking\nand the variance of a statistic is often considered a disadvantage of the jackknife\nprocedure: the \u201draw\u201d jackknife estimates of bias and variance typically have to\nbe multiplied by a so-called \u201din\ufb02ation factor\u201d, which is usually of the order of the\nsample size N. In the context of model selection, however, one may take advantage\nof the extremely small bias of the \u201draw\u201d jackknife estimate when determining, e.g.,\nthe mean number of edges in the model. Table 1 shows that the \u201draw\u201d jackknife\nis typically less biased than the bias-corrected bootstrap in our experiments. How-\never, it is not clear in the context of model selection as to how meaningful the \u201draw\u201d\njackknife estimate of model variability is.\nOur second experiment essentially con\ufb01rms the above results. The yeast pheromone\nresponse data contains 33 variables and 320 data points (measurements) [9]. We\ndiscretized this gene-expression data using the average optimal number of discretiza-\ntion levels for each variable as determined in [14]. Unlike in [14], we simply dis-\ncretized the data in a preprocessing step, and then conducted our experiments based\non this discretized data set.5 Since the correct network structure is unknown in this\nexperiment, we used local search combined with simulated annealing in order to\noptimize the BIC score and the posterior probability (\u03b1 = 25, cf.\n[14]). As a ref-\nerence in this experiment, we used 320 network structures learned from the given\n(discretized) data D, each of which is the highest-scoring graph found in a run of\nlocal search combined with simulated annealing.6 Each resampling procedure is\nalso based on 320 subsamples.\n\n5Of course, the bias-correction according to Eq. 7 also applies to the joint optimization\n\nof the discretization and graph structure when given a bootstrap sample.\n\n6Using the annealing parameters as suggested in [10], each run of simulated annealing\n\nresulted in a di\ufb00erent network structure (local optimum) in practice.\n\n00.51\u0001given data\u000200.51corrected bootstrap00.51\u0001given data\u000200.51bootstrap\u000300.51\u0001corrected bootstrap00.51bootstrap\u0002\fWhile the pheromone data experiments in Table 1 qualitatively con\ufb01rm the previous\nresults, the bias induced by the bootstrap is even larger here. We suspect that this\ndi\ufb00erence in the bias is caused by the rather extreme parameter values in the original\nalarm network model, which leads to a relatively large signal-to-noise ratio even in\nsmall data sets. In contrast, gene-expression data is known to be extremely noisy.\nAnother e\ufb00ect of the spurious dependences induced by the bootstrap procedure is\nshown in Figure 1: the overestimation of the con\ufb01dence in the presence of individual\nedges in the network structures. The con\ufb01dence in an individual edge can be esti-\nmated as the ratio between the number of learned graphs where that edge is present\nand the overall number of learned graphs. Each mark in Figure 1 corresponds to an\nedge, and its coordinates re\ufb02ect the con\ufb01dence estimated by the di\ufb00erent methods.\nObviously, the naive application of the bootstrap leads to a considerable overesti-\nmation of the con\ufb01dence in the presence of many edges in Figure 1, particularly of\nthose whose absence is favored by both our reference and the bias-corrected boot-\nstrap. In contrast, the con\ufb01dence estimated by the bias-corrected bootstrap aligns\nquite well with the con\ufb01dence determined by our reference in Figure 1, leading to\nmore trustworthy results in our experiments.\n\nReferences\n\n[1] H. Akaike. Information theory and an extension of the maximum likelihood\nprinciple. International Symposium on Information Theory, pp. 267\u201381. 1973.\n[2] Carlton.On the bias of information estimates.Psych. Bulletin, 71:108\u201313, 1969.\n[3] G. Cooper and E. Herskovits. A Bayesian method for constructing Bayesian\n\nbelief networks from databases. UAI, pp. 86\u201394. 1991.\n\n[4] A.C. Davison and D.V. Hinkley. Bootstrap methods and their application. 1997.\n[5] B. Efron and R. J. Tibshirani. An Introduction to the Bootstrap. 1993.\n[6] N. Friedman, M. Goldszmidt, and A. Wyner. Data analysis with Bayesian\n\nnetworks: A bootstrap approach. UAI, pp. 196\u2013205. 1999.\n\n[7] N. Friedman, M. Goldszmidt, and A. Wyner. On the application of the boot-\nstrap for computing con\ufb01dence measures on features of induced Bayesian net-\nworks. AI & Stat., p.p 197\u2013202. 1999.\n\n[8] N. Friedman, M. Linial, I. Nachman, and D. Pe\u2019er. Using Bayesian networks\nto analyze expression data. Journal of Computational Biology, 7:601\u201320, 2000.\n[9] A. J. Hartemink, D. K. Gi\ufb00ord, T. S. Jaakkola, and R. A. Young. Combin-\ning location and expression data for principled discovery of genetic regulatory\nnetworks. In Paci\ufb01c Symposium on Biocomputing, 2002.\n\n[10] D. Heckerman, D. Geiger, and D. M. Chickering. Learning Bayesian networks:\nThe combination of knowledge and statistical data. Machine Learning, 20:197\u2013\n243, 1995.\n\n[11] G. A. Miller. Note on the bias of information estimates. Information Theory\n\nin Psychology: Problems and Methods, pages 95\u2013100, 1955.\n\n[12] D. Pe\u2019er, A. Regev, G. Elidan, and N. Friedman. Inferring subnetworks from\n\nperturbed expression pro\ufb01les. Bioinformatics, 1:1\u20139, 2001.\n\n[13] D. J. Spiegelhalter, N. G. Best, B. P. Carlin, and A. van der Linde. Bayesian\n\nmeasures of model complexity and \ufb01t. J. R. Stat. Soc. B, 64:583\u2013639, 2002.\n\n[14] H. Steck and T. S. Jaakkola.\n\n(Semi-)predictive discretization during model\n\nselection. AI Memo 2003-002, MIT, 2003.\n\n[15] M. Stone. An asymptotic equivalence of choice of model by cross-validation\n\nand Akaike\u2019s criterion. J. R. Stat. Soc. B, 36:44\u20137, 1977.\n\n[16] J. D. Victor. Asymptotic bias in information estimates and the exponential\n\n(Bell) polynomials. Neural Computation, 12:2797\u2013804, 2000.\n\n\f", "award": [], "sourceid": 2427, "authors": [{"given_name": "Harald", "family_name": "Steck", "institution": null}, {"given_name": "Tommi", "family_name": "Jaakkola", "institution": null}]}