{"title": "Conformal Prediction Under Covariate Shift", "book": "Advances in Neural Information Processing Systems", "page_first": 2530, "page_last": 2540, "abstract": "We extend conformal prediction methodology beyond the case of exchangeable data. In particular, we show that a weighted version of conformal prediction can be used to compute distribution-free prediction intervals for problems in which the test and training covariate distributions differ, but the likelihood ratio between the two distributions is known---or, in practice, can be estimated accurately from a set of unlabeled data (test covariate points). Our weighted extension of conformal prediction also applies more broadly, to settings in which the data satisfies a certain weighted notion of exchangeability. We discuss other potential applications of our new conformal methodology, including latent variable and missing data problems.", "full_text": "Conformal Prediction Under Covariate Shift\n\nRyan J. Tibshirani\n\nDepartment of Statistics\n\nMachine Learning Department\nCarnegie Mellon University\n\nPittsburgh PA, 15213\nryantibs@cmu.edu\n\nEmmanuel J. Cand\u00e8s\nDepartment of Statistics\n\nDepartment of Mathematics\n\nStanford University\nStanford CA, 94305\n\ncandes@stanford.edu\n\nRina Foygel Barber\nDepartment of Statistics\nUniversity of Chicago\n\nChicago, IL 60637\n\nrina@uchicago.edu\n\nAaditya Ramdas\n\nDepartment of Statistics\n\nMachine Learning Department\nCarnegie Mellon University\n\nPittsburgh PA, 15213\naramdas@cmu.edu\n\nAbstract\n\nWe extend conformal prediction methodology beyond the case of exchangeable\ndata. In particular, we show that a weighted version of conformal prediction can be\nused to compute distribution-free prediction intervals for problems in which the\ntest and training covariate distributions differ, but the likelihood ratio between the\ntwo distributions is known\u2014or, in practice, can be estimated accurately from a\nset of unlabeled data (test covariate points). Our weighted extension of conformal\nprediction also applies more broadly, to settings in which the data satis\ufb01es a certain\nweighted notion of exchangeability. We discuss other potential applications of our\nnew conformal methodology, including latent variable and missing data problems.\n\nIntroduction\n\n1\nLet (Xi, Yi) \u2208 Rd \u00d7 R, i = 1, . . . , n denote training data, assumed to be i.i.d. from an arbitrary distri-\nbution P . Given a desired coverage rate 1 \u2212 \u03b1 \u2208 (0, 1), consider the problem of constructing a band\n\n(cid:98)Cn : Rd \u2192 {subsets of R}, based on the training data such that, for a new i.i.d. point (Xn+1, Yn+1),\n\nP(cid:110)\nYn+1 \u2208 (cid:98)Cn(Xn+1)\n\n(cid:111) \u2265 1 \u2212 \u03b1,\n\n(1)\n\nwhere this probability is taken over the n + 1 points (Xi, Yi), i = 1, . . . , n + 1 (the n training points\nand the test point). Crucially, we will require (1) to hold with no assumptions whatsoever on the\nunderlying distribution P .\nConformal prediction, a framework pioneered by Vladimir Vovk and colleagues in the 1990s, provides\na means for achieving this goal, relying only on exchangeablility of the training and test data. The\nde\ufb01nitive reference is the book by Vovk et al. [2005]; see also Shafer and Vovk [2008], Vovk et al.\n[2009], Vovk [2013], Burnaev and Vovk [2014], and http://www.alrw.net for an often-updated list\nof conformal prediction work by Vovk and colleagues. Moreover, see Lei and Wasserman [2014], Lei\net al. [2018] for recent developments in the areas of nonparametric and high-dimensional regression.\nIn this work, we extend conformal prediction beyond the setting of exchangeable data, allowing for\nprovably valid inference even when the training and test data are not drawn from the same distribution.\nWe begin by reviewing the basics of conformal prediction, in this section. In Section 2, we describe\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fan extension of conformal prediction to the setting of covariate shift, and give supporting empirical\nresults. In Section 3, we cover the mathematical details behind our conformal extension. We conclude\nin Section 4 with a short discussion.\n\n1.1 Quantile lemma\n\nBefore explaining the basic ideas behind conformal inference (i.e., conformal prediction, we will use\nthese two terms interchangeably), we introduce some notation. We denote by Quantile(\u03b2; F ) the\nlevel \u03b2 quantile of a distribution F , i.e., for Z \u223c F ,\n\nQuantile(\u03b2; F ) = inf(cid:8)z : P{Z \u2264 z} \u2265 \u03b2(cid:9).\n\nIn our use of quantiles, we will allow for distributions F on the augmented real line, R \u222a {\u221e}. For\nvalues v1, . . . , vn, we write v1:n = {v1, . . . , vn} to denote their multiset. Note that this is unordered,\nand allows for multiple instances of the same element; thus in the present case, if vi = vj for i (cid:54)= j,\nthen this value appears twice in v1:n. To denote quantiles of the empirical distribution of the values\nv1, . . . , vn, we abbreviate\n\n(cid:19)\n\n(cid:18)\n\nn(cid:88)\n\ni=1\n\n1\nn\n\nQuantile(\u03b2; v1:n) = Quantile\n\n\u03b2;\n\n\u03b4vi\n\n,\n\nwhere \u03b4a denotes a point mass at a (i.e., the distribution that places all mass at the value a). The next\nresult is a simple but key component underlying conformal prediction. Its proof, as with all proofs in\nthis paper, is deferred to the supplement.\nLemma 1. If V1, . . . , Vn+1 are exchangeable random variables, then for any \u03b2 \u2208 (0, 1), we have\n\nP(cid:110)\nVn+1 \u2264 Quantile(cid:0)\u03b2; V1:n \u222a {\u221e}(cid:1)(cid:111) \u2265 \u03b2.\n\nFurthermore, if ties between V1, . . . , Vn+1 occur with probability zero, then the above probability is\nupper bounded by \u03b2 + 1/(n + 1).\n\n1.2 Conformal prediction\n\nS(cid:0)(x, y), Z(cid:1) = |y \u2212(cid:98)\u00b5(x)|,\n\nWe now return to the regression setting.1 Denote Zi = (Xi, Yi), i = 1, . . . , n. In what follows, we\ndescribe the construction of a prediction band satisfying (1), using conformal inference, due to Vovk\net al. [2005]. We \ufb01rst choose a score function S, whose arguments consist of a point (x, y), and a\nmultiset Z.2 Informally, a low value of S((x, y), Z) indicates that the point (x, y) \u201cconforms\u201d to Z,\nwhereas a high value indicates that (x, y) is atypical relative to the points in Z. For example, we\nmight choose to de\ufb01ne S by\n\nwhere(cid:98)\u00b5 : Rd \u2192 R is a regression function, \ufb01tted by running an algorithm A on Z. Next, at a given\nx \u2208 Rd, we de\ufb01ne (cid:98)Cn(x), the conformal prediction interval3, by repeating the following procedure\nn+1 = S(cid:0)(x, y), Z1:n \u222a {(x, y)}(cid:1), (3)\n= S(cid:0)Zi, Z1:n \u222a {(x, y)}(cid:1), i = 1, . . . , n,\nand include y in our prediction interval (cid:98)Cn(x) if\n1:n \u222a {\u221e}(cid:1),\n\nfor each y \u2208 R. We calculate the nonconformity scores\nV (x,y)\ni\n\nn+1 \u2264 Quantile(cid:0)1 \u2212 \u03b1; V (x,y)\n\n1:n = {V (x,y)\n\n}. Importantly, the symmetry in the construction of the noncon-\nwhere V (x,y)\nformity scores (3) guarantees exact coverage in \ufb01nite samples. The next theorem summarizes this\ncoverage result. The lower bound is a standard result from the conformal literature, see Vovk et al.\n[2005]; the upper bound, as far as we know, was \ufb01rst pointed out by Lei et al. [2018].\n\nand V (x,y)\n\nV (x,y)\n\n, . . . , V (x,y)\n\nn\n\n(2)\n\n1\n\n1Throughout this paper, we focus on regression, where the response Y is continuous, for simplicity. The\n\nsame ideas can be applied to classi\ufb01cation, where Y is discrete.\n2We emphasize that by de\ufb01ning Z to be a multiset, we are treating its points as unordered. Hence, to be\nperfectly explicit, the score function S cannot accept the points in Z in any particular order, and it must take\n\nthem in as unordered. The same is true of the base algorithm A used to de\ufb01ne the \ufb01tted regression function(cid:98)\u00b5, in\n3For convenience, throughout, we will refer to (cid:98)Cn(x) as an \u201cinterval\u201d, even though this may actually be a\nunion of multiple nonoverlapping intervals. Similarly, for simplicity, we will refer to (cid:98)Cn as a \u201cband\u201d.\n\nthe choice of absolute residual score function (2).\n\n2\n\n\fTheorem 1 (Vovk et al. 2005, Lei et al. 2018). Assume that (Xi, Yi) \u2208 Rd \u00d7 R, i = 1, . . . , n + 1\nare exchangeable. For any score function S, and any \u03b1 \u2208 (0, 1), de\ufb01ne the conformal band (based\non the \ufb01rst n samples) at x \u2208 Rd by\n\n(cid:110)\n\nn+1 \u2264 Quantile(cid:0)1 \u2212 \u03b1; V (x,y)\n(cid:98)Cn(x) =\n, i = 1, . . . , n + 1 are as de\ufb01ned in (3). Then (cid:98)Cn satis\ufb01es\nP(cid:110)\n(cid:111) \u2265 1 \u2212 \u03b1.\nYn+1 \u2208 (cid:98)Cn(Xn+1)\n\ny \u2208 R : V (x,y)\n\n1:n \u222a {\u221e}(cid:1)(cid:111)\n\n,\n\nwhere V (x,y)\n\ni\n\n(4)\n\n1\n\nn+1\n\n, . . . , V (Xn+1,Yn+1)\n\nFurthermore, if ties between V (Xn+1,Yn+1)\nprobability is upper bounded by 1 \u2212 \u03b1 + 1/(n + 1).\nRemark 1. Theorem 1 is stated assuming exchangeable samples (Xi, Yi), i = 1, . . . , n + 1, which\nis weaker than assuming i.i.d. samples. As we will see in what follows, it is possible to relax the\nexchangeability assumption, under an appropriate modi\ufb01cation to the conformal procedure.\nRemark 2. If we use an appropriate random tie-breaking rule (to determine the rank of Vn+1 among\nV1, . . . , Vn+1), then the upper bounds in Lemma 1 and Theorem 1 hold in general (without assuming\nthere are no ties almost surely).\n\noccur with probability zero, then this\n\nThe result in Theorem 1, albeit simple to prove, is quite remarkable. It gives a recipe for distribution-\nfree prediction intervals, having nearly exact coverage, starting from an arbitrary score function S;\ne.g., absolute residuals de\ufb01ned using a \ufb01tted regression function from any base algorithm A, as in (2).\nFor more discussion of conformal prediction, its properties, and its variants, see Vovk et al. [2005],\nLei et al. [2018] and references therein.\n\n2 Covariate shift\n\nIn this paper, we are concerned with settings in which the data (Xi, Yi), i = 1, . . . , n + 1 are no\nlonger exchangeable. Our primary focus will be a setting in which we observe data according to\n\n(Xi, Yi) i.i.d.\u223c P = PX \u00d7 PY |X , i = 1, . . . , n,\n\n(Xn+1, Yn+1) \u223c (cid:101)P = (cid:101)PX \u00d7 PY |X , independently.\n\n(5)\nNotice that the conditional distribution of Y |X is assumed to be the same for both the training and\ntest data. Such a setting is often called covariate shift (e.g., see Shimodaira 2000, Quinonero-Candela\net al. 2009; see also Remark 4 below for more discussion of this literature). The key realization is the\n\nfollowing: if we know the ratio of test to training covariate likelihoods, d(cid:101)PX /dPX, then we can still\n\nperform a modi\ufb01ed of version conformal inference, using a quantile of a suitably weighted empirical\ndistribution of nonconformity scores. The next subsection gives details; following this, we give an\nempirical demonstration.\n\n2.1 Weighted conformal prediction\n\nIn conformal prediction, we form a prediction interval by comparing the value of a nonconformity\nscore at a test point to the empirical distribution of nonconformity scores at the training points. In the\n\ncovariate shift case, where the covariate distributions PX ,(cid:101)PX in our training and test sets differ, we\nthe other points) by a probability proportional to the likelihood ratio w(Xi) = d(cid:101)PX (Xi)/dPX (Xi).\n\nwill now weight each nonconformity score V (x,y)\n\n(measuring how well Zi = (Xi, Yi) conforms to\n\ni\n\nTherefore, we will no longer be interested in the empirical distribution 1\nn+1\nas in Theorem 1, but rather, a weighted version\n\ni=1 \u03b4V (x,y)\n\ni\n\n+ 1\n\nn+1 \u03b4\u221e,\n\n(cid:80)n\n\nn(cid:88)\n\ni=1\n\npw\ni (x)\u03b4V (x,y)\n\ni\n\n+ pw\n\nn+1(x)\u03b4\u221e,\n\nwhere the weights are de\ufb01ned by\n\npw\ni (x) =\n\nw(Xi)\n\nj=1 w(Xj) + w(x)\n\n(cid:80)n\n\n, i = 1, . . . , n,\n\nand pw\n\nn+1(x) =\n\n(cid:80)n\n\nw(x)\n\nj=1 w(Xj) + w(x)\n\n.\n\n(6)\n\nDue this careful weighting, draws from the discrete distribution in the second to last display resemble\nnonconformity scores computed on the test population, and thus, they \u201clook exchangeable\u201d with the\nnonconformity score at our test point. Our main result below formalizes these claims.\n\n3\n\n\fCorollary 1. Assume data from the model (5). Assume (cid:101)PX is absolutely continuous with respect to\nPX, and denote w = d(cid:101)PX /dPX. For any score function S, and any \u03b1 \u2208 (0, 1), de\ufb01ne for x \u2208 Rd,\n\n(cid:26)\n(cid:98)Cn(x) =\nThen (cid:98)Cn satis\ufb01es\n\nwhere V (x,y)\n\ni\n\ny \u2208 R : V (x,y)\n\nn+1 \u2264 Quantile\n\n1 \u2212 \u03b1;\n\ni=1\n, i = 1, . . . , n + 1 are as de\ufb01ned in (3), and pw\n\nP(cid:110)\nYn+1 \u2208 (cid:98)Cn(Xn+1)\n\nn(cid:88)\n(cid:111) \u2265 1 \u2212 \u03b1.\n\n(cid:18)\n\n(cid:19)(cid:27)\n\npw\ni (x)\u03b4V (x,y)\n\ni\n\n+ pw\n\nn+1(x)\u03b4\u221e\n\n,\n\n(7)\n\ni , i = 1, . . . , n + 1 are as de\ufb01ned in (6).\n\nCorollary 1 is a special case of a more general result presented later in Theorem 2, which extends\nconformal inference to a setting in which the data are what we call weighted exchangeable.\n\nRemark 3. The same result as in Corollary 1 holds if w \u221d d(cid:101)PX /dPX, i.e., with unknown normal-\n\nization constant, because this constant cancels out in the calculation of probabilities in (6).\nRemark 4. Though the basic premise of covariate shift\u2014and certainly the techniques employed\nin addressing it\u2014are related to much older ideas in statistics, the speci\ufb01c setup in (5) has recently\ngenerated great interest in machine learning: e.g., see Sugiyama and Muller [2005], Sugiyama et al.\n[2007], Quinonero-Candela et al. [2009], Agarwal et al. [2011], Wen et al. [2014], Reddi et al. [2015],\nChen et al. [2016] and references therein). The focus is usually on correcting estimators, model\nevaluation, or model selection approaches to account for covariate shift. Correcting distribution-free\nprediction intervals, as we examine in this work, is (as far as we know) a new contribution. As one\n\nmight expect, the likelihood ratio d(cid:101)PX /dPX, a key component of our conformal construction in\n\nCorollary 1, also plays a critical role in much of the literature on covariate shift.\n\n2.2 Airfoil data example\n\nWe demonstrate conformal prediction in the covariate shift setting using an empirical example. We\nconsider the airfoil data set from the UCI Machine Learning Repository [Dua and Graff, 2019], which\nhas N = 1503 observations of a response Y (scaled sound pressure level of NASA airfoils), and a\ncovariate X with d = 5 dimensions (log frequency, angle of attack, chord length, free-stream velocity,\nand suction side log displacement thickness). For ef\ufb01ciency, we use a variant of conformal prediction\ncalled split conformal prediction [Papadopoulos et al., 2002, Lei et al., 2015], which we extend to the\ncovariate shift case in the same way (using weighted quantiles); see the supplement. For R code to\nreproduce the results that follow, see http://www.github.com/ryantibs/conformal/.\n\nCreating training data, test data, and covariate shift. We repeated an experiment for 5000 trials,\nwhere for each trial we randomly partitioned the data {(Xi, Yi)}N\ni=1 into two sets Dtrain, Dtest, and\nalso constructed a covariate shift test set Dshift, which have the following roles.\n\ncompute conformal prediction intervals (using the split conformal variant).\n\n\u2022 Dtrain, containing 50% of the data, is our training set, i.e., (Xi, Yi), i = 1, . . . , n, used to\n\u2022 Dtest, containing 50% of the data, is our test set (as these data points are exchangeable with\n\u2022 Dshift is a second test set, constructed to simulate covariate shift, by sampling 25% of the\n\nthose in Dtrain, there is no covariate shift in this test set).\n\npoints from Dtest with replacement, with probabilities proportional to\nw(x) = exp(xT \u03b2), where \u03b2 = (\u22121, 0, 0, 0, 1).\n\n(8)\nAs the original data points Dtrain \u222a Dtest can be seen as draws from the same underlying distribution,\nwe can view w(x) as the likelihood ratio of covariate distributions between the test set Dshift and\n\ntraining set Dtrain. Note that the test covariate distribution (cid:101)PX, which satis\ufb01es d(cid:101)PX \u221d exp(xT \u03b2)dPX\n\nas we have de\ufb01ned it here, is called an exponential tilting of the training covariate distribution PX.\nThe supplement displays kernel density estimates \ufb01t to the airfoil data set, pre and post exponential\ntilting, to visualize the differences in the covariate distributions.\n\nLoss of coverage of ordinary conformal prediction under covariate shift. First, we examine\nthe performance of ordinary split conformal prediction. The nominal coverage level was set to be\n90% (meaning \u03b1 = 0.1), here and throughout. The results are displayed in the top row of Figure 1.\n\n4\n\n\fIn each of the 5000 trials, we computed the empirical coverage from the split conformal intervals\nover points in the test sets, and the histograms show the distribution of these empirical coverages over\nthe trials. We see that for the original test data Dtest (no covariate shift, shown in red), split conformal\nworks as expected, with the average of the empirical coverages (over the 5000 trials) being 90.2%; for\nthe nonuniformly subsampled test data Dshift (covariate shift, in blue), split conformal considerably\nundercovers, with its average coverage being 82.2%.\n\nCoverage of weighted conformal prediction with oracle weights. Next, displayed in the middle\nrow of Figure 1, we consider weighted split conformal prediction, to cover the points in Dshift (shown\nin orange). At the moment, we assume oracle knowledge of the true weight function w in (8) needed\nto calculate the probabilities in (6). We see that this brings the coverage back to the desired level,\nwith the average coverage being 90.8%. However, the histogram is more dispersed than it is when\nthere is no covariate shift (compare to the top row, in red). This is because, by using a quantile of\nthe weighted empirical distribution of nonconformity scores, we are relying on a reduced \u201ceffective\nsample size\u201d. Given training points X1, . . . , Xn, and a likelihood ratio w of test to training covariate\ndistributions, a popular heuristic formula from the covariate shift literature for the effective sample\nsize of X1, . . . , Xn is [Gretton et al., 2009, Reddi et al., 2015]:\n\n[(cid:80)n\n(cid:80)n\ni=1 |w(Xi)|]2\ni=1 |w(Xi)|2 =\n\n(cid:98)n =\n\n(cid:107)w(X1:n)(cid:107)2\n(cid:107)w(X1:n)(cid:107)2\n\n1\n\n2\n\n,\n\nwhere we abbreviate w(X1:n) = (w(X1), . . . , w(Xn)) \u2208 Rn. To compare weighted conformal\nprediction against the unweighted method at the same effective sample size, in each trial, we ran\n\nunweighted split conformal on the original test set Dtest, but we used only(cid:98)n subsampled points from\n\nDtrain to compute the quantile of nonconformity scores. The results (the middle row of Figure 1, in\npurple) line up closely with those from weighted conformal, which shows that the overdispersion in\nthe coverage histogram from the latter is fully explained by the reduced effective sample size.\n\nCoverage of weighted conformal with estimated weights. Denote by X1, . . . , Xn the covariate\npoints in Dtrain and by Xn+1, . . . , Xn+m the covariate points in Dshift. Here we describe how to\n\nestimate w = d(cid:101)PX /dPX, the likelihood ratio of interest, by applying logistic regression or random\n\nforests (more generally, any classi\ufb01er that outputs estimated probabilities of class membership) to\nthe feature-class pairs (Xi, Ci), i = 1, . . . , n + m, where Ci = 0 for i = 1, . . . , n and Ci = 1 for\ni = n + 1, . . . , n + m. Noting that\n\nP(C = 1|X = x)\nP(C = 0|X = x)\n\nP(C = 1)\nP(C = 0)\n\n=\n\nd(cid:101)PX\n\ndPX\n\n(x),\n\nwe can take the conditional odds ratio w(x) = P(C = 1|X = x)/P(C = 0|X = x) as an equivalent\nrepresentation for the oracle weight function (since we only need to know the likelihood ratio up to\n\na proportionality constant, recall Remark 3). Therefore, if(cid:98)p(x) is an estimate of P(C = 1|X = x)\n\nobtained by \ufb01tting a classi\ufb01er to the data (Xi, Ci), i = 1, . . . , n + m, then we can use\n\n(cid:98)w(x) = (cid:98)p(x)\n1 \u2212(cid:98)p(x)\n\n(9)\n\nas our estimated weight function for the calculation of probabilities (6) that are needed for conformal\nprediction. There is in fact a sizeable literature on density ratio estimation, and the method just\ndescribe falls into a class called probabilistic classi\ufb01cation approaches; two other classes are based\non moment matching, and minimization of \u03c6-divergences (e.g., Kullblack-Leibler divergence). For a\ncomprehensive review of these approaches, and supporting theory, see Sugiyama et al. [2012].\nThe bottom row of Figure 1 shows the results from using weighted split conformal prediction to\n\ncover the points in Dshift, where the weight function (cid:98)w has been estimated as in (9), using logistic\nregression (in gray) and random forests4 (in green) to \ufb01t the class probability function(cid:98)p. Note that\n4In the random forests approach, we clipped the estimated test class probability(cid:98)p(x) to lie in between 0.01\nand 0.99, to prevent the estimated weight (likelihood ratio) (cid:98)w(x) from being in\ufb01nite. Without clipping, the\n\nlogistic regression is well-speci\ufb01ed in this example, as it assumes the log odds is a linear function of\nx, which is exactly as in (8). Random forests, of course, allows more \ufb02exibility in the \ufb01tted model.\n\nestimated probability of being in the test class was sometimes exactly 1 (this occurred in about 2% of the cases\nencountered over all 5000 repetitions), resulting in an in\ufb01nite weight, and causing numerical issues.\n\n5\n\n\fFigure 1: Empirical coverages of conformal prediction intervals, computed using 5000 different random splits\nof the airfoil data set. The averages of empirical coverages in each histogram are marked on the x-axis.\n\n6\n\nCoverageFrequency0.70.80.91.00100300500700No covariate shiftCovariate shiftCoverageFrequency0.70.80.91.00100300500700Oracle weightsNo shift, fewer samplesCoverageFrequency0.70.80.91.00100300500700Logistic regression weightsRandom forest weights\fBoth classi\ufb01cation approaches deliver weights that translate into good average coverage, being 91.0%\nfor each approach. Furthermore, their histograms are only a little more dispersed than that for the\noracle weights (middle row, in orange). For more simulation results, see the supplement.\n\n3 Weighted exchangeability\n\nIn this section, we develop a general result on conformal prediction for settings in which the data\nsatisfy what we call weighted exchangeability. First we precisely de\ufb01ne this concept, then we extend\nLemma 1 to this new setting, and extend conformal prediction as well.\n\n3.1 Generalizing exchangeability\n\nWe \ufb01rst de\ufb01ne a generalized notion of exchangeability.\nDe\ufb01nition 1. We call random variables V1, . . . , Vn weighted exchangeable, with weight functions\nw1, . . . , wn, if the density5 f of their joint distribution can be factorized as\n\nn(cid:89)\n\nf (v1, . . . , vn) =\n\nwi(vi) \u00b7 g(v1, . . . , vn),\n\ni=1\n\nwhere g does not depend on the ordering of its inputs, i.e., g(v\u03c3(1), . . . , v\u03c3(n)) = g(v1, . . . , vn) for\nany permutation \u03c3 of 1, . . . , n.\nClearly, weighted exchangeability with weight functions wi \u2261 1, i = 1, . . . , n reduces to ordinary\nexchangeability. Furthermore, independent draws (where all marginal distributions are absolutely\ncontinuous with respect to, say, the \ufb01rst one) are always weighted exchangeable, with weight functions\ngiven by the appropriate Radon-Nikodym derivatives, i.e., likelihood ratios. This is stated next; the\nproof follows directly from De\ufb01nition 1 and is omitted.\nLemma 2. Let Zi \u223c Pi, i = 1, . . . , n be independent draws, where each Pi is absolutely continuous\nwith respect to P1, for i \u2265 2. Then Z1, . . . , Zn are weighted exchangeable, with weight functions\nw1 \u2261 1, and wi = dPi/dP1, i \u2265 2.\nLemma 2 highlights an important special case (which we note, includes the covariate shift model in\n(5)). But it is worth being clear that our de\ufb01nition of weighted exchangeability encompasses more\nthan independent sampling, and allows for a nontrivial dependency structure between the variables.\n\n3.2 Generalizing conformal prediction\n\nNow we give a weighted generalization of Lemma 1.\nLemma 3. Let Zi, i = 1, . . . , n + 1 be weighted exchangeable, with weight functions w1, . . . , wn+1.\nLet Vi = S(Zi, Z1:(n+1)), for i = 1, . . . , n + 1, and S is an arbitrary score function. De\ufb01ne\n\n(cid:80)\n\n(cid:81)n+1\n\n(cid:80)\n\n(cid:81)n+1\n\n\u03c3:\u03c3(n+1)=i\n\nj=1 wj(z\u03c3(j))\n\n\u03c3\n\nj=1 wj(z\u03c3(j))\n\npw\ni (z1, . . . , zn+1) =\n\n, i = 1, . . . , n + 1,\n\n(10)\n\n(cid:26)\n\nP\n\n(cid:18)\n\nn(cid:88)\n\ni=1\n\nwhere the summations are taken over permutations \u03c3 of the numbers 1, . . . , n + 1. Then for any\n\u03b2 \u2208 (0, 1),\n\n\u03b2;\n\nn+1(Z1, . . . , Zn+1)\u03b4\u221e\n\nVn+1 \u2264 Quantile\n\npw\ni (Z1, . . . , Zn+1)\u03b4Vi + pw\n\n\u2265 \u03b2.\nRemark 5. When V1, . . . , Vn+1 are exchangeable, we have wi \u2261 1 for i = 1, . . . , n, and so pw\ni \u2261 1\nfor i = 1, . . . , n. Note that, in this special case, the lower bound in Lemma 3 reduces to the ordinary\nunweighted lower bound in Lemma 1. On the other hand, obtaining a meaningful upper bound on the\nprobability in question in Lemma 3, as was done in Lemma 1 (when we assume almost surely no ties),\ndoes not seem possible without further conditions on the weight functions. This is because the largest\njump in the cumulative distribution function of Vn+1|Ez is of size maxi=1,...,n+1 pw\ni (z1, . . . , zn+1),\nwhich can potentially be very large; in the unweighted case, this jump is always of size 1/(n + 1).\n\n(cid:19)(cid:27)\n\n5More generally, f may be the Radon-Nikodym derivative with respect to an arbitrary base measure.\n\n7\n\n\fA weighted version of conformal prediction follows immediately from Lemma 3.\nTheorem 2. Assume that Zi = (Xi, Yi) \u2208 Rd\u00d7R, i = 1, . . . , n + 1 are weighted exchangeable with\nweight functions w1, . . . , wn+1. For any score function S, and any \u03b1 \u2208 (0, 1), de\ufb01ne the weighted\nconformal band (based on the \ufb01rst n samples) at a point x \u2208 Rd by\n\ny \u2208 R : V (x,y)\n\nn+1 \u2264 Quantile\n\n1 \u2212 \u03b1;\n\npw\ni\n\n(cid:18)\n\nn(cid:88)\n\ni=1\n\n(cid:0)Z1, . . . , Zn, (x, y)(cid:1)\u03b4V (x,y)\n(cid:0)Z1, . . . , Zn, (x, y)(cid:1)\u03b4\u221e\n\npw\nn+1\n\n+\n\ni\n\n(cid:19)(cid:27)\n\n,\n\n(11)\n\n(cid:26)\n\n(cid:98)Cn(x) =\n\nwhere V (x,y)\n\nThen (cid:98)Cn satis\ufb01es\n\ni\n\n, i = 1, . . . , n + 1 are as de\ufb01ned in (3), and pw\n\ni , i = 1, . . . , n + 1 are as de\ufb01ned in (10).\n\nP(cid:110)\nYn+1 \u2208 (cid:98)Cn(Xn+1)\n\n(cid:111) \u2265 1 \u2212 \u03b1.\n\nObserve that Corollary 1 follows by taking wi \u2261 1 for i = 1, . . . , n, and wn+1((x, y)) = w(x).\n\n4 Discussion\n\nWe described an extension of conformal prediction to handle weighted exchangeable data, covering\nexchangeable data, and independent (but not identically distributed) data, as special cases. In general,\nthe new weighted methodology requires computing quantiles of a weighted discrete distribution of\nnonconformity scores, which is combinatorially hard. But the computations simplify dramatically for\n\na case of signi\ufb01cant practical interest, where the test covariate distribution (cid:101)PX differs from the training\ncovariate distribution PX by a known likelihood ratio d(cid:101)PX /dPX (and the conditional distribution\nratio d(cid:101)PX /dPX is not known, it can be estimated given access to unlabeled data (test covariate points),\n\nPY |X remains unchanged). In this case, called covariate shift, the new weighted conformal prediction\nmethodology is just as easy, computationally, as ordinary conformal prediction. When the likelihood\n\nwhich we showed empirically, on a low-dimensional example, can still yield correct coverage.\nBeyond the setting of covariate shift that we have focused on (as the main application in this paper),\nour weighted conformal methodology can be applied to several other closely related settings, where\nordinary conformal prediction will not directly yield correct coverage. We discuss two such settings\nbelow; a third, on approximate conditional inference, is discussed the supplement.\nGraphical models with covariate shift. Assume that the training data (Z, X, Y ) \u223c P obeys the\nMarkovian structure Z \u2192 X \u2192 Y . As an example, to make matters concrete, suppose that Z is a\nlow-dimensional covariate (such as ancenstry information), X is a high-dimensional set of features\nfor a person (such as genetic measurements), and Y is a real-valued outcome of interest (such as life\n\nexpectancy). Suppose that on the test data (Z, X, Y ) \u223c (cid:101)P , the distribution of Z has changed, causing\n\na change in the distribution of X, and thus causing a change in the distribution of the unobserved\nY (however the distribution of X|Z is unchanged). One plausible solution to this problem would\nbe to just ignore Z in both training and test sets, and run weighted conformal prediction on only\n(X, Y ), treating this like a usual covariate shift problem. But, as X is high-dimensional, this would\nrequire estimating a ratio of two high-dimensional densities, which would be dif\ufb01cult. Since Z is\nlow-dimensional, we can instead estimate the weights by estimating the likelihood ratio of Z between\ntest and training sets, which follows because for the joint covariate (Z, X),\n\n(cid:101)PZ,X (z, x)\n\nPZ,X (z, x)\n\n=\n\n(cid:101)PZ(z)PX|Z=z(x)\n\nPZ(z)PX|Z=z(x)\n\n(cid:101)PZ(z)\n\nPZ(z)\n\n.\n\n=\n\nThis may be a more tractable quantity to estimate for the purpose of weighted conformal inference.\nThese ideas may be generalized to more complex graphical settings, which we leave to future work.\n\nMissing covariates with known summaries. As another concrete example, suppose that hospital\nA has collected a private training data set (Z, X, Y ) \u223c P A where Z \u2208 {0, 1} is a sensitive patient\ncovariate, X represents other covariates, and Y is a real-valued response that is expensive to measure.\nSuppose that hospital B also has its own data set, but in order to save money and not measure the\nresponses for their patients, it asks hospital A for help to produce prediction intervals for these\n\n8\n\n\fresponses. Instead of sharing the collected data (Z, X) \u223c P B for each patient with hospital A, due\nto privacy concerns, hospital B only provides hospital A with the X covariate for each patient, along\nwith a summary statistic for Z, representing the fraction of Z values that equal one (more accurately,\nthe probability of drawing a patient with Z = 1 from their underlying patient population). Assume\nthat P A\nX|Z (e.g., if Z is the sex of the patient, then this assumes there is one joint distribution\non X for males and one for females, which does not depend on the hospital). The likelihood ratio of\ncovariate distributions thus again reduces to calculating the likelihood ratio of Z between P B and\nP A, which we can easily do, and use weighted conformal prediction.\n\nX|Z = P B\n\nTowards local conditional coverage? We \ufb01nish by descibing how our weighted conformal method-\nology can be used to construct prediction bands with certain a approximate notion of conditional\ncoverage. Given i.i.d. (Xi, Yi), i = 1, . . . , n + 1, consider, instead of the original goal (1),\n\nP(cid:110)\nYn+1 \u2208 (cid:98)Cn(x0)\n\n(cid:12)(cid:12)(cid:12) Xn+1 = x0\n\n(cid:111) \u2265 1 \u2212 \u03b1.\n\n(12)\nThis is (exact) conditional coverage at x0 \u2208 Rd. As it turns out, asking for (12) to hold at PX-almost\nevery x0 \u2208 Rd, and for all distributions P is far too strong: Vovk [2012], Lei and Wasserman [2014]\n\nprove that any method with such a property must yield an interval (cid:98)Cn(x0) with in\ufb01nite expected\n\nlength at any non-atom point x0, for any underlying distribution P . Thus we must relax (12) and\nseek some notion of approximate conditional coverage, if we hope to achieve it with a nontrivial\nprediction band. Some relaxations were recently considered in Barber et al. [2019], most of which\nwere also impossible to achieve in a nontrivial way. A different, natural relaxation of (12) is\n\n\u2265 1 \u2212 \u03b1,\n\n(13)\n\n(cid:1) dPX (x)\n\n(cid:82) P(cid:0)Yn+1 \u2208 (cid:98)Cn(x0)| Xn+1 = x(cid:1)K(cid:0) x\u2212x0\n(cid:1) dPX (x)\n(cid:12)(cid:12)(cid:12) Xn+1 = x0 + h\u03c9\n\n(cid:82) K(cid:0) x\u2212x0\nP(cid:110)\nYn+1 \u2208 (cid:98)Cn(x0)\n\nh\n\nh\n\n(cid:111) \u2265 1 \u2212 \u03b1,\n\nwhere K is kernel function and h > 0 is bandwidth parameter. Here we are asking for a prediction\nband whose average conditional coverage, in some locally-weighted sense around x, is at least 1 \u2212 \u03b1.\nWe can equivalently write (13) as\n\n(14)\nwhere the probability is taken over the n+1 data points and an independent draw \u03c9 from a distribution\nwhose density is proportional to K. As we can see from (13) (or (14)), this kind of locally-weighted\nguarantee should be close to a guarantee on conditional coverage, when the bandwidth h is small.\nIn order to achieve (13) in a distribution-free manner, we can invoke the weighted conformal inference\nmethodology. In particular, note that we can once more rewrite (14) as\n\n(cid:110)\nYn+1 \u2208 (cid:98)Cn((cid:101)Xn+1)\n\n(cid:111) \u2265 1 \u2212 \u03b1,\n\nPx0\n\n(cid:32)\n\n(cid:40)\n\n(15)\nindependent test point ((cid:101)Xn+1, Yn+1), from (cid:101)P = (cid:101)PX \u00d7 PY |X, where d(cid:101)PX /dPX \u221d K((\u00b7 \u2212 x0)/h).\nwhere Px0 integrates over training points (Xi, Yi), i = 1, . . . , n i.i.d. from P = PX \u00d7 PY |X and an\nNote that this precisely \ufb01ts into the covariate shift setting (5). To be explicit, for any score function S,\n(cid:33)(cid:41)\nand any \u03b1 \u2208 (0, 1), given a center point x0 \u2208 Rd of interest, de\ufb01ne\n(cid:98)Cn(x) =\n\n+ K(cid:0) x\u2212x0\n(cid:1)\u03b4V (x,y)\ni=1 K(cid:0) Xi\u2212x0\n(cid:80)n\n(cid:1) + K(cid:0) x\u2212x0\ni=1 K(cid:0) Xi\u2212x0\n(cid:1)\n(cid:80)n\n(cid:111) \u2265 1 \u2212 \u03b1.\n(cid:110)\nYn+1 \u2208 (cid:98)Cn(Xn+1; x0)\n\n(16)\nThis is \u201calmost\u201d of the desired form (15) (equivalently (13), or (14)), except for one critical caveat.\n\nThe band (cid:98)Cn(\u00b7 ; x0) in (16) was constructed based on knowing the center point x0 in advance. If\nwe were to ask for local conditional coverage at a new point x0, then the entire band (cid:98)Cn(\u00b7; x0) must\n\n, i = 1, . . . , n + 1, are as in (3). Then by Corollary 1,\n\ny \u2208 R : V (x,y)\n\nn+1 \u2264 Quantile\n\nwhere V (x,y)\n\ni\n\n(cid:1)\u03b4\u221e\n\n1 \u2212 \u03b1;\n\nh\n\nh\n\ni\n\nh\n\nPx0\n\nh\n\n,\n\nchange (must be recomputed) in order to accommodate the new guarantee.\n\nAcknowledgements. The authors thank the American Institute of Mathematics for supporting and\nhosting our collaboration. R.F.B. was partially supported by the National Science Foundation under\ngrant DMS-1654076 and by an Alfred P. Sloan fellowship. E.J.C. was partially supported by the\nOf\ufb01ce of Naval Research under grant N00014-16-1-2712, by the National Science Foundation under\ngrant DMS-1712800, and by a generous gift from TwoSigma. R.J.T. was partially supported by the\nNational Science Foundation under grant DMS-1554123.\n\n9\n\n\fReferences\nDeepak Agarwal, Lihong Li, and Alex Smola. Linear-time estimators for propensity scores. Interna-\n\ntional Conference on Arti\ufb01cial Intelligence and Statistics, 2011.\n\nRina Foygel Barber, Emmanuel J. Candes, Aaditya Ramdas, and Ryan J. Tibshirani. The limits of\n\ndistribution-free conditional predictive inference. arXiv preprint arXiv:1903.04684, 2019.\n\nEvgeny Burnaev and Vladimir Vovk. Ef\ufb01ciency of conformalized ridge regression. Annual Conference\n\non Learning Theory, 2014.\n\nXiangli Chen, , Mathew Monfort, Anqi Liu, and Brian Da Ziebart. Robust covariate shift regression.\n\nInternational Conference on Arti\ufb01cial Intelligence and Statistics, 2016.\n\nDheeru Dua and Casey Graff. UCI machine learning repository, 2019. URL http://archive.ics.\n\nuci.edu/ml.\n\nArthur Gretton, Alex Smola, Jiayuan Huang, Marcel Schmittfull, Karsten Borgwardt, and Bernhard\nScholkopf. Covariate shift by kernel mean matching. In Dataset Shift in Machine Learning,\nchapter 8, pages 131\u2013160. MIT Press, 2009.\n\nJing Lei and Larry Wasserman. Distribution-free prediction bands for non-parametric regression.\n\nJournal of the Royal Statistical Society: Series B, 76(1):71\u201396, 2014.\n\nJing Lei, Alessandro Rinaldo, and Larry Wasserman. A conformal prediction approach to explore\n\nfunctional data. Annals of Mathematics and Arti\ufb01cial Intelligence, 74(1\u20132):29\u201343, 2015.\n\nJing Lei, Max G\u2019Sell, Alessandro Rinaldo, Ryan J. Tibshirani, and Larry Wasserman. Distribution-\nfree predictive inference for regression. Journal of the American Statistical Association, 113(523):\n1094\u20131111, 2018.\n\nHarris Papadopoulos, Kostas Proedrou, Volodya Vovk, and Alex Gammerman. Inductive con\ufb01dence\n\nmachines for regression. European Conference on Machine Learning, 2002.\n\nJoaquin Quinonero-Candela, Masashi Sugiyama, Anton Schwaighofer, and Neil D. Lawrence. Dataset\n\nShift in Machine Learning. MIT Press, 2009.\n\nSashank J. Reddi, Barnabas Poczos, and Alex Smola. Doubly robust covariate shift correction. AAAI\n\nConference on Arti\ufb01cial Intelligence, 2015.\n\nGlenn Shafer and Vladimir Vovk. A tutorial on conformal prediction. Journal of Machine Learning\n\nResearch, 9:371\u2013421, 2008.\n\nHidetoshi Shimodaira. Improving predictive inference under covariate shift by weighting the log-\n\nlikelihood function. Journal of Statistical Planning and Inference, 90(2):227\u2013244, 2000.\n\nMasashi Sugiyama and Klaus-Robert Muller. Input-dependent estimation of generalization error\n\nunder covariate shift. Statistics & Decisions, 23(4):249\u2013279, 2005.\n\nMasashi Sugiyama, Matthias Krauledat, and Klaus-Robert Muller. Covariate shift adaptation by\nimportance weighted cross validation. Journal of Machine Learning Research, 8(985\u20131005), 2007.\n\nMasashi Sugiyama, Taiji Suzuki, and Takafumi Kanamori. Density Ratio Estimation in Machine\n\nLearning. Cambridge University Press, 2012.\n\nVladimir Vovk. Conditional validity of inductive conformal predictors. Asian Conference on Machine\n\nLearning, 2012.\n\nVladimir Vovk. Transductive conformal predictors. Symposium on Conformal and Probabilistic\n\nPrediction with Applications, 2013.\n\nVladimir Vovk, Alex Gammerman, and Glenn Shafer. Algorithmic Learning in a Random World.\n\nSpringer, 2005.\n\nVladimir Vovk, Ilia Nouretdinov, and Alex Gammerman. On-line predictive linear regression. Annals\n\nof Statistics, 37(3):1566\u20131590, 2009.\n\n10\n\n\fJunfeng Wen, Chun-Nam Yu, and Russell Greiner. Robust learning under uncertain test distributions:\nRelating covariate shift to model misspeci\ufb01cation. International Conference on Machine Learning,\n2014.\n\n11\n\n\f", "award": [], "sourceid": 1446, "authors": [{"given_name": "Ryan", "family_name": "Tibshirani", "institution": "Carnegie Mellon University"}, {"given_name": "Rina", "family_name": "Foygel Barber", "institution": "University of Chicago"}, {"given_name": "Emmanuel", "family_name": "Candes", "institution": "Stanford University"}, {"given_name": "Aaditya", "family_name": "Ramdas", "institution": "CMU"}]}