{"title": "Confounding-Robust Policy Improvement", "book": "Advances in Neural Information Processing Systems", "page_first": 9269, "page_last": 9279, "abstract": "We study the problem of learning personalized decision policies from observational data while accounting for possible unobserved confounding in the data-generating process. Unlike previous approaches that assume unconfoundedness, i.e., no unobserved confounders affected both treatment assignment and outcomes, we calibrate policy learning for realistic violations of this unverifiable assumption with uncertainty sets motivated by sensitivity analysis in causal inference. Our framework for confounding-robust policy improvement optimizes the minimax regret of a candidate policy against a baseline or reference \"status quo\" policy, over an uncertainty set around nominal propensity weights. We prove that if the uncertainty set is well-specified, robust policy learning can do no worse than the baseline, and only improve if the data supports it. We characterize the adversarial subproblem and use efficient algorithmic solutions to optimize over parametrized spaces of decision policies such as logistic treatment assignment. We assess our methods on synthetic data and a large clinical trial, demonstrating that confounded selection can hinder policy learning and lead to unwarranted harm, while our robust approach guarantees safety and focuses on well-evidenced improvement.", "full_text": "Confounding-Robust Policy Improvement\n\nCornell University and Cornell Tech\n\nCornell University and Cornell Tech\n\nAngela Zhou\n\nNew York, NY\n\naz434@cornell.edu\n\nNathan Kallus\n\nNew York, NY\n\nkallus@cornell.edu\n\nAbstract\n\nWe study the problem of learning personalized decision policies from observational\ndata while accounting for possible unobserved confounding in the data-generating\nprocess. Unlike previous approaches that assume unconfoundedness, i.e., no\nunobserved confounders affected both treatment assignment and outcomes, we\ncalibrate policy learning for realistic violations of this unveri\ufb01able assumption\nwith uncertainty sets motivated by sensitivity analysis in causal inference. Our\nframework for confounding-robust policy improvement optimizes the minimax\nregret of a candidate policy against a baseline or reference \u201cstatus quo\u201d policy,\nover an uncertainty set around nominal propensity weights. We prove that if the\nuncertainty set is well-speci\ufb01ed, robust policy learning can do no worse than the\nbaseline, and only improve if the data supports it. We characterize the adversarial\nsubproblem and use ef\ufb01cient algorithmic solutions to optimize over parametrized\nspaces of decision policies such as logistic treatment assignment. We assess our\nmethods on synthetic data and a large clinical trial, demonstrating that confounded\nselection can hinder policy learning and lead to unwarranted harm, while our robust\napproach guarantees safety and focuses on well-evidenced improvement.\n\nIntroduction\n\n1\nThe problem of learning personalized decision policies to study \u201cwhat works and for whom\u201d in areas\nsuch as medicine and e-commerce often endeavors to draw insights from observational data, since\ndata from randomized experiments may be scarce and costly or unethical to acquire [12, 3, 30, 6,\n13]. These and other approaches for drawing conclusions from observational data in the Neyman-\nRubin potential outcomes framework generally appeal to methodologies such as inverse-propensity\nweighting, matching, and balancing, which compare outcomes across groups constructed such that\nassignment is almost as if at random [23]. These methods rely on the controversial assumption of\nunconfoundedness, which requires that the data are suf\ufb01ciently informative of treatment assignment\nsuch that no unobserved confounders jointly affect treatment assignment and individual response [24].\nThis key assumption may be made to hold ex ante by directly controlling the treatment assignment\npolicy as sometimes done in online advertising [4], but in other domains of key interest such as\npersonalized medicine where electronic medical records (EMRs) are increasingly being analyzed ex\npost, unconfoundedness may never truly hold in fact.\nAssuming unconfoundedness, also called ignorability, conditional exogeneity, or selection on observ-\nables, is controversial because it is fundamentally unveri\ufb01able since the counterfactual distribution\nis not identi\ufb01ed from the data, thus rendering any insights from observational studies vulnerable to\nthis fundamental critique [11]. If the data is truly unconfounded, it would be known by construction\nbecause it would come from an RCT or logged bandit; any data whose unconfoundedness is uncertain\nmust be confounded to some extent. The growing availability of richer observational data such as\nfound in EMRs renders unconfoundedness more plausible, yet it still may never be fully satis\ufb01ed in\npractice. Because unconfoundedness may fail to hold, existing policy learning methods that assume\nit can lead to personalized decision policies that seek to exploit individual-level effects that are not\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\freally there, may intervene where not necessary, and may in fact lead to net harm rather than net\ngood. Such dangers constitute obvious impediments to the use of policy learning to enhance decision\nmaking in such sensitive applications as medicine, public policy, and civics.\nTo address this de\ufb01ciency, in this paper we develop a framework for robust policy learning and\nimprovement that can ensure that a personalized decision policy derived from observational data,\nwhich inevitably may have some unobserved confounding, does no worse than a current policy such as\nthe standard of care and in fact does better if the data can indeed support it. We do so by recognizing\nand accounting for the potential confounding in the data and require that the learned policy improve\nupon a baseline no matter the direction of confounding. Thus, we calibrate personalized decision\npolicies to address sensitivity to realistic violations of the unconfoundedness assumption. For the\npurposes of informing reliable and personalized decision-making that leverages modern machine\nlearning, point identi\ufb01cation of individual-level causal effects, which previous approaches rely on,\nmay not be at all necessary for success, but accounting for the lack of identi\ufb01cation is.\nFunctionally, our approach is to optimize a policy to achieve the best worst-case improvement relative\nto a baseline treatment assignment policy such as treat all or treat none, where the improvement\nis measured using a weighted average of outcomes and weights take values in an uncertainty set\naround the nominal inverse propensity weights (IPW). This generalizes the popular class of IPW-\nbased approaches to policy learning, which optimize an unbiased estimator for policy value under\nunconfoundedness [15, 28, 27]. Unlike standard approaches, in our approach the choice of baseline\nis material and changes the resulting policy chosen by our method. This framing supports reliable\ndecision-making in practice, as often a practitioner is seeking evidence of substantial improvement\nupon the standard of care or a default option, and/or the intervention under consideration introduces\nrisk of toxicity or adverse effects and should not be applied without strong evidence.\nOur contributions are as follows: we provide a framework for performing policy improvement which\nis robust in the face of unobserved confounding. Our framework allows for the speci\ufb01cation of\ndata-driven uncertainty sets, based on the sensitivity parameter describing a pointwise multiplica-\ntive bound, as well as allowing for a global uncertainty budget which restricts the total deviation\nproportionally to the maximal `1 discrepancy between the true propensities and nominal propensities.\nLeveraging the optimization structure of the robust subproblem, we provide algorithms for performing\npolicy optimization. We assess performance on a synthetic example as well as a large clinical trial.\n\n2 Problem Statement and Preliminaries\nWe assume the observational data consists of tuples of random variables {(Xi, Ti, Yi) : i = 1, . . . , n},\ncomprising of covariates Xi 2X , assigned treatment Ti 2 {1, 1}, and real-valued outcomes\nYi 2 R. Using the Neyman-Rubin potential outcomes framework, we let Yi(1) and Yi(1) denote\nthe potential outcomes of applying treatment 1 and 1, respectively. We assume that the observed\noutcome is potential outcome for the observed treatment, Yi = Yi(Ti), encapsulating non-interference\nand consistency, also known as SUTVA [25]. We also use the convention that the outcomes Yi\ncorresponds to losses so that lower outcomes are better.\nWe consider evaluating and learning a (randomized) treatment assignment policy mapping covariates\nto the probability of assigining treatment, \u21e1 : X! [0, 1]. We focus on a policy class \u21e1 2F of\nrestricted complexity. Examples include linear policies \u21e1(X) = I[|x], logistic policies \u21e1(X) =\n(|x) where (z) = 1/(1 + ez), or decision trees of a bounded certain depth. We allow the\ncandidate policy \u21e1 to be either deterministic or stochastic, and denote the random variable indicating\nthe realization of treatment assignment for some Xi to be a Bernoulli random variable Z\u21e1\ni such that\n\u21e1(Xi) = Pr[Z\u21e1\nThe goal of policy evaluation is to assess the policy value,\n\ni = 1 | Xi].\n\nV (\u21e1) = E[Y (Z\u21e1)] = E[\u21e1(Xi)Y (1) + (1 \u21e1(Xi))Y (1)],\n\nthe population average outcome induced by the policy \u21e1. The problem of policy optimization seeks\nto \ufb01nd the best such policy over the parametrized function class F. Both of these tasks are hindered\nby residual confounding since then V (\u21e1) cannot actually be identi\ufb01ed from the data.\nMotivated by the sensitivity model in [22] and without loss of generality, we assume that there is an\nadditional but unobserved covariate Ui such that unconfoundedness would hold if we were to control\nfor both Xi and Ui, that is, such that E[Yi(t) | Xi, Ui, Ti] = E[Yi(t) | Xi, Ui] for t 2 {1, 1}.\n\n2\n\n\fEquivalently, we can treat the data as collected under an unknown logging policy that based its\nassignment on both Xi and Ui and that assigned Ti = 1 with probability e(Xi, Ui) = Pr[T = 1 |\nXi, Ui]. Here, e(Xi, Ui) is precisely the true propensity score of unit i. Since we do not have access\nto Ui in our data, we instead presume that we have access only to nominal propensities \u02c6e(Xi) =\nPr[T = 1 | Xi], which do not account for the potential unobserved confounding. These are either part\nof the data or can be estimated directly from the data using a probabilistic classi\ufb01cation model such as\nlogistic regression. For compactness, we denote \u02c6eTi(Xi) = 1\n2 (1 Ti)(1 \u02c6e(Xi))\nand eTi(Xi, Ui) = 1\n\n2 (1 + Ti)\u02c6e(Xi) + 1\n\n2 (1 + Ti)e(Xi, Ui) + 1\n\n2 (1 Ti)(1 e(Xi, Ui)).\n\n2.1 Related Work\nOur work builds upon the literatures on policy learning from observational data and on sensitivity\nanalysis in causal inference.\nSensitivity analysis. Sensitivity analysis in causal inference tests the robustness of qualitative\nconclusions made from observational data to model speci\ufb01cation or assumptions such as unconfound-\nedness. In this work, we focus on structural assumptions bounding how unobserved confounding\naffects selection, without restriction on how unobserved confounding affects outcomes. In particular,\nwe focus on the implications of confounding on personalized treatment decisions.\nRosenbaum\u2019s model for sensitivity analysis assesses the robustness of matched-pairs randomization\ninference to the presence of unobserved confounding by considering a uniform bound on the impact\nof confounding on the odds ratio of treatment assignment [22]. Motivated by a logistic speci\ufb01cation,\nin this model, the odds-ratio for two units with the same covariates Xi = Xj, which differs due to the\nunits\u2019 different values Ui, Uj for the unobserved confounder, is elog()(UiUj ), and Ui, Uj 2 [0, 1]\nmay be arbitrary. We consider a variant, also called the \u201cmarginal sensitivity model\" in [34], which\ninstead bounds the log-odds ratio between e(Xi), e(Xi, Ui).\nIn the sampling literature, the weight-normalized estimator for population mean is known as the\nHajek estimator, and Aronow and Lee [1] derive sharp bounds on the estimator arising from a uniform\nbound on the sampling weights, showing a closed-form solution for the solution to the fractional\nlinear program for a uniform bound on the sampling probabilities. [34] considers bounds on the\nHajek estimator, but imposes a parametric model on the treatment assignment probability.\nSensitivity analysis is also related to the literature on partial identi\ufb01cation of treatment effects\n[17]. Similar bounds studied in [33] in the transfer learning setting rely on no knowledge but the\nlaw of total probability. Our approach instead uses sensitivity analysis based on the estimated\npropensities as a starting point and leverages additional information about how far it is from true\npropensities to achieve tighter bounds that interpolate between the fully-unconfounded and arbitrarily-\nconfounded regimes. [19] considers tightening the bounds from the Hajek estimator by adding shape\nconstraints, such as log-concavity, on the cumulative distribution of outcomes Y . [18] considers\nsharp partially identi\ufb01ed bounds under the assumption of an uniform bound on nominal propensities,\nsupU | Pr[T = 1 | X] Pr[T = 1 | X, U ]|\uf8ff c. We focus on the implications of sensitivity analysis\nfor policy-learning based approaches for learning optimal treatment policies from observational data.\nPolicy learning from observational data under unconfoundedness. A variety of approaches\nfor learning personalized intervention policies that maximize causal effect have been proposed,\nbut all under the assumption of unconfoundedness. These fall under regression-based strate-\ngies [21] or reweighting-based strategies [3, 12, 13, 28], or doubly robust combinations thereof\n[6, 30]. Regression-based strategies estimate the conditional average treatment effect (CATE),\nE[Y (1) Y (1) | X], either directly or by differencing two regressions, and use it to score the\npolicy. Without unconfoundedness, however, CATE is not identi\ufb01able from the data and these\nmethods have no guarantees.\nReweighting-based strategies use inverse-probability weighting (IPW) to change measure from the\noutcome distribution induced by a logging policy to that induced by the policy \u21e1. Speci\ufb01cally, these\nmethods use the fact that, under unconfoundedness, \u02c6V IPW(\u21e1) is unbiased for V (\u21e1) [15], where\n\n\u02c6V IPW(\u21e1) = 1\n\n(1+Ti(2\u21e1(Xi)1))Yi\n\n2\u02c6eTi (Xi)\n\n(1)\n\ni=1\n\nOptimizing \u02c6V IPW(\u21e1) can be phrased as a weighted classi\ufb01cation problem [3]. Since dividing by\npropensities can lead to extreme weights and high variance estimates, additional strategies such as\n\nnPn\n\n3\n\n\fclipping the probabilities away from 0 and normalizing by the sum of weights as a control variate\nare typically necessary for good performance [27, 32]. With or without these \ufb01xes, if there are\nunobserved confounders, none of these are consistent for V (\u21e1) and learned policies may introduce\nmore harm than good.\nA separate literature in reinforcement learning considers the idea of safe policy improvement by\nminimizing the regret against a baseline policy, forming an uncertainty set around the presumed\nunknown transition probabilities between states as in [29], or forming a trust region for safe policy\nexploration via concentration inequalities on the importance-reweighted estimates of policy risk [20].\n\n3 Robust policy evaluation and improvement\nOur framework for confounding-robust policy improvement minimizes a bound on policy regret\nagainst a speci\ufb01ed baseline policy \u21e10, R\u21e10(\u21e1) = V (\u21e1) V (\u21e10). Our bound is achieved by maxi-\nmizing a reweighting-based regret estimate over an uncertainty set around the nominal propensities.\nThis ensures that we cannot do any worse than \u21e10 and may do better, even if the data is confounded.\nThe baseline policy \u21e10 can be any \ufb01xed policy that we want to make sure not to do worse than,\nor deviate from unnecessarily. This is usually the current standard of care, established from prior\nevidence, and can be a policy that actually depends on x. Generally, we think of this as the policy\nthat always assigns control. Alternatively, if a reliable estimate of the average treatment effect,\nE[Y (1) Y (1)], is available then \u21e10 can be the constant \u21e10(x) = I[E[Y (1) Y (1)] < 0]. In an\nagnostic extreme, \u21e10 can be the complete randomization policy \u21e10(x) = 1/2.\n\n3.1 Confounding-robust policy learning by optimizing minimax regret\nIf we had oracle access to the true inverse propensities W \u21e4i = 1/eTi(Xi, Ui) we could form the\ncorrect IPW estimate by replacing nominal with true propensities in eq. (1). We may go a step further\nand, recognizing that E[1/eTi(Xi, Ui)] = 2, use the empirical sum of true propensities as a control\nvariate by normalizing our IPW estimate by them. This gives rise to the following Hajek estimators\nof V (\u21e1) and correspondingly R\u21e10(\u21e1)\n\n\u02c6V \u21e4(\u21e1) = Pn\ni=1 W \u21e4i (1+Ti(2\u21e1(Xi)1))Yi\n\u02c6R\u21e4\u21e10(\u21e1) = \u02c6V \u21e4(\u21e1) \u02c6V \u21e4(\u21e10) = 2Pn\n\nPn\n\ni=1 W \u21e4i\n\n,\n\ni=1 W \u21e4i (\u21e1(Xi)\u21e10(Xi))TiYi\n\ni=1 W \u21e4i\n\nPn\n\nIt follows by Slutsky\u2019s theorem that these estimates remain consistent (if we know W \u21e4i ). Note that\nhad we known W \u21e4i , both the normalization and choice of \u21e10 would have amounted to constant shifts\nand scales to \u02c6R\u21e4\u21e10(\u21e1) that would not have changed the choice of \u21e1 to minimize the regret estimate.\nThis will not be true of our bound, where both the normalization and the choice of \u21e10 will be material.\nSince the oracle weights W \u21e4i are unknown, we instead minimize the worst-case possible value of our\nregret estimate, by ranging over the space of possible values for eTi(Xi, Ui) that are consistent with\nthe observed data and our assumptions about the confounded data-generating process. Speci\ufb01cally,\nour model restricts the extent to which unobserved confounding may affect assignment probabilities.\nWe \ufb01rst consider an uncertainty set motivated by the odds-ratio characterization in [22], which\nrestricts how far the weights can vary pointwise from the nominal propensities. Given a bound > 1,\nthe odds-ratio restriction on e(x, u) is that it satisfy the following inequalities\n\n1 \uf8ff (1\u02c6e(x))e(x,u)\n\n\u02c6e(x)(1e(x,u)) \uf8ff .\n\n(2)\n\nThis restriction is motivated by (but more general than) considering a logistic model where e(x, u) =\n(g(x) + u), g is any function, u 2 [0, 1] is bounded without loss of generality, and ||\uf8ff log().\nSuch a model would necessarily give rise to eq. (2). This restriction also immediately leads to an\nuncertainty set for the true inverse propensities of observed treatments of each unit, 1/e(Xi, Ui),\nwhich we denote as follows\n\nn =W 2 Rn\n\ni \uf8ff Wi \uf8ff b\n1 \u02c6eTi(Xi) +\u02c6eTi(Xi)\n\nU \na\ni =\n\n+ : a\n\ni 8i = 1, . . . , n , where\n\n(1 \u02c6eTi(Xi)) + \u02c6eTi(Xi)\n\n, b\n\ni =\n\n\u02c6eTi(Xi)\n\n\u02c6eTi(Xi)\n\n4\n\n\fThe corresponding bound on empirical regret is R\u21e10(\u21e1;U \nn ), where for any U\u21e2 Rn\nPn\n\nR\u21e10(\u21e1;U) = supW2U\n\ni=1 Wi(\u21e1(Xi)\u21e10(Xi))TiYi\n\nWe then choose the policy \u21e1 in our class that minimizes this regret bound, i.e., \u21e1(F,U \n\n2Pn\n\ni=1 Wi\n\n+ we de\ufb01ne\n\nn ,\u21e1 0), where\n\n\u21e1(F,U,\u21e1 0) 2 argmin\u21e12F R\u21e10(\u21e1;U)\n\n(3)\n\nn ), weight normalization is crucial for only enforcing robust-\nIn particular, for our estimate R\u21e10(\u21e1;U \nness against consequential realizations of confounding which affect the relative weighting of patient\noutcomes; otherwise robustness against confounding would simply assign weights to their highest\npossible bounds for positive YiTi. If the baseline policy is in the policy class F, it already achieves 0\nregret; thus, minimizing regret necessitates learning regions of policy treatment assignment where\nevidence from observed outcomes suggests bene\ufb01ts in terms of decreased loss. Different baseline\npolicies \u21e10 = 0, 1 structurally change the solution to the adversarial subproblem by shifting the\ncontribution of the loss term YiTi(\u21e1(Xi) \u21e10) to emphasize improvement upon the baseline.\nBudgeted uncertainty sets to address \u201clocal\u201d confounding. Our approach can be pessimistic in\nensuring robustness against worst-case realizations of unobserved confounding \u201cglobally\u201d for each\nunit, whereas concerns about unobserved confounding may be restricted to a subset of the population,\ndue to subgroup risk factors or outliers. For the Rosenbaum model in hypothesis testing, this has\nbeen recognized by [7, 9] who address it by limiting the average of the unobserved propensities by an\nadditional sensitivity parameter. Motivated by this, we next consider an alternative uncertainty set,\nwhere we \ufb01x a budget \u21e4 for how much the weights can diverge from the nominal inverse propensity\nweights in total. Speci\ufb01cally, letting \u02c6Wi = 1/\u02c6eTi (Xi), we construct the uncertainty set\n\nU ,\u21e4\n\ni=1 |Wi \u02c6Wi|\uf8ff \u21e4, a\n\ni 8i = 1, . . . , no\n+ : Pn\nWhen plugged into eq. (3), this provides an alternative policy choice criterion that is less conservative.\nWe suggest to calibrate \u21e4 as a fraction \u21e2< 1 of the total deviation allowed by U \nn . Speci\ufb01cally,\n\u21e4= \u21e2Pn\ni \u02c6Wi). This is the approach we take in our empirical investigation.\n\nn =nW 2 Rn\ni=1 max( \u02c6Wi a\n\ni \uf8ff Wi \uf8ff b\n\ni , b\n\n3.2 The Improvement Guarantee\n\nWe next prove that if we appropriately bounded the potential hidden confounding then our worst-case\nempirical regret objective is asymptotically an upper bound on the true population regret. On the one\nhand, since our objective is necessarily non-positive if \u21e10 2F , this says we do no worse. On the\nother hand, if our objective is negative, which we can check by just evaluating it, then we are assured\nsome strict improvement. Our result is generic for both U \nOur upper bound depends on the complexity of our policy class. De\ufb01ne its Rademacher complexity:\n\nn and U ,\u21e4\nn .\n2nP\u270f2{1,+1}n sup\u21e12F 1\ni=1 \u270fi\u21e1(Xi)\nnPn\n\nRn(F) = 1\n\nAll the policy classes we consider have pn-vanishing complexities, i.e., Rn(F) = O(n1/2).\nTheorem 1. Suppose that (1/e(X1, U1), . . . , 1/e(Xn, Un)) 2U and that \u232b \uf8ff e(x, u) \uf8ff 1 \u232b for\nsome \u232b> 0 and |Y |\uf8ff C for some C 1. Then for any > 0 such that n \u232b2 log(5/)/2, we\nhave that with probability at least 1 ,\n\nR\u21e10(\u21e1) = V (\u21e1) V (\u21e10) \uf8ff R\u21e10(\u21e1;U) + 2Rn(F) + C\n\n8\u21e1 2F\n\n(4)\n\n\u232bq 8 log(5/)\n\nn\n\nIn particular, if we let \u21e1 = \u21e1(F,U,\u21e1 0) be as in eq. (3) then eq. (4) holds for \u21e1, which minimizes the\nright hand side. So, if the objective R\u21e10(\u21e1;U) is negative, we are (almost) assured of getting some\nimprovement on \u21e10. At the same time, so long as \u21e10 2 \u21e7, the objective is necessarily non-positive,\nso we are also (almost) assured of doing no worse than \u21e10. Our guarantee of improvement holds,\nunder well-speci\ufb01cation, without requiring effect identi\ufb01cation due to hidden confounding. Thus,\nTheorem 1 exactly captures the appeal of our approach.\n\n5\n\n\f3.3 Calibration of the uncertainty parameter \nIn our framework, appropriate choice of is both important for ensuring that we avoid harm and will\nbe context-dependent. The assumption that there exists a \ufb01nite < 1 that satisfy eq. (2) is itself\nuntestable, just like unconfoundedness (which corresponds to = 1 ). Since we focus on enabling\nsafe policy learning in domains where one errs toward safety in case of ignorance, if absolutely\nnothing is known then = 1 is the right choice and there is no hope for strictly safe improvement.\nHowever, practitioners generally have domain-level knowledge on the missing variables that may\nimpact selection. This can guide the choice of < 1, which our method leverages to offer some\nimprovement while ensuring safety. In particular, one way that the value of can be calibrated is\nby judging its value against the discrepancies in estimated propensities that are induced by omitting\nobserved variables [10]. Then, determining a reasonable upper bound for can be phrased in terms of\nwhether one thinks one has omitted a variable that could have increased or decreased the probability\nof treatment by as much as a particular observed variable. For example, a bound for can be implied\nby claiming one has not omitted a variable with as much impact on treatment as does, say, age,\nif age were observed. Additionally, when alternative outcome data is available, other approaches\nsuch as negative controls can be used to provide a lower bound for [16]. If one knows that the\ntreatment does not have an effect on a particular outcome but one is observed in the data, then \nmust be suf\ufb01ciently large to invalidate that observed effect. These tools can be combined to derive a\nreasonable range for in practice. Since our focus is on safety, we suggest to err toward larger .\n\n4 Optimizing Robust Policies\nWe next discuss how to optimize the policy optimization problem in eq. (3). We focus on differentiable\nparametric policies, F = {\u21e1( \u00b7 ; \u2713) : \u2713 2 \u21e5}, such as logistic policies. We \ufb01rst discuss how to solve\nthe worst-case regret subproblem for a \ufb01xed policy, which we will then use to develop our algorithm.\n\ni=1 riWi/Pn\n\ni=1 Wi.\n\n4.1 Dual Formulation of Worst-Case Regret\nn ). Moreover,\nThe minimization in eq. (3) for U = U \nthis supremum over weights W does not on the face of it appear to be convex. We next proceed to\ncharacterize this supremum, formulate it as a linear program, and, by dualizing it, provide an ef\ufb01cient\nprocedure for \ufb01nding the pessimal weights.\nFor compactness and generality, we address the optimization problem Q(r;U \narbitrary reward vector r 2 Rn, where\n\nn involves an inner supremum, namely R\u21e10(\u21e1;U \n\nn ) parameterized by an\n\nQ(r;U) = maxW2U Pn\n\n(5)\nn involves only\nTo recover R\u21e10(\u21e1;U), we would simply set ri = 2(\u21e1(Xi) \u21e10(Xi))TiYi. Since U \nn is a linear fractional program. We can reformulate it\nlinear constraints on W , eq. (5) for U = U \nas a linear program by applying the Charnes-Cooper transformation [5], requiring weights to sum\nto 1, and rescaling the pointwise bounds by a nonnegative scale factor t. We obtain the following\nequivalent linear program, where we let w 2 Rn\ni=1 riwi : Pn\n\n+ denote the normalized weights:\ni=1 wi = 1; ta\n\nn ) = maxt,w0Pn\nminu,v0,2R { : b|v + a|u 0, vi ui + ri 8 i = 1...n}\n\n(6)\nThe dual problem to eq. (6) has dual variables 2 R for the weight normalization constraint and\n+ for the lower bound and upper bound constraints on weights, respectively, and is given by\nu, v 2 Rn\n(7)\nWe use this to show that solving the adversarial subproblem requires only sorting the data and ternary\nsearch to optimize a unimodal function, generalizing the result of Aronow and Lee [1] for arbitrary\npointwise bounds on the weights. Crucially, the algorithmically ef\ufb01cient solution will allow for faster\nsubproblem solutions when optimizing our regret bound over policies in a given policy classes.\nTheorem 2 (Normalized optimization solution). Let (i) denote the ordering such that r(1) \uf8ff r(2) \uf8ff\nn ) = (k\u21e4), where k\u21e4 = inf{k = 1, . . . , n + 1 : (k) < (k 1)} and\n\u00b7\u00b7\u00b7\uf8ff r(n). Then, Q(r;U \n\ni ,8 i = 1, . . . , n \n\ni \uf8ff wi \uf8ff tb\n\nQ(r;U \n\nMoreover, (k) is a discrete concave unimodal function.\n\n(i)r(i)+Pikb\n(k) = Pi 0.4\n\nc. Avg death prognosis in treated\n\n1\n\npTi\n\n3 and p1 = 1\n\n|Stest|Pi2Stest YiTi\u21e1(Xi) 1\npolicies against the vanilla IPW estimatorPi\n\ndifference-in-means estimate of the ATE for the composite score in full data is signi\ufb01cant at 0.13,\nsuggesting that heparin is overall harmful. Without access to the true counterfactual outcomes for\npatients, our oracle estimates are IPW-based estimates from the held-out RCT data with probabilities\n3. We use an out-of-sample Horvitz-Thompson\nof treatment assignment as p1 = 2\nestimate of policy regret relative to \u21e10(x) = 0 based on the held-out dataset Stest, Rtest\n\u21e10 (\u21e1) =\n. In Fig. 2a, we evaluate on 10 draws from the dataset, comparing our\nYi Pr[\u21e1i=Ti]\nPr[T =Ti] with a probabilistic policy, and assigning\nbased on the sign of the CATE prediction from causal forests [31]. The selected datasets average a size\nof ntrain = 2430. We evaluate logistic parametric policies (CRLogit) and budgeted (CRLogit.L1)\nwith \u21e2 = 0.5. For the parametric policies, we optimize with the same parameters as earlier. We\nevaluate log() = 0.1, 0.2, every 0.025 between 0.25 and 0.45, every 0.2 between log() = 0.5, 1.5\nand = 2 . For small values of log(), our methods perform similarly as IPW. As log() increases,\nour methods achieve policy improvement, though the L1-budgeted method (CRLogit.L1) achieves\nworse performance. For log() > 0.9, the robust policy essentially learns the all-control policy; our\n\ufb01nite-sample regret estimator simply indicates good regret for a neglible number of patients (5-6).\nIn Figs. 2b-2c, we study the behavior of the robust policies. The IST trial recorded a prognosis score\nof probability of death at 6 months for patients, using an externally validated model, which we do not\ninclude in the training data, but use to assess the validity of our robust policy. In Fig. 2c, we consider\nthe average prognosis score of death for among patients treated with \u21e1(X) > 0.4. In Fig. 2b, for\nlog() 2 [0.3, 0.5], the policy considers treating 1 20% of patients and the subsequent average\nprognosis score of the population under consideration increases, indicating that the policy is learning\nand treating on appropriate indicators of severity from the available covariates. For log() > 0.9, the\nnoise in the prognosis score is due to the small treated subgroups (while the unbudgeted policy does\nnot learn a policy that improves upon control, so we default to control and truncate the plot).\nOur learned policies suggest that improvements from heparin may be seen in the highest-risk patients,\nconsistent with the \ufb01ndings of [2], a systematic review comparing anticoagulants such as heparin\nagainst aspirin. They conclude from a study of a number of trials, including IST, that heparin provides\nlittle therapeutic bene\ufb01t, with the caveat that the trial evidence base is lacking for the highest-risk\npatients where heparin may be of bene\ufb01t. Thus, our robust method appropriately treats those, and\nonly those, who stand to bene\ufb01t from the more aggressive treatment regime.\n\n6 Conclusion\n\nWe developed a framework for estimating and optimizing for robust policy improvement, which\noptimizes the minimax regret of a candidate personalized decision policy against a baseline policy.\nWe optimize over uncertainty sets centered at the nominal propensities, and leverage the optimization\nstructure of normalized estimators to perform policy optimization ef\ufb01ciently by subgradient descent\non the robust risk. Assessments on synthetic and clinical data demonstrate the bene\ufb01ts of robust\npolicy improvement.\n\n9\n\n\fAcknowledgments\nThis material is based upon work supported by the National Science Foundation under Grant No.\n1656996. Angela Zhou is supported through the National Defense Science & Engineering Graduate\nFellowship Program.\n\nReferences\n[1] P. Aronow and D. Lee. Interval estimation of population means under unknown but bounded\n\nprobabilities of sample selection. Biometrika, 2012.\n\n[2] E. Berge and P. A. Sandercock. Anticoagulants versus antiplatelet agents for acute ischaemic\n\nstroke. The Cochrane Library of Systematic Reviews, 2002.\n\n[3] A. Beygelzimer and J. Langford. The offset tree for learning with partial labels. Proceedings\nof the 15th ACM SIGKDD international conference on Knowledge discovery and data mining,\n2009.\n\n[4] L. Bottou, J. Peters, J. Quinonero-Candela, D. X. Charles, D. M. Chickering, E. Portugaly,\nD. Ray, P. Simard, and E. Snelson. Counterfactual reasoning and learning systems. Journal of\nMachine Learning Research, 2013.\n\n[5] A. Charnes and W. Cooper. Programming with linear fractional functionals. Naval Research\n\nLogistics Quarterly, 1962.\n\n[6] M. Dudik, D. Erhan, J. Langford, and L. Li. Doubly robust policy evaluation and optimization.\n\nStatistical Science, 2014.\n\n[7] C. Fogarty and R. Hasegawa. An extended sensitivity analysis for heterogeneous unmeasured\n\nconfounding. 2017.\n\n[8] I. S. T. C. Group. The international stroke trial (ist): a randomised trial of aspirin, subcutaneous\nheparin, both, or neither among 19435 patients with acute ischaemic stroke. international stroke\ntrial collaborative group. Lancet, 1997.\n\n[9] R. Hasegawa and D. Small. Sensitivity analysis for matched pair analysis of binary data: From\n\nworst case to average case analysis. Biometrics, 2017.\n\n[10] J. Y. Hsu and D. S. Small. Calibrating sensitivity analyses to observed covariates in observational\n\nstudies. Biometrics, 69(4):803\u2013811, 2013.\n\n[11] G. Imbens and D. Rubin. Causal Inference for Statistics, Social, and Biomedical Sciences.\n\nCambridge University Press, 2015.\n\n[12] N. Kallus. Recursive partitioning for personalization using observation data. Proceedings of the\n\nThirty-fourth International Conference on Machine Learning, 2017.\n\n[13] T. Kitagawa and A. Tetenov. Empirical welfare maximization. 2015.\n\n[14] M. Ledoux and M. Talagrand. Probability in Banach Spaces: isoperimetry and processes.\n\nSpringer, 1991.\n\n[15] L. Li, W. Chu, J. Langford, and X. Wang. Unbiased of\ufb02ine evaluation of contextual-bandit-\nbased news article recommendation algorithms. Proceedings of the fourth ACM international\nconference on web search and data mining, 2011.\n\n[16] M. Lipsitch, E. T. Tchetgen, and T. Cohen. Negative controls: A tool for detecting confounding\n\nand bias in observational studies. Epidemiology, 2010.\n\n[17] C. Manski. Social Choice with Partial Knoweldge of Treatment Response. The Econometric\n\nInstitute Lectures, 2005.\n\n[18] M. Masten and A. Poirier. Identi\ufb01cation of treatment effects under conditional partial indepen-\n\ndence. Econometrica, 2018.\n\n10\n\n\f[19] L. W. Miratrix, S. Wager, and J. R. Zubizarreta. Shape-constrained partial identi\ufb01cation of a\n\npopulation mean under unknown probabilities of sample selection. Biometrika, 2018.\n\n[20] M. Petrik, M. Ghavamzadeh, and Y. Chow. Safe policy improvement by minimizing robust\n\nbaseline regret. 29th Conference on Neural Information Processing Systems, 2016.\n\n[21] M. Qian and S. A. Murphy. Performance guarantees for individualized treatment rules. Annals\n\nof statistics, 39(2):1180, 2011.\n\n[22] P. Rosenbaum. Observational Studies. Springer Series in Statistics, 2002.\n[23] P. R. Rosenbaum and D. B. Rubin. The central role of the propensity score in observational\n\nstudies for causal effects. Biometrika, 1983.\n\n[24] D. Rubin. Estimating causal effect of treatments in randomized and nonrandomized studies.\n\nJournal of Educational Psychology, 1974.\n\n[25] D. B. Rubin. Comments on \u201crandomization analysis of experimental data: The \ufb01sher ran-\ndomization test comment\u201d. Journal of the American Statistical Association, 75(371):591\u2013593,\n1980.\n\n[26] G. Still. Lectures on parametric optimization: An introduction. Optimization Online, 2018.\n[27] A. Swaminathan and T. Joachims. The self-normalized estimator for counterfactual learning.\n\nProceedings of NIPS, 2015.\n\n[28] A. Swaminathan and T. Joachims. Counterfactual risk minimization. Journal of Machine\n\nLearning Research, 2015.\n\n[29] P. Thomas, G. Theocharous, and M. Ghavamzadeh. High con\ufb01dence policy improvement.\n\nProceedings of the 32nd International Conference on Machine Learning, 2015.\n\n[30] S. Wager and S. Athey. Ef\ufb01cient policy learning. 2017.\n[31] S. Wager and S. Athey. Estimation and inference of heterogeneous treatment effects using\n\nrandom forests. Journal of the American Statistical Association, (just-accepted), 2017.\n\n[32] Y.-X. Wang, A. Agarwal, and M. Dudik. Optimal and adaptive off-policy evaluation in contextual\n\nbandits. Proceedings of Neural Information Processing Systems 2017, 2017.\n\n[33] J. Zhang and E. Bareinboim. Transfer learning in multi-armed bandits: A causal approach.\nIn Proceedings of the Twenty-Sixth International Joint Conference on Arti\ufb01cial Intelligence,\nIJCAI-17, pages 1340\u20131346, 2017. doi: 10.24963/ijcai.2017/186. URL https://doi.org/\n10.24963/ijcai.2017/186.\n\n[34] Q. Zhao, D. S. Small, and B. B. Bhattacharya. Sensitivity analysis for inverse probability\n\nweighting estimators via the percentile bootstrap. ArXiv, 2017.\n\n11\n\n\f", "award": [], "sourceid": 5614, "authors": [{"given_name": "Nathan", "family_name": "Kallus", "institution": "Cornell University"}, {"given_name": "Angela", "family_name": "Zhou", "institution": "Cornell University"}]}