{"title": "Machine Learning Estimation of Heterogeneous Treatment Effects with Instruments", "book": "Advances in Neural Information Processing Systems", "page_first": 15193, "page_last": 15202, "abstract": "We consider the estimation of heterogeneous treatment effects with arbitrary machine learning methods in the presence of unobserved confounders with the aid of a valid instrument. Such settings arise in A/B tests with an intent-to-treat structure, where the experimenter randomizes over which user will receive a recommendation to take an action, and we are interested in the effect of the downstream action. We develop a statistical learning approach to the estimation of heterogeneous effects, reducing the problem to the minimization of an appropriate loss function that depends on a set of auxiliary models (each corresponding to a separate prediction task). The reduction enables the use of all recent algorithmic advances (e.g. neural nets, forests). We show that the estimated effect model is robust to estimation errors in the auxiliary models, by showing that the loss satisfies a Neyman orthogonality criterion. Our approach can be used to estimate projections of the true effect model on simpler hypothesis spaces. When these spaces are parametric, then the parameter estimates are asymptotically normal, which enables construction of confidence sets. We applied our method to estimate the effect of membership on downstream webpage engagement for a major travel webpage, using as an instrument an intent-to-treat A/B test among 4 million users, where some users received an easier membership sign-up process. We also validate our method on synthetic data and on public datasets for the effects of schooling on income.", "full_text": "Machine Learning Estimation of Heterogeneous\n\nTreatment Effects with Instruments\n\nVasilis Syrgkanis\nMicrosoft Research\n\nvasy@microsoft.com\n\nVictor Lei\nTripAdvisor\n\nvlei@tripadvisor.com\n\nMiruna Oprescu\nMicrosoft Research\n\nmoprescu@microsoft.com\n\nMaggie Hei\n\nMicrosoft Research\n\nMaggie.Hei@microsoft.com\n\nGreg Lewis\n\nMicrosoft Research\n\nglewis@microsoft.com\n\nKeith Battocchi\nMicrosoft Research\n\nkebatt@microsoft.com\n\nAbstract\n\nWe consider the estimation of heterogeneous treatment effects with arbitrary ma-\nchine learning methods in the presence of unobserved confounders with the aid of\na valid instrument. Such settings arise in A/B tests with an intent-to-treat structure,\nwhere the experimenter randomizes over which user will receive a recommendation\nto take an action, and we are interested in the effect of the downstream action. We\ndevelop a statistical learning approach to the estimation of heterogeneous effects,\nreducing the problem to the minimization of an appropriate loss function that\ndepends on a set of auxiliary models (each corresponding to a separate prediction\ntask). The reduction enables the use of all recent algorithmic advances (e.g. neural\nnets, forests). We show that the estimated effect model is robust to estimation errors\nin the auxiliary models, by showing that the loss satis\ufb01es a Neyman orthogonality\ncriterion. Our approach can be used to estimate projections of the true effect model\non simpler hypothesis spaces. When these spaces are parametric, then the parame-\nter estimates are asymptotically normal, which enables construction of con\ufb01dence\nsets. We applied our method to estimate the effect of membership on downstream\nwebpage engagement on TripAdvisor, using as an instrument an intent-to-treat\nA/B test among 4 million TripAdvisor users, where some users received an easier\nmembership sign-up process. We also validate our method on synthetic data and\non public datasets for the effects of schooling on income.1\n\n1\n\nIntroduction\n\nA/B testing is the gold standard of causal inference. But even when A/B testing is feasible, estimating\nthe effect of a treatment on an outcome might not be a straightforward task. One major dif\ufb01culty is\nnon-compliance: even if we randomize what treatment to recommend to a subject, the subject might\nnot comply with the recommendation due to unobserved factors and follow the alternate action. The\nimpact that unobserved factors might have on the measured outcome is a source of endogeneity and\ncan lead to biased estimates of the effect. This problem arises in large scale data problems in the\ndigital economy; when optimizing a digital service, we might often want to estimate the effect of\nsome action taken by our users on downstream metrics. However, the service cannot force users to\ncomply, but can only \ufb01nd means of incentivizing or recommending the action. The unobserved factors\nof compliance can lead to biased estimates if we consider the takers and not takers as exogenously\n\n1Prototype code for all the algorithms presented and the synthetic data experimental study can be found at\n\nhttps://github.com/Microsoft/EconML/tree/master/prototypes/dml_iv.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fassigned and employ machine learning approaches to estimate the potentially heterogeneous effect of\nthe action on the downstream metric.\nThe problem can be solved by using the technique of instrumental variable (IV) regression: as long\nas the recommendation increases the probability of taking the treatment, then we know that there is at\nleast some fraction of users that were assigned the treatment \u201cexogeneously\u201d. IV regression parses\nout this population of \u201cexogenously treated\u201d users and estimates an effect based solely on them.\nMost classical IV approaches estimate a constant average treatment effect. However, to make\npersonalized policy decisions (an emerging trend in most digital services) one might want to estimate\na heterogeneous effect based on observable characteristics of the user. The latter is a daunting task,\nas we seek to estimate a function of observable characteristics as opposed to a single number. Hence,\nstatistical power is at stake. Even estimating an ATE is non-trivial when effect and compliance are\ncorrelated through observables. The emergence of large data-sets in the digital economy alleviates this\nconcern; with A/B tests running on millions of users it is possible to estimate complex heterogeneous\neffect models, even if compliance levels are relatively weak. Moreover, as we control for more and\nmore observable features of the user, we also reduce the risk that correlation between effect and\ncompliance is stemming from unobserved factors.\nThis leads to the question this work seeks to answer: how can we blend the power of modern machine\nlearning approaches (e.g. random forests, gradient boosting, penalized regressions, neural networks)\nwith instrumental variable methods, so as to estimate complex heterogeneous effect rules. Recent\nwork at the intersection of machine learning and econometrics has proposed powerful methods for\nestimating the effect of a treatment on an outcome, while using machine learning methods for learning\nnuisance models that help de-bias the \ufb01nal effect rule. However, the majority of the work has either\nfocused on 1) estimating average treatment effects or low dimensional parametric effect models (e.g.\nthe double machine learning approach of [11]), 2) developing new algorithms for estimating fully\nnon-parametric models of the effect (e.g. the IV forest method of [4], the DeepIV method of [15]), 3)\nassuming that the treatment is exogenous once we condition on the observable features and reducing\nthe problem to an appropriate square loss minimization framework (see e.g. [26, 21]).\nNevertheless, a general reduction of IV based machine learning estimation of heterogeneous effects\nto a more standard statistical learning problem that can incorporate existing algorithms in a black-box\nmanner has not been formulated in prior work. In fact, the recent work of [26], which develops a\nstatistical learning based approach in the setting with no unboserved confounders, leaves as a major\nopen question the development of an analogue statistical learning approach for our setting, with\nunobserved confounders and access to valid instruments. Such a reduction can help us leverage\nthe recent algorithmic advances in statistical learning theory so as to work with large data-sets.\nOur work proposes the reduction of heterogeneous effects estimation via instruments to a square\nloss minimization problem over a hypothesis space. This enables us to learn not only the true\nheterogeneous effect model, but also the projections of the true model in simpler hypothesis spaces\nfor interpretability. Moreover, our work leverages recent advances in statistical learning with nuisance\nfunctions [12, 13], to show that the mean squared error (MSE) of the learned model is robust to\nthe estimation error of auxiliary models that need to be estimated (as is standard in IV regression).\nThus we achieve MSE rates where the leading term depends only on the sample complexity of the\nhypothesis space of the heterogeneous effect model.\nSome advantages of reducing our problem to a set of standard regression problems include being\nable to use existing algorithms and implementations, as well as recent advances of interpretability\nin machine learning. For instance, in our application we deploy the SHAP framework [23, 22] to\ninterpret random forest based models of the heterogeneous effect. Furthermore, when the hypothesis\nspace is low dimensional and parametric then our approach falls in the setting studied by prior\nwork of [11] and, hence, not only MSE rates but also con\ufb01dence interval construction is relatively\nstraightforward. This enables hypothesis testing on parametric projections of the true effect model.\nWe apply our approach to an intent-to-treat A/B test among 4 million users on a major travel webpage\nso as to estimate the effect of membership on downstream engagement. We identify sources of\nheterogeneity that have policy implications on which users the platform should engage more and\npotentially how to re-design the recommendation to target users with large effects. We validate the\n\ufb01ndings on a different cohort in a separate experiment among 10 million users on the same platform.\nEven though the new experiment was deployed on a much broader and different cohort, we identify\ncommon leading factors of effect heterogeneity, hence con\ufb01rming our \ufb01ndings. As a robustness check\n\n2\n\n\fwe create semi-synthetic data with similar features and marginal distributions of variables as the real\ndata, but where we know the ground truth. We \ufb01nd that our method performs well both in terms of\nMSE, identifying the relevant factors and coverage of the con\ufb01dence intervals.\nFinally, we apply our method to a more traditional IV application: estimating the effect of schooling\non wages. We use a well studied public data set and observe that our approach automatically identi\ufb01es\nsources of heterogeneity that were previously uncovered using more structural approaches. We also\nvalidate our method in this application on semi-synthetic data that emulate the true data.\n\n2 Estimation of Heterogeneous Treatment Effects with Instruments\n\nWe consider estimation of heterogeneous treatment effects with respect to a set of features X, of\nan endogenous treatment T on an outcome Y with an instrument Z. For simplicity of exposition,\nwe will restrict attention to the case where Y, Z and T are scalar variables, but several of our results\nextend to the case of multi-dimensional treatments and instruments. Z is an instrumental variable if it\nhas an effect on the treatment but does not have a direct effect on the outcome other than through the\ntreatment. More formally, we assume the following moment condition:\n\nE[Y \u2212 \u03b80(X)T \u2212 f0(X) | Z, X] = 0\n\n(1)\nEquivalently we assume that: Y = \u03b80(X) T + f0(X) + e, with E[e | Z, X] = 0. We allow for the\npresence of confounders, i.e. e could be correlated with T via some unobserved common factor \u03bd.\nHowever, our exclusion restriction on the instrument implies that the residual is mean zero conditional\non the instrument. Together with the fact that the instrument also has an effect on the treatment at\nany value of the feature X, i.e.: Var(E[T | Z, X] | X) \u2265 \u03bb, allows us to identify the heterogeneous\neffect function \u03b80(X). We focus on the case where the effect is linear in the treatment T , which is\nwlog in the binary treatment setting, which is our main application, and since our goal is to focus on\nthe non-linearity wrt X (this greatly simpli\ufb01es our problem, see [9, 10, 25, 16]).2\nGiven n i.i.d. samples from the data generating process, our goal is to estimate a model \u02c6\u03b8(X) that\nachieves small expected mean-squared-error, i.e.: E[(cid:107)\u02c6\u03b8 \u2212 \u03b80(cid:107)2] := E[(\u02c6\u03b8(X) \u2212 \u03b80(X))2] \u2264 Rn.\nSince the true \u03b80 function can be very complex and dif\ufb01cult to estimate in \ufb01nite samples, we are\nalso interested in estimating projections of the true \u03b80 on simpler hypothesis spaces \u0398\u03c0. Projections\nare also useful for interpretability: one might want to understand what is the best linear projection\nof \u03b80(X) on X, i.e. \u03b10 = arg min\u03b1 E[((cid:104)\u03b1, X(cid:105) \u2212 \u03b80(X))2]. In this case we will denote with \u03b8\u2217\nE[(\u03b8(X) \u2212 \u03b80(X))2] and our goal would be\nthe projection of \u03b80 on \u0398\u03c0, i.e. \u03b8\u2217 = arg min\u03b8\u2208\u0398\u03c0\nto achieve small mean squared error with respect to \u03b8\u2217. When \u03b8\u2217 is a low dimensional parametric\nclass (e.g. a linear function on a low-dimensional feature space or a constant function), we are also\ninterested in performing inference; i.e. constructing con\ufb01dence intervals that asymptotically contain\nthe correct parameter with probability equal to some target con\ufb01dence level.\nWarm-Up: Estimating the Average Treatment Effect (ATE) For estimation of the average treat-\nment effect (ATE), assuming that either there is no effect heterogeneity with respect to X or there is\nno heterogeneous compliance with respect to X, [11] propose a method for estimating the ATE that\nsolves the empirical analogue of the following moment equation:\n\nE[(Y \u2212 E[Y | X] \u2212 \u03b8(T \u2212 E[T | X])) (Z \u2212 E[Z | X])] = 0\n\n(2)\nThis moment function is orthogonal to all the functions q0(X) = E[Y | X], p0(X) = E[T | X] and\nr0(X) = E[Z | X] that also need to be estimated from data. This moment avoids the estimation of\nthe expected T conditional on Z, X and satis\ufb01es an orthogonality condition that enables robustness of\nEn[(Y \u2212\u02c6q(X)) (Z\u2212\u02c6r(X))]\nthe estimate \u02c6\u03b8 =\nEn[(T\u2212 \u02c6p(X)) (Z\u2212\u02c6r(X))] , to errors in the nuisance estimates \u02c6q, \u02c6r and \u02c6p. The estimate\nis asymptotically normal with variance equal to the variance of the method if the estimates were the\n\n2Implicitly our moment condition abstracts away the low level conditions that allow one to interpret the\nparameter that satis\ufb01es the moment condition as causal. For instance, in the case of binary instruments and binary\ntreatments to interpret the solution to the moment condition as the causal effect, one requires a monotonicity\ncondition on the compliance structure [18], i.e. if a unit does not take the treatment when recommended, it would\nhave also not taken the treatment without the recommendation. However, the fact that we also condition on X\nand we estimate an effect conditional on X only requires these conditions to hold conditional on X, i.e. the\ndirectionality of the effect of the instrument on the treatment can change for different X\u2019s. So the requirements\nare milder. This weakening has also been observed in prior work in econometrics [19].\n\n3\n\n\fcorrect ones, assuming that the mean squared error of these estimates decays at least at a rate of n\u22121/4\n(see [11] for more details). This result requires that the nuisance estimates are \ufb01tted in a cross-\ufb01tting\nmanner, i.e. we use half of the data to \ufb01t a model for each of these functions and then predict the\nvalues of the model on the other half of the samples. We refer to this algorithm as DMLATEIV.3\nInconsistency under Effect and Compliance Heterogeneity The above estimate \u02c6\u03b8 is a consistent\nestimate of the average treatment effect as long as there is either no effect heterogeneity with respect\nto X or there is no heterogeneous compliance (i.e. the effect of the instrument on the treatment)\nwith respect to X. Otherwise it is inconsistent. The reason is that, if we let \u02dcT = T \u2212 p0(X)\nand \u02dcZ = Z \u2212 r0(X), then the population quantity: \u03b20(X) = E[ \u02dcT \u02dcZ | X] is a function of X. If\nwe also have effect heterogeneity, then we are solving for a constant \u02c6\u03b8 that in the limit satis\ufb01es:\nE[( \u02dcY \u2212 \u02c6\u03b8 \u02dcT ) \u02dcZ] = 0, where \u02dcY = Y \u2212 q0(X). On the other hand the true heterogeneous model\nsatis\ufb01es the equation: E[( \u02dcY \u2212 \u03b80(X) \u02dcT ) \u02dcZ] = 0. In the limit, the two quantities are related via the\nequation: \u02c6\u03b8 E[ \u02dcT \u02dcZ] = E[\u03b80(X) \u02dcT \u02dcZ]. Then the constant effect that we estimate converges to the\nquantity: \u02c6\u03b8 =\n. If \u03b80(X) is not independent with \u03b20(X), then \u02c6\u03b8 is a re-weighted version\nof the true average treatment effect E[\u03b8(X)], re-weighted by the heterogeneous compliance. To\naccount for this heterogeneous compliance we need to change our moment equation so as to re-weight\nbased on \u03b20(X), which is unknown and also needs to be estimated from data. Given that this function\ncould be arbitrarily complex, we want our \ufb01nal estimate to be robust to estimation errors of \u03b20(X).\nWe can achieve this by considering a doubly robust approach to estimating \u02c6\u03b8. Suppose that we had\nsome other method of computing an estimate of the heterogeneous treatment effect \u03b80(X), then we\ncan combine both estimates to get a more robust method for the ATE, e.g.:\n\nE[\u03b80(X)\u03b20(X)]\n\nE[\u03b20(X)]\n\n\u02c6\u03b8DR = E(cid:104)\u02c6\u03b8(X) + ( \u02dcY \u2212\u02c6\u03b8(X) \u02dcT ) \u02dcZ\n\n(cid:105)\n\n\u02c6\u03b2(X)\n\n(3)\n\nThis approach has been analyzed in [27] in the case of constant treatment effects and an analogue\nof this average effect was also used by [5] in a policy learning problem as opposed to an estima-\ntion problem. In particular, the quantity \u02dcZ/\u03b2(X) is known as the compliance score [1, 3]. Our\nmethodological contribution in the next two sections is two-fold: i) \ufb01rst we propose a model-based\nstable approach for estimating a preliminary estimate \u02c6\u03b8(X), which does not necessarily require that\n\u03b2(X) > 0 everywhere (an assumption that is implicit in the latter method), ii) second we show that\nthis doubly robust quantity can be used as a regression target and minimizing the square loss with\nrespect to this target, corresponds to an orthogonal loss, as de\ufb01ned in [12, 13].\n\n2.1 Preliminary Estimate of Conditional Average Treatment Effect (CATE)\nLet h0(Z, X) = E[T | Z, X] and p0, q0 as in the previous section. Then observe that\nwe can re-write the moment condition as: E[Y \u2212 \u03b80(X) h0(Z, X) \u2212 f0(X) | Z, X] = 0.\nMoreover, observe that the functions p0, q0 and f0 are related via: q0(X) = \u03b80(X) p0(X) +\nf0(X). Thus we can further re-write the moment condition in terms of q0, p0 instead of f0:\nE[Y \u2212 q0(X) \u2212 \u03b80(X) (h0(Z, X) \u2212 p0(X)) | Z, X] = 0. Moreover, we can identify \u03b8(X)\nwith the following subset of conditional moments, where the conditioning of Z is removed:\nE[(Y \u2212 q0(X) \u2212 \u03b8(X) (h0(Z, X) \u2212 p0(X))) (h0(Z, X) \u2212 p0(X)) | X] = 0. Equivalently, \u03b8(X)\nis a minimizer of the square loss:\n\nL1(\u03b8; q0, h0, p0) := E(cid:104)\n\n(Y \u2212 q0(X) \u2212 \u03b8(X) (h0(Z, X) \u2212 p0(X)))2(cid:105)\n\n(4)\n\nsince the derivative of this loss with respect to \u03b8(X) is equal to the moment equation and, thus, the\n\ufb01rst order condition for the loss minimization problem is satis\ufb01ed by the true model \u03b80. Moreover, if\nthe loss function satis\ufb01es a functional analogue of strong convexity, then any minimizer of the loss\n\n3For Double Machine Learning ATE estimation with Instrumental Variables.\n\n4\n\n\fachieves small mean squared error with respect to \u03b80. This leads to the following approach:\n\nAlgorithm 1: HETEROGENEOUS EFFECTS: DMLIV Partially orthogonal, convex loss.\n\n1 On a half-sample S1: regress i) Y on X, ii) T on X, Z, iii) T on X, to learn estimates \u02c6q, \u02c6h and \u02c6p corr.;\n2 Minimize the empirical analogue of the square loss over some hypothesis space \u0398 on the other half-sample S2:\n\n(cid:80)\n\n\u02c6\u03b8 = arg inf \u03b8\u2208\u0398\n\n2\nn\n\ni\u2208S2\n\n(Yi \u2212 \u02c6q(Xi) \u2212 \u03b8(Xi) (\u02c6h(Zi, Xi) \u2212 \u02c6p(Xi)))2 := L1\n\nn(\u03b8; \u02c6q, \u02c6h, \u02c6p)\n\n(5)\n\nor any learning algorithm that achieves small generalization error w.r.t. loss L1(\u03b8; \u02c6q, \u02c6h, \u02c6p) over \u0398.\n\nThis method is an extension of the classical two-stage-least-squares (2SLS) approach [2] to allow\nfor arbitrary machine learning models; ignoring the residualization part (i.e. if for instance q(X) =\np(X) = 0), then it boils down to: 1) predict the mean treatment from the instrument and X with\nan arbritrary regression/classi\ufb01cation method, 2) predict the outcome from the predicted treatment\nmultiplied by the heterogeneous effect model \u03b8(X). Residualization helps us remove the dependence\nof the mean squared error on the complexity of the baseline function f0(X). We achieve this by\nshowing that this loss is orthogonal with respect to p, q (see [13] for the de\ufb01nition of an orthogonal\nloss). However, orthogonality does not hold with respect to h. This \ufb01nding is reasonable since we are\nusing h(Z, X) as our regressor. Hence, any error in the measurement of the regressor can directly\npropagate to an error in \u03b8(X). This is the same reason why in classical IV regression one cannot\nignore the variance from the \ufb01rst stage of 2SLS when calculating con\ufb01dence intervals.\nLemma 1. The loss function L1(\u03b8; q, h, p) is orthogonal to the nuisance functions p, q, but not h.\nStrong convexity and overlap. Note that both the empirical loss L1\nn and the population loss L1\nare convex in the prediction, which typically implies computational tractability. Moreover, the\nsecond order directional derivative of the population loss in any functional direction \u03b8(\u00b7) \u2212 \u03b80(\u00b7) is:\n. To be\nable to achieve mean-squared-error rates based on our loss minimization, we need the population\nversion L1 of the loss function to satisfy a functional analogue of \u03bb-strong convexity:\n\n(\u02c6h(Z, X) \u2212 \u02c6p(X))2 (\u03b8(X) \u2212 \u03b80(X))2(cid:105)\n\nand let: V (X) := E(cid:104)\n\n(\u02c6h(Z, X) \u2212 \u02c6p(X))2 | X\n\nE(cid:104)\n\n(cid:105)\n\n4,(cid:107)p\u2212 p0(cid:107)2\n\n\u2200\u03b8 \u2208 \u0398 : E[V (X) \u00b7 (\u03b8(X) \u2212 \u03b80(X))2] \u2265 \u03bb E[(\u03b8(X) \u2212 \u03b80(X))2]\n\n(6)\nThis setting falls under the \u201csingle-index\u201d setup of [13]. Using arguments from Lemma 1 of [13], if:\n(7)\n\nwhere V0(X) := E(cid:2)(h0(Z, X) \u2212 p0(X))2 | X(cid:3) = Var(E[T | Z, X] | X), then \u03bb \u2265 \u03bb0 \u2212 O((cid:107)h \u2212\n\n\u2200\u03b8 \u2208 \u0398 : E[V0(X) \u00b7 (\u03b8(X) \u2212 \u03b80(X))2] \u2265 \u03bb0 E[(\u03b8(X) \u2212 \u03b80(X))2]\n\n4) = \u03bb0\u2212 o(1). A suf\ufb01cient condition is that V0(X) \u2265 \u03bb0 for all X. This is a standard\nh0(cid:107)2\n\"overlap\" condition that the instrument is exogenously varying at any X and has a direct effect on the\ntreatment at any X. DMLIV only requires an \"average\" overlap condition, tailored particularly to the\nhypothesis space \u0398, hence it could handle settings where the instrument is weak for some subset of\nthe population. For instance, if \u0398 is a linear function class: \u0398 = {(cid:104)\u03b8, \u03c6(X)(cid:105) : \u03b8 \u2208 S \u2286 Rd}, then for\nthe oracle strong convexity to hold it suf\ufb01ces that: E[V (X)\u03c6(X)\u03c6(X)T ] (cid:23) \u03bbI. Lemma 1, combined\nwith the above discussion and the results of [13] yields:4\nCorollary 2. Assume all random variables are bounded and consider any algorithm that achieves\nn with respect to loss L1(\u03b8; \u02c6q, \u02c6h, \u02c6p). Moreover, suppose that the\nexpected generalization error R2\nnuisance estimates satisfy (cid:107)\u02c6q \u2212 q0(cid:107)4,(cid:107)\u02c6p \u2212 p0(cid:107)4 = o(gn) and (cid:107)\u02c6h \u2212 h0(cid:107)4 = o(hn). Then \u02c6\u03b8 returned\n2 \u2264 O\nby DMLIV satis\ufb01es: (cid:107)\u02c6\u03b8 \u2212 \u03b80(cid:107)2\n. If empirical risk minimization is used in the \ufb01nal\nstage, then R2\nn, where \u03b4n is the critical radius of the hypothesis space \u0398 as de\ufb01ned\nn + g4\nvia the localized Rademacher complexity [20].\nComputational considerations. The empirical loss L1\n\nn is not a standard square loss. However, we\ni \u03b3(Xi)2( \u02dcYi/\u03b3(Xi) \u2212 \u03b8(Xi))2. Thus the problem is equivalent to a standard\nsquare loss minimization with label \u02dcYi/\u03b3(Xi) and sample weights \u03b3(Xi)2. Thus we can use any\nout-of-the-box machine learning method that accepts sample weights, such as stochastic gradient\nbased regression methods and gradient boosted or random forests. Alternatively, if we assume a linear\nrepresentation of the effect function \u03b8(X) = (cid:104)\u03b8, \u03c6(X)(cid:105), then the problem is equivalent to regressing\n\u02dcY on the scaled features \u03c6(X) \u03b3(X), and again any method for \ufb01tting linear models can be invoked.\n\ncan re-write it as(cid:80)\n\n(cid:16) R2\n\nn = \u03b42\n\nn + h2\n\n(cid:17)\n\nn+g4\n\nn\n\nn+h2\n\u03bb0\n\n4This corollary follows by small modi\ufb01cations of the proofs of Theorem 1 and Theorem 3 of [13] that\n\naccounts for the non-orthogonality w.r.t. h, so we omit its proof.\n\n5\n\n\f2.2 DRIV: Orthogonal Loss for IV Estimation of CATE and Projections\n\nWe now present the main estimation algorithm that combines the doubly robust approach presented for\nATE estimation with the preliminary estimator of the CATE to obtain a fully orthogonal and strongly\nconvex loss. This method achieves a second order effect from all nuisance estimation errors and\nenables oracle rates for the target effect class \u0398 and asymptotically valid inference for low dimensional\ntarget effect classes. In particular, given access to a \ufb01rst stage model of heterogeneous effects \u03b8pre\n(such as the one produced by DMLIV), we can estimate a more robust model of heterogeneous effects\nvia minimizing a square loss that treats the doubly robust quantity used in Equation (3) as the label:\n\nmin\u03b8\u2208\u0398\u03c0 L2(\u03b8; \u03b8pre, \u03b2, p, q, r) := E\n\n\u03b8pre(X) + ( \u02dcY \u2212\u03b8pre(X) \u02dcT ) \u02dcZ\n\n\u03b2(X)\n\n\u2212 \u03b8(X)\n\n(8)\n\n(cid:17)2(cid:21)\n\n(cid:20)(cid:16)\n\nWe allow for a model space \u0398\u03c0 that is not necessarily equal to \u0398. The solution in Equation (3) is a\nspecial case of this minimization problem where the space \u0398\u03c0 contains only constant functions. Our\nmain result shows that this loss is orthogonal to all nuisance functions \u03b8pre, \u02c6\u03b2, \u02c6q, \u02c6p, \u02c6r. Moreover, it is\nstrongly convex in the prediction \u03b8(X), since conditional on all the nuisance estimates it is a standard\nsquare loss. Moreover, we show that the loss is orthogonal irrespective of what the model space \u0398\u03c0,\neven if \u0398\u03c0 (cid:54)= \u0398, as long as the preliminary estimate \u03b8pre is consistent with respect to the true CATE\n\u03b80 (i.e. \ufb01t a \ufb02exible preliminary CATE and use it to project to a simpler hypothesis space).\nLemma 3. The loss L2 is orthogonal with respect to the nuisance functions \u03b8pre, \u03b2, p, q and r.\nAlgorithm 2: DRIV Orthogonal convex loss for CATE and projections of CATE\n\n1 Estimate a preliminary estimate \u03b8pre of the CATE \u03b80(X) using DMLIV on half-sample S1;\n2 Using half-sample S1, regress i) Y on X, ii) T on X, iii) Z on X to learn estimates \u02c6q, \u02c6p, \u02c6r correspondingly;\n3 Regress T \u00b7 Z on X using S1 to learn estimate \u02c6f of function f0(X) = E[T \u00b7 Z | X];\n4 \u2200i \u2208 S2, let \u02dcYi = Yi \u2212 \u02c6q(Xi), \u02dcTi = Ti \u2212 \u02c6p(Xi), \u02dcZi = Zi \u2212 \u02c6r(Xi), \u02c6\u03b2(Xi) = \u02c6f (Xi) \u2212 \u02c6p(Xi) \u02c6r(Xi);\n5 Minimize empirical analogue of square loss L2 over hypothesis space \u0398\u03c0 on the other half-sample S2, i.e.:\n\n(cid:80)\n\n(cid:16)\n\n(cid:17)2\n\n\u02c6\u03b8DR = arg inf \u03b8\u2208\u0398\u03c0\n\n2\nn\n\ni\u2208S2\n\n\u03b8pre(Xi) + ( \u02dcYi\u2212 \u02c6\u03b8(Xi) \u02dcTi) \u02dcZi\n\n\u2212 \u03b8(Xi)\n\n\u02c6\u03b2(Xi)\n\n:= L2\n\nn(\u03b8; \u03b8pre, \u02c6\u03b2, \u02c6p, \u02c6q, \u02c6r)\n\nor any learning algorithm that has small generalization error w.r.t. loss L2(\u03b8; \u03b8pre, \u02c6\u03b2, \u02c6p, \u02c6q, \u02c6r) on \u0398\u03c0.\n\n(cid:16)\n\nn + g4\nn\n\nn = \u03b42\n\nn + g4\n\n2 \u2264 O(cid:0)R2\n\n(cid:1), where \u03b8\u2217 = arg min\u03b8\u2208\u0398\u03c0 L2(\u03b8; \u03b80, \u03b20, p0, q0, r0). If empirical risk\n\nIf we use DMLIV for \u03b8prel, even though DMLIV has a \ufb01rst order impact from the error of h, the\nsecond stage estimate has a second order impact, since it has a second order impact from the \ufb01rst\nstage CATE error. Lemma 3 together with the results of [13] and [12] implies the following corollary:\nCorollary 4. Assume all random variables are bounded and consider any algorithm that achieves\nn with respect to loss L2(\u03b8; \u03b8pre, \u02c6\u03b2, \u02c6p, \u02c6q, \u02c6r). Moreover, suppose that\nexpected generalization error R2\neach nuisance estimate \u02c6g \u2208 {\u03b8pre, \u02c6\u03b2, \u02c6p, \u02c6q, \u02c6r}, (cid:107)\u02c6g \u2212 g0(cid:107)4 \u2264 gn. Then \u02c6\u03b8 returned by DRIV satis\ufb01es:\n(cid:107)\u02c6\u03b8 \u2212 \u03b8\u2217(cid:107)2\nminimization is used in the \ufb01nal stage, then R2\nn, where \u03b4n is the critical radius of the\nhypothesis space \u0398 as de\ufb01ned via the localized Rademacher complexity [20]. If \u0398 is high-dimensional\nsparse linear, i.e. \u03b8(X) = (cid:104)\u03be, \u03c6(X)(cid:105) with (cid:107)\u03be(cid:107)0 \u2264 s, \u03c6(X) \u2208 Rp and E[\u03c6(X)\u03c6(X)T ] \u2265 \u03bb0I,\nthen if an (cid:96)1-penalized square loss minimization is used in the \ufb01nal step of DRIV, it suf\ufb01ces that\n(cid:107)\u02c6g \u2212 g0(cid:107)2 \u2264 gn to get: (cid:107) \u02c6\u03be \u2212 \u03be\u2217(cid:107)2\nInterpretability through projections. The fact that our loss function can be used with any\ntarget \u0398\u03c0 allows us to perform inference on the projection of \u03b80 on a simple space \u0398\u03c0 (e.g.\ndecision trees, linear functions) for interpretability purposes.\nthe label in the\n\ufb01nal regression of DRIV, then observe that when the nuisance estimates take their true val-\nues then E[Y DR\nhas mean zero. Hence:\nL2(\u03b8; \u03b80, \u03b20, p0, q0, r0) = Var(\u03b8DR(X)) + E[(\u03b80(X) \u2212 \u03b8(X))2]. The \ufb01rst part is independent of\n\u03b8 and hence minimizing the oracle L2 is equivalent to minimizing E[(\u03b80(X) \u2212 \u03b8(X))2] over \u03b8 \u2208 \u0398\u03c0,\nwhich is exactly the projection of \u03b80 on \u0398\u03c0. One version of an interpretable model is estimating the\nCATE with respect to a subset T of the variables, i.e.: \u03b8(XT ) = E[\u03b80(X) | XT ] (e.g. how treatment\neffect varies with a single feature). This boils down to setting \u0398\u03c0 some space of functions of XT .\nIf T is a low dimensional set of features and \u0398\u03c0 is a the space of linear functions of XT , i.e.\n\u0398\u03c0 = {X \u2192 (cid:104)\u03b8T , XT(cid:105) : \u03b8T \u2208 R|T|}, then the \ufb01rst order condition of our loss is equal to the\n\n| X] = \u03b80(X), since the second part of Y DR\n\ns2 log(p)/n+g4\n\nn\n\nIf we let Y DR\n\ni\n\n2 \u2264 O\n\n(cid:17)\n\n.\n\n\u03bb0\n\ni\n\ni\n\n6\n\n\fmoment condition E[(Y DR \u2212 (cid:104)\u03b8T , XT(cid:105))XT ] = 0. Then orthogonality of our loss implies that DRIV\nis equivalent to an orthogonal moment estimation method [11]. Thus using the results of [11] we\nget that the estimate \u02c6\u03b8T of DRIV is asympotically normal with asymptotic variance equal to the\nhypothetical variance of \u03b8T as if the nuisance estimates had their true values. Hence, we can use\nout-of-the-box packages for calculating CIs of an OLS regression to get p-values on the coef\ufb01cients.\n\n3 Estimating Effects of Membership at TripAdvisor\nWe apply our methods to estimate the treatment effect of membership on the number of days a user\nvisits TripAdvisor. The instrument used was a 14-day intent-to-treat A/B test run during 2018, where\nusers in group A received a new, easier membership sign-up process, while the users in group B did\nnot. The treatment is whether a user became a member or not. Becoming a member and logging\ninto TripAdvisor gives users exclusive access to trip planning tools, special deals and price alerts,\nand personalized ideas and travel advice. Our data consists of 4,606,041 total users in a 50:50 A/B\ntest. For each user, we have a 28-day pre-experiment summary about their browsing and purchasing\nactivity on TripAdvisor (see Sec. B.2). The instrument signi\ufb01cantly increased the rate of treatment,\nand is assumed to satisfy the exclusion restriction.\nWe applied two sets of nuisance estimation models with different complexity characteristics: LASSO\nregression and logistic regression with an L2 penalty (LM); and gradient boosting regression and\nclassi\ufb01cation (GB). The only exception was E[Z|X], where we used a \ufb01xed estimate of 0.5 since the\ninstrument was a large randomized experiment. See Sec. B.1 for details.5\nMethod\nNuisance\n\nNuisance\n\nGB\nGB\n\nATE Est [95% CI]\nDMLATEIV 0.127 [-0.031, 0.285]\n0.125 [-0.061, 0.311]\n\nDRIV\n\nLM\nLM\n\nMethod\n\nATE Est [95% CI]\nDMLATEIV 0.117 [-0.051, 0.285]\n0.113 [-0.052, 0.279]\n\nDRIV\n\nTable 1: ATE Estimates for 2018 Experiment at TripAdvisor\n\nWe estimate the ATE using DRIV projected onto a constant (Table 1). Using linear nuisance models\nresults in very similar ATE estimates between DMLATEIV and DRIV. We compare the X co-variate\nassociations for both heterogeneity and compliance under DRIV to understand why. If there are\nco-variates with signi\ufb01cant non-zero associations in both heterogeneity and compliance, this could\nlead to different estimates between DRIV and DMLATEIV (and vice versa). Replacing the CATE\nprojection model with a linear regression, we obtain valid inferences for the co-variates associated\nwith treatment effect heterogeneity (Figure 1). For compliance, we run a linear regression of the\nestimated quantity \u03b2(X) on X, to assess its association with each of the features (see Sec. B.1 for\ndetails). Comparing treatment and compliance coef\ufb01cients, os_type_linux and revenue_pre are the\nonly coef\ufb01cients substantially different from 0 in both. However, only a very small proportion of\nusers in the experiment are Linux users, and the distribution of revenue is very positively skewed.\nThis justi\ufb01es the minor difference between the DMLATEIV and DRIV estimates. Moreover, we \ufb01t\na shallow, heavily regularized random forest and interpret it using Shapley Additive Explanations\n(SHAP) [24]. SHAP gave directionally similar impact of each feature on the effect (Figure 1).\nHowever, since we constrained the model to have depth at most one, it essentially gives the features\nin order of importance if we were to split the population based on a single feature. This justi\ufb01es why\nthe order of importance of features in the forest is not in the same order as the magnitude of the rank\nin the linear model, since they have different interpretations. The features picked up by the forest\nintuitively make sense since an already highly engaged member of TripAdvisor, or a user who has\nrecently made a booking, is less likely to further increase their visits to TripAdvisor. Using gradient\nboosting nuisance models, we show that many inferences remain similar (Figure 3 in Appendix). The\nmost notable changes in heterogeneity were for features which have a highly skewed distribution (e.g.\nvisits to speci\ufb01c pages on TripAdvisor), or which appear rarely in the data (e.g. Linux users). The\nlinear CATE projection model coef\ufb01cients are largely similar for both residualization models (except\nthe Linux operating system feature, which appears rarely in the data). Moving to a random forest for\nthe CATE projection model with SHAP presents greater differences, especially for the highly skewed\nfeatures.\nSimilar instrument from a recent experiment A recent 2019 A/B test of the same membership\nsign-up process provided another viable instrument. This 21-day A/B test included a much larger,\nmore diverse population of users than in 2018 due to fewer restrictions for eligibility (see Sec. B.2 for\n\n5We attempted to use the R implementation of Generalized Random Forests (GRF)[4] to compare with our\nresults. However, we could not \ufb01t due to the size of the data and insuf\ufb01cient memory errors (with 64GB RAM).\n\n7\n\n\fFigure 1: (From left to right) Linear CATE projection, SHAP summary of random forest CATE projection,\nLinear CATE projection coef\ufb01cients. Using linear nuisance models.\n\ndetails). We apply DRIV with gradient boosting residualization models and a linear projection of the\nCATE. The CATE distribution has generally higher values compared to the 2018 experiment which\nre\ufb02ects the different experimental population. In particular, users in the 2018 experiment had much\nhigher engagement and signi\ufb01cantly higher revenue in the pre-experiment period. This was largely\nbecause users were only included in the 2018 experiment on their second visit. The higher baseline\nnaturally makes it more dif\ufb01cult to achieve high treatment effects, explaining the generally lower\nCATE distribution in the 2018 experiment. We note that, unlike in 2018, the revenue coef\ufb01cient is\nno longer signi\ufb01cant. We again attribute this to the much higher revenue baseline in 2018. Despite\nthe population differences, however, we observe \"days_visited_vrs_pre\" continues to have a very\nsigni\ufb01cant positive association. \"days_visited_exp_pre\" now also appears to have a signi\ufb01cantly\npositive association, as does the iPhone device (which was not a feature in the 2018 experiment). The\ninclusion of iPhone users is another big domain shift in the two experiments.\nPolicy recommendations for Trip Advisor Our results offer several policy implications for Trip\nAdvisor. Firstly, encourage iPhone users, and users who frequent vacation rentals pages to sign-up\nfor membership. These users exhibited high treatment effects from membership. For frequent visitors\nto vacation rentals pages, this effect was robust across residualization models, CATE projections,\nand even different instruments (e.g. by providing stronger encouragements for sign-up on particular\nsub-pages). Secondly, \ufb01nd ways to improve the membership offering for users who are already\nengaged: e.g. recently made a booking (high revenue_pre), were already frequent visitors (high\ndays_visited_free_pre).\nValidation on Semi-Synthetic Data In Appendix C, we validate the correctness of ATE and CATE\nfrom DRIV, by creating a semi-synthetic dataset with the same variables and such that the marginal\ndistribution of each variable looks similar to the TripAdvisor data, but where we know the true effect\nmodel. We \ufb01nd that DRIV recovers a good estimate of the ATE. The CATE of DRIV with linear\nregression as \ufb01nal stage also recovers the true coef\ufb01cients, and a random forest \ufb01nal stage picks\nthe correct factors of heterogeneity as most important features. Moreover, coverage of DRIV ATE\ncon\ufb01dence intervals is almost nominal at 94%, while DMLATEIV can be very biased and have 0\ncoverage. 6\n4 Estimating the Effect of Schooling on Wages\nThe causal impact of schooling on wages has been studied at length in Economics (see [14], [6],\n[7], [17]), and although it is generally agreed that there is a positive impact, it is dif\ufb01cult to obtain\na consistent estimate of the effect due to self-selection into education levels. To account for this\nendogeneity, Card ([6]) proposes using proximity to a 4-year college as an IV for schooling. We\nanalyze Card\u2019s data from the Nat. Long. Survey of Young Men (NLSYM, 1966) to estimate the\nATE of education on wages and \ufb01nd sources of heterogeneity. We describe the NLSYM data in\ndepth in Appendix D. At high level, the data contains 3,010 rows with 22 mostly binary covariates\nX, log wages (y), years of schooling (T ), and 4-year college proximity indicator (Z). We apply\nDMLATEIV and DRIV with linear (LM) or gradient boosted (GBM) nuisance models to estimate\nthe ATE (Table 2 and Table 8 in Appx. D). While the DMLATEIV results are consistent with Card\u2019s\n(0.134, [0.026, 0.242] 95% CI), this estimate is likely biased in the presence of compliance and effect\nheterogeneity (see Sec. 2). The DRIV ATE estimates, albeit lower, still lie within the 95% CI of the\n\n6Results on the coverage experiment can be recovered by running coverage.py followed by post-processing\nwith post_processing.ipynb at https://github.com/microsoft/EconML/tree/master/prototypes/dml_iv. Single\nsynthetic instance results on the quality of the recovered estimates and comparisons with benchmark approaches\ncan be found in TA_DGP_Analysis.ipynb at the same location.\n\n8\n\n3210123Treatment Effect05000001000000150000020000002500000FrequencyATE Estimate: 0.113Linear CATE (Linear Residualization)4202Coefficient Valueos_type_linuxrevenue_predays_visited_free_preinterceptlocale_en_USdays_visited_rs_predays_visited_exp_preos_type_osxdays_visited_hs_predays_visited_fs_predays_visited_vrs_preLinear Model Residualization\fNuisance\n\nLM\nLM\n\nMethod\n\nDMLATEIV\n\nDRIV\n\nATE Est\n0.137\n0.065\n\n\u2020 Contains the true ATE (0.609)\n\nTable 2: NLSYM ATE Estimates for Observational and Semi-synthetic Data\n\nObservational Data\n95% CI\n\nSemi-Synthetic Data\n\n95% CI\n\nATE Est\n0.654\n0.587\n\n[0.027, 0.248]\n[-0.02, 0.151]\n\u2021 Coverage for 95% CI over 100 Monte Carlo simulations\n\n[0.621, 0.687]\n[0.521, 0.652]\u2020\n\nCover \u2021\n10%\n92%\n\nDML ATE. We study effect heterogeneity with a shallow random forest an the last stage of DRIV.\nFig. 2 depicts the spread of treatment effects, and the important features selected. Most effects (89%)\nare positive, with very few very negative outliers. The heterogeneity is driven mainly by parental\neducation variables. We project the DRIV treatment effect on the mother\u2019s education variable to\nstudy this effect. In \ufb01g. 2, we note that treatment effects are highest among children of less educated\nmothers. This pattern has also been observed in [6] and [17].\n\nFigure 2: Treatment effect distribution, heterogeneity features, and linear projection on mother\u2019s education.\n\nSemi-synthetic Data Results. We created semi-synthetic data from the NLSYM covariates X and in-\nstrument Z, with generated treatments and outcomes based on known compliance and treatment func-\ntions (see Appx. D for details). In Table 2, we see that DMLATEIV ATE (true ATE=0.609) is upwards\nbiased and has poor coverage over 100 runs, whereas the DRIV ATE is less biased and has overall good\ncoverage. With DRIV, we also recover the correct \u03b8(X) coef\ufb01cients: 0.142 ([0.037, 0.245] 95% CI)\nvs 0.1, 0.049 ([0.015, 0.083]) vs 0.05, and \u22120.147 ([\u22120.365, 0.072]) vs \u22120.1.\n\n5 Acknowledgements\n\nWe thank Jeff Palmucci, Brett Malone, Baskar Mohan, Molly Steinkrauss, Gwyn Fisher and Matthew\nDacey from TripAdvisor for their support and assistance in making this collaboration possible.\n\nReferences\n[1] Alberto Abadie. Semiparametric instrumental variable estimation of treatment response models. Journal\n\nof Econometrics, 113(2):231 \u2013 263, 2003.\n\n[2] Joshua D Angrist and J\u00f6rn-Steffen Pischke. Mostly harmless econometrics: An empiricist\u2019s companion.\n\nPrinceton university press, 2008.\n\n[3] Peter M Aronow and Allison Carnegie. Beyond late: Estimation of the average treatment effect with an\n\ninstrumental variable. Political Analysis, 21(4):492\u2013506, 2013.\n\n[4] Susan Athey, Julie Tibshirani, Stefan Wager, et al. Generalized random forests. The Annals of Statistics,\n\n47(2):1148\u20131178, 2019.\n\n[5] Susan Athey and Stefan Wager. Ef\ufb01cient policy learning. arXiv preprint arXiv:1702.02896, 2017.\n\n[6] David Card. Using geographic variation in college proximity to estimate the return to schooling. Technical\n\nreport, National Bureau of Economic Research, 1993.\n\n[7] David Card. Estimating the return to schooling: Progress on some persistent econometric problems.\n\nEconometrica, 69(5):1127\u20131160, 2001.\n\n9\n\n\u000b\u0011\r\u000b\u000f\r\u000b\r\r\u000b\u000f%70,92039\u0003\u001c110.9\r\u0012\u000e\r\u000e\u0012\u000f\r\u000f\u0012\u0010\r\u0010\u0012\u0011\r\u0011\u0012\u001d706:03.\u00050.20.00.2SHAP value (impact on model output)fatheduc_nansinmom14reg662southexperexpersqreg666motheduc_nanmotheducfatheducLowHighFeature value202Mother's Education (scaled)0.40.20.00.20.40.60.8Treatment EffectUpper 95% CILower 95% CIPrediction\f[8] Tianqi Chen and Carlos Guestrin. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd\nACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD \u201916, pages\n785\u2013794, New York, NY, USA, 2016. ACM.\n\n[9] Xiaohong Chen and Zhipeng Liao. Sieve semiparametric two-step gmm under weak dependence. Cowles\nFoundation Discussion Papers 2012, Cowles Foundation for Research in Economics, Yale University, 2015.\n\n[10] Xiaohong Chen and Demian Pouzo. Estimation of nonparametric conditional moment models with possibly\n\nnonsmooth generalized residuals. Econometrica, 80(1):277\u2013321, 2012.\n\n[11] Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Du\ufb02o, Christian Hansen, Whitney Newey,\nand James Robins. Double/debiased machine learning for treatment and structural parameters. The\nEconometrics Journal, 21(1):C1\u2013C68, 2018.\n\n[12] Victor Chernozhukov, Denis Nekipelov, Vira Semenova, and Vasilis Syrgkanis. Plug-in regularized estima-\ntion of high-dimensional parameters in nonlinear semiparametric models. arXiv preprint arXiv:1806.04823,\n2018.\n\n[13] Dylan J Foster and Vasilis Syrgkanis. Orthogonal statistical learning. arXiv preprint arXiv:1901.09036,\n\n2019.\n\n[14] Zvi Griliches. Estimating the returns to schooling: Some econometric problems. Econometrica: Journal\n\nof the Econometric Society, pages 1\u201322, 1977.\n\n[15] Jason Hartford, Greg Lewis, Kevin Leyton-Brown, and Matt Taddy. Deep IV: A \ufb02exible approach\nfor counterfactual prediction. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th\nInternational Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research,\npages 1414\u20131423, International Convention Centre, Sydney, Australia, 06\u201311 Aug 2017. PMLR.\n\n[16] Jason Hartford, Greg Lewis, Kevin Leyton-Brown, and Matt Taddy. Deep iv: A \ufb02exible approach for\n\ncounterfactual prediction. In International Conference on Machine Learning, pages 1414\u20131423, 2017.\n\n[17] John Hudson and John G Sessions. Parental education, labor market experience and earnings: new wine in\n\nan old bottle? Economics Letters, 113(2):112\u2013115, 2011.\n\n[18] Guido W. Imbens and Joshua D. Angrist. Identi\ufb01cation and estimation of local average treatment effects.\n\nEconometrica, 62(2):467\u2013475, 1994.\n\n[19] Tobias J. Klein. Heterogeneous treatment effects: Instrumental variables without monotonicity? Journal\n\nof Econometrics, 155(2):99 \u2013 116, 2010.\n\n[20] Vladimir Koltchinskii. Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems.\n\nSpringer, 2011.\n\n[21] S\u00f6ren R K\u00fcnzel, Jasjeet S Sekhon, Peter J Bickel, and Bin Yu. Meta-learners for estimating heterogeneous\n\ntreatment effects using machine learning. arXiv preprint arXiv:1706.03461, 2017.\n\n[22] Scott M Lundberg, Gabriel G Erion, and Su-In Lee. Consistent individualized feature attribution for tree\n\nensembles. arXiv preprint arXiv:1802.03888, 2018.\n\n[23] Scott M Lundberg and Su-In Lee. A uni\ufb01ed approach to interpreting model predictions. In I. Guyon, U. V.\nLuxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural\nInformation Processing Systems 30, pages 4765\u20134774. Curran Associates, Inc., 2017.\n\n[24] Scott M Lundberg and Su-In Lee. A uni\ufb01ed approach to interpreting model predictions. In I. Guyon, U. V.\nLuxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural\nInformation Processing Systems 30, pages 4765\u20134774. Curran Associates, Inc., 2017.\n\n[25] Whitney K. Newey and James L. Powell. Instrumental variable estimation of nonparametric models.\n\nEconometrica, 71(5):1565\u20131578, 2003.\n\n[26] Xinkun Nie and Stefan Wager. Quasi-oracle estimation of heterogeneous treatment effects. arXiv preprint\n\narXiv:1712.04912, 2017.\n\n[27] Ryo Okui, Dylan S Small, Zhiqiang Tan, and James M Robins. Doubly robust instrumental variable\n\nregression. Statistica Sinica, pages 173\u2013205, 2012.\n\n10\n\n\f", "award": [], "sourceid": 8728, "authors": [{"given_name": "Vasilis", "family_name": "Syrgkanis", "institution": "Microsoft Research"}, {"given_name": "Victor", "family_name": "Lei", "institution": "TripAdvisor"}, {"given_name": "Miruna", "family_name": "Oprescu", "institution": "Microsoft Research"}, {"given_name": "Maggie", "family_name": "Hei", "institution": "Microsoft"}, {"given_name": "Keith", "family_name": "Battocchi", "institution": "Microsoft"}, {"given_name": "Greg", "family_name": "Lewis", "institution": "Microsoft Research"}]}