{"title": "Feedback Detection for Live Predictors", "book": "Advances in Neural Information Processing Systems", "page_first": 3428, "page_last": 3436, "abstract": "A predictor that is deployed in a live production system may perturb the features it uses to make predictions. Such a feedback loop can occur, for example, when a model that predicts a certain type of behavior ends up causing the behavior it predicts, thus creating a self-fulfilling prophecy. In this paper we analyze predictor feedback detection as a causal inference problem, and introduce a local randomization scheme that can be used to detect non-linear feedback in real-world problems. We conduct a pilot study for our proposed methodology using a predictive system currently deployed as a part of a search engine.", "full_text": "Feedback Detection for Live Predictors\n\nStefan Wager, Nick Chamandy, Omkar Muralidharan, and Amir Najmi\n\nswager@stanford.edu, {chamandy, omuralidharan, amir}@google.com\n\nStanford University and Google, Inc.\n\nAbstract\n\nA predictor that is deployed in a live production system may perturb the features\nit uses to make predictions. Such a feedback loop can occur, for example, when a\nmodel that predicts a certain type of behavior ends up causing the behavior it pre-\ndicts, thus creating a self-ful\ufb01lling prophecy. In this paper we analyze predictor\nfeedback detection as a causal inference problem, and introduce a local random-\nization scheme that can be used to detect non-linear feedback in real-world prob-\nlems. We conduct a pilot study for our proposed methodology using a predictive\nsystem currently deployed as a part of a search engine.\n\n1\n\nIntroduction\n\nWhen statistical predictors are deployed in a live production environment, feedback loops can be-\ncome a concern. Predictive models are usually tuned using training data that has not been in\ufb02uenced\nby the predictor itself; thus, most real-world predictors cannot account for the effect they themselves\nhave on their environment. Consider the following caricatured example: A search engine wants to\ntrain a simple classi\ufb01er that predicts whether a search result is \u201cnewsy\u201d or not, meaning that the\nsearch result is relevant for people who want to read the news. This classi\ufb01er is trained on histori-\ncal data, and learns that high click-through rate (CTR) has a positive association with \u201cnewsiness.\u201d\nProblems may arise if the search engine deploys the classi\ufb01er, and starts featuring search results that\nare predicted to be newsy for some queries: promoting the search result may lead to a higher CTR,\nwhich in turn leads to higher newsiness predictions, which makes the result be featured even more.\nIf we knew beforehand all the channels through which predictor feedback can occur, then detecting\nfeedback would not be too dif\ufb01cult. For example, in the context of the above example, if we knew\nthat feedback could only occur through some changes to the search result page that were directly\ntriggered by our model, then we could estimate feedback by running small experiments where we\nturn off these triggering rules. However, in large industrial systems where networks of classi\ufb01ers all\nfeed into each other, we can no longer hope to understand a priori all the ways in which feedback\nmay occur. We need a method that lets us detect feedback from sources we might not have even\nknown to exist.\nThis paper proposes a simple method for detecting feedback loops from unknown sources in live\nsystems. Our method relies on arti\ufb01cially inserting a small amount of noise into the predictions made\nby a model, and then measuring the effect of this noise on future predictions made by the model. If\nfuture model predictions change when we add arti\ufb01cial noise, then our system has feedback.\n\n1\n\n\fTo understand how random noise can enable us to detect feedback, suppose that we have a model\nwith predictions \u02c6y in which tomorrow\u2019s prediction \u02c6y(t+1) has a linear feedback dependence on to-\nday\u2019s prediction \u02c6y(t): if we increase \u02c6y(t) by , then \u02c6y(t+1) increases by for some 2 R. In-\ntuitively, we should be able to \ufb01t this slope by perturbing \u02c6y(t) with a small amount of noise\n\u232b \u21e0N 0, 2\n\u232b and then regressing the new \u02c6y(t+1) against the noise; the reason least squares should\n\nwork here is that the noise \u232b is independent of all other variables by construction. The main contri-\nbution of this paper is to turn this simple estimation idea into a general procedure that can be used\nto detect feedback in realistic problems where the feedback has non-linearities and jumps.\n\nCounterfactuals and Causal Inference Feedback detection is a problem in causal inference. A\nmodel suffers from feedback if the predictions it makes today affect the predictions it will make to-\nmorrow. We are thus interested in discovering a causal relationship between today\u2019s and tomorrow\u2019s\npredictions; simply detecting a correlation is not enough. The distinction between causal and asso-\nciational inference is acute in the case of feedback: today\u2019s and tomorrow\u2019s predictions are almost\nalways strongly correlated, but this correlation by no means implies any causal relationship.\nIn order to discover causal relationships between consecutive predictions, we need to use some form\nof randomized experimentation. In our case, we add a small amount of random noise to our pre-\ndictions. Because the noise is fully arti\ufb01cial, we can reasonably ask counterfactual questions of the\ntype: \u201cHow would tomorrow\u2019s predictions have changed if we added more/less noise to the predic-\ntions today?\u201d The noise acts as an independent instrument that lets us detect feedback. We frame\nour analysis in terms of a potential outcomes model that asks how the world would have changed\nhad we altered a treatment; in our case, the treatment is the random noise we add to our predictions.\nThis formalism, often called the Rubin causal model [1], is regularly used for understanding causal\ninference [2, 3, 4]. Causal models are useful for studying the behavior of live predictive systems on\nthe internet, as shown by, e.g., the recent work of Bottou et al. [5] and Chan et al. [6].\n\nOutline and Contributions\nIn order to de\ufb01ne a rigorous feedback detection procedure, we need\nto have a precise notion of what we mean by feedback. Our \ufb01rst contribution is thus to provide such\na model by de\ufb01ning statistical feedback in terms of a potential outcomes model (Section 2). Given\nthis feedback model, we propose a local noising scheme that can be used to \ufb01t feedback functions\nwith non-linearities and jumps (Section 4). Before presenting general version of our approach, how-\never, we begin by discussing the linear case in Section 3 to elucidate the mathematics of feedback\ndetection: as we will show, the problem of linear feedback detection using local perturbations re-\nduces to linear regression. Finally, in Section 5 we conduct a pilot study based on a predictive model\ncurrently deployed as a part of a search engine.\n\n2 Feedback Detection for Statistical Predictors\nSuppose that we have a model that makes predictions \u02c6y(t)\nin time periods t = 1, 2, ... for examples\ni\ni = 1, ..., n. The predictive model itself is taken as given; our goal is to understand feedback effects\nbetween consecutive pairs of predictions \u02c6y(t)\n. We de\ufb01ne statistical feedback in terms\ni\nof counterfactual reasoning: we want to know what would have happened to \u02c6y(t+1)\nbeen\ndifferent. We use potential outcomes notation [e.g., 7] to distinguish between counterfactuals: let\n\u02c6y(t+1)\nas\ni\nour time-t prediction. In practice we only get to observe \u02c6y(t+1)\n; all other values\ni\nof \u02c6y(t+1)\n] are counterfactual. We also consider \u02c6y(t+1)\n[?], the prediction our model would have\nmade at time t + 1 if the model never made any of its predictions public and so did not have the\nchance to affect its environment. With this notation, we de\ufb01ne feedback as\n\n] be the predictions our model would have made at time t + 1 if we had published y(t)\ni\n\n] for a single y(t)\ni\n\nand \u02c6y(t+1)\n\nhad \u02c6y(t)\ni\n\n[y(t)\n\ni\n\n[y(t)\n\ni\n\ni\n\n[y(t)\n\ni\n\ni\n\ni\n\ni\n\nfeedback(t)\n\ni = \u02c6y(t+1)\n\ni\n\n[\u02c6y(t)\n\ni\n\n] \u02c6y(t+1)\n\ni\n\n[?],\n\n(1)\n\n2\n\n\fi.e., the difference between the predictions our model actually made and the predictions it would\nhave made had it not had the chance to affect its environment by broadcasting predictions in the\npast. Thus, statistical feedback is a difference in potential outcomes.\n\nAn additive feedback model\nIn order to get a handle on feedback as de\ufb01ned above, we assume\nthat feedback enters the model additively: \u02c6y(t+1)\ni ), where f is a feedback\nfunction, and y(t)\nis the prediction published at time t. In other words, we assume that the predictions\ni\nmade by our model at time t + 1 are the sum of the prediction the model would have made if there\nwere no feedback, plus a feedback term that only depends on the previous prediction made by the\nmodel. Our goal is to estimate the feedback function f.\n\n[?] + f (y(t)\n\n] = \u02c6y(t+1)\n\n[y(t)\n\ni\n\ni\n\ni\n\ni\n\nand \u02c6y(t+1)\n\nArti\ufb01cial noising for feedback detection The relationship between \u02c6y(t)\ncan be in\ufb02u-\ni\nenced by many things, such as trends, mean reversion, random \ufb02uctuations, as well as feedback. In\norder to isolate the effect of feedback, we need to add some noise to the system to create a situation\nthat resembles a randomized experiment. Ideally, we might hope to sometimes turn our predic-\ntive system off in order to get estimates of \u02c6y(t)\n[?]. However, predictive models are often deeply\ni\nintegrated into large software systems, and it may not be clear what the correct system behavior\nwould be if we turned the predictor off. To side-step this concern, we randomize our system by\nadding arti\ufb01cial noise to predictions: at time t, instead of deploying the prediction \u02c6y(t)\n, we deploy\ni\niid\u21e0 N is arti\ufb01cial noise drawn from some distribution N. Because the\n\u02c7y(t)\ni = \u02c6y(t)\nnoise \u232b(t)\nis independent from everything else, it puts us in a randomized experimental setup that\nis affected by \u232b(t)\nallows us to detect feedback as a causal effect. If the time t + 1 prediction \u02c6y(t+1)\n,\nthen our system must have feedback because the only way \u232b(t)\ncan in\ufb02uence \u02c6y(t+1)\nis through the\ninteraction between our model predictions and the surrounding environment at time t.\n\n, where \u232b(t)\n\ni + \u232b(t)\n\ni\n\ni\n\ni\n\ni\n\ni\n\ni\n\ni\n\nIn practice, we want the noise \u232b(t)\n\nLocal average treatment effect\nto be small enough that it does\nnot disturb the regular operation of the predictive model too much. Thus, our experimental setup\nallows us to measure feedback as a local average treatment effect [4], where the arti\ufb01cial noise \u232b(t)\nacts as a continuous treatment. Provided our additive model holds, we can then piece together these\nlocal treatment effects into a single global feedback function f.\n\ni\n\ni\n\n3 Linear Feedback\nWe begin with an analysis of linear feedback problems; the linear setup allows us to convey the main\ninsights with less technical overhead. We discuss the non-linear case in Section 4. Suppose that we\nhave some natural process x(1), x(2), ... and a predictive model of the form \u02c6y = w \u00b7 x. (Suppose for\nnotational convenience that x includes the constant, and the intercept term is folded into w.) For our\npurposes, w is \ufb01xed and known; for example, w may have been set by training on historical data.\nAt some point, we ship a system that starts broadcasting the predictions \u02c6y = w \u00b7 x, and there is a\nconcern that the act of broadcasting the \u02c6y may perturb the underlying x(t) process. Our goal is to\ndetect any such feedback. Following earlier notation we write \u02c6y(t+1)\n] = w \u00b7 x(t+1)\n] for the\ntime t + 1 variables perturbed by feedback, and \u02c6y(t+1)\n[?] for the counterparts we\nwould have observed without any feedback.\nIn this setup, any effect of \u02c6y(t)\n] is feedback. A simple way to constrain this relationship\nis using a linear model x(t+1)\n[?] + \u02c6y(t)\n[\u02c6y(t)\n]\nis perturbed by an amount that scales linearly with \u02c6y(t)\ni\n\ni . In other words, we assume that x(t+1)\n. Given this simple model, we \ufb01nd that:\n\n[?] = w \u00b7 x(t+1)\n\n[\u02c6y(t)\ni\n] = x(t+1)\n\ni on x(t+1)\n\n[\u02c6y(t)\n\n[\u02c6y(t)\n\n[\u02c6y(t)\n\ni\n\ni\n\ni\n\ni\n\ni\n\ni\n\ni\n\ni\n\ni\n\ni\n\ni\n\ni\n\n\u02c6y(t+1)\ni\n\n[\u02c6y(t)\n\ni\n\n] = \u02c6y(t+1)\n\ni\n\n[?] + w \u00b7 \u02c6y(t)\n\ni\n\n,\n\n(2)\n\n3\n\n\fand so f (y) = y with = w \u00b7 ; f is the feedback function we want to \ufb01t.\nWe cannot work with (2) directly, because \u02c6y(t+1)\nproblem, we add arti\ufb01cial noise to our predictions: at time t, we publish predictions \u02c7y(t)\ninstead of the raw predictions \u02c6y(t)\ni\nbecause \u02c6y(t+1)\nbetween \u02c6y(t+1)\n\n[?] is not observed. In order to get around this\ni +\u232b(t)\n. As argued in Section 2, this method lets us detect feedback\nthrough a feedback mechanism, and so any relationship\n\ncan only depend on \u232b(t)\nand \u232b(t)\n\ni must be a symptom of feedback.\n\ni = \u02c6y(t)\n\ni\n\ni\n\ni\n\ni\n\ni\n\ni\n\ni\n\ni\n\ni\n\ni\n\ni\n\ni\n\ni\n\n[\u02c6y(t)\n\n[\u02c6y(t)\n\n[\u02c6y(t)\n\n[\u02c6y(t)\n\ni +\u232b(t)\n\ni +\u232b(t)\n\nwhere 2\n\n] + \u232b(t)\n\n] = \u02c6y(t+1)\n\ni i\n\ni on \u02c6y(t+1)\n\nis\n. This relationship suggests that we should be able to recover\n\nA Simple Regression Approach With the linear feedback model (2), the effect of \u232b(t)\n\u02c6y(t+1)\ni\ni\n by regressing \u02c6y(t+1)\nTheorem 1. Suppose that (2) holds, and that we add noise \u232b(t)\ni\nestimate using linear least squares\n],\u232b (t)\n\n. The following result con\ufb01rms this intuition.\n\nto our time t predictions. If we\n\nagainst the added noise \u232b(t)\n\n\u02c6 = dCovh\u02c6y(t+1)\n\u232b = Varh\u232b(t)\n\n]i\n1A ,\ndVarh\u232b(t)\ni i\ni i and n is the number of examples to which we applied our predictor.\n\n, then pn\u21e3 \u02c6 \u2318 )N 0@0,\n\nVarh\u02c6y(t+1)\n\nTheorem 1 gives us a baseline understanding for the dif\ufb01culty of the feedback detection problem:\nthe precision of our feedback estimates scales as the ratio of the arti\ufb01cial noise 2\n\u232b to natural noise\nVar[\u02c6y(t+1)\n]]. Note that the proof of Theorem 1 assumes that we only used predictions from a\nsingle time period t + 1 to \ufb01t feedback, and that the raw predictions \u02c6y(t+1)\n] are all independent.\nIf we relax these assumptions we get a regression problem with correlated errors, and need to be\nmore careful with technical conditions.\nEf\ufb01ciency and Conditioning The simple regression model (3) treats the term \u02c6y(t+1)\n] as noise.\nThis is quite wasteful: if we know \u02c6y(t)\ni we usually have a fairly good idea of what \u02c6y(t+1)\n] should\nbe, and not using this information needlessly in\ufb02ates the noise. Suppose that we knew the function1\n(4)\n\ni\n2\n\u232b\n\ni\n[\u02c6y(t)\n\n(3)\n\n[\u02c6y(t)\n\n[\u02c6y(t)\n\n[\u02c6y(t)\n\ni\n\ni\n\ni\n\ni\n\ni\n\ni\n\ni\n\ni\n\nThen, we could write our feedback model as\n\n\u02c6y(t+1)\ni\n\n[\u02c6y(t)\n\ni +\u232b(t)\n\ni\n\ni\n\ni\n\n[\u02c6y(t)\n\n\u00b5(y) := Eh\u02c6y(t+1)\n] = \u00b5\u21e3\u02c6y(t)\n\n] \u02c6y(t)\ni = yi .\ni \u2318 +\u21e3\u02c6y(t+1)\n] \u00b5\u21e3\u02c6y(t)\n\n[\u02c6y(t)\n\ni\n\ni\n\ni \u2318\u2318 + \u232b(t)\n\ni\n\nwhere \u00b5(\u02c6y(t)\ni ) is a known offset. Extracting this offset improves the precision of our estimate for \u02c6.\nTheorem 2. Under the conditions of Theorem 1 suppose that the function \u00b5 from (4) is known and\nthat the \u02c6y(t+1)\n. Then, given the information\navailable at time t, the estimate\n\nare all independent of each other conditional on \u02c6y(t)\n\ni\n\ni\n\n,\n\n(5)\n\ni\n\ni\n\n[\u02c6y(t)\n\ni +\u232b(t)\n\n\u02c6\u21e4 = dCovh\u02c6y(t+1)\ni \u2318 ,\u232b (t)\n] \u00b5\u21e3\u02c6y(t)\ni i\ndVarh\u232b(t)\ni i\n] \u02c6y(t)\nEhVarh\u02c6y(t+1)\ni ii\npn\u21e3 \u02c6\u21e4 \u2318 )N 0@0,\n\n[\u02c6y(t)\n\n2\n\u232b\n\ni\n\ni\n\n1A .\n\n1In practice we do not know \u00b5, but we can estimate it; see Section 4.\n\nhas asymptotic distribution\n\n(6)\n\n(7)\n\n4\n\n\fMoreover, if the variance of \u2318(t)\nlinear unbiased estimator of .\n\ni = \u02c6y(t+1)\n\ni\n\n[\u02c6y(t)\n\ni\n\n] \u00b5(\u02c6y(t)\n\ni ) does not depend on \u02c6y(t)\n\ni\n\n, then \u02c6\u21e4 is the best\n\nTheorem 2 extends the general result from above that the precision with which we can estimate\nfeedback scales as the ratio of arti\ufb01cial noise to natural noise. The reason why \u02c6\u21e4 is more ef\ufb01cient\nthan \u02c6 is that we managed to condition away some of the natural noise, and reduced the variance of\nour estimate for by\n\nVarh\u00b5\u21e3\u02c6y(t)\n\ni \u2318i = Varh\u02c6y(t+1)\n\ni\n\n[\u02c6y(t)\n\ni\n\n]i EhVarh\u02c6y(t+1)\n\ni\n\n[\u02c6y(t)\n\ni\n\n] \u02c6y(t)\ni ii .\n\nIn other words, the variance reduction we get from \u02c6\u21e4 directly matches the amount of variability we\ncan explain away by conditioning. The estimator (6) is not practical as stated, because it requires\nknowledge of the unknown function \u00b5 and is restricted to the case of linear feedback. In the next\nsection, we generalize this estimator into one that does not require prior knowledge of \u00b5 and can\nhandle non-linear feedback.\n\n(8)\n\ni\n\ni\n\ni\n\ni\n\ni\n\ni\n\ni\n\ni\n\ni\n\ni\n\ni\n\ni\n\ni\n\n[\u02c6y(t)\n\n] = \u02c6y(t+1)\n\n[?] + f (\u02c6y(t)\n\n4 Fitting Non-Linear Feedback\nSuppose now that we have the same setup as in the previous section, except that now feedback has\na non-linear dependence on the prediction: \u02c6y(t+1)\ni ) for some arbitrary\nfunction f. For example, in the case of a linear predictive model \u02c6y = w \u00b7 x, this kind of feedback\n[?] + f(x)(\u02c6y(t)\ncould arise if we have feature feedback x(t+1)\n] = x(t+1)\ni ); the feedback function\nthen becomes f (\u00b7) = w \u00b7 f(x)(\u00b7). When we add noise \u232b(t)\nto the above predictions, we only affect\nthe feedback term f (\u00b7):\n] = f\u21e3\u02c6y(t)\n\u02c6y(t+1)\ni\n\n] \u02c6y(t+1)\nThus, by adding arti\ufb01cial noise \u232b(t)\n, we are able to cancel out the nuisance terms, and isolate the\nfeedback function f that we want to estimate. We cannot use (9) in practice, though, as we can only\nobserve one of \u02c6y(t+1)\n] in reality; the other one is counterfactual. We can get\n[\u02c6y(t)\naround this problem by conditioning on \u02c6y(t)\nas in Section 3. Let\ni\n\ni \u2318 f\u21e3\u02c6y(t)\ni \u2318 .\n\ni + \u232b(t)\n\n] or \u02c6y(t+1)\n\ni +\u232b(t)\n\ni +\u232b(t)\n\n(9)\n\n[\u02c6y(t)\n\n[\u02c6y(t)\n\n[\u02c6y(t)\n\n[\u02c6y(t)\n\ni +\u232b(t)\n\nis a term that captures trend effects that are not due to feedback. The \u21e4 denotes convolution:\n\ni \u21e0 N.\nUsing the conditional mean function \u00b5 we can write our expression of interest as\n\n\u00b5 (y) = Eh\u02c6y(t+1)\n\n] \u02c6y(t)\ni = yi\n= t (y) + 'N \u21e4 f (y) , where t (y) = Eh\u02c6y(t+1)\ni \u2318 \u02c6y(t)\n'N \u21e4 f (y) = Ehf\u21e3\u02c6y(t)\n] \u00b5\u21e3\u02c6y(t)\ni \u2318. If we have a good idea of what \u00b5 is, the left-hand side can be\n\n[?] \u02c6y(t)\ni = yi\ni = yi with \u232b(t)\ni \u2318 'N \u21e4 f\u21e3\u02c6y(t)\n\ni \u2318 = f\u21e3\u02c6y(t)\n\n[?] t\u21e3\u02c6y(t)\n\ni \u2318 + \u2318(t)\n\ni + \u232b(t)\n\ni + \u232b(t)\n\n:= \u02c6y(t+1)\n\nwhere \u2318(t)\ni\nmeasured, as it only depends on \u02c6y(t+1)\ntwo terms on the right-hand side only depend on \u232b(t)\nzero. The upshot is that we can treat (12) as a regression problem where \u2318(t)\ni\nwe estimate \u00b5 from an auxiliary problem where we regress \u02c6y(t+1)\n] against \u02c6y(t)\ni\n\n. Meanwhile, conditional on \u02c6y(t)\ni\nis independent of \u232b(t)\n\n] and \u02c6y(t)\ni\n, while \u2318(t)\ni\n\n, the \ufb01rst\nand mean-\nis noise. In practice,\n\n\u02c6y(t+1)\ni\n\ni +\u232b(t)\n\ni +\u232b(t)\n\ni +\u232b(t)\n\n(12)\n\n(11)\n\n(10)\n\n[\u02c6y(t)\n\n[\u02c6y(t)\n\n[\u02c6y(t)\n\n,\n\ni\n\n[\u02c6y(t)\n\ni\n\ni\n\ni\n\ni\n\n.\n\ni\n\ni\n\ni\n\ni\n\ni\n\ni\n\ni\n\ni\n\ni\n\n5\n\n\fA Pragmatic Approach There are many possible approaches to solving the non-parametric sys-\ntem of equations (12) for f [e.g., 8, Chapter 5]. Here, we take a pragmatic approach, and constrain\nourselves to solutions of the form \u02c6\u00b5(y) = \u02c6\u00b5 \u00b7 b\u00b5(y) and \u02c6f (y) = \u02c6f \u00b7 bf (y), where b\u00b5 : R ! Rp\u00b5\nand bf : R ! Rpf are predetermined basis expansions. This approach transforms our problem\ninto an ordinary least-squares problem, and works well in terms of producing reasonable feedback\nestimates in real-world problems (see Section 5). If this relation in fact holds for some values \u00b5\nand f , the result below shows that we can recover f by least-squares.\nTheorem 3. Suppose that \u00b5 and f are de\ufb01ned as above, and that we have an unbiased estimator\n\u02c6\u00b5 of \u00b5 with variance V\u00b5 = Var[ \u02c6\u00b5]. Then, if we \ufb01t f by least squares using (12) as described in\nAppendix A, the resulting estimate \u02c6f is unbiased and has variance\n\n,\n\nb|\n\nb|\n\nX|\n\n(13)\n\ni + \u232b(t)\n\nf Xf\u23181\n\n...\n\u00b5\u21e3\u02c6y(t)\ni \u2318...\n\nwhere the design matrices X\u00b5 and Xf are de\ufb01ned as\n\nfVY + X\u00b5V\u00b5X|\n\n\u00b5 Xf\u21e3X|\n...\n\nVarh \u02c6fi =\u21e3X|\n1CCCA\nX\u00b5 =0BBB@\n\nf Xf\u23181\nand Xf =0BBB@\nf\u21e3\u02c6y(t)\nand VY is a diagonal matrix with (VY )ii = Varh\u02c6y(t+1)\nIn the case where our spline model is misspeci\ufb01ed, we can obtain a similar result using methods\ndue to Huber [9] and White [10]. In practice, we can treat \u02c6\u00b5 as known since \ufb01tting \u00b5(\u00b7) is usually\neasier than \ufb01tting f (\u00b7): estimating \u00b5(\u00b7) is just a smoothing problem whereas estimating f (\u00b7) requires\n\ufb01tting differences. If we also treat the errors \u2318(t)\nin (12) as roughly homoscedatic, (13) reduces to\ni\nVarh \u02c6fi \u21e1\nThis simpli\ufb01ed form again shows that the precision of our estimate of f (\u00b7) scales roughly as the ratio\nof the variance of the arti\ufb01cial noise \u232b(t)\n\ni \u2318 ('N \u21e4 bf )|\u21e3\u02c6y(t)\ni \u2318\n...\n] \u02c6y(t)\ni i.\n\nEhVarh\u02c6y(t+1)\nn E [ksik2\n2]\n\ni \u2318 'N \u21e4 bf\u21e3\u02c6y(t)\ni \u2318 .\n\n, where si = bf\u21e3\u02c6y(t)\n\nto the variance of the natural noise.\n\n] \u02c6y(t)\ni ii\n\n1CCCA\n\ni + \u232b(t)\n\n(14)\n\n(15)\n\n[\u02c6y(t)\n\n[\u02c6y(t)\n\ni\n\ni\n\ni\n\ni\n\ni\n\nOur Method in Practice For convenience, we summarize the steps needed to implement our\niid\u21e0 N for\nmethod here: (1) At time t, compute model predictions \u02c6y(t)\ni\nsome noise distribution N. Deploy predictions \u02c7y(t)\nin the live system. (2) Fit a non-\ni \u2318 to learn the function \u00b5 (y) :=\nparametric least-squares regression of \u02c6y(t+1)\n] \u02c6y(t)\ni = yi. We use the R formula notation, where a \u21e0 g(b) means that we\n\nwant to learn a function g(b) that predicts a. (3) Set up the non-parametric least-squares regression\nproblem\n\n] \u21e0 \u00b5\u21e3\u02c6y(t)\n\nEh\u02c6y(t+1)\n\nand draw noise terms \u232b(t)\n\ni = \u02c6y(t)\ni +\u232b(t)\n\ni + \u232b(t)\n\ni +\u232b(t)\n\n[\u02c6y(t)\n\n[\u02c6y(t)\n\ni\n\ni\n\ni\n\ni\n\ni\n\ni\n\n\u02c6y(t+1)\ni\n\n[\u02c6y(t)\n\ni +\u232b(t)\n\ni\n\n] \u00b5\u21e3\u02c6y(t)\n\ni \u2318 \u21e0 f\u21e3\u02c6y(t)\n\ni + \u232b(t)\n\ni \u2318 'N \u21e4 f\u21e3\u02c6y(t)\ni \u2318 ,\n(16)\n, and \u21e4 denotes convolution. In Appendix\n\nwhere the goal is to learn f. Here, 'N is the density of \u232b(t)\nA we show how to carry out these steps using standard R libraries.\nThe resulting function f (y) is our estimate of feedback: If we make a prediction \u02c7y(t)\ni\nour time t + 1 prediction will be boosted by f (\u02c7y(t)\n\ni ). The above equation only depends on \u02c6y(t)\n\nat time t, then\n, \u232b(t)\n,\n\ni\n\ni\n\ni\n\n6\n\n\fi\n\ni\n\n[\u02c6y(t)\n\ni +\u232b(t)\n\nand \u02c6y(t+1)\n], which are all quantities that can be observed in the context of an experiment\nwith noised predictions. Note that as we only \ufb01t f using the differences in (16), the intercept of f\nis not identi\ufb01able. We \ufb01x the intercept (rather arbitrarily) by setting the average \ufb01tted feedback over\nall training examples to 0; we do not include an intercept term in the basis bf .\n\nChoice of Noising Distribution Adding noise to deployed predictions often has a cost that may\ndepend on the shape of the noise distribution N. A good choice of N should re\ufb02ect this cost. For\nexample, if the practical cost of adding noise only depends on the largest amount of noise we ever\nadd, then it may be a good idea to draw \u232b(t)\ni uniformly at random from {\u00b1\"} for some \"> 0. In our\nexperiments, we draw noise from a Gaussian distribution \u232b(t)\n\n\u232b).\ni \u21e0N (0, 2\n\n5 A Pilot Study\n\ni\n\nis high and \u02c7y(t)\n\nby a random amount\u201d; here \u02c7y(t)\ni\n\ni ), and so this example can be taken as a stretch case for our method.\n\nThe original motivation for this research was to develop a methodology for detecting feedback in\nreal-world systems. Here, we present results from a pilot study, where we added signal to historical\ndata that we believe should emulate actual feedback. The reason for monitoring feedback on this\nsystem is that our system was about to be more closely integrated with other predictive systems, and\nthere was a concern that the integration could induce bad feedback loops. Having a reliable method\nfor detecting feedback would provide us with an early warning system during the integration.\nThe predictive model in question is a logistic regression classi\ufb01er. We added feedback to historical\ndata collected from log \ufb01les according to half a dozen rules of the form \u201cif a(t)\ni > 0,\ni\nthen increase a(t+1)\nis the time-t prediction deployed by our system\n(in log-odds space) and a(t)\nis some feature with a positive coef\ufb01cient. These feedback generation\ni\nrules do not obey the additive assumption. Thus our model is misspeci\ufb01ed in the sense that there\nis no function f such that a current prediction \u02c7y(t)\nincreased the log-odds of the next prediction by\ni\nf (\u02c7y(t)\nOur dataset had on the order of 100,000 data points, half of which were used for \ufb01tting the model\nitself and half of which were used for feedback simulation. We generated data for 5 simulated time\nperiods, adding noise with \u232b = 0.1 at each step, and \ufb01t feedback using a spline basis discussed in\nAppendix B. The \u201ctrue feedback\u201d curve was obtained by \ufb01tting a spline regression to the additive\nfeedback model by looking at the unobservable \u02c6y(t+1)\n[?]; we used a df = 5 natural spline with\nknots evenly spread out on [9, 3] in log-odds space plus a jump at 0.\nFor our classi\ufb01er of interest, we have fairly strong reasons to believe that the feedback function may\nhave a jump at zero, but probably shouldn\u2019t have any other big jumps. Assuming that we know a\npriori where to look for jumps does not seem to be too big a problem for the practical applications\nwe have considered. Results for feedback detection are shown in Figure 1. Although the \ufb01t is not\nperfect, we appear to have successfully detected the shape of feedback. The error bars for estimated\nfeedback were obtained using a non-parametric bootstrap [11] for which we resampled pairs of\n(current, next) predictions.\nThis simulation suggests that our method can be used to accurately detect feedback on scales that\nmay affect real-world systems. Knowing that we can detect feedback is reassuring from an engi-\nneering point of view. On a practical level, the feedback curve shown in Figure 1 may not be too\nbig a concern yet: the average feedback is well within the noise level of the classi\ufb01er. But in large-\nscale systems the ways in which a model interacts with its environment is always changing, and it\nis entirely plausible that some innocuous-looking change in the future would increase the amount\nof feedback. Our methodology provides us with a way to continuously monitor how feedback is\naffected by changes to the system, and can alert us to changes that cause problems. In Appendix B,\nwe show some simulations with a wider range of effect sizes.\n\ni\n\n7\n\n\fTrue Feedback\nEstimated Feedback\n\n.\n\n4\n0\n\n.\n\n3\n0\n\nk\nc\na\nb\nd\ne\ne\nF\n\n2\n0\n\n.\n\n1\n0\n\n.\n\n.\n\n0\n0\n\n0.2\n\n0.4\n\n0.6\n\nPrediction\n\n0.8\n\nFigure 1: Simulation aiming to replicate realistic feedback in a real-world classi\ufb01er. The red solid\nline is our feedback estimate; the black dashed line is the best additive approximation to the true\nfeedback. The x-axis shows predictions in probability space; the y axis shows feedback in log-\nodds space. The error bars indicate pointwise con\ufb01dence intervals obtained using a non-parametric\nbootstrap with B = 10 replicates, and stretch 1 SE in each direction. Further experiments are\nprovided in Appendix B.\n\n6 Conclusion\nIn this paper, we proposed a randomization scheme that can be used to detect feedback in real-world\npredictive systems. Our method involves adding noise to the predictions made by the system; this\nnoise puts us in a randomized experimental setup that lets us measure feedback as a causal effect.\nIn general, the scale of the arti\ufb01cial noise required to detect feedback is smaller than the scale of\nthe natural predictor noise; thus, we can deploy our feedback detection method without disturbing\nour system of interest too much. The method does not require us to make hypotheses about the\nmechanism through which feedback may propagate, and so it can be used to continuously monitor\npredictive systems and alert us if any changes to the system lead to an increase in feedback.\nRelated Work The interaction between models and the systems they attempt to describe has been\nextensively studied across many \ufb01elds. Models can have different kinds of feedback effects on their\nenvironments. At one extreme of the spectrum, models can become self-ful\ufb01lling prophecies: for\nexample, models that predict economic growth may in fact cause economic growth by instilling\nmarket con\ufb01dence [12, 13]. At the other end, models may distort the phenomena they seek to\ndescribe and therefore become invalid. A classical example of this is a concern that any metric used\nto regulate \ufb01nancial risk may become invalid as soon as it is widely used, because actors in the\n\ufb01nancial market may attempt to game the metric to avoid regulation [14]. However, much of the\nwork on model feedback in \ufb01elds like \ufb01nance, education, or macro-economic theory has focused on\nnegative results: there is an emphasis on understanding when feedback can happen and promoting\nawareness about how feedback can interact with policy decisions, but there does not appear to be\nmuch focus on actually \ufb01tting feedback. One notable exception is a paper by Akaike [15], who\nshowed how to \ufb01t cross-component feedback in a system with many components; however, he did\nnot add arti\ufb01cial noise to the system, and so was unable to detect feedback of a single component on\nitself.\nAcknowledgments The authors are grateful to Alex Blocker, Randall Lewis, and Brad Efron for\nhelpful suggestions and interesting conversations. S. W. is supported by a B. C. and E. J. Eaves\nStanford Graduate Fellowship.\n\n8\n\n\fReferences\n[1] Paul W Holland. Statistics and causal inference.\n\n81(396):945\u2013960, 1986.\n\nJournal of the American Statistical Association,\n\n[2] Joshua D Angrist, Guido W Imbens, and Donald B Rubin. Identi\ufb01cation of causal effects using instru-\n\nmental variables. Journal of the American Statistical Association, 91(434):444\u2013455, 1996.\n\n[3] Bradley Efron and David Feldman. Compliance as an explanatory variable in clinical trials. Journal of\n\nthe American Statistical Association, 86(413):9\u201317, 1991.\n\n[4] Guido W Imbens and Joshua D Angrist. Identi\ufb01cation and estimation of local average treatment effects.\n\nEconometrica, 62(2):467\u2013475, 1994.\n\n[5] L\u00b4eon Bottou, Jonas Peters, Joaquin Qui\u02dcnonero-Candela, Denis X Charles, D Max Chickering, Elon Por-\ntugaly, Dipankar Ray, Patrice Simard, and Ed Snelson. Counterfactual reasoning and learning systems:\nThe example of computational advertising. Journal of Machine Learning Research, 14:3207\u20133260, 2013.\n[6] David Chan, Rong Ge, Ori Gershony, Tim Hesterberg, and Diane Lambert. Evaluating online ad cam-\npaigns in a pipeline: Causal models at scale. In Proceedings of the 16th ACM SIGKDD International\nConference on Knowledge Discovery and Data Mining, pages 7\u201316. ACM, 2010.\n\n[7] Donald B Rubin. Causal inference using potential outcomes. Journal of the American Statistical Associ-\n\nation, 100(469):322\u2013331, 2005.\n\n[8] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning. Springer\n\nNew York, second edition, 2009.\n\n[9] Peter J Huber. The behavior of maximum likelihood estimates under nonstandard conditions. In Pro-\nceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, pages 221\u2013233,\n1967.\n\n[10] Halbert White. A heteroskedasticity-consistent covariance matrix estimator and a direct test for het-\n\neroskedasticity. Econometrica: Journal of the Econometric Society, 48(4):817\u2013838, 1980.\n[11] Bradley Efron and Robert Tibshirani. An Introduction to the Bootstrap. CRC press, 1993.\n[12] Robert K Merton. The self-ful\ufb01lling prophecy. The Antioch Review, 8(2):193\u2013210, 1948.\n[13] Fabrizio Ferraro, Jeffrey Pfeffer, and Robert I Sutton. Economics language and assumptions: How theo-\n\nries can become self-ful\ufb01lling. Academy of Management Review, 30(1):8\u201324, 2005.\n\n[14] J\u00b4on Dan\u0131elsson. The emperor has no clothes: Limits to risk modelling. Journal of Banking & Finance,\n\n26(7):1273\u20131296, 2002.\n\n[15] Hirotugu Akaike. On the use of a linear model for the identi\ufb01cation of feedback systems. Annals of the\n\nInstitute of Statistical Mathematics, 20(1):425\u2013439, 1968.\n\n[16] Theodoros Evgeniou, Massimiliano Pontil, and Tomaso Poggio. Regularization networks and support\n\nvector machines. Advances in Computational Mathematics, 13(1):1\u201350, 2000.\n\n[17] Federico Girosi, Michael Jones, and Tomaso Poggio. Regularization theory and neural networks archi-\n\ntectures. Neural Computation, 7(2):219\u2013269, 1995.\n\n[18] Peter J Green and Bernard W Silverman. Nonparametric Regression and Generalized Linear Models: A\n\nRoughness Penalty Approach. Chapman & Hall London, 1994.\n\n[19] Trevor Hastie and Robert Tibshirani. Generalized Additive Models. CRC Press, 1990.\n[20] Grace Wahba. Spline Models for Observational Data. Siam, 1990.\n[21] Stefanie Biedermann, Holger Dette, and David C Woods. Optimal design for additive partially nonlinear\n\nmodels. Biometrika, 98(2):449\u2013458, 2011.\n\n[22] Werner G M\u00a8uller. Optimal design for local \ufb01tting.\n\n55(3):389\u2013397, 1996.\n\nJournal of statistical planning and inference,\n\n[23] William J Studden and D J VanArman. Admissible designs for polynomial spline regression. The Annals\n\nof Mathematical Statistics, 40(5):1557\u20131569, 1969.\n\n[24] Erich Leo Lehmann and George Casella. Theory of Point Estimation. Springer, second edition, 1998.\n\n9\n\n\f", "award": [], "sourceid": 1783, "authors": [{"given_name": "Stefan", "family_name": "Wager", "institution": "Stanford University"}, {"given_name": "Nick", "family_name": "Chamandy", "institution": "Google"}, {"given_name": "Omkar", "family_name": "Muralidharan", "institution": "Google"}, {"given_name": "Amir", "family_name": "Najmi", "institution": "Google"}]}