{"title": "Off-policy evaluation for slate recommendation", "book": "Advances in Neural Information Processing Systems", "page_first": 3632, "page_last": 3642, "abstract": "This paper studies the evaluation of policies that recommend an ordered set of items (e.g., a ranking) based on some context---a common scenario in web search, ads, and recommendation. We build on techniques from combinatorial bandits to introduce a new practical estimator that uses logged data to estimate a policy's performance.  A thorough empirical evaluation on real-world data reveals that our estimator is accurate in a variety of settings, including as a subroutine in a learning-to-rank task, where it achieves competitive performance. We derive conditions under which our estimator is unbiased---these conditions are weaker than prior heuristics for slate evaluation---and experimentally demonstrate a smaller bias than parametric approaches, even when these conditions are violated. Finally, our theory and experiments also show exponential savings in the amount of required data compared with general unbiased estimators.", "full_text": "Off-policy evaluation for slate recommendation\n\nAdith Swaminathan\n\nMicrosoft Research, Redmond\nadswamin@microsoft.com\n\nAkshay Krishnamurthy\n\nUniversity of Massachusetts, Amherst\n\nakshay@cs.umass.edu\n\nAlekh Agarwal\n\nMicrosoft Research, New York\nalekha@microsoft.com\n\nMiroslav Dud\u00edk\n\nMicrosoft Research, New York\nmdudik@microsoft.com\n\nJohn Langford\n\nMicrosoft Research, New York\n\njcl@microsoft.com\n\nDamien Jose\n\nMicrosoft, Redmond\n\ndajose@microsoft.com\n\nImed Zitouni\n\nMicrosoft, Redmond\n\nizitouni@microsoft.com\n\nAbstract\n\nThis paper studies the evaluation of policies that recommend an ordered set of\nitems (e.g., a ranking) based on some context\u2014a common scenario in web search,\nads, and recommendation. We build on techniques from combinatorial bandits to\nintroduce a new practical estimator that uses logged data to estimate a policy\u2019s\nperformance. A thorough empirical evaluation on real-world data reveals that our\nestimator is accurate in a variety of settings, including as a subroutine in a learning-\nto-rank task, where it achieves competitive performance. We derive conditions\nunder which our estimator is unbiased\u2014these conditions are weaker than prior\nheuristics for slate evaluation\u2014and experimentally demonstrate a smaller bias than\nparametric approaches, even when these conditions are violated. Finally, our theory\nand experiments also show exponential savings in the amount of required data\ncompared with general unbiased estimators.\n\n1\n\nIntroduction\n\nIn recommendation systems for e-commerce, search, or news, we would like to use the data collected\nduring operation to test new content-serving algorithms (called policies) along metrics such as\nrevenue and number of clicks [4, 25]. This task is called off-policy evaluation. General approaches,\nnamely inverse propensity scores (IPS) [13, 18], require unrealistically large amounts of logged\ndata to evaluate whole-page metrics that depend on multiple recommended items, which happens\nwhen showing ranked lists. The key challenge is that the number of possible lists (called slates) is\ncombinatorially large. As a result, the policy being evaluated is likely to choose different slates from\nthose recorded in the logs most of the time, unless it is very similar to the data-collection policy. This\nchallenge is fundamental [34], so any off-policy evaluation method that works with large slates needs\nto make some structural assumptions about the whole-page metric or the user behavior.\nPrevious work on off-policy evaluation and whole-page optimization improves the probability of\nmatch between logging and evaluation by restricting attention to small slate spaces [35, 26], intro-\nducing assumptions that allow for partial matches between the proposed and observed slates [27],\nor assuming that the policies used for logging and evaluation are similar [4, 32]. Another line of\nwork constructs parametric models of slate quality [8, 16, 14] (see also Sec. 4.3 of [17]). While these\napproaches require less data, they can have large bias, and their use in practice requires an expensive\ntrial-and-error cycle involving weeks-long A/B tests to develop new policies [20]. In this paper we\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fFigure 1: Off-policy evaluation of two whole-page user-satisfaction metrics on proprietary search\nengine data. Average RMSE of different estimators over 50 runs on a log-log scale. Our method\n(PI) achieves the best performance with moderate data sizes. The unbiased IPS method suffers high\nvariance, and direct modeling (DM) of the metrics suffers high bias. ONPOLICY is the expensive\nchoice of deploying the policy, for instance, in an A/B test.\n\ndesign a method more robust to problems with bias and with only modest data requirements, with the\ngoal of substantially shortening this cycle and accelerating the policy development process.\nWe frame the slate recommendation problem as a combinatorial generalization of contextual ban-\ndits [3, 23, 13]. In combinatorial contextual bandits, for each context, a policy selects a slate consisting\nof component actions, after which a reward for the entire slate is observed. In web search, the context\nis the search query augmented with a user pro\ufb01le, the slate is the search results page consisting of a\nlist of retrieved documents (actions), and example reward metrics are page-level measures such as\ntime-to-success, NDCG (position-weighted relevance), or other measures of user satisfaction. As\ninput we receive contextual bandit data obtained by some logging policy, and our goal is to estimate\nthe reward of a new target policy. This off-policy setup differs from online learning in contextual\nbandits, where the goal is to adaptively maximize the reward in the presence of an explore-exploit\ntrade-off [5].\nInspired by work in combinatorial and linear bandits [7, 31, 11], we propose an estimator that\nmakes only a weak assumption about the evaluated metric, while exponentially reducing the data\nrequirements in comparison with IPS. Speci\ufb01cally, we posit a linearity assumption, stating that the\nslate-level reward (e.g., time to success in web search) decomposes additively across actions, but\nthe action-level rewards are not observed. Crucially, the action-level rewards are allowed to depend\non the context, and we do not require that they be easily modeled from the features describing the\ncontext. In fact, our method is completely agnostic to the representation of contexts.\nWe make the following contributions:\n1. The pseudoinverse estimator (PI) for off-policy evaluation: a general-purpose estimator from\nthe combinatorial bandit literature, adapted for off-policy evaluation. When ranking (cid:96) out of\nm items under the linearity assumption, PI typically requires O((cid:96)m/\u03b52) samples to achieve\nerror at most \u03b5\u2014an exponential gain over the m\u2126((cid:96)) sample complexity of IPS. We provide\ndistribution-dependent bounds based on the overlap between logging and target policies.\n\n2. Experiments on real-world search ranking datasets: The strong performance of the PI estimator\nprovides, to our knowledge, the \ufb01rst demonstration of high-quality off-policy evaluation of\nwhole-page metrics, comprehensively outperforming prior baselines (see Fig. 1).\n\n3. Off-policy optimization: We provide a simple procedure for learning to rank (L2R) using the\nPI estimator to impute action-level rewards for each context. This allows direct optimization of\nwhole-page metrics via pointwise L2R approaches, without requiring pointwise feedback.\n\nRelated work Large state spaces have typically been studied in the online, or on-policy, setting.\nSome works assume speci\ufb01c parametric (e.g., linear) models relating the metrics to the features\ndescribing a slate [2, 31, 15, 10, 29]; this can lead to bias if the model is inaccurate (e.g., we might\nnot have access to suf\ufb01ciently predictive features). Others posit the same linearity assumption as we\ndo, but further assume a semi-bandit feedback model where the rewards of all actions on the slate\n\n2\n\n0.00.20.40.60.81.0Numberofloggedsamples(n)0.00.20.40.60.81.0RMSE10310410\u22121100101Reward:NegativeTime-to-success10310410\u22120.5100Reward:UtilityRateOnPolicyIPSDM:treePI\fare revealed [19, 22, 21]. While much of the research focuses on on-policy setting, the off-policy\nparadigm studied in this paper is often preferred in practice since it might not be possible to implement\nlow-latency updates needed for online learning, or we might be interested in many different metrics\nand require a manual review of their trade-offs before deploying new policies.\nAt a technical level, the PI estimator has been used in online learning [7, 31, 11], but the analysis\nthere is tailored to the speci\ufb01c data collection policies used by the learner. In contrast, we provide\ndistribution-dependent bounds without any assumptions on the logging or target policy.\n\n2 Setting and notation\n\nthe decision maker observes a context x drawn from a distribution D(x) over some space X;\n\nIn combinatorial contextual bandits, a decision maker repeatedly interacts as follows:\n1.\n2. based on the context, the decision maker chooses a slate s = (s1, . . . , s(cid:96)) consisting of actions sj,\nwhere a position j is called a slot, the number of slots is (cid:96), actions at position j come from some\nspace Aj(x), and the slate s is chosen from a set of allowed slates S(x) \u2286 A1(x) \u00d7 \u00b7\u00b7\u00b7 \u00d7 A(cid:96)(x);\n3. given the context and slate, a reward r \u2208 [\u22121, 1] is drawn from a distribution D(r | x, s); rewards\nin different rounds are independent, conditioned on contexts and slates.\nThe context space X can be in\ufb01nite, but the set of actions is \ufb01nite. We assume |Aj(x)| = mj for all\ncontexts x \u2208 X and de\ufb01ne m := maxj mj as the maximum number of actions per slot. The goal of\nthe decision maker is to maximize the reward. The decision maker is modeled as a stochastic policy\n\u03c0 that speci\ufb01es a conditional distribution \u03c0(s | x) (a deterministic policy is a special case). The value\nof a policy \u03c0, denoted V (\u03c0), is de\ufb01ned as the expected reward when following \u03c0:\n(1)\n\nV (\u03c0) := Ex\u223cDEs\u223c\u03c0(\u00b7|x)Er\u223cD(\u00b7|x,s)\n\n(cid:2)r(cid:3) .\n\nTo simplify derivations, we extend the conditional distribution \u03c0 into a distribution over triples\n(x, s, r) as \u03c0(x, s, r) := D(r | x, s)\u03c0(s | x)D(x). With this shorthand, we have V (\u03c0) = E\u03c0[r].\nTo \ufb01nish this section, we introduce notation for the expected reward for a given context and slate,\nwhich we call the slate value, and denote as:\n\nV (x, s) := Er\u223cD(\u00b7|x,s)[r] .\n\n(2)\nExample 1 (Cartesian product). Consider the optimization of a news portal where the reward is the\nwhole-page advertising revenue. Context x is the user pro\ufb01le, slate is the news-portal page with slots\ncorresponding to news sections,1 and actions are the articles. The set of valid slates is the Cartesian\n\nj\u2264(cid:96) Aj(x). The number of valid slates is exponential in (cid:96): |S(x)| =(cid:81)\n\nproduct S(x) =(cid:81)\n\nExample 2 (Ranking). Consider web search and ranking. Context x is the query along with user\npro\ufb01le. Actions correspond to search items (such as webpages). The policy chooses (cid:96) of m items,\nwhere the set A(x) of m items for a context x is chosen from a corpus by a \ufb01ltering step (e.g., a\ndatabase query). We have Aj(x) = A(x) for all j \u2264 (cid:96), but the allowed slates S(x) have no repetitions.\nThe number of valid slates is exponential in (cid:96): |S(x)| = m!/(m \u2212 (cid:96))! = m\u2126((cid:96)). A reward could be\nthe negative time-to-success, i.e., negative of the time taken by the user to \ufb01nd a relevant item.\n\nj\u2264(cid:96) mj.\n\n2.1 Off-policy evaluation and optimization\n\nIn the off-policy setting, we have access to the logged data (x1, s1, r1), . . . , (xn, sn, rn) collected\nusing a past policy \u00b5, called the logging policy. Off-policy evaluation is the task of estimating the\nvalue of a new policy \u03c0, called the target policy, using the logged data. Off-policy optimization is the\nharder task of \ufb01nding a policy \u02c6\u03c0 that achieves maximal reward.\nThere are two standard approaches for off-policy evaluation. The direct method (DM) uses the logged\ndata to train a (parametric) model \u02c6r(x, s) for predicting the expected reward for a given context and\nslate. V (\u03c0) is then estimated as\n\n(cid:80)n\n\n(cid:80)\n\n\u02c6VDM(\u03c0) = 1\nn\n\ni=1\n\ns\u2208S(x) \u02c6r(xi, s)\u03c0(s | xi) .\n\n(3)\n\n1For simplicity, we do not discuss the more general setting of showing multiple articles in each news section.\n\n3\n\n\fThe direct method is often biased due to mismatch between model assumptions and ground truth.\nThe second approach, which is provably unbiased (under modest assumptions), is the inverse propen-\nsity score (IPS) estimator [18]. The IPS estimator re-weights the logged data according to ratios of\nslate probabilities under the target and logging policy. It has two common variants:\n\n(cid:80)n\ni=1 ri \u00b7 \u03c0(si|xi)\n\u00b5(si|xi) ,\n\n\u02c6VwIPS(\u03c0) =(cid:80)n\n\n\u02c6VIPS(\u03c0) = 1\nn\n\n(cid:14)(cid:0)(cid:80)n\n\n(cid:1) .\n\ni=1 ri \u00b7 \u03c0(si|xi)\n\u00b5(si|xi)\n\n\u03c0(si|xi)\n\u00b5(si|xi)\n\ni=1\n\n(4)\n\nwIPS generally has better variance with an asymptotically zero bias. The variance of both estimators\ngrows linearly with \u03c0(s|x)\n\n\u00b5(s|x) , which can be \u2126(|S(x)|). This is prohibitive when |S(x)| = m\u2126((cid:96)).\n\n3 Our approach\n\ns \u03c6x =(cid:80)(cid:96)\n\nThe IPS estimator is minimax optimal [34], so its exponential variance is unavoidable in the worst\ncase. We circumvent this hardness by positing an assumption on the structure of rewards. Speci\ufb01cally,\nwe assume that the slate-level reward is a sum of unobserved action-level rewards that depend on the\ncontext, the action, and the position on the slate, but not on the other actions on the slate.\nFormally, we consider slate indicator vectors in R(cid:96)m whose components are indexed by pairs (j, a)\nof slots and possible actions in them. A slate is described by an indicator vector 1s \u2208 R(cid:96)m whose\nentry at position (j, a) is equal to 1 if the slate s has action a in slot j, i.e., if sj = a. The above\nassumption is formalized as follows:\nAssumption 1 (Linearity Assumption). For each context x \u2208 X there exists an (unknown) intrinsic\nreward vector \u03c6x \u2208 R(cid:96)m such that the slate value satis\ufb01es V (x, s) = 1T\nThe slate indicator vector can be viewed as a feature vector, representing the slate, and \u03c6x can be\nviewed as a context-speci\ufb01c weight vector. The assumption refers to the fact that the value of a slate\nis a linear function of its feature representation. However, note that this linear dependence is allowed\nto be completely different across contexts, because we make no assumptions on how \u03c6x depends\non x, and in fact our method does not even attempt to accurately estimate \u03c6x. Being agnostic to the\nform of \u03c6x is the key departure from the direct method and parametric bandits.\nWhile Assumption 1 rules out interactions among different actions on a slate,2 its ability to vary\nintrinsic rewards arbitrarily across contexts captures many common metrics in information retrieval,\nsuch as the normalized discounted cumulative gain (NDCG) [6], a common metric in web ranking:\n2rel(x,sj )\u22121\nlog2(j+1) where rel(x, a)\nis the relevance of document a on query x. Then NDCG(x, s) := DCG(x, s)/DCG(cid:63)(x) where\nDCG(cid:63)(x) = maxs\u2208S(x) DCG(x, s), so NDCG takes values in [0, 1]. Thus, NDCG satis\ufb01es Assump-\n\nExample 3 (NDCG). For a slate s, we \ufb01rst de\ufb01ne DCG(x, s) :=(cid:80)(cid:96)\ntion 1 with \u03c6x(j, a) =(cid:0)2rel(x,a) \u2212 1(cid:1)(cid:14) log2(j + 1)DCG(cid:63)(x).\n\nj=1 \u03c6x(j, sj).\n\nj=1\n\nIn addition to Assumption 1, we also make the standard assumption that the logging policy puts\nnon-zero probability on all slates that can be potentially chosen by the target policy. This assumption\nis also required for IPS, otherwise unbiased off-policy evaluation is impossible [24].\nAssumption 2 (Absolute Continuity). The off-policy evaluation problem satis\ufb01es the absolute\ncontinuity assumption if \u00b5(s | x) > 0 whenever \u03c0(s | x) > 0 with probability one over x \u223c D.\n3.1 The pseudoinverse estimator\n\nUsing Assumption 1, we can now apply the techniques from the combinatorial bandit literature to\nour problem. In particular, our estimator closely follows the recipe of Cesa-Bianchi and Lugosi\n[7], albeit with some differences to account for the off-policy and contextual nature of our setup.\nUnder Assumption 1, we can view the recovery of \u03c6x for a given context x as a linear regression\nproblem. The covariates 1s are drawn according to \u00b5(\u00b7 | x), and the reward follows a linear model,\nconditional on s and x, with \u03c6x as the \u201cweight vector\u201d. Thus, we can write the MSE of an estimate w\nas Es\u223c\u00b5(\u00b7|x)Er\u223cD(\u00b7|s,x)[(1T\ns w\u2212 r)2 | x], using our de\ufb01nition\nof \u00b5 as a distribution over triples (x, s, r). We estimate \u03c6x by the MSE minimizer with the smallest\n\ns w\u2212 r)2], or more compactly as E\u00b5[(1T\n\n2We discuss limitations of Assumption 1 and directions to overcome them in Sec. 5.\n\n4\n\n\fnorm, which can be written in closed form as\n\n\u00af\u03c6x =(cid:0)E\u00b5[1s1T\n\ns | x](cid:1)\u2020 E\u00b5[r1s | x] ,\n\n(5)\nwhere M\u2020 is the Moore-Penrose pseudoinverse of a matrix M. Note that this idealized \u201cestimator\u201d\n\u00af\u03c6x uses conditional expectations over s \u223c \u00b5(\u00b7 | x) and r \u223c D(\u00b7 | s, x). To simplify the notation,\ns | x] \u2208 R(cid:96)m\u00d7(cid:96)m to denote the (uncentered) covariance matrix for our\nwe write \u0393\u00b5,x := E\u00b5[1s1T\nregression problem, appearing on the right-hand side of Eq. (5). We also introduce notation for the\nsecond term in Eq. (5) and its empirical estimate: \u03b8\u00b5,x := E\u00b5[r1s | x], and \u02c6\u03b8i := ri1si.\nThus, our regression estimator (5) is simply \u00af\u03c6x = \u0393\u2020\nshow that V (x, s) = 1T\ns\nfor V (\u03c0), which we call the pseudoinverse estimator or PI:\n\n\u00b5,x\u03b8\u00b5,x. Under Assumptions 1 and 2, it is easy to\n\u00b5,x\u03b8\u00b5,x. Replacing \u03b8\u00b5,x with \u02c6\u03b8i motivates the following estimator\n\n\u00af\u03c6x = 1T\n\ns \u0393\u2020\n\n\u02c6VPI(\u03c0) =\n\n1\nn\n\n\u03c0(s | xi)1T\n\ns \u0393\u2020\n\n\u00b5,xi\n\n\u02c6\u03b8i =\n\n1\nn\n\nri \u00b7 qT\n\n\u03c0,xi\n\n\u0393\u2020\n\n\u00b5,xi\n\n1si\n\n.\n\n(6)\n\nIn Eq. (6) we have expanded the de\ufb01nition of \u02c6\u03b8i and introduced the notation q\u03c0,x for the expected\nslate indicator under \u03c0 conditional on x, q\u03c0,x := E\u03c0[1s | x]. The summation over s required to obtain\nq\u03c0,xi in Eq. (6) can be replaced by a small sample. We can also derive a weighted variant of PI:\n\nn(cid:88)\n\n(cid:88)\n\ni=1\n\ns\u2208S\n\nn(cid:88)\n\ni=1\n\n(cid:80)n\n(cid:80)n\ni=1 ri \u00b7 qT\ni=1 qT\n\n\u03c0,xi\n\n\u0393\u2020\n\u00b5,xi\n\u03c0,xi\n\u2020\n\u00b5,xi1si\n\u0393\n\n1si\n\n\u02c6VwPI(\u03c0) =\n\n.\n\n(7)\n\nWe prove the following unbiasedness property in Appendix A.\nProposition 1. If Assumptions 1 and 2 hold, then the estimator \u02c6VPI is unbiased, i.e., E\u00b5n[ \u02c6VPI] = V (\u03c0),\nwhere the expectation is over the n logged examples sampled i.i.d. from \u00b5.\n\nAs special cases, PI reduces to IPS when (cid:96) = 1, and simpli\ufb01es to(cid:80)n\nCartesian product slate space, when \u00b5 factorizes across slots as \u00b5(s | x) =(cid:81)\n(cid:17)\n\ni=1 ri/n when \u03c0 = \u00b5 (see\nAppendix C). To build further intuition, we consider the settings of Examples 1 and 2, and simplify\nthe PI estimator to highlight the improvement over IPS.\nExample 4 (PI for a Cartesian product when \u00b5 is a product distribution). The PI estimator for the\nj \u00b5(sj | x), simpli\ufb01es to\n\n(cid:16)(cid:80)(cid:96)\n\n(cid:80)n\n\n\u02c6VPI(\u03c0) = 1\nn\n\ni=1 ri \u00b7\n\n\u03c0(sij|xi)\n\u00b5(sij|xi) \u2212 (cid:96) + 1\n\nj=1\n\n,\n\nby Prop. 2 in Appendix D. Note that unlike IPS, which divides by probabilities of whole slates, the PI\nestimator only divides by probabilities of actions appearing in individual slots. Thus, the magnitude\nof each term of the outer summation is only O((cid:96)m), whereas the IPS terms are m\u2126((cid:96)).\nExample 5 (PI for rankings with (cid:96) = m and uniform logging). In this case,\n\n(cid:80)n\n\n(cid:17)\n\u03c0(sij|xi)\n1/(m\u22121) \u2212 m + 2\nby Prop. 4 in Appendix E.1. The summands are again O((cid:96)m) = O(m2).\n3.2 Deviation analysis\n\n(cid:16)(cid:80)(cid:96)\n\n\u02c6VPI(\u03c0) = 1\nn\n\ni=1 ri \u00b7\n\nj=1\n\n,\n\nSo far, we have shown that PI is unbiased under our assumptions and overcomes the de\ufb01ciencies of\nIPS in speci\ufb01c examples. We now derive a \ufb01nite-sample error bound, based on the overlap between \u03c0\nand \u00b5. We use Bernstein\u2019s inequality, for which we de\ufb01ne the variance and range terms:\n\n\u03c32 := Ex\u223cD\n\n\u03c0,x\u0393\u2020\n\n\u00b5,xq\u03c0,x\n\n\u03c1 := sup\nx\n\nsup\n\ns:\u00b5(s|x)>0\n\n\u03c0,x\u0393\u2020\n\n\u00b5,x1s\n\n.\n\n(8)\n\n(cid:2)qT\n\n(cid:3) ,\n\n(cid:12)(cid:12)qT\n\n(cid:12)(cid:12)\n\nThe quantity \u03c32 bounds the variance whereas \u03c1 bounds the range. They capture the \u201caverage\u201d and\n\u201cworst-case\u201d mismatch between \u00b5 and \u03c0. They equal one when \u03c0 = \u00b5 (see Appendix C), and yield\nthe following deviation bound:\n\n5\n\n\fTheorem 1. Under Assumptions 1 and 2, let \u03c32 and \u03c1 be de\ufb01ned as in Eq. (8). Then, for any\n(cid:114)\n\u03b4 \u2208 (0, 1), with probability at least 1 \u2212 \u03b4,\n\n(cid:12)(cid:12)(cid:12) \u2264\n(cid:12)(cid:12)(cid:12) \u02c6VPI(\u03c0) \u2212 V (\u03c0)\n\n2\u03c32 ln(2/\u03b4)\n\n2(\u03c1 + 1) ln(2/\u03b4)\n\n+\n\nn\n\n3n\n\n.\n\nWe observe that this \ufb01nite sample bound is structurally different from the regret bounds studied in the\nprior works on combinatorial bandits. The bound incorporates the extent of overlap between \u03c0 and\n\u00b5 so that we have higher con\ufb01dence in our estimates when the logging and evaluation policies are\nsimilar\u2014an important consideration in off-policy evaluation.\nWhile the bound might look complicated, it simpli\ufb01es if we consider the class of \u03b5-uniform logging\npolicies. Formally, for any policy \u00b5, de\ufb01ne \u00b5\u03b5(s | x) = (1 \u2212 \u03b5)\u00b5(s | x) + \u03b5\u03bd(s | x), with \u03bd being the\nuniform distribution over the set S(x). For suitably small \u03b5, such logging policies are widely used in\npractice. We have the following corollary for these policies, proved in Appendix E:\nCorollary 1. In the settings of Example 1 or Example 2, if the logging is done with \u00b5\u03b5 for some\n\u03b5 > 0, we have | \u02c6VPI(\u03c0) \u2212 V (\u03c0)| \u2264 O\nAgain, this turns the \u2126(m(cid:96)) data dependence of IPS into O(m(cid:96)). The key step in the proof is the\nbound on a certain norm of \u0393\u2020\n\u03bd, similar to the bounds of Cesa-Bianchi and Lugosi [7], but our results\nare a bit sharper.\n\n(cid:0)(cid:112)\u03b5\u22121(cid:96)m/n(cid:1).\n\n4 Experiments\n\nWe empirically evaluate the performance of the pseudoinverse estimator for ranking problems. We \ufb01rst\nshow that PI outperforms prior works in a comprehensive semi-synthetic study using a public dataset.\nWe then use our estimator for off-policy optimization, i.e., to learn ranking policies, competitively with\nsupervised learning that uses more information. Finally, we demonstrate substantial improvements\non proprietary data from search engine logs for two user-satisfaction metrics used in practice: time-\nto-success and utility rate, which do not satisfy the linearity assumption. More detailed results are\ndeferred to Appendices F and G. All of our code is available online.3\n\n4.1 Semi-synthetic evaluation\n\nOur semi-synthetic evaluation uses labeled data from the Microsoft Learning to Rank Challenge\ndataset [30] (MSLR-WEB30K) to create a contextual bandit instance. Queries form the contexts\nx and actions a are the available documents. The dataset contains over 31K queries, each with\nup to 1251 judged documents, where the query-document pairs are judged on a 5-point scale,\nrel(x, a) \u2208 {0, . . . , 4}. Each pair (x, a) has a feature vector f (x, a), which can be partitioned into\ntitle and body features (ftitle and fbody). We consider two slate rewards: NDCG from Example 3, and\nthe expected reciprocal rank, ERR [9], which does not satisfy linearity, and is de\ufb01ned as\n\nERR(x, s) :=(cid:80)(cid:96)\n\n1\nr\n\nr=1\n\n(cid:81)r\u22121\ni=1 (1 \u2212 R(si))R(sr) , where R(a) = 2rel(x,a)\u22121\n\n2maxrel with maxrel = 4.\n\nTo derive several distinct logging and target policies, we \ufb01rst train two lasso regression models,\ncalled lassotitle and lassobody, and two regression tree models, called treetitle and treebody, to predict\nrelevances from ftitle and fbody, respectively. To create the logs, queries x are sampled uniformly, and\nthe set A(x) consists of the top m documents according to treetitle. The logging policy is parametrized\nby a model, either treetitle or lassotitle, and a scalar \u03b1 \u2265 0. It samples from a multinomial distribution\nover documents p\u03b1(a|x) \u221d 2\u2212\u03b1(cid:98)log2 rank(x,a)(cid:99) where rank(x, a) is the rank of document a for query\nx according to the corresponding model. Slates are constructed slot-by-slot, sampling without\nreplacement according to p\u03b1. Varying \u03b1 interpolates between uniformly random and deterministic\nlogging. Thus, all logging policies are based on the models derived from ftitle. We consider two\ndeterministic target policies based on the two models derived from fbody, i.e., treebody and lassobody,\nwhich select the top (cid:96) documents according to the corresponding model. The four base models are\nfairly distinct: on average fewer than 2.75 documents overlap among top 10 (see Appendix H).\n\n3https://github.com/adith387/slates_semisynth_expts\n\n6\n\n\fFigure 2: Top: RMSE of various estimators under four experimental conditions (see Appendix F for\nall 40 conditions). Middle: CDF of normalized RMSE at 600k samples; each plot aggregates over 10\nlogging-target combinations; closer to top-left is better. Bottom: Same as middle but at 60k samples.\n\nWe compare the weighted estimator wPI with the direct method (DM) and weighted IPS (wIPS).\n(Weighted variants outperformed the unweighted ones.) We implement two variants of DM: regression\ntrees and lasso, each trained on the \ufb01rst n/2 examples and using the remaining n/2 examples for\nevaluation according to Eq. (3). We also include an aspirational baseline, ONPOLICY, which\ncorresponds to deploying the target policy as in an A/B test and returning the average of observed\nrewards. This is the expensive alternative we wish to avoid.\nWe evaluate the estimators by recording the root mean square error (RMSE) as a function of\nthe number of samples, averaged over at least 25 independent runs. We do this for 40 different\nexperimental conditions, considering two reward metrics, two slate-space sizes, and 10 combinations\nof target and logging policies (including the choice of \u03b1). The top row of Fig. 2 shows results for four\nrepresentative conditions (see Appendix F for all results), while the middle and bottom rows aggregate\nacross conditions. To produce the aggregates, we shift and rescale the RMSE of all methods, at 600k\n(middle row) or 60k (bottom row) samples, so the best performance is at 0.001 and the worst is at\n1.0 (excluding ONPOLICY). (We use 0.001 instead of 0.0 to allow plotting on a log scale.) The\naggregate plots display the cumulative distribution function of these normalized RMSE values across\n10 target-logging combinations, keeping the metric and the slate-space size \ufb01xed.\nThe pseudoinverse estimator wPI easily dominates wIPS across all experimental conditions, as can\nbe seen in Fig. 2 (top) and in Appendix F. While wIPS and IPS are (asymptotically) unbiased even\nwithout linearity assumption, they both suffer from a large variance caused by the slate size. The\nvariance and hence the mean square error of wIPS and IPS grows exponentially with the slate size, so\nthey perform poorly beyond the smallest slate sizes. DM performs well in some cases, especially\nwith few samples, but often plateaus or degrades eventually as it over\ufb01ts on the logging distribution,\nwhich is different from the target. While wPI does not always outperform DM methods (e.g., Fig. 2,\ntop row, second from right), it is the only method that works robustly across all conditions, as can\nbe seen in the aggregate plots. In general, choosing between DM and wPI is largely a matter of\nbias-variance tradeoff. DM can be particularly good with very small data sizes, because of its low\nvariance, and in those settings it is often the best choice. However, PI performs comprehensively\nbetter given enough data (see Fig. 2, middle row).\n\n7\n\n0.00.20.40.60.81.0#ofsamples(n)0.00.20.40.60.81.0log10(RMSE)103104105106-4-3-2-10NDCG,m=100,l=10logging:uniform,target:treeOnPolicywIPSDM:lassoDM:treewPI103104105106NDCG,m=100,l=10,\u03b1=1.0logging:tree,target:tree103104105106ERR,m=100,l=10logging:uniform,target:lasso103104105106ERR,m=10,l=5logging:uniform,target:tree0.00.20.40.60.81.0NormalizedRMSE@600ksamples0246810#ofconditions10\u2212310\u2212210\u221211000246810NDCG,m=10,l=510\u2212310\u2212210\u22121100ERR,m=10,l=510\u2212310\u2212210\u22121100NDCG,m=100,l=1010\u2212310\u2212210\u22121100ERR,m=100,l=100.00.20.40.60.81.0NormalizedRMSE@60ksamples0246810#ofconditions10\u2212310\u2212210\u221211000246810NDCG,m=10,l=510\u2212310\u2212210\u22121100ERR,m=10,l=510\u2212310\u2212210\u22121100NDCG,m=100,l=1010\u2212310\u2212210\u22121100ERR,m=100,l=10\fIn the top row of Fig. 2, we see that, as expected, wPI is biased for the ERR metric since ERR does\nnot satisfy linearity. The right two panels also demonstrate the effect of varying m and (cid:96). While wPI\ndeteriorates somewhat for the larger slate space, it still gives a meaningful estimate. In contrast, wIPS\nfails to produce any meaningful estimate in the larger slate space and its RMSE barely improves with\nmore data. Finally, the left two plots in the top row show that wPI is fairly insensitive to the amount\nof stochasticity in logging, whereas DM improves with more overlap between logging and target.\n\n4.2 Semi-synthetic policy optimization\n\n\u00b5,xi\n\nWe now show how to use the pseudoinverse estimator for off-policy optimization. We leverage\npointwise learning to rank (L2R) algorithms, which learn a scoring function for query-document pairs\nby \ufb01tting to relevance labels. We call this the supervised approach, as it requires relevance labels.\nInstead of requiring relevance labels, we use the pseudoinverse estimator to convert page-level reward\ninto per-slot reward components\u2014the estimates of \u03c6x(j, a)\u2014and these become targets for regression.\nThus, the pseudoinverse estimator enables pointwise L2R to optimize whole-page metrics even\nwithout relevance labels. Given a contextual bandit dataset {(xi, si, ri)}i\u2264n collected by the logging\npolicy \u00b5, we begin by creating the estimates of \u03c6xi: \u02c6\u03c6i = \u0393\u2020\n\u02c6\u03b8i, turning the i-th contextual bandit\nexample into (cid:96)m regression examples. The trained regression model is used to create a slate, starting\nwith the highest scoring slot-action pair, and continuing greedily (excluding the pairs with the already\nchosen slots or actions). This procedure is detailed in Appendix G. Note that without the linearity\nassumptions, our imputed regression targets might not lead to the best possible learned policy, but we\nstill expect to adapt somewhat to the slate-level metric.\nWe use the MSLR-WEB10K dataset [30] to compare our approach with benchmarked results [33] for\nNDCG@3 (i.e., (cid:96) = 3).4 This dataset contains 10k queries, over 1.2M relevance judgments, and up to\n908 judged documents per query. The state-of-the-art listwise L2R method on this dataset is a highly\ntuned variant of LambdaMART [1] (with an ensemble of 1000 trees, each with up to 70 leaves).\nWe use the provided 5-fold split and always train on bandit data collected by uniform logging from\nfour folds, while evaluating with supervised data on the \ufb01fth. We compare our approach, titled\nPI-OPT, against the supervised approach (SUP), trained to predict the gains, equal to 2rel(x,a) \u2212 1,\ncomputed using annotated relevance judgements in the training folds (predicting raw relevances was\ninferior). Both PI-OPT and SUP train gradient boosted regression trees (with 1000 trees, each with\nup to 70 leaves). Additionally, we also experimented with the ERR metric.\nThe average test-set performance (computed using ground-truth relevance judgments for each test\nset) across the 5-folds is reported in Table 1. Our method, PI-OPT is competitive with the supervised\nbaseline SUP for NDCG, and is substantially superior for ERR. A different transformation instead\nof gains might yield a stronger supervised baseline for ERR, but this only illustrates the key bene\ufb01t\nof PI-OPT: the right pointwise targets are automatically inferred for any whole-page metric. Both\nPI-OPT and SUP are slightly worse than LambdaMART for NDCG@3, but they are arguably not as\nhighly tuned, and PI-OPT only uses the slate-level metric.\n\nTable 1: Comparison of L2R approaches optimizing NDCG@3 and ERR@3. LambdaMART is a\ntuned list-wise approach. SUP and PI-OPT use the same pointwise L2R learner; SUP uses 8 \u00d7 105\nrelevance judgments, PI-OPT uses 107 samples (under uniform logging) with page-level rewards.\n\nMetric\nNDCG@3\nERR@3\n\nLambdaMART\n\nuniformly random\n\n0.457\n\n\u2014\n\n0.152\n0.096\n\nSUP\n0.438\n0.311\n\nPI-OPT\n0.421\n0.321\n\n4.3 Real-world experiments\n\nWe \ufb01nally evaluate all methods using logs collected from a popular search engine. The dataset\nconsists of search queries, for which the logging policy randomly (non-uniformly) chooses a slate of\n\n4Our dataset here differs from the dataset MSLR-WEB30K used in Sec. 4.1. There our goal was to study\nrealistic problem dimensions, e.g., constructing length-10 rankings out of 100 candidates. Here, we use MSLR-\nWEB10K, because it is the largest dataset with public benchmark numbers by state-of-the-art approaches\n(speci\ufb01cally LambdaMART).\n\n8\n\n\fsize (cid:96) = 5 from a small pre-\ufb01ltered set of documents of size m \u2264 8. After preprocessing, there are 77\nunique queries and 22K total examples, meaning that for each query, we have logged impressions for\nmany of the available slates. As before, we create the logs by sampling queries uniformly at random,\nand using a logging policy that samples uniformly from the slates shown for this query.\nWe consider two page-level metrics: time-to-success (TTS) and UTILITYRATE. TTS measures the\nnumber of seconds between presenting the results and the \ufb01rst satis\ufb01ed click from the user, de\ufb01ned\nas any click for which the user stays on the linked page for suf\ufb01ciently long. TTS value is capped and\nscaled to [0, 1]. UTILITYRATE is a more complex page-level metric of user satisfaction. It captures\nthe interaction of a user with the page as a timeline of events (such as clicks) and their durations. The\nevents are classi\ufb01ed as revealing a positive or negative utility to the user and their contribution is\nproportional to their duration. UTILITYRATE takes values in [\u22121, 1].\nWe evaluate a target policy based on a logistic regression classi\ufb01er trained to predict clicks and using\nthe predicted probabilities to score slates. We restrict the target policy to pick among the slates in our\nlogs, so we know the ground truth slate-level reward. Since we know the query distribution, we can\ncalculate the target policy\u2019s value exactly, and measure RMSE relative to this true value.\nWe compare our estimator (PI) with three baselines similar to those from Sec. 4.1: DM, IPS and\nONPOLICY. DM uses regression trees over roughly 20,000 slate-level features.\nFig. 1 from the introduction shows that PI provides a consistent multiplicative improvement in RMSE\nover IPS, which suffers due to high variance. Starting at moderate sample sizes, PI also outperforms\nDM, which suffers due to substantial bias.\n\n5 Discussion\n\nIn this paper we have introduced a new estimator (PI) for off-policy evaluation in combinatorial\ncontextual bandits under a linearity assumption on the slate-level rewards. Our theoretical and\nempirical analysis demonstrates the merits of the approach. The empirical results show a favorable\nbias-variance tradeoff. Even in datasets and metrics where our assumptions are violated, the PI\nestimator typically outperforms all baselines. Its performance, especially at smaller sample sizes,\ncould be further improved by designing doubly-robust variants [12] and possibly also incorporating\nweight clipping [34].\nOne promising approach to relax Assumption 1 is to posit a decomposition over pairs (or tuples) of\nslots to capture higher-order interactions such as diversity. More generally, one could replace slate\nspaces by arbitrary compact convex sets, as done in linear bandits. In these settings, the pseudoinverse\nestimator could still be applied, but tight sample-complexity analysis is open for future research.\n\nReferences\n[1] Nima Asadi and Jimmy Lin. Training ef\ufb01cient tree-based models for document ranking. In European\n\nConference on Advances in Information Retrieval, 2013.\n\n[2] Peter Auer. Using con\ufb01dence bounds for exploitation-exploration trade-offs. Journal of Machine Learning\n\nResearch, 2002.\n\n[3] Peter Auer, Nicol\u00f2 Cesa-Bianchi, Yoav Freund, and Robert E Schapire. The nonstochastic multiarmed\n\nbandit problem. SIAM Journal on Computing, 2002.\n\n[4] L\u00e9on Bottou, Jonas Peters, Joaquin Qui\u00f1onero-Candela, Denis Charles, Max Chickering, Elon Portugaly,\nDipankar Ray, Patrice Simard, and Ed Snelson. Counterfactual reasoning and learning systems: The\nexample of computational advertising. Journal of Machine Learning Research, 2013.\n\n[5] S\u00e9bastien Bubeck and Nicol\u00f2 Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed\n\nbandit problems. Foundations and Trends R(cid:13) in Machine Learning, 2012.\n\n[6] Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Hullender.\n\nLearning to rank using gradient descent. In International Conference on Machine Learning, 2005.\n\n[7] Nicolo Cesa-Bianchi and G\u00e1bor Lugosi. Combinatorial bandits. Journal of Computer and System Sciences,\n\n2012.\n\n9\n\n\f[8] Olivier Chapelle and Ya Zhang. A dynamic Bayesian network click model for web search ranking. In\n\nInternational Conference on World Wide Web, 2009.\n\n[9] Olivier Chapelle, Donald Metlzer, Ya Zhang, and Pierre Grinspan. Expected reciprocal rank for graded\n\nrelevance. In Conference on Information and Knowledge Management, 2009.\n\n[10] Wei Chu, Lihong Li, Lev Reyzin, and Robert E Schapire. Contextual bandits with linear payoff functions.\n\nIn Arti\ufb01cial Intelligence and Statistics, 2011.\n\n[11] Varsha Dani, Thomas P. Hayes, and Sham M. Kakade. The price of bandit information for online\n\noptimization. In Advances in Neural Information Processing Systems, 2008.\n\n[12] Miroslav Dud\u00edk, John Langford, and Lihong Li. Doubly robust policy evaluation and learning.\n\nInternational Conference on Machine Learning, 2011.\n\nIn\n\n[13] Miroslav Dud\u00edk, Dumitru Erhan, John Langford, and Lihong Li. Doubly robust policy evaluation and\n\noptimization. Statistical Science, 2014.\n\n[14] Georges E. Dupret and Benjamin Piwowarski. A user browsing model to predict search engine click data\nfrom past observations. In SIGIR Conference on Research and Development in Information Retrieval,\n2008.\n\n[15] Sarah Filippi, Olivier Cappe, Aur\u00e9lien Garivier, and Csaba Szepesv\u00e1ri. Parametric bandits: The generalized\n\nlinear case. In Advances in Neural Information Processing Systems, 2010.\n\n[16] Fan Guo, Chao Liu, Anitha Kannan, Tom Minka, Michael Taylor, Yi-Min Wang, and Christos Faloutsos.\n\nClick chain model in web search. In International Conference on World Wide Web, 2009.\n\n[17] Katja Hofmann, Lihong Li, Filip Radlinski, et al. Online evaluation for information retrieval. Foundations\n\nand Trends in Information Retrieval, 2016.\n\n[18] Daniel G Horvitz and Donovan J Thompson. A generalization of sampling without replacement from a\n\n\ufb01nite universe. Journal of the American Statistical Association, 1952.\n\n[19] Satyen Kale, Lev Reyzin, and Robert E Schapire. Non-stochastic bandit slate problems. In Advances in\n\nNeural Information Processing Systems, 2010.\n\n[20] Ron Kohavi, Roger Longbotham, Dan Sommer\ufb01eld, and Randal M Henne. Controlled experiments on the\n\nweb: survey and practical guide. Knowledge Discovery and Data Mining, 2009.\n\n[21] Akshay Krishnamurthy, Alekh Agarwal, and Miroslav Dud\u00edk. Ef\ufb01cient contextual semi-bandit learning.\n\nAdvances in Neural Information Processing Systems, 2016.\n\n[22] Branislav Kveton, Zheng Wen, Azin Ashkan, and Csaba Szepesv\u00e1ri. Tight regret bounds for stochastic\n\ncombinatorial semi-bandits. In Arti\ufb01cial Intelligence and Statistics, 2015.\n\n[23] John Langford and Tong Zhang. The epoch-greedy algorithm for multi-armed bandits with side information.\n\nIn Advances in Neural Information Processing Systems, 2008.\n\n[24] John Langford, Alexander Strehl, and Jennifer Wortman. Exploration scavenging.\n\nConference on Machine Learning, 2008.\n\nIn International\n\n[25] Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to personalized\n\nnews article recommendation. In International Conference on World Wide Web, 2010.\n\n[26] Lihong Li, Wei Chu, John Langford, and Xuanhui Wang. Unbiased of\ufb02ine evaluation of contextual-bandit-\nbased news article recommendation algorithms. In International Conference on Web Search and Data\nMining, 2011.\n\n[27] Lihong Li, Imed Zitouni, and Jin Young Kim. Toward predicting the outcome of an a/b experiment for\n\nsearch relevance. In International Conference on Web Search and Data Mining, 2015.\n\n[28] Kaare Brandt Petersen, Michael Syskind Pedersen, et al. The matrix cookbook. Technical University of\n\nDenmark, 2008.\n\n[29] Lijing Qin, Shouyuan Chen, and Xiaoyan Zhu. Contextual combinatorial bandit and its application on\n\ndiversi\ufb01ed online recommendation. In International Conference on Data Mining, 2014.\n\n[30] Tao Qin and Tie-Yan Liu. Introducing LETOR 4.0 datasets. arXiv:1306.2597, 2013.\n\n10\n\n\f[31] Paat Rusmevichientong and John N Tsitsiklis. Linearly parameterized bandits. Mathematics of Operations\n\nResearch, 2010.\n\n[32] Adith Swaminathan and Thorsten Joachims. Counterfactual risk minimization: Learning from logged\n\nbandit feedback. In International Conference on Machine Learning, 2015.\n\n[33] Niek Tax, Sander Bockting, and Djoerd Hiemstra. A cross-benchmark comparison of 87 learning to rank\n\nmethods. Information Processing and Management, 2015.\n\n[34] Yu-Xiang Wang, Alekh Agarwal, and Miroslav Dudik. Optimal and adaptive off-policy evaluation in\n\ncontextual bandits. In International Conference on Machine Learning, 2017.\n\n[35] Yue Wang, Dawei Yin, Luo Jie, Pengyuan Wang, Makoto Yamada, Yi Chang, and Qiaozhu Mei. Beyond\nranking: Optimizing whole-page presentation. In International Conference on Web Search and Data\nMining, pages 103\u2013112, 2016.\n\n11\n\n\f", "award": [], "sourceid": 2032, "authors": [{"given_name": "Adith", "family_name": "Swaminathan", "institution": "Microsoft Research"}, {"given_name": "Akshay", "family_name": "Krishnamurthy", "institution": null}, {"given_name": "Alekh", "family_name": "Agarwal", "institution": "Microsoft Research"}, {"given_name": "Miro", "family_name": "Dudik", "institution": "Microsoft Research"}, {"given_name": "John", "family_name": "Langford", "institution": "Microsoft Research New York"}, {"given_name": "Damien", "family_name": "Jose", "institution": "Microsoft"}, {"given_name": "Imed", "family_name": "Zitouni", "institution": "Microsoft"}]}