{"title": "A Smoothed Analysis of the Greedy Algorithm for the Linear Contextual Bandit Problem", "book": "Advances in Neural Information Processing Systems", "page_first": 2227, "page_last": 2236, "abstract": "Bandit learning is characterized by the tension between long-term exploration and short-term exploitation. However, as has recently been noted, in settings in which the choices of the learning algorithm correspond to important decisions about individual people (such as criminal recidivism prediction, lending, and sequential drug trials), exploration corresponds to explicitly sacrificing the well-being of one individual for the potential future benefit of others. In such settings, one might like to run a ``greedy'' algorithm, which always makes the optimal decision for the individuals at hand --- but doing this can result in a catastrophic failure to learn. In this paper, we consider the linear contextual bandit problem and revisit the performance of the greedy algorithm.\n\nWe give a smoothed analysis, showing that even when contexts may be chosen by an adversary, small perturbations of the adversary's choices suffice for the algorithm to achieve ``no regret'', perhaps (depending on the specifics of the setting) with a constant amount of initial training data. This suggests that in slightly perturbed environments, exploration and exploitation need not be in conflict in the linear setting.", "full_text": "A Smoothed Analysis of the Greedy Algorithm for the\n\nLinear Contextual Bandit Problem\n\nSampath Kannan\n\nUniversity of Pennsylvania\n\nJamie Morgenstern\n\nGeorgia Tech\n\nAaron Roth\n\nUniversity of Pennsylvania\n\nBo Waggoner\n\nMicrosoft Research, NYC\n\nZhiwei Steven Wu\n\nUniversity of Minnesota\n\nAbstract\n\nBandit learning is characterized by the tension between long-term exploration and\nshort-term exploitation. However, as has recently been noted, in settings in which\nthe choices of the learning algorithm correspond to important decisions about\nindividual people (such as criminal recidivism prediction, lending, and sequential\ndrug trials), exploration corresponds to explicitly sacri\ufb01cing the well-being of one\nindividual for the potential future bene\ufb01t of others. In such settings, one might like\nto run a \u201cgreedy\u201d algorithm, which always makes the optimal decision for the indi-\nviduals at hand \u2014 but doing this can result in a catastrophic failure to learn. In this\npaper, we consider the linear contextual bandit problem and revisit the performance\nof the greedy algorithm. We give a smoothed analysis, showing that even when\ncontexts may be chosen by an adversary, small perturbations of the adversary\u2019s\nchoices suf\ufb01ce for the algorithm to achieve \u201cno regret\u201d, perhaps (depending on\nthe speci\ufb01cs of the setting) with a constant amount of initial training data. This\nsuggests that in slightly perturbed environments, exploration and exploitation need\nnot be in con\ufb02ict in the linear setting.1\n\n1\n\nIntroduction\n\nLearning algorithms often need to operate in partial feedback settings (also known as bandit settings),\nin which the decisions of the algorithm determine the data that it observes. Many real-world\napplication domains of machine learning have this \ufb02avor. Predictive policing algorithms [Rudin,\n2013] deploy police of\ufb01cers and receive feedback about crimes committed and observed in areas the\nalgorithm chose to deploy of\ufb01cers. Lending algorithms [Byrnes, 2016] observe whether individuals\nwho were granted loans pay them back, but do not get to observe counterfactuals: would an individual\nnot granted a loan have repaid such a loan? Algorithms which inform bail and parole decisions\n[Barry-Jester et al., 2015] observe whether individuals who are released go on to recidivate, but do\nnot get to observe whether individuals who remain incarcerated would have committed crimes had\nthey been released. Algorithms assigning drugs to patients in clinical trials do not get to observe the\neffects of the drugs that were not assigned to particular patients.\nLearning in partial feedback settings faces the well-understood tension between exploration and\nexploitation. In order to perform well, the algorithms need at some point to exploit the information\nthey have gathered and make the best decisions they can. But they also need to explore: to make\ndecisions which do not seem optimal according to the algorithm\u2019s current point-predictions, in order\nto gather more information about less explored portions of the decision space.\n\n1The full version of this paper is available at https://arxiv.org/abs/1801.03423.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fHowever, in practice, decision systems often do not explicitly explore, for a number of reasons.\nExploration is important for maximizing long-run performance, but decision makers might be myopic\n\u2014 more interested in their short-term reward. In other situations, the decisions made at each round\naffect the lives of individuals, and explicit exploration might be objectionable on its face: it can be\nconsidered immoral to harm an individual today (explicitly sacri\ufb01cing present utility) for a potential\nbene\ufb01t to future individuals (long-term learning rates) [Bird et al., 2016, Bastani et al., 2017]. For\nexample, in a medical trial, it may be repugnant to knowingly assign a patient a drug that is thought to\nbe sub-optimal (or even dangerous) given the current state of knowledge, simply to increase statistical\ncertainty. In a parole scenario, we may not want to release a criminal that we estimate is at high risk\nfor committing violent crime.\nOn the other hand, a lack of exploration can lead to a catastrophic failure to learn, which is highly\nundesirable \u2013 and which can also lead to unfairness. A lack of exploration (and a corresponding\nfailure to correctly learn about crime statistics) has been blamed as a source of \u201cunfairness\u201d in\npredictive policing algorithms [Ensign et al., 2017]. In this paper, we seek to quantify how costly\nwe should expect a lack of exploration to be when the instances are not entirely worst-case. In other\nwords: is myopia a friction that we should generically expect to quickly be overcome, or is it really a\nlong-term obstacle to learning? Empirical evaluation shows that greedy algorithms often do well \u2014\neven outperforming algorithms with explicit exploration [Bietti et al., 2018]. Our work provides a\ntheoretical explanation for this phenomenon.\n\n1.1 Our Results\n\ni] = \u03b2i \u00b7 xt\n\nWe study the linear contextual bandits problem, which informally, represents the following learning\nscenario which takes place over a sequence of rounds t (formal de\ufb01nitions appear in Section 2). At\neach round t, the learner must make a decision amongst k choices, which are represented by contexts\ni \u2208 Rd. If the learner chooses action it at round t, he observes a reward rt\nit \u2014 but does not observe\nxt\nthe rewards corresponding to choices not taken. The rewards are stochastic, and their expectations are\ngoverned by unknown linear functions of the contexts. For an unknown set of parameters \u03b2i \u2208 Rd,\nE[rt\ni. We consider two variants of the problem: in one (the single parameter setting), all of\nthe rewards are governed by the same linear function: \u03b21 = . . . = \u03b2k = \u03b2. In the other (the multiple\nparameter setting), the parameter vectors \u03b2i for each choice can be distinct. Normally, these two\nsettings are equivalent to one another (up to a factor of k in the problem dimension) \u2014 but as we\nshow, in our case, they have distinct properties2. The single-parameter setting can model, for example,\nthe choice of which of some subset of individuals should participate in a particular clinical trial. The\nmulti-parameter setting can model, for example, the risk of criminal recidivism amongst different\nindividuals who come from different backgrounds, when observable features correlate differently to\ncrime risk amongst different groups of individuals.\nWe study the greedy algorithm, which trains least-squares estimates \u02c6\u03b2t\ntions, and at each round, picks the arm with the highest predicted reward: it = arg maxi\nthe single parameter setting, greedy simply maintains a single estimate \u02c6\u03b2t.\nIt is well known that the greedy algorithm does not obtain any non-trivial worst-case regret bound.\nWe give a smoothed analysis which shows that the worst case is brittle, however. Speci\ufb01cally, we\nconsider a model in which the contexts xt\ni are chosen at each round by an adaptive adversary, but are\nthen perturbed by independent Gaussian perturbations in each coordinate, with standard deviation \u03c3.\nWe show that under smoothed analysis, there is a qualitative distinction between the single parameter\nand multiple parameter settings:\n\ni on the current set of observa-\ni. In\n\ni \u00b7 xt\n\u02c6\u03b2t\n\n1. In the single parameter setting (Section 4), the greedy algorithm with high probability\n\n(cid:16)\u221a\n\n(cid:17)\n\nobtains regret bounded by \u02dcO\n\nT d\n\u03c32\n\nover T rounds.\n\n2. In the multiple parameter setting (Section 5), the greedy algorithm requires a \u201cwarm start\u201d \u2013\nthat is, to start with a small number of observations for each action \u2013 to obtain non-trivial\nregret bounds, even when facing a perturbed adversary. We show that if the warm start\nprovides for each arm a small number of examples (depending polynomially on \ufb01xed\nparameters of the instance, like 1/\u03c3, d, k, and 1/(mini ||\u03b2i||)), that may themselves be\n2To convert a multi-parameter problem to single-parameter, concatenate the parameter vectors \u03b2i \u2208 Rd into\ni into kd dimensions with zeros in all irrelevant kd \u2212 d coordinates.\n\na single vector \u03b2 \u2208 Rkd, and lift contexts xt\n\n2\n\n\f(cid:16)\u221a\n\n(cid:17)\n\nT k\n\u03c32\n\nchosen by an adversary and perturbed, then with high probability greedy obtains regret\n\u02dcO\n. Moreover, this warm start is necessary: we give lower bounds showing that if\nthe greedy algorithm is not initialized with a number of examples n that grows polynomially\nwith both 1/\u03c3 and with 1/ mini ||\u03b2i||, then there are simple \ufb01xed instances that force the\nalgorithm to have regret growing linearly with T , with constant probability. (See Section 6\nfor a formal statement of the lower bounds.)\n\nOur results extend beyond this particular perturbed adversary: we give general conditions on the\ndistribution over contexts which imply our regret bounds. All missing proofs can be found in the full\nversion.\n\n1.2 Related Work\n\nThe most closely related piece of work (from which we take direct inspiration) is Bastani et al. [2017],\nwho, in a stochastic setting, give conditions on the sampling distribution over contexts that causes\nthe greedy algorithm to have diminishing regret in a closely related but incomparable version of\nthe two-armed linear contextual bandits problem3. The conditions on the context distribution given\nin that work are restrictive, however. They imply, for example, that every linear policy (and in\nparticular the optimal policy) will choose each action with constant probability bounded away from\nzero. When translated to our perturbed adversarial setting, the distributional conditions of Bastani\net al. [2017] do not imply regret bounds that are sub-exponential in either the perturbation magnitude\n\u03c3 or the dimension d of the problem. There is also strong empirical evidence that exploration free\nalgorithms perform well on real datasets: [Bietti et al., 2018]. Our work can be viewed as providing\nan explanation of this phenomenon. Finally, building on our work, [Raghavan et al., 2018] use the\nsame diversity condition that we introduce in this paper to show a stronger result in a more restrictive\nsetting. They show that in the single parameter setting, when one further assumes that 1) the linear\nparameter is drawn from a Bayesian prior that is not too concentrated, 2) the contexts are drawn\ni.i.d. from a \ufb01xed distribution and then perturbed, and 3) that the algorithm is allowed to make its\ndecisions in \u201cbatches\u201d of polylog(d, t)/\u03c32 many rounds, then the greedy algorithm is essentially\ninstance optimal in terms of Bayesian regret, and moreover, that its regret grows at a rate of O(T 1/3)\nin the worst case. In contrast, we make substantially weaker assumptions (the parameter vector and\ncontexts can be worst case, we need not be in the single parameter setting, and we don\u2019t need batches),\nbut prove a worse regret bound of O(T 1/2), without a guarantee of instance optimality.\nA large literature focuses on designing no-regret algorithms for contextual bandit problems (e.g.\nLi et al. [2010], Agarwal et al. [2014], Li et al. [2011]), particularly for linear contextual bandits\n(e.g. [Chu et al., 2011, Abbasi-Yadkori et al., 2011]). Some of these (e.g. Syrgkanis et al. [2016])\nuse \u201cfollow the perturbed leader\u201d style algorithms, which invite a natural comparison to our setting.\nHowever, the phenomenon we are exploiting is quite different. It is very important in our setting that\nthe perturbations are added by nature, and if the perturbations were instead added by our algorithm\n(against worst-case contexts), the regret guarantee would cease to hold. To see this, note that against\nworst-case adversaries, the single parameter and multiple parameter settings are equivalent to one\nanother \u2014 but in our smoothed setting, we prove a qualitative separation.\nWe defer further related work, including work on smoothed analysis and algorithmic fairness, to the\nfull version.\n\n2 Model and Preliminaries\n\nWe now introduce the notation and de\ufb01nitions we use for this work. For a vector x, (cid:107)x(cid:107) represents its\nEuclidean norm. We consider two variants of the k-arm linear contextual bandits problem. The \ufb01rst\nsetting has a single d-dimensional parameter vector \u03b2 which governs rewards for all contexts x \u2208 Rd;\nthe second has k distinct parameter vectors \u03b2i \u2208 Rd governing the rewards for different arms.\n\n3Bastani et al. [2017] assume a single context at each round, shared between two actions. We consider each\n\naction as parameterized by its own context, and k can be arbitrary.\n\n3\n\n\fT(cid:88)\n\n(cid:16)\n\n(cid:17)\n\n\u03b2i \u00b7 xt\n\ni \u2212 \u03b2it \u00b7 xt\n\nit\n\n.\n\n1, . . . , xt\n\ni \u2208 Rd is treated as a row vector unless\nIn rounds t, contexts xt\notherwise noted. The learner chooses an arm it \u2208 {1, . . . , k}, and obtains s2-subgaussian4 reward\nrt whose mean satis\ufb01es E [rt] = \u03b2 \u00b7 xt\nit in the\nmulti-parameter setting. The regret of a sequence of actions and contexts of length T is (again, in the\nsingle parameter setting all \u03b2i = \u03b2):\n\nit in the single parameter setting and E [rt] = \u03b2it \u00b7 xt\n\nk, are presented, where xt\n\nRegret = Regret(x1, i1, . . . , xT , iT ) =\n\nmax\n\ni\n\nt=1\n\nWe next formalize the history or transcript of an algorithm on a sequence of contexts. A history entry\n\nis a member of H =(cid:0)Rd(cid:1)k \u00d7 {1, . . . , k} \u00d7 R. A history is a list of history entries, i.e. a member of\ncontexts: A : H\u2217 \u2192(cid:0)Rd(cid:1)k. We denote the output of the adversary by (\u00b51, \u00b52, . . . , \u00b5k)5 We assume\n\nH\u2217. Given a history H \u2208 HT , entry t is denoted ht = (x1, . . . , xk, it, rt\nFormally, an adaptive adversary A is a (possibly randomized) algorithm that maps a history to k\nthat (cid:107)\u00b5i(cid:107) \u2264 1 always. Next we de\ufb01ne the notion of a perturbed adversary, which encompasses both\nstages of the context-generation process.\nDe\ufb01nition 1 (Perturbed Adversary). For any adversary A, the \u03c3-perturbed adversary A\u03c3 is de\ufb01ned\nby the following process. In round t:\n\nit).\n\n1 + et\n\n1, . . . , et\n\n1, . . . , xt\n\n1, . . . , \u00b5t\n\nk) = (\u00b5t\n\nk = A(H t\u22121).\n1, . . . , \u00b5t\n\n1. Given history H t\u22121 \u2208 Ht\u22121, let \u00b5t\n2. Perturbations et\n3. Output the list of contexts (xt\n\nk are drawn independently from N (0, \u03c32I).\nk).\nk + et\nWe de\ufb01ne a perturbed adversary to be R-bounded if with probability 1, (cid:107)xt\ni(cid:107) \u2264 R for all i and t and\nall histories. We call perturbations (r, \u03b4)-centrally bounded if, for each history, and \ufb01xed unit vectors\nw1, . . . , wk (possibly all equal), we have with probability 1 \u2212 \u03b4 that maxi=1,...,k wi \u00b7 et\nWe can interpret the output of a perturbed adversary as being a mild perturbation of the (unperturbed)\nadaptive adversary when the magnitude of the perturbations is smaller than the magnitude of the\noriginal context choices \u00b5i themselves. Said another way, we can think of the perturbations as\nbeing mild when they do not substantially increase the norms of the contexts with probability at\nleast 1 \u2212 \u03b4. This will be the case throughout the run of the algorithm (via a union bound over T )\nwhen \u03c3 \u2264 \u02dcO(1/\nd). We refer to this case as the \u201clow perturbation regime\u201d. We view it as the\nmost interesting case because otherwise, the perturbations tend to be large enough to overwhelm the\nadversarial choices and the problem becomes easier. Here we focus on presenting results for the low\nperturbation regime, leaving the rest to the full version.\n\ni \u2264 r.\n\n\u221a\n\n3 Proof Approach and Key Conditions\n\nOur goal will be to show that the greedy algorithm achieves no regret against any perturbed adversary\nin both the single-parameter and multiple-parameter settings. The key idea is to show that the\ndistribution on contexts generated by perturbed adversaries satisfy certain conditions which suf\ufb01ce to\nprove a regret bound. The conditions we work with are related to (but substantially weaker than) the\nconditions shown to be suf\ufb01cient for a no regret guarantee by Bastani et al. [2017].\nThe \ufb01rst key condition, diversity of contexts, considers the positive semide\ufb01nite d \u00d7 d matrix E [x\nx]\nfor a context x, and asks for a lower bound on its minimum eigenvalue. This implies the distribution\nover x has non-trivial variance in all directions, which is necessary for the least squares estimator\nto converge to the underlying parameter \u03b2. It implies that observations of \u03b2 \u00b7 x convey information\nabout \u03b2 in all directions.\nHowever, we only observe the rewards for contexts x conditioned on Greedy selecting them: we\nsee a biased (conditional) distribution on x. Thus we need the diversity condition to hold on these\nconditional distributions.\nCondition 1 (Diversity). Let e \u223c D on Rd and let r, \u03bb0 \u2265 0. We call D (r, \u03bb0)-diverse if for all \u02c6\u03b2,\nall \u00b5 with (cid:107)\u00b5(cid:107) \u2264 1, and all \u02c6b \u2264 r(cid:107) \u02c6\u03b2(cid:107), for x = \u00b5 + e:\n\n4A random variable Y with mean \u00b5 is s2-subgaussian if E(cid:104)\n\net(Y \u2212\u00b5)(cid:105) \u2264 et2s2/2 for all t.\n\n(cid:124)\n\n5The notation is chosen since \u00b5i will be the mean around which the perturbed context is drawn.\n\n4\n\n\f(cid:18)\n\nE\ne\u223cD\n\n(cid:104)\n\nx\n\n(cid:124)\n\nx\n\n(cid:105)(cid:19)\n(cid:12)(cid:12)(cid:12) \u02c6\u03b2 \u00b7 e \u2265 \u02c6b\n\n\u03bbmin\n\n\u2265 \u03bb0.\n\nIn the single parameter setting, diversity will imply a regret guarantee: when any arm is pulled, the\ncontext-reward pair gives useful information about all components of the (single) parameter \u03b2. In the\nmultiple parameter setting, diversity will suf\ufb01ce to guarantee that the learner\u2019s estimate of arm i\u2019s\nparameter vector converges to \u03b2i as a function of the number of times arm i is pulled; but alone it\ndoes not cause arm i to be pulled suf\ufb01ciently often (even in rounds where i is the best alternative,\nwhen failing to pull it will cause our algorithm to suffer regret).\nThus the multiple parameter setting will require a second key condition, margins. Margins will imply\nthat conditioned on an arm being optimal on a given round, there is a non-trivial probability (over the\nrandomness in the contexts) that Greedy perceives it to be optimal based on current estimates { \u02c6\u03b2t\ni},\nso long as the current estimates achieve at least some constant baseline accuracy. A small initial\ntraining set can guarantee that initial estimates achieve constant error, and so the margin condition\nimplies that Greedy will continue to explore arms with a frequency that is proportional to the number\nof rounds for which they are optimal; then diversity implies that estimates of those arms\u2019 parameters\nwill improve quickly (without promising anything about arms that are rarely optimal \u2013 and hence\ninconsequential for regret).\nCondition 2 (Conditional Margins). Let e \u223c D and let r, \u03b1, \u03b3 \u2265 0. We say D has (r, \u03b1, \u03b3) margins\nif for all \u03b2 (cid:54)= 0 and b \u2264 r(cid:107)\u03b2(cid:107),\n\nP [\u03b2 \u00b7 e > b + \u03b1(cid:107)\u03b2(cid:107) | \u03b2 \u00b7 e \u2265 b] \u2265 \u03b3.\n\nSo, on rounds for which arm i has largest expected reward, with probability at least \u03b3 its expected\nreward is largest by at least some margin (\u03b1(cid:107)\u03b2(cid:107)). If Greedy has suf\ufb01ciently accurate estimates { \u02c6\u03b2t\ni},\nthis implies that Greedy will pull arm i. We say a perturbed adversary satis\ufb01es the diversity and\nmargin conditions if the distributions of et\ni are independent and satisfy these conditions for all i, t.\nWe will show the diversity condition implies no-regret in single-parameter settings, and the diver-\nsity and margin conditions imply no-regret in multi-parameter settings. We further show that the\nperturbation distribution N (0, \u03c32I) satis\ufb01es these conditions. We note that our choice of Gaussian\nperturbations was convenient and natural but not necessary (other perturbation distributions also\nsatisfy our conditions, implying similar results for those perturbations).\n\nComplications: extreme perturbation realizations. When the realizations of the Gaussian per-\nturbations have extremely large magnitude, the diversity and margin conditions will not hold6. This\nis potentially problematic, because the probabilistic conditioning in both conditions increases the\nlikelihood that the perturbations will be large. This is the role of the parameter r in both conditions:\nto provide a reasonable upper bound on the threshold that a perturbation variable should not exceed.\nexceed. In the succeeding sections, we will use conditions we call \u201cgood\u201d to formalize the intuition\nthat this is unlikely to happen often, when the perturbations satisfy a centrally-bounded condition.\n\n4 Single Parameter Setting\n\nWe de\ufb01ne the \u201cGreedy Algorithm\u201d as the algorithm which myopically pulls the \u201cbest\u201d arm at each\nround according to the predictions of the classic least-squares estimator. Let X t denote the (t\u2212 1)\u00d7 d\ndesign matrix at time t, in which each row t(cid:48) is some observed context xt(cid:48)\nwas\nselected at round t(cid:48) < t. The corresponding vector of rewards is denoted yt = (r1\nit\u22121). The\n(cid:124). At each round t, Greedy \ufb01rst computes\ntransposes of a matrix Z and vector z are denoted Z\nthe least-squares estimator based on the historical contexts and rewards: \u02c6\u03b2t \u2208 arg min\u03b2 ||X t\u03b2\u2212yt||2\n2,\nand then greedily selects the arm with the highest estimated reward: it = arg maxi\ni. We defer\nthe formal description of the algorithm to the full version.\n\nit(cid:48) where arm it(cid:48)\ni1, . . . , rt\u22121\n\n(cid:124) and z\n\n\u02c6\u03b2t \u00b7 xt\n\n\u201cReasonable\u201d rounds. As discussed in Section 2, the diversity condition will only hold when an\ni are not too large; we formalize these \u201cgood\u201d situations here. Fix a round t,\narm\u2019s perturbations et\n6E.g. for margins, consider the one-dimensional case: a lower truncated Gaussian tightly concentrates on its\n\nminimal support value.\n\n5\n\n\fthe current Greedy hypothesis \u02c6\u03b2t, and any choices of the adversary \u00b5t\nk conditioned on the\nentire history up to round t. Now each value \u02c6\u03b2txt\ni is a random variable, and Greedy\nselects the arm corresponding to the largest realized value. In particular, we de\ufb01ne the \u201cthreshold\u201d\nfor Greedy to pull i as follows.\nDe\ufb01nition 2. Fix a round t, Greedy\u2019s hypothesis \u02c6\u03b2t, and the adversary\u2019s choices \u00b5t\nde\ufb01ne \u02c6ct\n\ni is r-(cid:100)good if \u02c6ct\n\nj. We say a realization of \u02c6ct\n\ni + r(cid:107) \u02c6\u03b2t(cid:107).\n\ni \u2264 \u02c6\u03b2t \u00b7 \u00b5t\n\ni = \u02c6\u03b2t\u00b5t\n\ni + \u02c6\u03b2tet\n\n1, . . . , \u00b5t\n\n1, . . . , \u00b5t\n\n\u02c6\u03b2t \u00b7 xt\n\nk. We\n\ni := maxj(cid:54)=i\n\nThe \u201chat\u201d on (cid:100)good corresponds to those on \u02c6ct\n\nanalogous conditions without the hats. Notice that \u02c6ct\nperturbations et\ndetermined by the perturbations et\ntoo large for arm i to be selected.\n\nj for j (cid:54)= i, and Greedy pulls i if and only if \u02c6\u03b2txt\n\ni and \u02c6\u03b2t. In the multiple parameter setting we will use\ni is a random variable that depends on all the\ni is r-good is\ni need not be\n\ni is r-(cid:100)good, then et\n\ni \u2265 \u02c6ct\ni(cid:48) of all arms i(cid:48) (cid:54)= i. Intuitively, if \u02c6ct\n\ni. The event that \u02c6ct\n\n4.1 Regret framework for perturbed adversaries\n\nWe \ufb01rst observe an upper-bound on Greedy\u2019s regret as a function of the distance between \u02c6\u03b2t and\nthe true model \u03b2. This allows us to focus on the diversity condition, which will guarantee that this\ndistance shrinks. Let i\u2217(t) = arg maxi \u03b2 \u00b7 xt\nLemma 4.1. Suppose for all i, t that (cid:107)xt\ni(cid:107) \u2264 R. In the single-parameter setting, for any tmin \u2208 [T ],\nwe have:\n\ni, the optimal arm at time t.\n\nRegret(x1, i1, . . . , xT , iT ) \u2264 2Rtmin + 2R\n\nT(cid:88)\n\n(cid:13)(cid:13)(cid:13)\u03b2 \u2212 \u02c6\u03b2t(cid:13)(cid:13)(cid:13) .\n\nt=tmin\n\nTo apply Lemma 4.1, we need to show that estimates \u02c6\u03b2t \u2192 \u03b2 quickly. The key idea is that if the\ninput contexts are \u201cdiverse\u201d enough (captured formally by De\ufb01nition 1), we will be able to infer\n\u03b2. Lemma 4.2 shows \u02c6\u03b2t approaches \u03b2 at a rate governed by the minimum eigenvalue of the design\nmatrix.\ni(cid:107) \u2264 R and recall\nLemma 4.2. Fix a round t and let Z t = (X t)\nthat rewards are s2-subgaussian. Then with probability 1 \u2212 \u03b4 over the randomness in rewards, we\nhave\n\nX t. Suppose all contexts satisfy (cid:107)xt\n\n(cid:112)2tdRs2 ln(td/\u03b4)\n\n(cid:124)\n\nObserve that the matrix Z t =(cid:80)\n\n(cid:107)\u03b2 \u2212 \u02c6\u03b2t(cid:107) \u2264\n\n(cid:124)\n\n\u03bbmin (Z t)\n\n.\n\nt(cid:48)\u2264t (xt(cid:48)\ni )\n\nxt(cid:48)\ni . The next step is to show that \u03bbmin(Z t) grows at a\nrate of \u0398(t) with high probability, which will imply via Lemma 4.2 that (cid:107)\u03b2 \u2212 \u02c6\u03b2t(cid:107) \u2264 O(1/\nt),\n\ufb01xing all other parameters. This is proven in the following key result, Lemma 4.3. The proof uses a\nconcentration result for the minimum eigenvalue to show that \u03bbmin(Z t) grows at a rate \u0398(t) with high\nprobability. This relies crucially on the (r, \u03bb0) diversity condition, which intuitively lower-bounds the\nexpected increase in \u03bbmin(Z t) at each round. The details are more complicated, as this increase only\ni; we show this happens with constant probability\n\nholds when Greedy\u2019s choice of i has an r-(cid:100)good \u02c6ct\n\nfor an (r, 1/2)-centrally bounded adversary.\nLemma 4.3. For Greedy in the single parameter setting with an R-bounded, (r, 1/2)-\ncentrally bounded, (r, \u03bb0)-diverse adversary, we have with probability 1 \u2212 \u03b4 that for all t \u2265\nmax{0, 20R2\n\n\u03bb0d\u03b4 )}, we have \u03bbmin(Z t) \u2265 t\u03bb0\n4 .\n\nln( 20R2\n\n\u221a\n\n\u03bb0\n\nCombining these results gives a bound on the regret of Greedy against general perturbed adversaries.\n\nThe Gaussian, \u03c3-perturbed adversary. We need to show that our \u03c3-perturbed adversary satis\ufb01es\nthe diversity condition (and another technical condition that we defer to the supplementary materials).\nFor the diversity condition, we show that the diversity parameter \u03bb can be lower bounded by the\nvariance of a single-dimensional truncated Gaussian, then analyze this variance using tight Gaussian\ntail bounds. Our proof makes use of the careful choice of truncations of A(cid:48)\n\u03c3 using a different\northonormal change of basis each round, which maintains the perturbation\u2019s Gaussian distribution\nbut allows the form of the conditioning to be much simpli\ufb01ed. Finally, we arrive at the main result\nfor this section:\n\n6\n\n\fTheorem 4.1. In the single parameter setting against the \u03c3-perturbed adversary A\u03c3, \ufb01x any choice\nof parameters such that \u03c3 \u2264\n(the low perturbation regime) and d \u2264 eO(s2T ). With\nprobability at least 1 \u2212 \u03b4, Greedy has\n\n2d ln(T kd/\u03b4)\n\n\u221a\n\n2\n\n1\n\n(cid:32)(cid:112)T ds2 ln(T d/\u03b4) ln(k)\n\n(cid:33)\n\n\u03c32\n\nRegret \u2264 O\n\nwhere d is the dimension of contexts, k is the number of arms, rewards are s2-subgaussian, and in all\ncases O(\u00b7) hides an absolute constant.\n\n5 Multiple Parameter Setting\n\nIn the multi-parameter setting, we cannot hope for the greedy algorithm to achieve vanishing regret\nwithout any initial information, as it never learns about parameters of arms it does not pull (formalized\nin a lower bound in Section 6). If, however, Greedy receives a small amount of initial information in\nthe form of a constant number of n samples (xi, ri) for each arm i, perturbations will imply vanishing\nregret. We refer to this as an n-sample \u201cwarm start\u201d to Greedy. (See the full version for a formal\ndescription of the algorithm.)\nFor this setting, we show that the diversity and margin conditions together on a generic bounded\nadversary imply low regret. We then leverage this to give regret bounds for the Gaussian adversary\nA\u03c3. As discussed in Section 3, the key idea is as follows. Analogous to the single parameter setting,\nthe diversity condition implies that additional datapoints we collect for an arm improve the accuracy\nof its estimate \u02c6\u03b2t\ni . Meanwhile, the margin condition implies that for suf\ufb01ciently accurate estimates,\nwhen an arm is optimal (\u03b2ixt\ni is largest), the perturbations have a good chance of causing Greedy to\npull that arm ( \u02c6\u03b2t\ni is largest). Thus, the initial data sample kickstarts Greedy with reasonably accurate\nestimates, causing it to regularly pull optimal arms and accrue more data points, thus becoming more\naccurate.\n\ni xt\n\nNotation and preliminaries. Recall that it is the arm pulled by Greedy at round t, i.e. it =\ni \u00b7 xt\ni. Similarly let i\u2217(t) be the optimal arm at round t, i.e. i\u2217(t) = arg maxi \u03b2i \u00b7 xt\n\u02c6\u03b2t\ni. Let\narg maxi\nni(t) be the number of times arm i is pulled prior to round t, including the warm start (so ni(1) will\ni = {t : i\u2217(t) = i}, the rounds where i is pulled and is\nbe nonzero). Let Si = {t : it = i} and let S\u2217\ni is a threshold that i must exceed to be pulled by Greedy, and the r-(cid:100)good condition\noptimal respectively.\n\nRecall that \u02c6ct\ncaptures cases where this can happen without the perturbation et\ncondition formally for the multiple parameter case. We also need a similar threshold ct\nexceed to be the optimal arm, and an analogous r-good condition.\nDe\ufb01nition 3. Fix a round t, the current Greedy hypotheses \u02c6\u03b2t\n\u00b5t\n1, . . . , \u00b5t\noutcome of \u02c6ct\noutcome of ct\n\nj \u00b7 xt\n\u02c6\u03b2t\ni := maxj(cid:54)=i\ni + r(cid:107) \u02c6\u03b2t\ni \u2264 \u02c6\u03b2t\ni \u00b7 \u00b5t\ni \u2264 \u03b2i \u00b7 \u00b5t\ni + r(cid:107)\u03b2i(cid:107).\n\n1, . . . , \u02c6\u03b2t\ni(cid:107). Similarly, de\ufb01ne ct\n\nj, a random variable depending on {et\n\ni is r-(cid:100)good if \u02c6ct\n\ni is r-good if ct\n\nk, and choices of an adversary\nj : j (cid:54)= i}. Say an\ni := maxj(cid:54)=i \u03b2j \u00b7 xt\nj and say an\n\ni being too large. We now de\ufb01ne this\ni that i must\n\nk. De\ufb01ne \u02c6ct\n\n5.1 Regret framework for perturbed adversaries\n\nSimilarly to Lemma 4.1, here the regret of Greedy shrinks as each \u02c6\u03b2t\nidentical, but in this case, we prove this for each arm i \u2208 [k].\n\nLemma 5.1. In the multiple parameter setting, the regret of Greedy is bounded by(cid:80)k\n\uf8f6\uf8f8 .\n(cid:13)(cid:13)(cid:13)\n\n(cid:13)(cid:13)(cid:13)\u03b2i \u2212 \u02c6\u03b2t\n\n(cid:13)(cid:13)(cid:13)\u03b2i \u2212 \u02c6\u03b2t\n\n\uf8eb\uf8ed(cid:88)\n\n(cid:32)(cid:88)\n\nRegreti(T ) = R\n\n(cid:13)(cid:13)(cid:13)(cid:33)\n\nwith\n\n+ R\n\ni\n\ni\n\nt\u2208Si\n\nt\u2208S\u2217\n\ni\n\ni \u2192 \u03b2i. The proof is essentially\n\ni=1 Regreti(T )\n\nAs in the single parameter setting, the diversity condition implies that with enough observations ni(t)\nfor arm i, we have (cid:107)\u03b2i \u2212 \u02c6\u03b2t\n). We omit the details as they are analogous to the single\nparameter case, and move on to the margin condition.\n\ni(cid:107) \u2264 O(\n\n1\u221a\n\nni(t)\n\n7\n\n\fWe wish to capture the bene\ufb01ts of the margin condition, i.e. that arms which are often optimal are also\nactually pulled often by Greedy. The \ufb01rst step is to leverage the margin condition to argue that when\ni is r-good), it is optimal by a signi\ufb01cant margin (\u03b1(cid:107)\u03b2i(cid:107)) with a signi\ufb01cant\narm i is optimal (and ct\nprobability (\u03b3). Combining this with accurate initial estimates implies that it will actually be pulled\nby Greedy.\nLemma 5.2. Suppose the perturbed adversary is R-bounded and has (r, \u03b1, \u03b3) margins for some\nr \u2264 R. Consider any round t where for all j we have (cid:107)\u03b2j \u2212 \u02c6\u03b2t\n\nj(cid:107) \u2264 \u03b1 minj(cid:48) (cid:107)\u03b2j(cid:48)(cid:107)\n\n. Then\n\n2R\n\nP(cid:2)it = i(cid:12)(cid:12) i\u2217(t) = i, ct\n\ni is r-good(cid:3) \u2265 \u03b3.\n\nRecall that Si, S\u2217\ni are respectively the set of rounds in which it = i (Greedy pulls arm i) and i\u2217(t) = i\n(arm i is optimal), respectively. The following key result leverages the margin condition to argue that,\nif i is optimal for a signi\ufb01cant number of rounds, then it is pulled by Greedy. This is vital to a good\nregret bound because it shows that ni(t), the number of samples from arm i, is steadily increasing in\nt if i is often optimal, which we know from the diversity condition implies that the estimate \u02c6\u03b2t\ni is\nconverging.\nLemma 5.3. Consider an R-bounded perturbed adversary with (r, \u03b1, \u03b3) margins and assume\nfor all i and t. With probability at least 1 \u2212 \u03b4, we have for all natural\n(cid:107)\u03b2i \u2212 \u02c6\u03b2t\nnumbers N, |{t \u2208 S\u2217\n\u03b4 times\nbefore being pulled by Greedy.\n\n\u03b4 . That is, arm i can be optimal at most 5\n\ni : ni(t) = N}| \u2264 5\n\ni(cid:107) \u2264 \u03b1 minj (cid:107)\u03b2j(cid:107)\n\n\u03b3 ln 2\n\n\u03b3 ln 2\n\n2R\n\nThe Gaussian, \u03c3-perturbed adversary At a high level, all that remains to complete our analysis\nis to prove that our perturbed adversary A\u03c3 produces distributions that satisfy our margin condition.\nThere are some complications that make the details of this argument slightly circuitous, that we defer\nto the full version in the supplement (we \ufb01rst prove this result for an adversary that uses a truncated\nGaussian distribution, and hence always produces bounded contexts, and then use this to argue that\nour actual adversary also has the properties that we need). Once this is proven, we obtain the main\nresult of the multiple parameter setting. In particular, in the small-perturbation regime, a constant-size\n\u221a\nwarm start (i.e. independent of T , as long as \u03c3 is small) suf\ufb01ces to initialize Greedy such that, with\nhigh probability, it can obtain \u02dcO(\nTheorem 5.1. In the multiple parameter setting, against the \u03c3-perturbed adversary A\u03c3, with a warm\nstart of size\n\nT ) regret.\n\nfor any setting of parameters such that \u03c3 \u2264\n\n, Greedy satis\ufb01es with probability 1 \u2212 \u03b4\n\n\uf8eb\uf8ed ds2 ln(\n(cid:32)\u221a\n\n3\n\ndks2\n\n\u03b4\u03c3 minj (cid:107)\u03b2j(cid:107)2 )\n\n\u03c312 minj (cid:107)\u03b2j(cid:107)2\n\u221a\n\n1\n\nd ln(2T kd/\u03b4)\nT kds2(ln T kd\n\n\u03b4 )3/2\n\n\u03c32\n\n\uf8f6\uf8f8 ,\n(cid:33)\n\nn \u2265 \u2126\n\nRegret \u2264 O\n\nwhere d is the dimension of the contexts, k is the number of arms, rewards are s2-subgaussian, and\nin all cases O(\u00b7), \u2126(\u00b7) hide absolute constants.\n\n6 Lower Bounds for the Multi-Parameter Setting\n\nFinally, in this section, we show that our results for the multi-parameter setting are qualitatively tight.\nNamely, Greedy can be forced to suffer linear regret in the multi-parameter setting unless it is given\n\u03c3 , the perturbation parameter, and 1/ mini ||\u03b2||i, the\na \u201cwarm start\u201d that scales polynomially with 1\nnorm of the smallest parameter vector. This shows the polynomial dependencies on these parameters\nin our upper bound cannot be removed, and in particular, prove a qualitative separation between the\nmulti-parameter setting and the single parameter setting (in which a warm start is not required). Both\nof our lower bounds are in the fully stochastic setting \u2013 i.e. they are based on instances in which\ncontexts are drawn from a \ufb01xed distribution, and do not require that we make use of an adaptive\nadversary. First, we focus on the perturbation parameter \u03c3.\n\n8\n\n\fTheorem 6.1. Suppose greedy is given a warm start of size n \u2264\nsetting. Then, there exists an instance for which Greedy incurs regret \u2126( \u03c1\u221a\nin its \ufb01rst \u03c1 rounds.\nRemark 1. Theorem 6.1 implies for T < exp( 1\n\n\u03c3 ), either\n\n100\u03c32 ln\n\n\u2022 n = \u2126(cid:0)poly(cid:0) 1\n\n(cid:1)(cid:1), or\n\n\u2022 Greedy suffers linear regret.\n\n\u03c3\n\n(cid:18)\n\n(cid:19)\n\n1\n\n\u03c1\n\nin the \u03c3-perturbed\nn ) with constant probability\n\n100\n\nn\n\n\u03c1\n\n100\n\n100n ln\n\n(cid:17)\n\n(cid:16) 1\u221a\n\nn , \u03c32), for \u03c3 =\n\nThe lower bound instance is simple: one-dimensional, with two arms and model parameters \u03b21 =\n\u221a\n\u03b22 = 1.\nIn each round (including the warm start) the unperturbed contexts are \u00b51 = 1 and\n\u00b52 = 1\u2212 1/\n2 are drawn independently from the Gaussian\ndistributions N (1, \u03c32) and N (1 \u2212 1\u221a\n. We show the estimators after\n\nn, and so the perturbed contexts xt\n\n(cid:114) 1\n\n1 and xt\n\nthe warm start have additive error \u2126\n\nconstant probability, arm 1 will only be pulled \u02dcO(cid:0)n2/3(cid:1) rounds. So, with constant probability greedy\n\nwith a constant probability, and when this is true, with\n\nwill pull arm 2 nearly every round, even though arm 1 will be better in a constant fraction of rounds.\nWe now turn our attention to showing that the warm start must also grow with 1/ mini ||\u03b2i||. Infor-\nmally, the instance we use to show this lower bound has unperturbed contexts \u00b5t\ni = 1 for both arms\nand all rounds, and \u03b21 = 8\u0001, \u03b22 = 10\u0001. We show again that the warm start of size n yields, with\nconstant probability, estimators with error ci\u221a\nn, causing Greedy to choose arm 2 rather than arm 1 for\na large number of rounds. When 2 is not pulled too many times, with constant probability its estimate\nTheorem 6.2. Let \u0001 = mini |\u03b2i|, \u03c3 < 1(cid:114)\nremains small and continues to be passed over in favor of arm 1.\nof size n \u2264 1\n\n\u03b4 < 2n1/3. Suppose Greedy is given a warm start\n\n2\u0001 . Then, some instances cause Greedy to incur regret\n\nand T\n\nln\n\nT\n\u03b4\n\n(cid:18)\n\n(cid:18)\n\n(cid:19)(cid:19)\n\n.\n\nR(T ) = \u2126\n\n\u0001\n\ne\n\n1\n\n18\u03c32 \u2212 n\n\n2\n3\n\nRemark 2. Observe again that this implies that Greedy can be forced to incur linear regret if its warm\nstart size does not grow with 1/ mini ||\u03b2i|| for exponentially many rounds.\n\nReferences\nYasin Abbasi-Yadkori, D\u00e1vid P\u00e1l, and Csaba Szepesv\u00e1ri. Improved algorithms for linear stochastic\n\nbandits. In Advances in Neural Information Processing Systems, pages 2312\u20132320, 2011.\n\nAlekh Agarwal, Daniel Hsu, Satyen Kale, John Langford, Lihong Li, and Robert Schapire. Taming\nthe monster: A fast and simple algorithm for contextual bandits. In International Conference on\nMachine Learning, pages 1638\u20131646, 2014.\n\nAnna Maria Barry-Jester, Ben Casselman, and Dana Goldstein. The new science of sentencing. The\nMarshall Project, August 8 2015. URL https://www.themarshallproject.org/2015/08/\n04/the-new-science-of-sentencing. Retrieved 4/28/2016.\n\nH. Bastani, M. Bayati, and K. Khosravi. Exploiting the Natural Exploration In Contextual Bandits.\n\nArXiv e-prints, April 2017.\n\nA. Bietti, A. Agarwal, and J. Langford. Practical Evaluation and Optimization of Contextual Bandit\n\nAlgorithms. ArXiv e-prints, February 2018.\n\nSarah Bird, Solon Barocas, Kate Crawford, Fernando Diaz, and Hanna Wallach. Exploring or\nexploiting? social and ethical implications of autonomous experimentation. Workshop on Fairness,\nAccountability, and Transparency in Machine Learning, 2016.\n\nNanette Byrnes. Arti\ufb01cial intolerance. MIT Technology Review, March 28 2016. URL https://\nwww.technologyreview.com/s/600996/artificial-intolerance/. Retrieved 4/28/2016.\n\n9\n\n\fWei Chu, Lihong Li, Lev Reyzin, and Robert E Schapire. Contextual bandits with linear payoff\nfunctions. In International Conference on Arti\ufb01cial Intelligence and Statistics, pages 208\u2013214,\n2011.\n\nDanielle Ensign, Sorelle A. Friedler, Scott Neville, Carlos Eduardo Scheidegger, and Suresh Venkata-\nsubramanian. Runaway feedback loops in predictive policing. Workshop on Fairness, Accountabil-\nity, and Transparency in Machine Learning, 2017.\n\nLihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to\npersonalized news article recommendation. In Proceedings of the 19th international conference on\nWorld wide web, pages 661\u2013670. ACM, 2010.\n\nLihong Li, Wei Chu, John Langford, and Xuanhui Wang. Unbiased of\ufb02ine evaluation of contextual-\nbandit-based news article recommendation algorithms. In Proceedings of the fourth ACM interna-\ntional conference on Web search and data mining, pages 297\u2013306. ACM, 2011.\n\nManish Raghavan, Aleksandrs Slivkins, Jennifer Wortman Vaughan, and Zhiwei Steven Wu. The\nexternalities of exploration and how data diversity helps exploitation. In Proceedings of the 31st\nConference on Learning Theory, COLT 2018, 2018.\n\nCynthia Rudin.\n\nPredictive policing using machine learning to detect patterns of crime.\nWired Magazine, August 2013.\nURL http://www.wired.com/insights/2013/08/\npredictive-policing-using-machine-learning-to-detect-\\patterns-of-crime/.\nRetrieved 4/28/2016.\n\nVasilis Syrgkanis, Akshay Krishnamurthy, and Robert Schapire. Ef\ufb01cient algorithms for adversarial\ncontextual learning. In International Conference on Machine Learning, pages 2159\u20132168, 2016.\n\n10\n\n\f", "award": [], "sourceid": 1123, "authors": [{"given_name": "Sampath", "family_name": "Kannan", "institution": "University of Pennsylvania"}, {"given_name": "Jamie", "family_name": "Morgenstern", "institution": "Georgia Tech"}, {"given_name": "Aaron", "family_name": "Roth", "institution": "University of Pennsylvania"}, {"given_name": "Bo", "family_name": "Waggoner", "institution": "Microsoft"}, {"given_name": "Zhiwei Steven", "family_name": "Wu", "institution": "University of Minnesota"}]}