{"title": "Conservative Contextual Linear Bandits", "book": "Advances in Neural Information Processing Systems", "page_first": 3910, "page_last": 3919, "abstract": "Safety is a desirable property that can immensely increase the applicability of learning algorithms in real-world decision-making problems. It is much easier for a company to deploy an algorithm that is safe, i.e., guaranteed to perform at least as well as a baseline. In this paper, we study the issue of safety in contextual linear bandits that have application in many different fields including personalized ad recommendation in online marketing. We formulate a notion of safety for this class of algorithms. We develop a safe contextual linear bandit algorithm, called conservative linear UCB (CLUCB), that simultaneously minimizes its regret and satisfies the safety constraint, i.e., maintains its performance above a fixed percentage of the performance of a baseline strategy, uniformly over time. We prove an upper-bound on the regret of CLUCB and show that it can be decomposed into two terms: 1) an upper-bound for the regret of the standard linear UCB algorithm that grows with the time horizon and 2) a constant term that accounts for the loss of being conservative in order to satisfy the safety constraint. We empirically show that our algorithm is safe and validate our theoretical analysis.", "full_text": "Conservative Contextual Linear Bandits\n\nAbbas Kazerouni\nStanford University\n\nMohammad Ghavamzadeh\n\nDeepMind\n\nabbask@stanford.edu\n\nghavamza@google.com\n\nYasin Abbasi-Yadkori\n\nAdobe Research\n\nabbasiya@adobe.com\n\nBenjamin Van Roy\nStanford University\nbvr@stanford.edu\n\nAbstract\n\nSafety is a desirable property that can immensely increase the applicability of\nlearning algorithms in real-world decision-making problems. It is much easier\nfor a company to deploy an algorithm that is safe, i.e., guaranteed to perform at\nleast as well as a baseline. In this paper, we study the issue of safety in contextual\nlinear bandits that have application in many different \ufb01elds including personalized\nrecommendation. We formulate a notion of safety for this class of algorithms. We\ndevelop a safe contextual linear bandit algorithm, called conservative linear UCB\n(CLUCB), that simultaneously minimizes its regret and satis\ufb01es the safety con-\nstraint, i.e., maintains its performance above a \ufb01xed percentage of the performance\nof a baseline strategy, uniformly over time. We prove an upper-bound on the regret\nof CLUCB and show that it can be decomposed into two terms: 1) an upper-bound\nfor the regret of the standard linear UCB algorithm that grows with the time horizon\nand 2) a constant term that accounts for the loss of being conservative in order to\nsatisfy the safety constraint. We empirically show that our algorithm is safe and\nvalidate our theoretical analysis.\n\nIntroduction\n\n1\nMany problems in science and engineering can be formulated as decision-making problems under\nuncertainty. Although many learning algorithms have been developed to \ufb01nd a good policy/strategy\nfor these problems, most of them do not provide any guarantee for the performance of their resulting\npolicy during the initial exploratory phase. This is a major obstacle in using learning algorithms in\nmany different \ufb01elds, such as online marketing, health sciences, \ufb01nance, and robotics. Therefore,\ndeveloping learning algorithms with safety guarantees can immensely increase the applicability of\nlearning in solving decision problems. A policy generated by a learning algorithm is considered to be\nsafe, if it is guaranteed to perform at least as well as a baseline. The baseline can be either a baseline\nvalue or the performance of a baseline strategy. It is important to note that since the policy is learned\nfrom data, it is a random variable, and thus, the safety guarantees are in high probability.\nSafety can be studied in both of\ufb02ine and online scenarios. In the of\ufb02ine case, the algorithm learns\nthe policy from a batch of data, usually generated by the current strategy or recent strategies of the\ncompany, and the question is whether the learned policy will perform as well as the current strategy or\nno worse than a baseline value, when it is deployed. This scenario has been recently studied heavily\nin both model-based (e.g., Petrik et al. [2016]) and model-free (e.g., Bottou et al. 2013; Thomas et\nal. 2015a,b; Swaminathan and Joachims 2015a,b) settings. In the model-based approach, we \ufb01rst\nuse the batch of data and build a simulator that mimics the behavior of the dynamical system under\nstudy (hospital\u2019s ER, \ufb01nancial market, robot), and then use this simulator to generate data and learn\nthe policy. The main challenge here is to have guarantees on the performance of the learned policy,\ngiven the error in the simulator. This line of research is closely related to the area of robust learning\nand control. In the model-free approach, we learn the policy directly from the batch of data, without\nbuilding a simulator. This line of research is related to off-policy evaluation and control. While the\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fmodel-free approach is more suitable for problems in which we have access to a large batch of data,\nsuch as in online marketing, the model-based approach works better in problems in which data is\nharder to collect, but instead, we have good knowledge about the underlying dynamical system that\nallows us to build an accurate simulator.\nIn the online scenario, the algorithm learns a policy while interacting with the real system. Although\n(reasonable) online algorithms will eventually learn a good or an optimal policy, there is no guarantee\nfor their performance along the way (the performance of their intermediate policies), especially at\nthe very beginning, when they perform a large amount of exploration. Thus, in order to guarantee\nsafety in online algorithms, it is important to control their exploration and make it more conservative.\nConsider a manager that allows our learning algorithm runs together with her company\u2019s current\nstrategy (baseline policy), as long as it is safe, i.e., the loss incurred by letting a portion of the traf\ufb01c\nhandled by our algorithm (instead of by the baseline policy) does not exceed a certain threshold.\nAlthough we are con\ufb01dent that our algorithm will eventually perform at least as well as the baseline\nstrategy, it should be able to remain alive (not terminated by the manager) long enough for this to\nhappen. Therefore, we should make it more conservative (less exploratory) in a way not to violate the\nmanager\u2019s safety constraint. This setting has been studied in the multi-armed bandit (MAB) [Wu et\nal., 2016]. Wu et al. [2016] considered the baseline policy as a \ufb01xed arm in MAB, formulated safety\nusing a constraint de\ufb01ned based on the performance of the baseline policy (mean of the baseline arm),\nand modi\ufb01ed the UCB algorithm [Auer et al., 2002] to satisfy this constraint.\nIn this paper, we study the notion of safety in contextual linear bandits, a setting that has application\nin many different \ufb01elds including personalized recommendation. We \ufb01rst formulate safety in this\nsetting, as a constraint that must hold uniformly in time, in Section 2. Our goal is to design learning\nalgorithms that minimize regret under the constraint that at any given time, their expected sum of\nrewards should be above a \ufb01xed percentage of the expected sum of rewards of the baseline policy.\nThis \ufb01xed percentage depends on the amount of risk that the manager is willing to take. In Section 3,\nwe propose an algorithm, called conservative linear UCB (CLUCB), that satis\ufb01es the safety constraint.\nAt each round, CLUCB plays the action suggested by the standard linear UCB (LUCB) algorithm\n(e.g., Dani et al. 2008; Rusmevichientong and Tsitsiklis 2010; Abbasi-Yadkori et al. 2011; Chu et\nal. 2011; Russo and Van Roy 2014), only if it satis\ufb01es the safety constraint for the worst choice of\nthe parameter in the con\ufb01dence set, and plays the action suggested by the baseline policy, otherwise.\nWe prove an upper-bound for the regret of CLUCB, which can be decomposed into two terms. The\n\ufb01rst term is an upper-bound on the regret of LUCB that grows at the rate\nT log(T ). The second\nterm is constant (does not grow with the horizon T ) and accounts for the loss of being conservative in\norder to satisfy the safety constraint. This improves over the regret bound derived in Wu et al. [2016]\nfor the MAB setting, where the regret of being conservative grows with time. In Section 4, we show\nhow CLUCB can be extended to the case that the reward of the baseline policy is unknown without a\nchange in its rate of regret. Finally in Section 5, we report experimental results that show CLUCB\nbehaves as expected in practice and validate our theoretical analysis.\n\n\u221a\n\n2 Problem Formulation\nIn this section, we \ufb01rst review the standard linear bandit setting and then introduce the conservative\nlinear bandit formulation considered in this paper.\n\n2.1 Linear Bandit\nIn the linear bandit setting, at any time t, the agent is given a set of (possibly) in\ufb01nitely many\nactions/options At, where each action a \u2208 At is associated with a feature vector \u03c6t\na \u2208 Rd. At each\nround t, the agent selects an action at \u2208 At and observes a random reward yt generated as\n\nyt = (cid:104)\u03b8\u2217, \u03c6t\n\nat\n\n(cid:105) + \u03b7t,\n\n(1)\n\n= E[yt], and \u03b7t is a random noise such that\n\nwhere \u03b8\u2217 \u2208 Rd is the unknown reward parameter, (cid:104)\u03b8\u2217, \u03c6t\nat time t, i.e., rt\nat\nAssumption 1 Each element \u03b7t of the noise sequence {\u03b7t}\u221e\ni.e., E[e\u03b6\u03b7t | a1:t, \u03b71:t\u22121] \u2264 exp(\u03b6 2\u03c32/2), \u2200\u03b6 \u2208 R.\nThe sub-Gaussian assumption implies that E[\u03b7t | a1:t, \u03b71:t\u22121] = 0 and Var[\u03b7t | a1:t, \u03b71:t\u22121] \u2264 \u03c32.\n\nat is the expected reward of action at\n\nt=1 is conditionally \u03c3-sub-Gaussian,\n\n(cid:105) = rt\n\nat\n\n2\n\n\fNote that the above formulation contains time-varying action sets and time-dependent feature vectors\nfor each action, and thus, includes the linear contextual bandit setting. In linear contextual bandit, if\nwe denote by xt, the state of the system at time t, the time-dependent feature vector \u03c6t\na for action a\nwill be equal to \u03c6(xt, a), the feature vector of state-action pair (xt, a).\nWe also make the following standard assumption on the unknown parameter \u03b8\u2217 and feature vectors:\nAssumption 2 There exist constants B, D \u2265 0 such that (cid:107)\u03b8\u2217(cid:107)2 \u2264 B, (cid:107)\u03c6t\na(cid:105) \u2208\n[0, 1], for all t and all a \u2208 At.\n\na(cid:107)2 \u2264 D, and (cid:104)\u03b8\u2217, \u03c6t\n\nWe de\ufb01ne B =(cid:8)\u03b8 \u2208 Rd : (cid:107)\u03b8(cid:107)2 \u2264 B(cid:9) and F =(cid:8)\u03c6 \u2208 Rd : (cid:107)\u03c6(cid:107)2 \u2264 D, (cid:104)\u03b8\u2217, \u03c6(cid:105) \u2208 [0, 1](cid:9) to be the\nafter T rounds, i.e.,(cid:80)T\n\nparameter space and feature space, respectively.\na(cid:105) at\nObviously, if the agent knows \u03b8\u2217, she will choose the optimal action a\u2217\neach round t. Since \u03b8\u2217 is unknown, the agent\u2019s goal is to maximize her cumulative expected rewards\n\n(cid:105), or equivalently, to minimize its (pseudo)-regret, i.e.,\n\nt = arg maxa\u2208At(cid:104)\u03b8\u2217, \u03c6t\n\nt=1(cid:104)\u03b8\u2217, \u03c6t\n\nat\n\nRT =\n\n(cid:104)\u03b8\u2217, \u03c6t\na\u2217\n\nt\n\n(cid:104)\u03b8\u2217, \u03c6t\n\nat\n\n(cid:105),\n\n(2)\n\nT(cid:88)\n\nt=1\n\n(cid:105) \u2212 T(cid:88)\n\nt=1\n\nwhich is the difference between the cumulative expected rewards of the optimal and agent\u2019s strategies.\n\nbt\n\n= (cid:104)\u03b8\u2217, \u03c6t\n\n2.2 Conservative Linear Bandit\nThe conservative linear bandit setting is exactly the same as the linear bandit, except that there exists\na baseline policy \u03c0b (e.g., the company\u2019s current strategy) that at each round t, selects action bt \u2208 At\n(cid:105). We assume that the expected rewards of the actions\nand incurs the expected reward rt\nbt\ntaken by the baseline policy, rt\n, are known (see Remark 1). We relax this assumption in Section 4\nbt\nand extend our proposed algorithm to the case that the reward function of the baseline policy is not\nknown in advance. Another difference between the conservative and standard linear bandit settings is\nthe performance constraint, which is de\ufb01ned as follows:\nDe\ufb01nition 1 (Performance Constraint) At each round t, the difference between the performances\nof the baseline and the agent\u2019s policies should remain below a pre-de\ufb01ned fraction \u03b1 \u2208 (0, 1) of the\nbaseline performance. This constraint may be written formally as\n\u2200t \u2208 {1, . . . , T},\n\n\u2212 t(cid:88)\n\nor equivalently as\n\n\u2265 (1\u2212\u03b1)\n\nt(cid:88)\n\nt(cid:88)\n\nt(cid:88)\n\n\u2264 \u03b1\n\nri\nbi\n\nri\nai\n\nri\nbi\n\nri\nai\n\ni=1\n\nri\nbi\n\n. (3)\n\ni=1\n\nt(cid:88)\n\ni=1\n\ni=1\n\ni=1\n\nThe parameter \u03b1 controls the level of conservatism of the agent. Small values show that only small\nlosses are tolerated and the agent should be overly conservative, whereas large values indicate that\nthe manager is willing to take risk and the agent can be more explorative. Here, given the value of\n\u03b1, the agent should select her actions in a way to both minimize her regret (2) and to satisfy the\nperformance constraint (3). In the next section, we propose a linear bandit algorithm to achieve this\ngoal with high probability.\nRemark 1. Since the baseline policy is often our company\u2019s strategy, it is reasonable to assume that\na large amount of data generated by this policy is available, and thus, we have an accurate estimate of\nits reward function. If in addition to this accurate estimate, we have access to the actual data, we can\nuse them in our algorithms. The reason we do not use the data generated by the actions suggested by\nthe baseline policy in constructing the con\ufb01dence sets of our algorithm in Section 3 is mainly to keep\nthe analysis simple. However, when dealing with the more general case of unknown baseline reward\nin Section 4, we construct the con\ufb01dence sets using all available data, including those generated by\nthe baseline policy. It is important to note that having a good estimate of the baseline reward function\ndoes not necessarily mean that we know the unknown parameter \u03b8\u2217. This is because the data used for\nthis estimate has been generated by the baseline policy, and thus, may only provide a good estimate\nof \u03b8\u2217 in a limited subspace.\n3 A Conservative Linear Bandit Algorithm\nIn this section, we propose a linear bandit algorithm, called conservative linear upper con\ufb01dence\nbound (CLUCB), whose pseudocode is shown in Algorithm 1. CLUCB is based on the optimism\nin the face of uncertainty principle, and given the value of \u03b1, minimizes the regret (2) and satis\ufb01es\nthe performance constraint (3) with high probability. At each round t, CLUCB uses the previous\n\n3\n\n\fAlgorithm 1 CLUCB\n\nInput: \u03b1,B,F\nInitialize: S0 = \u2205, z0 = 0 \u2208 Rd, and C1 = B\nt,(cid:101)\u03b8t) \u2208 arg max(a,\u03b8)\u2208At\u00d7Ct (cid:104)\u03b8, \u03c6t\nfor t = 1, 2, 3,\u00b7\u00b7\u00b7 do\na(cid:105)\n\nFind (a(cid:48)\nCompute Lt = min\u03b8\u2208Ct (cid:104)\u03b8, zt\u22121 + \u03c6t\na(cid:48)\ni=1 ri\nbi\n\n\u2265 (1 \u2212 \u03b1)(cid:80)t\n\nif Lt +(cid:80)\n\nri\nbi\n\n(cid:105)\n\nt\n\nelse\n\nend if\nend for\n\ni\u2208Sc\nt\u22121\nPlay at = a(cid:48)\nt and observe reward yt de\ufb01ned by (1)\nSet zt = zt\u22121 + \u03c6t\nt\u22121\nGiven at and yt, construct the con\ufb01dence set Ct+1 according to (5)\nat\n\nthen\n, St = St\u22121 \u222a t, Sc\n\nt = Sc\n\nPlay at = bt and observe reward yt de\ufb01ned by (1)\nSet zt = zt\u22121, St = St\u22121, Sc\n\nt\u22121 \u222a t, Ct+1 = Ct\n\nt = Sc\n\nt \u2208 arg maxa\u2208At max\u03b8\u2208Ct(cid:104)\u03b8, \u03c6t\n\nobservations and builds a con\ufb01dence set Ct that with high probability contains the unknown parameter\na(cid:105), which has the best\n\u03b8\u2217. It then selects the optimistic action a(cid:48)\nperformance among all the actions available in At, within the con\ufb01dence set Ct. In order to make\nsure that the constraint (3) is satis\ufb01ed, the algorithm plays the optimistic action a(cid:48)\nt, only if it satis\ufb01es\nthe constraint for the worst choice of the parameter \u03b8 \u2208 Ct. To make this more precise, let St\u22121 be\nthe set of rounds i < t at which CLUCB has played the optimistic action, i.e., ai = a(cid:48)\ni. Similarly,\nt\u22121 = {1, 2,\u00b7\u00b7\u00b7 , t \u2212 1} \u2212 St\u22121 is the set of rounds j < t at which CLUCB has followed the\nSc\nbaseline policy, i.e., aj = bj.\nIn order to guarantee that it does not violate constraint (3), at each round t, CLUCB plays the\noptimistic action, i.e., at = a(cid:48)\n\nt, only if\n\n(cid:104) (cid:88)\n\ni\u2208Sc\n\nt\u22121\n\nmin\n\u03b8\u2208Ct\n\n(cid:68)\n\n(cid:122)\n\n(cid:125)(cid:124)\n(cid:123)(cid:88)\n\nzt\u22121\n\n(cid:69)\n\n\u03c6i\nai\n\ni\u2208St\u22121\n\nri\nbi +\n\n\u03b8,\n\n(cid:105)(cid:105) \u2265 (1 \u2212 \u03b1)\n\nt(cid:88)\n\ni=1\n\nri\nbi ,\n\n+ (cid:104)\u03b8, \u03c6t\na(cid:48)\n\nt\n\nand plays the conservative action, i.e., at = bt, otherwise. In the following, we describe how CLUCB\nconstructs and updates its con\ufb01dence sets Ct.\n3.1 Construction of Con\ufb01dence Sets\nCLUCB starts by the most general con\ufb01dence set C1 = B and updates its con\ufb01dence set only when it\nplays an optimistic action. This is mainly to simplify the analysis and is based on the idea that since\nthe reward function of the baseline policy is known ahead of time, playing a baseline action does not\nprovide any new information about the unknown parameter \u03b8\u2217. However, this can be easily changed\nto update the con\ufb01dence set after each action. In fact, this is what we do in the algorithm proposed in\nSection 4. We follow the approach of Abbasi-Yadkori et al. [2011] to build con\ufb01dence sets for \u03b8\u2217.\nLet St = {i1, . . . , imt} be the set of rounds up to and including round t at which CLUCB has played\nthe optimistic action. Note that we have de\ufb01ned mt = |St|. For a \ufb01xed value of \u03bb > 0, let\n\nbe the regularized least square estimate of \u03b8 at round t, where \u03a6t = [\u03c6i1\nai1\n[yi1, . . . , yimt\nnext round t + 1 as\n\n(cid:98)\u03b8t = (\u03a6t\u03a6\n(cid:110)\n\u03b8 \u2208 Rd : (cid:107)\u03b8 \u2212(cid:98)\u03b8t(cid:107)Vt \u2264 \u03b2t+1\nCt+1 =\n(cid:17)\n(cid:16) 1+(mt+1)D2/\u03bb\n\n(4)\n] and Yt =\n](cid:62). For a \ufb01xed con\ufb01dence parameter \u03b4 \u2208 (0, 1), we construct the con\ufb01dence set for the\n(cid:114)\n\nt , and the weighted norm is de\ufb01ned\nwhere \u03b2t+1 = \u03c3\nx(cid:62)V x for any x \u2208 Rd and any positive de\ufb01nite V \u2208 Rd\u00d7d. Note that similar to the linear\nas (cid:107)x(cid:107)V =\nUCB algorithm (LUCB) in Abbasi-Yadkori et al. [2011], the sub-Gaussian parameter \u03c3 and the\nregularization parameter \u03bb that appear in the de\ufb01nitions of \u03b2t+1 and Vt should also be given to the\nCLUCB algorithm as input. The following proposition (Theorem 2 in Abbasi-Yadkori et al. 2011)\nshows that the con\ufb01dence sets constructed by (5) contain the true parameter \u03b8\u2217 with high probability.\n\n\u221a\n\u03bbB, Vt = \u03bbI + \u03a6t\u03a6(cid:62)\n\n, . . . , \u03c6imt\naimt\n\n\u22121 \u03a6tYt,\n\n(cid:124)\nt + \u03bbI)\n\n(cid:111)\n\nd log\n\n(5)\n\n\u221a\n\n+\n\n,\n\n\u03b4\n\n4\n\n\fProposition 1 For the con\ufb01dence set Ct de\ufb01ned by (5), we have P(cid:2)\u03b8\u2217 \u2208 Ct, \u2200t \u2208 N(cid:3) \u2265 1 \u2212 \u03b4.\n\nAs mentioned before, CLUCB ensures that performance constraint (3) holds for all \u03b8 \u2208 Ct at all\nrounds t. As a result, if all the con\ufb01dence sets hold (i.e., contain the true parameter \u03b8\u2217), CLUCB\nis guaranteed to satisfy performance constraint (3). Proposition 1 indicates that this happens with\nprobability at least 1 \u2212 \u03b4. It is worth noting that satisfying constraint (3) implies that CLUCB is at\nleast as good as the baseline policy at all rounds. In this vein, Proposition 1 guarantees that, with\nprobability at least 1 \u2212 \u03b4, CLUCB performs no worse than the baseline policy at all rounds.\n3.2 Regret Analysis of CLUCB\n\u2212 rt\nIn this section, we prove a regret bound for the proposed CLUCB algorithm. Let \u2206t\nbt\nbt\nbe the baseline gap at round t, i.e., the difference between the expected rewards of the optimal and\nbaseline actions at round t. This quantity shows how sub-optimal the action suggested by the baseline\npolicy is at round t. We make the following assumption on the performance of the baseline policy \u03c0b.\nAssumption 3 There exist 0 \u2264 \u2206l \u2264 \u2206h and 0 < rl such that, at each round t,\n\n= rt\na\u2217\n\nt\n\n\u2206l \u2264 \u2206t\n\nbt\n\n\u2264 \u2206h\n\nand\n\nrl \u2264 rt\n\nbt\n\n.\n\n(6)\n\nAn obvious candidate for both \u2206h and rh is 1, as all the mean rewards are con\ufb01ned in [0, 1]. The\nreward lower-bound rl ensures that the baseline policy maintains a minimum level of performance at\neach round. Finally, \u2206l = 0 is a reasonable candidate for the lower-bound of the baseline gap.\nThe following proposition shows that the regret of CLUCB can be decomposed into the regret of\na linear UCB (LUCB) algorithm (e.g., Abbasi-Yadkori et al. 2011) and a regret caused by being\nconservative in order to satisfy the performance constraint (3).\n\nProposition 2 The regret of CLUCB can be decomposed into two terms as follows:\n\n(7)\nT| =\nwhere RST (LUCB) is the cumulative (pseudo)-regret of LUCB at rounds t \u2208 ST and nT = |Sc\nT \u2212 mT is the number of rounds (in T rounds) at which CLUCB has played a conservative action.\n\nRT (CLUCB) \u2264 RST (LUCB) + nT \u2206h,\n\nProof: From the de\ufb01nition of regret (2), we have\n\nT(cid:88)\n\n\u2212 T(cid:88)\n\n(cid:88)\n\nt\u2208ST\n\nRT (CLUCB) =\n\nrt\na\u2217\n\nt\n\nrt\nat =\n\nt=1\n\nt=1\n\n\u2212 rt\n\nat ) +\n\n(rt\na\u2217\n\nt\n\n(cid:122)\n\n(cid:88)\n\nt\u2208Sc\n\nT\n\n(rt\na\u2217\n\nt\n\n(cid:125)(cid:124)\n\n\u2206t\nbt\n\n\u2212 rt\n\n(cid:123)\nbt ) \u2264 (cid:88)\n\nt\u2208ST\n\n\u2212 rt\n\nat ) + nT \u2206h. (8)\n\n(rt\na\u2217\n\nt\n\nThe result follows from the fact that for t \u2208 ST , CLUCB plays the exact same actions as LUCB, and\n(cid:3)\nthus, the \ufb01rst term in (8) represents LUCB\u2019s regret for these rounds.\nThe regret bound of LUCB for the con\ufb01dence set (5) can be derived from the results of Abbasi-\nYadkori et al. [2011]. Let E be the event that \u03b8\u2217 \u2208 Ct, \u2200t \u2208 N, which according to Proposition 1\nholds w.p. at least 1 \u2212 \u03b4. The following proposition provides a bound on RST (LUCB). Since this\nproposition is a direct application of Thm. 3 in Abbasi-Yadkori et al. [2011], we omit its proof here.\n(cid:19)(cid:21)\nProposition 3 On event E = {\u03b8\u2217 \u2208 Ct, \u2200t \u2208 N}, for any T \u2208 N, we have\n\n(cid:115)\n\n(cid:19)\n\n(cid:18)\n\n(cid:20)\n\n\u221a\n\nmT d log\n\n\u03bb +\n\nB\n\n\u03bb + \u03c3\n\n2 log(\n\n) + d log\n\n1 +\n\n(cid:115)\nRST (LUCB) \u2264 4\n(cid:18)\n\n1\n\u03b4\n\n\u00d7\n\nmT D\n\n(cid:19)\u221a\n\nd\n\n(cid:19)\n\nT\n\n.\n\nmT D\n\n\u03bbd\n\n(9)\n\n(cid:18)\n(cid:18) D\n\nT\n\n\u03bb\u03b4\n\n= O\n\nd log\n\nNow in order to bound the regret of CLUCB, we only need to \ufb01nd an upper-bound on nT , i.e., the\nnumber of times that CLUCB deviates from LUCB and selects the action suggested by the baseline\npolicy. We prove an upper-bound on nT in Theorem 4, which is the main technical result of this\nsection. Due to space constraint, we only provide a proof sketch for Theorem 4 in the paper and\nreport its detailed proof in Appendix A. The proof requires several technical lemmas that have been\nproved in Appendix C.\n\n5\n\n\f\u221a\n\n\u221a\n\n(cid:33)(cid:35)2\n\nnT \u2264 1 + 114d2 (B\n\n\u03bb + \u03c3)2\n\u03b1rl(\u2206l + \u03b1rl)\n\nTheorem 4 Let \u03bb \u2265 max(1, D2). Then, on event E, for any horizon T \u2208 N, we have\n\n(cid:32)\n(cid:34)\nProof Sketch: Let \u03c4 = max(cid:8)1 \u2264 t \u2264 T | at (cid:54)= a(cid:48)\n(cid:9) be the last round that CLUCB takes an action\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)V\n(cid:88)\n\u03c4(cid:88)\n\nsuggested by the baseline policy. We \ufb01rst show that at round \u03c4, the following holds:\n\nbt \u2264 \u2212(m\u03c4\u22121 + 1)\u2206l + 2\u03b2\u03c4\nrt\n\n\u03bb + \u03c3)\n\u03b4(\u2206l + \u03b1rl)\n\n(cid:13)(cid:13)\u03c6\u03c4\n\n(cid:13)(cid:13)\u03c6t\n\n\u221a\n62d(B\n\n(cid:13)(cid:13)V\n\n(cid:13)(cid:13)V\n\nt\u2208S\u03c4\u22121\n\n+ 2\u03b2\u03c4\n\n\u03c6t\nat\n\n\u22121\nt\n\n\u22121\n\u03c4\n\nlog\n\n+ 2\n\n\u03b2t\n\nt=1\n\na(cid:48)\n\n\u03c4\n\n\u03b1\n\nat\n\n.\n\nt\n\n.\n\n\u22121\n\u03c4\n\nNext, using Lemmas 7 and 8 (reported in Appendix C), and the Cauchy-Schwartz inequality, we\ndeduce that\n\n\u03c4(cid:88)\n\u2265 rl for all t, and \u03c4 = n\u03c4\u22121 + m\u03c4\u22121 + 1, it follows that\n\nbt \u2264 \u2212(m\u03c4\u22121 + 1)\u2206l + 8d(B\nrt\n\n\u221a\n\u03bb + \u03c3) log\n\nt=1\n\n\u03b1\n\n\u03b1rln\u03c4\u22121 \u2264 \u2212(m\u03c4\u22121 + 1)(\u2206l + \u03b1rl) + 8d(B\n\n\u03bb + \u03c3) log\n\n\u221a\n\n(cid:18) 2(m\u03c4\u22121 + 1)\n(cid:18) 2(m\u03c4\u22121 + 1)\n\n\u03b4\n\n\u03b4\n\nSince rt\nbt\n\n\u03c4\n\n+\n\na(cid:48)\n\nt\u2208S\u03c4\u22121\n\n(cid:88)\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)\u03c6\u03c4\n(cid:19)(cid:112)(m\u03c4\u22121 + 1).\n(cid:19)(cid:112)(m\u03c4\u22121 + 1).\n\n(10)\n\nNote that n\u03c4\u22121 and m\u03c4\u22121 appear on the LHS and RHS of (10), respectively. The key point is that\nthe RHS is positive only for a \ufb01nite number of integers m\u03c4\u22121, and thus, it has a \ufb01nite upper bound.\nUsing Lemma 9 (reported and proved in Appendix C), we prove that\n\u221a\n\u221a\n\u03bb + \u03c3)\n62d(B\n\u03b4(\u2206l + \u03b1rl)\n\n\u03b1rln\u03c4\u22121 \u2264 114d2 (B\n\n(cid:33)(cid:35)2\n\n\u2206l + \u03b1rl\n\n\u03bb + \u03c3)2\n\n(cid:32)\n\n(cid:34)\n\n\u221a\n\n\u00d7\n\nlog\n\n.\n\n(cid:3)\nFinally, the fact that nT = n\u03c4 = n\u03c4\u22121 + 1 completes the proof.\nWe now have all the necessary ingredients to derive a regret bound on the performance of the CLUCB\nalgorithm. We report the regret bound of CLUCB in Theorem 5, whose proof is a direct consequence\nof the results of Propositions 2 and 3, and Theorem 4.\nTheorem 5 Let \u03bb \u2265 max(1, D2). With probability at least 1 \u2212 \u03b4, the CLUCB algorithm satis\ufb01es the\nperformance constraint (3) for all t \u2208 N, and has the regret bound\n\nwhere K is a constant that only depends on the parameters of the problem as\n\n(cid:19)\n\nT +\n\nK\u2206h\n\u03b1rl\n\n(cid:18)\n(cid:1)\u221a\nd log(cid:0) DT\n(cid:32)\n(cid:34)\n\n\u03bb\u03b4\n\n(cid:33)(cid:35)2\n\n.\n\n\u221a\n\n\u221a\n\nK = 1 + 114d2 (B\n\n\u03bb + \u03c3)2\n\n\u2206l + \u03b1rl\n\nlog\n\n\u221a\n62d(B\n\n\u03bb + \u03c3)\n\u03b4(\u2206l + \u03b1rl)\n\nRT (CLUCB) = O\n\n,\n\n(11)\n\n\u221a\nRemark 2. The \ufb01rst term in the regret bound (11) is the regret of LUCB, which grows at the rate\nT log(T ). The second term accounts for the loss incurred by being conservative in order to satisfy\nthe performance constraint (3). Our results indicate that this loss does not grow with time (since\nCLUCB acts conservatively only in a \ufb01nite number of rounds). This is a clear improvement over\nthe regret bound reported in Wu et al. [2016] for the MAB setting, in which the regret of being\nconservative grows with time. Furthermore, the regret bound of Theorem 5 clearly indicates that\nCLUCB\u2019s regret is larger for smaller values of \u03b1. This perfectly matches the intuition that the agent\nmust be more conservative, and thus, suffers higher regret for smaller values of \u03b1. Theorem 5 also\nindicates that CLUCB\u2019s regret is smaller for smaller values of \u2206h, because when the baseline policy\n\u03c0b is close to optimal, the algorithm does not lose much by being conservative.\n\n6\n\n\fAlgorithm 2 CLUCB2\n\nInput: \u03b1, rl,B,F\nInitialize: n \u2190 0, z \u2190 0, w \u2190 0, v \u2190 0 and C1 \u2190 B\nfor t = 1, 2, 3,\u00b7\u00b7\u00b7 do\n\nt,(cid:101)\u03b8) = arg max(a,\u03b8)\u2208At\u00d7Ct (cid:104)\u03b8, \u03c6t\na(cid:105)\n\nLet bt be the action suggested by \u03c0b at round t\nFind (a(cid:48)\nFind Rt = max\u03b8\u2208Ct(cid:104)\u03b8, v + \u03c6t\nif Lt \u2265 (1 \u2212 \u03b1)Rt then\n\n(cid:105) & Lt = min\u03b8\u2208Ct(cid:104)\u03b8, z + \u03c6t\na(cid:48)\n\nbt\n\nt\n\nPlay at = a(cid:48)\nSet z \u2190 z + \u03c6t\na(cid:48)\n\nt\n\nt and observe yt de\ufb01ned by (1)\n\nand v \u2190 v + \u03c6t\n\nbt\n\nelse\n\nend for\n\nPlay at = bt and observe yt de\ufb01ned by (1)\nSet w = w + \u03c6t\nbt\n\nand n \u2190 n + 1\n\nend if\nGiven at and yt, construct the con\ufb01dence set Ct+1 according to (15)\n\n(cid:105) + \u03b1 max(cid:8) min\u03b8\u2208Ct(cid:104)\u03b8, w(cid:105), nrl\n\n(cid:9)\n\n4 Unknown Baseline Reward\nIn this section, we consider the case where the expected rewards of the actions taken by the baseline\n, are unknown at the beginning. We show how the CLUCB algorithm presented in Section 3\npolicy, rt\nbt\nshould be changed to handle this case, and present a new algorithm, called CLUCB2. We prove a\nregret bound for CLUCB2, which is at the same rate as that for CLUCB. This shows that the lack of\nknowledge about the reward function of the baseline policy does not hurt our algorithm in terms of\nthe rate of the regret. The pseudocode of CLUCB2 is shown in Algorithm 2. The main difference\nwith CLUCB is in the condition that should be checked at each round t to see whether we should\nplay the optimistic action a(cid:48)\nt or the conservative action bt. This condition should be selected in a way\nthat CLUCB2 satis\ufb01es constraint (3). We may rewrite (3) as\n\n(cid:88)\n\ni\u2208St\u22121\n\nri\nai\n\n+ rt\na(cid:48)\n\nt\n\n+ \u03b1\n\nri\nbi\n\n\u2265 (1 \u2212 \u03b1)(cid:0)rt\n\n(cid:88)\n\n+\n\nbt\n\ni\u2208St\u22121\n\nIf we lower-bound the LHS and upper-bound the RHS of (12), we obtain\n(cid:104)\u03b8,\n\n(cid:104)\u03b8,\n\n(cid:104)\u03b8,\n\n\u03c6i\nbi\n\n(cid:105) \u2265 (1 \u2212 \u03b1) max\n\u03b8\u2208Ct\n\n(cid:105) + \u03b1 min\n\u03b8\u2208Ct\n\n\u03c6i\nai\n\n+ \u03c6t\na(cid:48)\n\nt\n\nmin\n\u03b8\u2208Ct\n\n(cid:88)\n\ni\u2208St\u22121\n\n(cid:88)\n(cid:88)\n\nt\u22121\n\ni\u2208Sc\n\ni\u2208Sc\n\nt\u22121\n\nri\nbi\n\n(cid:1).\n(cid:88)\n\ni\u2208St\u22121\n\n(12)\n\n\u03c6i\nbi\n\n+ \u03c6t\nbt\n\n(cid:105).\n\n(13)\n\nSince each con\ufb01dence set Ct is built in a way to contain the true parameter \u03b8\u2217 with high probability,\nit is easy to see that (12) is satis\ufb01ed whenever (13) is true.\nCLUCB2 uses both optimistic and conservative actions, and their corresponding rewards in building\n], Yt = [y1, y2,\u00b7\u00b7\u00b7 , yt]\n(cid:124),\nits con\ufb01dence sets. Speci\ufb01cally for any t, we let \u03a6t = [\u03c61\na1\nVt = \u03bbI + \u03a6\n\n(cid:124)\nt \u03a6t, and de\ufb01ne the least-square estimate after round t as\n\n,\u00b7\u00b7\u00b7 , \u03c6t\n\n, \u03c62\na2\n\nat\n\n(cid:124)\nt + \u03bbI)\n\n(cid:98)\u03b8t = (\u03a6t\u03a6\n\u22121 \u03a6tYt.\nGiven Vt and(cid:98)\u03b8t, the con\ufb01dence set for round t + 1 is constructed as\n(cid:110)\n(cid:111)\n\u03b8 \u2208 Ct : (cid:107)\u03b8 \u2212(cid:98)\u03b8t(cid:107)Vt \u2264 \u03b2t+1\n(cid:16) 1+tD2/\u03bb\n\n(cid:114)\ni.e., P(cid:2)\u03b8\u2217 \u2208 Ct, \u2200t \u2208 N(cid:3) \u2265 1 \u2212 \u03b4.\n\nCt+1 =\n\n(cid:17)\n\n+ B\n\n\u221a\n\nwhere C1 = B and \u03b2t = \u03c3\n\u03bb. Similar to Proposition 1, we can easily\nprove that the con\ufb01dence sets built by (15) contain the true parameter \u03b8\u2217 with high probability,\n\nd log\n\n\u03b4\n\n(14)\n\n(15)\n\n,\n\nRemark 3. Note that unlike the CLUCB algorithm, here we build nested con\ufb01dence sets, i.e., \u00b7\u00b7\u00b7 \u2286\nCt+1 \u2286 Ct \u2286 Ct\u22121 \u2286 \u00b7\u00b7\u00b7 , which is necessary for the proof of the algorithm. This can potentially\nincrease the computational complexity of CLUCB2, but from a practical point of view, the con\ufb01dence\n\n7\n\n\fFigure 1: Average per-step regret (over 1, 000 runs) of LUCB and CLUCB for different values of \u03b1.\n\nsets become nested automatically after suf\ufb01cient data has been observed. Therefore, the nested\nconstraint in building the con\ufb01dence sets can be relaxed after suf\ufb01ciently large number of rounds.\nThe following theorem guarantees that CLUCB2 satis\ufb01es the safety constraint (3) with high probabil-\nity, while its regret has the same rate as that of CLUCB and is worse than that of LUCB only up to an\nadditive constant.\nTheorem 6 Let \u03bb \u2265 max(1, D2) and \u03b4 \u2264 2/e. Then, with probability at least 1 \u2212 \u03b4, CLUCB2\nalgorithm satis\ufb01es the performance constraint (3) for all t \u2208 N, and has the regret bound\n\nRT (CLUCB2) = O\n\nd log\n\nT +\n\n,\n\n(16)\n\n(cid:18)\n\n(cid:34)\n\n(cid:18) DT\n(cid:32)\n\n\u03bb\u03b4\n\n(cid:19)\u221a\n\n\u221a\n\nK\u2206h\n\u03b12r2\nl\n\n(cid:19)\n(cid:33)(cid:35)2\n\nwhere K is a constant that depends only on the parameters of the problem as\n\n\u221a\n\nK = 256d2(B\n\n\u03bb + \u03c3)2\n\nlog\n\n10d(B\n\n\u03bb + \u03c3)\n\n\u03b1rl(\u03b4)1/4\n\n+ 1.\n\nWe report the proof of Theorem 6 in Appendix B. The proof follows the same steps as that of\nTheorem 5, with additional non-trivial technicalities that have been highlighted there.\n\nfrom N(cid:0)0, I4\n\n(cid:1) such that the mean reward associated to each arm is positive. The observation noise at\n\n5 Simulation Results\nIn this section, we provide simulation results to illustrate the performance of the proposed CLUCB\nalgorithm. We considered a time independent action set of 100 arms each having a time independent\nfeature vector living in R4 space. These feature vectors and the parameter \u03b8\u2217 are randomly drawn\neach time step is also generated independently from N (0, 1), and the mean reward of the baseline\npolicy at any time is taken to be the reward associated to the 10\u2019th best action. We have taken\n\u03bb = 1, \u03b4 = 0.001 and the results are averaged over 1,000 realizations.\nIn Figure 1, we plot per-step regret (i.e., Rt\nt ) of LUCB and CLUCB for different values of \u03b1 over\na horizon T = 40, 000. Figure 1 shows that per-step regret of CLUCB remains constant at the\nbeginning (the conservative phase). This is because during this phase, CLUCB follows the baseline\npolicy to make sure that the performance constraint (3) is satis\ufb01ed. As expected, the length of the\nconservative phase decreases as \u03b1 is increased, since the performance constraint is relaxed for larger\nvalues of \u03b1, and hence, CLUCB starts playing optimistic actions more quickly. After this initial\nconservative phase, CLUCB has learned enough about the optimal action and its performance starts\nconverging to that of LUCB. On the other hand, Figure 1 shows that per-step regret of CLUCB at the\n\ufb01rst few periods remains much lower than that of LUCB. This is because LUCB plays agnostic to the\nsafety constraint, and thus, may select very poor actions in its initial exploration phase. In regard\nto this, Figure 2(a) plots the percentage of the rounds, in the \ufb01rst 1, 000 rounds, at which the safety\nconstraint (3) is violated by LUCB and CLUCB for different values of \u03b1. According to this \ufb01gure,\n\n8\n\n\f(a)\n\n(b)\n\nFigure 2: (a) Percentage of the rounds, in the \ufb01rst 1, 000 rounds, at which the safety constraint is\nviolated by LUCB and CLUCB for different values of \u03b1, (b) Per-step regret of LUCB and CLUCB\nfor different values of \u03b1, at round t = 40, 000.\n\nCLUCB satis\ufb01es the performance constraint for all values of \u03b1, while LUCB fails in a signi\ufb01cant\nnumber of rounds, specially for small values of \u03b1 (i.e., tight constraint).\nTo better illustrate the effect of the performance constraint (3) on the regret of the algorithms,\nFigure 2(b) plots the per-step regret achieved by CLUCB at round t = 40, 000 for different values of\n\u03b1, as well as that for LUCB. As expected from our analysis and is shown in Figure 1, the performance\nof CLUCB converges to that of LUCB after an initial conservative phase. Figure 2(b) con\ufb01rms that\nthe convergence happens more quickly for larger values of \u03b1, where the constraint is more relaxed.\n\n6 Conclusions\nIn this paper, we studied the concept of safety in contextual linear bandits to address the challenges that\narise in implementing such algorithms in practical situations such as personalized recommendation\nsystems. Most of the existing linear bandit algorithms, such as LUCB [Abbasi-Yadkori et al., 2011],\nsuffer from a large regret at their initial exploratory rounds. This unsafe behavior is not acceptable\nin many practical situations, where having a reasonable performance at any time is necessary for a\nlearning algorithm to be considered reliable and to remain in production.\nTo guarantee safe learning, we formulated a conservative linear bandit problem, where the per-\nformance of the learning algorithm (measured in terms of its cumulative rewards) at any time is\nconstrained to be at least as good as a fraction of the performance of a baseline policy. We proposed\na conservative version of LUCB algorithm, called CLUCB, to solve this constrained problem, and\nshowed that it satis\ufb01es the safety constraint with high probability, while achieving a regret bound\nequivalent to that of LUCB up to an additive time-independent constant. We designed two versions of\nCLUCB that can be used depending on whether the reward function of the baseline policy is known or\nunknown, and showed that in each case, CLUCB acts conservatively (i.e., plays the action suggested\nby the baseline policy) only at a \ufb01nite number of rounds, which depends on how suboptimal the\nbaseline policy is. We reported simulation results that support our analysis and show the performance\nof the proposed CLUCB algorithm.\n\n9\n\n\fReferences\nY. Abbasi-Yadkori, D. P\u00b4al, and C. Szepesv\u00b4ari. Improved algorithms for linear stochastic bandits. In\n\nAdvances in Neural Information Processing Systems, pages 2312\u20132320, 2011.\n\nP. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem.\n\nMachine Learning Journal, 47:235\u2013256, 2002.\n\nL. Bottou, J. Peters, J. Quinonero-Candela, D. Charles, D. Chickering, E. Portugaly, D. Ray, P. Simard,\nand E. Snelson. Counterfactual reasoning and learning systems: The example of computational\nadvertising. Journal of Machine Learning Research, 14:3207\u20133260, 2013.\n\nW. Chu, L. Li, L. Reyzin, and R. Schapire. Contextual bandits with linear payoff functions. In\nProceedings of the Fourteenth International Conference on Arti\ufb01cial Intelligence and Statistics,\npages 208\u2013214, 2011.\n\nV. Dani, T. Hayes, and S. Kakade. Stochastic linear optimization under bandit feedback. In COLT,\n\npages 355\u2013366, 2008.\n\nM. Petrik, M. Ghavamzadeh, and Y. Chow. Safe policy improvement by minimizing robust baseline\n\nregret. In Advances in Neural Information Processing Systems, pages 2298\u20132306, 2016.\n\nP. Rusmevichientong and J. Tsitsiklis. Linearly parameterized bandits. Mathematics of Operations\n\nResearch, 35(2):395\u2013411, 2010.\n\nD. Russo and B. Van Roy. Learning to optimize via posterior sampling. Mathematics of Operations\n\nResearch, 39(4):1221\u20131243, 2014.\n\nA. Swaminathan and T. Joachims. Batch learning from logged bandit feedback through counterfactual\n\nrisk minimization. Journal of Machine Learning Research, 16:1731\u20131755, 2015.\n\nA. Swaminathan and T. Joachims. Counterfactual risk minimization: Learning from logged bandit\n\nfeedback. In Proceedings of The 32nd International Conference on Machine Learning, 2015.\n\nP. Thomas, G. Theocharous, and M. Ghavamzadeh. High con\ufb01dence off-policy evaluation. In\n\nProceedings of the Twenty-Ninth Conference on Arti\ufb01cial Intelligence, 2015.\n\nP. Thomas, G. Theocharous, and M. Ghavamzadeh. High con\ufb01dence policy improvement.\n\nIn\nProceedings of the Thirty-Second International Conference on Machine Learning, pages 2380\u2013\n2388, 2015.\n\nY. Wu, R. Shariff, T. Lattimore, and C. Szepesv\u00b4ari. Conservative bandits. In Proceedings of The 33rd\n\nInternational Conference on Machine Learning, pages 1254\u20131262, 2016.\n\n10\n\n\f", "award": [], "sourceid": 2115, "authors": [{"given_name": "Abbas", "family_name": "Kazerouni", "institution": "Stanford University"}, {"given_name": "Mohammad", "family_name": "Ghavamzadeh", "institution": "DeepMind"}, {"given_name": "Yasin", "family_name": "Abbasi Yadkori", "institution": "Adobe Research"}, {"given_name": "Benjamin", "family_name": "Van Roy", "institution": "Stanford University"}]}