{"title": "Scalar Posterior Sampling with Applications", "book": "Advances in Neural Information Processing Systems", "page_first": 7685, "page_last": 7693, "abstract": "We propose a practical non-episodic PSRL algorithm that unlike recent state-of-the-art PSRL algorithms uses a deterministic, model-independent episode switching schedule. Our algorithm termed deterministic schedule PSRL (DS-PSRL) is efficient in terms of time, sample, and space complexity. We prove a Bayesian regret bound under mild assumptions. Our result is more generally applicable to multiple parameters and continuous state action problems. We compare our algorithm with state-of-the-art PSRL algorithms on standard discrete and continuous problems from the literature. Finally, we show how the assumptions of our algorithm satisfy a sensible parameterization for a large class of problems in sequential recommendations.", "full_text": "Scalar Posterior Sampling with Applications\n\nGeorgios Theocharous\n\nAdobe Research\n\ntheochar@adobe.com\n\nZheng Wen\n\nAdobe Research\nzwen@adobe.com\n\nYasin Abbasi-Yadkori\n\nAdobe Research\n\nabbasiya@adobe.com\n\nNikos Vlassis\n\nNet\ufb02ix\n\nnvlassis@netflix.com\n\nAbstract\n\nWe propose a practical non-episodic PSRL algorithm that unlike recent state-of-the-\nart PSRL algorithms uses a deterministic, model-independent episode switching\nschedule. Our algorithm termed deterministic schedule PSRL (DS-PSRL) is ef-\n\ufb01cient in terms of time, sample, and space complexity. We prove a Bayesian\nregret bound under mild assumptions. Our result is more generally applicable\nto multiple parameters and continuous state action problems. We compare our\nalgorithm with state-of-the-art PSRL algorithms on standard discrete and con-\ntinuous problems from the literature. Finally, we show how the assumptions of\nour algorithm satisfy a sensible parametrization for a large class of problems in\nsequential recommendations.\n\n1\n\nIntroduction\n\nThompson sampling [Thompson, 1933], or posterior sampling for reinforcement learning (PSRL), is\na conceptually simple approach to deal with unknown MDPs [Strens, 2000; Osband et al., 2013].\nPSRL begins with a prior distribution over the MDP model parameters (transitions and/or rewards)\nand typically works in episodes. At the start of each episode, an MDP model is sampled from the\nposterior belief and the agent follows the policy that is optimal for that sampled MDP until the end of\nthe episode. The posterior is updated at the end of every episode based on the observed actions, states,\nand rewards. A special case of MDP under which PSRL has been recently extensively studied is\nMDP with state resetting, either explicitly or implicitly. Speci\ufb01cally, in [Osband et al., 2013; Osband\nand Van Roy, 2014] the considered MDPs are assumed to have \ufb01xed-length episodes, and at the\nend of each episode the MDP\u2019s state is reset according to a \ufb01xed state distribution. In [Gopalan and\nMannor, 2015], there is an assumption that the environment is ergodic and that there exists a recurrent\nstate under any policy. Both approaches have developed variants of PSRL algorithms under their\nrespective assumptions, as well as state-of-the-art regret bounds, Bayesian in [Osband et al., 2013;\nOsband and Van Roy, 2014] and Frequentist in [Gopalan and Mannor, 2015].\nHowever, many real-world problems are of a continuing and non-resetting nature. These include\nsequential recommendations and other common examples found in controlled mechanical systems\n(e.g., control of manufacturing robots), and process optimization (e.g., controlling a queuing system),\nwhere \u2018resets\u2019 are rare or unnatural. Many of these real world examples could easily be parametrized\nwith a scalar parameter, where each value of the parameter could specify a complete model. These\ntype of domains do not have the luxury of state resetting, and the agent needs to learn to act, without\nnecessarily revisiting states. Extensions of the PSRL algorithms to general MDPs without state\nresetting has so far produced non-practical algorithms and in some cases buggy theoretical analysis.\nThis is due to the dif\ufb01culty of analyzing regret under policy switching schedules that depend on\nvarious dynamic statistics produced by the true underlying model (e.g., doubling the visitations of\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fstate and action pairs and uncertainty reduction of the parameters). Next we summarize the literature\nfor this general case PSRL.\nThe earliest such general case was analyzed as Bayes regret in a \u2018lazy\u2019 PSRL algorithm [Abbasi-\nYadkori and Szepesv\u00e1ri, 2015]. In this approach a new model is sampled, and a new policy is\ncomputed from it, every time the uncertainty over the underlying model is suf\ufb01ciently reduced;\nhowever, the corresponding analysis was shown to contain a gap [Osband and Van Roy, 2016].\nA recent general case PSRL algorithm with Bayes regret analysis was proposed in [Ouyang et al.,\n2017b]. At the beginning of each episode, the algorithm generates a sample from the posterior\ndistribution over the unknown model parameters. It then follows the optimal stationary policy for the\nsampled model for the rest of the episode. The duration of each episode is dynamically determined\nby two stopping criteria. A new episode starts either when the length of the current episode exceeds\nthe previous length by one, or when the number of visits to any state-action pair is doubled. They\n\nestablish eO(HSpAT ) bounds on expected regret under a Bayesian setting, where S and A are the\nsizes of the state and action spaces, T is time, and H is the bound of the span, and eO notation hides\n\nlogarithmic factors. However, despite the state-of-the-art regret analysis, the algorithm is not well\nsuited for large and continuous state and action spaces due to the requirement to count state and\naction visitations for all state-action pairs.\nIn another recent work [Agrawal and Jia, 2017], the authors present a general case PSRL algorithm\nthat achieves near-optimal worst-case regret bounds when the underlying Markov decision process is\ncommunicating with a \ufb01nite, though unknown, diameter. Their main result is a high probability regret\n\nupper bound of eO(DpSAT ) for any communicating MDP with S states, A actions and diameter\n\nD, when T S5A. Despite the nice form of the regret bound, this algorithm suffers from similar\npracticality issues as the algorithm in [Ouyang et al., 2017b]. The epochs are computed based on\ndoubling the visitations of state and action pairs, which implies tabular representations. In addition it\nemploys a stricter assumption than previous work of a fully communicating MDP with some unknown\ndiameter. Finally, in order for the bound to be true T S5A, which would be impractical for large\nscale problems.\nBoth of the above two recent state-of-the-art algorithms [Ouyang et al., 2017b; Agrawal and Jia,\n2017], do not use generalization, in that they learn separate parameters for each state-action pair.\nIn such non-parametrized case, there are several other modern reinforcement learning algorithms,\nsuch as UCRL2 [Jaksch et al., 2010], REGAL [Bartlett and Tewari, 2009], and R-max [Brafman and\nTennenholtz, 2002], which learn MDPs using the well-known \u2018optimism under uncertainty\u2019 principle.\nIn these approaches a con\ufb01dence interval is maintained for each state-action pair, and observing\na particular state transition and reward provides information for only that state and action. Such\napproaches are inef\ufb01cient in cases where the whole structure of the MDP can be determined with a\nscalar parameter.\nDespite the elegant regret bounds for the general case PSRL algorithms developed in [Ouyang et\nal., 2017b; Agrawal and Jia, 2017], both of them focus on tabular reinforcement learning and hence\nare sample inef\ufb01cient for many practical problems with exponentially large or even continuous\nstate/action spaces. On the other hand, in many practical RL problems, the MDPs are parametrized in\nthe sense that system dynamics and reward/loss functions are assumed to lie in a known parametrized\nlow-dimensional manifold [Gopalan and Mannor, 2015]. Such model parametrization (i.e. model\ngeneralization) allows researchers to develop sample ef\ufb01cient algorithms for large-scale RL problems.\nOur paper belongs to this line of research. Speci\ufb01cally, we propose a novel general case PSRL\nalgorithm, referred to as DS-PSRL, that exploits model parametrization (generalization). We prove an\n\nparameter.\nDS-PSRL also has lower computational and space complexities than algorithms proposed in [Ouyang\net al., 2017b; Agrawal and Jia, 2017]. In the case of [Ouyang et al., 2017b] the number of policy\n\neO(pT ) Bayes regret bound for DS-PSRL, assuming we can model every MDP with a single smooth\nswitches in the \ufb01rst T steps is KT = O\u21e3p2SAT log(T )\u2318; on the other hand, DS-PSRL adopts\n\na deterministic schedule and its number of policy switches is KT \uf8ff log(T ). Since the major\ncomputational burden of PSRL algorithms is to solve a sampled MDP at each policy switch, DS-\nPSRL is computationally more ef\ufb01cient than the algorithm proposed in [Ouyang et al., 2017b]. As to\nthe space complexity, both algorithms proposed in [Ouyang et al., 2017b; Agrawal and Jia, 2017]\n\n2\n\n\fneed to store counts of state and action visitations. In contrast, DS-PSRL uses a model independent\nschedule and as a result does not need to store such statistics.\nIn the rest of the paper we will describe the DS-PSRL algorithm, and derive a state-of-the-art Bayes\nregret analysis. We will demonstrate and compare our algorithm with state-of-the-art on standard\nproblems from the literature. Finally, we will show how the assumptions of our algorithm satisfy a\nsensible parametrization for a large class of problems in sequential recommendations.\n\n2 Problem Formulation\n\nWe consider the reinforcement learning problem in a parametrized Markov decision process (MDP)\n(X ,A,`, P \u2713\u21e4) where X is the state space, A is the action space, ` : X\u21e5A! R is the instantaneous\nloss function, and P \u2713\u21e4 is an MDP transition model parametrized by \u2713\u21e4. We assume that the learner\nknows X , A, `, and the mapping from the parameter \u2713\u21e4 to the transition model P \u2713\u21e4, but does not\nknow \u2713\u21e4. Instead, the learner has a prior belief P0 on \u2713\u21e4 at time t = 0, before it starts to interact with\nthe MDP. We also use \u21e5 to denote the support of the prior belief P0. Note that in this paper, we do\nnot assume X or A to be \ufb01nite; they can be in\ufb01nite or even continuous. For any time t = 1, 2, . . ., let\nxt 2X be the state at time t and at 2A be the action at time t. Our goal is to develop an algorithm\n(controller) that adaptively selects an action at at every time step t based on prior information and\npast observations to minimize the long-run Bayes average loss\n\nE\"lim sup\n\nn!1\n\n`(xt, at)# .\n\n1\nn\n\nnXt=1\n\nSimilarly as existing literature [Osband et al., 2013; Ouyang et al., 2017b], we measure the perfor-\nmance of such an algorithm using Bayes regret:\n\nRT = E\" TXt=1\u21e3`(xt, at) J \u2713\u21e4\u21e1\u21e4\u2318# ,\n\nwhere J \u2713\u21e4\u21e1\u21e4 is the average loss of running the optimal policy under the true model \u2713\u21e4. Note that under\nthe mild \u2018weakly communicating\u2019 assumption, J \u2713\u21e4\u21e1\u21e4 is independent of the initial state.\nThe Bayes regret analysis of PSRL relies on the key observation that at each stopping time \u2327 the true\n\nfact allows to relate quantities that depend on the true, but unknown, MDP \u2713\u21e4, to those of the sampled\n\nMDP model \u2713\u21e4 and the sampled modele\u2713\u2327 are identically distributed [Ouyang et al., 2017b]. This\nMDPe\u2713\u2327 that is fully observed by the agent. This is formalized by the following Lemma 1.\nLemma 1 (Posterior Sampling [Ouyang et al., 2017b]). Let (Fs)1s=1 be a \ufb01ltration (Fs can be\nthought of as the historic information until current time s) and let \u2327 be an almost surely \ufb01nite\nFs-stopping time. Then, for any measurable function g,\n\n(1)\n\n(2)\n\nE [g(\u2713\u21e4)|F\u2327 ] = Ehg(e\u2713\u2327 )|F\u2327i .\n\nAdditionally, the above implies that E [g(\u2713\u21e4)] = Ehg(e\u2713\u2327 )i through the tower property.\n\n3 The Proposed Algorithm: Deterministic Schedule PSRL\n\nIn this section, we propose a PSRL algorithm with a deterministic policy update schedule, shown\nin Figure 1. The algorithm changes the policy in an exponentially rare fashion; if the length of the\ncurrent episode is L, the next episode would be 2L. This switching policy ensures that the total\n\nnumber of switches is O(log T ). We also note that, when sampling a new parametere\u2713t, the algorithm\n\n\ufb01nds the optimal policy assuming that the sampled parameter is the true parameter of the system.\nAny planning algorithm can be used to compute this optimal policy [Sutton and Barto, 1998]. In\nour analysis, we assume that we have access to the exact optimal policy, although it can be shown\nthat this computation need not be exact and a near optimal policy suf\ufb01ces (see [Abbasi-Yadkori and\nSzepesv\u00e1ri, 2015]).\n\n3\n\n\fInputs: P1, the prior distribution of \u2713\u21e4.\nL 1.\nfor t 1, 2, . . . do\nif t = L then\nSamplee\u2713t \u21e0 Pt.\nL 2L.\ne\u2713t e\u2713t1.\nCalculate near-optimal action at \u21e1\u21e4(xt,e\u2713t).\n\nExecute action at and observe the new state xt+1.\nUpdate Pt with (xt, at, xt+1) to obtain Pt+1.\n\nend if\n\nelse\n\nend for\n\nFigure 1: The DS-PSRL algorithm with deterministic schedule of policy updates.\n\nTo measure the performance of our algorithm we use Bayes regret RT de\ufb01ned in Equation 1. The\nslower the regret grows, the closer is the performance of the learner to that of an optimal policy. If\nthe growth rate of RT is sublinear (RT = o(T )), the average loss per time step will converge to the\noptimal average loss as T gets large, and in this sense we can say that the algorithm is asymptotically\noptimal. Our main result shows that, under certain conditions, the construction of such asymptotically\noptimal policies can be reduced to ef\ufb01ciently sampling from the posterior of \u2713\u21e4 and solving classical\n(non-Bayesian) optimal control problems.\nFirst we state our assumptions. We assume that MDP is weakly communicating. This is a standard\nassumption and under this assumption, the optimal average loss satis\ufb01es the Bellman equation.\nFurther, we assume that the dynamics are parametrized by a scalar parameter and satisfy a smoothness\ncondition.\nAssumption A1 (Lipschitz Dynamics) There exist a constant C such that for any state x and action\na and parameters \u2713, \u27130 2 \u21e5 \u2713 <,\n\nkP (.|x, a, \u2713) P (.|x, a, \u27130)k1 \uf8ff C |\u2713 \u27130| .\n\nWe also make a concentrating posterior assumption, which states that the variance of the difference\nbetween the true parameter and the sampled parameter gets smaller as more samples are gathered.\nAssumption A2 (Concentrating Posterior) Let Nj be one plus the number of steps in the \ufb01rst j\n\nC0 such that\n\nepisodes. Lete\u2713j be sampled from the posterior at the current episode j. Then there exists a constant\n\nmax\n\nj\n\nE\uf8ffNj1\u2713\u21e4 e\u2713j\n\n2 \uf8ff C0 log T .\n\nRT = eO(CpC0T ),\n\n4\n\nThe A2 assumption simply says the variance of posterior decreases given more data. In other words,\nwe assume that the problem is learnable and not a degenerate case. A2 was actually shown to\nhold for two general categories of problems, \ufb01nite MDPs and linearly parametrized problems with\nGaussian noise Abbasi-Yadkori and Szepesv\u00e1ri [2015]. In addition, in this paper we prove how\nthis assumption is satis\ufb01ed for a large class of practical problems, such as smoothly parametrized\nsequential recommendation systems in Section 6.\nNow we are ready to state the main theorem. We show a sketch of the analysis in the next section.\nMore details are in the appendix.\nTheorem 1 Under Assumption A1 and A2, the regret of the DS-PSRL algorithm is bounded as\n\nwhere the eO notation hides logarithmic factors.\n\nNotice that the regret bound in Theorem 1 does not directly depend on S or A. Moreover, notice that\nthe regret bound is smaller if the Lipschitz constant C is smaller or the posterior concentrates faster\n(i.e. C0 is smaller).\n\n\f4 Sketch of Analysis\n\nTo analyze the algorithm shown in Figure 1, \ufb01rst we decompose the regret into a number of terms,\n\nt+1 \u21e0 P (.| xt, a,e\u2713t), i.e. an imaginary next state sample\nwhich are then bounded one by one. Letexa\nassuming we take action a in state xt when parameter ise\u2713t. Also letext+1 \u21e0 P (.| xt, at,e\u2713t) and\nxt+1 \u21e0 P (.| xt, at,\u2713 \u21e4). By the average cost Bellman optimality equation [Bertsekas, 1995], for a\nsystem parametrized bye\u2713t, we can write\nJ(e\u2713t) + ht(xt) = min\n\nHere ht(x) = h(x,e\u2713t) is the differential value function for a system with parametere\u2713t. We assume\nthere exists H > 0 such that ht(x) 2 [0, H] for any x 2X . Because the algorithm takes the optimal\naction with respect to parametere\u2713t and at is the action at time t, the right-hand side of the above\nequation is minimized and thus\n\na2An`(xt, a) + Ehht(exa\n\nt+1)|Ft,e\u2713tio .\n\n(3)\n\nJ(e\u2713t) + ht(xt) = `(xt, at) + Ehht(ext+1)|Ft,e\u2713ti .\n\nThe regret decomposes into two terms as shown in Lemma 2.\nLemma 2 We can decompose the regret as follows:\n\n(4)\n\nRT =\n\nTXt=1\n\nE [`(xt, at) J(\u2713\u21e4)] \uf8ff H\n\nTXt=1\n\nE [1{At}] +\n\nTXt=1\n\nE [ht(xt+1) ht(ext+1)] + H\n\nwhere At denotes the event that the algorithm has changed its policy at time t.\n\nt=1 E [1{At}] \uf8ff H log2(T ).\n\naround the true parameter vector. To simplify the exposition, we de\ufb01ne\n\nThe \ufb01rst term HPT\nt=1 E [1{At}] is related to the sequential changes in the differential value\nfunctions, ht+1 ht. We control this term by keeping the number of switches small; ht+1 = ht as\nlong as the same parametere\u2713t is used. Notice that under DS-PSRL,PT\nt=1 1{At}\uf8ff log2(T ) always\nholds. Thus, the \ufb01rst term can be bounded by HPT\nThe second termPT\n\nt=1 E [ht(xt+1) ht(ext+1)] is related to how fast the posterior concentrates\nt =ZX\u21e3P (x| xt, at,\u2713 \u21e4) P (x| xt, at,e\u2713t)\u2318ht(x)dx\n= E [ht(xt+1) ht(ext+1)|xt, at] .\n\nRecall thatext+1 \u21e0 P (.| xt, at,e\u2713t) while xt+1 \u21e0 P (.| xt, at,\u2713 \u21e4), thus, from the tower rule, we have\nThe following two lemmas boundPT\nE\" TXt=1\n\nLemma 3 Under Assumption A1, let m be the number of schedules up to time T , we can show:\n\nt=1 E [t] under Assumption A1 and A2.\n\nE [t] = E [ht(xt+1) ht(ext+1)] .\nt# \uf8ff CHvuuutTE24\nMj\u2713\u21e4 e\u2713j\n235 \uf8ff 2C0 log2 T .\nE24\nMj\u2713\u21e4 e\u2713j\n\nmXj=1\n\nmXj=1\n\n235\n\n5\n\nwhere Mj is the number of steps in the jth episode.\n\nLemma 4 Given Assumption A2 we can show:\n\n\fThus,\n\nE\" TXt=1\n\nt# \uf8ff CHq2C0T log2 T = O(pT log T ) .\n\nCombining the above results, we have\n\nRT \uf8ff H log2(T ) + CHq2C0T log2 T + H = O(CHpC0T log T ) .\n\nThis concludes the proof.\n\n5 Experiments\n\nIn this section we compare through simulations the performance of DS-PSRL algorithm with the\nlatest PSRL algorithm called Thompson Sampling with dynamic episodes (TSDE) Ouyang et al.\n[2017b]. We experimented with the RiverSwim environment Strehl and Littman [2008], which was\nthe domain used to show how TSDE outperforms all known existing algorithms in Ouyang et al.\n[2017b]. The RiverSwim example models an agent swimming in a river who can choose to swim\neither left or right. The MDP consists of K states arranged in a chain with the agent starting in the\nleftmost state (s = 1). If the agent decides to move left i.e with the river current then he is always\nsuccessful but if he decides to move right he might \u2018fail\u2019 with some probability. The reward function\nis given by: r(s, a) = 5 if s = 1, a = left; r(s, a) = 10000 if s = K, a = right; and r(s, a) = 0\notherwise.\n\n5.1 Scalar Parametrization\n\nFor scalar parametrization a scalar value de\ufb01nes the transition dynamics of the whole MDP. We did\ntwo types of experiments, In the \ufb01rst experiment the transition dynamics (or fail probability) were the\nsame for all states for a given scalar value. In the second experiment we allowed for a single scalar\nvalue to de\ufb01ne different fail probabilities for different states. We assumed two probabilities of failure,\na high probability P1 and a low probability P2. We assumed we have two scalar values {\u27131,\u2713 2}. We\ncompared with an algorithm that switches every time-step, which we otherwise call t-mod-1, with\nTSDE and DS-PSRL algorithms. We assumed the true model of the world was \u2713\u21e4 = \u27132 and that the\nagent starts in the left-most state.\nIn the \ufb01rst experiment, \u27131 sets P1 to be the fail probability for all states and \u27132 sets P2 to be the fail\nprobability for all states. For \u27131 the optimal policy was to go left for the states closer to left and right\nfor the states closer to right. For \u27132 the optimal policy was to always go right. The results are shown\nin Figure 2(a), where all schedules are quickly learning to optimize the reward.\nIn the second experiment, \u27131 sets P1 to be the fail probability for all states. And \u27132 sets P1 for the\n\ufb01rst few states on the left-end, and P2 for the remaining. The optimal policies were similar to the\n\ufb01rst experiment. However the transition dynamics are the same for states closer to the left-end, while\nthe polices are contradicting. For \u27131 the optimal policy is to go left and for \u27132 the optimal policy\nis to go right for states closer to the left-end. This leads to oscillating behavior when uncertainty\nabout the true \u2713 is high and policy switching is done frequently. The results are shown in Figure\n2(b) where t-mod-1 and TSDE underperform signi\ufb01cantly. Nonetheless, when the policy is switched\nafter multiple interactions, the agent is likely to end up in parts of the space where it becomes easy\nto identify the true model of the world. The second experiment is an example where multi-step\nexploration is necessary.\n\n5.2 Multiple Parameters\n\nEven though our theoretical analysis does not account for the case with multiple parameters, we tested\nempirically our algorithm with multiple parameters. We assumed a Dirichlet prior for every state and\naction pair. The initial parameters of the priors were set to one (uniform) for the non-zero transition\nprobabilities of the RiverSwim problem and zero otherwise. Updating the posterior in this case is\nequivalent to updating the parameters after every transition. We did not compare with the t-mod-1\nschedule, due to the computational cost of sampling and solving an MDP every time-step. Unlike the\n\n6\n\n\f#switches: DS\u2212PSRL: 14 TSDE: 159, #states: 50, #param: 1 expr: 1\n\n#switches: DS\u2212PSRL: 14 TSDE: 315, #states: 50, #param: 1 expr: 2\n\nd\nr\na\nw\ne\nr\n \n\ne\ng\na\nr\ne\nv\nA\n\n8000\n\n7500\n\n7000\n\n6500\n\n6000\n\nALG\n\nOptimal\nDS\u2212PSRL\nTSDE\n\n8000\n\n6000\n\n4000\n\n2000\n\n0\n\nd\nr\na\nw\ne\nr\n \n\ne\ng\na\nr\ne\nv\nA\n\nALG\n\nOptimal\nt\u2212mod\u22121\nDS\u2212PSRL\nTSDE\n\n7500\n\n10000\n\n0\n\n2500\n\n0\n\n2500\n\n5000\nTime\n(a)\n\n7500\n\n10000\n\n5000\nTime\n(b)\n\nFigure 2: When multi-step exploration is necessary DS-PSRL outperforms.\n\nscalar case we cannot de\ufb01ne a small \ufb01nite number of values, for which we can pre-compute the MDP\npolicies. The ground truth model used was \u27132 from the second scalar experiment. Our results are\nshown in Figures 3(a) and 3(b). DS-PSRL performs better than TSDE as we increase the number of\nparameters.\n\n#switches: DS\u2212PSRL: 14 TSDE: 274, #states: 15, #param: 43\n\n#switches: DS\u2212PSRL: 14 TSDE: 301, #states: 20, #param: 58\n\nd\nr\na\nw\ne\nr\n \ne\ng\na\nr\ne\nv\nA\n\n8000\n\n6000\n\n4000\n\n2000\n\n0\n\nALG\n\nOptimal\nDS\u2212PSRL\nTSDE\n\n8000\n\n6000\n\n4000\n\n2000\n\n0\n\nd\nr\na\nw\ne\nr\n \ne\ng\na\nr\ne\nv\nA\n\nALG\n\nOptimal\nDS\u2212PSRL\nTSDE\n\n#switches: DS\u2212PSRL: 10 TSDE: 60\n\n25\n\n20\n\n15\n\n10\n\n5\n\nt\ns\no\nc\n \ne\ng\na\nr\ne\nv\nA\n\nALG\n\nOptimal\nt\u2212mod\u22121\nDS\u2212PSRL\nTSDE\n\n7500\n\n10000\n\n0\n\n2500\n\n0\n\n2500\n\n5000\nTime\n(a)\n\n7500\n\n10000\n\n5000\nTime\n(b)\n\n0\n\n250\n\n500\nTime\n\n750\n\n1000\n\n(c) The LQ problem.\n\nFigure 3: Multiple parameters (a,b) and continuous domain (c).\n\n5.3 Continuous Domains\nIn a \ufb01nal experiment we tested the ability of DS-PSRL algorithm in continuous state and action\ndomains. Speci\ufb01cally, we implemented the discrete in\ufb01nite horizon linear quadratic (LQ) problem in\nAbbasi-Yadkori and Szepesv\u00e1ri [2015, 2011]:\n\nxt+1 = A\u21e4xt + B\u21e4ut + wt+1 and ct = xT\n\nt Qxt + uT\n\nt Rut,\n\nwhere t = 0, 1, ..., ut 2 Rd is the control at time t, xt 2 Rn is the state at time t, ct 2 R is the cost\nat time t, wt+1 is the \u2018noise\u2019, A\u21e4 2 Rn\u21e5n and B\u21e4 2 Rn\u21e5d are unknown matrices while Q 2 Rn\u21e5n\nand R 2 Rd\u21e5d are known (positive de\ufb01nite) matrices. The problem is to design a controller based on\npast observations to minimize the average expected cost. Uncertainty is modeled as a multivariate\nnormal distribution. In our experiment we set n = 2 and d = 2.\nWe compared DS-PSRL with t-mod-1 and a recent TSDE algorithm for learning-based control of\nunknown linear systems with Thompson Sampling Ouyang et al. [2017a]. This version of TSDE\nuses two dynamic conditions. The \ufb01rst condition is the same as in the discrete case, which activates\nwhen episodes increase by one from the previous episode. The second condition activates when the\ndeterminant of the sample covariance matrix is less than half of the previous value. All algorithms\nlearn quickly the optimal A\u21e4 and B\u21e4 as shown in Figure 3(c). The fact that switching every time-step\nworks well indicates that this problem does not require multi-step exploration.\n\n6 Application to Sequential Recommendations\n\nWith \u2018sequential recommendations\u2019 we refer to the problem where a system recommends various\n\u2018items\u2019 to a person over time to achieve a long-term objective. One example is a recommendation\nsystem at a website that recommends various offers. Another example is a tutorial recommendation\nsystem, where the sequence of tutorials is important in advancing the user from novice to expert\n\n7\n\n\fover time. Finally, consider a points of interest recommendation (POI) system, where the system\nrecommends various locations for a person to visit in a city, or attractions in a theme park. Personalized\nsequential recommendations are not suf\ufb01ciently discussed in the literature and are practically non-\nexistent in the industry. This is due to the increased dif\ufb01culty in accurately modeling long-term user\nbehaviors and non-myopic decision making. Part of the dif\ufb01culty arises from the fact that there may\nnot be a previous sequential recommendation system deployed for data collection, otherwise known\nas the cold start problem.\nFortunately, there is an abundance of sequential data in the real world. These data is usually \u2018passive\u2019\nin that they do not include past recommendations. A practical approach that learns from passive\ndata was proposed in Theocharous et al. [2017]. The idea is to \ufb01rst learn a model from passive\ndata that predicts the next activity given the history of activities. This can be thought of as the\n\u2018no-recommendation\u2019 or passive model. To create actions for recommending the various activities,\nthe authors perturb the passive model. Each perturbed model increases the probability of following\nthe recommendations, by a different amount. This leads to a set of models, each one with a different\n\u2018propensity to listen\u2019. In effect, they used the single \u2018propensity to listen\u2019 parameter to turn a passive\nmodel into a set of active models. When there are multiple model one can use online algorithms, such\nas posterior sampling for Reinforcement learning (PSRL) to identify the best model for a new user\n[Strens, 2000; Osband et al., 2013]. In fact, the algorithm used in Theocharous et al. [2017] was a\ndeterministic schedule PSRL algorithm. However, there was no theoretical analysis. The perturbation\nfunction used was the following:\n\nP (s|X, a, \u2713) =\u21e2P (s|X)1/\u2713,\n\nP (s|X)/z(\u2713),\n\nif a = s\notherwise\n\n(5)\n\nwhere s = is a POI, X = (s1, s2 . . . st) a history of POIs, and z(\u2713) = Ps6=a P (s|X)\n\nfactor. Here we show how this model satis\ufb01es both assumptions of our regret analysis.\n\n1P (s=a|X)1/\u2713 is a normalizing\n\nLipschitz Dynamics We \ufb01rst prove that the dynamics are Lipschitz continuous:\nLemma 5 (Lipschitz Continuity) Assume the dynamics are given by Equation 5. Then for all \u2713, \u27130 1\nand all X and a, we have\n\nkP (\u00b7|X, a, \u2713) P (\u00b7|X, a, \u27130)k1 \uf8ff\n\n2\ne|\u2713 \u27130|.\n\nPlease refer to Appendix D for the proof of this lemma.\n\nConcentrating Posterior As is detailed in Appendix E (see Lemma 6), we can also show that\nAssumption A2 holds in this POI recommendation example. Speci\ufb01cally, we can show that under\nmild technical conditions, we have\n\nmax\n\nj\n\nE\uf8ffNj1\u2713\u21e4 e\u2713j\n\n2 = O(1)\n\n7 Summary and Conclusions\n\nWe proposed a practical general case PSRL algorithm, called DS-PSRL with provable guarantees.\nThe algorithm has similar regret to state-of-the-art. However, our result is more generally applicable\nto continuous state-action problems; when dynamics of the system is parametrized by a scalar, our\nregret is independent of the number of states. In addition, our algorithm is practical. The algorithm\nprovides for generalization, and uses a deterministic policy switching schedule of logarithmic order,\nwhich is independent from the true model of the world. This leads to ef\ufb01ciency in sample, space\nand time complexities. We demonstrated empirically how the algorithm outperforms state-of-the-art\nPSRL algorithms. Finally, we showed how the assumptions satisfy a sensible parametrization for a\nlarge class of problems in sequential recommendations.\n\n8\n\n\fReferences\nYasin Abbasi-Yadkori and Csaba Szepesv\u00e1ri. Regret bounds for the adaptive control of linear\n\nquadratic systems. In COLT, 2011.\n\nYasin Abbasi-Yadkori and Csaba Szepesv\u00e1ri. Bayesian optimal control of smoothly parameterized\n\nsystems. In UAI, pages 1\u201311, 2015.\n\nShipra Agrawal and Randy Jia. Posterior sampling for reinforcement learning: worst-case regret\n\nbounds. In NIPS, 2017.\n\nPeter L Bartlett and Ambuj Tewari. Regal: A regularization based algorithm for reinforcement\n\nlearning in weakly communicating MDPs. In UAI, pages 35\u201342, 2009.\n\nDimitri P Bertsekas. Dynamic programming and optimal control, volume 2. Athena Scienti\ufb01c\n\nBelmont, MA, 1995.\n\nRonen I Brafman and Moshe Tennenholtz. R-max-a general polynomial time algorithm for near-\noptimal reinforcement learning. Journal of Machine Learning Research, 3(Oct):213\u2013231, 2002.\nAditya Gopalan and Shie Mannor. Thompson sampling for learning parameterized Markov decision\n\nprocesses. In COLT, pages 861\u2013898, 2015.\n\nThomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement\n\nlearning. Journal of Machine Learning Research, 11(Apr):1563\u20131600, 2010.\n\nIan Osband and Benjamin Van Roy. Model-based reinforcement learning and the eluder dimension.\n\nIn NIPS, pages 1466\u20131474, 2014.\n\nIan Osband and Benjamin Van Roy. Posterior sampling for reinforcement learning without episodes.\n\narXiv preprint arXiv:1608.02731, 2016.\n\nIan Osband, Dan Russo, and Benjamin Van Roy. (More) ef\ufb01cient reinforcement learning via posterior\n\nsampling. In NIPS, pages 3003\u20133011, 2013.\n\nYi Ouyang, Mukul Gagrani, and Rahul Jain. Learning-based control of unknown linear systems with\n\nthompson sampling. arXiv preprint arXiv:1709.04047, 2017.\n\nYi Ouyang, Mukul Gagrani, Ashutosh Nayyar, and Rahul Jain. Learning unknown Markov decision\n\nprocesses: A thompson sampling approach. In NIPS, 2017.\n\nAlexander L. Strehl and Michael L. Littman. An analysis of model-based interval estimation for\nmarkov decision processes. Journal of Computer and System Sciences, 74(8):1309 \u2013 1331, 2008.\nLearning Theory 2005.\n\nMalcolm Strens. A Bayesian framework for reinforcement learning. In ICML, pages 943\u2013950, 2000.\nRichard S Sutton and Andrew G Barto. Introduction to reinforcement learning, volume 135. MIT\n\nPress Cambridge, 1998.\n\nGeorgios Theocharous, Nikos Vlassis, and Zheng Wen. An interactive points of interest guidance\n\nsystem. In IUI Companion, pages 49\u201352. ACM, 2017.\n\nWilliam R. Thompson. On the likelihood that one unknown probability exceeds another in view of\n\nthe evidence of two samples. Biometrika, 25:285\u2013294, 1933.\n\n9\n\n\f", "award": [], "sourceid": 3793, "authors": [{"given_name": "Georgios", "family_name": "Theocharous", "institution": "Adobe Research"}, {"given_name": "Zheng", "family_name": "Wen", "institution": "Adobe Research"}, {"given_name": "Yasin", "family_name": "Abbasi Yadkori", "institution": "Adobe Research"}, {"given_name": "Nikos", "family_name": "Vlassis", "institution": "Netflix"}]}