{"title": "Large Scale Markov Decision Processes with Changing Rewards", "book": "Advances in Neural Information Processing Systems", "page_first": 2340, "page_last": 2350, "abstract": "We consider Markov Decision Processes (MDPs) where the rewards are unknown and may change in an adversarial manner. We provide an algorithm that achieves a regret bound of $O( \\sqrt{\\tau (\\ln|S|+\\ln|A|)T}\\ln(T))$, where $S$ is the state space, $A$ is the action space, $\\tau$ is the mixing time of the MDP, and $T$ is the number of periods. The algorithm's computational complexity is polynomial in $|S|$ and $|A|$. We then consider a setting often encountered in practice, where the state space of the MDP is too large to allow for exact solutions. By approximating the state-action occupancy measures with a linear architecture of dimension $d\\ll|S|$, we propose a modified algorithm with a computational complexity polynomial in $d$ and independent of $|S|$. We also prove a regret bound for this modified algorithm, which to the best of our knowledge, is the first $\\tilde{O}(\\sqrt{T})$ regret bound in the large-scale MDP setting with adversarially changing rewards.", "full_text": "Large Scale Markov Decision Processes with\n\nChanging Rewards\n\nAdrian Rivera Cardoso, He Wang\n\nSchool of Industrial and Systems Engineering\n\nGeorgia Institute of Technology\n\nadrian.riv@gatech.edu, he.wang@isye.gatech.edu\n\nHuan Xu\n\nAlibaba Group\n\nhuan.xu@alibaba-inc.com\n\nAbstract\n\nWe consider Markov Decision Processes (MDPs) where the rewards are unknown\nand may change in an adversarial manner. We provide an algorithm that achieves a\n\nregret bound of O((cid:112)\u03c4 (ln|S| + ln|A|)T ln(T )), where S is the state space, A is\n\nthe action space, \u03c4 is the mixing time of the MDP, and T is the number of periods.\nThe algorithm\u2019s computational complexity is polynomial in |S| and |A|. We then\nconsider a setting often encountered in practice, where the state space of the MDP is\ntoo large to allow for exact solutions. By approximating the state-action occupancy\nmeasures with a linear architecture of dimension d (cid:28) |S|, we propose a modi\ufb01ed\nalgorithm with a computational complexity polynomial in d and independent of\n|S|. We also prove a regret bound for this modi\ufb01ed algorithm, which to the best\nof our knowledge, is the \ufb01rst \u02dcO(\nT ) regret bound in the large-scale MDP setting\nwith adversarially changing rewards.\n\n\u221a\n\n1\n\nIntroduction\n\nIn this paper, we study Markov Decision Processes (hereafter MDPs) with arbitrarily varying rewards.\nMDP provides a general mathematical framework for modeling sequential decision making under\nuncertainty [8, 24, 35]. In the standard MDP setting, if the process is in some state s, the decision\nmaker takes an action a and receives an expected reward of r(s, a). The process then randomly enters\na new state according to some known transition probability. In particular, the standard MDP model\nassumes that the decision maker has complete knowledge of the reward function r(s, a), which does\nnot change over time.\nOver the past two decades, there has been much interest in sequential learning and decision making\nin an unknown and possibly adversarial environment. A wide range of sequential learning problems\ncan be modeled using the framework of Online Convex Optimization (OCO) [45, 20]. In the OCO\nsetting, the decision maker plays a repeated game against an adversary for a given number of rounds.\nAt the beginning of each round indexed by t, the decision maker chooses an action at from a convex\ncompact set A and the adversary chooses a concave reward function rt(\u00b7), hence a reward of rt(at)\nis received. After observing the realized reward function, the decision maker chooses its next action\nat+1 and so on. Since the decision maker does not know the future reward functions, its goal is to\nachieve a small regret; that is, the cumulative reward earned throughout the game should be close to\nthe cumulative reward if the decision maker had been given the bene\ufb01t of hindsight to choose a \ufb01xed\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\faction. We can express the regret for T rounds as\n\nT(cid:88)\n\nrt(a) \u2212 T(cid:88)\n\nrt(at).\n\nRegret(T ) = max\na\u2208A\n\nt=1\n\nt=1\n\nThe OCO model has many applications such as universal portfolios [13, 27, 23], online shortest path\n[38], and online submodular minimization [22]. It is also closely related with areas such as convex\noptimization [21, 7] and game theory [10]. There are many algorithms that guarantee sublinear regret,\ne.g., Online Gradient Descent [45], Perturbed Follow the Leader [28], and Regularized Follow the\nLeader [37, 4]. Compared with the MDP setting, the main difference is that in OCO there is no notion\nof states, however the payoffs may be chosen by an adversary.\nIn this work, we study a general problem framework that unites MDP and OCO, which we call the\nOnline MDP problem. More speci\ufb01cally, we consider MDPs where the transition probabilities are\nknown but the rewards are sequentially chosen by an adversary.\nWe list below some canonical motivating examples that can be modeled as Online MDPs.\n\n\u2022 Adversarial Multi-Armed Bandits with Constraints [43]: We can generalize the adversarial\nmulti-armed bandits problem with k arms (see Auer et al. [5]) with various constraints\nsuch as: restricting the number of times that an arm can be chosen in a given time interval,\nlimiting how we switch between arms, etc. These constraints can be captured easily by\nde\ufb01ning appropriate states in the Online MDP.\n\u2022 The Paging Problem [17]: Suppose we are given n pages. A memory can hold at most\nk (k < n) of them. An arbitrary sequence of paging request arrives. A page request is a\nhit if the associated page is in memory, and is a miss otherwise. After each request, the\ndecision maker may swap any page in memory by paying some cost. Note that the state of\nthe memory and the swapping decisions can be modeled using MDP. The decision maker\u2019s\ngoal is to maximize the number of hits minus the switching costs.\n\u2022 The k-Server Problem [29, 17]: In this classical problem in computer science, there are k\nservers, represented as points in a metric space. Requests arrive to the metric space, which\nare also represented as points. As each request arrives, the decision maker can choose to\nmove one of the servers to the requested point. The goal is to minimize the total distance all\nservers move. If the arrivals of requests are adversarial, this problem can be modeled as an\nOnline MDP problem, where the state represents the position of servers.\n\nNotice that in all of the problems above, the transition probabilities are known, while the adversarial\nrewards/costs are observed by the decision maker sequentially after each decision epoch. Moreover,\nin each of these Online MDP problems, the size of the state space may grow exponentially with the\nnumber k. Some other noteworthy examples are the stochastic inventory control problem [35] and\nsome server queuing problems [14, 3].\n\n1.1 Main Results\n\nWe propose a new computationally ef\ufb01cient algorithm that achieves near optimal regret for the\nOnline MDP problem. Our algorithm is based on the (dual) linear programming formulation of\nin\ufb01nite-horizon average reward MDPs, which uses the occupancy measure of state-action pairs\nas decision variables. This approach differs from other papers that have studied the Online MDP\nproblem previously, see review in \u00a71.2.\n\nWe prove that the algorithm\u2019s regret is bounded by O(\u03c4 +(cid:112)\u03c4 T (ln|S| + ln|A|) ln(T )), where S\n\ndenotes the state space, A denotes the action space, \u03c4 is the mixing time of the MDP, and T is the\nnumber of periods. Notice that this regret bound depends logarithmically on the size of the state and\naction space. The algorithm solves a regularized linear program in each period with poly(|S||A|)\ncomplexity. The regret bound and the computation complexity compares favorably to the existing\nmethods, which are summarized in \u00a71.2.\nWe then extend our results to the case where the state space S is extremely large so that poly(|S||A|)\ncomputational complexity is impractical. We assume the state-action occupancy measures associated\nwith stationary policies are approximated with a linear architecture of dimension d (cid:28) |S|. We design\nan approximate algorithm combining several innovative techniques for solving large scale MDPs\n\n2\n\n\finspired by [2, 3]. A salient feature of this algorithm is that its computational complexity does not\n\u221a\ndepend on the size of the state-space but instead on the number of features d. The algorithm has a\nregret bound O(cS,A(ln|S| + ln|A|)\n\u03c4 T ln T ), where cS,A is a problem dependent constant. To the\nbest of our knowledge, this is the \ufb01rst \u02dcO(\n\nT ) regret result for large scale Online MDPs.\n\n\u221a\n\n1.2 Related Work\n\nThe history of MDP goes back to the seminal work of Bellman [6] and Howard [24] from the\n1950\u2019s. Some classic algorithms for solving MDP include policy iteration, value iteration, policy\ngradient, Q-learning and their approximate versions (see [35, 8, 9] for an excellent discussion). In\nthis paper, we will focus on a relatively less used approach, which is based on \ufb01nding the occupancy\nmeasure using linear programming, as done recently in [12, 39, 2] to solve MDPs with static rewards\n(see more details in Section 3.1). To deal with the curse of dimensionality, Chen et al. [12] uses\nbilinear functions to approximate the occupancy measures and Abbasi-Yadkori et al. [2] uses a linear\napproximation.\nThe Online MDP problem was \ufb01rst studied a decade ago by [43, 17]. Even-Dar et al. [17] developed\n\nno regret algorithms where the bound scales as O(\u03c4 2(cid:112)T ln(|A|)), where \u03c4 is the mixing time de\ufb01ned\n\nin \u00a72. Their method runs an expert algorithm (e.g. Weighted Majority [31]) on every state where the\nactions are the experts. However, the authors did not consider the case with large state space in their\npaper. Yu et al. [43] proposed a more computationally ef\ufb01cient algorithm using a variant of Follow\nthe Perturbed Leader [28], but unfortunately their regret bound becomes O(|S||A|2\u03c4 T 3/4+\u0001). They\nalso considered approximation algorithm for large state space, but did not establish an exact regret\nbound. The work most closely related to ours is that from Dick et al. [15], where the authors also use\na linear programming formulation of MDP similar to ours. However, there seem to be some gaps\nin the proof of their results.1 That issue aside, in order to solve large-scale MDPs, their focus is to\nef\ufb01ciently solve the quadratic sub-problems that de\ufb01ne their iterates ef\ufb01ciently. Instead, we leverage\nthe linear approximation scheme introduced in [2].\nMa et al. [32] also considers Online MDPs with large state space. Under some conditions, they show\nsublinear regret using a variant of approximate policy iteration, but the regret rate is left unspeci\ufb01ed\nin their paper. Zimin and Neu [44] considered a special class of MDPs called episodic MDPs and\ndesign algorithms using the occupancy measure LP formulation. Following this line of work, Neu\net al. [34] shows that several reinforcement learning algorithms can be viewed as variant of Mirror\nDescent [25], thus one can establish convergence properties of these algorithms. In [33], the authors\nconsidered Online MDPs with bandit feedback and provide an algorithm based on [17]\u2019s with regret\nof O(T 2/3). Some other related work can be found in [11, 30, 26].\nA more general problem to the Online MDP setting considered here is where the MDP transition\nprobabilities also change in an adversarial manner, which is beyond the scope of this paper. It\nis believed that this problem is much less tractable computationally; see discussion in [16]. Yu\nand Mannor [42] studied MDPs with changing transition probabilities, although [33] questions the\ncorrectness of their result, as the regret obtained seems to have broken a lower bound. In [19], the\nauthors use a sliding window approach under a particular de\ufb01nition of regret. Abbasi-Yadkori et\nal. [1] achieved sublinear regret with changing transition probabilities when compared against a\nrestricted policy class.\n\n2 Problem Formulation: Online MDP\n\nWe consider a general Markov Decision Process (MDP) with known transition probabilities but\nunknown and adversarially chosen rewards. Let S denote the set of possible states, and A denote\nthe set of actions. (For notational simplicity, we assume the set of actions a player can take is the\nsame for all states, but this assumption can be relaxed easily.) At each period t \u2208 [T ], if the system is\nin state st \u2208 S, the decision maker chooses an action at \u2208 A and collects a reward rt(st, at). Here,\nrt : S \u00d7 A \u2192 [\u22121, 1] denotes a reward function for period t. We assume that the sequence of reward\n1In particular, we believe the proof of Lemma 1 in [15] is incorrect. Equation (8) in their paper states that\nthe regret relative to a policy is equal to the sum of a sequence of vector products; however, the dimensions\nof vectors involved in these dot products are incompatible. By their de\ufb01nition, the variable \u03bdt is a vector of\ndimension |S|, which is being multiplied with a loss vector with dimension |S||A|.\n\n3\n\n\fT(cid:88)\n\nT(cid:88)\n\nt=1\n\nt=1\n\ni=1 and {ai}t\u22121\n\nfunctions {rt}T\nt=1 is initially unknown to the decision maker. The function rt is revealed only after\nthe action at has been chosen. We allow the sequence {rt}T\nt=1 to be chosen by an adaptive adversary,\nmeaning rt can be chosen using the history {si}t\ni=1. In particular, the adversary does\nnot observe the action at when choosing rt. After at is chosen, the system then proceeds to state\nst+1 in the next period with probability P (st+1|st, at). We assume the decision maker has complete\nknowledge of the transition probabilities given by P (s(cid:48)|s, a) : S \u00d7 A \u2192 S.\nSuppose that the initial state of the MDP follows s1 \u223c \u03bd1, where \u03bd1 is a probability distribution\nover S. The objective of the decision maker is to choose a sequence of actions based on the history\nof states and rewards observed, such that the cumulative reward in T periods is close to that of\nthe optimal of\ufb02ine static policy. Formally, let \u03c0 denote a stationary (possibly randomized) policy:\n\u03c0 : S \u2192 \u2206A, where \u2206A is the set of probability distributions over the action set A. Let \u03a0 denote the\nset of all stationary policies. We aim to \ufb01nd an algorithm that minimizes\n\nMDP-Regret(T ) (cid:44) sup\n\u03c0\u2208\u03a0\n\nR(T, \u03c0), with R(T, \u03c0) (cid:44) E[\n\nrt(s\u03c0\n\nt , a\u03c0\n\nt )] \u2212 E[\n\nrt(st, at)],\n\n(1)\n\nwhere the expectations are taken with respect to random transitions of MDP and (possibly) external\nrandomization of the algorithm.\n\n3 Preliminaries\n\nt+1 be the distribution over states at time t + 1 under policy \u03c0, given by \u03bd\u03c0\nst denote the stationary distribution for policy \u03c0, which satis\ufb01es the linear equation \u03bd\u03c0\n\ns,s(cid:48) (cid:44) P (s(cid:48) | s, \u03c0(s)) be the probability\nNext, we provide additional notations for the MDP. Let P \u03c0\nof transitioning from state s to s(cid:48) given a policy \u03c0. Let P \u03c0 be an |S| \u00d7 |S| matrix with entries\ns,s(cid:48) (\u2200s, s(cid:48) \u2208 S). We use row vector \u03bdt \u2208 \u2206S to denote the probability distribution over states at\nP \u03c0\nt+1 = \u03bdtP \u03c0.\ntime t. Let \u03bd\u03c0\nLet \u03bd\u03c0\nstP \u03c0.\nst = \u03bd\u03c0\nWe assume the following condition on the convergence to stationary distribution, which is commonly\nused in the MDP literature [see 43, 17, 33].\nAssumption 1. There exists a real number \u03c4 \u2265 0 such that for any policy \u03c0 \u2208 \u03a0 and any pair of\ndistributions \u03bd, \u03bd(cid:48) \u2208 \u2206S, it holds that (cid:107)\u03bdP \u03c0 \u2212 \u03bd(cid:48)P \u03c0(cid:107)1 \u2264 e\u2212 1\nWe refer to \u03c4 in Assumption 1 as the mixing time, which measures the convergence speed to the\nstationary distribution. In particular, the assumption implies that \u03bd\u03c0\nst is unique for a given policy \u03c0.\nWe use \u00b5(s, a) to denote the proportion of time that the MDP visits state-action pair (s, a) in\nthe long run. We call \u00b5\u03c0 \u2208 R|S|\u00d7|A| the occupancy measure of policy \u03c0. Let \u03c1\u03c0\nt be the long-\nrun average reward under policy \u03c0 when the reward function is \ufb01xed to be rt every period, i.e.,\nt , where \u03c0t is the policy selected by the\n\u03c1\u03c0\nt\ni=1\ndecision maker at time t.\n\ni )]. We de\ufb01ne \u03c1t (cid:44) \u03c1\u03c0t\n\n(cid:44) limT\u2192\u221e 1\n\n\u03c4 (cid:107)\u03bd \u2212 \u03bd(cid:48)(cid:107)1.\n\n(cid:80)T\n\nE[rt(s\u03c0\n\ni , a\u03c0\n\nT\n\n3.1 Linear Programming Formulation for the Average Reward MDP\nGiven a reward function r : S \u00d7 A \u2192 [\u22121, 1], suppose one wants to \ufb01nd a policy \u03c0 that maximizes\nthe long-run average reward: \u03c1\u2217 = sup\u03c0 limT\u2192\u221e 1\nt ). Under Assumption 1, the\nMarkov chain induced by any policy is ergodic and the long-run average reward is independent of\nthe starting state (see [8]). It is well known that the optimal policy can be obtained by solving the\nBellman equation, which in turn can be written as a linear program (in the dual form):\n\n(cid:80)T\n\nt=1 r(s\u03c0\n\nt , a\u03c0\n\nT\n\nLet \u00b5\u2217 be an optimal solution to the LP (2). We can construct an optimal policy of the MDP by\nde\ufb01ning \u03c0\u2217(s, a) (cid:44)\na\u2208A \u00b5\u2217(s, a) > 0; for states where the\n\n\u00b5\u2217(s,a)\n\n(cid:80)\n\n(2)\n\n\u00b5(s(cid:48), a) \u2200s(cid:48) \u2208 S\n\n\u00b5(s, a) = 1, \u00b5(s, a) \u2265 0 \u2200s \u2208 S, \u2200a \u2208 A.\n\n\u03c1\u2217 = max\n\n\u00b5\n\ns\u2208S\n\na\u2208A\n\n(cid:88)\n\n\u00b5(s, a)r(s, a)\n\n(cid:88)\n(cid:88)\n(cid:88)\n\n\u00b5(s, a)P (s(cid:48)|s, a) =\n\n(cid:88)\ns.t. (cid:88)\n(cid:88)\na\u2208A \u00b5\u2217(s,a) for all s \u2208 S such that(cid:80)\n\na\u2208A\n\na\u2208A\n\ns\u2208S\n\ns\u2208S\n\na\u2208A\n\n4\n\n\fdenominator is zero, the policy may choose arbitrary actions, since those states will not be visited in\nthe stationary distribution. Let \u03bd\u2217\nst be the stationary distribution over states under this optimal policy.\nFor simplicity, we will write the \ufb01rst constraint of LP (2) in the matrix form as \u00b5(cid:62)(P \u2212 B) = 0,\nwhere B is an appropriately chosen matrix with 0-1 entries. We denote the feasible set of the above\nLP as \u2206M (cid:44) {\u00b5 \u2208 R : \u00b5 \u2265 0, \u00b5(cid:62)1 = 1, \u00b5(cid:62)(P \u2212 B) = 0}. The following de\ufb01nition will be used in\nthe analysis later.\nDe\ufb01nition 1. Let \u03b40 \u2265 0 be the largest real number such that for all \u03b4 \u2208 [0, \u03b40], the set \u2206M,\u03b4 (cid:44)\n{\u00b5 \u2208 R|S|\u00d7|A| : \u00b5 \u2265 \u03b4, \u00b5(cid:62)1 = 1, \u00b5(cid:62)(P \u2212 B) = 0} is nonempty.\n\n4 A Sublinear Regret Algorithm for Online MDP\n\nIn this section, we present an algorithm for the Online MDP problem. The algorithm is very intuitive\ngiven the LP formulation (2) for the static problem. As the rewards may change each round, the\nalgorithm simply treats the Online MDP problem as an Online Convex Optimization (OCO) problem\nwith reward functions {rt}T\n\nt=1 and decision set \u2206M .\n\ns\u2208S\n\na\u2208A \u00b5(s, a) ln(\u00b5(s, a))\n\nAlgorithm 1 (MDP-RFTL)\n\ninput: parameter \u03b4 > 0, \u03b7 > 0, regularization term R(\u00b5) =(cid:80)\nif(cid:80)\n\ninitialization: choose any \u00b51 \u2208 \u2206M,\u03b4 \u2282 R|S|\u00d7|A|\nfor t = 1, ...T do\n\nobserve current state st\n\n(cid:80)\n\na\u2208A \u00b5t(st, a) > 0 then\n\n(cid:80)\nchoose action a \u2208 A with probability\na \u00b5t(st,a).\nelse\nchoose action a \u2208 A with probability 1|A|\n(cid:104)(cid:104)ri, \u00b5(cid:105) \u2212 1\nend if\nobserve reward function rt \u2208 [\u22121, 1]|S||A|\nupdate \u00b5t+1 \u2190 arg max\u00b5\u2208\u2206M,\u03b4\n\n(cid:80)t\n\n\u00b5t(st,a)\n\ni=1\n\n(cid:105)\n\n\u03b7 R(\u00b5)\n\nend for\n\n\u00b5t(st,a)\n\nfar,(cid:80)t\n\nAt the beginning of each round t \u2208 [T ], the algorithm starts with an occupancy measure \u00b5t. If the\n(cid:80)\nMDP is in state st, we play action a \u2208 A with probability\na \u00b5t(st,a). If the denominator is 0,\nthe algorithm picks any action in A with equal probability. After observing reward function rt and\ncollecting reward rt(st, at), the algorithm changes the occupancy measure to \u00b5t+1.\nThe new occupancy measure is chosen according to the Regularized Follow the Leader (RFTL)\nalgorithm [37, 4]. RFTL chooses the best occupancy measure for the cumulative reward observed so\ni=1 ri, plus a regularization term R(\u00b5). The regularization term forces the algorithm not to\ndrastically change the occupancy measure from round to round. In particular, we choose R(\u00b5) to be\nthe entropy function. This choice will allow us to get ln(|S||A|) dependence in the regret bound.\nThe complete algorithm is shown in Algorithm 1. The main result of this section is the following.\nTheorem 1. Suppose {rt}T\nt=1 is an arbitrary sequence of rewards such that |rt(s, a)| \u2264 1 for all\ns \u2208 S and a \u2208 A. For T \u2265 ln2(1/\u03b40), the MDP-RFTL algorithm with parameters \u03b7 =\n,\n\u03b4 = e\u2212\u221a\nThe regret bound in Theorem 1 is near optimal: a lower bound of \u2126((cid:112)T ln|A|) exists for the\nO(\u03c4 + \u03c4 2(cid:112)ln(|A|)T ). Compared to their result, our bound is better by a factor of \u03c4 3/2. However,\n\nproblem of learning with expert advice [18, 20], a special case of Online MDP where the state\nspace is a singleton. We note that the bound only depends logarithmically on the size of the state\nspace and action space. The state-of-the-art regret bound for Online MDPs is that of [17], which is\n\n(cid:17)\n\u03c4 + 4(cid:112)\u03c4 T (ln|S| + ln|A|) ln(T )\n\n(cid:113) T ln(|S||A|)\n\nMDP-Regret(T ) \u2264 O\n\n\u03c4 guarantees\n\n(cid:16)\n\n\u221a\n\nT /\n\n.\n\n\u03c4\n\n5\n\n\four bound has depends on(cid:112)ln|S| + ln|A|, whereas the bound in [17] depends on(cid:112)ln|A|. Both\n\nalgorithms require poly(|S||A|) computation time, but are based on different ideas: the algorithm of\n[17] is based on expert algorithms and requires computing Q-functions at each time step, whereas\nour algorithm is based on RFTL. In the next section, we will show how to extend our algorithm to the\ncase with large state space.\n\n4.1 Proof Idea for Theorem 1\nThe key to analyze our algorithm is to decompose the regret with respect to policy \u03c0 \u2208 \u03a0 as follows\n\nt )]\u2212 T(cid:88)\n\n(cid:35)\n\n(cid:34) T(cid:88)\n\nt \u2212 T(cid:88)\n\n\u03c1\u03c0\n\n(cid:35)\n\n(cid:34) T(cid:88)\n\n\u03c1t\n\n+\n\nT(cid:88)\n\n(cid:34)\n\nT(cid:88)\n\nE[\n\n(cid:35)\n\nR(T, \u03c0) =\n\nrt(s\u03c0\n\nt , a\u03c0\n\n\u03c1\u03c0\nt\n\n+\n\n\u03c1t\u2212E[\n\nrt(st, at)]\n\n.\n\n(3)\n\nt=1\n\nt=1\n\nt=1\n\nt=1\n\nt=1\n\nt=1\n\nThis decomposition was \ufb01rst used by [17]. We now give some intuition on why R(T, \u03c0) should be\nt at time t under a policy\nsublinear. By the mixing condition in Assumption 1, the state distribution \u03bd\u03c0\n\u03c0 differs from the stationary distribution \u03bd\u03c0\nst by at most O(\u03c4 ). This result can be used to bound the\n\ufb01rst term of (3).\nThe second term of (3) can be related to the online convex optimization (OCO) problem through\nthe linear programming formulation from \u00a73.1. Notice that \u03c1\u03c0\na\u2208A \u00b5\u03c0(s, a)r(s, a) =\n\n(cid:80)\nt (s, a)r(s, a) = (cid:104)\u00b5\u03c0t, r(cid:105). Therefore, we have\n\n(cid:104)\u00b5\u03c0, r(cid:105), and \u03c1t =(cid:80)\n\nt =(cid:80)\n\na\u2208A \u00b5\u03c0\n\ns\u2208S\n\ns\u2208S\n\n(cid:80)\nT(cid:88)\n\nt=1\n\n\u03c1\u03c0\n\n(cid:104)\u00b5\u03c0, rt(cid:105) \u2212 T(cid:88)\nT(cid:88)\n\nt \u2212 T(cid:88)\n(cid:80)T\nt=1(cid:104)\u00b5\u03c0, rt(cid:105)\u2212(cid:80)T\n\n\u03c1t =\n\nt=1\n\nt=1\n\nt=1\n\n(cid:104)\u00b5\u03c0t, rt(cid:105),\n\n(4)\n\nwhich is exactly the regret quantity commonly studied in the OCO problem. We are thus seeking an\nt=1(cid:104)\u00b5\u03c0t, rt(cid:105). In order to achieve logarithmic\nalgorithm that can bound max\u00b5\u03c0\u2208\u2206M\ndependence on |S| and |A| in Theorem 1, we apply the RFTL algorithm, regularized by the negative\nentropy function R(\u00b5). A technical challenge we faced in the analysis is that R(\u00b5) is not Lipschitz\ncontinuous over the feasible set \u2206M . So we design the algorithm to play in a shrunk set \u2206M,\u03b4 for\nsome \u03b4 > 0 (see De\ufb01nition 1), in which R(\u00b5) is indeed Lipschitz continuous.\nFor the last term in (3), note that it is similar to the \ufb01rst term, albeit more complicated: the policy \u03c0\nis \ufb01xed in the \ufb01rst term, but the policy \u03c0t used by the algorithm is varying over time. To solve this\nchallenge, the key idea is to show that the policies do not change too much from round to round, so\nthat the third term grows sublinearly in T . To this end, we use the property of the RFTL algorithm\nwith a carefully chosen regularization parameter \u03b7 > 0. The complete proof of Theorem 1 can be\nfound in Appendix A.\n\n5 Online MDPs with Large State Space\n\nIn the previous section, we designed an algorithm for Online MDP with sublinear regret. However,\nthe computational complexity of our algorithm is O(poly(|S||A|)) per round. MDPs in practice often\nhave extremely large state space S due to the curse of dimenionality [8], so computing the exact\nsolution becomes impractical. In this section, we propose an approximate algorithm that can handle\nlarge state space.\n\n5.1 Approximating Occupancy Measures and Regret De\ufb01nition\n\nWe consider an approximation scheme introduced in [3] for standard MDPs. The idea is to use d\nfeature vectors (with d (cid:28) |S||A|) to approximate occupancy measures \u00b5 \u2208 R|S|\u00d7|A|. Speci\ufb01cally,\nwe approximate \u00b5 \u2248 \u03a6\u03b8 where \u03a6 is a given matrix of dimension |S||A| \u00d7 d, and \u03b8 \u2208 \u0398 (cid:44) {\u03b8 \u2208 Rd\n+ :\n(cid:107)\u03b8(cid:107)\u221e \u2264 W} for some positive constant W . As we will restrict the occupancy measures chosen by\nour algorithm to satisfy \u00b5 = \u03a6\u03b8, the de\ufb01nition of MDP-regret (1) is too strong as it compares against\nall stationary policies. Instead, we restrict the benchmark to be the set of policies \u03a0\u03a6 that can be\nrepresented by matrix \u03a6, where\n\n\u03a0\u03a6 (cid:44) {\u03c0 \u2208 \u03a0 : there exists \u00b5\u03c0 \u2208 \u2206M such that \u00b5\u03c0 = \u03a6\u03b8 for some \u03b8 \u2208 \u0398}.\n\n6\n\n\fOur goal will now be to achieve sublinear \u03a6-MDP-regret de\ufb01ned as\n\n\u03a6-MDP-Regret(T ) (cid:44) max\n\u03c0\u2208\u03a0\u03a6\n\nE[\n\nrt(s\u03c0\n\nt , a\u03c0\n\nt )] \u2212 E[\n\nT(cid:88)\n\nt=1\n\nT(cid:88)\n\nt=1\n\nrt(st, at)],\n\n(5)\n\nwhere the expectation is taken with respect to random state transitions of the MDP and randomization\nused in the algorithm. Additionally, we want to make the computational complexity independent of\n|S| and |A|.\nChoice of Matrix \u03a6 and Computation Ef\ufb01ciency. The columns of matrix \u03a6 \u2208 R|S||A|\u00d7d represent\nprobability distributions over state-action pairs. The choice of \u03a6 is problem-dependent, and a\ndetailed discussion is beyond the scope of this paper. Abbasi-Yadkori et al. [3] shows that for many\napplications such as the game of Tetris and queuing networks, \u03a6 can be naturally chosen as a sparse\nmatrix, which allows constant time access to entries of \u03a6 and ef\ufb01cient dot product operations. We\nwill assume such constant time access throughout our analysis. We refer readers to [3] for further\ndetails.\n\n5.2 The Approximate Algorithm\n\nThe algorithm we propose is built on MDP-RFTL, but is signi\ufb01cantly modi\ufb01ed in several aspects.\nWe start with key ideas on how and why we need to modify the previous algorithm, and then formally\npresent the new algorithm. To aid our analysis, we make the following de\ufb01nition.\nDe\ufb01nition 2. Let \u02dc\u03b40 \u2265 0 be the largest real number such that for all \u03b4 \u2208 [0, \u02dc\u03b40] the set \u2206\u03a6\n(cid:44) {\u00b5 \u2208\nR|S||A| : there exists \u03b8 \u2208 \u0398 such that \u00b5 = \u03a6\u03b8, \u00b5 \u2265 \u03b4, \u00b5(cid:62)1 = 1, \u00b5(cid:62)(P \u2212 B) = 0} is nonempty. We\nalso write \u2206\u03a6\nM\n\n(cid:44) \u2206\u03a6\n\nM,0.\n\nM,\u03b4\n\nt+1 \u2190 arg max\u03b8\u2208\u2206\u03a6\n\nM,\u03b4 de\ufb01ned above. We then use occupancy measures \u00b5\u03a6\u03b8\u2217\n\n(cid:80)t\nAs a \ufb01rst attempt, one could replace the shrunk set of occupancy measures \u2206M,\u03b4 in Algorithm 1\nwith \u2206\u03a6\nt+1 given by the RFTL\ni=1 [(cid:104)ri, \u00b5(cid:105) \u2212 (1/\u03b7)R(\u00b5)]. The same proof of Theorem 1\nalgorithm, i.e., \u03b8\u2217\nwould apply and guarantee a sublinear \u03a6-MDP-Regret. Unfortunately, replacing \u2206M,\u03b4 with \u2206\u03a6\ndoes not reduce the time complexity of computing the iterates {\u00b5\u03a6\u03b8\u2217\nt=1, which is still poly(|S||A|).\nTo tackle this challenge, we will not apply the RFTL algorithm exactly, but will instead obtain an\napproximate solution in poly(d) time. We relax the constraints \u00b5 \u2265 \u03b4 and \u00b5(cid:62)(P \u2212 B) = 0 that\nde\ufb01ne the set \u2206\u03a6\n\nM,\u03b4, and add the following penalty term to the objective function:\n\nt+1 (cid:44) \u03a6\u03b8\u2217\n\nt }T\n\nM,\u03b4\n\nM,\u03b4\n\nV (\u03b8) (cid:44) \u2212Ht(cid:107)(\u03a6\u03b8)(cid:62)(P \u2212 B)(cid:107)1 \u2212 Ht(cid:107) min{\u03b4, \u03a6\u03b8}(cid:107)1.\n\n(6)\nt=1 is a sequence of tuning parameters that will be speci\ufb01ed in Theorem 2. Let \u0398\u03a6 (cid:44)\n\nHere, {Ht}T\n{\u03b8 \u2208 \u0398 , 1(cid:62)(\u03a6\u03b8) = 1}. Thus, the original RFTL step in Algorithm 1 now becomes\n\nt(cid:88)\n\nct,\u03b7(\u03b8), where ct,\u03b7(\u03b8) (cid:44) t(cid:88)\n\ni=1\n\ni=1\n\n(cid:20)\n\nmax\n\u03b8\u2208\u0398\u03a6\n\n(cid:21)\n\n(cid:104)ri, \u03a6\u03b8(cid:105) \u2212 1\n\u03b7\n\nR\u03b4(\u03a6\u03b8)\n\n+ V (\u03b8).\n\n(7)\n\nIn the above function, we use a modi\ufb01ed entropy function R\u03b4(\u00b7) as the regularization term, because\nthe standard entropy function has an in\ufb01nite gradient at the origin. More speci\ufb01cally, let R(s,a)(\u00b5) (cid:44)\n\n\u00b5(s, a) ln(\u00b5(s, a)) be the entropy function. We de\ufb01ne R\u03b4(\u00b5) =(cid:80)\n\n(s,a) R\u03b4\n\n(cid:40)\n\nR\u03b4\n\n(s,a)\n\n(cid:44)\n\nR(s,a)(\u00b5)\nR(s,a)(\u03b4) + d\n\nd\u00b5(s,a) R(s,a)(\u03b4)(\u00b5(s, a) \u2212 \u03b4)\n\n(s,a)(\u00b5(s, a)), where\nif \u00b5(s, a) \u2265 \u03b4\notherwise.\n\n(8)\n\nSince computing an exact gradient for function ct,\u03b7(\u00b7) would take O(|S||A|) time, we solve problem\n(7) by stochastic gradient ascent. The following lemma shows how to ef\ufb01ciently generate stochastic\nsubgradients for function ct,\u03b7 via sampling.\n\n7\n\n\fHt\n\nLemma 1. Let q1 be any probability distribution over state-action pairs, and q2 be any probability\ndistribution over all states. Sample a pair (s(cid:48), a(cid:48)) \u223c q1 and s(cid:48)(cid:48) \u223c q2. The quantity\ngs(cid:48),a(cid:48),s(cid:48)(cid:48) (\u03b8) = \u03a6(cid:62)rt +\n\u2212 Ht\nq2(s(cid:48)(cid:48))\n\n(s(cid:48),a(cid:48))(\u03a6\u03b8)\nsatis\ufb01es E(s(cid:48),a(cid:48))\u223cq1,s(cid:48)(cid:48)\u223cq2[gs(cid:48),a(cid:48),s(cid:48)(cid:48) (\u03b8)|\u03b8] = \u2207\u03b8c\u03b7,t(\u03b8) for any \u03b8 \u2208 \u0398. Morever, we have (cid:107)g(\u03b8)(cid:107)2 \u2264\n\u221a\nt\n\nq1(s(cid:48), a(cid:48))\n[(P \u2212 B)(cid:62)\u03a6]s(cid:48)(cid:48),:sign([(P \u2212 B)(cid:62)\u03a6]s(cid:48)(cid:48),:\u03b8) \u2212\n\n\u03b7 (1 + ln(W d) + | ln(\u03b4)|)C1, w.p.1, where\n\n\u03a6(s(cid:48),a(cid:48)),:I{\u03a6(s(cid:48),a(cid:48)),:\u03b8 \u2264 \u03b4}\n\nd + Ht(C1 + C2) + t\n\n\u03b7q1(s(cid:48), a(cid:48))\n\n\u2207\u03b8R\u03b4\n\nt\n\nC1 = max\n\n(s,a)\u2208S\u00d7A\n\n, C2 = max\ns\u2208S\n\n(cid:107)\u03a6(s,a),:(cid:107)2\nq1(s, a)\n\n(cid:107)(P \u2212 B)(cid:62)\nq2(s)\n\n:,s\u03a6(cid:107)2\n\n.\n\n(9)\n\nPutting everything together, we present the complete approximate algorithm for large state online\nMDPs in Algorithm 2. The algorithm uses Projected Stochastic Gradient Ascent (Algorithm 3) as a\nsubroutine, which uses the sampling method in Lemma 1 to generate stochastic sub-gradients.\n\nAlgorithm 2 (LARGE-MDP-RFTL)\n\ninput: matrix \u03a6, parameters: \u03b7, \u03b4 > 0, convex function R\u03b4(\u00b5), SGA step-size schedule {wt}T\npenalty term parameters {Ht}T\ninitialize: \u02dc\u03b81 \u2190 PSGA(\u2212R\u03b4(\u03a6\u03b8) + V (\u03b8), \u0398\u03a6, w0, K0)\nfor t = 1, ..., T do\n\nt=1\n\nt=0,\n\nobserve current state st; play action a with distribution\nobserve rt \u2208 [\u22121, 1]|S||A|\n\n\u02dc\u03b8t+1 \u2190 PSGA((cid:80)t\n\ni=1[(cid:104)ri, \u03a6\u03b8(cid:105) \u2212 1\n\n\u03b7 R\u03b4(\u03a6\u03b8)] + V (\u03b8), \u0398\u03a6, wt, Kt)\n\n(cid:80)\n\n[\u03a6\u02dc\u03b8t]+(st,a)\na\u2208A[\u03a6\u02dc\u03b8t]+(st,a)\n\nend for\n\nAlgorithm 3 Projected Stochastic Gradient Ascent: PSGA(f, X, w, K)\n\ninput: concave objective function f, feasible set X, stepsize w, x1 \u2208 X\nfor k = 1, ...K do\n\ncompute a stochastic subgradient gk such that E[gk] = \u2207f (xk) using Lemma 1\nset xk+1 \u2190 PX (xk + wg(xk))\n\n(cid:80)K\n\nk=1 xk\n\nend for\noutput: 1\nK\n\n5.3 Analysis of the Approximate Algorithm\n\nWe establish a regret bound for the LARGE-MDP-RFTL algorithm as follows.\nTheorem 2. Suppose {rt}T\nT , K(t) = (cid:2)W 3/2t2d3/2\u03c4 4(C1 + C2)T 3/2 ln(W T d)(cid:3)2, wt =\nall s \u2208 S and a \u2208 A. For T \u2265 ln2( 1\ne\u2212\u221a\nguarantees that\n\nt=1 is an arbitrary sequence of rewards such that |rt(s, a)| \u2264 1 for\n\u03c4 , \u03b4 =\n\n), LARGE-MDP-RFTL with parameters \u03b7 =\n\nd+Ht(C1+C2)+ t\n\n\u03b7 C1)\n\nK(t)(t\n\n\u221a\n\ndW\n\n\u221a\n\n\u221a\n\n\u03b40\n\n(cid:113) T\n\n\u221a\n\u03a6-MDP-Regret(T ) \u2264 O(cS,A ln(|S||A|)\n\n\u03c4 T ln(T )).\n\nHere cS,A is a problem dependent constant. The constants C1, C2 are de\ufb01ned in Lemma 1.\n\nA salient feature of the LARGE-MDP-RFTL algorithm is that its computational complexity in each\nperiod is independent of the size of state space |S| or the size of action space |A|, and thus is amenable\nto large scale MDPs. In particular, in Theorem 2, the number of SGA iterations, K(t), is O(d) and\nindependent of |S| and |A|.\nCompared to Theorem 1, we achieve a regret with similar dependence on the number of periods T\nand the mixing time \u03c4. The regret bound also depends on ln(|S|) and ln(|A|), with an additional\n\n8\n\n\ft }T\n\nt }T\n\nt=1 induced by following policies {\u03a6\u03b8\u2217\n\nconstant term cS,A. The constant comes from a projection problem (see details in Appendix B) and\nmay grow with |S| and |A| in general. But for some MDP problems, cS,A can be bounded by an\nabsolute constant: an example is the well-known (Markovian) multi-armed bandit problem [41]. For\na more detailed discussion of the constant cS,A, we refer readers to Appendix C.\nProof Idea for Theorem 2. Consider the MDP-RFTL iterates ,{\u03b8\u2217\nt=1, and the occupancy measures\n{\u00b5\u03a6\u03b8\u2217\nt = \u03a6\u03b8\u2217\nM,\u03b4 it holds that \u00b5\u03a6\u03b8\u2217\nt\nfor all t. Thus, following the proof of Theorem 1, we can obtain the same \u03a6-MDP-Regret bound in\nTheorem 1 if we follow policies {\u03a6\u03b8\u2217\nt takes O(poly(|S||A|)) time.\nThe crux the proof of Theorem 2 is to show that the {\u03a6\u02dc\u03b8t}T\noccupancy measures {\u00b5\u03a6\u02dc\u03b8t}T\nconstraints of \u2206\u03a6\nthe distance between \u00b5\u03a6\u03b8\u2217\n\nt=1 iterates in Algorithm 2 induce\nt=1. Since the algorithm has relaxed\nM,\u03b4 and thus \u00b5\u03a6\u02dc\u03b8t (cid:54)= \u03a6\u02dc\u03b8t. So we need to show that\n\nt+1, and \u00b5\u03a6\u02dc\u03b8t+1 is small. Using triangle inequality we have\n\nM,\u03b4, in general we have \u03a6\u02dc\u03b8t /\u2208 \u2206\u03a6\n\nt=1 that are close to {\u00b5\u03a6\u03b8\u2217\n\nt=1. However, computing \u03b8\u2217\n\nt }T\nt \u2208 \u2206\u03a6\n\nt=1. Since \u03b8\u2217\n\nt }T\n\nt }T\n\nt \u2212 \u00b5\u03a6\u02dc\u03b8t(cid:107)1 \u2264 (cid:107)\u00b5\u03a6\u03b8\u2217\n\n(cid:107)\u00b5\u03a6\u03b8\u2217\nwhere P\u2206\u03a6\nindividually. We defer the details to Appendix B as bounding each term requires lengthy proofs.\n\n(\u03a6\u02dc\u03b8t)(cid:107)1 + (cid:107)P\u2206\u03a6\n(\u00b7) denotes the Euclidean projection onto \u2206\u03a6\n\n(\u03a6\u02dc\u03b8t) \u2212 \u03a6\u02dc\u03b8t(cid:107)1 + (cid:107)\u03a6\u02dc\u03b8t \u2212 \u00b5\u03a6\u02dc\u03b8t(cid:107)1,\n\nM,\u03b4. We then proceed to bound each term\n\nt \u2212 P\u2206\u03a6\n\nM,\u03b4\n\nM,\u03b4\n\nM,\u03b4\n\n6 Conclusion\n\nWe consider Markov Decision Processes (MDPs) where the transition probabilities are known but the\nrewards are unknown and may change in an adversarial manner. We provide a simple online algorithm,\nwhich applies Regularized Follow the Leader (RFTL) to the linear programming formulation of\n\nthe average reward MDP. The algorithm achieves a regret bound of O((cid:112)\u03c4 (ln|S| + ln|A|)T ln(T )),\n\nwhere S is the state space, A is the action space, \u03c4 is the mixing time of the MDP, and T is the number\nof periods. The algorithm\u2019s computational complexity is polynomial in |S| and |A| per period.\nWe then consider a setting often encountered in practice, where the state space of the MDP is\ntoo large to allow for exact solutions. We approximate the state-action occupancy measures with\na linear architecture of dimension d (cid:28) |S||A|. We then propose an approximate algorithm that\nrelaxes the constraints in the RFTL algorithm, and then solve the relaxed problem using stochastic\ngradient descent method. The computational complexity of this approximate algorithm is indepen-\ndent of the size of state space |S| and the size of action space |A|. We prove a regret bound of\n\u221a\nO(cS,A ln(|S||A|)\n\u03c4 T ln(T )) compared to the best static policy approximated by the linear archi-\n\u221a\ntecture, where cS,A is a problem dependent constant. To the best of our knowledge, this is the \ufb01rst\n\u02dcO(\n\nT ) regret bound for large scale MDPs with changing rewards.\n\nReferences\n[1] Y. Abbasi, P. L. Bartlett, V. Kanade, Y. Seldin, and C. Szepesv\u00e1ri. Online learning in Markov\ndecision processes with adversarially chosen transition probability distributions. In Advances in\nNeural Information Processing Systems, pages 2508\u20132516, 2013.\n\n[2] Y. Abbasi-Yadkori, P. L. Bartlett, X. Chen, and A. Malek. Large-scale Markov decision problems\n\nvia the linear programming dual. arXiv preprint arXiv:1901.01992, 2019.\n\n[3] Y. Abbasi-Yadkori, P. L. Bartlett, and A. Malek. Linear programming for large-scale Markov\nIn International Conference on Machine Learning, volume 32, pages\n\ndecision problems.\n496\u2013504. MIT Press, 2014.\n\n[4] J. D. Abernethy, E. Hazan, and A. Rakhlin. Competing in the dark: An ef\ufb01cient algorithm for\n\nbandit linear optimization. In Conference on Learning Theory, 2009.\n\n[5] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. The nonstochastic multiarmed bandit\n\nproblem. SIAM Journal on Computing, 32(1):48\u201377, 2002.\n\n[6] R. Bellman. A Markovian decision process. Journal of Mathematics and Mechanics, pages\n\n679\u2013684, 1957.\n\n9\n\n\f[7] A. Ben-Tal, E. Hazan, T. Koren, and S. Mannor. Oracle-based robust optimization via online\n\nlearning. Operations Research, 63(3):628\u2013638, 2015.\n\n[8] D. P. Bertsekas. Dynamic programming and optimal control, volume 2. Athena Scienti\ufb01c,\n\nBelmont, MA, 4 edition, 2012.\n\n[9] D. P. Bertsekas and J. N. Tsitsiklis. Neuro-dynamic programming. Athena Scienti\ufb01c, Belmont,\n\nMA, 1996.\n\n[10] N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge university press,\n\n2006.\n\n[11] K. Chatterjee. Markov decision processes with multiple long-run average objectives.\n\nIn\nInternational Conference on Foundations of Software Technology and Theoretical Computer\nScience, pages 473\u2013484. Springer, 2007.\n\n[12] Y. Chen, L. Li, and M. Wang. Scalable bilinear pi learning using state and action features. arXiv\n\npreprint arXiv:1804.10328, 2018.\n\n[13] T. M. Cover. Universal portfolios. Mathematical \ufb01nance, 1(1):1\u201329, 1991.\n\n[14] D. P. De Farias and B. Van Roy. The linear programming approach to approximate dynamic\n\nprogramming. Operations research, 51(6):850\u2013865, 2003.\n\n[15] T. Dick, A. Gyorgy, and C. Szepesvari. Online learning in Markov decision processes with\nchanging cost sequences. In International Conference on Machine Learning, pages 512\u2013520,\n2014.\n\n[16] E. Even-Dar, S. M. Kakade, and Y. Mansour. Experts in a Markov decision process. In Advances\n\nin Neural Information Processing Systems, pages 401\u2013408, 2005.\n\n[17] E. Even-Dar, S. M. Kakade, and Y. Mansour. Online Markov decision processes. Mathematics\n\nof Operations Research, 34(3):726\u2013736, 2009.\n\n[18] Y. Freund and R. E. Schapire. Adaptive game playing using multiplicative weights. Games and\n\nEconomic Behavior, 29(1-2):79\u2013103, 1999.\n\n[19] P. Gajane, R. Ortner, and P. Auer. A sliding-window algorithm for Markov decision processes\n\nwith arbitrarily changing rewards and transitions. arXiv preprint arXiv:1805.10066, 2018.\n\n[20] E. Hazan. Introduction to online convex optimization. Foundations and Trends R(cid:13) in Optimiza-\n\ntion, 2(3-4):157\u2013325, 2016.\n\n[21] E. Hazan and S. Kale. An optimal algorithm for stochastic strongly-convex optimization. arXiv\n\npreprint arXiv:1006.2425, 2010.\n\n[22] E. Hazan and S. Kale. Online submodular minimization. Journal of Machine Learning Research,\n\n13(Oct):2903\u20132922, 2012.\n\n[23] D. P. Helmbold, R. E. Schapire, Y. Singer, and M. K. Warmuth. On-line portfolio selection\n\nusing multiplicative updates. Mathematical Finance, 8(4):325\u2013347, 1998.\n\n[24] R. A. Howard. Dynamic programming and Markov processes. John Wiley, 1960.\n\n[25] A. Juditsky, A. Nemirovski, et al. First order methods for nonsmooth convex large-scale\noptimization, i: general purpose methods. Optimization for Machine Learning, pages 121\u2013148,\n2011.\n\n[26] S. Junges, N. Jansen, C. Dehnert, U. Topcu, and J.-P. Katoen. Safety-constrained reinforcement\nlearning for mdps. In International Conference on Tools and Algorithms for the Construction\nand Analysis of Systems, pages 130\u2013146. Springer, 2016.\n\n[27] A. Kalai and S. Vempala. Ef\ufb01cient algorithms for universal portfolios. Journal of Machine\n\nLearning Research, 3(Nov):423\u2013440, 2002.\n\n10\n\n\f[28] A. Kalai and S. Vempala. Ef\ufb01cient algorithms for online decision problems. Journal of\n\nComputer and System Sciences, 71(3):291\u2013307, 2005.\n\n[29] E. Koutsoupias. The k-server problem. Computer Science Review, 3(2):105\u2013118, 2009.\n\n[30] J. K\u02c7ret\u00ednsk`y, G. A. P\u00e9rez, and J.-F. Raskin. Learning-based mean-payoff optimization in an\n\nunknown mdp under omega-regular constraints. arXiv preprint arXiv:1804.08924, 2018.\n\n[31] N. Littlestone and M. K. Warmuth. The weighted majority algorithm.\n\nComputation, 108(2):212\u2013261, 1994.\n\nInformation and\n\n[32] Y. Ma, H. Zhang, and M. Sugiyama. Online Markov decision processes with policy iteration.\n\narXiv preprint arXiv:1510.04454, 2015.\n\n[33] G. Neu, A. Gy\u00f6rgy, C. Szepesv\u00e1ri, and A. Antos. Online markov decision processes under\n\nbandit feedback. IEEE Transactions on Automatic Control, 59(3):676\u2013691, 2014.\n\n[34] G. Neu, A. Jonsson, and V. G\u00f3mez. A uni\ufb01ed view of entropy-regularized Markov decision\n\nprocesses. arXiv preprint arXiv:1705.07798, 2017.\n\n[35] M. L. Puterman. Markov decision processes: discrete stochastic dynamic programming. John\n\nWiley & Sons, 2014.\n\n[36] A. Schrijver. Theory of linear and integer programming. John Wiley & Sons, 1998.\n[37] S. Shalev-Shwartz. Online learning and online convex optimization. Foundations and Trends R(cid:13)\n\nin Machine Learning, 4(2):107\u2013194, 2012.\n\n[38] E. Takimoto and M. K. Warmuth. Path kernels and multiplicative updates. Journal of Machine\n\nLearning Research, 4(Oct):773\u2013818, 2003.\n\n[39] M. Wang. Primal-dual pi learning: Sample complexity and sublinear run time for ergodic\n\nMarkov decision problems. arXiv preprint arXiv:1710.06100, 2017.\n\n[40] R. Weber et al. On the Gittins index for multiarmed bandits. The Annals of Applied Probability,\n\n2(4):1024\u20131033, 1992.\n\n[41] P. Whittle. Multi-armed bandits and the Gittins index. Journal of the Royal Statistical Society:\n\nSeries B (Methodological), 42(2):143\u2013149, 1980.\n\n[42] J. Y. Yu and S. Mannor. Online learning in Markov decision processes with arbitrarily changing\nrewards and transitions. In 2009 International Conference on Game Theory for Networks, pages\n314\u2013322. IEEE, 2009.\n\n[43] J. Y. Yu, S. Mannor, and N. Shimkin. Markov decision processes with arbitrary reward processes.\n\nMathematics of Operations Research, 34(3):737\u2013757, 2009.\n\n[44] A. Zimin and G. Neu. Online learning in episodic Markovian decision processes by relative\nentropy policy search. In Advances in Neural Information Processing Systems, pages 1583\u20131591,\n2013.\n\n[45] M. Zinkevich. Online convex programming and generalized in\ufb01nitesimal gradient ascent. In\nProceedings of the 20th International Conference on Machine Learning (ICML-03), pages\n928\u2013936, 2003.\n\n11\n\n\f", "award": [], "sourceid": 1384, "authors": [{"given_name": "Adrian", "family_name": "Rivera Cardoso", "institution": "Georgia Tech"}, {"given_name": "He", "family_name": "Wang", "institution": "Georgia Institute of Technology"}, {"given_name": "Huan", "family_name": "Xu", "institution": "Georgia Inst. of Technology"}]}