NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID: 1384 Large Scale Markov Decision Processes with Changing Rewards

### Reviewer 1

-- The rebuttal answered all my theoretical questions. I still feel that the work would be greatly improved by adding numerical experiments. -- This paper considers regret minimization in Markov decision processes (MDP). In particular, the authors refer to a specific setting called 'online MDP', where the dynamics, that is, the transition probabilities, are known while the reward is not. Regret minimization then refers to the idea to minimize the regret'' given that rewards could be chosen/observed in an adversarial manner. The authors start with a (rather technical) introduction, pose related work, and explain the main ideas based on concise preliminaries. Afterwards, an extension to large state spaces by using approximate occupancy measures and thereby avoiding concrete state-mappings is provided. In general, I found the paper very well written and good to follow. The topic seems relevant - however, I would have liked a more thorough motivation for the setting where the transitions probabilities are already known while the reward is not. The authors name a few applications (on page 2), but no further explanation is given. That leads to the second problem: No experimental evaluation is provided. While I believe that the linear programming approach will work nicely, it would be good to see how the work performs in practice. That is even more true for the approximate approach for large state spaces. Finally, the authors are not aware of some related work, see for instance Chatterjee, Krishnendu. "Markov decision processes with multiple long-run average objectives." International Conference on Foundations of Software Technology and Theoretical Computer Science. Springer, Berlin, Heidelberg, 2007. Křetínský, Jan, Guillermo A. Pérez, and Jean-François Raskin. "Learning-based mean-payoff optimization in an unknown MDP under omega-regular constraints." arXiv preprint arXiv:1804.08924 (2018). A similar setting has also been considered in Junges, Sebastian, et al. "Safety-constrained reinforcement learning for MDPs." International Conference on Tools and Algorithms for the Construction and Analysis of Systems. Springer, Berlin, Heidelberg, 2016. I also have some technical questions: 1. Is an upper bound on the reward required? In that case, also approaches (expliting dual MDP formulations) from robust/uncertain MDPs could be interesting to compare with, see for instance: Wiesemann, Wolfram, Daniel Kuhn, and Berç Rustem. "Robust Markov decision processes." Mathematics of Operations Research 38.1 (2013): 153-183. 2. Algorithm 1 basically just performs online updates based on the occupancy measure so far, that is on the history of the current policy. Is there not more information that could be exploited? 3. As far as I see a discount factor is not incorporated. Can the approaches be easily adapted?

### Reviewer 2

This paper studies algorithms for average-cost MDPs with discrete states and actions, known transition dynamics, bounded adversarial rewards, and exponentially fast mixing. The problem setting and initial regret bound were described in the work of [15]. Their work and follow-up papers use an experts algorithm in each state, with losses corresponding to Q-functions. The algorithm proposed in this paper is different in that works in the dual: it optimizes the state-action occupancy measure subject to dynamics constraints, using FTRL. Compared to [15], the obtained regret bound has an improved dependence on mixing time, but it includes an extra ln|S| term. For MDPs with a large state-space, the authors approximate the occupancy measure using a linear function. This requires several changes: relaxing constraints and adding them as a penalty, and generating stochastic subgradient in order to reduce computational complexity to be polynomial in the number of features rather than |S| and |A|. The corresponding regret bound is with respect to a restricted class of policies, for which the occupancy measure can be expressed as a linear function of the same features. Originality: The proposed algorithm and analysis follow multiple ideas from previous work. In the tabular case, the approach is quite similar to [13]. The authors say that there are some gaps in the proof of Lemma 1 in [13]. I looked at the paper [13] and also contacted the authors, and it seems like this problem is in fact just a typo. Thus I think a clearer statement of contribution in the context of the work of [13] is needed. The computationally efficient version with large MDPs and linear function approximation is more novel. Quality: The paper seems technically correct; I did not verify all of the proofs. Clarity: The paper is well-written and well-organized. However, the authors should clarify the difference between this work and [13], in terms of both the algorithm and analysis. It would also be useful to clarify the benefits, if any, of working in the dual (i.e. optimizing the state-occupancy measure). In the tabular MDP, it adds a dependence on ln|S|. In the large MDP, presumably one could similarly use a linear Q-function, and not have to deal with dynamics constraints. As an aside, in the LQR problem with adversarial costs and known dynamics (https://arxiv.org/abs/1806.07104), there exist some advantages of working in the dual SDP: in the primal, feasible suboptimal solutions do not necessarily correspond to stabilizing controllers. After reading the rebuttal: My overall opinion has not changed much. Yes, the error in [13] is about |S| vs |S||A|. Given that the motivation for the paper is "a setting often encountered in practice, where the state space of the MDP is too large to allow for exact solutions", I agree with the other reviewers that it would be useful to at least provide an example of such a problem, as well as some numerical simulations. Some of the practical examples the authors mention in the rebuttal and list in the introduction *do not* correspond to the studied setting. For example, in queueing networks, the cost function is not unknown/adversarial, but in fact known, and a deterministic function of the state (it's the number of jobs in the system).

### Reviewer 3

Summary: This work is a continuation of a line of works on online Markov decision processes (Online MDPs). The main focus of this work is to tackle Online MDPs in very large state spaces. It considers the full-information setting where the dynamics of the MDP is known, while the reward function for each time-step may change adversarially, but is revealed to the learner after each decision. The underlying MDP dynamic has a fast mixing property. The goal of the learner is to maximize the total reward of horizon T. The authors design two algorithms, Alg. MDP-RFTL and Alg. Large-MDP-RFTL. The paper is written clearly. The quality of the analysis is above average in my opinion. On the other hand, the topic of Online MDPs has been considered by several previous works, eg., [15] (Even-Dar et al. 2009), [28] (Neu et al. 2014), [13] (Dick et al. 2014), etc. In terms of the scope, while several of the works above also considered the bandit setting, this paper only deals with the full-information setting. The analysis and the algorithms follow several techniques from [15] (Even-Dar et al. 2009), [3] (Abbasi-Yadkori), etc. The first algorithm Alg. MDP-RFTL has regret bound of order \tilde{O}(\sqrt{\tau(ln|S|+ln|A|)T}ln(T)). The idea is to use Regularized-Follow-The-Leader algorithm on the feasible set stationary distributions of the MDP, as in [13] (Dick et al. 2014). The regret bound for Alg. MDP-RFTL is \tilde{O}(\sqrt{\tau(ln|S|+ln|A|)T}ln(T)), which has an extra dependence on |S|, but in the same time has a better dependence on \tau than the SOTA regret bound in [15]. This algorithm requires the knowledge of the mixing time \tau. The authors design the approximate algorithm, Alg. Large-MDP-RFTL, applicable for large state spaces with computational complexity independent of the state space size, with regret \tilde{O}(\sqrt{T}) compared against the best hindsight in a specified subspace of policies/stationary distributions. Alg. Large-MDP-FRTL also requires the knowledge of the mixing time \tau. The approximate algorithm uses projected stochastic gradient ascent (PSGA) sub-routine to avoid the computational complexity to be dependent on the size of the state and action space. However the number of iterations K(t) needed for the sub-routine PSGA for each time-step t has to be large enough. K(t) is of the order \tilde{O}(t^4\tau^8 T^3) (for K(t) defined in line 621), which may be large, when t, \tau, T are large. Could the authors give some discussion on this? Thanks. (Also the K(t) specified in line 254 in Theorem 2 is different from that in the proof in line 621.) Another issue is about the unknown constant c, which bounds the infinity-norm of the optimal dual variable \lambda^* for the dual problem of Primal 1. The c’s dependence on |S| and |A| is not very clear, which may affect the regret bound \tilde{O}(c\sqrt{T}). (Only a few special cases were discussed in Appendix B, including an example with c\leq 2|S|.) Perhaps more concrete examples may be needed to better assess the quality of the regret bound in terms of c. Considering the real-world applicability of the Alg. Large-MDP-RFTL: 1) First, the theoretical mixing time assumption of the MDPs may not be easily imposed on real-world problems. (While we also assume the dynamics of the MDP is given.) This mixing time \tau in addition appears in the Large-MDP-RFTL hyper-parameters H(t) and K(t). Thus, in practice, one must be able to estimate the mixing time accurately. Hence, if merely estimated approximately, a discussion should be preferable to see how this might affect the algorithm. (More real-world examples in this online MDP setting may help to enhance the paper's impact.) 2) Second, the choice of \Phi should have an impact on the algorithm Large-MDP-RFTL, since the comparator in hindsight is restricted on the subspace determined by \Phi. (Small regret may not eliminate possible poor performance when compared against ill-picked policy linear subspace.) In retrospect, it may be nice to try to include some discussion on choosing the best linear subspace. 3) Third, the constant c in the regret bound \tilde{O}(c\sqrt{T}) for Large-MDP-RFTL is not clear, at the moment, in its dependence on the size of state and action spaces. Based on these concerns, perhaps, it is preferable to supplement the theoretical results with some experiments for this online MDP setting. About the constraint set for \theta, there is inconsistency for the definition of the constant W, shown below: 1) (W as 1-norm bound for \theta): The constraint set \Theta for \theta is defined in line 204-205, where the constant W>0 bounds the 1-norm of \theta. (||\theta||_1\leq W). Then W can also serve as 2-norm bound for \theta. In line 555, ||\theta||_2 \leq W is used, for applying Theorem 3. 2) (W defined as such that 0\leq \theta(i) \leq W \forall i\in [d]): In the proof of Lemma 11, by recasting the 1-norm minimization as a LP problem Primal 1, the condition \theta \in \Theta is rewritten as -\theta(i) \geq –W \forall i \in [d]; \theta(i) \geq 0 \forall i \in [d]. Also in line 571 and 578, it says that “since all entries of \Phi and \theta are nonnegative…” In the proof of Lemma 1, W is also used in the same way. There is an error in Lemma 16: In the proof for the upper bounds for R^\delta, , under line 580, the proof has a mistake: although ln(x) \leq x, we are bounding |ln(x)|=-ln(x). Perhaps we may argue that since R^\delta \leq R (as stated in line 579), where R is the negative entropy, both R^\delta and R are upper bounded by 0. In the proof of Lemma 13 for bounding the Lipschitz constant G_F: |ln(dW)| is replaced by dW in the end. However, |ln(dW)| is not necessarily less than dW, unless dW \geq 1 (although we can usually assume it, for large d and W). Suppose dW \geq 1. Keeping the log may be beneficial: ln(dW) is smaller, which may help in the main theorem to reduce dependencies on d and W. Experiments may further help evaluate the algorithms as in [3].