NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:379
Title:Regret Minimization for Reinforcement Learning with Vectorial Feedback and Complex Objectives

Reviewer 1

Summary: This paper studies a generalization of online reinforcement learning (in the infinite horizon undiscounted setting with finite state and action space and communicating MDP) where the agent aims at maximizing a certain type of concave function of the rewards (extended to global concave functions in appendix). More precisely, every time an action "a" is played in state "s", the agent receives a vector of rewards V(s,a) (instead of a scalar reward r(s,a)) and tries to maximize a concave function of the empirical average of the vectorial outcomes. This problem is very general and models a wide variety of different settings ranging from multi-objective optimization in MDPs, to maximum entropy exploration and online learning in MDPs with knapsack constraints. In section 2 the authors introduce the necessary background and formalize the notions of "optimal gain" and "regret" in this setting. Defining the "optimal gain" (called the "offline benchmark" in the paper) is not straightforward. In section 3, the authors first show how the exploration-exploitation dilemma faced by the agent is intrinsically more challenging since the optimal policy may be non-stationary (example of Fig. 1 and Claim 3).2. They then introduce a variant of UCRL2 (called TFW-UCRL2) which can efficiently balance exploration and exploitation in this setting. The algorithm mainly differs from UCRL2 by the stopping condition of episodes: an episode also stops whenever the "gradient " becomes too far from its initial value at the beginning of the episode. This prevents TFW-UCRL2 from converging to a stationary policy whenever the optimal policy is actually not stationary. The gradient is updated using a Frank-Wolfe algorithm. When the objective function is both Lipschitz and smooth, the regret of TOC-UCRL2 can be bounded by O(1/\sqrt{T}) (Theorem 3.1). A detailed proof sketch of Theorem 5 is presented in section 4 (the main steps and arguments are given). Finally, in Section 5 the authors report experiments illustrating the theoretical findings. Clarity: Overall the paper is well-written and clear. All the statements are very precise. The examples of Fig.1 are very helpful to understand the challenges of the setting (and they are well-explained). Correctness: The paper is very rigorous (both the main body and the appendix). The proofs are very detailed and still concise (and thus easy to read and verify). The derivation of the regret proof of Theorem 3.1 is very rigorous while being easy to follow. Significance: In my opinion, the contribution of this paper is quite substantial and non-trivial. The paper (including the appendices) makes a comprehensive and insightful analysis of the problem addressed in the paper. In my opinion, the problem addressed is very relevant for the community (the setting and results are very general).

Reviewer 2

My main concerns regarding the paper are the follows: 1) the main part of the paper is really hard to follow. There are four main building blocks of the proposed algorithm: UCRL, Frank-Wolf, EVI and GTP. Based on Algorithm 1, I believe that I could figure out how the Frank-Wolf and EVI are used in the UCRL framework, but I could not find out how GTP is implemented and what its role is. I guess it a mechanism to avoid the low gradient states, but I don't see how 2) The aggregation function is not well-motivated. I don't see that it would preserve the Pareto dominance relation which is a key notion in multi-objective optimization. 3) Can you please explain me any real-world problem which might be modeled by the proposed setup. Only synthetic experiments are presented in the experimental study. 4) The technical contribution seems marginal. Tremendous work is presented in the appendix, but if I look at Lemma 4.1, it seems that, thanks to the concavity of the aggregation function, the optimization problem can be handled quite similar way to the single objective case. In Eq. (11), the regret basically boils down to the error in value estimation which is very similar to the single objective case.

Reviewer 3

i. Novelty and Significance: The problem formulation seems novel as multi-objective scenarios are quite relevant in practice. I am not entirely well versed on existing works on multi-objective MDPs in RL, but authors claimed to have the first non-asymptotic sub-linear regret guarantee for the current setting unlike some prior works [28,34] which focuses on asymptotic convergence and work under more restricted assumptions on the MDPs. ii. Writing: The paper is overall well written with properly introduced notations and the claims are easy to follow. I made a high level check of the proof techniques which appears to be reasonable. iii. Numerical Experiments: This section could have been elaborated further (although more details are included in the Appendix). I would suggest moving some of the analysis of Sec. 4 to Appendix to include more experimental evaluations in the main text. In particular, the reported empirical results does not study any comparative performances with other existing methods --- I am curious to see how TFW-UCRL2 performs with respect to the proposed algorithms of [28] and [34], even under the setting of their restricted assumptions. =================== Post Rebuttal ========================= I have read and understood the rebuttal clarifications. I keep my score unchanged.