NIPS 2017
Mon Dec 4th through Sat the 9th, 2017 at Long Beach Convention Center
Paper ID: 2921 Unifying PAC and Regret: Uniform PAC Bounds for Episodic Reinforcement Learning

Reviewer 1

The paper defines "Uniform-PAC" where uniformity is over the optimality criterion, eps. It is PAC like in that optimal actions are taken in all but a bounded number of steps. It is also regret like in that the algorithm is eventually good relative to any epsilon---not just one it is told to meet. I liked the paper. I thought the discussion of different performance metrics was thorough and informative. I would have liked more intuition about the iterated logarithm idea and its main properties, but I understand that the highly technical stuff had to be expressed in very limited space. I would also have liked to have seen a more thorough accounting of open problems, as arguing that we have a new algorithmic criterion means that some new problems arise and some older problems are reduced in priority. So, for example, do we need an analysis of the infinite-horizon case? Does it raise new issues or do we essentially replace H by 1/(1-gamma) and things go through? Does the criterion have interesting problems from a bandit perspective? Detailed comments: "UBEV leverages new confidence intervals which the law of iterated logarithm allows us to more tightly bound the probability of failure events, in which the algorithm may not achieve high performance.": Confusing, if not broken. Rewrite. "F_PAC(1/eps, log(1/delta)": Missing paren. Also, F_PAC is used immediately after with several new parameters. "F_UHPR(S,A,H,T,log(1/delta)": Missing paren. "(Delta_k)_k": Isn't that second k incorrect? "guarantees only bounds" -> "guarantees only bound". "criteria overcomes" -> "criterion overcomes"? "that an algorithm optimal" -> "that an algorithm with optimal"? "Uniform-PAC is the bridge" -> "Uniform-PAC is a bridge"? "thet" -> "the"? I'm confused about the implications of Theorem 3 on Theorem 2. Theorem 3b says the algorithm produces an (eps,delta)-PAC bound. Theorem 3c says it has T^1/2 high probability regret. But, Theorem 2a that an algorithm can't have an (eps,delta)-PAC bound and high probability regret better than T^2/3. Theorem 2b is even stronger, saying it can't do better than T. Oh, wait, maybe I see the subtlety here. The issue is not that *no* algorithm with a certain PAC bound can achieve a certain regret bound. It's that knowing that an algorithm satisfies the PAC bound is insufficient to guarantee that it has the regret bound. The UPAC bound is stronger and results in algorithms that have both kinds of guarantees. I think you can make this clearer in the paper. (It's very interesting.) "rng" is used without explanation. I found https://en.wikipedia.org/wiki/Rng_(algebra), but it doesn't seem to use rng as an operator, just a property. So, I think additional background is needed here. "All algorithms are run with setting where their bounds hold.": Reword? "increase. . One": Extra period. "the discounted for which": Missing word? Note: I was very impressed with the comments of Assigned_Reviewer_3, but they didn't significantly dampen my enthusiasm for the paper. To me, the interesting contribution is the definition of a criterion that implies both PAC and regret bounds. The analysis of the finite-horizon case was just a small downpayment on using this concept. That being said, I would be much happier if an infinite horizon analysis was presented and I do not think your claim that it is "even not clear how regret should be defined here" is very convincing. There are several concepts used in the literature that would seem worth considering. Something related to the Strehl/Littman concept raised by Assigned_Reviewer_1 seems like it could work.