NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:1748
Title:Real-Time Reinforcement Learning

Reviewer 1

Positive: - Overall, I feel that the paper provides an interesting contribution that may help to work toward applying RL to real-world problems where an agent interacts with the physical world, e.g. in robots. + reproducibility: The authors promised in the authors response to make their code available. Negative: - One problem I see with the paper is that it is unclear at this point whether this line of work is necessary because with increased computing power on embedded devices such as robots, the inference time of most methods turns out to actually be neglible (millisecond range or faster). I feel that this point might be alleviated by providing a series of experiments (e.g. in the driving experiment proposed in the paper) where the agent is assumed to be super fast, very fast, fast, not fast, really slow - and show how that impacts the performance of the SAC method. - another problem with the paper is that it is partially hard to follow the notation and reasoning (see more details below). -> which the authors have promised to improve in their response. more detailed comments: line 29: it is hard to udnerstand what the authors mean with "one time-step" here - it becomes clear later that the authors refer to one agent/environemtn step here - but this could also be read as "the agent is really, really fast" -> this is not a big problem though because it becomes clear later (and the figures next to the text also help). Maybe just referring to the figure inline here would already address make this much clearer and prepare the reader better for the rest of the paper. sec 3, first paragraph: the authors start using u for the action and it is difficult to follow here why a new symbol is used there. Maybe stick with a? lines 69ff: - t_\pi is not defined (and I read it as the time it takes to evlauate the policy. - t_s is not defined (and I don't actually see how it is different from t_pi - maybe just use a single symbol here (there is a bit of discussion about choosing t_s to be larger or smaller than t_pi - but I don't see the point in that sections 3.1 / 3.2: This is quite confusing: Section 3.1 defines RTMRP(E) as an environment that behaves the same with turn-based interaction as E would behave with real-time interaction. Section 3.2 defines TB(E) as ? - and RTMRP(TB(E)) as ? It feels like this shoudl be TB(RTMRP(E))? Overall, I do understand that these sections are describing augmentations/reductions that convert real-time RL environments into turn-based environments and vice versa but the description and the notation are quite confusing to me. Maybe it would be easier to follow if: E_{rt} is a real time environment Figure 3/4: could be merged which would make it easier to compare the performance of the "working" SAC with the RT methods.

Reviewer 2

This paper constructs a framework for performing reinforcement learning where while choosing an action, the environment state can simultaneously change. This framework is inspired by real-world systems where the environment runs asynchronously from the agents and we would like agents to learn and act in "real-time". This is an important topic of study for practical application of RL to physical systems, including robotics. The authors show that the framework can represent existing MDPs. They also introduce a new learning method based on the Soft-Actor Critic for learning in this framework. Empirical results show that the real-time algorithm outperforms the existing SAC algorithm on various continuous control tasks. Would this approach handle real-world effects like jitter? I can see a scenario where jitter shifts multiple state changes into the same transition. It seems like your formulation only uses the latest state, so if jitter could cause you to miss assigning rewards for important states. Would effects like this be considered a partially observable RTMDP? Or would you just have to set an action rate as low as the lowest expected delta between state observations due to jitter? Line 78: Is this actually true? I could imagine a scenario where if the agent had taken a lot of small fast actions with t_s < t_\pi it would achieve the desired behavior. But if it waits to act at a slower rate by continuing computation as suggested here, the required action may be outside of the action space of the agent (assuming that its action space is bounded). Making the code available would make this a stronger, more reproducible paper. Similarly for the car simulation used in some of the experiments. Overall, this is an interesting area to study with potentially a lot of impact on real-world RL applications. The experiments show that even in environments where the standard MDP model works well, by considering the problem in the real-time framework, there is potential for even faster learning and better final policies.

Reviewer 3

This paper introduces a new RL framework that considers the time of taking action. The relationship between the new method and classical MDP is also specified in the paper. Based on the new framework, the authors propose a new actor-critic algorithm. The idea of this paper is interesting, and the paper is overall well written. My main concern is that the definition of the real-time RL problem seems to be inconsistent with the motivation of the paper. I understand the authors want to take the time of the action into consideration, but the definition of real-time RL, we cannot see this accurately.