Paper ID: 1049
Title: Reward Mapping for Transfer in Long-Lived Agents
Reviews

Submitted by Assigned_Reviewer_4

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper presents a model for transfer where synthetic reward functions are learned from previous tasks, and are mapped to new tasks to produce synthetic reward functions which speed up learning. This application of the relatively recent idea of a synthetic reward function that improves the performance of an underlying "utility" reward function to the transfer scenario is new and interesting, and the experiments are well thought out, thorough and reasonably convincing.

The major criticism I had of the paper is that it does not compare to other methods of transfer. The authors are incorrect that policy transfer is not possible in their setting; that is why sub-policies, in the form of options, are often transferred instead of entire policies. (Speaking of which, the authors should probably cite Lisa Torrey's work on policy transfer.) In this case though it's OK to skip that comparison because you could use either, both, or neither types of transfer in combination, so the comparison adds information but is not critical.

However, the failure to compare against a similar reward shaping scheme is more problematic. These two methods are effectively solving the same problem, except that reward shaping does not change the ultimate solution and a synthetic reward function does. In my opinion, these two methods (even outside of the transfer scenario) have not been adequately compared, which is odd because its trivial to learn a shaping function (it's just a value function initialization, so you're just learning a value function). This lack of comparison leaves me with significant doubts about the whole synthetic reward function enterprise generally. So I think a comparison here - where the shaping function is mapped in the same way that the reward function is - would significantly improve the paper. But perhaps that is too much to ask for in a single paper, especially since the mapped shaping function could be considered a new (though somewhat obvious) method.

The paper is very well written and was generally a pleasure to read, though this was spoiled somewhat by the repeated use of parenthetical citations as nouns. The references were poorly formatted in some cases (capitalization on "mdps", etc.)
Q2: Please summarize your review in 1-2 sentences
This paper describes a novel method for transfer in reinforcement learning domains, and is well written and well executed. A better experimental comparison to reward shaping methods would have been good, but isn't totally necessary.

Submitted by Assigned_Reviewer_8

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The authors consider an agent that will experience a series of sequential decision-making tasks over its lifetime in the same environment (or a similar one). They propose a method for transfering knowledge acquired in previous problems to the current problem. Specifically, they consider transfering knowledge acquired in learning optimal reward functions. The authors demonstrate the utility of their approach in two examples.

The paper builds primarily on two pieces of earlier work. The first is the optimal rewards formulation in Singh et al. (2010), where the authors define an internal reward function as one that maximizes external reward (this internal reward function may differ from the external reward function if the agent is bounded in its capabilities, for instance if it has a small planning horizon). The second is the algorithm by Sorg et al. (2010) for incremental learning of such a reward function in a single task setting.

The contribution of the current paper is to place this earlier work in a multi-task setting. As in Sorg et al (2010), the agent learns an internal reward function in each task. In addition, the agent learns a mapping from the external reward functions of the tasks it has experienced in the past to the internal reward functions it has learned by the end of those tasks. The agent uses this mapping to initialize its internal reward function at the beginning of each task.

The examples in the paper are small but sufficient to demonstrate the potential utility of the approach. In practice, the success of the algorithm will depend on the availability of good features for the mapping from external rewards to internal rewards.

The paper is well written and easy to follow. The empirical evaluation is well done, informative and useful. Figure 4, in particular, is helpful in concretely showing what the algorithm has done. Including a similar figure (or description) for the network example would be a useful addition to the paper.
Q2: Please summarize your review in 1-2 sentences
The paper is not particularly innovative but it is a well-executed, useful addition to the literature on transfer learning.

Submitted by Assigned_Reviewer_9

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The paper describes a method to "conserve" some earlier experiences
in solving a RL problem to later runs using different reward
functions. Importantly, the system assumes a limited power for
learning, which makes it advantageous to use a surrogate reward
function that allows the agent to learn the concrete reward. The
idea is to provide the opportunity to learn an inner reward function
as a function of the external reward function.

The idea is interesting and attractive, and the prospect of a
"bounded rationality"-type assumption behind the algorithm (although
the authors studiously - and wisely - avoid using it) renders the
method a welcome approach to a more practical (and plausible)
perspective on reinforcement learning in general scenarios.

Generally well readable, the reviewer found that the paper lost
clarity in the network routing domain. I'll mention some of the
issues in the details below.

In terms of methodology, the paper falls into the general category
of the "reward-shaping" methodology, the success of the methodology
in the examples is convincing, the general method class is, of
course, already, if not maturing, but consolidating.

- line 323: what are "trajectory-count" parameters for UCT? Number
of sample runs?

- line 332: it seems that either colours, coefficient signs or
semantics of the coefficent are inconsistent here. The text says:
"negative/dark/discouraging exploration", but that does not fit
with figure 4.

- line 370: I do not understand the point of the decomposition into
G_1, G_2, G_3? What's the purpose of it?

- line 403: I do not understand how the transition function is
modeled. Don't you use reward mapping anymore here? If you use it,
*and* you modify the transition function, how does that happen?
Please reformulate this section, it is completely unclear to me.

- line 418: What is the "competing policy transfer agent"? What
model does it use?
Q2: Please summarize your review in 1-2 sentences
An interesting method for transfer learning under limited
resources. Settled in existing "reward shaping" methodology
territory, the method itself looks sufficiently original and
effective to warrant publication. Some (minor) weaknesses in the
description of the second example.
Author Feedback

Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 6000 characters. Note however that reviewers and area chairs are very busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.
We thank the reviewers for helpful reviews and given a chance will incorporate into the final version their specific suggestions on how to improve clarity and readability and we will add text to discuss relationship to provided references.

Reviewer 1: a) In the network routing domain we did compare our method against the specific policy transfer method developed by the creators of the domain. We will sharpen the writing to make this clear. Furthermore, we will attempt to describe the competing method better (though space limitations make this a challenge). b) Agreed with the reviewer that our statement that “there is no scope for transfer by reuse of value function or policies” needs revision. We will be more precise by stating that without modifications UCT does use a reward function but does *not* use an initial value function or policy and hence changing a reward function is a natural and consequential way to influence unmodified UCT. However, we do agree that non-trivial modifications of UCT could allow use of value functions and/or policies. c) We will cite and describe how Torrey etal's work on policy transfer is different from ours (thanks for a useful pointer). In their most related work on transfer with model-based planning agents their focus is on discovering for which state-action pairs their old knowledge (value-functions, reward-functions and transitions) is trustworthy. Trustworthy knowledge from old tasks is reused in the new task. Trust comes from whether their old predictions remain true in the new task. So their particular research emphasis is quite different from ours, though their overall objective of transfer is similar. Their less related work focuses on transfer with model-free RL agents using learned relational knowledge. Finally, as noted above we did compare against a policy transfer method proposed by the creators of the network routing domain. d) While in this paper our focus was on adapting reward functions to achieve transfer in computationally bounded agents and for this we used the existing optimal-rewards framework, we agree with the reviewer that the optimal-rewards-based work could use a more careful direct comparison to reward-shaping; this could be relevant future work for the authors. We note that reference Sorg etal ([15]) does show that in lookahead tree planning agents potential based reward-shaping is a special case of optimal rewards and performs worse empirically. e) We will fix the use of parenthetical citations as nouns.

Reviewer 2: Yes, availability of good features for the reward-mapping function is key both for the existence of good internal reward functions as well as for the ease of the supervised learning task that maps objective rewards to internal rewards. We will acknowledge this explicitly in a revision. Thank you.

Reviewer 3: Thank you for helpful suggestions of how to improve the writing in the network routing domain. To answer some of your questions briefly using line numbers. (323) Yes, trajectory-counts are UCT sample runs or trajectories to a prescribed depth. (332) Embarrassingly we had colors on both the numbers (these were the relevant colors) and on the grid squares (these were meant to help visualize) that were chosen in a contradictory manner (negative numbers used a dark-line font but the associated grid squares were light colored). In the cited statement we mean that negative numbers written in a dark-line font discouraged exploration. We will fix this in our graphic. (370) We will clarify how the G_1, G_2, G_3 decomposition we introduced was used to allow for intuitively interesting changes to the state-transition function across tasks. Specifically, part of the state description is the destination of packets in the system. The probabilities that the destination of a packet belonged to each of the subgraphs (p^G_1, p^G_2) became parameters we could manipulate to change the state-transition probabilities. (403) In the experiments where transition functions changed with tasks, we still used a mapping function whose *output* was the initial internal reward function, but whose *inputs* now included parameters of the state transition function (specifically p^G_1 and p^G_2 as mentioned above). (418) The competing policy transfer agent is from the reference Natarajan and Tadepalli [9] in the paper; they introduced the network routing domain to evaluate their policy transfer algorithm and we use both their domain and their algorithm as a comparison. Their method stores policies and for each new task chooses an appropriate policy from the stored ones and them improves the initial policy with experience. (They use vector-based average reward RL algorithms where each component of the vector corresponds to an objective reward function feature, and store and reuse these vectors as implicit representations of policies.)