
Submitted by Assigned_Reviewer_1
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
[Light review]
This is an interesting work that proposes some value gradient methods for stochastic systems.
The work does not have any theoretical guarantee, but claimed are supported by empirical studies.
Some aspects of the papers are not very clear though:
* The discussion around equations (1213), and especially how to generate samples from p(eta,zetas,a,s') is not clear. What does "without loss of generality" mean at Lines 187188? * How difficult is it to learn such a noise model? Generating samples from p(eta,zetas,a,s') is required at line 11 of Algorithm 1, line 12 of Algorithm 2, so a more clear discussion is needed.
* Why do we have partial derivative of V(s') w.r.t. theta in equation (4)? Isn't that derivative equal to V_{s'}(s') f_a pi_theta, which is there already?
* Please provide a better comparison with the work of Fairbanks. Is the main difference in the use of reparameterization technique?
===== [UPDATE] Thank you. You addressed my concerns.
Q2: Please summarize your review in 12 sentences
See comments.
Submitted by Assigned_Reviewer_2
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The paper presents a new stochastic value gradient algorithm that can combine learning a system dynamics model with learning a (stateaction) value function to obtain accurate gradients for the policy update. Value gradient algorithms could so far be used only for deterministic environments and deterministic policies. They extend the formulation to the stochastic case by introducing the noise as additional variables that are supposed to be known for the trajectories (and hence, deterministic). Furthermore, they propose to use a learned model (e.g. a neural network) to compute the gradient from multistep predictions. Yet, the model is used only for computing the gradient and not for the prediction itself (real roleouts are used), which makes the approach less reliant on the model accuracy (in contrast to other modelbased RL methods such as PILCO that suffer from model errors quite severly). They introduce 3 algorithms: SVG(0) uses the Qfunction to compute the policy gradient, SVG(1) uses the value function and a 1step prediction with the learned model and SVG(inf) only uses the model without any value function. SVG(1) is reported to work best in the experiments.
There is not much to say about this paper. Nice ideas, properly executed with a massive amount of evaluations and well written. I think the contributions of the paper are very relevant as it nicely combines modelfree and modelbased RL methods in one framework and can be applied for complex neural network policies, value functions and dynamic models. The introduced extensions to standard value gradients such as stochastic policies and models, using the model only for gradient computation and not for prediction as well as experience replay with offpolicy samples are well motivated and important contributions to make the algorithm more efficient.
More Comments:
The algorithm SVG(0) seems to be almost equivalent to the submission 1107 (deep deterministic policy gradient). I assume the authors of the papers are the same. Please elobarate on the difference of the two algorithms and cite the other paper if both are accepted.
 The authors do not seem to update the exploration policy to generate the exploration noise usde for gathering data. This seems to be a bigger limitation of the approach that the exploration noise variance can not be learned, which might lead to more efficient exploration strategies.
Q2: Please summarize your review in 12 sentences
A nice paper introducing a new class of algorithms that can learn the model, value function and policy simultanously. The results are exhaustive and convincing.
Submitted by Assigned_Reviewer_3
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper introduces a family of Stochastic Value Gradient (SVG) algorithms for finding optimal policies in Markov Decision Processes. First, the authors present a standard value gradient method used for deterministic systems. The method consists in deriving the gradient of the value function (Bellman equation) with respect to the policy parameters, and backpropagating the gradient to the first timestep. To generalize this technique to stochastic systems, the authors focus on a particular type of transition functions (dynamics) where the next state is given by a deterministic function of the previous state and action, and a random noise. The noise is an exogenous variable that is assumed to be independent of the states and actions. By containing the stochasticity in a variable that is independent of states and actions, one can use a simple MC estimator to calculate the value gradient, which is computed in the same way as in a deterministic system. The gradient can be computed by fitting a model and a value function. The authors demonstrate their approach in a number of simulated systems.
The paper is very clear, wellorganized and wellwritten. Some parts are unnecessarily long, such as the first paragraph of Section 4. Some statements seem to be repeated or unnecessary, such as 173177. The statement about "the first in the literature to learn a model, a value function and a policy simultaneously in a continuous domain" is too strong. I can think of several works that did that, e.g., Deisenroth's works.
My main concern with this work is the fact that the derivation of the SVG algorithms is rather straightforward. The reparameterization technique is a simple trick that can be applied only for certain types of transition functions (stateindependent noise). The importancesampling technique is extensively used in offpolicy reinforcement learning. Of course, the simplicity of the derivations should never be used as an argument against a paper if the authors clearly demonstrate the advantage of their approach on realworld problems. The experiments presented in this paper are limited to toy problems, which does not indicate how much impact this work will have. Moreover, it is not very clear from Figures 24 which variant of SVG should be used.
On the other hand, I really liked the clarity of the paper, and I can easily see how other researchers may attempt to implement the SVG algorithms and reproduce the results. This is a big plus for the paper. The authors also carefully evaluated their algorithms on a quite large number of systems.
Some questions/comments: 1) In algorithms 1 and 2, there is a bias in the choice of actions caused by the policy parameters. How do you remove this bias in training the model? The model is tested on data points that are sampled from a distribution (policy) that is different from the distribution used in the training.
2) There seems to be an error in line 11 of Algorithm 2. The numerator should be (s,a) instead (s^k,a^k). 3) Could you detail the derivation of Equations (3) and (4)? 4) Typo: "that that".
Q2: Please summarize your review in 12 sentences
This is wellwritten paper. The idea of Stochastic Value Gradient is nice but it seems quite straightforward, and all the experiments were performed on artificial problems.
Submitted by Assigned_Reviewer_4
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The paper introduces a formalizm which enables that allows extending valuebased
policy gradient methods to stochastic policies in stochastic
environments. The formalism is based on reparametrization (modeling
stochasticity as a deterministic function of exogenuous noise).
The paper presents algorithms based on this formalism and test them on
multiple control problems.
The problem of extending valuegradient methods to stochastic policies
and environments seems important. The experimental part of the paper
covers different challenging control domains. The main issue with the
paper is that the presentation needs significant improvement to be accessible
to a general RL audience, which is a reasonable requirement from an RL
paper. Specifically, terms are used without being defined first, many
descriptions are informal where formality is required, and many critical
details are missing.
This makes it hard to follow and evaluate the exact
contributions of the paper for people outside the community that works on
RL for robotics in continuous state/action spaces.
Details:
 the abstract is currently a bit misleading: it claims to apply
algorithms to "a range of difficult physical control problems", without
mentioning that they are *simulated* control problems. The authors should
clarify this point explicitely in the abstract in a final version of the
paper, if accepted.
line 16  long sentence, needs break.
line 38  what do you mean by 'chain'  please be more formal
line 40  'backpropagating' has different meanings  please
explain which one of them you refer to here. What is being backpropagated?
line 41  'replace the derivatives'  replace them where?
line 51, 55  what do you mean by 'unroll', please clarify
line 56  'learned jointly'  why is it new, many modelbased
algorithms can be considered to learn jointly the model,
valuefunction and policy
line 57  long sentence, break
line 6061  the sentence is vague  please be more explicit
line 76  Define MDP as a tuple (S, A, P, R, gamma, s0)
line 78  please clarify why learning timeinvariant policies if
reward is timevarying
line 82  define gamma mathematically (instead of 'strictly less than...')
Equation 1  missing upper limit in the summation
line 9294  definition of V' is confusing, also you might want to use ':="
line 105  you differentiate with respect to state and theta (not action)
Eq. 34  please provide more details on how you perform the derivation here
line 108  'purely modelbased formalism' expression is used but was never defined here
line 108  earlier, did you mean by 'backpropagating' to
backpropagation of the derivatives described here? if yes,
please say explicitely
line 111  'unrolling' was never defined
lines 111114 are unclear
line 117  'model' used but never defined
line 118  how do you define 'modeling errors' and how
sensitive V_theta to them ('highly sensitive' is not informative
enough)
line 119  'we start instead'  instead of what?
line 122  'real, not planned'  what do you mean by 'planned'  please define
line 124  'critics' used but never defined in this context
lines 125127  sentence is unclear
line 131  'backpropagate efficiently'  what do you mean by saying
that: what is your efficiency measure
line 136  'we consider conditional densities'  what do you mean by 'consider'
line 148  what do you mean by 'exploits' the Jacobian
line 165  'standard planning method'  can you provide reference?
please define 'planning' in this context?
Eq 1011  you use notation that is not explained: what is '_\eta,\ksi'
line 185  'we can now replace..'  why you can do that?
line 186  why this case is 'important'
line 187  \Sigma was not defined
Sec 4.2  Since it doesn't contain any details, consider moving it to
forward near the place you provide more details
line 218  'the parameters of the critic' used but were never defined here
line 257  'our framework easily supports'  what do you mean by 'easily'
Sec 6  it would be good to mention a relevant figure first (e.g. Fig 2) and
then describe in detail what it shows.
Section 6.1  based on what you determine that [25] is
stateoftheart? also is it stateoftheart on a given domain?
multiple domains?
Q2: Please summarize your review in 12 sentences
The paper addresses an interesting problem, presents new formalism and
algorithms, and test them extensively on multiple control problems.
In its current state, the paper seems to be accessible only to a
subcommunity of RL. To make the paper more widely accessible (at least
to a general RL audience), and to increase the impact of the paper, the
presentation needs to be improved  see detailed comments below.
Also, the abstract is currently a bit misleading: it claims to apply
algorithms to "a range of difficult physical control problems", without
mentioning that they are *simulated* control problems. The authors should
clarify this point explicitely in the abstract in a final version of the
paper, if accepted.
Q1:Author
rebuttal: Please respond to any concerns raised in the reviews. There are
no constraints on how you want to argue your case, except for the fact
that your text should be limited to a maximum of 5000 characters. Note
however, that reviewers and area chairs are busy and may not read long
vague rebuttals. It is in your own interest to be concise and to the
point.
We thank all reviewers for taking time to understand
the manuscript and provide detailed feedback, which will greatly improve
the manuscript. We are happy to see agreement that the paper addresses an
important set of problems, introduces useful ideas and algorithms, and
presents a thorough experimental investigation on several
domains.
Below we address specific points:
[R2] Significance
of reparameterization; importance sampling novelty (IS); task complexity;
joint learning Reparameterization may be simple but it is a powerful
idea that has only recently found wider application in ML. Importantly,
the application here leads to a new perspective on policy gradients (i.a.
avoiding likelihood ratio estimators; allowing backprop through stoch.
model); for example, the recently published DPG algorithm follows as a
special case of reparameterization. We don't claim IS is novel but use it
to construct new, practical algorithms.
Our experiments are
performed in simulation, but with complexity on par with recent work (e.g.
ref [11]), and include locomotion, reaching, and grasping in high
dimensions with realistic, nonlinear contact physics.
PILCO does
not use value function approximators.
[R2,R3,R5] Derivation of eq.
3,4, extra term: (3,4) result from differentiating the recursive
definition of the value function (line 103) wrt. s and \theta, taking into
account that V(s') depends on \theta through s' (line 104; 2nd term in
(4)) but also directly, since the policy parameters affect future time
steps (3rd term in (4)).
[R4,R5] Inferring noise variables in
(12,13); stochastic neural networks (NN) We choose the parametric form
for policy and model and thus know the form of the joint p(s',a, \xi, \eta
 s ). We can then perform inference via Bayes rule. Depending on the
parametrization this is easy or requires approximation. For the
additiveGaussian noise considered in the paper \xi, \eta are
deterministic functions of (s, s', a).
To parameterize conditional
densities for model and policy we use the form in l134 (general case), and
l259/265 (for the model). We use NNs to learn \mu and learn a constant
sigma per dimension. \rho is stdnormal. The NNs hence parameterize the
mean \mu of a Gaussian as a function of state (and action). Supervised
learning for such a model is straightforward. More general noise models
are possible (see below).
[R2] State dependent noise Fixed
noise was sufficient for our experiments but reparameterization admits
statedependent noise as discussed in paper lines 256ff (e.g. learning
sigmas as NNs). More general transformations allow more complex forms of
noise (see below)
[R1] Learning exploration SVG may learn policy
noise although we are not using this here. The policy noise can be state
dependent (as in line 134). We can learn richer stochastic policies by
feeding noise into the network. For instance, we have used SVG(0) to learn
a bimodal action distribution for a partially observed domain in which
bimodality is optimal.
[R2] Model bias Training the model
concurrently with the policy ensures that it is updated as the agent
enters new parts of state space. Exploration ensures that actions are well
sampled around trajectories (maintaining exploration is critical). We
compute policy gradients on observed trajectories to ensure they are
evaluated only where model training data is available.
[R2] Alg.
2,L11: We importance weight the historical stateaction pair (s_k, a_k)
from the database.
[R5] "without loss of generality" in lines
187188: Stateindependent additive noise does not affect the model
Jacobian.
[R1] SVG(0)/DPG DPG is the deterministic policy limit
of SVG(0). Hence their Q functions are different. Both allow offpolicy
learning and experience replay but in different ways. Only SVG(0) allows
the offpolicy use of Kstep returns for learning Q, or learning
exploration.
[R2] Which SVG We found that SVG(1)ER performed
very well across all problems except for the "Hand" due to its long
horizon and sparse reward (Fig. 2, left). SVGvariants with value fn
failed, but purely modelbased SVG(inf) succeeded.
[R5] Comparison
w/ Fairbank: Fairbank assumes known, deterministic dynamics. We extend to
the stochastic case, show how to robustly combine approx. models and value
fns., and tackle challenging problems.
[R4] comparison w/other
algorithms (PILCO; Levine) PILCO is limited to linear/RBF policies. To
the extent that the environments are comparable, our results for the
length 3 swimmer are similar to Levine & Abbeel 2014.
[R4]
Reproducibility: We will include a full description of all domains in the
final version of the paper.
[R3] Line 185: We model the joint
trajectory distribution and condition on observed data. [R3] Line
40/108: Yes, see eqns. 3&4. [R3] Line 148: The likelihood ratio
estimator doesn't use the Jacobian of g. E.g. D2 in ref [16] [R3] Line
122: Planned = trajectory sampled from learned
model. 
