
Submitted by Assigned_Reviewer_1
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The author begins by proposing a strategy to embed feedback gains in neural network policies that makes them robust to disturbances and regression error. Trajectory optimization and policy optimization are softly coupled by a cost penalty for deviation, allowing the policy to learn reasonable approximations of optimal trajectories that are in fact simpler and more robust than the products of MPCbased trajectory optimization on their own. The tasks are optimized using sophisticated parallelization, though this is not the central purpose of the paper and more of a means to an end. Finally, a diverse set of robots are controlled using the trained feedback controllers.
This paper is of the highest quality, is incredibly clear, represents a tour de force of technique, and presents quite beautiful results in its applications. Bravo.
Q2: Please summarize your review in 12 sentences
This is a beautiful body of work that marries graphics, control, neural networks in the pursuit of generation of controllers for a diversity of body morphologies and tasks. It's inspiring.
Submitted by Assigned_Reviewer_2
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
"Interactive control of diverse complex characters with neural networks" shows
how a small, fast neural network can be used to represent a control policy for interactive control of a character in a physics simulator.
The approach uses trajectory optimization in a training phase to learn a neural network policy. The contribution is to show that the same architecture works across a range of moreorless cyclic behaviours, and a range of body types.
General Comments 
My reading of this paper is that it is an application of deep learning to existing (albeit recent) techniques in computer graphics.
The ability of trajectory optimization to find realistic control signals has
been established by previous work, some of which is cited here.
The use of neural networks to approximate the stepbystep actions required of a policy in order to trace out trajectories similar to the ones found by trajectory optimization has also been developed in previous work (such as e.g. [9]).
The authors of this paper must therefore work harder to position the work that went into this paper relative to these previouslypublished approaches.
The
closest work to my mind is the recent line of work by Pieter Abbeel and Sergey Levine. How would the authors compare the approach here to that one? Is it meant to learn faster? To support a wider variety of body shapes? Is it meant to require less computation at runtime? To be more robust to certain environmental variabilities?
Specific Comments 
What is meant by "we found the resulting process can produce very large and illconditioned feedback gains"?
Consider moving the discussion of why you did not use LQG until after you have presented your strategy.
The notation around lines 226240 is confusing. You introduce s(X) = \bar s, and then define a cost in terms of s(X)  \bar s. The bars and squiggles are visually similar.
What is meant by "our approach produces softer feedback gains according to parameter \lambda"? Softer than what? What does it mean for gains to be either soft or not soft? Why is it a good thing to change the initial state as part of your procedure?
Q2: Please summarize your review in 12 sentences
Sound work, but the contribution relative to other recent related work is not clear enough.
Submitted by Assigned_Reviewer_3
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
In this submission, authors propose an algorithm for realtime control of 3D models, where a neural network (NN) is used to generate animations, after having been trained offline to reproduce the output of ContactInvariant Optimization (CIO). Importantly, CIO and NN training are not independent: they are performed in an alternating fashion, the output of one being used in the criterion optimized by the other. Another core element of the method is the injection of noise, through data augmentation by small perturbations as well as additive noise in the NN's hidden layers. Experiments show that such a network is able to produce realistic and stable control policies on a range of very different character models.
This is a very interesting approach, considering that realtime control of arbitrary models is a challenging task. Although the proposed method may initially appear relatively straightforward ("learn a NN to predict the output of CIO"), the additional steps to make it work (joint training and noise injection) are significant new contributions, and shown in experiments to help get much better generalization.
Those experimental results would be stronger, however, if comparisons were made with a larger dataset. Since the base training trajectories were generated with only "between 100 to 200 trials, each with 5 branched segments", it is not surprising for the "no noise" variant to overfit. The "no joint" variant might also benefit from more diverse training trajectories  even if that is less obvious.
Overall the comparative evaluation is the weak point of the paper (acknowledged in the introduction: "A systematic comparison of these more direct methods with the present trajectoryoptimizationbased methods remains to be done"). The only comparison to a competing method is a quick one with MPC at the end of the paper (and there is no reference for this method by the way).
For the most part, the maths are clearly explained and motivated, although someone willing to replicate these results will definitely need to read up additional references, since some points are only addressed superficially (especially the CIO step  see Eq. 6 whose notations are not defined). Providing code would certainly be much appreciated by the community. There is one part I was not able to properly understand: section 3.2 about generating optimal actions for noisy inputs. Notations got really confusing for me at this point, and it was not clear to me what was being done (in particular the time index "t" was dropped, but is it still implied for some of the quantities?).
It would have also been useful to get some time measurements for computations during realtime control, since that could be a major bottleneck (and also knowing how these computations scale with the complexity of the 3D model). In an application like a video game, there is not much CPU available to animate a single character on screen.
Additional minor remarks:  "biped location": locomotion?  "It is our aim to unity these disparate approaches": unite? unify?  Am I right that the target is only about x,y,z coordinates, and there is no constraint on the angle the character is facing when reaching it? If yes, would it be easy to add an extra target angle? (which seems important in practice)  I would not call "noise injection" the step where additional data is generated from noisy inputs, because the target is recomputed to match the modified inputs. To me, this is simply "data augmentation", and it has nothing to do with the idea behind denoising autoencoders.  Are you really using the initial (random) network weights theta in the very first iteration, to compute the first X's in Alg. 1? Or are you starting with X's computed without the regularization term?  What is the justification for using the same hyperparameter eta in the two steps of Alg. 1?  "it is nontrivial to adapt them to asynchronous and stochastic setting we have": missing "the"?  Acronym LQG is used one line before it is defined.  "While we can use Linear Quadratic Gaussian (LQG) pass": missing "a"?  l. 231 the star in the argmin is in the wrong place, and is \bar{s} missing in \tilde{C}?  "the minimizer of a quadratic around optimal trajectory": missing a word?  "sX and aX are subsets of \phiX": why is that guaranteed?  "a single pass over the data to reduce the objective in (4)": it would rather be the objective from Alg. 1 since only theta is optimized  The asynchronous joint optimization scheme described in Section 6 is very different from the alternating scheme from Alg. 1. It would be worth at least mentioning this earlier.  l. 315, f should be f^{des}?  "In the current work, the dynamics constraints are enforced softly and thus may include some root forces in simulation.": I did not understand this sentence  The "no noise" variant in experiments could be split between "no data augmentation" and "no hidden noise injection".  A question I would find interesting to investigate is whether alpha can be decreased all the way to zero during optimization (for CIO only). In other words, is it working mostly because of a "curriculum learning" effect (easier task initially), or is it always useful to keep the trajectories learned by CIO close to what the network can learn?  Caption of Table 1a says 4 layers but I believe this is 3
Q2: Please summarize your review in 12 sentences
A greatlooking application of neural networks to character control, whose practical benefits and applicability still need to be established.
Submitted by Assigned_Reviewer_4
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper proposes to combine trajectory optimization with policy learning to enable interactive control of complex characters. The main proposal is to jointly learn the trajectories and the policy, with the additional empirical insight that decoupling the two optimization problems into alternating subproblems i) enables reusing of existing, well documented methods and ii) works sufficiently well in practice. Although I could follow the paper well, I'm not immersed in this domain well enough to comment on the potential impact of this paper, which on the surface certaintly looks like a technical prowess albeit with no novel theoretical insights.
The authors combine an impressive level of technical sophistication at all levels  from the concepts down to their implementation  with equally impressive results. I found the claimed advantages of i) using sensory noise during training and ii) learning trajectories jointly with the policy, to have been clearly demonstrated.
I also found the paper was clearly written, with what I think is the appropriate level of detail given the complexity of the problem and its implementation. Each section treats a separate aspect of the problem and there are very useful crossreferences between sections.
In think the motivations for the extra optimization setp in sec 7 could have been better explained. At this stage, given what had been announced in previous sections, I was somehow expecting the policy to run without the need of any extra optimization.
Is that because there is always a residual trajectory cost (or policy regression error?) such that feeding the action given by the policy into the physical simulator would accumulate errors? I would imagine that the forward evolution of the dynamics takes into account physical constraints so that no physical implausibility can result anyway, but I might be missing something? I appreciate, though, that reoptimizing at every timestep lets you use larger timestep.
l153: similar to recent approaches... can you perhaps give one or two refs? It's counterintuitive to me that keeping updates small in parameter space would be a sensible thing to do in general, though in the end I understood that it's useful in a scenario like yours where you're blockalternating between subproblems.
typo: l89: to unify
Q2: Please summarize your review in 12 sentences
The paper's results look rather impressive, and the proposed optimization scheme + implementation sensible to me. However this research area is slightly off my radar so it's difficult for me to be confident about novelty and significance.
Submitted by Assigned_Reviewer_5
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
 It would be nice to mention from the beginning that you need full model knowledge (Eq. 6)
 Sect. 4 and 6 are disproportionate.
 Sect. 4 does not mention the recurrent units from the abstract.
 l. 405: How is this a stochastic NN? The training data is stochastic but the network itself seems deterministic.
Rebuttal ======== Thank you for the additional information.
Q2: Please summarize your review in 12 sentences
LIGHT REVIEW The paper shows promising results but should be restructured.
Q1:Author
rebuttal: Please respond to any concerns raised in the reviews. There are
no constraints on how you want to argue your case, except for the fact
that your text should be limited to a maximum of 5000 characters. Note
however, that reviewers and area chairs are busy and may not read long
vague rebuttals. It is in your own interest to be concise and to the
point.
We thank the reviewers for their insightful
comments. Several reviewers inquired about the comparison to related
methods, in particular work of Levine and Abbeel. We address this below,
but as a metacomment, note that these papers are not based on
easytoimplement algorithms that can be compared sidebyside on
established benchmarks. Hopefully this emerging field will settle at some
point, and more direct comparisons will become feasible. But for now, we
have to assume that each group has pushed their methods as far as they
could, and obtained the best results that are possible with reasonable
effort. So we can compare the end results. This of course is not as solid
as a more controlled comparison, but it is the best we can offer for
now.
The concurrent line of work by Levine and Abbeel in
application to locomotion demonstrated neural network policies on steady
noninteractive controllers for a 2D biped and a swimmer. Furthermore,
their method requires an existing biped controller (Simbicon) to
initialize trajectory optimization and would not generalize to nonbiped
morphologies. We use contactinvariant optimization to generate stable 3D
locomotion, which does not require any manual initialization; we simply
repeat the initial pose for initialization.
Furthermore, our
algorithm is designed to be asynchronous and deployable on large compute
clusters while still converging. Trial trajectory states or goals can be
corrupted by noise and entire trials can be added or removed. In contrast,
approach of Levine and Abbeel requires keeping track of Lagrange
multipliers for every state and trial and thus must operate on a fixed set
of trials.
We also easily incorporate recurrent memory into our
method, and have experimented with training general RNNs with the
blockalternating optimization described in the paper and found it to
outperform BPTT and LSTM on initial synthetic experiments. We are happy to
include these results in the supplementary material of the final
submission.
Reviewers 2 and 3 questioned the necessity and
performance of an extra runtime optimization step in section 7. This step
is necessary because our policy commands highlevel control features, such
as desired joint velocity and contact forces, rather than directly
commanding torques. The motivation is that this set can potentially
include other more compact and highlevel features, such as character's
end effector positions or center of mass. As a result, we need an extra
step to convert our highlevel control features to control torques that
would be passed to the simulator. There are many standard ways in the
graphics and robotics literature to compute torques from control features,
and we used the one that is simplest to implement in our pipeline (doing
onestep trajectory optimization). Other ways, such as operationalspace
control can be implemented instead. Our implementation runs at 60 FPS on a
modern desktop machine and there exist realtime implementations of
operational/featurespace control in game production settings [1]. We have
also experimented with commanding torques directly for bird and swimmer
scenarios and found the resulting policies to work successfully without
performing an extra optimization step.
Reviewer 1 asked for
clarification on our feedback gain comparison. We observed that feedback
gain matrices produced by LQG are an illconditioned function of the
state, in particular producing very large spikes around contact events and
resulting in unstable controllers when used without tuning. Our approach
is a novel way to regularize feedback gains to be a smoother function
without changing the underlying objective.
Reviewers 2 and 3 asked
a number of questions, some of which we will address here.
Our
target does not only include x,y,z position of the character. We are able
to control the target heading of the character as well (as shown in the
video for quadruped and swimmer examples). We do use random network
weights for the first algorithm iteration. Not using regularization term
on first iteration does speed up convergence, but for simplicity and
consistency, we do not treat first iteration specially. A term that
keeps optimized parameters close to previous iteration's solution has been
used in stochastic ADMM [2] a question of whether our regularization
term is only necessary early in the optimization and can be removed at the
end is interesting! Our initial experiments show that policy performance
indeed degrades if this term is removed at later stages of optimization.
In other words, this term actually shapes the nature of the behaviors,
rather than just acting as an optimization tool.
[1] Evangelos
Kokkevis, Practical Physics for Articulated Characters, Game Developer's
Conference 2004. [2] Hua Ouyang et al, Stochastic Alternating Direction
Method of Multipliers, ICML 2013 
