{"title": "Interactive Control of Diverse Complex Characters with Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 3132, "page_last": 3140, "abstract": "We present a method for training recurrent neural networks to act as near-optimal feedback controllers. It is able to generate stable and realistic behaviors for a range of dynamical systems and tasks -- swimming, flying, biped and quadruped walking with different body morphologies. It does not require motion capture or task-specific features or state machines. The controller is a neural network, having a large number of feed-forward units that learn elaborate state-action mappings, and a small number of recurrent units that implement memory states beyond the physical system state. The action generated by the network is defined as velocity. Thus the network is not learning a control policy, but rather the dynamics under an implicit policy. Essential features of the method include interleaving supervised learning with trajectory optimization, injecting noise during training, training for unexpected changes in the task specification, and using the trajectory optimizer to obtain optimal feedback gains in addition to optimal actions.", "full_text": "Interactive Control of Diverse Complex Characters\n\nwith Neural Networks\n\nIgor Mordatch, Kendall Lowrey, Galen Andrew, Zoran Popovic, Emanuel Todorov\n\nDepartment of Computer Science, University of Washington\n\n{mordatch,lowrey,galen,zoran,todorov}@cs.washington.edu\n\nAbstract\n\nWe present a method for training recurrent neural networks to act as near-optimal\nfeedback controllers.\nIt is able to generate stable and realistic behaviors for a\nrange of dynamical systems and tasks \u2013 swimming, \ufb02ying, biped and quadruped\nwalking with different body morphologies. It does not require motion capture or\ntask-speci\ufb01c features or state machines. The controller is a neural network, having\na large number of feed-forward units that learn elaborate state-action mappings,\nand a small number of recurrent units that implement memory states beyond the\nphysical system state. The action generated by the network is de\ufb01ned as velocity.\nThus the network is not learning a control policy, but rather the dynamics under an\nimplicit policy. Essential features of the method include interleaving supervised\nlearning with trajectory optimization, injecting noise during training, training for\nunexpected changes in the task speci\ufb01cation, and using the trajectory optimizer to\nobtain optimal feedback gains in addition to optimal actions.\n\nFigure 1: Illustration of the dynamical systems and tasks we have been able to control using the\nsame method and architecture. See the video accompanying the submission.\n\n1\n\nIntroduction\n\nInteractive real-time controllers that are capable of generating complex, stable and realistic move-\nments have many potential applications including robotic control, animation and gaming. They can\nalso serve as computational models in biomechanics and neuroscience. Traditional methods for de-\nsigning such controllers are time-consuming and largely manual, relying on motion capture datasets\nor task-speci\ufb01c state machines. Our goal is to automate this process, by developing universal synthe-\nsis methods applicable to arbitrary behaviors, body morphologies, online changes in task objectives,\nperturbations due to noise and modeling errors. This is also the ambitious goal of much work in\nReinforcement Learning and stochastic optimal control, however the goal has rarely been achieved\nin continuous high-dimensional spaces involving complex dynamics.\nDeep learning techniques on modern computers have produced remarkable results on a wide range\nof tasks, using methods that are not signi\ufb01cantly different from what was used decades ago. The\nobjective of the present paper is to design training methods that scale to larger and harder control\nproblems, even if most of the components were already known. Speci\ufb01cally, we combine supervised\n\n1\n\n\flearning with trajectory optimization, namely Contact-Invariant Optimization (CIO) [12], which has\ngiven rise to some of the most elaborate motor behaviors synthesized automatically. Trajectory\noptimization however is an of\ufb02ine method, so the rationale here is to use a neural network to learn\nfrom the optimizer, and eventually generate similar behaviors online. There is closely related recent\nwork along these lines [9, 11], but the method presented here solves substantially harder problems\n\u2013 in particular it yields stable and realistic locomotion in three-dimensional space, where previous\nwork was applied to only two-dimensional characters. That this is possible is due to a number of\ntechnical improvements whose effects are analyzed below.\nControl was historically among the earliest applications of neural networks, but the recent surge in\nperformance has been in computer vision, speech recognition and other classi\ufb01cation problems that\narise in arti\ufb01cial intelligence and machine learning, where large datasets are available. In contrast,\nthe data needed to learn neural network controllers is much harder to obtain, and in the case of imag-\ninary characters and novel robots we have to synthesize the training data ourselves (via trajectory\noptimization). At the same time the learning task for the network is harder. This is because we need\nprecise real-valued outputs as opposed to categorical outputs, and also because our network must\noperate not on i.i.d. samples, but in a closed loop, where errors can amplify over time and cause\ninstabilities. This necessitates specialized training procedures where the dataset of trajectories and\nthe network parameters are optimized together. Another challenge caused by limited datasets is the\npotential for over-\ufb01tting and poor generalization. Our solution is to inject different forms of noise\nduring training. The scale of our problem requires cloud computing and a GPU implementation, and\ntraining that takes on the order of hours. Interestingly, we invest more computing resources in gen-\nerating the data than in learning from it. Thus the heavy lifting is done by the trajectory optimizer,\nand yet the neural network complements it in a way that yields interactive real-time control.\nNeural network controllers can also be trained with more traditional methods which do not involve\ntrajectory optimization. This has been done in discrete action settings [10] as well as in continuous\ncontrol settings [3, 6, 14]. A systematic comparison of these more direct methods with the present\ntrajectory-optimization-based methods remains to be done. Nevertheless our impression is that net-\nworks trained with direct methods give rise to successful yet somewhat chaotic behaviors, while the\npresent class of methods yield more realistic and purposeful behaviors.\nUsing physics based controllers allows for interaction, but these controllers need specially designed\narchitectures for each range of tasks or characters. For example, for biped location common ap-\nproaches include state machines and use of simpli\ufb01ed models (such as the inverted pendulum) and\nconcepts (such as zero moment or capture points) [21, 18]. For quadrupedal characters, a different\nset of state machines, contact schedules and simpli\ufb01ed models is used [13]. For \ufb02ying and swim-\nming yet another set of control architectures, commonly making use of explicit cyclic encodings,\nhave been used [8, 7]. It is our aim to unity these disparate approaches.\n\n2 Overview\n\nLet the state of the character be de\ufb01ned as [q f r], where q is the physical pose of the character (root\nposition, orientation and joint angles), f are the contact forces being applied on the character by the\nground, and r is the recurrent memory state of the character. The motion of the character is a state\n\ntrajectory of length T de\ufb01ned by X =(cid:2)q0 f 0 r0 ... qT f T rT(cid:3). Let X1, ..., XN be a collection of\n\nN trajectories, each starting with different initial conditions and executing a different task (such as\nmoving the character to a particular location).\nWe introduce a neural network control policy \u03c0\u03b8 : s (cid:55)\u2192 a, parametrized by neural network weights\n\u03b8, that maps a sensory state of the character s at each point in time to an optimal action a that\ncontrols the character. In general, the sensory state can be designed by the user to include arbitrary\ninformative features, but in this preliminary work we use the following simple and general-purpose\nrepresentation:\n\nst =(cid:2)qt rt \u02d9qt\u22121 f t\u22121(cid:3)\n\nat =(cid:2) \u02d9qt \u02d9rt f t(cid:3) ,\n\nwhere, e.g., \u02d9qt (cid:44) qt+1 \u2212 qt denotes the instantaneous rate of change of q at time t. With this\nrepresentation of the action, the policy directly commands the desired velocity of the character and\napplied contact forces, and determines the evolution of the recurrent state r. Thus, our network\nlearns both optimal controls and a model of dynamics simultaneously.\n\n2\n\n\fLet Ci(X) be the total cost of the trajectory X, which rewards accurate execution of task i and\nphysical realism of the character\u2019s motion. We want to jointly \ufb01nd a collection of optimal trajectories\nthat each complete a particular task, along with a policy \u03c0\u03b8 that is able to reconstruct the sense and\naction pairs st(X) and at(X) of all trajectories at all timesteps:\n\nminimize\n\u03b8 X1 ... XN\n\nCi(Xi)\n\nsubject to \u2200 i, t : at(Xi) = \u03c0\u03b8(st(Xi)).\n\n(1)\n\n(cid:88)\n\ni\n\nThe optimized policy parameters \u03b8 can then be used to execute policy in real-time and interactively\ncontrol the character by the user.\n\n2.1 Stochastic Policy and Sensory Inputs\n\nInjecting noise has been shown to produce more robust movement strategies in graphics and optimal\ncontrol [20, 6], reduce over\ufb01tting and prevent feature co-adaptation in neural network training [4],\nand stabilize recurrent behaviour of neural networks [5]. We inject noise in a principled way to aid\nin learning policies that do not diverge when rolled out at execution time.\nIn particular, we inject additive Gaussian noise into the sensory inputs s given to the neural network.\nLet the sensory noise be denoted \u03b5 \u223c N (0, \u03c32\n\u03b5I), so the resulting noisy policy inputs are s + \u03b5.\nThis is similar to denoising autoencoders [17] with one important difference: the change in input in\nour setting also induces a change in the optimal action to output. If the noise is small enough, the\noptimal action at nearby noisy states is given by the \ufb01rst order expansion\n\n(2)\nwhere as (alternatively da\nds ) is the matrix of optimal feedback gains around s. These gains can be\ncalculated as a byproduct of trajectory optimization as described in section 3.2. Intuitively, such\nfeedback helps the neural network trainer to learn a policy that can automatically correct for small\ndeviations from the optimal trajectory and allows us to use much less training data.\n\na(s + \u03b5) = a + as\u03b5,\n\n2.2 Distributed Stochastic Optimization\n\nThe resulting constrained optimization problem (1) is nonconvex and too large to solve directly. We\nreplace the hard equality constraint with a quadratic penalty with weight \u03b1:\n\nleading to the relaxed, unconstrained objective\n\nR(s, a, \u03b8, \u03b5) =\n\n(cid:107)(a + as\u03b5) \u2212 \u03c0\u03b8(s + \u03b5)(cid:107)2 ,\n\n\u03b1\n2\n\n(cid:88)\n\n(cid:88)\n\nminimize\n\u03b8 X1 ... XN\n\nCi(Xi) +\n\nR(st(Xi), at(Xi), \u03b8, \u03b5i,t).\n\ni\n\ni,t\n\n(3)\n\n(4)\n\nWe then proceed to solve the problem in block-alternating optimization fashion, optimizing for one\nset of variables while holding others \ufb01xed. In particular, we independently optimize for each Xi\n(trajectory optimization) and for \u03b8 (neural network regression).\nAs the target action a + as\u03b5 depends on the optimal feedback gains as, the noise \u03b5 is resampled after\noptimizing each policy training sub-problem. In principle the noisy sensory state and corresponding\naction could be recomputed within the neural network training procedure, but we found it expedient\nto freeze the noise during NN optimization (so that the optimal feedback gains need not be passed\nto the NN training process). Similar to recent stochastic optimization approaches, we introduce\nquadratic proximal regularization terms (weighted by rate \u03b7) that keep the solution of the current\niteration close to its previous optimal value. The resulting algorithm is\n\n1 Sample sensor noise \u00af\u03b5i,t for each t and i.\n\nAlgorithm 1: Distributed Stochastic Optimization\n\n2 Optimize N trajectories (sec 3): \u00afXi = argminX Ci(X) +(cid:80)\n(cid:80)\n\n3 Solve neural network regression (sec 4): \u00af\u03b8 = argmin\u03b8\n4 Repeat.\n\nt R(si,t, ai,t, \u00af\u03b8, \u00af\u03b5i,t) + \u03b7\n\ni,t R(\u00afsi,t, \u00afai,t, \u03b8, \u00af\u03b5i,t) + \u03b7\n\n2\n\n(cid:13)(cid:13)X \u2212 \u00afXi(cid:13)(cid:13)2\n(cid:13)(cid:13)\u03b8 \u2212 \u00af\u03b8(cid:13)(cid:13)2\n\n2\n\n3\n\n\fThus we have reduced a complex policy search problem in (1) to an alternating sequence of inde-\npendent trajectory optimization and neural network regression problems, each of which are well-\nstudied and allow the use of existing implementations. While previous work [9, 11] used ADMM\nor dual gradient descent to solve similar optimization problems, it is non-trivial to adapt them to\nasynchronous and stochastic setting we have. Despite potentially slower rate, we still observe con-\nvergence as shown in section 8.1.\n\n3 Trajectory Optimization\n\n(cid:88)\n\nWe wish to \ufb01nd trajectories that start with particular initial conditions and execute the task, while\nsatisfying physical realism of the character\u2019s motion. The existing approach we use is Contact-\nInvariant Optimization (CIO) [12], which is a direct trajectory optimization method based on inverse\ndynamics. De\ufb01ne the total cost for a trajectory X:\n\nC(X) =\n\nc(\u03c6t(X)),\n\n(5)\n\nt\n\nwhere \u03c6t(X) is a function that extracts a vector of features (such as root forces, contact distances,\ncontrol torques, etc.) from the trajectory at time t and c(\u03c6) is the state cost over these features.\nPhysical realism is achieved by satisfying equations of motion, non-penetration, and force comple-\nmentarity conditions at every point in the trajectory [12]:\n\nd(q) \u2265 0,\n\nH(q)\u00a8q + C(q, \u02d9q) = \u03c4 + J(cid:62)(q, \u02d9q)f ,\n\n(6)\nwhere d(q) is the distance of the contact to the ground and K is the contact friction cone. These\nconstraints are implemented as soft constraints, as in [12] and are included in C(X). Initial con-\nditions are also implemented as soft constraints in C(X). Additionally we want to make sure the\ntask is satis\ufb01ed, such as moving to a particular location while minimizing effort. These task costs\nare the same for all our experiments and are described in section 8. Importantly, CIO is able to \ufb01nd\nsolutions with trivial initializations, which makes it possible to have a broad range of characters and\nbehaviors without requiring hand-designed controllers or motion capture for initialization.\n\nd(q)(cid:62)f = 0,\n\nf \u2208 K(q)\n\n3.1 Optimal Trajectory\n\nThe trajectory optimization problem consists of \ufb01nding the optimal trajectory parameters X that\nminimize the total cost (5) with objective (3) now folded into C for simplicity:\n\nX\u2217 = argmin\n\nC(X).\n\nX\n\n(7)\n\nWe solve the above optimization problem using Newton\u2019s method, which requires the gradient and\nHessian of the total cost function. Using the chain rule, these quantities are\n\nCX =\n\n\u03c6\u03c6t\nct\nX\n\nCXX =\n\n(\u03c6t\n\nX)(cid:62)ct\n\n\u03c6\u03c6\u03c6t\n\nX + ct\n\n\u03c6\u03c6t\n\n(\u03c6t\n\nX)(cid:62)ct\n\n\u03c6\u03c6\u03c6t\nX\n\nt\n\nt\n\nt\n\nwhere the truncation of the last term in CXX is the common Gauss-Newton Hessian approximation\n[1]. We choose cost functions for which c\u03c6 and c\u03c6\u03c6 can be calculated analytically. On the other\nhand, \u03c6X is calculated by \ufb01nite differencing. The optimum can then be found by the following\nrecursion:\n\n(8)\nBecause this optimization is only a sub-problem (step 2 in algorithm 1), we don\u2019t run it to conver-\ngence, and instead take between one and ten iterations.\n\nXXCX.\n\nX\u2217 = X\u2217 \u2212 C\u22121\n\n3.2 Optimal Feedback Gains\n\nIn addition to the optimal trajectory, we also need to \ufb01nd optimal feedback gains as necessary\nto generate optimal actions for noisy inputs in (2). While these feedback gains are a byproduct\nof indirect trajectory optimization methods such as LQG, they are not an obvious result of direct\ntrajectory optimization methods like CIO. While we can use Linear Quadratic Gaussian (LQG)\n\n4\n\n(cid:88)\n\n(cid:88)\n\nXX \u2248(cid:88)\n\n\fpass around our optimal solution to compute these gains, this is inef\ufb01cient as it does not make use\nof computation already performed during direct trajectory optimization. Moreover, we found the\nresulting process can produce very large and ill-conditioned feedback gains. One could change the\nobjective function for the LQG pass when calculating feedback gains to make them smoother (for\nexample, by adding explicit trajectory smoothness cost), but then the optimal actions would be using\nfeedback gains from a different objective. Instead, we describe a perturbation method that reuses\ncomputation done during direct trajectory optimization, also producing better-conditioned gains.\nThis is a general method for producing feedback gains that stabilize resulting optimal trajectories\nand can be useful for other applications.\nSuppose we perturb a certain aspect of optimal trajectory X, such that the sensory state changes:\ns(X) = \u00afs. We wish to \ufb01nd how the optimal action a(X) will change given this perturbation. We\ncan enforce the perturbation with a soft constraint of weight \u03bb, resulting in an augmented total cost:\n\n\u02dcC(X, \u00afs) = C(X) +\n\n(9)\nLet \u02dcX(\u00afs) = argmin\u2217\n\u02dcC(X\u2217) be the optimum of the augmented total cost. For \u00afs near s(X) (as is the\ncase with local feedback control), the minimizer of augmented cost is the minimizer of a quadratic\naround optimal trajectory X\n\nX\n\n(cid:107)s(X) \u2212 \u00afs(cid:107)2 .\n\n\u03bb\n2\n\n\u02dcX(\u00afs) = X \u2212 \u02dcC\u22121\n\nXsX)\u22121(CX + \u03bbs(cid:62)\nwhere all derivatives are calculated around X. Differentiating the above w.r.t. \u00afs,\n\nXX(X, \u00afs) \u02dcCX(X, \u00afs) = X \u2212 (CXX + \u03bbs(cid:62)\n\nX(s(X) \u2212 \u00afs)),\n\n\u02dcX\u00afs = \u03bb(CXX + \u03bbs(cid:62)\n\nXsX)\u22121s(cid:62)\n\nX = C\u22121\n\nXXs(cid:62)\n\nX(sXC\u22121\n\nXXs(cid:62)\n\nX +\n\nI)\u22121,\n\n1\n\u03bb\n\nwhere the last equality follows from Woodbury identity and has the bene\ufb01t of reusing C\u22121\nXX, which\nis already computed as part of trajectory optimization. The optimal feedback gains for a are a\u00afs =\naX \u02dcX\u00afs. Note that sX and aX are subsets of \u03c6X, and are already calculated as part of trajectory\noptimization. Thus, computing optimal feedback gains comes at very little additional cost.\nOur approach produces softer feedback gains according to parameter \u03bb without modifying the cost\nfunction. The intuition is that instead of holding perturbed initial state \ufb01xed (as LQG does, for\nexample), we make matching the initial state a soft constraint. By weakening this constraint, we\ncan modify initial state to better achieve the master cost function without using very aggressive\nfeedback.\n\n4 Neural Network Policy Regression\n\nAfter performing trajectory optimization, we perform standard regression to \ufb01t a neural network\nto the noisy \ufb01xed input and output pairs {s + \u03b5, a + as\u03b5}i,t for each timestep and trajectory. Our\nneural network policy has a total of K layers, hidden layer activation function \u03c3 (tanh, in the present\nwork) and hidden units hk for layer k. To learn a model that is robust to small changes in neural state,\nwe add independent Gaussian noise \u03b3k \u223c N (0, \u03c32\n\u03b3 to the neural activations at\neach layer during training. Wager et al. [19] observed this noise model makes hidden units tend\ntoward saturated regions and less sensitive to precise values of individual units.\nAs with the trajectory optimization sub-problems, we do not run the neural network trainer to con-\nvergence but rather perform only a single pass of batched stochastic gradient descent over the dataset\nbefore updating the parameters \u03b8 in step 3 of Algorithm 1.\nAll our experiments use 3 hidden layer neural networks with 250 hidden units in each layer (other\nnetwork sizes are evaluated in section 8.1). The neural network weight matrices are initialized with\na spectral radius of just above 1, similar to [15, 5]. This helps to make sure initial network dynamics\nare stable and do not vanish or explode.\n\n\u03b3I) with variance \u03c32\n\n5 Training Trajectory Generation\n\nTo train a neural network for interactive use, we required a data set that includes dynamically chang-\ning task\u2019s goal state. The task, in this case, is the locomotion of a character to a movable goal\n\n5\n\n\fposition controlled by the user. (Our character\u2019s goal position was always set to be the origin, which\nencodes the characters state position in the goal position\u2019s coordinate frame. Thus the \u201corigin\u201d may\nshift relative to the character, but this keeps behavior invariant to the global frame of reference.)\nOur trajectory generation creates a dataset consisting of trials and segments. Each trial k starts with\na reference physical pose and null recurrent memory [q \u02d9q r]init and must reach goal location gk,0.\nAfter generating an optimal trajectory Xk,0 according to section 3, a random timestep t is chosen to\nbranch a new segment with [q \u02d9q r]t used as the initial state. A new goal location gk,1 is also chosen\nrandomly for optimal trajectory Xk,1.\nThis process represents the character changing direction at some point along its original trajectory\nplan: \u201cinteraction\u201d in this case is simply a new change in goal position. This technique allows for\nour initial states and goals to come from the distribution that re\ufb02ects the character\u2019s typical motion.\nIn all our experiments, we use between 100 to 200 trials, each with 5 branched segments.\n\n6 Distributed Training Architecture\n\nOur training algorithm was implemented in a asynchronous, distributed architecture, utilizing a\nGPU for neural network training. Simple parallelism was achieved by distributing the trajectory\noptimization processes to multiple node machines, while the resulting data was used to train the NN\npolicy on a single GPU node.\nAmazon Web Service\u2019s EC2 3.8xlarge instances provided the nodes for optimization, while a\ng2.2xlarge instance provided the GPU. Utilizing a star-topology with the GPU instance at the center,\na Network File System server distributes the training data X and network parameters \u03b8 to necessary\nprocesses within the cluster. Each optimization node is assigned a subset of the total trials and\nsegments for the given task. This simple usage of \ufb01les for data storage meant no supporting infras-\ntructure other than standard \ufb01le locking for concurrency.\nWe used a custom GPU implementation of stochastic gradient descent (SGD) to train the neural\nnetwork control policy. For the \ufb01rst training epoch, all trajectories and action sequences are loaded\nonto the GPU, randomly shuf\ufb02ing the order of the frames. Then the neural network parameters \u03b8\nare updated using batched SGD in a single pass over the data to reduce the objective in (4). At the\nstart of subsequent training epochs, trajectories which have been updated by one of the trajectory\noptimization processes (and injected with new sensor noise \u03b5) are reloaded.\nAlthough this architecture is asynchronous, the proximal regularization terms in the objective pre-\nvent the training data and policy results from changing too quickly and keep the optimization from\ndiverging. As a result, we can increase our training performance linearly for the size of cluster we\nare using, to about 30 optimization nodes per GPU machine. We run the overall optimization pro-\ncess until the average of 200 trajectory optimization iterations has been reached across all machines.\nThis usually results in about 10000 neural network training epochs, and takes about 2.5 hours to\ncomplete, depending on task parameters and number of nodes.\n\n7 Policy Execution\n\nOnce we \ufb01nd the optimal policy parameters \u03b8 of\ufb02ine, we can execute the resulting policy in real-\ntime under user control. Unlike non-parametric methods like motion graphs or Gaussian Processes,\nwe do not need to keep any trajectory data at execution time. Starting with an initial state x0, we\n\ncompute sensory state s and query the policy (without noise) for the desired action(cid:2) \u02d9qdes \u02d9rdes f(cid:3).\n\nTo evolve the physical state of the system, we directly optimize the next state x1 to match \u02d9qdes while\nsatisfying equations of motion\n\n(cid:13)(cid:13) \u02d9q \u2212 \u02d9qdes(cid:13)(cid:13)2\n\n+(cid:13)(cid:13)\u02d9r \u2212 \u02d9rdes(cid:13)(cid:13)2\n\nx1 = argmin\n\nx\n\n+(cid:13)(cid:13)f \u2212 f des(cid:13)(cid:13)2\n\nsubject to (6)\n\nNote that this is simply the optimization problem (7) with horizon T = 1, which can be solved at\nreal-time rates and does not require any additional implementation. This approach is reminiscent of\nfeature-based control in computer graphics and robotics.\n\n6\n\n\fBecause our physical state evolution is a result of optimization (similar to an implicit integrator),\nit does not suffer from instabilities or divergence as Euler integration would, and allows the use of\nlarger timesteps (we use \u2206t of 50ms in all our experiments). In the current work, the dynamics\nconstraints are enforced softly and thus may include some root forces in simulation.\n\n8 Results\n\nThis algorithm was applied to learn a policy that allows interactive locomotion for a range of very\ndifferent three-dimensional characters. We used a single network architecture and parameters to\ncreate all controllers without any specialized initializations. While the task is locomotion, different\ncharacter types exhibit very different behaviors. The experiments include three-dimensional swim-\nming and \ufb02ying characters as well as biped and quadruped walking tasks. Unlike in two-dimensional\nscenarios, it is much easier for characters to fall or go into unstable regions, yet our method manages\nto learn successful controllers. We strongly suggest viewing the supplementary video for examples\nof resulting behaviors.\nThe swimming creature featured four \ufb01ns with two degrees of freedom each. It is propelled by lift\nand drag forces for simulated water density of 1000kg/m3. To move, orient, or maintain position,\ncontroller learned to sweep down opposite \ufb01ns in a cyclical patter, as in treading water. The bird\ncreature was a modi\ufb01cation of the swimmer, with opposing two-segment wings and the medium\ndensity changed changed to that of air (1.2kg/m3). The learned behavior that emerged is cyclical\n\ufb02apping motion (more vigorous now, because of the lower medium density) as well as utilization of\nlift forces to coast to distant goal positions and modulation of \ufb02apping speed to change altitude.\nThree bipedal creatures were created to explore the controller\u2019s function with respect to contact\nforces. Two creatures were akin to a humanoid - one large and one small, both with arms - while\nthe other had a very wide torso compared to its height. All characters learned to walk to the target\nlocation and orientation with a regular, cyclic gait. The same algorithm also learned a stereotypical\ntrot gait for a dog-like and spider-like quadrupeds. This alternating left/right footstep cyclic behavior\nfor bipeds or trot gaits for quadrupeds emerged without any user input or hand-crafting.\nThe costs in the trajectory optimization were to reach goal position and orientation while minimizing\ntorque usage and contact force magnitudes. We used the MuJoCo physics simulator [16] engine for\nour dynamics calculations. The values of the algorithmic constants used in all experiments are\n\u03c3\u03b5 = 10\u22122 \u03c3\u03b3 = 10\u22122 \u03b1 = 10 \u03bb = 102 \u03b7 = 10\u22122.\n\n8.1 Comparative Evaluation\n\nWe show the performance of our method on a biped walking task in \ufb01gure 2 under full method case.\nTo test the contribution of our proposed joint optimization technique, we compared our algorithm to\nnaive neural network training on a static optimal trajectory dataset. We disabled the neural network\nand generated optimal trajectories as according to 5. Then, we performed our regression on this\nstatic data set with no trajectories being re-optimized. The results are shown in no joint case. We\nsee that at test time, our full method performs two orders of magnitude better than static training.\nTo test the contribution of noise injection, we used our full method, but disabled sensory and hidden\nunit noise (sections 2.1 and 4). The results are under no noise case. We observe typical over\ufb01tting,\nwith good training performance, but very poor test performance. In practice, both ablations above\nlead to policy rollouts that quickly diverge from expected behavior.\nAdditionally, we have compared the performance of different policy network architectures on the\nbiped walking task by varying the number of layers and hidden units. The results are shown in table\n1. We see that 3 hidden layers of 250 units gives the best performance/complexity tradeoff.\nModel-predictive control (MPC) is another potential choice of a real-time controller for task-driven\ncharacter behavior. In fact, the trajectory costs for both MPC and our method are very similar. The\nresulting trajectories, however, end up being different: MPC creates effective trajectories that are\nnot cyclical (both are shown in \ufb01gure 3 for a bird character). This suggests a signi\ufb01cant nullspace\nof task solutions, but from all these solutions, our joint optimization - through the cost terms of\nmatching the neural network output - act to regularize trajectory optimization to predictable and less\nchaotic behaviors.\n\n7\n\n\fFigure 2: Performance of our full method and two ablated con\ufb01gurations as training progresses over\n10000 neutral network updates. Mean and variance of the error is over 1000 training and test trials.\n\n10 neurons\n25 neurons\n100 neurons\n250 neurons\n500 neurons\n\n0.337 \u00b1 0.06\n0.309 \u00b1 0.06\n0.186 \u00b1 0.02\n0.153 \u00b1 0.02\n0.148 \u00b1 0.02\n\n1 layer\n2 layers\n3 layers\n4 layers\n\n0.307 \u00b1 0.06\n0.253 \u00b1 0.06\n0.153 \u00b1 0.02\n0.158 \u00b1 0.02\n\n(a) Increasing Neurons per layer with 4 layers\n\n(b) Increasing Layers with 250 neurons per layer\n\nTable 1: Mean and variance of joint position error on test rollouts with our method after training\nwith different neural network con\ufb01gurations.\n\n9 Conclusions and Future Work\n\nWe have presented an automatic way of generating neural network parameters that represent a con-\ntrol policy for physically consistent interactive character control, only requiring a dynamical char-\nacter model and task description. Using both trajectory optimization and stochastic neural networks\ntogether combines correct behavior with real-time interactive use. Furthermore, the same algorithm\nand controller architecture can provide interactive control for multiple creature morphologies.\nWhile the behavior of the characters re\ufb02ected ef\ufb01cient task completion in this work, additional\nmodi\ufb01cations could be made to affect the style of behavior \u2013 costs during trajectory optimization\ncan affect how a task is completed. Incorporation of muscle actuation effects into our character\nmodels may result in more biomechanically plausible actions for that (biologically based) character.\nIn addition to changing the character\u2019s physical characteristics, we could explore different neural\nnetwork architectures and how they compare to biological systems. With this work, we have net-\nworks that enable diverse physical action, which could be augmented to further re\ufb02ect biological\nsensorimotor systems. This model could be used to experiment with the effects of sensor delays and\nthe resulting motions, for example [2].\nThis work focused on locomotion of different creatures with the same algorithm. Previous work\nhas demonstrated behaviors such as getting up, climbing, and reaching with the same trajectory\noptimization method [12]. Real-time policies using this algorithm could allow interactive use of\nthese behaviors as well. Extending beyond character animation, this work could be used to develop\ncontrollers for robotics applications that are robust to sensor noise and perturbations if the trained\ncharacter model accurately re\ufb02ects the robot\u2019s physical parameters.\n\nthat\n\nFigure 3: Typical joint an-\ngle trajectories\nre-\nsult from MPC and our\nmethod. While both trajec-\ntories successfully main-\ntain position for a bird\ncharacter, our method gen-\nerates trajectories that are\ncyclic and regular.\n\n8\n\n\fReferences\n[1] P. Chen. Hessian matrix vs. gauss-newton hessian matrix. SIAM J. Numerical Analysis,\n\n49(4):1417\u20131435, 2011.\n\n[2] H. Geyer and H. Herr. A muscle-re\ufb02ex model that encodes principles of legged mechanics\nproduces human walking dynamics and muscle activities. Neural Systems and Rehabilitation\nEngineering, IEEE Transactions on, 18(3):263\u2013273, 2010.\n\n[3] R. Grzeszczuk, D. Terzopoulos, and G. Hinton. Neuroanimator: Fast neural network emula-\ntion and control of physics-based models. In Proceedings of the 25th Annual Conference on\nComputer Graphics and Interactive Techniques, SIGGRAPH \u201998, pages 9\u201320, New York, NY,\nUSA, 1998. ACM.\n\n[4] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov.\n\nIm-\nproving neural networks by preventing co-adaptation of feature detectors. arXiv preprint\narXiv:1207.0580, 2012.\n\n[5] G. M. Hoerzer, R. Legenstein, and W. Maass. Emergence of complex computational structures\nfrom chaotic neural networks through reward-modulated hebbian learning. Cerebral Cortex,\n2012.\n\n[6] D. Huh and E. Todorov. Real-time motor control using recurrent neural networks. In Adaptive\nDynamic Programming and Reinforcement Learning, 2009. ADPRL \u201909. IEEE Symposium on,\npages 42\u201349, March 2009.\n\n[7] A. J. Ijspeert. Central pattern generators for locomotion control in animals and robots: a review,\n\n2008.\n\n[8] E. Ju, J. Won, J. Lee, B. Choi, J. Noh, and M. G. Choi. Data-driven control of \ufb02apping \ufb02ight.\n\nACM Trans. Graph., 32(5):151:1\u2013151:12, Oct. 2013.\n\n[9] S. Levine and V. Koltun. Learning complex neural network policies with trajectory optimiza-\ntion. In ICML \u201914: Proceedings of the 31st International Conference on Machine Learning,\n2014.\n\n[10] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. A. Ried-\n\nmiller. Playing atari with deep reinforcement learning. CoRR, abs/1312.5602, 2013.\n\n[11] I. Mordatch and E. Todorov. Combining the bene\ufb01ts of function approximation and trajectory\n\noptimization. In Robotics: Science and Systems (RSS), 2014.\n\n[12] I. Mordatch, E. Todorov, and Z. Popovi\u00b4c. Discovery of complex behaviors through contact-\n\ninvariant optimization. ACM Transactions on Graphics (TOG), 31(4):43, 2012.\n\n[13] J. R. Rebula, P. D. Neuhaus, B. V. Bonnlander, M. J. Johnson, and J. E. Pratt. A controller\nfor the littledog quadruped walking on rough terrain. In Robotics and Automation, 2007 IEEE\nInternational Conference on, pages 1467\u20131473. IEEE, 2007.\n\n[14] J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel. Trust region policy optimiza-\n\ntion. CoRR, abs/1502.05477, 2015.\n\n[15] I. Sutskever, J. Martens, G. E. Dahl, and G. E. Hinton. On the importance of initialization and\nmomentum in deep learning. In Proceedings of the 30th International Conference on Machine\nLearning (ICML-13), volume 28, pages 1139\u20131147, May 2013.\n\n[16] E. Todorov, T. Erez, and Y. Tassa. Mujoco: A physics engine for model-based control. In\n\nIROS\u201912, pages 5026\u20135033, 2012.\n\n[17] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol. Extracting and composing robust\n\nfeatures with denoising autoencoders. pages 1096\u20131103, 2008.\n\n[18] M. Vukobratovic and B. Borovac. Zero-moment point - thirty \ufb01ve years of its life.\n\nHumanoid Robotics, 1(1):157\u2013173, 2004.\n\nI. J.\n\n[19] S. Wager, S. Wang, and P. Liang. Dropout training as adaptive regularization. In Advances in\n\nNeural Information Processing Systems (NIPS), 2013.\n\n[20] J. M. Wang, D. J. Fleet, and A. Hertzmann. Optimizing walking controllers for uncertain inputs\n\nand environments. ACM Trans. Graph., 29(4):73:1\u201373:8, July 2010.\n\n[21] K. Yin, K. Loken, and M. van de Panne. Simbicon: Simple biped locomotion control. ACM\n\nTrans. Graph., 26(3):Article 105, 2007.\n\n9\n\n\f", "award": [], "sourceid": 1758, "authors": [{"given_name": "Igor", "family_name": "Mordatch", "institution": "University of Washington"}, {"given_name": "Kendall", "family_name": "Lowrey", "institution": "University of Washington"}, {"given_name": "Galen", "family_name": "Andrew", "institution": "University of Washington, Seattle"}, {"given_name": "Zoran", "family_name": "Popovic", "institution": "University of Washington"}, {"given_name": "Emanuel", "family_name": "Todorov", "institution": "University of Washington"}]}