{"title": "Embed to Control: A Locally Linear Latent Dynamics Model for Control from Raw Images", "book": "Advances in Neural Information Processing Systems", "page_first": 2746, "page_last": 2754, "abstract": "We introduce Embed to Control (E2C), a method for model learning and control of non-linear dynamical systems from raw pixel images. E2C consists of a deep generative model, belonging to the family of variational autoencoders, that learns to generate image trajectories from a latent space in which the dynamics is constrained to be locally linear. Our model is derived directly from an optimal control formulation in latent space, supports long-term prediction of image sequences and exhibits strong performance on a variety of complex control problems.", "full_text": "Embed to Control: A Locally Linear Latent\n\nDynamics Model for Control from Raw Images\n\nManuel Watter\u2217\n\nJost Tobias Springenberg\u2217\n\nJoschka Boedecker\n\n{watterm,springj,jboedeck}@cs.uni-freiburg.de\n\nUniversity of Freiburg, Germany\n\nMartin Riedmiller\nGoogle DeepMind\n\nLondon, UK\n\nriedmiller@google.com\n\nAbstract\n\nWe introduce Embed to Control (E2C), a method for model learning and control\nof non-linear dynamical systems from raw pixel images. E2C consists of a deep\ngenerative model, belonging to the family of variational autoencoders, that learns\nto generate image trajectories from a latent space in which the dynamics is con-\nstrained to be locally linear. Our model is derived directly from an optimal control\nformulation in latent space, supports long-term prediction of image sequences and\nexhibits strong performance on a variety of complex control problems.\n\n1\n\nIntroduction\n\nControl of non-linear dynamical systems with continuous state and action spaces is one of the key\nproblems in robotics and, in a broader context, in reinforcement learning for autonomous agents.\nA prominent class of algorithms that aim to solve this problem are model-based locally optimal\n(stochastic) control algorithms such as iLQG control [1, 2], which approximate the general non-\nlinear control problem via local linearization. When combined with receding horizon control [3], and\nmachine learning methods for learning approximate system models, such algorithms are powerful\ntools for solving complicated control problems [3, 4, 5]; however, they either rely on a known system\nmodel or require the design of relatively low-dimensional state representations. For real autonomous\nagents to succeed, we ultimately need algorithms that are capable of controlling complex dynamical\nsystems from raw sensory input (e.g. images) only. In this paper we tackle this dif\ufb01cult problem.\nIf stochastic optimal control (SOC) methods were applied directly to control from raw image data,\nthey would face two major obstacles. First, sensory data is usually high-dimensional \u2013 i.e. images\nwith thousands of pixels \u2013 rendering a naive SOC solution computationally infeasible. Second,\nthe image content is typically a highly non-linear function of the system dynamics underlying the\nobservations; thus model identi\ufb01cation and control of this dynamics are non-trivial.\nWhile both problems could, in principle, be addressed by designing more advanced SOC algo-\nrithms we approach the \u201coptimal control from raw images\u201d problem differently: turning the prob-\nlem of locally optimal control in high-dimensional non-linear systems into one of identifying a\nlow-dimensional latent state space, in which locally optimal control can be performed robustly and\neasily. To learn such a latent space we propose a new deep generative model belonging to the class\nof variational autoencoders [6, 7] that is derived from an iLQG formulation in latent space. The\nresulting Embed to Control (E2C) system is a probabilistic generative model that holds a belief over\nviable trajectories in sensory space, allows for accurate long-term planning in latent space, and is\ntrained fully unsupervised. We demonstrate the success of our approach on four challenging tasks\nfor control from raw images and compare it to a range of methods for unsupervised representation\nlearning. As an aside, we also validate that deep up-convolutional networks [8, 9] are powerful\ngenerative models for large images.\n\n\u2217Authors contributed equally.\n\n1\n\n\f2 The Embed to Control (E2C) model\n\nWe brie\ufb02y review the problem of SOC for dynamical systems, introduce approximate locally optimal\ncontrol in latent space, and \ufb01nish with the derivation of our model.\n\n2.1 Problem Formulation\n\nWe consider the control of unknown dynamical systems of the form\nst+1 = f (st, ut) + \u03be, \u03be \u223c N (0, \u03a3\u03be),\n(1)\nwhere t denotes the time steps, st \u2208 Rns the system state, ut \u2208 Rnu the applied control and \u03be\nthe system noise. The function f (st, ut) is an arbitrary, smooth, system dynamics. We equivalently\nrefer to Equation (1) using the notation P (st+1|st, ut), which we assume to be a multivariate normal\ndistribution N (f (st, ut), \u03a3\u03be). We further assume that we are only given access to visual depictions\nxt \u2208 Rnx of state st. This restriction requires solving a joint state identi\ufb01cation and control problem.\nFor simplicity we will in the following assume that xt is a fully observed depiction of st, but relax\nthis assumption later.\nOur goal then is to infer a low-dimensional latent state space model in which optimal control can\nbe performed. That is, we seek to learn a function m, mapping from high-dimensional images xt\nto low-dimensional vectors zt \u2208 Rnz with nz (cid:28) nx, such that the control problem can be solved\nusing zt instead of xt:\n(2)\nwhere \u03c9 accounts for system noise; or equivalently zt \u223c N (m(xt), \u03a3\u03c9). Assuming for the moment\nthat such a function can be learned (or approximated), we will \ufb01rst de\ufb01ne SOC in a latent space and\nintroduce our model thereafter.\n\nzt = m(xt) + \u03c9, \u03c9 \u223c N (0, \u03a3\u03c9),\n\n2.2 Stochastic locally optimal control in latent spaces\nLet zt \u2208 Rnz be the inferred latent state from image xt of state st and f lat(zt, ut) the transition\ndynamics in latent space, i.e., zt+1 = f lat(zt, ut). Thus f lat models the changes that occur in\nzt when control ut is applied to the underlying system as a latent space analogue to f (st, ut).\nAssuming f lat is known, optimal controls for a trajectory of length T in the dynamical system can\nbe derived by minimizing the function J(z1:T , u1:T ) which gives the expected future costs when\nfollowing (z1:T , u1:T ):\n\nc(zt, ut)\n\ncT (zT , uT ) +\n\n(3)\nwhere c(zt, ut) are instantaneous costs, cT (zT , uT ) denotes terminal costs and z1:T = {z1, . . . , zT}\nand u1:T = {u1, . . . , uT} are state and action sequences respectively. If zt contains suf\ufb01cient infor-\nmation about st, i.e., st can be inferred from zt alone, and f lat is differentiable, the cost-minimizing\ncontrols can be computed from J(z1:T , u1:T ) via SOC algorithms [10]. These optimal control al-\ngorithms approximate the global non-linear dynamics with locally linear dynamics at each time step\nt. Locally optimal actions can then be found in closed form. Formally, given a reference trajectory\n\u00afz1:T \u2013 the current estimate for the optimal trajectory \u2013 together with corresponding controls \u00afu1:T\nthe system is linearized as\n\nzt+1 = A(\u00afzt)zt + B(\u00afzt)ut+1 + o(\u00afzt) + \u03c9, \u03c9 \u223c N (0, \u03a3\u03c9),\n\n(4)\nwhere A(\u00afzt) = \u03b4f lat(\u00afzt,\u00afut)\nare local Jacobians, and o(\u00afzt) is an offset. To\nenable ef\ufb01cient computation of the local controls we assume the costs to be a quadratic function of\nthe latent representation\n\n, B(\u00afzt) = \u03b4f lat(\u00afzt,\u00afut)\n\nc(zt, ut) = (zt \u2212 zgoal)T Rz(zt \u2212 zgoal) + uT\n\n(5)\nwhere Rz \u2208 Rnz\u00d7nz and Ru \u2208 Rnu\u00d7nu are cost weighting matrices and zgoal is the inferred\nrepresentation of the goal state. We also assume cT (zT , uT ) = c(zT , uT ) throughout this paper.\nIn combination with Equation (4) this gives us a local linear-quadratic-Gaussian formulation at\neach time step t which can be solved by SOC algorithms such as iterative linear-quadratic reg-\nulation (iLQR) [11] or approximate inference control (AICO) [12]. The result of this trajectory\n1:T ) \u2248\noptimization step is a locally optimal trajectory with corresponding control sequence (z\u2217\narg minz1:T\nu1:T\n\nJ(z1:T , u1:T ).\n\n1:T , u\u2217\n\nt Ruut,\n\n\u03b4 \u00afut\n\n\u03b4\u00afzt\n\n(cid:34)\n\nJ(z1:T , u1:T ) = Ez\n\n(cid:35)\n\n,\n\nT\u22121(cid:88)\n\nt0\n\n2\n\n\fxt\n\nhenc\n\u03c6\n\n\u00b5t\n\u03a3t\n\npt\n\nhdec\n\u03b8\n\nAt\nhtrans\n\u03c8 Bt\not\n\nzt\n\nut\n\nKL\n\n\u02c6Q\u03c8 Q\u03c6\n\n\u02c6zt+1 \u2248 zt+1\n\n\u00b5t+1\n\u03a3t+1\n\nhenc\n\u03c6\n\nhdec\n\u03b8\n\npt\n\nxt+1\n\nencode\ndecode\ntransition\n\nFigure 1: The information \ufb02ow in the E2C model. From left to right, we encode and decode an\nimage xt with the networks henc\n\u03b8 , where we use the latent code zt for the transition step.\nThe htrans\n\u03c8 network computes the local matrices At, Bt, ot with which we can predict \u02c6zt+1 from zt\nand ut. Similarity to the encoding zt+1 is enforced by a KL divergence on their distributions and\nreconstruction is again performed by hdec\n\u03b8 .\n\n\u03c6 and hdec\n\n2.3 A locally linear latent state space model for dynamical systems\n\nStarting from the SOC formulation, we now turn to the problem of learning an appropriate low-\ndimensional latent representation zt \u223c P (Zt|m(xt), \u03a3\u03c9) of xt. The representation zt has to ful\ufb01ll\nthree properties: (i) it must capture suf\ufb01cient information about xt (enough to enable reconstruc-\ntion); (ii) it must allow for accurate prediction of the next latent state zt+1 and thus, implicitly, of the\nnext observation xt+1; (iii) the prediction f lat of the next latent state must be locally linearizable for\nall valid control magnitudes ut. Given some representation zt, properties (ii) and (iii) in particular\nrequire us to capture possibly highly non-linear changes of the latent representation due to transfor-\nmations of the observed scene induced by control commands. Crucially, these are particularly hard\nto model and subsequently linearize. We circumvent this problem by taking a more direct approach:\ninstead of learning a latent space z and transition model f lat which are then linearized and combined\nwith SOC algorithms, we directly impose desired transformation properties on the representation zt\nduring learning. We will select these properties such that prediction in the latent space as well as\nlocally linear inference of the next observation according to Equation (4) are easy.\nThe transformation properties that we desire from a latent representation can be formalized directly\nfrom the iLQG formulation given in Section 2.2 . Formally, following Equation (2), let the latent\nrepresentation be Gaussian P (Z|X) = N (m(xt), \u03a3\u03c9). To infer zt from xt we \ufb01rst require a\nmethod for sampling latent states. Ideally, we would generate samples directly from the unknown\ntrue posterior P (Z|X), which we, however, have no access to. Following the variational Bayes\napproach (see Jordan et al. [13] for an overview) we resort to sampling zt from an approximate\nposterior distribution Q\u03c6(Z|X) with parameters \u03c6.\nInference model for Q\u03c6. In our work this is always a diagonal Gaussian distribution Q\u03c6(Z|X) =\nN (\u00b5t, diag(\u03c32\nt ) \u2208 Rnz\u00d7nz are computed\nby an encoding neural network with outputs\n\nt )), whose mean \u00b5t \u2208 Rnz and covariance \u03a3t = diag(\u03c32\n\n\u03c6 (xt) + b\u00b5,\n\u03c6 (xt) + b\u03c3,\n\n\u00b5t = W\u00b5henc\nlog \u03c3t = W\u03c3henc\n\n(6)\n(7)\n\u03c6 \u2208 Rne is the activation of the last hidden layer and where \u03c6 is given by the set of all\nwhere henc\nlearnable parameters of the encoding network, including the weight matrices W\u00b5, W\u03c3 and biases\nb\u00b5, b\u03c3. Parameterizing the mean and variance of a Gaussian distribution based on a neural network\ngives us a natural and very expressive model for our latent space. It additionally comes with the\nbene\ufb01t that we can use the reparameterization trick [6, 7] to backpropagate gradients of a loss\nfunction based on samples through the latent distribution.\nGenerative model for P\u03b8. Using the approximate posterior distribution Q\u03c6 we generate observed\nsamples (images) \u02dcxt and \u02dcxt+1 from latent samples zt and zt+1 by enforcing a locally linear rela-\ntionship in latent space according to Equation (4), yielding the following generative model\n\n(8)\n\nzt \u223c Q\u03c6(Z | X)\n\u02dcxt, \u02dcxt+1 \u223c P\u03b8(X | Z)\n\n= N (\u00b5t, \u03a3t),\n\n= Bernoulli(pt),\n\n\u02c6zt+1 \u223c \u02c6Q\u03c8( \u02c6Z | Z, u) = N (At\u00b5t + Btut + ot, Ct),\n\nwhere \u02c6Q\u03c8 is the next latent state posterior distribution, which exactly follows the linear form re-\nquired for stochastic optimal control. With \u03c9t \u223c N (0, Ht) as an estimate of the system noise,\n\n3\n\n\fC can be decomposed as Ct = At\u03a3tAT\nt + Ht. Note that while the transition dynamics in our\ngenerative model operates on the inferred latent space, it takes untransformed controls into account.\nThat is, we aim to learn a latent space such that the transition dynamics in z linearizes the non-linear\nobserved dynamics in x and is locally linear in the applied controls u. Reconstruction of an image\nfrom zt is performed by passing the sample through multiple hidden layers of a decoding neural\nnetwork which computes the mean pt of the generative Bernoulli distribution1 P\u03b8(X|Z) as\n\npt = Wphdec\n\n\u03b8 (zt) + bp,\n\n(9)\n\u03b8 (zt) \u2208 Rnd is the response of the last hidden layer in the decoding network. The set of\nwhere hdec\nparameters for the decoding network, including weight matrix Wp and bias bp, then make up the\nlearned generative parameters \u03b8.\nTransition model for \u02c6Q\u03c8. What remains is to specify how the linearization matrices At \u2208 Rnz\u00d7nz,\nBt \u2208 Rnz\u00d7nu and offset ot \u2208 Rnz are predicted. Following the same approach as for distribution\nmeans and covariance matrices, we predict all local transformation parameters from samples zt\n\u03c8 (zt) \u2208 Rnt of a third neural network with parameters \u03c8 \u2013\nbased on the hidden representation htrans\nto which we refer as the transformation network. Speci\ufb01cally, we parametrize the transformation\nmatrices and offset as\n\n(10)\n\nvec[At] = WA htrans\nvec[Bt] = WB htrans\not = Wo htrans\n\nt ) which reduces the parameters to be estimated for At to 2nz.\n\n\u03c8 (zt) + bA,\n\u03c8 (zt) + bB,\n\u03c8 (zt) + bo,\nwhere vec denotes vectorization and therefore vec[At] \u2208 R(n2\nz) and vec[Bt] \u2208 R(nz\u00b7nu). To cir-\ncumvent estimating the full matrix At of size nz \u00d7 nz, we can choose it to be a perturbation of the\nidentity matrix At = (I + vtrT\nA sketch of the complete architecture is shown in Figure 1. It also visualizes an additional constraint\nthat is essential for learning a representation for long-term predictions: we require samples \u02c6zt+1\nfrom the state transition distribution \u02c6Q\u03c8 to be similar to the encoding of xt+1 through Q\u03c6. While it\nmight seem that just learning a perfect reconstruction of xt+1 from \u02c6zt+1 is enough, we require multi-\nstep predictions for planning in Z which must correspond to valid trajectories in the observed space\nX. Without enforcing similarity between samples from \u02c6Q\u03c8 and Q\u03c6, following a transition in latent\nspace from zt with action ut may lead to a point \u02c6zt+1, from which reconstruction of xt+1 is possible,\nbut that is not a valid encoding (i.e. the model will never encode any image as \u02c6zt+1). Executing\nanother action in \u02c6zt+1 then does not result in a valid latent state \u2013 since the transition model is\nconditional on samples coming from the inference network \u2013 and thus long-term predictions fail.\nIn a nutshell, such a divergence between encodings and the transition model results in a generative\nmodel that does not accurately model the Markov chain formed by the observations.\n\n2.4 Learning via stochastic gradient variational Bayes\nFor training the model we use a data set D = {(x1, u1, x2), . . . , (xT\u22121, uT\u22121, xT )} containing ob-\nservation tuples with corresponding controls obtained from interactions with the dynamical system.\nUsing this data set, we learn the parameters of the inference, transition and generative model by\nminimizing a variational bound on the true data negative log-likelihood \u2212 log P (xt, ut, xt+1) plus\nan additional constraint on the latent representation. The complete loss function2 is given as\n\n(cid:16) \u02c6Q\u03c8( \u02c6Z | \u00b5t, ut)(cid:13)(cid:13)Q\u03c6(Z | xt+1)\n\n(cid:17)\n\n.\n\n(11)\n\n(cid:88)\n\nL(D) =\n\nLbound(xt, ut, xt+1) + \u03bb KL\n\n(xt,ut,xt+1)\u2208D\n\nThe \ufb01rst part of this loss is the per-example variational bound on the log-likelihood\nLbound(xt, ut, xt+1) = E zt\u223cQ\u03c6\n\u02c6zt+1\u223c \u02c6Q\u03c8\n\n[\u2212 log P\u03b8(xt|zt) \u2212 log P\u03b8(xt+1|\u02c6zt+1)] + KL(Q\u03c6||P (Z)), (12)\n\nwhere Q\u03c6, P\u03b8 and \u02c6Q\u03c8 are the parametric inference, generative and transition distributions from\nSection 2.3 and P (Zt) is a prior on the approximate posterior Q\u03c6; which we always chose to be\n\n1A Bernoulli distribution for P\u03b8 is a common choice when modeling black-and-white images.\n2Note that this is the loss for the latent state space model and distinct from the SOC costs.\n\n4\n\n\fan isotropic Gaussian distribution with mean zero and unit variance. The second KL divergence in\nEquation (11) is an additional contraction term with weight \u03bb, that enforces agreement between the\ntransition and inference models. This term is essential for establishing a Markov chain in latent space\nthat corresponds to the real system dynamics (see Section 2.3 above for an in depth discussion). This\nKL divergence can also be seen as a prior on the latent transition model. Note that all KL terms can\nbe computed analytically for our model (see supplementary for details).\nDuring training we approximate the expectation in L(D) via sampling. Speci\ufb01cally, we take one\nsample zt for each input xt and transform that sample using Equation (10) to give a valid sample\n\u02c6zt+1 from \u02c6Q\u03c8. We then jointly learn all parameters of our model by minimizing L(D) using SGD.\n\n3 Experimental Results\n\nWe evaluate our model on four visual tasks: an agent in a plane with obstacles, a visual version of the\nclassic inverted pendulum swing-up task, balancing a cart-pole system, and control of a three-link\narm with larger images. These are described in detail below.\n\n3.1 Experimental Setup\n\nModel training. We consider two different network types for our model: Standard fully connected\nneural networks with up to three layers, which work well for moderately sized images, are used for\nthe planar and swing-up experiments; A deep convolutional network for the encoder in combination\nwith an up-convolutional network as the decoder which, in accordance with recent \ufb01ndings from\nthe literature [8, 9], we found to be an adequate model for larger images. Training was performed\nusing Adam [14] throughout all experiments. The training data set D for all tasks was generated by\nrandomly sampling N state observations and actions with corresponding successor states. For the\nplane we used N =3, 000 samples, for the inverted pendulum and cart-pole system we used N =\n15, 000 and for the arm N=30, 000. A complete list of architecture parameters and hyperparameter\nchoices as well as an in-depth explanation of the up-convolutional network are speci\ufb01ed in the\nsupplementary material. We will make our code and a video containing controlled trajectories for all\nsystems available under http://ml.informatik.uni-freiburg.de/research/e2c .\nModel variants. In addition to the Embed to Control (E2C) dynamics model derived above, we\nalso consider two variants: By removing the latent dynamics network htrans\n\u03c8 , i.e. setting its output\nto one in Equation (10) \u2013 we obtain a variant in which At, Bt and ot are estimated as globally\nlinear matrices (Global E2C). If we instead replace the transition model with a network estimating\nthe dynamics as a non-linear function \u02c6f lat and only linearize during planning, estimating At, Bt, ot\nas Jacobians to \u02c6f lat as described in Section 2.2, we obtain a variant with nonlinear latent dynamics.\nBaseline models. For a thorough comparison and to exhibit the complicated nature of the tasks,\nwe also test a set of baseline models on the plane and the inverted pendulum task (using the same\narchitecture as the E2C model): a standard variational autoencoder (VAE) and a deep autoencoder\n(AE) are trained on the autoencoding subtask for visual problems. That is, given a data set D\nused for training our model, we remove all actions from the tuples in D and disregard temporal\ncontext between images. After autoencoder training we learn a dynamics model in latent space,\napproximating f lat from Section 2.2. We also consider a VAE variant with a slowness term on the\nlatent representation \u2013 a full description of this variant is given in the supplementary material.\nOptimal control algorithms. To perform optimal control in the latent space of different models,\nwe employ two trajectory optimization algorithms: iterative linear quadratic regulation (iLQR) [11]\n(for the plane and inverted pendulum) and approximate inference control (AICO) [12] (all other\nexperiments). For all VAEs both methods operate on the mean of distributions Q\u03c6 and \u02c6Q\u03c8. AICO\nadditionally makes use of the local Gaussian covariances \u03a3t and Ct. Except for the experiments\non the planar system, control was performed in a model predictive control fashion using the reced-\ning horizon scheme introduced in [3]. To obtain closed loop control given an image xt, it is \ufb01rst\npassed through the encoder to obtain the latent state zt. A locally optimal trajectory is subsequently\nt:t+T ) \u2248 arg minzt:t+T\nfound by optimizing (z\u2217\nJ(zt:t+T , ut:t+T ) with \ufb01xed, small horizon\nT (with T = 10 unless noted otherwise). Controls u\u2217\nt are applied to the system and a transition to\nzt+1 is observed (by encoding the next image xt+1). Then a new control sequence \u2013 with horizon\n\nt:t+T , u\u2217\n\nut:t+T\n\n5\n\n\fFigure 2: The true state space of the planar system (left) with examples (obstacles encoded as circles)\nand the inferred spaces (right) of different models. The spaces are spanned by generating images for\nevery valid position of the agent and embedding them with the respective encoders.\n\nT \u2013 starting in zt+1 is found using the last estimated trajectory as a bootstrap. Note that planning\nis performed entirely in the latent state without access to any observations except for the depiction\nof the current state. To compute the cost function c(zt, ut) required for trajectory optimization in\nz we assume knowledge of the observation xgoal of the goal state sgoal. This observation is then\ntransformed into latent space and costs are computed according to Equation (5).\n\n3.2 Control in a planar system\n\nThe agent in the planar system can move in a bounded two-dimensional plane by choosing a con-\ntinuous offset in x- and y-direction. The high-dimensional representation of a state is a 40 \u00d7 40\nblack-and-white image. Obstructed by six circular obstacles, the task is to move to the bottom right\nof the image, starting from a random x position at the top of the image. The encodings of obstacles\nare obtained prior to planning and an additional quadratic cost term is penalizing proximity to them.\nA depiction of the observations on which control is performed \u2013 together with their corresponding\nstate values and embeddings into latent space \u2013 is shown in Figure 2. The \ufb01gure also clearly shows\na fundamental advantage the E2C model has over its competitors: While the separately trained\nautoencoders make for aesthetically pleasing pictures, the models failed to discover the underlying\nstructure of the state space, complicating dynamics estimation and largely invalidating costs based\non distances in said space. Including the latent dynamics constraints in these end-to-end models on\nthe other hand, yields latent spaces approaching the optimal planar embedding.\nWe test the long-term accuracy by accumulating latent and real trajectory costs to quantify whether\nthe imagined trajectory re\ufb02ects reality. The results for all models when starting from random posi-\ntions at the top and executing 40 pre-computed actions are summarized in Table 1 \u2013 using a seperate\ntest set for evaluating reconstructions. While all methods achieve a low reconstruction loss, the dif-\nference in accumulated real costs per trajectory show the superiority of the E2C model. Using the\nglobally or locally linear E2C model, trajectories planned in latent space are as good as trajectories\nplanned on the real state. All models besides E2C fail to give long-term predictions that result in\ngood performance.\n\n3.3 Learning swing-up for an inverted pendulum\n\nWe next turn to the task of controlling the classical inverted pendulum system [15] from images.\nWe create depictions of the state by rendering a \ufb01xed length line starting from the center of the\nimage at an angle corresponding to the pendulum position. The goal in this task is to swing-up and\nbalance an underactuated pendulum from a resting position (pendulum hanging down). Exemplary\nobservations and reconstructions for this system are given in Figure 3(d). In the visual inverted\npendulum task our algorithm faces two additional dif\ufb01culties: the observed space is non-Markov, as\nthe angular velocity cannot be inferred from a single image, and second, discretization errors due to\nrendering pendulum angles as small 48x48 pixel images make exact control dif\ufb01cult. To restore the\nMarkov property, we stack two images (as input channels), thus observing a one-step history.\nFigure 3 shows the topology of the latent space for our model, as well as one sample trajectory in\ntrue state and latent space. The fact that the model can learn a meaningful embedding, separating\n\n6\n\nVAE with slownessAENon-linear E2CGlobal E2CE2CVAE51015202530355101520253035\fTable 1: Comparison between different approaches to model learning from raw pixels for the planar\nand pendulum system. We compare all models with respect to their prediction quality on a test set\nof sampled transitions and with respect to their performance when combined with SOC (trajectory\ncost for control from different start states). Note that trajectory costs in latent space are not neces-\nsarily comparable. The \u201creal\u201d trajectory cost was computed on the dynamics of the simulator while\nexecuting planned actions. For the true models for st, real trajectory costs were 20.24\u00b1 4.15 for the\nplanar system, and 9.8 \u00b1 2.4 for the pendulum. Success was de\ufb01ned as reaching the goal state and\nstaying \u0001-close to it for the rest of the trajectory (if non terminating). All statistics quantify over 5/30\n(plane/pendulum) different starting positions. A \u2020 marks separately trained dynamics networks.\nSuccess\npercent\n\nTrajectory Cost\n\nAlgorithm\n\nLatent\n\nPlanar System\n\nNext State Loss\nlog p(xt+1|\u02c6xt, ut)\n3538.9 \u00b1 1395.2\n652.1 \u00b1 930.6\n104.3 \u00b1 235.8\n11.3 \u00b1 10.1\n9.3 \u00b1 4.6\n9.7 \u00b1 3.2\n\n1325.6 \u00b1 81.2\n43.1 \u00b1 20.8\n47.1 \u00b1 20.5\n19.8 \u00b1 9.8\n12.5 \u00b1 3.9\n10.3 \u00b1 2.8\n\nInverted Pendulum Swing-Up\n13433.8 \u00b1 6238.8\n8791.2 \u00b1 17356.9\n779.7 \u00b1 633.3\n87.7 \u00b1 64.2\n72.6 \u00b1 34.5\n125.3 \u00b1 62.6\n89.3 \u00b1 42.9\n\n1285.9 \u00b1 355.8\n497.8 \u00b1 129.4\n419.5 \u00b1 85.8\n489.1 \u00b1 87.5\n313.3 \u00b1 65.7\n628.1 \u00b1 45.9\n275.0 \u00b1 16.6\n\nState Loss\nlog p(xt|\u02c6xt)\n11.5 \u00b1 97.8\n3.6 \u00b1 18.9\n10.5 \u00b1 22.8\n8.3 \u00b1 5.5\n6.9 \u00b1 3.2\n7.7 \u00b1 2.0\n8.9 \u00b1 100.3\n7.5 \u00b1 47.7\n26.5 \u00b1 18.0\n64.4 \u00b1 32.8\n59.6 \u00b1 25.2\n115.5 \u00b1 56.9\n84.0 \u00b1 50.8\n\nAE\u2020\nVAE\u2020\nVAE + slowness\u2020\nNon-linear E2C\nGlobal E2C\nE2C\nAE\u2020\nVAE\u2020\nVAE + slowness\u2020\nE2C no latent KL\nNon-linear E2C\nGlobal E2C\nE2C\n\nReal\n\n273.3 \u00b1 16.4\n91.3 \u00b1 16.4\n89.1 \u00b1 16.4\n42.3 \u00b1 16.4\n27.3 \u00b1 9.7\n25.1 \u00b1 5.3\n194.7 \u00b1 44.8\n237.2 \u00b1 41.2\n188.2 \u00b1 43.6\n213.2 \u00b1 84.3\n37.4 \u00b1 12.4\n125.1 \u00b1 10.7\n15.4 \u00b1 3.4\n\n0 %\n0 %\n0 %\n\n96.6 %\n100 %\n100 %\n\n0 %\n0 %\n0 %\n0 %\n\n63.33 %\n\n0 %\n90 %\n\nvelocities and positions, from this data is remarkable (no other model recovered this shape). Table 1\nagain compares the different models quantitatively. While the E2C model is not the best in terms of\nreconstruction performance, it is the only model resulting in stable swing-up and balance behavior.\nWe explain the failure of the other models with the fact that the non-linear latent dynamics model\ncannot be guaranteed to be linearizable for all control magnitudes, resulting in undesired behav-\nior around unstable \ufb01xpoints of the real system dynamics, and that for this task a globally linear\ndynamics model is inadequate.\n\n3.4 Balancing a cart-pole and controlling a simulated robot arm\n\nFinally, we consider control of two more complex dynamical systems from images using a six layer\nconvolutional inference and six layer up-convolutional generative network, resulting in a 12-layer\ndeep path from input to reconstruction. Speci\ufb01cally, we control a visual version of the classical cart-\npole system [16] from a history of two 80 \u00d7 80 pixel images as well as a three-link planar robot arm\nbased on a history of two 128 \u00d7 128 pixel images. The latent space was set to be 8-dimensional in\nboth experiments. The real state dimensionality for the cart-pole is four and is controlled using one\n\nx100\n\n\u02c6x100\n\n\u02c6x70\n\nx70\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 3: (a) The true state space of the inverted pendulum task overlaid with a successful trajectory\ntaken by the E2C agent. (b) The learned latent space. (c) The trajectory from (a) traced out in the\nlatent space. (d) Images x and reconstructions \u02c6x showing current positions (right) and history (left).\n\n7\n\n\u221210\u221250510Angularvelocity\u22123\u22122\u221210123Anglez0\u22123\u22122\u221210123z1\u22123\u22122\u221210123z2\u22123\u22122\u221210123z0z1\u22123\u22122\u221210123z2\u22123\u22122\u221210123\fFigure 4: Left: Trajectory from the cart-pole domain. Only the \ufb01rst image (green) is \u201creal\u201d, all\nother images are \u201cdreamed up\u201d by our model. Notice discretization artifacts present in the real\nimage. Right: Exemplary observed (with history image omitted) and predicted images (including\nthe history image) for a trajectory in the visual robot arm domain with the goal marked in red.\n\naction, while for the arm the real state can be described in 6 dimensions (joint angles and velocities)\nand controlled using a three-dimensional action vector corresponding to motor torques.\nAs in previous experiments the E2C model seems to have no problem \ufb01nding a locally linear em-\nbedding of images into latent space in which control can be performed. Figure 4 depicts exemplary\nimages \u2013 for both problems \u2013 from a trajectory executed by our system. The costs for these trajec-\ntories (11.13 for the cart-pole, 85.12 for the arm) are only slightly worse than trajectories obtained\nby AICO operating on the real system dynamics starting from the same start-state (7.28 and 60.74\nrespectively). The supplementary material contains additional experiments using these domains.\n\n4 Comparison to recent work\n\nIn the context of representation learning for control (see B\u00a8ohmer et al. [17] for a review), deep\nautoencoders (ignoring state transitions) similar to our baseline models have been applied previously,\ne.g. by Lange and Riedmiller [18]. A more direct route to control based on image streams is taken\nby recent work on (model free) deep end-to-end Q-learning for Atari games by Mnih et al. [19], as\nwell as kernel based [20] and deep policy learning for robot control [21].\nClose to our approach is a recent paper by Wahlstr\u00a8om et al. [22], where autoencoders are used to\nextract a latent representation for control from images, on which a non-linear model of the forward\ndynamics is learned. Their model is trained jointly and is thus similar to the non-linear E2C variant\nin our comparison. In contrast to our model, their formulation requires PCA pre-processing and does\nneither ensure that long-term predictions in latent space do not diverge, nor that they are linearizable.\nAs stated above, our system belongs to the family of VAEs and is generally similar to recent work\nsuch as Kingma and Welling [6], Rezende et al. [7], Gregor et al. [23], Bayer and Osendorfer [24].\nTwo additional parallels between our work and recent advances for training deep neural networks\ncan be observed. First, the idea of enforcing desired transformations in latent space during learning\n\u2013 such that the data becomes easy to model \u2013 has appeared several times already in the literature.\nThis includes the development of transforming auto-encoders [25] and recent probabilistic models\nfor images [26, 27]. Second, learning relations between pairs of images \u2013 although without control \u2013\nhas received considerable attention from the community during the last years [28, 29]. In a broader\ncontext our model is related to work on state estimation in Markov decision processes (see Langford\net al. [30] for a discussion) through, e.g., hidden Markov models and Kalman \ufb01lters [31, 32].\n\n5 Conclusion\n\nWe presented Embed to Control (E2C), a system for stochastic optimal control on high-dimensional\nimage streams. Key to the approach is the extraction of a latent dynamics model which is constrained\nto be locally linear in its state transitions. An evaluation on four challenging benchmarks revealed\nthat E2C can \ufb01nd embeddings on which control can be performed with ease, reaching performance\nclose to that achievable by optimal control on the real system model.\n\nAcknowledgments\n\nWe thank A. Radford, L. Metz, and T. DeWolf for sharing code, as well as A. Dosovitskiy for useful\ndiscussions. This work was partly funded by a DFG grant within the priority program \u201cAutonomous\nlearning\u201d (SPP1597) and the BrainLinks-BrainTools Cluster of Excellence (grant number EXC\n1086). M. Watter is funded through the State Graduate Funding Program of Baden-W\u00a8urttemberg.\n\n8\n\nObservedPredicted12345678\fReferences\n[1] D. Jacobson and D. Mayne. Differential dynamic programming. American Elsevier, 1970.\n[2] E. Todorov and W. Li. A generalized iterative LQG method for locally-optimal feedback control of\n\nconstrained nonlinear stochastic systems. In ACC. IEEE, 2005.\n\n[3] Y. Tassa, T. Erez, and W. D. Smart. Receding horizon differential dynamic programming. In Proc. of\n\nNIPS, 2008.\n\n[4] Y. Pan and E. Theodorou. Probabilistic differential dynamic programming. In Proc. of NIPS, 2014.\n[5] S. Levine and V. Koltun. Variational policy search via trajectory optimization. In Proc. of NIPS, 2013.\n[6] D. P. Kingma and M. Welling. Auto-encoding variational bayes. In Proc. of ICLR, 2014.\n[7] D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in\n\ndeep generative models. In Proc. of ICML, 2014.\n\n[8] M. D. Zeiler, D. Krishnan, G. W. Taylor, and R. Fergus. Deconvolutional networks. In CVPR, 2010.\n[9] A. Dosovitskiy, J. T. Springenberg, and T. Brox. Learning to generate chairs with convolutional neural\n\nnetworks. In Proc. of CVPR, 2015.\n\n[10] R. F. Stengel. Optimal Control and Estimation. Dover Publications, 1994.\n[11] W. Li and E. Todorov. Iterative Linear Quadratic Regulator Design for Nonlinear Biological Movement\n\nSystems. In Proc. of ICINCO, 2004.\n\n[12] M. Toussaint. Robot Trajectory Optimization using Approximate Inference. In Proc. of ICML, 2009.\n[13] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul. An introduction to variational methods for\n\ngraphical models. In Machine Learning, 1999.\n\n[14] D. Kingma and J. Ba. Adam: A method for stochastic optimization. In Proc. of ICLR, 2015.\n[15] H. Wang, K. Tanaka, and M. Grif\ufb01n. An approach to fuzzy control of nonlinear systems; stability and\n\ndesign issues. IEEE Trans. on Fuzzy Systems, 4(1), 1996.\n\n[16] R. S. Sutton and A. G. Barto.\n\nIntroduction to Reinforcement Learning. MIT Press, Cambridge, MA,\n\nUSA, 1st edition, 1998. ISBN 0262193981.\n\n[17] W. B\u00a8ohmer, J. T. Springenberg, J. Boedecker, M. Riedmiller, and K. Obermayer. Autonomous learning\n\nof state representations for control. KI - K\u00a8unstliche Intelligenz, 2015.\n\n[18] S. Lange and M. Riedmiller. Deep auto-encoder neural networks in reinforcement learning. In Proc. of\n\nIJCNN, 2010.\n\n[19] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller,\nA. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran,\nD. Wierstra, S. Legg, and D. Hassabis. Human-level control through deep reinforcement learning. Nature,\n518(7540), 02 2015.\n\n[20] H. van Hoof, J. Peters, and G. Neumann. Learning of non-parametric control policies with high-\n\ndimensional state features. In Proc. of AISTATS, 2015.\n\n[21] S. Levine, C. Finn, T. Darrell, and P. Abbeel. End-to-end training of deep visuomotor policies. CoRR,\n\nabs/1504.00702, 2015. URL http://arxiv.org/abs/1504.00702.\n\n[22] N. Wahlstr\u00a8om, T. B. Sch\u00a8on, and M. P. Deisenroth. From pixels to torques: Policy learning with deep\ndynamical models. CoRR, abs/1502.02251, 2015. URL http://arxiv.org/abs/1502.02251.\n[23] K. Gregor, I. Danihelka, A. Graves, D. Rezende, and D. Wierstra. DRAW: A recurrent neural network for\n\nimage generation. In Proc. of ICML, 2015.\n\n[24] J. Bayer and C. Osendorfer. Learning stochastic recurrent networks. In NIPS 2014 Workshop on Advances\n\nin Variational Inference, 2014.\n\n[25] G. Hinton, A. Krizhevsky, and S. Wang. Transforming auto-encoders. In Proc. of ICANN, 2011.\n[26] L. Dinh, D. Krueger, and Y. Bengio. Nice: Non-linear independent components estimation. CoRR,\n\nabs/1410.8516, 2015. URL http://arxiv.org/abs/1410.8516.\n\n[27] T. Cohen and M. Welling. Transformation properties of learned visual representations. In ICLR, 2015.\n[28] G. W. Taylor, L. Sigal, D. J. Fleet, and G. E. Hinton. Dynamical binary latent variable models for 3d\n\nhuman pose tracking. In Proc. of CVPR, 2010.\n\n[29] R. Memisevic. Learning to relate images. IEEE Trans. on PAMI, 35(8):1829\u20131846, 2013.\n[30] J. Langford, R. Salakhutdinov, and T. Zhang. Learning nonlinear dynamic models. In ICML, 2009.\n[31] M. West and J. Harrison. Bayesian Forecasting and Dynamic Models (Springer Series in Statistics).\n\nSpringer-Verlag, February 1997. ISBN 0387947256.\n\n[32] T. Matsubara, V. G\u00b4omez, and H. J. Kappen. Latent Kullback Leibler control for continuous-state systems\n\nusing probabilistic graphical models. UAI, 2014.\n\n9\n\n\f", "award": [], "sourceid": 1573, "authors": [{"given_name": "Manuel", "family_name": "Watter", "institution": "University of Freiburg"}, {"given_name": "Jost", "family_name": "Springenberg", "institution": "University of Freiburg"}, {"given_name": "Joschka", "family_name": "Boedecker", "institution": "University of Freiburg"}, {"given_name": "Martin", "family_name": "Riedmiller", "institution": "Google DeepMind"}]}