{"title": "Regularizing Trajectory Optimization with Denoising Autoencoders", "book": "Advances in Neural Information Processing Systems", "page_first": 2859, "page_last": 2869, "abstract": "Trajectory optimization using a learned model of the environment is one of the core elements of model-based reinforcement learning. This procedure often suffers from exploiting inaccuracies of the learned model. We propose to regularize trajectory optimization by means of a denoising autoencoder that is trained on the same trajectories as the model of the environment. We show that the proposed regularization leads to improved planning with both gradient-based and gradient-free optimizers. We also demonstrate that using regularized trajectory optimization leads to rapid initial learning in a set of popular motor control tasks, which suggests that the proposed approach can be a useful tool for improving sample efficiency.", "full_text": "Regularizing Trajectory Optimization\n\nwith Denoising Autoencoders\n\nRinu Boney\u2217\n\nAalto University & Curious AI\n\nrinu.boney@aalto.fi\n\nNorman Di Palo\u2217\n\nSapienza University of Rome\nnormandipalo@gmail.com\n\nMathias Berglund\n\nCurious AI\n\nAlexander Ilin\n\nAalto University & Curious AI\n\nJuho Kannala\nAalto University\n\nAntti Rasmus\n\nCurious AI\n\nHarri Valpola\n\nCurious AI\n\nAbstract\n\nTrajectory optimization using a learned model of the environment is one of the\ncore elements of model-based reinforcement learning. This procedure often suffers\nfrom exploiting inaccuracies of the learned model. We propose to regularize\ntrajectory optimization by means of a denoising autoencoder that is trained on the\nsame trajectories as the model of the environment. We show that the proposed\nregularization leads to improved planning with both gradient-based and gradient-\nfree optimizers. We also demonstrate that using regularized trajectory optimization\nleads to rapid initial learning in a set of popular motor control tasks, which suggests\nthat the proposed approach can be a useful tool for improving sample ef\ufb01ciency.\n\n1\n\nIntroduction\n\nState-of-the-art reinforcement learning (RL) often requires a large number of interactions with the\nenvironment to learn even relatively simple tasks [11]. It is generally believed that model-based RL\ncan provide better sample-ef\ufb01ciency [9, 2, 5] but showing this in practice has been challenging. In\nthis paper, we propose a way to improve planning in model-based RL and show that it can lead to\nimproved performance and better sample ef\ufb01ciency.\nIn model-based RL, planning is done by computing the expected result of a sequence of future actions\nusing an explicit model of the environment. Model-based planning has been demonstrated to be\nef\ufb01cient in many applications where the model (a simulator) can be built using \ufb01rst principles. For\nexample, model-based control is widely used in robotics and has been used to solve challenging tasks\nsuch as human locomotion [34, 35] and dexterous in-hand manipulation [21].\nIn many applications, however, we often do not have the luxury of an accurate simulator of the\nenvironment. Firstly, building even an approximate simulator can be very costly even for processes\nwhose dynamics is well understood. Secondly, it can be challenging to align the state of an existing\nsimulator with the state of the observed process in order to plan. Thirdly, the environment is often\nnon-stationary due to, for example, hardware failures in robotics, change of the input feed and\ndeactivation of materials in industrial process control. Thus, learning the model of the environment is\nthe only viable option in many applications and learning needs to be done for a live system. And since\nmany real-world systems are very complex, we are likely to need powerful function approximators,\nsuch as deep neural networks, to learn the dynamics of the environment.\nHowever, planning using a learned (and therefore inaccurate) model of the environment is very\ndif\ufb01cult in practice. The process of optimizing the sequence of future actions to maximize the\n\n\u2217Equal contribution, rest in alphabetical order\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fexpected return (which we call trajectory optimization) can easily exploit the inaccuracies of the\nmodel and suggest a very unreasonable plan which produces highly over-optimistic predicted rewards.\nThis optimization process works similarly to adversarial attacks [1, 13, 33, 7] where the input of a\ntrained model is modi\ufb01ed to achieve the desired output. In fact, a more ef\ufb01cient trajectory optimizer\nis more likely to fall into this trap. This can arguably be the reason why gradient-based optimization\n(which is very ef\ufb01cient at for example learning the models) has not been widely used for trajectory\noptimization.\nIn this paper, we study this adversarial effect of model-based planning in several environments and\nshow that it poses a problem particularly in high-dimensional control spaces. We also propose to\nremedy this problem by regularizing trajectory optimization using a denoising autoencoder (DAE)\n[37]. The DAE is trained to denoise trajectories that appeared in the past experience and in this\nway the DAE learns the distribution of the collected trajectories. During trajectory optimization, we\nuse the denoising error of the DAE as a regularization term that is subtracted from the maximized\nobjective function. The intuition is that the denoising error will be large for trajectories that are far\nfrom the training distribution, signaling that the dynamics model predictions will be less reliable as it\nhas not been trained on such data. Thus, a good trajectory has to give a high predicted return and it\ncan be only moderately novel in the light of past experience.\nIn the experiments, we demonstrate that the proposed regularization signi\ufb01cantly diminishes the\nadversarial effect of trajectory optimization with learned models. We show that the proposed\nregularization works well with both gradient-free and gradient-based optimizers (experiments are\ndone with cross-entropy method [3] and Adam [14]) in both open-loop and closed-loop control.\nWe demonstrate that improved trajectory optimization translates to excellent results in early parts\nof training in standard motor-control tasks and achieve competitive performance after a handful of\ninteractions with the environment.\n\n2 Model-Based Reinforcement Learning\n\nIn this section, we explain the basic setup of model-based RL and present the notation used. At\nevery time step t, the environment is in state st, the agent performs action at, receives reward\nrt = r(st, at) and the environment transitions to new state st+1 = f (st, at). The agent acts based\non the observations ot = o(st) which is a function of the environment state. In a fully observable\nMarkov decision process (MDP), the agent observes full state ot = st. In a partially observable\nMarkov decision process (POMDP), the observation ot does not completely reveal st. The goal of the\nagent is select actions {a0, a1, . . .} so as to maximize the return, which is the expected cumulative\n\nreward E [(cid:80)\u221e\n\nt=0 r(st, at)].\n\nIn the model-based approach, the agent builds the dynamics model of the environment (forward\nmodel). For a fully observable environment, the forward model can be a fully-connected neural\nnetwork trained to predict the state transition from time t to t + 1:\n\n(1)\nIn partially observable environments, the forward model can be a recurrent neural network trained to\ndirectly predict the future observations based on past observations and actions:\n\nst+1 = f\u03b8(st, at) .\n\not+1 = f\u03b8(o0, a0, . . . , ot, at) .\n\n(2)\nIn this paper, we assume access to the reward function and that it can be computed from the agent\nobservations, that is rt = r(ot, at).\nAt each time step t, the agent uses the learned forward model to plan the sequence of future actions\n{at, . . . , at+H} so as to maximize the expected cumulative future reward.\n\n(cid:34)t+H(cid:88)\n\n(cid:35)\n\nG(at, . . . , at+H ) = E\n\nr(o\u03c4 , a\u03c4 )\n\nat, . . . , at+H = arg max G(at, . . . , at+H ) .\n\n\u03c4 =t\n\nThis process is called trajectory optimization. The agent uses the learned model of the environment\nto compute the objective function G(at, . . . , at+H ). The model (1) or (2) is unrolled H steps into\nthe future using the current plan {at, . . . , at+H}.\n\n2\n\n\fAlgorithm 1 End-to-end model-based reinforcement learning\n\nCollect data D by random policy.\nfor each episode do\n\nTrain dynamics model f\u03b8 using D.\nfor time t until the episode is over do\n\nOptimize trajectory {at, ot+1, . . . , at+H , ot+H+1}.\nImplement the \ufb01rst action at and get new observation ot.\nend for\nAdd data {(s1, a1, . . . , aT , oT )} from the last episode to D.\n\nend for\n\nThe optimized sequence of actions from trajectory optimization can be directly applied to the\nenvironment (open-loop control). It can also be provided as suggestions to a human operator with the\npossibility for the human to change the plan (human-in-the-loop). Open-loop control is challenging\nbecause the dynamics model has to be able to make accurate long-range predictions. An approach\nwhich works better in practice is to take only the \ufb01rst action of the optimized trajectory and then re-\nplan at each step (closed-loop control). Thus, in closed-loop control, we account for possible modeling\nerrors and feedback from the environment. In the control literature, this \ufb02avor of model-based RL is\ncalled model-predictive control (MPC) [22, 30, 16, 24].\nThe typical sequence of steps performed in model-based RL are: 1) collect data, 2) train the forward\nmodel f\u03b8, 3) interact with the environment using MPC (this involves trajectory optimization in every\ntime step), 4) store the data collected during the last interaction and continue to step 2. The algorithm\nis outlined in Algorithm 1.\n\n3 Regularized Trajectory Optimization\n\n3.1 Problem with using learned models for planning\n\nIn this paper, we focus on the inner loop of model-based RL which is trajectory optimization using a\nlearned forward model f\u03b8. Potential inaccuracies of the trained model cause substantial dif\ufb01culties\nfor the planning process. Rather than optimizing what really happens, planning can easily end up\nexploiting the weaknesses of the predictive model. Planning is effectively an adversarial attack\nagainst the agent\u2019s own forward model. This results in a wide gap between expectations based on the\nmodel and what actually happens.\nWe demonstrate this problem using a simple industrial process control benchmark from [28]. The\nproblem is to control a continuous nonlinear reactor by manipulating three valves which control\n\ufb02ows in two feeds and one output stream. Further details of the process and the control problem\nare given in Appendix A. The task considered in [28] is to change the product rate of the process\nfrom 100 to 130 kmol/h. Fig. 1a shows how this task can be performed using a set of PI controllers\nproposed in [28]. We trained a forward model of the process using a recurrent neural network (2) and\nthe data collected by implementing the PI control strategy for a set of randomly generated targets.\nThen we optimized the trajectory for the considered task using gradient-based optimization, which\nproduced results in Fig. 1b. One can see that the proposed control signals are changed abruptly and\nthe trajectory imagined by the model signi\ufb01cantly deviates from reality. For example, the pressure\nconstraint (of max 300 kPa) is violated. This example demonstrates how planning can easily exploit\nthe weaknesses of the predictive model.\n\n3.2 Regularizing Trajectory Optimization with Denoising Autoencoders\n\nWe propose to regularize the trajectory optimization with denoising autoencoders (DAE). The idea is\nthat we want to reward familiar trajectories and penalize unfamiliar ones because the model is likely\nto make larger errors for the unfamiliar ones.\nThis can be achieved by adding a regularization term to the objective function:\n\nGreg = G + \u03b1 log p(ot, at . . . , ot+H , at+H ) ,\n\n(3)\n\n3\n\n\f(a) Multiloop PI control\n\n(b) No regularization\n\n(c) DAE regularization\n\nFigure 1: Open-loop planning for a continuous nonlinear two-phase reactor from [28]. Three subplots\nin every sub\ufb01gure show three measured variables (solid lines): product rate, pressure and A in the\npurge. The black curves represent the model\u2019s imagination while the red curves represent the reality\nif those controls are applied in an open-loop mode. The targets for the variables are shown with\ndashed lines. The fourth (low right) subplots show the three manipulated variables: valve for feed 1\n(blue), valve for feed 2 (red) and valve for stream 3 (green).\n\ns1\n\ns0\n\nf\u03b8\n\na1\n\ns2\n\nf\u03b8\n\na2\n\ng\n\nr1\n\nc1\n\nr2\n\nFigure 2: Example: fragment of a computational graph used during trajectory optimization in an\nMDP. Here, window size w = 1, that is the DAE penalty term is c1 = (cid:107)g([s1, a1]) \u2212 [s1, a1](cid:107)2.\n\nwhere p(ot, at, . . . , ot+H , at+H ) is the probability of observing a given trajectory in the past experi-\nence and \u03b1 is a tuning hyperparameter. In practice, instead of using the joint probability of the whole\ntrajectory, we use marginal probabilities over short windows of size w:\n\nGreg = G + \u03b1\n\nlog p(x\u03c4 )\n\n(4)\n\nwhere x\u03c4 = {o\u03c4 , a\u03c4 , . . . o\u03c4 +w, a\u03c4 +w} is a short window of the optimized trajectory.\nSuppose we want to \ufb01nd the optimal sequence of actions by maximizing (4) with a gradient-based\noptimization procedure. We can compute gradients \u2202Greg\nby backpropagation in a computational graph\n\u2202ai\nwhere the trained forward model is unrolled into the future (see Fig. 2). In such backpropagation-\nthrough-time procedure, one needs to compute the gradient with respect to actions ai.\n\nt+H\u2212w(cid:88)\n\n\u03c4 =t\n\ni+w(cid:88)\n\n\u03c4 =i\n\n\u2202Greg\n\u2202ai\n\n=\n\n\u2202G\n\u2202ai\n\n+ \u03b1\n\n\u2202x\u03c4\n\u2202ai\n\n\u2202\n\u2202x\u03c4\n\nlog p(x\u03c4 ) ,\n\n(5)\n\nwhere we denote by x\u03c4 a concatenated vector of observations o\u03c4 , . . . o\u03c4 +w and actions a\u03c4 , . . . a\u03c4 +w,\nover a window of size w. Thus to enable a regularized gradient-based optimization procedure, we\nneed means to compute \u2202\n\u2202x\u03c4\n\nlog p(x\u03c4 ).\n\nIn order to evaluate log p(x\u03c4 ) (or its derivative), one needs to train a separate model p(x\u03c4 ) of the\npast experience, which is the task of unsupervised learning. In principle, any probabilistic model\ncan be used for that. In this paper, we propose to regularize trajectory optimization with a denoising\nautoencoder (DAE) which does not build an explicit probabilistic model p(x\u03c4 ) but rather learns to\napproximate the derivative of the log probability density. The theory of denoising [23, 27] states that\n\n4\n\n100105110115120125130Product Rate [kmol/h]2760278028002820284028602880Pressure [kPa]010203050556065A in purge [mole %]0102030020406080Manipulated variables [%]90100110120130140150Product Rate [kmol/h]27002800290030003100Pressure [kPa]010203047.550.052.555.057.560.062.5A in purge [mole %]010203050050100150Manipulated variables [%]100105110115120125130Product Rate [kmol/h]27002725275027752800282528502875Pressure [kPa]010203047.550.052.555.057.560.062.5A in purge [mole %]0102030304050607080Manipulated variables [%]\fthe optimal denoising function g(\u02dcx) (for zero-mean Gaussian corruption) is given by:\n\ng(\u02dcx) = \u02dcx + \u03c32\nn\n\n\u2202\n\u2202 \u02dcx\n\nlog p(\u02dcx) ,\n\nwhere p(\u02dcx) is the probability density function for data \u02dcx corrupted with noise and \u03c3n is the standard\ndeviation of the Gaussian corruption. Thus, the DAE-denoised signal minus the original gives\nthe gradient of the log-probability of the data distribution convolved with a Gaussian distribution:\n\u2202 \u02dcx log p(\u02dcx) \u221d g(x) \u2212 x . Assuming \u2202\n\u2202\n\n(6)\n\n\u2202 \u02dcx log p(\u02dcx) \u2248 \u2202\ni+w(cid:88)\n\u2202x\u03c4\n\u2202ai\n\n\u2202G\n\u2202ai\n\n\u2202x log p(x) yields\n(g(x\u03c4 ) \u2212 x\u03c4 ) .\n\n\u2202Greg\n\u2202ai\n\n=\n\n+ \u03b1\n\n\u03c4 =i\n\n\u2202 \u02dcx log p(\u02dcx) instead of \u2202\n\nUsing \u2202\n\u2202x log p(x) can behave better in practice because it is similar to replacing\np(x) with its Parzen window estimate [36]. In automatic differentiation software, this gradient can\nbe computed by adding the penalty term (cid:107)g(x\u03c4 ) \u2212 x\u03c4(cid:107)2 to G and stopping the gradient propagation\nthrough g. In practice, stopping the gradient through g did not yield any bene\ufb01ts in our experiments\ncompared to simply adding the penalty term (cid:107)g(x\u03c4 ) \u2212 x\u03c4(cid:107)2 to the cumulative reward, so we used\nthe simple penalty term in our experiments. Also, this kind of regularization can easily be used with\ngradient-free optimization methods such as cross-entropy method (CEM) [3].\nOur goal is to tackle high-dimensional problems and expressive models of dynamics. Neural networks\ntend to fare better than many other techniques in modeling high-dimensional distributions. However,\nusing a neural network or any other \ufb02exible parameterized model to estimate the input distribution\nposes a dilemma: the regularizing network which is supposed to keep planning from exploiting the\ninaccuracies of the dynamics model will itself have weaknesses which planning will then exploit.\nClearly, DAE will also have inaccuracies but planning will not exploit them because unlike most other\ndensity models, DAE develops an explicit model of the gradient of logarithmic probability density.\nThe effect of adding DAE regularization in the industrial process control benchmark discussed in the\nprevious section is shown in Fig. 1c.\n\n3.3 Related work\n\nSeveral methods have been proposed for planning with learned dynamics models. Locally linear time-\nvarying models [17, 19] and Gaussian processes [8, 15] or mixture of Gaussians [29] are data-ef\ufb01cient\nbut have problems scaling to high-dimensional environments. Recently, deep neural networks have\nbeen successfully applied to model-based RL. Nagabandi et al. [24] use deep neural networks as\ndynamics models in model-predictive control to achieve good performance, and then shows how\nmodel-based RL can be \ufb01ne-tuned with a model-free approach to achieve even better performance.\nChua et al. [5] introduce PETS, a method to improve model-based performance by estimating and\npropagating uncertainty with an ensemble of networks and sampling techniques. They demonstrate\nhow their approach can beat several recent model-based and model-free techniques. Clavera et al. [6]\ncombines model-based RL and meta-learning with MB-MPO, training a policy to quickly adapt to\nslightly different learned dynamics models, thus enabling faster learning.\nLevine and Koltun [20] and Kumar et al. [17] use a KL divergence penalty between action distributions\nto stay close to the training distribution. Similar bounds are also used to stabilize training of policy\ngradient methods [31, 32]. While such a KL penalty bounds the evolution of action distributions, the\nproposed method also bounds the familiarity of states, which could be important in high-dimensional\nstate spaces. While penalizing unfamiliar states also penalize exploration, it allows for more controlled\nand ef\ufb01cient exploration. Exploration is out of the scope of the paper but was studied in [10], where a\nnon-zero optimum of the proposed DAE penalty was used as an intrinsic reward to alternate between\nfamiliarity and exploration.\n\n4 Experiments on Motor Control\n\nWe show the effect of the proposed regularization for control in standard Mujoco environments: Cart-\npole, Reacher, Pusher, Half-cheetah and Ant available in [4]. See the description of the environments\nin Appendix B. We use the Probabilistic Ensembles with Trajectory Sampling (PETS) model from\n[5] as the baseline, which achieves the best reported results on all the considered tasks except for Ant.\n\n5\n\n\fTrajectory optimization with CEM\n\nTrajectory optimization with Adam\n\nFigure 3: Visualization of trajectory optimization at timestep t = 50. Each row has the same model\nbut a different optimization method. The models are obtained by 5 episodes of end-to-end training.\nRow above: Cartpole environment. Row below: Half-cheetah environment. Here, the red lines\ndenote the rewards predicted by the model (imagination) and the black lines denote the true rewards\nobtained when applying the sequence of optimized actions (reality). For a low-dimensional action\nspace (Cartpole), trajectory optimizers do not exploit inaccuracies of the dynamics model and hence\nDAE regularization does not affect the performance noticeably. For a higher-dimensional action space\n(Half-cheetah), gradient-based optimization without any regularization easily exploits inaccuracies of\nthe dynamics model but DAE regularization is able to prevent this. The effect is less pronounced\nwith gradient-free optimization but still noticeable.\n\nThe PETS model consists of an ensemble of probabilistic neural networks and uses particle-based\ntrajectory sampling to regularize trajectory optimization. We re-implemented the PETS model using\nthe code provided by the authors as a reference.\n\n4.1 Regularized trajectory optimization with models trained with PETS\n\nIn MPC, the innermost loop is open-loop control which is then turned to closed-loop control by taking\nin new observations and replanning after each action. Fig. 3 illustrates the adversarial effect during\nopen-loop trajectory optimization and how DAE regularization mitigates it. In Cartpole environment,\nthe learned model is very good already after a few episodes of data and trajectory optimization stays\nwithin the data distribution. As there is no problem to begin with, regularization does not improve the\nresults. In Half-cheetah environment, trajectory optimization manages to exploit the inaccuracies of\nthe model which is particularly apparent in gradient-based Adam. DAE regularization improves both\nbut the effect is much stronger with Adam.\nThe problem is exacerbated in closed-loop control since it continues optimization from the solution\nachieved in the previous time step, effectively iterating more per action. We demonstrate how\nregularization can improve closed-loop trajectory optimization in the Half-cheetah environment. We\n\ufb01rst train three PETS models for 300 episodes using the best hyperparameters reported in [5]. We\nthen evaluate the performance of the three models on \ufb01ve episodes using four different trajectory\noptimizers: 1) Cross-entropy method (CEM) which was used during training of the PETS models,\n2) Adam, 3) CEM with the DAE regularization and 4) Adam with the DAE regularization. The results\naveraged across the three models and the \ufb01ve episodes are presented in Table 1.\n\nTable 1: Comparison of PETS with CEM and Adam optimizers in Half-cheetah\n\nOptimizer\nAverage Return\n\nCEM\n\n10955 \u00b1 2865\n\nCEM + DAE Adam Adam + DAE\n12967 \u00b1 3216\n12796 \u00b1 2716\n\n\u2013\n\nWe \ufb01rst note that planning with Adam fails completely without regularization: the proposed actions\nlead to unstable states of the simulator. Using Adam with the DAE regularization \ufb01xes this problem\nand the obtained results are better than the CEM method originally used in PETS. CEM appears\n\n6\n\n01020304050Planning Iterations202224262830Cumulative RewardDAE Regularization01020304050Planning IterationsCumulative RewardNo Regularization01020304050Planning Iterations0.250.500.751.00DAE Penalty01020304050Planning IterationsDAE Penalty01020304050Planning Iterations202224262830Cumulative RewardDAE Regularization01020304050Planning IterationsCumulative RewardNo Regularization01020304050Planning Iterations0.60.81.01.21.4DAE Penalty01020304050Planning IterationsDAE Penalty01020304050Planning Iterations050100150Cumulative RewardDAE Regularization01020304050Planning IterationsCumulative RewardNo Regularization01020304050Planning Iterations0204060DAE Penalty01020304050Planning IterationsDAE Penalty01020304050Planning Iterations050100150Cumulative RewardDAE Regularization01020304050Planning IterationsCumulative RewardNo Regularization01020304050Planning Iterations0100000200000300000400000DAE Penalty01020304050Planning IterationsDAE Penalty\fto regularize trajectory optimization but not as ef\ufb01ciently CEM+DAE. These open-loop results are\nconsistent with the closed-loop results in Fig. 3.\n\n4.2 End-to-end training with regularized trajectory optimization\n\nIn the following experiments, we study the performance of end-to-end training with different trajectory\noptimizers used during training. Our agent learns according to the algorithm presented in Algorithm 1.\nSince the environments are fully observable, we use a feedforward neural network as in (1) to model\nthe dynamics of the environment. Unlike PETS, we did not use an ensemble of probabilistic networks\nas the forward model. We use a single probabilistic network which predicts the mean and variance of\nthe next state (assuming a Gaussian distribution) given the current state and action. Although we only\nuse the mean prediction, we found that also training to predict the variance improves the stability of\nthe training.\nFor all environments, we use a dynamics model with the same architecture: three hidden layers of size\n200 with the Swish non-linearity [26]. Similar to prior works, we train the dynamics model to predict\nthe difference between st+1 and st instead of predicting st+1 directly. We train the dynamics model\nfor 100 or more epochs (see Appendix C) after every episode. This is a larger number of updates\ncompared to \ufb01ve epochs used in [5]. We found that an increased number of updates has a large effect\non the performance for a single probabilistic model and not so large effect for the ensemble of models\nused in PETS. This effect is shown in Fig. 6.\nFor the denoising autoencoder, we use the same architecture as the dynamics model. The state-action\npairs in the past episodes were corrupted with zero-mean Gaussian noise and the DAE was trained to\ndenoise it. Important hyperparameters used in our experiments are reported in the Appendix C. For\nDAE-regularized trajectory optimization we used either CEM or Adam as optimizers.\nThe learning progress of the compared algorithms is presented in Fig. 4. Note that we report the\naverage returns across different seeds, not the maximum return seen so far as was done in [5].2 In\nCartpole, all the methods converge to the maximum cumulative reward but the proposed method\nconverges the fastest. In the Cartpole environment, we also compare to a method which uses Gaussian\nProcesses (GP) as the dynamics model (algorithm denoted GP-E in [5], which considers only the\nexpectation of the next state prediction). The implementation of the GP algorithm was obtained\nfrom the code provided by [5]. Interestingly, our algorithm also surpasses the Gaussian Process (GP)\nbaseline, which is known to be a sample-ef\ufb01cient method widely used for control of simple systems.\nIn Reacher, the proposed method converges to the same asymptotic performance as PETS, but faster.\nIn Pusher, all algorithms perform similarly.\nIn Half-cheetah and Ant, the proposed method shows very good sample ef\ufb01ciency and very rapid\ninitial learning. The agent learns an effective running gait in only a couple of episodes.3 The\nresults demonstrate that denoising regularization is effective for both gradient-free and gradient-based\nplanning, with gradient-based planning performing the best. The proposed algorithm learns faster\nthan PETS in the initial phase of training. It also achieves performance that is competitive with\npopular model-free algorithms such as DDPG, as reported in [5].\nHowever, the performance of the proposed method does not improve after initial 10 episodes, so it\ndoes not reach the asymptotic performance of PETS (see results for PETS for Half-cheetah after 300\nepisodes in Table 1). This result is evidence of the importance of exploration: the DAE regularization\nessentially penalizes exploration, which can harm asymptotic performance in complex environments.\nIn PETS, CEM leaves some noise in the trajectories, which might help to obtain better asymptotic\nperformance. The result presented in Appendix E provides some evidence that at least a part of the\nproblem is lack of exploration.\nWe also compare the performance of our method with Model-Based Meta Policy Optimization\n(MB-MPO) [6], an approach that combines the bene\ufb01ts of model-based RL and meta learning: the\nalgorithm trains a policy using simulations generated by an ensemble of models, learned from data.\nMeta-learning allows this policy to quickly adapt to the various dynamics, hence learning how to\nquickly adapt in the real environment, using Model-Agnostic Meta Learning (MAML) [12]. In\n2 Because of the different metric used, the PETS results presented in this paper may appear worse than in [5].\nHowever, we veri\ufb01ed that our implementation of PETS obtains similar results to [5] for the metric used in [5].\n3 Videos of our agents during training can be found at https://sites.google.com/view/\n\nregularizing-mbrl-with-dae/home.\n\n7\n\n\fFigure 4: Results of our experiments on the \ufb01ve benchmark environments, in comparison to PETS\n[5]. We show the return obtained in each episode. All the results are averaged across 5 seeds, with\nthe shaded area representing standard deviation. PETS is a recent state-of-the-art model-based RL\nalgorithm and GP-based (Gaussian Processes) control algorithms are well known to be sample-\nef\ufb01cient and are extensively used for the control of simple systems.\n\nFigure 5: Comparison to MB-MPO [6], MB-TRPO [18] and MB-MPC [24] on Half-cheetah. We\nplot the average return over the last 20 episodes. Our results are averaged across 3 seeds, with the\nshaded area representing standard deviation. Note that the comparison numbers are picked from [6]\nand the results from the \ufb01rst 20 episodes are not reported.\n\nFig. 5 we compare our method to MB-MPO and other model-based methods included in [6]. This\nexperiment is done in the Half-cheetah environment with shorter episodes (200 timesteps) in order to\ncompare to the results reported in [6]. The results show that our method learns faster than MB-MPO.\n\n5 Discussion\n\nIn recent years, a lot of effort has been put in making deep reinforcement algorithms more sample-\nef\ufb01cient, and thus adaptable to real world scenarios. Model-based reinforcement learning has shown\npromising results, obtaining sample-ef\ufb01ciency even orders of magnitude better than model-free\ncounterparts, but these methods have often suffered from sub-optimal performance due to many\nreasons. As already noted in the recent literature [24, 5], out-of-distribution errors and model\nover\ufb01tting are often sources of performance degradation when using complex function approximators.\nIn this work we demonstrated how to tackle this problem using regularized trajectory optimization.\nOur experiments demonstrate that the proposed solution can improve the performance of model-based\nreinforcement learning.\n\n8\n\n020406080100Episodes(200xsteps)1000100200300400500600ReturnHalfcheetah(episodelength=200)Adam+DAEreg.(Ours)CEM+DAEreg.(Ours)MBMPOMBMPCMBTRPO\fWhile trajectory optimization is a key component in model-based RL, there are clearly several other\nissues which need to be tackled in complex environments:\n\n\u2022 Local minima for trajectory optimization. There can be multiple trajectories that are\nreasonable solutions but in-between trajectories can be very bad. For example, we can take a\nstep with a right or left foot but both will not work. We tackled this issue by trying multiple\ninitializations, which worked for the considered environments, but better techniques will be\nneeded for more complex environments.\n\n\u2022 The planning horizon problem.\n\nIn the presented experiments, the planning procedure\ndid not care about what happens after the planning horizon. This was not a problem for\nthe considered environments due to nicely formatted reward. Other solutions like value\nfunctions, multiple time scales or hierarchy for planning are required with sparser reward\nproblems. All of these are compatible with model-based RL.\n\n\u2022 Open-loop vs. closed-loop (compounding errors). The implicit planning assumption of\ntrajectory optimization is open-loop control. However, MPC only takes the \ufb01rst action\nand then replans (closed-loop control). If the outcome is uncertain (e.g., due to stochastic\nenvironments or imperfect forward model), this can lead to overly pessimistic controls.\n\n\u2022 Local optima of the policy. This is the well-known exploration-exploitation dilemma. If\nthe model has never seen data of alternative trajectories, it may predict their consequences\nincorrectly and never try them (because in-between trajectories can be genuinely worse).\nGood trajectory optimization (exploitation) can harm long-term performance because it\nreduces exploration, but we believe that it is better to add explicit exploration. With model-\nbased RL, intrinsically motivated exploration is a particularly interesting option because it\nis possible to balance exploration and the expected cost. This is particularly important in\nhazardous environments where safe exploration is needed.\n\n\u2022 High-dimensional input space. Sensory systems like cameras, lidars and microphones can\nproduce vast amounts of data and it is infeasible to plan based on detailed prediction on low\nlevel such as pixels. Also, predictive models of pixels may miss the relevant state.\n\n\u2022 Changing environments. All the considered environments were static but real-world systems\nkeep changing. Online learning and similar techniques are needed to keep track of the\nchanging environment.\n\nStill, model-based RL is an attractive approach and not only due to its sample-ef\ufb01ciency. Compared to\nmodel-free approaches, model-based learning makes safe exploration and adding known constraints\nor \ufb01rst-principles models much easier. We believe that the proposed method can be a viable solution\nfor real-world control tasks especially where safe exploration is of high importance.\nWe are currently working on applying the proposed methods for real-world problems such as assisting\noperators of complex industrial processes and for control of autonomous mobile machines.\n\nAcknowledgments\n\nWe would like to thank Jussi Sainio, Jari Rosti and Isabeau Pr\u00e9mont-Schwarz for their valuable\ncontributions in the experiments on industrial process control.\n\nReferences\n[1] Naveed Akhtar and Ajmal Mian. Threat of adversarial attacks on deep learning in computer\n\nvision: A survey. IEEE Access, 6:14410\u201314430, 2018.\n\n[2] Kai Arulkumaran, Marc Peter Deisenroth, Miles Brundage, and Anil Anthony Bharath. A brief\n\nsurvey of deep reinforcement learning. arXiv preprint arXiv:1708.05866, 2017.\n\n[3] Zdravko I Botev, Dirk P Kroese, Reuven Y Rubinstein, and Pierre L\u2019Ecuyer. The cross-entropy\nmethod for optimization. In Handbook of statistics, volume 31, pages 35\u201359. Elsevier, 2013.\n\n[4] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang,\n\nand Wojciech Zaremba. OpenAI gym. arXiv preprint arXiv:1606.01540, 2016.\n\n9\n\n\f[5] Kurtland Chua, Roberto Calandra, Rowan McAllister, and Sergey Levine. Deep reinforcement\nlearning in a handful of trials using probabilistic dynamics models. In Advances in Neural\nInformation Processing Systems 31, pages 4759\u20134770. 2018.\n\n[6] Ignasi Clavera, Jonas Rothfuss, John Schulman, Yasuhiro Fujita, Tamim Asfour, and Pieter\nAbbeel. Model-based reinforcement learning via meta-policy optimization. arXiv preprint\narXiv:1809.05214, 2018.\n\n[7] Nilesh Dalvi, Pedro Domingos, Mausam, Sumit Sanghai, and Deepak Verma. Adversarial clas-\nsi\ufb01cation. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge\nDiscovery and Data Mining, KDD \u201904, pages 99\u2013108, 2004.\n\n[8] Marc Deisenroth and Carl E Rasmussen. PILCO: A model-based and data-ef\ufb01cient approach\nto policy search. In Proceedings of the 28th International Conference on Machine Learning\n(ICML), pages 465\u2013472, 2011.\n\n[9] Marc Peter Deisenroth, Gerhard Neumann, Jan Peters, et al. A survey on policy search for\n\nrobotics. Foundations and Trends in Robotics, 2(1\u20132):1\u2013142, 2013.\n\n[10] Norman Di Palo and Harri Valpola. Improving model-based control and active exploration with\n\nreconstruction uncertainty optimization. arXiv preprint arXiv:1812.03955, 2018.\n\n[11] Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. Benchmarking\ndeep reinforcement learning for continuous control. In Proceedings of the 33rd International\nConference on Machine Learning (ICML), pages 1329\u20131338, 2016.\n\n[12] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adap-\ntation of deep networks. In Proceedings of the 34th International Conference on Machine\nLearning (ICML), pages 1126\u20131135, 2017.\n\n[13] Sandy Huang, Nicolas Papernot, Ian Goodfellow, Yan Duan, and Pieter Abbeel. Adversarial\n\nattacks on neural network policies. arXiv preprint arXiv:1702.02284, 2017.\n\n[14] Diederick P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Interna-\n\ntional Conference on Learning Representations, 2015.\n\n[15] Jonathan Ko, Daniel J Klein, Dieter Fox, and Dirk Haehnel. Gaussian processes and rein-\nforcement learning for identi\ufb01cation and control of an autonomous blimp. In Robotics and\nAutomation, 2007 IEEE International Conference on, pages 742\u2013747. IEEE, 2007.\n\n[16] Basil Kouvaritakis and Mark Cannon. Non-linear Predictive Control: theory and practice. Iet,\n\n2001.\n\n[17] Vikash Kumar, Emanuel Todorov, and Sergey Levine. Optimal control with learned local\nmodels: Application to dexterous manipulation. In Robotics and Automation (ICRA), 2016\nIEEE International Conference on, pages 378\u2013383. IEEE, 2016.\n\n[18] Thanard Kurutach, Ignasi Clavera, Yan Duan, Aviv Tamar, and Pieter Abbeel. Model-ensemble\ntrust-region policy optimization. In International Conference on Learning Representations,\n2018.\n\n[19] Sergey Levine and Pieter Abbeel. Learning neural network policies with guided policy search\nunder unknown dynamics. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q.\nWeinberger, editors, Advances in Neural Information Processing Systems 27, pages 1071\u20131079.\n2014.\n\n[20] Sergey Levine and Vladlen Koltun. Guided policy search. In Proceedings of the 30th Interna-\n\ntional Conference on Machine Learning (ICML), pages 1\u20139, 2013.\n\n[21] Kendall Lowrey, Aravind Rajeswaran, Sham Kakade, Emanuel Todorov, and Igor Mordatch.\nPlan online, learn of\ufb02ine: Ef\ufb01cient learning and exploration via model-based control. arXiv\npreprint arXiv:1811.01848, 2018.\n\n[22] David Q Mayne, James B Rawlings, Christopher V Rao, and Pierre OM Scokaert. Constrained\n\nmodel predictive control: Stability and optimality. Automatica, 36(6):789\u2013814, 2000.\n\n10\n\n\f[23] K Miyasawa. An empirical bayes estimator of the mean of a normal population. Bulletin of the\n\nInternational Statistical Institute, 38(181-188):1\u20132, 1961.\n\n[24] Anusha Nagabandi, Gregory Kahn, Ronald S Fearing, and Sergey Levine. Neural network\ndynamics for model-based deep reinforcement learning with model-free \ufb01ne-tuning. In 2018\nIEEE International Conference on Robotics and Automation (ICRA), pages 7559\u20137566. IEEE,\n2018.\n\n[25] Vitchyr Pong*, Shixiang Gu*, Murtaza Dalal, and Sergey Levine. Temporal difference mod-\nels: Model-free deep RL for model-based control. In International Conference on Learning\nRepresentations, 2018.\n\n[26] Prajit Ramachandran, Barret Zoph, and Quoc V Le. Searching for activation functions. arXiv\n\npreprint arXiv:1710.05941, 2017.\n\n[27] Martin Raphan and Eero P Simoncelli. Least squares estimation without priors or supervision.\n\nNeural computation, 23(2):374\u2013420, 2011.\n\n[28] N Lawrence Ricker. Model predictive control of a continuous, nonlinear, two-phase reactor.\n\nJournal of Process Control, 3(2):109\u2013123, 1993.\n\n[29] C\u00e9dric Rommel, Fr\u00e9d\u00e9ric Bonnans, Pierre Martinon, and Baptiste Gregorutti. Gaussian mixture\npenalty for trajectory optimization problems. Journal of Guidance, Control, and Dynamics,\npages 1\u20136, 2019.\n\n[30] John Rossiter. Model-based Predictive Control-a Practical Approach. CRC Press, 01 2003.\n\n[31] John Schulman, Sergey Levine, Philipp Moritz, Michael Jordan, and Pieter Abbeel. Trust region\npolicy optimization. In Proceedings of the 32nd International Conference on Machine Learning\n(ICML), pages 1889\u20131897, 2015.\n\n[32] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal\n\npolicy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.\n\n[33] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfel-\nlow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199,\n2013.\n\n[34] Yuval Tassa, Tom Erez, and Emanuel Todorov. Synthesis and stabilization of complex behav-\niors through online trajectory optimization. In 2012 IEEE/RSJ International Conference on\nIntelligent Robots and Systems, pages 4906\u20134913. IEEE, 2012.\n\n[35] Yuval Tassa, Nicolas Mansard, and Emo Todorov. Control-limited differential dynamic pro-\ngramming. In 2014 IEEE International Conference on Robotics and Automation (ICRA), pages\n1168\u20131175. IEEE, 2014.\n\n[36] Pascal Vincent. A connection between score matching and denoising autoencoders. Neural\n\ncomputation, 23(7):1661\u20131674, 2011.\n\n[37] Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and Pierre-Antoine Manzagol.\nStacked denoising autoencoders: Learning useful representations in a deep network with a local\ndenoising criterion. Journal of machine learning research, 11(Dec):3371\u20133408, 2010.\n\n11\n\n\f", "award": [], "sourceid": 1633, "authors": [{"given_name": "Rinu", "family_name": "Boney", "institution": "Aalto University"}, {"given_name": "Norman", "family_name": "Di Palo", "institution": "-"}, {"given_name": "Mathias", "family_name": "Berglund", "institution": "Curious AI"}, {"given_name": "Alexander", "family_name": "Ilin", "institution": "Aalto University"}, {"given_name": "Juho", "family_name": "Kannala", "institution": "Aalto University"}, {"given_name": "Antti", "family_name": "Rasmus", "institution": "The Curious AI Company"}, {"given_name": "Harri", "family_name": "Valpola", "institution": "Curious AI"}]}