{"title": "Exploiting Model Uncertainty Estimates for Safe Dynamic Control Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1047, "page_last": 1053, "abstract": null, "full_text": "Exploiting Model Uncertainty Estimates \n\nfor Safe Dynamic Control Learning \n\nJeff G. Schneider \nThe Robotics Institute \n\nCarnegie Mellon University \n\nPittsburgh, PA 15213 \nschneide@cs.cmu.edu \n\nAbstract \n\nModel learning combined with dynamic programming has been shown to \nbe effective for learning control of continuous state dynamic systems. The \nsimplest method assumes the learned model is correct and applies dynamic \nprogramming to it, but many approximators provide uncertainty estimates \non the fit. How can they be exploited? This paper addresses the case \nwhere the system must be prevented from having catastrophic failures dur(cid:173)\ning learning. We propose a new algorithm adapted from the dual control \nliterature and use Bayesian locally weighted regression models with dy(cid:173)\nnamic programming. A common reinforcement learning assumption is that \naggressive exploration should be encouraged. This paper addresses the con(cid:173)\nverse case in which the system has to reign in exploration. The algorithm \nis illustrated on a 4 dimensional simulated control problem. \n\nIntroduction \n\n1 \nReinforcement learning and related grid-based dynamic programming techniques are \nincreasingly being applied to dynamic systems with continuous valued state spaces. \nRecent results on the convergence of dynamic programming methods when using \nvarious interpolation methods to represent the value (or cost-to-go) function have \ngiven a sound theoretical basis for applying reinforcement learning to continuous \nvalued state spaces [Gordon, 1995]. These are important steps toward the eventual \napplication of these methods to industrial learning and control problems. \n\nIt has also been reported recently that there are significant benefits in data and \ncomputational efficiency when data from running a system is used to build a model, \nrather than using it once for single value function updates (as Q-learning would \ndo) and discarding it [Sutton, 1990, Moore and Atkeson, 1993, Schaal and Atkeson, \n1993, Davies, 1996]. Dynamic programming sweeps can then be done on the learned \nmodel either off-line or on-line. In its vanilla form, this method assumes the model \nis correct and does deterministic dynamic programming using the model. This \nassumption is often not correct, especially in the early stages of learning. When \nlearning simulated or software systems, there may be no harm in the fact that this \n\n\f1048 \n\nJ. G. Schneider \n\nassumption does not hold. However, in real, physical systems there are often states \nthat really are catastrophic and must be avoided even during learning. Worse yet, \nlearning may have to occur during normal operation of the system in which case its \nperformance during learning must not be significantly degraded. \n\nThe literature on adaptive and optimal linear control theory has explored this prob(cid:173)\nlem considerably under the names stochastic control and dual control. Overviews \ncan be found in [Kendrick, 1981, Bar-Shalom and Tse, 1976]. The control decision \nis based on three components call the deterministic, cautionary, and probing terms. \nThe deterministic term assumes the model is perfect and attempts to control for \nthe best performance. Clearly, this may lead to disaster if the model is inaccurate. \nAdding a cautionary term yields a controller that considers the uncertainty in the \nmodel and chooses a control for the best expected performance. Finally, if the sys(cid:173)\ntem learns while it is operating, there may be some benefit to choosing controls \nthat are suboptimal and/or risky in order to obtain better data for the model and \nultimately achieve better long-term performance. The addition of the probing term \ndoes this and gives a controller that yields the best long-term performance. \n\nThe advantage of dual control is that its strong mathematical foundation can pro(cid:173)\nvide the optimal learning controller under some assumptions about the system, \nthe model, noise, and the performance criterion. Dynamic programming methods \nsuch as reinforcement learning have the advantage that they do not make strong \nassumptions about the system, or the form of the performance measure. It has \nbeen suggested [Atkeson, 1995, Atkeson, 1993] that techniques used in global linear \ncontrol, including caution and probing, may also be applicable in the local case. In \nthis paper we propose an algorithm that combines grid based dynamic program(cid:173)\nming with the cautionary concept from dual control via the use of a Bayesian locally \nweighted regression model. \nOur algorithm is designed with industrial control applications in mind. A typical \nscenario is that a production line is being operated conservatively. There is data \navailable from its operation, but it only covers a small region of the state space and \nthus can not be used to produce an accurate model over the whole potential range \nof operation. Management is interested in improving the line's response to changes \nin set points or disturbances, but can not risk much loss of production during the \nlearning process. The goal of our algorithm is to collect new data and optimize the \nprocess while explicitly minimizing the risk. \n\n2 The Algorithm \nConsider a system whose dynamics are given by xk+1 = f(xk, uk). The state, x, \nand control,u, are real valued vectors and k represents discrete time increments. \nA model of f is denoted as j. The task is to minimize a cost functional of the \nform J = E:=D L(xk, uk, k) subject to the system dynamics. N mayor may not \nbe fixed depending on the problem. L is given, but f must be learned. The goal is \nto acquire data to learn f in order to minimize J without incurring huge penalties \nin J during learning. There is an implicit assumption that the cost function defines \ncatastrophic states. If it were known that there were no disasters to avoid, then \nsimpler, more aggressive algorithms would likely outperform the one presented here. \nThe top level algorithm is as follows: \n\n1. Acquire some data while operating the system from an existing controller. \n2. Construct a model from the data using Bayesian locally weighted regression. \n3. Perform DP with the model to compute a value function and a policy. \n4. Operate the system using the new policy and record additional data. \n\n\fExploiting Model Uncertainty Estimates for Safe Dynamic Control Learning \n\n1049 \n\n5. Repeat to step 2 while there is still some improvement in performance. \n\nIn the rest of this section we describe steps 2 and 3. \n\n2.1 Bayesian locally weighted regression \nWe use a form of locally weighted regression [Cleveland and Delvin, 1988, \nAtkeson, 1989, Moore, 1992] called Bayesian locally weighted regression [Moore \nand Schneider, 1995] to build a model from data. When a query, x q , is made, each \nof the stored data points receives a weight Wi = exp( -llxi - xql1 2 / K). K is the \nkernel width which controls the amount of localness in the regression. For Bayesian \nLWR we assume a wide, weak normal-gamma prior on the coefficients of the regres(cid:173)\nsion model and the inverse of the noise covariance. The result of a prediction is a \nt distribution on the output that remains well defined even in the absence of data \n(see [Moore and Schneider, 1995] and [DeGroot, 1970] for details) . \n\nThe distribution of the prediction in regions where there is little data is crucial to \nthe performance of the DP algorithm. As is often the case with learning through \nsearch and experimentation, it is at least as important that a function approximator \npredicts its own ignorance in regions of no data as it is how well it interpolates in \ndata rich regions. \n\n2.2 Grid based dynamic programming \nIn dynamic programming, the optimal value function, V, represents the cost-to-go \nfrom each state to the end of the task assuming that the optimal policy is followed \nfrom that point on. The value function can be computed iteratively by identifying \nthe best action from each state and updating it according to the expected results \nof the action as given by a model of the system. The update equation is: \n\nVk+1(x) = minL(x, u) + Vk(j(x, u\u00bb \n\n(1) \nIn our algorithm, updates to the ~que function are computed while considering \nthe probability distribution on the results of each action. If we assume that the \noutput of the real system at each time step is an independent random variable \nwhose probability density function is given by the uncertainty from the model, the \nupdate equation is as follows: \n\nVk+1(x) = minL(x, u) + E[Vk(f(x, u))lj] \n\n(2) \nNote that the independence as~~fhption does not hold when reasonably smooth \nsystem dynamics are modeled by a smooth function approximator. The model \nerror at one time step along a trajectory is highly correlated with the model error \nat the following step assuming a small distance traveled during the time step. \n\nOur algorithm for DP with model uncertainty on a grid is as follows: \n\n1. Discretize the state space, X, and the control space, U. \n2. For each state and each control cache the cost of taking this action from \nthis state. Also compute the probability density function on the next state \nfrom the model and cache the information. There are two cases which are \nshown graphically in fig. 1: \n\n\u2022 If the distribution is much narrower than the grid spacing, then the \nmodel is confident and a deterministic update will be done according to \neq. 1. Multilinear interpolation is used to compute the value function \nat the mean of the predicted next state [Davies, 1996] . \n\n\u2022 Otherwise, a stochastic update will be done according to eq. 2. The pdf \nof each of the state variables is stored, discretized at the same intervals \nas the grid representing the value function. Output independence is \n\n\f1050 \n\nJ G. Schneider \n\nHigh Confidence Next State \n\nLow Confidence Next State \n\nv7 \n\nv8 \n\nvlO \n\nvI,.!! \n\n-..;:~ \n\n.......---:: ~ V \n\nV. ~ V \n\n...---:l \n\n17 __ V \n\n(.-/. \n\nFigure 1: Illustration of the two kinds of cached updates. In the high confidence sce(cid:173)\nnario the transition is treated as deterministic and the value function is computed \nwith multilinear interpolation : Vl~+l = L(x, u) + OAV; + 0.3V; + 0.2V1k1 + 0.1 vl2 \u2022 \nIn the low confidence scenario the transition is treated stochastically and the up(cid:173)\ndate takes a weighted sum over all the vertices of significant weight as well as the \n\npro a Iltymassoutsl et egn: VIO \n\n-\n\n\u00b7d h \n\n\u00b7d TTk+l _ L( \n\n) \n\nb b\u00b7l\u00b7 \n\n~ \n\n, . \n\nIe \n\nI \n\nX,u +~ \n\nL-f.,/ lp(.,/l><} p(x lJ ,x,u)V (x) \n\nL-{.,'Ip(,,' \u00bb<} p(x If ,x ,u) \n\nI ' \n\nassumed and later the pdf of each grid point will be computed as the \nproduct of the pdfs for each dimension and a weighted sum of all the \ngrid points with significant weight will be computed. Also the total \nprobability mass outside the bounds of the grid is computed and stored. \n\n3. For each state, use the cached information to estimate the cost of choos(cid:173)\n\ning each action from that state. Update the value function at that state \naccording to the cost of the best action found . \n\n4. Repeat 3 until the value function converges, or the desired number of steps \n\nhas been reached in finite step problems. \n\n5. Record the best action (policy) for each grid point. \n\n3 Experiments: Minimal Time Cart-Pole Maneuvers \nThe inverted pendulum is a well studied problem. It is easy to learn to stabilize it in \na small number oftrials, but not easy to learn quick maneuvers. We demonstrate our \nalgorithm on the harder problem of moving the cart-pole stably from one position \nto another as quickly as possible. We assume we have a controller that can balance \nthe pole and would like to learn to move the cart quickly to new positions, but \nnever drop the pole during the learning process. The simulation equations and \nparameters are from [Barto et aI., 1983] and the task is illustrated at the top of fig. \n2. The state vector is x = [ pole angle (0), pole angular velocity (8), cart position \n(p), cart velocity (p) ]. The control vector, u, is the one dimensional force applied \nto the cart. Xo is [0 0 170] and the cost function is J = E~o xT X + 0.01 uT u. N is \nnot fixed. It is determined by the amount of time it takes for the system to reach \na goal region about the target state, [0 0 0 0] . If the pole is dropped, the trial ends \nand an additional penalty of 106 is incurred. \n\nThis problem has properties similar to familiar process control problems such as \ncooking, mixing, or cooling, because it is trivial to stabilize the system and it can \nbe moved slowly to a new desired position while maintaining the stability by slowly \nchanging positional setpoints. In each case, the goal is to learn how to respond \nfaster without causing any disasters during, or after, the learning process. \n\n\fExploiting Model Uncertainty Estimates for Safe Dynamic Control Learning \n\n1051 \n\n3.1 Learning an LQR controller \nWe first learn a linear quadratic regulator that balances the pole. This can be done \nwith minimal data. The system is operated from the state, [0 0 0 0] for 10 steps \nof length 0.1 seconds with a controller that chooses u randomly from a zero mean \ngaussian with standard deviation 0.5. This is repeated to obtain a total of 20 data \npoints. That data is used to fit a global linear model mapping x onto x'. An LQR \ncontroller is constructed from the model and the given cost function following the \nderivation in [Dyer and McReynolds, 1970]. \n\nThe resulting linear controller easily stabilizes the pole and can even bring the \nsystem stably (although very inefficiently as it passes through the goal several times \nbefore coming to rest there) to the origin when started as far out as x = [0 0 10 0]. \nIf the cart is started further from the origin, the controller crashes the system. \n\n3.2 Building the initial Bayesian LWR model \nWe use the LQR controller to generate data for an initial model. The system is \nstarted at x = [0 0 1 0] and controlled by the LQR controller with gaussian noise \nadded as before. The resulting 50 data points are stored for an LWR model that \nmaps [e, 0, u] -+ [0, pl. The data in each dimension of the state and control space is \nscaled to [0 1]. In this scaled space, the LWR kernel width is set to 1.0. \n\nNext, we consider the deterministic DP method on this model. The grid covers the \nranges: [\u00b11.0 \u00b14.0 \u00b121.0 \u00b120.0] and is discretized to [11 9 11 9] levels. The control \nis \u00b130.0 discretized to 15 levels. Any state outside the grid bounds is considered \nfailure and incurs the 106 penalty. If we assume the model is correct, we can use \ndeterministic DP on the grid to generate a policy. The computation is done with \nfixed size steps in time of 0.25 seconds. We observe that this policy is able to move \nthe system safely from an initial state of [0 0 12 0], but crashes if it is started further \nout. Failure occurs because the best path g.enerated using the model strays far from \nthe region of the data (in variables e and e) used to construct the model. \nIt is disappointing that the use of LWR for nonlinear modeling didn't improve much \nover a globally linear model and an LQR controller. We believe this is a common \nsituation. It is difficult to build better controllers from naive use of nonlinear \nmodeling techniques because the available data models only a narrow region of \noperation and safely acquiring a wider range of data is difficult. \n\n3.3 Cautionary dynamic programming \nAt this point we are ready to test our algorithm. Step 3 is executed using the LWR \nmodel from the data generated by the LQR controller as before. A trace of the \nsystem's operation when started at a distance of 17 from the goal is shown at the \ntop of fig. 2. The controller is extremely conservative with respect to the angle of \nthe pole. The pole is never allowed to go outside \u00b10.13 radians. Even as the cart \napproaches the goal at a moderate velocity the controller chooses to overshoot the \ngoal considerably rather than making an abrupt action to brake the system. \n\nThe data from this run is added to the model and the steps are repeated. Traces of \nthe runs from three iterations of the algorithm are shown in fig. 2. At each trial, the \ncontroller becomes more aggressive and completes the task with less cost. After the \nthird iteration, no significant improvement is observed. The costs are summarized \nand compared with the LQR and deterministic DP controllers in table 1. \n\nFig. 3 is another illustration of how the policy becomes increasingly aggressive. It \nplots the pole angle vs. the pole angular velocity for the original LQR data and the \nexecutions at each of the following three trials. In summary, our algorithm is able \n\n\f1052 \n\n1. G. Schneider \n\nGoal Reg-ion \n\no \n\n13 \n\n24 \n\n12 \n\n25 \n\n2' \n\n11 \n2'1. \n\n10 \n\n'A i \u2022\u2022 \n\n3 2 . \n\n10 \n\n\u2022 \n\nI \n\nI \n\nI \n\nI \n\nI \n\nI \n\nI \n\nI \n\no \n\no \n\nI \n\nI \n\nI \n\nI \n\nI \n\nI \n\nI \n\nI \n\nt \n\nI \n\nI \n\nI \n\nI \n\n, \n\n\u2022 \n\n5 \n\n01 \n\nI \n\nI \n\nI \n\nI \n\nI \n\n' \n\n, \n\nFigure 2: The task is to move the cart to the origin as quickly as possible without \ndropping the pole. The bottom three pictures show a trace of the policy execution \nobtained after one, two, and three trials (shown in increments of 0.5 seconds) \n\nNumber of data points used Cost from initial state 17 \n\nController \n\nLQR \nDeterministic D P \nStochastic DP trial 1 \nStochastic DP trial 2 \nStochastic DP trial 3 \n\nto build the controller \n20 \n50 \n50 \n221 \n272 \n\nfailure \nfailure \n12393 \n7114 \n6270 \n\nTable 1: Summary of experimental results \n\nto start from a simple controller that can stabilize the pole and learn to move it \naggressively over a long distance without ever dropping the pole during learning. \n\n4 Discussion \nWe have presented an algorithm that uses Bayesian locally weighted regression \nmodels with dynamic programming on a grid. The result is a cautionary adaptive \ncontrol algorithm with the flexibility of a non-parametric nonlinear model instead \nof the more restrictive parametric models usually considered in the dual control \nliterature. We note that this algorithm presents a viewpoint on the exploration \nvs exploitation issue that is different from many reinforcement learning algorithms, \nwhich are devised to encourage exploration (as in the probing concept in dual con(cid:173)\ntrol) . However, we argue that modeling the data first with a continuous function \napproximator and then doing DP on the model often leads to a situation where \nexploration must be inhibited to prevent disasters. This is particularly true in the \ncase of real, physical systems. \n\n\fExploiting Model Uncertainty Estimatesfor Safe Dynamic Control Learning \n\n1053 \n\nAngular \n\nVelocity \n\n1.5 \n\n1 \n\n0.5 \n\n0 \n\n-0.5 \n\n-1 \n\n-1.5 \n\n-0.8 \n\n\" - ' .. --. . . \n\n\" \" \" \" \" \" . \" \n\n. . \n\nLQR data 0 \n\n1st trial \n_ ....... 2nd trial \n'3td trial \n\n---\n\n-(cid:173).... \n\n'. \n\n: \n\n\" \" \" \" \" \" . \" \" \" \n\n\",,\" \" \n'. \n\n., . \n\" .. ,,\"\" \" \n\n\",,\"\"\"\" \" \n\n-0.6 \n\n-0.4 \n\n-0.2 \n\n0 \nPole Angle \n\n0.2 \n\n0.4 \n\n0.6 \n\nFigure 3: Execution trace. At each iteration, the controller is more aggressive. \n\nReferences \n[Atkeson, 1989) C. Atkeson. Using local models to control movement . In Advances in Neural Informa(cid:173)\n\ntion Processing Systems, 1989. \n\n[Atkeson, 1993] C . Atkeson. Using local trajectory optimizers to speed up global optimization in dy(cid:173)\n\nnamic programming. In Advances in Neural Information Processing Systems (NIPS-6), 1993. \n\n[Atkeson , 1995) C . Atkeson . Local methods for active learning. Invited talk at AAAI Fall Symposium \n\non Active Learning, 1995 . \n\n[Bar-Shalom and Tse, 1976) Y . Bar-Shalom and E . Tse. Concepts and Methods in Stochastic Control. \n\nAcademic Press, 1976. \n\n[Barto et al., 1983) A . Barto, R. Sutton, and C. Anderson. Neuronlike adaptive elements that can solve \n\ndifficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, 1983. \n\n[Cleveland and Delvin, 1988) W . Cleveland and S. Delvin. Locally weighted regression: An approach to \nregression analysis by local fitting. Journal of the American Statistical Association, pages 596-610, \nSeptember 1988. \n\n[Davies, 1996] S. Davies. Applying grid-based interpolation to reinforcement learning. In Neural Infor(cid:173)\n\nmation Proceuing Systems 9, 1996. \n\n[DeGroot, 1970) M. DeGroot. Optimal Statistical Decisions. McGraw-Hill, 1970. \n[Dyer and McReynolds , 1970) P. Dyer and S. McReynolds. The Computation and Theory of Optimal \n\nControl. Academic Press, 1970. \n\n[Gordon , 1995] G. Gordon. Stable function approximation in dynamic programming. \n\nInternational Conference on Machine Learning, 1995 . \n\nIn The 12th \n\n[Kendrick, 1981) D. Kendrick. Stochastic Control for Economic Models. McGraw-Hill, 1981. \n[Moore and AtkesoD, 1993) A . Moore and C. Atkeson. Prioritized sweeping: Reinforcement learning \n\nwith less data and less real time. Machine Learning, 13(1):103-130,1993. \n\n[Moore and Schneider, 1995] A. Moore and J. Schneider. Memory based stochastic optimization. In \n\nAdvances in Neural Information Proceuing Systems (NIPS-B), 1995 . \n\n[Moore, 1992) A. Moore. Fast, robust adaptive control by learning only forward models. In Advances \n\nin Neural Information Processing Systems 4, 1992. \n\n[Schaal and Atkeson, 1993) S. Schaal and C . Atkeson. Assessing the quality of learned local models. In \n\nAdvances in Neural Information Processing Systems (NIPS-6), 1993. \n\n[Sutton, 1990) R. Sutton. First results with dyna, an intergrated architecture for learning, planning, \nand reacting. In AAAI Spring Symposium on Planning in Uncertain, Unpredictable , or Changing \nEnvironment\", 1990. \n\n\f", "award": [], "sourceid": 1317, "authors": [{"given_name": "Jeff", "family_name": "Schneider", "institution": null}]}