{"title": "Autonomous Helicopter Flight via Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 799, "page_last": 806, "abstract": "", "full_text": "Autonomous helicopter \ufb02ight\nvia Reinforcement Learning\n\nAndrew Y. Ng\n\nStanford University\nStanford, CA 94305\n\nH. Jin Kim, Michael I. Jordan, and Shankar Sastry\n\nUniversity of California\n\nBerkeley, CA 94720\n\nAbstract\n\nAutonomous helicopter \ufb02ight represents a challenging control problem,\nwith complex, noisy, dynamics. In this paper, we describe a successful\napplication of reinforcement learning to autonomous helicopter \ufb02ight.\nWe \ufb01rst \ufb01t a stochastic, nonlinear model of the helicopter dynamics. We\nthen use the model to learn to hover in place, and to \ufb02y a number of\nmaneuvers taken from an RC helicopter competition.\n\n1 Introduction\nHelicopters represent a challenging control problem with high-dimensional, complex,\nasymmetric, noisy, non-linear, dynamics, and are widely regarded as signi\ufb01cantly more\ndif\ufb01cult to control than \ufb01xed-wing aircraft. [7] Consider, for instance, the problem of de-\nsigning a helicopter that hovers in place. We begin with a single, horizontally-oriented main\nrotor attached to the helicopter via the rotor shaft. Suppose the main rotor rotates clock-\nwise (viewed from above), blowing air downwards and hence generating upward thrust.\nBy applying clockwise torque to the main rotor to make it rotate, our helicopter experi-\nences an anti-torque that tends to cause the main chassis to spin anti-clockwise. Thus,\nin the invention of the helicopter, it was necessary to add a tail rotor, which blows air\nsideways/rightwards to generate an appropriate moment to counteract the spin. But, this\nsideways force now causes the helicopter to drift leftwards. So, for a helicopter to hover in\nplace, it must actually be tilted slightly to the right, so that the main rotor\u2019s thrust is directed\ndownwards and slightly to the left, to counteract this tendency to drift sideways.\nThe history of helicopters is rife with such tales of ingenious solutions to problems caused\nby solutions to other problems, and of complex, nonintuitive dynamics that make heli-\ncopters challenging to control.\nIn this paper, we describe the successful application of\nreinforcement learning to designing a controller for autonomous helicopter \ufb02ight. Due to\nspace constraints, our description of this work is necessarily brief; a detailed treatment is\nprovided in [9]. For a discussion of related work on autonomous \ufb02ight, also see [9, 12].\n2 Autonomous Helicopter\nThe helicopter used in this work was a Yamaha R-50 helicopter, which is approximately\n3.6m long, carries up to a 20kg payload, and is shown in Figure 1a. A detailed description\nof the design and construction of its instrumentation is in [12]. The helicopter carries an\nInertial Navigation System (INS) consisting of 3 accelerometers and 3 rate gyroscopes\ninstalled in exactly orthogonal x,y,z directions, and a differential GPS system, which with\nthe assistance of a ground station, gives position estimates with a resolution of 2cm. An\nonboard navigation computer runs a Kalman \ufb01lter which integrates the sensor information\nfrom the GPS, INS, and a digital compass, and reports (at 50Hz) 12 numbers corresponding\n\n, pitch \t , yaw \n ),\n\nto the estimates of the helicopter\u2019s position (\u0002\u0001\u0004\u0003\u0005\u0001\u0007\u0006 ), orientation (roll \b\nvelocity ( \u000b\n\u0002\u0001\n\n\u0006 ) and angular velocities ( \u000b\n\n ).\n\n\u0003\f\u0001\n\n\u000b\n\u000b\n\b\n\u0001\n\u000b\n\t\n\u0001\n\u000b\n\fFigure 1: (a) Autonomous helicopter. (b) Helicopter hovering under control of learned policy.\n\n(a)\n\n(b)\n\nMost Helicopters are controlled via a 4-dimensional action space:\n\n\u0001\u0005\u0001\u0007\u0006 : The longtitudinal (front-back) and latitudinal (left-right) cyclic pitch con-\n\u0002\u0001\u0004\u0003\ntrols. The rotor plane is the plane in which the helicopter\u2019s rotors rotate. By\ntilting this plane either forwards/backwards or sideways, these controls cause the\nhelicopter to accelerate forward/backwards or sideways.\n\u0002\u0001\u0007\b : The (main rotor) collective pitch control. As the helicopter main-rotor\u2019s blades\nsweep through the air, they generate an amount of upward thrust that (generally)\nincreases with the angle at which the rotor blades are tilted. By varying the tilt\nangle of the rotor blades, the collective pitch control affects the main rotor\u2019s thrust.\n\u0002\u0001\n\t : The tail rotor collective pitch control. Using a mechanism similar to the main\nrotor collective pitch control, this controls the tail rotor\u2019s thrust.\nUsing the position estimates given by the Kalman \ufb01lter, our task is to pick good control\nactions every 50th of a second.\n3 Model identi\ufb01cation\nTo \ufb01t a model of the helicopter\u2019s dynamics, we began by asking a human pilot to \ufb02y the\nhelicopter for several minutes, and recorded the 12-dimensional helicopter state and 4-\ndimensional helicopter control inputs as it was \ufb02own. In what follows, we used 339 seconds\nof \ufb02ight data for model \ufb01tting, and another 140 seconds of data for hold-out testing.\nThere are many natural symmetries in helicopter \ufb02ight. For instance, a helicopter at (0,0,0)\nfacing east behaves in a way related only by a translation and rotation to one at (10,10,50)\nfacing north, if we command each to accelerate forwards. We would like to encode these\nsymmetries directly into the model rather force an algorithm to learn them from scratch.\nThus, model identi\ufb01cation is typically done not in the spatial (world) coordinates\n\u000b\r\f\n, but instead in the helicopter body coordinates, in which\n\u0002\u0001\n\u0003\f\u0001\u0007\u0006\n, and \u0006 axes are forwards, sideways, and down relative to the current position\n, \u0003\nthe \nto\nof the helicopter. Where there is risk of confusion, we will use superscript\nis forward velocity, regardless of\ndistinguish between spatial and body coordinates; thus,\norientation. Our model is identi\ufb01ed in the body coordinates\n.\n\n\u0014\u000f\n. Note that once this model is built, it is easily\nwhich has four fewer variables than\nconverted back using simple geometry to one in terms of spatial coordinates.\nOur main tool for model \ufb01tting was locally weighted linear regression (e.g., [11, 3]). Given\n\u0003 where the $\u0019 \u2019s are vector-valued inputs and the \u0003\n\u0019 \u2019s are the real-\na dataset\n\u001a\u0019\u0004\u0001\n-th row is \u0013\u0019 , and let\nvalued outputs to be predicted, we let\n, locally weighted linear regression makes\n\u0003 be the vector of \u0003'\u0019 \u2019s. In response to a query at \nis a diagonal matrix with\nthe prediction \u0003\n\u001c\u001f3\n%/*10\n(say)\n\u001a\u0019\u001d\u001c\u0005\u001c , so that the regression gives datapoints near\n, and\n a larger weight. Here,\nwas picked in our experiments via leave-one-out cross validation. 1 Using the estimator for\n\u001c . By\nnoise\n1Actually, since we were \ufb01tting a model to a time-series, samples tend to be correlated in time,\n\n, where\n(,\f-\u0018.%/*102%\n$\u0019>\u001c\n*\u0014?\n\u0003 determines how weights fall off with distance from \n\n\f)(+*\n\u00194\u0019\n\f6587:9;\u0018=<\n\u0006 given in [3], this gives a model \u0003\n\nbe the design matrix whose\n\nCBED\n\n, where DGFAHJI\u001bKMLONQP\n\n\u0003\u001b\u0019\u001d\u001c\u001f\u001e! \n\n\u0019#\"\n\n\u0017\n\u0018\n\n\u0002\u0001\n\n\u0010\u000f\n\n\u0013\u0012\n\n, and\n\n\fA(\n\n\u0018SR\n\nand\n\n\u000b\u0016\u0015\n\n\u000e\n\u0001\n\b\n\u0001\n\t\n\u0001\n\n\u0001\n\u000b\n\u000b\n\u0003\n\u0001\n\u000b\n\u0006\n\u0001\n\u000b\n\b\n\u0001\n\u000b\n\t\n\u0001\n\u000b\n\u000b\n\u0011\n\u000b\n\u000b\n\u0012\n\f\n\u000e\n\b\n\u0001\n\t\n\u0001\n\u000b\n\n\u0012\n\u0001\n\u000b\n\u0003\n\u0012\n\u0001\n\u000b\n\u0006\n\u0012\n\u0001\n\u000b\n\b\n\u0001\n\u000b\n\t\n\u0001\n\u000b\n%\n&\n\n\u0003\n\u0003\n0\n0\n\u0003\n\u0006\n\u0018\n\n<\n3\n\u0003\n\u0018\n\n<\n?\n3\n@\n*\n\u0001\n@\n\u0006\n\fxdot\n\nxdot\n\n1\n\n \n\nr\no\nr\nr\ne\nd\ne\nr\na\nu\nq\ns\n \nn\na\ne\nm\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n0\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n \n\nr\no\nr\nr\ne\nd\ne\nr\na\nu\nq\ns\n \nn\na\ne\nm\n\n0\n0\n\n0.5\n\nseconds\n\nxdot\n\n0.5\n\nseconds\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n \n\nr\no\nr\nr\ne\nd\ne\nr\na\nu\nq\ns\n \nn\na\ne\nm\n\n0\n0\n\n0.012\n\n0.01\n\n0.008\n\n0.006\n\n0.004\n\n0.002\n\n \n\nr\no\nr\nr\ne\nd\ne\nr\na\nu\nq\ns\n \nn\na\ne\nm\n\n0\n0\n\n(a)\n\n1\n\n1\n\n0.5\n\nseconds\nthetadot\n\n0.5\n\nseconds\n\n1\n\n1\n\nt\n\no\nd\ny\n\n2\n\n1\n\n0\n\n\u22121\n\n\u22122\n\n\u22123\n0\n\n+1\n\nxerr\nx\n\nerry\ny\n\nerrz\nz\n\nerrw\n\na1\n\na\n2\n\na\n3\n\na\n4\n\n(c)\n\n2\n\n4\n\ntime\n\n6\n\n(b)\n\n8\n\n10\n\nand \u0001\n\nFigure 2: (a) Examples of plots comparing a model \ufb01t using the parameterization described in the text\n(solid lines) to some other models (dash-dot lines). Each point plotted shows the mean-squared error\nbetween the predicted value of a state variable\u2014when a model is used to the simulate the helicopter\u2019s\n-axis\u2014and the true value of that state variable (as\ndynamics for a certain duration indicated on the\nmeasured on test data) after the same duration. Top left: Comparison of \u0001\n-error to model not using\n\u0002\u0004\u0003\u0006\u0005 , etc. terms. Top right: Comparison of \u0001\n-error to model omitting intercept (bias) term. Bottom:\nComparison of \u0001\nto linear deterministic model identi\ufb01ed by [12]. (b) The solid line is the true\nhelicopter \u0001\ngiven the initial state at time 0 and all the intermediate control inputs. The dotted lines show two\nstandard deviations in the estimated state. Every two seconds, the estimated state is \u201creset\u201d to the\ntrue state, and the track restarts with zero error. Note that the estimated state is of the full, high-\ndimensional state of the helicopter, but only \u0001\nis shown here. (c) Policy class. The picture inside\nthe circles indicate whether a node outputs the sum of their inputs, or the\nof the sum of their\ninputs. Each edge with an arrow in the picture denotes a tunable parameter. The solid lines show\nthe hovering policy class (Section 5). The dashed lines show the extra weights added for trajectory\nfollowing (Section 6).\n\n\b state on 10s of test data. The dash-dot line is the helicopter state predicted by our model,\n\n\t\u000b\n\r\f\u000f\u000e\n\n\u000b\u0011\u0010\n\nand \u0001\n\nand action \u0001\n\n\b\u0015\u0010\nplus noise.\n\napplying locally-weighted regression with the state\n\nstep differences (e.g., \b\u0012\u0010\u0014\u0013\n\nas inputs, and the one-\n) of each of the state variables in turn as the target output,\n\u0003 as\nthis gives us a non-linear, stochastic, model of the dynamics, allowing us to predict\na function of\nWe actually used several re\ufb01nements to this model. Similar to the use of body coordinates\nto exploit symmetries, there is other prior knowledge that can be incorporated. Since both\n, there\nare state variables, and we know that (at 50Hz) \b\u0016\u0010\u0014\u0013\n\b\u0015\u0010\nis no need to carry out a regression for \b\nthe helicopter should have no direct effect on forward velocity \u000b\nregression to estimate \u000b\n\n. Similarly, we know that the roll angle \b of\ncorresponding to \b can be set to 0. This allows\n\nus to reduce the number of parameters that have to be \ufb01t. Similar reasoning allows us to\nconclude (cf. [12]) that certain other parameters should be\n(gravity), and these\nwere also hard-coded into the model. Finally, we added three extra (unobserved) variables\n\n\u0003\u0018\u0017\n. So, when performing\n\n\b\u0015\u0010\u000b\u0019\u001b\u001aQR\n\n, the coef\ufb01cient in\n\nand \u000b\n\b\u0015\u0010\n\n\b\u0015\u0010\n\n\u0010\u0014\u0013\n\nor\n\n,\n\n\u001c\u0011\u0019\u001b\u001a\u001bR\n\n\u001f\u001e\n\n,\n\nto model latencies in the responses to the controls. (See [9] for details.)\n\n,\n\n \u001e\n\nand/or\n\nSome of the (other) choices that we considered in selecting a model include whether to use\nterms; whether to include an intercept term; at what frequency to\nthe \u0001\nidentify the model; whether to hardwire certain coef\ufb01cients as described; and whether to\nuse weighted or unweighted linear regression. Our main tool for choosing among the mod-\nels was plots such as those shown in Figure 2a. (See \ufb01gure caption.) We were particularly\n\u0003 from\n, but\ninterested in checking how accurate a model is not just for predicting\nhow accurate it is at longer time scales. Each of the panels in Figure 2a shows, for a model,\nthe mean-squared error (as measured on test data) between the helicopter\u2019s true position\nand the estimated position at a certain time in the future (indicated on the -axis).\nThe helicopter\u2019s blade-tip moves at an appreciable fraction of the speed of sound. Given the\n\n\u0001\u0005\u0001\n\n\u0010\u0014\u0013\n\nand the presence of temporally close-by samples\u2014which will be spatially close-by as well\u2014may\nmake data seem more abundant than in reality (leading to bigger\nthan might be optimal for test\ndata). Thus, when leaving out a sample in cross validation, we actually left out a large window (16\nseconds) of data around that sample, to diminish this bias.\n\n!#\"\n\nS\nS\nS\nS\nS\nS\nS\nS\nS\nS\nS\nq\nf\nw\n\n\n\n\n\u0007\n\b\n\u0010\n\u0003\n<\n\u000b\n\u000b\n\u0010\n\u0010\nB\n\u000b\n\n\n(\nR\n\u001d\n\u0001\n\u0003\n\u0015\n\u0011\n\u0003\n\u0015\n\u0001\n\u000b\n\u0012\n\u0003\n\u0015\n\u0011\n\u0003\n\u0015\n\u000b\n\u0012\n\u000b\n\u000b\n\u0010\n\u0010\n\u0003\n\f\u000b\u0002\u0001 .\n\n\u0018S\u000b\n\n!\n\u001f\n#\"\n\n\u0015 \f\n\n\u0018\u001d\u000b\n\n\u001c\u0010B\n\n\u0018\t\n\n\u0015\u000b\b\n\n\u0015 \f\n\n\u0018S\u000b\n\ndanger and expense (about $70,000) of autonomous helicopters, we wanted to verify the\n\ufb01tted model carefully, so as to be reasonably con\ufb01dent that a controller tested successfully\nin simulation will also be safe in real life. Space precludes a full discussion, but one of\nour concerns was the possibility that unmodeled correlations in D might mean the noise\nvariance of the actual dynamics is much larger than predicted by the model. (See [9] for\ndetails.) To check against this, we examined many plots such as shown in Figure 2, to\ncheck that the helicopter state \u201crarely\u201d goes outside the errorbars predicted by our model\nat various time scales (see caption).\n4 Reinforcement learning: The PEGASUS algorithm\nWe used the PEGASUS reinforcement learning algorithm of [10], which we brie\ufb02y review\n, state\n. Also let some\nwith\n\nbe given, and suppose our goal is to \ufb01nd a policy in\n\n, action space \u0006\n\nof policies\n\n, initial state\n\u0012\u0014\u0013\n\nhere. Consider an MDP with state space \ntransition probabilities \u0007\n\u0015\t\b\n\u0017\u0018\r\u000f\u0019\u0010\n\u0018\u001c\u0017\n\n\u000b\u0002\u0001\u0004\u0003\u0005\n, and discount \u0015\n\u001c\u0013B\nwhere the expectation is over the random sequence of states\n\n\u001c , reward function \f\u000e\r\u000f\u0011\u0010\n\u0018\u000b\n\nfamily\n\u0012\u001a\u0006\nhigh utility, where the policy of\n\f\u001e\u001d\n\nis de\ufb01ned to be\n\u00038\u001c;B\n\u001c+B\n\u0018S\u000b\u001f\u0001\nis executed in the MDP starting from state\n\n\u001c . Then a standard way to de\ufb01ne an estimate &\n\n\u0001!$!$\u001f$ visited over time\nwhen \u0017\nThese utilities are in general intractable to calculate exactly, but suppose we have a com-\n\u0001\u0005\u0001 and outputs\nputer simulator of the MDP\u2019s dynamics\u2014that is, a program that inputs\n\u001c of \u001b\n\u000b#% drawn from \u0007\nis via\n\u0018\u001c\u0017\n\u0001\u001f$!$!$ , and by taking the\nMonte Carlo: We can use the simulator to sample a trajectory\n\n!\n\u001f\n on this sequence, we obtain\nempirical sum of discounted rewards \f\n\u001c\u0010B\n\u0018S\u000b\u0002\u0001\none \u201csample\u201d with which to estimate \u001b\nsuch se-\n\u001c . More generally, we could generate\n\u0018\u001c\u0017\nquences, and average to obtain a better estimator. We can then try to optimize the estimated\nutilities and search for \u201cNQK)(\u0010L\n\u001c .\u201d\n\u0018,\u0017\nUnfortunately, this is a dif\ufb01cult stochastic optimization problem: Evaluating &\na Monte Carlo sampling process, and two different evaluations of &\n\ninvolves\n\u001c will typically give\nthat we average over\nslightly different answers. Moreover, even if the number of samples\n\u001c will fail with probability 1 to be a (\u201cuniformly\u201d) good estimate of\nis arbitrarily large,\n\u0018\u001c\u0017\n\u001c . In our experiments, this fails to learn any reasonable controller for our helicopter.\n\u0018,\u0017\nThe PEGASUS method uses the observation that almost all computer simulations of the\nform described sample\n\u001c by \ufb01rst calling a random number generator to get one\n\u0015\u000b\b\n, and then calculating\nas some deterministic function of the\n(or more) random numbers -\n. If we demand that the simulator expose its interface to the\n\u0001M\u0001 and the random -\nin advance and\nrandom number generator, then by pre-sampling all the random numbers -\n\ufb01xing them, we can then use these same, \ufb01xed, random numbers to evaluate any policy.\nis just an ordinary deterministic func-\nSince all the random numbers are \ufb01xed,\n\u001c . Importantly,\ntion, and standard search heuristics can be used to search for N\u001bK0(\u0010LON\nthis also allows us to show that, so long as we average over a number of samples '\nthat\n\u001b will be a\nis at most polynomial in all quantities of interest, then with high probability,\nuniformly good estimate of \u001b\n\"2143 ). This also allows us to give guarantees\non the performance of the solutions found. For further discussion of P EGASUS and other\nwork such as the variance reduction and gradient estimation methods (cf. [6, 5]), see [9].\n5 Learning to Hover\nOne previous attempt had been made to use a learning algorithm to \ufb02y this helicopter, using\n5 -synthesis [2]. This succeeded in \ufb02ying the helicopter in simulation, but not on the actual\nhelicopter (Shim, pers. comm.). Similarly, preliminary experiments using\n687\ncontrollers to \ufb02y a similar helicopter were also unsuccessful. These comments should not\nbe taken as conclusive of the viability of any of these methods; rather, we take them to be\nindicative of the dif\ufb01culty and subtlety involved in learning a helicopter controller.\n\n.\u0016/\u0010\n\n\u0012\u001a\u0013\n\n\u0006 and\n\n(\"#&\n\n\u0018\u001c\u0017\n\n\u0018\u001c\u0017\n\n\u0018,\u0017\n\n7+*\n\n\u0018\t\n\n\u000b\u0002\u0001\n\n\u0018,\u0017\n\ninput\n\n\u0018,\u0017\n\n\u0018\u001c\u0017\n\n\u0016\n\u0016\n\u0017\n\u001b\n\u001c\n\u000e\n\f\n\u0015\n\u0006\n\f\n\u0006\n\u0017\n\u000f\n\u0001\n\u000b\n\u0001\n\u0001\n\u000b\n\u0003\n\u000b\n\u001b\n\u001c\n\u0001\n\u000b\n\u0003\n\u0003\n'\nN\n&\n\u001b\n\u001b\n\u001c\n\u001b\n'\n&\n\u001b\n\u001b\n\u000b\n%\nF\n\u0007\n\u000b\n%\n\u000b\n&\n\u001b\n7\n*\n&\n\u001b\n&\n\u001b\n\u001c\n<\n\u001b\n\u001c\n6\n\fx\u2212velocity (m/s)\n\ny\u2212velocity (m/s)\n\nz\u2212velocity (m/s)\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n0\n\n66\n\n65.5\n\n65\n\n64.5\n\n64\n\n63.5\n\n63\n\n62.5\n\n62\n\n61.5\n0\n\n5\n\n10\n\n15\n\n20\n\n25\n\n30\n\nx\u2212position (m)\n\n5\n\n10\n\n15\n\n20\n\n25\n\n30\n\n0.4\n\n0.2\n\n0\n\n\u22120.2\n\n\u22120.4\n\n\u22120.6\n0\n\n\u221245\n\n\u221250\n\n\u221255\n\n\u221260\n\n\u221265\n\n\u221270\n\n\u221275\n\n\u221280\n0\n\n5\n\n10\n\n15\n\n20\n\n25\n\n30\n\ny\u2212position (m)\n\n5\n\n10\n\n15\n\n20\n\n25\n\n30\n\n1\n\n0.5\n\n0\n\n\u22120.5\n0\n\n7\n\n6.5\n\n6\n\n5.5\n\n5\n\n4.5\n0\n\n5\n\n10\n\n15\n\n20\n\n25\n\n30\n\nz\u2212position (m)\n\n5\n\n10\n\n15\n\n20\n\n25\n\n30\n\n\u0001\n\nKMK\u000b\n\r\f+B\n\n\u001c;B\n\n\b\u0015\u001e\n\n\u0001\n\u0004\u0003\n\n\u0001\u0004\u0003\u0005\u0003\n\n\u0001\u0004\u0006\u0006\u0003\n\n\b\u0015\u0016\n\n\t\u0018\u0017\n\n\f\u0019\b\u0015\u001a\n\n\b\u000f\u000eMN\u0011\u0010\u0005\u0012\n\n\u0018\u0013\b\n\nKMK\u000b\n\r\f\u001f\u001c;B\n\n\b\u0015\u0014\n\n\u0002 velocities. Bottom:\n\nto make it hover stably there. For our policy class\n\nis de\ufb01ned to be the error in the \n\n\u0012 -position (forward direction, in body coordinates)\n\nFigure 3: Comparison of hovering performance of learned controller (solid line) vs. Yamaha li-\n\u0002 positions.\ncensed/specially trained human pilot (dotted line). Top:\nWe began by learning a policy for hovering in place. We want a controller that, given the\ncurrent helicopter state and a desired hovering position and orientation\n\u001c ,\n,\ncomputes controls \u0001\nwe chose the simple neural network depicted in Figure 2c (solid edges only). Each of\nthe edges in the \ufb01gure represents a weight, and the connections were chosen via simple\nreasoning about which control channel should be used to control which state variables. For\n\u0003 , which\ninstance, consider the the longitudinal (forward/backward) cyclic pitch control \u0001\ncauses the rotor plane to tilt forward/backward, thus causing the helicopter to pitch (and/or\naccelerate) forward or backward. From Figure 2c, we can read off the \u0001\u001a\u0003 control control as\n\u000e\u001fN\u001b\u0010\u0005\u0012\n\f\t\b\n\u0018\u001c\b\u0015\u001d\nKMK\u001f\n \f\nHere, the \b\n\u0019 \u2019s are the tunable parameters (weights) of the network, and\n\u001a\u0012\n!#\"%$\u0013&\n'\u0013\"%!\nbetween where the helicopter currently is and where we wish it to hover.\nWe chose a quadratic cost function on the (spatial representation of the) state, where 2\n; \u0001\n8=9:2:>\n(*)\u001c+-,\u000f.0/1)32546)\n,%8D,\u000bE\n8\u00119:2?>\n\u001c , while also keeping the veloc-\nThis encourages the helicopter to hover near\nity small and not making abrupt movements. The weights G\nGIH , etc. (distinct from the\n\u0019 parameterizing our policy class) were chosen to scale each of the terms to be\nweights \b\nroughly the same order of magnitude. To encourage small actions and smooth control of\nthe helicopter, we also used a quadratic penalty for actions: \f\n\u001c\u0013B\n\u0018S\u000b\n\b\u001fN\nUsing the model identi\ufb01ed in Section 3, we can now apply P EGASUS to de\ufb01ne approx-\nimations &\nto the utilities of policies. Since policies are smoothly parameterized in\nthe weights, and the dynamics are themselves continuous in the actions, the estimates of\nutilities are also continuous in the weights.3 We may thus apply standard hillclimbing al-\ngorithms to maximize &\nin terms of the policy\u2019s weights. We tried both a gradient\nR\u0011T\n.SR\u001bE\n2The B:/1B\n7 error term is computed with appropriate wrapping aboutPMQ\nrad, and the helicopter is currently facingBU.\nrad.\n3Actually, this is not true. One last component of the reward that we did not mention earlier was\nthat, if in performing the locally weighted regression, the matrix WYX5Z[W\nis singular to numerical\nprecision, then we declare the helicopter to have \u201ccrashed,\u201d terminate the simulation, and give it\nis singular to numerical\na huge negative (-50000) reward. Because the test checking if W\nZ[W\n]?)\nprecision returns either 1 or 0, \\\n\n, has a discontinuity between \u201ccrash\u201d and \u201cnot-crash.\u201d\n\n\u001c , and the overall reward was \f\n\nrad, the error is 0.02, not PMQ\n\nrad, so that if B\n/VR\u001bE\n\n,%8\u001b9:2?>\n\u0003\u0018\u0003\n\u0001\u0007\u0006F\u0003\n\n\u0004\u0003\n\n\u0018JG\n\n\b-K\n\n\u0001:\u001c .\n\n/VR\u001bE\n\nR\u001bT\n\nP-Q\n\n,%8\u001b9:25;\u0011)\n\n\u00187\n\n8\u001b9:2A@5)CB*/\u0015B\n\n(1)\n\n,%8\u00119:25<=)\n\n\u0001:\u001c\n\n\u0001M\u0001\u0007\u001c\n\n\u0018S\u000b\n\n\bML\n\n\bDO\n\u0018,\u0017\n\n\u0018\u001c\u0017\n\n\b\n\n\b\n\n\u0018\n\u0001\n\n\u0003\n\u0003\n\u000e\n<\n\u001c\n\u0001\n\u001c\n\u000f\n\t\n\u0016\n\u0007\n\u0003\n\u0003\nB\n\b\n\u0006\n5\n\b\n\t\n5\n\u000b\n\n\u0012\nB\n\u0001\n\u0003\n\u0007\n\u0003\n\u0007\n\u0003\n$\n5\n\f\n\n\u0012\n<\n\n/\n\b\n/\n\b\n7\n\b\n/\n\b\n7\n4\n\u0001\n\n\b\n<\n\u0001\n\u0002\n7\n\u0018\n\u0001\n\u0001\n\n\u0003\n\n\u0001\n\u0018\n\f\n<\n\u0001\n\u0006\n\u0003\nB\nG\n\u0001\n\u0006\n\u0006\nB\nG\n\u0001\n\u0006\n\b\nB\nG\n\u0001\n\u0006\n\t\n\f\n\f\n\f\n\u0018\n\u001b\n\u001c\n\u001b\n\u001c\n7\nR\nP\nX\nQ\n\f10\n\n9.5\n\n9\n\n8.5\n\n8\n\n7.5\n\n7\n\n6.5\n\n6\n\n5.5\n\n64\n63.5\n\n\u221268\n\n\u221265\n\n\u221266\n\n\u221267\n\n\u221258\n\n\u221259\n\n\u221260\n\n\u221261\n\n\u221262\n\n\u221263\n\n\u221264\n\n66.26.4\n\n82\n\n80\n\n78\n\n76\n\n74\n\n72\n\n70\n\n68\n\n66\n\n64\n\n\u221290\n\n\u221295\n\n\u2212100\n\n\u2212105\n\n9.5\n\n9\n\n8.5\n\n8\n\n7.5\n\n7\n\n6.5\n\n6\n\n5.5\n\n68.5\n\n68\n67.5\n\n\u221280\n\n\u221281\n\n\u221279\n\n\u221271\n\n\u221272\n\n\u221273\n\n\u221274\n\n\u221275\n\n\u221276\n\n\u221277\n\n\u221278\n\nTop row: Maneuver diagrams\n\nfrom RC helicopter competition.\n\nFigure 4:\nwww.modelaircraft.org]. Bottom row: Actual trajectories \ufb02own using learned controller.\n\u001c with respect to\nascent algorithm, in which we numerically evaluate the derivative of\nthe weights and then take a step in the indicated direction, and a random-walk algorithm\nin which we propose a random perturbation to the weights, and move there if it increases\n\u001c . Both of these algorithms worked well, though with gradient ascent, it was important\n\u0018,\u0017\nto scale the derivatives appropriately, since the estimates of the derivatives were sometimes\nnumerically unstable.4 It was also important to apply some standard heuristics to prevent\nits solutions from diverging (such as verifying after each step that we did indeed take a step\n, and undoing/redoing the step using a smaller stepsize if this was\n\n[Source:\n\n\u0018\u001c\u0017\n\nuphill on the objective &\n\n\u0018\u001c\u0017\n\nnot the case).\nThe most expensive step in policy search was the repeated Monte Carlo evaluation to obtain\n\u001c . To speed this up, we parallelized our implementation, and Monte Carlo evaluations\n\u0018,\u0017\nusing different samples were run on different computers, and the results were then aggre-\ngated to obtain &\n\u001c . We ran PEGASUS using 30 Monte Carlo evaluations of 35 seconds\n. Figure 1b shows the result of implementing and\nof \ufb02ying time each, and \u0015\nrunning the resulting policy on the helicopter. On its maiden \ufb02ight, our learned policy was\nsuccessful in keeping the helicopter stabilized in the air. (We note that [1] was also suc-\ncessful at using our PEGASUS algorithm to control a subset, the cyclic pitch controls, of a\nhelicopter\u2019s dynamics.)\nWe also compare the performance of our learned policy against that of our human pilot\ntrained and licensed by Yamaha to \ufb02y the R-50 helicopter. Figure 5 shows the velocities and\npositions of the helicopter under our learned policy and under the human pilot\u2019s control. As\nwe see, our controller was able to keep the helicopter \ufb02ying more stably than was a human\npilot. Videos of the helicopter \ufb02ying are available at\n\n\u0001\u0002\n\nhttp://www.cs.stanford.edu/\u02dcang/nips03/\n\n6 Flying competition maneuvers\nWe were next interested in making the helicopter learn to \ufb02y several challenging maneu-\nvers. The Academy of Model Aeronautics (AMA) (to our knowledge the largest RC heli-\ncopter organization) holds an annual RC helicopter competition, in which helicopters have\nto be accurately \ufb02own through a number of maneuvers. This competition is organized into\nClass I (for beginners, with the easiest maneuvers) through Class III (with the most dif\ufb01cult\nmaneuvers, for the most advanced pilots). We took the \ufb01rst three maneuvers from the most\nchallenging, Class III, segment of their competition.\nFigure 4 shows maneuver diagrams from the AMA web site. In the \ufb01rst of these maneuvers\n\n4A problem exacerbated by the discontinuities described in the previous footnote.\n\n&\n\u001b\n&\n\u001b\n\u001b\n&\n\u001b\n\u001b\n\f\nR\n$\n\u001a\n\f\u001c\u0001\u001bR\u0003\u0002\n\n5\u0003\n\n\u0004\u0003\n\n\u0018SR\nR\u000b\n\n\u0003\u0018\u0003\n\n\u001c\u0004\u001bR\u0005\u0002\n\n\u0018.R\n\n\u0001\u0004\u0003\u0005\u0003\n\n\u0001\u0004\u0006\u0006\u0003\n\n\u0003\u0018\u0003\n\n\u0001\u0007\u0006F\u0003\n\n\u001c , then a fraction of a second later ask it to hover at\n\u001c and so on, our helicopter will slowly \ufb02y in the \n\n(III.1), the helicopter starts from the middle of the base of a triangle, \ufb02ies backwards to the\npirouette (turning in place), \ufb02ies backwards up an edge\nlower-right corner, performs a\nof the triangle, backwards down the other edge, performs another\npirouette, and \ufb02ies\nbackwards to its starting position. Flying backwards is a signi\ufb01cantly less stable maneuver\nthan \ufb02ying forwards, which makes this maneuver interesting and challenging. In the second\nmaneuver (III.2), the helicopter has to perform a nose-in turn, in which it \ufb02ies backwards\nout to the edge of a circle, pauses, and then \ufb02ies in a circle but always keeping the nose of\nthe helicopter pointed at center of rotation. After it \ufb01nishes circling, it returns to the starting\npoint. Many human pilots seem to \ufb01nd this second maneuver particularly challenging.\nLastly, maneuver III.3 involves \ufb02ying the helicopter in a vertical rectangle, with two\n\u0006\b\u0007\u001bR\t\u0002\npirouettes in opposite directions halfway along the rectangle\u2019s vertical segments.\nHow does one design a controller for \ufb02ying trajectories? Given a controller for keeping\na system\u2019s state at a point\n\u001c , one standard way to make the system move\nthrough a particular trajectory is to slowly vary\n\u001c along a sequence of set\npoints on that trajectory. (E.g., see [4].) For instance, if we ask our helicopter to hover\nat\n\u001c , then at\n-direction. By taking this\n\u0018.R\nprocedure and \u201cwrapping\u201d it around our old policy class from Figure 2c, we thus obtain a\ncomputer program\u2014that is, a new policy class\u2014not just for hovering, but also for \ufb02ying\narbitrary trajectories. I.e., we now have a family of policies that take as input a trajectory,\nand that attempt to make the helicopter \ufb02y that trajectory. Moreover, we can now also\nretrain the policy\u2019s parameters for accurate trajectory following, not just hovering.\nSince we are now \ufb02ying trajectories and not only hovering, we also augmented the policy\nclass to take into account more of the coupling between the helicopter\u2019s different sub-\ndynamics. For instance, the simplest way to turn is to change the tail rotor collective\npitch/thrust, so that it yaws either left or right. This works well for small turns, but for large\nturns, the thrust from the tail rotor also tends to cause the helicopter to drift sideways. Thus,\nwe enriched the policy class to allow it to correct for this drift by applying the appropriate\ncyclic pitch controls. Also, having a helicopter climb or descend changes the amount of\nwork done by the main rotor, and hence the amount of torque/anti-torque generated, which\ncan cause the helicopter to turn. So, we also added a link between the collective pitch con-\ntrol and the tail rotor control. These modi\ufb01cations are shown in Figure 2c (dashed lines).\nWe also needed to specify a reward function for trajectory following. One simple\nchoice for \f would have been to use Equation (1) with the newly-de\ufb01ned (time-varying)\n5\u0003\n\u001c . But we did not consider this to be a good choice. Speci\ufb01cally, consider\nmaking the helicopter \ufb02y in the increasing -direction, so that\n\u001c starts off as\n\u0003 slowly increased over time. Then, while\n\u0018.R\nwill indeed increase, it will also almost certainly lag con-\nthe actual helicopter position \nsistently behind 5\u0003 . This is because the hovering controller is always trying to \u201ccatch up\u201d\n\u0003\u0018\u0003\nA\u0003 may remain large, and the helicopter will\nto the moving\n5\u0003 cost, even if it is in fact \ufb02ying a very accurate trajectory in the\ncontinuously incur a \nincreasing -direction exactly as desired. It would be undesirable to have the helicopter\nrisk trying to \ufb02y more aggressively to reduce this fake \u201cerror,\u201d particularly if it is at the cost\nof increased error in the other coordinates. So, we changed the reward function to penal-\n\f\u001b\u001c , where\nize deviation not from\n\u0001\u0007\u0006\u000e\f\n\f\u001b\u001c is the \u201cprojection\u201d of the helicopter\u2019s position onto the path of the idealized,\n\u0005\f\ndesired trajectory. (In our example of \ufb02ying in a straight line, for a helicopter at\n\u001c ,\n\u0002\u0001\u0004\u0003\f\u0001\u0004\u0006\nwe easily see\n\u001c .) Thus, we imagine an \u201cexternal observer\u201d that\n\u0005\f\nlooks at the actual helicopter state and estimates which part of the idealized trajectory the\nhelicopter is trying to \ufb02y through (taking care not to be confused if a trajectory loops back\non itself), and the learning algorithm pays a penalty that is quadratic between the actual\nposition and the \u201ctracked\u201d position on the idealized trajectory.\nWe also needed to make sure the helicopter is rewarded for making progress along the\n\n\u0001\u0007\u0006F\u0003\n\u001c (say), and has its \ufb01rst coordinate \n\n\u00038\u001c , but instead deviation from\n\n\u001c . Thus, \n\n\u0003\u0018\u0003\n\n\u0001\u0004\u0006\u0006\u0003\n\n\u0001\u0004\u0003\r\f\n\n\u0001\u0004\u0006\u000e\f\n\n\u0003\u0018\u0003\n\n\u0001\u0007\u0006F\u0003\n\n\u0005\f\n\n\u0001\u0004\u0003\r\f\n\n\f\u001b\u001c\n\n\u0002\u0001\n\n\f-\u0018\n\n\u0003\r\f\n\n\u0001\u0004\u0006\u000e\f\n\n5\u0003\n\n\u0001\u0007\u0006F\u0003\n\n5\u0003\n\n\u0004\u0003\n\n\u0018\n\u0001\n\n\u0003\n\u0018\n\u0001\n\u0001\n\n\u0003\n\u0001\nR\n\u0001\nR\n\u0001\nR\n$\nR\n\u001c\n\u0001\nR\n\u0001\nR\n\u0001\nR\n$\n\u0001\nR\n\u0001\nR\n\u0001\nR\n\u0015\n\u0018\n\u0001\n\u0001\n\n\u0003\n\u0018\n\u0001\n\u0001\n\n\u0003\n\u0001\nR\n\u0001\nR\n\u0001\nR\n\u0015\n\u0018\n\u0001\n\u0001\n\n\u0003\n<\n<\n\u0018\n\u0001\n\u0001\n\n\u0018\n\u0001\n\n\u0018\n\u0001\n\u0001\n\n\u0018\n\u0001\n\n\u0018\n\u0001\n\nR\n\u0001\nR\n\u0001\nR\n\f\u0001\u0007\u0006\n\n\u001a\b\u0002\n\n\u000f\u0003\n\n\u000f\u0003\n\n\u0003\u0018\u0003\n\n\u0001\u0007\u0006F\u0003\n\n\u0001\u0004\u0003\u0018\u0003\n\n\u0001\u0007\u0006F\u0003\n\n\u0001\u0004\u0003\u0018\u0003\n\n\u0001\u0004\u0006\u0006\u0003\n\n\u0001\u0004\u0003\u0018\u0003\n\n\u0001\u0004\u0006\u0006\u0003\n\n\u001c and the evolution of\n\u00038\u001c\n\ntrajectory. To do this, we used the potential-based shaping rewards of [8]. Since, we are\nalready tracking where along the desired trajectory the helicopter is, we chose a potential\nfunction that increases along the trajectory. Thus, whenever the helicopter\u2019s\n\u0001\u0004\u0003\nmakes forward progress along this trajectory, it receives positive reward. (See [8].)\nFinally, our modi\ufb01cations have decoupled our de\ufb01nition of the reward function from\n\u000f\u0003\n5\u0003\nin time. So, we are now also free\nto consider allowing\nto evolve in a way that is different from the path of\nthe desired trajectory, but nonetheless in way that allows the helicopter to follow the actual,\ndesired trajectory more accurately. (In control theory, there is a related practice of using the\ninverse dynamics to obtain better tracking behavior.) We considered several alternatives,\nbut the main one used ended up being a modi\ufb01cation for \ufb02ying trajectories that have both\na vertical and a horizontal component (such as along the two upper edges of the triangle in\nIII.1). Speci\ufb01cally, it turns out that the \u0006\n(vertical)-response of the helicopter is very fast:\nTo climb, we need only increase the collective pitch control, which almost immediately\ncauses the helicopter to start accelerating upwards. In contrast, the and \u0003\nresponses are\n\u00038\u001c moves at\nupwards as in maneuver III.1, the he-\nmuch slower. Thus, if\nlicopter will tend to track the \u0006 -component of the trajectory much more quickly, so that it\naccelerates into a climb steeper than\n, resulting in a \u201cbowed-out\u201d trajectory. Similarly,\n\u001a\u000b\u0002\nan angled descent results in a \u201cbowed-in\u201d trajectory. To correct for this, we arti\ufb01cially\nslowed down the \u0006 -response, so that when\nis moving into an angled climb\n\u00038\u001c portion will evolve normally with time, but the changes to \u0006\nor descent, the\nwill be delayed by \u0007 seconds, where \u0007 here is another parameter in our policy class, to be\nautomatically learned by our algorithm.\nUsing this setup and retraining our policy class\u2019 parameters for accurate trajectory follow-\ning, we were able to learn a policy that \ufb02ies all three of the competition maneuvers fairly\naccurately. Figure 4 (bottom) shows actual trajectories taken by the helicopter while \ufb02ying\nthese maneuvers. Videos of the helicopter \ufb02ying these maneuvers are also available at the\nURL given at the end of Section 5.\nReferences\n[1] J. Bagnell and J. Schneider. Autonomous helicopter control using reinforcement\nlearning policy search methods. In Int\u2019l Conf. Robotics and Automation. IEEE, 2001.\n5 -analysis and synthesis\n[3] W. Cleveland. Robust locally weighted regression and smoothing scatterplots. J.\n\n[2] G. Balas, J. Doyle, K. Glover, A. Packard, and R. Smith.\n\ntoolbox user\u2019s guide, 1995.\n\n\u000f\u0003\n\n\u0001\u0004\u0003\u0018\u0003\n\n\u000f\u0003\n\n\u0001\u0004\u0003\u0005\u0003\n\n\u0001\u0004\u0006\u0006\u0003\n\nAmer. Stat. Assoc, 74, 1979.\n\n[4] Gene F. Franklin, J. David Powell, and Abbas Emani-Naeini. Feedback Control of\n\nDynamic Systems. Addison-Wesley, 1995.\n\n[5] Y. Ho and X. Cao. Pertubation analysis of discrete event dynamic systems. Kluwer,\n\n1991.\n\n[6] J. Kiefer and J. Wolfowitz. Stochastic estimation of the maximum of a regression\n\nfunction. Annals of Mathematical Statistics, 23:462\u2013466, 1952.\n\n[7] J. Leishman. Principles of Helicopter Aerodynamics. Cambridge Univ. Press, 2000.\n[8] A. Y. Ng, D. Harada, and S. Russell. Policy invariance under reward transformations:\nTheory and application to reward shaping. In Proc. 16th ICML, pages 278\u2013287, 1999.\n[9] Andrew Y. Ng. Shaping and policy search in reinforcement learning. PhD thesis,\n\nEECS, University of California, Berkeley, 2003.\n\n[10] Andrew Y. Ng and Michael I. Jordan. P EGASUS: A policy search method for large\nMDPs and POMDPs. In Proc. 16th Conf. Uncertainty in Arti\ufb01cial Intelligence, 2000.\n[11] C. Atkeson S. Schaal and A. Moore. Locally weighted learning. AI Review, 11, 1997.\n[12] Hyunchul Shim. Hierarchical \ufb02ight control system synthesis for rotorcraft-based un-\n\nmanned aerial vehicles. PhD thesis, Mech. Engr., U.C. Berkeley, 2000.\n\n\u0018\n\n\f\n\f\n\f\n\u0001\n\n\f\n\u001c\n\u0018\n\u0001\n\u0001\n\n\u0003\n\u0018\n\u0001\n\n\u0003\n\u001c\n\u0018\n\u0001\n\n\u0018\n\u0001\n\n\n\n\u0018\n\u0001\n\n\u0003\n\u001c\n\u0018\n\u0001\n\n\u0003\n\f", "award": [], "sourceid": 2455, "authors": [{"given_name": "H.", "family_name": "Kim", "institution": null}, {"given_name": "Michael", "family_name": "Jordan", "institution": null}, {"given_name": "Shankar", "family_name": "Sastry", "institution": null}, {"given_name": "Andrew", "family_name": "Ng", "institution": null}]}