{"title": "Rapidly Adapting Artificial Neural Networks for Autonomous Navigation", "book": "Advances in Neural Information Processing Systems", "page_first": 429, "page_last": 435, "abstract": null, "full_text": "Rapidly Adapting Artificial Neural Networks for \n\nAutonomous Navigation \n\nDean A. Pomerleau \n\nSchool of Computer Science \nCarnegie Mellon University \n\nPittsburgh, PA  15213 \n\nAbstract \n\nThe ALVINN (Autonomous Land Vehicle In a Neural Network) project addresses \nthe problem  of training artificial  neural  networks in  real  time to  perform difficult \nperception tasks.  ALVINN ,is  a back-propagation network that uses inputs from  a \nvideo camera and an imaging laser rangefinder to drive the CMU Navlab, a modified \nChevy van.  This paper describes training techniques which allow ALVINN to learn \nin under 5 minutes to autonomously control the Navlab by watching a human driver's \nresponse  to  new  situations.  Using  these  techniques,  ALVINN  has  been  trained \nto  drive  in  a  variety  of circumstances  including  single-lane  paved  and  unpaved \nroads,  multilane  lined  and  unlined  roads,  and  obstacle-ridden  on- and  off-road \nenvironments, at speeds of up to 20 miles per hour. \n\n1  INTRODUCTION \n\nPrevious trainable connectionist perception systems have often ignored important aspects  of \nthe  form  and  content  of available  sensor data.  Because  of the  assumed  impracticality  of \ntraining networks to perform realistic high level perception tasks,  connectionist researchers \nhave frequently restricted their task domains to either toy problems (e.g.  the T-C identification \nproblem [11] [6]) or fixed low level operations (e.g.  edge detection [8]). While these restricted \ndomains  can  provide  valuable  insight into  connectionist  architectures  and  implementation \ntechniques, they frequently ignore the complexities associated with real world problems. \n\nThere are  exceptions to this trend towards simplified tasks.  Notable successes  in high level \ndomains such as  speech  recognition  [12], character recognition [5]  and face  recognition [2] \nhave  been  achieved  using  real  sensor data.  However,  the  results  have  come  only in  very \ncontrolled environments,  after  careful  preprocessing  of the  input to  segment  and  label  the \ntraining  exemplars. \nIn  addition,  these  successful  connectionist  perception  systems  have \nignored the fact that sensor data normally becomes available gradually and not as a monolithic \ntraining set.  In short, artificial neural networks previously have never been successfully trained \n\n429 \n\n\f430 \n\nPomerleau \n\nRoad Intensity \nFeedback Unit \n\nSh\"'Jl \nleft \n\nSIJ'aIShl \nAhead \n\nSh\"'Jl \nRIghI \n\nSharp \nLeft \n\nStralghl \nAhead \n\nSharp \nRighi \n\n30 Output \n\nUnits \n\n3Oxl2 Sensor \n\nRetia \n\nFigure 1:  ALVINN's previous (left) and current (right) architectures \n\nusing sensor data in real  time to perform a real world perception task. \n\nThe ALVINN (Autonomous Land Vehicle In  a Neural Network) system remedies this short(cid:173)\ncoming.  ALVINN  is  a  back-propagation  network  designed  to  drive  the  CMU  Navlab.  a \nmodified Chevy  van.  Using real  time training techniques,  the system  quickly learns  to  au(cid:173)\ntonomously control the Navlab by watching a human driver's reactions.  ALVINN has been \ntrained to drive in a variety of circumstances including single-lane paved and unpaved roads, \nmultilane  lined  and  unlined  roads  and  obstacle  ridden  on- and  off-road  environments,  at \nspeeds  of up  to 20 miles  per hour.  This  paper will  primarily focus  on  improvements  and \nextensions made to the AL VINN system since the presentation of this work at the 1988 NIPS \nconference [9]. \n\n2  NETWORK ARCHITECTURE \n\nThe current  architecture for an  individual ALVINN driving network is  significantly simpler \nthan the previous version (See Figure 1).  The input layer now consists of a single 30x32 unit \n\"retina\" onto which a sensor image from  either the video camera  or the laser rangefinder is \nprojected.  Each of the 960 input units is fully connected to the hidden layer of 5 units, which \nis in tum fully connected to the output layer.  The 30 unit output layer is a linear representation \nof the currently appropriate steering direction which may serve to keep the vehicle on the road \nor to prevent it from colliding with nearby obstaclesl .  The centermost output unit represents \nthe  \"travel  straight  ahead\"  condition,  while  units  to  the  left  and  right  of center  represent \nsuccessively sharper left and right turns. \n\nThe  reductions in network complexity  over previous versions  have  been  made  in  response \nto  experience  with  ALVINN in actual  driving situations.  I  have  found  that  the  distributed \nnature of the  internal representation allows  a  network of only  5  hidden  units to accurately \ndrive in  a variety  of situations.  I  have  also  learned  that  multiple sensor inputs  to  a single \nnetwork are  redundant and  can  be  eliminated.  For instance,  when  training a network on  a \nsingle-lane road, there is sufficient information in the video image alone for accurate driving. \nSimilarly, for obstacle avoidance, the laser rangefinder image is sufficient and the video image \n\nIThe task a particular driving network perfonns depends on the type of input sensor image and the \n\ndriving situation it has been trained to handle. \n\n\fRapidly Adapting Artificial Neural Networks for Autonomous Navigation \n\n431 \n\nis superfluous.  The road intensity feedback  unit has been eliminated on similar grounds.  In \nthe previous architecture,  it provided the network with  the relative intensity of the road vs. \nthe  non-road in  the  previous  image.  This  information was  unnecessary  for  accurate  road \nfollowing, and undefined in new ALVINN domains such as off-road driving. \n\nTo  drive the Navlab, an  image from  the  appropriate sensor is reduced to 30 x 32 pixels and \nprojected onto the input layer.  After propagating activation through the network, the output \nlayer's activation prOfile is translated into a vehicle steering command.  The steering direction \ndictated by the network is taken to be the center of mass of the \"hill\" of activation surrounding \nthe output unit with the highest activation leveL  Using the center of mass of activation instead \nof the most active  output unit when determining the direction to steer permits finer steering \ncorrections, thus improving ALVINN's driving accuracy. \n\n3  TRAINING \"ON -THE-FLY\" \n\nThe most interesting recent  improvement to ALVINN is the training teChnique.  Originally, \nALVINN  was  trained with  backpropagation  using  1200 simulated  scenes  portraying roads \nunder a wide variety of weather and lighting conditions [9].  Once trained,  the network was \nable  to drive  the  Navlab  at  up  to  1.8  meters  per second  (3.5  mph)  along  a 400 meter path \nthrough a wooded area of the CMU campus  in weather which included snowy, rainy,  sunny\" \nand cloudy situations. \n\nDespite  its  apparent  success,  this  training paradigm  had  serious  shortcomings.  It  required \napproximately 6 hours of Sun-4 CPU time to generate the synthetic road scenes,  and then an \nadditional 45  minutes of Warp2  computation time to train  the network.  Furthermore,  while \neffective at training the network to drive on a single-lane road, extending the synthetic training \nparadigm  to deal  with more  complex driving situations like multilane and  off-road driving \nwould have required prohibitively complex artificial scene generators. \n\nI have developed  a scheme  called training \"on-the-fiy\" to deal  with these  problems.  Using \nthis  technique,  the  network learns  to imitate  a person  as  he drives.  The network is trained \nwith back-propagation using the latest video camera image as input and the person's current \nsteering direction as the desired output. \n\nThere are two potential problems associated with this simple training on-the-fiy scheme.  First, \nsince the person steers the vehicle down the center of the road during training, the network \nwill never be presented with situations where it must recover from misalignment errors.  When \ndriving  for  itself,  the  network may  occasionally  stray  from  the  road  center,  so  it must  be \nprepared  to  recover  by  steering  the  vehicle  back  to  the  middle  of the  road.  The  second \nproblem is that naively training the  network with only the  current video image and steering \ndirection  may  cause it to  overlearn  recent  inputs.  If the person  drives  the  Navlab down  a \nstretch  of straight road  near the  end  of training,  the network will be presented  with  a long \nsequence of similar images.  This sustained lack of diversity in the training set will cause the \nnetwork to \"forget\" what it had learned  about driving on  curved roads  and  instead learn to \nalways steer straight ahead. \n\nBoth problems associated  with training on-the-fiy stem  from  the fact  that back-propagation \nrequires training data which is  representative  of the  full  task  to be learned.  To  provide the \nnecessary  variety of exemplars  while still training on real  data,  the  simple training on-the-\n\n2There was fonnerly  a  100 MFLOP Warp systolic array supercomputer onboard the Navlab.  It has \nbeen replaced by 3 Sun-4s, further necessitating the  streamlined architecture described in the previous \nsection. \n\n\f432 \n\nPomerleau \n\nOriginal Image \n\nShifted and Rotated Images \n\nFigure 2:  The single original video image  is  shifted and  rotated to  create  multiple training \nexemplars in which the vehicle appears to be a different locations relative to the road. \n\nfly  scheme  described' above must be modified.  Instead of presenting the network with only \nthe current video image  and  steering direction, each  original image is shifted and rotated in \nsoftware to create  14 additional images in which the vehicle appears to be situated differently \nrelative to the environment (See Figure 2).  The sensor's position and orientation relative to \nthe  ground plane  are  known,  so precise  transformations can  be  achieved  using perspective \ngeometry.  The  correct steering direction as  dictated  by the driver for the  original image is \naltered  for  each  of the  transformed  images  to  account  for  the  altered  vehicle  placement3 . \nUsing transformed training patterns allows the network to learn how to recover from driving \nerrors.  Also,  overtraining on  repetitive images  is  less  of a problem,  since the  transfonned \ntraining exemplars add variety to the training set.  As  additional insurance against the effects \nof repetitive exemplars, the training set diversity is further increased by maintaining a buffer \nof previously encountered training patterns. \n\nIn practice, training on-the-fly works as follows.  A live sensor image is digitized and reduced \nto the low resolution image required by the network.  This single original image is shifted and \nrotated 14 times to create  14 additional training exemplars4 .  Fifteen old exemplars from the \ncurrent training set of 200 patterns are chosen and replaced by the 15 new exemplars.  The 15 \nexemplars to be replaced in the training set are chosen on the basis of how closely they match \nthe steering direction of one of the new tokens.  Exchanging a new token for an old token with \na similar steering direction helps maintain diversity in the training buffer during monotonous \nstretches of road by preventing novel older patterns from  being replaced by recent redundant \nones. \n\nAfter this replacement process, one forward and one backward pass of the baCk-propagation \nalgorithm is  performed  on  the 200 exemplars  to  update  the  network's  weights.  The  entire \nprocess  is  then  repeated.  The  network  requires  approximately  50  iterations  through  this \ndigitize-replace-train cycle  to learn to drive in the  domains  that have been  tested.  Running \n\n3 A  simple steering model is  used when transforming the driver's original direction.  It assumes the \n\"correct\" steering direction  is the one that will eliminate the additional vehicle translation and rotation \nintroduced by the transformation and bringing the vehicle to the point the person was originally steering \ntowards a fixed distance ahead of the vehicle. \n\n4The  shifts  are  chosen randomly  from  the  range  -1.25  to  + 1.25  meters  and the  rotations from  the \n\nrange -6.0 to +6.0 degrees. \n\n\fRapidly Adapting Artificial Neural Networks for Autonomous Navigation \n\n433 \n\nFigure 3:  Video images taken on three of the test roads ALVINN has been trained to drive on. \nThey  are,  from  left to right, a single-lane dirt access  road,  a single-lane paved bicycle path, \nand a lined two-lane highway. \n\non a Sun-4, this takes about five  minutes during which a person drives the Navlab at  about 4 \nmiles per hour over the training road. \n\n4  RESULTS  AND DISCUSSION \n\nOnce it has learned, the  network can  accurately  traverse the length of road used for training \nand also generalize to drive along parts of the road it has never encountered under a variety \nof weather conditions.  In  addition,  since  determining the  steering direction  from  the  input \nimage merely involves a forward sweep through the network, the system is able to process 25 \nimages per second, allowing it to drive at up to the Navlab's maximum speed of20 miles per \nhou~. This is over twice as fast  as  any  other sensor-based autonomous system has driven the \nNavlab [3] [7]. \n\nThe training on-the-fly scheme gives ALVINN a flexibility which is novel among autonomous \nnavigation systems.  It has  allowed me  to successfully  train individual networks to drive in \na variety  of situations, including a single-lane dirt access  road,  a single-lane paved bicycle \npath, a two-lane suburban neighborhood street, and a lined two-lane highway (See Figure 3). \nUsing  other sensor modalities  as  input,  including laser  range  images  and  laser reflectance \nimages,  individual ALVINN networks have  been  trained to follow  roads  in total  darkness, \nto  avoid  collisions in  obstacle  rich  environments,  and  to  follow  alongside  railroad  tracks. \nALVINN networks have driven in each of these situations for up to 1/2 mile, until reaching a \ndead end or a difficult  intersection.  The development of a system for each  of these domains \nusing the \"traditional approach\" to autonomous navigation would require the programmer to \n1) determine what features  are  important for the particular task, 2) program detectors (using \nstatistical  or symbolic  techniques)  for  finding  these  important  features  and  3)  develop  an \nalgorithm for determining which direction to steer from the location of the detected features. \n\nIn contrast, ALVINN is able to learn for each new domain what image features are important, \nhow  to  detect  them  and  how  to  use  their position  to  steer  the  vehicle.  Analysis  of the \nhidden  unit representations developed in different driving situations shows that the network \nforms  detectors  for  the  image  features  which  correlate  with  the  correct  steering  direction. \nWhen trained on multi-lane roads, the network develops hidden unit feature detectors for the \nlines painted on  the road, while in single-lane driving situations, the detectors developed are \n\n5The Navlab has a hydraulic  drive  system which allows for very precise speed control, but which \n\nprevents the vehicle from  driving over 20 miles per hour. \n\n\f434 \n\nPomerleau \n\nsensitive to road edges and rOad-shaped regions of similar intensity in the image.  For a more \ndetailed analysis of ALVINN's internal representations see [9]  [10]. \n\nThis ability  to utilize arbitrary image features  can  be problematic.  This was  the case  when \nALVINN was trained to drive on a poorly defined dirt road with a distinct ditch on its right side. \nThe network had no problem learning and then driving autonomously in one direction,  but \nwhen driving the other way, the network was erratic, swerving from one side of the road to the \nother.  After analyzing the network's hidden representation, the reason for its difficulty became \nclear.  Because  of the  poor distinction between  the road and  the  non-road, the  network had \ndeveloped only weak detectors for the road itself and instead relied heavily on the position of \nthe ditch to determine the direction to steer.  When tested in the opposite direction, the network \nwas able to keep the vehicle on the road using its weak road detectors but was unstable because \nthe ditch it had learned to look for on the right side was now on the left.  Individual ALVINN \nnetworks have a tendency to rely on any image feature consistently correlated with the correct \nsteering  direction.  Therefore,  it is  important to expose  them  to  a  wide  enough  variety  of \nsituations during training so as to minimize the effects of transient image features. \n\nOn  the  other hand,  experience  has  shown  that it is  more  efficient  to  train  several  domain \nspecific  networks  for  circumstances  like  one-lane  vs. \ntwo-lane  driving,  instead  training a \nsingle network for all situations. To prevent this network specificity from reducing ALVINN's \ngenerality,  I am  currently implementing connectionist and non-connectionist techniques for \ncombining networks trained for different driving situations. Using a simple rule-based priority \nsystem similar to the subsumption architecture [1], I have recently combined a road following \nnetwork and an  obstacle avoidance network.  The road following network uses video camera \ninput to follow  a  single-lane road.  The  obstacle  avoidance  network uses  laser rangefinder \nimages as  input.  It is trained to swerve appropriately to prevent a collision when confronted \nwith  obstacles  and  to  drive  straight  when  the  terrain  ahead  is  free  of obstructions.  The \narbitration rule  gives priority to the  road  following network when  determining the  steering \ndirection, except when the obstacle avoidance network outputs a sharp steering command. In \nthis case, the urgency of avoiding an imminent collision takes precedence over road following \nand  the  steering  direction is  determined  by  the  obstacle  avoidance  network.  Together,  the \ntwo networks and the arbitration rule comprise  a system capable of staying on the road and \nswerving to prevent collisions. \n\nTo  facilitate  other  rule-based  arbitration  teChniques,  I  am  currently  adding  to  ALVINN  a \nnon-connectionist module which maintains the vehicle's position on a map.  Knowing its map \nposition will allow ALVINN to use  arbitration rules such  as  \"when on a stretch  of two lane \nhighway,  rely primarily on the two lane highway network\".  This symbolic mapping module \nwill also  allow  ALVINN to  make  high level,  goal-oriented decisions such  as  which way  to \ntum at intersections and when to stop at a predetermined destination. \n\nFinally,  I  am  experimenting with connectionist techniques,  such  as  the  task decomposition \narchitecture  [6]  and the  meta-pi  architecture  [4],  for combining networks  more  seamlessly \nthan is possible with symbolic rules.  These connectionist arbitration techniques will enable \nALVINN to combine outputs from networks trained to perform the same task using different \nsensor modalities  and  to  decide  when  a new  expert  must  be  trained  to  handle  the  current \nsituation. \n\nAcknowledgements \n\nThe principle support for the Navlab has come from DARPA, under contracts DACA 76-85-C-\n0019, DACA76-85-C-0003 and DACA76-85-C-0002. This research was also funded in part \nby a grant from Fujitsu Corporation. \n\n\fRapidly Adapting Artificial Neural Networks for Autonomous Navigation \n\n435 \n\nReferences \n\n[1]  Brooks, R.A. (1986) A robust layered control system for a mobile robot. IEEE Journal \n\nof Robotics and Automation, vol. RA-2, no.  1, pp. 14-23, April 1986. \n\n[2]  Cottrell, G.W.  (1990) Extracting features from faces using compression networks:  Face, \nidentity,  emotion and gender recognition using holons. In Connectionist Models:  Proc. \nof the  1990  Summer School,  David Touretzky  (Ed.),  Morgan  Kaufmann,  San  Mateo, \nCA. \n\n[3]  Crisman,  J.D.  and  Thorpe C.E.  (1990) Color vision for road following.  In Vision  and \nNavigation:  The  CMU  Navlab  Charles  Thorpe  (Ed.),  Kluwer  Academic  Publishers, \nBoston,MA. \n\n[4]  Hampshire, J.B., Waibel A.H. (1989) The meta-pi network:  Buildingdistributedknowl(cid:173)\n\nedge representations for robust pattern recognition. Carnegie Mellon Technical Report \nCMU-CS-89-l66-R. August, 1989. \n\n[5]  LeCun,  Y.,  Boser,  B.,  Denker,  J.S.,  Henderson,  D.,  Howard,  R.E.,  Hubbard,  W.,  and \nJackel, L.D. (1989) Backpropagation applied to handwritten zip code recognition. Neural \nComputation} (4). \n\n[6]  Jacobs, R.A, Jordan, M.I., Barto, A.G. (1990) Task decomposition through competition \nin  a  modular  connectionist  architecture:  The  what  and  where  vision  tasks.  Univ.  of \nMassachusetts Computer and Information Science Technical Report 90-27, March 1990. \n[7]  Kluge, K.  and Thorpe C.E.  (1990) Explicit models for robot road following. In Vision \nand Navigation:  The CMU Navlab Charles Thorpe (Ed.), Kluwer Academic Publishers, \nBoston,MA. \n\n[8]  Koch, c., Bair, W., Harris, J.G., Horiuchi, T., Hsu, A. and Luo, J. (1990) Real-time com(cid:173)\nputervision and robotics using analog VLSI circuits. In Advances in Neural Information \nProcessing Systems, 2, D.S. Touretzky (Ed.), Morgan Kaufmann, San Mateo, CA \n\n[9]  Pomerleau, D.A. (1989) ALVINN:  An Autonomous Land Vehicle In a Neural Network, \nAdvances in Neural Information Processing  Systems,  }, D.S.  Touretzky (Ed.),  Morgan \nKaufmann, San Mateo, CA. \n\n[10]  Pomerleau,  D.A.  (1990) Neural  network based  autonomous navigation. In Vision  and \nNavigation:  The  CMU Navlab  Charles  Thorpe  (Ed.),  Kluwer Academic  Publishers, \nBoston,MA. \n\n[11]  Rumelhart,  D.E.,  Hinton,  G.E.,  and Williams, R.J.  (1986) Learning  internal represen(cid:173)\n\ntations  by  error propagation.  In  D.E.  Rumelhart  and  J.L.  McClelland  (Eds.)  Parallel \nDistributed Processing:  Explorations in the Microstructures of Cognition. Vol.}.'  Foun(cid:173)\ndations. Bradford Books/MlT Press, Cambridge, MA. \n\n[12]  Waibel,  A, Hanazawa,  T.,  Hinton, G.,  Shikano, K.,  Lang,  K.  (1988) Phoneme recog(cid:173)\n\nnition:  Neural  Networks  vs.  Hidden Markov Models.  Proceedings from  Int.  Conf.  on \nAcoustics, Speech and Signal ProceSSing, New York, New York. \n\n\f", "award": [], "sourceid": 432, "authors": [{"given_name": "Dean", "family_name": "Pomerleau", "institution": null}]}