{"title": "Neural Network Exploration Using Optimal Experiment Design", "book": "Advances in Neural Information Processing Systems", "page_first": 679, "page_last": 686, "abstract": null, "full_text": "Neural Network Exploration Using \n\nOptimal Experiment Design \n\nDavid A.  Cohn \n\nDept.  of Brain and Cognitive Sciences \n\nMassachusetts  Inst.  of Technology \n\nCambridge,  MA  02139 \n\nAbstract \n\nConsider  the  problem of learning input/output mappings through \nexploration,  e.g.  learning  the kinematics or  dynamics of a  robotic \nmanipulator.  If actions  are  expensive  and  computation is  cheap, \nthen  we  should  explore  by  selecting  a  trajectory  through  the  in(cid:173)\nput  space  which  gives  us  the  most  amount  of information in  the \nfewest  number of steps.  I discuss how  results from the field  of opti(cid:173)\nmal experiment design may be used to guide such exploration, and \ndemonstrate its use  on a  simple kinematics problem. \n\n1 \n\nIntroduction \n\nMost  machine  learning  research  treats  the  learner  as  a  passive  receptacle  for  data \nto  be  processed.  This approach  ignores the fact  that, in many situations,  a  learner \nis  able,  and sometimes required,  to act  on  its environment to gather  data. \n\nLearning  control  inherently  involves  being  active;  the  controller  must  act  in  order \nto  learn  the  result  of  its  action.  When  training  a  neural  network  to  control  a \nrobotic  arm,  one  may explore  by  allowing  the  controller  to  \"flail\"  for  a  length  of \ntime, moving the  arm at random through coordinate space  while  it builds up  data \nfrom  which  to  build  a  model  [Kuperstein,  1988].  This  is  not  feasible,  however,  if \nactions  are expensive  and must be conserved.  In these situations, we  should choose \na  training trajectory  that will get  the most information out of a  limited number of \nsteps.  Manually designing such trajectories is a slow process,  and intuitively  \"good\" \ntrajectories  often  fail  to  sufficiently  explore  the  state space  [Armstrong,  1989].  In \n\n679 \n\n\f680 \n\nCohn \n\nthis  paper  I  discuss  another  alternative  for  exploration:  automatic,  incremental \ngeneration of training trajectories  using results  from  \"optimal experiment  design.\" \n\nThe  study of optimal experiment  design  (OED)  [Fedorov,  1972]  is  concerned  with \nthe  design  of experiments that are expected  to  minimize variances of a  parameter(cid:173)\nized  model.  Viewing  actions  as experiments that move us  through the state space, \nwe  can  use  the techniques  of OED to design  training trajectories. \n\nThe  intent  of optimal  experiment  design  is  usually  to  maximize  confidence  in  a \ngiven  model,  minimize parameter  variances  for  system  identification,  or  minimize \nthe model's output variance.  Armstrong [1989]  used  a form of OED to identify link \nmasses  and inertial  moments of a  robot  arm,  and found  that  automatically gener(cid:173)\nated training trajectories  provided a  significant improvement over  human-designed \ntrajectories.  Automatic exploration strategies  have  been  tried  for  neural  networks \n(e.g.  [Thrun  and  Moller,  1992]'  [Moore,  1994]),  but  use  of OED  in  the  neural  net(cid:173)\nwork  community has been  limited.  Plutowski and  White [1993]  successfully  used  it \nto filter  a  data set for  maximally informative points,  but its application to selecting \nnew data has only  been  proposed  [MacKay,  1992],  not  demonstrated. \n\nThe following  section  gives  a  brief description  of the  relevant  results  from  optimal \nexperiment  design.  Section  3 describes  how  these  results may be  adapted  to guide \nneural  network  exploration  and  Section  4  presents  experimental  results  of imple(cid:173)\nmenting  this  adaptation.  Finally,  Section  5  discusses  implications  of the  results, \nand logical extensions of the  current  experiments. \n\n2  Optimal experiment design \n\nOptimal experiment design  draws heavily on the technique of Maximum Likelihood \nEstimation (MLE)  [Thisted,  1988].  Given  a  set  of assumptions about the learner's \narchitecture and sources of noise in the output,  MLE provides a statistical basis for \nlearning.  Although the specific  MLE  techniques we  use  hold exactly only for  linear \nmodels,  making certain computational approximations allows them to be used with \nnonlinear systems such  as neural  networks. \nWe  begin  with  a  training set  of input-output  pairs  (Xi, Yi)i=l  and  a  learner  fw O\u00b7 \nWe  define  fw(x)  to  be  the  learner's  output  given  input  X  and  weight  vector  w. \nUnder  an assumption of additive Gaussian noise,  the maximum likelihood estimate \nfor  the  weight  vector,  W,  is  that  which  minimizes  the  sum  squared  error  Esse  = \n2:7=1(JW(Xi)  - Yi)2.  The estimate W gives  us  an estimate of the  output  at a  novel \ninput:  if =  fw(x)  (see  e.g.  Figure  1a). \nMLE  allows us to compute the variances of our weight  and output estimates.  Writ(cid:173)\ning the output sensitivity asgw(x)  =  8fw(x)/8w, the  covariances of ware \n\nwhere  the  last  approximation  assumes  local  linearity  of gw(x).  (For  brevity,  the \noutput sensitivity will be  abbreviated to g( x)  in the  rest  of the  paper.) \n\n\fNeural Network Exploration Using Optimal Experiment Design \n\n681 \n\ny \n\nFigure  1:  a)  A  set  of training  examples for  a  classification  problem,  and  the  net(cid:173)\nwork's best fit  to the data.  b) Maximum likelihood estimate of the network's output \nvariance for  the  same problem. \n\nFor  a  given  reference  input  X r ,  the estimated output variance  is \n\nvar(xr ) =  g(Xr? A- 1g(xr ). \n\n(1) \n\nOutput  variance  corresponds  to the  model's estimate of the expected  squared  dis(cid:173)\ntance between its output  fw(x)  and the unknown  \"true\"  output y.  Output variance \nthen, corresponds to the model's estimate of its mean squared error (MSE)  (see  Fig(cid:173)\nure  1 b).  If the  estimates  are  accurate  then  minimizing the  output  variance  would \ncorrespond  to minimizing the network's  MSE. \n\nIn  optimal experiment  design,  we  estimate how  adding  a  new  training  example is \nexpected  to change  the  computed variances.  Given  a  novel  X n +1,  we  can  use  OED \nto  predict  the  effect  of adding  Xn+1  and  its  as-yet-unknown  Yn+1  to  the  training \nset.  We  make the assumption that \n\n-1 \n\nAn+1  ~  An + g(xn+dg(xn+d \n\n( \n\nT)-1 \n\n, \n\nwhich corresponds to assuming that our current model is already fairly good.  Based \non this assumption, the new  parameter variances will  be \n\nA~~1 =  A~l - A~1g(xn+d(1 + g(Xn+1? A~1g(xn+d)g(xn+t)T A~1. \n\nCombined with  Equation  1,  this  predicts  that  if we  take  a  new  example  at  X n +1, \nthe  change  in  output variance  at  reference  input  Xr  will be \n\n~var(Xr ) \n\n(g(xrf A~lg(xn+l\u00bb2(1 + g(Xn+1)T A;;lg(xn+d) \ncov(xr, Xn+l)2(1  + var(xn+d) \n\n(2) \n\nTo minimize the expected value of var(xr ),  we  should select  Xn+l  so as to maximize \nthe  right  side  of  Equation  2.  For  other  interesting  OED  measures,  see  MacKay \n[1992] . \n\n\f682 \n\nCohn \n\n3  Adapting  OED  to  Exploration \n\nWhen  building a  world  model,  the  learner  is  trying  to  build a  mapping, e.g.  from \njoint angles  to cartesian  coordinates  (or  from  state-action  pairs to  next  states).  If \nit  is  allowed to select  arbitrary joint  angles  (inputs)  in  successive  time steps,  then \nthe problem is  one of selecting the next  \"query\"  to make ([Cohn,  1990],  [Baum and \nLang,  1991]).  In exploration, however,  one's choices for  a next input are constrained \nby  the current  input.  We  cannot  instantaneously  \"teleport\"  to remote parts of the \nstate space,  but must choose  among inputs that are available in the next time step. \n\nOne approach to selecting a next input is to use selective sampling:  evaluate a  num(cid:173)\nber of possible random inputs, choose  the one  with the highest  expected gain.  In a \nhigh-dimensional action space, this is inefficient.  The approach followed here is that \nof gradient search,  differentiating Equation 2 and hillclimbing on 8jj,var( x r )/ 8Xn +l. \n\nNote  that  Equation  2 gives  the  expected  change  in  variance  only  at  a  single  point \nX r ,  while  we  wish  to  minimize  the  average  variance  over  the  entire  domain.  Ex(cid:173)\nplicitly  integrating  over  the  domain  is  intractable,  so  we  must  make  do  with  an \napproximation.  MacKay  [1992]  proposed  using  a  fixed  set  of reference  points  and \nmeasuring the  expected  change  in  variance over  them.  This produces  spurious  lo(cid:173)\ncal  maxima at  the  reference  points,  and  has  the  undesirable  effect  of  arbitrarily \nquantizing the input space.  Our  approach is to iteratively draw  reference  points at \nrandom  (either  uniformly or  according  to  a  distribution  of interest),  and  compute \na  stochastic  approximation of jj, var. \n\nBy  climbing the  stochastically  approximated gradient,  either  to  convergence  or  to \nthe horizon of available next inputs, we will settle on an input/action with a (locally) \noptimal decrease  in  expected  variance. \n\n4  Experimental Results \n\nIn  this  section,  I  describe  two  sets  of experiments.  The  first  attempts  to  confirm \nthat the  gains predicted  by  optimal experiment  design  may actually  be  realized  in \npractice,  and the  second  studies the  application of OED to a simple learning task. \n\n4.1  Expected versus actual gain \n\nIt must be  emphasized that the gains predicted  by  OED  are  expected  gains.  These \nexpectations  are  based  on  the  relatively  strong  assumptions  of MLE,  which  may \nnot strictly hold.  In order for  the expected gains to materialize, two  \"bridges\"  must \nbe  crossed.  First,  the  expected  decrease  in  model  variance  must  be  realized  as  an \nactual  decrease  in  variance.  Second,  the  actual  decrease  in  model  variance  must \ntranslate into an  actual  decrease  in  model  MSE. \n\n4.1.1  Expected decreases in variance  --+  actual decreases in variance \n\nThe  translation  from  expected  to actual  changes  in  variance  requires  coordination \nbetween  the  exploration  strategy  and  the  learning  algorithm:  to  predict  how  the \nvariance of a  weight will change with a  new  piece  of data, the predictor must know \nhow  the  weight  itself  (and  its  neighboring  weights)  will  change.  Using  a  black \n\n\fNeural Network Exploration Using Optimal Experiment Design \n\n683 \n\n0 . 012  l I \n\n0.0 1 \n\n~ \n\n~ > \n~ ... \n~ \n'tl \n\n0 . 008 \n\n0 .00 6 \n\n0 . 0 0 4 \n\n0 . 002 \n\n, , \n\nx \nx \n\nx \n\nx \n\nx \n\nx \n\nXX \n\n-\n\n- -\n\n-\n\n-\n\na ctual =e xpec ted \n\n\"\"; , \n;:: \n\n~ \n\n[oJ \nVl \n:E \n\n\" ... \n~ \n'tl \n\n2 .8 \n\n2 . 4 \n\n1. 6 \n\n1. 2 \n\n0.8 \n\n0 . 4 \n\n- 0. 4 \n\nx x \n\nx \n\n;If \n)(  x \n\nx \n\nx \nX \n\nx \n\nx \n\nx \ni\u00ab \n\nx \n\nx \n\nx  x \n\n0 . 0 0 2 \n\n0. 00 4 \n\n0 . 0 0 6 \n\n0 . 0 08 \n\n0 . 0 1 \n\n0 . 01 2 \n\n0 .002 \n\n0.004 \n\n0.00 6 \n\n0 .00 8 \n\n0 . 01 \n\n0 . 012 \n\nexp ec t e d  d e lta  var \n\nactua l  d e lta  va r \n\nFigure  2:  a)  Correlations  between  expected  change  in  output  variance  and  actual \nchange  output  variance  b)  Correlations  between  actual  change  in  output  variance \nand  change  in  mean squared  error.  Correlations are  plotted for  a  network  trained \non  50  examples from  the  arm kinematics task. \n\nbox  routine  like  backpropagation to  update  the  weights  virtually  guarantees  that \nthere  will  be  some  mismatch  between  expected  and  actual  decreases  in  variance. \nExperiments  indicate  that,  in  spite  of this,  the  correlation  between  predicted  and \nactual  changes in  variance are  relatively good  (Figure  2a) . \n\n4.1.2  Decreases in variance  -- decreases in MSE \n\nA  more troubling translation is the  one from  model  variance  to model correctness. \nGiven  the  highly  nonlinear  nature of a  neural  network,  local  minima may leave  us \nin  situations  where  the  model  is  very  confident  but  entirely  wrong.  Due  to  high \nconfidence,  the learner may reject  actions that would reduce  its mean squared error \nand  explore  areas  where  the  model  is  correct,  but  has  low  confidence.  Evidence \nof this  behavior  is  seen  in  the  lower  right  corner  of Figure  2b,  where  some actions \nwhich  produce  a large decrease  in variance have little effect  on the  network's MSE. \nWhile this decreases  the utility of OED, it is not crippling.  We  discuss one possible \nsolution to this problem at the end of this  paper. \n\n4.2  Learning kinematics \n\nWe  have  used  the  the  stochastic  approximation of  ~var to  guide  exploration  on \nseveral  simple  tasks  involving  classification  and  regression.  Below ,  I  detail  the \nexperiments  involving exploration  of the  kinematics  of a  simple  two-dimensional, \ntwo-joint  arm .  The task  was  to  learn  a  forward  model  8 1  x  8 2  -- X  X  Y  through \nexploration, which could then  be used  to build a  controller following Jordan [1992]. \n\n\f684 \n\nCohn \n\nThe  model  was  to  be  learned  by  a  feedforward  network  with  a  sigmoid  transfer \nfunction  using a  single hidden layer of 8 or  20  hidden  units. \n\nFigure 3:  Learning 2D  arm kinematics with 8 hidden units.  a) Geometry of the 2D, \ntwo-joint arm.  b)  Sample trajectory  using  OED-based greedy  exploration. \n\nOn each time step, the learner was allowed to select  inputs 8 1  and 8 2  and was then \ngiven  tip  position  x  and  y  to  incorporate  into its  training set.  It then  hillclimbed \nto find  the next 8 1  and 8 2  within its limits of movement that would maximize the \nstochastic  approximation of ~var .  On  each  time step  8 1  and  8 2  were  limited to \nchange by no more than \u00b136\u00b0 and \u00b118\u00b0 respectively.  Simulations were performed on \nthe Xerion simulator (made available by  the University of Toronto), approximating \nthe variance  gradient on each  step  with  100  randomly drawn  points.  A sample tip \ntrajectory  is  illustrated in  Figure 3b. \n\nWe  compared  the  performance of this  one-step  optimal (greedy)  learner,  in  terms \nof mean squared  error,  with  that  of an  identical  learner  which  explored  randomly \nby  \"flailing.\"  Not surprisingly, the improvement of greedy exploration over  random \nexploration  is  significant  (Figure  4b).  The  asymptotic performance of the  greedy \nlearner was  better than that of the random learner,  and it reached  its asymptote in \nmuch few  steps. \n\n5  Discussion \n\nThe  experiments  described  in  this  paper  indicate  that  optimal experiment  design \nis  a  promising  tool  for  guiding  neural  network  exploration.  It  requires  no  arbi(cid:173)\ntrary  discretization  of state  or  action  spaces,  and  is  amenable  to  gradient  search \ntechniques.  It does,  however,  have  high  computational costs  and,  as  discussed  in \nSection 4.1.2,  may be led  astray if the model settles  in  a local minimum. \n\n5.1  Alternatives to greedy OED \n\nThe greedy approach is prone to \"boxing itself into a corner\"  while leaving important \nparts  of  the  domain  unexplored.  One  heuristic  for  avoiding  local  minima  is  to \n\n\fNeural Network Exploration Using Optimal Experiment Design \n\n685 \n\n::::~ \n0.21~ \\\\ \nO. 18~  \\ \n~  0 . 15~  \\ \n\n.  \\ \n\nI\n\n. ::: \\( \n\n0 . 061 \n\n80 \n\n100 \n\n120 \n\nO. 0 3~ \no . 001-----,-, \n\no \n\nI  ~~-=::::-./ \n\n, \n\n-,---,-, \n\nI \n\n, \n\n, \n\n20  40  60  80  100120140160180200 \n\nNumber  of  steps \n\n0.24 -\n\nO.  28 ~ \nI \n\nI  ~\\ \n0 . 16V \n\nI \nI \nw  o. 201  ~ \nUl \n:0: \n\nI \n+ \n\n0.12-1 \nI \nI \n0 . 08i \n\n0 .0 4J \n\n0.00 , .. \n\n20 \n\n\\ \n~ \n\nn  e \n\nI \n\n60 \n\n_  --T\n40 \n\nNumber  of  steps \n\nFigure 4:  Learning  2D  arm kinematics.  a)  MSE for  a  single exploration trajectory \n(20  hidden  units).  b)  Plot of MSE  for  random and greedy  exploration  vs.  number \nof training examples,  averaged  over  12  runs  (8  hidden  units). \n\noccasionally  check  the  expected  gain  in  other  parts  of the  input  space  and  move \ntowards them if they  promise much greater gain than  a  greedy  step. \n\nThe  theoretically  correct  but  computationally expensive  approach  is  to  optimize \nover  an  entire  trajectory.  Trajectory  optimization entails  starting  with  an  initial \ntrajectory, computing the expected gain over it, and iteratively perturbing points on \nthe trajectory towards towards optimal expected gain (subject to other points along \nthe  trajectory  being  explored).  Experiments  are  currently  underway  to  determine \nhow  much  of an improvement  may  be  realized  with  trajectory  optimization;  it  is \nunclear whether the improvement over the greedy approach will be worth the added \ncomputational cost. \n\n5.2  Computational Costs \n\nThe  computational  costs  of even  greedy  OED  are  great .  Selecting  a  next  action \nrequires computation and inversion of the hessian {)2 Eue/ ow 2 .  Each time an action \nis  selected  and  taken,  the  new  data  must  be  incorporated  into  the  training  set, \nand  the learner  retrained .  In comparison,  when  using  a  flailing  strategy  or  a  fixed \ntrajectory, the data may be gathered with little computation, and the learner trained \nonly  once  on  the  batch.  In this  light,  the  cost  of data must  be  much  greater  than \nthe  cost  of computation for  optimal experiment  design  to be  a  preferable  strategy. \n\nThere  are  many approximations one  can  make which  significantly  bring  down  the \ncost of OED. By only considering covariances of weights leading to the same neuron, \nthe hessian may be  reduced  to a  block diagonal form,  with each  neuron  computing \nits  own  (simpler)  covariances  in  parallel.  As  an  extreme,  one  can  do  away  with \ncovariances entirely  and rely  only  on  individual weight  variances,  whose  computa(cid:173)\ntion  is  simple.  By  the  same token,  one  can  incorporate the  new  examples in small \nbatches,  only retraining every  5 or  so  steps.  While suboptimal from  a  data gather(cid:173)\ning perspective,  they appear to still outperform random exploration, and are  much \ncheaper  than  \"full-blown\"  optimization. \n\n\f686 \n\nCohn \n\n5.3  Alternative architectures \n\nWe  may  be  able  to  bring  down  computational costs  and  improve  performance  by \nusing  a  different  architecture  for  the  learner.  With  a  standard feedforward  neural \nnetwork,  not  only  is  the  repeated  compution  of variances  expensive,  it  sometimes \nfails  to  yield  estimates suitable  for  use  as  confidence  intervals  (as  we  saw  in  Sec(cid:173)\ntion  4.1.2).  A  solution  to  both  of these  problems  may  lie  in  selection  of a  more \namenable  architecture  and  learning  algorithm .  One  such  architecture,  in  which \noutput variances have  a  direct  role in estimation, is  a  mixture of Gaussians,  which \nmay be  efficiently  trained using an  EM  algorithm [Ghahramani and Jordan,  1994]. \nWe  expect  that it is  along these  lines that our future research  will  be most fruitful. \n\nAcknowledgements \n\nI am indebted to Michael I. Jordan and David J .C.  MacKay for their help in making \nthis research  possible.  This work was funded  by  ATR Human Information Process(cid:173)\ning  Laboratories,  Siemens Corporate  Research  and NSF grant CDA-9309300. \n\nBibliography \n\nB.  Armstrong.  (1989) On finding exciting trajectories for identification experiments. \nInt.  J.  of Robotics  Research,  8(6):28-48. \n\nE.  Baum  and  K.  Lang. \n(1991)  Constructing  hidden  units  using  examples  and \nqueries.  In  R .  Lippmann et  al.,  eds.,  Advances  in  Neural  Information  Processing \nSystems  3,  Morgan  Kaufmann, San  Francisco,  CA. \nD.  Cohn,  L.  Atlas  and  R.  Ladner.  (1990)  Training  connectionist  networks  with \nqueries  and selective  sampling.  In  D.  Touretzky,  editor,  Advances  in  Neural Infor(cid:173)\nmation  Processing  Systems  2,  Morgan  Kaufmann,  San  Francisco. \n\nV.  Fedorov.  (1972)  Theory  of Optimal Experiments.  Academic Press,  New  York. \n\nZ.  Ghahramani and  M.  Jordan.  (1994)  Supervised  learning  from  incomplete  data \nvia an  EM  approach.  In  this  volume. \n\nM.  Jordan  and  D.  Rumelhart.  (1992)  Forward  models:  Supervised  learning  with a \ndistal teacher.  Cognitive  Science,  16(3):307-354. \n\nD.  MacKay.  (1992)  Information-based objective functions for  active data selection, \nNeural  Computation  4(4):  590-604. \n\nA.  Moore.  (1994)  The  parti-game algorithm for  variable  resolution  reinforcement \nlearning in multidimensional state-spaces.  In  this  volume. \n\nM.  Plutowski and H.  White.  (1993)  Selecting concise  training sets from clean  data. \nIEEE  Trans.  on  Neural Networks,  4(2):305-318. \n\nR.  Thisted.  (1988)  Elements  of Statistical Computing.  Chapman and Hall,  NY. \n\nS.  Thrun and  K.  Moller.  (1992)  Active  Exploration in  Dynamic Environments.  In \nJ.  Moody  et  aI.,  editors,  Advances  in  Neural  Information  Processing  Systems  4. \nMorgan Kaufmann,  San  Francisco,  CA. \n\n\f", "award": [], "sourceid": 765, "authors": [{"given_name": "David", "family_name": "Cohn", "institution": null}]}