{"title": "Scheduling Straight-Line Code Using Reinforcement Learning and Rollouts", "book": "Advances in Neural Information Processing Systems", "page_first": 903, "page_last": 909, "abstract": null, "full_text": "Computation of Smooth Optical Flow in a \n\nFeedback Connected Analog Network \n\nAlan Stocker * \n\nInstitute of Neuroinforrnatics \nUniversity and ETH Zi.irich \n\nWinterthurerstrasse 190 \n8057 Zi.irich, Switzerland \n\nRodney Douglas \n\nInstitute of Neuroinforrnatics \nUniversity and ETH Zi.irich \n\nWinterthurerstrasse 190 \n8057 Zi.irich, Switzerland \n\nAbstract \n\nIn  1986, Tanner and Mead [1] implemented an interesting constraint sat(cid:173)\nisfaction  circuit  for  global  motion  sensing  in  a VLSI.  We  report  here  a \nnew  and  improved a VLSI implementation that provides smooth optical \nflow as well as global motion in a two dimensional visual field.  The com(cid:173)\nputation of optical flow  is  an ill-posed problem, which expresses itself as \nthe aperture problem.  However, the optical flow  can be estimated by the \nuse of regularization methods, in  which additional constraints are intro(cid:173)\nduced in  terms of a global energy functional that must be minimized . We \nshow how the algorithmic constraints of Hom and Schunck [2]  on com(cid:173)\nputing smooth optical flow can be mapped onto the physical constraints \nof an equivalent electronic network. \n\n1  Motivation \n\nThe perception of apparent motion is  crucial for navigation.  Knowledge of local motion of \nthe environment relative to the observer simplifies the calculation of important tasks such as \ntime-to-contact or focus-of-expansion. There are several methods to compute optical flow. \nThey  have  the  common  problem  that  their computational  load  is  large.  This  is  a  severe \ndisadvantage for autonomous agents,  whose computational power is  restricted by energy, \nsize and weight.  Here we show how the global regularization approach which is necessary \nto  solve for  the  ill-posed  nature  of computing optical  flow,  can  be  formulated  as  a  local \nfeedback constraint, and implemented as  a physical analog device that is  computationally \nefficient. \n\n*  correspondence to: alan@ini.phys.ethz.ch \n\n\fComputation of Optical Flow in an Analog Network \n\n707 \n\n2  Smooth Optical Flow \n\nHorn and Schunck [2]  defined optical flow  in relation to  the spatial  and temporal changes \nin  image brightness.  Their model assumes that the total  image brightness E(x, y, t)  does \nnot change over time; \n\nExpanding equation (1) according to the chain rule of differentiation leads to \n\no \n\nF  ==  ox E(x, y, t)u + oy E(x, y, t)v + 8t E(x, y, t)  = 0, \n\nd \ndt E(x, y, t)  = O. \n\n0 \n\n0 \n\n(I) \n\n(2) \n\nwhere u  =  dx / dt  and v  =  dy / dt represent the two components of the  local  optical flow \nvector. \nSince  there  is  one  equation  for  two  unknowns  at  each  spatial  location,  the  problem  is \nill-posed,  and  there  are  an  infinite  number of possible  solutions  lying  on  the  constraint \nline  for every location  (x, y).  However,  by  introducing an  additional constraint the  prob(cid:173)\nlem can be regularized and a unique solution can be found. \nFor example, Horn and Schunck require the optical  flow  field  to be smooth.  As a measure \nof smoothness they choose the squares of of the spatial derivatives of the flow  vectors, \n\n(3) \n\nOne can also view this constraint as introducing a priori knowledge:  the closer two points \nare in the image space the more likely they belong to the projection of the same object. Un(cid:173)\nder the assumption of rigid objects undergoing translational motion, this constraint implies \nthat  the  points have  the same,  or at least very  similar motion  vectors.  This assumption  is \nobviously not  valid  at  boundaries of moving objects,  and  so this algorithm fails  to detect \nmotion discontinuities [3]. \nThe computation of smooth optical flow  can now be formulated as  the minimization prob(cid:173)\nlem of a global energy functional, \n\nJ J ~ dx dy  ---7  min \n\n(4) \n\nwith F  and 8 2  as in equation (2) and (3) respectively. Thus, we exactly apply the approach \nof standard regularization theory [4]: \n\nL \n\nAx=y \nx  =  A -Iy \nII  Ax - y  II  +.x II  P  11=  min \n\ny:  data \ninverse problem, ill-posed \nregularization \n\nThe  regularization  parameter, .x,  controls the degree of smoothing of the  solution and  its \ncloseness  to  the data.  The  norm,  II  . II,  is  quadratic.  A  difference in  our case  is  that  A \nis  not  constant  but  depends  on  the  data.  However,  if we  consider motion  on  a discrete \ntime-axis  and  look  at  snapshots  rather  than  continuously  changing  images,  A  is  quasi(cid:173)\nstationary.1  The  energy  functional  (4)  is  convex  and  so,  a  simple  numerical  technique \nlike gradient descent would be able to find  the global minimum.  To compute optical flow \nwhile  preserving motion  discontinuities one can  modify  the energy functional  to  include \na binary  line  process that prevents smoothing over discontinuities [4].  However, such  an \nfunctional  will  not  be convex.  Gradient descent methods would  probably  fail  to  find  the \nglobal amongst all  local minima and other methods have to be applied. \n\n1 In the a VLSI implementation this requires a much shorter settling time constant for the network \n\nthan the brightness changes in the image. \n\n\f708 \n\nA.  Stocker and R.  Doug/as \n\n3  A Physical Analog Model \n\n3.1  Continuous space \n\nStandard regularization problems can  be  mapped  onto electronic  networks consisting  of \nconductances and capacitors [5].  Hutchinson et al.  [6]  showed how resistive networks can \nbe  used  to  compute optical  flow  and  Poggio et  al.  [7]  introduced electronic network  so(cid:173)\nlutions for second-order-derivative optic  flow  computation.  However, these proposed net(cid:173)\nwork architectures all  require complicated and sometimes negative conductances although \nHarris et al.  [8]  outlined a similar approach as proposed in this paper independently.  Fur(cid:173)\nthennore,  such  networks were not implemented practically,  whereas  our implementation \nwith constant nearest neighbor conductances is intuitive and straightforward. \nConsider equation (4): \n\nL  =  L(u, v, '\\lu, '\\lv, x, y). \n\nThe  Lagrange function  L  is  sufficiently regular (L  E  C 2 ),  and  thus  it  follows  from  cal(cid:173)\nculus of variation that the  solution of equation (4)  also suffices  the linear Euler-Lagrange \nequations \n\nA '\\l2u - Ex (Exu + Eyv + E t ) \nA'\\l2v - Ey(Exu + Eyv + E t ) \n\no \nO. \n\n(5) \n\nThe Euler-Lagrange equations are  only necessary  conditions for equation (4).  The  suffi(cid:173)\ncient condition for solutions of equations (5) to be a weak minimum is the strong Legendre(cid:173)\ncondition, that is \n\nL'ilu'ilu  > 0 \n\nand \n\nL'ilv'ilv  > 0, \n\nwhich is easily shown to be true. \n\n3.2  Discrete Space - Mapping to Resistive Network \n\nBy  using a discrete five-point approximation of the Laplacian  \\7 2  on  a regular grid, equa(cid:173)\ntions (5) can be rewritten as \nA(Ui+1 )' + Ui-1 )'  + Ui  )'+1  + Ui  )-1  - 4Ui )')  - Ex,  ,(Ex \nA(Vi+1)'  +Vi - 1)'  +Vi)'+1  +Vi)'-1  - 4Vi)' ) -Ey'  (Ex,  ,Ui)' +Ey'  ,Vi)' +Et,  ,)=0 \nwhere i  and j  are the indices for the sampling nodes.  Consider a single node of the resistive \nnetwork shown in Figure 1: \n\n,Ui)'  + E y'  Vi)' + E t \n\n1 , ) ' . J ' \n\n1 ,1 '  \n\nl ,J '  \n\n' . ]\n\n' \n\n, \n\n, \n\n, \n\n, \n\n, \n\n, \n\n, \n\n, \n\n, \n\n, \n\n1,] \n\nt,] \n\n,) =0  (6) \n\n1 , J \n\nFigure 1:  Single node of a resistive network. \n\nFrom Kirchhoff's law it follows that \n\ndV,\u00b7  , \n\nC  d~')  =  G(Vi+1 ,j  + Vi-I ,j  + Vi,HI  + Vi,j-1  - 4Vi,j) + lini.j \n\n(7) \n\n\fComputation of Optical Flow in an Analog Network \n\n709 \n\nwhere Vi ,j  represents the voltage and l in', i  the input current.  G  is the conductance between \ntwo neighboring nodes and C the node capacitance. \nIn steady state, equation (7) becomes \n\nG(Vi+I ,j  + Vi - I ,j  + Vi, j+!  + Vi ,j- I  - 4Vi ,j) + lini\"  =  O. \n\nThe analogy with equations (6) is obvious: \n\nG  ~ .A \n\nlUin \u00b7\u00b7  ~ -Ex\u00b7 . (Ex \u00b7  UiJ' +Ey , ViJ' +Et\nlVin \"  ~ -Ey . , (Ex \" UiJ, +Ey\" Vi),+Et\n\nt , ] '  \n\nt t ] \n\nt. ) \n\nt t )\n\n1 , )\n\n' \n\n' \n\n' \n\nt\n\n, } \n\nt\n\n, } \n\n1 , )\n\n\u00b7 ) \n\n1 , ) \n\n, ) \n\nI , J \n\n(8) \n\n(9) \n\nTo create the full  system we use two parallel resistive networks in which the node voltages \nUi, j  and  Vi,j  represent the two components of the optical  flow  vector U and v .  The input \ncurrents lUini,i  and lVini\"  are computed by a negative recurrentfeedback loop modulated \nby  the input data, which are the spatial and temporal intensity gradients. \nNotice that the input currents are proportional to  the deviation of the local brightness con(cid:173)\nstraint: the less the local optical flow  solution fits  the data the higher the current lini.j  will \nbe to correct the solution and vice versa. \nStability  and  convergence of the  network  are  guaranteed  by  Maxwell 's minimum power \nprinciple [4, 9]. \n\n4  The Smooth Optical Flow Chip \n\n4.1 \n\nImplementation \n\n-CP\\~}1J\u00ad\n~tf)~ \n!  I  ~ \n\nFigure  2:  A  single  motion  cell  within  the  three  layer network.  For simplicity  only  one \nresistive network is shown. \n\nThe circuitry consists  of three  functional  layers  (Figure  2).  The  input  layer  includes  an \narray of adaptive photoreceptors [10]  and  provides the derivatives of the image brightness \nto the second layer,  The spatial gradients are  the first-order linear approximation obtained \nby  subtracting the two neighboring photoreceptor outputs.  The second layer computes the \ninput current to  the  third  layer according to  equations (9).  Finally  these currents are  fed \ninto the two resistive networks that report the optical flow  components. \nThe schematics of the core of a single motion cell are drawn in Figure 3.  The photoreceptor \nand  the temporal differentiator are not shown as well  as  the other half of the circuitry that \ncomputes the y-component of the flow  vector. \n\n\f710 \n\nA.  Stocker and R. Doug/as \n\nA few  remarks are  appropriate here:  First,  the two components of the optical flow  vector \nhave  to  be able to  take on  positive and negative values  with respect to some reference po(cid:173)\ntential.  Therefore, a symmetrical circuit scheme is applied where the positive and negative \n(reference  voltage)  values  are carried  on  separate signal  lines.  Thus,  the  actual  value  is \nencoded as the difference of the two potentials. \n\nE (E  V +  E) \nx \n\nx  x \n\nt \n\n~.\" .... \" ....... \" ....... \" ......... : \n\ntemporal \ndifferentiator \n\nExl \n_ f-VViBias ! \nl \nv+ \nI:\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7 .. \u00b7\u00b7 .. \u00b7 .. \u00b7\u00b7\u00b7\u00b7\u00b7 .. \u00b7\u00b7\u00b7 .. \u00b7: \nX  DiffBias \n\n1 \n\nOpBias \n\nFigure  3:  Cell  core  schematics;  only  the  circuitry  related  to  the  computation  of  the \nx-component of the flow  vector is shown. \n\nSecond, the limited linear range of the Gilbert multipliers leads to a narrow span of flow ve(cid:173)\nlocities that can be computed reliably.  However, the tuning can be such that the operational \nrange is  either at  high or very  low  velocities.  Newer implementations are using modified \nmultipliers with a larger linear range. \nThird, consider a single motion cell (Figure 2). In  principle, this cell would be able to sat(cid:173)\nisfy the local constraint perfectly. In practice (see Figure 3), the finite output impedance of \nthe p-type Gilbert multiplier slightly degrades this ideal solution by  imposing an  effective \nconductance G load .  Thus, a constant voltage on the capacitor representing a non-zero mo(cid:173)\ntion  signal requires a net  output current of the mUltiplier to maintain it.  This requirement \nhas two interesting consequences: \ni)  The reported optical  flow  is  dependent on  the  spatial  gradients (contrast).  A single un(cid:173)\ncoupled cell according to  Figure 2 has a steady state solution with \n\nU \ni ,j  '\"  (G load + E;i .j + E~iJ \n\nI , ] \n\n.Ex \n\n. \n' .J \n\n-Et \n\nand \n\n'Y: \ni,j  '\" (Gload + E;  . + Ey2) \n\n-EtEy .. \n\n1 , J \n\n1,) \n\n1,) \n\n1,) \n\nrespectively.  For the same object speed, the chip reports higher velocity signals for higher \nspatial gradients.  Preferably, Gload should be as  low  as possible to minimize its  influence \non the solution. \nii)  On  the other hand,  the locally  ill-posed problem is now  well-posed because G load  im(cid:173)\nposes  a  second  constraint.  Thus,  the  chip  behaves  sensibly  in  the  case  of low  contrast \ninput  (small  gradients),  reporting  zero  motion  where  otherwise,  unreliable  high  values \nwould occur.  This  is  convenient because  the  signal-to-noise ratio  at  low  contrast  is  very \npoor.  Furthermore, a single cell is  forced  to report the velocity on the constraint line with \nsmallest absolute value,  which is  normal  to the spatial gradient.  That means that the chip \n\n\fComputation of Optical Flow in an Analog Network \n\n711 \n\nreports normal flow when  there is  no neighbor connection.  Since there is an  trade-off be(cid:173)\ntween  the  robustness  of the  optical  flow  computation  and  a  low  conductance  Glaad,  the \nfollower-connected transconductance amplifier in our implementation allows us to control \nG laad  above its small intrinsic value. \n\n4.2  Results \n\nThe results reported below were obtained from  a MOSIS  tinychip containing a 7x7 array \nof motion cells each 325x325 A 2  in  size. The chip was fabricated in  1.2 J.,tm  technology at \nAMI. \n\n- \\ \n\n......... \n\n\"'-\n\"- \"- \"-\n\n-, \" \"\"-\n,,,-\n\"\" \n-~\"  , \n.\"  \"-\n'f-~' ~ , \n\n,\"\"\",-- ~ ~ \n\n3 \n\n\"-\n\n... \n\n\"-\n\n\"-\n\n,  \" \n,  \"-\n\n,  ,\"  \"-\n\"- ,~\"  , \n.'  ,  ,\"  \"- , \n\"  ,~\"  ,  , \n.' ,\"  \"- , \n\" \n,  ,  , \n,1'-'\" \n\na \n\nb \n\nc \n\nFigure 4:  Smooth optical flow response of the chip to an left-upwards moving edge. \na:  photoreceptor output, the arrow indicates the actual motion direction.  b:  weak coupling \n(small conductance G).  c:  strong coupling. \n\n3 \n\n\\  --\n, - , \n,  \" \n'r-- /\" \n. ,  I  I \n'I  / \n\n,.- ,~ \n\n\" \n.,.- ,/ \n\n\\ \n\n, \n\nlr- ~~~~~~ \n\nI \n,- - /   2F--- ~ ~~ ~ -E-- ~ \n\"-\n,, '.... \n\n\"'--\n/  .F--~~~~~~ \n\nSr- ~ ~ ~ ~ '4--\n&r-- ~ 'E--- ~ -E--\n\n3F-- ~ ~ ~ \"'E--- ~ -E--\n\n--\\ \n\n'E-- ~ \n\n<Eo--\n\n\\ \n\n1F-- ~ ~ ~ ~ '<E-- 4 -\n\na \n\nb \n\nc \n\nFigure 5:  Response  of the  optical  flow  chip  to  a  plaid stimulus moving towards  the left: \na:  photoreceptor output;  b  shows  the  normal  flow  computation  with  disabled  coupling \nbetween the motion cells in the network while in c the coupling strength is at maximum. \n\nThe chip is  able to compute smooth optical flow  in a qualitative manner.  The smoothness \ncan  be set by  adjusting the coupling conductances (Figure 4).  Figure 5b presents the nor(cid:173)\nmal flow  computation that occurs when the coupling between the motion cells is disabled. \nThe limited resolution of this prototype chip together with  the small  size  of the stimulus \nleads to  a  noisy response.  However it  is  clear that  the chip perceives the  two gratings as \nseparate moving objects with motion normal to their edge orientation.  When the network \n\n\f712 \n\nA.  Stocker and R.  Douglas \n\nconductance is  set very high the chip perfonns a collective computation solving the  aper(cid:173)\nture problem under the assumption of single object motion.  Figure 5c shows how the chip \ncan compute the correct motion of a plaid pattern. \n\n5  Conclusion \n\nWe  have presented here an  aVLSI implementation of a network that computes 2D smooth \noptical  flow.  The  strength of the  resistive coupling can  be  varied  continuously  to  obtain \ndifferent degrees of smoothing, from a purely local up to a single global motion signal.  The \nchip ideally computes smooth optical flow  in the classical definition of Horn and Schunck. \nInstead of using  negative and complex conductances we  implemented a network solution \nwhere  each  motion  cell  is  perfonning  a  local  constraint  satisfaction  task  in  a  recurrent \nnegative feedback loop. \nIt  is  significant  that  the  solution  of a  global  energy  minimization  task  can  be  achieved \nwithin  a  network  of local  constraint solving  cells  that  do not have  explicit  access  to  the \nglobal computational goal. \n\nAcknowledgments \n\nThis  article  is  dedicated  to  Misha Mahowald.  We  would  like  to  thank  Eric  Vittoz,  Jorg \nKramer, Giacomo Indiveri and Tobi Delbriick for fruitful discussions.  We  thank the Swiss \nNational Foundation for supporting this work and MOSIS for chip fabrication. \n\nReferences \n\n[1]  J.  Tanner and c.A. Mead.  An integrated analog optical motion sensor.  In S. -Y.  Kung, \nR.  Owen,  and  G.  Nash,  editors,  VLSI Signal Processing,  2,  page 59  ff.  IEEE Press, \n1986. \n\n[2]  B.K.  Horn  and  B.G.  Schunck.  Detennining  optical  flow.  Artificial  Intelligence, \n\n17: 185-203, 1981. \n\n[3]  A.  Yuille.  Energy functions for early vision and analog networks.  Biological Cyber(cid:173)\n\nnetic~61:115-123, 1989. \n\n[4]  T.  Poggio,  V.  Torre,  and  C.  Koch.  Computational  vision  and regularization theory. \n\nNature,  317(26):314-319, September 1985. \n\n[5]  B.  K.  Horn.  Parallel  networks for machine vision.  Technical  Report  1071, MIT AI \n\nLab, December 1988. \n\n[6]  J.  Hutchinson, C.  Koch,  1.  Luo,  and C.  Mead.  Computing motion  using  analog and \n\nbinary resistive networks.  Computer, 21 :52-64, March 1988. \n\n[7]  T.  Poggio, W.  Yang,  and V.  Torre.  Optical flow:  Computational properties and net(cid:173)\n\nworks, biological and analog.  The Computing Neuron, pages 355-370, 1989. \n\n[8]  1.G.  Harris, C.  Koch,  E.  Staats,  and J.  Luo.  Analog hardware for detecting disconti(cid:173)\n\nnuities in early vision.  Int.  Journal of Computer Vision,  4:211-223,  1990. \n\n[9]  J. Wyatt. Little-known properties of resistive grids that are useful in analog vision chip \ndesigns. In C.  Koch and H. Li, editors, Vision  Chips:  Implementing Vision Algorithms \nwith Analog VLSI Circuits, pages 72-89. IEEE Computer Society Press,  1995. \n\n[10]  S.c.  Liu.  Silicon  retina  with  adaptive  filtering  properties.  In  Advances in  Neural \n\nInformation Processing Systems 10, November 1997. \n\n\fScheduling Straight-Line Code Using \nReinforcement Learning and Rollouts \n\nAmy McGovern and Eliot Moss \n\n{ amy I moss@cs. umass. edu} \n\nDepartment of Computer Science \n\nUniversity of Massachusetts, Amherst \n\nAmherst, MA 01003 \n\nAbstract \n\nThe  execution  order  of a  block  of computer  instructions  can  make  a \ndifference  in  its  running  time  by  a  factor  of two  or  more.  In  order  to \nachieve  the  best possible speed,  compilers  use heuristic  schedulers  ap(cid:173)\npropriate to  each  specific  architecture implementation.  However,  these \nheuristic schedulers are time-consuming and expensive to  build.  In this \npaper, we present results using both rollouts and reinforcement learning \nto construct heuristics for scheduling basic blocks. The rollout scheduler \noutperformed  a  commercial  scheduler,  and  the  reinforcement learning \nscheduler performed almost as well as  the commercial scheduler. \n\n1  Introduction \nAlthough high-level  code is  generally  written  as  if it  were  going to be executed  sequen(cid:173)\ntially, many modern computers are pipelined and allow for the simultaneous issue of mul(cid:173)\ntiple instructions.  In order to  take advantage of this feature, a scheduler needs to  reorder \nthe instructions in a way that preserves the semantics of the original high-level code while \nexecuting it as  quickly as possible.  An efficient schedule can produce a speedup in execu(cid:173)\ntion of a factor of two or more.  However, building a scheduler can be an  arduous process. \nArchitects  developing  a  new  computer must  manually  develop  a  specialized  instruction \nscheduler each time a change is  made in the proposed system.  Building a scheduler auto(cid:173)\nmatically can save time and money.  It can allow the architects to explore the design space \nmore thoroughly and to use more accurate metrics in evaluating designs. \n\nMoss et al.  (1997)  showed that supervised learning techniques can induce excellent basic \nblock instruction  schedulers  for  the  Digital  Alpha 21064 processor.  Although  all  of the \nsupervised learning methods performed quite well,  they shared several limitations. Super(cid:173)\nvised  learning requires  exact  input/output pairs.  Generating  these  training  pairs  requires \nan optimal scheduler that searches every valid permutation of the instructions within a ba(cid:173)\nsic block and saves the optimal permutation (the schedule with the smallest running time). \nHowever, this search was too time-consuming to perform on blocks with more than  10 in-\n\n\f904 \n\nA.  McGovern and E.  Moss \n\nstructions,  because optimal  instruction  scheduling  is  NP-hard.  Using  a  semi-supervised \nmethod  such  as  reinforcement  learning  or rollouts  does  not  require  generating  training \npairs, so the method can be applied to larger basic blocks and can be trained without know(cid:173)\ning optimal schedules. \n\n2  Domain Overview \nMoss et al.  (1997)  gave a full description of the domain. This study presents an overview, \nnecessary details, our experimental method and detailed results for both rollouts and rein(cid:173)\nforcement learning. \n\nWe focused on scheduling basic blocks of instructions on the 21064 version (DEC,  1992) \nof the Digital  Alpha processor (Sites,  1992).  A basic block is  a set  of instructions  with  a \nsingle entry point and a single exit point.  Our schedulers could reorder instructions within \na  basic  block  but  could  not  rewrite,  add,  or remove any  instructions.  The goal  of each \nscheduler is to find a least-cost valid ordering of the instructions. The cost is defined as the \nsimulated execution time of the block.  A valid  ordering is  one that preserves the seman(cid:173)\ntically  necessary ordering constraints of the original code.  We  insure validity by  creating \na dependency  graph that directly  represents  those necessary ordering relationships.  This \ngraph is a directed acyclic graph (DAG). \n\nThe  Alpha  21064  is  a  dual-issue  machine with  two  different  execution  pipelines .  Dual \nissue occurs only if a number of detailed conditions hold, e.g., the two instructions match \nthe two  pipelines.  An  instruction can take anywhere from  one to  many  tens  of cycles to \nexecute. Researchers at Digital have a publicly available 21064 simulator that also includes \na heuristic scheduler for basic blocks.  We call that scheduler DEC. The simulator gives the \nrunning time  for  a  given  scheduled block assuming  all  memory references hit  the  cache \nand all  resources are available at the beginning of the block.  All of our schedulers used a \ngreedy  algorithm to  schedule the instructions,  i.e.,  they built schedules sequentially from \nbeginning to end with no backtracking. \n\nIn order to  test each scheduling algorithm, we used the  18 SPEC95  benchmark programs. \nTen of these programs are written in FORTRAN and contain mostly floating point calcula(cid:173)\ntions.  Eight of the programs are written in C and focus more on integer, string, and pointer \ncalculations.  Each  program  was  compiled  using  the  commercial Digital  compiler at  the \nhighest  level  of optimization.  We  call  the schedules output by  the  compiler OR/G.  This \ncollection has 447,127 basic blocks, containing 2,205,466 instructions. \n\n3  Rollouts \nRollouts are a form of Monte Carlo search, first introduced by Tesauro and Galperin (1996) \nfor use in backgammon. Bertsekas et al. (l997a,b)  have explored rollouts in other domains \nand  proven  important  theoretical  results.  In  the  instruction  scheduling  domain,  rollouts \nwork as follows:  suppose the scheduler comes to a point where it has a partial schedule and \na set of (more than one) candidate instructions to  add  to  the schedule.  For each candidate, \nthe scheduler appends it to the partial schedule and then follows a fixed policy 1r to schedule \nthe  remaining  instructions.  When  the  schedule  is  complete,  the  scheduler evaluates  the \nrunning time and returns.  When 1r  is stochastic, this rollout can be repeated many times for \neach  instruction to  achieve a measure of the average expected outcome.  After rolling out \neach candidate, the scheduler picks the one with the best average running time. \n\nOur first set of rollout experiments compared three different rollout policies 1r.  The theory \ndeveloped by  Bertsekas et al.  (l997a,b)  proved that  if we used  the DEC  scheduler as  1r, \nwe would perform no  worse than DEC.  An  architect proposing a new  machine might not \nhave a good heuristic available to use as 1r, so we also considered policies more likely to be \navailable.  The first  was  the random policy, RANDOM-1r, which  is  a choice that is  clearly \nalways available.  Under this policy, the rollout makes all  choices randomly. We  also used \n\n\fScheduling Straight-Line Code Using RL and Rollouts \n\n905 \n\nthe ordering produced by the optimizing compiler ORIG, denoted ORIG-1r.  The last rollout \npolicy tested was the DEC scheduler itself, denoted DEC-1r. \n\nThe scheduler performed only  one rollout per candidate instruction when  using ORIG-1r \nand DEC-1r  because they  are deterministic.  We  used  25  rollouts  for RANDOM-1r.  After \nperforming a number of rollouts for  each candidate instruction,  we chose the instruction \nwith the best average running time.  As a baseline scheduler, we also scheduled each block \nrandomly.  Because the running time increases quadratically with  the number of rollouts, \nwe focused our rollout experiments on one program in the SPEC95 suite:  applu. \n\nTable 1 gives the performance of each rollout scheduler as compared to the DEC scheduler \non all 33,007 basic blocks of size 200 or less from applu.  To assess the performance of each \nrollout policy 1r,  we used the ratio of the weighted execution time of the rollout scheduler \nto  the  weighted execution time of the DEC  scheduler.  More concisely,  the performance \nmeasure was: \nLall blocks rollout scheduler execution time * number of times block is executed \nratio =  ===:=-'-'-~==-~-----------------------\u00ad\nLall blocks DEC scheduler execution time * number of times block is executed \nThis means  that  a faster  running  time on the part  of our scheduler would  give  a smaller \nratio. \n\n. \n\nScheduler \nRandom \nORIG-1T' \n\nRatio  Scheduler \n1.3150  RANDOM-1T' \n0.9895  DEC-1T' \n\nRatio \n1.0560 \n0.9875 \n\nTable  1:  Ratios of the weighted execution time of the rollout scheduler to the DEC sched(cid:173)\nuler.  A ratio of less than one means that the rollouts outperformed the DEC scheduler. \n\nAll of the rollout schedulers far outperformed the random scheduler which was 31 % slower \nthan DEC. By only adding rollouts, RANDOM-1r  was able to achieve a running time only \n5%  slower than DEC.  Only the schedulers using ORIG-1r  and  DEC-1r  as  a model outper(cid:173)\nformed  the DEC  scheduler.  Using ORIG-1r  and  DEC-1r  for rollouts produced a schedule \nthat was 1.1 % faster than the DEC  scheduler on average.  Although this improvement may \nseem small, the DEC scheduler is  known to make optimal choices 99.13% of the time for \nblocks of size 10 or less (Stefanovic,  1997). \n\nRollouts were tested only on applu rather than on the entire SPEC95 benchmark suite due \nto  the lengthy computation time.  Rollouts are costly because performing m  rollouts on n \ninstructions is O(n 2m), whereas a greedy scheduling algorithm is O(n).  Again, because of \nthe time required, we only performed five runs of RANDOM-1r.  Since DEC-1r and ORIG-1r \nare deterministic, only one run was necessary.  We  also ran the random scheduler 5 times. \nEach number reported above is the geometric mean of the ratios across the five runs. \n\nPart of the motivation behind using rollouts in a scheduler is to obtain fast schedules without \nspending the time to build a precise heuristic.  With this in mind, we explored RANDOM-1r \nmore closely in a follow-up experiment. \n\nEvaluation of the number of rollouts \nThis experiment considered how performance varies with the number of rollouts.  We tested \n1,5, 10,25, and 50 rollouts per candidate instruction. We also varied the metric for choos(cid:173)\ning  among candidates.  Instead of always  choosing the instruction  with  the best  average \nperformance,  we  also  experimented  with  selecting  the  instruction  with  the  absolute best \nrunning time among its  rollouts.  We  hypothesized that selection of the absolute best path \nmight lead to better performance overall. These experiments were performed on all  33,007 \nbasic blocks of size 200 or less from applu. \n\nFigure  1 shows  the  performance of the rollout scheduler as  a function  of the  number of \nrollouts.  Performance is  assessed  in  the same way  as  before:  ratio of weighted execution \n\n\f906 \n\nA. McGovern and E.  Moss \n\nPerformance over number of rollouts \n\n1-\"- ~~t l \n\n, ,8 \n\n,  '6 \n\n,\",4 \n~ \nlij' '2 \nE \n., \n.g  \" \n,  08 \n\nQ. \n\n'06 \n\n1.04, \n\n5 \n\n'0 \n\n25 \n\nNumber of Rollouts \n\n50 \n\nFigure  1:  Performance of rollout  scheduler  with  the random model  as  a  function  of the \nnumber of rollouts and the choice of evaluation function. \n\ntimes.  Thus, a lower number is better.  Each data point represents the geometric mean over \nfive  different  runs.  The difference in  performance between one rollout and  five  rollouts \nusing  the  average  choice  for  each  rollout  is  1.16  versus  1.10.  However,  the  difference \nbetween 25  rollouts and 50 rollouts is only  1.06 versus  1.05.  This  indicates  the tradeoff \nbetween schedule quality and the number of rollouts.  Also, choosing the instructions with \nthe  best rollout  schedule did  not  yield  better performance over any  numbers of rollouts. \nWe hypothesize that this is due to the stochastic nature of the rollouts.  Once the scheduler \nchooses an instruction, it repeats the rollout process again. By choosing the instruction with \nthe absolute best rollout, there is no guarantee that the scheduler will find  that permutation \nof instructions  again  on  the  next  rollout.  When  it  chooses  the  instruction  with  the  best \naverage rollout,  the scheduler has  a better chance of finding  a good schedule on the next \nrollout. \n\nAlthough the rollout schedulers performed quite well, the extremely long scheduling time \nis  a major drawback.  Using 25  rollouts per block took over 6 hours to  schedule one pro(cid:173)\ngram.  Unless  this  aspect  can  be  improved,  rollouts  cannot  be  used  for  all  blocks  in  a \ncommercial scheduler or in  evaluating more than  a few  proposed machine architectures. \nHowever, because rollout scheduling performance is  high, rollouts could be used  to  opti(cid:173)\nmize the schedules on important (long running times or frequently executed) blocks within \na program. \n\n4  Reinforcement Learning Results \n\n4.1  Overview \nReinforcement learning (RL) is a collection of methods for discovering near-optimal solu(cid:173)\ntions to  stochastic sequential decision problems (Sutton &  Barto,  1998).  A reinforcement \nlearning system does not require a teacher to  specify correct actions.  Instead, the learning \nagent tries different actions and observes their consequences to determine which actions are \nbest.  More specifically, in the reinforcement learning framework, a learning agent interacts \nwith  an  environment over a series of discrete time steps t  =  0,1,2, 3, ....  At each time t, \nthe agent is in some state, denoted St, and chooses an action, denoted at ,  which causes the \nenvironment to transition to state StH and to emit a reward, denoted rtH'  The next state \nand reward depend only on the preceding state and  action, but they may depend on it in a \nstochastic fashion.  The objective is to learn a (possibly stochastic) mapping from states to \nactions called a policy, which maximizes the cumulative discounted reward received by the \nagent.  More precisely, the objective is  to choose action at  so as to  maximize the expected \n\nreturn, E n::::o -yirt+i+l }, where -y  E  [0, 1) is a discount-rate parameter. \n\n\fScheduling Straight-Line Code USing RL and Rollouts \n\n907 \n\nA common solution strategy is to  approximate the optimal value function  V* , which maps \nstates to the maximal expected return that can be obtained starting in each state and taking \nthe best action.  In this paper we use temporal difference (TD) learning (Sutton,  1988).  In \nthis method, the approximation to V* is represented by a table with an entry V (s)  for every \nstate.  After each transition from state  St  to  state StH, under an  action with  reward rt+l, \nthe estimated value function V (St)  is updated by: \n\nV(St)  +- V(St) + a  [rtH  + ,V(St+l) - V(st)] \n\nwhere a  is  a positive step-size parameter. \n\n4.2  Experimental Results \nScheeff et  al.  (1997)  have  previously  experimented  with  reinforcement learning  in  this \ndomain.  However, the results were not as good as hoped. Finding the right reward structure \nwas  the difficult part of using RL in  this  domain.  Rewarding based on number of cycles \nto  execute  the  block does  not  work  well  as  it  punishes  the  learner  on  long  blocks.  To \nnormalize for this effect, Scheeff et al.  (1997) rewarded based on the cycles per instruction \n(CPI). However, learning with this reward also did not work well as some blocks have more \nunavoidable idle time than others.  A reward based solely on CPI does not account for this \naspect.  To account for this variation across blocks, we gave the RL scheduler a final reward \n\nof:  . \n\nr  = time to execute block-max  minimum wetghted critical path, \n\n( .  \n\n(# of instructions) ) \n\n2 \n\nThe scheduler received a reward of zero unless the schedule was  complete.  As  the 21064 \nprocessor can only issue two instructions at a time, the number of instructions divided by 2 \ngives an absolute lower bound on the running time.  The weighted critical path (wcp) helps \nto solve the problem of the same size blocks being easier or harder to schedule than others. \nWhen a block  is  harder to  execute than  another block of the same size,  the wcp tends  to \nbe higher, thus causing the learner to  get  a different reward.  The wcp is  correlated  with \nthe predicted number of execution cycles for the DEC scheduler (r = 0.9) and the number \nof instructions divided by 2 is  also correlated (r = 0.78) with  the DEC scheduler.  Future \nexperiments will use a weighted combination of these two features to compute the reward. \n\nAs  with  the  supervised  learning results  presented  in  Moss  et  al.  (1997),  the  RL  system \nlearned  a  preferential  value function between  candidate instructions.  That is,  instead  of \nlearning  the  value  of instruction  A  or  instruction  B,  RL  learned  the  value  of choosing \ninstruction  A  over instruction  B.  The state  space consisted  of a  tuple of features  from  a \ncurrent partial  schedule and  the two  candidate instructions.  These features  were derived \nfrom knowledge of the DEC simulator.  The features and our intuition for their importance \nare summarized in Table 2. \n\nPrevious experiments (Moss et al.  1997)  showed that the  actual  value of wcp and e  did \nnot matter as  much  as  their relative values.  Thus,  for  those features  we used  the signum \n\u00ab(1)  of the  difference of their  values  for  the  two  candidate  instruction.  Signum  returns \n-1,0, or 1 depending on whether the value is less than, equal to, or greater than zero. Using \nthis  representation,  the  RL  state space consisted  of the  following  tuple,  given candidate \ninstruction x  and y and partial schedule p: \n\nstate_vec(p, x, y)  =  (odd(P), ic(x) , ic(y),d(x), dey), a(wcp(x) - wcp(y)), a(e(x) - e(y\u00bb) \n\nThis  yields 28,800 unique states.  Figure 2  shows  an  example partial  schedule,  a  set  of \ncandidate instructions, and the resulting states for the RL system. \n\nThe RL  scheduler does not  learn over states  where there are no choices to be made.  The \nlast  choice point in  a  trajectory  is  given  the final  reward  even  if further  instructions  are \nscheduled from that point.  The values of multiple states are updated at each time step be(cid:173)\ncause the instruction that is  chosen affects the preference function of multiple states.  For \n\n\f908 \n\nHeuristic Name \nOdd Partial (odd) \n\nInstruction Class (ic) \n\nWeighted Critical Path (wcp) \n\nActual  Dual (d) \n\nMax Delay (e) \n\nHeuristic Description \nIs the current number of instructions sched-\nuled odd or even? \n\nThe  Alpha's  instructions  can  be  divided \ninto  equivalence  classes  with  respect  to \ntiming properties. \nThe  height  of the  instruction  in  the  DAG \n(the  length  of the  longest chain of instruc-\ntions  dependent  on  this  one),  with  edges \nweighted by expected latency  of the  result \nproduced by the instruction \nCan the instruction dual-issue with the pre-\nvious scheduled instruction? \n\nA. McGovern and E.  Moss \n\nIntuition for Use \nIf TRUE, we're interested in scheduling in-\nstructions that can dual-issue  with  the  pre-\nvious instruction. \nThe  instructions  in  each  class  can  be  ex-\necuted only in  certain execution pipelines, \netc. \nInstructions  on longer critical paths should \nbe  scheduled  first,  since  they  affect  the \nlower bound of the  schedule cost. \n\nIf Odd Partial is TRUE, it is  important that \nwe  find  an  instruction, if there  is  one,  that \ncan issue in  the  same cycle with the previ-\nous scheduled instruction. \n\nThe earliest cycle when the instruction can  We  want  to  schedule instructions  that  will \nhave their data and functional unit available \nbegin to execute, relative to the current cy-\ncle; this takes into account any  wait for in-\nearliest. \nputs  for  functional  units  to become  avail-\nable \n\nTable 2:  Features for Instructions and Partial Schedule \n\npartial schedule p \n\nA \n\ncandidate instructions \n\nc \n\nStates for RL system \n\nState label  State \nAB \nAC \nBC \nBA \nCA \nCB \n\nstate_ vec(p,A,B) \nstate_ vec(p,A,C) \nstate_vec(p,B,C) \nstate_ vec(p,B,A) \nstate_vec(p,C,A) \nstate_ vec(p,C,B) \n\nFigure  2:  On  the  left  is  a  graphical  depiction  of a  partial  schedule and  three  candidate \ninstructions. The table on the right shows how the RL system makes its states from this. \n\nexample, using the partial schedule and candidate instructions shown in Figure 2, schedul(cid:173)\ning instruction A, the RL system would backup values for AB, AC, and the opposite values \nfor BA and CA. \n\nUsing  this  system,  we  performed  leave-one-out cross  validation  across  all  blocks  of the \nSPEC95 benchmark suite. Blocks with more than 800 instructions were broken into blocks \nof 800 or  less  because of memory  limitations  on  the DEC  simulator.  This  was  true for \nonly two  applications:  applu  and fpppp.  The RL  system was  trained online for  19  of the \n20 applications using  Q  = 0.05 and  an \u00a3-greedy exploration method with  \u00a3 = 0.05.  This \nwas  repeated  20 different  times,  holding  one program from  SPEC95  out of the  training \neach time. We  then evaluated the greedy policy (\u00a3  =  0) learned by the RL system on each \nprogram that had been held out.  All ties were broken randomly. Performance was assessed \nthe same way  as  before.  The results for  each  benchmark are shown in  Table  3.  Overall, \nthe RL scheduler performed only 2%  slower than DEC.  This is  a geometric mean over all \napplications in  the suite and on all blocks. Although the RL system did not outperform the \nDEC  scheduler overall, it  significantly outperformed DEC on the  large blocks (applu-big \nand fpppp-big) . \n\n5  Conclusions \nThe advantages of the RL scheduler are its performance on the task, its speed, and the fact \nthat it  does  not  rely  on  any  heuristics  for  training.  Each  run  was  much  faster  than  with \nrollouts  and  the  performance came close to  the performance of the DEC  scheduler.  In  a \n\n\fScheduling Straight-Line Code Using RL and Rollouts \n\n909 \n\nApp \napplu \ncompress95 \nhydro2d \nmgrid \ntomcatv \n\nRatio  App \n1.001 \n0.977 \n1.022 \n1.009 \n1.019 \n\napplu-big \nfpppp \nijpeg \nperl \nturb3d \n\nRatio  App \napsi \n0.959 \n1.055 \nfpppp-big \nIi \n0.975 \n1.014 \nsu2cor \nvortex \n1.218 \n\nRatio  App \nccl \n1.018 \ngo \n0.977 \n1.012  m88ksim \n1.018 \n1.032 \n\nswim \nwaveS \n\nRatio \n1.022 \n1.028 \n1.042 \n1.040 \n1.032 \n\nTable 3:  Performance of the greedy RL-scheduler on each application in SPEC95 over all \nleave-one-out cross-validation runs as compared to DEC. Applications whose running time \nwas better than DEC are shown in italics. \n\nsystem where multiple architectures are being tested,  RL  could provide a  good scheduler \nwith minimal setup and training. \n\nWe  have demonstrated  two methods  of instruction scheduling that do not rely  on having \nheuristics  and that perform quite well.  Future work could address tying the two methods \ntogether while retaining the speed of the RL learner, issues of global instruction scheduling, \nscheduling loops, and validating the techniques on other architectures. \n\nAcknowledgments \nWe  thank  John  Cavazos  and  Darko  Stefanovic  for  setting  up  the  simulator  and  for  prior  work  in \nthis  domain,  along  with  Paul  Utgoff,  Doina  Precup,  Carla Brodley,  and  David  Scheeff.  We  also \nwish to  thank  Andrew  Barto,  Andrew  Fagg,  and  Doina Precup for  comments  on earlier versions  of \nthe  paper.  This  work  is  supported  in  part  by  the  National  Physical  Science Consortium,  Lockheed \nMartin, Advanced Technology Labs, and NSF grant IRI-9503687 to Roderic A.  Grupen and Andrew \nG. Barto.  We thank various people of Digital Equipment Corporation, for the DEC scheduler and the \nATOM  program instrumentation tool  (Srivastava &  Eustace,  1994),  essential to  this  work.  We  also \nthank Sun Microsystems and Hewlett-Packard for their support. \nReferences \nBertsekas, D. P.  (1997).  Differential training of rollout policies.  In Proc.  of the 35th Allerton Confer(cid:173)\nBertsekas, D.  P., Tsitsiklis, 1.  N.  & Wu, c. (1997). Rollout algorithms for combinatorial optimization. \n\nence on Communication,  Control, and Computing.  Allerton Park, Ill. \n\nJournal of Heuristics. \n\nDEC (1992).  DEC chip 21064-AA  Microprocessor Hardware Reference Manual  (first edition  Ed.). \n\nMaynard, MA:  Digital Equipment Corporation. \n\nMoss,  1.  E.  B.,  Utgoff,  P.  E.,  Cavazos,  J.,  Precup,  D.,  Stefanovic,  D.,  Brodley,  C.  E. &  Scheeff, \nD. T.  (1997).  Learning to schedule  straight-line code.  In  Proceedings  of Advances  in  Neural \nInformation Processing Systems  10 (Proceedings of NIPS'97) . MIT Press. \nScheeff,  D.,  Brodley,  c.,  Moss,  E.,  Cavazos,  1.  &  Stefanovic,  D.  (1997).  Applying  reinforcement \nlearning  to  instruction  scheduling  within  basic  blocks.  Technical  report,  University  of Mas(cid:173)\nsachusetts, Amherst. \n\nSites, R.  (1992). Alpha Architecture Reference Manual.  Maynard, MA:  Digital Equipment Corpora(cid:173)\n\ntion. \n\nSrivastava,  A.  &  Eustace,  A. (1994).  ATOM:  A  system  for  building  customized  program  analysis \n\ntools.  In  Proc. ACM SIGPLAN '94 Con!  on Prog. Lang.  Design and Imp!.  (pp.  196-205). \n\nStefanovic,  D.  (1997).  The  character of the  instruction  scheduling  problem.  University  of Mas(cid:173)\n\nsachusetts, Amherst. \n\nSutton, R.  S.  (1988).  Learning to predict by the method of temporal differences.  Machine  Learning, \n\n3,9-44. \n\nSutton,  R.  S.  &  Barto,  A.  G.  (1998).  Reinforcement Learning. An Introduction.  Cambridge,  MA: \n\nMIT Press. \n\nTesauro,  G.  &  Galperin, G.  R.  (1996).  On-line  policy  improvement  using  monte-carlo  search.  In \nAdvances in Neural Information Processing:  Proceedings of the  Ninth Conference.  MIT Press. \n\n\f", "award": [], "sourceid": 1557, "authors": [{"given_name": "Amy", "family_name": "McGovern", "institution": null}, {"given_name": "J.", "family_name": "Moss", "institution": null}]}