{"title": "Reinforcement Learning for Dynamic Channel Allocation in Cellular Telephone Systems", "book": "Advances in Neural Information Processing Systems", "page_first": 974, "page_last": 980, "abstract": null, "full_text": "Reinforcement  Learning for  Dynamic \n\nC\u00b7hannel Allocation in  Cellular Telephone \n\nSystems \n\nSatinder Singh \n\nDepartment of Computer Science \n\nUniversity of Colorado \nBoulder,  CO  80309-0430 \nbavej a@cs.colorado.edu \n\nDimitri Bertsekas \n\nLab.  for Info.  and  Decision Sciences \n\nMIT \n\nCambridge, MA  02139 \nbertsekas@lids.mit.edu \n\nAbstract \n\nIn cellular telephone systems, an important problem is to dynami(cid:173)\ncally allocate the communication resource  (channels) so as to max(cid:173)\nimize  service  in  a  stochastic  caller  environment.  This problem  is \nnaturally formulated  as  a  dynamic programming problem and  we \nuse a reinforcement learning (RL) method to find  dynamic channel \nallocation policies that are better than previous heuristic solutions. \nThe policies obtained perform well for  a  broad variety of call traf(cid:173)\nfic  patterns.  We  present  results  on  a  large  cellular  system  with \napproximately 4949  states. \n\nIn  cellular  communication systems,  an  important problem  is  to  allocate the com(cid:173)\nmunication resource  (bandwidth) so as to maximize the service provided to a set of \nmobile callers whose  demand for  service  changes stochastically.  A given  geograph(cid:173)\nical  area is  divided  into mutually disjoint  cells,  and each  cell  serves  the  calls that \nare  within  its  boundaries  (see  Figure  1a).  The total system  bandwidth is  divided \ninto channels, with each channel centered  around a frequency.  Each channel can be \nused simultaneously at different  cells,  provided these  cells are sufficiently separated \nspatially, so that there  is  no interference  between  them.  The minimum separation \ndistance between simultaneous reuse of the same channel is called the  channel reuse \nconstraint. \n\nWhen a call requests service in a given cell  either a free  channel  (one that does not \nviolate the channel  reuse  constraint)  may be assigned  to the  call, or else  the call  is \nblocked  from  the system;  this  will  happen  if no free  channel  can  be found.  Also, \nwhen a mobile caller crosses from one cell to another, the call is  \"handed off\"  to the \ncell of entry;  that is,  a  new  free  channel is provided to the call at the new  cell.  If no \nsuch  channel  is available,  the call  must be dropped/disconnected  from  the system. \n\n\fRLfor Dynamic Channel Allocation \n\n975 \n\nOne  objective  of a  channel  allocation  policy  is  to  allocate  the  available  channels \nto calls  so  that the  number of blocked  calls is  minimized.  An  additional objective \nis  to  minimize the  number of calls  that  are  dropped  when  they  are  handed off to \na  busy  cell.  These  two objectives  must  be  weighted  appropriately  to reflect  their \nrelative importance, since dropping existing calls is generally more undesirable than \nblocking new  calls. \nTo  illustrate  the  qualitative  nature  of the  channel  assignment  decisions,  suppose \nthat  there  are  only  two  channels  and  three  cells  arranged  in  a  line.  Assume  a \nchannel  reuse  constraint  of 2,  i.e.,  a  channel  may  be  used  simultaneously  in  cells \n1 and  3,  but may not  be  used  in  channel  2  if it is  already  used  in  cell  1 or  in  cell \n3.  Suppose  that  the  system  is  serving  one  call  in  cell  1 and  another  call  in  cell \n3.  Then  serving  both  calls  on  the  same  channel  results  in  a  better  channel  usage \npattern than serving them on different  channels,  since  in  the former case the other \nchannel  is  free  to  be  used  in  cell  2.  The  purpose  of the  channel  assignment  and \nchannel rearrangement strategy is,  roughly speaking, to create such favorable usage \npatterns that minimize the likelihood of calls  being blocked. \n\nWe formulate the channel assignment problem as a dynamic programming problem, \nwhich,  however,  is too complex to be solved exactly.  We introduce approximations \nbased on the methodology of reinforcement learning (RL) (e.g.,  Barto,  Bradtke and \nSingh,  1995, or the recent textbook by  Bertsekas and Tsitsiklis, 1996).  Our method \nlearns channel allocation policies that outperform not only the most commonly used \npolicy  in  cellular  systems,  but  also  the  best  heuristic  policy  we  could  find  in  the \nliterature. \n\n1  CHANNEL ASSIGNMENT POLICIES \n\nMany cellular systems are based on a fixed  assignment  (FA) channel allocation; that \nis,  the set  of channels  is  partitioned,  and  the  partitions are  permanently  assigned \nto cells  so  that  all  cells  can  use  all  the  channels  assigned  to them  simultaneously \nwithout  interference  (see  Figure  1a).  When  a  call  arrives  in  a  cell,  if any  pre(cid:173)\nassigned channel is unused; it is assigned, else the call is blocked.  No rearrangement \nis done when a call terminates.  Such a policy is static and cannot take advantage of \ntemporary stochastic  variations in  demand for  service.  More  efficient  are  dynamic \nchannel  allocation  policies,  which  assign  channels  to different  cells,  so  that  every \nchannel  is  available to every  cell  on  a  need  basis,  unless  the  channel  is  used  in  a \nnearby  cell  and the reuse  constraint is  violated. \n\nThe  best  existing  dynamic  channel  allocation  policy  we  found  in  the  literature  is \nBorrowing with  Directional Channel  Locking  (BDCL)  of Zhang  &  Yum  (1989).  It \nnumbers the  channels  from  1 to N,  partitions and  assigns  them to cells  as  in  FA. \nThe  channels  assigned  to  a  cell  are  its  nominal  channels.  If a  nominal  channel \nis  available  when  a  call  arrives  in  a  cell,  the  smallest  numbered  such  channel  is \nassigned  to the call.  If no  nominal channel  is  available, then  the  largest  numbered \nfree  channel  is  borrowed from  the neighbour  with the most free  channels.  When a \nchannel  is  borrowed,  careful  accounting  of the  directional  effect  of which  cells  can \nno  longer  use  that  channel  because  of interference  is  done.  The  call  is  blocked  if \nthere  are  no free  channels  at  all.  When a  call  terminates  in  a  cell  and the channel \nso  freed  is  a  nominal channel,  say  numbered  i,  of that  cell,  then  if there  is  a  call \nin  that  cell  on  a  borrowed  channel,  the  call  on  the  smallest  numbered  borrowed \nchannel  is  reassigned  to i  and the borrowed  channel  is  returned  to the appropriate \ncell.  If there  is no  call  on  a  borrowed  channel,  then  if there  is  a  call  on  a  nominal \nchannel numbered  larger than i, the call on the highest numbered nominal channel \nis  reassigned  to i.  If the call just terminated was  itself on  a  borrowed channel,  the \n\n\f976 \n\nS.  Singh and D.  Bertsekas \n\ncall on the smallest numbered borrowed channel is reassigned to it and that channel \nis  returned  to the  cell  from  which  it  was  borrowed.  Notice  that  when  a  borrowed \nchannel  is  returned  to its original cell,  a  nominal channel  becomes free  in that cell \nand triggers  a  reassignment.  Thus,  in the  worst  case  a  call  termination in one  cell \ncan sequentially cause reassignments in arbitrarily far  away cells - making BDCL \nsomewhat impractical. \n\nBOCL is quite sophisticated and combines the notions of channel-ordering, nominal \nchannels,  and  channel  borrowing.  Zhang  and  Yum  (1989)  show  that  BOCL  is \nsuperior  to  its  competitors,  including  FA.  Generally,  BOCL  has  continued  to  be \nhighly  regarded  in  the  literature  as  a  powerful  heuristic  (Enrico  et.al.,  1996) .  In \nthis  paper,  we  compare  the  performance  of dynamic  channel  allocation  policies \nlearned  by  RL  with  both  FA  and  BOCL. \n\n1.1  DYNAMIC  PROGRAMMING FORMULATION \n\nWe  can formulate the dynamic channel allocation problem using dynamic program(cid:173)\nming (e.g.,  Bertsekas,  1995).  State transitions occur when channels become free due \nto call departures,  or  when  a  call  arrives  at a  given  cell  and  wishes  to be  assigned \na  channel,  or when  there  is  a  handoff,  which  can  be  viewed  as  a  simultaneous call \ndeparture  from  one  cell  and  a  call  arrival  at  another  cell.  The  state  at  each  time \nconsists of two components: \n\n(1)  The  list  of occupied  and  unoccupied  channels  at  each  cell.  We  call  this  the \nconfiguration of the cellular system.  It is  exponential in the number of cells. \n(2) The event that causes the state transition (arrival, departure, or handoff).  This \n\ncomponent of the state is uncontrollable. \n\nThe  decision/control  applied  at  the  time of a  call  departure  is  the  rearrangement \nof the  channels  in  use  with  the  aim of creating  a  more favorable  channel  packing \npattern  among  the  cells  (one  that  will  leave  more  channels  free  for  future  assign(cid:173)\nments) .  Unlike  BDCL,  our  RL  solution  will restrict  this rearrangement  to the  cell \nwith  the  current  call departure.  The  control exercised  at the time of a  call  arrival \nis  the  assignment of a  free  channel,  or the blocking of the call if no free  channel  is \ncurrently  available.  In  general,  it may also  be  useful  to  do  admission  control,  i.e., \nto  allow  the  possibility  of not  accepting  a  new  call  even  when  there  exists  a  free \nchannel to minimize the dropping of ongoing calls during handoff in the future.  We \naddress  admission control in a separate paper  and here  restrict  ourselves  to always \naccepting a call if a free  channel is available.  The objective is to learn a  policy that \nassigns  decisions  (assignment or  rearrangement  depending  on  event)  to each  state \nso  as  to  maximize \n\nJ  =  E {lCO  e- f3t e(t)dt} , \n\nwhere  E{-} is  the expectation operator,  e(t) is  the number of ongoing calls at time \nt,  and j3  is  a discount factor that makes immediate profit more valuable than future \nprofit.  Maximizing J  is equivalent to minimizing the expected  (discounted)  number \nof blocked  calls over  an infinite horizon. \n\n2  REINFORCEMENT LEARNING SOLUTION \n\nRL  methods solve optimal control (or dynamic programming) problems by  learning \ngood  approximations  to  the  optimal  value  function,  J*,  given  by  the  solution  to \n\n\fRLfor Dynamic Channel Allocation \n\n977 \n\nthe  Bellman optimality equation  which  takes  the  following form  for  the  dynamic \nchannel allocation problem: \n\nJ(x) \n\nEe{  max  [E~dc(x,a,~t)+i(~t)J(Y)}]}, \n\naEA(r ,e) \n\n(1) \n\nwhere x is a configuration, e is the random event (a call arrival or departure), A( x, e) \nis the set of actions available in the current state (x, e),  ~t is the random time until \nthe  next  event,  c(x, a, ~t) is  the  effective  immediate payoff with  the  discounting, \nand i(~t) is the effective  discount for  the  next  configuration  y. \n\nRL learns approximations to J*  using Sutton's (1988)  temporal difference (TD(O)) \nalgorithm.  A fixed  feature  extractor  is  used  to form an approximate compact rep(cid:173)\nresentation  of the  exponential  configuration  of the  cellular  array.  This  approxi(cid:173)\nmate representation forms the input to a function approximator (see  Figure  1)  that \nlearns/stores estimates of J*.  No  partitioning of channels  is done;  all channels are \navailable  in  each  cell.  On  each  event,  the  estimates  of J*  are  used  both  to  make \ndecisions  and to update the estimates themselves as follows: \n\nCall  Arrival:  When  a  call  arrives,  evaluate  the  next  configuration  for  each  free \nchannel  and  assign  the  channel  that  leads  to  the  configuration  with  the  largest \nestimated value.  If there is  no free  channel  at all,  no decision has to be made. \n\nCall  Termination:  When a  call terminates, one  by  one each ongoing call  in that \ncell  is  considered  for  reassignment to the just freed  channel;  the  resulting  configu(cid:173)\nrations are evaluated  and compared  to the value of not doing any reassignment  at \nall.  The action that leads to the highest value configuration is then executed. \n\nOn call arrival, as long as there is a free channel, the number of ongoing calls and the \ntime to next event do not depend on the free channel assigned.  Similarly, the number \nof ongoing calls and the time to next event do not depend on the rearrangement done \non call departure.  Therefore,  both the sample immediate payoff which  depends  on \nthe number of ongoing  calls and the time to next  event,  and the effective  discount \nfactor which depends only on the time to next event  are  independent of the choice \nof action.  Thus one  can  choose  the  current  best  action  by  simply considering  the \nestimated values of the next configurations.  The next configuration for  each action \nis  deterministic and trivial to compute. \n\nWhen the next random event occurs,  the sample payoff and the discount factor be(cid:173)\ncome available and are used to update the value function as follows:  on a transition \nfrom configuration  x  to  y on action  a  in time ~t, \n\n(1- a)Jo/d(X) + a (c(x, a, ~t) + i(~t)Jo/d(y\u00bb \n\n(2) \nwhere x is used  to indicate the approximate feature-based  representation of x.  The \nparameters ofthe function approximator are then updated to best represent  Jnew(x) \nusing gradient descent  in mean-squared error (Jnew(x) - JO/d(x\u00bb2 . \n\n3  SIMULATION  RESULTS \n\nCall arrivals are  modeled  as  Poisson  processes  with a  separate  mean for  each  cell, \nand  call  durations  are  modeled  with  an  exponential  distribution .  The first  set  of \nresults  are  on  the  7  by  7  cellular  array  of  Figure  ??a with  70  channels  (roughly \n7049  configurations)  and  a  channel  reuse  constraint of 3  (this problem is  borrowed \nfrom  Zhang and Yum's (1989)  paper on an empirical comparison of BDCL and its \ncompetitors) .  Figures  2a,  b  &  c are for  uniform call  arrival  rates  of 150,  200,  and \n350 calls/hr respectively in each cell.  The mean call duration for all the experiments \n\n\f978 \n\nS. Singh and D.  Bertsekas \n\nreported  here  is  3  minutes.  Figure  2d  is  for  non-uniform  call  arrival  rates.  Each \ncurve plots the cumulative empirical blocking probability as a function of simulated \ntime.  Each  data point  is  therefore  the  percentage  of system-wide  calls  that  were \nblocked up until that  point in time.  All simulations start with no ongoing calls. \n\na) \n\nb) \n\nTO(O)  Training \n\n, , ,------\n\nCon figuration  Feature \n\nExtractor \n\nFeatures \n\n(Availability \n\nand \nPacking) \n\nFunCtion \n\n, \n\nAp5ioXimator \nI , \n\n, \n\nValue \n\nFigure  1:  a)  Cellular  Array.  The  market  area  is  divided  up  into  cells,  shown  here  as \nhexagons.  The available  bandwidth  is  divided  into  channels.  Each  cell  has  a  base  sta(cid:173)\ntion  responsible  for  calls  within  its  area.  Calls  arrive  randomly,  have  random  durations \nand  callers  may  move  around  in  the  market  area  creating  handoffs.  The  channel  reuse \nconstraint  requires  that  there be a  minimum  distance  between  simultaneous  reuse  of the \nsame channel.  In  a  fixed  assignment  channel  allocation  policy  (assuming  a  channel  reuse \nconstraint of 3),  the channels  are partitioned  into 7 lots labeled  1 to 7 and assigned  to the \ncells  in  the compact  pattern shown  here.  Note that  the minimum  distance  between  cells \nwith  the same  number is  at least  three.  b)  A  block  diagram  of the RL  system.  The ex(cid:173)\nponential  configuration  is  mapped  into a  feature-based  approximate  representation  which \nforms  the input  to a  function  approximation  system  that learns  values  for configurations. \nThe  parameters  of the  function  approximator  are  trained  using  gradient  descent  on  the \nsquared  TD(O)  error in  value  function  estimates  (c.L  Equation  2) . \n\nThe RL  system uses  a  linear neural  network  and two sets  of features  as input:  one \navailability feature  for  each  cell  and one  packing feature for  each  cell-channel pair. \nThe availability feature for a cell is the number offree channels in that cell, while the \npacking feature  for  a  cell-channel  pair is the number of times that  channel  is  used \nin  a  4  cell  radius.  Other  packing features  were  tried  but are  not  reported  because \nthey  were  insignificant.  The  RL  curves  in  Figure  2  show  the  empirical  blocking \nprobability  whilst  learning.  Note  that  learning  is  quite  rapid.  As  the  mean  call \narrival rate is  increased  the relative difference  between  the  3  algorithms decreases. \nIn  fact,  FA  can  be  shown  to  be  optimal  in  the  limit  of infinite  call  arrival  rates \n(see  McEliece  and  Sivarajan,  1994).  With so  many  customers  in  every  cell  there \nare  no  short-term fluctuations  to exploit.  However,  as  demonstrated  in  Figure  2, \nfor practical traffic rates RL  consistently gives a  big win over  FA  and a smaller win \nover  BnCL. \nOne difference  between RL and BnCL is that while the BnCL policy is independent \nof call  traffic,  RL  adapts its policy to the particulars of the  call traffic it is  trained \non and should  therefore  be  less  sensitive  to different  patterns of non-uniformity of \ncall traffic across cells.  Figure 3b presents multiple sets of bar-graphs of asymptotic \nblocking probabilities for  the three  algorithms on  a  20  by  1 cellular  array  with  24 \nchannels and a  channel  reuse  constraint of 3.  For each set,  the average per-cell  call \narrival rate is  the same (120  calls/hr; mean duration of 3 minutes),  but the pattern \nof call arrival rates across  the 20  cells  is varied.  The patterns are shown on the left \nof the  bar-graphs  and  are  explained  in  the  caption  of Figure  3b.  From  Figure  3b \nit  is  apparent  that  RL  is  much  less  sensitive  to varied  patterns of non-uniformity \nthan both BnCL and  FA. \n\nWe have showed that RL  with a linear function approximator is able to find  better \ndynamic channel allocation policies than the BnCL and FA  policies.  Using nonlin(cid:173)\near  neural  networks  as  function  approximators for  RL  did  in  some  cases  improve \n\n\fRLfor Dynamic Channel Allocation \n\n979 \n\n150 calslhr \n\nb) \n\n200 calslhr \n\nI\" \n\nFA \n\n~ .. t. \nJ \n\nCD \n\na) \n\nc) \n\n5000.0 \n\nr .... \nh. \n\n2500.0 \nTime \n\nBDCL \n\n-- RL \n\n'\" j \n~ .oo \n~ \n~ w \n...... \n.... ~==:;c::-==;::~.=.==:J \nl \ngo.\" \n:Ii! !  \"20 ! .t. \n\nB~L \n\n~ \n~.... \n\n350 calslhr \n\n/ \n\n/ \n\nRL \n\nL \n\nFA \n\nE \nw \n\n......... -----26...., .. ..,.. .\u2022 -----... ~O'. \n\nTIIYIe \n\n~ !  0.20 \n--. \n.0 \u00a3 \n'\" :i \n~ ell  0.10 \ni E \n. ..... \n\nw \n\nd) \n\n2500.0 \nTIIYIe \n\nNon-Unilorm \n\n-FA \n\nBDCL \n\nRL \n\n...... \n\nFA \n\n~ i \n\u00a3 .20 \ngo \n:Ii! \n~ \n~ .. t. \n:~  .,...----------1 \nCD \n.......... -----26--.. ~ .. -----.... \n\nBDCL \n\n~ \n~ \n\nRL \n\n...J  .\u2022 \n\nTIIYIe \n\nFigure 2:  a),  b)t c) &  d) These figures compare performance of RL, FA, and BDCL on the 7 \nby  7 cellular array of Figure lao  The means of the call arrival  (Poisson) processes are shown \nin  the graph titles.  Each  curve  presents  the cumulative  empirical  blocking  probability  as \na  function  of  time  elapsed  in  minutes.  All  simulations  start  with  no  ongoing  calls  and \ntherefore the blocking  probabilities  are low in  the early minutes of the performance curves. \nThe RL curves presented here are for a linear function  approximator and show performance \nwhile learning.  Note  that learning  is quite rapid. \n\nperformance  over  linear  networks  by  a  small  amount  but  at  the  cost  of a  big  in(cid:173)\ncrease in training time.  We chose to present results for linear networks because they \nhave  the  advantage that even  though training is  centralized,  the  policy so  learned \nis  decentralized  because  the features  are  local  and  therefore  just the  weights  from \nthe  local  features  in  the  trained  linear  network  can  be  used  to choose  actions  in \neach  cell.  For  large  cellular  arrays,  training  itself could  be  decentralized  because \nthe choice of action in a  particular cell  has a  minor effect  on far away cells.  We  will \nexplore the effect  of decentralized  training in future  work. \n\n4  CONCLUSION \n\nThe dynamic channel allocation problem is naturally formulated as an optimal con(cid:173)\ntrol or dynamic programming problem, albeit one with very large state spaces.  Tra(cid:173)\nditional dynamic programming techniques  are  computationally infeasible for  such \nlarge-scale  problems.  Therefore,  knowledge-intensive  heuristic  solutions  that  ig(cid:173)\nnore  the  optimal control framework  have  been  developed.  Recent  approximations \nto  dynamic  programming introduced  in  the  reinforcement  learning  (RL)  commu(cid:173)\nnity  make it  possible  to  go  back  to  the  channel  assignment  probiem  and  solve  it \nas  an optimal control  problem,  in  the  process  finding  better  solutions  than  previ(cid:173)\nously  available.  We  presented  such  a  solution  using  Sutton's  (1988)  TD(O)  with  a \nfeature-based  linear  network  and  demonstrated  its  superiority  on  a  problem  with \napproximately 7049  states.  Other recent examples of similar successes  are the game \n\n\f980 \n\nS.  Singh and D.  Bertsekas \n\n~LJ..\"j \".LLd.d..J -.I \n\na)  ~~~~~;.;~ \n\n~tI!C~\"I\"\"\"\"CIhtuT~.\"\" \n\nPatterne \n\nb)  (mmm  ....... , .\u2022.. m) \n\n01 .......... 11) \n\n\"'1Ir1d~ \n1 __ 10  11't--\n_ \",,, \n\n' \n\n,. \n.......... \n\n-\n, \n\n~...:=.:.J  - '- ' - ~ \n-=-....... \n\n~\"'0011_~ \n\nfWnfore ...... tl . .  nlng \n\nA_A ..... n_' \n\n.......  I' ............ e .......... \n\n(II ................. \"\" \n\n.... \n\n-\n_ \n\n_ \n\nFA \nBDCL \n\nAL \n\n-\n\nFigure  3: \na)  Screen  dump  of  a  Java  Demonstration  available  publicly  at  http:/ / \nwww.cs.colorado.edurbaveja/Demo.html b)  Sensitivity of channel assignment  methods to \nnon-uniform  traffic  patterns.  This  figure  plots  asymptotic  empirical  blocking  probability \nfor  RL,  BDCL,  and  FA  for  a  linear  array of cells  with different  patterns (shown  at left) of \nmean  call  arrival  rates  -\nchosen  so  that  the  average  per cell  call  arrival  rate is  the same \nacross  patterns.  The  symbol  I  is  for  low,  m  for  medium,  and  h  for  high.  The  numeric \nvalues of I,  h,  and m  are chosen  separately  for each  pattern  to ensure that the average  per \ncell  rate of arrival is  120 calls/hr.  The results show  that  RL  is  able  to adapt its allocation \nstrategy and  thereby  is  better able  to exploit  the non-uniform  call  arrival  rates. \n\nof backgammon  (Tesauro,  1992),  elevator-scheduling  (Crites  &  Barto,  1995),  and \njob-shop scheduling (Zhang &  Dietterich,  1995).  The neuro-dynamic programming \ntextbook  (Bertsekas  and Tsitsiklis,  1996) presents  a  variety ofrelated case studies. \n\nReferences \n\nBarto,  A.G.,  Bradtke,  S.J.  &  Singh,  S.  (1995)  Learning  to  act  using  real-time  dynamic \nprogramming.  Artificial Intelligence,  72:81-138. \n\nBertsekas, D.P.  (1995)  Dynamic Programming and Optimal Control:  Vols  1 and 2.  Athena(cid:173)\nScientific,  Belmont,  MA. \n\nBertsekas,  D.P.  &  Tsitsiklis,  J.  (1996)  Neuro-Dynamic  Programming  Athena-Scientific, \nBelmont,  MA. \n\nCrites,  R.H.  &  Barto,  A.G.  (1996)  Improving  elevator  performance  using  reinforcement \nlearning.  In  Advances is  Neural  Information  Processing Systems 8. \nDel  Re,  W.,  Fantacci,  R.  &  Ronga,  L.  (1996)  A  dynamic  channel  allocation  technique \nbased on  Hopfield  Neural  Networks.  IEEE  Transactions  on  Vehicular  Technology,  45:1. \n\nMcEliece,  R.J.  & Sivarajan,  K.N.  (1994),  Performance limits  for  channelized  cellular  tele(cid:173)\nphone systems.  IEEE  Trans.  Inform.  Theory,  pp.  21-34,  Jan. \n\nSutton,  R.S.  (1988)  Learning  to predict  by  the methods of temporal differences.  Machine \nLearning,  3:9-44. \n\nTesauro,  G.J.  (1992)  Practical  issues  in  temporal  difference  learning.  Machine  Learning, \n8{3/4):257-277. \n\nZhang,  M.  &  Yum,  T.P.  (1989)  Comparisons of Channel-Assignment  Strategies in Cellular \nMobile  Telephone Systems.  IEEE  Transactions  on  Vehicular  Technology Vol.  38,  No.4. \n\nZhang,  W.  &  Dietterich,  T .G.  (1996)  High-performance job-shop scheduling  with  a  time(cid:173)\ndelay  TD{lambda)  network.  In  Advances is  Neural Information Processing Systems 8. \n\n\f", "award": [], "sourceid": 1216, "authors": [{"given_name": "Satinder", "family_name": "Singh", "institution": null}, {"given_name": "Dimitri", "family_name": "Bertsekas", "institution": null}]}