{"title": "Reinforcement Learning for Dynamic Channel Allocation in Cellular Telephone Systems", "book": "Advances in Neural Information Processing Systems", "page_first": 974, "page_last": 980, "abstract": null, "full_text": "Reinforcement Learning for Dynamic \n\nC\u00b7hannel Allocation in Cellular Telephone \n\nSystems \n\nSatinder Singh \n\nDepartment of Computer Science \n\nUniversity of Colorado \nBoulder, CO 80309-0430 \nbavej a@cs.colorado.edu \n\nDimitri Bertsekas \n\nLab. for Info. and Decision Sciences \n\nMIT \n\nCambridge, MA 02139 \nbertsekas@lids.mit.edu \n\nAbstract \n\nIn cellular telephone systems, an important problem is to dynami(cid:173)\ncally allocate the communication resource (channels) so as to max(cid:173)\nimize service in a stochastic caller environment. This problem is \nnaturally formulated as a dynamic programming problem and we \nuse a reinforcement learning (RL) method to find dynamic channel \nallocation policies that are better than previous heuristic solutions. \nThe policies obtained perform well for a broad variety of call traf(cid:173)\nfic patterns. We present results on a large cellular system with \napproximately 4949 states. \n\nIn cellular communication systems, an important problem is to allocate the com(cid:173)\nmunication resource (bandwidth) so as to maximize the service provided to a set of \nmobile callers whose demand for service changes stochastically. A given geograph(cid:173)\nical area is divided into mutually disjoint cells, and each cell serves the calls that \nare within its boundaries (see Figure 1a). The total system bandwidth is divided \ninto channels, with each channel centered around a frequency. Each channel can be \nused simultaneously at different cells, provided these cells are sufficiently separated \nspatially, so that there is no interference between them. The minimum separation \ndistance between simultaneous reuse of the same channel is called the channel reuse \nconstraint. \n\nWhen a call requests service in a given cell either a free channel (one that does not \nviolate the channel reuse constraint) may be assigned to the call, or else the call is \nblocked from the system; this will happen if no free channel can be found. Also, \nwhen a mobile caller crosses from one cell to another, the call is \"handed off\" to the \ncell of entry; that is, a new free channel is provided to the call at the new cell. If no \nsuch channel is available, the call must be dropped/disconnected from the system. \n\n\fRLfor Dynamic Channel Allocation \n\n975 \n\nOne objective of a channel allocation policy is to allocate the available channels \nto calls so that the number of blocked calls is minimized. An additional objective \nis to minimize the number of calls that are dropped when they are handed off to \na busy cell. These two objectives must be weighted appropriately to reflect their \nrelative importance, since dropping existing calls is generally more undesirable than \nblocking new calls. \nTo illustrate the qualitative nature of the channel assignment decisions, suppose \nthat there are only two channels and three cells arranged in a line. Assume a \nchannel reuse constraint of 2, i.e., a channel may be used simultaneously in cells \n1 and 3, but may not be used in channel 2 if it is already used in cell 1 or in cell \n3. Suppose that the system is serving one call in cell 1 and another call in cell \n3. Then serving both calls on the same channel results in a better channel usage \npattern than serving them on different channels, since in the former case the other \nchannel is free to be used in cell 2. The purpose of the channel assignment and \nchannel rearrangement strategy is, roughly speaking, to create such favorable usage \npatterns that minimize the likelihood of calls being blocked. \n\nWe formulate the channel assignment problem as a dynamic programming problem, \nwhich, however, is too complex to be solved exactly. We introduce approximations \nbased on the methodology of reinforcement learning (RL) (e.g., Barto, Bradtke and \nSingh, 1995, or the recent textbook by Bertsekas and Tsitsiklis, 1996). Our method \nlearns channel allocation policies that outperform not only the most commonly used \npolicy in cellular systems, but also the best heuristic policy we could find in the \nliterature. \n\n1 CHANNEL ASSIGNMENT POLICIES \n\nMany cellular systems are based on a fixed assignment (FA) channel allocation; that \nis, the set of channels is partitioned, and the partitions are permanently assigned \nto cells so that all cells can use all the channels assigned to them simultaneously \nwithout interference (see Figure 1a). When a call arrives in a cell, if any pre(cid:173)\nassigned channel is unused; it is assigned, else the call is blocked. No rearrangement \nis done when a call terminates. Such a policy is static and cannot take advantage of \ntemporary stochastic variations in demand for service. More efficient are dynamic \nchannel allocation policies, which assign channels to different cells, so that every \nchannel is available to every cell on a need basis, unless the channel is used in a \nnearby cell and the reuse constraint is violated. \n\nThe best existing dynamic channel allocation policy we found in the literature is \nBorrowing with Directional Channel Locking (BDCL) of Zhang & Yum (1989). It \nnumbers the channels from 1 to N, partitions and assigns them to cells as in FA. \nThe channels assigned to a cell are its nominal channels. If a nominal channel \nis available when a call arrives in a cell, the smallest numbered such channel is \nassigned to the call. If no nominal channel is available, then the largest numbered \nfree channel is borrowed from the neighbour with the most free channels. When a \nchannel is borrowed, careful accounting of the directional effect of which cells can \nno longer use that channel because of interference is done. The call is blocked if \nthere are no free channels at all. When a call terminates in a cell and the channel \nso freed is a nominal channel, say numbered i, of that cell, then if there is a call \nin that cell on a borrowed channel, the call on the smallest numbered borrowed \nchannel is reassigned to i and the borrowed channel is returned to the appropriate \ncell. If there is no call on a borrowed channel, then if there is a call on a nominal \nchannel numbered larger than i, the call on the highest numbered nominal channel \nis reassigned to i. If the call just terminated was itself on a borrowed channel, the \n\n\f976 \n\nS. Singh and D. Bertsekas \n\ncall on the smallest numbered borrowed channel is reassigned to it and that channel \nis returned to the cell from which it was borrowed. Notice that when a borrowed \nchannel is returned to its original cell, a nominal channel becomes free in that cell \nand triggers a reassignment. Thus, in the worst case a call termination in one cell \ncan sequentially cause reassignments in arbitrarily far away cells - making BDCL \nsomewhat impractical. \n\nBOCL is quite sophisticated and combines the notions of channel-ordering, nominal \nchannels, and channel borrowing. Zhang and Yum (1989) show that BOCL is \nsuperior to its competitors, including FA. Generally, BOCL has continued to be \nhighly regarded in the literature as a powerful heuristic (Enrico et.al., 1996) . In \nthis paper, we compare the performance of dynamic channel allocation policies \nlearned by RL with both FA and BOCL. \n\n1.1 DYNAMIC PROGRAMMING FORMULATION \n\nWe can formulate the dynamic channel allocation problem using dynamic program(cid:173)\nming (e.g., Bertsekas, 1995). State transitions occur when channels become free due \nto call departures, or when a call arrives at a given cell and wishes to be assigned \na channel, or when there is a handoff, which can be viewed as a simultaneous call \ndeparture from one cell and a call arrival at another cell. The state at each time \nconsists of two components: \n\n(1) The list of occupied and unoccupied channels at each cell. We call this the \nconfiguration of the cellular system. It is exponential in the number of cells. \n(2) The event that causes the state transition (arrival, departure, or handoff). This \n\ncomponent of the state is uncontrollable. \n\nThe decision/control applied at the time of a call departure is the rearrangement \nof the channels in use with the aim of creating a more favorable channel packing \npattern among the cells (one that will leave more channels free for future assign(cid:173)\nments) . Unlike BDCL, our RL solution will restrict this rearrangement to the cell \nwith the current call departure. The control exercised at the time of a call arrival \nis the assignment of a free channel, or the blocking of the call if no free channel is \ncurrently available. In general, it may also be useful to do admission control, i.e., \nto allow the possibility of not accepting a new call even when there exists a free \nchannel to minimize the dropping of ongoing calls during handoff in the future. We \naddress admission control in a separate paper and here restrict ourselves to always \naccepting a call if a free channel is available. The objective is to learn a policy that \nassigns decisions (assignment or rearrangement depending on event) to each state \nso as to maximize \n\nJ = E {lCO e- f3t e(t)dt} , \n\nwhere E{-} is the expectation operator, e(t) is the number of ongoing calls at time \nt, and j3 is a discount factor that makes immediate profit more valuable than future \nprofit. Maximizing J is equivalent to minimizing the expected (discounted) number \nof blocked calls over an infinite horizon. \n\n2 REINFORCEMENT LEARNING SOLUTION \n\nRL methods solve optimal control (or dynamic programming) problems by learning \ngood approximations to the optimal value function, J*, given by the solution to \n\n\fRLfor Dynamic Channel Allocation \n\n977 \n\nthe Bellman optimality equation which takes the following form for the dynamic \nchannel allocation problem: \n\nJ(x) \n\nEe{ max [E~dc(x,a,~t)+i(~t)J(Y)}]}, \n\naEA(r ,e) \n\n(1) \n\nwhere x is a configuration, e is the random event (a call arrival or departure), A( x, e) \nis the set of actions available in the current state (x, e), ~t is the random time until \nthe next event, c(x, a, ~t) is the effective immediate payoff with the discounting, \nand i(~t) is the effective discount for the next configuration y. \n\nRL learns approximations to J* using Sutton's (1988) temporal difference (TD(O)) \nalgorithm. A fixed feature extractor is used to form an approximate compact rep(cid:173)\nresentation of the exponential configuration of the cellular array. This approxi(cid:173)\nmate representation forms the input to a function approximator (see Figure 1) that \nlearns/stores estimates of J*. No partitioning of channels is done; all channels are \navailable in each cell. On each event, the estimates of J* are used both to make \ndecisions and to update the estimates themselves as follows: \n\nCall Arrival: When a call arrives, evaluate the next configuration for each free \nchannel and assign the channel that leads to the configuration with the largest \nestimated value. If there is no free channel at all, no decision has to be made. \n\nCall Termination: When a call terminates, one by one each ongoing call in that \ncell is considered for reassignment to the just freed channel; the resulting configu(cid:173)\nrations are evaluated and compared to the value of not doing any reassignment at \nall. The action that leads to the highest value configuration is then executed. \n\nOn call arrival, as long as there is a free channel, the number of ongoing calls and the \ntime to next event do not depend on the free channel assigned. Similarly, the number \nof ongoing calls and the time to next event do not depend on the rearrangement done \non call departure. Therefore, both the sample immediate payoff which depends on \nthe number of ongoing calls and the time to next event, and the effective discount \nfactor which depends only on the time to next event are independent of the choice \nof action. Thus one can choose the current best action by simply considering the \nestimated values of the next configurations. The next configuration for each action \nis deterministic and trivial to compute. \n\nWhen the next random event occurs, the sample payoff and the discount factor be(cid:173)\ncome available and are used to update the value function as follows: on a transition \nfrom configuration x to y on action a in time ~t, \n\n(1- a)Jo/d(X) + a (c(x, a, ~t) + i(~t)Jo/d(y\u00bb \n\n(2) \nwhere x is used to indicate the approximate feature-based representation of x. The \nparameters ofthe function approximator are then updated to best represent Jnew(x) \nusing gradient descent in mean-squared error (Jnew(x) - JO/d(x\u00bb2 . \n\n3 SIMULATION RESULTS \n\nCall arrivals are modeled as Poisson processes with a separate mean for each cell, \nand call durations are modeled with an exponential distribution . The first set of \nresults are on the 7 by 7 cellular array of Figure ??a with 70 channels (roughly \n7049 configurations) and a channel reuse constraint of 3 (this problem is borrowed \nfrom Zhang and Yum's (1989) paper on an empirical comparison of BDCL and its \ncompetitors) . Figures 2a, b & c are for uniform call arrival rates of 150, 200, and \n350 calls/hr respectively in each cell. The mean call duration for all the experiments \n\n\f978 \n\nS. Singh and D. Bertsekas \n\nreported here is 3 minutes. Figure 2d is for non-uniform call arrival rates. Each \ncurve plots the cumulative empirical blocking probability as a function of simulated \ntime. Each data point is therefore the percentage of system-wide calls that were \nblocked up until that point in time. All simulations start with no ongoing calls. \n\na) \n\nb) \n\nTO(O) Training \n\n, , ,------\n\nCon figuration Feature \n\nExtractor \n\nFeatures \n\n(Availability \n\nand \nPacking) \n\nFunCtion \n\n, \n\nAp5ioXimator \nI , \n\n, \n\nValue \n\nFigure 1: a) Cellular Array. The market area is divided up into cells, shown here as \nhexagons. The available bandwidth is divided into channels. Each cell has a base sta(cid:173)\ntion responsible for calls within its area. Calls arrive randomly, have random durations \nand callers may move around in the market area creating handoffs. The channel reuse \nconstraint requires that there be a minimum distance between simultaneous reuse of the \nsame channel. In a fixed assignment channel allocation policy (assuming a channel reuse \nconstraint of 3), the channels are partitioned into 7 lots labeled 1 to 7 and assigned to the \ncells in the compact pattern shown here. Note that the minimum distance between cells \nwith the same number is at least three. b) A block diagram of the RL system. The ex(cid:173)\nponential configuration is mapped into a feature-based approximate representation which \nforms the input to a function approximation system that learns values for configurations. \nThe parameters of the function approximator are trained using gradient descent on the \nsquared TD(O) error in value function estimates (c.L Equation 2) . \n\nThe RL system uses a linear neural network and two sets of features as input: one \navailability feature for each cell and one packing feature for each cell-channel pair. \nThe availability feature for a cell is the number offree channels in that cell, while the \npacking feature for a cell-channel pair is the number of times that channel is used \nin a 4 cell radius. Other packing features were tried but are not reported because \nthey were insignificant. The RL curves in Figure 2 show the empirical blocking \nprobability whilst learning. Note that learning is quite rapid. As the mean call \narrival rate is increased the relative difference between the 3 algorithms decreases. \nIn fact, FA can be shown to be optimal in the limit of infinite call arrival rates \n(see McEliece and Sivarajan, 1994). With so many customers in every cell there \nare no short-term fluctuations to exploit. However, as demonstrated in Figure 2, \nfor practical traffic rates RL consistently gives a big win over FA and a smaller win \nover BnCL. \nOne difference between RL and BnCL is that while the BnCL policy is independent \nof call traffic, RL adapts its policy to the particulars of the call traffic it is trained \non and should therefore be less sensitive to different patterns of non-uniformity of \ncall traffic across cells. Figure 3b presents multiple sets of bar-graphs of asymptotic \nblocking probabilities for the three algorithms on a 20 by 1 cellular array with 24 \nchannels and a channel reuse constraint of 3. For each set, the average per-cell call \narrival rate is the same (120 calls/hr; mean duration of 3 minutes), but the pattern \nof call arrival rates across the 20 cells is varied. The patterns are shown on the left \nof the bar-graphs and are explained in the caption of Figure 3b. From Figure 3b \nit is apparent that RL is much less sensitive to varied patterns of non-uniformity \nthan both BnCL and FA. \n\nWe have showed that RL with a linear function approximator is able to find better \ndynamic channel allocation policies than the BnCL and FA policies. Using nonlin(cid:173)\near neural networks as function approximators for RL did in some cases improve \n\n\fRLfor Dynamic Channel Allocation \n\n979 \n\n150 calslhr \n\nb) \n\n200 calslhr \n\nI\" \n\nFA \n\n~ .. t. \nJ \n\nCD \n\na) \n\nc) \n\n5000.0 \n\nr .... \nh. \n\n2500.0 \nTime \n\nBDCL \n\n-- RL \n\n'\" j \n~ .oo \n~ \n~ w \n...... \n.... ~==:;c::-==;::~.=.==:J \nl \ngo.\" \n:Ii! ! \"20 ! .t. \n\nB~L \n\n~ \n~.... \n\n350 calslhr \n\n/ \n\n/ \n\nRL \n\nL \n\nFA \n\nE \nw \n\n......... -----26...., .. ..,.. .\u2022 -----... ~O'. \n\nTIIYIe \n\n~ ! 0.20 \n--. \n.0 \u00a3 \n'\" :i \n~ ell 0.10 \ni E \n. ..... \n\nw \n\nd) \n\n2500.0 \nTIIYIe \n\nNon-Unilorm \n\n-FA \n\nBDCL \n\nRL \n\n...... \n\nFA \n\n~ i \n\u00a3 .20 \ngo \n:Ii! \n~ \n~ .. t. \n:~ .,...----------1 \nCD \n.......... -----26--.. ~ .. -----.... \n\nBDCL \n\n~ \n~ \n\nRL \n\n...J .\u2022 \n\nTIIYIe \n\nFigure 2: a), b)t c) & d) These figures compare performance of RL, FA, and BDCL on the 7 \nby 7 cellular array of Figure lao The means of the call arrival (Poisson) processes are shown \nin the graph titles. Each curve presents the cumulative empirical blocking probability as \na function of time elapsed in minutes. All simulations start with no ongoing calls and \ntherefore the blocking probabilities are low in the early minutes of the performance curves. \nThe RL curves presented here are for a linear function approximator and show performance \nwhile learning. Note that learning is quite rapid. \n\nperformance over linear networks by a small amount but at the cost of a big in(cid:173)\ncrease in training time. We chose to present results for linear networks because they \nhave the advantage that even though training is centralized, the policy so learned \nis decentralized because the features are local and therefore just the weights from \nthe local features in the trained linear network can be used to choose actions in \neach cell. For large cellular arrays, training itself could be decentralized because \nthe choice of action in a particular cell has a minor effect on far away cells. We will \nexplore the effect of decentralized training in future work. \n\n4 CONCLUSION \n\nThe dynamic channel allocation problem is naturally formulated as an optimal con(cid:173)\ntrol or dynamic programming problem, albeit one with very large state spaces. Tra(cid:173)\nditional dynamic programming techniques are computationally infeasible for such \nlarge-scale problems. Therefore, knowledge-intensive heuristic solutions that ig(cid:173)\nnore the optimal control framework have been developed. Recent approximations \nto dynamic programming introduced in the reinforcement learning (RL) commu(cid:173)\nnity make it possible to go back to the channel assignment probiem and solve it \nas an optimal control problem, in the process finding better solutions than previ(cid:173)\nously available. We presented such a solution using Sutton's (1988) TD(O) with a \nfeature-based linear network and demonstrated its superiority on a problem with \napproximately 7049 states. Other recent examples of similar successes are the game \n\n\f980 \n\nS. Singh and D. Bertsekas \n\n~LJ..\"j \".LLd.d..J -.I \n\na) ~~~~~;.;~ \n\n~tI!C~\"I\"\"\"\"CIhtuT~.\"\" \n\nPatterne \n\nb) (mmm ....... , .\u2022.. m) \n\n01 .......... 11) \n\n\"'1Ir1d~ \n1 __ 10 11't--\n_ \",,, \n\n' \n\n,. \n.......... \n\n-\n, \n\n~...:=.:.J - '- ' - ~ \n-=-....... \n\n~\"'0011_~ \n\nfWnfore ...... tl . . nlng \n\nA_A ..... n_' \n\n....... I' ............ e .......... \n\n(II ................. \"\" \n\n.... \n\n-\n_ \n\n_ \n\nFA \nBDCL \n\nAL \n\n-\n\nFigure 3: \na) Screen dump of a Java Demonstration available publicly at http:/ / \nwww.cs.colorado.edurbaveja/Demo.html b) Sensitivity of channel assignment methods to \nnon-uniform traffic patterns. This figure plots asymptotic empirical blocking probability \nfor RL, BDCL, and FA for a linear array of cells with different patterns (shown at left) of \nmean call arrival rates -\nchosen so that the average per cell call arrival rate is the same \nacross patterns. The symbol I is for low, m for medium, and h for high. The numeric \nvalues of I, h, and m are chosen separately for each pattern to ensure that the average per \ncell rate of arrival is 120 calls/hr. The results show that RL is able to adapt its allocation \nstrategy and thereby is better able to exploit the non-uniform call arrival rates. \n\nof backgammon (Tesauro, 1992), elevator-scheduling (Crites & Barto, 1995), and \njob-shop scheduling (Zhang & Dietterich, 1995). The neuro-dynamic programming \ntextbook (Bertsekas and Tsitsiklis, 1996) presents a variety ofrelated case studies. \n\nReferences \n\nBarto, A.G., Bradtke, S.J. & Singh, S. (1995) Learning to act using real-time dynamic \nprogramming. Artificial Intelligence, 72:81-138. \n\nBertsekas, D.P. (1995) Dynamic Programming and Optimal Control: Vols 1 and 2. Athena(cid:173)\nScientific, Belmont, MA. \n\nBertsekas, D.P. & Tsitsiklis, J. (1996) Neuro-Dynamic Programming Athena-Scientific, \nBelmont, MA. \n\nCrites, R.H. & Barto, A.G. (1996) Improving elevator performance using reinforcement \nlearning. In Advances is Neural Information Processing Systems 8. \nDel Re, W., Fantacci, R. & Ronga, L. (1996) A dynamic channel allocation technique \nbased on Hopfield Neural Networks. IEEE Transactions on Vehicular Technology, 45:1. \n\nMcEliece, R.J. & Sivarajan, K.N. (1994), Performance limits for channelized cellular tele(cid:173)\nphone systems. IEEE Trans. Inform. Theory, pp. 21-34, Jan. \n\nSutton, R.S. (1988) Learning to predict by the methods of temporal differences. Machine \nLearning, 3:9-44. \n\nTesauro, G.J. (1992) Practical issues in temporal difference learning. Machine Learning, \n8{3/4):257-277. \n\nZhang, M. & Yum, T.P. (1989) Comparisons of Channel-Assignment Strategies in Cellular \nMobile Telephone Systems. IEEE Transactions on Vehicular Technology Vol. 38, No.4. \n\nZhang, W. & Dietterich, T .G. (1996) High-performance job-shop scheduling with a time(cid:173)\ndelay TD{lambda) network. In Advances is Neural Information Processing Systems 8. \n\n\f", "award": [], "sourceid": 1216, "authors": [{"given_name": "Satinder", "family_name": "Singh", "institution": null}, {"given_name": "Dimitri", "family_name": "Bertsekas", "institution": null}]}