{"title": "Multi-Grid Methods for Reinforcement Learning in Controlled Diffusion Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 1033, "page_last": 1039, "abstract": "", "full_text": "Multi-Grid Methods for Reinforcement \n\nLearning in Controlled Diffusion Processes \n\nStephan Pareigis \n\nstp@numerik.uni-kiel.de \n\nLehrstuhl Praktische Mathematik \nChristian-Albrechts-U niversi tat Kiel \n\nKiel, Germany \n\nAbstract \n\nReinforcement learning methods for discrete and semi-Markov de(cid:173)\ncision problems such as Real-Time Dynamic Programming can \nbe generalized for Controlled Diffusion Processes. The optimal \ncontrol problem reduces to a boundary value problem for a fully \nnonlinear second-order elliptic differential equation of Hamilton(cid:173)\nJacobi-Bellman (HJB-) type. Numerical analysis provides multi(cid:173)\ngrid methods for this kind of equation. In the case of Learning Con(cid:173)\ntrol, however, the systems of equations on the various grid-levels are \nobtained using observed information (transitions and local cost). \nTo ensure consistency, special attention needs to be directed to(cid:173)\nward the type of time and space discretization during the obser(cid:173)\nvation. An algorithm for multi-grid observation is proposed. The \nmulti-grid algorithm is demonstrated on a simple queuing problem. \n\n1 \n\nIntroduction \n\nControlled Diffusion Processes (CDP) are the analogy to Markov Decision Problems \nin continuous state space and continuous time. A CDP can always be discretized in \nstate space and time and thus reduced to a Markov Decision Problem. Algorithms \nlike Q-Iearning and RTDP as described in [1] can then be applied to produce controls \nor optimal value functions for a fixed discretization. \n\nProblems arise when the discretization needs to be refined, or when multi-grid \ninformation needs to be extracted to accelerate the algorithm. The relation of \ntime to state space discretization parameters is crucial in both cases. Therefore \n\n\f1034 \n\nS. Pareigis \n\na mathematical model of the discretized process is introduced, which reflects the \nproperties of the converged empirical process. In this model, transition probabilities \nof the discrete process can be expressed in terms of the transition probabilities of \nthe continuous process. Recent results in numerical methods for stochastic control \nproblems in continuous time can be applied to give assumptions that guarantee a \nlocal consistency condition which is needed for convergence. The same assumptions \nallow application of multi-grid methods. \n\nIn section 2 Controlled Diffusion Processes are introduced. A model for the dis(cid:173)\ncretized process is suggested in section 3 and the main theorem is stated. Section 4 \npresents an algorithm for multi-grid observation according to the results in the pre(cid:173)\nceding section. Section 5 shows an application of multi-grid techniques for observed \nprocesses. \n\n2 Controlled Diffusion Processes \n\nConsider a Controlled Diffusion Process (CDP) ~(t) in some bounded domain 0 C \nffi. n fulfilling the diffusion equation \n\n~(t) = b(~(t), u(t))dt + (7(~(t))dw. \n\n(1) \n\nThe control u(t) takes values in some finite set U. The immediate reinforcement \n(cost) for state ~(t) and control u(t) is \n\nThe control objective is to find a feedback control law \n\nr(t) = r(~(t),u(t)). \n\nu(t) = u(~(t)), \n\nthat minimizes the total discounted cost \n\nJ(x, u) = IE~ 100 e-/3tr(~(t), u(t)dt, \n\n(2) \n\n(3) \n\n(4) \n\nwhere IE~ is the expectation starting in x E 0 and applying the control law u(.). \n(3 > 0 is the discount. \nThe transition probabilities of the CDP are given for any initial state x E 0 and \nsubset A c 0 by the stochastic kernels \n\nPtU(x, A) :=prob{~(t) E AI~(O) =x,u} . \n\nIt is known that the kernels have the properties \n\nl (y - x)PtU(x, dy) \nl (y - x)(y - xf PtU(x, dy) \n\nt . b(x, u) + o(t) \n\nt\u00b7 (7(x)(7(xf + o(t). \n\n(5) \n\n(6) \n\n(7) \n\nFor the optimal control it is sufficient to calculate the optimal value function V : \nO-tffi. \n\nV(x) := inf J(x, u). \n\nu(.) \n\n(8) \n\n\fMulti-Grid Methods for Reinforcement Learning in Diffusion Processes \n\n1035 \n\nUnder appropriate smoothness assumptions V is a solution of the Hamilton-Jacobi(cid:173)\nBellman (HJB-) equation \n\nmin {C:tV(x) - ,i3V(x) + r(x, an = 0, \n\naEU \n\nx E O. \n\n(9) \n\nLet a(x) = O\"(x)O\"(x)T be the diffusion matrix, then La, a E U is defined as the \nelliptic differential operator \n\nn \n\nn \n\nLa := L aij(x)ox/Jx; + Lbi(x,a)oxi. \n\n(10) \n\ni,j=l \n\ni=l \n\n3 A Model for Observed CDP's \n\nLet Ohi be the centers of cells of a cell-centered grid on 0 with cell sizes ho, hI = \nho/2, h2 = hI/2, .... For any x E Ohi we shall denote by A(x) the cell of x. Let \n6.t > 0 be a parameter for the time discretization. \n\n\u00b7 \n\u00b7 ,-\n\n\u00b7 \n\u00b7 \n\u00b7 \nt\" r----J.,c \nr J.'\" \n\u00b7 \n\n:J \n\nl -\n\n~ \n\nD\u00b7 \n\u00b7 r \n\u00b7 \nJ0 \n\u00b7 \n\u00b7 \n\n\u00b7 \n\u00b7 \n\nFigure 1: The picture depicts three \ncell-centered grid levels and the trajec(cid:173)\ntory of a diffusion process. The approx(cid:173)\nimating value function is represented \nlocally constant on each cell. The tri(cid:173)\nangles on the path denote the posi(cid:173)\ntion of the diffusion at sample times \n0, /It, 2/lt, 3/lt, . . .. \ntween respective cells are then counted \nin matrices Q't, for each control a and \ngrid i. \n\nTransitions be(cid:173)\n\nBy counting the transitions between cells and calculating the empirical probabilities \nas defined in (20) we obtain empirical processes on every grid. By the law of \ngreat numbers the empirical processes will converge towards observed CDPs as \nsubsequently defined. \n\nDefinition 1 An observed process ~hi,Lldt) is a Controlled Markov Chain (i.e. \ndiscrete state-space and discrete time) on Ohi and interpolation time 6.ti with the \ntransition probabilities \n\nprob{~(6.ti) E A(Y)I~(O) E A(x), u} \n\n:n { PXti (z, A(y\u00bbdz, \ni J A(x) \n\n(11) \n\nwhere x, y E Ohi and ~(t) is a solution of (1). Also define the observed reinforcement \np as \n\n(12) \n\n\f1036 \n\ns. Pareigis \n\nOn every grid Ohi the respective process ehi ,Llti has its own value function Vhi ,Llti . \nBy theorem 10.4.1. in Kushner, Dupuis ([5], 1992) it holds, that \n\nVhi,Llti (x) -+ V(x) for all x E 0, \n\n(13) \n\nif the following local consistency conditions hold. \n\nDefinition 2 Let D.eh,Llt = eh,Llt(D.t) - eh,Llt(O). eh,Llt is called locally consistent \nto a solution e(.) of (1), iff \n\nIE~ D.eh,Llt \nIE~[D.eh,Llt - IE~D.eh,Llt][D.eh,Llt - IE~D.eh,LltlT \nsup lD.eh,Llt(nD.t)I \nn \n\nb(x, a)D.t + o(D.t) \na(x)D.t + o(D.t) \n\n-+ 0 as h -+ O. \n\n(14) \n(15) \n(16) \n\nTo verify these conditions for the observed CDP, the expectation and variance can \nbe calculated. For the expectation we get \n\nyEOhi \n\nL Phi,Llti(x,y)(y - x) \n:n L l (y - x)PXdz,A(y))dz. \n\ni yEOhi A(x) \n\n(17) \n\nRecalling properties (6) and (7) and doing a similar calculation for the variance we \nobtain the following theorem. \n\nTheorem 3 For observed CDPs ehi,Llti let hi and D.ti be such that \n\n(18) \n\nFurthermore, ehi ,Llti shall be truncated at some radius R, such that R -+ 0 for \nhi -+ 0 and expectation and variance of the truncated process differ only in the \norder o(D.t) from expectation and variance of ehi,Llti. Then the observed processes \nehi,Llti truncated at R are locally consistent to the diffusion process e(.) and therefore \nthe value functions Vhi ,Llti converge to the value function V. \n\n4 \n\nIdentification by Multi-Grid Observation \n\nThe condition in Theorem 3 provides information as how to choose parameters in \nthe algorithm with empirical data. Choose discretization values ho, D.to for the \ncoarsest grid no. D.to should typically be of order Ilbllsup/ho. Then choose for the \nfiner grids \n\ngrid \nspace \ntime \n\no \nho \nD.to \n\n1 \n~ \n2 \n\n2 \n~ \n4 \n\n3 \n\n~ \n\n~ \n\n2 \n\n2 \n\n4 \n~ \n16 \n~ \n\n5 \n~ \n32 \n~ \n\n4 \n\n8 \n\n(19) \n\nThe sequences verify the assumption (18). We may now formulate the algorithm \nfor Multi-Grid Observation of the CDP e(.). Note that only observation is being \ncarried out. The actual calculation of the value function may be done separately \nas described in the next section. The choice of the control is assumed to be done \n\n\fMulti-Grid Methodsfor Reinforcement Learning in Diffusion Processes \n\n1037 \n\nby a separate controller. Let Ok be the finest grid, Le. Dotk and hk the finest \ndiscretizations. Let U, = u~t';~t,. = U x ... xU, Dotl! Dotk times. Qr' is a 10,1 x 10,1-\nmatrix (a, E U,), containing the number of transitions between cells in 0\" Rr' is a \nlO,l-vector containing the empirical cost for every cell in 0 , . The immediate cost is \ngiven by the system as r, = Jo~t' e-/3tr(~(t), a,)dt. T denotes current time. \nO. Initialize 0 \" Qr', Rr' for all a, E U\" 1 = 0, ... , k \n1. repeat { \nchoose a = a(T) E U and apply a constantly on [T; T + Dotk) \n2. \nT := T + Dotk \n3. \nfor I = 0 to k do { \n4. \ndetermine cell Xl E 0, with ~(T - Dot,) E A(XI) \n5. \ndetermine cell Yl E 0 , with ~(T) E A(Yl) \n6. \nif Ilxk - Ykll ~ R (truncation radius) then goto 2. else \n7. \na, := (a(T - Dot,) , a(T + Dotk - Dot,), .. . ,a(T - Dotk)) \n8. \nreceive immediate cost r, \n9. \n10. Qr'(Xl,Yl) := Qr'(Xl,Yt) + 1 \nRr' (Xl) := (rl + Rr' (Xl) . EZEn, Qr' (Xl, z)) / (1 + EZEn, Qr' (Xl, z)) \n11. \n} (for-do) \n} (repeat) \n\nBefore applying a multi-grid algorithm for the calculation of the value function on \nthe basis of the observations, one should make sure that every box has at least \nsome data for every control. Especially in the early stages of learning only the two \ncoarsest grids 00, 0 1 could be used for computation of the optimal value function \nand finer grids may be added (possibly locally) as learning evolves. \n\n5 Application of Multi-Grid Techniques \n\nThe identification algorithm produces matrices Qr' containing the number of tran(cid:173)\nsitions between boxes in 0 ,. We will calculate from the matrices Q the transition \nmatrices P by the formula \n\np,a' (x, y) = Qr' (x, Y)/ (L Qr' (x, Z)) , x, Y E 0 \" a, E U\" 1 = 0, .. . , k. \n\nzEn, \n\nNow we define matrices A and right hand sides I as \n\nAr' := ({31 p,a' - I) / Dot, \n\nIt':= Rr' / Dotl , \n\nwhere {31 = e-/3~t,. The discrete Bellman equation takes the following form \n\n(20) \n\n(21) \n\n(22) \n\nThe problem is now in a form to which the multi-grid method due to Hoppe, BloB \n([2], 1989) can be applied. For prolongation and restriction we choose bilinear \ninterpolation and full weighted restriction for cell-centered grids. We point out, \nthat for any cell X E 0 , only those neighboring cells shall be used for prolongation \nand restriction for which the minimum in (22) is attained for the same control as \nthe minimizing control in X (see [2], 1989 and [3], 1996 for details). On every grid \n\n\f1038 \n\ns. Pareigis \n\n0 1 the defect in equation (22) is calculated and used for a correction on grid 0 /- 1 . \nAs a smoother nonlinear Gauss-Seidel iteration applied to (22) is used. \n\nOur approach differs from the algorithm in Hoppe, BloB ([2], 1989) in the special \nform of the matrices A~' in equation (22). The stars are generally larger than \nnine-point, in fact the stars grow with decreasing h although the matrices remain \nsparse. Also, when working with empirical information the relationship between the \nmatrices Ar' on the various grids is based on observation of a process, which implies \nthat coarse grid corrections do not always correct the equation of the finest grid \n(especially in the early stages of learning). However, using the observed transition \nmatrices Ar' on the coarse grids saves the computing time which would otherwise \nbe needed to calculate these matrices by the Galerkin product (see Hackbusch [4], \n1985). \n\n6 Simulation with precomputed transitions \n\nConsider a homogeneous server problem with two servers holding data (Xl, X2) E \n[0,1] x [0,1]. Two independent data streams arrive, one at each server. A controller \nhas to decide to which server to route. The modeling equation for the stream shall \nbe \n\ndx = b(x, u)dt + CT(x)dw, u E {I, 2} \n\nwith \n\nb(x,l) = (!1) b(x,2) = (~1) CT= (~ ~) \n\n(23) \n\n(24) \n\n(25) \n\nThe boundaries at Xl = 0 and X2 = 0 are reflecting. The exceeding data on \neither server Xl, X2 > 1 is rejected from the system and penalized with g(Xl, 1) = \ng(1,x2) = 10, 9 = 0 otherwise. The objective of the control policy shall be to \nminimize \n\nIE 1000 e-i3t (xI(t) + X2(t) + g(Xl,X2))dt. \n\nThe plots of the value function show, that in case of high load (Le. Xl, X2 close to \n1) a maximum of cost is assumed. Therefore it is cheaper to overload a server and \npay penalty than to stay close to the diagonal as is optimal in the low load case. \n\nFor simulation we used preco~puted (Le. converged heuristic) transition probabili(cid:173)\nties to test the multi-grid performance. The discount f3 was set to .7. The multi-grid \nalgorithm reduces the error in each iteration by ' a factor 0.21, using 5 grid levels \nand a V -cycle and two smoothing iterations on the coarsest grid. For comparison, \nthe iteration on the finest grid converges with a reduction factor 0.63. \n\n7 Discussion \n\nWe have given a condition for sampling controlled diffusion processes such that \nthe value functions will converge while the discretization tends to zero. Rigorous \nnumerical methods can now be applied to reinforcement learning algorithms in \ncontinuous-time, continuous-state as is demonstrated with a multi-grid algorithm \nfor the HJB-equation. Ongoing work is directed towards adaptive grid refinement \nalgorithms and application to systems that include hysteresis. \n\n\fMulti-Grid Methodsfor Reinforcement Leaming in Diffusion Processes \n\n1039 \n\nFigure 2: Contour plots of the predicted reward in a homogeneous server problem with \nnonlinear costs are shown on different grid levels. On the coarsest 4 x 4 grid a sampling rate \nof one second is used with 9-point-star transition matrices. At the finest grid (64 x 64) a \nsampling rate of t second is used with observation on 81-point-stars. Inside the egg-shaped \narea the value function assumes its maximum. \n\nReferences \n\n[lJ A. Barto, S. Bradtke, S. Singh. Learning to Act using Real-Time Dynamic Pro(cid:173)\ngramming, AI Journal on Computational Theories of Interaction and Agency, \n1993. \n\n[2J M. BloB and R. Hoppe. Numerical Computation of the Value Function of Op(cid:173)\ntimally Controlled Stochastic Switching Processes by Multi-Grid Techniques, \nNumer Funct Anal And Optim 10(3+4), 275-304, 1989. \n\n[3] S. Pareigis. Lernen der Lasung der Bellman-Gleichung durch Beobachtung von \n\nkontinuierlichen Prozessen, PhD Thesis, 1996. \n\n[4J W. Hackbusch. Multi-Grid Methods and Applications, Springer-Verlag, 1985. \n\n[5] H. Kushner and P. Dupuis. Numerical Methods for Stochastic Control Prob(cid:173)\n\nlems in Continuous Time, Springer-Verlag, 1992. \n\n\f", "award": [], "sourceid": 1273, "authors": [{"given_name": "Stephan", "family_name": "Pareigis", "institution": null}]}