{"title": "Kernel-Based Reinforcement Learning in Average-Cost Problems: An Application to Optimal Portfolio Choice", "book": "Advances in Neural Information Processing Systems", "page_first": 1068, "page_last": 1074, "abstract": null, "full_text": "Kernel-Based Reinforcement Learning in \nAverage-Cost Problems: An Application \n\nto Optimal Portfolio Choice \n\nDirk Ormoneit \n\nDepartment of Computer Science \n\nPeter Glynn \n\nEESOR \n\nStanford University \n\nStanford, CA 94305-9010 \normoneit@cs.stanford.edu \n\nStanford University \n\nStanford, CA 94305-4023 \n\nAbstract \n\nMany approaches to reinforcement learning combine neural net(cid:173)\nworks or other parametric function approximators with a form of \ntemporal-difference learning to estimate the value function of a \nMarkov Decision Process. A significant disadvantage of those pro(cid:173)\ncedures is that the resulting learning algorithms are frequently un(cid:173)\nstable. In this work, we present a new, kernel-based approach to \nreinforcement learning which overcomes this difficulty and provably \nconverges to a unique solution. By contrast to existing algorithms, \nour method can also be shown to be consistent in the sense that \nits costs converge to the optimal costs asymptotically. Our focus \nis on learning in an average-cost framework and on a practical ap(cid:173)\nplication to the optimal portfolio choice problem. \n\n1 \n\nIntroduction \n\nTemporal-difference (TD) learning has been applied successfully to many real-world \napplications that can be formulated as discrete state Markov Decision Processes \n(MDPs) with unknown transition probabilities. If the state variables are continuous \nor high-dimensional , the TD learning rule is typically combined with some sort of \nfunction approximator - e.g. a linear combination of feature vectors or a neural \nnetwork - which may well lead to numerical instabilities (see, for example, [BM95, \nTR96]). Specifically, the algorithm may fail to converge under several circumstances \nwhich, in the authors' opinion, is one of the main obstacles to a more wide-spread \nuse of reinforcement learning (RL) in industrial applications. As a remedy, we \nadopt a non-parametric perspective on reinforcement learning in this work and we \nsuggest a new algorithm that always converges to a unique solution in a finite \nnumber of steps. In detail, we assign value function estimates to the states in a \nsample trajectory and we update these estimates in an iterative procedure. The \n\n\fupdates are based on local averaging using a so-called \"weighting kernel\". Besides \nnumerical stability, a second crucial advantage of this algorithm is that additional \ntraining data always improve the quality of the approximation and eventually lead \nto optimal performance - that is, our algorithm is consistent in a statistical sense. \nTo the authors' best knowledge, this is the first reinforcement learning algorithm \nfor which consistency has been demonstrated in a continuous space framework. \nSpecifically, the recently advocated \"direct\" policy search or perturbation methods \ncan by construction at most be optimal in a local sense [SMSMOO , VRKOOj. \n\nRelevant earlier work on local averaging in the context of reinforcement learning \nincludes [Rus97j and [Gor99j. While these papers pursue related ideas, their ap(cid:173)\nproaches differ fundamentally from ours in the assumption that the transition prob(cid:173)\nabilities of the MDP are known and can be used for learning. By contrast, kernel(cid:173)\nbased reinforcement learning only relies on sample trajectories of the MDP and it \nis therefore much more widely applicable in practice. While our method addresses \nboth discounted- and average-cost problems, we focus on average-costs here and \nrefer the reader interested in discounted-costs to [OSOOj. For brevity, we also defer \ntechnical details and proofs to an accompanying paper [OGOOj. Note that average(cid:173)\ncost reinforcement learning has been discussed by several authors (e.g. [TR99]). \n\nThe remainder of this work is organized as follows. In Section 2 be provide basic \ndefinitions and we describe the kernel-based reinforcement learning algorithm. Sec(cid:173)\ntion 3 focuses on the practical implementation of the algorithm and on theoretical \nissues. Sections 4 and 5 present our experimental results and conclusions. \n\n2 Kernel-Based Reinforcement Learning \n\nConsider a MDP defined by a sequence of states X t taking values in IRd , a sequence \nof actions at taking values in A = {I, 2, ... , M}, and a family of transition kernels \n{Pa(x, B)la E A} characterizing the conditional probability of the event X t E B \ngiven X t- 1 = x and at-l = a. The cost function c(x, a) represents an immediate \npenalty for applying action a in state x. Strategies, policies, or controls are under(cid:173)\nstood as mappings of the form J1. : IRd -+ A, and we let PX,/A denote the probability \ndistribution governing the Markov chain starting from Xo = x associated with the \npolicy J1.. Several regularity conditions are listed in detail in [OGOOj. \n\nOur goal is to identify policies that are optimal in that they minimize the long-run \n\naverage-cost TJ/A == liIllT-too Ex,/A [f 'L,;=-Ol c(Xt, J1.(Xt})]. An optimal policy, J1.*, can \n\nbe characterized as a solution to the Average-Cost Optimality Equation (ACOE): \n\nTJ* + h*(x) \nJ1.*(x) \n\nmin{c(x, a) + (rah*)(x)}, \nargmin{c(x, a) + (rah*)(x)} , \n\na \n\na \n\n(1) \n\n(2) \n\nwhere TJ* is the minimum average-cost and h*(x) has an interpretation as the differ(cid:173)\nential value of starting in x as opposed to drawing a random starting position from \nthe stationary distribution under J1.*. r a denotes the conditional expectation oper(cid:173)\nator (r ah)(X) == Ex,a [h(Xl) ], which is assumed to be unknown so that (1) cannot \nbe solved explicitly. Instead, we simulate the MDP using a fixed proposal strat(cid:173)\negy jl in reinforcement learning to generate a sample trajectory as training data. \nFormally, let S == {zo, .. . , Zm} denote such an m-step sample trajectory and let \n\n\fA == {ao, ... ,am-llas = p,(zs)} and C == {c(zs , as)IO ~ s < m} be the sequences \nof actions and costs associated with S. Then our objective can be reformulated as \nthe approximation of fJ* based on information in S, A, and C. In detail, we will \nconstruct an approximate expectation operator, l' m,a, based on the training data, \nS, and use this approximation in place of the true operator rain this work. For(cid:173)\nmally substituting 1'm,a for rain (1) and (2) gives the Approximate Avemge-Cost \nOptimality Equation (AACOE): \n\ni)m + hm(x) \n\nflm(x) \n\nargmjn {c(x , a) + (1' m,ahm)(X)} . \n\n(3) \n\n(4) \n\nNote that , ifthe solutions i)m and hm to (3) are well-defined, they can be interpreted \nas statistical estimates of TJ* and h* in equation (1). However, i)m and hm need not \nexist unless 1'm ,a is defined appropriately. We therefore employ local averaging in \nthis work to construct 1'm,a in a way that guarantees the existence of a unique \nfixed point of (3). For the derivation of the local averaging operator, note that \nthe task of approximating (rah)(x) = Ex,a[h(Xdl can be interpreted alternatively \nas a regression of the \"target\" variable h(Xd onto the \"input\" Xo = x . So-called \nkernel-smoothers address regression tasks of this sort by locally averaging the target \nvalues in a small neighborhood of x . This gives the following approximation: \n\nm-l L km ,a(zs, x)h(zs+1)' \n\ns=o \n\n(5) \n\n(6) \n\nIn detail, we employ the weighting function or weighting kernel km ,a (zs, x) in (6) to \ndetermine the weights that are used for averaging in equation (5). Here km,a(zs , x) is \na multivariate Gaussian, normalized so as to satisfy the constraints km , .. (zs, x) > 0 \nif as = a , km,a(zs , x) = 0 if as i- a, and I:::,,=~l km, .. (zs, x) = 1. Intuitively, (5) \nassesses the future differential cost of applying action a in state x by looking at all \ntimes in the training data where a has been applied previously in a state similar \nto x , and by averaging the current differential value estimates at the outcomes of \nthese previous transitions. Because the weights km , .. (zs , x) are related inversely \nto the distance Ilzs - xii, transitions originating in the neighborhood of x are most \ninfluential in this averaging procedure. A more statistical interpretation of (5) would \nsuggest that ideally we could simply generate a large number of independent samples \nfrom the conditional distribution Px,a and estimate Ex ,a[h(X1)l using Monte-Carlo \napproximation. Practically speaking , this approach is clearly infeasible because in \norder to assess the value of the simulated successor states we would need to sample \nrecursively, thereby incurring exponentially increasing computational complexity. A \nmore realistic alternative is to estimate l' m,a h (x) as a local average of the rewards \nthat were generated in previous transitions originating in the neighborhood of x, \nwhere the membership of an observation Z s in the neighborhood of x is quantified \nusing km ,a( zs, x). Here the regularization parameter b determines the width of the \nGaussian kernel and thereby also the size of the neighborhood used for averaging. \nDepending on the application , it may be advisable to choose b either fixed or as a \nlocation-dependent function of the training data. \n\n\f3 \n\n\"Self-Approximating Property\" \n\nAs we illustrated above, kernel-based reinforcement learning formally amounts to \nsubstituting the approximate expectation operator r m,a for r a and then applying \ndynamic programming to derive solutions to the approximate optimality equation \n(3). In this section, we outline a practical implementation of this approach and \nwe present some of our theoretical results. In particular, we consider the relative \nvalue iteration algorithm for average-cost MDPs that is described , for example, in \n[Ber95]. This procedure iterates a variant of equation (1) to generate a sequence of \nvalue function estimates, h~ , that eventually converge to a solution of (1) (or (3), \nrespectively). An important practical problem in continuous state MDPs is that the \nintermediate functions h~ need to be represented explicitly on a computer. This re(cid:173)\nquires some form of function approximation which may be numerically undesirable \nand computationally burdensome in practice. In the case of kernel-based reinforce(cid:173)\nment learning, the so-called \"self-approximating\" property allows for a much more \nefficient implementation in vector format (see also [Rus97]). Specifically, because \nour definition of r m,ah in (5) only depends on the values of h at the states in S, \nthe AACOE (3) can be solved in two steps: \n\n(7) \n\n(8) \n\nIn other words , we first determine the values of hm at the points in S using (7) \nand then compute the values at new locations x in a second step using (8). Note \nthat (7) is a finite equation system by contrast to (3). By introducing the vectors \nand matrices n?,(i) == hm,?,( zi ), c?,(i) == C?,(Zi), q>?,(i,j) == km ,?, (Zj,Zi ) for i = \n1, . .. , m and j = 1, ... , m , the relative value iteration algorithm can thus be written \nconveniently as (for details, see [Ber95, OGOO]): \n\n~k+1 ._ ~k \nU \n\n~k ( ) \n. - U n ew - I tnew 1 . \n\n(9) \n\nHence we end up with an algorithm that is analogous to value iteration except that \nwe use the weighting matrix q>a in place ofthe usual transition probabilities and nk \nand C a are vectors of points in the training set S as opposed to vectors of states. \nIntuitively, (9) assigns value estimates to the states in the sample trajectory and \nupdates these estimates in an iterative fashion. Here the update of each state is \nbased on a local average over the costs and values of the samples in its neighborhood. \nSince q>a (i, j) > 0 and 2::7=1 q>a(i, j) = 1 we can further exploit the analogy between \n(9) and the usual value iteration in an \"artificial\" MDP with transition probabilities \nq>a to prove the following theorem: \n\nTheorem 1 The relative value iteration (9) converges to a unique fixed point. \n\nFor details, the reader is referred to [OSOO , OGOO]. Note that Theorem 1 illustrates \na rather unique property of kernel-based reinforcement learning by comparison to \nalternative approaches. In addition, we can show that - under suitable regularity \nconditions - kernel-based reinforcement learning is consistent in the following sense: \n\nTheorem 2 The approximate optimal cost Tfm converges to the true optimal cost \nTI* in the sense that \n\nE \n\nxo ,ji. Tim -\n\n1 A \n\n* 1 m-t co 0 \n. \nTI \n\n---+ \n\n\fAlso, the true cost of the approximate strategy Pm converges to the optimal cost: \n\nHence Pm performs as well as fJ* asymptotically and we can also predict the op(cid:173)\ntimal cost TJ* using r,m. From a practical standpoint, Theorem 2 asserts that the \nperformance of approximate dynamic programming can be improved by increasing \nthe amount of training data. Note, however, that the computational complexity \nof approximate dynamic programming depends on the sample size m. In detail, \nthe complexity of a single application of (9) is O(m2) in a naive implementation \nand O(mlog m) in a more elaborate nearest neighbor approach. This complexity \nissue prevents the use of very large data sets using the \"exact\" algorithm described \nabove. As in the case of parametric reinforcement learning, we can of course restrict \nourselves to a fixed amount of computational resources simply by discarding obser(cid:173)\nvations from the training data or by summarizing clusters of data using \"sufficient \nstatistics\". Note that the convergence property in Theorem 1 remains unaffected \nby such an approximation. \n\n4 Optimal Portfolio Choice \n\nIn this section , we describe the practical application of kernel-based reinforcement \nlearning to an investment problem where an agent in a financial market decides \nwhether to buy or sell stocks depending on the market situation. In the finance \nand economics literature, this task is known as \"optimal portfolio choice\" and has \ncreated an enormous literature over the past decades. Formally, let St symbolize \nthe value of the stock at time t and let the investor choose her portfolio at from the \nset A == {O , 0.1, 0.2 , ... , I}, corresponding to the relative amount of wealth invested \nin stocks as opposed to an alternative riskless asset. At time t + 1, the stock price \nchanges from St to St+1, and the portfolio of the investor participates in the price \nmovement depending on her investment choice. Formally, if her wealth at time t is \nWt , it becomes Wt+1 = (1 + at St\u00b1~: S, ) Wt at time t + 1. To render this simulation \n\nas realistic as possible, our investor is assumed to be risk-averse in that her fear of \nlosses dominates her appreciation of gains of equal magnitude. A standard way to \nexpress these preferences formally is to aim at maximizing the expectation of a con(cid:173)\ncave \"utility function\", U(z), ofthe final wealth WT. Using the choice U(z) = log( z), \nthe investor's utility can be written as U(WT) = 2:,;:01 log (1 + at S'\u00b1~:S') . Hence \nutilities are additive over time, and the objective of maximizing E[U(WT)] can be \nstated in an average-cost framework where c(x, a) = Ex,a [log (1 + a S'\u00b1~:S' )]. \nWe present results using simulated and real stock prices. With regard to the simu(cid:173)\nlated data, we adopt the common assumption in finance literature that stock prices \nare driven by an Ito process with stochastic, mean-reverting volatility: \n\ndSt \ndVt \n\nfJStdt + ylv;StdBt, \n\u00a2(fJ - vt)dt + pylv;dBt . \n\nHere Vt is the time-varying volatility, and Bt and Bt are independent Brownian mo(cid:173)\ntions. The parameters of the model are fJ = 1.03, fJ = 0.3, \u00a2 = 10.0, and p = 5.0. We \n\n\fsimulated daily data for the period of 13 years using the usual Euler approximation \nof these equations. The resulting stock prices, volatilities, and returns are shown in \nFigure l. Next, we grouped the simulated time series into 10 sets of training and \n\nFigure 1: The simulated time-series of stock prices (left) , volatility (middle) , and \ndaily returns (right; Tt == log(St/St-d) over a period of one year. \n\ntest data such that the last 10 years are used as 10 test sets and the three years \npreceding each test year are used as training data. Table 1 reports the training and \ntest performances on each of these experiments using kernel-based reinforcement \nlearning and a benchmark buy & hold strategy. Performance is measured using \n\nYear II Reinforcement Learning \n\nBuy &: Hold \n\nII Training \n0.129753 \n0.125742 \n0.100265 \n0.059405 \n0.082622 \n0.077856 \n0.136525 \n0.145992 \n0.126052 \n0.127900 \n\n4 \n5 \n6 \n7 \n8 \n9 \n10 \n11 \n12 \n13 \n\nTest \n\n0.096555 \n0. 107905 \n-0.074588 \n0.201186 \n0.227161 \n0.098172 \n0.199804 \n0.121507 \n-0.018110 \n-0.022748 \n\nI Training \n0.058819 \n0.043107 \n0.053755 \n0.018023 \n0.041410 \n0.074632 \n0.137416 \n0.147065 \n0.125978 \n0.077196 \n\nTest \n\n0.052533 \n0.081395 \n-0.064981 \n0.172968 \n0.197319 \n0.092312 \n0.194993 \n0.118656 \n-0.017869 \n-0.029886 \n\nTable 1: Investment p erformance on the simulated data (initial wealth Wa = 100). \n\nthe Sharpe-ratio which is a standard measure of risk-adjusted investment perfor(cid:173)\nmance. In detail, the Sharpe-ratio is defined as SR = log(WT/Wo)/iT where iT is \nthe standard deviation of log(Wt!Wt- 1 ) over time. Note that large values indicate \ngood risk-adjusted performance in years of positive growth , whereas negative val(cid:173)\nues cannot readily be interpreted. We used the root of the volatility (standardized \nto zero mean and unit variance) as input information and determined a suitable \nchoice for the bandwidth parameter (b = 1) experimentally. Our results in Table 1 \ndemonstrate that reinforcement learning dominates buy & hold in eight out of ten \nyears on the training set and in all seven years with positive growth on the test set. \n\nTable 2 shows the results of an experiment where we replaced the artificial time \nseries with eight years of daily German stock index data (DAX index, 1993-2000). \nWe used the years 1996-2000 as t est data and the three years preceding each test \nyear for training. As the model input, we computed an approximation of the (root(cid:173)\n) volatility using a geometric average of historical returns. Note that the training \nperformance of reinforcement learning always dominates the buy & hold strategy, \nand the test results are also superior to the benchmark except in the year 2000. \n\n\fYear Reinforcement Learning \n\nBuy &; Hold \n\nTraining \n0.083925 \n0.119875 \n0.123927 \n0.141242 \n0.085236 \n\n1996 \n1997 \n1998 \n1999 \n2000 \n\nTest \n\n0.173373 \n0.121583 \n0.079584 \n0.094807 \n-0.007878 \n\nTraining \n0.038818 \n0.119875 \n0.096183 \n0.035137 \n0.081271 \n\nTest \n\n0.120107 \n0.096369 \n0.035204 \n0.090541 \n0.148203 \n\nTable 2: Investment performance on the DAX data. \n\n5 Conclusions \n\nWe presented a new, kernel-based reinforcement learning method that overcomes \nseveral important shortcomings of temporal-difference learning in continuous-state \ndomains. In particular, we demonstrated that the new approach always converges \nto a unique approximation of the optimal policy and that the quality of this approx(cid:173)\nimation improves with the amount of training data. Also, we described a financial \napplication where our method consistently outperformed a benchmark model in an \nartificial and a real market scenario. While the optimal portfolio choice problem is \nrelatively simple, it provides an impressive proof of concept by demonstrating the \npractical feasibility of our method. Efficient implementations of local averaging for \nlarge-scale problems have been discussed in the data mining community. Our work \nmakes these methods applicable to reinforcement learning, which should be valuable \nto meet the real-time and dimensionality constraints of real-world problems. \n\nAcknowledgements. The work of Dirk Ormoneit was partly supported by the Deutsche \nForschungsgemeinschaft. Saunak Sen helped with valuable discussions and suggestions. \n\nReferences \n\n[Ber95) \n\n[BM95) \n\n[Gor99) \n\n[OGOO) \n\n[OSOO) \n\n[Rus97) \n\nD. P. Bertsekas. Dynamic Programming and Optimal Control, volume 1 and \n2. Athena Scientific, 1995. \n\nJ. A. Boyan and A. W. Moore. Generalization in reinforcement learning: Safely \napproximating the value function. In NIPS 7,1995. \n\nG. Gordon. Approximate Solutions to Markov Decision Processes. PhD thesis, \nComputer Science Department, Carnegie Mellon University, 1999. \n\nD. Ormoneit and P. Glynn. Kernel-based reinforcement learning in average(cid:173)\ncost problems. Working paper, Stanford University. In preparation. \nD. Ormoneit and S. Sen. Kernel-based reinforcement learning. Machine Learn(cid:173)\ning, 2001. To appear. \n\nJ. Rust. Using randomization to break the curse of dimensionality. Economet(cid:173)\n\"ica, 65(3):487- 516, 1997. \n\n[SMSMOO) R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy gradient meth(cid:173)\n\nods for reinforcement learning with function approximation. In NIPS 12,2000. \n\n[TR96) \n\n[TR99) \n\nJ. N. TsitsikIis and B. Van Roy. Feature-based methods for large-scale dynamic \nprogramming. Machine Learning, 22:59-94, 1996. \n\nJ. N. TsitsikIis and B. Van Roy. Average cost temporal-difference learning. \nAutomatica, 35(11):1799- 1808, 1999. \n\n[VRKOO) \n\nJ. N. Tsitsiklis V. R. Konda. Actor-critic algorithms. In NIPS 12,2000. \n\n\f", "award": [], "sourceid": 1849, "authors": [{"given_name": "Dirk", "family_name": "Ormoneit", "institution": null}, {"given_name": "Peter", "family_name": "Glynn", "institution": null}]}