Reinforcement Learning for Trading

Part of Advances in Neural Information Processing Systems 11 (NIPS 1998)

Bibtex Metadata Paper


John Moody, Matthew Saffell


We propose to train trading systems by optimizing financial objec(cid:173) tive functions via reinforcement learning. The performance func(cid:173) tions that we consider are profit or wealth, the Sharpe ratio and our recently proposed differential Sharpe ratio for online learn(cid:173) ing. In Moody & Wu (1997), we presented empirical results that demonstrate the advantages of reinforcement learning relative to supervised learning. Here we extend our previous work to com(cid:173) pare Q-Learning to our Recurrent Reinforcement Learning (RRL) algorithm. We provide new simulation results that demonstrate the presence of predictability in the monthly S&P 500 Stock Index for the 25 year period 1970 through 1994, as well as a sensitivity analysis that provides economic insight into the trader's structure.

Introduction: Reinforcement Learning for Thading

1 The investor's or trader's ultimate goal is to optimize some relevant measure of trading system performance , such as profit, economic utility or risk-adjusted re(cid:173) turn. In this paper , we propose to use recurrent reinforcement learning to directly optimize such trading system performance functions , and we compare two differ(cid:173) ent reinforcement learning methods. The first, Recurrent Reinforcement Learning, uses immediate rewards to train the trading systems, while the second (Q-Learning (Watkins 1989)) approximates discounted future rewards. These methodologies can be applied to optimizing systems designed to trade a single security or to trade port(cid:173) folios . In addition , we propose a novel value function for risk-adjusted return that enables learning to be done online: the differential Sharpe ratio.

Trading system profits depend upon sequences of interdependent decisions, and are thus path-dependent. Optimal trading decisions when the effects of transactions costs, market impact and taxes are included require knowledge of the current system state. In Moody, Wu, Liao & Saffell (1998), we demonstrate that reinforcement learning provides a more elegant and effective means for training trading systems when transaction costs are included , than do more standard supervised approaches.

• The authors are also with Nonlinear Prediction Systems.


J. Moody and M Saffell

Though much theoretical progress has been made in recent years in the area of rein(cid:173) forcement learning, there have been relatively few successful, practical applications of the techniques. Notable examples include Neuro-gammon (Tesauro 1989), the asset trader of Neuneier (1996), an elevator scheduler (Crites & Barto 1996) and a space-shuttle payload scheduler (Zhang & Dietterich 1996).

In this paper we present results for reinforcement learning trading systems that outperform the S&P 500 Stock Index over a 25-year test period, thus demonstrating the presence of predictable structure in US stock prices. The reinforcement learning algorithms compared here include our new recurrent reinforcement learning (RRL) method (Moody & Wu 1997, Moody et ai. 1998) and Q-Learning (Watkins 1989).

2 Trading Systems and Financial Performance Functions 2.1 Structure, Profit and Wealth for Trading Systems We consider performance functions for systems that trade a single 1 security with price series Zt. The trader is assumed to take only long, neutral or short positions Ft E {-I , 0, I} of constant magnitude. The constant magnitude assumption can be easily relaxed to enable better risk control. The position Ft is established or maintained at the end of each time interval t, and is re-assessed at the end of period t + 1. A trade is thus possible at the end of each time period, although nonzero trading costs will discourage excessive trading. A trading system return R t is realized at the end of the time interval (t - 1, t] and includes the profit or loss resulting from the position F t - 1 held during that interval and any transaction cost incurred at time t due to a difference in the positions Ft- 1 and Ft.

In order to properly incorporate the effects of transactions costs, market impact and taxes in a trader's decision making, the trader must have internal state information and must therefore be recurrent. An example of a single asset trading system that takes into account transactions costs and market impact has following decision function: Ft = F((}t; Ft-l. It) with It = {Zt, Zt-1, Zt-2,··.; Yt, Yt-1, Yt-2, ... } where (}t denotes the (learned) system parameters at time t and It denotes the information set at time t, which includes present and past values of the price series Zt and an arbitrary number of other external variables denoted Yt. Trading systems can be optimized by maximizing performance functions U 0 such as profit, wealth, utility functions of wealth or performance ratios like the Sharpe ratio. The simplest and most natural performance function for a risk-insensitive trader is profit. The transactions cost rate is denoted 6.

Additive profits are appropriate to consider if each trade is for a fixed number of shares or contracts of security Zt. This is often the case, for example, when trading small futures accounts or when trading standard US$ FX contracts in dollar(cid:173) denominated foreign currencies. With the definitions rt = Zt - Zt-1 and r{ = 4 - 4-1 for the price returns of a risky (traded) asset and a risk-free asset (like T(cid:173) Bills) respectively, the additive profit accumulated over T time periods with trading position size Jl > 0 is then defined as: