{"title": "Risk Sensitive Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1031, "page_last": 1037, "abstract": null, "full_text": "Learning a  Continuous Hidden Variable \n\nModel for  Binary  Data \n\nDaniel D.  Lee \nBell Laboratories \n\nLucent  Technologies \nMurray Hill,  NJ  07974 \nddlee~bell-labs.com \n\nHaim Sompolinsky \n\nRacah Institute of Physics  and \nCenter for  Neural  Computation \n\nHebrew  University \n\nJerusalem, 91904, Israel \nhaim~fiz.huji . ac.il \n\nAbstract \n\nA directed generative model for binary data using a  small number \nof hidden  continuous  units  is  investigated.  A  clipping  nonlinear(cid:173)\nity distinguishes the model from conventional principal components \nanalysis.  The relationships between the correlations of the underly(cid:173)\ning continuous Gaussian variables  and the binary output variables \nare utilized  to learn  the  appropriate weights  of the network.  The \nadvantages of this  approach are illustrated on a  translationally in(cid:173)\nvariant binary distribution and on handwritten digit  images. \n\nIntroduction \n\nPrincipal Components Analysis  (PCA) is  a widely used statistical technique for  rep(cid:173)\nresenting data with a large number of variables [1].  It is based upon the assumption \nthat  although  the  data is  embedded  in  a  high  dimensional  vector  space,  most  of \nthe  variability  in  the  data is  captured  by  a  much  lower  climensional  manifold.  In \nparticular for  PCA,  this  manifold is  described  by  a  linear  hyperplane  whose  char(cid:173)\nacteristic  directions  are  given  by  the  eigenvectors  of the  correlation  matrix  with \nthe largest eigenvalues.  The success of PCA and closely  related techniques  such  as \nFactor Analysis  (FA)  and PCA mixtures clearly indicate that much real world data \nexhibit  the low  dimensional manifold structure assumed by  these models [2,  3]. \nHowever,  the  linear  manifold  structure  of PCA  is  not  appropriate  for  data  with \nbinary  valued  variables.  Binary values  commonly  occur  in  data such  as  computer \nbit  streams,  black-and-white images,  on-off outputs of feature  detectors,  and  elec(cid:173)\ntrophysiological  spike  train data [4].  The  Boltzmann  machine  is  a  neural  network \nmodel that incorporates hidden binary spin variables, and in  principle, it should be \nable  to  model  binary  data with  arbitrary spin  correlations  [5].  Unfortunately,  the \n\n\f516 \n\nD. D. Lee and H.  Sompolinsky \n\nFigure  1:  Generative  model  for  N-dimensional binary  data using  a  small  number \np  of continuous hidden variables. \n\ncomputational time needed for training a Boltzmann machine renders it impractical \nfor  most  applications. \nIn  these  proceedings,  we  present  a  model  that  uses  a  small  number of continuous \nhidden  variables  rather  than  hidden  binary  variables  to  capture the  variability  of \nbinary  valued  visible  data.  The  generative  model  differs  from  conventional  peA \nbecause  it  incorporates  a  clipping  nonlinearity.  The  resulting  spin  configurations \nhave an entropy  related to the number of hidden  variables  used,  and the resulting \nstates are connected by  small numbers of spin flips.  The learning algorithm is  par(cid:173)\nticularly simple, and is  related to peA by a scalar transformation of the correlation \nmatrix. \n\nGenerative  Model \n\nFigure  1  shows  a  schematic  diagram  of the  generative  process.  As  in  peA,  the \nmodel assumes that the data is generated by a small number P of continuous hidden \nvariables Yi .  Each of the hidden  variables  are assumed to be drawn independently \nfrom  a normal distribution with unit variance: \n\nP(Yi)  =  exp( -yt /2)/~. \n\n(1) \nThe continuous  hidden  variables are combined  using  the feedforward  weights  Wij , \nand the N  binary output units are then calculated using the sign of the feedforward \nacti vations: \n\n(3) \nSince binary data is  commonly obtained by  thresholding,  it seems  reasonable that \na  proper  generative  model  should  incorporate  such  a  clipping  nonlinearity.  The \ngenerative process is  similar to that of a  sigmoidal belief network with  continuous \nhidden  units  at zero  temperature.  The nonlinearity  will  alter the  relationship  be(cid:173)\ntween the correlations of the binary variables and the weight matrix W  as described \nbelow. \n\nThe real-valued Gaussian variables Xi  are exactly analogous to the visible variables \nof conventional  peA.  They  lie  on  a  linear  hyperplane determined  by  the  span of \nthe matrix W,  and their correlation matrix is  given by: \n\n(2) \n\n(4) \n\nP \n\nXi  =  L WijYj \nj=l \nsgn(xi). \n\nSi \n\ncxx  =  (xxT )  =  WW T . \n\n\fLearning a Continuous Hidden  Variable Model for Binary Data \n\n517 \n\n\"\"\"\"'\" \n-- \"' ..... . \n~ .. ,  ...... ,.\",'\" \n\n+-\n\n4+ \n.... \"\" \n: \n\nY2 \n-t-\n. ' \n\n.' \n\n\u2022\u2022 ' \n\nJ  J \n\n+++ \n\n\"'/\"~'l=LWl 'Y'~O \n. . . . \n: \n. . \n. . . \n, ,  . , . \n. . . . . \n, \n, , , \n. . . . \n:  x3 r \n\n, \n\"~ \n\n\"\"  x2~ 0 \n\n, , \n\nFigure  2:  Binary  spin  configurations  Si  in  the  vector  space  of  continuous  hidden \nvariables Yj  with  P = 2 and  N  = 3. \n\nBy  construction,  the  correlation  matrix  CXX  has  rank  P  which  is  much  smaller \nthan  the  number  of  components  N.  Now  consider  the  binary  output  variables \nSi  = sgn(xd\u00b7  Their correlations can be calculated from  the probability distribution \nof the Gaussian variables  Xi: \n\n(CSS)ij  = (SiSj)  = J IT dYk  P(Xk) sgn(Xi) sgn(Xj) \n\nk \n\nwhere \n\n(5) \n\n(6) \n\n(7) \n\nThe  integrals  in  Equation  5  can  be  done  analytically,  and  yield  the  surprisingly \nsimple result: \n\n(CSS ) .. -\n'J  -\n\n(2) \n\n_ \n11\" \n\nsin-1 \n\n[C~.X  1 \nJCfix elf  . \n\n'J \n\nThus,  the correlations of the clipped binary variables  CSS  are related to the corre(cid:173)\nlations of the corresponding Gaussian variables CXX  through  the nonlinear arcsine \nfunction.  The normalization in the denominator of the arcsine argument reflects the \nfact  that the sign function is  unchanged by a scale change in the Gaussian variables. \nAlthough the correlation matrix CSS  and the generating correlation matrix cn are \neasily related through Equation 7,  they have qualitatively very different properties. \nIn general, the correlation matrix CSS  will  no longer have the low  rank structure of \nCXX.  As illustrated by the translationally invariant example in the next section, the \nspectrum of CSS  may  contain a  whole  continuum of eigenvalues  even  though  cxx \nhas only a few  nonzero eigenvalues. \n\nPCA is  typically used  for  dimensionality reduction of real variables;  can this model \nbe  used  for  compressing the  binary  outputs  Si?  Although  the output  correlations \nCSS  no longer display the low rank structure of the generating CXX , a more appropri(cid:173)\nate measure of data compression is the entropy of the binary output states.  Consider \nhow many of the 2N possible binary states will be generated by the clipping process. \nThe  equation  Xi  =  E j  WijYj  = 0  defines  a  P  - 1 dimensional  hyperplane  in  the \nP-dimensional state space of hidden variables  Yj,  which  are shown  as  dashed  lines \nin  Figure  2.  These  hyperplanes  partition  the  half-space  where  Si  = +1  from  the \n\n\f518 \n\nD.  D.  Lee and H.  Sompolinsky \n\n5;=+1 \n\n5;=-1 \n\nL \n\nIL.--__ --II \n\n______ ...... 1 \n\n, \n'(cid:173) , , \n\nC)()( \n\ncss \n'., \n., , , , ...  ... \n\n10' \n\nEigenvalue rank \n\n10.2  '-----'-__  ~~ __ _  ~.............J \n102 \n\n10\u00b0 \n\n\"' ... ... ... \n\nFigure  3:  Translationally  invariant  binary  spin  distribution  with  N  =  256  units. \nRepresentative  samples from  the distribution  are illustrated on  the  left,  while  the \neigenvalue spectrum of CSS  and CXX  are plotted on the right. \n\nregion where Si = -1.  Each of the N  spin variables will have such a dividing hyper(cid:173)\nplane in this P-dimensional state space, and all of these hyperplanes will generically \nbe unique.  Thus ,  the  total  number of spin  configurations  Si  is  determined  by  the \nnumber of cells  bounded by  N  dividing hyperplanes in  P  dimensions.  The number \nof such cells is  approximately NP for  N  \u00bb  P, a well-known result from  perceptrons \n[6].  To leading order for  large N, the entropy of the binary states generated by this \nprocess is  then given  by S  =  P log N.  Thus,  the entropy of the spin configurations \ngenerated by  this  model  is  directly  proportional to the number of hidden variables \nP . \nHow  is  the topology  of the  binary spin configurations  Si  related  to  the PCA  man(cid:173)\nifold  structure of the continuous  variables  Xi?  Each of the generated spin  states is \nrepresented by a polytope cell in the P dimensional vector space of hidden variables. \nEach polytope has at least P + 1 neighboring polytopes which are related to it by a \nsingle or small  number of spin  flips.  Therefore,  although the state space of binary \nspin  configurations is  discrete, the continuous  manifold  structure of the underlying \nGaussian variables in this model is  manifested as binary output configurations with \nlow  entropy that  are connected with small  Hamming distances. \n\nTranslationally Invariant  Example \n\nIn  principle,  the  weights  W  could  be learned  by  applying  maximum  likelihood  to \nthis  generative model;  however,  the  resulting  learning  algorithm  involves  analyti(cid:173)\ncally  intractable  multi-dimensional  integrals.  Alternatively,  approximations  based \nupon mean field theory or importance sampling could be used to learn the appropri(cid:173)\nate parameters [7].  However, Equation 7 suggests a simple learning rule that is  also \napproximate, but is  much more computationally efficient  [8].  First, the binary cor(cid:173)\nrelation matrix CSS  is  computed from  the data.  Then the empirical CSS  is  mapped \ninto  the  appropriate  Gaussian  correlation  matrix using  the  nonlinear  transforma(cid:173)\ntion:  CXX  =  sin(7l'Css /2).  This  results  in  a  Gaussian correlation matrix where the \nvariances of the individual Xi  are fixed  at unity.  The weights Ware then calculated \nusing the conventional PCA algorithm.  The correlation matrix cxx  is  diagonalized, \nand  the  eigenvectors  with  the largest eigenvalues  are  used  to form  the  columns  of \n\n\fLearning a Continuous Hidden  Variable Model for Binary Data \n\n519 \n\nw to yield the best low rank approximation CXX  ~ WW T .  Scaling the variables  Xi \nwill  result  in  a  correlation matrix CXX  with  slightly  different  eigenvalues  but  with \nthe same rank. \nThe  utility  of this  transformation  is  illustrated  by  the  following  simple  example. \nConsider the distribution  of N  =  256  binary spins  shown  in  Figure  3.  Half of the \nspins  are chosen  to be  positive,  and the location of the positive  bump is  arbitrary \nunder  the  periodic  boundary  conditions.  Since  the  distribution  is  translationally \ninvariant,  the  correlations CIl  depend  only  on  the relative distance  between spins \nIi - jl.  The eigenvectors  are  the  Fourier  modes,  and  their  eigenvalues  correspond \nto their overlap with a triangle wave.  The eigenvalue spectrum of css  is  plotted in \nFigure 3 as sorted by their rank.  In this particular case,  the correlation matrix CSS \nhas N /2 positive eigenvalues with a  corresponding range of values. \nNow  consider  the  matrix  CXX  =  sin(-lI'Css /2).  The  eigenvalues  of  CXX  are  also \nshown in Figure 3.  In contrast to the many different  eigenvalues CSS,  the spectrum \nof the  Gaussian correlation matrix CXX  has  only  two  positive eigenvalues,  with all \nthe rest exactly equal to zero.  The corresponding eigenvectors are a cosine and sine \nfunction.  The  generative  process  can  thus  be  understood  as  a  linear  combination \nof the  two  eigenmodes  to yield  a  sine function  with  arbitary phase.  This  function \nis  then clipped to yield  the positive bump seen  in the original binary distribution. \nIn  comparison with  the eigenvalues of CS S, the eigenvalue spectrum of CXX  makes \nobvious  the low  rank structure of the generative process.  In  this case,  the original \nbinary distribution can be constructed using only P  =  2 hidden variables,  whereas \nit is  not  clear from  the eigenvalues  of CSS  what the appropriate number  of modes \nis.  This  illustrates  the  utility  of determining  the  principal  components  from  the \ncalculated  Gaussian  correlation matrix cxx  rather than  working directly  with  the \nobservable binary correlation matrix CSS. \n\nHandwritten Digits Example \n\nThis  model  was  also  applied  to  a  more  complex  data set.  A  large  set  of  16  x  16 \nblack  and  white  images  of handwritten  twos  were  taken  from  the  US  Post  Office \ndigit  database [9].  The pixel  means  and pixel  correlations  were directly  computed \nfrom  the images.  The generative model needs to be slightly modified to account for \nthe  non-zero  means  in  the  binary  outputs.  This  is  accomplished  by  adding fixed \nbiases  ~i  to the Gaussian variables  Xi  before clipping: \n\n(8) \nThe  biases  ~i  can  be  related  to  the  means  of the  binary  outputs  through  the  ex-\npression: \n\nSi  =  sgn(~i + Xi). \n\n~i  =  J2CtX erf- 1 (Si). \n\n(9) \nThis  allows  the  biases  to  be  directly  computed  from  the  observed  means  of  the \nbinary  variables.  Unfortunately,  with  non-zero  biases,  the  relationship  between \nthe Gaussian correlations  CXX  and binary correlations  CSS  is  no  longer  the simple \nexpression found in Equation 7.  Instead, the correlations are related by the following \nintegral equation: \n\nGiven  the  empirical  pixel  correlations  CSS  for  the  handwritten digits,  the  integral \nin Equation 10 is numerically solved for each pair of indices to yield the appropriate \n\n\f520 \n\nD.  D. Lee and H  Sompolinsky \n\n102  ~------~------~------~-------.------~ \n\n.... \n\nCSS \n\n..... .... .... ..... \n\n\"'to \n\n\" ~ , , , \n\nMorph \n2 \n2 \n2 \n2 \n;2 \na \n2 \n~ a \n\n103  L -____  ~ __ ____  ~ __  ~ __  ~ ______  ~ ______  ~ \n250 \n\n200 \n\n50 \n\n100 \n\n150 \n\nEigenvalue Rank \n\nFigure 4:  Eigenvalue spectrum of CSS  and CXx  for handwritten images of twos.  The \ninset shows the P  =  16 most significant eigenvectors for  cxx arranged by rows.  The \nright side of the figure  shows  a  nonlinear morph between two different  instances of \na  handwritten two using these eigenvectors. \n\nGaussian  correlation  matrix  CXX .  The  correlation  matrices  are  diagonalized  and \nthe  resulting  eigenvalue  spectra are  shown  in  Figure  4.  The  eigenvalues  for  CXX \nagain  exhibit  a  characteristic drop  that  is  steeper  than  the falloff in  the spectrum \nof the binary correlations CSs.  The corresponding eigenvectors of CXX  with the  16 \nlargest positive eigenvalues are depicted in  the inset of Figure 4.  These eigenmodes \nrepresent  common  image  distortions  such  as  rotations  and  stretching  and  appear \nqualitatively similar to those found  by the standard PCA algorithm. \nA  generative  model  with  weights  W  corresponding  to  the  P  =  16  eigenvectors \nshown in Figure 4 is  used to fit  the handwritten twos, and the utility of this nonlin(cid:173)\near generative model is illustrated in the right side of Figure 4.  The top and bottom \nimages in the figure  are two different  examples of a handwritten two from  the data \nset,  and the generative model is  used to morph between the two examples.  The hid(cid:173)\nden values Yi  for  the original images are first  determined for  the different  examples, \nand  the  intermediate  images  in  the  morph  are  constructed  by  linearly  interpolat(cid:173)\ning  in  the  vector  space  of the  hidden  units.  Because of the  clipping  nonlinearity, \nthis  induces  a  nonlinear mapping in the outputs with  binary units being flipped  in \na  particular order  as  determined  by  the  generative  model.  In  contrast,  morphing \nusing  conventional  PCA  would  result  in  a  simple  linear  interpolation between  the \ntwo  images,  and the intermediate images  would  not look anything like  the original \nbinary distribution [10]. \nThe  correlation  matrix  CXX  also  happens  to  contain  some  small  negative  eigen(cid:173)\nvalues.  Even  though  the  binary  correlation  matrix  CSS  is  positive  definite,  the \ntransformation  in  Equation  10  does  not  guarantee  that  the  resulting  matrix  CXx \nwill  also  be  positive definite.  The  presence of these  negative  eigenvalues  indicates \na  shortcoming  of the  generative  processs  for  modelling  this  data.  In  particular, \nthe  clipped  Gaussian  model  is  unable  to  capture  correlations  induced  by  global \n\n\fLearning a Continuous Hidden  Variable Model for Binary Data \n\n521 \n\nconstraints  in  the  data.  As  a  simple  illustration  of this  shortcoming  in  the  gen(cid:173)\nerative model,  consider  the binary distribution  defined  by  the  probability  density: \nP({s})  tX  lim.B-+ooexp(-,BLijSiSj).  The states in  this  distribution  are  defined  by \nthe constraint that the sum of the binary variables is  exactly zero:  Li Si  = O.  Now, \nfor  N  2:  4,  it  can  be  shown  that  it  is  impossible  to  find  a  Gaussian  distribution \nwhose visible binary variables match the negative correlations induced  by  this sum \nconstraint. \n\nThese examples  illustrate the value of using the clipped  generative model  to  learn \nthe  correlation  matrix of the  underlying  Gaussian  variables  rather  than  using  the \ncorrelations of the outputs directly.  The clipping nonlinearity is convenient because \nthe  relationship  between the  hidden  variables  and the output  variables  is  particu(cid:173)\nlarly easy to understand.  The learning algorithm differs from  other nonlinear PCA \nmodels and autoencoders because the inverse mapping function need not be explic(cid:173)\nitly learned [11,  12].  Instead, the correlation matrix is  directly transformed from the \nobservable variables  to the underlying  Gaussian variables.  The correlation matrix \nis  then diagonalized to determine the appropriate feedforward weights.  This results \nin  a  extremely  efficient  training  procedure  that  is  directly  analogous  to  PCA  for \ncontinuous variables. \nWe  acknowledge  the  support  of Bell  Laboratories,  Lucent  Technologies,  and  the \nUS-Israel  Binational  Science  Foundation.  We  also  thank  H.  S.  Seung  for  helpful \ndiscussions. \n\nReferences \n[1]  Jolliffe, IT (1986).  Principal  Component Analysis.  New  York:  Springer-Verlag. \n[2]  Bartholomew, DJ  (1987) . Latent variable  models  and factor  analysis.  London: \n\nCharles Griffin  & Co.  Ltd. \n\n[3]  Hinton, GE,  Dayan,  P  &  Revow,  M  (1996).  Modeling the manifolds of images \n\nof handwritten digits.  IEEE  Transactions  on Neural  networks 8,65- 74. \n\n[4]  Van  Vreeswijk,  C,  Sompolinsky,  H,  & Abeles,  M.  (1999).  Nonlinear  statistics \n\nof spike trains . In preparation. \n\n[5]  Ackley,  DH,  Hinton,  GE,  & Sejnowski,  TJ  (1985).  A  learning  algorithm  for \n\nBoltzmann machines.  Cognitive  Science 9,  147-169. \n\n[6]  Cover,  TM  (1965).  Geometrical  and statistical properties of systems of linear \ninequalities  with  applications  in  pattern  recognition.  IEEE  Trans.  Electronic \nComput.  14, 326- 334. \n\n[7]  Tipping,  ME  (1999) .  Probabilistic  visualisation  of  high-dimensional  binary \n\ndata.  Advances  in Neural Information  Processing  Systems ~1. \n\n[8]  Christoffersson, A (1975). Factor analysis of dichotomized variables.  Psychome(cid:173)\n\ntrika 40, 5- 32. \n\n[9]  LeCun, Yet al.  (1989). Backpropagation applied to handwritten zip code recog(cid:173)\n\nnition.  Neural  Computation  i, 541-551. \n\n[10]  Bregler,  C,  &  Omohundro,  SM  (1995).  Nonlinear  image  interpolation  using \nmanifold learning.  Advances in Neural Information  Processing Systems 7,973-\n980. \n\n[11]  Hastie,  T  and  Stuetzle,  W  (1989).  Principal  curves.  Journal  of the  American \n\nStatistical  Association 84,  502-516. \n\n[12]  Demers, D, & Cottrell, G (1993) . Nonlinear dimensionality reduction.  Advances \n\nin Neural  Information  Processing Systems 5,  580-587. \n\n\fRisk Sensitive Reinforcement Learning \n\nRalph Neuneier \n\nOliver Mihatsch \n\nSiemens AG, Corporate Technology \n\nD-81730 Mtinchen, Germany \n\nSiemens AG, Corporate Technology \n\nD-81730 Mtinchen, Germany \n\nRalph.Neuneier@mchp.siemens.de \n\nOliver.Mihatsch@mchp.siemens.de \n\nAbstract \n\nAs  already  known,  the  expected  return  of a  policy  in  Markov  Deci(cid:173)\nsion Problems is  not  always  the  most suitable optimality criterion.  For \nmany applications control strategies have to meet various constraints like \navoiding very bad states (risk-avoiding) or generating high profit within \na short time (risk-seeking) although this might probably cause significant \ncosts.  We  propose a modified Q-Iearning algorithm which uses a single \ncontinuous  parameter  K  E  [-1, 1]  to  determine in  which  sense  the  re(cid:173)\nsulting policy is  optimal.  For K  =  0,  the policy  is  optimal with  respect \nto  the usual expected return criterion, while  K  -+  1 generates a solution \nwhich is optimal in worst case.  Analogous, the closer K  is to -1 the more \nrisk seeking the policy becomes.  In contrast to other related approaches \nin  the  field  of MDPs  we  do  not have to  transform the  cost model  or to \nincrease the  state  space in  order to  take risk into account.  Our new  ap(cid:173)\nproach  is  evaluated  by  computing optimal  investment  strategies  for  an \nartificial stock market. \n\n1  WHY IT SOMETIMES PAYS TO ACT CAUTIOUSLY \n\nReinforcement learning (RL) deals  with  the  computation of favorable  control policies  in \nsequential decision task.  Its theoretical framework of Markov Decision Problems (MDPs) \nevaluates and compares policies by their expected (sometimes discounted or averaged) sum \nof the immediate returns or costs per time step (Bertsekas & Tsitsiklis,  1996). But there are \nnumerous applications  which require a more  sophisticated control scheme:  e.  g.  a policy \nshould take into account that bad outcomes or states may be possible even if they are very \nrare because they are so disastrous, that they should be certainly avoided. \n\nAn  obvious example is  the  field  of finance  where the main  question  is  how  to  invest re(cid:173)\nsources  among  various  opportunities (e.g.  assets  like  stocks,  bonds,  etc.)  to  achieve  re(cid:173)\nmarkable  returns  while  simultaneously  controlling  the  risk  exposure  of the  investments \ndue  to  changing markets  or economic conditions.  Many  traders  try  to  achieve this  by  a \nMarkovitz-like portfolio management which distributes capital according to return and risk \n\n\f/032 \n\nR.  Neuneier and 0. Mihatsch \n\nestimates of the  assets.  A  new  approach using  reinforcement learning  techniques which \nadditionally integrates trading costs and other market imperfections has been proposed in \nNeuneier,  1998.  Here,  these  algorithms are  naturally  extended  such  that  an  explicit risk \ncontrol is now possible.  The investor can decide how much risk shelhe is willing to accept \nand  then compute an  optimal risk-averse investment strategy.  Similar trade-off scenarios \ncan be formulated in robotics, traffic control and further application areas. \n\nThe fact that the popular expected value criterion is  not always suitable has been  already \nknown  in  the  field  of AI  (Koenig  &  Simmons,  1994),  control  theory  and  reinforcement \nlearning (Heger,  1994 and  Szepesvari,  1997).  Several techniques have been proposed to \nhandle this problem. The most obvious way is to transform the sum of returns \"Et rt  using \nan  appropriate utility function U which reflects the desired properties of the  solution.  Un(cid:173)\nfortunately, interesting nonlinear utility  functions incorporating the  variance of the return, \nsuch  as  U(\"Et rt)  = \"Et rt  - >'(\"Et rt  - E(\"Et rt))2,  lead  to  non-Markovian  decision \nproblems.  The popular class  of exponential  utility  functions  U(\"E t rt)  =  exp(>'\"Et rt) \npreserves  the  Markov  property  but  requires  time  dependent policies  even  for discounted \ninfinite horizon MDPs.  Furthermore, it is not possible to formulate a corresponding model(cid:173)\nfree  learning algorithm.  A further alternative changes the  state  space model by  including \npast  returns  as  an  additional  state  element at  the  cost  of a  higher  dimensionality  of the \nMDP.  Furthermore,  it  is  not  always  clear in  which way  the  states  should be  augmented. \nOne may  also  transform the cost model,  i. e.  by  punishing large losses  stronger than  mi(cid:173)\nnor costs.  While requiring a significant amount of prior knowledge, this also increases the \ncomplexity of the MDP. \n\nIn contrast to these approaches we modify the popular Q-learning algorithm by introducing \na control parameter which determines in  which sense the resulting policy is optimal.  Intu(cid:173)\nitively and loosely speaking, our algorithm simulates the learning behavior of an optimistic \n(pessimistic) person by  overweighting (underweighting) experiences which are more pos(cid:173)\nitive (negative) than expected.  This main idea will be made more precise in  section 2 and \nmathematically  thoroughly  analyzed  in  section  3.  Using  artificial  data,  we  demonstrate \nsome properties of the new algorithm by constructing an optimal risk-avoiding investment \nstrategy (section 4). \n\n2  RISK SENSITIVE Q-LEARNING \n\nFor brevity we restrict ourselves to the subclass of infinite horizon discounted Markov deci(cid:173)\nsion problems (MDP). Furthermore, we assume the immediate rewards being deterministic \nfunctions  of the  current state  and  control  action.  Let S  =  {I, ... , n} be  the  finite  state \nspace and U be the finite  action space.  Transition probabilities and immediate rewards are \ndenoted by Pij(U)  and 9i(U), respectively.  'Y  denotes the discount factor.  Let II be the set \nof all deterministic policies mapping states to control actions. \n\nA commonly used objective is to learn a policy 1r  that \n\nmaximizes  ( Q' (i, u) '~g,(u) + E {t, 'Y'g\"  (,,(i,)) } ) \n\n(1) \n\nquantifying  the  expected  reward  if one  executes  control  action  U  in  state  i  and  follows \nthe  policy  1r  thereafter.  It is  a  well-known result  that  the  optimal  Q-values Q*(i,u)  := \nmaX7rETIQ7r (i, u) satisfy the following optimality equation \n\nQ*(i,u) =  9i(U) + 'Y  ~ Pij(U) maxQ*(j,u')  Vi  E  S,u E  U. \n\nL...J \njES \n\nu'EU \n\n(2) \n\nAny  policy 1f  with 1f(i)  =  argmaxuEU Q* (i, u)  is  optimal  with  respect to  the expected \nreward criterion. \n\n\fRisk Sensitive Reinforcement Learning \n\n1033 \n\nThe Q-function Q1r  averages over the outcome of all possible trajectories (series of states) \nof the  Markov process generated by  following  the policy 1r.  However,  the  outcome  of a \nspecific realization of the Markov process may deviate significantly from  this mean value. \nThe expected  reward  criterion  does  not consider  any  risk,  although  the  cases  where  the \ndiscounted reward falls considerably below the mean value is  of a living interest for many \napplications.  Therefore,  depending  on  the  application  at  hand  the  expected  reward  ap(cid:173)\nproach is  not always appropriate.  Alternatively, Heger (1994) and Littman &  Szepesvari \n(1996) present a performance criterion that exclusively focuses on risk avoiding policies: \n\nmaximize  (Q< (i, u) ,= 9i(U) + \n\n\"i~f  {t, 7' 9;,(1T(i,\u00bb}) . \n\np(ll, t 2, ... \u00bbo \n\n-\n\n(3) \n\nThe Q-function Q1r (i, u) denotes the worst possible outcome if one executes control action \nu in  state i  and follows the policy 1r  thereafter.  The corresponding optimality equation for \nQ*(i, u)  := max1rEn Q1r (i, u) is given by \n\nQ*(i,u) =  9i(U) + /  min  maxQ*(j,u') . \n\n)ES  u'EU-\n\n-\n\n(4) \n\nPij(U\u00bbO \n\nAny  policy 1[ satisfying 1[( i)  =  arg maxuE U Q* (i, u)  is optimal with respect to  this mini(cid:173)\nmal reward criterion. In most real world applications this approach is too restrictive because \nit takes very rare events (that in practice never happen) fully into account. This usually leads \nto policies with a lower average performance than the application requires.  An investment \nmanager, for instance, which acts  with respect to  this  very pessimistic  objective function \nwill not invest at all. \n\nTo handle the trade-off between a sufficient average performance and a risk avoiding (risk \nseeking)  behavior, we  propose  a family  of new  optimality equations parameterized  by  a \nmeta-parameter /'i,  (-1 < /'i,  < 1): \n\no =  ~ Pij(U)X\"  (9i(U) + / max Q,.(j, u') - Q,.(i, u))  Vi  E  S, u E  U \n\n(5) \n\nu'EU \n\n~ \njES \n\nwhere X,. (x)  := (1 - /'i, sign(x) )x. (In the next section we will show that a unique solution \nQ,.  of the  above  equation  (5)  exists.)  Obviously,  for  /'i,  =  0  we  recover equation  (2), \nthe optimality equation for  the  expected reward  criterion.  If we  choose  /'i,  to  be positive \n(0 < /'i,  < 1) then we overweight negative temporal differences \n\n9i(U) + / max Q,.(j, u') - Q,.(i, u)  < 0 \n\nu'EU \n\n(6) \n\nwith respect to positive ones.  Loosely speaking, we overweight transitions to states where \nthe future return is lower than the average one.  On the other hand, we underweight transi(cid:173)\ntions to states that promise a higher return than in the average. Thus, an agent that behaves \naccording to  the  policy 1r,.(i)  := argmaxuEU Q,.(i,u)  is  risk avoiding if /'i,  > O.  In  the \nlimit  /'i,  -+  1 the  policy 1r,.  approaches the  optimal worst-case policy 1[,  as  we  will  show \nin  the  following  section.  (To  get an  intuition about this,  the reader may  easily check that \nthe optimal worst-case Q-value Q*  fulfills the modified optimality equation (5) for  /'i,  =  1.) \nSimilarly, the policy 1r,.  becomesrisk seeking if we choose /'i,  to be negative. \nIt  is  straightforward to  formulate  a  risk  sensitive  Q-Iearning  algorithm that  bases  on  the \nmodified  optimality equation  (5).  Let  Q,.(i, u; w)  be  a  parametric approximation of the \nQ-function Q,.(i,u).  The states and actions encountered at time step k during simulation \nare denoted by ik and Uk  respectively. At each time step apply the following update rule: \n\nd(k) \n\nw(k+1) \n\n9ik (Uk) + / max Q,.(ik+l, u'; w(k))  - Q,.(ik, Uk; w(k)), \nW(k)  + a~k) X\"(d(k))\\7 wQ,.(ik, Uk; w(k)), \n\nu'EU \n\n(7) \n\n\f1034 \n\nR. Neuneier and 0. Mihatsch \n\nwhere o:~k)  denotes a stepsize sequence.  The following section analyzes the properties of \nthe new optimality equations and the corresponding Q-Iearning algorithm. \n\n3  PROPERTIES OF THE RISK SENSITIVE Q-FUNCTION \n\nDue to  space limitations we are not able to give detailed proofs of our results.  Instead, we \nfocus on interpreting their practical consequences. The proofs will be published elsewhere. \n\nBefore formulating the mathematical results,  we introduce some notation to make the ex(cid:173)\nposition more concise.  Using an arbitrary stepsize 0  <  0:  <  1, we define the value iteration \noperator corresponding to our modified optimality equation (5) as \n\nTa,~[Q](i, u)  := Q(i, u) + 0: L Pij(U)X~ (9i(U) +, ~~ Q(j, u') - Q(i, u)). \n\njES \n\n(8) \n\nThe operator Ta,~ acts  on  the  space  of Q-functions.  For every  Q-function Q and  every \nstate-action  pair  (i, u)  we  define  N~[Q](i, u)  to  be  the  set  of all  successor  states  j  for \nwhich maxu'EU Q(j, u') attains its minimum: \nN~[Q](i,u):= {j  E Slpij(u) > o and  maxQ(j,u') =  min  maxQV,u')}.  (9) \n\nu'EU \n\nj'es  u'EU \n\nPij,(U) >0 \n\nLet p~[Q]( i, u)  := 2:jE N\" [Q](i,u) Pij (u)  be the probability of transitions to such successor \nstates. \n\nWe have the following lemma ensuring the contraction property of Ta,~. \n\nLemma 1 (Contraction Property)  Let IQI  =  maxiES,uEU Q(i, u)  and 0  < 0:  < 1,0 < \n,  < 1.  Then \n\nITa,~[Qd - Ta,~[Q2ll ::;  (1  - 0:(1  -11>:1)(1  -\n\n,)) IQ1  - Q21  VQ1, Q2. \n\n(10) \n\nThe operator Ta,~ is contracting,  because 0 < (1  - 0:(1  - 11>:1)(1  - ,)) < 1. \n\nThe lemma has several important consequences. \n\n1.  The risk sensitive optimality equation (5), i. e.  Ta,~[Ql =  Q has a unique solution Q~ \n\nfor all  -1 < I>:  < 1. \n\n2.  The value iteration procedure Qnew  := Ta,~[Ql converges towards Q~. \n\n3.  The  existing  convergence  results  for  traditional  Q-Iearning  (Bertsekas  &  Tsitsiklis \n1997, Tsitsiklis &  Van  Roy  1997) remain also  valid in the  risk sensitive case  I>:  i- O. \nParticularly,  risk  sensitive  Q-learning (7)  converges with probability  one in  the case \nof lookup  table  representations as  well  as  in  the  case  of optimal  stopping  problems \ncombined with linear representations. \n\n4.  The speed of convergence for both,  risk  sensitive  value iteration  and  Q-Iearning be(cid:173)\n\ncomes worse if 11>:1  -7  1.  We can remedy this to some extent if we increase the stepsize \n0:  appropriately. \n\nLet 7r ~ be a greedy policy with respect to the unique solution Q ~ of our modified optimality \nequation;  that  is  7r~(i)  =  argmaxuEuQ~(i,u).  The  following  theorem  examines  the \nperformance of 7r ~ for the risk avoiding case I>:  2:  O.  It gives us a feeling about the expected \noutcome Q'Ir\"  and  the worst possible outcome Q'Ir\"  of policy 7r~ for different values of 1>:. \nThe theorem clarifies the limiting behavior of 7r ~ if I>:  -7  1. \n\n\fRisk Sensitive Reinforcement Learning \n\n1035 \n\nTheorem 2  Let 0  ~ /\\,  < 1.  The following  inequalities hold componentwise,  i.  e. for each \npair (i,u)  E  S  x  U . \n\no ~ Q*  - Qrr\"  ~ 2/\\'-1' (Q*  - Q*) \n\n-, \n\n-\n\no ~ PK[QK](Q*  - Qrr,,)  ~ (1- /\\,) -'-(Q* - Q*) \n\n2/\\,  1-, \n\n-\n\n-\n\n-\n\n(11) \n\n(12) \n\nMoreover,  lim 0\"\"  =  Q*  and lim Qrr\"  =  Q*. \n\nK~O \n\nK~l--\n\nThe difference Q * - Q*  between the optimal expected reward and the optimal worst case \nreward is crucial in theabove inequalities.  It measures the amount of risk being inherent in \nour MDP at hand.  Besides the value of /\\', this quantity essentially influences the difference \nbetween  the  performance of the  policy  7r K  and  the  optimal  performance  with  respect to \nboth,  the expected reward and the  worst case criterion.  The second inequality (12)  states \nthat the performance of policy  7r K  in the worst case sense tends to  the optimal worst case \nperformance if /\\,  -+  1.  The \"speed of convergence\" is  influenced by the quantity PK [Q K], \ni. e.  the probability that a worst case transition really occurs.  (Note that PK [Q KJ  is bounded \nfrom below.)  A higher probability PK [Q KJ  of worst case transitions implies a stronger risk \navoiding attitude of the policy 7r K. \n\n4  EXPERIMENTS: RISK-AVERSE INVESTMENT DECISIONS \n\nOur algorithm is  now tested on the task of constructing an optimal investment policy for an \nartificial stock price analogous to the empirical analysis in Neuneier, 1998. The task, illus(cid:173)\ntrated as a MDP in fig.  1, is to decide at each time step (e. g. each day or after each mayor \nevent on the market) whether to buy the stock and therefore speculating on increasing stock \nprices or to keep the capital in cash which avoids potential losses due to decreasing stock \nprices. \n\ndisturbancies \n\nfinancial market \n\ninvestments \n\nreturn \n\nrates, prices \n\ninvestor \n\nat =  J-L(xt} \np(xt+llxt} \nr(xt,at,$t+d \n\n2.-----~----__ ----__ ----__ ----__ --__. \n\nFigure 1.  The Markov Decision Problem: \nXt  = ($t, Kt)' \n\nstate:  market $t \nand portfolio K t \npolicy J-L,  actions \ntransition probabilities \nreturn function \n\n' .9 \n1 . B \n\n.~: :: \ni  '.5 \n\n1. , \n\nFigure 2.  A realization of the ar(cid:173)\ntificial  stock  price  for  300  time \nsteps. \nIt  is  obvious  that  the \nprice follows an increasing trend \nbut  with  higher  values  a  sud(cid:173)\nden drop to low values becomes \nmore and more probable. \n\nIt is  assumed,  that  the  investor is  not able to  influence the  market by  the  investment de(cid:173)\ncisions.  This  leads  to  a MDP with  some of the  state elements  being  uncontrollable and \nresults in two computationally import implications:  first,  one can simulate the investments \nby  historical data without investing (and potentially losing) real  money.  Second, one can \nformulate a very efficient (memory saving) and more robust Q-Ieaming algorithms. Due to \nspace restriction we skip a detailed description of these algorithms and refer the interested \nreader to Neuneier,  1998. \n\n\f1036 \n\nR. Neuneier and O.  Mihatsch \n\nThe artificial  stock price  is  in  the  range of [1, 2].  The transition probabilities  are  chosen \nsuch that the stock market simulates a situation where the price follows an increasing trend \nbut with higher values a drop to very low values becomes more and more probable (fig. 2). \n\nThe  state  vector consists  of the  current stock  price  and  the  current investment,  i. e.  the \namount of money invested in  stocks or cash.  Changing the investment from cash to  stocks \nresults  in  some transaction  costs consisting  of variable  and  fixed  terms.  These costs  are \nessential to define the investment problem as a MDP because they couple the actions made \nat different time steps.  Otherwise we could solve the problem by a pure prediction of the \nnext stock price.  The function which quantifies the immediate return for each time step is \ndefined as  follows:  if the capital is invested in cash,  then there is  nothing to  earn even  if \nthe stock price increases,  if the investor has bought stocks the return is  equal the relative \nchange of the stock price weighted by the invested amount of capital minus the transaction \ncosts which apply if one changed from cash to stocks. \n\no \n\n< .. to!  on ,took>  1 \n\n1.5 \n\n\u2022 tode prtce S \n\n.. \", 0.5 \n\no \n\ncoptto! \n\n,\"stocks  1 \n\nIt .. o.s \n\n...... \n\ncaon \n,ncaoh \n\n~ , \n\ncapital \n\nR'1  stclCln  1 \n\n1.5 \n\n_todc price \u2022 \n\nFigure 3.  Left:  Risk  neu(cid:173)\ntral  policy,  K,  =  O.  Right: \nA  small  bias  of  K,  =  0.3 \nagainst risk changes the po(cid:173)\nlicy  if one  is  not  invested \n(transaction  costs  apply  in \nthis case) . \n\nFigure  4.  Left:  K,  =  0.5 \nyields a stronger risk averse \nattitude.  Right:  With  K,  = \n0.8 the policy becomes also \nmore cautious if already in(cid:173)\nvested in stocks. \nFigure 5.  Left:  K,  =  0.9 \nleads  to  a  policy  which  in(cid:173)\nvests  in  stocks  in  only  5 \ncases.  Right:  The  worst \ncase  solution  never  invests \nbecause  there  is  always  a \npositive  probability  for  de(cid:173)\ncreasing stock prices. \n\nAs a reinforcement learning method, Q-Iearning has to interact with the environment (here \nthe stock market) to  learn optimal investment behavior.  Thus,  a training set of 2000 data \npoints is  generated.  The training phase is  divided into epochs which consists of as  many \ntrials as data in the training set exist.  At every trial the algorithm selects randomly a stock \nprice from  the data set,  chooses a random investment state  and updates the  tabulated Q(cid:173)\nvalues according to the procedure given in Neuneier, 1998. The only difference of our new \nrisk averse Q-Iearning is  that negative experiences, i. e. smaller returns than  in  the mean, \nare overweighted in comparison to positive experiences using the /\\,-factor of eq. (7).  Using \ndifferent /\\,  values from 0  (recovering the original  Q-Iearning procedure) to  1 (leading to \nworst case Q-Iearning) we plot the resulting policies as  mappings from  the state space to \ncontrol actions in  figures  3 to 5.  Obviously, with increasing  /\\,  the investor acts  more and \nmore cautiously  because there  are  less  states  associated  with  an  investment decision  for \nstocks.  In  the extreme case of /\\,  =  1, there is  no stock investment at all in order to avoid \nany loss. The policy is not useful in practice. This supports our introductory comments that \nworst case Q-learning is not appropriate in many tasks. \n\n\fRisk Sensitive Reinforcement Learning \n\n1037 \n\n0.8,---.-----r--,---.--.---.----,------,-----r---, \n\nQQ-plol of Ihe distributions \n\n, ' + \n\n+  ., .. \n+' \n\n\"\" \n\no \n\n0 .7 \n\"+ ,.  0.6 \n\" \u2022 ~ ..  0.5 \ni ....  0.4 \n! ~ 0.3 \nN i  0.2 \n~ J! \n:  0.1 \n'ii \nc \n... \n1I \n\not--~WIIfl!III!'I\"\"\"\" \n\n-0.1 \n\n'., ' \n\n-O .2':--::'.,---L---:-'-:---::\"---:-'-::---'----:c-'-::----'~--'-,____---,J \n\n~ ~  0  ~  u  ~  U \n\nU  MUM  \n\nquanlU ... of the cla .. leal approach: KaO \n\nFigure 6.  The quantiles of the dis(cid:173)\ntributions of the discounted sum of \nreturns for It = 0.2 (0) and It = 0.4 \n(+)  are  plotted  against  the  quan(cid:173)\ntiles  for  the  classical  risk  neutral \napproach  It =  O.  The distributions \nonly  differ  significantly  for  nega(cid:173)\ntive accumulated returns (left tail of \nthe distributions). \n\nFor further analysis, we specify a risky start state io  for which a sudden drop of the stock \nprice  in  the  near future  is  very  probable.  Starting  at io  we  compute the  cumulated dis(cid:173)\ncounted rewards  of 10000 different  trajectories  following  the  policies  11\"0,  11\"0.2  and  11\"0.4 \nwhich have been generated using K,  =  0 (risk neutral), K,  =  0.2 and K,  =  0.4.  The resulting \nthree data sets are compared using a quantile-quantile plot whose purpose is to determine \nwhether the samples come from  the same distribution type.  If they do so, the plot will  be \nlinear.  Fig. 6  clearly  shows  that for  higher  K,-values  the  left tail  of the distribution (neg(cid:173)\native  returns)  bends up  indicating a  fewer  number of losses.  On  the other hand there  is \nno significant difference for positive quantiles.  In contrast to  naive utility functions which \npenalizes high variance  in  general,  our risk  sensitive Q-Iearning asymmetrically reduces \nthe probability for losses which may be more suitable for many applications. \n\n5  CONCLUSION \n\nWe have formulated a new Q-Iearning algorithm which can be continuously tuned towards \nrisk  seeking or risk  avoiding  policies.  Thus,  it is possible to  construct control  strategies \nwhich are more suitable for the problem at hand by only small modifications of Q-Iearning \nalgorithm. The advantage of our approch in comparison to already known solutions is, that \nwe have neither to  change the cost nor the  state model.  We  can prove that our algorithm \nconverges under the usual assumptions. Future work will focus on the connections between \nour approach and the utility theoretic point of view. \n\nReferences \n\nD. P. Bertsekas, J. N. Tsitsiklis (1996) Neuro-Dynamic Programming.  Athena Scientific. \nM. Heger (1994) Consideration of Risk and Reinforcement Learning, in Machine Learning, proceed(cid:173)\nings of the 11 th International Conference, Morgan Kaufmann Publishers. \nS.  Koenig,  R. G.  Simmons  (1994)  Risk-Sensitive  Planning  with  Probabilistic  Decision  Graphs. \nProc. of the Fourth Int. Conf. on Principles of Knowledge Representation and Reasoning (KR). \nM.L. Littman, Cs. Szepesvari (1996),  A generalized reinforcement-learning model:  Convergence \nand applications. In International Conference of Machine Learning '96. Bari. \nR. Neuneier (1998) Enhancing Q-learning for Optimal Asset Allocation, in Advances in Neural In(cid:173)\nformation Processing Systems /0, Cambridge, MA: MIT Press. \nM. L. Puterman (1994), Markov Decision Processes, John Wiley &  Sons. \nCs. Szepesvari (1997) Non-Markovian Policies in Sequential Decision Problems, Acta Cybernetica. \nJ. N.  Tsitsiklis, B. Van  Roy  (1997)  Approximate  Solutions to Optimal  Stopping Problems,  in Ad(cid:173)\nvances in Neural Information Processing Systems 9, Cambridge, MA: MIT Press. \n\n\f", "award": [], "sourceid": 1583, "authors": [{"given_name": "Ralph", "family_name": "Neuneier", "institution": null}, {"given_name": "Oliver", "family_name": "Mihatsch", "institution": null}]}