{"title": "Log-normality and Skewness of Estimated State/Action Values in Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1804, "page_last": 1814, "abstract": "Under/overestimation of state/action values are harmful for reinforcement learning agents. In this paper, we show that a state/action value estimated using the Bellman equation can be decomposed to a weighted sum of path-wise values that follow log-normal distributions. Since log-normal distributions are skewed, the distribution of estimated state/action values can also be skewed, leading to an imbalanced likelihood of under/overestimation. The degree of such imbalance can vary greatly among actions and policies within a single problem instance, making the agent prone to select actions/policies that have inferior expected return and higher likelihood of overestimation. We present a comprehensive analysis to such skewness, examine its factors and impacts through both theoretical and empirical results, and discuss the possible ways to reduce its undesirable effects.", "full_text": "Log-normality and Skewness of Estimated\n\nState/Action Values in Reinforcement Learning\n\nLiangpeng Zhang1,2, Ke Tang3,1, and Xin Yao3,2\n\n1School of Computer Science and Technology,\nUniversity of Science and Technology of China\n\n2University of Birmingham, U.K.\n\n3Shenzhen Key Lab of Computational Intelligence,\nDepartment of Computer Science and Engineering,\n\nSouthern University of Science and Technology, China\n\nlxz472@cs.bham.ac.uk, tangk3@sustc.edu.cn, xiny@sustc.edu.cn\n\nAbstract\n\nUnder/overestimation of state/action values are harmful for reinforcement learn-\ning agents. In this paper, we show that a state/action value estimated using the\nBellman equation can be decomposed to a weighted sum of path-wise values that\nfollow log-normal distributions. Since log-normal distributions are skewed, the\ndistribution of estimated state/action values can also be skewed, leading to an\nimbalanced likelihood of under/overestimation. The degree of such imbalance can\nvary greatly among actions and policies within a single problem instance, making\nthe agent prone to select actions/policies that have inferior expected return and\nhigher likelihood of overestimation. We present a comprehensive analysis to such\nskewness, examine its factors and impacts through both theoretical and empirical\nresults, and discuss the possible ways to reduce its undesirable effects.\n\n1\n\nIntroduction\n\nIn reinforcement learning (RL) [1, 2], actions executed by the agent are decided by comparing relevant\nstate values V or action values Q. In most cases, the ground truth V and Q are not available to the\nagent, and the agent has to rely on estimated values \u02c6V and \u02c6Q instead. Therefore, whether or not an\nRL algorithm yields suf\ufb01ciently accurate \u02c6V and \u02c6Q is a key factor to its performance. Many researches\nhave proved that, for many popular RL algorithms such as Q-learning [3] and value iteration [4],\nestimated values are guaranteed to converge in the limit to their ground truth values [5, 6, 7, 8].\nStill, under/overestimation of state/action values occur frequently in practice. Such phenomena are\noften considered as the result of insuf\ufb01cient sample size or the utilisation of function approximation\n[9]. However, recent researches have pointed out that the basic estimators of V and Q derived\nfrom the Bellman equation, which were considered unbiased and have been widely applied in RL\nalgorithms, are actually biased [10] and inconsistent [11]. For example, van Hasselt [10] showed that\nthe max operator in the Bellman equation and its transforms introduces bias to the estimated action\nvalues, resulting in overestimation. New operators and algorithms have been proposed to correct such\nbiases [12, 13, 14], inconsistency [11] and other issues of value-based RL [15, 16, 17, 18].\nThis paper shows that, despite having great improvements in recent years, the value estimator\nof RL can still suffer from under/overestimation. Speci\ufb01cally, we show that the distributions of\nestimated state/action values are very likely to be skewed, resulting in imbalanced likelihood of\nunder/overestimation. Such skewness and likelihood can vary dramatically among actions/policies\nwithin a single problem instance. As a result, the agent may frequently select undesirable ac-\ntions/policies, regardless of its value estimator being unbiased.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fFigure 1: Illustration of positive skewness (red distribution) and negative skewness (blue distribution).\nThick and thin vertical lines represent the corresponding expected values and medians, respectively.\n\nSuch phenomenon is illustrated in Figure 1. An estimated state/action value following the red\ndistribution has a mean 0.21 and a median \u22120.61, thus tends to be underestimated. Another following\nthe blue distribution, on the other hand, has a mean \u22120.92 and a median 0.61, thus likely to be\noverestimated. Despite that the red expected return is noticeably greater than the blue, the probability\nof an unbiased agent arriving at the opposite conclusion (blue is better) and thus selecting the inferior\naction/policy is around 0.59, which is even worse than random guessing.\nThis paper also indicates that such skewness comes from the Bellman equation passing the dispersion\nof transition dynamics to the state/action values. Therefore, as long as a value is estimated by\napplying the Bellman equation to the observations of transition, it can suffer from the skewness\nproblem, regardless of the algorithm being used. Instead of proposing new algorithms, this paper\nsuggests two general ways to reduce the skewness. The \ufb01rst is to balance the impacts of positive and\nnegative immediate rewards to the estimated values. We show that positive rewards lead to positive\nskewness and vice versa, and thus, a balance between the two may help neutralise the harmful effect\nof skewness. The second way is to simply collect more observations of transitions. However, our\nresults in this paper indicate that the effectiveness of this approach diminishes quickly as the sample\nsize grows, and thus is recommended only when observations are cheap to obtain.\nIn the rest of this paper, we will elaborate our analysis to the distributions of state/action values\nestimated by the Bellman equation. Speci\ufb01cally, we will show that an estimated value in a general\nMDP can be decomposed to path-wise values in normalised single-reward Markov chains. The\npath-wise values are shown to obey log-normal distributions, and thus the distribution of an estimated\nvalue is the convolution of such log-normal distributions. To understand which factors have the most\nimpact to the skewness, we derive the expressions of the parameters of these log-normal distributions.\nWe then discuss whether the skewness of estimated values can be reduced in order to improve\nlearning performance. Finally, we provide our empirical results to complement our theoretical ones,\nillustrating how substantial the undesirable effect of skewness can be, as well as to what degree such\neffect can be reduced by obtaining more observations.\n\n2 Preliminaries\n\nThe standard RL setup of [1] is followed in this paper. An environment is formulated as a \ufb01nite\ndiscounted Markov Decision Process (MDP) M = (S, A, P, R, \u03b3), where S and A are \ufb01nite sets\nof states and actions, P (s(cid:48)|s, a) is a transition probability function, R(s, a, s(cid:48)) is an immediate\nreward function, and \u03b3 \u2208 (0, 1) is a discount factor. A trajectory (s1, a1, s2, r1), (s2, a2, s3, r2), ...,\n(st, at, st+1, rt) represents the interaction history between the agent and the MDP. The number of\noccurrences of state-action pair (s, a) and transition (s, a, s(cid:48)) in such trajectory are denoted Ns,a and\nNs,a,s(cid:48), respectively.\nA policy is denoted \u03c0, and V \u03c0(s) is the state value of \u03c0 starting from s. An action value Q\u03c0(s, a) is\nessentially a state value following a non-stationary policy that selects a at the \ufb01rst step but follows \u03c0\nthereafter. It can be analysed in the same way as V \u03c0, so it suf\ufb01ces to focus on V \u03c0 in the following\nsections. For convenience, superscript \u03c0 in V \u03c0 will be dropped if it is clear from the context.\ns(cid:48)\u2208S P (s(cid:48)|s, \u03c0(s))(R(s, \u03c0(s), s(cid:48)) + \u03b3V \u03c0(s(cid:48))),\nwhich is called the Bellman equation. Most model-based and model-free RL algorithms utilise this\nequation, its equivalents, or its transforms to estimate state values. Since P and R are unknown to\nthe agent, estimated values \u02c6V (s) are computed from estimated transitions \u02c6P and rewards \u02c6R instead,\n\nFor any s \u2208 S and policy \u03c0, it holds that V \u03c0(s) =(cid:80)\n\n2\n\n-5-4-3-2-1012345estimated value00.10.20.30.40.5densityvalue 1value 2\fwhere \u02c6P (s(cid:48)|s, a) = Ns,a,s(cid:48)/Ns,a and \u02c6R(s, a, s(cid:48)) = rt with (st, at, st+1)=(s, a, s(cid:48)). This is done\nexplicitly in model-based learning, and implicitly with frequencies of updates in model-free learning.\nWe will show in later section that the skewness of estimated values is decided by the dynamic effects\nof the environment rather than the learning algorithm being used, and therefore, it suf\ufb01ces to focus\non the model-based case in order to evaluate such skewness.\n\nThe skewness in this paper refers to the Pearson 2 coef\ufb01cient (E[X] \u2212 median[X])/(cid:112)Var[X]\n\n[19, 20]. Following this de\ufb01nition, a distribution has a positive skewness if and only if its mean is\ngreater than its median, and vice versa. Assuming that the bias of \u02c6V is corrected or absent, we have\nE[ \u02c6V ] = V . Thus, a positive skewness of \u02c6V means Pr( \u02c6V 0.5, indicating a higher likelihood\nof underestimation, while a negative skewness indicates a higher likelihood of overestimation.\nAn informative indicator of skewness is CDF \u02c6V (V )\u22120.5 where CDF \u02c6V is the cumulative distribution\nfunction of \u02c6V . The sign of this indicator is consistent with the Pearson 2 coef\ufb01cient, while its absolute\nvalue gives the extra probability of under/overestimation of \u02c6V compared to a zero-skew distribution.\nA log-normal distribution with location parameter \u00b5 and scale parameter \u03c3 is denoted lnN (\u00b5, \u03c32). A\nrandom variable X follows lnN (\u00b5, \u03c32) if and only if ln(X) follows normal distribution N (\u00b5, \u03c32).\nThe parameters \u00b5 and \u03c3 of log-normal distribution can be calculated from its mean and variance\n\n(cid:1), where E[X] and Var[X] are the mean and\n\n(cid:1), and \u03c32 = ln(cid:0)1 + Var[X]\n\nby \u00b5 = ln(cid:0)\n\nE[X]2\n\n\u221aE[X]2+Var[X]\n\nvariance of X \u223c lnN (\u00b5, \u03c32), respectively.\n\nE[X]2\n\n3 Log-normality of Estimated State Values\n\nIn this section, we elaborate our analysis to the distributions of estimated values \u02c6V . The analysis\nis formed of three steps. First, state values in general MDPs are decomposed to the state values\nin relevant normalised single-reward Markov chains. Second, they are further decomposed into\npath-wise state values. Third, the path-wise state values are shown to obey log-normal distributions.\n\n3.1 Decomposing into Normalised Single-reward Markov chains\n\nV \u03c0 = B(P \u03c0\u25e6R\u03c0J ), or V \u03c0(si) =(cid:80)\n\nGiven an MDP M and a policy \u03c0, the interaction between \u03c0 and M forms a Markov chain M \u03c0, with\ntransition probability pi,j = P (sj|si, \u03c0(si)) and reward ri,j = R(si, \u03c0(si), sj) from arbitrary state\nsi to state sj. Let P \u03c0 be the transition matrix of M \u03c0, V \u03c0 be the (column) vector of state values,\nR\u03c0 be the reward matrix, and J be a vector of 1 with the same size of V \u03c0. Then Bellman equation\nis equivalent to V \u03c0 = P \u03c0\u25e6R\u03c0J + \u03b3P \u03c0V \u03c0 = (I \u2212 \u03b3P \u03c0)\u22121(P \u03c0\u25e6R\u03c0J ), where I is an identity\nmatrix, and \u25e6 is Hadamard product.\nThis equation indicates that a state value is a weighted sum of dynamic effects, with rewards serving\nas the weights of summation. Precisely, let B = (I \u2212 \u03b3P \u03c0)\u22121, then the equation above becomes\nj,k rj,k(bi,j pj,k). Here, term (bi,j pj,k) describes the joint\ndynamic effect starting from si ending with transition sjsk, which will be elaborated in Section 3.2.\nj,k denote a normalised single-reward Markov chain (NSR-MC) of M \u03c0, which has exactly the\nLet M \u03c0\nsame S, A, \u03b3 and P \u03c0 as M \u03c0, but all rewards are trivially 0 except rj,k = 1. For an NSR-MC M \u03c0\nj,k,\nthe equation above becomes V \u03c0\n(si) = bi,j pj,k. Thus, a state value V of a general MDP M can\nM \u03c0\nbe rewritten as the weighted sum of state values of all |S|2 NSR-MCs {M \u03c0\n\nj,k} of M, i.e.\n\nj,k\n\nM (si) =(cid:80)\n\nV \u03c0\n\nj,k rj,kVM \u03c0\n\nj,k\n\n(si).\n\n(1)\n\nTherefore, the next step of analysis is to examine the state values in NSR-MCs.\n\n3.2 Decomposing into Path-wise State Values\nSeeing Markov chain M \u03c0 as a directed graph, a walk w of length |w| in such graph is a sequence\nof |w| successive transitions through states s1, s2, s3, ..., s|w|+1.1 A path is a walk without repeated\nstates, with exception to the last state s|w|+1, which can be either a visited or an unvisited one.\n\n1Superscripts here refer to the timestamps on w rather than the indices of speci\ufb01c states in S.\n\n3\n\n\fp1,1\n\n1\n\np1,2\n\np2,1\n\np2,2\n\n2\n\np3,1\n\np2,3\n\np3,2\n\np3,3\n\n3\n\np3,4\n\n4\n\nFigure 2: Illustration of walks and a representative path. \"Forward\" and \"backward\" transitions are\ndrawn in thick and thin arrows, respectively, and pi,j denotes the transition probability from si to sj.\n\nw\u2208Wi,j,k\n\nIn an NSR-MC with unique non-zero reward rj,k = 1, a state value V \u03c0(si) = bi,j pj,k can be\nexpanded as a sum of the discounted occurrence probabilities of walks that start from si and end\nwith transition (sj, \u03c0(sj), sk). Let Wi,j,k denotes the set of all possible walks w satisfying s1=si,\n(st,st+1) on w pst,st+1). Since\n\ns|w|=sj and s|w|+1=sk. Then we have V (si) =(cid:80)\n\n(\u03b3|w|\u22121(cid:81)\n\nWi,j,k is in\ufb01nite, the walks in Wi,j,k need to be put into \ufb01nite groups for further analysis.\nConcretely, a step in a walk is considered \"forward\" if it arrives to a previously unvisited state, and\n\"backward\" if the destination has already been visited before that step. The latter also includes the\ncases where st+1 = st, that is, the agent stays at the same state after transition. The only exception\nto this classi\ufb01cation is the last transition of a walk, which is always considered a \"forward\" one,\nregardless of if its destination having been visited or not. The start state s1 and all such \"forward\"\ntransitions of a walk w form a representative path of w, denoted \u02dcw.\nThis is illustrated by Figure 2. In this example, all walks from s1 passing s2 ending with s3s4, such\nas (s1s1s2s3s3s4), (s1s2s3s1s2s3s4) and (s1s2s3s2s3s2s3s4), are grouped with the representative\npath (s1s2s3s4). Note that transition s1s3 will not happen within this group; rather, it belongs to the\ngroups that have s1s3 in their representative paths.\nAs can be seen from Figure 2, all possible walks sharing one representative path \u02dcw compose a chain\nwhich has the same transition probability values with the original Markov chain M \u03c0, but with only\ntwo type of transitions: (forward) si to si+1 (i \u2264 | \u02dcw|); (backward) si to sj (j \u2264 i \u2264 | \u02dcw|). We call\nthis chain the derived chain of \u02dcw, denoted M \u03c0( \u02dcw), or simply M ( \u02dcw). Then the in\ufb01nite sum becomes\n\nV (s) =(cid:80)\n\n\u02dcw\u2208 \u02dcW VM ( \u02dcw)(s),\n\n(2)\n\nwhere \u02dcW is the set of all representative paths that start from s and end with the unique 1-reward\ntransition of the relevant NSR-MC. Such VM ( \u02dcw)(s) are called path-wise state values of M \u03c0.\nSince the main concern of this paper is the skewness of \u02c6V , we do not provide a constructive method of\nobtaining all M \u03c0( \u02dcw). Rather, we point out that the size of \u02dcW is at most (|S|!), and thus an estimated\nvalue \u02c6V in NSR-MCs can be decomposed to \ufb01nitely many estimated path-wise state values.\n\n3.3 Log-normality of Estimated Path-wise State Values\n\npossibility of(cid:80)i+1\n\nStrictly speaking, derived chain M ( \u02dcw) of a representative path \u02dcw is not necessarily a Markov chain,\nbecause only part of the transitions in the original Markov chain M \u03c0 is included, allowing the\nj=1 psi,sj < 1. However, this does not make the path-wise state values violate\n\nBellman equation, and thus they can be treated as regular state values.\nSince a representative path \u02dcw has no repeated states (except for s| \u02dcw|+1 which can either be a new state\nor the same as some sk), the superscripts here can be treated as the indices of states for convenience.\nTherefore, path-wise state value VM ( \u02dcw)(si) is denoted Vi, and pi,j refers to psi,sj in this section.\nGiven \u02dcw, the most important path-wise value is V1, which belongs to the start point of \u02dcw.\nDe\ufb01nition 3.1. Given a derived chain M ( \u02dcw) and discount factor \u03b3, let pi,j be the transition probability\nfrom si to sj on M ( \u02dcw). The joint dynamic effect of M ( \u02dcw) for i \u2264 | \u02dcw| is recursively de\ufb01ned as\n\n1 \u2212 \u03b3(pi,i +(cid:80)i\u22121\n\n\u03b3pi,i+1\n\n(cid:81)i\u22121\n\nDi =\n\nj=1 pi,j\n\nk=j Dk)\n\n.\n\nLemma 3.2. For all i < | \u02dcw|, path-wise state values satisfy Vi = Di Vi+1.\n\n4\n\n\fj=1 pi,j(ri,j+\u03b3Vj). By de\ufb01nition of M ( \u02dcw)\nj=1 pi,jVj\nfor i < | \u02dcw|. When i = 1, this becomes V1 = \u03b3(p1,1V1+p1,2V2) = \u03b3p1,2\nV2 = D1V2. Sup-\n1\u2212\u03b3p1,1\nj=i Dj)Vk+1 for i \u2264 k,\nl=j Dl)Vk+1 + pk+1,k+2Vk+2] =\nVk+2 = Dk+1Vk+2. Thus, by the principle of induction, Vi =\n\nProof. By Bellman equation, it holds that Vi = (cid:80)| \u02dcw|+1\nwe have pi,j = 0 for j > i+1 and ri,j = 0 for (i, j) (cid:54)= (| \u02dcw|,| \u02dcw|+1). Thus Vi = \u03b3(cid:80)i+1\npose Vi = Di Vi+1 holds for all i \u2264 k < | \u02dcw|\u22121. Then Vi = ((cid:81)k\nand therefore, Vk+1 = \u03b3(cid:80)k+2\nj=1 pk+1,jVj = \u03b3[(cid:80)k+1\n(cid:81)k\n1\u2212\u03b3(pk+1,k+1+(cid:80)k\nDi Vi+1 holds for all i < | \u02dcw|.\nLemma 3.3. For all i \u2264 | \u02dcw|, Vi = 1\n\nj=1 pk+1,j((cid:81)k\n\nj=i Dj. Particularly, V1 = 1\n\n(cid:81)| \u02dcw|\n\n(cid:81)| \u02dcw|\n\nj=1 Dj.\n\nj=1 pk+1,j\n\n\u03b3pk+1,k+2\n\nl=j Dl)\n\n\u03b3\n\n\u03b3\n\nProof. By de\ufb01nition of \u02dcw, there are two possible cases of the last step from s| \u02dcw| to s| \u02dcw|+1:\n(I) s| \u02dcw|+1 /\u2208 {s1, ..., s| \u02dcw|}; (II) there exists k \u2264 | \u02dcw| such that s| \u02dcw|+1 = sk.\nV| \u02dcw| = p| \u02dcw|,| \u02dcw|+1(r| \u02dcw|,| \u02dcw|+1 + \u03b3V| \u02dcw|+1) + \u03b3(cid:80)| \u02dcw|\n(Case I) There is no transition starting from s| \u02dcw|+1 in this case, thus V| \u02dcw|+1 = 0. Therefore,\nj=1 p| \u02dcw|,jVj =\n1\u2212\u03b3(p| \u02dcw|,| \u02dcw|+(cid:80)| \u02dcw|\u22121\nj=i Dj.\np| \u02dcw|,| \u02dcw|+1(r| \u02dcw|,| \u02dcw|+1 + \u03b3Vk) + \u03b3(cid:80)| \u02dcw|\n(Case II with s| \u02dcw|+1 = sk) In this case V| \u02dcw|+1 = Vk and p| \u02dcw|,| \u02dcw|+1 = p| \u02dcw|,k, thus V| \u02dcw| =\nj=1 p| \u02dcw|,jVj which is the\n\n\u03b3 D| \u02dcw|. Thus Vi = ((cid:81)| \u02dcw|\u22121\nj=1,j(cid:54)=k p| \u02dcw|,jVj = p| \u02dcw|,| \u02dcw|+1 + \u03b3(cid:80)| \u02dcw|\n\nj=1 p| \u02dcw|,jVj = p| \u02dcw|,| \u02dcw|+1 + \u03b3(cid:80)| \u02dcw|\n(cid:81)| \u02dcw|\n\nj=i Dj)V| \u02dcw| = 1\n\u03b3\n\n(cid:81)| \u02dcw|\u22121\n\nk=j Dk)\n\np| \u02dcw|,| \u02dcw|+1\n\n= 1\n\np| \u02dcw|,j\n\nj=1\n\n(cid:81)| \u02dcw|\n\nsame expression as the \ufb01rst case, and therefore Vi = 1\n\u03b3\n\nj=i Dj also holds for this case.\n\n(cid:81)i\u22121\n\n\u03b3 . Thus we have ln(V1) = \u2212 ln(\u03b3) +(cid:80)| \u02dcw|\n\nthe equation above becomes ln( \u02c6V1) = \u2212 ln(\u03b3) +(cid:80)| \u02dcw|\n\nIn both of the two cases above, V1 is the product of D1, D2, ..., D| \u02dcw| given by De\ufb01nition 3.1, and\nan additional factor 1\nj=1 ln(Dj). By replacing all pi,j in\nDe\ufb01nition 3.1 with estimated transition \u02c6pi,j, we get the \u201cestimated\u201d 2 joint dynamic effects \u02c6D. Then\nj=1 ln( \u02c6Dj). Assuming \u02c6Di\u2019s as independent\nrandom variables, it can be shown by the central limit theorem that as | \u02dcw| grows, ln( \u02c6V1) will tend to\na normal distribution, and therefore, \u02c6V1 approximates a log-normal distribution.\nThe \u201cestimated\u201d joint dynamic effects \u02c6D are actually mutually dependent in most cases, thus the\nrigorous analysis of log-normality is more complicated. The main idea here is to \ufb01rst prove all\n\u02c6Di \u2264 \u03b3, and then show that the summation involving terms pi,j\n\u02c6Dk in De\ufb01nition 3.1 diminish\nquickly with the size of \u02dcw, which indicates that \u02c6Di is mostly decided by \u02c6pi,i and \u02c6pi,i+1 and thus the\ndependency between any two \u02c6D is relatively weak. As the focus here is to see the skewness of \u02c6V1,\nsuch analysis is skipped, and we proceed to the study of parameters of log-normal distribution of \u02c6V1.\nSince \u02c6pi,i and \u02c6pi,i+1 are the main factors that decide \u02c6Di, we provide the result on the most repre-\nsentative case where pi,i + pi,i+1 = 1 and all other pi,j are 0 for i < | \u02dcw|. Such M ( \u02dcw) is denoted\nM0( \u02dcw) in the following text. It is easy to see that all \u02c6Di are mutually independent in such chains.\nThe delta method [21, 22] below is used to obtain the expressions of parameters.\nLemma 3.4 (Delta method[21, 22]). Suppose X is a random variable with \ufb01nite moments, E[X]\nbeing its mean and Var[X] being its variance. Suppose f is a suf\ufb01ciently differentiable function.\nThen it holds that E[f (X)] \u2248 f (E[X]), and Var[f (X)] \u2248 f(cid:48)(E[X])2 Var[X].\nLemma 3.5. Let \u02c6Dj be Dj replacing all p with \u02c6p. Let Ni denotes the number of visits to the chain\nstate si in a learning trajectory. In M0( \u02dcw) derived chains it holds that E[ \u02c6Dj] \u2248 \u03b3pj,j+1\n, and\n1\u2212\u03b3pj,j\nVar[ \u02c6Dj] \u2248 \u03b32(1\u2212\u03b3)2\n\n(1\u2212\u03b3pj,j )4 \u00b7 pj,j pj,j+1\n\nk=j\n\nNj\n\n.\n\nProof. It holds that Var[\u02c6pj,j+1] = ( 1\nNj\nDe\ufb01nition 3.1.\n\n)2Njpj,jpj,j+1 = pj,j pj,j+1\n\nNj\n\n, then by applying Lemma 3.4 to\n\n2Such \u201cestimation\u201d is not done explicitly in actual algorithms, but implicitly when using Bellman equation.\n\n5\n\n\fLemma 3.6. In M0( \u02dcw) derived chains it holds that\n\n| \u02dcw|(cid:89)\n\nj=1\n\n1\n\u03b3\n\nE[ \u02c6V1] =\n\nE[ \u02c6Dj],\n\n(cid:18) | \u02dcw|(cid:89)\n\nj=1\n\nVar[ \u02c6V1] \u2248 1\n\u03b32\n\n(Var[ \u02c6Dj] + E[ \u02c6Dj]2) \u2212\n\n| \u02dcw|(cid:89)\n\n(cid:19)\n\n.\n\nE[ \u02c6Dj]2\n\nj=1\n\nj=1(Var[Xj] + E[Xj]2) \u2212\nE[Xj]2. Since all \u02c6D are independent in M0( \u02dcw), by applying this and Lemma 3.4 to Lemma\n\nProof. For independent X1, X2, ..., Xn it holds that Var[X1...Xn] =(cid:81)n\n(cid:81)n\nln(cid:0)\n\n3.3, the above results can be obtained.\nTheorem 3.7. In M0( \u02dcw) with suf\ufb01ciently large | \u02dcw|, it holds that \u02c6V1\n\n(cid:1) and \u03c32 = ln(cid:0)1+ Var[ \u02c6V1]\n\n(cid:1), where E[ \u02c6V1] and Var[ \u02c6V1] are given by Lemma 3.6.\n\n\u02d9\u223c lnN (\u00b5, \u03c32) with \u00b5 =\n\nj=1\n\n\u221a\n\nE[ \u02c6V1]2\n\nE[ \u02c6V1]2+Var[ \u02c6V1]\n\nE[ \u02c6V1]2\n\nProof. By applying the equations on the parameters of log-normal (see Section 2) to \u02c6V1.\n\n4 Skewness of Estimated State Values, and Countermeasures\n\nThis section interprets the results presented in Section 3 in terms of skewness, and discuss how to\nreduce the undesirable effects of skewness. The skewness is mainly decided by two factors: (a)\nparameter \u03c3 of log-normal distributions; (b) non-zero immediate rewards.\n\n2\u03c3\n\nImpact of Parameter \u03c3 of Log-normal Distributions\n\n4.1\nA regular log-normal distribution lnN (\u00b5, \u03c32) has a positive skewness, which means a sampled value\nfrom such distribution has more than 0.5 probability to be less than its expected value, resulting in a\nhigher likelihood of underestimation. Precisely, if X \u223c lnN (\u00b5, \u03c32), then E[X] = exp(\u00b5 + \u03c32/2)\nand median[X] = exp(\u00b5), thus the Pearson 2 coef\ufb01cient of X is greater than 0. Additionally, since\nlnN (\u00b5, \u03c32) has a CDF(x) = 0.5(1 + erf( ln(x)\u2212\u00b5\u221a\n)) where erf(x) is the Gauss error function, our\n\u221a\nindicator CDF(E[X])\u22120.5 equals to 0.5 erf(\u03c3/\n8). This indicates that \u03c3 has a stronger impact than\n\u00b5 to the scale of the skewness in log-normal distributions.\nCombining Lemma 3.6 and Theorem 3.7 shows that \u03c3 is decided by a complicated interaction\nbetween all observed dynamic effect \u02c6Dj\u2019s. By Lemma 3.5, transition probabilities pj,\u2217 completely\ndecide E[ \u02c6Dj], and have substantial impacts to Var[ \u02c6Dj].\nThis indicates that the main cause of skewness is the transition dynamics of MDPs rather than learning\nalgorithms. As an extreme case, if the forward transition of a state-action pair is deterministic (i.e.\npj,j+1 = 1), then its Var[ \u02c6Dj] = 0, resulting no contribution to the skewness. If an estimated\nvalue consists of a large portion of such transitions, then the likelihoods of overestimation and\nunderestimation are both very low. On the other hand, if backward transition probability pj,j (or any\npj,k with k \u2264 j) is close to 1, then Var[ \u02c6Dj] increases dramatically, resulting a noticeable skewness.\nReal-world problems can be a mix of these two extremes, which leads to a great variety of skewness\namong different actions/policies, making learning signi\ufb01cantly more dif\ufb01cult.\nBy Lemma 3.5, \u03c3 is also dependent to the number of observations Nj. As Nj grows in\ufb01nitely,\nVar[ \u02c6Dj] slowly decreases to 0, which reduces Var[ \u02c6V1] in Lemma 3.6 and eventually leads \u03c3 to\n0. This indicates that running algorithms more steps does help reduce the skewness of estimated\nvalues and improve the overall performance. However, the expression of Var[ \u02c6Dj] in Lemma 3.5 also\nindicates that the degree of improvement diminishes quickly as Nj grows. Therefore, collecting more\nobservations is not always an ef\ufb01cient way to reduce the skewness.\n\n6\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 3: (a) Log-normals weighted by positive reward (red) and negative reward (blue). Thick/thin\nvertical lines are means & medians. (b, c) Convolution of two log-normals, given by the purple curve.\n\n1 \u2212 p\n\n1, rD\n\n1\n\n1 \u2212 p\n\n2\n\np\n\n1\n\np\n\n1\n\n1 \u2212 p\n\n3\n\np\n\n1\n\n...\n\np\n\n1\n\nn\n\n1, rG\n\nFigure 4: A chain MDP with n states, forward probability p, goal reward rG and distraction reward\nrD. Transitions under taking action a+ is drawn in solid arrows, and a\u2212 in dotted arrows.\n\n4.2\n\nImpact of Non-zero Immediate Rewards\n\nNon-zero immediate rewards decide not only the scale of skewness, but also the direction of skewness.\nBy Equation 1 and 2 in Sections 3.1 and 3.2, path-wise values are weighted by their corresponding\nimmediate rewards before being summed into state values. If a path-wise state value is weighted by a\npositive reward, then the resulting distribution is still a regular log-normal, which has a positive skew-\nness and thus a higher likelihood of underestimation. However, if it is weighted by a negative reward,\nthen the result is a \ufb02ipped log-normal, which has a negative skewness and thus a higher likelihood of\noverestimation. This is illustrated in Figure 3 (a), where the red and blue distributions correspond to\nthe estimated path-wise values weighted by a positive and a negative reward, respectively.\nIn general cases, the sum of positively skewed random variables is not necessarily a positively skewed\nrandom variable. However, the sum of regular log-normal random variables can be approximated by\nanother log-normal [23], thus is still positively skewed. Since path-wise state values are approximately\nlog-normal, it is clear that if an MDP only has positive immediate rewards, then all estimated values\nare likely to be positively skewed and thus have higher likelihoods to be underestimated.\nOn the other hand, if an estimated value is composed of both positive and negative rewards, then the\nskewness of regular and \ufb02ipped log-normal distributions may partly be neutralised in their convolution.\nThe purple distribution in Figure 3 (b) shows the result of convolution of two skewed distributions\nthat lie symmetrically to x = 0. The skewness is perfectly neutralised in this case, resulting in a\nsymmetric distribution with a balanced likelihood of under/overestimation. In the case of Figure 3\n(c), the convolution is still a skewed one, but the scale of this skewness is less than the original ones.\nTo make learning easier, one may hope to design the reward function such that the more desirable\nactions/policies have both higher expected returns and higher likelihood of overestimation than the\nless desirable ones. However, the former requires more positive rewards, while the latter calls for\nmore negative rewards, causing an unsolvable dilemma. Therefore, it is more realistic just to balance\nthe likelihood of under/overestimation, so that all actions/policies can compete fairly with each other.\nReward shaping [24, 25] can be a promising choice to achieve this goal, as it preserves the optimality\nof policies. Since a better balance of positive and negative rewards directly reduces the impact of the\nskewness of all relevant log-normal distributions, this approach might be more effective than simply\ncollecting more observations.\n\n5 Experiments\n\nIn this section, we present our empirical results on the skewness of estimated values. There are two\npurposes in these experiments: (a) to demonstrate how substantial the harm of the skewness can be;\n(b) to see the improvement provided by collecting more observations, as mentioned in Section 4.1.\nWe conducted experiments in chain MDPs shown in Figure 4. There are n > 0 states s1, s2, ..., sn in\na chain MDP. At each state, the agent has two possible actions a+ and a\u2212. By taking a+ at si with\n\n7\n\n-4-2024estimated path-wise value00.20.40.60.8densitypositivenegative-4-2024estimated path-wise value00.20.40.60.8densitypositivenegativeconvolution-4-2024estimated path-wise value00.511.5densitypositivenegativeconvolution\f(a)\n\n(b)\n\nFigure 5: (a) Distribution of \u02c6V \u03c0+\n\n(s1) at m = 200. (b) Underestimation probability curve.\n\nare all 1, we have Var[ \u02c6V \u03c0\u2212\n\ni < n, the agent has probability p > 0 to be sent to si+1, and 1 \u2212 p to remain at si. Taking a+ at sn\nyields a goal reward rG > 0, and the agent remains at sn. Taking a\u2212, on the other hand, sends the\nagent from si to si\u22121 (i > 1) or s1 (i = 1) with probability 1, and if a\u2212 is taken at s1, then the agent\nwill be provided a distraction reward rD > 0.\nThe objective of the learning agent is to discover a policy that leads it to the goal sn and collects rG\nas often as possible, rather than being distracted by rD. There are two policy of interest: \u03c0+ that\nalways take a+, and \u03c0\u2212 that always take a\u2212. Other policies can be proved to be always worse than\n\u03c0+ and \u03c0\u2212 in terms of V \u03c0(s1) regardless of rG, rD, p, and discount factor \u03b3.\nSince using max operator may introduce bias [10], we modi\ufb01ed the default value iteration algorithm\n[4] to let it output the unbiased estimated state values by following predetermined policies rather than\nusing max operator. In each run of experiment, m observations were collected for each state-action\npair, resulting in a data set of size 2mn. Then, the observations were passed to the modi\ufb01ed value\niteration algorithm to estimate the state values of \u03c0+ and \u03c0\u2212 under discount factor \u03b3 = 0.9.\nThe Markov chain M \u03c0+ and M \u03c0\u2212\nhere are both single-path ones, and thus the corresponding\ntheoretical distributions of \u02c6V can be computed directly by applying Theorem 3.7. Further, since\ntransition probabilities in M \u03c0\u2212\n] = 0, and thus its estimated values\nalways equal trivially to the ground truth one (i.e. it will never be under/overestimated).\nThe empirical and theoretical distributions of estimated state value \u02c6V \u03c0+\n(s1) with m = 200, n = 20,\np = 0.1, rG = 1e6 in 1000 runs is shown in Figure 5 (a). One-sample Kolmogorov-Smirnov test was\nconducted against the null hypotheses that the empirical data came from the theoretical log-normal\ndistributions. The resulting p-value was 0.1190, which failed to reject the null hypothesis at 5%\nsigni\ufb01cance level, indicating no signi\ufb01cant difference between the theoretical and sample distribution.\nMore importantly, Figure 5 (a) shows a clear positive skewness, indicating a higher likelihood of\nunderestimation. The empirical value of indicator CDF(E[ \u02c6V ])\u22120.5 was +0.103, meaning that in\n60.3% of runs, the state value was underestimated. This further indicates that, if the distraction reward\nrD is set to a value such that V \u03c0\u2212\n(s1), then the agent will wrongly\nselect \u03c0\u2212 with probability close to 0.603, which is worse than random guess.\nTo see whether collecting more observations helps reduce skewness, the same experiments as above\nwere conducted with the number of observations per state-action m ranged from 20 to 400. Figure 5\n(b) shows the theoretical and empirical probability of underestimation Pr( \u02c6V \u03c0+\n(s1)).\nAt m = 20, 200 and 400, the empirical underestimate probability was 0.741, 0.603 and 0.563,\nrespectively. While from m = 20 to 200 there was an signi\ufb01cant improvement of 0.138, or a 18.6%\nrelative improvement, from 200 to 400 it was only 0.040, or 6.6% relative. This result supports\nthe analysis in Section 4.1, demonstrating that the merit of collecting more observations is most\nnoticeable when the sample size is low, and diminishes quickly as the sample size grows.\nWe also conducted experiments in the complex maze domain [26] in the same manner as above. In\nthis domain, the task of the agent is to \ufb01nd a policy that can collect all \ufb02ags and bring them to the\ngoal as often as possible, without falling into any traps. The maze used is given in Figure 6 (a).\nThe states in this domain is represented by the current position of the agent and the status of the three\n\ufb02ags. The agent starts at the start point indicated by S with no \ufb02ag. At each time step, the agent can\n\n(s1) is slightly less than V \u03c0+\n\n(s1) < E \u02c6V \u03c0+\n\n8\n\n05101520estimated value00.050.10.150.2densitysample distributiontheoretical distribution0100200300400#observations per state-action0.50.550.60.650.70.750.8probability of underestimationempiricaltheoretical\f1\n\nT\n\nS\n\nX\n\nT\n\nT\n\n2 T\n\nX\n\nT\nT\n\nX\n\nX 3\n\nX X X\nX\nG\n\nX\n\n(a)\n\n(b)\n\n(c)\n\nFigure 6: (a) A complex maze. S, G, numbers, and circles stand for start, goal, \ufb02ags, and traps,\nrespectively. (b) Distribution of \u02c6V \u03c0\u2217\n\n(sstart) at m = 10. (c) Underestimation probability curve.\n\nselect one of the four directions to move to. The agent is then sent to the adjacent grid at the chosen\ndirection with probability 0.7, and at each of the other three directions with probability 0.1, unless the\ndestination is blocked, in which case the agent remains at the current grid. Additionally, at the \ufb02ag\ngrids (numbers in Figure 6 (a)), taking actions also provides the corresponding \ufb02ag to the agent if\nthat \ufb02ag has not been obtained yet. At the goal point (G), taking arbitrary action yields an immediate\nreward equals to 1, 100, 1002 or 1003 if the agent holds 0, 1, 2 or 3 \ufb02ags, respectively. Then the agent\nis sent back to the start point, and all three \ufb02ag are reset to their initial position. Finally, at any trap\ngrid (circles), taking actions sends the agent to S and resets all \ufb02ags without yielding a goal reward.\nThe complex maze in Figure 6(a) has 440 states, 4 actions, 32 non-zero immediate rewards, and\ncomplicated transition patterns, and thus is dif\ufb01cult to analyse manually. However, it is noticeable\nthat all non-zero immediate rewards are positive, and thus according to Section 4.2, estimated state\nvalues are likely to have positive skew, resulting in greater likelihood of underestimation.\nFigure 6 (b) shows the empirical distribution of estimated value \u02c6V \u03c0\u2217\n(sstart, no \ufb02ag) under \u03b3 = 0.9 and\nm = 10 in 1000 runs. Although it is not a path-wise state value, the distribution is approximately\nlog-normal with parameter \u00b5 \u2248 8.21, \u03c3 \u2248 0.480. In 67.6% of these 1000 runs, the optimal state\nvalue at the start state was underestimated.\nThe effect of collecting a larger sample is show in Figure 6 (c). The probability of underestimation\ndecreased from 0.676 at m = 10 to 0.597 at m = 50, 0.563 at m = 100, and 0.556 at m = 200.\nThe data points approximated an exponential function y = 0.1725 exp(\u22120.04015x) + 0.5546, which\nsuggests that it can be very dif\ufb01cult to achieve underestimation probability lower than 0.55 by\ncollecting more data in this domain.\n\n6 Conclusion and Future Work\n\nThis paper has shown that estimated state values computed using the Bellman equation can be\ndecomposed to the relevant path-wise state values, and the latter obey log-normal distributions.\nSince log-normal distributions are skewed, the estimated state values also have skewed distributions,\nresulting in imbalanced likelihood of under/overestimation, which can be harmful for learning.\nWe have also pointed out that the direction of such imbalance is decided by the immediate reward\nassociated to the log-normal distributions, and thus, by carefully balancing the impact of positive and\nnegative rewards when designing the MDPs, such undesirable imbalance can possibly be neutralised.\nCollecting more observations, on the other hand, helps reduce the skewness to a degree, but such\neffect becomes less signi\ufb01cant when the sample size is already large.\nIt would be interesting to see how the skewness studied in this paper interacts with function approxi-\nmation (e.g. neural networks [27, 28]), policy gradient [29, 30], or Monte-Carlo tree search [31, 32].\nA reasonable guess is that these techniques introduce their own skewness, and the two different\nskewness amplify each other, making learning even more dif\ufb01cult. On the other hand, reducing the\nskewness discussed in this paper may improve learning performance even when such techniques\nare used. Therefore, developing a concrete method of balancing positive and negative rewards (as\ndiscussed in Section 4.2) can be very helpful, and will be investigated in the future.\n\n9\n\n20004000600080001000012000estimated state value0123density10-4m=10lognormal fit04080120160200#observations per state-action0.50.550.60.650.7probability of underestimationdatacurve fit\fAcknowledgements\n\nThis paper was supported by Ministry of Science and Technology of China (Grant No.\n2017YFB1003102), the National Natural Science Foundation of China (Grant Nos. 61672478\nand 61329302), the Science and Technology Innovation Committee Foundation of Shenzhen (Grant\nNo. ZDSYS201703031748284), EPSRC (Grant No. J017515/1), and in part by the Royal Society\nNewton Advanced Fellowship (Reference No. NA150123).\n\nReferences\n[1] Richard S. Sutton and Andrew G. Barto. Introduction to Reinforcement Learning. MIT Press,\n\nCambridge, MA, USA, 1st edition, 1998.\n\n[2] Csaba Szepesv\u00e1ri. Algorithms for reinforcement learning. Synthesis lectures on arti\ufb01cial\n\nintelligence and machine learning, 4(1):1\u2013103, 2010.\n\n[3] Christopher Watkins and Peter Dayan. Q-learning. Machine learning, 8(3-4):279\u2013292, 1992.\n\n[4] Martin Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming.\n\nWiley-Interscience, 1994.\n\n[5] Peter Dayan. The convergence of TD (\u03bb) for general \u03bb. Machine learning, 8(3-4):341\u2013362,\n\n1992.\n\n[6] John N. Tsitsiklis. Asynchronous stochastic approximation and Q-learning. Machine Learning,\n\n16(3):185\u2013202, 1994.\n\n[7] Michael L. Littman, Thomas L. Dean, and Leslie P. Kaelbling. On the complexity of solving\nIn Proceedings of the Eleventh Conference on Uncertainty in\n\nmarkov decision problems.\nArti\ufb01cial Intelligence, pages 394\u2013402. Morgan Kaufmann Publishers Inc., 1995.\n\n[8] Csaba Szepesv\u00e1ri. The asymptotic convergence-rate of Q-learning. In Proceedings of the 10th\nInternational Conference on Neural Information Processing Systems, pages 1064\u20131070. MIT\nPress, 1997.\n\n[9] Sebastian Thrun and Anton Schwartz. Issues in using function approximation for reinforcement\nlearning. In Proceedings of the 1993 Connectionist Models Summer School Hillsdale, NJ.\nLawrence Erlbaum. Citeseer, 1993.\n\n[10] Hado Van Hasselt. Double Q-learning. In Advances in Neural Information Processing Systems,\n\npages 2613\u20132621, 2010.\n\n[11] Marc G. Bellemare, Georg Ostrovski, Arthur Guez, Philip S. Thomas, and R\u00e9mi Munos.\nIncreasing the action gap: New operators for reinforcement learning. In Proceedings of the 30th\nAAAI Conference on Arti\ufb01cial Intelligence, pages 1476\u20131483, 2016.\n\n[12] Donghun Lee, Boris Defourny, and Warren B. Powell. Bias-corrected Q-learning to control\nIn Adaptive Dynamic Programming And Reinforcement\n\nmax-operator bias in Q-learning.\nLearning (ADPRL), 2013 IEEE Symposium on, pages 93\u201399. IEEE, 2013.\n\n[13] Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double\nIn Proceedings of the 30th AAAI Conference on Arti\ufb01cial Intelligence, pages\n\nQ-learning.\n2094\u20132100, 2016.\n\n[14] Carlo D\u2019Eramo, Alessandro Nuara, Matteo Pirotta, and Marcello Restelli. Estimating the\nmaximum expected value in continuous reinforcement learning problems. In Proceedings of the\n31th AAAI Conference on Arti\ufb01cial Intelligence, pages 1840\u20131846, 2017.\n\n[15] Dimitri P. Bertsekas and Huizhen Yu. Q-learning and enhanced policy iteration in discounted\n\ndynamic programming. Mathematics of Operations Research, 37(1):66\u201394, 2012.\n\n[16] Paul Wagner. Policy oscillation is overshooting. Neural Networks, 52:43\u201361, 2014.\n\n10\n\n\f[17] Nan Jiang, Alex Kulesza, Satinder Singh, and Richard Lewis. The dependence of effective\nplanning horizon on model accuracy. In Proceedings of the 2015 International Conference on\nAutonomous Agents and Multiagent Systems, pages 1181\u20131189. International Foundation for\nAutonomous Agents and Multiagent Systems, 2015.\n\n[18] Harm Van Seijen, A. Rupam Mahmood, Patrick M. Pilarski, Marlos C. Machado, and Richard S.\nSutton. True online temporal-difference learning. Journal of Machine Learning Research,\n17(145):1\u201340, 2016.\n\n[19] David P. Doane and Lori E. Seward. Measuring skewness: a forgotten statistic. Journal of\n\nStatistics Education, 19(2):1\u201318, 2011.\n\n[20] Harold Hotelling and Leonard M. Solomons. The limits of a measure of skewness. The Annals\n\nof Mathematical Statistics, 3(2):141\u2013142, 05 1932.\n\n[21] Gary W. Oehlert. A note on the delta method. The American Statistician, 46(1):27\u201329, 1992.\n\n[22] George Casella and Roger L. Berger. Statistical inference. 2nd edition, 2002.\n\n[23] Norman C. Beaulieu and Qiong Xie. An optimal lognormal approximation to lognormal sum\n\ndistributions. IEEE Transactions on Vehicular Technology, 53(2):479\u2013489, 2004.\n\n[24] Andrew Y. Ng, Daishi Harada, and Stuart Russell. Policy invariance under reward transforma-\ntions: Theory and application to reward shaping. In Proceedings of the Sixteenth International\nConference on Machine Learning, volume 99, pages 278\u2013287, 1999.\n\n[25] John Asmuth, Michael L. Littman, and Robert Zinkov. Potential-based shaping in model-based\nreinforcement learning. In Proceedings of the 23th AAAI Conference on Arti\ufb01cial Intelligence,\npages 604\u2013609, 2008.\n\n[26] Liangpeng Zhang, Ke Tang, and Xin Yao. Increasingly cautious optimism for practical PAC-\nMDP exploration. In Proceedings of the 24th International Joint Conference on Arti\ufb01cial\nIntelligence, pages 4033\u20134040, 2015.\n\n[27] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G.\nBellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Pe-\ntersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan\nWierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement\nlearning. Nature, 518(7540):529\u2013533, 2015.\n\n[28] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap,\nTim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforce-\nment learning. In Proceedings of the 33rd International Conference on Machine Learning,\npages 1928\u20131937, 2016.\n\n[29] Sham Kakade. A natural policy gradient. Advances in neural information processing systems,\n\n2:1531\u20131538, 2002.\n\n[30] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust\nregion policy optimization. In Proceedings of The 32nd International Conference on Machine\nLearning, pages 1889\u20131897, 2015.\n\n[31] Levente Kocsis and Csaba Szepesv\u00e1ri. Bandit based monte-carlo planning.\n\nConference on Machine Learning, 2006.\n\nIn European\n\n[32] Cameron B. Browne, Edward Powley, Daniel Whitehouse, Simon M. Lucas, Peter I. Cowling,\nPhilipp Rohlfshagen, Stephen Tavener, Diego Perez, Spyridon Samothrakis, and Simon Colton.\nA survey of monte carlo tree search methods. IEEE Transactions on Computational Intelligence\nand AI in games, 4(1):1\u201343, 2012.\n\n11\n\n\f", "award": [], "sourceid": 1129, "authors": [{"given_name": "Liangpeng", "family_name": "Zhang", "institution": "University of Birmingham"}, {"given_name": "Ke", "family_name": "Tang", "institution": "Southern University of Science and Technology"}, {"given_name": "Xin", "family_name": "Yao", "institution": "Southern University of Science and Technology, China"}]}