{"title": "Learning first-order Markov models for control", "book": "Advances in Neural Information Processing Systems", "page_first": 1, "page_last": 8, "abstract": null, "full_text": "Learning \ufb01rst-order Markov models for control\n\nComputer Science Department\n\nComputer Science Department\n\nAndrew Y. Ng\n\nStanford University\nStanford, CA 94305\n\nPieter Abbeel\n\nStanford University\nStanford, CA 94305\n\nAbstract\n\nFirst-order Markov models have been successfully applied to many prob-\nlems, for example in modeling sequential data using Markov chains, and\nmodeling control problems using the Markov decision processes (MDP)\nformalism.\nIf a \ufb01rst-order Markov model\u2019s parameters are estimated\nfrom data, the standard maximum likelihood estimator considers only\nthe \ufb01rst-order (single-step) transitions. But for many problems, the \ufb01rst-\norder conditional independence assumptions are not satis\ufb01ed, and as a re-\nsult the higher order transition probabilities may be poorly approximated.\nMotivated by the problem of learning an MDP\u2019s parameters for control,\nwe propose an algorithm for learning a \ufb01rst-order Markov model that ex-\nplicitly takes into account higher order interactions during training. Our\nalgorithm uses an optimization criterion different from maximum likeli-\nhood, and allows us to learn models that capture longer range effects, but\nwithout giving up the bene\ufb01ts of using \ufb01rst-order Markov models. Our\nexperimental results also show the new algorithm outperforming conven-\ntional maximum likelihood estimation in a number of control problems\nwhere the MDP\u2019s parameters are estimated from data.\n\nIntroduction\n\n1\nFirst-order Markov models have enjoyed numerous successes in many sequence modeling\nand in many control tasks, and are now a workhorse of machine learning.1 Indeed, even\nin control problems in which the system is suspected to have hidden state and thus be\nnon-Markov, a fully observed Markov decision process (MDP) model is often favored over\npartially observable Markov decision process (POMDP) models, since it is signi\ufb01cantly\neasier to solve MDPs than POMDPs to obtain a controller. [5]\nWhen the parameters of a Markov model are not known a priori, they are often estimated\nfrom data using maximum likelihood (ML) (and perhaps smoothing). However, in many\napplications the dynamics are not truly \ufb01rst-order Markov, and the ML criterion may lead to\npoor modeling performance. In particular, we will show that the ML model \ufb01tting criterion\nexplicitly considers only the \ufb01rst-order (one-step) transitions. If the dynamics are truly\ngoverned by a \ufb01rst-order system, then the longer-range interactions would also be well\nmodeled. But if the system is not \ufb01rst-order, then interactions on longer time scales are\noften poorly approximated by a model \ufb01t using maximum likelihood. In reinforcement\nlearning and control tasks where the goal is to maximize our long-term expected rewards,\nthe predictive accuracy of a model on long time scales can have a signi\ufb01cant impact on the\nattained performance.\n\n1To simplify the exposition, in this paper we will consider only \ufb01rst-order Markov models. How-\never, the problems we describe in this paper also arise with higher order models and with more\nstructured models (such as dynamic Bayesian networks [4, 10] and mixed memory Markov mod-\nels [8, 14]), and it is straightforward to extend our methods and algorithms to these models.\n\n\fAs a speci\ufb01c motivating example, consider a system whose dynamics are governed by\na random walk on the integers. Letting St denote the state at time t, we initialize the\nsystem to S0 = 0, and let St = St\u22121 + \u03b5t, where the increments \u03b5t \u2208 {\u22121, +1} are\nequally likely to be \u22121 or +1. Writing St in terms of only the \u03b5t\u2019s, we have St = \u03b51 +\n\u00b7\u00b7\u00b7 + \u03b5t. Thus, if the increments are independent, we have Var(ST ) = T . However\nif the increments are perfectly correlated (so \u03b51 = \u03b52 = \u00b7\u00b7\u00b7 with probability 1), then\nVar(ST ) = T 2. So, depending on the correlation between the increments, the expected\nvalue E[|ST|] can be either O(\u221aT ) or O(T ). Further, regardless of the true correlation in\ndata would return the same model with E[|ST|] = O(\u221aT ).\n\nthe data, using maximum likelihood (ML) to estimate the model parameters from training\n\nTo see how these effects can lead to poor performance on a control task, consider learning\nto control a vehicle (such as a car or a helicopter) under disturbances \u03b5t due to very strong\nwinds. The in\ufb02uence of the disturbances on the vehicle\u2019s position over one time step may be\nsmall, but if the disturbances \u03b5t are highly correlated, their cumulative effect over time can\nbe substantial. If our model completely ignores these correlations, we may overestimate\nour ability to control the vehicle (thinking our variance in position is O(T ) rather than\nO(T 2)), and try to follow overly narrow/dangerous paths.\nOur motivation also has parallels in the debate on using discriminative vs. generative al-\ngorithms for supervised learning. There, the consensus (assuming there is ample training\ndata) seems to be that it is usually better to directly minimize the loss with respect to the\nultimate performance measure, rather than an intermediate loss function such as the like-\nlihood of the training data. (See, e.g., [16, 9].) This is because the model (no matter how\ncomplicated) is almost always not completely \u201ccorrect\u201d for the problem data. By anal-\nogy, when modeling a dynamical system for a control task, we are interested in having\na model that accurately predicts the performance of different control policies\u2014so that it\ncan be used to select a good policy\u2014and not in maximizing the likelihood of the observed\nsequence data.\nIn related work, robust control offers an alternative family of methods for accounting for\nmodel inaccuracies, speci\ufb01cally by \ufb01nding controllers that work well for a large class of\nmodels. (E.g., [13, 17, 3].) Also, in applied control, some practitioners manually adjust\ntheir model\u2019s parameters (particularly the model\u2019s noise variance parameters) to obtain a\nmodel which captures the variability of the system\u2019s dynamics. Our work can be viewed\nas proposing an algorithm that gives a more structured approach to estimating the \u201cright\u201d\nvariance parameters. The issue of time scales has also been addressed in hierarchical re-\ninforcement learning (e.g., [2, 15, 11]), but most of this work has focused on speeding up\nexploration and planning rather than on accurately modeling non-Markovian dynamics.\nThe rest of this paper is organized as follows. We de\ufb01ne our notation in Section 2, then\nformulate the model learning problem ignoring actions in Section 3, and propose a learn-\ning algorithm in Section 4. In Section 5, we extend our algorithm to incorporate actions.\nSection 6 presents experimental results, and Section 7 concludes.\n\n2 Preliminaries\nIf x \u2208 Rn, then xi denotes the i-th element of x. Also, let j:k = [j j +1 j +2 \u00b7\u00b7\u00b7 k\u22121 k]T .\nFor any k-dimensional vector of indices I \u2208 Nk, we denote by xI the k-dimensional\nvector with the subset of x\u2019s entries whose indices are in I. For example, if x =\n[0.0 0.1 0.2 0.3 0.4 0.5]T , then x0:2 = [0.0 0.1 0.2]T .\nA \ufb01nite-state decision process (DP) is a tuple (S, A, T, \u03b3, D, R), where S is a \ufb01nite set of\nstates; A is a \ufb01nite set of actions; T = {P (St+1 = s0|S0:t = s0:t, A0:t = a0:t)} is a set of\nstate transition probabilities (here, P (St+1 = s0|S0:t = s0:t, A0:t = a0:t) is the probability\nof being in a state s0 \u2208 S at time t + 1 after having taken actions a0:t \u2208 At+1 in states\ns0:t \u2208 St+1 at times 0 : t); \u03b3 \u2208 [0, 1) is a discount factor; D is the initial state distribution,\nfrom which the initial state s0 is drawn; and R : S 7\u2192 R is the reward function. We assume\nall rewards are bounded in absolute value by Rmax. A DP is not necessarily Markov.\n\n\ft=0 \u03b3t Pst\n\nt=0 \u03b3tR(st)|\u03c0] = P\u221e\n\nA policy \u03c0 is a mapping from states to probability distributions over actions. Let V \u03c0(s) =\nE[P\u221e\n\nt=0 \u03b3tR(st)|\u03c0, s0 = s] be the usual value function for \u03c0. Then the utility of \u03c0 is\n\nU (\u03c0) = Es0\u223cD[V \u03c0(s0)] = E[P\u221e\nP (St = st|\u03c0)R(st).\nThe second expectation above is with respect to the random state sequence s0, s1, . . . drawn\nby starting from s0 \u223c D, picking actions according to \u03c0 and transitioning according to P .\nThroughout this paper, P\u02c6\u03b8 will denote some estimate of the transition probabilities. We de-\nnote by \u02c6U (\u03c0) the utility of the policy \u03c0 in an MDP whose \ufb01rst-order transition probabilities\nare given by P\u02c6\u03b8 (and similarly \u02c6V \u03c0 the value function in the same MDP). Thus, we have2\n\u02c6U (\u03c0) = \u02c6Es0\u223cD[ \u02c6V \u03c0(s0)] = \u02c6E[P\u221e\nP\u02c6\u03b8(St = st|\u03c0)R(st).\nNote that if |U (\u03c0) \u2212 \u02c6U (\u03c0)| \u2264 \u03b5 for all \u03c0, then \ufb01nding the optimal policy in the estimated\nMDP that uses parameters P\u02c6\u03b8 (using value iteration or any other algorithm) will give a\npolicy whose utility is within 2\u03b5 of the optimal utility. [6]\nFor stochastic processes without decisions/actions, we will use the same notation but drop\nthe conditioning on \u03c0. Often we will also abbreviate P (St = st) by P (st).\n3 Problem Formulation\nTo simplify our exposition, we will begin by considering stochastic processes that do not\nhave decisions/actions. Section 5 will discuss how actions can be incorporated into the\nmodel.\nWe \ufb01rst consider how well \u02c6V (s0) approximates V (s0). We have\n\nt=0 \u03b3tR(st)|\u03c0] = P\u221e\n\nt=0 \u03b3t Pst\n\n| \u02c6V (s0) \u2212 V (s0)| =\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\n\u221e\n\nX\n\n\u03b3t X\n\nt=0\n\nst\n\n\u221e\n\nP\u02c6\u03b8(st|s0)R(st) \u2212\n\n\u221e\n\nX\n\n\u03b3t X\n\nt=0\n\nst\n\nP (st|s0)R(st)\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\n\u2264 Rmax\n\nX\n\n\u03b3t X\n\nt=0\n\nst\n\n(cid:12)(cid:12)P\u02c6\u03b8(st|s0) \u2212 P (st|s0)(cid:12)(cid:12) .\n\n(1)\n\nSo, to ensure that \u02c6V (s0) is an accurate estimate of V (s0), we would like the parameters \u02c6\u03b8 of\nthe model to minimize the right hand side of (1). The term Pst (cid:12)(cid:12)P\u02c6\u03b8(st|s0) \u2212 P (st|s0)(cid:12)(cid:12)\nis\nexactly (twice) the variational distance between the two conditional distributions P\u02c6\u03b8(\u00b7|s0)\nand P (\u00b7|s0). Unfortunately P is not known when learning from data. We only get to\nobserve state sequences sampled according to P . This makes Eqn. (1) a dif\ufb01cult criterion\nto optimize. However, it is well known that the variational distance is upper bounded by\na function of the KL-divergence. (See, e.g., [1].) The KL-divergence between P and P\u02c6\u03b8\ncan be estimated (up to a constant) as the log-likelihood of a sample. So, given a training\nsequence s0:T sampled from P , we propose to estimate the transition probabilities P\u02c6\u03b8 by\n\n\u02c6\u03b8 = arg max\n\n\u03b8\n\nT \u22121\n\nT \u2212t\n\nX\n\nX\n\nt=0\n\nk=1\n\n\u03b3k log P\u03b8(st+k|st).\n\n(2)\n\nNote the difference between this and the standard maximum likelihood (ML) esti-\nSince we are using a model\nmate.\nis parameterized as a \ufb01rst-order Markov\nmodel,\nthe probability of the data under the model is given by P\u03b8(s0, . . . , sT ) =\nP\u03b8(sT|sT \u22121)P\u03b8(sT \u22121|sT \u22122) . . . P\u03b8(s1|s0)D(s0) (where D is the initial state distribution).\nBy de\ufb01nition, maximum likelihood (ML) chooses the parameters \u03b8 that maximize the prob-\nability of the observed data. Taking logs of the probability above, (and ignoring D(s0),\nwhich is usually parameterized separately), we \ufb01nd that the ML estimate is given by\n\nthat\n\n\u02c6\u03b8 = arg max\n\n\u03b8\n\nT \u22121\n\nX\n\nt=0\n\nlog P\u03b8(st+1|st).\n\n(3)\n\n2Since P \u02c6\u03b8 is a \ufb01rst-order model, it explicitly parameterizes only P \u02c6\u03b8(St+1 = st+1|St = st, At =\nat). We use P \u02c6\u03b8(St = st|\u03c0) to denote the probability that St = st in an MDP with one-step\ntransition probabilities P \u02c6\u03b8(St+1 = st+1|St = st, At = at) and initial state distribution D when\nacting according to the policy \u03c0.\n\n\fS\n\n0\n\nS\n\n1\n\nS\n\n2\n\nS\n\n3\n\nS\n\n0\n\nS\n\n1\n\nS\n\n1\n\nS\n\n0\n\nS\n\n2\n\nS\n\n2\n\nS\n\n3\n\nS\n\n0\n\nS\n\n1\n\nS\n\n1\n\nS\n\n1\n\nS\n\n2\n\nS\n\n2\n\nS\n\n2\n\nS\n\n3\n\nS\n\n3\n\n(a)\n\n(b)\n\n(c)\n\nFigure 1: (a) A length four training sequence. (b) ML estimation for a \ufb01rst-order Markov model op-\ntimizes the likelihood of the second node given the \ufb01rst node in each of the length two subsequences.\n(c) Our objective (Eqn. 2) also includes the likelihood of the last node given the \ufb01rst node in each\nof these three longer subsequences of the data. (White nodes represent unobserved variables, shaded\nnodes represent observed variables.)\nAll the terms above are of the form P\u03b8(st+1|st). Thus, the ML estimator explicitly\nconsiders, and tries to model well, only the observed one-step transitions.\nIn Figure 1\nwe use Bayesian network notation to illustrate the difference between the two objec-\ntives for a training sequence of length four. Figure 1(a) shows the training sequence,\nwhich can have arbitrary dependencies. Maximum likelihood (ML) estimation maximizes\nfM L(\u03b8) = log P\u03b8(s1|s0) + log P\u03b8(s2|s1) + log P\u03b8(s3|s2). Figure 1(b) illustrates the in-\nteractions modeled by ML. Ignoring \u03b3 for now, for this example our objective (Eqn. 2) is\nfM L(\u03b8) + log P\u03b8(s2|s0) + log P\u03b8(s3|s1) + log P\u03b8(s3|s0). Thus, it takes into account both\nthe interactions in Figure 1(b) as well as the longer-range ones in Figure 1(c).\n4 Algorithm\nWe now present an EM algorithm for optimizing the objective in Eqn. (2) for a \ufb01rst-order\nMarkov model.3 Our algorithm is derived using the method of [7]. (See the Appendix for\ndetails.) The algorithm iterates between the following two steps:\n\n\u2022 E-step: Compute expected counts\n\u2013 \u2200i, j \u2208 S, set stats(j, i) = 0\n\u2013 \u2200t : 0 \u2264 t \u2264 T \u2212 1,\u2200k : 1 \u2264 k \u2264 T \u2212 t,\u2200l : 0 \u2264 l \u2264 k \u2212 1,\u2200i, j \u2208 S\n\nstats(j, i) + = \u03b3kP\u02c6\u03b8(St+l+1 = j, St+l = i|St = st, St+k = st+k)\n\n\u2022 M-step: Re-estimate model parameters\n\nUpdate \u02c6\u03b8 such that \u2200i, j \u2208 S, P\u02c6\u03b8(j|i) = stats(j, i)/ Pk\u2208S stats(k, i)\n\nPrior to starting EM, the transition probabilities P\u02c6\u03b8 can be initialized with the \ufb01rst-order\ntransition counts (i.e., the ML estimate of the parameters), possibly with smoothing.4\nLet us now consider more carefully the computation done in the E-step for one speci\ufb01c pair\nof values for t and k (corresponding to one term log P\u03b8(st+k|st) in Eqn. 2). For k \u2265 2, as\nin the forward-backward algorithm for HMMs (see, e.g., [12, 10]), the pairwise marginals\ncan be computed by a forward propagation (computing the forward messages), a backward\npropagation (computing the backward messages), and then combining the forward and\nbackward messages.5 Forward and backward messages are computed recursively:\nl = 1 to k \u2212 1, \u2200i \u2208 S m\u2192t+l(i) = Pj\u2208S m\u2192t+l\u22121(j)P\u02c6\u03b8(i|j),\nfor\nl = k \u2212 1 down to 1, \u2200i \u2208 S mt+l\u2190(i) = Pj\u2208S mt+l+1\u2190(j)P\u02c6\u03b8(j|i),\n\n(4)\n(5)\n\nfor\n\n3Using higher order Markov models or more structured models (such as dynamic Bayesian net-\nworks [4, 10] or mixed memory Markov models [8, 14]) offer no special dif\ufb01culties, though the\nnotation becomes more involved and the inference (in the E-step) might become more expensive.\n\n4A parameter P \u02c6\u03b8(j|i) initialized to zero will remain zero throughout successive iterations of EM.\n\nIf this is undesirable, then smoothing could be used to eliminate zero initial values.\n\n5Note that the special case k = 1 (and thus l = 0) does not require inference. In this case we\n\nsimply have P \u02c6\u03b8(St+1 = j, St = i|St = st, St+1 = st+1) = 1{i = st}1{j = st+1}.\n\n\fwhere we initialize m\u2192t(i) = 1{i = st}, and mt+k\u2190(i) = 1{i = st+k}. The pairwise\nmarginals can be computed by combining the forward and backward messages:\n(6)\nP\u02c6\u03b8(St+l+1 = j, St+l = i|St = st, St+k = st+k) = m\u2192t+l(i)P\u02c6\u03b8(j|i)mt+l+1\u2190(j).\nFor the term log P\u03b8(st+k|st), we end up performing 2(k \u2212 1) message computations, and\ncombining messages into pairwise marginals k \u2212 1 times. Doing this for all terms in the\nobjective results in O(T 3) message computations and O(T 3) computations of pairwise\nmarginals from these messages.\nIn practice, the objective (2) can be approximated by\nconsidering only the terms in the summation with k \u2264 H, where H is some time horizon.6\nIn this case, the computational complexity is reduced to O(T H 2).\n4.1 Computational Savings\nThe following observation leads to substantial savings in the number of message compu-\ntations. The forward messages computed for the term log P\u03b8(st+k|st) depend only on the\nk=1 are the\nvalue of st. So the forward messages computed for the terms {log P\u03b8(st+k|st)}H\nsame as the forward messages computed just for the term log P\u03b8(st+H|st). A similar ob-\nservation holds for the backward messages. As a result, we need to compute only O(T H)\nmessages (as opposed to O(T H 2) in the naive algorithm).\nThe following observation leads to further, (even more substantial) savings. Consider two\nterms in the objective log P\u03b8(st1+k|st1 ) and log P\u03b8(st2+k|st2 ). If st1 = st2 and st1+k =\nst2+k, then both terms will have exactly the same pairwise marginals and contribution to\nthe expected counts. So expected counts have to be computed only once for every triple\ni, j, k for which (St = i, St+k = j) occurs in the training data. As a consequence, the\nrunning time for each iteration (once we have made an initial pass over the data to count\nthe number of occurrences of the triples) is only O(|S|2H 2), which is independent of the\nsize of the training data.\n5\nIn decision processes, actions in\ufb02uence the state transition probabilities. To generate train-\ning data, suppose we choose an exploration policy and take actions in the DP using this\npolicy. Given the resulting training data, and generalizing Eqn. (2) to incorporate actions,\nour estimator now becomes\n\nIncorporating actions\n\n\u02c6\u03b8 = arg max\n\n\u03b8\n\nT \u22121\n\nT \u2212t\n\nX\n\nX\n\nt=0\n\nk=1\n\n\u03b3k log P\u03b8(st+k|st, at:t+k\u22121).\n\n(7)\n\nThe EM algorithm is straightforwardly extended to this setting, by conditioning on the\nactions during the E-step, and updating state-action transition probabilities P\u03b8(j|i, a) in\nthe M-step.\nAs before, forward messages need to be computed only once for each value of t, and back-\nward messages only once for each value of t + k. However achieving the more substantial\nsavings, as described in the second paragraph of Section 4.1, is now more dif\ufb01cult. In par-\nticular, now the contribution of a triple i, j, k (one for which (St = i, St+k = j) occurs\nin the training data) depends on the action sequence at:t+k\u22121. The number of possible\nsequences of actions at:t+k\u22121 grows exponentially with k.\nIf, however, we use a deterministic exploration policy to generate the training data (more\nspeci\ufb01cally, one in which the action taken is a deterministic function of the current state),\nthen we can again obtain these computational advantages: Counts of the number of oc-\ncurrences of the triples described previously are now again a suf\ufb01cient statistic. How-\never, a single deterministic exploration policy, by de\ufb01nition, cannot explore all state-action\npairs. Thus, we will instead use a combination of several deterministic exploration policies,\nwhich jointly can explore all state-action pairs. In this case, the running time for the E-step\nbecomes O(|S|2H 2|\u03a0|), where |\u03a0| is the number of different deterministic exploration\npolicies used. (See Section 6.2 for an example.)\n\n6Because of the discount term \u03b3 k in the objective (2), one can safely truncate the summation over\n\nk after about O(1/(1 \u2212 \u03b3)) terms without incurring too much error.\n\n\fG\n\nA\n\nB\n\ny\nt\ni\nl\ni\nt\n\nU\n\n\u221230\n\n\u221240\n\n\u221250\n\n\u221260\n\n\u221270\n\n\u221280\n0\n\nS\n\nnew algorithm\nmaximum likelihood\n0.6\n\n0.4\n\n0.2\n\n0.8\nCorrelation level for noise\n\n\u2212200\n\n\u2212400\n\n\u2212600\n\ny\nt\ni\nl\ni\nt\n\nU\n\nnew algorithm\nmaximum likelihood\n0.75\n\n0.8\n\n\u2212800\n\n0.7\n0.95\nCorrelation level between arrivals\n\n0.85\n\n0.9\n\n(a)\n\n(b)\n\n(c)\n\nFigure 2: (a) Grid-world. (b) Grid-world experimental results, showing the utilities of policies ob-\ntained from the MDP estimated using ML (dash-dot line), and utilities of policies obtained from the\nMDP estimated using our objective (solid line). Results shown are means over 5 independent trials,\nand the error bars show one standard error for the mean. The horizontal axis (correlation level for\nnoise) corresponds to the parameter q in the experiment description. (c) Queue experiment, show-\ning utilities obtained using ML (dash-dot line), and using our algorithm (solid line). Results shown\nare means over 5 independent trials, and the error bars show one standard error for the mean. The\nhorizontal axis (correlation level between arrivals) corresponds to the parameter b in the experiment\ndescription. (Shown in color, where available.)\n6 Experiments\nIn this section, we empirically study the performance of model \ufb01tting using our proposed\nalgorithm, and compare it to the performance of ordinary ML estimation.\n\n6.1 Shortest vs. safest path\nConsider an agent acting for 100 time steps in the grid-world in Figure 2(a). The initial\nstate is marked by S, and the absorbing goal state by G. The reward is -500 for the gray\nsquares, and -1 elsewhere. This DP has four actions that (try to) move in each of the four\ncompass directions, and succeed with probability 1 \u2212 p. If an action is not successful, then\nthe agent\u2019s position transitions to one of the neighboring squares. Similar to our example in\nSection 1, the random transitions (resulting from unsuccessful actions) may be correlated\nover time. In this problem, if there is no noise (p = 0), the optimal policy is to follow\none of the shortest paths to the goal that do not pass through gray squares, such as path A.\nFor higher noise levels, the optimal policy is to stay as far away as possible from the gray\nsquares, and try to follow a longer path such as B to the goal.7 At intermediate noise levels,\nthe optimal policy is strongly dependent on how correlated the noise is between successive\ntime steps. The larger the correlation, the more dangerous path A becomes (for reasons\nsimilar to the random walk example in Section 1). In our experiments, we compare the\nbehavior of our algorithm and ML estimation with different levels of noise correlation.8\nFigure 2(b) shows the utilities obtained by the two different models, under different degrees\nof correlation in the noise. The two algorithms perform comparably when the correlation is\nweak, but our method outperforms ML when there is strong correlation. Empirically, when\nthe noise correlation is high, our algorithm seems to be \ufb01tting a \ufb01rst-order model with a\nlarger \u201ceffective\u201d noise level. When the resulting estimated MDP is solved, this gives more\ncautious policies, such as ones more inclined to choose path B over A. In contrast, the\nML estimate performs poorly in this problem because it tends to underestimate how far\nsideways the agent tends to move due to the noise (cf. the example in Section 1).\n\n7For very high noise levels (e.g. p = 0.99) the optimal policy is qualitatively different again.\n8Experimental details: The noise is governed by an (unobserved) Markov chain with four states\ncorresponding to the four compass directions. If an action at time t is not successful, the agent moves\nin the direction corresponding to the state of this Markov chain. On each step, the Markov chain\nstays in the current state with probability q, and transitions with probability 1 \u2212 q uniformly to any\nof the four states. Our experiments are carried out varying q from 0 (low noise correlation) to 0.9\n(strong noise correlation). A 200,000 length state-action sequence for the grid-world, generated using\na random exploration policy, was used for model \ufb01tting, and a constant noise level p = 0.3 was used\nin the experiments. Given a learned MDP model, value iteration was used to \ufb01nd the optimal policy\nfor it. To reduce computation, we only included the terms of the objective (Eqn. 7) for which k = 10.\n\n\f6.2 Queue\nWe consider a service queue in which the average arrival rate is p. Thus, p =\nP (a customer arrives in one time step). Also, for each action i, let qi denote the service\nrate under that action (thus, qi = P (a customer is served in one time step|action = i)). In\nour problem, there are three service rates q0 < q1 < q2 with respective rewards 0,\u22121,\u221210.\nThe maximum queue size is 20, and the reward for any state of the queue is 0, except when\nthe queue becomes full, which results in a reward of -1000. The service rates are q0 = 0,\nq1 = p and q2 = 0.75. So the inexpensive service rate q1 is suf\ufb01cient to keep up with\narrivals on average. However, even though the average arrival rate is p, the arrivals come\nin \u201cbursts,\u201d and even the high service rate q2 is insuf\ufb01cient to keep the queue small during\nthe bursts of many consecutive arrivals.9\nExperimental results on the queue are shown in Figure 2(c). We plot the utilities obtained\nusing each of the two algorithms for high arrival correlations. (Both algorithms perform\nessentially identically at lower correlation levels.) We see that the policies obtained with\nour algorithm consistently outperform those obtained using maximum likelihood to \ufb01t the\nmodel parameters. As expected, the difference is more pronounced for higher correlation\nlevels, i.e., when the true model is less well approximated by a \ufb01rst-order model.\nFor learning the model parameters, we used three deterministic exploration policies, each\ncorresponding to always taking one of the three actions. Thus, we could use the more\nef\ufb01cient version of the algorithm described in the second paragraph of Section 4.1 and\nat the end of Section 5. A single EM iteration for the experiments on the queue took 6\nminutes for the original version of the algorithm, but took only 3 seconds for the more\nef\ufb01cient version; this represents more than a 100-fold speedup.\n7 Conclusions\nWe proposed a method for learning a \ufb01rst-order Markov model that captures the system\u2019s\ndynamics on longer time scales than a single time step. In our experiments, this method was\nalso shown to outperform the standard maximum likelihood model. In other experiments,\nwe have also successfully applied these ideas to modeling the dynamics of an autonomous\nRC car. (Details will be presented in a forthcoming paper.)\nReferences\n[1] T. M. Cover and J. A. Thomas. Elements of Information Theory. Wiley, 1991.\n[2] T. G. Dietterich. Hierarchical reinforcement learning with the MAXQ value function decom-\n\nposition. JAIR, 2000.\n\n[3] P. Gahinet, A. Nemirovski, A. Laub, and M. Chilali. LMI Control Toolbox. Natick, MA, 1995.\n[4] Z. Ghahramani. Learning dynamic Bayesian networks. In Adaptive Processing of Sequences\n\nand Data Structures, pages 168\u2013197. Springer-Verlag, 1998.\n\n9Experimental details: The true process has two different (hidden) modes for arrivals. The \ufb01rst\nmode has a very low arrival rate, and the second mode has a very high arrival rate. We denote\nthe steady state distribution over the two modes by (\u03c61, \u03c62). (I.e., the system spends a fraction \u03c61\nof the time in the low arrival rate mode, and a fraction \u03c62 = 1 \u2212 \u03c61 of the time in high arrival\nrate mode.) Given the steady state distribution, the state transition matrix [a 1 \u2212 a; 1 \u2212 b b] has\nonly one remaining degree of freedom, which (essentially) controls how often the system switches\nbetween the two modes. (Here, a [resp. b] is the probability, if we are in the slow [resp. fast] mode,\nof staying in the same mode the next time step.) More speci\ufb01cally, assuming \u03c61 > \u03c62, we have\nb \u2208 [0, 1], a = 1 \u2212 (1 \u2212 b)\u03c62/\u03c61. The larger b is, the more slowly the system switches between\nmodes. Our experiments used \u03c61 = 0.8, \u03c62 = 0.2, P (arrival|mode 1) = 0.01, P (arrival|mode 2) =\n0.99. This means b = 0.2 gives independent arrival modes for consecutive time steps.\nIn our\nexperiments, q0 = 0, and q1 was equal to the average arrival rate p = \u03c61P (arrival|mode 1) +\n\u03c62P (arrival|mode 2). Note that the highest service rate q2(= 0.75) is lower than the fast mode\u2019s\narrival rate. Training data was generated using 8000 simulations of 25 time steps each, in which the\nqueue length is initialized randomly, and the same (randomly chosen) action is taken on all 25 time\nsteps. To reduce computational requirements, we only included the terms of the objective (Eqn. 7)\nfor which k = 20. We used a discount factor \u03b3 = .95 and approximated utilities by truncating at a\n\ufb01nite horizon of 100. Note that although we explain the queuing process by arrival/departure rates,\nthe algorithm learns full transition matrices for each action, and not only the arrival/departure rates.\n\n\f[5] L. P. Kaelbling, M. L. Littman, and A. R. Cassandra. Planning and acting in partially observable\n\nstochastic domains. Arti\ufb01cial Intelligence, 101, 1998.\n\n[6] M. Kearns, Y. Mansour, and A. Y. Ng. Approximate planning in large POMDPs via reusable\n\ntrajectories. In NIPS 12, 1999.\n\n[7] R. Neal and G. Hinton. A view of the EM algorithm that justi\ufb01es incremental, sparse, and other\n\nvariants. In Learning in Graphical Models, pages 355\u2013368. MIT Press, 1999.\n\n[8] H. Ney, U. Essen, and R. Kneser. On structuring probabilistic dependencies in stochastic lan-\n\nguage modeling. Computer Speech and Language, 8, 1994.\n\n[9] A. Y. Ng and M. I. Jordan. On discriminative vs. generative classi\ufb01ers: A comparison of logistic\n\nregression and naive Bayes. In NIPS 14, 2002.\n\n[10] J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Mor-\n\ngan Kauffman, 1988.\n\n[11] D. Precup, R. S. Sutton, and S. Singh. Theoretical results on reinforcement learning with\n\ntemporally abstract options. In Proc. ECML, 1998.\n\n[12] L. R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recog-\n\nnition. Proceedings of the IEEE, 77, 1989.\n\n[13] J. K. Satia and R. L. Lave. Markov decision processes with uncertain transition probabilities.\n\nOperations Research, 1973.\n\n[14] L. K. Saul and M. I. Jordan. Mixed memory Markov models: decomposing complex stochastic\n\nprocesses as mixtures of simpler ones. Machine Learning, 37, 1999.\n\n[15] R. S. Sutton. TD models: Modeling the world at a mixture of time scales. In Proc. ICML, 1995.\n[16] V. N. Vapnik. Statistical Learning Theory. John Wiley & Sons, 1998.\n[17] C. C. White and H. K. Eldeib. Markov decision processes with imprecise transition probabili-\n\nties. Operations Research, 1994.\n\nAppendix: Derivation of EM algorithm\nThis Appendix derives the EM algorithm that optimizes Eqn. (7). The derivation is based\non [7]\u2019s method. Note that because of discounting, the objective is slightly different from\nthe standard setting of learning the parameters of a Markov chain with unobserved variables\nin the training data.\nSince we are using a \ufb01rst-order model, we have P\u02c6\u03b8(st+k|st, at:t+k\u22121) =\nP\u02c6\u03b8(st+k|St+k\u22121, at+k\u22121)P\u02c6\u03b8(St+k\u22121|St+k\u22122, at+k\u22122) . . . P\u02c6\u03b8(St+1|st, at).\nPSt+1:t+k\u22121\nHere, the summation is over all possible state sequences St+1:t+k\u22121. So we have\nPT \u22121\n\nt=0 PT \u2212t\n= PT \u22121\n\nk=1 \u03b3k log P\u02c6\u03b8(st+k|st, at:t+k\u22121)\nt=0 \u03b3 log P\u02c6\u03b8(st+1|st, at) + PT \u22121\n\nt=0 PT \u2212t\n\nk=2 \u03b3k log PSt+1:t+k\u22121\n\nQt,k(St+1:t+k\u22121)\nQt,k(St+1:t+k\u22121)\n\nP\u02c6\u03b8(st+k|St+k\u22121, at+k\u22121)P\u02c6\u03b8(St+k\u22121|St+k\u22122, at+k\u22122) . . . P\u02c6\u03b8(St+1|st, at)\n\n\u2265 PT \u22121\n\nt=0 \u03b3 log P\u02c6\u03b8(st+1|st, at) + PT \u22121\n\nk=2 \u03b3kQt,k(St+1:t+k\u22121)\nlog P \u02c6\u03b8(st+k|St+k\u22121,at+k\u22121)P \u02c6\u03b8(St+k\u22121|St+k\u22122,at+k\u22122)...P \u02c6\u03b8(St+1|st,at)\n.\n\nt=0 PT \u2212t\n\nQt,k(St+1:t+k\u22121)\n\n(8)\nHere, Qt,k is a probability distribution, and the inequality follows from Jensen\u2019s inequality\nand the concavity of log(\u00b7). As in [7], the EM algorithm optimizes Eqn. (8) by alternately\noptimizing with respect to the distributions Qt,k (E-step), and the transition probabilities\nP\u02c6\u03b8(\u00b7|\u00b7,\u00b7) (M-step). Optimizing with respect to the Qt,k variables (E-step) is achieved by\nsetting Qt,k(St+1:t+k\u22121) =\n(9)\nto the transition probabilities P\u02c6\u03b8(\u00b7|\u00b7,\u00b7) (M-step) for Qt,k\nOptimizing with respect\n\ufb01xed as in Eqn. (9) is done by updating \u02c6\u03b8 to \u02c6\u03b8new such that \u2200 i, j \u2208 S,\u2200 a \u2208 A\nthat P\u02c6\u03b8new\nwe\nwhere\n(j|i, a)\n=\nstats(j, i, a) = PT \u22121\nt=0 PT \u2212t\nk=1 Pk\u22121\nl=0 \u03b3kP\u02c6\u03b8(St+l+1 = j, St+l = i|St = st, St+k =\nst+k, At:t+k\u22121 = at:t+k\u22121)1{at+l = a}. Note that only the pairwise marginals\nP\u02c6\u03b8(St+l+1, St+l|St, St+k, At:t+k\u22121) are needed in the M-step, and so it is suf\ufb01cient to\ncompute only these when optimizing with respect to the Qt,k variables in the E-step.\n\nP\u02c6\u03b8(St+1, . . . , St+k\u22121|St = st, St+k = st+k, At:t+k\u22121 = at:t+k\u22121).\n\nstats(j, i, a)/ Pk\u2208S stats(k, i, a),\n\nhave\n\n\f", "award": [], "sourceid": 2569, "authors": [{"given_name": "Pieter", "family_name": "Abbeel", "institution": null}, {"given_name": "Andrew", "family_name": "Ng", "institution": null}]}