{"title": "Direct value-approximation for factored MDPs", "book": "Advances in Neural Information Processing Systems", "page_first": 1579, "page_last": 1586, "abstract": "", "full_text": "Direct value-approxiIllation for factored MDPs\n\nDale Schuurmans and ReIn Patrascll\n\nDepartment of Computer Science\n\nUniversity of Waterloo\n\n{dale, rpatrasc} @cs.'Uwaterloo.ca\n\nAbstract\n\nWe present a simple approach for computing reasonable policies\nfor factored Markov decision processes (MDPs), when the opti(cid:173)\nmal value function can be approximated by a compact linear form.\nOur method is based on solving a single linear program that ap(cid:173)\nproximates the best linear fit to the optimal value function. By\napplying an efficient constraint generation procedure we obtain an\niterative solution method that tackles concise linear programs. This\ndirect linear programming approach experimentally yields a signif(cid:173)\nicant reduction in computation time over approximate value- and\npolicy-iteration methods (sometimes reducing several hours to a\nfew seconds). However, the quality of the solutions produced by\nlinear programming is weaker-usually about twice the approxi(cid:173)\nmation error for the same approximating class. Nevertheless, the\nspeed advantage allows one to use larger approximation classes to\nachieve similar error in reasonable time.\n\n1\n\nIntroduction\n\nMarkov decision processes (MDPs) form a foundation for control in uncertain and\nstochastic environments and reinforcement learning. Standard methods such as\nvalue-iteration, policy-iteration and linear programming can be used to produce\noptimal control policies for MDPs that are expressed in explicit form; that is, the\npolicy, value function and state transition model are all represented in a tabular\nmanner that explicitly enumerates the state space. This renders the approaches\nimpractical for all but toy problems. The real goal is to achieve solution methods\nthat scale up reasonably in the size of the state description, not the size of the state\nspace itself (which is usually either exponential or infinite).\n\nThere are two basic premises on which solution methods can scale up: (1) exploiting\nstructure in the MDP model itself (i.e. structure in the reward function and the state\ntransition model); and (2) exploiting structure in an approximate representation of\nthe optimal value function (or policy). Most credible attempts at scaling-up have\ngenerally had to exploit both types of structure. Even then, it is surprisingly difficult\nto formulate an optimization method that can handle large state descriptions and\nyet simultaneously produce value functions or policies with small approximation\nerrors, or errors that can be bounded tightly. In this paper we investigate a simple\napproach to determining approximately .optimal policies based on a simple direct\n\n\flinear programming approach. Specifically, the idea is to approximate the optimal\nvalue function by formulating a single linear program and exploiting structure in the\nMDP and the value function approximation to solve this linear program efficiently.\n\n2\n\nPreliminaries\n\nWe consider MDPs with finite state and action spaces and consider the goal of max(cid:173)\nimizing infinite horizon discounted reward. In this paper, states will be represented\nby vectors x of length n, where for simplicity we assume the state variables Xl, ... , X n\nare in {O, I}; hence the total nuniber of states is N == 2n . We also assume there\nis a small finite set of actions A == {aI, ... , al}. An MDP is defined by: (1) a state\ntransition model P(x/lx, a) which specifies the probability of the next state Xl given\nthe current state x and action a; (2) a reward function R(x, a) which specifies the\nimmediate reward obtained by taking action a in state X; and (3) a discount factor\n\" 0 :S , < 1. The problem is to determine an optimal control policy 1r* : X --7 A\nthat achieves maximum expected future discounted reward in every state.\n\nTo understand the standard solution methods it is useful to define some auxiliary\nconcepts. For any policy 1r, the value function V 7r\n: X --7 JR denotes the expected\nfuture discounted reward achieved by policy 1r in each state x. It turns out that\nV7r satisfies a fixed point relationship between the value of current states and the\nexpected values of future states, given by a backup operator V 7r == B 7r V 7r\n, where\nB 7r operates on arbitrary functions over the state space according to\n\n(B 7r f) (x) == R(x, 1r(x)) + , E P(x'lx, 1r(x))f(x/)\n\nX'\n\nAnother important backup operator is defined with respect to a fixed action a\n\n(B af) (x) == R(x, a) +,E P(x/lx, a)f(x')\n\nX'\n\nThe action-value function Q7r : X x A --7 JR denotes the expected future discounted\nreward achieved by taking action a in state x and following policy 1r thereafter;\nwhich must satisfy Q7r (x, a) == B a V7r \u2022 Given an arbitrary function f over states,\nthe greedy policy 1rgre (f) with respect to f is defined by\n1rgre (I) (x) == arg max (B a f) (x)\n\na\n\nFinally, if we let 1r* denote the optimal policy and V* denote its value function,\nwe have the relationship V* == B*V*, where (B* f) (x) == maXa (Ba f) (x).\nIf, in\naddition, we define Q*(x,a) == BaV* then we also have 1r*(x) == 1rgre (V*)(x) ==\narg maxa Q* (x, a). Given these definitions, the three fundamental methods for cal(cid:173)\nculating 1r* can be formulated as:\n\nPolicy iteration: Start with an arbitrary policy 1r(0). Iterate 1r(i+l) f- 1rgre (V 7r(i\u00bb)\nuntil1r(i+l) == 1r(i). Return 1r* == 1r(i+I).\n\nIterate f(i+l) f- B* f(i)\n\nf(i) 1100 < tole Return 1r* == 1rgre (f(i+I)).\n\nValue iteration: Start with an arbitrary function f(O).\nuntilllf(i+l) -\nLinear programming: Calculate V* == arg min] I:x I(x) subject\n(B a f) (x) for all a and x. Return 1r* == 1rgre (V*).\nAll three methods can be shown to produce optimal policies for the given MDP\n[1, 10] even though they do so in very different ways. However, all three approaches\nshare the same fundamental limitation that they do not scale up feasibly in n, the\nsize of the state descriptions.\nInstead, all of these approaches work with explicit\nrepresentations of the policies and value functions that are exponential in n.\n\nto f(x)\n\n2=:\n\n\f3 Exploiting structure\n\nTo scale up to large state spaces it is necessary to exploit substantial structure in\nthe MDP while also adopting some form of approximation for the optimal value\nfunction and policy. The two specific structural assumptions we consider in this\npaper are (1) factored MDPs and (2) linear value function approximations. Neither\nof these two assumptions alone is sufficient to permit efficient policy optimization for\nlarge MDPs. However, combined, the two assumptions allow approximate solutions\nto be obtained for problems involving trillions of states reasonably quickly.\n\n3.1 Factored MDPs\n\nIn the spirit of [7, 8, 6] we define a factored MDP to be one that can be rep(cid:173)\nresented compactly by an additive reward function and a factored state transi(cid:173)\ntion model. Specifically, we assume the reward function decomposes as R(x, a) ==\nE~=l Ra,r (xa,r) where each local reward function Ra,r is defined on a small set\nof variables xa,r' We assume the state transition model P(x/lx, a) can be repre(cid:173)\nsented by a set of dynamic Bayesian networks (DBNs) on state variables-one for\neach action-where each DBN defines a compact transition model on a directed\nbipartite graph connecting state variables in consecutive time steps. Let Xa,i de(cid:173)\nnote the parents of successor variable x~ in the DBN for action a. To allow effi(cid:173)\ncient optimization we assume the patent set Xa,i contains a small number of state\nvariables from the previous time step. Given this model, the probability of a suc(cid:173)\ncessor state Xl given a predecessor state x and action a is given by the product\nP(x/lx, a) == Il7=1 P(X~IXa,i)'\nThe main benefit of this factored representation is that it allows large MDPs to\nbe encoded concisely: if the functions Ra,r(xa,r) and P(X~IXa,i) depend on a small\nnumber of variables, they can be represented by small tables and efficiently com(cid:173)\nbined to determine R(x, a) and P(x'lx, a). Unfortunately, as pointed out in [7],\na factored MDP does not by itself yield a feasible method to determining optimal\npolicies. The main problem is that, even if P and R are factored, the optimal value\nfunction generally does not have a compact representation (nor does the optimal\npolicy). Therefore, obtaining an exact solution appears to require a return to ex(cid:173)\nplicit representations. However, it turns out that the factored MDP representation\ninteracts very well with linear value function approximations.\n\n3.2 Linear approximation\n\nOne of the central tenets to scaling up is to approximate the optimal value func(cid:173)\ntion rather than calculate it exactly. Numerous schemes have been investigated for\napproximating optimal value functions and policies in a compact representational\nframework, including: hierarchical decompositions [5], decision trees and diagrams\n[3, 12], generalized linear functions [1, 13, 4, 7, 8, 6], neural networks [2], and prod(cid:173)\nucts of experts .[11]. However, the simplest of these is generalized linear functions,\nwhich is the form we investigate below. In this case, we consider functions of the\nform f(x) = 2:;=1 wjbj(xj) where b1 , \u2022\u2022\u2022 , bk are a fixed set of basis functions, and Xj\ndenotes the variables on which basis bj depends. Combining linear functions with,\nfactored MDPs provides many opportunities for feasible approximation.\n\nThe first main benefit of combining linear approximation with factored MDPs is\nthat the result of applying the backup operator B a to a linear function results in\na compact representation for the action-value function. Specifically if we define\n\n\fg(X, a) == (B a f) (x) then we can rewrite it as\n\nm\n\nk\n\ng(X, a) == L:Ra,r(xa,r) + L:WjCa,j(Xa,j)\n\nr=l\n\nj=l\n\nwhere\n\nCa,j(Xa,j) ==1'L:P(xjla,xa,j)bj (xj) andxa,j == U Xa,i\n\nxj\n\nx~Exj\n\nThat is, Xa,i are the parent variables of x~, and Xa,j\nis the union of the parent\nvariables of x~ E xj. Thus, ca,j expresses the fact that in a factored MDP the\nexpected future value of one component of the approximation depends only on the\ncurrent state variables Xa,j that are direct parents of the variables xj in bj \u2022 If the\nMDP is sparsely connected then the variable sets in 9 will not be much larger than\nthose in f. The ability to represent the state-action value function in a compact\nlinear form immediately provides a feasible implementation of the greedy policy for\nf, since 1rgre (f) (x) == argmaXa g(~, a) by definition of 1rgre , and g(x, a) is efficiently\ndeterminable for each x and a. However, it turns out that this is not enough\nto permit feasible forms of approximate policy- and value-iteration to be easily\nimplemented.\nThe main problem is that even though Ba f has a factored form for fixed a, B* f does\nnot and (therefore) neither does 1rgre (f). In fact, even if a policy 1f were concisely\nrepresented, B 1r f would not necessarily have a compact form because 1f usually\ndepends on all the state variables and thus P(x/lx, 1r(x)) == I17=1 P(x~IX1r(x),i) be(cid:173)\ncomes a product of terms that depend on all the state variables. Here [8, 6] introduce\nan additional assumption that there is a special \"default\" action ad for the MDP\nsuch that all other actions a have a factored transition model P (\u00b71\u00b7, a) that differs\nfrom P(\u00b7I\u00b7, ad) only on a small number of state variables. This allows the greedy\npolicy 1rgre (f) to have a compact form and moreover allows B 1r gre(f) f\nto be con(cid:173)\ncisely represented. With some effort, it then becomes possible to formulate feasible\nversions of approximate policy- and value-iteration [8, 6].\n\nApproximate policy iteration: Start with default policy 1r(O)(x) == ad. Iterate\nf(i) +- arg minf maxx If(x) -\n(B 1r (i) f) (x) I , 1r(i+1) f- 1fgre (f(i)) until1r(i+1) == 1r(i).\n\nIterate 1r(i) +(cid:173)\nApproximate value iteration: Start with arbitrary f(O).\n1rgre (f(i)) ,f(i+1) +- argminf maxx 1!(x)-(B1r (i) f)(x)1 until Ilf(i+1)_!(i) 1100 < tole\n\nis\n\npart\n\nof\n\nthese\n\niterative\n\nexpensive\n\nalgorithms\n\ndetermining\nThe most\narg minf maxx If(x) - (B7r(i) f) (x)I which involves solving a linear program minw,E E\nsubject to -E :S !w (x) - (B 7r fw) (x) :S E for all x. This linear program is problematic\nbecause it involves an exponential number of constraints. A\u00b7 central achievement of\n[6] is to show that this system of constraints can be encoded by an equivalent system\nof constraints that has a much more compact form. The idea behind this construc(cid:173)\ntion is to realize that searching for the max or a min of a linear function with a\ncompact basis can be conducted in an organized fashion, and such an organized\nsearch can be encoded in an equally concise constraint system. This construction\nallows approximate solutions to MDPs with up to n == 40 state variables (1 trillion\nstates) to be generated in under 7.5 hours using approximate policy iteration [6].1\n\n1 It turns out that approximate value iteration is less effective because it takes more\n\niterations to converge, and in fact can diverge in theory [6, 13].\n\n\fOur main observation is that if one has to solve linear programs to conduct the\napproximate iterations anyway, then it might be much simpler and more efficient\nto approximate the linear programming approach directly.\n\n4 Approximate linear programming\n\nOur first idea is simply to observe that a factored MDP and linear value approxima(cid:173)\ntion immediately allow one to directly solve the linear programming approximation\nto the optimal value function, which is given by\n\nIIjinL f(x) subject to f(x) - (B af) (x) ;::: 0 for all x and a\n\nx\n\nwhere f is restricted to a linear form over a fixed basis. In fact, it is well known [1, 2]\nthat this yields a linear program in the basis weights w. However, what had not\nbeen previously shown is that given a factored MDP, an equivalent linear program\nof feasible size could be formulated. Given the results of [6] outlined above this is\nnow easy to do. First, one can show that the minimization objective can be encoded\ncompactly\n\nk\n\nLf(x)\nx\n\nLLWjbj(xj)\nx\n\nj=l\n\nk\nLWjYj\nj=l\n\nwhere Yj == 2n-lxjl Lbj(xj)\n\n~\n\nHere the Yj components can be easily precomputed by enumerating assignments\nto the small sets of variables in basis functions. Second, as we have seen,\nthe\nexponentially many constraints have a structured form. Specifically f (x) - (B a f) (X)\ncan be represented as\n\nf(x) - (B a f) (x)\n\nkL Wj (bj (Xj) - Ca,j (xa,j)) - L Ra,r (Xa,r)\n\nj=l\n\nr\n\nwhich has a simple basis representation that allows the technique of [6] to be used\n(B a f) (x) 2:: 0 for all x and a\nto encode a constraint system that enforces f(x) -\nwithout enumerating the state space for each action.\n\nWe implemented this approach and tested it on some of the test problems from [6].\nIn these problems there is a directed network of computer systems Xl, \u2022\u2022\u2022 , X n where\neach system is either up (Xi == 1) or down (Xi == 0). Systems can spontaneously\ngo down with some probability at each step, but this probability is increased if an\nimmediately preceding machine in the network is down. There are n + 1 actions:\ndo nothing (the default) and reboot machine i. The reward in a state is simply the\nsum of systems that are up, with a bonus reward of 1 if system 1 (the server) is\nup. I.e., R(x) == 2Xl + 2:7=2 Xi. We considered the network architectures shown in\nFigure 1 and used the transition probabilities P(x~ == llxi, parent(Xi) , a == i) == 0.95\nand P(x~ == 11Xi, parent(Xi) , a I- i) == 0.9 if Xi == parent(Xi) == 1; 0.67 if Xi == 1 and\nparent(xi) == 0; and 0.01 if Xi == o. The discount factor was 'I == 0.95. The first basis\nfunctions we considered were just the indicators on each variable Xi plus a constant\nbasis function (as reported in [6]).\n\nThe results for two network architectures are shown in Figure 1. Our approximate\nlinear programming method is labeled ALP and is compared to the approxi.mate\n\n\fserver\n\n0 APIgen\n\nALP\nALPgen\nALPgen2\nAPIgen\n\ntime\n\nn=\nN=\nAPI[6]\n\nconstraints ALP\n\nALPgen\nALPgen2\nAPI[6]\nDB Bellman APIgen\n/ Rmax\n\nALP (gen)\nALPgen2\n\n12\n4e3\n7m\n39s\n4.5s\n0.7s\n14s\n420\n1131\n38\n166\n0.3Q'\n0.36\n0.85\n0.12\n\n16\n6e4\n30m\n1.'5m\n23s\n1.2s\n37s\n777\n2023\n50\n321\n, 0.33\n0.34\n0.82\n0.14\n\n24\n2e7\n1.3h\n\n32\n28\n20\n4e9\n3e8\n1e6\n3h\n1.9h\n50m\n13m\n2.3m 4.0m 6.5m\n23m\n10m\n1.4m 4.1m\n4.5s\n3.5s\n2.6s\n1.8s\n4.7m 6.4m\n2.8m\n102m\n2747\n1270\n1591\n921\n8151\n6235\n4575\n3171\n98\n86\n62\n74\n1433\n1223\n514\n914\n0.36\n0.34\n0.36\n0.35\n0.33\n0.33\n0.32\n0.32\n0.77\n0.78\n0.78\n0.80\n0.08\n0.10\n0.08\n0.08\n\n36\n7e10\n4.5h\n22m\n47m\n5.9s\n12m\n4325\n10K\n110\n1951\n0.37\n0.32\n0.76\n0.07\n\n40\n1e12\n7.5h\n28m\n2.4h\n7.0s\n17m\n4438\n13K\n122\n2310\n0.38\n0.31\n0.76\n0.07\n\nn=\nN=\nAPI[6]\nAPIgen\nALP\nALPgen\nALPgen2\nAPIgen\nALP\nALPgen\nALPgen2\nAPI[6]\nAPIgen\nALP(gen)\nALPgen2\n\n13\n8e4\n5m\n28s\n0.7s\n0.7s\n17s\n363\n729\n50\n261\n0.27\n0.50\n0.96\n0.21\n\n16\n6e4\n15m\n106m\n1.6s\nLOs\n338\n952\n1089\n69\n381\n0.29\n0.46\n0.82\n0.22\n\ntime\n\nconstraints\n\nDB Bellman\n/ Rmax\n\n28\n3e8\nl.3h\n12m\n20s\n2.4s\n\n34\n22\n2e10\n4e6\n2.Th\n50m\n23m\n3.9m\n56s\n6.0s\n3.4s\n1.5s\n1.9m 5.4m 9.6m\n6196\n1699\n4761\n2025\n90\n135\n1925\n826\n0.35\n0.32\n0.38\n0.42\n0.77\n0.78\n0.15\n0.07\n\n3792\n3249\n114\n1505\n0.34\n0.39\n0.78\n0.06\n\n40\n1e12\n5h\n33m\n2.2m\n4.7s\n23m\n7636\n6561\n162\n3034\n0.36\n0.37\n0.76\n0.03\n\nFigure 1: Experimental results (timings on a 750MHz PIlI processor, except 2)\n\npolicy iteration strategy API described in [6]. Since we did not have the specific\nprobabilities used in [6] and could only estimate the numbers for API from graphs\npresented in the paper, this comparison is only meant to be loosely indicative of\nthe general run times of the two methods on such problems. (Perturbing the prob(cid:173)\nability values did not significantly affect our results, but we implemented APlgen\nfor comparison.) As in [6] our implementation is based on Matlab, using CPLEX\nto solve linear programs. Our preliminary results appear to support the hypothe(cid:173)\nsis that direct linear programming can be more efficient than approximate policy\niteration on problems of this type. A further advantage of the linear program(cid:173)\nming approach is that it is simpler to program and involves solving only one LP.\nMore importantly, the direct LP approach does not require the MDP to have a spe(cid:173)\ncial default action since the action-value function can be directly extracted using\n7rgre (f)(x) == argma:xay(x,a) and g is easily recoverable from f.\nBefore discussing drawbacks, we note that it is possible to solve the linear program\neven more efficiently by iteratively generating constraints as needed. This is now\npossible because factored MDPs and linear value approximations allow an efficient\nsearch for the maximally violated constraints in the linear program, which provides\nan effective way of generating concise linear programs that can be solved much more\nefficiently than those formulated above. Specifically, the procedure ALPgen exploits\nthe feasible search techniques for minimizing linear functions discussed previously\nto efficiently generate a small set of critical constraints, which is iteratively grown\nuntil the final solution is identified; see Figure 2.\n\n2These numbers are estimated from graphs in [6]. The exact probabilities and computer\nused for the simulations were not reported in that paper, so we cannot assert an exact\ncomparison. However, perturbed probabilities have little effect .on the performance of the\nmethods we tried, and it seems that overall this is a loosely representative comparison of\nthe general performance of the' various algorithms on these problems.\n\n\fALPgen\n\nStart with f(O) = 0 and constraints = 0\nLoop\n\nI\n\nFor each a E A, compute x a t- arg minx f(i) (x) -\nconstraints t- constraints U{constraint(x a1\nSolve f(i~l) t- minJ 2:~ f(x) subject to constraints\n\n(B a f(i)) (x)\n), \u2022\u2022\u2022 , constraint(xak\n\n)}\n\nUntil minx f(~)(x) - (Baf(~})(x) ~ 0 - tot for all a\nReturn g(., a) = B a f for each a, to represent the greedy policy\n\nFigure 2: ALPgen procedure\n\nThe rationale for this procedure is that the main bottleneck in the previous meth(cid:173)\nods is generating the constraints, not solving the linear programs [6]. Since only a\nsmall number of constraints are active at a solution and these are likely t.o be the\nmost violated near the solution, adding only most violated constraints appears to\nbe a useful way to proceed. Indeed, Figure 1 shows that ALPgen produces the same\napproximate solutions as ALP in a tiny fraction of the time. In the most extreme\ncase ALPgen produces an approximate solution in 7 seconds while other methods\ntake several hours on the same problem. The reason for this speedup is explained\nby the results which show the numbers of constraints generated by each method.\nFurther investigation is also required to fully outline the robustness of the con(cid:173)\nstraint generation method. In fact, one cannot guarantee that a greedy constraint\ngeneration scheme like the one proposed here will always produce a feasible number\nof constraints [9]. Nevertheless, the potential benefits of conservatively generating\nconstraints as needed seem to be clear. Of course, the main drawback of the direct\nlinear programming approach over approximate policy iteration is that ALP incurs\nlarger approximation errors than API.\n\n5 Bounding approximation error\n\nIt turns out that neither API nor ALP are guaranteed to return the best\nlin(cid:173)\near approximation to the true value function. Nevertheless, it is possible to ef(cid:173)\nficiently calculate bounds on the approximation errors of these methods, again\nby exploiting the structure of the problem: A well known result [14] asserts that\nmaxx V* (x) - V 7rgre (J) (x) :S 12, maxx f(x) - (B* f) (x) (where in our case f ~ V*).\nThis upper bound can in turn be bounded by a quantity that is feasible to calculate:\nmaxx f(x)-(B* f) (x) = maxxmina f(x)-(Ba f) (x) :S mina maxx f(x)-(B af)(x).\nThus an upper bound on the error from the optimal value function can be calculated\nby performing an efficient search for maxx f(x) -\nFigure 1 shows that the measurable error quantity, maxx f(x) - (B a f) (x) (reported\nas UB Bellman) is about a factor of two larger for the linear programming approach\nthan for approximate policy iteration on the same basis. In this respect, API ap(cid:173)\npears to have an inherent advantage (although in the limit of an exhaustive basis\nboth approaches converge to the same optimal value). To get an indication of the\ncomputational cost required for ALPgen to achieve a similar bound on approxima(cid:173)\ntion error, we repeated the same experiments with a larger basis set that included all\nfour indicators between pairs of connected variables. The results for this model are\nreported as ALPgen2, and Figure 1 shows that, indeed, the bound on approxima(cid:173)\ntion error is reduced substantially-but at the predictable cost of a sizable increase\nin computation time. However, the run times are still appreciably smaller than the\npolicy iteration methods.\n\n(Baf) (x) for each a.\n\n\fParadoxically, linear programming seems to offer computational advantages over\npolicy and value iteration in the context of approximation, even though it is widely\nheld to be an inferior solution strategy for explicitly represented MDPs.\n\nReferences\n\n[1] D. Bertsekas. Dynamic Programming and Optimal Control, volume 2. Athena\n\nScientific, 1995.\n\n[2] D. Bertsekas and J. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific,\n\n1996.\n\n[3] C. Boutilier, R. Dearden, and M. Goldszmidt. Stochastic dynamic program(cid:173)\n\nming with factored representations. Artificial Intelligence, 2000.\n\n[4] J. Boyan. Least-squares temporal difference learning.\n\nIn Proceedings ICML,\n\n1999.\n\n[5J T. Dietterich.Hierarchical reinforcement learning vlith the 1\\1AXQ value func(cid:173)\n\ntion decomposition. JAIR, 13:227-303,2000.\n\n[6] C. Guestrin, D. Koller, and R. Parr. Max-norm projection for factored MDPs.\n\n\u00b7In Proceedings IJCAI, 2001.\n\n[7] D. Koller and R. Parr. Computing factored value functions for policies in\n\nstructured MDPs. In Proceedings IJCAI, 1999.\n\n[8] D. Koller and R. Parr. Policy iteration for factored MDPs.\n\nIn Proceedings\n\nUAI,2000.\n\n.\n\n[9] R. Martin. Large Scale Linear and Integer Optimization. Kluwer, 1999.\n[10] M. Puterman. Markov Decision Processes: Discrete Dynamic Programming.\n\nWiley, 1994.\n\n[IIJ B. Sallans and G. Hinton. Using free energies to represent Q-values in a mul(cid:173)\n\ntiagent reinforcement learning task. In Proceedings NIPS, 2000.\n\n[12] R. St-Aubin, J. Hoey, and C. Boutilier. APRICODD: Approximating policy\n\nconstruction using decision diagrams. In Proceedings NIPS, 2000.\n\n[13J B. Van Roy. Learning and value function approximation in complex decision\n\nprocesses. PhD thesis, MIT, EECS, 1998.\n\n[14J R. Williams and L. Baird. Tight performance bounds on greedy policies based\n_on imperfect value functions. Technical report, Northeastern University, 1993.\n\n\f", "award": [], "sourceid": 1981, "authors": [{"given_name": "Dale", "family_name": "Schuurmans", "institution": null}, {"given_name": "Relu", "family_name": "Patrascu", "institution": null}]}