{"title": "Linear Program Approximations for Factored Continuous-State Markov Decision Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 895, "page_last": 902, "abstract": "", "full_text": "Linear Program Approximations for Factored\nContinuous-State Markov Decision Processes\n\nMilos Hauskrecht and Branislav Kveton\n\nDepartment of Computer Science and Intelligent Systems Program\n\nUniversity of Pittsburgh\n\n milos,bkveton\u0001 @cs.pitt.edu\n\nAbstract\n\nApproximate linear programming (ALP) has emerged recently as one of\nthe most promising methods for solving complex factored MDPs with\n(cid:2)nite state spaces.\nIn this work we show that ALP solutions are not\nlimited only to MDPs with (cid:2)nite state spaces, but that they can also be\napplied successfully to factored continuous-state MDPs (CMDPs). We\nshow how one can build an ALP-based approximation for such a model\nand contrast it to existing solution methods. We argue that this approach\noffers a robust alternative for solving high dimensional continuous-state\nspace problems. The point is supported by experiments on three CMDP\nproblems with 24-25 continuous state factors.\n\n1\n\nIntroduction\n\nMarkov decision processes (MDPs) offer an elegant mathematical framework for represent-\ning and solving decision problems in the presence of uncertainty. While standard solution\ntechniques, such as value and policy iteration, scale-up well in terms of the number of\nstates, the state space of more realistic MDP problems is factorized and thus becomes ex-\nponential in the number of state components. Much of the recent work in the AI community\nhas focused on factored structured representations of (cid:2)nite-state MDPs and their ef(cid:2)cient\nsolutions. Approximate linear programming (ALP) has emerged recently as one of the\nmost promising methods for solving complex factored MDPs with discrete state compo-\nnents. The approach uses a linear combination of local feature functions to model the value\nfunction. The coef(cid:2)cients of the model are (cid:2)t using linear program methods. A number of\nre(cid:2)nements of the ALP approach have been developed over past few years. These include\nthe work by Guestrin et al [8], de Farias and Van Roy [6, 5], Schuurmans and Patrascu\n[15], and others [11]. In this work we show how the same set of linear programming (LP)\nmethods can be extended also to solutions of factored continuous-state MDPs.1\nThe optimal solution of the continuous-state MDP (CMDP) may not (and typically does\nnot) have a (cid:2)nite support. To address this problem, CMDPs and their solutions are usually\napproximated and solved either through state space discretization or by (cid:2)tting a surrogate\nand (often much simpler) parametric value function model. The two methods come with\ndifferent advantages and limitations. 2 The disadvantage of discretizations is their accu-\n\n1We assume that action spaces stay (cid:2)nite. Rust [14] calls such models discrete decision processes.\n2The two methods are described in more depth in Section 3.\n\n\fracy and the fact that higher accuracy solutions are paid for by the exponential increase in\nthe complexity of discretizations. On the other hand, parametric value-function approx-\nimations may become unstable when combined with the dynamic programming methods\nand least squares error [1]. The ALP solution that is developed in this work eliminates the\ndisadvantages of discretization and function approximation approaches while preserving\ntheir good properties.\nIt extends the approach of Trick and Zin [17] to factored multi-\ndimensional continuous state spaces. Its main bene(cid:2)ts are good running time performance,\nstability of the solution, and good quality policies.\nFactored models offer a more natural and compact way of parameterizing complex decision\nprocesses. However, not all CMDP models and related factorizations are equally suitable\nalso for the purpose of optimization. In this work we study factored CMDPs with state\n. We show that the solution for such a model can be approximated\nby an ALP with in(cid:2)nite number of constraints that decompose locally. In addition, we\nshow that by choosing transition models based on beta densities (or their mixtures) and\nbasis functions de(cid:2)ned by products of polynomials one obtains an ALP in which both the\nobjective function and constraints are in closed form. In order to alleviate the problem of\nin(cid:2)nite number of constraints, we develop and study approximation based on constraint\nsampling [5, 6]. We show that even under a relatively simple random constraint sampling\nwe are able to very quickly calculate solutions of a high quality that are comparable to other\nexisting CMDP solution methods.\nThe text of the paper is organized as follows. First we review (cid:2)nite-state MDPs and approx-\nimate linear programming (ALP) methods developed for their factored re(cid:2)nements. Next\nwe show how to extend the LP approximations to factored continuous-state MDPs and dis-\ncuss assumptions underlying the model. Finally, we test the new method on a continuous-\nstate version of the computer network problem [8, 15] and compare its performance to\nalternative CMDP methods.\n\nspaces restricted to\u0002\u0004\u0003\u0006\u0005\b\u0007\n\t\f\u000b\n\n2 Finite-state MDPs\n\n(cid:2)xed point equation [12]:\n\nde(cid:2)nes a probabilistic transition model mapping a state to the next states given an action,\n\nIR de(cid:2)nes a reward model for choosing an action in a speci(cid:2)c state.\n\nA (cid:2)nite state MDP de(cid:2)nes a stochastic control process with components\r\u000f\u000e\u0010\u0005\n\u0011\u0012\u0005\u0014\u0013\u0015\u0005\u0014\u0016\u0018\u0017 ,\nwhere\u000e\nis a (cid:2)nite set of actions,\u0013\u001a\u0019\u001b\u000e\u001d\u001c\u001e\u0011\u001f\u001c \u000e\"!#\u0002\n\u0003$\u0005%\u0007&\t\nis a (cid:2)nite set of states,\u0011\nand\u0016'\u0019(\u000e)\u001c*\u0011+!\nGiven an MDP our objective is to (cid:2)nd the policy,.-/\u00190\u000e1!2\u0011 maximizing the in(cid:2)nite-\n96?\n9\u000f:<;>=\n\u0017 , where=A@\nhorizon discounted reward criterion:34\r6587\nand?\n9 is a reward obtained in stepC . The value of the optimal policy satis(cid:2)es the Bellman\n=WV\fXLY\n\u00134\r\u000fEUZ\\[\n\u0016O\r\fEP\u0005RQS\u0017UT\nEP\u0005RQS\u0017\n\r\u000fE>\u0017GFIH*JLK\nwhereD\nis the value of the optimal policy andE\nZ denotes the next state. For all states\nthe equation can be written asD\nF8`\n, where`\nthe value functionD\n, the optimal policy,\n\r\u000fE>\u0017\n\r\fE>\u0017\nV\u000fX\n\r\fEUZc\u0017.a\u001e\u0016O\r\u000fEP\u0005\nQ$\u0017edf\u0003\u0006\u0005\nEP\u0005\nQS\u0017\n\u00134\r\fEUZb[\n\r\fE>\u0017.a\n\nis the Bellman operator. Given\nis de(cid:2)ned by the action optimizing Eqn 1.\nMethods for solving an MDP include value iteration, policy iteration, and linear program-\nming [12, 2]. In the linear program (LP) formulation we solve the following problem:\n\n\u0002\u0004\u0003$\u0005%\u0007B\u0017\n\r\u000fEUZ]\u0017_^4\u0005\n\n\fE>\u0017 for every stateE are treated as variables.\n\nghEP\u0005\nQ\n\nminimize\n\nsubject to:\n\nwhere values ofD\n\nis a discount factor\n\n(1)\n\n(2)\n\n9\nD\nM\nN\nD\nE\n@\n\u000e\nD\n-\nV\nX\nD\nD\n=\nY\nD\n\fFactorizations and LP approximations\nIn factored MDPs,\n\nstate variables\n\nis de(cid:2)ned in terms of\n\nGiven a factored binary-state MDP, the coef(cid:2)cients of the linear model can be found by\nsolving the surrogate of the LP in Equation 2 [8]:\n\nvariables. Compact parameterizations of MDPs based on dynamic belief networks [7] and\ndecomposable reward functions are routinely used to represent such MDPs more ef(cid:2)ciently.\nHowever, the presence of a compact model does not imply the existence of ef(cid:2)cient optimal\nsolutions. To address this problem Koller and Parr [9] and Guestrin at al [8] propose to use\na linear model [13]:\n\nthe state space \u000e\n\u000e*iB\u0005R\u000e4jB\u0005BklkBk\n\u0005R\u000e\n\u0001 . As a result, the state space becomes exponential in the number of\n\r\u000fE>\u0017\u0015F\n\r\u000fE\n9+n\nto approximate the value functionD\n9 are the linear coef(cid:2)cients to be found ((cid:2)t)\nandm\n\r\u000fE>\u0017 . Heren\n9 s denote feature functions de(cid:2)ned over subsetsE\n9 of state variables.\np\f|\u000f}~p_\u007f\np\u000fsBt\buUv\nw\bxyv\nminimizeo\nw\bx{z\n\u008cy\u008d\n\u008b(\u008c6\u008d\np\u0081\u0080\np]\u008a\n|\u000f}\n|\u000f}~\u0088\n\u007f\u0084\u0083\u0086\u0085\n|\u000f}<\u0088\n\u007f\f\u008e\n\u008f \u0083\u0091\u0090\u0092|\u000f}\n\u007f\u0094\u0093 \u0095\np6\u0089\nx\u0012\u0087\n9\u000f\u0097\n9 under actionQ\nM are the parents of state variables inE\nwhereE\n, and\u0016O\r\fEP\u0005RQS\u0017 decom-\n\u0005RQS\u0017 , such that\u0016\n\u0005\nQ$\u0017\n\r\u000fE\nposes to5\u0099\u0098\n\nis a local reward function de(cid:2)ned\nover a subset of state variables. Note that while the objective function can be computed\nef(cid:2)ciently, the number of constraints one has to satisfy remains exponential in the number\nof random variables. However, only a subset of these constraints becomes active and affect\nthe solution. Guestrin et al [8] showed how to (cid:2)nd active constraints by solving a cost\nnetwork problem. Unfortunately, the cost network formulation is NP-hard. An alternative\napproach for (cid:2)nding active constraints was devised by Schuurmans and Patrascu [15]. The\napproach implements a constraint generation method [17] and appears to give a very good\nperformance on average. The idea is to greedily search for maximally violated constraints\nwhich can be done ef(cid:2)ciently by solving a linear optimization problem. These constraints\nare included in the linear program and the process is repeated until no violated constraints\nare found. De Farias and Van Roy [5] analyzed a Monte Carlo approach with randomly\nsampled constraints.\n\nV\r\fE\n\u009el:\nZ andE denote the current and previous states. Rewards are rep-\nEP\u0005\nQS\u0017 whereE\n\u009b.\r\u00a0\u009f\n\u0016O\r\u000fEP\u0005\nQS\u0017UT\n\r\u000fE>\u0017\u001bF\u0099H\u009cJ(K\nM\u00a2\u00a1\n\n\u0017b\u00a5\u0006E\nZ\u00a7\u00a6\n3We note that in general any bounded subspace of IRt can be transformed to\u00a8\n\nEP\u0005\nQS\u0017\n\n\u009b.\r\u000fE\n\n\u008cR\u00a9\u00ab\u00aa\n\n=\u00a4\u00a3\n\nXLY\n\n\fE\n\n.\n\n\n\u000b\nm\nV\n9\nm\n9\n9\n\u0017\nV\nV\np\nq\n\u0082\nz\np\np\nV\nw\nY\n}\n\u007f\nz\np\np\n}\nZ\n\u009a\n:\ni\n\u0016\nM\n\u0097\n\u009a\nM\n\u0097\n\u009a\nM\n\u0097\n\u009a\nM\n\u0097\n\u009a\nZ\n[\n\u009d\n\u000b\ni\nZ\n\u009e\n[\nD\nZ\n[\nD\nZ\nk\n\u0095\nt\n\f. One such operator has been studied by Rust [14] and is de(cid:2)ned as:\n\nThe problem with CMDPs is that in most cases the optimal value function does not have a\n(cid:2)nite support and cannot be computed. The solutions attempt to replace the value function\nor the optimal policy with a (cid:2)nite approximation.\nGrid-based MDP (GMDP) discretizations. A typical solution is to discretize the state\nspace to a set of grid points and approximate value functions over such points. Unfortu-\nnately, classic grid algorithms scale up exponentially with the number of state variables\n. Then\nthat is restricted to grid\n\n\u0005lkBkBkR\u0005\nEU\u00ae\u0012\u0001 be a set of grid points over the state space\u0002\u0004\u0003\u0006\u0005\b\u0007\n\t\u00a0\u000b\n\u0005\nE\n[4]. Let\u00ac\u00adF\ncan be approximated with an operator`/\u00af\nthe Bellman operator`\npoints\u00ac\n\u0005\nQ$\u0017\r\u000fE\n\r\u000fE\nis a normalizing constant. Equation 4 applied to grid points\u00ac\nthat\u00b1\n\u00ac*[ states. The solution,D\n\u00afhF\u00b2`\u009c\u00af\nnite state MDP with[\nMDP is to approximate the optimal value functionD\n\r\u000fE>\u0017 with an appropriate parametric\n\nde(cid:2)nes a (cid:2)-\n, approximates the original\ncontinuous-state MDP. Convergence properties of the approximation scheme in Equation 4\nfor random or pseudo-random samples were analyzed by Rust [14].\nParametric function approximations. An alternative way to solve a continuous-state\n\nfunction model [3]. The parameters of the model are (cid:2)tted iteratively by applying one\nstep Bellman backups to a (cid:2)nite set of state points arranged on a (cid:2)xed grid or obtained\nthrough Monte Carlo sampling. Least squares criterion is used to (cid:2)t the parameters of the\nmodel. In addition to parallel updates and optimizations, on-line update schemes based on\ngradient descent [3, 16] are very popular and can be used to optimize the parameters. The\ndisadvantage of the methods is their instability and possible divergence [1].\n\n\u0016O\r\u000fE\n\n(4)\n\n3.2 LP approximations of CMDPs\n\nOur objective is to develop an alternative to the above solutions that is based on ALP\ntechniques and that takes advantage of model factorizations. It is easy to see that for a\ngeneral continuous-state model the exact solution cannot be formulated as a linear program\nas was done in Equation 2 since the number of states is in(cid:2)nite. However, using linear\nrepresentations of the value functions we need to optimize only over a (cid:2)nite number of\nweights combining feature functions. So adopting the ALP approach from factored MDPs\n(Section 2), the CMDP problem can be formulated as:\n\nw\u00b3x\n\n|\u000f}\n\u007f_\u00b4\b}\np_|\u000f}~p\f\u007fU\u0083\u0086\u0085\n\np\u0092\u0080\n\nsubject to:\n\nminimizeo\n\u008c4\u0096\n\u008c\u00ab\u008d\nbination with factored transitions are well-behaved when integrated over\u0002\u0004\u0003\u0006\u0005\b\u0007\n\t6\u000b state space\n\n(or any bounded space) and nicely decompose along state-variable subsets de(cid:2)ning feature\nfunctions similarly to Equation 3. This simpli(cid:2)cation is a consequence of the following\nvariable elimination transformation:\n\nThe above formulation of the ALP builds upon our observation that linear models in com-\n\n\u0083\u0086\u0090\u0081|\u000f}\n\n\u007f\u0094\u0093 \u0095\n\n\u00b6I\u00b7\n\npfq\n\n\u007f_\u00b4\b}\n\n\u008c6\u008d\n\n\u007f_\u00bf\n\nx\u0092\u00b5\n\npy|\u000f}\n\n\u00b9\u00bb\u00ba\n\n\u008c\\\u008d\n\n|\u000f\u00bd\n\nx_\u00bc\n\u00a3\u0006\u00c3\n\n\u00a3S\u00c3\n\n;\u00c2\u00c1\n\n_\u00c4L\u00176\u00a5S\u00c4\u00b3\u00c5\u0091\u00a5S\u00c6\u0012F\n\n_\u00c4L\u0017b\u00a5$\u00c4\u00b3\u00c5\u00c7\u00c6%\u00a6\n\n\u00a3\u0006\u00c3\n\n\u00a0\u00c4(\u00176\u00a5S\u00c4\u0006k\n\nDespite the decomposition, the ALP formulation of the factored CMDP comes with two\nconcerns. First, the integrals may be improper and not computable. Second, we need to\n\n\nE\ni\nj\nD\n9\nM\n\u0080\n\u0082\n9\n=\n\u00ae\nV\ni\n\u009e\n[\nE\n9\nD\n\u009e\n\u0017\n\u008e\n\u008f\n\u0005\n\u00af\n\u009e\n[\nE\n9\nM\n9\n\u009e\n[\nE\n9\nM\n9\n\u0017\nD\n\u00af\nV\np\nq\np\n\u00a3\nz\np\np\np\nV\n\u0082\nz\n\u00a3\nw\nY\n\u00b8\nY\nw\nY\n\u0088\n\u00be\n\u0089\n}\n\u00be\n\u008a\n\u008b\n\u00c0\nz\n\u0088\np\n\u0088\np\n\u008e\n\u008f\n}\n\u00a3\ni\nm\n\u00a1\n\u00c1\nm\ni\n;\nF\nm\n\fsolutions to both issues.\nClosed form solutions Integrals in the objective function and constraints depend on the\nchoice of transition models and basis functions. We want all these integrals to be proper\nRiemannian integrals. We prefer integrals with closed-form expressions. To this point, we\nhave identi(cid:2)ed conjugate classes of transition models and basis functions leading to closed\nform expressions.\n\nsatisfy in(cid:2)nite number of constraints (for all values ofE\nandQ ). In the following we give\nBeta transitions. To parameterize the transition model over\u0002\u0004\u0003$\u0005%\u0007&\t we propose to use beta\n\u009e\n\u0097\nwhereE\n\u0017\u00ce\u00cd+\u0003\n\u009eR\u0097\nforE\n\nFeature functions. A feature function form that is particularly suitable for the ALP and\nmatches beta transitions is a product of power functions:\n\ndensities or their mixtures. The beta transition is de(cid:2)ned as:\n\n\u0017\u00ab\u0017\n\u0005\n\r\fE\n\n\u0017&\u0005\\\u00cc\n\n\u009e\n\u0097\n\n\u009e\n\u0097\n\n\u009e\n\u0097\n\n\u009e\n\u0097\n\n\u009eR\u0097\n\n\u000fE\n\nIt is easy to show that for such a case the integrals in the objective function simplify to:\n\n\u00176\u00a5\u0006E\n\n\u009e\n\u0097\n\u009eR\u0097\n\u009e\n\u0097\n\u009e\n\u0097\n\u009b.\r\u00a0\u009f\u00c8Z\n\u0005\nQS\u0017\u001bFr\u00c9\u0012\u00ca\u00bb\u00cb\\QU\r\u00a0\u009f\u00c8Z\n\r\u000fE\n\r\fE\n\u0017\n\u0005\u00ab\u00cc\n\u009e under actionQ\nis the parent set of a variable\u009f\n, and\u00cc\n\u00b9_\u00d0\n\u0002\u0004\u0003\u0006\u0005\b\u0007\n\t6\u00cf\n\u00cf de(cid:2)ne the parameters of the beta model.\n\u00b9\u00a0\u00d0\n\u0017\u001bF\n\r\fE\n\u00b9\n\u00d3\n\u00b9_\u00d0\n\u00a5\u00d4E\n\u00b9\n\u00ba\n\n\u00b9_\u00d0\n\u00a5S\u009f\n\u00b9\u00bb\u00d3\n\u007f~\u00da\u0086\u00d8%\u00db\n|\u000f}\n|\u000f}\n\u007f_\u00b4\b}\np~\u00d6\n\u007fS\u00da/\u00d8\n|\u000f}\n\u009e , the ALP formulation becomes:\nw\bx6v\n\u0083\u0091\u0085\n\u008bB\u007f\n|\u000f}\nw\bx\n\n\u008b\b\u008c6\u008d\n\u0017PF\nprq\n\n|\u000f}\n\u008b\b\u007fS\u00da/\u00d8\n\n\u007f_\u00bf\np\u0084\u00e0\np\u0081\u0080\n\n\u00b9\n\u00d3\n|]\u00d8\u00b3\u00d9\n|]\u00d8\n\n\u00b9\u0014\u00d3\nsS\u00e1\n\n\u000fE\n|\u000f\u00bd\n\r\u000fE\n\n|\u000f}\n|\u000f}\n\nxS\u00d7\n\n|\u000f}\n\n\u00b9\n\u00d3\n\u007f6\u007f\n|]\u00d8\u00b3\u00d9\n\u007f<\u00da\u0091\u00dc\n\n\u009e\n\u0097\n\nT8\u0007\n\u007f6\u007fS\u00da\u00dd\u00dc\n|\u000f}\n\n|]\u00d8\n\n|\u000f}\n\n\u007f6\u007f\n\n(5)\n\n\u0083\u0010\u0090\u0092|\u000f}\n\n\u008c\u00ab\u008d\n\n\u007fP\u0093 \u0095\n\n\u008c\u0010\u0096\n\n\u008c\\\u008d\n\nALP solution. Although the ALP uses in(cid:2)nitely many constraints, only a (cid:2)nite subset of\nconstraints, active constraints, is necessary to de(cid:2)ne the optimal solution. Existing ALP\nmethods for factored (cid:2)nite-state MDPs search for this subset more ef(cid:2)ciently by taking\nadvantage of local constraint decompositions and various heuristics. However, at the end\nthese methods always rely on the fact the decompositions are de(cid:2)ned on a (cid:2)nite state sub-\nspace. Unfortunately, constraints in our model decompose over smaller but still continuous\nsubspaces, so the existing solutions for the (cid:2)nite-state MDPs cannot be applied directly.\nSampling constraints. To avoid the problem of continuous state spaces we approximate\nthe ALP solution using a (cid:2)nite set of constraints de(cid:2)ned by a (cid:2)nite set of state space points\n. These state space points can be de(cid:2)ned by regular grids on state sub-\n. In this work we focus on and experiment\n\nand actions in\u0011\nspaces or via random sampling of statesE\n\n@\u00dd\u00e2\n\nSimilarly, using our conjugate transition and basis models the integrals in constraints sim-\nplify to:\n\nis the gamma function. For example, assuming features with products of state\n\n\u00b9\u00bb\u00ba\nvariables:m\nwhere\u00de\u001b\rbk\u00df\u0017\nminimizeo\n\nsubject to:\n\n\u009e\n[\nE\nM\n\u009e\n[\n\u00cc\ni\nM\nM\nj\nM\nM\nM\ni\nM\nM\nj\nM\nM\nM\n@\nX\n\u00d1\nm\n9\n9\n\u00b7\n\u00d2\nX\nx\n\u009f\n\u0098\nx\n\u009e\nk\n\u00a3\nX\nx\nm\n9\n9\n9\nF\n\u00a3\nX\nx\n\u00b7\n\u00d2\nX\nx\n\u009f\n\u0098\nx\n\u009e\n9\nF\n\u00b7\n\u00d2\nX\nx\n\u00a3\n\u00d2\n\u00b9\n\u009f\n\u0098\nx\n\u009e\n\u009e\nF\n\u00b7\n\u00d2\nX\nx\n\u0007\n\u00d5\n9\nk\n\u00a3\nw\nY\nx\n\u00b5\n\u00b6\n\u00b7\n\u00b8\nY\nw\nY\nx\n\u00bc\n\u0088\n\u00be\n\u0089\n}\n\u00be\n\u008a\n\u00c0\nz\np\n\u0088\np\n\u0088\n\u00b7\n\u00b8\nY\nw\nY\n\u00be\n\u008a\n\u008b\n\u00be\n\u008a\n\u008b\n\u00be\n\u008a\n\u008b\n\u00be\n\u008a\n\u008b\n\u00d7\n\u00be\n\u008a\n\u008b\n\u00be\n\u008a\n\u008b\n\u00be\n\u008a\np\n\u007f\n\u00d7\n\u00d9\n\u00be\n\u008a\n\u008b\n\u00be\n\u008a\n\u008b\n\u00db\n\u00be\n\u008a\n\u008b\n\u00be\n\u008a\n\u008b\n\u00be\n\u008a\np\n\u007f\n\u00d7\n\u00d9\n\u00be\n\u008a\n\u008b\n\u00be\n\u008a\n\u008b\n\u008c\n9\n9\n\u009d\n\u00d2\nX\nx\n\u009f\nV\n\u00a9\nv\nV\np\nq\n\u0082\n\u00b7\n\u00b8\n\u00b9\n\u00ba\n\u00bd\n\u00be\n\u00b7\n\u00b8\nY\n\u00b9\n\u00ba\nw\nY\nx\n\u00d8\n\u00d9\n\u00be\n\u008a\n\u008b\n\u00be\n\u008a\n\u008b\n\u007f\n\u00d8\n\u00d9\n\u00be\n\u008a\n\u008b\n\u00be\n\u008a\n\u00db\n\u00be\n\u008a\n\u008b\n\u00be\n\u008a\n\u008e\n\u008f\n}\n\f)\na\n \n,\n}\n\n \n\n1\n\u2212\n\n \nj\n\nj\n\nx\n \n,\nx\n{\n \n|\n \n\u2019j\nx\n(\np\n \nl\ne\nd\no\nm\n \nn\no\ni\nt\ni\ns\nn\na\nr\nT\n\n5\n\n4\n\n3\n\n2\n\n1\n\n0\n\n0\n\na \u201e j, xj = 1\n\nxj \u2212 1 = 0.0\nxj \u2212 1 = 0.5\nxj \u2212 1 = 1.0\n0.5\nx\u2019j\n\n1\n\n8\n\n7\n\n6\n\n5\n\n4\n\n3\n\n2\n\n1\n\n0\n\n0\n\n(b)\n\na = j\n\n0.5\nx\u2019j\n\n1\n\n(a)\n\nFigure 1: a. Topologies of computer networks used in experiments. b. Transition densities\n\nfor the\u00e3 th computer and different previous-state/action combinations.\n\nwith the random sampling approach. For the (cid:2)nite state spaces such a technique has been\ndevised and analyzed by de Farias and Van Roy [5]. We note that the blind sampling ap-\nproach can be improved via various heuristics.4 However, despite many possible heuristic\nimprovements, we believe that the crucial bene(cid:2)t comes from the ALP formulation that\n(cid:147)(cid:2)ts(cid:148) the linear model and subsequent constraint and subspace decompositions.\n\n4 Experiments\nTo test the ALP method we use a continuous-state modi(cid:2)cation of the computer network\nexample proposed by Guestrin et al [8]. Figure 1a illustrates three different network struc-\ntures used in experiments. Nodes in graphs represent computers. The state of a machine\nis represented by a number between 0 and 1 re(cid:3)ecting its processing capacity (the abil-\nity to process tasks). The network performance can be controlled through activities of a\nhuman operator: the operator can attend a machine (one at time) or do nothing. Thus,\nis the number of computers in the network. The\nprocessing capacity of a machine (cid:3)uctuates randomly and is determined by: (1) a ran-\ndom event (e.g., a software bug), (2) machines connected to it and (3) the presence of\nthe operator at the machine console. The transition model represents the dynamics of\nthe computer network. The model is factorized and de(cid:2)ned in terms of beta densities:\n\nthere is a total of\u00e4\u009cT\u00e5\u0007 actions where\u00e4\n\u009e\n\u0097\n\u009e\n\u0097\n\u009e\n\u0097\n\u009eR\u0097\n\u009eR\u0097\n\r\u000fE\n\u0017\\\u0017 , where\u009f\n\u0017\n\u0005\u00ab\u00cc\n\r\u000fE\nis the current state of the\u00e3 th\n\u009b>\r_\u009f\n\u0005RQS\u0017OF\u00e6\u00c9\u0012\u00ca\u00bb\u00cb\\QU\r\u00a0\u009f\n\u009eR\u0097\nM describes the previous-step state of the computers affecting\u00e3 . We use:\ncomputer, andE\n\u009e\u00b3\u00eb\n\u009eR\u0097\n\u009e\b\u00eb\n\u009e\n\u0097\n\u009eL\u00e7\n\u009eL\u00e7\na{\u00ecL\u009f\n\u0017\u001bF\u00e5\u0007l\u0003\u0084a{\u00e8L\u009f\n\r\fE\ni and\u00cc\n\u00170F8\u00e8