{"title": "Policy-Gradient Methods for Planning", "book": "Advances in Neural Information Processing Systems", "page_first": 9, "page_last": 16, "abstract": "", "full_text": "Policy-Gradient Methods for Planning\n\nStatistical Machine Learning, National ICT Australia, Canberra\n\nDouglas Aberdeen\n\ndoug.aberdeen@anu.edu.au\n\nAbstract\n\nProbabilistic temporal planning attempts to \ufb01nd good policies for acting\nin domains with concurrent durative tasks, multiple uncertain outcomes,\nand limited resources. These domains are typically modelled as Markov\ndecision problems and solved using dynamic programming methods.\nThis paper demonstrates the application of reinforcement learning \u2014 in\nthe form of a policy-gradient method \u2014 to these domains. Our emphasis\nis large domains that are infeasible for dynamic programming. Our ap-\nproach is to construct simple policies, or agents, for each planning task.\nThe result is a general probabilistic temporal planner, named the Factored\nPolicy-Gradient Planner (FPG-Planner), which can handle hundreds of\ntasks, optimising for probability of success, duration, and resource use.\n\n1\n\nIntroduction\n\nTo date, only a few planning tools have attempted to handle general probabilistic temporal\nplanning problems. These tools have only been able to produce good policies for relatively\ntrivial examples. We apply policy-gradient reinforcement learning (RL) to these domains\nwith the goal of creating tools that produce good policies in real-world domains rather than\nperfect policies in toy domains. We achieve this by: (1) factoring the policy into simple\nindependent policies for starting each task; (2) presenting each policy with critical observa-\ntions instead of the entire state; (3) using function approximators for each policy; (4) using\nlocal optimisation methods instead of global optimisation; and (5) using algorithms with\nmemory requirements that are independent of the state space size.\n\nPolicy gradient methods do not enumerate states and are applicable to multi-agent settings\nwith function approximation [1, 2], thus they are a natural match for our approach to han-\ndling large planning problems. We use the GPOMDP algorithm [3] to estimate the gradient\nof a long-term average reward of the planner\u2019s performance, with respect to the parameters\nof each task policy. We show that maximising a simple reward function naturally minimises\nplan durations and maximises the probability of reaching the plan goal.\n\nA frequent criticism of policy-gradient methods compared to traditional forward chaining\nplanners \u2014 or even compared to value-based RL methods \u2014 is the lack of a clearly inter-\npretable policy. A minor contribution of this paper is a description of how policy-gradient\nmethods can be used to prune a decision tree over possible policies. After training, the\ndecision tree can be translated into a list of policy rules.\n\nPrevious probabilistic temporal planners include CPTP [4], Prottle [5], Tempastic [6] and a\nmilitary operations planner [7]. Most these algorithms use some form of dynamic program-\n\n\fming (either RTDP [8] or AO*) to associate values with each state/action pair. However,\nthis requires values to be stored for each encountered state. Even though these algorithms\ndo not enumerate the entire state space their ability to scale is limited by memory size. Even\nproblems with only tens of tasks can produce millions of relevant states. CPTP, Prottle, and\nTempastic minimise either plan duration or failure probability, not both. The FPG-Planner\nminimises both of these metrics and can easily optimise over resources too.\n\n2 Probabilistic temporal planning\n\nTasks are the basic planning unit corresponding to grounded1 durative actions. Tasks have\nthe effect of setting condition variables to true or false. Each task has a set of preconditions,\neffects, resource requirements, and a \ufb01xed probability of failure. Durations may be \ufb01xed or\ndependent on how long it takes for other conditions to be established. A task is eligible to\nbegin when its preconditions are satis\ufb01ed and suf\ufb01cient resources are available. A starting\ntask may have some immediate effects. As tasks end a set of effects appropriate to the\noutcome are applied. Typically, but not necessarily, succeeding tasks set some facts to\ntrue, while failing tasks do nothing or negate facts. Resources are occupied during task\nexecution and consumed when the task ends. Different outcomes can consume varying\nlevels of resources. The planning goal is to set a subset of the conditions to a desired value.\n\nThe closest work to that presented here is described by Peshkin et al. [1] which describes\nhow a policy-gradient approach can be applied to multi-agent MDPs. This work lays the\nfoundation for this application, but does not consider the planning domain speci\ufb01cally. It\nis also applied to relatively small domains, where the state space could be enumerated.\n\nActions in temporal planning consist of launching multiple tasks concurrently. The number\nof candidate actions available in a given state is the power set of the tasks that are eligible\nto start. That is, with N eligible tasks there are 2N possible actions. Current planners\nexplore this action space systematically, pruning actions that lead to low rewards. When\ncombined with probabilistic outcomes the state space explosion cripples existing planners\nfor tens of tasks and actions. A key reason treat each task as an individual policy agent is to\ndeal with this explosion of the action space. We replace the single agent choosing from the\npower-set of eligible tasks with a single simple agent for each task. The policy learnt by\neach agent is whether to start its associated task given its observation, independent of the\ndecisions made by the other agents. This idea alone does not simplify the problem. Indeed,\nif the agents received perfect state information they could learn to predict the decision of\nthe other agents and still act optimally. The signi\ufb01cant reduction in complexity arises from:\n(1) restricting the class of functions that represent agents, (2) providing only partial state\ninformation, (3) optimising locally, using gradient ascent.\n\n3 POMDP formulation of planning\n\nOur intention is to deliberately use simple agents that only consider partial state informa-\ntion. This requires us to explicitly consider partial observability. A \ufb01nite partially observ-\nable Markov decision process consists of: a \ufb01nite set of states s \u2208 S; a \ufb01nite set of actions\na \u2208 A; probabilities Pr[s0|s, a] of making state transition s \u2192 s0 under action a; a reward\nfor each state r(s) : S \u2192 R; and a \ufb01nite set of observation vectors o \u2208 O seen by the agent\nin place of the complete state descriptions. For this application, observations are drawn\ndeterministically given the state, but more generally may be stochastic. Goal states are\nstates where all the goal state variables are satis\ufb01ed. From failure states it is impossible\nto reach a goal state, usually because time or resources have run out. These two classes\nof state are combined to form the set of reset states that produce an immediate reset to the\n\n1Grounded means that tasks do not have parameters that can be instantiated.\n\n\finitial state s0. A single trajectory through the state space consists of many individual trials\nthat automatically reset to s0 each time a goal state or failure state is reached.\nPolicies are stochastic, mapping observation vectors o to a probability over actions. Let N\nbe the number of basic tasks available to the planner. In our setting an action a is a binary\nvector of length N. An entry of 1 at index n means \u2018Yes\u2019 begin task n, and a 0 entry means\n\u2018No\u2019 do not start task n. The probability of actions is Pr[a|o, \u03b8], where conditioning on \u03b8\nre\ufb02ects the fact that the policy is controlled by a set of real valued parameters \u03b8 \u2208 Rp. This\npaper assumes that all stochastic policies (i.e., any values for \u03b8) reach reset states in \ufb01nite\ntime when executed from s0. This is enforced by limiting the maximum duration of a plan.\nThis ensures that the underlying MDP is ergodic, a necessary condition for GPOMDP. The\nGPOMDP algorithm maximises the long-term average reward\n\nT\u22121X\n\nt=0\n\n\u03b7(\u03b8) = lim\nT\u2192\u221e\n\n1\nT\n\nr(st).\n\nIn the context of planning, the instantaneous reward provides the agent with a measure of\nprogress toward the goal. A simple reward scheme is to set r(s) = 1 for all states s that\nrepresent the goal state, and 0 for all other states. To maximise \u03b7(\u03b8), successful planning\noutcomes must be reached as frequently as possible. This has the desired property of\nsimultaneously minimising plan duration, as well as maximising the probability of reaching\nthe goal (failure states achieve no reward). It is tempting to provide a negative reward for\nfailure states, but this can introduce poor local maxima in the form of policies that avoid\nnegative rewards by avoiding progress altogether. We provide a reward of 1000 each time\nthe goal is achieved, plus an admissible heuristic reward for progress toward the goal. This\nadditional shaping reward provides a reward of 1 for every goal state variable achieved, and\n-1 for every goal variable that becomes unset. Policies that are optimal with the additional\nshaping reward are still optimal under the basic goal state reward [9].\n\n3.1 Planning state space\n\nFor probabilistic temporal planning our state description contains [7]: the state\u2019s absolute\ntime, a queue of impending events, the status of each task, the truth value of each condition,\nand the available resources. In a particular state, only a subset of the eligible tasks will\nsatisfy all preconditions for execution. We call these tasks eligible. When a decision to start\na \ufb01xed duration task is made an end-task event is added to a time ordered event queue. The\nevent queue holds a list of events that the planner is committed to, although the outcome of\nthose events may be uncertain.\n\nThe generation of successor states is shown in Alg. 1. The algorithm begins by starting the\ntasks given by the current action, implementing any immediate effects. An end-task event\nis added at an appropriate time in the queue. The state update then proceeds to process\nevents until there is at least one task that is eligible to begin. Events have probabilistic\noutcomes. Line 20 of Alg. 1 samples one possible outcome from the distribution imposed\nby probabilities in the problem de\ufb01nition. Future states are only generated at points where\ntasks can be started. Thus, if an event outcome is processed and no tasks are enabled, the\nsearch recurses to the next event in the queue.\n\n4 Factored Policy-Gradient\n\nWe assume the presence of policy agents, parameterised with independent sets of param-\neters for each agent \u03b8 = {\u03b81, . . . , \u03b8N}. We seek to adjust the parameters of the policy to\nmaximise the long-term average reward \u03b7(\u03b8). The GPOMDP algorithm [3] estimates the\ngradient \u2207\u03b7(\u03b8) of the long-term average reward with respect to the current set of policy\n\n\fs.failureLeaf=true\nreturn\n\nif s.time > maximum makespan then\n\nend if\nif s.operationGoalsMet() then\n\ns.beginTask(n)\ns.addEvent(n, s.time+taskDuration(n))\n\nAlg. 1: \ufb01ndSuccessor(State s, Action a)\n1: for each an =\u2019Yes\u2019 in a do\n2:\n3:\n4: end for\n5: repeat\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n15:\n16:\n17:\n18:\n19:\n20:\n21:\n22: until s.anyEligibleTasks()\n\nend if\nevent = s.nextEvent()\ns.time = event.time\nsample outcome from event\ns.implementEffects(outcome)\n\nend if\nif \u00acs.anyEligibleTasks() then\n\ns.failureLeaf=true\nreturn\n\ns.goalLeaf=true\nreturn\n\nAlg. 2: Gradient Estimator\n\n1: Set s0 to initial state, t = 0, et = [0]\n2: while t < T do\net = \u03b2et\u22121\n3:\nGenerate observation ot of st\n4:\nfor Each eligible task n do\n5:\n6:\n7:\n8:\n9:\n\nSample atn =Yes or atn =No\net = et + \u2207 log Pr[atn|o, \u03b8n]\n\nend for\nTry\n{at1, at2, . . . , atN}\nwhile mutex prohibits at do\n\naction\n\nat\n\n=\n\n10:\n11:\n12:\n13:\n14:\n\nrandomly disable task in at\n\nend while\nst+1 = \ufb01ndSuccessor(st, at)\n\u02c6\u2207t\u03b7(\u03b8) = \u02c6\u2207t\u22121\u03b7(\u03b8)\u2212\nt+1 (r(st+1)et \u2212 \u02c6\u2207t\u22121\u03b7(\u03b8))\nt \u2190 t + 1\n\n1\n\n15:\n16: end while\n17: Return \u02c6\u2207T \u03b7(\u03b8)\n\nparameters. Once an estimate \u02c6\u2207\u03b7(\u03b8) is computed over T simulation steps, we maximise\nthe long-term average reward with the gradient ascent \u03b8 \u2190 \u03b8 + \u03b1 \u02c6\u2207\u03b7(\u03b8), where \u03b1 is a small\nstep size. The experiments in this paper use a line search to determine good values of \u03b1.\nWe do not guarantee that the best representable policy is found, but our experiments have\nproduced policies comparable to global methods like real-time dynamic programming [8].\n\nThe algorithm works by sampling a single long trajectory through the state space (Fig. 4):\n(1) the \ufb01rst state represents time 0 in the plan; (2) the agents all receive the vector observa-\ntion ot of the current state st; (3) each agent representing an eligible task emits a probability\nof starting; (4) each agent samples start or do not start and issues it as a planning action;\n(5) the state transition is sampled with Alg. 1; (6) the agents receive the global reward for\nthe new state action and update their gradient estimates. Steps 1 to 6 are repeated T times.\nEach vector action at is a combination of independent \u2018Yes\u2019 or \u2018No\u2019 choices made by each\neligible agent. Each agent is parameterised by an independent set of parameters that make\nup \u03b8 \u2208 Rp: \u03b81, \u03b82, . . . , \u03b8N . If atn represents the binary decision made by agent n at time t\nabout whether to start its corresponding task, then the policy factors into\n\nPr[at|ot, \u03b8] = Pr[at1, . . . , atN|ot, \u03b81, . . . , \u03b8N ]\n\n= Pr[at1|ot, \u03b81] \u00d7 \u00b7\u00b7\u00b7 \u00d7 Pr[atN|ot, \u03b8N ].\n\nIt is not necessary for all agents to receive the same observation, and it may be advantageous\nto show different agents different parts of the state, leading to a decentralised planning\nalgorithm. Similar approaches are adopted by Peshkin et al. [1], Tao et al. [2], using policy-\ngradient methods to train multi-agent systems. The main requirement for each policy-agent\nis that log Pr[atn|ot, \u03b8n] be differentiable with respect to the parameters for each choice\ntask start atn =\u2018Yes\u2019 or \u2018No\u2019. We now describe two such agents.\n\n4.1 Linear approximator agents\n\nOne representation of agents is a linear network mapped into probabilities using a logistic\nregression function:\n\n\fot\n\not\n\nPr[Y es|ot, \u03b81] = 0.1\n\nPr[N o|ot, \u03b81] = 0.9\n\nPr[N o|ot, \u03b82] = 1.0\n\nat\n\n\ufb01ndSuccessor(st, at)\n\nPr[Y es|ot, \u03b8N ] = 0.5\n\nPr[N o|ot, \u03b8N ] = 0.5\n\nFig. 3: Decision tree agent.\n\n(Left)\n\nFig. 4:\ntask-\npolicies make independent deci-\nsions.\n\nIndividual\n\nt \u03b8n)\n\nPr[atn = Y es|ot, \u03b8n] = exp(o>\nexp(o>\n\nt \u03b8n) + 1\n\n(1)\nIf the dimension of the observation vector is |o| then each set of parameters \u03b8n can be\nthought of as an |o| vector that represents the approximator weights for task n. The log\nderivatives, necessary for Alg. 2, are given in [10]. Initially, the parameters are set to small\nrandom values: a near uniform random policy. This encourages exploration of the action\nspace. Each gradient step typically moves the parameters closer to a deterministic policy.\nAfter some experimentation we chose an observation vector that is a binary description of\nthe eligible tasks and the state variable truth values plus a constant 1 bit to provide bias to\nthe agents\u2019 linear networks.\n\n4.2 Decision tree agents\n\nOften we have a selection of potential control rules. A decision tree can represent all\nsuch control rules at the leaves. The nodes are additional parameterised or hardwired rules\nthat select between different branches, and therefore different control rules. An action\na is selected by starting at the root node and following a path down the tree, visiting a\nset of decision nodes D. At each node we either applying a hard coded branch selection\nrule, or sample a stochastic branch rule from the probability distribution invoked by the\nparameterisation. Assuming the independence of decisions at each node, the probability or\nreaching an action leaf l equals the product of branch probabilities at each decision node\n\nwhere d represents the current decision node, and d0 represents the next node visited in the\ntree. The \ufb01nal next node d0 is the leaf l. The probability of a branch followed as a result\nof a hard-coded rule is 1. The individual Pr(d0|o, \u03b8d) functions can be any differentiable\nfunction of the observation vector o.\nFor multi-agent domains, such as our formulation of planning, we have a decision tree for\neach task agent. We use the same initial tree (with different parameters), for each agent,\nshown in Fig. 3. Nodes A, D, F, H represent hard coded rules that switch with probability\none between the Yes and No branches based on a boolean observation that gives the truth\nof the statement in the node for the current state. Nodes B, C, E, G are parameterised so\n\nPr[a = l|o, \u03b8] = Y\n\nd\u2208D\n\nPr[d0|o, \u03b8d],\n\n(2)\n\nConditionsEligible tasksTask statusResourcesEvent queueTimeCurrent StateConditionsEligible tasksTask statusResourcesEvent queueTimeNot EligibleTask NTask 1Task 2Next StateChoice disabled\fthat they select branches stochastically. For this application, the probability of choosing\nthe Yes or No branches is a single parameter logistic function that is independent of the\nobservations. Parameter adjustments have the simple effect of pruning parts the tree that\nrepresent poor policies, leaving the hard coded rules to choose the best action given the\nobservation. The policy encoded by the parameter is written in the node label. For example\nfor task agent n, and decision node C \u201ctask duration matters?\u201d, we have the probability\n\nPr(Y es|o, \u03b8n,C) = Pr(Y es|\u03b8n,C) = exp(\u03b8n,C)\nexp(\u03b8n,C) + 1\n\nThe log gradient of this function is given in [10]. If set parameters to always select the\ndashed branch in Fig. 3 we would be following the policy:\nif the task IS eligible, and\nprobability this task success does NOT matter, and the duration of this task DOES matter,\nand this task IS fast, then start, otherwise do not start. Apart from being easy to interpret\nthe optimised decision tree as a set of \u2014 possibly stochastic \u2014 if-then rules, we can also\nencode highly expressive policies with only a few parameters.\n\n4.3 GPOMDP for planning\nAlg. 4 describes the algorithm for computing \u02c6\u2207\u03b7(\u03b8), based on GPOMDP [3]. The vector\nquantity et is an eligibility trace.\nIt has dimension p (the total number of parameters),\nand can be thought of as storing the eligibility of each parameter for being reinforced\nafter receiving a reward. The gradient estimate provably converges to a biased estimate of\n\u2207\u03b7(\u03b8) as T \u2192 \u221e. The quantity \u03b2 \u2208 [0, 1) controls the degree of bias in the estimate.\nAs \u03b2 approaches 1, the bias of the estimates drop to 0. However if \u03b2 = 1, estimates\nexhibit in\ufb01nite variance in the limit as T \u2192 \u221e. Thus the parameter \u03b2 is used to achieve\na bias/variance tradeoff in our stochastic gradient estimates. GPOMDP gradient estimates\nhave been proven to converge, even under partial observability.\n\nLine 8 computes the log gradient of the sampled action probability and adds the gradient\nfor the n\u2019th agent\u2019s parameters into the eligibility trace. The gradient for parameters not\nrelating to agent n is 0. We do not compute Pr[atn|ot, \u03b8n] or gradients for tasks with\nunsatis\ufb01ed preconditions. If all eligible agents decide not to start their tasks, we issue a\nnull-action. If the state event queue is not empty, we process the next event, otherwise time\nis incremented by 1 to ensure all possible policies will eventually reach a reset state.\n\n5 Experiments\n\n5.1 Comparison with previous work\n\nWe compare the FPG-Planner with that of our earlier RTDP based planner for military\noperations [7], which is based on real-time dynamic programming with [8]. The domains\ncome from the Australian Defence Science and Technology Organisation, and represent\nmilitary operations planning scenarios. There are two problems, the \ufb01rst with 18 tasks\nand 12 conditions, and the second with 41 tasks and 51 conditions. The goal is to set the\n\u201cObjective island secured\u201d variable to true. There are multiple interrelated tasks that can\nlead to the goal state. Tasks fail or succeed with a known probability and can only execute\nonce, leading to relatively large probabilities of failure even for optimal plans. See [7] for\ndetails. Unless stated, FPG-Planner experiments used T = 500, 000 gradient estimation\nsteps and \u03b2 = 0.9. Optimisation time was limited to 20 minutes wall clock time on a single\nuser 3GHz Pentium IV with 1GB ram. All evaluations are based on 10,000 simulated\nexecutions of \ufb01nalised policies. Results quote the average duration, resource consumption,\nand the percentage of plans that terminate in a failure state.\n\nWe repeat the comparison experiments 50 times with different random seeds and report\n\n\fTable 1: Two domains compared with a dynamic programming based planner.\n\nFactored Linear\n\nFactored Tree\n\nProblem\n\nAssault Ave\nAssault Best\nWebber Ave\nWebber Best\n\nRTDP\n\nDur Res\n8.0\n171\n113\n6.2\n4.4\n245\n217\n4.2\n\nFail% Dur Res\n8.3\n8.7\n4.1\n4.1\n\n105\n93.1\n193\n190\n\n26.1\n24.0\n58.1\n57.7\n\nFail% Dur Res\n8.3\n8.4\n4.1\n4.1\n\n26.6\n23.1\n57.9\n57.0\n\n115\n112\n186\n181\n\nFail%\n27.1\n25.6\n58.0\n57.3\n\nTable 2: Effect of different observations.\n\nObservation\nEligible & Conds\nConds only\nEligible only\n\nDur Res\n8.3\n105\n112\n8.1\n8.1\n112\n\nFail%\n26.6\n28.1\n29.6\n\nTable 3: Results for the Art45/25 domain.\n\nPolicy\nRandom\nNaive\nLinear\nDumb Tree\nProb Tree\nDur Tree\nRes Tree\n\nDur Res\n206\n394\n231\n332\n67\n121\n157\n92\n62\n156\n72\n167\n136\n53\n\nFail%\n83.4\n78.6\n7.4\n19.1\n10.9\n17.4\n8.50\n\nmean and best results in Table 1. The \u201cBest\u201d plan minimises an arbitrarily chosen com-\nbined metric of 10 \u00d7 f ail% + dur. FPG-Planning with a linear approximator signi\ufb01cantly\nshortens the duration of plans, without increasing the failure rate. The very simple deci-\nsion tree performs less well than than the linear approximator, but better than the dynamic\nprogramming algorithm. This is somewhat surprising given the simplicity of the tree for\neach task. The shorter duration for the Webber decision tree is probably due to the slightly\nhigher failure rate. Plans failing early produces shorter durations.\nTable 1 assumes that the observation vector o presented to linear agents is a binary descrip-\ntion of the eligible tasks and the condition truth values plus a constant 1 bit to provide bias\nto the agents\u2019 linear networks. Table 2 shows that giving the agents less information in the\nobservation harms performance.\n\n5.2 Large arti\ufb01cial domains\n\nEach scenario consists of N tasks and C state variables. The goal state of the synthetic\nscenarios is to assert 90% of the state variables, chosen during scenario synthesis, to be\ntrue. See [10] for details. All generated problems have scope for choosing tasks instead of\nmerely scheduling them. All synthetic scenarios are guaranteed to have at least one policy\nwhich will reach the operation goal assuming all tasks succeed. Even a few tens of tasks\nand conditions can generate a state space too large for main memory.\n\nWe generated 37 problems, each with 40 tasks and 25 conditions (Art40/25). Although the\nnumber of tasks and conditions is similar to the Webber problem described above, these\nproblems demonstrate signi\ufb01cantly more choices to the planner, making planning non-\ntrivial. Unlike the initial experiments, all tasks can be repeated as often as necessary so the\noverall probability of failure depends on how well the planner chooses and orders tasks to\navoid running out of time and resources. Our RTDP based planner was not able to perform\nany signi\ufb01cant optimisation in 20m due to memory problems. Thus, to demonstrate FPG-\nPlanning is having some effect, we compared the optimised policies to two simple policies.\nThe random policy starts each eligible task with probability 0.5. The naive policy starts\nall eligible tasks. Both of these policies suffer from excessive resource consumption and\nnegative effects that can cause failure.\nTable 3 shows that the linear approximator produces the best plans, but it requires C + 1\nparameters per task. The results for the decision tree illustrated in Fig. 3 are given in the\n\n\f\u201cProb Tree\u201d row. This tree uses a constant 4 parameters per task, and subsequently requires\nfewer operations when computing gradients. The \u201cDumb\u201d row is a decision stub, with one\nparameter per task that simply learns whether to start when eligible. The remaining \u201cDur\u201d\nand \u201cRes\u201d Tree rows re-order the nodes in Fig. 3 to swap the nodes C and E respectively\nwith node B. This tests the sensitivity of the tree to node ordering. There appears to be\nsigni\ufb01cant variation in the results. For example, when node E is swapped with B, the\nresultant policies use less resources.\n\nWe also performed optimisation of a 200 task, 100 condition problem generated using the\nsame rules as the Art40/25 domain. The naive policy had a failure rate of 72.4%. No\ntime limit was applied. Linear network agents (20,200 parameters) optimised for 14 hours,\nbefore terminating with small gradients, and resulted in a plan with 20.8% failure rate. The\ndecision tree agent (800 parameters) optimised for 6 hours before terminating with a 1.7%\nfailure rate. The smaller number of parameters and a priori policies embedded in the tree,\nallow the decision tree to perform well in very large domains. Inspection of the resulting\nparameters demonstrated that different tasks pruned different regions of the decision tree.\n\n6 Conclusion\n\nWe have demonstrated an algorithm with great potential to produce good policies in real-\nworld domains. Further work will re\ufb01ne our parameterised agents, and validate this ap-\nproach on realistic larger domains. We also wish to characterise possible local minima.\n\nAcknowledgements\n\nThank you to Olivier Buffet and Sylvie Thi\u00b4ebaux for many helpful comments. National\nICT Australia is funded by the Australian Government\u2019s Backing Australia\u2019s Ability pro-\ngram and the Centre of Excellence program. This project was also funded by the Australian\nDefence Science and Technology Organisation.\n\nReferences\n\n[1] L. Peshkin, K.-E. Kim, N. Meuleau, and L. P. Kaelbling. Learning to cooperate via policy\n\nsearch. In UAI, 2000.\n\n[2] Nigel Tao, Jonathan Baxter, and Lex Weaver. A multi-agent, policy-gradient approach to net-\n\nwork routing. In Proc. ICML\u201901. Morgan Kaufmann, 2001.\n\n[3] J. Baxter, P. Bartlett, and L. Weaver. Experiments with in\ufb01nite-horizon, policy-gradient estima-\n\ntion. JAIR, 15:351\u2013381, 2001.\n\n[4] Mausam and Daniel S. Weld. Concurrent probabilistic temporal planning. In Proc. Interna-\ntional Conference on Automated Planning and Scheduling, Moneteray, CA, June 2005. AAAI.\n[5] I. Little, D. Aberdeen, and S. Thi\u00b4ebaux. Prottle: A probabilistic temporal planner. In Proc.\n\nAAAI\u201905, 2005.\n\n[6] Hakan L. S. Younes and Reid G. Simmons. Policy generation for continuous-time stochastic\n\ndomains with concurrency. In Proc. of ICAPS\u201904, volume 14, 2005.\n\n[7] Douglas Aberdeen, Sylvie Thi\u00b4ebaux, and Lin Zhang. Decision-theoretic military operations\n\nplanning. In Proc. ICAPS, volume 14, pages 402\u2013411. AAAI, June 2004.\n\n[8] A.G. Barto, S. Bradtke, and S. Singh. Learning to act using real-time dynamic programming.\n\nArti\ufb01cial Intelligence, 72, 1995.\n\n[9] A.Y. Ng, D. Harada, and S. Russell. Policy invariance under reward transformations: Theory\n\nand application to reward shaping. In Proc. ICML\u201999, 1999.\n\n[10] Douglas Aberdeen. The factored policy-gradient planner. Technical report, NICTA, 2005.\n\n\f", "award": [], "sourceid": 2910, "authors": [{"given_name": "Douglas", "family_name": "Aberdeen", "institution": null}]}