{"title": "MDPs with Non-Deterministic Policies", "book": "Advances in Neural Information Processing Systems", "page_first": 1065, "page_last": 1072, "abstract": "Markov Decision Processes (MDPs) have been extensively studied and used in the context of planning and decision-making, and many methods exist to find the optimal policy for problems modelled as MDPs. Although finding the optimal policy is sufficient in many domains, in certain applications such as decision support systems where the policy is executed by a human (rather than a machine), finding all possible near-optimal policies might be useful as it provides more flexibility to the person executing the policy. In this paper we introduce the new concept of non-deterministic MDP policies, and address the question of finding near-optimal non-deterministic policies. We propose two solutions to this problem, one based on a Mixed Integer Program and the other one based on a search algorithm. We include experimental results obtained from applying this framework to optimize treatment choices in the context of a medical decision support system.", "full_text": "MDPs with Non-Deterministic Policies\n\nMahdi Milani Fard\n\nSchool of Computer Science\n\nMcGill University\nMontreal, Canada\n\nJoelle Pineau\n\nSchool of Computer Science\n\nMcGill University\nMontreal, Canada\n\nmmilan1@cs.mcgill.ca\n\njpineau@cs.mcgill.ca\n\nAbstract\n\nMarkov Decision Processes (MDPs) have been extensively studied and used in the\ncontext of planning and decision-making, and many methods exist to \ufb01nd the opti-\nmal policy for problems modelled as MDPs. Although \ufb01nding the optimal policy\nis suf\ufb01cient in many domains, in certain applications such as decision support sys-\ntems where the policy is executed by a human (rather than a machine), \ufb01nding all\npossible near-optimal policies might be useful as it provides more \ufb02exibility to\nthe person executing the policy. In this paper we introduce the new concept of\nnon-deterministic MDP policies, and address the question of \ufb01nding near-optimal\nnon-deterministic policies. We propose two solutions to this problem, one based\non a Mixed Integer Program and the other one based on a search algorithm. We\ninclude experimental results obtained from applying this framework to optimize\ntreatment choices in the context of a medical decision support system.\n\n1\n\nIntroduction\n\nMarkov Decision Processes (MDPs) have been extensively studied in the context of planning and\ndecision-making. In particular, MDPs have emerged as a useful framework for optimizing action\nchoices in the context of medical decision support systems [1, 2, 3, 4]. Given an adequate MDP\nmodel (or data source), many methods can be used to \ufb01nd a good action-selection policy. This pol-\nicy is usually a deterministic or stochastic function [5]. But policies of these types face a substantial\nbarrier in terms of gaining acceptance from the medical community, because they are highly pre-\nscriptive and leave little room for the doctor\u2019s input. In such cases, where the actions are executed\nby a human, it may be preferable to instead provide several (near-)equivalently good action choices,\nso that the agent can pick among those according to his or her own heuristics and preferences. 1\nTo address this problem, this paper introduces the notion of a non-deterministic policy 2, which is\na function mapping each state to a set of actions, from which the acting agent can choose. We aim\nfor this set to be as large as possible, to provide freedom of choice to the agent, while excluding\nany action that is signi\ufb01cantly worse than optimal. Unlike stochastic policies, here we make no\nassumptions regarding which action will be executed. This choice can be based on the doctor\u2019s\nqualitative assessment, patient\u2019s preferences, or availability of treatment.\nWhile working with non-deterministic policies, it is important to ensure that by adding some free-\ndom of choice to the policy, the worst-case expected return of the policy is still close enough to the\noptimal value. We address this point by providing guarantees on the expected return of the non-\ndeterministic policy. We de\ufb01ne a set of optimization problems to \ufb01nd such a policy and provide\ntwo algorithms to solve this problem. The \ufb01rst is based on a Mixed Integer Program formulation,\nwhich provides the best solution\u2014in the sense of maximizing the choice of action, while remaining\n1This is especially useful given that human preferences are often dif\ufb01cult to quantify objectively, and thus\n\ndif\ufb01cult to incorporate in the reward function.\n\n2Borrowing the term \u201cnon-deterministic\u201d from the theory of computation, as opposed to deterministic or\n\nstochastic actions.\n\n\fwithin an allowed performance-loss threshold\u2014but with high computational cost. Then we describe\na simple search algorithm that can be much more ef\ufb01cient in some cases.\nThe main contributions of this work are to introduce the concept of non-deterministic policies, pro-\nvide solution methods to compute such policies, and demonstrate the usefulness of this new model\nfor providing acceptable solutions in medical decision support systems. From a pratical perspective,\nwe aim to improve the acceptability of MDP-based decision-support systems.\n\n2 Non-Deterministic Policies\n\nIn this section, we formulate the concept of non-deterministic policies and provide some de\ufb01nitions\nthat are used throughout the paper.\nAn MDP M = (S, A, T, R) is de\ufb01ned by a set of states S, a function A(s) mapping each state to a\nset of action, a transition function T (s, a, s0) de\ufb01ned as:\n\nT (s, a, s0) = p(st+1 = s0|st = s, at = a),8s, s0 2 S, a 2 A(s),\n\n(1)\nand a reward function R(s, a) : S \u21e5 A ! [Rmin, Rmax]. Throughout the paper we assume \ufb01nite\nstate, \ufb01nite action, discounted reward MDPs, with the discount factor denoted by .\nA deterministic policy is a function from states to actions. The optimal deterministic policy is the\npolicy that maximizes the expected discounted sum of rewards (Pt trt) if the agent acts according\n\nto that policy. The value of a state-action pair (s, a) according to the optimal deterministic policy\non an MDP M = (S, A, T, R) satis\ufb01es the Bellman optimality equation [6]:\n\nQ\u21e4M(s, a) = R(s, a) + Xs0 \u2713T (s, a, s0) max\n\na02A(s0)\n\nQ\u21e4M(s0, a0)\u25c6 .\n\n(2)\n\nWe further de\ufb01ne the optimal value of state s denoted by V \u21e4M(s) to be maxa2A(s) Q\u21e4M(s, a).\nA non-deterministic policy is a function that maps each state s to a non-empty set of actions\ndenoted by \u21e7(s) \u2713 A(s). The agent can choose to do any action a 2 \u21e7(s) whenever the MDP is in\nstate s. Here we will provide a worst-case analysis, presuming that the agent may choose the worst\naction in each state.\nThe value of a state-action pair (s, a) according to a non-deterministic policy \u21e7 on an MDP M =\n(S, A, T, R) is given by the recursive de\ufb01nition:\n\nQ\u21e7\n\nM(s, a) = R(s, a) + Xs0 \u2713T (s, a, s0) min\n\na02\u21e7(s0)\n\nQ\u21e7\n\nM(s0, a0)\u25c6 ,\n\n(3)\n\nwhich is the worst-case expected return under the allowed set of actions. We de\ufb01ne the value of state\ns according to a non-deterministic policy \u21e7 denoted by V \u21e7\nTo calculate the value of a non-deterministic policy, we construct an MDP M0 = (S0, A0, R0, T 0)\nwhere S0 = S, A0 =\u21e7 , R0 = R and T 0 = T . It is straight-forward to show that:\n\nM(s) to be mina2\u21e7(s) Q\u21e7\n\nM(s, a).\n\nM(s, a) = Q\u21e4M0(s, a).\nQ\u21e7\n\n(4)\n\nA non-deterministic policy \u21e7 is said to be augmented with state-action pair (s, a) denoted by \u21e70 =\n\u21e7+ ( s, a), if it satis\ufb01es:\n\n\u21e70(s0) =\u21e2\u21e7(s0),\n\n\u21e7(s0) [{ a},\n\ns0 6= s\ns0 = s\n\n(5)\n\nIf a policy \u21e7 can be achieved by a number of augmentations from a policy \u21e70, we say that \u21e7\nincludes \u21e70. The size of a policy \u21e7, denoted by |\u21e7|, is the sum of the cardinality of the action sets\nin \u21e7: |\u21e7| =Ps |\u21e7(s)|.\nA non-deterministic policy \u21e7 is said to be non-augmentable according to a constraint if and only\nif \u21e7 satis\ufb01es , and for any state-action pair (s, a), \u21e7+ ( s, a) does not satisfy . In this paper we\n\n\fwill be working with constraints that have this particular property: if a policy \u21e7 does not satisfy ,\nany policy that includes \u21e7 does not satisfy . We will refer to such constraints as being monotonic.\nA non-deterministic policy \u21e7 on an MDP M is said to be \u270f-optimal (\u270f 2 [0, 1]) if we have:3\n(6)\nThis can be thought of as a constraint on the space of non-deterministic policies which makes sure\nthat the worst-case expected return is within some range of the optimal value. It is straight forward\nto show that this constraint is monotonic.\nA conservative \u270f-optimal non-deterministic policy \u21e7 on an MDP M is a policy that is non-\naugmentable according to this constraint:\n\nM(s) (1 \u270f)V \u21e4M(s), 8s 2 S.\nV \u21e7\n\n(T (s, a, s0)(1 \u270f)V \u21e4M(s0)) (1 \u270f)V \u21e4M(s), 8a 2 \u21e7(s).\n\n(7)\n\nR(s, a) + Xs0\n\nThis constraint indicates that we only add those actions to the policy whose reward plus (1 \u270f)\nof the future optimal return is within the sub-optimal margin. This ensures that non-deterministic\npolicy is \u270f-optimal by using the inequality:\n\nQ\u21e7\n\nM(s, a) R(s, a) + Xs0\n\n(T (s, a, s0)(1 \u270f)V \u21e4M(s0)) ,\n\n(8)\n\ninstead of solving Eqn 3 and using the inequality constraint in Eqn 6. Applying Eqn 7 guarantees\nthat the non-deterministic policy is \u270f-optimal while it may still be augmentable according to Eqn 6,\nhence the name conservative. It can also be shown that the conservative policy is unique.\nA non-augmentable \u270f-optimal non-deterministic policy \u21e7 on an MDP M is a policy that is not\naugmentable according to the constraint in Eqn 6. It is easy to show that any non-augmentable\n\u270f-optimal policy includes the conservative policy. However, non-augmentable \u270f-optimal policies\nare not necessarily unique. In this paper we will focus on a search problem in the space of non-\naugmentable \u270f-optimal policies, trying to maximize some criteria. Speci\ufb01cally, we will be trying\nto \ufb01nd non-deterministic policies that give the acting agent more options while staying within an\nacceptable sub-optimal margin.\nWe now present an example that clari\ufb01es the concepts introduced so far. To simplify drawing graphs\nof the MDP and policies, we assume deterministic transitions in this example. However the concepts\napply to any probabilistic MDP as well. Fig 1 shows a sample MDP. The labels on the arcs show\naction names and the corresponding rewards are shown in the parentheses. We assume ' 1 and\n\u270f = 0.05. Fig 2 shows the optimal policy of this MDP. The conservative \u270f-optimal non-deterministic\npolicy of this MDP is shown in Fig 3.\n\na(0)\n\nb(3)\n\na(0)\n\na(0)\n\nS1\n\nS1\n\nS1\n\nS2\n\nS2\n\nS2\n\na(0)\n\nb(3)\n\nS3\n\na(100)\n\nb(99)\n\nS4\n\na(0)\n\na(0)\n\nS5\n\nFigure 1: Example MDP\n\na(0)\n\na(100)\n\nS3\n\nFigure 2: Optimal policy\n\na(0)\n\nS3\n\na(100)\n\nb(99)\n\nS4\n\nS4\n\na(0)\n\na(0)\n\na(0)\n\na(0)\n\nS5\n\nS5\n\nFigure 3: Conservative policy\n\nFig 4 includes two possible non-augmentable \u270f-optimal policies. Although both policies in Fig 4 are\n\u270f-optimal, the union of these is not \u270f-optimal. This is due to the fact that adding an option to one\nof the states removes the possibility of adding options to other states, which illustrates why local\nchanges are not always appropriate when searching in the space of \u270f-optimal policies.\n\n3In some of the MDP literature, \u270f-optimality is de\ufb01ned as an additive constraint (Q\u21e7\n\nderivations will be analogous in that case.\n\nM Q\u21e4M \u270f) [7]. The\n\n\fS1\n\nS1\n\na(0)\n\na(0)\n\nb(3)\n\na(0)\n\nb(3)\na(0)\n\nS2\n\nS2\n\nS3\n\nS3\n\na(100)\n\nb(99)\n\na(100)\n\nb(99)\n\nS4\n\nS4\n\na(0)\n\na(0)\n\na(0)\n\na(0)\n\nS5\n\nS5\n\nFigure 4: Two non-augmentable policies\n\n3 Optimization Problem\n\nWe formalize the problem of \ufb01nding an \u270f-optimal non-deterministic policy in terms of an optimiza-\ntion problem. There are several optimization criteria that can be formulated, while still complying\nwith the \u270f-optimal constraint. Notice that the last two problems can be de\ufb01ned both in the space of\nall \u270f-optimal policies or only the non-augmentable ones.\n\n\u2022 Maximizing the size of the policy: According to this criterion, we seek non-augmentable\n\u270f-optimal policies that have the biggest overall size. This provides more options to the\nagent while still keeping the \u270f-optimal guarantees. The algorithms proposed in this paper\nuse this optimization criterion. Notice that the solution to this optimization problem is non-\naugmentable according to the \u270f-optimal constraint, because it maximizes the overall size of\nthe policy.\n\n\u2022 Maximizing the margin: We aim to maximize margin of a non-deterministic policy \u21e7:\n(9)\n\nM(\u21e7) = min\n\ns \u2713\n\nmin\n\na2\u21e7(s),a0 /2\u21e7(s)\n\n(Q(s, a) Q(s, a0))\u25c6 .\n\nThis optimization criterion is useful when one wants to \ufb01nd a clear separation between the\ngood and bad actions in each state.\n\n\u2022 Minimizing the uncertainly: If we learn the models from data we will have some uncer-\ntainly about the optimal action in each state. We can use some variance estimation on the\nvalue function [8] along with a Z-Test to get some con\ufb01dence level on our comparisons and\n\ufb01nd the probability of having the wrong order when comparing actions according to their\nvalues. Let Q be the value of the true model and \u02c6Q be our empirical estimate based on\nsome dataset D. We aim to minimize the uncertainly of a non-deterministic policy \u21e7:\n\nM(\u21e7) = max\n\ns \u2713\n\nmax\n\na2\u21e7(s),a0 /2\u21e7(s)\n\np (Q(s, a) < Q(s, a0)|D)\u25c6 .\n\n(10)\n\n4 Solving the Optimization Problem\n\nIn the following sections we provide algorithms to solve the \ufb01rst optimization problem mentioned\nabove, which aims to maximize the size of the policy. We focus on this criterion as it seems most\nappropriate for medical decision support systems, where it is desirable for the acceptability of the\nsystem to \ufb01nd policies that provide as much choice as possible for the acting agent. We \ufb01rst present\na Mixed Integer Program formulation of the problem, and then present a search algorithm that uses\nthe monotonic property of the \u270f-optimal constraint. While the MIP method is useful as a general\nformulation of the problem, the search algorithm has potential for further extensions with heuristics.\n\n4.1 Mixed Integer Program\n\nRecall that we can formulate the problem of \ufb01nding the optimal deterministic policy on an MDP as\na simple linear program [5]:\n\nminV \u00b5T V, subject to\n\n(11)\nwhere \u00b5 can be thought of as the initial distribution over the states. The solution to the above problem\nis the optimal value function (denoted by V \u21e4). Similarly, having computed V \u21e4 using Eqn 11, the\n\nV (s) R(s, a) + Ps0 T (s, a, s0)V (s0) 8s, a,\n\n\fproblem of a search for an optimal non-deterministic policy according to the size criterion can be\nrewritten as a Mixed Integer Program:4\n\ns \u21e7ea), subject to\n\nmaxV,\u21e7(\u00b5T V + (Vmax Vmin)eT\nV (s) (1 \u270f)V \u21e4(s)\nPa \u21e7(s, a) > 0\n\n8s\n8s\n\nV (s) \uf8ff R(s, a) + Ps0 T (s, a, s0)V (s0) + Vmax(1 \u21e7(s, a)) 8s, a.\n\n(12)\nHere we are overloading the notation \u21e7 to de\ufb01ne a binary matrix representing the policy. \u21e7(s, a)\nis 1 if a 2 \u21e7(s), and 0 otherwise. We de\ufb01ne Vmax = Rmax/(1 ) and Vmin = Rmin/(1 ).\ne\u2019s are column vectors of 1 with the appropriate dimensions. The \ufb01rst set of constraints makes sure\nthat we stay within \u270f of the optimal return. The second set of constraints ensures that at least one\naction is selected per state. The third set ensures that for those state-action pairs that are chosen\nin any policy, the Bellman constraint holds, and otherwise, the constant Vmax makes the constraint\ntrivial. Notice that the solution to the above maximizes |\u21e7| and the result is non-augmentable. As a\ncounter argument, suppose that we could add a state-action pair to the solution \u21e7, while still staying\nin \u270f sub-optimal margin. By adding that pair, the objective function is increased by (Vmax Vmin),\nwhich is bigger than any possible decrease in the \u00b5T V term, and thus the objective is improved,\nwhich con\ufb02icts with \u21e7 being the solution.\nWe can use any MIP solver to solve the above problem. Note however that we do not make use of\nthe monotonic nature of the constraints. A general purpose MIP solver could end up searching in the\nspace of all the possible non-deterministic policies, which would require exponential running time.\n\n4.2 Search Algorithm\n\nWe can make use of the monotonic property of the \u270f-optimal policies to narrow down the search. We\nstart by computing the conservative policy. We then augment it until we arrive at a non-augmentable\npolicy. We make use of the fact that if a policy is not \u270f-optimal, neither is any other policy that\nincludes it, and thus we can cut the search tree at this point.\nThe following algorithm is a one-sided recursive depth-\ufb01rst-search-like algorithm that searches in\nthe space of plausible non-deterministic policies to maximize a function g(\u21e7). Here we assume that\nthere is an ordering on the set of state-action pairs {pi} = {(sj, ak)}. This ordering can be chosen\naccording to some heuristic along with a mechanism to cut down some parts of the search space. V \u21e4\nis the optimal value function and the function V returns the value of the non-deterministic policy\nthat can be calculated by minimizing Equation 3.\n\nFunction getOptimal(\u21e7, startIndex, \u270f)\n\u21e7o \u21e7\nfor i startIndex to |S||A| do\n\n(s, a) pi\nif a /2 \u21e7(s) & V (\u21e7 + (s, a)) (1 \u270f)V \u21e4 then\n\n\u21e70 getOptimal (\u21e7+ ( s, a), i + 1,\u270f )\nif g(\u21e70) > g(\u21e7o) then\n\n\u21e7o \u21e70\n\nend\n\nend\n\nend\nreturn \u21e7o\n\nWe should make a call to the above function passing in the conservative policy \u21e7m and starting from\nthe \ufb01rst state-action pair: getOptimal(\u21e7m, 0,\u270f ).\nThe asymptotic running time of the above algorithm is O((|S||A|)d(tm + tg)), where d is the maxi-\nmum size of an \u270f-optimal policy minus the size of the conservative policy, tm is the time to solve the\noriginal MDP and tg is the time to calculate the function g. Although the worst-case running time is\nstill exponential in the number of state-action pairs, the run-time is much less when the search space\nis suf\ufb01ciently small. The |A| term is due to the fact that we check all possible augmentations for\n4Note that in this MIP, unlike the standard LP for MDPs, the choice of \u00b5 can affect the solution in cases\n\nwhere there is a tie in the size of \u21e7.\n\n\feach state. Note that this algorithm searches in the space of all \u270f-optimal policies rather than only\nthe non-augmentable ones. If we set function g(\u21e7) = |\u21e7|, then the algorithm will return the biggest\nnon-augmentable \u270f-optimal policy.\nThis search can be further improved by using heuristics to order the state-action pairs and prune the\nsearch. One can also start the search from any other policy rather than the conservative policy. This\ncan be potentially useful if we have further constraints on the problem. One way to narrow down\nthe search is to only add the action that has the maximum value for any state s:\n\n\u21e70 =\u21e7+ \u2713s, arg max\nQ(s,a)\u25c6 ,\n\n(13)\n\nThis leads to a running time of O(|S|d(tm + tg)). However this does not guarantee that we see all\nnon-augmentable policies. This is due to the fact that after adding an action, the order of values\nmight change. If the transition structure of the MDP contains no loop with non-zero probability\n(transition graph is directed acyclic, DAG), then this heuristic will produce the optimal result while\ncutting down the search time. In other cases, one might do a partial evaluation of the augmented\npolicy to approximate the value after adding the actions, possibly by doing a few backups rather\nthan using the original Q values. This offers the possibility of trading-off computation time for\nbetter solutions.\n\n5 Empirical Evaluation\n\nTo evaluate our proposed algorithms, we \ufb01rst test the both the MIP and search formulations on MDPs\ncreated randomly, and then test the search algorithm on a real-world treatment design scenario.\nTo begin, we generated random MDPs with 5 states and 4 actions. The transitions are deterministic\n(chosen uniformly random) and the rewards are random values between 0 and 1, except for one of\nthe states with reward 10 for one of the actions; was set to 0.95. The MIP method was imple-\nmented with MATLAB and CPLEX. Fig 5 shows the solution to the MIP de\ufb01ned in Eqn 12 for a\nparticular randomly generated MDP. We see that the size of non-deterministic policy increases as\nthe performance threshold is relaxed.\n\n1, 0.4\n\n1, 0.4\n\n2, 0.7\n\n3, 0.5\n\nS1\n\nS1\n\n3, 0.5\n\n3, 0.2\n\n3, 0.9\n\n3, 0.5\n\n3, 0.2\n\n3, 0.9\n\nS4\n\nS5\n\n3, 9.9\n\nS2\n\nS4\n\nS5\n\n3, 9.9\n\nS2\n\nS3\n\nS3\n\n1, 0.4\n\n2, 0.7\n\n1, 0.4\n\n2, 0.7\n\n3, 0.5\n\nS1\n\nS1\n\n3, 0.5\n\n3, 0.2\n\n3, 0.9\n\n3, 0.5\n\n4, 0.2\n\n3, 0.2\n\n3, 0.9\n\nS4\n\nS5\n\n3, 9.9\n\nS2\n\nS4\n\nS5\n\n3, 9.9\n\nS2\n\nS3\n\nS3\n\nFigure 5: MIP solution for different values of \u270f 2{ 0, 0.01, 0.02, 0.03}. The labels on the edges are\naction indices, followed by the corresponding immediate rewards.\n\nTo compare the running time of the MIP solver and the search algorithm, we constructed random\nMDPs as described above with more state-action pairs. Fig 6 Left shows the running time averaged\nover 20 different random MDPs , assuming \u270f = 0.01. It can be seen that both algorithms have\n\n\fFigure 6: Left: Running time of MIP and search algorithm as a function of the number of state-action\npairs. Right: Average percentage of state-action pairs that were different in the noisy policy.\n\nexponential running time. The running time of the search algorithm has a bigger constant factor, but\nhas a smaller exponent base which results in a faster asymptotic running time.\nTo study how stable non-deterministic policies are to potential noise in the models, we check to see\nhow much the policy changes when Gaussian noise is added to the reward function. Fig 6 Right\nshows the percentage of the total state-action pairs that were either added or removed from the\nresulting policy by adding noise to the reward model (we assume a constant \u270f = 0.02). We see that\nthe resulting non-deterministic policy changes somewhat, but not drastically, even with noise level\nof similar magnitude as the reward function.\nNext, we implemented the full search algorithm on an MDP constructed for a medical decision-\nmaking task involving real patient data. The data was collected as part of a large (4000+ patients)\nmulti-step randomized clinical trial, designed to investigate the comparative effectiveness of differ-\nent treatments provided sequentially for patients suffering from depression [9]. The goal is to \ufb01nd\na treatment plan that maximizes the chance of remission. The dataset includes a large number of\nmeasured outcomes. For the current experiment, we focus on a numerical score called the Quick\nInventory of Depressive Symptomatology (QIDS), which was used in the study to assess levels of\ndepression (including when patients achieved remission). For the purposes of our experiment, we\ndiscretize the QIDS scores (which range from 5 to 27) uniformly into quartiles, and assume that\nthis, along with the treatment step (up to 4 steps were allowed), completely describe the patient\u2019s\nstate. Note that the underlying transition graph can be treated as a DAG because the study is limited\nto four steps of treatment. There are 19 actions (treatments) in total. A reward of 1 is given if the\npatient achieves remission (at any step) and a reward of 0 is given otherwise. The transition and\nreward models were generated empirically from the data using a frequentist approach.\n\nTable 1: Policy and running time of the full search algorithm on the medical problem\n\n\u270f = 0.02\n\n\u270f = 0.015\n\n\u270f = 0.01\n\nTime (seconds)\n\n5 < QIDS < 9\n\n9 \uf8ff QIDS < 12\n\n12 \uf8ff QIDS < 16\n\n16 \uf8ff QIDS \uf8ff 27\n\n118.7\nCT\nSER\n\nBUP, CIT+BUS\n\nCIT+BUP\nCIT+CT\n\nVEN\n\nCIT+BUS\n\nCT\nCT\n\nCIT+CT\n\n12.3\nCT\nSER\n\nCIT+BUP\nCIT+CT\n\nVEN\n\nCIT+BUS\n\n3.5\nCT\n\n\u270f = 0\n1.4\nCT\n\nCIT+BUP CIT+BUP\n\nVEN\n\nVEN\n\nCT\n\nCIT+CT\n\nCT\n\nCIT+CT\n\nCT\n\nTable 1 shows the non-deterministic policy obtained for each state during the second step of the\ntrial (each acronym refers to a speci\ufb01c treatment). This is computed using the search algorithm,\nassuming different values of \u270f. Although this problem is not tractable with the MIP formulation\n(304 state-action pairs), a full search in the space of \u270f-optimal policies is still possible. Table 1 also\nshows the running time of the algorithm, which as expected increases as we relax the threshold \u270f.\nHere we did not use any heuristics. However, as the underlying transition graph is a DAG, we could\nuse the heuristic discussed in the previous section (Eqn 13) to get the same policies even faster. An\n\n\uf001\uf002\uf003\uf002\uf004\uf002\uf005\uf006\uf007\uf008\uf009\uf00a\uf00b\uf00c\uf00d\uf00b\uf00e\uf00f\uf010\uf00f\uf009\uf011\uf010\uf012\uf00f\uf013\uf00c\uf014\uf00b\uf015\uf010\uf013\uf00a\uf00e\uf002\uf016\uf002\uf017\uf002\uf016\uf017\uf017\uf017\uf002\uf017\uf002\uf002\uf018\uf013\uf007\uf009\uf00b\uf019\uf00e\uf01a\uf01b\uf01c\uf01d\uf01e\uf009\uf010\uf00a\uf012\uf01f\uf001\uf001\uf002\uf003\uf004\uf004\uf002\uf003\uf005\uf006\uf007\uf008\uf009\uf00a\uf00b\uf009\uf00c\uf00d\uf00e\uf00f\uf00f\uf010\uf00d\uf011\uf009\uf011\uf00a\uf010\uf00f\uf012\uf009\uf010\uf011\uf009\uf013\uf014\uf012\uf009\uf015\uf012\uf016\uf00d\uf015\uf017\uf009\uf018\uf00a\uf017\uf012\uf019\uf001\uf01a\uf005\uf01a\uf01b\uf01a\uf01c\uf01a\uf01d\uf01a\uf004\uf001\uf01a\uf004\uf005\uf01a\uf004\uf01b\uf01a\uf004\uf01c\uf01a\uf01e\uf01f\uf012\uf015\uf00d\uf020\uf012\uf009\uf021\uf012\uf015\uf022\uf012\uf011\uf013\uf00d\uf020\uf012\uf009\uf00a\uf00b\uf009\uf017\uf010\uf023\uf012\uf015\uf012\uf022\uf012\finteresting question is how to set \u270f a priori. In practice, a doctor may use the full table as a guideline,\nusing smaller values of \u270f when s/he wants to rely more on the decision support system, and larger\nvalues when relying more on his/her own assessments.\n\n6 Discussion\n\nThis paper introduces a framework for computing non-deterministic policies for MDPs. We believe\nthis framework can be especially useful in the context of decision support systems to provide more\nchoice and \ufb02exibility to the acting agent. This should improve acceptability of decision support\nsystems in \ufb01elds where the policy is used to guide (or advise) a human expert, notably for the\noptimization of medical treatments.\nThe framework we propose relies on two competing objectives. On the one hand we want to provide\nas much choice as possible in the non-deterministic policy, while at the same time preserving some\nguarantees on the return (compared to the optimal policy). We present two algorithms that can solve\nsuch an optimization problem: a MIP formulation that can be solved by any general MIP solver,\nand a search algorithm that uses the monotonic property of the studied constraints to cut down\non the running time. The search algorithm is particularly useful when we have good heuristics to\nfurther prune the search space. Future work will consider different optimizing criteria, such as those\noutlined in Section 3, which may be more appropriate for some domains with very large action sets.\nA limitation of our current approach is that the algorithms presented so far are limited to relatively\nsmall domains, and scale well only for domains with special properties, such as a DAG structure\nin the transition model or good heuristics for pruning the search. This clearly points to future work\nin developing better approximation techniques. Nonetheless it is worth keeping in mind that many\ndomains of application, may not be that large (see [1, 2, 3, 4] for examples) and the techniques as\npresented can already have a substantial impact.\nFinally, it is worth noting that non-deterministic policies can also be useful in cases where the MDP\ntransition and reward models are imperfectly speci\ufb01ed or learned from data, though we have not\nexplored this case in detail yet. In such a setting, the difference between the optimal and a near\noptimal policy may not be computed accurately. Thus, it is useful to \ufb01nd all actions that are close to\noptimal so that the real optimal action is not missed. An interesting question here is whether we can\n\ufb01nd the smallest non-deterministic policy that will include the optimal policy with some probability\n1 . This is similar to the framework in [7], and could be useful in cases where there is not enough\ndata to compare policies with good statistical signi\ufb01cance.\nAcknowledgements: The authors wish to thank A. John Rush, Susan A. Murphy, Doina Precup, and\nStephane Ross for helpful discussions regarding this work. Funding was provided by the National\nInstitutes of Health (grant R21 DA019800) and the NSERC Discovery Grant program.\n\nReferences\n[1] A. Schaefer, M. Bailey, S. Shechter, and M. Roberts. Handbook of Operations Research / Management\nScience Applications in Health Care, chapter Medical decisions using Markov decision processes. Kluwer\nAcademic Publishers, 2004.\n\n[2] M. Hauskrecht and H. Fraser. Planning treatment of ischemic heart disease with partially observable\n\nMarkov decision processes. Arti\ufb01cial Intelligence in Medicine, 18(3):221\u2013244, 2000.\n\n[3] P. Magni, S. Quaglini, M. Marchetti, and G. Barosi. Deciding when to intervene: a Markov decision\n\nprocess approach. International Journal of Medical Informatics, 60(3):237\u2013253, 2000.\n\n[4] D. Ernst, G. B. Stan, J. Concalves, and L. Wehenkel. Clinical data based optimal sti strategies for hiv: a\n\nreinforcement learning approach. In Proceedings of Benelearn, 2006.\n\n[5] D.P. Bertsekas. Dynamic Programming and Optimal Control, Vol 2. Athena Scienti\ufb01c, 1995.\n[6] R.S. Sutton and A.G. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, 1998.\n[7] M. Kearns and S. Singh. Near-optimal reinforcement learning in poly. time. Machine Learning, 49, 2002.\n[8] S. Mannor, D. Simester, P. Sun, and J.N. Tsitsiklis. Bias and variance in value function estimation. In\n\nProceedings of ICML, 2004.\n\n[9] M. Fava, A.J. Rush, and M.H. Trivedi et al. Background and rationale for the sequenced treatment alterna-\n\ntives to relieve depression (STAR*D) study. Psychiatr Clin North Am, 26(2):457\u201394, 2003.\n\n\f", "award": [], "sourceid": 369, "authors": [{"given_name": "M.", "family_name": "Fard", "institution": null}, {"given_name": "Joelle", "family_name": "Pineau", "institution": null}]}