{"title": "Bayesian Control of Large MDPs with Unknown Dynamics in Data-Poor Environments", "book": "Advances in Neural Information Processing Systems", "page_first": 8146, "page_last": 8156, "abstract": "We propose a Bayesian decision making framework for control of Markov Decision Processes (MDPs) with unknown dynamics and large, possibly continuous, state, action, and parameter spaces in data-poor environments. Most of the existing adaptive controllers for MDPs with unknown dynamics are based on the reinforcement learning framework and rely on large data sets acquired by sustained direct interaction with the system or via a simulator. This is not feasible in many applications, due to ethical, economic, and physical constraints. The proposed framework addresses the data poverty issue by decomposing the problem into an offline planning stage that does not rely on sustained direct interaction with the system or simulator and an online execution stage. In the offline process, parallel Gaussian process temporal difference (GPTD) learning techniques are employed for near-optimal Bayesian approximation of the expected discounted reward over a sample drawn from the prior distribution of unknown parameters. In the online stage, the action with the maximum expected return with respect to the posterior distribution of the parameters is selected. This is achieved by an approximation of the posterior distribution using a Markov Chain Monte Carlo (MCMC) algorithm, followed by constructing multiple Gaussian processes over the parameter space for efficient prediction of the means of the expected return at the MCMC sample. The effectiveness of the proposed framework is demonstrated using a simple dynamical system model with continuous state and action spaces, as well as a more complex model for a metastatic melanoma gene regulatory network observed through noisy synthetic gene expression data.", "full_text": "Bayesian Control of Large MDPs with\n\nUnknown Dynamics in Data-Poor Environments\n\nMahdi Imani\n\nTexas A&M University\nCollege Station, TX, USA\nm.imani88@tamu.edu\n\nSeyede Fatemeh Ghoreishi\n\nTexas A&M University\nCollege Station, TX, USA\nf.ghoreishi88@tamu.edu\n\nUlisses M. Braga-Neto\nTexas A&M University\nCollege Station, TX, USA\nulisses@ece.tamu.edu\n\nAbstract\n\nWe propose a Bayesian decision making framework for control of Markov Deci-\nsion Processes (MDPs) with unknown dynamics and large, possibly continuous,\nstate, action, and parameter spaces in data-poor environments. Most of the exist-\ning adaptive controllers for MDPs with unknown dynamics are based on the re-\ninforcement learning framework and rely on large data sets acquired by sustained\ndirect interaction with the system or via a simulator. This is not feasible in many\napplications, due to ethical, economic, and physical constraints. The proposed\nframework addresses the data poverty issue by decomposing the problem into an\nof\ufb02ine planning stage that does not rely on sustained direct interaction with the\nsystem or simulator and an online execution stage. In the of\ufb02ine process, parallel\nGaussian process temporal difference (GPTD) learning techniques are employed\nfor near-optimal Bayesian approximation of the expected discounted reward over\na sample drawn from the prior distribution of unknown parameters. In the online\nstage, the action with the maximum expected return with respect to the posterior\ndistribution of the parameters is selected. This is achieved by an approximation\nof the posterior distribution using a Markov Chain Monte Carlo (MCMC) algo-\nrithm, followed by constructing multiple Gaussian processes over the parameter\nspace for ef\ufb01cient prediction of the means of the expected return at the MCMC\nsample. The effectiveness of the proposed framework is demonstrated using a\nsimple dynamical system model with continuous state and action spaces, as well\nas a more complex model for a metastatic melanoma gene regulatory network\nobserved through noisy synthetic gene expression data.\n\n1\n\nIntroduction\n\nDynamic programming (DP) solves the optimal control problem for Markov Decision Processes\n(MDPs) with known dynamics and \ufb01nite state and action spaces. However, in complex applications\nthere is often uncertainty about the system dynamics. In addition, many practical problems have\nlarge or continuous state and action spaces. Reinforcement learning is a powerful technique widely\nused for adaptive control of MDPs with unknown dynamics [1]. Existing RL techniques developed\nfor MDPs with unknown dynamics rely on data that is acquired via interaction with the system or\nvia simulation. While this is feasible in areas such as robotics or speech recognition, in other appli-\ncations such as medicine, materials science, and business, there is either a lack of reliable simulators\nor inaccessibility to the real system due to practical limitations, including cost, ethical, and physical\nconsiderations. For instance, recent advances in metagenomics and neuroscience call for the devel-\nopment of ef\ufb01cient intervention strategies for disease treatment. However, these systems are often\nmodeled with MDPs with continuous state and action spaces, with limited access to expensive data.\nThus, there is a need for control of systems with unknown dynamics and large or continuous state,\naction, and parameter spaces in data-poor environments.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fRelated Work: Approximate dynamic programming (ADP) techniques have been developed for\nproblems in which the exact DP solution is not achievable. These include parametric and non-\nparametric reinforcement learning (RL) techniques for approximating the expected discounted re-\nward over large or continuous state and action spaces. Parametric RL techniques include neural\n\ufb01tted Q-iteration [2], deep reinforcement learning [3], and kernel-based techniques [4]. A popular\nclass of non-parametric RL techniques is Gaussian process temporal difference (GPTD) learning [5],\nwhich provides a Bayesian representation of the expected discounted return. However, all afore-\nmentioned methods involve approximate of\ufb02ine planning for MDPs with known dynamics or online\nlearning by sustained direct interaction with the system or a simulator. The multiple model-based\nRL (MMRL) [6] is a framework that allows the extension of the aforementioned RL techniques\nto MDPs with unknown dynamics represented over a \ufb01nite parameter space, and therefore cannot\nhandle large or continuous parameter spaces.\nIn addition, there are several Bayesian reinforcement learning techniques in the literature [7]. For\nexample, Bayes-adaptive RL methods assume a parametric family for the MDP transition matrix\nand simultaneously learn the parameters and policy. A closely related method in this class is the\nBeetle algorithm [8], which converts a \ufb01nite-state MDP into a continuous partially-observed MDP\n(POMDP). Then, an approximate of\ufb02ine algorithm is developed to solve the POMDP. The Beetle\nalgorithm is however capable of handling \ufb01nite state and action spaces only. Online tree search\napproximations underlie a varied and popular class of Bayesian RL techniques [9\u201316]. In particular,\nthe Bayes-adaptive Monte-Carlo planning (BAMCP) algorithm [16] has been shown empirically\nto outperform the other techniques in this category. This is due to the fact that BAMCP uses a\nrollout policy during the learning process, which effectively biases the search tree towards good\nsolutions. However, this class of methods applies to \ufb01nite-state MDPs with \ufb01nite actions; application\nto continuous state and action spaces requires discretization of these spaces, rendering computation\nintractable in most cases of interest.\nLookahead policies are a well-studied class of techniques that can be used for control of MDPs with\nlarge or continuous state, action, and parameter spaces [17]. However, ignoring the long future hori-\nzon in their decision making process often results in poor performance. Other methods to deal with\nsystems carrying other sources of uncertainty include [18, 19].\nMain Contributions: The goal of this paper is to develop a framework for Bayesian decision mak-\ning for MDPs with unknown dynamics and large or continuous state, action and parameter spaces\nin data-poor environments. The framework consists of of\ufb02ine and online stages. In the of\ufb02ine stage,\nsamples are drawn from a prior distribution over the space of parameters. Then, parallel Gaussian\nprocess temporal difference (GPTD) learning algorithms are applied for Bayesian approximation of\nthe expected discounted reward associated with these parameter samples. During the online process,\na Markov Chain Monte Carlo (MCMC) algorithm is employed for sample-based approximation of\nthe posterior distribution. For decision making with respect to the posterior distribution, Gaussian\nprocess regression over the parameter space based on the means and variances of the expected re-\nturns obtained in the of\ufb02ine process is used for prediction of the expected returns at the MCMC\nsample points. The proposed framework offers several bene\ufb01ts, summarized as follows:\n\u2022 Risk Consideration: Most of the existing techniques try to estimate \ufb01xed values for approx-\nimating the expected Q-function and make a decision upon that, while the proposed method\nis capable of Bayesian representation of the Q-function. This allows risk consideration during\naction selection, which is required by many real-world applications, such as cancer drug design.\n\u2022 Fast Online Decision Making: The proposed method is suitable for problems with tight time-\nlimit constraints, in which the action should be selected relatively fast. Most of the computational\neffort spent by the proposed method is in the of\ufb02ine process. By contrast, the online process used\nby Monte-Carlo based techniques is often very slow, especially for large MDPs, in which a large\nnumber of trajectories must be simulated for accurate estimation of the Q-functions.\n\n\u2022 Continuous State/Action Spaces: Existing Bayesian RL techniques can handle continuous state\nand action spaces to some extent (e.g., via discretization). However, the dif\ufb01culty in picking a\nproper quantization rate, which directly impacts accuracy, and the computational intractability\nfor large MDPs make the existing methods less attractive.\n\n\u2022 Generalization: Another feature of the proposed method is the ability to serve as an initialization\nstep for Monte-Carlo based techniques. In fact, if the expected error at each time point is large\n\n2\n\n\f(according to the Bayesian representation of the Q-functions), Monte-Carlo techniques can be\nemployed for ef\ufb01cient online search using the available results of the proposed method.\n\u2022 Anytime Planning: The Bayesian representation of the Q-function allows starting the online\ndecision making process at anytime to improve the of\ufb02ine planning results. In fact, while the\nonline planning is undertaken, the accuracy of the Q-functions at the current of\ufb02ine samples can\nbe improved or the Q-functions at new of\ufb02ine samples from the posterior distribution can be\ncomputed.\n\n2 Background\nA Markov decision process (MDP) is formally de\ufb01ned by a 5-tuple (cid:104)S, A, T, R, \u03b3(cid:105), where S is the\nstate space, A is the action space, T : S \u00d7 A \u00d7 S is the state transition probability function such\nthat T (s, a, s(cid:48)) = p(s(cid:48) | s, a) represents the probability of moving to state s(cid:48) after taking action a in\nstate s, R : S \u00d7 A \u2192 R is a bounded reward function such that R(s, a) encodes the reward earned\nwhen action a is taken in state s, and 0 < \u03b3 < 1 is a discount factor.\nA deterministic stationary policy \u03c0 for an MDP is a mapping \u03c0 : S \u2192 A from states to actions. The\nexpected discounted reward function at state s \u2208 S after taking action a \u2208 A and following policy \u03c0\nafterward is de\ufb01ned as:\n\n(cid:35)\n\n(cid:34) \u221e(cid:88)\n\nQ\u03c0(s, a) = E\n\n\u03b3tR(st, at) | s0 = s, a0 = a\n\n.\n\n(1)\n\nt=0\n\nThe optimal action-value function, denoted by Q\u2217, provides the maximum expected return Q\u2217(s, a)\nthat can be obtained after executing action a in state s. An optimal stationary policy \u03c0\u2217, which\nattains the maximum expected return for all states, is given by \u03c0\u2217(s) = maxa\u2208A Q\u2217(s, a).\nAn MDP is said to have known dynamics if the 5-tuple (cid:104)S, A, T, R, \u03b3(cid:105) is fully speci\ufb01ed, otherwise\nit is said to have unknown dynamics. For an MDP with known dynamics and \ufb01nite state and action\nspaces, planning algorithms such as Value Iteration or Policy Iteration [20] can be used to compute\nthe optimal policy of\ufb02ine. Several approximate dynamic programming (ADP) methods have been\ndeveloped for approximating the optimal stationary policy over continuous state and action spaces.\nHowever, in this paper, we are concerned with large MDP with unknown dynamics in data-poor\nenvironments.\n\n3 Proposed Bayesian Decision Framework\n\nLet the unknown parts of the dynamics be encoded into a \ufb01nite dimensional vector \u03b8, where \u03b8 takes\nvalue in a parameter space \u0398 \u2282 Rm. Notice that each \u03b8 \u2208 \u0398 speci\ufb01es an MDP with known\ndynamics. Assuming (a0:k\u22121, s0:k) be the sequence of taken actions and observed states up to time\nstep k during the execution process, the proposed method selects an action according to:\n\nE\u03b8|s0:k,a0:k\u22121[Q\u2217\n\n\u03b8(sk, a)] ,\n\na\u2208A\n\nak = argmax\n\n(2)\nwhere the expectation is taken relative to the posterior distribution p(\u03b8 | s0:k, a0:k\u22121), and Q\u2217\ncharacterizes the optimal expected return for the MDP associated with \u03b8.\nTwo main issues complicate \ufb01nding the exact solution in (2). First, computation of the posterior\ndistribution might not have a closed-form solution, and one needs to use techniques such as Markov-\nChain Monte-Carlo (MCMC) for sample-based approximation of the posterior. Secondly, the exact\ncomputation of Q\u2217\n\u03b8 for any given \u03b8 is not possible, due to the large or possibly continuous state\nand action spaces. However, for any \u03b8 \u2208 \u0398, the expected return can be approximated with one of\nthe many existing techniques such as neural \ufb01tted Q-iteration [2], deep reinforcement learning [3],\nand Gaussian process temporal difference (GPTD) learning [5]. On the other hand, all the afore-\nmentioned techniques can be extremely slow over an MCMC sample that is suf\ufb01ciently large to\nachieve accurate results. In sum, computation of the expected returns associated with samples of the\nposterior distribution during the execution process is not practical.\nIn the following paragraphs, we propose ef\ufb01cient of\ufb02ine and online planning processes capable of\ncomputing an approximate solution to the optimization problem in (2).\n\n\u03b8\n\n3\n\n\f(cid:34) \u221e(cid:88)\n\nr=t\n\n3.1 Of\ufb02ine Planner\nThe of\ufb02ine process starts by drawing a sample \u0398prior = {\u03b8prior\n}N prior\ni=1 \u223c p(\u03b8) of size N prior from\nthe parameter prior distribution. For each sample point \u03b8 \u2208 \u0398prior, one needs to approximate\nthe optimal expected return Q\u2217\n\u03b8 over the entire state and action spaces. We propose to do this by\nusing Gaussian process temporal difference (GPTD) learning [5]. The detailed reasoning behind\nthis choice will be provided when the online planner is discussed.\n\u03b8 for given \u03b8 \u2208\nGP-SARSA is a GPTD algorithm that provides a Bayesian approximation of Q\u2217\n\u0398prior. We describe the GP-SARSA algorithm over the next several paragraphs. Given a policy\n\u03c0\u03b8 : S \u2192 A for an MDP corresponding to \u03b8, the discounted return at time step t can be written as:\n\ni\n\n(cid:35)\n\nU t,\u03c0\u03b8\n\n\u03b8\n\n(st, at) = E\n\n\u03b3r\u2212t R\u03b8(sr+1, ar+1)\n\n,\n\n(3)\n\nwhere sr+1 \u223c p (s(cid:48) | sr, ar = \u03c0\u03b8(sr), \u03b8), and U t,\u03c0\u03b8\n(st, at) is the expected accumulated reward for\nthe system corresponding to parameter \u03b8 obtained over time if the current state and action are st and\nat and policy \u03c0\u03b8 is followed afterward.\nIn the GPTD method, the expected discounted return U t,\u03c0\u03b8\n\n(st, at) is approximated as:\n\n\u03b8\n\n\u03b8\n\n\u03b8 (s, a) + \u2206Q\u03c0\u03b8\n\u03b8 ,\n\u03b8 (s, a) is a Gaussian process [21] over the space S\u00d7A and \u2206Q\u03c0\u03b8\n\n(st = s, at = a) \u2248 Q\u03c0\u03b8\n\nU t,\u03c0\u03b8\n\n\u03b8\n\n(4)\n\u03b8 is a zero-mean Gaussian\n\nq. A zero-mean Gaussian process is usually considered as a prior:\n\nwhere Q\u03c0\u03b8\nresidual with variance \u03c32\n\n\u03b8 (s, a) = GP (0, k\u03b8 ((s, a), (s(cid:48), a(cid:48)))) ,\nQ\u03c0\u03b8\n\n(5)\nwhere k\u03b8(\u00b7,\u00b7) is a real-valued kernel function, which encodes our prior belief on the correlation be-\ntween (s, a) and (s(cid:48), a(cid:48)). One possible choice is considering decomposable kernels over the state and\naction spaces: k\u03b8 ((s, a), (s(cid:48), a(cid:48))) = kS,\u03b8 (s, s(cid:48))\u00d7kU,\u03b8 (a, a(cid:48)). A proper choice of the kernel function\ndepends on the nature of the state and action spaces, e.g., whether they are \ufb01nite or continuous.\nLet B\u03b8\nby a policy \u03c0\u03b8 from an MDP corresponding to \u03b8, with the corresponding immediate rewards r\u03b8\n[R\u03b8(s0, a0), . . . , R\u03b8(st, at)]T . The posterior distribution of Q\u03c0\u03b8\n\nt = [(s0, a0), . . . , (st, at)]T be the sequence of observed joint state and action pairs simulated\nt =\n\n\u03b8 (s, a) can be written as [5]:\n\n\u03b8 (s, a) | r\u03b8\nQ\u03c0\u03b8\n\ncov\u03b8((s, a), (s, a)) = k\u03b8((s, a), (s, a))\u2212K(s,a),B\u03b8\n\nQ\u03b8(s, a) = K(s,a),B\u03b8\n\nwhere\n\nwith\n\nHt =\n\n\uf8ee\uf8ef\uf8ef\uf8ef\uf8ef\uf8f0\n\n1 \u2212\u03b3\n0\n1 \u2212\u03b3\n0\n...\n...\n...\n\n0\n0\n\n0\n0\n\n0\n0\n\n. . .\n0\n. . . 0\n...\n. . .\n. . .\n\n...\n...\n1 \u2212\u03b3\n1\n0\n\nt )\u22121r\u03b8\nt ,\n\nt\n\nt\n\nt ,B\u03b8\nt\n\nHT\n\nHT\n\nt , B\u03b8\n\nt + \u03c32\n\nt + (\u03c3q\n\nq HtHT\nHT\n\nt ,B\u03b8\nt\nt (HtKB\u03b8\n\nt (HtKB\u03b8\nHT\n\nt \u223c N(cid:0)Q\u03b8(s, a), cov\u03b8 ((s, a), (s, a))(cid:1) ,\n\uf8f9\uf8fa\uf8fa\uf8fa\uf8fa\uf8fb, KB,B(cid:48) =\nt \u223c N(cid:16)\n\n\uf8ee\uf8ef\uf8f0k\u03b8((s0, a0), (s(cid:48)\n\nk\u03b8((sm, am), (s(cid:48)\n\nq )2ItHT\nt\n\n(cid:17)(cid:17)\n\n0, a(cid:48)\n\n0, a(cid:48)\n\n+ (\u03c3\u03b8\n\n(cid:16)\n\n...\n\n0\n0\n\n...\n\n\u03b8)2HtHT\n\nt )\u22121HtKT\n\n(s,a),B\u03b8\nt\n\n(7)\n\n0)) . . . k\u03b8((s0, a0), (s(cid:48)\n\nn, a(cid:48)\n\nn))\n\n0)) . . . k\u03b8((sm, am), (s(cid:48)\n\nn, a(cid:48)\n\nn))\n\n...\n\n\uf8f9\uf8fa\uf8fb,\n\n(6)\n\n(8)\n\n(9)\n\nfor B = [(s0, a0), . . . , (sm, am)]T and B(cid:48) = [(s(cid:48)\nThe hyper-parameters of the kernel function can be estimated by maximizing the likelihood of the\nobserved reward [22]:\n\n0), . . . , (s(cid:48)\n\nn, a(cid:48)\n\n0, a(cid:48)\n\nn)]T .\n\nt | B\u03b8\nr\u03b8\n\nKB\u03b8\n\n0, Ht\nwhere It is the identity matrix of size t \u00d7 t.\nThe choice of policy for gathering data has signi\ufb01cant impact on the proximity of the estimated\ndiscounted return to the optimal one. A well-known option, which uses Bayesian representation of\nthe expected return and adaptively balances exploration and exploitation, is given by [22]:\n\nqa, qa\u223cN(cid:0)Q\u03b8(s, a), cov\u03b8 ((s, a), (s, a))(cid:1).\n\n(10)\n\nt ,B\u03b8\nt\n\n,\n\n\u03c0\u03b8(s) = argmax\n\na\u2208A\n\n4\n\n\fThe GP-SARSA algorithm approximates the expected return by simulating several trajectories based\non the above policy. Running N prior parallel GP-SARSA algorithms for each \u03b8 \u2208 \u0398prior leads to\nN prior near-optimal approximations of the expected reward functions.\n\n(cid:104)E(cid:104) \u02c6Q\u03b8(sk, a)\n(cid:105)(cid:105)\n\n3.2 Online Planner\nLet \u02c6Q\u03b8(s, a) be the Gaussian process approximating the optimal Q-function for any s \u2208 S and a \u2208 A\ncomputed by a GP-SARSA algorithm associated with parameter \u03b8 \u2208 \u0398prior. One can approximate\n(2) as:\n\nE\u03b8|s0:k,a0:k\u22121\n\nak \u2248 argmax\na\u2208A\n\nE\u03b8|s0:k,a0:k\u22121\n\na\u2208A\n\n= argmax\n\n(11)\nWhile the value of Q\u03b8(sk, a) at values \u03b8 \u2208 \u0398prior drawn from the prior distribution is available, the\nexpectation in (11) is over the posterior distribution.\nRather than restricting ourselves to parametric families, we compute the expectation in (11) by a\nMarkov Chain Monte-Carlo (MCMC) algorithm for generating i.i.d. sample values from the pos-\nterior distribution. For simplicity, and without loss of generality, we employ the basic Metropolis\nHastings MCMC [23] algorithm. Let the last accepted MCMC sample in the sequence of samples\nbe \u03b8post\n, generated at the j-th iteration. A candidate MCMC sample point \u03b8cand is drawn accord-\ning to a symmetric proposal distribution q(\u03b8 | \u03b8post\n). The candidate MCMC sample point \u03b8cand is\naccepted with probability \u03b1 given by:\n\nj\n\nj\n\n(cid:2)Q\u03b8(sk, a)(cid:3) .\n\n\u03b1 = min\n\n1,\n\np(s0:k, a0:k\u22121 | \u03b8cand) p(\u03b8cand)\np(s0:k, a0:k\u22121 | \u03b8post\n) p(\u03b8post\n)\n\nj\n\nj\n\n,\n\n(12)\n\n(cid:40)\n\n(cid:41)\n\notherwise it is rejected, where p(\u03b8) denotes the prior probability of \u03b8. Accordingly, the (j + 1)th\nMCMC sample point is:\n\n(cid:26)\u03b8n with probability \u03b1\n\n\u03b8post\nj\n\notherwise\n\n\u03b8post\nj+1 =\n\n(13)\n\nj\n\nj\n\n1\n\n, . . . , \u03b8post\n\nq(\u03b8 | \u03b8post\n\n) > 0, for any \u03b8post\n\nN post) is approximately a sample from the posterior distribution.\n\nRepeating this process leads to a sequence of MCMC sample points. The positivity of the pro-\nposal distribution (i.e.\n) is a suf\ufb01cient condition for ensur-\ning an ergodic Markov chain whose steady-state distribution is the posterior distribution p(\u03b8 |\ns0:k, a0:k\u22121) [24]. Removing a \ufb01xed number of initial \u201cburn-in\u201d sample points, the MCMC sample\n\u0398post = (\u03b8post\nThe last step towards the computation of (11) is the approximation of the mean of the predicted\nexpected return Q\u03b8(., .) at values of the MCMC sample \u0398post. We take advantage of the Bayesian\nrepresentation of the expected return computed by the of\ufb02ine GP-SARSAs for this, as described\nnext.\n((sk, a), (sk, a)), . . .,\nLet f a\n= [Q\u03b8prior\nsk\ncov\u03b8prior\n((sk, a), (sk, a))]T be the means and variances of the predicted expected returns com-\nputed based on the results of of\ufb02ine GP-SARSAs at current state sk for a given action a \u2208 A. This\ninformation can be used for constructing a Gaussian process for predicting the expected return over\n\n(sk, a), . . . , Q\u03b8prior\nN prior\n\n(sk, a)]T and va\nsk\n\n= [cov\u03b8prior\n\nN prior\n\n1\n\n1\n\nthe MCMC sample:\uf8ee\uf8ef\uf8ef\uf8f0 Q\u03b8post\n\n1\n\n(sk, a)\n...\n\nQ\u03b8post\nN post\n\n(sk, a)\n\nwhere\n\n\uf8f9\uf8fa\uf8fa\uf8fb = \u03a3\u0398post,\u0398prior\n\uf8ee\uf8ef\uf8f0 k(\u03b81, \u03b8(cid:48)\n\n...\n\n\u03a3\u0398m,\u0398n =\n\n(cid:0)\u03a3\u0398prior,\u0398prior +Diag(va\nk)(cid:1)\u22121\n\uf8f9\uf8fa\uf8fb ,\n\n. . .\n...\n. . . k(\u03b8m, \u03b8(cid:48)\nn)\n\nk(\u03b81, \u03b8(cid:48)\nn)\n\n1)\n\n...\n\nk(\u03b8m, \u03b8(cid:48)\nn)\n\n5\n\nf a\nsk\n\n,\n\n(14)\n\n\ffor \u0398m = {\u03b81, . . . , \u03b8m} and \u0398n = {\u03b8(cid:48)\nn}, and k(\u03b8, \u03b8(cid:48)) denotes the correlation between\nsample points in the parameter space. The parameters of the kernel function can be inferred by\nmaximizing the marginal likelihood:\n\n1, . . . , \u03b8(cid:48)\n\nk | \u0398prior \u223c N(cid:0)0, \u03a3\u0398prior,\u0398prior + Diag(va\nk)(cid:1) .\n\nf a\n\n(15)\nThe process is summarized in Figure 1(a). The red vertical lines represent the expected returns at\nsample points from the posterior. It can be seen that only a single of\ufb02ine sample point is in the area\ncovered by the MCMC samples, which illustrates the advantage of the constructed Gaussian process\nfor predicting the expected return over the posterior distribution.\n\nFigure 1: (a) Gaussian process for prediction of the expected returns at posterior sample points based\non prior sample points. (b) Proposed framework.\nThe GP is constructed for any given a \u2208 A. For a large or continuous action space, one needs to draw\na \ufb01nite set of actions {a1, . . . , aM} from the space, and compute Q\u03b8(sk, a) for a \u2208 {a1, . . . , aM}\nand \u03b8 \u2208 \u0398post. It should be noted that the uncertainty in the expected return of the of\ufb02ine sample\npoints is ef\ufb01ciently taken into account for predicting the mean expected error at the MCMC sample\npoints. Thus, equation (11) can be written as:\n\n(cid:2)Q\u03b8(sk, a)(cid:3) \u2248 argmax\n\n(cid:88)\n\n1\n\nak \u2248 argmax\na\u2208A\n\nE\u03b8|s0:k,a0:k\u22121\n\na\u2208{a1,...,aM}\n\nN post\n\n\u03b8\u2208\u0398post\n\nQ\u03b8(sk, a) .\n\n(16)\n\nIt is shown empirically in numerical experiments that as more data are observed during execution, the\nproposed method becomes more accurate, eventually achieving the performance of a GP-SARSA\ntrained on data from the true system model. The entire proposed methodology is summarized in\nAlgorithm 1 and Figure 1(b) respectively.\nNotice that the values of N prior and N post should be chosen based on the size of the MDP, the\navailability of computational resources, and presence of time constraints. Indeed, large N prior means\nthat larger parameter samples must be obtained in the of\ufb02ine process, while large N post is associated\nwith larger MCMC samples in the posterior update step.\n\n4 Numerical Experiments\n\nThe numerical experiments compare the performance of the proposed framework with two other\nmethods: 1) Multiple Model-based RL (MMRL) [6]: the parameter space in this method is quan-\ntized into a \ufb01nite set \u0398quant according to its prior distribution and the results of of\ufb02ine parallel\nGP-SARSA algorithms associated with this set are used for decision making during the execution\n\u03b8\u2208\u0398quant Q\u03b8(sk, ak = a)P (\u03b8 | s0:k, a0:k\u22121). 2) One-step\nprocess via: aMMRL\nlookahead policy [17]: this method selects the action with the highest expected immediate reward:\nk = argmaxa\u2208A E\u03b8|s0:k,a0:k\u22121[R(sk, ak = a)]. As a baseline for performance, the results of the\naseq\nGP-SARSA algorithm tuned to the true model are also displayed.\n\n= argmaxa\u2208A(cid:80)\n\nk\n\n6\n\nOffline\tPlannerOnline\tPlannerpriorposterior\fAlgorithm 1 Bayesian Control of Large MDPs with Unknown Dynamics in Data-Poor Environments.\n\nOf\ufb02ine Planning\n\n1: Draw N prior parameters from the prior distribution: \u0398prior = {\u03b81, . . . , \u03b8N prior} \u223c p(\u03b8).\n2: Run N prior parallel GP-SARSAs:\n\nOnline Planning\n\n3: Initial action selection:\n\n\u02c6Q\u03b8 \u2190 GP-SARSA(\u03b8), \u03b8 \u2208 \u0398prior.\n\na0 = arg max\na\u2208A\n\n1\n\nN prior\n\n(cid:88)\n\n\u03b8\u2208\u0398prior\n\nQ\u03b8(s0, a).\n\n4: for k = 1, . . . do\n5:\n6:\n7:\n8:\n\nTake action ak\u22121, record the new state sk.\nGiven s0:k, a0:k\u22121, run MCMC and collect \u0398post\nfor a \u2208 {a1, . . . , aM} do\nRecord the means and variances of of\ufb02ine GPs at (sk, a):\n\n.\n\nk\n\n(sk, a), . . . , Q\u03b8prior\nN prior\n\nf a\nsk = [Q\u03b81\nva\nsk = [cov\u03b81 ((sk, a), (sk, a)), . . . , cov\u03b8prior\n\n(sk, a)]T ,\n\nN prior\n\n((sk, a), (sk, a))]T.\n\n9:\n10:\n11:\n12:\n\nsk , va\n\nsk over \u0398prior.\n\nConstruct a GP using f a\nUse the constructed GP to compute Q\u03b8(sk, a), for \u03b8 \u2208 \u0398post.\nend for\nAction selection:\n\n13: end for\n\nak = arg max\n\na\u2208{a1,...,aM }\n\n(cid:88)\n\n1\n\nN post\n\n\u03b8\u2208\u0398post\n\nQ\u03b8(sk, a).\n\nSimple Continuous State and Action Example: The following simple MDP with unknown dy-\nnamics is considered in this section:\n\nsk = bound[sk\u22121 \u2212 \u03b8sk\u22121(0.5 \u2212 sk\u22121) + 0.2ak\u22121 + nk] ,\n\n(17)\nwhere sk \u2208 S = [0, 1] and ak \u2208 A = [\u22121, 1] for any k \u2265 0, nk \u223c N (0, 0.05), \u03b8 is the unknown\nparameter with true value \u03b8\u2217 = 0.2 and bound maps the argument to the closest point in state space.\nThe reward function is R(s, a) = \u221210 \u03b4s<0.1 \u2212 10 \u03b4s>0.9 \u2212 2|a|, so that the cost is minimum when\nthe system is in the interval [0.1, 0.9]. The prior distribution is p(\u03b8) \u223c N (0, 0.2). The decomposable\nsquared exponential kernel function is used over the state and action spaces. The of\ufb02ine and MCMC\nsample sizes are 10 and 1000, respectively.\nFigures 2(a) and (b) plot the optimal actions in the state and parameter spaces and the Q-function\nover state and action spaces for the true model \u03b8\u2217, obtained by GP-SARSA algorithms. It can be seen\nthat the decision is signi\ufb01cantly impacted by the parameter, especially in regions of the state space\nbetween 0.5 to 1. The Bayesian approximation of the Q-function is represented by two surfaces\nthat de\ufb01ne 95%-con\ufb01dence intervals for the expected return. The average reward per step over\n100 independent runs starting from different initial states are plotted in Figure 2(c). As expected,\nthe maximum average reward is obtained by the GP-SARSA associated with the true model. The\nproposed framework signi\ufb01cantly outperforms both MMRL and one-step lookahead techniques.\nOne can see that the average reward for the proposed algorithm converges to the true model results\nafter less than 20 actions while the other methods do not. The very poor performance of the one-step\nlookahead method is due to the greedy heuristics involved in its decision making process, which do\nnot factor in long-term rewards.\nMelanoma Gene Regulatory Network: A key goal in genomics is to \ufb01nd proper intervention\nstrategies for disease treatment and prevention. Melanoma is the most dangerous form of skin\ncancer, the gene-expression behavior of which can be represented through the Boolean activities of\n7 genes displayed in Figure 3. Each gene expression can be 0 or 1, corresponding to gene inactivation\nor activation, respectively. The gene states are assumed to be updated at each discrete time through\n\n7\n\n\fFigure 2: Small example results.\n\nthe following nonlinear signal model:\n\nxk = f (xk\u22121) \u2295 ak\u22121 \u2295 nk ,\n\n(18)\nwhere xk = [WNT5Ak, pirink, S100Pk, RET1k, MART1k, HADHBk, STC2k] is the state vector at\ntime step k, action ak\u22121 \u2208 A \u2282 {0, 1}7, such that ak\u22121(i) = 1 \ufb02ips the state of the ith gene, f is\nthe Boolean function displayed in Table 1, in which the ith binary string speci\ufb01es the output value\nfor the given input genes, \u201c\u2295\u201d indicates component-wise modulo-2 addition and nk \u2208 {0, 1}7 is\nBoolean transition noise, such that P (nk(i) = 1) = p, for i = 1, . . . , 7.\n\nFigure 3: Melanoma regulatory network\n\nTable 1: Boolean functions for the melanoma GRN.\n\nGenes\n\nInput Gene(s)\n\nWNT5A HADHB\npirin\nS100P\nRET1\nMART1\nHADHB pirin,S100P,RET1\nSTC2\n\nprin, RET1,HADHB\nS100P,RET1,STC2\nRET1,HADHB,STC2\npirin,MART1,STC2\n\npirin,STC2\n\nOutput\n\n10\n00010111\n10101010\n00001111\n10101111\n01110111\n1101\n\nIn practice, the gene states are observed through gene expression technologies such as cDNA mi-\ncroarray or image-based assay. A Gaussian observation model is appropriate for modeling the gene\nexpression data produced by these technologies:\n\nyk(i) \u223c N (20 xk(i) + \u03b8, 10) ,\n\n(19)\nfor i = 1, . . . , 7; where parameter \u03b8 is the baseline expression in the inactivated state with the true\nvalue \u03b8\u2217 = 30. Such a model is known as a partially-observed Boolean dynamical system in the\nliterature [25, 26].\nIt can be shown that for any given \u03b8 \u2208 R, the partially-observed MDP in (18) and (19) can be\ntransformed into an MDP in a continuous belief space [27, 28]:\n\nsk = g(sk\u22121, ak\u22121, \u03b8)\n\n\u221d p(yk | xk, \u03b8) P (xk | xk\u22121, ak) sk\u22121 ,\n\n(20)\n\n8\n\n0.60.4-100.20.25state (s)0.500action (a)0.75-0.211(a)\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t(b)\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t(c)Proposed\tMethod\fFigure 4: Melanoma gene regulatory network results.\n\n50(cid:80)128\n\nwhere \u201c\u221d\" indicates that the right-hand side must be normalized to add up to 1. The belief state is a\nvector of length 128 in a simplex of size 127.\nIn [29, 30], the expression of WNT5A was found to be highly discriminatory between cells\nwith properties typically associated with high metastatic competence versus those with low\nmetastatic competence. Hence, an intervention that blocked the WNT5A protein from activat-\ning its receptor could substantially reduce the ability of WNT5A to induce a metastatic pheno-\ntype. Thus, we consider the following immediate reward function in belief space: R(s, a) =\ni=1 s(i) \u03b4xi(1)=0 \u2212 10||a||1. Three actions are available for controlling the system: A =\n{[0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 1, 0, 0, 0], [0, 0, 0, 0, 0, 1, 0]}.\nThe decomposable squared exponential and delta Kronecker kernel functions are used for Gaussian\nprocess regression over the belief state and action spaces, respectively. The of\ufb02ine and MCMC\nsample sizes are 10 and 3000, respectively. The average reward per step over 100 independent runs\nfor all methods is displayed in Figure 4. Uniform and Gaussian distributions with different variances\nare used as prior distributions in order to investigate the effect of prior peakedness. As expected, the\nhighest average reward is obtained for GP-SARSA tuned to the true parameter \u03b8\u2217. The proposed\nmethod has higher average reward than the MMRL and one-step lookahead algorithms. In fact,\nthe expected return produced by the proposed method converges to the GP-SARSA tuned to the\ntrue parameter faster for peaked prior distributions. As more actions are taken, the performance of\nMMRL approaches, but not quite reaches, the baseline performance of the GP-SARSA tuned to the\ntrue parameter. The one-step lookahead method performs poorly in all cases as it does not account\nfor long-term rewards in the decision making process.\n\n5 Conclusion\n\nIn this paper, we introduced a Bayesian decision making framework for control of MDPs with un-\nknown dynamics and large or continuous state, actions and parameter spaces in data-poor environ-\nments. The proposed framework does not require sustained direct interaction with the system or a\nsimulator, but instead it plans of\ufb02ine over a \ufb01nite sample of parameters from a prior distribution over\nthe parameter space and transfers this knowledge ef\ufb01ciently to sample parameters from the posterior\nduring the execution process. The methodology offers several bene\ufb01ts, including the possibility of\nhandling large and possibly continuous state, action, and parameter spaces; data-poor environments;\nanytime planning; and dealing with risk in the decision making process.\n\nAcknowledgment\n\nThe authors acknowledge the support of the National Science Foundation, through NSF award CCF-\n1718924.\n\n9\n\n\fReferences\n[1] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. MIT press, 1998.\n\n[2] A. Antos, C. Szepesv\u00e1ri, and R. Munos, \u201cFitted Q-iteration in continuous action-space MDPs,\u201d in Ad-\n\nvances in neural information processing systems, pp. 9\u201316, 2008.\n\n[3] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller,\nA. K. Fidjeland, G. Ostrovski, et al., \u201cHuman-level control through deep reinforcement learning,\u201d Nature,\nvol. 518, no. 7540, p. 529, 2015.\n\n[4] L. Busoniu, R. Babuska, B. De Schutter, and D. Ernst, Reinforcement learning and dynamic programming\n\nusing function approximators, vol. 39. CRC press, 2010.\n\n[5] Y. Engel, S. Mannor, and R. Meir, \u201cReinforcement learning with Gaussian processes,\u201d in Proceedings of\n\nthe 22nd international conference on Machine learning, pp. 201\u2013208, ACM, 2005.\n\n[6] K. Doya, K. Samejima, K.-i. Katagiri, and M. Kawato, \u201cMultiple model-based reinforcement learning,\u201d\n\nNeural computation, vol. 14, no. 6, pp. 1347\u20131369, 2002.\n\n[7] M. Ghavamzadeh, S. Mannor, J. Pineau, A. Tamar, et al., \u201cBayesian reinforcement learning: A survey,\u201d\n\nFoundations and Trends R(cid:13) in Machine Learning, vol. 8, no. 5-6, pp. 359\u2013483, 2015.\n\n[8] P. Poupart, N. Vlassis, J. Hoey, and K. Regan, \u201cAn analytic solution to discrete Bayesian reinforcement\nlearning,\u201d in Proceedings of the 23rd international conference on Machine learning, pp. 697\u2013704, ACM,\n2006.\n\n[9] A. Guez, D. Silver, and P. Dayan, \u201cEf\ufb01cient Bayes-adaptive reinforcement learning using sample-based\n\nsearch,\u201d in Advances in Neural Information Processing Systems, pp. 1025\u20131033, 2012.\n\n[10] J. Asmuth and M. L. Littman, \u201cLearning is planning: near Bayes-optimal reinforcement learning via\n\nMonte-Carlo tree search,\u201d arXiv preprint arXiv:1202.3699, 2012.\n\n[11] Y. Wang, K. S. Won, D. Hsu, and W. S. Lee, \u201cMonte Carlo Bayesian reinforcement learning,\u201d arXiv\n\npreprint arXiv:1206.6449, 2012.\n\n[12] N. A. Vien, W. Ertel, V.-H. Dang, and T. Chung, \u201cMonte-Carlo tree search for Bayesian reinforcement\n\nlearning,\u201d Applied intelligence, vol. 39, no. 2, pp. 345\u2013353, 2013.\n\n[13] S. Ross and J. Pineau, \u201cModel-based Bayesian reinforcement learning in large structured domains,\u201d in\nUncertainty in arti\ufb01cial intelligence: proceedings of the... conference. Conference on Uncertainty in Ar-\nti\ufb01cial Intelligence, vol. 2008, p. 476, NIH Public Access, 2008.\n\n[14] T. Wang, D. Lizotte, M. Bowling, and D. Schuurmans, \u201cBayesian sparse sampling for on-line reward\noptimization,\u201d in Proceedings of the 22nd international conference on Machine learning, pp. 956\u2013963,\nACM, 2005.\n\n[15] R. Fonteneau, L. Busoniu, and R. Munos, \u201cOptimistic planning for belief-augmented Markov decision\nprocesses,\u201d in Adaptive Dynamic Programming And Reinforcement Learning (ADPRL), 2013 IEEE Sym-\nposium on, pp. 77\u201384, IEEE, 2013.\n\n[16] A. Guez, D. Silver, and P. Dayan, \u201cScalable and ef\ufb01cient Bayes-adaptive reinforcement learning based on\n\nMonte-Carlo tree search,\u201d Journal of Arti\ufb01cial Intelligence Research, vol. 48, pp. 841\u2013883, 2013.\n\n[17] W. B. Powell and I. O. Ryzhov, Optimal learning, vol. 841. John Wiley & Sons, 2012.\n\n[18] N. Drougard, F. Teichteil-K\u00f6nigsbuch, J.-L. Farges, and D. Dubois, \u201cStructured possibilistic planning\n\nusing decision diagrams.,\u201d in AAAI, pp. 2257\u20132263, 2014.\n\n[19] F. W. Trevizan, F. G. Cozman, and L. N. de Barros, \u201cPlanning under risk and Knightian uncertainty.,\u201d in\n\nIJCAI, vol. 2007, pp. 2023\u20132028, 2007.\n\n[20] D. P. Bertsekas, Dynamic programming and optimal control, vol. 1. Athena scienti\ufb01c Belmont, MA,\n\n1995.\n\n[21] C. E. Rasmussen and C. Williams, Gaussian processes for machine learning. MIT Press, 2006.\n\n[22] M. Gasic and S. Young, \u201cGaussian processes for POMDP-based dialogue manager optimization,\u201d\n\nIEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 22, no. 1, pp. 28\u201340, 2014.\n\n10\n\n\f[23] W. K. Hastings, \u201cMonte Carlo sampling methods using Markov chains and their applications,\u201d Biometrika,\n\nvol. 57, no. 1, pp. 97\u2013109, 1970.\n\n[24] W. R. Gilks, S. Richardson, and D. Spiegelhalter, Markov chain Monte Carlo in practice. CRC press,\n\n1995.\n\n[25] U. Braga-Neto, \u201cOptimal state estimation for Boolean dynamical systems,\u201d in Signals, Systems and Com-\nputers (ASILOMAR), 2011 Conference Record of the Forty Fifth Asilomar Conference on, pp. 1050\u20131054,\nIEEE, 2011.\n\n[26] M. Imani and U. Braga-Neto, \u201cMaximum-likelihood adaptive \ufb01lter for partially-observed Boolean dy-\n\nnamical systems,\u201d IEEE Transactions on Signal Processing, vol. 65, no. 2, pp. 359\u2013371, 2017.\n\n[27] M. Imani and U. M. Braga-Neto, \u201cPoint-based methodology to monitor and control gene regulatory net-\n\nworks via noisy measurements,\u201d IEEE Transactions on Control Systems Technology, 2018.\n\n[28] M. Imani and U. M. Braga-Neto, \u201cFinite-horizon LQR controller for partially-observed Boolean dynam-\n\nical systems,\u201d Automatica, vol. 95, pp. 172\u2013179, 2018.\n\n[29] M. Bittner, P. Meltzer, Y. Chen, Y. Jiang, E. Seftor, M. Hendrix, M. Radmacher, R. Simon, Z. Yakhini,\nA. Ben-Dor, et al., \u201cMolecular classi\ufb01cation of cutaneous malignant melanoma by gene expression pro-\n\ufb01ling,\u201d Nature, vol. 406, no. 6795, pp. 536\u2013540, 2000.\n\n[30] A. T. Weeraratna, Y. Jiang, G. Hostetter, K. Rosenblatt, P. Duray, M. Bittner, and J. M. Trent, \u201cWnt5a\nsignaling directly affects cell motility and invasion of metastatic melanoma,\u201d Cancer cell, vol. 1, no. 3,\npp. 279\u2013288, 2002.\n\n11\n\n\f", "award": [], "sourceid": 4998, "authors": [{"given_name": "Mahdi", "family_name": "Imani", "institution": "Texas A&M University"}, {"given_name": "Seyede Fatemeh", "family_name": "Ghoreishi", "institution": "Texas A&M University"}, {"given_name": "Ulisses M.", "family_name": "Braga-Neto", "institution": "Texas A&M University"}]}