Part of Advances in Neural Information Processing Systems 14 (NIPS 2001)
It is desirable that a complex decision-making problem in an uncer(cid:173) tain world be adequately modeled by a Markov Decision Process (MDP) whose structural representation is adaptively designed by a parsimonious resources allocation process. Resources include time and cost of exploration, amount of memory and computational time allowed for the policy or value function representation. Concerned about making the best use of the available resources, we address the problem of efficiently estimating where adding extra resources is highly needed in order to improve the expected performance of the resulting policy. Possible application in reinforcement learning (RL) , when real-world exploration is highly costly, concerns the de(cid:173) tection of those areas of the state-space that need primarily to be explored in order to improve the policy. Another application con(cid:173) cerns approximation of continuous state-space stochastic control problems using adaptive discretization techniques for which highly efficient grid points allocation is mandatory to survive high dimen(cid:173) sionality. Maybe surprisingly these two problems can be formu(cid:173) lated under a common framework: for a given resource allocation, which defines a belief state over possible MDPs, find where adding new resources (thus decreasing the uncertainty of some parame(cid:173) ters -transition probabilities or rewards) will most likely increase the expected performance of the new policy. To do so, we use sam(cid:173) pling techniques for estimating the contribution of each parameter's probability distribution function (Pdf) to the expected loss of us(cid:173) ing an approximate policy (such as the optimal policy of the most probable MDP) instead of the true (but unknown) policy.