{"title": "Data Generation as Sequential Decision Making", "book": "Advances in Neural Information Processing Systems", "page_first": 3249, "page_last": 3257, "abstract": "We connect a broad class of generative models through their shared reliance on sequential decision making. Motivated by this view, we develop extensions to an existing model, and then explore the idea further in the context of data imputation -- perhaps the simplest setting in which to investigate the relation between unconditional and conditional generative modelling. We formulate data imputation as an MDP and develop models capable of representing effective policies for it. We construct the models using neural networks and train them using a form of guided policy search. Our models generate predictions through an iterative process of feedback and refinement. We show that this approach can learn effective policies for imputation problems of varying difficulty and across multiple datasets.", "full_text": "Data Generation as Sequential Decision Making\n\nPhilip Bachman\n\nDoina Precup\n\nMcGill University, School of Computer Science\n\nMcGill University, School of Computer Science\n\nphil.bachman@gmail.com\n\ndprecup@cs.mcgill.ca\n\nAbstract\n\nWe connect a broad class of generative models through their shared reliance on\nsequential decision making. Motivated by this view, we develop extensions to an\nexisting model, and then explore the idea further in the context of data imputation\n\u2013 perhaps the simplest setting in which to investigate the relation between uncon-\nditional and conditional generative modelling. We formulate data imputation as\nan MDP and develop models capable of representing effective policies for it. We\nconstruct the models using neural networks and train them using a form of guided\npolicy search [9]. Our models generate predictions through an iterative process of\nfeedback and re\ufb01nement. We show that this approach can learn effective policies\nfor imputation problems of varying dif\ufb01culty and across multiple datasets.\n\n1\n\nIntroduction\n\nDirected generative models are naturally interpreted as specifying sequential procedures for gener-\nating data. We traditionally think of this process as sampling, but one could also view it as making\nsequences of decisions for how to set the variables at each node in a model, conditioned on the\nsettings of its parents, thereby generating data from the model. The large body of existing work\non reinforcement learning provides powerful tools for addressing such sequential decision making\nproblems. We encourage the use of these tools to understand and improve the extended processes\ncurrently driving advances in generative modelling. We show how sequential decision making can be\napplied to general prediction tasks by developing models which construct predictions by iteratively\nre\ufb01ning a working hypothesis under guidance from exogenous input and endogenous feedback.\nWe begin this paper by reinterpreting several recent generative models as sequential decision making\nprocesses, and then show how changes inspired by this point of view can improve the performance\nof the LSTM-based model introduced in [3]. Next, we explore the connections between directed\ngenerative models and reinforcement learning more fully by developing an approach to training\npolicies for sequential data imputation. We base our approach on formulating imputation as a \ufb01nite-\nhorizon Markov Decision Process which one can also interpret as a deep, directed graphical model.\nWe propose two policy representations for the imputation MDP. One extends the model in [3] by\ninserting an explicit feedback loop into the generative process, and the other addresses the MDP\nmore directly. We train our models/policies using techniques motivated by guided policy pearch\n[9, 10, 11, 8]. We examine their qualitative and quantitative performance across imputation problems\ncovering a range of dif\ufb01culties (i.e. different amounts of data to impute and different \u201cmissingness\nmechanisms\u201d), and across multiple datasets. Given the relative paucity of existing approaches to the\ngeneral imputation problem, we compare our models to each other and to two simple baselines. We\nalso test how our policies perform when they use fewer/more steps to re\ufb01ne their predictions.\nAs imputation encompasses both classi\ufb01cation and standard (i.e. unconditional) generative mod-\nelling, our work suggests that further study of models for the general imputation problem is worth-\nwhile. The performance of our models suggests that sequential stochastic construction of predic-\ntions, guided by both input and feedback, should prove useful for a wide range of problems. Training\nthese models can be challenging, but lessons from reinforcement learning may bring some relief.\n\n1\n\n\f(cid:88)\n\nz\n\nT(cid:89)\n\nt=1\n\n2 Directed Generative Models as Sequential Decision Processes\n\nDirected generative models have grown in popularity relative to their undirected counter-parts [6,\n14, 12, 4, 5, 16, 15] (etc.). Reasons include: the development of ef\ufb01cient methods for training them,\nthe ease of sampling from them, and the tractability of bounds on their log-likelihoods. Growth in\navailable computing power compounds these bene\ufb01ts. One can interpret the (ancestral) sampling\nprocess in a directed model as repeatedly setting subsets of the latent variables to particular values,\nin a sequence of decisions conditioned on preceding decisions. Each subsequent decision restricts\nthe set of potential outcomes for the overall sequence. Intuitively, these models encode stochastic\nprocedures for constructing plausible observations. This section formally explores this perspective.\n\n2.1 Deep AutoRegressive Networks\n\nThe deep autoregressive networks investigated in [4] de\ufb01ne distributions of the following form:\n\np(x) =\n\np(x|z)p(z), with p(z) = p0(z0)\n\npt(zt|z0, ..., zt\u22121)\n\n(1)\n\nin which x indicates a generated observation and z0, ..., zT represent latent variables in the model.\nThe distribution p(x|z) may be factored similarly to p(z). The form of p(z) in Eqn. 1 can represent\narbitrary distributions over the latent variables, and the work work in [4] mainly concerned ap-\nproaches to parameterizing the conditionals pt(zt|z0, ..., zt\u22121) that restricted representational power\nin exchange for computational tractability. To appreciate the generality of Eqn. 1, consider using zt\nthat are univariate, multivariate, structured, etc. One can interpret any model based on this sequen-\ntial factorization of p(z) as a non-stationary policy pt(zt|st) for selecting each action zt in a state\nst, with each st determined by all zt(cid:48) for t(cid:48) < t, and train it using some form of policy search.\n\n2.2 Generalized Guided Policy Search\n\nE\n\nE\n\nE\n\n(cid:21)\n\nminimize\n\np,q\n\n(cid:20)\n\niq\u223cIq\n\nip\u223cIp(\u00b7|iq)\n\n\u03c4\u223cq(\u03c4|iq,ip)\n\nWe adopt a broader interpretation of guided policy search than one might initially take from, e.g.,\n[9, 10, 11, 8]. We provide a review of guided policy search in the supplementary material. Our\nexpanded de\ufb01nition of guided policy search includes any optimization of the general form:\n[(cid:96)(\u03c4, iq, ip)] + \u03bb div (q(\u03c4|iq, ip), p(\u03c4|ip))\n\n(2)\nin which p indicates the primary policy, q indicates the guide policy, Iq indicates a distribution over\ninformation available only to q, Ip indicates a distribution over information available to both p and\nq, (cid:96)(\u03c4, iq, ip) computes the cost of trajectory \u03c4 in the context of iq/ip, and div(q(\u03c4|iq, ip), p(\u03c4|ip))\nmeasures dissimilarity between the trajectory distributions generated by p/q. As \u03bb > 0 goes to\nin\ufb01nity, Eqn. 2 enforces the constraint p(\u03c4|ip) = q(\u03c4|iq, ip), \u2200\u03c4, ip, iq. Terms for controlling, e.g.,\nthe entropy of p/q can also be added. The power of the objective in Eq. 2 stems from two main\npoints: the guide policy q can use information iq that is unavailable to the primary policy p, and the\nprimary policy need only be trained to minimize the dissimilarity term div(q(\u03c4|iq, ip), p(\u03c4|ip)).\nFor example, a directed model structured as in Eqn. 1 can be interpreted as specifying a policy for\na \ufb01nite-horizon MDP whose terminal state distribution encodes p(x). In this MDP, the state at time\n1 \u2264 t \u2264 T +1 is determined by {z0, ..., zt\u22121}. The policy picks an action zt \u2208 Zt at time 1 \u2264 t \u2264 T ,\nand picks an action x \u2208 X at time t = T + 1. I.e., the policy can be written as pt(zt|z0, ..., zt\u22121)\nfor 1 \u2264 t \u2264 T , and as p(x|z0, ..., zT ) for t = T + 1. The initial state z0 \u2208 Z0 is drawn from p0(z0).\nExecuting the policy for a single trial produces a trajectory \u03c4 (cid:44) {z0, ..., zT , x}, and the distribution\nover xs from these trajectories is just p(x) in the corresponding directed generative model.\nThe authors of [4] train deep autoregressive networks by maximizing a variational lower bound on\nthe training set log-likelihood. To do this, they introduce a variational distribution q which provides\nq0(z0|x\u2217) and qt(zt|z0, ..., zt\u22121, x\u2217) for 1 \u2264 t \u2264 T , with the \ufb01nal step q(x|z0, ..., zT , x\u2217) given by\na Dirac-delta at x\u2217. Given these de\ufb01nitions, the training in [4] can be interpreted as guided policy\nsearch for the MDP described in the previous paragraph. Speci\ufb01cally, the variational distribution q\nprovides a guide policy q(\u03c4|x\u2217) over trajectories \u03c4 (cid:44) {z0, ..., zT , x\u2217}:\n\nq(\u03c4|x\u2217) (cid:44) q(x|z0, ..., zT , x\u2217)q0(z0|x\u2217)\n\nqt(zt|z0, ..., zt\u22121, x\u2217)\n\n(3)\n\nT(cid:89)\n\nt=1\n\n2\n\n\fT(cid:89)\n\nThe primary policy p generates trajectories distributed according to:\n\np(\u03c4 ) (cid:44) p(x|z0, ..., zT )p0(z0)\n\n(4)\nwhich does not depend on x\u2217. In this case, x\u2217 corresponds to the guide-only information iq \u223c Iq in\nEqn. 2. We now rewrite the variational optimization as:\n\nt=1\n\npt(zt|z0, ..., zt\u22121)\n\nminimize\n\n(5)\nwhere (cid:96)(\u03c4, x\u2217) (cid:44) 0 and DX indicates the target distribution for the terminal state of the primary\npolicy p.1 When expanded, the KL term in Eqn. 5 becomes:\n\n\u03c4\u223cq(\u03c4|x\u2217)\n\nx\u2217\u223cDX\n\np,q\n\n[(cid:96)(\u03c4, x\u2217)] + KL(q(\u03c4|x\u2217)|| p(\u03c4 ))\n\nE\n\nE\n\n(cid:21)\n\n(cid:20)\n\nKL(q(\u03c4|x\u2217)|| p(\u03c4 )) =\n\n(cid:34)\n\nT(cid:88)\n\nt=1\n\n(cid:35)\n\n(6)\n\nE\n\n\u03c4\u223cq(\u03c4|x\u2217)\n\nq0(z0|x\u2217)\np0(z0)\n\n+\n\nlog\n\nqt(zt|z0, ..., zt\u22121, x\u2217)\npt(zt|z0, ..., zt\u22121)\n\nlog\n\n\u2212 log p(x\u2217|z0, ..., zT )\n\nThus, the variational approach used in [4] for training directed generative models can be interpreted\nas a form of generalized guided policy search. As the form in Eqn. 1 can represent any \ufb01nite directed\ngenerative model, the preceding derivation extends to all models we discuss in this paper.2\n\n2.3 Time-reversible Stochastic Processes\nOne can simplify Eqn. 1 by assuming suitable forms for X and Z0, ...,ZT . E.g., the authors of [16]\nproposed a model in which Zt \u2261 X for all t and p0(x0) was Gaussian. We can write their model as:\n\n(cid:88)\n\nT\u22121(cid:89)\n\nT(cid:89)\n\nt=2\n\np(xT ) =\n\npT (xT|xT\u22121)p0(x0)\n\npt(xt|xt\u22121)\n\n(7)\n\nx0,...,xT \u22121\n\nt=1\n\nwhere p(xT ) indicates the terminal state distribution of the non-stationary, \ufb01nite-horizon Markov\nprocess determined by {p0(x0), p1(x1|x0), ..., pT (xT|xT\u22121)}. Note that, throughout this paper, we\n(ab)use sums over latent variables and trajectories which could/should be written as integrals.\nThe authors of [16] observed that, for any reasonably smooth target distribution DX and suf\ufb01ciently\nlarge T , one can de\ufb01ne a \u201creverse-time\u201d stochastic process qt(xt\u22121|xt) with simple, time-invariant\ndynamics that transforms q(xT ) (cid:44) DX into the Gaussian distribution p0(x0). This q is given by:\n\nq0(x0) =\n\nq1(x0|x1)DX (xT )\n\nqt(xt\u22121|xt) \u2248 p0(x0)\n\n(8)\n\n(cid:88)\n\nx1,...,xT\n\n(cid:34)\n\nE\n\nT\u22121(cid:88)\n\n(cid:35)\n\nDX (xT )\n\nNext, we de\ufb01ne q(\u03c4 ) as the distribution over trajectories \u03c4 (cid:44) {x0, ..., xT} generated by the reverse-\ntime process determined by {q1(x0|x1), ..., qT (xT\u22121|xT ),DX (xT )}. We de\ufb01ne p(\u03c4 ) as the distri-\nbution over trajectories generated by the \u201cforward-time\u201d process in Eqn. 7. The training in [16] is\nequivalent to guided policy search using guide trajectories sampled from q, i.e. it uses the objective:\n\nq1(x0|x1)\np0(x0)\n\nqt+1(xt|xt+1)\npt(xt|xt\u22121)\n\np,q\n\n+\n\nt=1\n\nlog\n\nlog\n\n+ log\n\n\u03c4\u223cq(\u03c4 )\n\nminimize\n\npT (xT|xT\u22121)\n\n(cid:104)\u2212 log p0(x0) \u2212(cid:80)T\n\n(9)\nwhich corresponds to minimizing KL(q || p). If the log-densities in Eqn. 9 are tractable, then this\nminimization can be done using basic Monte-Carlo. If, as in [16], the reverse-time process q is not\ntrained, then Eqn. 9 simpli\ufb01es to: minimizep Eq(\u03c4 )\nThis trick for generating guide trajectories exhibiting a particular distribution over terminal states\nxT \u2013 i.e. running dynamics backwards in time starting from xT \u223c DX \u2013 may prove useful in settings\nother than those considered in [16]. E.g., the LapGAN model in [1] learns to approximately invert\na \ufb01xed (and information destroying) reverse-time process. The supplementary material expands on\nthe content of this subsection, including a derivation of Eqn. 9 as a bound on Ex\u223cDX [\u2212 log p(x)].\n1We could pull the \u2212 log p(x\u2217|z0, ..., zT ) term from the KL and put it in the cost (cid:96)(\u03c4, x\u2217), but we prefer the\n\u201cpath-wise KL\u201d formulation for its elegance. We abuse notation using KL(\u03b4(x = x\u2217)|| p(x)) (cid:44) \u2212 log p(x\u2217).\n\nt=1 log pt(xt|xt\u22121)\n\n(cid:105)\n\n.\n\n2This also includes all generative models implemented and executed on an actual computer.\n\n3\n\n\f2.4 Learning Generative Stochastic Processes with LSTMs\n\nT(cid:89)\n\nThe authors of [3] introduced a model for sequentially-deep generative processes. We interpret their\nmodel as a primary policy p which generates trajectories \u03c4 (cid:44) {z0, ..., zT , x} with distribution:\n\nt=1\n\np(\u03c4 ) (cid:44) p(x|s\u03b8(\u03c4