{"title": "A neurally plausible model learns successor representations in partially observable environments", "book": "Advances in Neural Information Processing Systems", "page_first": 13714, "page_last": 13724, "abstract": "Animals need to devise strategies to maximize returns while interacting with their environment based on incoming noisy sensory observations. Task-relevant states, such as the agent's location within an environment or the presence of a predator, are often not directly observable but must be inferred using available sensory information. Successor representations (SR) have been proposed as a middle-ground between model-based and model-free reinforcement learning strategies, allowing for fast value computation and rapid adaptation to changes in the reward function or goal locations.  Indeed, recent studies suggest that features of neural responses are consistent with the SR framework.  However, it is not clear how such representations might be learned and computed in partially observed, noisy environments. Here, we introduce a neurally plausible model using \\emph{distributional successor features}, which builds on the distributed distributional code for the representation and computation of uncertainty, and which allows for efficient value function computation in partially observed environments via the successor representation. We show that distributional successor features can support reinforcement learning in noisy environments in which direct learning of successful policies is infeasible.", "full_text": "A neurally plausible model learns successor\n\nrepresentations in partially observable environments\n\nEszter V\u00e9rtes Maneesh Sahani\n\nGatsby Computational Neuroscience Unit\n\nUniversity College London\n\nLondon W1T 4JG, UK.\n\n{eszter,maneesh}@gatsby.ucl.ac.uk\n\nAbstract\n\nAnimals need to devise strategies to maximize returns while interacting with their\nenvironment based on incoming noisy sensory observations. Task-relevant states,\nsuch as the agent\u2019s location within an environment or the presence of a predator,\nare often not directly observable but must be inferred using available sensory infor-\nmation. Successor representations (SR) have been proposed as a middle-ground\nbetween model-based and model-free reinforcement learning strategies, allowing\nfor fast value computation and rapid adaptation to changes in the reward function\nor goal locations. Indeed, recent studies suggest that features of neural responses\nare consistent with the SR framework. However, it is not clear how such represen-\ntations might be learned and computed in partially observed, noisy environments.\nHere, we introduce a neurally plausible model using distributional successor fea-\ntures, which builds on the distributed distributional code for the representation and\ncomputation of uncertainty, and which allows for ef\ufb01cient value function computa-\ntion in partially observed environments via the successor representation. We show\nthat distributional successor features can support reinforcement learning in noisy\nenvironments in which direct learning of successful policies is infeasible.\n\n1\n\nIntroduction\n\nHumans and other animals are able to evaluate long-term consequences of their actions and adapt\ntheir behaviour to maximize reward across different environments. This behavioural \ufb02exibility is\noften thought to result from the interaction of two adaptive systems implementing model-based and\nmodel-free reinforcement learning (RL).\nModel-based learning allows for \ufb02exible goal-directed behaviour, acquiring an internal model of the\nenvironment which is used to evaluate the consequences of actions. As a result, an agent can rapidly\nadjust its policy to localized changes in the environment or in reward function. But this \ufb02exibility\ncomes at a high computational cost, as optimal actions and value functions depend on expensive\nsimulations in the model. Model-free methods, on the other hand, learn cached values for states\nand actions, enabling rapid action selection. However, this approach is particularly slow to adapt to\nchanges in the task, as adjusting behaviour even to localized changes, e.g. in the placement of the\nreward, requires updating cached values at all states in the environment. It has been suggested that the\nbrain makes use both of these complementary approaches, and that they may compete for behavioural\ncontrol (Daw et al., 2005); indeed, several behavioural studies suggest that subjects implement a\nhybrid of model-free and model-based strategies (Daw et al., 2011; Gl\u00e4scher et al., 2010).\nSuccessor representations (SR; Dayan, 1993) augment the internal state used by model-free systems by\nthe expected future occupancy of each world state. SRs can be viewed as a precompiled representation\nof the model under a given policy. Thus, learning based on SRs falls between model-free and model-\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fbased approaches and correspondingly can reproduce a range of behaviours (Russek et al., 2017).\nRecent studies have argued for evidence consistent with SRs in rodent hippocampal and human\nbehavioural data (Stachenfeld et al., 2017; Momennejad et al., 2017).\nMotivated by both theoretical and experimental work arguing that neural RL systems operate over la-\ntent states and need to handle state uncertainty (Dayan and Daw, 2008; Gershman, 2018; Starkweather\net al., 2017), our work takes the successor framework further by considering partially observable\nenvironments. Adopting the framework of distributed distributional coding (V\u00e9rtes and Sahani, 2018),\nwe show how learnt latent dynamical models of the environment can be naturally integrated with\nSRs de\ufb01ned over the latent space. We begin with short overviews of reinforcement learning in the\npartially observed setting (section 2); the SR (section 3); and distributed distributional codes (DDCs)\n(section 4). In section 5, we describe how using DDCs in the generative and recognition models leads\nto a particularly simple algorithm for learning latent state dynamics and the associated SR.\n\n2 Partially observable Markov decision processes\n\nMarkov decision processes (MDP) provide a framework for modelling a wide range of sequential\ndecision-making tasks relevant for reinforcement learning. An MDP is de\ufb01ned by a set of states\nS and actions A, a reward function R : S \u00d7 A \u2192 R, and a probability distribution T (s(cid:48)|s, a) that\ndescribes the Markovian dynamics of the states conditioned on actions of the agent. For notational\nconvenience we will take the reward function to be independent of action, depending only on state;\nbut the approach we describe is easily extended to the more general case. A partially observable\nMarkov decision process (POMDP) is a generalization of an MDP where the Markovian states s \u2208 S\nare not directly observable to the agent. Instead, the agent receives observations (o \u2208 O) that depend\non the current latent state via an observation process Z(o|s). Formally, a POMDP is a tuple: (S,\nA, T , R, O, Z, \u03b3), comprising the objects de\ufb01ned above and a discount factor \u03b3. POMDPs can be\nde\ufb01ned over either discrete or continuous state spaces. Here, we focus on the more general continuous\ncase, although the model we present is applicable to discrete state spaces as well.\n\n3 The successor representation\n\nAs an agent explores an environment, the states it visits are ordered by the agent\u2019s policy and the\ntransition structure of the world. State representations that respect this dynamic ordering are likely to\nbe more ef\ufb01cient for value estimation and may promote more effective generalization. This may not\nbe true of the observed state coordinates. For instance, a barrier in a spatial environment might mean\nthat two states with adjacent physical coordinates are associated with very different values.\nDayan (1993) argued that a natural state space for model-free value estimation is one where distances\nbetween states re\ufb02ect the similarity of future paths given the agent\u2019s policy. The successor representa-\ntion (Dayan, 1993; SR) for state si is de\ufb01ned as the expected discounted sum of future occupancies\nfor each state sj, given the current state si:\n\nM \u03c0(si, sj) = E\u03c0\n\n\u03b3kI[st+k = sj] | st = si\n\n.\n\n(1)\n\n(cid:105)\n\n(cid:104) \u221e(cid:88)\n\nk=0\n\n(cid:88)\n\nj\n\nThat is, in a discrete state space, the SR is a N \u00d7 N matrix where N is the number of states in the\nenvironment. The SR depends on the current policy \u03c0 through the expectation in the right hand side\nof eq. 1, taken with respect to a (possibly stochastic) policy p\u03c0(at|st) and environmental transitions\nT (st+1|st, at). The SR makes it possible to express the value function in a particularly simple form.\nFollowing from eq. 1 and the usual de\ufb01nition of the value function:\n\nV \u03c0(si) =\n\nM \u03c0(si, sj)R(sj) ,\n\n(2)\n\nwhere R(sj) is the immediate reward in state sj.\nThe successor matrix M \u03c0 can be learned by temporal difference (TD) learning (Sutton, 1988), in\nmuch the same way as TD is used to update value functions. In particular, the SR is updated according\nto a TD error:\n\n\u03b4t(sj) = I[st = sj] + \u03b3M \u03c0(st+1, sj) \u2212 M \u03c0(st, sj) ,\n\n(3)\n\n2\n\n\fwhich re\ufb02ects errors in state predictions rather than rewards, a learning signal typically associated\nwith model based RL.\nAs shown in eq. 2, the value function can be factorized into the SR\u2014i.e., information about expected\nfuture states under the policy\u2014and instantaneous reward in each state1. This modularity enables rapid\npolicy evaluation under changing reward conditions: for a \ufb01xed policy only the reward function needs\nto be relearned to evaluate V \u03c0(s). This contrasts with both model-free and model-based algorithms,\nwhich require extensive experience or rely on computationally expensive evaluation, respectively, to\nrecompute the value function.\n\n3.1 Successor representation using features\nThe successor representation can be generalized to continuous states s \u2208 S by using a set of feature\nfunctions {\u03c8i(s)} de\ufb01ned over S. In this setting, the successor representation (also referred to as the\nsuccessor feature representation or SF) encodes expected feature values instead of occupancies of\nindividual states:\n\nM \u03c0(st, i) = E\u03c0\n\n\u03b3k\u03c8i(st+k) | st\n\n(4)\n\n(cid:104) \u221e(cid:88)\n\n(cid:105)\n\nAssuming that the reward function can be written (or approximated) as a linear function of the\nfeatures: R(s) = wT\nrew\u03c8(s) (where the feature values are collected into a vector \u03c8(s)), the value\nfunction V (st) has a simple form analagous to the discrete case:\n\nk=0\n\nV \u03c0(st) = wT\n\nrewM \u03c0(st)\n\n(5)\n\nFor consistency, we can use linear function approximation with the same set of features as in eq. 4 to\nparametrize the successor features M \u03c0(st, i).\n\nUij\u03c8j(st)\n\n(6)\n\nM \u03c0(st, i) \u2248(cid:88)\n\nj\n\nThe form of the SFs, embodied by the weights Uij, can be found by temporal difference learning:\n\n\u2206Uij = \u03b4i\u03c8j(st)\n\n\u03b4i = \u03c8i(st) + \u03b3M (st+1, i) \u2212 M (st, i)\n\n(7)\n\nAs we have seen in the discrete case, the TD error here signals prediction errors about features of\nstate, rather than about reward.\n\n4 Distributed distributional codes\n\nDistributed distributional codes (DDC) are a candidate for the neural representation of uncertainty\n(Zemel et al., 1998; Sahani and Dayan, 2003) and recently have been shown to support accurate\ninference and learning in hierarchical latent variable models (V\u00e9rtes and Sahani, 2018). In a DDC, a\npopulation of neurons represent distributions in their \ufb01ring rates implicitly, as a set of expectations:\n(8)\n\n\u00b5 = Ep(s)[\u03c8(s)]\n\nwhere \u00b5 is a vector of \ufb01ring rates, p(s) is the represented distribution, and \u03c8(s) is a vector of encoding\nfunctions speci\ufb01c to each neuron. DDCs can be thought of as representing exponential family\ndistributions with suf\ufb01cient statistics \u03c8(s) using their mean parameters Ep(s)[\u03c8(s)] (Wainwright and\nJordan, 2008).\n\n5 Distributional successor representation\n\nAs discussed above, the successor representation can support ef\ufb01cient value computation by incorpo-\nrating information about the policy and the environment into the state representation. However, in\nmore realistic settings, the states themselves are not directly observable and the agent is limited to\nstate-dependent noisy sensory information.\n\n1Alternatively, for the more general case of action-dependent reward, the expected instantaneous reward\n\nunder the policy-dependent action in each state.\n\n3\n\n\fAlgorithm 1 Wake-sleep algorithm in the DDC state-space model\n\nInitialise T, W\nwhile not converged do\n\nSleep phase:\nsample: {ssleep\n\nupdate W : \u2206W \u221d(cid:80)\n\n, osleep\n\nt\n\nt\n\n(cid:0)\u03c8(ssleep\n\nt\n\nt\n\n}t=0...N \u223c p(SN ,ON )\n\n) \u2212 fW (\u00b5t\u22121(Osleep\n\nt\u22121 ), osleep\n\nt\n\n)(cid:1)\u2207W fW\n\nWake phase:\nON \u2190 {collect observations}\ninfer posterior DDC \u00b5t(Ot) = fW (\u00b5t\u22121(Ot\u22121), ot)\nupdate T : \u2206T \u221d (\u00b5t+1(Ot+1) \u2212 T \u00b5t(Ot))\u00b5t(Ot)T\nupdate observation model parameters\n\nend while\n\nIn this section, we lay out how the DDC representation for uncertainty allows for learning and\ncomputing with successor representations de\ufb01ned over latent variables. First, we describe an algorithm\nfor learning and inference in dynamical latent variable models using DDCs. We then establish a link\nbetween the DDC and successor features (eq. 4) and show how they can be combined to learn what we\ncall the distributional successor features. We discuss different algorithmic and implementation-related\nchoices for the proposed scheme and their implications.\n\n5.1 Learning and inference in a state space model using DDCs\n\nHere, we consider POMDPs where the state-space transition model is itself de\ufb01ned by a conditional\nDDC with means that depend linearly on the preceding state features. That is, the conditional\ndistribution describing the latent dynamics implied by following the policy \u03c0 can be written in the\nfollowing form:\n\np\u03c0(st+1|st) \u21d4 Est+1|st,\u03c0[\u03c8(st+1)] = T \u03c0\u03c8(st)\n\n(9)\nwhere T \u03c0 is a matrix parametrizing the functional relationship between st and the expectation of\n\u03c8(st+1) with respect to p\u03c0(st+1|st).\nThe agent has access only to sensory observations ot at each time step, and in order to be able to make\nuse of the underlying latent structure, it has to learn the parameters of generative model p(st+1|st),\np(ot|st) as well as learn to perform inference in that model.\nWe consider online inference (\ufb01ltering), i.e. at each time step t the recognition model produces an\nestimate q(st|Ot) of the posterior distribution p(st|Ot) given all observations up to time t: Ot =\n(o1, o2, . . . ot). As in the DDC Helmholtz machine (V\u00e9rtes and Sahani, 2018), these distributions are\nrepresented by a set of expectations\u2014i.e., by a DDC:\n\n(10)\nThe \ufb01ltering posterior \u00b5t(Ot) is computed iteratively, using the posterior in the previous time step\n\u00b5t\u22121(Ot\u22121) and the new observation ot. The Markovian structure of the state space model (see \ufb01g.\n1) ensures that the recognition model can be written as a recursive function:\n\n\u00b5t(Ot) = Eq(st|Ot)[\u03c8(st)]\n\n\u00b5t(Ot) = fW (\u00b5t\u22121(Ot\u22121), ot)\n\n(11)\n\nwith a set of parameters W .\nThe recognition and generative models are updated using an adapted version of the wake-sleep\nalgorithm (Hinton et al., 1995; V\u00e9rtes and Sahani, 2018). In the following, we describe the two\nphases of the algorithm in more detail (see Algorithm 1).\n\nSleep phase\n\nThe aim of the sleep phase is to adjust the parameters of the recognition model given the current\ngenerative model. Speci\ufb01cally, the recognition model should approximate the expectation of the DDC\nencoding functions \u03c8(st) under the \ufb01ltering posterior p(st|Ot). This can be achieved by moment\nmatching, i.e., simulating a sequence of latent and observed states from the current model and\n\n4\n\n\f(cid:88)\n\nW \u2190 argmin\n\n(cid:107)\u03c8(ssleep\n\nt\n\nW\n\nt\n\nwhere {ssleep\n\nt\n\n, osleep\n\nt\n\n}t=0...N \u223c p(s0)p(o0|s0)\n\nminimizing the Euclidean distance between the output of the recognition model and the suf\ufb01cient\nstatistic vector \u03c8(.) evaluated at the latent state from the next time step.\n\nt\u22121 ), osleep\n\nt\n\n)(cid:107)2\n\n(12)\n\n) \u2212 fW (\u00b5t\u22121(Osleep\nN\u22121(cid:81)\n\np(st+1|st, T \u03c0)p(ot+1|st+1).\n\nt=0\n\nThis update rule can be implemented online as samples are simulated, and after a suf\ufb01ciently\nlong simulated sequence (or multiple sequences) {ssleep\n}t the recognition model will learn to\n) \u2248 Ep(st|Ot)[\u03c8(st)], yielding a DDC\napproximate expectations of the form: fW (\u00b5t\u22121(Osleep\nrepresentation of the posterior.\n\n, osleep\nt\u22121 ), osleep\n\nt\n\nt\n\nt\n\nWake phase\n\nIn the wake phase, the parameters of the generative model are adapted such that it captures the sensory\nobservations better. Here, we focus on learning the policy-dependent latent dynamics p\u03c0(st+1|st); the\nobservation model can be learned by the approach of V\u00e9rtes and Sahani (2018). Given a sequence of\ninferred posterior representations {\u00b5t(Ot)} computed using wake phase observations, the parameters\nof the latent dynamics T can be updated by minimizing a simple predictive cost function:\n\n(cid:88)\n(cid:107)\u00b5t+1(Ot+1) \u2212 T \u00b5t(Ot)(cid:107)2\n\nT \u2190 argmin\n\nT\n\nt\n\n(13)\n\nThe intuition behind eq. 13 is that for the optimal generative model the latent dynamics satis\ufb01es\nthe following equality: T \u2217\u00b5t(Ot) = Ep(ot+1|Ot)[\u00b5t+1(Ot+1)]. That is, the predictions made by\ncombining the posterior at time t and the prior will agree with the average posterior at the next time\nstep\u2014making T \u2217 a stationary point of the optimization in eq. 13. For further details on the nature\nof the approximation implied by the wake phase update and its relationship to variational learning,\nsee the supplementary material. In practice, the update can be done online, using gradient steps\nanalogous to prediction errors:\n\n\u2206T \u221d(cid:0)\u00b5t+1(Ot+1) \u2212 T \u00b5t(Ot)(cid:1)\u00b5t(Ot)T\n\n(14)\n\nT\n\nr1\n\ns1\n\no1\n\ns2\n\n. . .\n\nst\u22121\n\nT\n\nr2\n\nrt\u22121\n\no2\n\not\u22121\n\nst\n\not\n\n. . .\n\nrt\n\n\u00b51\n\n\u00b52\n\n. . .\n\n\u00b5t\u22121\n\n\u00b5t\n\n. . .\n\n.\n\nm\ne\nv\ni\nt\na\nr\ne\nn\ne\nG\n\n.\n\nm\nn\no\ni\nt\ni\n\nn\ng\no\nc\ne\nR\n\n(a) DDC state-space model\n\n(b) Noisy 2D environment\n\nFigure 1: Learning and inference in a state-space model parametrized by a DDC. (a) The structure of\nthe generative and recognition models. (b) Visualization of the dynamics T learned by the wake-sleep\n(algorithm 1). Arrows show the conditional mean Est+1|st[st+1] for each location. (c) Posterior\nmean trajectories inferred using the recognition model, plotted on top of true latent and observed\ntrajectories.\n\nFigure 1 shows a state-space model corresponding to a random walk policy in the latent space with\nnoisy observations, learned using DDCs (Algorithm 1). For further details of the experiment, see the\nsupplementary material.\n\n5.2 Learning distributional successor features\n\nNext, we show how using a DDC to parametrize the generative model (eq. 9) makes it possible\nto compute the successor features de\ufb01ned in the latent space in a tractable form, and how this\ncomputation can be combined with inference based on sensory observations.\n\n5\n\nLatent dynamicsTrajectories\fFollowing the de\ufb01nition of the SFs (eq. 4), we have:\n\u03b3k\u03c8(st+k)|st\n\nM (st) = E\u03c0\n\n(cid:104) \u221e(cid:88)\n\nk=0\n\n(cid:105)\n\n=\n\n\u221e(cid:88)\n\nk=0\n\n\u03b3kE\u03c0[\u03c8(st+k)|st]\n\n(15)\n\nWe can compute the conditional expectations of the feature vector \u03c8 in eq. 15 by applying the\ndynamics k times to the features \u03c8(st): Est+k|st[\u03c8(st+k)] = T k\u03c8(st). Thus, we have:\n\n\u221e(cid:88)\n\nk=0\n\nM (st) =\n\n\u03b3kT k\u03c8(st)\n\n(16)\n\n= (I \u2212 \u03b3T )\u22121\u03c8(st)\n\n(17)\nEq. 17 is reminiscent of the result for discrete observed state spaces M (si, sj) = (I \u2212 \u03b3P )\u22121\nij\n(Dayan, 1993), where P is a matrix containing Markovian transition probabilities between states. In a\ncontinuous state space, however, \ufb01nding a closed form solution like eq. 17 is non-trivial, as it requires\nevaluating a set of typically intractable integrals. The solution presented here directly exploits the\nDDC parametrization of the generative model and the correspondence between the features used in\nthe DDC and the SFs.\nIn this framework, we can not only compute the successor features in closed form in the latent space,\nbut also evaluate the distributional successor features, the posterior expectation of the SFs given a\nsequence of sensory observations:\n\nEst|Ot[M (st)] = (I \u2212 \u03b3T )\u22121Est|Ot[\u03c8(st)]\n\n= (I \u2212 \u03b3T )\u22121\u00b5t(Ot)\n\n(18)\n(19)\n\nThe results from this section suggest a number of different ways the distributional successor features\nEst|Ot[M (st)] can be learned or computed.\n5.2.1 Learning distributional SFs during sleep phase\nThe matrix U = (I \u2212 \u03b3T )\u22121 needed to compute distributional SFs in eq. 19 can be learned from\ntemporal differences in feature predictions based on sleep phase simulated latent state sequences\n(see eq. 6-7). Following a potential change in the dynamics of the environment, sleep phase learning\nallows for updating SFs and therefore cached values of\ufb02ine, without the need for further experience.\n\n5.2.2 Computing distributional SFs by dynamics\n\nAlternatively, eq. 19 can be implemented as a \ufb01xed point of a linear dynamical system, with recurrent\nconnections re\ufb02ecting the model of the latent dynamics:\n\n\u03c4 \u02d9x = \u2212x + \u03b3T x + \u00b5t(Ot)\n\u21d2 x(\u221e) = (I \u2212 \u03b3T )\u22121\u00b5t(Ot)\n\n(20)\n(21)\nIn this case, there is no need to learn (I \u2212 \u03b3T )\u22121 explicitly but it is implicitly computed through\ndynamics. For this to work, there is an underlying assumption that the dynamical system in eq. 20\nreaches equilibrium on a timescale (\u03c4) faster than that on which the observations Ot evolve.\nBoth of these approaches avoid having to compute the matrix inverse directly and allow for evaluation\nof policies given by a corresponding dynamics matrix T \u03c0 of\ufb02ine.\n\n5.2.3 Learning distributional SFs during wake phase\n\nInstead of fully relying on the learned latent dynamics to compute the distributional SFs, we\nserved data. We can de\ufb01ne the distributional SFs directly on the DDC posteriors: (cid:102)M (Ot) =\nE\u03c0[(cid:80)\ncan use posteriors computed by the recognition model during the wake phase, that is, using ob-\nsequences of observations Ot = (o1 . . . ot). Analogously to section 3.1,(cid:102)M (Ot) can be acquired by\nk \u03b3k\u00b5t+k(Ot+k)|\u00b5t(Ot)], treating the posterior representation \u00b5t(Ot) as a feature space over\n\n6\n\n\fFigure 2: Value functions under a random walk policy for two different reward locations. Values\nwere computed using SFs based on the latent, inferred DDC posterior or observed state variables.\n\nTD learning and assuming linear function approximation: (cid:102)M (Ot) \u2248 U \u00b5t(Ot). The matrix U can be\n\nupdated online, while executing a given policy and continuously inferring latent state representations\nusing the recognition model:\n\n\u2206U \u221d \u03b4t\u00b5t(Ot)T\n\u03b4t = \u00b5t(Ot) + \u03b3M (Ot+1) \u2212 M (Ot)\n\nIt can be shown that(cid:102)M (Ot), as de\ufb01ned here, is equivalent to Est|Ot[M (st)] if the learned generative\nmodel is optimal\u2013assuming no model mismatch\u2013and the recognition model correctly infers the\ncorresponding posteriors \u00b5t(Ot) (see supplementary material). In general, however, exchanging the\norder of TD learning and inference leads to different SFs. The advantage of learning the distributional\nsuccessor features in the wake phase is that even when the model does not perfectly capture the data\n(e.g. due to lack of \ufb02exibility or early on in learning) the learned SFs will re\ufb02ect the structure in the\nobservations through the posteriors \u00b5t(Ot).\n\n(22)\n(23)\n\n5.3 Value computation in a noisy 2D environment\n\nWe illustrate the importance of being able to consistently handle uncertainty in the SFs by learning\nvalue functions in a noisy environment. We use a simple 2-dimensional box environment with\ncontinuous state space that includes an internal wall. The agent does not have direct access to its\nspatial coordinates, but receives observations corrupted by Gaussian noise. Figure 2 shows the value\nfunctions computed using the successor features learned in three different settings: assuming direct\naccess to latent states, treating observations as though they were noise-free state measurements, and\nusing latent state estimates inferred from observations. The value functions computed in the latent\nspace and computed from DDC posterior representations both re\ufb02ect the structure of the environment,\nwhile the value function relying on SFs over the observed states fails to learn about the barrier.\nTo demonstrate that this is not simply due to using the suboptimal random walk policy, but persists\nthrough learning, we have learned successor features while adjusting the policy to a given reward\nfunction (see \ufb01gure 3). The policy was learned by generalized policy iteration (Sutton and Barto,\n1998), alternating between taking actions following a greedy policy and updating the successor\nfeatures to estimate the corresponding value function.\nThe value of each state and action was computed from the value function V (s) by a one-step look-\nahead, combining the immediate reward with the expected value function having taken a given\naction:\n\n(24)\nIn our case, as the value function in the latent space is expressed as a linear function of the features\n\u03c8(s): V (s) = wT\n\nrewU \u03c8(s) (eq. 5-6), the expectation in 24 can be expressed as:\n\nQ(st, at) = r(st) + \u03b3Est+1|st,at[V (st+1)]\n\nEst+1|st,at[V (st+1)] = wT\n\u2248 wT\n\nrewU \u00b7 Es(cid:48)|s,a[\u03c8(st+1)]\nrewU \u00b7 P \u00b7 (\u03c8(st) \u2297 \u03c6(at))\n\n(25)\n(26)\n\n7\n\n\fWhere P is a linear mapping, P : \u03a8 \u00d7 \u03a6 \u2192 \u03a8, that contains information about the distribution\np(st+1|st, at). More speci\ufb01cally, P is trained to predict Est+1|st,at[\u03c8(st+1)] as a bilinear function\nof state and action features (\u03c8(st), \u03c6(at)). Given the state-action value, we can implement a greedy\npolicy by choosing actions that maximize Q(s, a):\n\na\u2217 = argmax\na\u2208A\n= argmax\na\u2208A\n\nQ(st, at)\n\nr(st) + \u03b3wT\n\nrewU \u00b7 P \u00b7 (\u03c8(st) \u00d7 \u03c6(at))\n\n(27)\n\n(28)\n\nThe argmax operation in eq. 28 (possibly over a continuous space of actions) could be biologically\nimplemented by a ring attractor where the neurons receive state-dependent input through feedforward\nweights re\ufb02ecting the tuning (\u03c6(a)) of each neuron in the ring.\nJust as in \ufb01gure 2, we compute the value function in the fully observed case, using inferred states or\nusing only the noisy observations. For the latter two, we replace \u03c8(st) in eq. 28 with the inferred\nstate representation \u00b5(Ot) and the observed features \u03c8(ot), respectively. As the agent follows the\ngreedy policy and it receives new observations the corresponding SFs are adapted accordingly. Figure\n3 shows the learned value functions V \u03c0(s), V \u03c0(\u00b5) and V \u03c0(o) for a given reward location and the\ncorresponding dynamics T \u03c0. The agent having access to the true latent state as well as the one using\ndistributional SFs successfully learn policies leading to the rewarded location. As before, the agent\nlearning SFs purely based on observations remains highly sub-optimal.\n\nFigure 3: Value functions computed by SFs under the learned policy. Top row shows reward and\nvalue functions learned in the three different conditions. Bottom row shows histogram of collected\nrewards from 100 episodes with random initial states, and the learned dynamics T \u03c0 visualized as in\n\ufb01g. 1.\n\n6 Discussion and related work\n\nWe have shown that the DDC represention of uncertainty over latent variables can be naturally inte-\ngrated with representations of uncertainty about future states, and thus offers a natural generalisation\nof SRs to more realistic environments with partial observability. The proposed algorithm jointly\ntackles the problem of learning the latent variable model and learning to perform online inference\nby \ufb01ltering. Distributional SFs are applicable to POMDPs with continuous or discrete variables\nand leverage a \ufb02exible posterior approximation, not restricted to a simple parametric form, that is\nrepresented in a population of neurons in a distributed fashion.\nWhile parametrising the latent dynamics with DDCs is attractive as it makes computing the SFs in the\nlatent space analytically tractable and allows for computing distributional SFs by recurrent dynamics\n(sec. 5.2.2), it is so far unclear how sampling from such a model might be implemented by neural\ncircuits. Alternatively, one can consider a standard exponential family parametrisation which remains\ncompatible with sleep and wake phase TD learning of distributional SFs.\nEarlier work on biological reinforcement learning in POMDPs was restricted to the case of binary\nor categorical latent variables where the posterior beliefs can be computed analytically (Rao, 2010).\n\n8\n\nHistogram of collected rewards\fFurthermore the transition model of the POMDP was assumed to be known, rather than learned as in\nthe present work.\nHere, we have de\ufb01ned distributional SFs over states, using single step look-ahead to compute\nstate-action values (eq. 24). Alternatively, SFs could be de\ufb01ned directly over both states and\nactions (Kulkarni et al., 2016; Barreto et al., 2017) whilst retaining the distributional development\npresented here. Barreto et al. (2017, 2019) have shown that successor representations corresponding\nto previously learned tasks can be used as a basis to construct policies for novel tasks, enabling\ngeneralization. Our framework can be extended in a similar way, eliminating the need to adapt the\nSFs as the policy of the agent changes.\nThe neurotransmitter dopamine has long been hypothesised to signal reward prediction errors (RPE)\nand thus to play a key role in temporal difference learning (Schultz et al., 1997). More recently, it\nhas been argued that dopamine activity is consistent with RPEs computed based on belief states\nrather than sensory observations directly (Babayan et al., 2018; Lak et al., 2017; Sarno et al., 2017).\nThus dopamine is well suited to carry the information necessary for learning value functions under\nstate uncertainty. In another line of experimental work, dopamine has been found to signal sensory\nprediction errors (PE) even if the absence of an associated change in value (Takahashi et al., 2017),\nsuggesting a more general role of dopamine in learning (Gershman, 2018; Gardner et al., 2018).\nGardner et al. have proposed that dopamine\u2014signalling prediction error over features of state\u2014\nmay provide the neural substrate for the error signals necessary to learn successor representations.\nDistributional SFs unify these two sets of observations and their theoretical implications in a single\nframework. They posit that PEs are computed over the posterior belief about latent states (represented\nas DDCs), and that these PEs are de\ufb01ned over a set of non-linear features of the hidden state rather\nthan reward.\nThe proposed learning scheme for distributional SFs allows for \ufb02exible interpolation between model-\nbased and model-free approaches. Wake phase learning of SFs is grounded in observations and only\nrelies on the model through the belief state updates, while sleep phase learning uses simulated latent\nstates from the model to update the SFs\u2014akin to the Dyna algorithm (Sutton, 1990).\nThe framework for learning distributional successor features presented here also provides a link\nbetween various intriguing and seemingly disparate experimental observations in the hippocampus.\nThe relationship between hippocampal place cell activity and (non-distributional) SRs has been\nexplored previously (e.g., Stachenfeld et al., 2014; 2017) providing an interpretation for phenomena\nsuch as \u201csplitter\u201d cells, which show spatial tuning that depends on the whole trajectory (i.e. policy)\ntraversed by the animal not just on its current position (Grieves et al., 2016). However, as discussed\nearlier, relevant states for a given reinforcement learning problem (in this case states over which the\nSR should be learned) cannot be assumed to be directly available to the agent but must be inferred\nfrom observations. The hypothesis that hippocampal place cell activity encodes inferred location,\nwith its concomitant uncertainty, has also been linked to experimental data (Madl et al., 2014). Thus,\nour approach connects these two separate threads in the literature and thereby encompasses both\ngroups of experimental results.\nLastly, the framework helps to link simulation of an internal model to learning. Acquisition of the\ninference model in our framework requires simulating experience (sleep samples) from the agent\u2019s\ncurrent model of the environment, to provide the basis for an update of the recognition model.\nThe sleep samples re\ufb02ect the agent\u2019s knowledge of the environmental dynamics but they need not\ncorrespond exactly to a previously experienced trajectory. This is reminiscent of hippocampal \u201creplay\u201d\nwhich does not just recapitulate previous experience, but often represents novel trajectories not\npreviously experienced by the animal (Gupta et al., 2010; \u00d3lafsd\u00f3ttir et al., 2015; Stella et al., 2019).\nRelatedly, Liu et al. (2019) recently observed that replay events in humans re\ufb02ect abstract structural\nknowledge of a learned task. Our model suggests a novel functional interpretation of these replayed\ntrajectories; namely, that they may play an important role in learning to infer relevant latent states\nfrom observations. This accords with the observation that experimental interference with replay\nevents impedes learning in contexts where optimal actions depend on history-based inference (Jadhav\net al., 2012).\nDistributional SFs provide interpretation for a variety of experimental observations and a step towards\nalgorithmic solutions for \ufb02exible decision making in realistic and challenging problem settings\nanimals face, i.e. under state uncertainty.\n\n9\n\n\fReferences\nB. M. Babayan, N. Uchida, and S. J. Gershman. Belief state representation in the dopamine system.\n\nNat Commun, 9(1):1891, 2018.\n\nA. Barreto, W. Dabney, R. Munos, J. J. Hunt, T. Schaul, H. P. van Hasselt, and D. Silver. Successor\nfeatures for transfer in reinforcement learning. In Advances in Neural Information Processing\nSystems, pages 4055\u20134065, 2017.\n\nA. Barreto, D. Borsa, J. Quan, T. Schaul, D. Silver, M. Hessel, D. Mankowitz, A. \u017d\u00eddek, and\nR. Munos. Transfer in Deep Reinforcement Learning Using Successor Features and Generalised\nPolicy Improvement. arXiv:1901.10964 [cs], 2019.\n\nN. D. Daw, Y. Niv, and P. Dayan. Uncertainty-based competition between prefrontal and dorsolateral\n\nstriatal systems for behavioral control. Nat Neurosci, 8(12):1704, 2005.\n\nN. D. Daw, S. J. Gershman, B. Seymour, P. Dayan, and R. J. Dolan. Model-Based In\ufb02uences on\n\nHumans\u2019 Choices and Striatal Prediction Errors. Neuron, 69(6):1204\u20131215, 2011.\n\nP. Dayan. Improving Generalization for Temporal Difference Learning: The Successor Representation.\n\nNeural Comput, 5(4):613\u2013624, 1993.\n\nP. Dayan and N. D. Daw. Decision theory, reinforcement learning, and the brain. Cogn Affect Behav\n\nNeurosci, 8(4):429\u2013453, 2008.\n\nM. P. H. Gardner, G. Schoenbaum, and S. J. Gershman. Rethinking dopamine as generalized\n\nprediction error. Proc. Biol. Sci., 285(1891), 2018.\n\nS. J. Gershman. The Successor Representation: Its Computational Logic and Neural Substrates. J.\n\nNeurosci., 38(33):7193\u20137200, 2018.\n\nJ. Gl\u00e4scher, N. Daw, P. Dayan, and J. P. O\u2019Doherty. States versus Rewards: Dissociable Neural\nPrediction Error Signals Underlying Model-Based and Model-Free Reinforcement Learning.\nNeuron, 66(4):585\u2013595, 2010.\n\nR. M. Grieves, E. R. Wood, and P. A. Dudchenko. Place cells on a maze encode routes rather than\n\ndestinations. eLife, 5:e15986, 2016.\n\nA. S. Gupta, M. A. A. van der Meer, D. S. Touretzky, and A. D. Redish. Hippocampal replay is not a\n\nsimple function of experience. Neuron, 65(5):695\u2013705, 2010.\n\nG. E. Hinton, P. Dayan, B. J. Frey, and R. M. Neal. The \"wake-sleep\" algorithm for unsupervised\n\nneural networks. Science, 268(5214):1158\u20131161, 1995.\n\nS. P. Jadhav, C. Kemere, P. W. German, and L. M. Frank. Awake Hippocampal Sharp-Wave Ripples\n\nSupport Spatial Memory. Science, 336(6087):1454\u20131458, 2012.\n\nT. D. Kulkarni, A. Saeedi, S. Gautam, and S. J. Gershman. Deep Successor Reinforcement Learning.\n\narXiv:1606.02396 [cs, stat], 2016.\n\nA. Lak, K. Nomoto, M. Keramati, M. Sakagami, and A. Kepecs. Midbrain Dopamine Neurons Signal\n\nBelief in Choice Accuracy during a Perceptual Decision. Curr. Biol., 27(6):821\u2013832, 2017.\n\nY. Liu, R. J. Dolan, Z. Kurth-Nelson, and T. E. J. Behrens. Human Replay Spontaneously Reorganizes\n\nExperience. Cell, 178(3):640\u2013652.e14, 2019.\n\nT. Madl, S. Franklin, K. Chen, D. Montaldi, and R. Trappl. Bayesian integration of information in\n\nhippocampal place cells. PLoS ONE, 9(3):e89762, 2014.\n\nI. Momennejad, E. M. Russek, J. H. Cheong, M. M. Botvinick, N. D. Daw, and S. J. Gershman. The\n\nsuccessor representation in human reinforcement learning. Nat Hum Behav, 1(9):680, 2017.\n\nH. F. \u00d3lafsd\u00f3ttir, C. Barry, A. B. Saleem, D. Hassabis, and H. J. Spiers. Hippocampal place cells\n\nconstruct reward related sequences through unexplored space. Elife, 4:e06063, 2015.\n\n10\n\n\fR. P. N. Rao. Decision Making Under Uncertainty: A Neural Model Based on Partially Observable\n\nMarkov Decision Processes. Front. Comput. Neurosci., 4, 2010.\n\nE. M. Russek, I. Momennejad, M. M. Botvinick, S. J. Gershman, and N. D. Daw. Predictive\nrepresentations can link model-based reinforcement learning to model-free mechanisms. PLOS\nComput Biol, 13(9):e1005768, 2017.\n\nM. Sahani and P. Dayan. Doubly Distributional Population Codes: Simultaneous Representation of\n\nUncertainty and Multiplicity. Neural Comput, 15(10):2255\u20132279, 2003.\n\nS. Sarno, V. de Lafuente, R. Romo, and N. Parga. Dopamine reward prediction error signal codes\nthe temporal evaluation of a perceptual decision report. Proc. Natl. Acad. Sci. U.S.A., 114(48):\nE10494\u2013E10503, 2017.\n\nW. Schultz, P. Dayan, and P. R. Montague. A neural substrate of prediction and reward. Science, 275\n\n(5306):1593\u20131599, 1997.\n\nK. L. Stachenfeld, M. Botvinick, and S. J. Gershman. Design Principles of the Hippocampal Cognitive\nMap. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors,\nAdvances in Neural Information Processing Systems 27, pages 2528\u20132536. Curran Associates, Inc.,\n2014.\n\nK. L. Stachenfeld, M. M. Botvinick, and S. J. Gershman. The hippocampus as a predictive map. Nat\n\nNeurosci, 20(11):1643\u20131653, 2017.\n\nC. K. Starkweather, B. M. Babayan, N. Uchida, and S. J. Gershman. Dopamine reward prediction\n\nerrors re\ufb02ect hidden-state inference across time. Nat Neurosci, 20(4):581\u2013589, 2017.\n\nF. Stella, P. Baracskay, J. O\u2019Neill, and J. Csicsvari. Hippocampal Reactivation of Random Trajectories\n\nResembling Brownian Diffusion. Neuron, 2019.\n\nR. S. Sutton. Learning to predict by the methods of temporal differences. Mach Learn, 3(1):9\u201344,\n\n1988.\n\nR. S. Sutton. Integrated Architectures for Learning, Planning, and Reacting Based on Approximating\nDynamic Programming. In B. Porter and R. Mooney, editors, Machine Learning Proceedings\n1990, pages 216\u2013224. Morgan Kaufmann, San Francisco (CA), 1990.\n\nR. S. Sutton and A. G. Barto. Introduction to Reinforcement Learning, volume 135. MIT press\n\nCambridge, 1998.\n\nY. K. Takahashi, H. M. Batchelor, B. Liu, A. Khanna, M. Morales, and G. Schoenbaum. Dopamine\nNeurons Respond to Errors in the Prediction of Sensory Features of Expected Rewards. Neuron,\n95(6):1395\u20131405.e3, 2017.\n\nE. V\u00e9rtes and M. Sahani. Flexible and accurate inference and learning for deep generative models.\nIn S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors,\nAdvances in Neural Information Processing Systems 31, pages 4166\u20134175. Curran Associates, Inc.,\n2018.\n\nM. J. Wainwright and M. I. Jordan. Graphical Models, Exponential Families, and Variational\n\nInference. Found. Trends Mach. Learn., 1(1-2):1\u2013305, 2008.\n\nR. S. Zemel, P. Dayan, and A. Pouget. Probabilistic interpretation of population codes. Neural\n\nComput, 10(2):403\u2013430, 1998.\n\n11\n\n\f", "award": [], "sourceid": 7610, "authors": [{"given_name": "Eszter", "family_name": "V\u00e9rtes", "institution": "Gatsby Unit, UCL"}, {"given_name": "Maneesh", "family_name": "Sahani", "institution": "Gatsby Unit, UCL"}]}