{"title": "Design Principles of the Hippocampal Cognitive Map", "book": "Advances in Neural Information Processing Systems", "page_first": 2528, "page_last": 2536, "abstract": "Hippocampal place fields have been shown to reflect behaviorally relevant aspects of space. For instance, place fields tend to be skewed along commonly traveled directions, they cluster around rewarded locations, and they are constrained by the geometric structure of the environment. We hypothesize a set of design principles for the hippocampal cognitive map that explain how place fields represent space in a way that facilitates navigation and reinforcement learning. In particular, we suggest that place fields encode not just information about the current location, but also predictions about future locations under the current transition distribution. Under this model, a variety of place field phenomena arise naturally from the structure of rewards, barriers, and directional biases as reflected in the transition policy. Furthermore, we demonstrate that this representation of space can support efficient reinforcement learning. We also propose that grid cells compute the eigendecomposition of place fields in part because is useful for segmenting an enclosure along natural boundaries. When applied recursively, this segmentation can be used to discover a hierarchical decomposition of space. Thus, grid cells might be involved in computing subgoals for hierarchical reinforcement learning.", "full_text": "Design Principles of the Hippocampal Cognitive Map\n\nKimberly L. Stachenfeld1, Matthew M. Botvinick1, and Samuel J. Gershman2\n\n1Princeton Neuroscience Institute and Department of Psychology, Princeton University\n2Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology\n\nkls4@princeton.edu, matthewb@princeton.edu, sjgershm@mit.edu\n\nAbstract\n\nHippocampal place \ufb01elds have been shown to re\ufb02ect behaviorally relevant aspects\nof space. For instance, place \ufb01elds tend to be skewed along commonly traveled\ndirections, they cluster around rewarded locations, and they are constrained by the\ngeometric structure of the environment. We hypothesize a set of design principles\nfor the hippocampal cognitive map that explain how place \ufb01elds represent space\nin a way that facilitates navigation and reinforcement learning. In particular, we\nsuggest that place \ufb01elds encode not just information about the current location,\nbut also predictions about future locations under the current transition distribu-\ntion. Under this model, a variety of place \ufb01eld phenomena arise naturally from\nthe structure of rewards, barriers, and directional biases as re\ufb02ected in the tran-\nsition policy. Furthermore, we demonstrate that this representation of space can\nsupport ef\ufb01cient reinforcement learning. We also propose that grid cells compute\nthe eigendecomposition of place \ufb01elds in part because is useful for segmenting an\nenclosure along natural boundaries. When applied recursively, this segmentation\ncan be used to discover a hierarchical decomposition of space. Thus, grid cells\nmight be involved in computing subgoals for hierarchical reinforcement learning.\n\n1\n\nIntroduction\n\nA cognitive map, as originally conceived by Tolman [46], is a geometric representation of the en-\nvironment that can support sophisticated navigational behavior. Tolman was led to this hypothesis\nby the observation that rats can acquire knowledge about the spatial structure of a maze even in the\nabsence of direct reinforcement (latent learning; [46]). Subsequent work has sought to formalize the\nrepresentational content of the cognitive map [13], the algorithms that operate on it [33, 35], and its\nneural implementation [34, 27]. Much of this work was galvanized by the discovery of place cells\nin the hippocampus [34], which selectively respond when an animal is in a particular location, thus\nsupporting the notion that the brain contains an explicit map of space. The later discovery of grid\ncells in the entorhinal cortex [16], which respond periodically over the entire environment, indicated\na possible neural substrate for encoding metric information about space.\nMetric information is very useful if one considers the problem of spatial navigation to be comput-\ning the shortest path from a starting point to a goal. A mechanism that accumulates a record of\ndisplacements can easily compute the shortest path back to the origin, a technique known as path\nintegration. Considerable empirical evidence supports the idea that animals use this technique for\nnavigation [13]. Many authors have proposed theories of how grid cells and place cells can be used\nto carry out the necessary computations [27].\nHowever, the navigational problems faced by humans and animals are inextricably tied to the more\ngeneral problem of reward maximization, which cannot be reduced to the problem of \ufb01nding the\nshortest path between two points. This raises the question: does the brain employ the same machin-\nery for spatial navigation and reinforcement learning (RL)? A number of authors have suggested\nhow RL mechanisms can support spatial learning, where spatial representations (e.g., place cells or\n\n1\n\n\fgrid cells), serve as the input to the learning system [11, 15]. In contrast to the view that spatial rep-\nresentation is extrinsic to the RL system, we pursue the idea that the brain\u2019s spatial representations\nare designed to support RL. In particular, we show how spatial representations resembling place\ncells and grid cells emerge as the solution to the problem of optimizing spatial representation in the\nservice of RL.\nWe \ufb01rst review the formal de\ufb01nition of the RL problem, along with several algorithmic solutions.\nSpecial attention is paid to the successor representation (SR) [6], which enables a computationally\nconvenient decomposition of value functions. We then show how the successor representation nat-\nurally comes to represent place cells when applied to spatial domains. The eigendecomposition of\nthe successor representation reveals properties of an environment\u2019s spectral graph structure, which\nis particularly useful for discovering hierarchical decompositions of space. We demonstrate that the\neigenvectors resemble grid cells, and suggest that one function of the entorhinal cortex may be to\nencode a compressed representation of space that aids hierarchical RL [3].\n\n2 The reinforcement learning problem\n\nHere we consider the problem of RL in a Markov decision process, which consists of the following\nelements: a set of states S, a set of actions A, a transition distribution P (s(cid:48)|s, a) specifying the\nprobability of transitioning to state s(cid:48) \u2208 S from state s \u2208 S after taking action a \u2208 A, a reward\nfunction R(s) specifying the expected reward in state s, and a discount factor \u03b3 \u2208 [0, 1]. An agent\nchooses actions according to a policy \u03c0(a|s) and collects rewards as it moves through the state space.\nThe standard RL problem is to choose a policy that maximizes the value (expected discounted future\nt=0 \u03b3tR(st) | s0 = s]. Our focus here is on policy evaluation (computing V ).\nIn our simulations we feed the agent the optimal policy; in the Supplementary Materials we discuss\nalgorithms for policy improvement. To simplify notation, we assume implicit dependence on \u03c0 and\n\nreturn), V (s) = E\u03c0 [(cid:80)\u221e\nde\ufb01ne the state transition matrix T , where T (s, s(cid:48)) =(cid:80)\n\na \u03c0(a|s)P (s(cid:48)|s, a).\n\nMost work on RL has focused on two classes of algorithms for policy evaluation: \u201cmodel-free\u201d\nalgorithms that estimate V directly from sample paths, and \u201cmodel-based\u201d algorithms that estimate\nT and R from sample paths and then compute V by some form of dynamic programming or tree\nsearch [44, 5]. However, there exists a third class that has received less attention. As shown by\nDayan [6], the value function can be decomposed into the inner product of the reward function with\nthe SR, denoted by M:\n\n(1)\nwhere I denotes the identity matrix. The SR encodes the expected discounted future occupancy of\nstate s(cid:48) along a trajectory initiated in state s:\n\nM = (I \u2212 \u03b3T )\u22121\n\nwhere I{\u00b7} = 1 if its argument is true, and 0 otherwise.\nThe SR obeys a recursion analogous to the Bellman equation for value functions:\n\n(3)\nThis recursion can be harnessed to derive a temporal difference learning algorithm for incrementally\nupdating an estimate \u02c6M of the SR [6, 14]. After observing a transition s \u2192 s(cid:48), the estimate is\nupdated according to:\n\ns(cid:48) T (s, s(cid:48))M (s(cid:48), j).\n\n(cid:104)I{s = j} + \u03b3 \u02c6M (s(cid:48), j) \u2212 \u02c6M (s, j)\n(cid:105)\n\n\u02c6M (s, j) \u2190 \u02c6M (s, j) + \u03b7\n\n(4)\nwhere \u03b7 is a learning rate (unless speci\ufb01ed otherwise, \u03b7 = 0.1 in our simulations). The SR combines\nsome of the advantages of model-free and model-based algorithms:\nlike model-free algorithms,\npolicy evaluation is computationally ef\ufb01cient, but at the same time the SR provides some of the same\n\ufb02exibility as model-based algorithms. As we illustrate later, an agent using the SR will be sensitive\nto distal changes in reward, whereas a model-free agent will be insensitive to these changes.\n\n,\n\ns(cid:48) M (s, s(cid:48))R(s(cid:48)),\n\nV (s) =(cid:80)\nM (s, s(cid:48)) = E [(cid:80)\u221e\nM (s, j) = I{s = j} + \u03b3(cid:80)\n\nt=0 \u03b3tI{st = s(cid:48)} | s0 = s] ,\n\n(2)\n\n3 The successor representation and place cells\n\nIn this section, we explore the neural implications of using the SR for policy evaluation: if the brain\nencoded the SR, what would the receptive \ufb01elds of the encoding population look like, and what\n\n2\n\n\fFigure 1: SR place \ufb01elds. Top two rows show place \ufb01elds without reward, bottom two show\nretrospective place \ufb01elds with reward (marked by +). Maximum \ufb01ring rate (a.u.) indicated for each\nplot. (a, b) Empty room. (c, d) Single barrier. (e, f) Multiple rooms.\n\nFigure 2: Direction selectivity along a\ntrack. Direction selectivity arises in SR\nplace \ufb01elds when the probability p\u2192\nof transitioning in the preferred left-to-\nright direction along a linear track is\ngreater than the probability p\u2190 of tran-\nsitioning in the non-preferred direction.\nThe legend shows the ratio of p\u2190 to p\u2192\nfor each simulation.\n\nwould the population look like at any point in time? This question is most easily addressed in spatial\ndomains, where states index spatial locations (see Supplementary Materials for simulation details).\nFor an open \ufb01eld with uniformly distributed rewards we assume a random walk policy, and the\nresulting SR for a particular location is an approximately symmetric, gradually decaying halo around\nthat location (Fig. 1a)\u2014the canonical description of a hippocampal place cell.\nIn order for the\npopulation to encode the expected visitations to each state in the domain from the current starting\nstate (i.e. a row of M), each receptive \ufb01eld corresponds to a column of the SR matrix. This allows\nthe current state\u2019s value to be computed by taking the dot product of its population vector with the\nreward vector. The receptive \ufb01eld (i.e. column of M) will encode the discounted expected number\nof times that state was visited for each starting state, and will therefore skew in the direction of the\nstates that likely preceded the current state.\nMore interesting predictions can be made when we examine the effects of obstacles and direction\npreference that shape the transition structure. For instance, when barriers are inserted into the en-\nvironment, the probability of transitioning across these obstacles will go to zero. SR place \ufb01elds\nare therefore constrained by environmental geometry, and the receptive \ufb01eld will be discontinuous\nacross barriers (Fig. 1c,e). Consistent with this idea, experiments have shown that place \ufb01elds be-\ncome distorted around barriers [32, 40]. When an animal has been trained to travel in a preferred\ndirection along a linear track, we expect the response of place \ufb01elds to become skewed opposite the\ndirection of travel (Fig. 2), a result that has been observed experimentally [28, 29].\nAnother way to alter the transition policy is by introducing a goal, which induces a tendency to move\nin the direction that maximizes reward. Under these conditions, we expect \ufb01ring \ufb01elds centered near\nrewarded locations to expand to include the surrounding locations and to increase their \ufb01ring rate,\nas has been observed experimentally [10, 21]. Meanwhile, we expect the majority of place \ufb01elds\n\n3\n\n)+( draweRabEmpty RoomSingle BarriercddraweR oNefMultiple Rooms1.81.81.81.81.81.81.82.11.91.91.81.85.61.21.31.25.61.61.41.31.21.31.21.410015020025030035040001234 0.20.40.60.81Direction SelectivityDistance along Track)RS( snoitaisiv detcepxe detnuocsiD\fFigure 3: Reward clustering in annular maze. (a) Histogram of number of cells \ufb01ring above\nbaseline at each displacement around an annular track. (b) Heat map of number of \ufb01ring cells at\neach location on unwrapped annular maze. Reward is centered on track. Baseline \ufb01ring rate set to\n10% maximum.\n\nFigure 4: Tolman detour task. The starting location is at the bottom of the maze where the\nthree paths meet, and the reward is at the top. Barriers are shown as black horizontal lines. Three\nconditions are shown: No detour, early detour, and late detour. (a, b, c) SR place \ufb01elds centered near\nand far from detours. Maximum \ufb01ring rate (a.u.) indicated by each plot. (d) Value function.\n\nthat encode non-rewarded states to skew slightly away from the reward. Under certain settings\nfor what \ufb01ring rate constitutes baseline (see Supplementary Materials), the spread of the rewarded\nlocations\u2019 \ufb01elds compensates for the skew of surrounding \ufb01elds away from the reward, and we\nobserve \u201cclustering\u201d around rewarded locations (Fig. 3), as has been observed experimentally in the\nannular water maze task [18]. This parameterization sensitivity may explain why goal-related \ufb01ring\nis not observed in all tasks [25, 24, 41].\nAs another illustration of the model\u2019s response to barriers, we simulated place \ufb01elds in a version\nof the Tolman detour task [46], as described in [1]. Rats are trained to move from the start to the\nrewarded location. At some point, an \u201cearly\u201d or a \u201clate\u201d transparent barrier is placed in the maze\nso that the rat must take a detour. For the early barrier, a short detour is available, and for the later\nbarrier, the only detour is a longer one. Place \ufb01elds near the detour are more strongly affected than\nplaces far away from the detour (Fig. 4a,b,c), consistent with experimental \ufb01ndings [1]. Fig. 4d\nshows the value function in each of these detour conditions.\n\n4 Behavioral predictions: distance estimation and latent learning\n\nIn this section, we examine some of the behavioral consequences of using the SR for RL. We \ufb01rst\nshow that the SR anticipates biases in distance estimation induced by semi-permeable boundaries.\nWe then explore the ability of the SR to support latent learning in contextual fear conditioning.\n\n4\n\n024f 00.20.4Percentage of Neurons FiringDistance around annular trackDepthababcdValueFiring Fieldsno detourearly detourlate detour1.251.602.361.151.151.081.491.491.49\fFigure 5: Distance estimates. (a) Increase in\nthe perceived distance between two points on\nopposite sides of a semipermeable boundary\n(marked with + and \u25e6 in 5b) as a function of\nbarrier permeability. (b) Perceived distance be-\ntween destination (market with +) and all other\nlocations in the space (barrier permeability =\n0.05).\n\nFigure 6: Context preexposure facilitation\neffect.\n(a) Simulated conditioned response\n(CR) to the context following one-trial contex-\ntual fear conditioning, shown as a function of\npreexposure duration. The CR was approxi-\nmated as the negative value summed over the\nenvironment. The \u201cLesion\u201d corresponds to\nagents with hippocampal damage, simulated by\nsetting the SR learning rate to 0.01. The \u201cCon-\ntrol\u201d group has a learning rate of 0.1. (b) value\nfor a single location after preexposure in a con-\ntrol agent. (c) same as (b) in a lesioned agent.\n\n(using the Euclidean distance between SR state representations,(cid:112)(M (s(cid:48)) \u2212 M (s))2, as a proxy\n\nStevens and Coupe [43] reported that people overestimated the distance between two locations when\nthey were separated by a boundary (e.g., a state or country line). This bias was hypothesized to arise\nfrom a hierarchical organization of space (see also [17]). We show (Fig. 5) how distance estimates\nfor the perceived distance between s and s(cid:48)) between points in different regions of the environment\nare altered when an enclosure is divided by a soft (semi-permeable) boundary. We see that as the\npermeability of the barrier decreases (making the boundary harder), the percent increase in perceived\ndistance between locations increases without bound. This gives rise to a discontinuity in perceived\ntravel time at the soft boundary.\nInterestingly, the hippocampus is directly involved in distance\nestimation [31], suggesting the hippocampal cognitive map as a neural substrate for distance biases\n(although a direct link has yet to be established).\nThe context preexposure facilitation effect refers to the \ufb01nding that placing an animal inside a condi-\ntioning chamber prior to shocking it facilitates the acquisition of contextual fear [9]. In essence, this\nis a form of latent learning [46]. The facilitation effect is thought to arise from the development of a\nconjunctive representation of the context in the hippocampus, though areas outside the hippocampus\nmay also develop a conjunctive representation in the absence of the hippocampus, albeit less ef\ufb01-\nciently [48]. The SR provides a somewhat different interpretation: over the course of preexposure,\nthe hippocampus develops a predictive representation of the context, such that subsequent learning\nis rapidly propagated across space. Fig. 6 shows a simulation of this process and how it accounts\nfor the facilitation effect. We simulated hippocampal lesions by reducing the SR learning rate from\n0.1 to 0.01, resulting in a more punctate SR following preexposure and a reduced facilitation effect.\n\n5 Eigendecomposition of the successor representation: hierarchical\n\ndecomposition and grid cells\n\nReinforcement learning and navigation can often be made more ef\ufb01cient by decomposing the envi-\nronment hierarchically. For example, the options framework [45] utilizes a set of subgoals to divide\nand conquer a complex learning environment. Recent experimental work suggests that the brain may\nexploit a similar strategy [3, 36, 8]. A key problem, however, is discovering useful subgoals; while\nprogress has been made on this problem in machine learning, we still know very little about how the\nbrain solves it (but see [37]). In this section, we show how the eigendecomposition of the SR can\nbe used to discover subgoals. The resulting eigenvectors strikingly resemble grid cells observed in\nentorhinal cortex.\n\n5\n\n00.510255075Distance (% Increase)PermeabilitySR Distance 012345abLesionValueControlPreexposure Duration (steps)Conditioned Responsex 105Valueabc 0246810121416180123 LesionControl\u22120.8\u22120.6\u22120.4\u22120.20\u22120.3\u22120.2\u22120.10\fFigure 7: Eigendecomposition of the SR. Each panel shows the same 20 eigenvectors randomly\nsampled from the top 100 (excluding the constant \ufb01rst eigenvector) for the environmental geometries\nshown in Fig. 1 (no reward). (a) Empty room. (b) Single barrier. (c) Multiple rooms.\n\nFigure 8: Eigendecomposition of the SR in a\nhairpin maze. Since the walls of the maze effec-\ntively elongate a dimension of travel (the track\nof the maze), the low frequency eigenvectors re-\nsemble one-dimensional sinusoids that have been\nfolded to match the space. Meanwhile, the low\nfrequency eigenvectors exhibit the compartmen-\ntalization shown by [7].\n\nwhere D is a diagonal degree matrix with D(s, s) =(cid:80)\n\nA number of authors have used graph partitioning techniques to discover subgoals [30, 39]. These\napproaches cluster states according to their community membership (a community is de\ufb01ned as a\nhighly interconnected set of nodes with relatively few outgoing edges). Transition points between\ncommunities (bottleneck states) are then used as subgoals. One important graph partitioning tech-\nnique, used by [39] to \ufb01nd subgoals, is the normalized cuts algorithm [38], which recursively thresh-\nolds the second smallest eigenvector (the Fiedler vector) of the normalized graph Laplacian to obtain\na graph partition. Given an undirected graph with symmetric weight matrix W , the graph Laplacian\nis given by L = D \u2212 W . The normalized graph Laplacian is given by L = I \u2212 D\u22121/2W D\u22121/2,\ns(cid:48) W (s, s(cid:48)). When states are projected onto\nthe second eigenvector, they are pulled along orthogonal dimensions according to their community\nmembership. Locations in distinct regions but close in Euclidean distance \u2013 for instance, nearby\npoints on opposite sides of a boundary \u2013 will be represented as distant in the eigenspace.\nThe normalized graph Laplacian is closely related to the SR [26]. Under a random walk policy,\nthe transition matrix is given by T = D\u22121W . If \u03c6 is an eigenvector of the random walk\u2019s graph\nLaplacian I\u2212T , then D1/2\u03c6 is an eigenvector of the normalized graph Laplacian. The corresponding\neigenvector for the discounted Laplacian, I \u2212 \u03b3T , is \u03b3\u03c6. Since the matrix inverse preserves the\neigenvectors, the normalized graph Laplacian has the same eigenvectors as the SR, M = (I\u2212\u03b3T )\u22121,\nscaled by \u03b3D\u22121/2. These spectral eigenvectors can be approximated by slow feature analysis [42].\nApplying hierarchical slow feature analysis to streams of simulated visual inputs produces feature\nrepresentations that resemble hippocampal receptive \ufb01elds [12].\nA number of representative SR eigenvectors are shown in Fig. 7, for three different room topologies.\nThe higher frequency eigenvectors display the latticing characteristic of grid cells [16]. The eigen-\ndecomposition is often discontinuous at barriers, and in many cases different rooms are represented\nby independent sinusoids. Fig. 8 shows the eigendecomposition for a hairpin maze. The eigen-\nvectors resemble folded up one-dimensional sinusoids, and high frequency eigenvectors appear as\nrepeating phase-locked \u201csubmaps\u201d with \ufb01ring selective to a subset of hallways, much like the grid\ncells observed by Derdikman and Moser [7].\nIn the multiple rooms environment, visual inspection reveals that the SR eigenvector with the second\nsmallest eigenvalue (the Fiedler vector) divides the enclosure along the vertical barrier: the left half\nis almost entirely blue and the right half almost entirely red, with a smooth but steep transition\nat the doorway (Fig. 9a). As discussed above, this second eigenvector can therefore be used to\nsegment the enclosure along the vertical boundary. Applying this segmentation recursively, as in\nthe normalized cuts algorithm, produces a hierarchical decomposition of the environment (Figure\n\n6\n\nabMultiple RoomscSingle Barrier Open RoomEigendecomposition\fFigure 9: Segmentation using normalized cuts.\n(a) The results of segmentation by thresholding\nthe second eigenvector of the multiple rooms en-\nvironment in Fig. 1. Dotted lines indicate the\nsegment boundaries. (b, c) Eigenvector segmen-\ntation applied recursively to fully parse the en-\nclosure into the four rooms.\n\n9b,c). By identifying useful subgoals from the environmental topology, this decomposition can be\nexploited by hierarchical learning algorithms [3, 37].\nOne might reasonably question why the brain should represent high frequency eigenvectors (like\ngrid cells) if only the low frequency eigenvectors are useful for hierarchical decomposition. Spectral\nfeatures also serve as generally useful representations [26, 22], and high frequency components are\nimportant for representing detail in the value function. The progressive increase in grid cell spacing\nalong the dorsal-ventral axis of the entorhinal cortex may function as a multi-scale representation\nthat supports both \ufb01ne and coarse detail [2].\n\n6 Discussion\n\nWe have shown how many empirically observed properties of spatial representation in the brain,\nsuch as changes in place \ufb01elds induced by manipulations of environmental geometry and reward,\ncan be explained by a predictive representation of the environment. This predictive representation\nis intimately tied to the problem of RL: in a certain sense, it is the optimal representation of space\nfor the purpose of computing value functions, since it reduces value computation to a simple matrix\nmultiplication [6]. Moreover, this optimality principle is closely connected to ideas from manifold\nlearning and spectral graph theory [26]. Our work thus sheds new computational light on Tolman\u2019s\ncognitive map [46].\nOur work is connected to several lines of previous work. Most relevant is Gustafson and Daw\n[15], who showed how topologically-sensitive spatial representations recapitulate many aspects of\nplace cells and grid cells that are dif\ufb01cult to reconcile with a purely Euclidean representation of\nspace. They also showed how encoding topological structure greatly aids reinforcement learning in\ncomplex spatial environments. Earlier work by Foster and colleagues [11] also used place cells as\nfeatures for RL, although the spatial representation did not explicitly encode topological structure.\nWhile these theoretical precedents highlight the importance of spatial representation, they leave\nopen the deeper question of why particular representations are better than others. We showed that\nthe SR naturally encodes topological structure in a format that enables ef\ufb01cient RL.\nSpectral graph theory provides insight into the topological structure encoded by the SR. In particular,\nwe showed that eigenvectors of the SR can be used to discover a hierarchical decomposition of the\nenvironment for use in hierarchical RL. These eigenvectors may also be useful as a representational\nbasis for RL, encoding multi-scale spatial structure in the value function. Spectral analysis has\nfrequently been invoked as a computational motivation for entorhinal grid cells (e.g., [23]). The\nfact that any function can be reconstructed by sums of sinusoids suggested that the entorhinal cortex\nimplements a kind of Fourier transform of space, and that place cells are the result of reconstructing\nspatial signals from their spectral decomposition. Two problems face this interpretation. Fist, recent\nevidence suggests that the emergence of place cells does not depend on grid cell input [4, 47].\nSecond, and more importantly for our purposes, Fourier analysis is not the right mathematical tool\nwhen dealing with spatial representation in a topologically structured environment, since we do not\nexpect functions to be smooth over boundaries in the environment. This is precisely the purpose of\nspectral graph theory: the eigenvectors of the graph Laplacian encode the smoothest approximation\nof a function that respects the graph topology [26].\nRecent work has elucidated connections between models of episodic memory and the SR. Specif-\nically, in [14] it was shown that the SR is closely related to the Temporal Context Model (TCM)\nof episodic memory [20]. The core idea of TCM is that items are bound to their temporal context\n(a running average of recently experienced items), and the currently active temporal context is used\n\n7\n\nSegmentationSecond LevelFirst Levelbca\fto cue retrieval of other items, which in turn cause their temporal context to be retrieved. The SR\ncan be seen as encoding a set of item-context associations. The connection to episodic memory is\nespecially interesting given the crucial mnemonic role played by the hippocampus and entorhinal\ncortex in episodic memory. Howard and colleagues [19] have laid out a detailed mapping between\nTCM and the medial temporal lobe (including entorhinal and hippocampal regions).\nAn important question for future work concerns how biologically plausible mechanisms can imple-\nment the computations posited in our paper. We described a simple error-driven updating rule for\nlearning the SR, and in the Supplementary Materials we derive a stochastic gradient learning rule\nthat also uses a simple error-driven update. Considerable attention has been devoted to the imple-\nmentation of error-driven learning rules in the brain, so we expect that these learning rules can be\nimplemented in a biologically plausible manner.\n\nReferences\n[1] A. Alvernhe, E. Save, and B. Poucet. Local remapping of place cell \ufb01ring in the tolman detour task.\n\nEuropean Journal of Neuroscience, 33:1696\u20131705, 2011.\n\n[2] H. T. Blair, A. C. Welday, and K. Zhang. Scale-invariant memory representations emerge from moire\ninterference between grid \ufb01elds that produce theta oscillations: a computational model. The Journal of\nNeuroscience, 27:3211\u20133229, 2007.\n\n[3] M. M. Botvinick, Y. Niv, and A. C. Barto. Hierarchically organized behavior and its neural foundations:\n\nA reinforcement learning perspective. Cognition, 113:262\u2013280, 2009.\n\n[4] M. P. Brandon, J. Koenig, J. K. Leutgeb, and S. Leutgeb. New and distinct hippocampal place codes are\n\ngenerated in a new environment during septal inactivation. Neuron, 82:789\u2013796, 2014.\n\n[5] N. D. Daw, Y. Niv, and P. Dayan. Uncertainty-based competition between prefrontal and dorsolateral\n\nstriatal systems for behavioral control. Nature Neuroscience, 8:1704\u20131711, 2005.\n\n[6] P. Dayan. Improving generalization for temporal difference learning: The successor representation. Neu-\n\nral Computation, 5:613\u2013624, 1993.\n\n[7] D. Derdikman, J. R. Whitlock, A. Tsao, M. Fyhn, T. Hafting, M.-B. Moser, and E. I. Moser. Fragmentation\n\nof grid cell maps in a multicompartment environment. Nature Neuroscience, 12:1325\u20131332, 2009.\n\n[8] C. Diuk, K. Tsai, J. Wallis, M. Botvinick, and Y. Niv. Hierarchical learning induces two simultaneous, but\nseparable, prediction errors in human basal ganglia. The Journal of Neuroscience, 33:5797\u20135805, 2013.\n[9] M. S. Fanselow. From contextual fear to a dynamic view of memory systems. Trends in Cognitive\n\nSciences, 14:7\u201315, 2010.\n\n[10] A. Fenton, L. Zinyuk, and J. Bures. Place cell discharge along search and goal-directed trajectories.\n\nEuropean Journal of Neuroscience, 12:3450, 2001.\n\n[11] D. Foster, R. Morris, and P. Dayan. A model of hippocampally dependent navigation, using the temporal\n\ndifference learning rule. Hippocampus, 10:1\u201316, 2000.\n\n[12] M. Franzius, H. Sprekeler, and L. Wiskott. Slowness and sparseness lead to place, head-direction, and\n\nspatial-view cells. PLoS Computational Biology, 3:3287\u20133302, 2007.\n[13] C. R. Gallistel. The Organization of Learning. The MIT Press, 1990.\n[14] S. J. Gershman, C. D. Moore, M. T. Todd, K. A. Norman, and P. B. Sederberg. The successor representa-\n\ntion and temporal context. Neural Computation, 24:1553\u20131568, 2012.\n\n[15] N. J. Gustafson and N. D. Daw. Grid cells, place cells, and geodesic generalization for spatial reinforce-\n\nment learning. PLoS Computational Biology, 7:e1002235, 2011.\n\n[16] T. Hafting, M. Fyhn, S. Molden, M.-B. Moser, and E. I. Moser. Microstructure of a spatial map in the\n\nentorhinal cortex. Nature, 436:801\u2013806, 2005.\n\n[17] S. C. Hirtle and J. Jonides. Evidence of hierarchies in cognitive maps. Memory & Cognition, 13:208\u2013217,\n\n1985.\n\n[18] S. A. Hollup, S. Molden, J. G. Donnett, M. B. Moser, and E. I. Moser. Accumulation of hippocampal\nplace \ufb01elds at the goal location in an annular watermaze task. Journal of Neuroscience, 21:1635\u20131644,\n2001.\n\n[19] M. W. Howard, M. S. Fotedar, A. V. Datey, and M. E. Hasselmo. The temporal context model in spatial\nnavigation and relational learning: toward a common explanation of medial temporal lobe function across\ndomains. Psychological Review, 112:75\u2013116, 2005.\n\n[20] M. W. Howard and M. J. Kahana. A distributed representation of temporal context. Journal of Mathe-\n\nmatical Psychology, 46:269\u2013299, 2002.\n\n8\n\n\f[21] T. Kobayashi, A. Tran, H. Nishijo, T. Ono, and G. Matsumoto. Contribution of hippocampal place cell\nactivity to learning and formation of goal-directed navigation in rats. Neuroscience, 117:1025\u201335, 2003.\n[22] G. Konidaris, S. Osentoski, and P. S. Thomas. Value function approximation in reinforcement learning\n\nusing the Fourier basis. In AAAI, 2011.\n\n[23] J. Krupic, N. Burgess, and J. O\u2019oeefe. Neural representations of location composed of spatially periodic\n\nbands. Science, 337:853\u2013857, 2012.\n\n[24] P. Lenck-Santini, R. Muller, E. Save, and B. Poucet. Relationships between place cell \ufb01ring \ufb01elds and\n\nnavigational decisions by rats. The Journal of Neuroscience, 22:9035\u201347, 2002.\n\n[25] P. Lenck-Santini, E. Save, and B. Poucet. Place-cell \ufb01ring does not depend on the direction of turn in a\n\ny-maze alternation task. European Journal of Neuroscience, 13(5):1055\u20138, 2001.\n\n[26] S. Mahadevan. Learning representation and control in markov decision processes: New frontiers. Foun-\n\ndations and Trends in Machine Learning, 1:403\u2013565, 2009.\n\n[27] B. L. McNaughton, F. P. Battaglia, O. Jensen, E. I. Moser, and M.-B. Moser. Path integration and the\n\nneural basis of the \u2018cognitive map\u2019. Nature Reviews Neuroscience, 7:663\u2013678, 2006.\n\n[28] M. R. Mehta, C. A. Barnes, and B. L. McNaughton. Experience-dependent, asymmetric expansion of\n\nhippocampal place \ufb01elds. Proceedings of the National Academy of Sciences, 94:8918\u20138921, 1997.\n\n[29] M. R. Mehta, M. C. Quirk, and M. A. Wilson. Experience-dependent asymmetric shape of hippocampal\n\nreceptive \ufb01elds. Neuron, 25:707\u2013715, 2000.\n\n[30] I. Menache, S. Mannor, and N. Shimkin. Q-cut\u2014dynamic discovery of sub-goals in reinforcement learn-\n\ning. In European Conference on Machine Learning, pages 295\u2013306. Springer, 2002.\n\n[31] L. K. Morgan, S. P. MacEvoy, G. K. Aguirre, and R. A. Epstein. Distances between real-world locations\n\nare represented in the human hippocampus. The Journal of Neuroscience, 31:1238\u20131245, 2011.\n\n[32] R. U. Muller and J. L. Kubie. The effects of changes in the environment on the spatial \ufb01ring of hippocam-\n\npal complex-spike cells. The Journal of Neuroscience, 7:1951\u20131968, 1987.\n\n[33] R. U. Muller, M. Stead, and J. Pach. The hippocampus as a cognitive graph. The Journal of General\n\nPhysiology, 107:663\u2013694, 1996.\n\n[34] J. O\u2019Keefe and L. Nadel. The Hippocampus as a Cognitive Map. Clarendon Press Oxford, 1978.\n[35] A. K. Reid and J. R. Staddon. A dynamic route \ufb01nder for the cognitive map. Psychological Review,\n\n105:585\u2013601, 1998.\n\n[36] J. J. Ribas-Fernandes, A. Solway, C. Diuk, J. T. McGuire, A. G. Barto, Y. Niv, and M. M. Botvinick. A\n\nneural signature of hierarchical reinforcement learning. Neuron, 71:370\u2013379, 2011.\n\n[37] A. C. Schapiro, T. T. Rogers, N. I. Cordova, N. B. Turk-Browne, and M. M. Botvinick. Neural represen-\n\ntations of events arise from temporal community structure. Nature Neuroscience, 16:486492, 2013.\n\n[38] J. Shi and J. Malik. Normalized cuts and image segmentation. Pattern Analysis and Machine Intelligence,\n\nIEEE Transactions on, 22:888\u2013905, 2000.\n\u00a8O. S\u00b8ims\u00b8ek, A. P. Wolfe, and A. G. Barto. Identifying useful subgoals in reinforcement learning by local\ngraph partitioning. In Proceedings of the 22nd International Conference on Machine Learning, pages\n816\u2013823. ACM, 2005.\n\n[39]\n\n[40] W. E. Skaggs and B. L. McNaughton. Spatial \ufb01ring properties of hippocampal ca1 populations in an\nenvironment containing two visually identical regions. The Journal of Neuroscience, 18:8455\u20138466,\n1998.\n\n[41] A. Speakman and J. O\u2019Keefe. Hippocampal complex spike cells do not change their place \ufb01elds if the\ngoal is moved within a cue controlled environment. European Journal of Neuroscience, 2:544\u20135, 1990.\n[42] H. Sprekeler. On the relation of slow feature analysis and laplacian eigenmaps. Neural computation,\n\n23:3287\u20133302, 2011.\n\n[43] A. Stevens and P. Coupe. Distortions in judged spatial relations. Cognitive Psychology, 10:422 \u2013 437,\n\n1978.\n\n[44] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT press, 1998.\n[45] R. S. Sutton, D. Precup, and S. Singh. Between MDPs and semi-MDPs: A framework for temporal\n\nabstraction in reinforcement learning. Arti\ufb01cial Intelligence, 112:181\u2013211, 1999.\n\n[46] E. C. Tolman. Cognitive maps in rats and men. Psychological Review, 55:189\u2013208, 1948.\n[47] T. J. Wills, F. Cacucci, N. Burgess, and J. O\u2019Keefe. Development of the hippocampal cognitive map in\n\npreweanling rats. Science, 328:1573\u20131576, 2010.\n\n[48] B. J. Wiltgen, M. J. Sanders, S. G. Anagnostaras, J. R. Sage, and M. S. Fanselow. Context fear learning\n\nin the absence of the hippocampus. The Journal of Neuroscience, 26:5484\u20135491, 2006.\n\n9\n\n\f", "award": [], "sourceid": 1315, "authors": [{"given_name": "Kimberly", "family_name": "Stachenfeld", "institution": "Princeton University"}, {"given_name": "Matthew", "family_name": "Botvinick", "institution": "Princeton University"}, {"given_name": "Samuel", "family_name": "Gershman", "institution": "Massachusetts Institute of Technology"}]}