{"title": "Efficient state-space modularization for planning: theory, behavioral and neural signatures", "book": "Advances in Neural Information Processing Systems", "page_first": 4511, "page_last": 4519, "abstract": "Even in state-spaces of modest size, planning is plagued by the \u201ccurse of dimensionality\u201d. This problem is particularly acute in human and animal cognition given the limited capacity of working memory, and the time pressures under which planning often occurs in the natural environment. Hierarchically organized modular representations have long been suggested to underlie the capacity of biological systems to efficiently and flexibly plan in complex environments. However, the principles underlying efficient modularization remain obscure, making it difficult to identify its behavioral and neural signatures. Here, we develop a normative theory of efficient state-space representations which partitions an environment into distinct modules by minimizing the average (information theoretic) description length of planning within the environment, thereby optimally trading off the complexity of planning across and within modules. We show that such optimal representations provide a unifying account for a diverse range of hitherto unrelated phenomena at multiple levels of behavior and neural representation.", "full_text": "Ef\ufb01cient state-space modularization for planning:\n\ntheory, behavioral and neural signatures\n\nDaniel McNamee, Daniel Wolpert, M\u00e1t\u00e9 Lengyel\n\nComputational and Biological Learning Lab\n\nDepartment of Engineering\nUniversity of Cambridge\n\nCambridge CB2 1PZ, United Kingdom\n\n{d.mcnamee|wolpert|m.lengyel}@eng.cam.ac.uk\n\nAbstract\n\nEven in state-spaces of modest size, planning is plagued by the \u201ccurse of dimen-\nsionality\u201d. This problem is particularly acute in human and animal cognition given\nthe limited capacity of working memory, and the time pressures under which plan-\nning often occurs in the natural environment. Hierarchically organized modular\nrepresentations have long been suggested to underlie the capacity of biological\nsystems 1,2 to ef\ufb01ciently and \ufb02exibly plan in complex environments. However, the\nprinciples underlying ef\ufb01cient modularization remain obscure, making it dif\ufb01cult to\nidentify its behavioral and neural signatures. Here, we develop a normative theory\nof ef\ufb01cient state-space representations which partitions an environment into distinct\nmodules by minimizing the average (information theoretic) description length of\nplanning within the environment, thereby optimally trading off the complexity of\nplanning across and within modules. We show that such optimal representations\nprovide a unifying account for a diverse range of hitherto unrelated phenomena at\nmultiple levels of behavior and neural representation.\n\n1\n\nIntroduction\n\nIn a large and complex environment, such as a city, we often need to be able to \ufb02exibly plan so that we\ncan reach a wide variety of goal locations from different start locations. How might this problem be\nsolved ef\ufb01ciently? Model-free decision making strategies 3 would either require relearning a policy,\ndetermining which actions (e.g. turn right or left) should be chosen in which state (e.g. locations in\nthe city), each time a new start or goal location is given \u2013 a very inef\ufb01cient use of experience resulting\nin prohibitively slow learning (but see Ref. 4). Alternatively, the state-space representation used for\ndetermining the policy can be augmented with extra dimensions representing the current goal, such\nthat effectively multiple policies can be maintained 5, or a large \u201clook-up table\u201d of action sequences\nconnecting any pair of start and goal locations can be represented \u2013 again leading to inef\ufb01cient use of\nexperience and potentially excessive representational capacity requirements.\nIn contrast, model-based decision-making strategies rely on the ability to simulate future trajectories\nin the state space and use this in order to \ufb02exibly plan in a goal-dependent manner. While such\nstrategies are data- and (long term) memory-ef\ufb01cient, they are computationally expensive, especially\nin state-spaces for which the corresponding decision tree has a large branching factor and depth 6.\nEndowing state-space representations with a hierarchical structure is an attractive approach to\nreducing the computational cost of model-based planning 7\u201311 and has long been suggested to be\na cornerstone of human cognition 1. Indeed, recent experiments in human decision-making have\ngleaned evidence for the use and \ufb02exible combination of \u201cdecision fragments\u201d 12 while neuroimaging\nwork has identi\ufb01ed hierarchical action-value reinforcement learning in humans 13 and indicated that\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fdorsolateral prefrontal cortex is involved in the passive clustering of sequentially presented stimuli\nwhen transition probabilities obey a \u201ccommunity\u201d structure 14.\nDespite such a strong theoretical rationale and empirical evidence for the existence of hierarchical\nstate-space representations, the computational principles underpinning their formation and utilization\nremain obscure. In particular, previous approaches proposed algorithms in which the optimal state-\nspace decomposition was computed based on the optimal solution in the original (non-hierarchical)\nrepresentation 15,16. Thus, the resulting state-space partition was designed for a speci\ufb01c (optimal)\nenvironment solution rather than the dynamics of the planning algorithm itself, and also required a\npriori knowledge of the optimal solution to the planning problem (which may be dif\ufb01cult to obtain in\ngeneral and renders the resulting hierarchy obsolete). Here, we compute a hierarchical modularization\noptimized for planning directly from the transition structure of the environment, without assuming\nany a priori knowledge of optimal behavior. Our approach is based on minimizing the average\ninformation theoretic description length of planning trajectories in an environment, thus explicitly\noptimizing representations for minimal working memory requirements. The resulting representation\nare hierarchically modular, such that planning can \ufb01rst operate at a global level across modules\nacquiring a high-level \u201crough picture\u201d of the trajectory to the goal and, subsequently, locally within\neach module to \u201c\ufb01ll in the details\u201d.\nThe structure of the paper is as follows. We \ufb01rst describe the mathematical framework for optimizing\nmodular state-space representations (Section 2), and also develop an ef\ufb01cient coding-based approach\nto neural representations of modularised state spaces (Section 2.6). We then test some of the key\npredictions of the theory in human behavioral and neural data (Section 3), and also describe how this\nframework can explain several temporal and representational characteristics of \u201ctask-bracketing\u201d and\nmotor chunking in rodent electrophysiology (Section 4). We end by discussing future extensions and\napplications of the theory (Section 5).\n\n2 Theory\n\n2.1 Basic de\ufb01nitions\n\nIn order to focus on situations which require \ufb02exible policy development based on dynamic goal\nrequirements, we primarily consider discrete \u201cmultiple-goal\u201d Markov decision processes (MDPs).\nSuch an MDP, M := {S,A,T ,G}, is composed of a set of states S, a set of actions A (a subset\nAs of which is associated with each state s \u2208 S), and transition function T which determines the\nprobability of transitioning to state sj upon executing action a in state si, p(sj|si, a) := T (si, a, sj).\nA task (s, g) is de\ufb01ned by a start state s \u2208 S and a goal state g \u2208 G and the agent\u2019s objective is to\nidentify a trajectory of via states v which gets the agent from s to g. We de\ufb01ne a modularization1\nM of the state-space S to be a set of Boolean matrices M := {Mi}i=1...m indicating the module\nmembership of all states s \u2208 S. That is, for all s \u2208 S, there exists i \u2208 1, . . . , m such that\nMi(s) = 1, Mj(s) = 0 \u2200j (cid:54)= i. We assume this to form a disjoint cover of the state-space\n(overlapping modular architectures will be explored in future work). We will abuse notation by\nusing the expression s \u2208 M to indicate that a state s is a member of a module M. As our planning\nalgorithm P, we consider random search as a worst-case scenario although, in principle, our approach\napplies to any algorithm such as dynamic programming or Q-learning 3 and we expect the optimal\nmodularization to depend on the speci\ufb01c algorithm utilized.\nWe describe and analyze planning as a Markov process. For planning, the underlying state-space is\nthe same as that of the MDP and the transition matrix T is a marginalization over a planning policy\n\u03c0plan (which, here, we assume is the random policy \u03c0rand(a|si) := 1|Asi|)\n\nGiven a modularization M, planning at the global level is a Markov process MG corresponding to\na \u201clow-resolution\u201d representation of planning in the underlying MDP where each state corresponds\n\n1This is an example of a \u201cpropositional representation\u201d 17,18 and is analogous to state aggregation or \u201cclus-\ntering\u201d 19,20 in reinforcement learning which is typically accomplished via heuristic bottleneck discovery algo-\nrithms 21. Our method is novel in that it does not require the optimal policy as an input and is founded on a\nnormative principle.\n\n2\n\nTij =(cid:88)a\n\n\u03c0plan(a|si)T (si, a, sj)\n\n(1)\n\n\fL(s, g) := Ev,n(cid:104)L(v(n))(cid:105) = \u2212\n\n\u221e(cid:88)n=1(cid:88)v(n)\n[H]sg = (cid:88)v(cid:54)=g\n\n[(I \u2212 Tg)\n\nThis is the (s, g)-th entry of the trajectory entropy matrix H of M. Remarkably, this can be expressed\nin closed form 25:\n\n\u22121]sv Hv\n\n(3)\n\nto a \u201clocal\u201d module Mi and the transition structure TG is induced from T via marginalization and\nnormalization 22 over the internal states of the local modules Mi.\n\n2.2 Description length of planning\n\nWe use an information-theoretic framework 23,24 to de\ufb01ne a measure, the (expected) description\nlength (DL) of planning, which can be used to quantify the complexity of planning P in the induced\nglobal L(P|MG) and local modules L(P|Mi). We will compute the DL of planning, L(P), in a\nnon-modularized setting and outline the extension to modularized planning DL L(P|M) (elaborating\nfurther in the supplementary material). Given a task (s, g) in an MDP, a solution v(n) to this task\nis an n-state trajectory such that v(n)\nn = g. The description length (DL) of this\ntrajectory is L(v(n)) := \u2212 log pplan(v(n)). A task may admit many solutions corresponding to\ndifferent trajectories over the state-space thus we de\ufb01ne the DL of the task (s, g) to be the expectation\nover all trajectories which solve this task, namely\n\n1 = s and v(n)\n\np(v(n)|s, g) log p(v(n)|s, g)\n\n(2)\n\nwhere T is the transition matrix of the planning Markov chain (Eq. 1), Tg is a sub-matrix correspond-\ning to the elimination of the g-th column and row, and Hv is the local entropy Hv := H(Tv\u00b7) at state\nv. Finally, we de\ufb01ne the description length L(P) of the planning process P itself over all tasks (s, g)\n(4)\n\nPs Pg L(s, g)\n\nL(P) := Es,g[L(s, g)] = (cid:88)(s,g)\n\nwhere Ps and Pg are priors of the start and goal states respectively which we assume to be factorizable\nP(s,g) = Ps Pg for clarity of exposition. In matrix notation, this can be expressed as L(P) = Ps H P T\ng\nwhere Ps is a row-vector of start state probabilities and Pg is a row-vector of goal state probabilities.\nThe planning DL, L(P|M), of a nontrivial modularization of an MDP requires (1) the computation\nof the DL of the global L(P|MG) and the local planning processes L(P|Mi) for global MG and\nlocal Mi modular structures respectively, and (2) the weighting of these quantities by the correct\npriors. See supplementary material for further details.\n\n2.3 Minimum modularized description length of planning\n\nBased on a modularization, planning can be \ufb01rst performed at the global level across modules, and\nthen subsequently locally within the subset of modules identi\ufb01ed by the global planning process\n(Fig. 1). Given a task (s, g) where s represents the start state and g represents the goal state, global\nsearch would involve \ufb01nding a trajectory in MG from the induced initial module (the unique Ms such\nthat Ms(s) = 1) to the goal module (Mg(g) = 1). The result of this search will be a global directive\nacross modules Ms \u2192 \u00b7\u00b7\u00b7 \u2192 Mg. Subsequently, local planning sub-tasks are solved within each\nmodule in order to \u201c\ufb01ll in the details\u201d. For each module transition Mi \u2192 Mj in MG, a local search\nin Mi is accomplished by planning from an entrance state from the previous module, and planning\nuntil an exit state for module Mj is entered. This algorithm is illustrated in Figure 1.\nBy minimizing the sum of the global L(P|MG) and local DLs L(P|Mi), we establish the optimal\nmodularization M\n\n\u2217 of a state-space for planning:\n\n\u2217\nM\n\n:= arg minM [L(P|M) + L(M)] , where L(P|M) := L(P|MG) +(cid:88)i\n\nL(P|Mi)\n\n(5)\n\nNote that this formulation explicitly trades-off the complexity (measured as DL) of planning at the\nglobal level, L(P|MG), i.e. across modules, and at the local level, L(P|Mi), i.e. within individual\nmodules (Fig. 1C-D). In principle, the representational cost of the modularization itself L(M) is also\n\n3\n\n\fpart of the trade-off, but we do not consider it further here for two reasons. First, in the state-spaces\nconsidered in this paper, it is dwarfed by the the complexities of planning, L(M) (cid:28) L(P|M)\n(see the supplementary material for the mathematical characterization of L(M)). Second, it taxes\nlong-term rather than short-term memory, which is at a premium when planning 26,27. Importantly,\nalthough computing the DL of a modularization seems to pose signi\ufb01cant computational challenges\nby requiring the enumeration of a large number of potential trajectories in the environment (across\nor within modules), in the supplementary material we show that it can be computed in a relatively\nstraightforward manner (the only nontrivial operation being a matrix inversion) using the theory of\n\ufb01nite Markov chains 22.\n\n2.4 Planning compression\n\nThe planning DL L(s, g) for a speci\ufb01c task (s, g) describes the expected dif\ufb01culty in \ufb01nding an\nintervening trajectory v for a task (s, g). For example, in a binary coding scheme where we assign\nbinary sequences to each state, the expected length of string of random 0s and 1s corresponding to a\ntrajectory will be shorter in a modularized compared to a non-modularized representation. Thus, we\ncan examine the relative bene\ufb01t of an optimal modularization, in the Shannon limit, by computing\nthe ratio of trajectory description lengths in modularized and non-modularized representations of\na task or environment 28. In line with spatial cognition terminology 29, we refer to this ratio as the\ncompression factor of the trajectory.\n\nFigure 1. Modularized planning. A. Schematic exhibiting how planning, which could be highly\ncomplex using a \ufb02at state space representation (left), can be reformulated into a hierarchical planning\nprocess via a modularization (center and right). Boxes (circles or squares) show states, lines are\ntransitions (gray: potential transitions, black: transitions considered in current plan). Once the \u201cglobal\ndirective\u201d has been established by searching in a low-resolution representation of the environment\n(center), the agent can then proceed to \u201c\ufb01ll in the details\u201d by solving a series of local planning\nsub-tasks (right). Formulae along the bottom show the DL of the corresponding planning processes.\nB. Given a modularization, a serial hierarchical planning process unfolds in time beginning with\na global search task followed by local sub-tasks. As each global/local planning task is initiated in\nseries, there is a phasic increase in processing which scales with planning dif\ufb01culty in the upcoming\nmodule as quanti\ufb01ed by the local DL, L(P|Mi). C. Map of London\u2019s Soho state-space, streets (lines,\nwith colors coding degree centrality) correspond to states (courtesy of Hugo Spiers). D. Minimum\nexpected planning DL of London\u2019s Soho as a function of the number of modules (minimizing over\nall modularizations with the given number of modules). Red: global, blue: local, black: total DL.\nE. Histogram of compression factors of 200 simulated trajectories from randomly chosen start to\ngoal locations in London\u2019s Soho. F. Absolute entropic centrality (EC) differences within and across\nconnected modules in the optimal modularization of the Soho state-space. G. Scatter plot of degree\nand entropic centralities of all states in the Soho state-space.\n\n4\n\nSoho ModularizationL(P|M)XiL(P|Mi)L(P|MG)Planning description length (nats)Number of modulesGSSFlat PlanningGlobalLocalL(P)XiL(P|Mi)GFig. 1 \u201cExplanation\u201dM1M2M3GlobalLocalABTime\u00b7\u00b7\u00b7s1g1\u00b7\u00b7\u00b7s2g2\u00b7\u00b7\u00b7s3g3Planning\u2028EntropyL(P|M1)L(P|M2)L(P|M3)EFModularized PlanningS/GSGSGSGL(P|MG)GLondon\u2019s Soho50mNCDL(P|MG)0123Entropic centrality (bits)\u00d7104123456Degree centrality (#connected states)4Degree centralityEntropic centrality (knats)0123132560.511.522.53Compression Factor00.050.10.150.20.25ProbabilityModularized Soho Trajectories.25FractionSoho trajectory compression factor0.05.1.15.2.511.522.53EC Euclidean Distance (knats)Within\u2028ModulesAcross\u2028Modules0-1012Within- vs Across-Module-50050100Rank Sum Score (larger => more similar)-1012Within- vs Across-Module0200040006000EC Euclidean DistanceNormalized Euclidean dist between EC ordered by module00.51Modules246-1012Within- vs Across-Module-50050Rank Sum Score (larger => more similar)-1012Within- vs Across-Module0200040006000Absolute difference in ECNormalized Euclidean dist between EC ordered by module00.51ModulesOptimizing Modularization\f2.5 Entropic centrality\nThe computation of the planning DL (Section 2.2) makes use of the trajectory entropy matrix H of a\nMarkov chain. Since H is composed of weighted sums of local entropies Hv, it suggests that we can\nexpress the contribution of a particular state v to the planning DL by summing its terms for all tasks\n(s, g). Thus, we de\ufb01ne the entropic centrality, Ev, of a state v via\n\nEv = (cid:88)s,g\n\nDsvg Hv\n\n(6)\n\nwhere we have made use of the fundamental tensor of a Markov chain D with components Dsvg =\n(cid:2)(I \u2212 Tg)\u22121(cid:3)sv\n. Note that task priors can easily be incorporated into this de\ufb01nition. The entropic\ncentrality (EC) of a state measures its importance to tasks across the domain and its gradient can\nserve as a measure of \u201csubgoalness\u201d for the planning process P. Indeed, we observed in simulations\nthat one strategy used by an optimal modularization to minimize planning complexity is to \u201cisolate\u201d\nplanning DL within rather than across modules, such that EC changes more across than within\nmodules (Fig. 1F). This suggests that changes in EC serve as a good heuristic for identifying modules.\nFurthermore, EC is tightly related to the graph-theoretic notion of degree centrality (DC). When tran-\nsitions are undirected and are deterministically related to action, degree centrality deg(v) corresponds\nto the number of states which are accessible from a state v. In such circumstances and assuming a\nrandom policy, we have\n\nEv = (cid:88)s,g\n\nDsvg\n\n1\n\ndeg(v)\n\nlog(deg(v))\n\n(7)\n\nThe ECs and DCs of all states in a state-space re\ufb02ecting the topology of London\u2019s Soho are plotted in\nFig. 1G and show a strong correlation in agreement with this analysis. In Section 3.2 we test whether\nthis tight relationship, together with the intuition developed above about changes in EC demarcating\napproximate module boundaries, provides a normative account of recently observed correlations\nbetween DC and human hippocampal activity during spatial navigation 30.\n\n2.6 Ef\ufb01cient coding in modularized state-spaces\n\nIn addition to \u201ccompressing\u201d the planning process, modularization also enables a neural channel to\ntransmit information (for example, a desired state sequence) in a more ef\ufb01cient pattern of activity\nusing a hierarchical entropy coding strategy 31 whereby contextual codewords signaling the entrance\nto and exit from a module constrain the set of states that can be transmitted to those within a\nmodule thus allowing them to be encoded with shorter description lengths according to their relative\nprobabilities 28 (i.e. a state that forms part of many trajectory will have a shorter description length\nthan one that does not). Assuming that neurons take advantage of these strategies in an ef\ufb01cient\ncode 32, several predictions can be made with regard to the representational characteristics of neuronal\npopulations encoding components of optimally modularized state-spaces. We suggest that the phasic\nneural responses (known as \u201cstart\u201d and \u201cstop\u201d signals) which have been observed to encase learned\nbehavioral sequences in a wide range of control paradigms across multiple species 33\u201336 serve this\npurpose in modularized control architectures. Our theory makes several predictions regarding the\ntemporal dynamics and population characteristics of these start/stop codes. First, it determines\na speci\ufb01c temporal pattern of phasic start/stop activity as an animal navigates using an optimally\nmodularized representation of a state-space. Second, neural representations for the start signals should\ndepend on the distribution of modules, while the stop codes should be sensitive to the distribution\nof components within a module. Considering the minimum average description length of each of\nthese distribution, we can make predictions regarding how much neural resources (for example, the\nnumber of neurons) should be assigned to represent each of these start/stop variables. We verify these\npredictions in published neural data 36,34 in Section 4.\n\n3 Route compression and state-space segmentation in spatial cognition\n\n3.1 Route compression\n\nWe compared the compression afforded by optimal modularization to a recent behavioral study\nexamining trajectory compression during mental navigation 29. In this task, students at the University\n\n5\n\n\fof Toronto were asked to mentally navigate between a variety of start and goal locations on their\ncampus and the authors computed the (inverse) ratio between the duration of this mental navigation\nand the typical time it would physically take to walk the same distance. Although mental navigation\ntime was substantially smaller than physical time, it was not simply a constant fraction of it, but\ninstead the ratio of the two (the compression factor) became higher with longer route length (Fig. 2A).\nIn fact, while in the original study only a linear relationship between compression factor and physical\nroute length was considered, reanalysing the data yielded a better \ufb01t by a logarithmic function\n(R2 = 0.69 vs. 0.46).\nIn order to compare our theory with these data, we computed compression factors between the\noptimally modularized and the non-modularized version of an environment. This was because\nstudents were likely to have developed a good knowledge of the campus\u2019 spatial structure, and so\nwe assumed they used an approximately optimal modularization for mental navigation, while the\nphysical walking time could not make use of this modularization and was bound to the original\nnon-modularized topology of the campus. As we did not have access to precise geographical data\nabout the part of the U. Toronto campus that was used in the original experiment, we ran our algorithm\non a part of London Soho which had been used in previous studies of human navigation 30. Based on\n200 simulated trajectories over route lengths of 1 to 10 states, we found that our compression factor\nshowed a similar dependence on route length2 (Fig. 2B) and again was better \ufb01t by a logarithmic\nversus a linear function (R2 = 0.82 vs. 0.72, respectively).\n\nFigure 2. Modularized representations for spatial cognition. A. Compression factor as a function\nof route length for navigating the U. Toronto campus (reproduced from Ref. 29) with linear (grey)\nand logarithmic \ufb01ts (blue). B. Compression factors for the optimal modularization in the London\nSoho environment. C. Spearman correlations between changes in local planning DL, L(P|Mi), and\nchanges in different graph-theoretic measures of centrality.\n\n3.2 Local planning entropy and degree centrality\n\nWe also modeled a task in which participants, who were trained to be familiar with the environment,\nnavigated between randomly chosen locations in a virtual reality representation of London\u2019s Soho\nby pressing keys to move through the scenes 30. Functional magnetic resonance imaging during\nthis task showed that hippocampal activity during such self-planned (but not guided) navigation\ncorrelated most strongly with changes in a topological state \u201cconnectedness\u201d measure known as\ndegree centrality (DC, compared to other standard graph-theoretic measures of centrality such as\n\u201cbetweenness\u201d and \u201ccloseness\u201d). Although changes in DC are not directly relevant to our theory, we\ncan show that they serve as a good proxy for a fundamental quantity in the theory, planning DL (see\nEq. 7), which in turn should be re\ufb02ected in neural activations.\nTo relate the optimal modularization, the most direct prediction of our theory, to neural signals, we\nmade the following assumptions (see also Fig. 1B). 1. Planning (and associated neural activity)\noccurs upon entering a new module (as once a plan is prepared, movement across the module can\nbe automatic without the need for further planning, until transitioning to a new module). 2. The\nmagnitude of neural activity is related to the local planning DL, L(P|Mi), of the module (as the\nhigher the entropy, the more trajectories need to be considered, likely activating more neurons with\ndifferent tunings for state transitions, or state-action combinations 37, resulting in higher overall\n\n2Note that the absolute scale of our compression factor is different from that found in the experiment because\nwe did not account for the trivial compression that comes from the simple fact that it is just generally faster to\nmove mentally than physically.\n\n6\n\nFig. 2 \u201cSpatialCog\u201d vs. Centrality CorrelationsRoute length (states)0246810Compression0.911.11.21.31.41.51.61.7Modularized Soho SimulationsR2=0.46R2=0.72Route length (m)02004006008001000Compression0510152025303540Empirical Data (Bonasia et al., 2016)R2=0.69R2=0.82Human Navigated TrajectoriesSimulated Modularized TrajectoriesRoute length (states)0246810Compression0.911.11.21.31.41.51.61.7Modularized Soho SimulationsR2=0.46R2=0.72Route length (m)02004006008001000Compression0510152025303540Empirical Data (Bonasia et al., 2016)R2=0.69R2=0.82CBAL(P|Mi)Compression factorCompression factor\u21e2DegreeBetweenCloseness10.80.60.40.20\fFigure 3. Neural activities encoding module boundaries. A. T-maze task in which tone determines\nthe location of the reward (reproduced from Ref. 34). Inset: the model\u2019s optimal modularization\nof the discretized T-maze state-space. Note that the critical junction has been extracted to form its\nown module which isolates the local planning DL caused by the split in the path. B. Empirical data\nexhibiting the temporal pattern of task-bracketing in dorsolateral striatal (DLS) neurons. Prior to\nlearning the task, ensemble activity was highly variable both spatially and temporally throughout\nthe behavioral trajectory. Reproduced from Ref. 34. C. Simulated \ufb01ring rates of \u201ctask-responsive\u201d\nneurons after and before acquiring an optimal modularization. D. The optimal modularization\n(colored states are in the same module) of a proposed state-space for an operant conditioning task 36.\nNote that the lever pressing sequences form their own modules and thus require specialized start/stop\ncodes. E. Analyses of striatal neurons suggesting that a larger percentage of neurons encoded lever\nsequence initiations compared to terminations, and that very few encoded both. Reproduced from\nRef. 36. F. Description lengths of start/stop codes in the optimal modularization.\n\nactivity in the population). Furthermore, as before, we also assume that participants were suf\ufb01ciently\nfamiliar with Soho that they used the optimal modularization (as they were speci\ufb01cally trained in\nthe experiment). Having established that under the optimal modularization entropic centrality (EC)\ntends to change more across than within modules (Fig. 1F), and also that EC is closely related to DC\n(Fig. 1G), the theory predicts that neural activity should be timed to changes in DC. Furthermore,\nthe DLs of successive modules along a trajectory will in general be positively correlated with the\ndifferences between their DLs (due to the unavoidable \u201cregression to the mean\u201d effect3). Noting that\nthe planning DL of a module is just the (weighted) average EC of its states (see Section 2.5), the\ntheory thus more speci\ufb01cally predicts a positive correlation between neural activity (representing the\nDLs of modules) and changes in EC and therefore changes in DC \u2013 just as seen in experiments.\nWe veri\ufb01ed these predictions numerically by quantifying the correlation of changes in each centrality\nmeasure used in the experiments with transient changes in local planning complexity as computed\nin the model (Fig. 2C). Across simulated trajectories, we found that changes in DC had a strong\ncorrelation with changes in local planning entropy (mean \u03c1deg = 0.79) that was signi\ufb01cantly higher\n(p < 10\u22125, paired t-tests) than the correlation with the other centrality measures. We predict that even\nhigher correlations with neural activity could be achieved if planning DL according to the optimal\nmodularization, rather than DC, was used directly as a regressor in general linear models of the fMRI\ndata.\n\n3Transitioning to a module with larger/smaller DL will cause, on average, a more positive/negative DL\n\nchange compared to the previous module DL.\n\n7\n\nFig. 3 \u201cStart/stop\u201dAT-Maze TaskSimulated \u201cTask-Responsive\u201d NeuronsLEFT LEVEREXPLORERIGHT LEVERMAG. ENTRYLICK\u2028FREEZEGROOMRESTMAG. ENTRYLICKLEFT LEVERRIGHT LEVERLEFT LEVERRIGHT LEVERRBCDESTART\u2028SEQUENCESTOP\u2028SEQUENCE00.511.505101500.511.5051015GateStartCueTurn StartTurn EndGoal\u2028Arrival00.511.505101500.511.5051015GateStartCueTurn StartTurn EndGoal\u2028ArrivalFiring Rate (Hz)FDescription length (nats)FirstFinal123StartCueTurn StartTurn EndGoal\u2028ArrivalGateEmpirical DataOvertrainedBefore AcquisitionModularizationOvertrained (Modularized)PercentageFiring Rate (Hz)Firing Rate (Hz)\f4 Task-bracketing and start/stop signals in striatal circuits\n\nSeveral studies have examined sequential action selection paradigms and identi\ufb01ed specialized task-\nbracketing 33,34 and \u201cstart\u201d and \u201cstop\u201d neurons that are invariant to a wide range of motivational,\nkinematic, and environmental variables 36,35. Here, we show that task-bracketing and start/stop signals\narise naturally from our model framework in two well-studied tasks, one involving their temporal 34\nand the other their representational characteristics 36.\nIn the \ufb01rst study, as rodents learned to navigate a T-maze (Fig. 3A), neural activity in dorsolateral\nstriatum and infralimbic cortex became increasingly crystallized into temporal patterns known as\n\u201ctask-brackets\u201d 34. For example, although neural activity was highly variable before learning; after\nlearning the same neurons phasically \ufb01red at the start of a behavioral sequence, as the rodent turned\ninto and out of the critical junction, and \ufb01nally at the \ufb01nal goal position where reward was obtained.\nBased on the optimal modularization for the T-maze state-space (Fig. 3A inset), we examined\nspike trains from a simulated neurons whose \ufb01ring rates scaled with local planning entropy (see\nsupplementary material) and this showed that initially (i.e. without modularization, Fig. 3C right)\nthe \ufb01ring rate did not re\ufb02ect any task-bracketing but following training (i.e. optimal modularization,\nFig. 3C left) the activity exhibited clear task-bracketing driven by the initiation or completion of a\nlocal planning process. These result show a good qualitative match to the empirical data (Fig. 3B,\nfrom Ref. 34) showing that task-bracketing patterns of activity can be explained as the result of\nmodule start/stop signaling and planning according to an optimal modular decomposition of the\nenvironment.\nIn the second study, rodents engaged in an operant conditioning paradigm in which a sequence of eight\npresses on a left or right lever led to the delivery of high or low rewards 36. After learning, recordings\nfrom nigrostriatal circuits showed that some neurons encoded the initiation, and fewer appeared to\nencode the termination, of these action sequences. We used our framework to compute the optimal\nmodularization based on an approximation to the task state-space (Fig. 3D) in which the rodent could\nbe in many natural behavioral states (red circles) prior to the start of the task. Our model found\nthat the lever action sequences were extracted into two separate modules (blue and green circles).\nGiven a modularization, a hierarchical entropy coding strategy uses distinct neural codewords for the\ninitiation and termination of each module (Section 2.6). Importantly, we found that the description\nlengths of start codes was longer than that of stop codes (Fig. 3F). Thus, an ef\ufb01cient allocation of\nneural resources predicts more neurons encoding start than stop signals, as seen in the empirical data\n(Fig. 3E). Intuitively, more bits are required to encode starts than stops in this state-space due to the\nrelatively high level of entropic centrality of the \u201crest\u201d state (where many different behaviors may\nbe initiated, red circles) compared to the \ufb01nal lever press state (which is only accessible from the\nprevious Lever press state and where the rodent can only choose to enter the magazine or return to\n\u201crest\u201d). These results show that the start and stop codes and their representational characteristics arise\nnaturally from an ef\ufb01cient representation of the optimally modularized state space.\n\n5 Discussion\n\nWe have developed the \ufb01rst framework in which it is possible to derive state-space modularizations\nthat are directly optimized for the ef\ufb01ciency of decision making strategies and do not require\nprior knowledge of the optimal policy before computing the modularization. Furthermore, we\nhave identi\ufb01ed experimental hallmarks of the resulting modularizations, thereby unifying a range\nof seemingly disparate results from behavioral and neurophysiological studies within a common,\nprincipled framework. An interesting future direction would be to study how modularized policy\nproduction may be realized in neural circuits. In such cases, once a representation has been established,\nneural dynamics at each level of the hierarchy may be used to move along a state-space trajectory via\na sequence of attractors with neural adaptation preventing back\ufb02ow 38, or by using fundamentally\nnon-normal dynamics around a single attractor state 39. The description length that lies at the heart of\nthe modularization we derived was based on a speci\ufb01c planning algorithm, random search, which\nmay not lead to the modularization that would be optimal for other, more powerful and realistic,\nplanning algorithms. Nevertheless, in principle, our approach is general in that it can take any\nplanning algorithm as the component that generates description lengths, including hybrid algorithms\nthat combine model-based and model-free techniques that likely underlie animal and human decision\nmaking 40.\n\n8\n\n\fReferences\n1. Lashley K. In: Jeffress LA, editor. Cerebral Mechanisms in Behavior, New York: Wiley, pp 112\u2013147. 1951.\n2. Simon H, Newell A. Human Problem Solving. Longman Higher Education, 1971.\n3. Sutton R, Barto A. Reinforcement Learning: An Introduction. MIT Press, 1998.\n4. Stachenfeld K et al. Advances in Neural Information Processing Systems, 2014.\n5. Moore AW et al. IJCAI International Joint Conference on Arti\ufb01cial Intelligence 2:1318\u20131321, 1999.\n6. Lengyel M, Dayan P. Advances in Neural Information Processing Systems, 2007.\n7. Dayan P, Hinton G. Advances in Neural Information Processing Systems, 1992.\n8. Parr R, Russell S. Advances in Neural Information Processing Systems, 1997.\n9. Sutton R et al. Arti\ufb01cial Intelligence 112:181 \u2013 211, 1999.\n10. Hauskrecht M et al. In: Uncertainty in Arti\ufb01cial Intelligence. 1998.\n11. Rothkopf CA, Ballard DH. Frontiers in Psychology 1:1\u201313, 2010.\n12. Huys QJM et al. Proceedings of the National Academy of Sciences 112:3098\u20133103, 2015.\n13. Gershman SJ et al. Journal of Neuroscience 29:13524\u201331, 2009.\n14. Schapiro AC et al. Nature Neuroscience 16:486\u2013492, 2013.\n15. Foster D, Dayan P. Machine Learning pp 325\u2013346, 2002.\n16. Solway A et al. PLoS Computational Biology 10:e1003779, 2014.\n17. Littman ML et al. Journal of Arti\ufb01cial Intelligence Research 9:1\u201336, 1998.\n18. Boutilier C et al. Journal of Arti\ufb01cial Intelligence Research 11:1\u201394, 1999.\n19. Singh SP et al. Advances in Neural Information Processing Systems, 1995.\n20. Kim KE, Dean T. Arti\ufb01cial Intelligence 147:225\u2013251, 2003.\n21. Simsek O, Barto AG. Advances in Neural Information Processing Systems, 2008.\n22. Kemeny JG, Snell JL. Finite Markov Chains. Springer-Verlag, 1983.\n23. Balasubramanian V. Neural Computation 9:349\u2013368, 1996.\n24. Rissanen J. Information and Complexity in Statistical Modeling. Springer, 2007.\n25. Kafsi M et al. IEEE Transactions on Information Theory 59:5577\u20135583, 2013.\n26. Todd M et al. Advances in Neural Information Processing Systems, 2008.\n27. Otto AR et al. Psychological Science 24:751\u201361, 2013.\n28. MacKay D. Information Theory, Inference, and Learning Algorithms. Cambridge University Press, 2003.\n29. Bonasia K et al. Hippocampus 26:9\u201312, 2016.\n30. Javadi AH et al. Nature Communications in press, 2016.\n31. Rosvall M, Bergstrom CT. Proceedings of the National Academy of Sciences 105:1118\u20131123, 2008.\n32. Ganguli D, Simoncelli E. Neural Computation 26:2103\u20132134, 2014.\n33. Barnes TD et al. Nature 437:1158\u201361, 2005.\n34. Smith KS, Graybiel AM. Neuron 79:361\u2013374, 2013.\n35. Fujii N, Graybiel AM. Science 301:1246\u20131249, 2003.\n36. Jin X, Costa RM. Nature 466:457\u2013462, 2010.\n37. Stalnaker TA et al. Frontiers in Integrative Neuroscience 4:12, 2010.\n38. Russo E et al. New Journal of Physics 10, 2008.\n39. Hennequin G et al. Neuron 82:1394\u2013406, 2014.\n40. Daw ND et al. Nature Neuroscience 8:1704\u201311, 2005.\n\n9\n\n\f", "award": [], "sourceid": 2245, "authors": [{"given_name": "Daniel", "family_name": "McNamee", "institution": "University of Cambridge"}, {"given_name": "Daniel", "family_name": "Wolpert", "institution": "University of Cambridge"}, {"given_name": "Mate", "family_name": "Lengyel", "institution": "University of Cambridge"}]}