{"title": "Interval timing in deep reinforcement learning agents", "book": "Advances in Neural Information Processing Systems", "page_first": 6689, "page_last": 6698, "abstract": "The measurement of time is central to intelligent behavior. We know that both animals and artificial agents can successfully use temporal dependencies to select actions. In artificial agents, little work has directly addressed (1) which architectural components are necessary for successful development of this ability, (2) how this timing ability comes to be represented in the units and actions of the agent, and (3) whether the resulting behavior of the system converges on solutions similar to those of biology. Here we studied interval timing abilities in deep reinforcement learning agents trained end-to-end on an interval reproduction paradigm inspired by experimental literature on mechanisms of timing. We characterize the strategies developed by recurrent and feedforward agents, which both succeed at temporal reproduction using distinct mechanisms, some of which bear specific and intriguing similarities to biological systems. These findings advance our understanding of how agents come to represent time, and they highlight the value of experimentally inspired approaches to characterizing agent abilities.", "full_text": "Interval timing in deep reinforcement learning agents\n\nBen Deverett\nDeepMind\n\nbendeverett@google.com\n\nRyan Faulkner\n\nDeepMind\n\nrfaulk@google.com\n\nMeire Fortunato\n\nDeepMind\n\nmeirefortunato@google.com\n\nGreg Wayne\nDeepMind\n\ngregwayne@google.com\n\nJoel Z. Leibo\nDeepMind\n\njzl@google.com\n\nAbstract\n\nThe measurement of time is central to intelligent behavior. We know that both\nanimals and arti\ufb01cial agents can successfully use temporal dependencies to select\nactions. In arti\ufb01cial agents, little work has directly addressed (1) which architectural\ncomponents are necessary for successful development of this ability, (2) how this\ntiming ability comes to be represented in the units and actions of the agent, and\n(3) whether the resulting behavior of the system converges on solutions similar to\nthose of biology. Here we studied interval timing abilities in deep reinforcement\nlearning agents trained end-to-end on an interval reproduction paradigm inspired\nby experimental literature on mechanisms of timing. We characterize the strategies\ndeveloped by recurrent and feedforward agents, which both succeed at temporal\nreproduction using distinct mechanisms, some of which bear speci\ufb01c and intriguing\nsimilarities to biological systems. These \ufb01ndings advance our understanding of\nhow agents come to represent time, and they highlight the value of experimentally\ninspired approaches to characterizing agent abilities.\n\n1\n\nIntroduction\n\nTo exploit the rewards available in our environment, we capitalize on relationships between envi-\nronmental causes and effects that exhibit precise temporal dependencies. For example, to avoid a\ndangerous threat moving towards you, you may estimate its speed by observing its displacement\nover a \ufb01xed time interval, extrapolate its future position over another time interval, and condition\nyour escape behavior on your estimated time of contact with the threat. This ability to measure\ntime and use it to guide behavior is necessary and prevalent in both animals and arti\ufb01cial agents.\nHowever, owing to basic differences in their implementation, arti\ufb01cial intelligence (AI) and biology\nhave different relationships to time. Nevertheless, it is likely that consideration of the temporal\nmeasurement problem across these domains may yield valuable insights for both.\nIn biological systems, time measurements are necessary at a variety of temporal scales, ranging from\nmilliseconds to years. The mechanisms underlying these timing abilities differ according to time\nscale, and many are well characterized at the level of the neural circuits [8, 24]. On the scale of\nseconds, interval timing paradigms are used in animals to study the behavioral and neural properties\nof time measurement. For example, an animal might be taught to measure out the elapsed interval\nbetween two events, then to report or reproduce that interval to the best of their ability in order to\nobtain a reward [3]. Humans, non-human primates, and rodents exhibit a number of characteristic\nbehaviors on these tasks [5, 14, 13] that may re\ufb02ect biological constraints on mechanisms that remain\nincompletely understood.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fIn the AI domain, there exist numerous agents that have succeeded in solving tasks with complex\ntemporal dependencies [31, 29, 30, 11, 12]. Many of these are in the category of deep reinforcement\nlearning agents, which develop reinforcement learning policies that use deep neural networks as\nfunction approximators [21]. While the abilities of these agents have advanced dramatically in recent\nyears, we lack detailed understanding of the solutions they employ.\nFor instance, consider an agent that must learn to condition its actions on the amount of elapsed time\nbetween two environmental stimuli, as we will do in this study. A deep reinforcement learning agent\nwith a recurrent module (e.g. LSTM [10]) has, by construction, two distinct mechanisms for storing\nrelevant timing information. First, the LSTM is a source of temporal memory, since it is designed to\nstore past information in model parameters trained by way of backpropagation through time. Second,\nthe reinforcement learning algorithm, regardless of the underlying function approximator, assigns\nthe credit associated with rewards to speci\ufb01c past states and actions. A deep reinforcement learning\nagent without a recurrent module (i.e. purely feedforward) lacks the former mechanism but retains\nthe latter one. When trained end-to-end on a timing task, it is unclear whether and how agents may\ncome to implicitly or explicitly represent time.\nHere we use an experimentally inspired approach to study the solutions that reinforcement learning\nagents develop for interval timing. We characterize how the strategies developed by the agents differ\nfrom one another, and from animals, discovering themes that underlie interval timing. We suggest\nthat this approach offers bene\ufb01ts both for AI \u2013 by introducing an experimental paradigm that simply\nand precisely evaluates agent strategies, and for neuroscience \u2013 by exploring the space of solutions\nthat develop outside of biological constraints, serving as a testbed for interpretation of timing-related\n\ufb01ndings in animals.\n\n2 Methods\n\n2.1\n\nInterval reproduction task\n\nWe designed a task based on a temporal reproduction behavioral paradigm in the neuroscience\nliterature [14]. The task was implemented in PsychLab [19], a simulated laboratory-like environment\ninside DeepMind lab [2] in which agents view a screen and make \u201ceye\u201d movements to obtain rewards.\nWe have open-sourced the task (along with other related timing tasks) for use in future work.\n\nFigure 1: Interval reproduction task. The image sequence shows a single trial of the task. First,\nthe agent \ufb01xates on a center cross, at which point the \u201cGo\u201d target appears. After a short delay, the\nred \u201cReady\u201d cue \ufb02ashes, followed by a randomly chosen \u201csample interval\u201d delay. After the sample\ninterval passes, the yellow \u201cSet\u201d cue \ufb02ashes. Then the agent must wait for duration of the sample\ninterval before gazing onto the \u201cGo\u201d target to end the trial. If the period over which it waited, the\n\u201cproduction interval\u201d, matches the sample interval within a tolerance, the agent is rewarded. This task\nis closely based on an existing temporal reproduction task for humans and non-human primates [14].\n\nThe task is shown in Fig. 1. In each trial, the agent \ufb01xates on a central start position, at which point a\n\u201cGo\u201d target appears on the screen, which will serve as the eventual gaze destination to end the trial.\nAfter a delay, a \u201cReady\u201d cue \ufb02ashes, followed by a speci\ufb01c \u201csample\u201c interval of time, then a \u201cSet\u201d\ncue \ufb02ashes. Following the \ufb02ash of the \u201cSet\u201d cue, the agent must wait for the duration of the sample\ninterval before gazing onto the \u201cGo\u201d target to complete the trial. If the duration of the \u201cproduction\u201d\ninterval (i.e. the elapsed time from \u201cSet\u201d cue appearance until gaze arrives on \u201cGo\u201d target) matches\nthe sample interval within a speci\ufb01ed tolerance, the agent is rewarded.\n\n2\n\nFixationGo targetappearsReady targetflashesSet targetflashesResponse atGo targetSample interval (ts)Production interval (tp)\fThe demands of this \u201ctemporal reproduction\u201d task are twofold: the agent must \ufb01rst measure the\ntemporal interval presented between two transient environmental events, and it then must reproduce\nthat interval again before ending the trial. Trials are presented in episodes, with each episode\ncontaining 300 seconds or a maximum of 50 trials, whichever comes \ufb01rst. Each trial\u2019s sample interval\nis selected randomly from the uniform range from 10-100 frames in steps of 10 (corresponding to\n167-1667 ms at 60 frames per second). The agent is rewarded if the production interval is suf\ufb01ciently\nclose to the sample interval; speci\ufb01cally, if |tp \u2212 ts| < \u03b3s(\u03b1 + \u03b2ts), where tp is the production\ninterval, ts is the sample interval, \u03b1 is a baseline tolerance, \u03b2 is a scaling factor like that used in [14]\nto account for scalar variability, and \u03b3s is an overall dif\ufb01culty scaling factor for each sample interval\ns. In practice, we usually set \u03b1 to 8 frames, \u03b2 to zero, and \u03b3s evolved within an episode from 2.5 to\n1.5 to 0, advancing each time two rewards were obtained at the given sample interval s. In practice\nwe found that the results shown are robust to a wide range of parameters for this curriculum.\n\n2.2 Agent architecture\n\nFigure 2: The agent architecture for our interval timing tasks. Frame input is passed into the residual\nnetwork + MLP. The output from this component is passed to the controller, which enables the\nintegration of past events in the recurrent case. Finally, this output is sent to the policy and value\nnetworks to generate an action (and a policy gradient in the backward pass).\n\nWe used an agent based on the A3C architecture [21] (Fig. 2). It uses a deep residual network [9] to\ngenerate a latent representation of the visual input from Psychlab which is subsequently passed to\na controller network: either a recurrent network, in this case an LSTM, or a feed-forward network.\nThe controller output is then fed forward to the policy and baseline networks that generate policy\nand value estimates that are then trained under the Importance Weighted Actor-Learner Architecture\n[6] (see section A.1). At each time step, the policy generates an action, corresponding to a small\ninstantaneous eye movement in a particular direction from its current position. The LSTM controller\nprovides a way for the agent to integrate past events along with its input in order to drive the policy\nwhile the feedforward agent must rely explicitly on the state of the environment to select actions.\nWe chose a residual embedding network architecture composed of three convolutional blocks with\nfeature map counts of 16, 32, and 32; each block has a convolutional layer with kernel size 3x3\nfollowed by max pooling with kernel size 3x3 and stride 2x2, followed by two residual subblocks.\nThe ResNet was followed by a 256-unit MLP. We used controllers with 128 hidden units for all\nexperiments. The learner was given trajectories of 100 frames, with a batch size of 32, and used 200\nactors. Other parameters were a discount factor of 0.99, baseline cost of 0.5, and entropy cost of 0.01.\nThe model was optimized using Adam with \u03b21 = 0.9, \u03b22 = 0.999, \u0001 = 10\u22124, and a learning rate of\n10\u22125.\n\n3 Results\n\n3.1 Performance of deep reinforcement learning agents\n\nThe recurrent agent learned to perform the task with near-perfect accuracy (Fig. 3a, b; top row); in\nother words, the production interval was matched to the sample interval across all presented sample\nintervals. From this analysis, however, it remains unclear whether the agent learned a general timing\nrule, or whether it memorized a speci\ufb01c discrete set of durations. Fig. 3c demonstrates that the agent\nindeed learned a general rule, successfully interpolating and extrapolating to new sample intervals on\nwhich it was not trained (+ signs in Fig. 3c).\nSomewhat surprisingly, the feedforward agent also learned to performed the task, albeit more slowly\nand to a lesser degree of accuracy (Fig. 3, bottom row). The feedforward agent exhibited some\n\n3\n\nPsychLab frame-[ResNetMLPController\u03c0VxtytPsychLab frame-[ResNetMLPController\u03c0VxtytPsychLab frame-[ResNetMLPController\u03c0VxtytPsychLab frameResNetMLPControllerht-1htxtyt\u03c0V(recurrent)\fFigure 3: Agent performance on interval reproduction task. (a) Reward rate over training for the\nrecurrent and feedforward agents. Lines show mean \u00b1 s.d. over 15 seeds. (b) Mean production\ninterval in the trained agent for each of the ten unique sample intervals. Underlying gray histograms\nshow the distribution of production intervals. Includes data from one actor in the \ufb01nal 60,000 trials\nonly, thus excluding the initial training phase. (c) Generalization was assessed by presenting sample\nintervals not used to train the agent (+ signs).\n\nnotable behavioral features relative to the recurrent agent: (1) production interval distributions were\nwider, (2) a mean-directed bias was found on the production intervals at the extremes of the sample\ninterval distribution, and (3) generalization to untrained intervals was poorer. Nevertheless, the agent\nlearned to produce intervals that were remarkably well matched to the sample intervals, especially\ngiven the absence of any traditional or explicit memory systems within the agent architecture.\n\n3.2 Psychophysical model of feedforward agent\n\nOne notable feature of the feedforward agent\u2019s solution was its similarity to human and primate data\n[14, 13]. In particular, the sigmoid-like shape of the performance curve suggests that the strategy\nmight be well explained by established models of perceptual timing in humans. We tested this\nintuition quantitatively, since an alignment between agent and animal performance could draw useful\nlinks between analyses of agent and animal behaviors.\nWe \ufb01t the feedforward agent data to a Bayesian psychophysical model previously established for\nhuman [13] and non-human primate [14] studies. In brief, the model treats the task as a three-stage\nprocess: a noisy observation of a sample interval ts measured as tm, a Bayesian least squares\nestimation te of the true ts given the noisy measurement tm, then the generation of a noisy production\ninterval tp from the estimated interval te. The measurement and production steps are modeled as\nGaussians with one parameter each, wm and wp respectively, corresponding to the coef\ufb01cient of\nvariation controlling the scalar variability [7] in the noisy measurement and production processes.\nThe conditional probability of a given production interval p(tp|ts, wm, wp) can then be computed by\nmarginalizing over the intermediate distributions as described in the full model, found in [13]. The\nmodel was \ufb01t using optimization routines in Scipy [15].\nFig. 4a shows that in the feedforward agent, the standard deviation of the production interval\nscales with the sample interval. While this relationship is slightly sub-linear, it approximates scalar\nvariability, a feature used to motivate the psychophysical model because of its prevalence in biological\nsystems [7]. Fig. 4b shows the model \ufb01t and the agent data; the approximate alignment of these data,\nand in particular the mean-directed bias at the tails, suggests an alignment between agent and animal\nbehaviors. While such models are known have poor identi\ufb01ability [1], the qualitative alignment of\n\n4\n\n\fFigure 4: Psychophysical model of interval timing. (a) In the trained agent, the standard deviation\nof the production interval scales with interval duration. When \ufb01t to the power law y = a + bxc, the\nbest-\ufb01t value for c is 0.7. Error bars: s.e.m. over 3 seeds. (b) An established Bayesian psychophysical\nmodel was \ufb01t to the feedforward agent data.\n\nthese results indicates the relevance of these animal-model tools for characterizing arti\ufb01cial agent\nbehavior as well.\n\n3.3 Evaluation of hidden unit activations\n\nTo understand the mechanism used by an agent to solve the task, it is helpful to characterize the\nactivity of the hidden units [31]. In this task, it is reasonable to predict that the agent\u2019s hidden\nunits should encode the timing information that the agent has learned in a trial. In particular, in the\nrecurrent agent, one might predict the development of a \u201ctimer\u201d, in which the activations of one or\nmany neurons implement a counter that accumulates over the presentation of the sample interval then\nreads out its value to produce the interval of interest.\n\nFigure 5: Hidden unit representations of time. The mean activations of the 128 hidden units (top\nrow: LSTM cell state in recurrent agent; bottom row: hidden units in feedforward agent) are shown\nfor trials of each sample interval duration. The activations of the population of hidden units are\nsummarized by the \ufb01rst prinicpal component. Each color corresponds to trials of one speci\ufb01c sample\ninterval duration (darker colors correspond to longer sample intervals). Left column: activations\ntemporally aligned to the Ready cue; colored dashed lines: onset of each respective Set cue. Right\ncolumn: same data aligned to Set cue.\n\nIn Fig. 5 (top row) we show that indeed such counters can be found in the unit activity of the trained\nrecurrent agent. We summarize the unit activity of the 128 LSTM cell state units using their \ufb01rst\nprincipal component. During presentation of the sample interval, unit activity rises uniformly, until\n\n5\n\n1030507090Sample interval16182022\u03c3production1030507090Sample interval1030507090Production intervalModelAgentab0501001502004020020406080ReadycueRecurrent(LSTM)201001020604020020406080SetcueLongest ts(100 frames)Shortest ts(10 frames)050100150200Time (frames)50510152025FeedforwardPC 1 of hiddenunit activations201001020Time (frames)50510152025\fthe Set cue is presented, at which point activity begins to fall, reaching its initial value by the time\nthat duration passes again. This is a simple solution to encoding the time interval and represents a\nform of clock. As a consequence, it can be seen that the Set-cue-aligned activity separates trials of\ndifferent duration (Fig. 5, upper right), such that activity falls from a higher set point when the target\ninterval is longer. These aligned average traces bear close resemblance to the unit activity found\nempirically in non-human primate parietal cortex during interval timing [14].\nIn feedforward agents, however, it is less clear how the activations of the neurons might represent the\ntiming information. Using the same analysis on the hidden units of the feedforward agent, we found\nthat unit activity also represents intervals, however in a less straightforward way (Fig. 5, bottom). To\ngain a better understanding of the feedforward solution, we proceeded to analyze the actions of the\nagent.\n\n3.4 Action trajectories\n\nThe success of the feedforward agent is intriguing because it suggests the agent has learned a strategy\nthat requires no persistent internal information, but rather achieves clock-like functionality using only\nits trained feedforward weights and external input. In order to characterize the strategies the agents\ndeveloped and how they differed from one another, we evaluated the trajectories of the agents in\naction space.\nThe use of highly controlled stimuli and actions in our task, inspired by neuroscience literature,\nallows us to perform this analysis in a straightforward way. Because our stimuli were presented on a\n2D screen and the only allowed actions were shifts in gaze direction, we could simply analyze the\nagent\u2019s gaze aligned to moments of interest in trials of each sample interval duration. In Fig. 6 we\nshow this analysis. In the recurrent agent, action trajectories appear similar across the range of sample\nintervals: the agent maintains \ufb01xation in a small region near the initial \ufb01xation point throughout the\nReady-Set interval (which it must measure), then it linearly shifts gaze to align with the target at the\ndesired time.\nOn the other hand, the feedforward agent shows a more interesting pattern: after the Ready cue, it\nbegins to traverse a stereotyped trajectory. When the Set cue arrives, it deviates off the trajectory\nand proceeds along another stereotyped trajectory, which it follows until it reaches the target. By\nexpanding the extent of these trajectories in a consistent way, the agent measures elapsed time. This\nstrategy can be described as one that uses the external environment as a clock, which is rational in the\nabsence of any persistent internal states to use as a clock.\nOne framework that may explain this feedforward agent\u2019s solution has been studied in animal behavior\nresearch and is called stigmergy [28]: coordination with the external environment to indirectly transfer\ninformation across individuals. In this case, one may describe the stereotyped action pattern used\nfor interval timing as \u201cautostigmergy\u201d: the agent\u2019s own interactions with the environment serve as a\nsource of memory external to the agent, but which can nevertheless be used to guide its actions. This\n\u201cautostigmergic\u201d solution is a particularly interesting proposition in light of the existing literature\non mechanisms of timing in animals: many studies have suggested that animals may measure and\nencode time through a process inherently linked with their behavior [18]. In other words, rather\nthan explicitly implementing a clock using neural activity, they indirectly measure time through\ntransitions in behavioral space. Fascinatingly, in a recent study where rats were trained to time out a\nparticular interval of time, the authors demonstrated that the rats solved the task by developing highly\nstereotyped movements that spanned the target interval, suggesting a possible link to behavioral\ntheories of timing [17]. The stigmergic strategy we observed here with our agents is therefore\nsimilar in nature to the strategy rats naturally adopted in that study. This observation suggests that\nanimals and arti\ufb01cial agents may converge on similar solutions for interval timing, and this may be a\nconsequence of shared computational constraints across both systems.\n\n3.5 Exploring architectural variants\n\nGiven the difference between the behaviors of the recurrent and feedforward agents, we explored and\ncompared the performance of some alternative architectural variants (Fig.7). The goal of this analysis\nwas to brie\ufb02y explore the sensitivity of the agent\u2019s performance to its speci\ufb01c setup, and future work\nshould extend these analyses to deeper characterization of a broader range of architectures.\n\n6\n\n\fFigure 6: Agent gaze trajectories. (a) The gaze position of the agent was recorded at each time point\nthroughout the trial for rewarded trials of varying sample interval durations. Each panel shows a\nschematic of the environment, with the black cross representing the central trial initiation gaze target,\nand the green square representing the Go target. Each subpanel shows the trajectory of gaze position\nover time (gray line) averaged across trials of one speci\ufb01c sample interval duration. For reference, the\nred dot corresponds to the moment when the \u201cReady\u201d cue appeared, and the yellow dot to when the\n\u201cSet\u201d cue appeared. The upper left panel shows the mean over trials with the shortest sample interval,\nincreasing rightward, with the bottom right showing the longest. The large right-side panel shows\nthe trajectories overlaid, colored according to sample interval duration (the darkest red corresponds\nto the longest sample interval, i.e. the trajectory in the bottom-right small panel). (b) The same as\nshown in a, but for the feedforward agent. (c) Three more pairs of examples from different seeds\ncomparing the recurrent and feedforward agents.\n\nWe \ufb01rst varied the number of hidden units (set at 128 throughout the study) to determine whether\nfewer parameters in the LSTM agent might degrade performance, or whether more parameters in the\nfeedforward agent may augment performance. These alterations had minimal effect on the overall\nperformance of the agents. We next trained agents in which the controller was not an LSTM or\nfeedforward network but rather a vanilla RNN, GRU [4] or RMC (relational memory core) [26]\ninstead. Agents with these controllers all learned the task, though the vanilla RNN and RMC exhibited\nsome biases in performance. We proceeded to ask about the performance of a frozen LSTM: that is, its\nparameters were non-trainable; they were initially randomized and not changed thereafter throughout\ntraining. Interestingly, this agent learned to the same degree as the basic LSTM agent. This \ufb01nding\naligns with other reports that learning can occur in networks with random weights [20]. Given this\napparent robustness, we then asked whether it would learn when the reinforcement learning algorithm\nwas limited to policy and baseline updates on smaller segments of agent-environment interactions (i.e.\nfewer steps). In particular, we modi\ufb01ed the agent such that the 100 steps used in backpropagation\nthrough time were divided into 10 chunks (of 10 steps each) for the sake of computing the policy\ngradient and baseline losses for the reinforcement learning algorithm (See A.1.1 for details). In\nthis truncated \u201c10-step RL,\u201d the reinforcement learning algorithm trained on episodes far shorter\n\n7\n\n+Recurrent(LSTM)+Feedforward+Recurrent(LSTM)+Feedforward+Recurrent(LSTM)+Feedforward+Recurrent(LSTM)+Feedforward++++++++++++++++++++\fFigure 7: Performance with architectural variants. Display conventions are as in Fig. 3b.\nLSTM/Feedforward size indicates the number of hidden units in the controller (as compared to\n128 in all prior \ufb01gures). Vanilla RNN, GRU, and RMC were substituted in as replacements for the\nLSTM/feedforward controllers. Frozen LSTM is an LSTM controller where parameters are not\ntrainable. 10-step RL refers to an agent that uses chunks of 10 agent-environment steps to compute\npolicy- and baseline-gradient updates (as compared to 100 steps in all prior \ufb01gures).\n\nthan the temporal intervals to be learned. We found that while the feedforward agent (which lacks\nbackpropagation through time) was severely impaired by this alteration, the recurrent agent was not.\n\n4 Discussion\n\nHere we adapted an interval timing task from the neuroscience literature and used it to study deep\nreinforcement learning agents. We found that both recurrent and feedforward agents could solve\nthe task in an end-to-end manner. We furthermore characterized differences in the behaviors of the\nagents at the levels of timing precision and generalization, hidden unit activations, and trajectories\nthrough action space. Recurrent agents implemented timers that could be characterized as counters in\nthe LSTM hidden units, whereas feedforward agents developed stigmergy-like strategies that bear\nresemblance to psychophysical results from timing experiments in animals.\nThe application of neural network models to questions from experimental neuroscience can aid in our\nunderstanding of neural coding in the brain [22]. The importance of understanding interval timing in\ndeep reinforcement learning agents has been previously recognized [25], and other work has been\nperformed using neural networks to study time perception. For example, [16] and [27] explored\ncomputational models of time perception and its relation to environmental stimuli. In addition to\nthe temporal reproduction task we studied here, there exist other timing tasks that are commonly\nused in the animal literature, such as temporal production [17] and temporal discrimination [23]\ntasks. Therefore, we have also generated tasks like this in PsychLab for future study, and we are\nopen-sourcing all these tasks as part of this contribution 1.\nFuture work should explore the ways in which different environmental and agent architectural con-\nstraints alter the solutions of the agent. Furthermore, it will be useful to determine how \ufb01ndings from\ninterval timing tasks like these, performed in controlled psychology-like environments, generalize\nto more complex domains where interval timing is necessary but is not the primary goal. Finally,\ncharacterizing agents\u2019 solution space for fundamental abilities like timing will be useful in designing\nfuture challenges and solutions to more complex tasks for AI. Perhaps the stigmergic behavior we\nuncovered in this study indicates the broader importance of deeply characterizing \u2013 and possibly\ncontrolling \u2013 agent behaviors in conjunction with their architectures when designing and studying\nintelligent abilities.\n\n1Available\n\nat\n\ncontributed/psychlab\n\nhttps://github.com/deepmind/lab/tree/master/game_scripts/levels/\n\n8\n\n104070100ProductionintervalLSTM(size 8)Feedforward(size 512)Vanilla RNNGRU104070100Sample interval104070100ProductionintervalRMC104070100Sample intervalFrozen LSTM104070100Sample intervalFeedforward10-step RL104070100Sample intervalLSTM10-step RL\fAcknowledgements We thank Neil Rabinowitz for helpful discussions on data interpretation and\nexperiment design.\n\nReferences\n[1] Luigi Acerbi, Wei Ji Ma, and Sethu Vijayakumar. A framework for testing identi\ufb01ability of\nbayesian models of perception. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence,\nand K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages\n1026\u20131034. Curran Associates, Inc., 2014.\n\n[2] Charles Beattie, Joel Z. Leibo, Denis Teplyashin, Tom Ward, Marcus Wainwright, Heinrich\nK\u00fcttler, Andrew Lefrancq, Simon Green, V\u00edctor Vald\u00e9s, Amir Sadik, Julian Schrittwieser, Keith\nAnderson, Sarah York, Max Cant, Adam Cain, Adrian Bolton, Stephen Gaffney, Helen King,\nDemis Hassabis, Shane Legg, and Stig Petersen. Deepmind lab. CoRR, abs/1612.03801, 2016.\n\n[3] C. V. Buhusi and W. H. Meck. What makes us tick? Functional and neural mechanisms of\n\ninterval timing. Nat. Rev. Neurosci., 6(10):755\u2013765, Oct 2005.\n\n[4] Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares,\nHolger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-\ndecoder for statistical machine translation. arXiv, 2014.\n\n[5] R. M. Church and M. Z. Deluty. Bisection of temporal intervals. J Exp Psychol Anim Behav\n\nProcess, 3(3):216\u2013228, Jul 1977.\n\n[6] Lasse Espeholt, Hubert Soyer, R\u00e9mi Munos, Karen Simonyan, Volodymyr Mnih, Tom Ward,\nYotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, Shane Legg, and Koray Kavukcuoglu.\nIMPALA: scalable distributed deep-rl with importance weighted actor-learner architectures.\nCoRR, abs/1802.01561, 2018.\n\n[7] John Gibbon. Scalar expectancy theory and weber\u2019s law in animal timing. Psychological\n\nReview, 84:279\u2013325, 03 1977.\n\n[8] E. Hazeltine, L. L. Helmuth, and R. B. Ivry. Neural mechanisms of timing. Trends Cogn. Sci.\n\n(Regul. Ed.), 1(5):163\u2013169, Aug 1997.\n\n[9] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\n\nrecognition. CoRR, abs/1512.03385, 2015.\n\n[10] Sepp Hochreiter and J\u00fcrgen Schmidhuber. Long short-term memory. Neural Comput., 9(8):1735\u2013\n\n1780, November 1997.\n\n[11] Chia-Chun Hung, Timothy Lillicrap, Josh Abramson, Yan Wu, Mehdi Mirza, Federico\nCarnevale, Arun Ahuja, and Greg Wayne. Optimizing agent behavior over long time scales by\ntransporting value. arXiv preprint arXiv:1810.06721, 2018.\n\n[12] Max Jaderberg, Wojciech M Czarnecki, Iain Dunning, Luke Marris, Guy Lever, Antonio Garcia\nCastaneda, Charles Beattie, Neil C Rabinowitz, Ari S Morcos, Avraham Ruderman, Nicolas\nSonnerat, Tim Green, Louise Deason, Joel Z. Leibo, David Silver, Demis Hassabis, Koray\nKavukcuoglu, and Thore Graepel. Human-level performance in \ufb01rst-person multiplayer games\nwith population-based deep reinforcement learning. arXiv preprint arXiv:1807.01281, 2018.\n\n[13] M. Jazayeri and M. N. Shadlen. Temporal context calibrates interval timing. Nat. Neurosci.,\n\n13(8):1020\u20131026, Aug 2010.\n\n[14] Mehrdad Jazayeri and Michael N Shadlen. A neural mechanism for sensing and reproducing a\n\ntime interval. Current Biology, 25(20):2599\u20132609, 2015.\n\n[15] Eric Jones, Travis Oliphant, Pearu Peterson, et al. SciPy: Open source scienti\ufb01c tools for\n\nPython, 2001\u2013.\n\n[16] Uma R Karmarkar and Dean V Buonomano. Timing in the absence of clocks: encoding time in\n\nneural network states. Neuron, 53(3):427\u2013438, 2007.\n\n9\n\n\f[17] R. Kawai, T. Markman, R. Poddar, R. Ko, A. L. Fantana, A. K. Dhawale, A. R. Kampff, and\nB. P. Olveczky. Motor cortex is required for learning but not for executing a motor skill. Neuron,\n86(3):800\u2013812, May 2015.\n\n[18] P. R. Killeen and J. G. Fetterman. A behavioral theory of timing. Psychol Rev, 95(2):274\u2013295,\n\nApr 1988.\n\n[19] Joel Z Leibo, Cyprien de Masson d\u2019Autume, Daniel Zoran, David Amos, Charles Beattie, Keith\nAnderson, Antonio Garc\u00eda Casta\u00f1eda, Manuel Sanchez, Simon Green, Audrunas Gruslys, et al.\nPsychlab: a psychology laboratory for deep reinforcement learning agents. arXiv preprint\narXiv:1801.08116, 2018.\n\n[20] T. P. Lillicrap, D. Cownden, D. B. Tweed, and C. J. Akerman. Random synaptic feedback\n\nweights support error backpropagation for deep learning. Nat Commun, 7:13276, 11 2016.\n\n[21] Volodymyr Mnih, Adri\u00e0 Puigdom\u00e8nech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lill-\nicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep\nreinforcement learning. CoRR, abs/1602.01783, 2016.\n\n[22] A Emin Orhan and Wei Ji Ma. A diverse range of factors affect the nature of neural representa-\n\ntions underlying short-term memory. Nature neuroscience, 22(2):275, 2019.\n\n[23] S. Pai, J. C. Erlich, C. Kopec, and C. D. Brody. Minimal impairment in a rat model of duration\ndiscrimination following excitotoxic lesions of primary auditory and prefrontal cortices. Front\nSyst Neurosci, 5:74, 2011.\n\n[24] J. J. Paton and D. V. Buonomano. The Neural Basis of Timing: Distributed Mechanisms for\n\nDiverse Functions. Neuron, 98(4):687\u2013705, May 2018.\n\n[25] E. A. Petter, S. J. Gershman, and W. H. Meck. Integrating Models of Interval Timing and\n\nReinforcement Learning. Trends Cogn. Sci. (Regul. Ed.), 22(10):911\u2013922, Oct 2018.\n\n[26] Adam Santoro, Ryan Faulkner, David Raposo, Jack Rae, Mike Chrzanowski, Theophane Weber,\nDaan Wierstra, Oriol Vinyals, Razvan Pascanu, and Timothy Lillicrap. Relational recurrent\nneural networks. arXiv:1806.01822, 2018.\n\n[27] Marta Su\u00e1rez-Pinilla, Kyriacos Nikiforou, Zafeirios Fountas, Anil Seth, and Warrick Roseboom.\nPerceptual content, not physiological signals, determines perceived duration when viewing\ndynamic, natural scenes. PsyArXiv:10.31234/osf.io/zste8, Dec 2018.\n\n[28] G. Theraulaz and E. Bonabeau. A brief history of stigmergy. Artif. Life, 5(2):97\u2013116, 1999.\n\n[29] Oriol Vinyals, Igor Babuschkin, Junyoung Chung, Michael Mathieu, Max Jaderberg, Wo-\njciech M. Czarnecki, Andrew Dudzik, Aja Huang, Petko Georgiev, Richard Powell, Timo\nEwalds, Dan Horgan, Manuel Kroiss, Ivo Danihelka, John Agapiou, Junhyuk Oh, Valentin\nDalibard, David Choi, Laurent Sifre, Yury Sulsky, Sasha Vezhnevets, James Molloy, Trevor\nCai, David Budden, Tom Paine, Caglar Gulcehre, Ziyu Wang, Tobias Pfaff, Toby Pohlen,\nYuhuai Wu, Dani Yogatama, Julia Cohen, Katrina McKinney, Oliver Smith, Tom Schaul,\nTimothy Lillicrap, Chris Apps, Koray Kavukcuoglu, Demis Hassabis, and David Silver. AlphaS-\ntar: Mastering the Real-Time Strategy Game StarCraft II. https://deepmind.com/blog/\nalphastar-mastering-real-time-strategy-game-starcraft-ii/, 2019.\n\n[30] Greg Wayne, Chia-Chun Hung, David Amos, Mehdi Mirza, Arun Ahuja, Agnieszka Grabska-\nBarwinska, Jack Rae, Piotr Mirowski, Joel Z Leibo, Adam Santoro, Mevlana Gemici, Malcolm\nReynolds, Tim Harley, Josh Abramson, Shakir Mohamed, Danilo Rezende, David Saxton,\nAdam Cain, Chloe Hillier, David Silver, Koray Kavukcuoglu, Matthew M Botvinick, Demis\nHassabis, and Timothy Lillirap. Unsupervised predictive memory in a goal-directed agent.\narXiv preprint arXiv:1803.10760, 2018.\n\n[31] G. R. Yang, M. R. Joglekar, H. F. Song, W. T. Newsome, and X. J. Wang. Task representations\nin neural networks trained to perform many cognitive tasks. Nat. Neurosci., 22(2):297\u2013306, 02\n2019.\n\n10\n\n\f", "award": [], "sourceid": 3627, "authors": [{"given_name": "Ben", "family_name": "Deverett", "institution": "Princeton University"}, {"given_name": "Ryan", "family_name": "Faulkner", "institution": "Deepmind"}, {"given_name": "Meire", "family_name": "Fortunato", "institution": "DeepMind"}, {"given_name": "Gregory", "family_name": "Wayne", "institution": "Google DeepMind"}, {"given_name": "Joel", "family_name": "Leibo", "institution": "DeepMind"}]}