{"title": "Brain Inspired Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1129, "page_last": 1136, "abstract": null, "full_text": "Brain Inspired Reinforcement Learning \n\n \n \n\n \n \n \n \n \n \n \n \n\nFran\u00e7ois Rivest* \n\nYoshua Bengio \n\nD\u00e9partement d\u2019informatique et de recherche op\u00e9rationnelle \n\nUniversit\u00e9 de Montr\u00e9al \n\n \n\nCP 6128 succ. Centre Ville, Montr\u00e9al, QC H3C 3J7, Canada \n\nfrancois.rivest@mail.mcgill.ca \n\nbengioy@iro.umontreal.ca \n\n \n \n\nJohn Kalaska \n\n \n D\u00e9partement de physiologie \n Universit\u00e9 de Montr\u00e9al \nkalaskaj@physio.umontreal.ca \n\nAbstract \n\nSuccessful application of reinforcement learning algorithms often \ninvolves considerable hand-crafting of the necessary non-linear \nfeatures to reduce the complexity of the value functions and hence \nto promote convergence of the algorithm. In contrast, the human \nbrain readily and autonomously finds the complex features when \nprovided with sufficient training. Recent work in machine learning \nand neurophysiology has demonstrated the role of the basal ganglia \nand the frontal cortex in mammalian reinforcement learning. This \npaper develops and explores new \nlearning \nalgorithms \nthat provides \npotential new approaches to the feature construction problem. The \nalgorithms are compared and evaluated on the Acrobot task. \n\ninspired by neurological evidence \n\nreinforcement \n\n1 Introduction \n\nfeatures \n\n[1]. Reinforcement \n\nlearning with non-linear \n\nReinforcement learning algorithms often face the problem of finding useful complex \nnon-linear \nfunction \napproximators like backpropagation networks attempt to address this problem, but \nin many cases have been demonstrated to be non-convergent [2]. The major \nchallenge faced by these algorithms is that they must learn a value function instead \nof learning the policy, motivating an interest in algorithms directly modifying the \npolicy [3]. \nIn parallel, recent work in neurophysiology shows that the basal ganglia can be \nmodeled by an actor-critic version of temporal difference (TD) learning [4][5][6], a \nwell-known reinforcement learning algorithm. However, the basal ganglia do not, \nby themselves, solve the problem of finding complex features. But the frontal \ncortex, which is known to play an important role in planning and decision-making, \nis tightly linked with the basal ganglia. The nature or their interaction is still poorly \nunderstood, and is generating a growing interest in neurophysiology. \n\n* URL: http://www.iro.umontreal.ca/~rivestfr \n\n\f \n\nThis paper presents new algorithms based on current neurophysiological evidence \nabout brain functional organization. It tries to devise biologically plausible \nalgorithms that may help overcome existing difficulties in machine reinforcement \nlearning. The algorithms are tested and compared on the Acrobot task. They are also \ncompared to TD using standard backpropagation as function approximator. \n\n2 Biological Background \n\nThe mammalian brain has multiple learning subsystems. Major learning components \ninclude the neocortex, the hippocampal formation (explicit memory storage system), \nthe cerebellum (adaptive control system) and the basal ganglia (reinforcement \nlearning, also known as instrumental conditioning). \nThe cortex can be argued to be equipotent, meaning that, given the same input, any \nregion can learn to perform the same computation. Nevertheless, the frontal lobe \ndiffers by receiving a particularly prominent innervation of a specific type of \nneurotransmitter, namely dopamine. The large frontal lobe in primates, and \nespecially in humans, distinguishes them from lower mammals. Other regions of the \ncortex have been modeled using unsupervised learning methods such as ICA [7], but \nmodels of learning in the frontal cortex are only beginning to emerge. \nThe frontal dopaminergic input arises in a part of the basal ganglia called ventral \ntegmental area (VTA) and the substantia nigra (SN). The signal generated by \ndopaminergic (DA) neurons resembles the effective reinforcement signal of \ntemporal difference (TD) learning algorithms [5][8]. Another important part of the \nbasal ganglia is the striatum. This structure is made of two parts, the matriosome \nand the striosome. Both receive input from the cortex (mostly frontal) and from the \nDA neurons, but the striosome projects principally to DA neurons in VTA and SN. \nThe striosome is hypothesized to act as a reward predictor, allowing the DA signal \nto compute the difference between the expected and received reward. The \nmatriosome projects back to the frontal lobe (for example, to the motor cortex). Its \nhypothesized role is therefore in action selection [4][5][6]. \nAlthough there have been several attempts to model the interactions between the \nfrontal cortex and basal ganglia, little work has been done on learning in the frontal \ncortex. In [9], an adaptive learning system based on the cerebellum and the basal \nganglia is proposed. In [10], a reinforcement learning model of the hippocampus is \npresented. In this paper, we do not attempt to model neurophysiological data per se, \nbut rather to develop, from current neurophysiological knowledge, new and efficient \nbiologically plausible reinforcement learning algorithms. \n\n3 The Model \n\nAll models developed here follow the architecture depicted in Figure 1. The first \nlayer (I) is the input layer, where activation represents the current state. The second \nlayer, the hidden layer (H), is responsible for finding the non-linear features \nnecessary to solve the task. Learning in this layer will vary from model to model. \nBoth the input and the hidden layer feed the parallel actor-critic layers (A and V) \nwhich are the computational analogs of the striatal matriosome and striosome, \nrespectively. They represent a linear actor-critic implementation of TD. \nThe neurological literature reports an uplink from V and the reward to DA neurons \nwhich sends back the effective reinforcement signal e (dashed lines) to A, V and H. \nThe A action units usually feed into the motor cortex, which controls muscle \nactivation. Here, A\u2019s are considered to represent the possible actions. The basal \nganglia receive input mainly from the frontal cortex and the dopaminergic signal \n\n\f(e). They also receive some input from parietal cortex (which, as opposed to the \nfrontal cortex, does not receive DA input, and hence, may be unsupervised). H will \nrepresent frontal cortex when given e and non-frontal cortex when not. The weights \nW, v and U correspond to weights into the layers A, V and H respectively (e is not \nweighted). \n\n \n\nA\n\nV\n\nStriatum \n\nreward \n\nD\n\ne\n\nW\n\nv\n\nH\n\nI\n\nU\n\n Sensory input \n\n(Frontal) \nCortex\n\n \n\nFigure 1: Architecture of the models. \n\nLet xt be the vector of the input layer activations based on the state of the \nenvironment at time t. Let f be the sigmoidal activation function of hidden units in \nH. Then yt = [f(u1xt), \u2026,f(unxt)]T, the vector of activations of the hidden layer at \nT]T be the state \ntime t, and where ui is a row of the weight matrix U. Let zt = [xt\ndescription formed by the layers I and H at time t. \n\nT yt\n\n3.1 Actor-critic \n\nThe actor-critic model of the basal ganglia developed here is derived from [4]. It is \nvery similar to the basal ganglia model in [5] which has been used to simulate \nneurophysiological data recorded while monkeys were learning a task [6]. All units \nare linear weighted sums of activity from the previous layers. The actor units \nbehave under a winner-take-all rule. The winner\u2019s activity settles to 1, and the \nothers to 0. The initial weights are all equal and non-negative in order to obtain an \ninitial optimist policy. Beginning with an overestimate of the expected reward leads \nevery action to be negatively corrected, one after the other until the best one \nremains. This usually favors exploration. \nThen V(zt) = vTzt. Let bt = Wzt be the vector of activation of the actor layer before \nthe winner take all processing. Let at = argmax(bt,i) be the winning action index at \ntime t, and let the vector ct be the activation of the layer A after the winner take all \nprocessing such that ct,a = 1 if a = at, 0 otherwise. \n\n3.1.1 Formal description \nTD learns a function V of the state that should converge to the expected total \ndiscounted reward. In order to do so, it updates V such that \n\n(\nzV\n\nt\n\n)\n\u03b3+\u2192\u22121\n\n[\nrE\nt\n\n(\nzV\n\n]t\n)\n\n \n\nwhere rt is the reward at time t and \u03b3 the discount factor. A simple way to achieve \nthat is to transform the problem into an optimization problem where the goal is to \nminimize: \nE\n\n\u03b3\u2212\u2212\n\n=\n\n(\nzV\n\n[\n(\nzV\n\n)\n\n]2\n)\n\n \n\nt\n\nr\nt\n\nt\n\n1\n\u2212\n\n\fIt is also useful at this point, to introduce the TD effective reinforcement signal, \nequivalent to the dopaminergic signal [5]: \n\n \n\ne\n\n=\n\nr\nt\nt\nteE =\n\n(\nzV\n\u03b3\n\n+\n\nt\n\n)\n\n\u2212\n\n(\nzV\n\n)1\u2212\n\n \n\nt\n\n2\n\n. \n\nThus: \n\nA learning rule for the weights v of V can then be devised by finding the gradient of \nE with respect to the weights v. Here, V is the weighted sum of the activity of I and \nH. Thus, the gradient is given by \n\nE\n\u2202\nv\n\u2202\n\n=\n\n2\n\ne\n\nt\n\n[\nz\n\u03b3\n\nt\n\n]1\nz\n\u2212\u2212\nt\n\n \n\nAdding a learning rate and negating the gradient for minimization gives the update: \n\nv\n=\u2206\n\n\u03b1\n\n[\nze\nt\n\nt\n\n\u22121\n\n\u2212\n\nz\n\u03b3\n\n]t\n\n \n\nDeveloping a learning rule for the actor units and their weights W using a cost \nfunction is a bit more complex. One approach is to use the tri-hebbian rule \n\nW\n\u2206 \u03b1\n\n=\n\nce\nt\n\nz\n\nt\n\n1\n\u2212\n\nT\n\n \n\nt\n\n1\n\u2212\n\nRemark that only the row vector of weights of the winning action is modified. \nThis rule was first introduced, but not simulated, in [4]. It associates the error e to \nthe last selected action. If the reward is higher than expected (e > 0), than the action \nunits activated by the previous state should be reinforced. Conversely, if it is less \nthan expected (e < 0), than the winning actor unit activity should be reduced for that \nstate. This is exactly what this tri-hebbian rule does. \n\n3.1.2 Biological justification \n[4] presented the first description of an actor-critic architecture based on data from \nthe basal ganglia that resemble the one here. The major difference is that the V \nupdate rule did not use the complete gradient information. \nA similar version was also developed in [5], but with little mathematical \njustification for the update rule. The model presented here is simpler and the critic \nupdate rule is basically the same, but justified neurologically. Our model also has a \nmore realistic actor update rule consistent with neurological knowledge of plasticity \nin the corticostriatal synapses [11] (H to V weights). The main purpose of the model \npresented in [5] was to simulate dopaminergic activity for which V is the most \nimportant factor, and in this respect, it was very successful [6]. \n\n3.2 Hidden Layer \n\nBecause the reinforcement learning layer is linear, the hidden layer must learn the \nnecessary non-linearity to solve the task. The rules below are attempts at \nneurologically plausible learning rules for the cortex, assuming it has no clear \nsupervision signal other than the DA signal for the frontal cortex. All hidden units \nweight vectors are initialized randomly and scaled to norm 1 after each update. \n\u2022 Fixed random \nThis is the baseline model to which the other algorithms will be compared. The \nhidden layer is composed of randomly generated hidden units that are not trained. \n\n\f \n\n\u2022 ICA \nIn [7], the visual cortex was modeled by an ICA learning rule. If the non-frontal \ncortex is equipotent, then any region of the cortex could be successfully modeled \nusing such a generic rule. The idea of combining unsupervised learning with \nreinforcement learning has already proven useful [1], but the unsupervised features \nwere trained prior to the reinforcement training. On the other hand, [12] has shown \nthat different systems of this sort could learn concurrently. Here, the ICA rule from \n[13] will be used as the hidden layer. This means that the hidden units are learning \nto reproduce the independent source signals at the origin of the observed mixed \nsignal. \n\u2022 Adaptive ICA (e-ICA) \nIf H represents the frontal cortex, then an interesting variation of ICA is to multiply \nits update term by the DA signal e. The size of e may act as an adaptive learning \nrate whose source is the reinforcement learning system critic. Also, if the reward is \nless than expected (e < 0), the features learned by the ICA unit may be more \ncounterproductive than helpful, and e pushes the learning away from those features. \n\u2022 e-gradient method \nAnother possible approach is to base the update rule on the derivative of the \nobjective function E applied to the hidden layer weights U, but constraining the \nupdate rule to only use information available locally. Let f\u2019 be the derivative of f, \nthen the gradient of E with respect to U is approximated by: \n\nE\n\u2202\nu\n\u2202\n\ni\n\n=\n\n[\n)\nxxufve\n2\n\u03b3\nt\nt\n\n(\n\u2032\n\ni\n\nt\n\ni\n\n\u2212\n\nxufv\ni\ni\nt\n\n(\n\u2032\n\n1\n\u2212\n\n)\nx\nt\n\n]1\n\n\u2212\n\n \n\nNegating the gradient for minimization, adding a learning rate and removing the \nnon-local weight information, gives the weight update rule: \n\nu\n=\u2206\n\ni\n\n\u03b1\n\n[\nxufe\nt\ni\nt\n\n(\n\u2032\n\n)\nx\nt\n\n1\n\u2212\n\n1\n\u2212\n\n]t\n(\n)\nxxuf\n\u2032\u2212\n\u03b3\n\nt\n\ni\n\n \n\nUsing the value of the weights v would lead to a rule that use non-local information. \nThe cortex is unlikely to have this and might consider all the weights in v to be \nequal to some constant. \nTo avoid neurons all moving in the same direction uniformly, we encourage the \nunits on the hidden layer to minimize their covariance. This can be achieved by \nadding an inhibitory neuron. Let qt be the average activity of the hidden units at \ntime t, i.e., the inhibitory neuron activity. Let \ntq be the moving exponential average \nof qt. Since \n\n \n\n[\nqVar\n\nt\n\n]\n\n1\n= \u2211\nn\n2\n\ni\n\n,\n\nj\n\n(\ncov\n\ny\n\n,\n\ny\n\nt\n\n,\n\nj\n\nit\n,\n\n)\n\n\u2245\n\nTimeAverag\n\n(\n(\nqe\n\n\u2212\n\nq\n\nt\n\nt\n\n)2\n)\n\n \n\nand ignoring the f\u2019s non-linearity , the gradient of the Var[qt] with respect to the \nweights U is approximated by: \n\n \n\n]\n\n[\nqVar\n\u2202\nu\n\u2202\n\ni\n\nt\n\n=\n\n(\nq\n2\n\nt\n\n\u2212\n\n) t\nxq\nt\n\n \n\nCombined with the previous equation, this results in a new update rule: \n\nu\n=\u2206\n\ni\n\n\u03b1\n\n[\nxufe\nt\ni\nt\n\n(\n\u2032\n\n)\nx\nt\n\n1\n\u2212\n\n1\n\u2212\n\n(\n)\nxxuf\n\u2032\u2212\n\u03b3\nt\n\nt\n\ni\n\n]\n\n[\nq\n\u03b1\nt\n\n+\n\n\u2212\n\n] t\nxq\nt\n\n \n\n\f \n\nWhen allowing the discount factor to be different on the hidden layer, we found that \n\u03b3 = 0 gave much better results (e-gradient(0)). \n\n4 Simulations & Results \n\nAll models of section 3 were run on the Acrobot task [8]. This task consists of a \ntwo-link pendulum with torque on the middle joint. The goal is to bring the tip of \nthe second pole in a totally upright position. \n\n4.1 The task: Acrobot \n\nThe input was coded using 12 equidistant radial basis functions for each angle and \n13 equidistant radial basis functions for each angular velocity, for a total of 50 non-\nnegative inputs. This somewhat simulates the input from joint-angle receptors. A \nreward of 1 was given only when the final state was reached (in all other case, the \nreward of an action was 0). Only 3 actions were available (3 actor units), either -1, 0 \nor 1 unit of torque. The details can be found in [8]. \n50 networks with different random initialization where run for all models for 100 \nepisodes (an episode is the sequence of steps the network performs to achieve the \ngoal from the start position). Episodes were limited to 10000 steps. A number of \nlearning rate values were tried for each model (actor-critic layer learning rate, and \nhidden layer learning rate). The selected parameters were the ones for which the \naverage number of steps per episode plus its standard deviation was the lowest. All \nhidden layer models got a learning rate of 0.1. \n\n4.2 Results \n\nFigure 2 displays the learning curves of every model evaluated. Three variables \nwere compared: overall learning performance (in number of steps to success per \nepisode), final performance (number of steps on the last episode), and early learning \nperformance (number of steps for the first episode). \n\nAveraged Learning Curves\n\nAverage Number of Steps Per Episode\n\n2500\n\n2250\n\n2000\n\n1750\n\n1500\n\ns\np\ne\nt\nS\n\n1250\n\n1000\n\n750\n\n500\n\n250\n\n0\n\nBaseline\n\nICA\n\ne-ICA\n\ne-Gradient(0)\n\ne\nd\no\ns\n\ni\n\np\nE\n \nr\ne\np\ns\np\ne\nt\n\n \n\nS\n\n1000\n\n900\n\n800\n\n700\n\n600\n\n500\n\n400\n\n300\n\n200\n\n100\n\n0\n\nBaseline\n\ne-Gradient\n\ne-ICA\n\nICA\n\ne-Gradient(0)\n\nHidden Layer\n\n \nFigure 3: Average number of steps per \nepisode with 95% confidence interval. \n\n1\n\n5\n\n9\n\n13\n\n17\n\n21\n\n25\n\n29\n\n33\n\n37\n\n41\n\n45\n\n49\n\n53\nEpisodes\n\n57\n\n61\n\n65\n\n69\n\n73\n\n77\n\n81\n\n85\n\n89\n\n93\n\n97\n\nFigure 2: Learning curves of the models.\n \n\n4.2.1 Space under the learning curve \nFigure 3 shows the average steps per episode for each model in decreasing order. \nAll models needed fewer steps on average than baseline (which has no training at \nthe hidden layer). In order to assess the performance of the models, an ANOVA \nanalysis of the average number of steps per episode over the 100 episodes was \nperformed. Scheff\u00e9 post-hoc analysis revealed that the performance of every model \n\n\f \n\nwas significantly different from every other, except for e-gradient and e-ICA (which \nare not significantly different from each other). \n\n4.2.2 Final performance \nANOVA analysis was also used to determine the final performance of the models, \nby comparing the number of steps on the last episode. Scheff\u00e9 test results showed \nthat all but e-ICA are significantly better than the baseline. Figure 4 shows the \nresults on the last episode in increasing order. The curved lines on top show the \nhomogeneous subsets. \n\nNumber of Steps on the Last Episode\n\nNumber of Steps on the First Episode\n\ne\nd\no\ns\n\ni\n\np\nE\n \nr\ne\np\ns\np\ne\nt\n\n \n\nS\n\n800\n\n700\n\n600\n\n500\n\n400\n\n300\n\n200\n\n100\n\n0\n\ne-Gradient(0)\n\nICA\n\ne-Gradient\n\nHidden Layer\n\ne-ICA\n\nBaseline\n\nFigure 4: Number of steps on the last \nepisode with 95% confidence interval. \n\ne\nd\no\ns\n\ni\n\np\nE\n \nr\ne\np\ns\np\ne\nt\n\n \n\nS\n\n3000\n\n2500\n\n2000\n\n1500\n\n1000\n\n500\n\n0\n\ne-Gradient\n\nBaseline\n\nICA\n\ne-Gradient(0)\n\ne-ICA\n\nHidden Layer\n\n \nFigure 5: Number of steps on the first \nepisode with 95% confidence interval. \n\n4.2.3 Early learning \nFigure 2 shows that the models also differed in their initial learning. To assess how \ndifferent those curves are, an ANOVA was run on the number of steps on the very \nfirst episode. Under this measure, e-gradient(0) and e-ICA were significantly faster \nthan the baseline and ICA was significantly slower (Figure 5). \nIt makes sense for ICA to be slower at the beginning, since it first has to stabilize \nfor the RL system to be able to learn from its input. Until the ICA has stabilized, the \nRL system has moving inputs, and hence cannot learn effectively. Interestingly, \ne-ICA was protected against this effect, having a start-up significantly faster than \nthe baseline. This implies that the e signal could control the ICA learning to move \nsynergistically with the reinforcement learning system. \n\n4.3 External comparison \n\nAcrobot was also run using standard backpropagation with TD and \u03b5-Greedy policy. \nIn this setup, a neural network of 50 inputs, 50 hidden sigmoidal units, and 1 linear \noutput was used as function approximator for V. The network had cross-connections \nand its weights were initialized as in section 3 such that both architectures closely \nmatched in terms of power. In this method, the RHS of the TD equation is used as a \nconstant target value for the LHS. A single gradient was applied to minimize the \nsquared error after the result of each action. Although not different from the \nbaseline on the first episode, it was significantly worst on overall and final \nperformance, unable to constantly improve. This is a common problem when using \nbackprop networks in RL without handcrafting the necessary complex features. We \nalso tried SARSA (using one network per action), but results were worst than TD. \nThe best result we found in the literature on the exact same task are from [8]. They \nused SARSA(\u03bb) with a linear combination of tiles. Tile coding discretized the input \nspace into small hyper-cubes and few overlapping tilings were used. From available \nreports, their first trial could be slower than e-gradient(0) but they could reach better \n\n\f \n\nfinal performance after more than 100 episodes with a final average of 75 steps \n(after 500 episodes). On the other hand, their function had about 75000 weights \nwhile all our models used 2900 weights. \n\n5 Discussion \n\nIn this paper we explored a new family of biologically plausible reinforcement \nlearning algorithms inspired by models of the basal ganglia and the cortex. They use \na linear actor-critic model of the basal ganglia and were extended with a variety of \nunsupervised and partially supervised learning algorithms inspired by brain \nstructures. The results showed that pure unsupervised learning was slowing down \nlearning and that a simple quasi-local rule at the hidden layer greatly improved \nperformance. Results also demonstrated the advantage of such a simple system over \nthe use of function approximators such as backpropagation. Empirical results \nindicate a strong potential for some of the combinations presented here. It remains \nto test them on further tasks, and to compare them to more reinforcement learning \nalgorithms. Possible loops from the actor units to the hidden layer are also to be \nconsidered. \n\nAcknowledgments \nThis research was supported by a New Emerging Team grant to John Kalaska and \nYoshua Bengio from the CIHR. We thank Doina Precup for helpful discussions. \n\nReferences \n[1] Foster, D. & Dayan, P. (2002) Structure in the space of value functions. Machine Learning \n49(2):325-346. \n[2] Tsitsiklis, J.N. & Van Roy, B. (1996) Featured-based methods for large scale dynamic \nprogramming. Machine Learning 22:59-94. \n[3] Sutton, R.S., McAllester, D., Singh, S. & Mansour, Y. (2000) Policy gradient methods for \nreinforcement learning with function approximation. Advances in Neural Information Processing \nSystems 12, pp. 1057-1063. MIT Press. \n[4] Barto A.G. (1995) Adaptive critics and the basal ganglia. In Models of Information Processing in \nthe Basal Ganglia, pp.215-232. Cambridge, MA: MIT Press. \n[5] Suri, R.E. & Schultz, W. (1999) A neural network model with dopamine-like reinforcement signal \nthat learns a spatial delayed response task. Neuroscience 91(3):871-890. \n[6] Suri, R.E. & Schultz, W. (2001) Temporal difference model reproduces anticipatory neural activity. \nNeural Computation 13:841-862. \n[7] Doi, E., Inui, T., Lee, T.-W., Wachtler, T. & Sejnowski, T.J. (2003) Spatiochromatic receptive field \nproperties derived from information-theoritic analysis of cone mosaic responses to natural scenes. \nNeural Computation 15:397-417. \n[8] Sutton R.S. & Barto A.G. (1998) Reinforcement Learning: An Introduction. Cambridge, MA: MIT \nPress. \n[9] Doya K. (1999) What are the computations of the cerebellum, the basal ganglia and the cerebral \ncortex? Neural Networks 12:961-974. \n[10] Foster, D.J., Morris, R.G.M., & Dayan, P. (2000) A model of hippocampally dependent navigation, \nusing the temporal difference learning rule. Hippocampus 10:1-16. \n[11] Wickens, J. & K\u00f6tter, R. (1995) Cellular models of reinforcement. In Models of Information \nProcessing in the Basal Ganglia, pp.187-214. Cambridge, MA: MIT Press. \n[12] Whiteson, S. & Stone, P. (2003) Concurrent layered learning. In Proceedings of the 2nd \nInternaltional Joint Conference on Autonomous Agents & Multi-agent Systems. \n[13] Amari, S-I (1999) Natural gradient learning for over- and under-complete bases in ICA. Neural \nComputation 11:1875-1883. \n \n\n\f", "award": [], "sourceid": 2749, "authors": [{"given_name": "Fran\u00e7cois", "family_name": "Rivest", "institution": null}, {"given_name": "Yoshua", "family_name": "Bengio", "institution": null}, {"given_name": "John", "family_name": "Kalaska", "institution": null}]}