R-learning in actor-critic model offers a biologically relevant mechanism for sequential decision-making

Part of Advances in Neural Information Processing Systems 33 (NeurIPS 2020)

AuthorFeedback Bibtex MetaReview Paper Review Supplemental

Authors

Sergey Shuvaev, Sarah Starosta, Duda Kvitsiani, Adam Kepecs, Alexei Koulakov

Abstract

In real-world settings, we repeatedly decide whether to pursue better conditions or to keep things unchanged. Examples include time investment, employment, entertainment preferences etc. How do we make such decisions? To address this question, the field of behavioral ecology has developed foraging paradigms – the model settings in which human and non-human subjects decided when to leave depleting food resources. Foraging theory, represented by the marginal value theorem (MVT), provided accurate average-case stay-or-leave rules consistent with behaviors of subjects towards depleting resources. Yet, the algorithms underlying individual choices and ways to learn such algorithms remained unclear. In this work, we build interpretable deep actor-critic models to show that R-learning – a reinforcement learning (RL) approach balancing short-term and long-term rewards – is consistent with the way real-life agents may learn making stay-or-leave decisions. Specifically we show that deep R-learning predicts choice patterns consistent with behavior of mice in foraging tasks; its TD error, the training signal in our model, correlates with dopamine activity of ventral tegmental area (VTA) neurons in the brain. Our theoretical and experimental results show that deep R-learning agents leave depleting reward resources when reward intake rates fall below their exponential averages over past trials. This individual-case decision rule, learned within RL and matching the MVT on average, bridges the gap between these major approaches to sequential decision-making. We further argue that our proposed decision rule, resulting from R-learning and consistent with animals’ behavior, is Bayes optimal in dynamic real-world environments. Overall, our work links available sequential decision-making theories including the MVT, RL, and Bayesian approaches to propose the learning mechanism and an optimal decision rule for sequential stay-or-leave choices in natural environments.