Nathaniel Daw, Aaron C. Courville, David Touretzky
According to a series of inﬂuential models, dopamine (DA) neurons sig- nal reward prediction error using a temporal-difference (TD) algorithm. We address a problem not convincingly solved in these accounts: how to maintain a representation of cues that predict delayed consequences. Our new model uses a TD rule grounded in partially observable semi-Markov processes, a formalism that captures two largely neglected features of DA experiments: hidden state and temporal variability. Previous models pre- dicted rewards using a tapped delay line representation of sensory inputs; we replace this with a more active process of inference about the under- lying state of the world. The DA system can then learn to map these inferred states to reward predictions using TD. The new model can ex- plain previously vexing data on the responses of DA neurons in the face of temporal variability. By combining statistical model-based learning with a physiologically grounded TD theory, it also brings into contact with physiology some insights about behavior that had previously been conﬁned to more abstract psychological models.