Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
This paper addresses the problem of biologically-plausible perceptual inference in dynamical environments. In particular, it considers situations in which informative sensory information arrives delayed with respect to the underlying state and thus require ‘postdiction’ to update the inference of past states given new sensory observations. The authors extend a previously published method for inference in graphical models (DDC-HM) to temporally extended encoding functions and test their model in three situation cases where postdiction is relevant. Overall, I find this work valuable and interesting. It could, however, be more clearly presented and provide some relevant comparisons with alternative models. Regarding originality, the technical contributions, (1) an extension of the DDC-HM method to temporally extended (recurrent) encoding functions and (2) a way to train the model are original and nontrivial. In terms of clarity, the theory is well described but a more clear exposition of the algorithm would help (including an algorithm box). Moreover, the results could be more clearly presented - in particular the figures. Figure 1 is pretty difficult to parse. The relevant comparison (between panels B and C) takes significant time to decode, which could be solved by better/additional labels or annotations. The inset occupies much of the figure but is never referred to, I think. In Figure 2, I couldn’t find the gray line in panel A, maybe because of its poor resolution. Also, I failed to find the precise definition of t_0 and \tau in the internal model. Most importantly, in my opinion, the work lacks a comparison of the proposed model to other techniques in variational inference for dynamical environments (e.g..variational recurrent neural networks). Without taking biological plausibility into account, it would be very interesting to know if alternative models can account for the behavioral phenomena or whether this is a particular feature of the proposed method. Without that comparison it seems that the work is half done. Small details: L179 “simple internal model for tone and noise stimuli described by (19) in Appendix C.1.” In Fig 3, C-F panels are all labeled C. ===UPDATE=== I'd like to thank the authors for a detail and thoughtful response to the reviews. After reading the rebuttal, I'm overall pleased with the work. In my view, however, adding a comparison with another variational inference method (even one clearly not biologically plausible) would strengthen the paper's results. I have therefore chosen to maintain my score (6. Marginally above acceptance...).
After reading the Author Summary: The authors have clarified some of my doubts and addressed my concerns. Also, I find the discussion about signatures of internal representation quite interesting and I recommend to include it in the revised paper. Overall, I confirm my original judgment that this is an important contribution to NeurIPS and theoretical neuroscience. Summary: In this paper, the authors extend the theory of distributed distributional codes (DDC) to address the issue of how the brain might implement online filtering and postdiction. They provide three example applications (an auditory continuity illusion, the flash-lag effect, tracking under noise and occlusions) that qualitatively reproduce findings from behavioral studies. Originality: High. There is relatively little work on building dynamical Bayesian inference models that account for *postdiction* effects; and this is a novel extension and application of the recently proposed distributed distributional codes Quality: High. This is a thoughtful, well-researched paper which provides an interesting theoretical contribution and well-designed simulations. Clarity: The paper is well-written and well-structured and the images provide useful information. Significance: This is an interesting contribution both for the attention it brings to the phenomena of postdiction and for an application of the DDC. Major comments: This is a solid, interesting paper and a pleasure to read, and I congratulate the authors for making their code entirely available (although as a good practice I recommend to comment the code more, even in production). The main point of DDC is to provide a means to represent complex probability distributions, learn, and compute with them in a simple, biologically plausible way (namely, using only linear operations or the delta rule); and this paper shows how the DDC framework can be extended to deal with online filtering (and postdiction of the entire stimulus history). One question I have is with respect to the tuning functions \gamma. First, it might help to clarify in Section 2 that the tuning functions \gamma are given (and not learnt) in the DDC setup. Could they be learnt as well? Second, the current work uses tanh nonlinearities after random linear projections (as explained in the Supplementary Material; but I recommend to specify it as well in the main text, perhaps in Section 4). I gather that the results are relatively robust to changes in the set of \gamma (as long as it constitutes a "good" basis set), and did the authors check sensitivity of their results to different randomized basis sets? Another point about DDC is that they are not easily amenable to computations besides the calculation of expectations (the authors need to use a bunch of additional methods to compute the marginals and in particular the joint posteriors, as explained in the Supplementary Material) as opposed, e.g., to a popular "sampling" proposal. The fact that DDC provide "implicit" representations might be an advantage in explaining why humans are good at implicit computations of uncertainty but not necessarily at reporting them explicitly. Can the authors elaborate briefly on consequences/signatures of their model that would allow it to be distinguished from competing proposals for internal representations? Minor comments: line 153-154: "Given that there exists a minimum in (8), the minimum of (9) also exists for most well-behaved distributions on x1:t-1, and is attained if W minimizes (8) on all possible x1:t-1." Can the authors expand a bit on this? Clearly if W minimizes (8) on all possible paths (which seems a fairly strong assumption) then it also satisfies (9); but what matters is that a solution satisfying (9) does not necessarily imply (8), unless there is some additional property. (This is partially addressed in Section B of the Supplementary Material.) Figure 2A: I recommend to re-draw panel A based on the data in the original paper, because the current panel is unreadable. Typos: line 44: percpets --> percepts line 57: it is a challenge for the brain learns to perform inference --> unclear phrasing (if "for" is meant as "because", the subject is hanging - what is a challenge? otherwise, there must be a typo) line 61: analyitical --> analytical line 63: can be approximated the optimal solutions --> can approximate the optimal solutions line 66: that address --> that addresses line 83: easy to samples --> easy to sample (from) line 85: a full stop or semicolon is missing at the end of the line line 141: "time steps" missing at the end? (and full stop) line 150: recogntion --> recognition line 156: h_W^bil --> W should be italicized line 169: for each experiments --> for each experiment line 179: the internal model is described by Eq. 15 in the appendix? (not Eq. 19) Fig 1 caption: "showing decoded marginal distribution for perceived over tone level"? (probably perceived --> posterior) Fig 1 caption: marked form A to F --> marked from A to F Fig 2 caption: posdition --> postdiction line 245: bimodalality --> bimodality References: Please double-check (e.g., reference  is missing the authors). Fig 3 caption: estiamted --> estimated Supplementary Material: line 382: used to be approximated the conditional -> used to approximate the conditional line 384: one interpret --> one interprets line 386: Approximation solution --> Approximated solution line 418: the middle band of reflects... --> ? line 423: something is wrong at the end of the line line 433: one can decoding --> one can decode line 450: new observation --> the new observation
The authors propose a DDC as a framework for representing an inference network that describes a stochastic latent dynamical system. The authors claim that the DC framework is what makes this method “neurally plausible.” The authors describe their method in detail, describe a learning procedure, and demonstrate the utility of their method by showing that the method reproduces some psychophysical observations from human studies. The paper appears to be well researched and technically well developed but lacks somewhat in clarity. In particular, the development of the DDC framework and learning rules is very detailed, but the authors are a bit loose with language and its not clear from paragraph to paragraph where they are going. Since the DDC is not a standard approach to modeling and inference, it is the onus of the authors to hold the hand of the reader a bit. I outline many of these points below along with some grammatical and spelling errors. Equation (2): The authors do not distinguish what they mean by “neurons”. For example, in equation (2), are the authors meaning to describe the mean firing rate of biological neurons or activations of neurons in an artificial neural network? If it’s the latter, how does this relate to real neurons, where there is point-process noise? Line 92: the authors state “ encode a random variable” but equation (2) does not seem to be encoding a random variable but the distribution of that variable. Maybe this is a distinction without a difference but I find it helpful to be clear here. If I understand correctly r_Z is not a function of z, but is just a list of numbers that is a (possibly lossy) encoding of the distribution over Z. This list of numbers is then used to compute approximate expectations wrt that distribution. Is that correct? Line 118: “as long as the brain is able to draw samples according to the correct internal model” Isn’t the point to learn the correct model? This seems like circular reasoning. How is the brain to learn the model of a distribution by sampling from that distribution? The same goes for the sampling described in Section 3.2. Line 141: “for at least K_\psi/K_\gamma” at least K_\psi/K_\gamma what? This sentence seems to have been cut off abruptly. Eq (3) and (12): How do the authors propose that the brain learn the alphas? Should this be obvious? Should it be trivial that the brain knows what function it should approximate? Line 41-42: “however, the plausibility or otherwise of these hypotheses remains debated” Its not clear that the authors have improved on this state of affairs in this paper. The authors could comment further on this point. Lines 190-195: This section seems superfluous. Can the authors describe what relevance this section has to the literature? What is the point they are making here? Maybe this space could have been better used by fleshing out their method more carefully. Figure 1: Why are there multiple buses displayed at some time points? Line 46: “The causal process of realistic..”-> “The causal process of a realistic...” Line 49: “ analytical solution are…”->” analytical solution is...” Line 57: “for the brain learns…” ->“for the brain to learn...” Line 83: “ easy to samples” -> “easy to sample”