{"title": "Neuronal Adaptation for Sampling-Based Probabilistic Inference in Perceptual Bistability", "book": "Advances in Neural Information Processing Systems", "page_first": 2357, "page_last": 2365, "abstract": "It has been argued that perceptual multistability reflects probabilistic inference performed by the brain when sensory input is ambiguous. Alternatively, more traditional explanations of multistability refer to low-level mechanisms such as neuronal adaptation. We employ a Deep Boltzmann Machine (DBM) model of cortical processing to demonstrate that these two different approaches can be combined in the same framework. Based on recent developments in machine learning, we show how neuronal adaptation can be understood as a mechanism that improves probabilistic, sampling-based inference. Using the ambiguous Necker cube image, we analyze the perceptual switching exhibited by the model. We also examine the influence of spatial attention, and explore how binocular rivalry can be modeled with the same approach. Our work joins earlier studies in demonstrating how the principles underlying DBMs relate to cortical processing, and offers novel perspectives on the neural implementation of approximate probabilistic inference in the brain.", "full_text": "Neuronal Adaptation for Sampling-Based\n\nProbabilistic Inference in Perceptual Bistability\n\nDavid P. Reichert, Peggy Seri\u00e8s, and Amos J. Storkey\n\n{d.p.reichert@sms., pseries@inf., a.storkey@} ed.ac.uk\n\nSchool of Informatics, University of Edinburgh\n\n10 Crichton Street, Edinburgh, EH8 9AB\n\nAbstract\n\nIt has been argued that perceptual multistability re\ufb02ects probabilistic inference\nperformed by the brain when sensory input is ambiguous. Alternatively, more\ntraditional explanations of multistability refer to low-level mechanisms such as\nneuronal adaptation. We employ a Deep Boltzmann Machine (DBM) model of\ncortical processing to demonstrate that these two different approaches can be com-\nbined in the same framework. Based on recent developments in machine learn-\ning, we show how neuronal adaptation can be understood as a mechanism that\nimproves probabilistic, sampling-based inference. Using the ambiguous Necker\ncube image, we analyze the perceptual switching exhibited by the model. We also\nexamine the in\ufb02uence of spatial attention, and explore how binocular rivalry can\nbe modeled with the same approach. Our work joins earlier studies in demonstrat-\ning how the principles underlying DBMs relate to cortical processing, and offers\nnovel perspectives on the neural implementation of approximate probabilistic in-\nference in the brain.\n\n1\n\nIntroduction\n\nBayesian accounts of cortical processing posit that the brain implements a probabilistic model to\nlearn and reason about the causes underlying sensory inputs. The nature of the potential cortical\nmodel and its means of implementation are hotly debated. Of particular interest in this context is\nbistable perception, where the percept switches over time between two interpretations in the case\nof an ambiguous stimulus such as the Necker cube, or two different images that are presented to\neither eye in binocular rivalry [1]. In these cases, ambiguous or con\ufb02icting sensory input could\nresult in a bimodal posterior over image interpretations in a probabilistic model, and perceptual\nbistability could re\ufb02ect the speci\ufb01c way the brain explores and represents this posterior [2, 3, 4, 5, 6].\nUnlike more classic explanations that explain bistability with low-level mechanism such as neuronal\nfatigue (e.g. [7, 8]), maybe making it more of an epiphenomenon, the probabilistic approaches see\nbistability as a fundamental aspect of how the brain implements probabilistic inference.\nRecently, it has been suggested that the cortex could employ approximate inference schemes, e.g. by\nestimating probability distributions with a set of samples, and studies show how electrophysiological\n[9] and psychophysical [10] data can be interpreted in that light. Gershman et al. [6] focus on\nbinocular rivalry and point out how in particular Markov Chain Monte Carlo (MCMC) algorithms,\nwhere correlated samples are drawn over time to approximate distributions, might naturally account\nfor aspects of perceptual bistability, such as its stochasticity and the fact that perception at any point\nin time only re\ufb02ects an individual interpretation of the image rather than a full distribution over\npossibilities. Gershman et al. do not provide a concrete neural model, however.\nIn earlier work, we considered Deep Boltzmann Machines (DBMs) as models of cortical perception,\nand related hierarchical inference in these generative models to hallucinations [11] and attention\n\n1\n\n\f[12]. With the connection between MCMC and bistability established, it is natural to explore DBMs\nas models of bistability as well, because Gibbs sampling, a MCMC method, can be performed to do\ninference. Importantly from a neuroscienti\ufb01c perspective, Gibbs sampling in Boltzmann machines\nsimply corresponds to the \u2018standard\u2019 way of running the DBM as a neural network with stochastic\n\ufb01ring of the units. However, it is well known that MCMC methods in general and Gibbs sampling\nin particular can be problematic in practice for complex, multi-modal distributions, as the sampling\nalgorithm can get stuck in individual modes (\u2018the chain does not mix\u2019). In very recent machine\nlearning work, Breuleux at al. [13] introduced a heuristic algorithm called Rates Fast Persistent\nContrastive Divergence (rates-FPCD) that aims to improve sampling performance in a Boltzmann\nmachine model by dynamically changing the model parameters, such as the connection strengths.\nIn closely related work, Welling [14] suggested a potential connection to dynamic synapses in the\nbrain. Hence, neuronal adaptation, here meant to be temporary changes to neuronal excitability and\nsynaptic ef\ufb01cacy, could actually be seen as a means of enhancing sampling based inference [2].\nWe thus aim to demonstrate how the low-level and probabilistic accounts of bistable perception can\nbe combined. We present a biological interpretation of rates-FPCD in terms of neuronal adaptation,\nor neuronal fatigue and synaptic depression speci\ufb01cally. Using a DBM that was trained on the two\ninterpretations of the Necker cube, we show how such adaptation leads to bistable switching of\nthe internal representations when the model is presented with the actual ambiguous Necker cube.\nMoreover, we model the role of spatial attention in biasing the perceptual switching. Finally, we\nexplore how the same approach can be applied also to binocular rivalry.\n\n2 Neuronal adaptation in a Deep Boltzmann Machine\n\nIn this section we brie\ufb02y introduce the DBM, the rates-FPCD algorithm as it is was motivated from\na machine learning perspective, and then explain the latter\u2019s relation to biology.\nA DBM [15] consists of stochastic binary units arranged hierarchically in several layers, with sym-\nmetric connections between layers and no connections within a layer. The \ufb01rst layer contains the\nvisible units that are clamped to data, such as images, during inference, whereas the higher layers\ncontain hidden units that learn representations from which they can generate the data in the visibles.\nWith the states in layer k denoted by x(k), connection weights W(k) and biases b(k), the probabil-\nity for a unit to switch on is determined by the input it gets from adjacent layers, using a sigmoid\nactivation function:\n\n(cid:18)\n\n(cid:18)\n\n\u2212(cid:88)\n\n\u2212(cid:88)\n\n(cid:19)(cid:19)\u22121\n\n. (1)\n\nP (x(k)\n\ni = 1|x(k\u22121), x(k+1)) =\n\n1+exp\n\nw(k\u22121)\n\nli\n\nx(k\u22121)\n\nl\n\nw(k)\n\nm \u2212b(k)\n\nim x(k+1)\n\ni\n\nl\n\nm\n\n(cid:88)\n\nRunning the network by switching units on and off in this manner implements Gibbs sampling on a\nprobability distribution determined by an energy function E,\n\nP (x) \u221d exp(\u2212E(x)) with E(x) =\n\n\u2212x(k)T W(k)x(k+1) \u2212 x(k)T b(k).\n\n(2)\n\nIntuitively speaking, when run the model performs a random walk in the energy landscape shaped\nduring learning, where it is attracted to ravines. Jumping between high-probability modes of the\ndistribution corresponds to traversing from one ravine to another.\n\nk\n\n2.1 Rates-FPCD, neuronal fatigue and synaptic depression\n\nUnfortunately, for many realistically complex inference tasks MCMC methods such as Gibbs are\nprone to get stuck in individual modes, resulting in an incomplete exploration of the distribution, and\nthere is much work in machine learning on improving sampling methods. One recently introduced\nalgorithm is rates-FPCD (Rates Fast Persistent Contrastive Divergence) [13], which was utilized\nto sample from Restricted Boltzmann Machines (RBMs), the two layer building blocks of DBMs.\nRates-FPCD is based on FPCD [16], which is used for training. Brie\ufb02y, in FPCD one contribution to\nthe weight training updates requires the model to be run continuously and independently of the data\nto explore the probability distribution as it is currently learned. Here it is important that the model\ndoes not get stuck in individual modes. It was found that introducing a fast changing component to\n\n2\n\n\fthe weights (and biases) to dynamically and temporarily change the energy landscape can alleviate\nthis problem. These fast weights Wf , which are added to the actual weights W, and the analogue\nfast biases b(k)\n\nare updated according to\n\nf\n\nWf \u2190 \u03b1Wf + \u0001(x(0)p(x(1)|x0) \u2212 x(cid:48)(0)x(cid:48)(1)T ),\nf \u2190 \u03b1b(0)\nb(0)\nf \u2190 \u03b1b(1)\nb(1)\n\nf + \u0001(x(0) \u2212 x(cid:48)(0)),\nf + \u0001(p(x(1)|x0) \u2212 x(cid:48)(1)).\n\n(3)\n(4)\n\n(5)\nHere, the visibles x(0) are clamped to the current data item.1 x(cid:48)(0) and x(cid:48)(1) are current samples\nfrom the freely run model. \u0001 is a parameter determining the rate of adaptation, and \u03b1 \u2264 1 is a decay\nparameter that limits the amount of weight change contributed by the fast weights. The second\nterm in each of the parentheses has the effect of changing the weights and biases such that whatever\nstates are currently being sampled by the model are made less likely in the following. Hence, this\nwill eventually \u2018push\u2019 the model out of a mode it is stuck in. The \ufb01rst terms in the parentheses are\ncomputed over the data and leads to the model being drawn to states supported by the current input.\nComputation of the \ufb01rst terms in the parentheses in equations 3-5 requires the training data. To turn\nFPCD into a general sampling algorithm applicable outside of training, when the training data is no\nlonger around, rates-FPCD simply replaces the \ufb01rst terms with the so-called rates, which are the\npairwise and unitary statistics averaged over all training data:\n\nWf \u2190 \u03b1Wf + \u0001(E[x(0)x(1)T ] \u2212 x(cid:48)(0)x(cid:48)(1)T ),\nf \u2190 \u03b1b(0)\nb(0)\nf \u2190 \u03b1b(1)\nb(1)\n\nf + \u0001(E[x(0)] \u2212 x(cid:48)(0)),\nf + \u0001(E[x(1)] \u2212 x(cid:48)(1))\n\n(6)\n(7)\n\n(8)\n\n(x(1) is sampled conditioned on the data). The rates are to be computed during training, but can\nthen be used for sampling afterwards. It was found that these terms suf\ufb01ciently serve to stabilize the\nsampling scheme, and that rates-FPCD yielded improved performance over Gibbs sampling [13].\nLet us consider equations 6-8 from a biological perspective, interpreting the weight parameters as\nsynaptic strengths and the biases as some overall excitability level of a neuron. The equations\nsuggest that the capability of the network to explore the state space is improved by dynamically\nadjusting the neuron\u2019s parameters (cf. e.g. [17]) depending on the current states of the neuron and its\nconnected partners (second terms in parentheses), drawing them towards some set values (\ufb01rst terms,\nthe rate statistics). All that is needed for the latter is that the neuron stores its average \ufb01ring activity\nduring learning (for the bias statistics) and the synapses remember some average \ufb01ring correlation\nbetween connected neurons (for the weight statistics).\nIn particular, if activation patterns in the\nnetwork are sparse and neurons are off most of the time, then these average terms will be rather low.\nDuring inference,2 the neuron will \ufb01re strongly for its preferred stimulus (or stimulus interpretation),\nbut then its \ufb01ring probability will decrease as its excitability and synaptic ef\ufb01cacy drop, allowing the\nnetwork to discover potential alternative interpretations of the stimulus. Thus, in the case of sparse\nactivity, equations 6-8 implement a form of neuronal fatigue and synaptic depression.\nPreceding the introduction of rates-FPCD as a sampling algorithm, we also utilized the same mech-\nanism (but only applied to the biases) in a biological model of hallucinations [11] to model home-\nostatic [18] regulation of neuronal \ufb01ring. We showed how it helps to make the system more robust\nagainst noise corruption in the input, though it can lead to hallucinations under total sensory depri-\nvation. Hence, the same underlying mechanisms could either be understood as short-term neuronal\nadaptation or longer term homeostatic regulation, depending on the time scales involved.\n\n3 Experiments: Necker cube\n\nWe trained a DBM on binary images of cubes at various locations, representing the two unambiguous\ninterpretations of the Necker cube, and then tested the model on the actual, ambiguous Necker cube\n\n1In practice, minibatches are used.\n2Applied in a DBM, not a RBM; see next section.\n\n3\n\n\f(a) Training and test set examples.\n\n(b) Perceptual bistability.\n\nFigure 1: (a): Examples of the unambiguous training images (left) and the ambiguous test images\n(right). (b): During inference on an ambiguous image, the decoded hidden states reveal perceptual\nswitching resulting from neuronal adaptation. Four consecutive sampling cycles are shown.\n\n(Figure 1a). We use a similar setup3 to that described in [11, 12], with localized receptive \ufb01elds\nthe size of which increased from lower to higher hidden layers, and sparsity encouraged simply\nby initializing the biases to negative values in training. As in the aforementioned studies, we are\ninterested in what is inferred in the hidden layers as the image is presented in the visibles, and\n\u2018decode\u2019 the hidden states by computing a reconstructed image for each hidden layer. To this end,\nstarting with the states of the hidden layer of interest, the activations (i.e. \ufb01ring probabilities) in\neach subsequent lower layer are computed deterministically in a single top-down pass, doubling the\nweights to compensate for the lack of bottom-up input, until a reconstructed image is obtained in the\nvisibles. In this way, the reconstructed image is determined by the states in the initial layer alone,\nindependently of the actual current states in the other layers.\nWhen presented with a Necker cube image, the hidden states were found to converge within a few\nsampling cycles (each consisting of one up and one down pass of sampling all hidden layers) to one\nof the unambiguous interpretations and remained therein, exhibiting no perceptual switching to the\nrespective alternative interpretation.4 We then employed rates-FPCD to model neuronal adaptation.5\nIt should be noted that unlike in [13], we utilize it in a DBM rather than a RBM, and during infer-\nence instead of when generating data samples (i.e. in our case the visibles are always clamped to\nan image). The rate statistics were computed by measuring unit activities and pairwise correlations\nwhen the trained model was run on the training data. With neuronal adaption, the internal repre-\nsentations as decoded from the hidden layer were found to switch over time between the two image\ninterpretations, thus the model exhibited perceptual bistability.\nAn example of the switching of internal representations is displayed in Figure 1b. It can be observed\nthat the perceptual state is most distinct in higher layers. For quantitative analysis, we computed the\nsquared reconstruction error of the image decoded from the topmost layer with regards to either\nof the two image interpretations. Plotted against time (Figure 2a), this shows how the internal\nrepresentations evolve during a trial. The representations match one of the two image interpretations\nin a relatively stable manner over several sampling cycles, with some degradation before and a short\ntransition phase during a perceptual switch.\nTo examine the effects of adaptation on an individual neuron, we picked a unit in the top layer that\nshowed high variance in both its activity levels and neuronal parameters as they changed over the\n\n3Images of 28x28 pixels, three hidden layers with 26x26 units each. Pretraining of the layers with CD-1, no\n\ntraining of full DBM.\n\n4It should be noted that the behavior of the network will depend heavily on the speci\ufb01cs of the training and\nthe data set used. We employed only the most simple training methods \u2013 layer-wise pre-training with CD-1 and\nno tuning of the full DBM \u2013 and do not claim that more advanced methods could not lead to better sampling\nbehavior, especially for this simple toy data. Indeed, using PCD instead we found some spontaneous switching,\nthough reconstructions were noisy. But for the argument at hand it is more important that in general, bad\nmixing with these models can be a problem that might be alleviated by methods such as rates-FPCD, hence\nusing a setup that exhibits this problem is useful to make the point.\n\n5\u03b1 = 0.95, \u0001 = 0.001 for Necker cube, \u03b1 = 0.9, \u0001 = 0.002 for binocular rivalry (Section 4).\n\n4\n\ndecoded hidden statestimeinput\f(a) Match of internal state to interpretations.\n\n(b) Single unit properties.\n\nFigure 2: (a): Time course of squared reconstruction errors of the decoded topmost hidden states\nw.r.t. either of the two image interpretations. Apart from during the transition periods, the percept\n(b):\nat any point matches one (close to zero error) but not the other interpretation (high error).\nActivation (i.e. \ufb01ring probability) and mean synaptic strength (arbitrary origin and units) of a top\nlayer unit that participates in coding for one but not the other interpretation (dashed line marks\ncurrently active interpretation). Depression and recovery of synaptic ef\ufb01cacy during instantiation\nof the preferred and non-preferred interpretations, respectively, lead to changes in activation that\nprecede the next perceptual switch.\n\ntrial, indicating that this unit was involved in coding for one but not the other image interpretation. In\nFigure 2b are plotted the time course of its activity levels (i.e. \ufb01ring probability according to equation\n1) and the mean synaptic ef\ufb01cacy, i.e. weight strength, of connections to this unit.6 As expected,\nthe \ufb01ring probability of this unit is close to one for one of the interpretations and close to zero for\nthe other, especially in the initial time period after a perceptual switch. However, as the neuron\u2019s\n\ufb01ring rate and synaptic activity deviate from their low average levels, the synaptic ef\ufb01cacy changes\nas shown in the plot. For example, during instantiation of the preferred stimulus interpretation, the\ndrop of neuronal excitability ultimately leads to a waning of activity that precedes and, together with\nthe changes in the overall network, subsequently triggers the next perceptual switch.\nFor another trial where we used an image of the Necker cube in a different position, the same unit\nshowed constant low \ufb01ring rates, indicating that it was not involved in representing that image.\nThe neuronal parameters were then found to be stable throughout the trial, after a slight initial\nmonotonic change that would allow the neuron to assume its low baseline activity as determined\nby the rate statistics. Moreover, other units were found to have relatively stable high \ufb01ring rate for\na given image throughout the trial, coding for features of the stimulus that were common to both\nimage interpretations, even though their neuronal parameters equally adapted due to their elevated\nactivity. This is due to the extent of adaptation being limited by the decay parameter \u03b1 (equations\n6-8), and shows that the adaptation can be set to be suf\ufb01ciently strong to allow for exploration of the\nposterior, without overwhelming the representations of unambiguous image features. Similarly, we\nnote that internal representations of the model when presented with the unambiguous images from\nthe training set were stable under adaptation with our setting of parameter values.\nWe also quanti\ufb01ed the statistics of perceptual switching by measuring the length of time the model\u2019s\nstate would stay in either of the two interpretations for one of the test images. The resulting his-\ntograms of percept durations, i.e. time intervals between switches, are displayed in Figure 3a sep-\narately for the two interpretations. They are shaped like gamma or log-normal distributions, qual-\nitatively in agreement with experimental results in human subjects [19]. There is a bias apparent\nin the model towards one of the interpretations (different for different images). Some biases are\nobserved in humans (as visible in the data in [4]), potentially induced by statistical properties of the\nenvironment. However, our data set did not involve any biases, so this seems to be merely an artifact\nproduced by the (basic) training procedure used.\n\n6The changes to weights and biases are equivalent, so we show only the former.\n\n5\n\n020406080100120140samplingcycle020406080100120140reconstructionerror020406080100120140sampling cycle0.00.51.0activationmean weights\f(a) Percept durations for both interpretations (left and right \ufb01gures), with/without attention.\n\n(b)\n\nFigure 3: (a): Histograms over percept durations between perceptual switches, for either interpre-\ntation (left and right, respectively) of one of the test images. Ignoring the peaks at small interval\nlengths, which stem from \ufb02uctuations during transitions, the histograms are very well \ufb01tted by log-\nnormal distributions (black curves, omitted in right \ufb01gure to avoid clutter). Also plotted in both\n\ufb01gures are histograms with spatial attention employed (see Section 3.1) to one of the interior cor-\nners of the Necker cube (as shown in (b)). The distributions shift or remain unchanged depending\non whether the attended corner is salient or not for the image interpretation in question.\n\n3.1 The role of spatial attention\n\nThe statistics of multistable perception can be in\ufb02uenced voluntarily by human subjects [20]. For\nthe Necker cube, overtly directing one\u2019s gaze to corners of the cube, especially the interior ones, can\nhave a biasing effect [21]. This could be explained by these features being in some way more salient\nfor either of the two interpretations. An explanation matching our (simpli\ufb01ed) setup would be that\nopaque cubes (as used in training) uniquely match one of the interpretations and lack one of the two\ninterior corners. In the following, we model not eye movements but covert attention, involving only\nthe shifting of an internal attentional \u2018spotlight\u2019, which also has been shown to affect perceptual\nswitching in the Necker cube [22].7\nThe presented image remained unchanged and a spatial spotlight that biased the internal representa-\ntions of the model was employed in the \ufb01rst hidden layer. To implement the spotlight, we made use\nof the fact that receptive \ufb01elds were topographically organized, and that sparsity in a DBM breaks\nthe symmetry between units being on and off and makes it possible to suppress represented informa-\ntion by suppressing the activity of speci\ufb01c hidden units [12]. We used a Gaussian shaped spotlight\nthat was centered at one of the salient internal corners of the Necker cube (Figure 3b) and applied it\nto the hidden units as additional negative biases, attenuating activity further away from the focus.\nThe effect of attention on the percept durations for one of the test images are displayed in Figure\n3a, together with the data obtained without attention for comparison. For the interpretation that\nmatched the corner that was attended, we found a shift towards longer percept durations (Figure\n3a, left), whereas the distribution for the other interpretation was relatively unchanged (Figure 3a,\nright). Averaged over all test images, the mean interval spent representing the interpretation favored\nby spatial attention saw a 25% increase vs. approx. no change for the other interpretation. Hence, in\nthe model spatial attention prolongs the percept whose salient feature is being attended. This seems\nto be qualitatively in line with experimental data at least in terms of voluntary attention having an\neffect, although speci\ufb01cs can depend on the nature of the stimulus and the details of the instructions\ngiven to experimental subjects [23].\n\n4 Experiments: binocular rivalry\n\nSeveral related studies that considered perceptual multistability in the light of probabilistic infer-\nence focused on binocular rivalry [2, 5, 6]. There, human observers are presented with a different\nimage to each eye, and their perception is found to switch between the two images. Depending on\n\n7We did not \ufb01nd an experimental study examining covert attention on the interior corners in unmodi\ufb01ed\n\nNecker cubes, which is what we simulate.\n\n6\n\n0510152025303540perceptduration0.000.020.040.060.080.100.120.140.160.18norm.countnoattentionattention01020304050perceptduration0.000.020.040.060.080.100.120.14norm.countnoattentionattentionattended\fFigure 4:\n: Example images for the\nbinocular rivalry experiment. Training\nimages (left) contained either horizon-\ntal or vertical bars, and the left and\nright image halves were identical (cor-\nresponding to the left and right \u2018eyes\u2019).\nFor the test images (right), the left and\nright halves are drawn independently.\nThey could come from the same cate-\ngory (top and bottom examples) or from\ncon\ufb02icting categories (middle example).\n\n(a) Percept vs. eye images for same category.\n\n(b) Percept vs. eye images for con\ufb02ict. categories.\n\nFigure 5: For binocular rivalry, displayed are the squared reconstruction errors for decoded top\nlayer representations computed against either of the two input images. (a): The input images came\nfrom the same category (here, vertical bars), and fusing of the percept was prominent, resulting\nin modest, similar errors for both images. (b): For input images from con\ufb02icting categories, the\npercept alternated more strongly between the images, although intermediate, fused states were still\nmore prevalent than was the case for the Necker cube. The step-like changes in the error were found\nto result from individual bars appearing and disappearing in the percept.\n\nspeci\ufb01cs such as size and content of the images, perception can switch completely between the two\nimages, fuse them, or do either to varying degrees over time [24, 25]. We demonstrate with a simple\nexperiment that the phenomenon of binocular rivalry can be addressed in our framework as well.\nTo this end, the same model architecture as before was used, but the number of visible units was\ndoubled and the units were separated into left and right \u2018eyes\u2019. During training, both sets of visibles\nsimply received the same images. During testing however, the left and right halves were set to\nindependently drawn training images to simulate the binocular rivalry experiment. The units in the\n\ufb01rst hidden layer were set to be monocular in the sense that their receptive \ufb01elds covered visible\nunits only in either of the left or right half, whereas higher layers did not made this distinction. As a\ndata set we used images containing either vertical or horizontal bars (Figure 4).\nAs with the Necker cube, perceptual switching was observed with adaptation but not without. Gen-\nerally, the perceptual state was found to be biased to one of the two images for some periods, while\nfusing the images to some extent during transition phases (Figure 5). Interestingly, whether fusing\nor alternation was more prominent depended on the nature of the con\ufb02ict in the two input images:\nFor images from the same category (both vertical or horizontal lines), fusing occurred more often\n(Figure 5a), whereas for images from con\ufb02icting categories, the percept represented more distinctly\neither image and fusing happened primarily in transition periods (Figure 5b). We quanti\ufb01ed this by\ncomputing the reconstruction errors from the decoded hidden states with regards to the two images,\nand taking the absolute difference averaged over the trial as measure for how much the internal states\nwere representing both images individually rather than fused versions. We found that this measure\nwas more than two times higher for con\ufb02icting categories. This result is qualitatively in line with\npsychophysical experiments that showed fusing for differing but compatible images (e.g. different\npatches of the same source image) [24, 25].\n\n7\n\nLRtrainingLRtest0100200300400sampling cycle050100150200250300350reconstruction error0100200300400sampling cycle050100150200250300350reconstruction error\f5 Related work and discussion\n\nOur study contributes to the emerging trend in computational neuroscience to consider approximate\nprobabilistic inference in the brain (e.g. [9, 10]), and complements several recent papers that exam-\nine perceptual multistability in this light. Gershman et al. [6] argued for interpreting multistability\nas inference based on MCMC, focusing on binocular rivalry only. Importantly, they use Markov\nrandom \ufb01elds as a high-level description of the perceptual problem itself (two possible \u2018causes\u2019\ngenerating the image, with a topology matching the stimulus). They argue that the brain might im-\nplement MCMC inference over these external variables, but do not make any statement w.r.t. the\nunderlying neural mechanisms. In contrast, in our model MCMC is performed over the internal,\nneurally embodied latent variables that were learned from data. Bistability results from bimodality\nin the learned high-dimensional hidden representations, rather than directly from the problem formu-\nlation. In another study, Sundareswara and Schrater [4] model perceptual switching for the Necker\ncube, including the in\ufb02uence of image context, which we could explore in future work. Similar to\n[6], they start from a high-level description of the problem. They design a custom abstract infer-\nence process that makes different predictions from our model: In their model, samples are drawn\ni.i.d. from the two posterior modes representing the two interpretations and are accumulated over\ntime, with older samples being exponentially discounted. A separate decision process selects from\nthe samples and determines what interpretation reaches awareness. In our model, the current con-\nscious percept is simply determined by the current overall state of the network, and the switching\ndynamics are a direct result of how this state evolves over time (as in [6]).\nHohwy et al. [5] explain binocular rivalry descriptively in their predictive coding framework. They\nidentify switching with exploration in an energy landscape, and suggest the contribution of stochas-\nticity or adaptation, but they do not make the connection to sampling and do not provide a compu-\ntational model. The work by Grossberg and Swaminathan [8] is an example of a non-probabilistic\nmodel of, among other things, Necker cube bistability, providing much biological detail, and con-\nsidering the role of spatial attention. Their study is also an instance of an approach that bases the\nswitching on neuronal adaptation, but does not see a functional role for multistability as such, rele-\ngating instead the functional relevance of adaptation to a role it plays during learning only. Similarly,\nin earlier work Dayan [2] utilizes an ad-hoc adaptation process in a deterministic probabilistic model\nof binocular rivalry. He suggests sampling could provide stochasticity, wondering about the relation\nbetween sampling and adaptation. This is was what we have addressed here. Indeed, our approach\nis supported by recent psychophysics results [26], which indicate that both noise and neuronal adap-\ntation are necessary to explain binocular rivalry.\nWe note that our setup is of course a simpli\ufb01cation and abstraction in that we do not explicitly\nmodel depth. Indeed, in perceiving the Necker cube one does not see the actually opaque cubes\nwe used in training, but rather a 3D wireframe cube. Peculiarly, this is actually contrary to the\ndepth information available, as a (2D) image of a cube is not actually a 3D cube, but collection\nof lines on a \ufb02at surface. How is a paradoxically \u2018\ufb02at 3D cube\u2019 represented in the brain? In a\nhierarchical architecture consisting of specialized areas, this might be realized by having a high\nlevel area that codes for objects (e.g. area IT in the cortex) represent a 3D cube, whereas another\narea that is primarily involved with depth as such represents a \ufb02at surface. Our work here and\nearlier [11, 12] showed that in a DBM, different hidden layers can represent different and partially\ncon\ufb02icting information (cf. Figure 1b). Finally, we also note that in preliminary experiments with\ndepth information (using real valued visibles) perceptual switching did still occur.\nIn conclusion, we provided a biological interpretation of rates-FPCD, and thus showed how two\nseemingly distinct explanations for perceptual multistability, probabilistic inference and neuronal\nadaptation, can be merged in one framework. Unlike other approaches, our account combines sam-\npling based inference and adaptation in a concrete neural architecture utilizing learned representa-\ntions of images. Moreover, our study further demonstrates the relevance of DBMs as cortical models\n[11, 12]. We believe that further developing hybrid approaches \u2013 combining probabilistic models,\ndynamical systems, and classic connectionist networks \u2013 will help identifying the neural substrate\nof the Bayesian brain hypothesis.\n\nAcknowledgments\n\nSupported by the EPSRC, MRC and BBSRC. We thank N. Heess and the reviewers for comments.\n\n8\n\n\fReferences\n[1] Leopold and Logothetis (1999) Multistable phenomena: changing views in perception. Trends in Cogni-\n\ntive Sciences, 3, 254\u2013264, PMID: 10377540.\n\n[2] Dayan, P. (1998) A hierarchical model of binocular rivalry. Neural Computation, 10, 1119\u20131135.\n[3] van Ee, R., Adams, W. J., and Mamassian, P. (2003) Bayesian modeling of cue interaction: bistability in\n\nstereoscopic slant perception. Journal of the Optical Society of America A, 20, 1398\u20131406.\n\n[4] Sundareswara, R. and Schrater, P. R. (2008) Perceptual multistability predicted by search model for\n\nBayesian decisions. Journal of Vision, 8, 1\u201319.\n\n[5] Hohwy, J., Roepstorff, A., and Friston, K. (2008) Predictive coding explains binocular rivalry: An episte-\n\nmological review. Cognition, 108, 687\u2013701.\n\n[6] Gershman, S., Vul, E., and Tenenbaum, J. (2009) Perceptual multistability as markov chain monte carlo\n\ninference. Advances in Neural Information Processing Systems 22.\n\n[7] Blake, R. (1989) A neural theory of binocular rivalry. Psychological Review, 96, 145\u2013167, PMID:\n\n2648445.\n\n[8] Grossberg, S. and Swaminathan, G. (2004) A laminar cortical model for 3D perception of slanted and\ncurved surfaces and of 2D images: development, attention, and bistability. Vision Research, 44, 1147\u2013\n1187.\n\n[9] Fiser, J., Berkes, B., Orban, G., and Lengyel, M. (2010) Statistically optimal perception and learning:\n\nfrom behavior to neural representations. Trends in Cognitive Sciences, 14, 119\u2013130.\n\n[10] Vul, E., Goodman, N. D., Grif\ufb01ths, T. L., and Tenenbaum, J. B. (2009) One and done? optimal decisions\n\nfrom very few samples. Proceedings of the 31st Annual Conference of the Cognitive Science Society..\n\n[11] Reichert, D. P., Seri\u00e8s, P., and Storkey, A. J. (2010) Hallucinations in Charles Bonnet Syndrome induced\nby homeostasis: a Deep Boltzmann Machine model. Advances in Neural Information Processing Systems\n23, 23, 2020\u20132028.\n\n[12] Reichert, D. P., Seri\u00e8s, P., and Storkey, A. J. (2011) A hierarchical generative model of recurrent Object-\nBased attention in the visual cortex. Proceedings of the International Conference on Arti\ufb01cial Neural\nNetworks (ICANN-11).\n\n[13] Breuleux, O., Bengio, Y., and Vincent, P. (2011) Quickly generating representative samples from an\n\nRBM-Derived process. Neural Computation, pp. 1\u201316.\n\n[14] Welling, M. (2009) Herding dynamical weights to learn. Proceedings of the 26th Annual International\n\nConference on Machine Learning, Montreal, Quebec, Canada, pp. 1121\u20131128, ACM.\n\n[15] Salakhutdinov, R. and Hinton, G. (2009) Deep Boltzmann machines. Proceedings of the 12th Interna-\n\ntional Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS), vol. 5, pp. 448\u2013455.\n\n[16] Tieleman, T. and Hinton, G. (2009) Using fast weights to improve persistent contrastive divergence. Pro-\nceedings of the 26th Annual International Conference on Machine Learning, Montreal, Quebec, Canada,\npp. 1033\u20131040, ACM.\n\n[17] Maass, W. and Zador, A. M. (1999) Dynamic stochastic synapses as computational units. Neural Compu-\n\ntation, 11, 903\u2013917.\n\n[18] Turrigiano, G. G. (2008) The self-tuning neuron: synaptic scaling of excitatory synapses. Cell, 135, 422\u2013\n\n435, PMID: 18984155.\n\n[19] Zhou, Y. H., Gao, J. B., White, K. D., Yao, K., and Merk, I. (2004) Perceptual dominance time distribu-\n\ntions in multistable visual perception. Biological Cybernetics, 90, 256\u2013263.\n\n[20] Meng, M. and Tong, F. (2004) Can attention selectively bias bistable perception? differences between\n\nbinocular rivalry and ambiguous \ufb01gures. Journal of Vision, 4.\n\n[21] Toppino, T. C. (2003) Reversible-\ufb01gure perception: mechanisms of intentional control. Perception &\n\nPsychophysics, 65, 1285\u20131295, PMID: 14710962.\n\n[22] Peterson, M. A. and Gibson, B. S. (1991) Directing spatial attention within an object: Altering the func-\ntional equivalence of shape descriptions. Journal of Experimental Psychology: Human Perception and\nPerformance, 17, 170\u2013182.\n\n[23] van Ee, R., Noest, A. J., Brascamp, J. W., and van den Berg, A. V. (2006) Attentional control over either\nof the two competing percepts of ambiguous stimuli revealed by a two-parameter analysis: means do not\nmake the difference. Vision Research, 46, 3129\u20133141, PMID: 16650452.\n\n[24] Tong, F., Meng, M., and Blake, R. (2006) Neural bases of binocular rivalry. Trends in Cognitive Sciences,\n\n10, 502\u2013511.\n\n[25] Knapen, T., Kanai, R., Brascamp, J., van Boxtel, J., and van Ee, R. (2007) Distance in feature space\n\ndetermines exclusivity in visual rivalry. Vision Research, 47, 3269\u20133275, PMID: 17950397.\n\n[26] Kang, M. and Blake, R. (2010) What causes alternations in dominance during binocular rivalry? Attention,\n\nPerception, & Psychophysics, 72, 179\u2013186.\n\n9\n\n\f", "award": [], "sourceid": 1267, "authors": [{"given_name": "David", "family_name": "Reichert", "institution": null}, {"given_name": "Peggy", "family_name": "Series", "institution": null}, {"given_name": "Amos", "family_name": "Storkey", "institution": null}]}