{"title": "Predicting response time and error rates in visual search", "book": "Advances in Neural Information Processing Systems", "page_first": 2699, "page_last": 2707, "abstract": "A model of human visual search is proposed. It predicts both response time (RT) and error rates (RT) as a function of image parameters such as target contrast and clutter. The model is an ideal observer, in that it optimizes the Bayes ratio of tar- get present vs target absent. The ratio is computed on the firing pattern of V1/V2 neurons, modeled by Poisson distributions. The optimal mechanism for integrat- ing information over time is shown to be a \u2018soft max\u2019 of diffusions, computed over the visual field by \u2018hypercolumns\u2019 of neurons that share the same receptive field and have different response properties to image features. An approximation of the optimal Bayesian observer, based on integrating local decisions, rather than diffusions, is also derived; it is shown experimentally to produce very similar pre- dictions. A psychophyisics experiment is proposed that may discriminate between which mechanism is used in the human brain.", "full_text": "Predicting response time and error rates in visual\n\nsearch\n\nBo Chen\nCaltech\n\nbchen3@caltech.edu\n\nVidhya Navalpakkam\n\nYahoo! Research\n\nnvidhya@yahoo-inc.com\n\nPietro Perona\n\nCaltech\n\nperona@caltech.edu\n\nAbstract\n\nA model of human visual search is proposed. It predicts both response time (RT)\nand error rates (RT) as a function of image parameters such as target contrast and\nclutter. The model is an ideal observer, in that it optimizes the Bayes ratio of target\npresent vs target absent. The ratio is computed on the \ufb01ring pattern of V1/V2 neu-\nrons, modeled by Poisson distributions. The optimal mechanism for integrating\ninformation over time is shown to be a \u2018soft max\u2019 of diffusions, computed over\nthe visual \ufb01eld by \u2018hypercolumns\u2019 of neurons that share the same receptive \ufb01eld\nand have different response properties to image features. An approximation of the\noptimal Bayesian observer, based on integrating local decisions, rather than diffu-\nsions, is also derived; it is shown experimentally to produce very similar predic-\ntions to the optimal observer in common psychophysics conditions. A psychophy-\nisics experiment is proposed that may discriminate between which mechanism is\nused in the human brain.\n\nA\n\nFigure 1: Visual search. (A) Clutter and camou\ufb02age make visual search dif\ufb01cult. (B,C) Psychologists and\nneuroscientists build synthetic displays to study visual search. In (B) the target \u2018pops out\u2019 (\u2206\u03b8 = 450), while\nin (C) the target requires more time to be detected (\u2206\u03b8 = 100) [1].\n\nB\n\nC\n\n1\n\nIntroduction\n\nAnimals and humans often use vision to \ufb01nd things: mushrooms in the woods, keys on a desk, a\npredator hiding in tall grass. Visual search is challenging because the location of the object that\none is looking for is not known in advance, and surrounding clutter may generate false alarms. The\nthree ecologically relevant performance parameters of visual search are the two error rates (ER):\nfalse alarms (FA) and false rejects (FR), and response time (RT). The design of a visual system is\ncrucial in obtaining low ER and RT. These parameters may be traded off by manipulating suitable\nthresholds [2, 3, 4].\nPsychologists and physiologists have long been interested in understanding the performance and the\nmechanisms of visual search. In order to approach this dif\ufb01cult problem they present human sub-\njects with synthetic stimuli composed of a variable number of \u2018items\u2019 which may include a \u2018target\u2019\n\n1\n\n\fand multiple \u2018distractors\u2019 (see Fig. 1). By varying the number of items one may vary the amount of\nclutter; by designing different target-distractor pairs one may probe different visual cues (contrast,\norientation, color, motion) and by varying the visual distinctiveness of the target vis-a-vis the dis-\ntractors one may study the effect of the signal-to-noise ratio (SNR). Several studies since 1980s have\ninvestigated how RT and ER are affected by the complexity of the stimulus (number of distractors),\nand by target-distractor discriminability with different visual cues. One early observation is that\nwhen the target and distractor features are widely separated in feature space (e.g., red target among\ngreen distractors), the target \u2018pops out\u2019. In these situations the ER is nearly zero, and the slope of\nRT vs. setsize is \ufb02at, i.e., RT to \ufb01nd the target is independent of number of items in the display [1].\nDecreasing the discriminability between the target and distractor increases error rates, and increases\nthe slope of RT vs. setsize [5]. Moreover, it was found that the RT for displays with no target is\nlonger than where the target is present (see review in [6]). Recent studies investigated the shape of\nRT distributions in visual search [7, 8].\nNeurophysiologically plausible models have been recently proposed to predict RTs in visual discrim-\nination tasks [9] and various other 2AFC tasks [10] at a single spatial location in the visual \ufb01eld.\nThey are based on sequential tests of statistical hypotheses (target present vs target absent) [11] com-\nputed on the response of stimulus-tuned neurons [2, 3]. We do not yet have satisfactory models for\nexplaining RTs in visual search, which is harder as it involves integrating information across several\nlocations across the visual \ufb01eld, as well as time. Existing models predicting RT in visual search are\neither qualitative (e.g. [12]) or descriptive (e.g., the drift-diffusion model [13, 14, 15]), and do not\nattempt to predict experimental results with new set sizes, target and distractor settings.\nWe propose a Bayesian model of visual search that predicts both ER and RT. Our study makes a\nnumber of contributions. First, while visual search has been modeled using signal-detection theory\nto predict ER [16], our model is based on neuron-like mechanisms and predicts both ER and RT.\nSecond, our model is an optimal observer, given a physiologically plausible front-end of the visual\nsystem. Third, our model shows that in visual search the optimal computation is not a diffusion, as\none might believe by analogy with single-location discrimination models [17, 18], rather, it is a \u2018soft-\nmax\u2019 nonlinear combination of locally-computed diffusions. Fourth, we study a physiologically\nparsimonious approximation to the optimal observer, we show that it is almost optimal when the\ncharacteristics of the task are known in advance and held constant, and we explore whether there are\npsychophysical experiments that could discriminate between the two models.\nOur model is based on a number of simplifying assumptions. First, we assume that stimulus items\nare centered on cortical hypercolumns [19] and at locations where there is no item neuronal \ufb01ring is\nnegligible. Second, retinal and cortical magni\ufb01cation [19] are ignored, since psychophysicists have\ndeveloped displays that sidestep this issue (by placing items on a constant-eccentricity ring as shown\nin Fig 1). Third, we do not account for overt and covert attentional shifts. Overt attentional shifts\nare manifested by saccades (eye motions), which happen every 200ms or so. Since the post-decision\nmotor response to a stimulus by pressing a button takes about 250-300ms, one does not need to\nworry about eye motions when response times are shorter than 500ms. For longer RTs, one may\nenforce eye \ufb01xation at the center of the display so as to prevent overt attentional shifts. Furthermore,\nour model explains serial search without the need to invoke covert attentional shifts [20] which are\ndif\ufb01cult to prove neurophysiologically.\n\n2 Target discrimination at a single location with Poisson neurons\n\nWe \ufb01rst consider probabilistic reasoning at one location, where two possible stimuli may appear.\nThe stimuli differ in one respect, e.g. they have different orientations \u03b8(1) and \u03b8(2). We will call\nthem distractor (D) and target (T), also labeled C = 1 and C = 2 (call c \u2208 {1, 2} the generic value\nof C). Based on the response of N neurons (a hypercolumn) we will decide whether the stimulus\nwas a target or a distractor. Crucially, a decision should be reached as soon as possible, i.e. as soon\nas there is suf\ufb01cient evidence for T or D [11].\nGiven the evidence T (de\ufb01ned further below in terms of the neurons\u2019 activity) we wish to decide\nwhether the stimulus was of type 1 or 2. We may do so when the probability P (C = 1|T ) of the\nstimulus being of type 1 given the observations in T exceeds a given threshold T1 (T1 = 0.99).\nWe may instead decide in favor of C = 2 e.g. when P (C = 1|T ) < T2 (e.g. T2 = 0.01). If\n\n2\n\n\fFigure 2: (Left three panels) Model of a hypercolumn in V1/V2 cortex composed of four orientation-tuned\nneurons (our simulations use 32). The left panel shows the neurons\u2019 tuning curve \u03bb(\u03b8) representing the expected\nPoisson \ufb01ring rate when the stimulus has orientation \u03b8. The middle plot shows the expected \ufb01ring rate of the\npopulation of neurons for two stimuli whose orientation is indicated with a red (distractor) and green (target)\nvertical line. The third plot shows the step-change in the value of the diffusion when an action potential is\nregistered from a given neuron. (Right panel) Diagram of the decision models. (A) One-location Bayesian\nobserver. The action potentials of a hypercolumn of neurons (top) are integrated in time to produce a diffusion.\nWhen the diffusion reaches either an upper bound T1 or a lower bound T0 the decision is taken that either\nthe target is present (1) or the target is absent (0). (B\u2013D) Multi-location ideal Bayesian observer. (B) While\nnot a diffusion, it may be seen as a \u2018soft maximum\u2019 combination of local diffusions: the local diffusions are\n\ufb01rst exponentiated, then averaged; the log of the result is compared to two thresholds to reach a decision. (C)\nThe \u2018Max approximation\u2019 is a simpli\ufb01ed approximation of the ideal observer, where the maximum of local\ndiffusions replaces a soft-maximum. (D) Equivalently, in the Max approximation decisions are reached locally\nand combined by logical operators. The white AND in a dark \ufb01eld indicates inverted AND of multiple inverted\ninputs.\n\nP (C = 1|T ) \u2208 (T2, T1) we will wait for more evidence. Thus, we need to compute P (C = 1|T ) :\n\nPr(C = 1|T ) =\n\nwhere R(T ) =\n\n1\n\n1 + P (C=2|T )\nP (C=1|T )\nP (T |C = 2)\nP (T |C = 1)\n\n1\n\n=\n1 + R(T ) P (C=2)\nP (C = 2|T )\nP (C = 1|T )\n\nP (C=1)\n\nP (C = 1)\nP (C = 2)\n\n=\n\n(1)\nwhere P (C = 1) = 1 \u2212 P (C = 2) is the prior probability of C = 1. Thus, it is equivalent to take\ndecisions by thresholding log R(T )1; we will elaborate on this in Sec. 3.\nWe will model the \ufb01ring rate of the neurons with a Poisson pdf: the number n of action potentials\nthat will be observed during one second is distributed as P (n|\u03bb) = \u03bbne\u2212\u03bb/n!. The constant \u03bb is\nthe expectation of the number of action potentials per second. Each neuron i \u2208 {1, . . . , N} is tuned\nto a different orientation \u03b8i; for the sake of simplicity we will assume that the width of the tuning\ncurve is the same for all neurons; i.e. each neuron i will respond to stimulus c with expectation\nc = f (|\u03b8(c)\u2212\u03b8i|) (in spikes per second) which are determined by the distance between the neuron\u2019s\n\u03bbi\npreferred orientation \u03b8i and by the stimulus orientation \u03b8(c).\nLet Ti = {ti\n\nk} be the set of action potentials from neuron i produced starting at t = 0 and until\nthe end of the observation period t = T . Indicate with T = {tk} = (cid:83)i Ti the complete set of\naction potentials from all neurons (where the tk are sorted). We will indicate with i(k) the index of\nthe neuron who \ufb01red the action potential at time tk. Call Ik = (tk tk+1) the intervals of time in\nbetween action potentials, where I0 = (0 t1). These intervals are open i.e. they do not contain the\nboundaries, hence they do not contain the action potentials.\nThe signal coming from the neurons is thus a concatenation of \u2018spikes\u2019 and \u2018intervals\u2019, and the\ninterval (0, T ) may be viewed as the union of instants tk and open intervals (tk, tk+1). i.e. (0, T ) =\n\nI0(cid:83) t1(cid:83) I1(cid:83) t2(cid:83)\u00b7\u00b7\u00b7\n\nSince the spike trains Ti and T are Poisson processes, once we condition on the class of the stimulus\nthe spike times are independent. This implies that: P (T |C = c) = \u03a0kP (Ik|C = c)P (tk|C = c).\nThis may be proven by dividing up (0, T ) into smaller and smaller intervals and taking the limit for\n\n1We use base 10 for all our logarithms and exponentials, i.e. log(x) \u2261 log10(x) and exp(x) \u2261 10x.\n\n3\n\n05010015001234567891011Stimulus orientation (cid:101) (degrees)Expected firing rate (cid:104) (spikes per second)Neurons\u2019 tuning curves050100150024681012Mean spiking rate per secondNeuron\u2019s preferred orientation (cid:101) (degrees)Poisson (cid:104) (spikes/s) (cid:101)D(cid:101)T(cid:104) (D,(cid:101)i) per s(cid:104) (T,(cid:101)i) per s050100150(cid:239)0.25(cid:239)0.2(cid:239)0.15(cid:239)0.1(cid:239)0.0500.050.10.150.20.25Diffusion jump caused by action potentialNeuron\u2019s preferred orientation (cid:101) (degrees)diffusion jump per spike (cid:101)D=90o(cid:101)T=105ojump on spikeinterspike drift per s=0.01T101explog\uffffBexp...T101T101T101T101\uffffdt\uffffdtmaxAC\uffffdt...\uffffdt\uffffdt...\uffffdt\uffffdt...