{"title": "On the connections between saliency and tracking", "book": "Advances in Neural Information Processing Systems", "page_first": 1664, "page_last": 1672, "abstract": "A model connecting visual tracking and saliency has recently been proposed. This model is based on the saliency hypothesis for tracking which postulates that tracking is achieved by the top-down tuning, based on target features, of discriminant center-surround saliency mechanisms over time. In this work, we identify three main predictions that must hold if the hypothesis were true: 1) tracking reliability should be larger for salient than for non-salient targets, 2) tracking reliability should have a dependence on the defining variables of saliency, namely feature contrast and distractor heterogeneity, and must replicate the dependence of saliency on these variables, and 3) saliency and tracking can be implemented with common low level neural mechanisms. We confirm that the first two predictions hold by reporting results from a set of human behavior studies on the connection between saliency and tracking. We also show that the third prediction holds by constructing a common neurophysiologically plausible architecture that can computationally solve both saliency and tracking. This architecture is fully compliant with the standard physiological models of V1 and MT, and with what is known about attentional control in area LIP, while explaining the results of the human behavior experiments.", "full_text": "On the connections between saliency and tracking\n\nVijay Mahadevan\n\nYahoo! Labs\n\nBangalore, India\n\nNuno Vasconcelos\n\nStatistical Visual Computing Laboratory\n\nUC San Diego, La Jolla, CA 92092\n\nvmahadev@yahoo-inc.com\n\nnuno@ece.ucsd.edu\n\nAbstract\n\nA model connecting visual tracking and saliency has recently been proposed. This\nmodel is based on the saliency hypothesis for tracking which postulates that track-\ning is achieved by the top-down tuning, based on target features, of discriminant\ncenter-surround saliency mechanisms over time. In this work, we identify three\nmain predictions that must hold if the hypothesis were true: 1) tracking reliabil-\nity should be larger for salient than for non-salient targets, 2) tracking reliabil-\nity should have a dependence on the de\ufb01ning variables of saliency, namely fea-\nture contrast and distractor heterogeneity, and must replicate the dependence of\nsaliency on these variables, and 3) saliency and tracking can be implemented with\ncommon low level neural mechanisms. We con\ufb01rm that the \ufb01rst two predictions\nhold by reporting results from a set of human behavior studies on the connection\nbetween saliency and tracking. We also show that the third prediction holds by\nconstructing a common neurophysiologically plausible architecture that can com-\nputationally solve both saliency and tracking. This architecture is fully compliant\nwith the standard physiological models of V1 and MT, and with what is known\nabout attentional control in area LIP, while explaining the results of the human\nbehavior experiments.\n\n1 Introduction\n\nBiological vision systems have evolved sophisticated tracking mechanisms, capable of tracking\ncomplex objects, undergoing complex motion, in challenging environments.These mechanisms have\nbeen an area of active research in both neurophysiology [10, 34] and psychophysics [28], where re-\nsearch has been devoted to the study of object tracking by humans [29]. This effort has produced\nseveral models of multi-object tracking, that account for the experimental evidence from human\npsychometric data [28]. Prominent among these are the FINST model of Pylyshyn [29], and the\nobject \ufb01le model of Kahnemman et al [18]. However, these models are not quantitative, and only\nexplain the psychophysics of tracking simple stimuli, such as dots or bars. They do not specify a set\nof computations for the implementation of a general purpose tracking algorithm, and it is unclear\nhow they could be applied to natural scenes. While some computational models for multiple object\ntracking (MOT) such as the oscillatory neural network model of Kazanovich et al. [19], and the\nparticle \ufb01lter based model of Vul et al. [37], have been proposed, there have been no attempts to\ndemonstrate their applicability to real video scenes.\n\nVisual tracking has also been widely studied in computer vision, where numerous tracking algo-\nrithms [38] have been proposed. Early solutions relied on simple object representations, and em-\nphasized the prediction of object dynamics, typically using a Kalman \ufb01lter. The prediction of these\ndynamics turned out to be dif\ufb01cult, motivating the introduction of more sophisticated methods such\nas particle \ufb01ltering [15]. Nevertheless, because these approaches relied on simple target represen-\ntations, they could not deal with complex scenes. This motivated research in the appearance-based\nmodeling techniques [17, 32, 9] where a model of object appearance is learned from the target loca-\ntion in the initial frame, and used to identify the target in the next. It is, however, dif\ufb01cult to learn\nappearance models from complex scenes, where background detail tends to drift into the region used\nto learn the model, corrupting the learning.\n\n1\n\n\fThe best results among tracking algorithms have recently been demonstrated for a class of meth-\nods that pose object tracking as incremental target/background classi\ufb01cation [22, 8, 2, 13]. These\ndiscriminant trackers train a classi\ufb01er to distinguish target from background at each frame. This\nclassi\ufb01er is then used to determine the location of the target in the next frame. Target and back-\nground are extracted at this location, the classi\ufb01er updated, and the process iterated.\n\nRecent work in the computer vision literature [22] has postulated a connection between discriminant\ntracking and one of the core processes of early biological vision - saliency, by suggesting that the\nability to track objects is a side-effect of the saliency mechanisms that are known to guide the de-\nployment of attention. More precisely, [22] has hypothesized that tracking is a simple consequence\nof object-based tuning, over time, of the mechanisms used by the attentional system to implement\nbottom-up saliency. We refer to this as the saliency hypothesis for tracking. Working under this\nhypothesis, [22] proposed a tracker based on the discriminant saliency principle of [12]. This is a\nprinciple for bottom-up center-surround saliency, which poses saliency as discrimination between\na target (center) and a null (surround) hypothesis. Center-surround discriminant saliency has pre-\nviously been shown to predict various psychophysical traits of human saliency and visual search\nperformance [11]. The extension proposed by [22], to the tracking problem, endows discriminant\nsaliency with a top-down feature selection mechanism. This mechanism enhances features that re-\nspond strongly to the target and weakly to the background, transforming the saliency operation from\na search for locations where center is distinct from the surround, to a search for locations where\ntarget is present in the center but not in the surround. [22] has shown that this tracker has state-of-\nthe-art performance on a number of tracking benchmarks from the computer vision literature.\n\nIn this work, we evaluate the validity of the saliency hypothesis by identifying three main predic-\ntions that ensue from the saliency hypothesis: 1) tracking reliability should be larger for salient than\nfor non-salient targets, 2) tracking reliability should have a dependence on the de\ufb01ning variables of\nsaliency, namely feature contrast and distractor heterogeneity, and must replicate the dependence of\nsaliency on these variables and, 3) saliency and tracking can be implemented with common low level\nneural mechanisms. We con\ufb01rm that the \ufb01rst two of these predictions hold by performing several\nhuman behavior experiments on the dependence between target saliency and human tracking perfor-\nmance. These experiments build on well understood properties of saliency, such as pop-out effects,\nto show that tracking requires discrimination between target and background using a center-surround\nmechanism. In addition, we characterize the dependence of tracking performance on the extent of\ndiscrimination, by gradually varying feature contrast between target and distractors in the tracking\ntasks. The results show that both tracking performance and saliency have highly similar patterns\nof dependency on feature contrast and distractor heterogeneity. To con\ufb01rm that the third prediction\nholds, we show that both saliency and tracking can be implemented by a network compliant with\nthe widely accepted neurophysiological models of neurons in area V1 [5] and the middle temporal\narea (MT) [36], and with the emerging view of attentional control in the lateral intra-parietal area\n(LIP) [3]. This network extends the substantial connections between discriminant saliency and the\nstandard model that have already been shown [12] and is a biologically plausible optimal model for\nboth saliency and tracking.\n\n2 Human Behavior Experiments on Saliency and Tracking\nWe start by reporting on human behavior experiments1 investigating the connections between the\npsychophysics of tracking and saliency. To the best of our knowledge, this is the \ufb01rst report on\npsychophysics experiments studying the relation between attentional tracking of a single target and\nits saliency. Video stimuli were designed with the Psychtoolbox [4] on Matlab v7, running on\na Windows XP PC. A 27 inch LCD monitor of size 47.5\u25e6 \u00d7 30\u25e6 visual angle and resolution of\n1270 \u00d7 1068 pixels was used to present the stimuli. Subjects were at a viewing distance of 57 cm.\nThe same apparatus was used for all experiments.\n2.1 Experiment 1 : Saliency affects tracking performance\nThe experimental setting was inspired by the tracking paradigm of Pylyshyn [29]. Subjects viewed\ndisplays containing a green target disk surrounded by 70 red distractor disks and a static \ufb01xation\nsquare. Example displays are shown in [1]. At the start of each trial, the target disk was cued with\na bounding box. Subjects were instructed to track the target covertly, while their eyes \ufb01xated on a\nblack \ufb01xation square in the center. After a keystroke from the subject, all disks moved independently,\nwith random motion, for around 7 seconds. Then, the disks stopped moving and the colors of 3\n\n1IRB approved study, subjects provided informed consent and were compensated $8 per hour\n\n2\n\n\fdisks were switched to 3 new colors - cyan, magenta and blue. Of these, one was the target and the\nother two the spatially closest distractors. Subjects were asked to identify the target among the 3\nhighlighted disks.\n\nMethod 13 subjects (age 22-35, 9 male, 4 female) performed 4 trials each, organized into 2 ver-\nsions of 2 conditions. First version: this version tested subject tracking performance under two\ndifferent stimulus conditions. In the \ufb01rst, denoted salient, the target remained green throughout the\npresentation, changing randomly to one of the three highlight colors at the end of the 7 seconds. In\nthe second, denoted non-salient, the target remained green for the \ufb01rst half of this period, switched\nto red for the remaining time, \ufb01nally turning to a highlight color. While in the \ufb01rst condition the\ntarget is salient throughout the presentation, the second makes the target non-salient throughout the\nlatter half of the trial. To eliminate potential effects of any other variables (e.g. target-distractor dis-\ntances and motion patterns), the non-salient display was created by rotating each frame of a salient\ndisplay by 90\u25e6 (and changing the green disk to red in the second half of the presentation).\n\nUnder the saliency hypothesis for tracking, the rate of successful target tracking should be much\nhigher for salient than for non-salient displays. However, this could be due to the fact that the target\nwas the only green disk in salient displays, and since it continuously popped-out subjects could be\nacquiring the target at any time even after losing track. Second Version: The second version ruled\nout this alternate hypothesis by using a different type of display for the salient condition. In this\ncase, the target was a red disk, and its 7 nearest spatial neighbors were green. All other distractors\nwere randomly assigned to either the red or green class. This eliminated the percept of pop-out.\nAs before, the display for the non-salient condition was created by rotation and color switch of\nthe target on the second half of the presentation. The video displays are available in the attached\nsupplement [1].\n\nResults and Discussion Figure 2 (a) and (b) present the rate of successful tracking in the two ver-\nsions. In both cases, this rate was much higher in the salient than in the non-salient condition. In the\nlatter, the tracking performance was almost at the chance level of 1\n3 , suggesting complete tracking\nfailure. In fact, the similarity of detection rates in the two experiments suggests that target pop-out\ndoes not aid human tracking performance at all. It only matters if the target is locally salient, i.e.\nsalient with respect to its immediate neighborhood. This is consistent with the saliency hypothesis,\nsince bottom-up saliency mechanisms are well known to have a center-surround structure [16, 12].\nIn fact, it suggests two new predictions. The \ufb01rst, motivated by the hypothesis that tracking requires\ntop-down biases of bottom-up saliency, is that center-surround organization also applies to tracking.\nTo address this prediction, we will investigate the spatial organization of tracking mechanisms in\ngreater detail in Experiment 3. The second, which follows from the fact that only target color var-\nied between the two conditions, is that tracking performance depends on the discriminability of the\ntarget. We study this prediction in Experiment 2. While the \ufb01rst experiment used color as a discrim-\ninant cue, the conclusion that saliency affect tracking performance applies even when other features\nare salient. For example, studies on multiple object tracking with identical targets and distractors\nhave reported tracking failure when target and distractors are too close to each other [14]. This is\nconsistent with the discriminant hypothesis: when target and distractors are identical, the target must\nbe spatio-temporally salient (due to its trajectory or position) in an immediate neighborhood to be\ntracked accurately.\n\n2.2 Experiment 2: Tracking reliability as a function of feature contrast\nExperiment 2 aimed to investigate the connection between the two phenomena in greater detail,\nnamely to quantify how tracking reliability depends on target saliency. Since saliency is not an in-\ndependent variable, this quanti\ufb01cation can only be done indirectly. One possibility is to resort to a\nvariable commonly used to manipulate saliency: the amount of feature contrast between target and\ndistractors. Several features can be used, as it is well known that targets which differ from distrac-\ntors in terms of color, luminance, orientation or texture can be perceived as salient [27, 25]. In fact,\nNothdurft [26] has precisely quanti\ufb01ed the dependence of saliency on orientation contrast in static\ndisplays. His work has shown that perceived target saliency increases with the orientation contrast\nbetween target and neighboring distractors. This increase is quite non-linear, exhibiting the thresh-\nold and saturation effects shown in Figure 1 (a), where we present curves of saliency as a function\nof orientation contrast between target and distractors for three levels of distractor homogeneity [26].\nThe relationship between tracking reliability and target saliency can thus be characterized by ma-\nnipulating orientation contrast and measuring the impact on tracking performance. If the saliency\n\n3\n\n\fhypothesis for tracking holds, saliency and tracking reliability should be equivalent functions of ori-\nentation contrast. In particular, increasing orientation contrast between target and distractors should\nresult in a non-linear increase of tracking reliability, with threshold and saturation effects similar to\nthose observed by Nothdurft.\n\nMethod 12 subjects (8 male and 4 female) in the age range 21-35 participated in the study. The\nexperimental setting was adapted from the work of Makovski and Jiang [23]. All displays had\nsize 26\u25e6 \u00d7 26\u25e6 (700 \u00d7 700 pixels) and consisted of 23 ellipses, all of color blue, against a black\nbackground. Each ellipse had a major axis of \u223c 0.56\u25e6 (15 pixels) and minor axis of \u223c 0.19\u25e6 (5\npixels). The orientation of the ellipses depended on the condition from which the trial was drawn.\nAt the start of a trial, one of the ellipses was designated as target (cued with a white bounding box).\nSubjects were instructed to track the target covertly, while \ufb01xating on a white square at the center of\nthe screen. On a keystroke, the ellipses started moving and continued to do so for \u223c 8-10 sec. At\nthe end of the trial, all ellipses were completely occluded by larger white disks and subjects asked to\nclick on the disk corresponding to the target. Each subject performed 30 trials under 7 conditions,\nfor a total of 210 trials, and no feedback was given on the accuracy of their selection.\n\nThe 7 conditions corresponded to different levels of orientation contrast between target and distractor\nellipses. Distractor orientation, de\ufb01ned by the major axis of the distractor ellipses, was always 0\u25e6.\nTarget orientation, determined by the major axis of the target ellipse, was selected from 7 values:\n0\u25e6, 10\u25e6, 20\u25e6, 30\u25e6, 40\u25e6, 60\u25e6 or 80\u25e6. This made orientation contrast equal to the target orientation.\nExample displays are shown in the attached supplement [1]. To keep all other variables (e.g. distance\nbetween items, motion patterns, distance from target to \ufb01xation square) identical, a trial was \ufb01rst\ncreated for one condition (target orientation 0\u25e6). The trials of all other conditions were obtained by\napplying a transformation to each frame of this video clip. This consisted of an af\ufb01ne transformation\nof the grid of ellipse centers, followed by the desired change in target orientation.\n\nTo study the effect of distractor heterogeneity [26], three versions of the experiment were conducted\nwith different numbers of ellipses in the target orientation. In the \ufb01rst version, only one ellipse (the\nactual target) was in target orientation. In this case, there was no distractor heterogeneity. In the\nsecond version, 18 of the 23 ellipses were in distractor orientation, and the remaining 5 in target\norientation. One of the latter was the actual target. Finally, in the third version, 13 ellipses were in\ndistractor and 10 in target orientation, for the largest degree of distractor heterogeneity.\n\nResults and Discussion Figure 1 (b), shows the curves of tracking accuracy vs. orientation con-\ntrast obtained in all three versions of the experiment. These curves are remarkably similar to Noth-\ndurft\u2019s saliency curves, shown in (a). Again, there are 1) distinct threshold and saturation effects\nfor tracking, with tracking accuracy saturating for orientation contrasts beyond 40\u25e6, and 2) decreas-\ning tracking accuracy as distractor heterogeneity increases. The co-variation of tracking accuracy\nand saliency is illustrated in Figure 1 (c), where the two quantities are presented as a scatter plot\nThe correlation between the two variable is near perfect (r = 0.975). In summary, tracking has a\ndependence on orientation contrast remarkably similar to that of saliency.\n\ny\nc\nn\ne\n\ni\nl\n\na\nS\n\n80\n\n60\n\n40\n\n20\n\n0\n\n \n\n \n\nbg=0\nbg=10\nbg=20\n80\n\n0\nTarget Orientation Contrast (deg)\n\n20\n\n40\n\n60\n\ny\nc\na\nr\nu\nc\nc\nA\ng\nn\nk\nc\na\nr\nT\n\n \n\ni\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n \n\n0 similar distractors\n4 similar distractors\n9 similar distractors\n\n0\n80\nTarget Orientation Contrast (deg)\n\n20\n\n40\n\n60\n\n \n\n1\n\ny\nc\na\nr\nu\nc\nc\na\ng\nn\nk\nc\na\nr\nT\n\n \n\ni\n\n0.9\n\n0.8\n\n0.7\n\n20\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n \n\n0 similar distractors \n4 similar distractors\n9 similar distractors\n\ny\nc\na\nr\nu\nc\nc\nA\ng\nn\nk\nc\na\nr\nT\n\n \n\ni\n\n40\n\n60\n\nSaliency\n\n80\n\n0.5\n \n0\n\n20\n\nTarget Orientation Contrast (deg)\n\n40\n\n60\n\n80\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 1: (a) saliency vs. orientation contrast (adapted from [26]) (b) human tracking success rate vs. orien-\ntation contrast. (c) scatter plot of saliency values from (a) vs tracking accuracy from (b), r = 0.975. (d) model\nprediction : tracking success rate vs. orientation contrast for the network of Figure 3.\n\n2.3 Experiment 3: The spatial structure of tracking\nIt is well known that bottom-up saliency mechanisms are based on spatially localized center-\nsurround processing [16, 6]. Hence, the saliency hypothesis for tracking predicts that tracking\nperformance depends only on distractors within a spatial neighborhood of the target. The results\n\n4\n\n\fof Experiment 2 provide some evidence in support of this prediction, by showing that tracking per-\nformance depends on distractor heterogeneity. This implies that the visual content of the background\naffects human tracking performance. The open question is whether the effect of the background 1)\nis limited to a localized neighborhood of the target, or 2) extends to the entire \ufb01eld of view. This\nquestion motivated Experiment 3. In this experiment, the distance dcsd between the target and the\nclosest distractor of the same orientation, denoted the closest similar distractor (CSD), was con-\ntrolled so that dcsd = R, where R is a parameter. This guaranteed that there were no distractors\nwith the target orientation inside a neighborhood of radius R around it. By jointly varying this pa-\nrameter and the amount of distractor heterogeneity, it is possible to test three hypotheses: (a) no\nsurround region is involved in tracking: in this case, the rate of tracking success does not depend\non the distractor heterogeneity at all, (b) the entire visual \ufb01eld affects tracking performance: in this\ncase, for a \ufb01xed distractor heterogeneity, the rate of tracking success does not depend on R, (c) the\neffect of the surround is spatially localized: in this case, there is a critical radius Rcritical beyond\nwhich distractors have no in\ufb02uence in tracking performance. This implies that the rate of tracking\nsuccess does not depend on distractor heterogeneity for R > Rcritical. Experiment 2 already estab-\nlished that hypothesis (a) does not hold. Experiment 3 was designed to determine which of (b) and\n(c) holds.\n\nMethod 9 subjects (7 male and 2 female) in the age range 21-35 participated in the study. The\ntarget orientation was \ufb01xed at 40\u25e6 for all stimuli. Two versions of the experiment were conducted,\nwith two levels of distractor heterogeneity. As in Experiment 2, the \ufb01rst version used 18 (5) of the\n23 ellipses in distractor (target) orientation. In the second version, 13 ellipses were in distractor and\n10 in target orientation. In both versions, the stimulus was produced with four values of average R\n(average, over all frames in the video sequence, of the distance dcsd). Across the 4 conditions, this\nquantity was in the range 1.67\u25e6 to 5.01\u25e6 (about 45 pixels to 135 pixels).\n\nResults and Discussion Figure 2(a) presents the rate of tracking success as a function of average\nR, for the two versions of the experiment. The tracking accuracy for the case where there is no\ndistractor heterogeneity (no distractor with the target orientation) is also shown, as a \ufb02at line. Two\nmain observations are worth noting. First, for a \ufb01xed (non-zero) amount of distractor heterogene-\nity, tracking performance always increases with R. This implies that it is easier to track when the\nCSD is farther from the target. Second, for large R tracking accuracy does not depend on distractor\nheterogeneity (it is nearly the same under the two heterogeneity conditions), converging to the accu-\nracy observed when there is no distractor heterogeneity (Experiment 3). These observations support\nthe conclusion that hypothesis (c) holds, i.e. tracking ability is in\ufb02uenced by a localized surround\nregion, of radius Rcritical \u2248 4\u25e6. When similar distractors are kept out of this region, the degree of\ndistractor heterogeneity has no effect in tracking performance.\n\nIn summary, results of the human behavior experiments show that the \ufb01rst two predictions made\nby the saliency hypothesis for tracking hold. These predictions are that tracking reliability 1) is\nlarger for salient than for non-salient targets (Experiment 1), 2) depends on the de\ufb01ning variables\nof saliency, namely feature contrast and distractor heterogeneity (Experiment 2), and replicates the\ndependence of saliency on these variables. This includes the threshold and saturation effects of the\ndependence of saliency on feature contrast (Experiment 2), and the spatially localized dependence\nof saliency on distractor heterogeneity (Experiment 3). Overall, these experiments provide strong\nevidence in support of the saliency hypothesis for tracking. We next consider the \ufb01nal prediction,\nwhich is that saliency and tracking can be implemented with common neural mechanisms.\n\n3 Joint neural architecture for saliency and tracking\n\nTo construct a saliency based neurally plausible computational model for tracking we start with the\nneural model proposed by [12] to compute saliency and identify the mechanisms required to extend\nit to perform tracking, and then show how these mechanisms can be implemented in a biologically\nplausible manner.\n\nIn [12], saliency is equated to optimal decision-making between two classes of visual stimuli, with\nlabel C \u2208 {0, 1}, C = 1 for stimuli in a target class, and C = 0 for stimuli in a background\nclass. The classes are de\ufb01ned in a center-surround manner where, at each location l, the target\n(background) class is that of stimuli in a center (surround) window. The stimuli are not observed\ndirectly, but through projection onto a set of n features, of responses Y(l) = (Y1(l), . . . , Yn(l)). The\nsaliency of location l is then equated to the expected accuracy of target/background classi\ufb01cation,\n\n5\n\n\f(a)\n\n(b)\n\ny\nc\na\nr\nu\nc\nc\na\n\n \n\ni\n\ng\nn\nk\nc\na\nr\nt\n\n1\n\n0.95\n\n0.9\n\n0.85\n\n0.8\n\n0.75\n\n0.7\n\n \n\n \n\n1\n\ny\nc\na\nr\nu\nc\nc\na\n\n \n\ni\n\ng\nn\nk\nc\na\nr\nt\n\n0.9\n\n0.8\n\n0.7\n\n \n\n0 similar distractors\n4 similar distractors\n9 similar distractors\n\n \n\n2\n\n0.6\naverage distance to nearest similar distractor (deg)\n\n3\n\n5\n\n4\n\n0 similar distractors\n4 similar distractors\n9 similar distractors\n5\n\n4\n\n2\n\n3\n\naverage distance to nearest similar distractor (deg)\n\n(c)\n\n(d)\n\nFigure 2:\n(a) and (b) Experiment 1: successful target tracking rate for targets that are (a) globally salient\n(pop-out), and (b) locally salient (do not pop-out).\n(c) and (d) Experiment 3: the effect of background on\ntracking performance - (c) Tracking accuracy of human subjects for two versions of distractor homogeneities\nare plotted as a function of the average target-similar distractor distance. Also shown, using a blue dashed line,\nis the tracking accuracy for the version with no similar distractors at target orientation of 40\u25e6 from Experiment\n2. (d) model prediction for the same data using the saliency based model of Figure 3.\n\ngiven the feature responses from the two classes and can be written as:\n\nS(l) =\n\n1\n\nn Xk\n\nSk(l),\n\nSk(l) = EY (l){\u03b3[PC(l)|Yk (l)(1|y)]},\n\n\u03b3(x) = (cid:26) x \u2212 1\n\n0\n\n2\n\nx \u2265 0.5\nx < 0.5\n\n(1)\n\nThe saliency measure Sk(l) is the expected con\ufb01dence with which the feature response Yk(l)\nis assigned to the target class. \u03b3(x) is a nonlinearity that thresholds the posterior probability\nPC(l)|Yk(l)(1|y) to prevent locations assigned to the background class by the Bayes decision rule\n(PC(l)|Yk(l)(1|y) \u2264 0.5), from contributing to the saliency. This tunes the saliency measure to re-\nspond only to the presence of target stimuli, not to its absence. This de\ufb01nition of saliency was\nshown, in [12], to be computable using units that conform to the standard neurophysiological model\nof cells in visual cortex area V1 [5], when the features are bandpass \ufb01lters (e.g. Gabor \ufb01lters) ex-\ntracted from static natural images. However, for the tracking task, the feature set Y for representing\nthe target and background needs to contain spatiotemporal features that are tuned to the velocity\nof moving patterns. It can be shown that saliency for such velocity tuned spatiotemporal features\ncan be computed by combining the outputs of a set of V1 like units of [12], akin to the widely\nused approach for constructing models for MT cells from afferent V1 units [36, 33]. This enhanced\nnetwork, illustrated in Figure 3, is equivalent to an MT neuron tuned to a particular velocity (see\nsupplement [1]).\n\n3.1 Neurophysiologically plausible feature selection\n\nA key component of the saliency tracker of [22] is a feature selection procedure that continuously\nadapts the saliency measure of (1) to the target. The basic idea is to select, at each time step, the\nfeatures in Y(l) that best discriminate between target (center) and background. This changes the\nsaliency from a bottom-up identi\ufb01cation of locations where center and surround differ, to a top-\ndown identi\ufb01cation of locations containing the target in the center and background in the surround.\nHowever, the procedure of [22] (based on feature ranking) is not biologically plausible. To derive\na biologically plausible feature selection mechanism, we replace the saliency measure of (1) with a\nfeature-weighted extension\n\nS(l) = Xk\n\n\u03b1kSk(l), Xk\n\n\u03b1k = 1\n\n(2)\n\nwhere \u03b1k is the weight given to the saliency of the kth feature channel. We associate a binary\nvariable Fk with each feature Yk, such that Fk = 1 if and only if Yk is the most salient feature of the\ntarget. We then assume that, given the knowledge of which feature is most salient, target presence\nat location l is independent of the remaining feature responses, and so the posterior probability of\ntarget presence given the observation of all features is written as:\n\nPC(l)|Y(l),Fk (1|y, 1) = 2\u03b3[PC(l)|Yk (l)(1|y)],\n\n(3)\n\nThis re\ufb02ects a conservative strategy, where features cannot be considered salient unless they are in-\ndividually discriminant for target presence. Given the location l\u2217 where the target has been detected,\nthe posterior probability of feature saliency can then be computed by Bayes rule\n\nPFk|C(l\u2217)(1|1) =\n\nPC(l\u2217)|Fk (1|1)PFk (1)\n\nPj PC(l\u2217)|Fj (1|1)PFj (1)\n\n,\n\nwhere\n\n(4)\n\n6\n\n\fPC(l\u2217 )|Fk (1|1)= Z PC(l\u2217)|Y(l\u2217),Fk (1|y, 1)PY(l\u2217)|Fk (y|1)dy\n\n= Z 2\u03b3[PC(l\u2217 )|Yk(l\u2217)(1|y)]PYk (l\u2217)(y)dy (using (3))\n= 2EYk (l\u2217){\u03b3[PC(l\u2217 )|Yk(l\u2217)(1|y)]} = 2Sk(l\u2217),\n\nand the last equality follows from (1). Using (6) in (4), we get\n\nPFk|C(l\u2217)(1|1) =\n\nSk(l\u2217)PFk (1)\nPj Sj(l\u2217)PFj (1)\n\n.\n\nUnder reasonable assumptions of persistence of the dominant features in the target, this analysis can\nbe extended over time, by denoting the state of Fk and l\u2217 at time t by F t\nt , and the sequence\nof target locations till time t by l\u2217\n\n0), and we get the recursion (see [1]),\n\nk and l\u2217\n\nt = (l\u2217\n\nt , l\u2217\n\n(5)\n\n(6)\n\n(7)\n\nPF t\n\nk|C(l\u2217\n\nt )(1|1) =\n\n.\n\n(8)\n\nt\u2212\u03c4 . . . l\u2217\nSk(l\u2217\n\nt )PF t\u2212\u03c4\n\nk\n\n|C(l\u2217\n\nt\u2212\u03c4 )(1|1)\n\nPj Sj(l\u2217\n\nt )PF t\u2212\u03c4\n\nj\n\n|C(l\u2217\n\nt\u2212\u03c4 )(1|1)\n\nt is computed by divisively normalizing a weighted version of Sk(l\u2217\n\nHence, the posterior probability of feature k being the most salient at time t given that the target is\nat l\u2217\nt ), the bottom-up saliency of\nthe feature k at l\u2217\nt , by the total saliency summed over all features. The weight applied to the saliency\nof each feature (corresponding to \u03b1k in (2)) is the posterior probability of the feature being the most\nsalient at time t \u2212 \u03c4 . Therefore the posterior at time t \u2212 \u03c4 is fed back with a delay, to become the\nweight at time t. This enhances the most salient features, suppressing the non-salient ones, and is\nequivalent to applying a soft-thresholding to select only the dominant features.\n\nThis feature selection mechanism involving selective enhancement and suppression of features, op-\nerating on the outputs of the MT stage bears a close resemblance to the phenomenon of feature-based\nattention [24]. In fact, the proposed approach to feature selection is similar to previously proposed\nbiologically plausible models of feature-based attention that rely on a Bayesian formulation and\ninclude divisive normalization [30, 31, 20, 7]. Further, neurophysiological studies have found evi-\ndence of feature-based attention in the lateral intraparietal (LIP) area [3]. LIP is also known to have\ncortico-cortical connections to area MT [21], and attentional control is thought to be fed-back from\nLIP to MT [35]. Studies also suggest that the LIP might be computing a priority map that combines\nboth bottom-up inputs and top-down signals, and the peak of this map response is used to guide\nvisual attention [3]. These \ufb01ndings are compatible with the feature selection approach of (8), and\ntherefore area LIP is a plausible candidate location for the feature selection stage of our model.\n\n3.2 Neurophysiologically plausible discriminant tracker\n\nA neurophysiologically plausible version of the discriminant tracker of [22] can be constructed with\nthe discriminant saliency measure of (1), and the feature selection mechanism of (8). As in [22],\nin the absence of top-level information regarding the target, initialization and target acquisition can\nbe treated as discrimination between the visual stimulus contained in a pair of center (target) and\nsurround windows, at every location of the visual \ufb01eld. In this case, there is no explicit top-down\nguidance about the object to recognize, and the saliency of location l is measured by the saliency\nof all unmodulated feature responses. This consists of using the bottom-up saliency measure of (2)\nwith \u03b1k = PF 0\n(1) is a uniform prior for feature selection, at time t = 0. The\noutputs of all features or neurons are then summed with equal weights to produce a \ufb01nal saliency\nmap. The peak of this map represents the location which is most distinct from its surround, based\non the responses of the motion sensitive spatio-temporal features, and becomes the target. Spatial\nattention is then shifted to the peak of this map.\n\n(1), where PF 0\n\nk\n\nk\n\nOnce the initial target location is attended, the feature selection mechanism modulates the saliency\nresponse of the individual feature channels, using the weights of (8). The \ufb01nal saliency value at\nthat location also becomes the normalizing constant for the divisive normalization of (8). These\nfeature weights are fed back to MT neurons, where each feature map is enhanced or attenuated\ndepending on the corresponding feature weight. This enhances the features that are salient for\ntarget detection, and suppresses the non-salient ones. LIP also feeds back the retinotopic weight\nmap corresponding to spatial attention, causing a suppression of feature responses in all areas other\nthan in a neighborhood of the current locus of attention.\n\n7\n\n\fFigure 3: The network for tracking using feature selection. The discriminant saliency network of [12] is used\nto construct a model for an MT neuron. Feature selection, performed possibly in area LIP with weights being\nfed-back to MT, is achieved by the modulation of the response of each feature channel by its saliency value\nafter divisive normalization across features.\nAfter the latency due to feedback, say at time t + \u03c4 , the new feature weights and spatial weights,\nmodulate the feature maps, which are again fed forward to LIP, where the updated saliency map is\ncomputed by simple summation. The top-down saliency of location l at time t + \u03c4 is then given by\n\nStd(l) = Xj\n\nStd\n\nj (l) = Xj\n\nSj(l)PF t\n\nj |C(l\u2217\n\nt )(1|1).\n\n(9)\n\nwhere Sj(l) is the modulated saliency response of the jth feature.\nSpatial attention suppresses all but a neighborhood of the last known target location l\u2217\nt , and the\nfeature-based attention suppresses all features except those present in the target and discriminative\nwith respect to the background. Therefore, the peak of the new saliency map corresponds to the\nposition that best resembles the target at time t + \u03c4 , and attention is shifted to that position.\n\nl\u2217\nt+\u03c4 = argmax\n\nStd(l)\n\nl\n\n(10)\n\nThe process is iterated, so as to track the target over time, as in [22]. The entire tracking network is\nshown in Figure 3. The computation, in V1, of SZj (l) is implemented with the bottom-up network\nof [12]. V1 outputs are then linearly combined with weights wjk (which are described in supple-\nment [1]) to obtain the MT responses Sk(l). The remaining operations, possibly in LIP, compute the\nprobabilities of (8) and the top-down saliency map of (9).\n\n4 Validation of joint architecture\nWe applied the network of Figure 3 to the sequences used in Experiment 1. Representative frames\nof the result of tracking on the displays of the experiment and the videos are available from [1]. The\nmodel replicates the trend observed in both versions of the experiment, accurately tracking the target\nin the salient conditions, and losing track in the non-salient condition.\n\nThe results of applying the network to the stimuli in Experiments 2 and 3 are shown in Figures 1(d)\nand 2(d) respectively. It is seen that the model predictions accurately match the trend observed in\nall three versions of the Experiment 2. The model also predicts the effect of background seen in\nExperiment 3.\n\n5 Conclusion\n\nWe provide the \ufb01rst veri\ufb01able evidence for a connection between saliency and tracking that was\nearlier only hypothesized [22]. In particular, we show that three main predictions of the hypothesis\nhold. First, using psychophysics experiments we show that tracking requires discrimination between\ntarget and background using a center-surround mechanism, and that tracking reliability and saliency\nhave a common dependence on feature contrast and distractor heterogeneity. Next we construct a\ntracking model starting a neurally plausible architecture to compute saliency, and show that it can be\nimplemented with widely accepted models of cortical computation. Speci\ufb01cally, the model is based\non a feature selection mechanism akin to the well known phenomenon of feature-based attention in\nMT. Finally, we show that the tracking model accurately replicates all our psychophysics results.\n\n8\n\n\fReferences\n\n[1] See attached supplementary material.\n[2] S. Avidan. Ensemble tracking. IEEE PAMI, 29(2):261\u2013271, 2007.\n[3] J. Bisley & M. Goldberg, \u201cAttention, intention, & priority in the parietal lobe,\u201d Annu. Rev. Neurosci, 33, p.\n\n1\u201321, 2010.\n\n[4] D. H. Brainard. The psychophysics toolbox. Spatial Vision, 10:433\u2013436, 1997.\n[5] M. Carandini et al., Do we know what the early visual system does? J. Neuroscience, 25, 2005.\n[6] J. Cavanaugh, W. Bair, & J. Movshon. Nature & interaction of signals from the receptive \ufb01eld center and\n\nsurround in macaque V1 neurons. J. Neurophysiol., 88:2530\u20132546, 2002.\n\n[7] S. Chikkerur, et al., What & where: A Bayesian inference theory of attention. Vision Research, 2010.\n[8] R. Collins, Y. Liu, & M. Leordeanu. On-line selection of discriminative tracking features. IEEE PAMI,\n\n27(10):1631 \u2013 1643, October 2005.\n\n[9] D. Comaniciu, V. Ramesh, & P. Meer. Kernel-based object tracking. IEEE PAMI, 25(5):564\u2013577, 2003.\n[10] J. C. Culham, et al., Cortical fmri activation produced by attentive tracking of moving targets. J. Neuro-\n\nphysiol, 80(5):2657\u20132670, 1998.\n\n[11] D. Gao, V. Mahadevan, & N. Vasconcelos. On the plausibility of the discriminant center-surround hy-\n\npothesis for visual saliency. Journal of Vision, 8(7):1\u201318, 6 2008.\n\n[12] D. Gao & N. Vasconcelos. Decision-theoretic saliency: computational principle, biological plausibility,\n\n& implications for neurophysiology & psychophysics. Neural Computation, 21:239\u2013271, Jan 2009.\n\n[13] H. Grabner & H. Bischof. On-line boosting & vision. IEEE CVPR, 1:260\u2013267, 2006.\n[14] J. Intriligator & P. Cavanagh. The spatial resolution of visual attention. Cog. Psych., 43:171\u2013216, 1997.\n[15] M. Isard & A. Blake. Condensation: conditional density propagation for visual tracking. IJCV, 29, 1998.\n[16] L. Itti et al., A model of saliency-based visual attention for rapid scene analysis.\nIEEE PAMI,\n\n20(11):1254\u20131259, 1998.\n\n[17] A. D. Jepson et al., Robust online appearance models for visual tracking. IEEE PAMI, 25(10), 2003.\n[18] D. Kahneman, A. Treisman, & B. J. Gibbs. The reviewing of object \ufb01les: Object-speci\ufb01c integration of\n\ninformation. Cognitive Psychology, 24(2):175\u2013219, 1992.\n\n[19] Y. Kazanovich & R. Borisyuk. An oscillatory neural model of multiple object tracking. Neural computa-\n\ntion, 18(6):1413\u20131440, 2006.\n\n[20] J. Lee & J. Maunsell. A normalization model of attentional modulation of single unit responses. PLoS\n\nOne, 4(2), 2009.\n\n[21] J. Lewis & D. Van Essen, \u201cCorticocortical connections of visual, sensorimotor, & multimodal processing\n\nareas in the parietal lobe of the macaque monkey,\u201d J. Comparative Neurol., 428(1), p. 112\u2013137, 2000.\n\n[22] V. Mahadevan & N. Vasconcelos. Saliency-based discriminant tracking. CVPR, 2009.\n[23] T. Makovski & Y. Jiang. Feature binding in attentive tracking of distinct objects. Visual cognition,\n\n17(1):180\u2013194, 2009.\n\n[24] J. Maunsell & S. Treue. Feature-based attention in visual cortex. Trends in Neurosci., 29(6), 2006.\n[25] H. C. Nothdurft. Texture segmentation & pop-out from orientation contrast. Vision Research, 31(6):1073\u2013\n\n1078, 1991.\n\n[26] H. C. Nothdurft. The conspicuousness of orientation & motion contrast. Spatial Vision, 7:341\u2013363, 1993.\n[27] H. C. Nothdurft. Salience from feature contrast: additivity across dimensions. Vision Research, 40:1183\u2013\n\n1201, 2000.\n\n[28] L. Oksama & J. Hyn. Is multiple object tracking carried out automatically by an early vision mechanism\n\nindependent of higher-order cognition? Visual Cognition, 11(5):631 \u2013 671, 2004.\n\n[29] Z. W. Pylyshyn & R. W. Storm. Tracking multiple independent targets: evidence for a parallel tracking\n\nmechanism. Spatial vision, 3(3):179\u2013197, 1988.\n\n[30] R. Rao. Bayesian inference & attentional modulation in the visual cortex. Neuroreport, 16(16), 2005.\n[31] J. Reynolds & D. Heeger. The normalization model of attention. Neuron, 61(2):168\u2013185, 2009.\n[32] D. Ross et al., Incremental learning for robust visual tracking. IJCV, 77(1-3):125\u2013141, 2008.\n[33] N. Rust et al., How MT cells analyze the motion of visual patterns. Nat. Neurosci., 9(11), 2006.\n[34] H. Sakata, H. Shibutani, & K. Kawano. Functional properties of visual tracking neurons in posterior\n\nparietal association cortex of the monkey. J Neurophysiol, 49(6):1364\u20131380, 1983.\n\n[35] Y. Saalmann, I. Pigarev, & T. Vidyasagar, \u201cNeural mechanisms of visual attention: how top-down feed-\n\nback highlights relevant locations,\u201d Science, 316(5831), p. 1612, 2007.\n\n[36] E. Simoncelli & D. Heeger. A model of neuronal responses in visual area MT. Vision Research,\n\n38(5):743\u2013761, 1998.\n\n[37] E. Vul et al., Explaining human multiple object tracking as resource-constrained approximate inference\n\nin a dynamic probabilistic model. NIPS, 22:1955\u20131963, 2009.\n\n[38] A. Yilmaz, O. Javed, & M. Shah. Object tracking: A survey. ACM Computing Surveys, 38(4):13, 2006.\n\n9\n\n\f", "award": [], "sourceid": 790, "authors": [{"given_name": "Vijay", "family_name": "Mahadevan", "institution": null}, {"given_name": "Nuno", "family_name": "Vasconcelos", "institution": null}]}