{"title": "A Computational Model of Eye Movements during Object Class Detection", "book": "Advances in Neural Information Processing Systems", "page_first": 1609, "page_last": 1616, "abstract": "", "full_text": "A Computational Model of Eye Movements\n\nduring Object Class Detection\n\nWei Zhang\u2020\n\nHyejin Yang\u2021\u2217\n\nDimitris Samaras\u2020\n\nGregory J. Zelinsky\u2020\u2021\n\nDept. of Computer Science\u2020\n\nDept. of Psychology\u2021\n\nState University of New York at Stony Brook\n\nStony Brook, NY 11794\n\n{wzhang,samaras}@cs.sunysb.edu\u2020\n\nhjyang@ic.sunysb.edu\u2217\n\nGregory.Zelinsky@stonybrook.edu\u2021\n\nAbstract\n\nWe present a computational model of human eye movements in an ob-\nject class detection task. The model combines state-of-the-art computer\nvision object class detection methods (SIFT features trained using Ad-\naBoost) with a biologically plausible model of human eye movement to\nproduce a sequence of simulated \ufb01xations, culminating with the acqui-\nsition of a target. We validated the model by comparing its behavior to\nthe behavior of human observers performing the identical object class\ndetection task (looking for a teddy bear among visually complex non-\ntarget objects). We found considerable agreement between the model\nand human data in multiple eye movement measures, including number\nof \ufb01xations, cumulative probability of \ufb01xating the target, and scanpath\ndistance.\n\n1. Introduction\n\nObject detection is one of our most common visual operations. Whether we are driving [1],\nmaking a cup of tea [2], or looking for a tool on a workbench [3], hundreds of times each\nday our visual system is being asked to detect, localize, or acquire through movements of\ngaze objects and patterns in the world.\n\nIn the human behavioral literature, this topic has been extensively studied in the context of\nvisual search. In a typical search task, observers are asked to indicate, usually by button\npress, whether a speci\ufb01c target is present or absent in a visual display (see [4] for a review).\nA primary manipulation in these studies is the number of non-target objects also appearing\nin the scene. A bedrock \ufb01nding in this literature is that, for targets that cannot be de\ufb01ned\nby a single visual feature, target detection times increase linearly with the number of non-\ntargets, a form of clutter or \u201dset size\u201d effect. Moreover, the slope of the function relating\ndetection speed to set size is steeper (by roughly a factor of two) when the target is absent\nfrom the scene compared to when it is present. Search theorists have interpreted these \ufb01nd-\nings as evidence for visual attention moving serially from one object to the next, with the\nhuman detection operation typically limited to those objects \ufb01xated by this \u201dspotlight\u201d of\nattention [5].\n\nObject class detection has also been extensively studied in the computer vision community,\n\n\fwith faces and cars being the two most well researched object classes [6, 7, 8, 9]. The\nrelated but simpler task of object class recognition (target recognition without localization)\nhas also been the focus of exciting recent work [10, 11, 12]. Both tasks use supervised\nlearning methods to extract visual features. Scenes are typically realistic and highly clut-\ntered, with object appearance varying greatly due to illumination, view, and scale changes.\nThe task addressed in this paper falls between the class detection and recognition problems.\nLike object class detection, we will be detecting and localizing class-de\ufb01ned targets; unlike\nobject class detection the test images will be composed of at most 20 objects appearing on\na simple background.\n\nBoth the behavioral and computer vision literatures have strengths and weaknesses when it\ncomes to understanding human object class detection. The behavioral literature has accu-\nmulated a great deal of knowledge regarding the conditions affecting object detection [4],\nbut this psychology-based literature has been dominated by the use of simple visual pat-\nterns and models that cannot be easily generalized to fully realistic scenes (see [13, 14] for\nnotable exceptions). Moreover, this literature has focused almost entirely on object-speci\ufb01c\ndetection, cases in which the observer knows precisely how the target will appear in the test\ndisplay (see [15] for a discussion of target non-speci\ufb01c search using featurally complex ob-\njects). Conversely, the computer vision literature is rich with models and methods allowing\nfor the featural representation of object classes and the detection of these classes in visually\ncluttered real-world scenes, but none of these methods have been validated as models of\nhuman object class detection by comparison to actual behavioral data.\n\nThe current study draws upon the strengths of both of these literatures to produce the \ufb01rst\njoint behavioral-computational study of human object class detection. First, we use an\neyetracker to quantify human behavior in terms of the number of \ufb01xations made during\nan object class detection task. Then we introduce a computational model that not only\nperforms the detection task at a level comparable to that of the human observers, but also\ngenerates a sequence of simulated eye movements similar in pattern to those made by\nhumans performing the identical detection task.\n2. Experimental methods\n\nAn effort was made to keep the human and model experiments methodologically similar.\nBoth experiments used training, validation (practice trials in the human experiment), and\ntesting phases, and identical images were presented to the model and human subjects in\nall three of these phases. The target class consisted of 378 teddy bears scanned from [16].\nNontargets consisted of 2,975 objects selected from the Hemera Photo Objects Collection.\nSamples of the bear and nontarget objects are shown in Figure 1. All objects were normal-\nized to have a bounding box area of 8,000 pixels, but were highly variable in appearance.\n\nFigure 1: Representative teddy bears (left) and nontarget objects (right).\n\nThe training set consisted of 180 bears and 500 nontargets, all randomly selected. In the\ncase of the human experiment, each of these objects was shown centered on a white back-\nground and displayed for 1 second. The testing set consisted of 180 new bears and nontar-\n\n\fgets. No objects were repeated between training and testing, and no objects were repeated\nwithin either of the training or testing phases. Test images depicted 6, 13, or 20 color ob-\njects randomly positioned on a white background. A single bear was present in half (90) of\nthese displays. Human subjects were instructed to indicate, by pressing a button, whether\na teddy bear appeared among the displayed objects. Target presence and set size were\nrandomly interleaved over trials. Each test trial in the human experiment began with the\nsubject \ufb01xating gaze at the center of the display, and eye position was monitored through-\nout each trial using an eyetracker. Eight students from Stony Brook University participated\nin the experiment.\n3. Model of eye movements during object class detection\n\nFigure 2: The \ufb02ow of processing through our model.\n\nBuilding on a framework described in [17, 14, 18], our model can be broadly divided into\nthree stages (Figure 2): (1) creating a target map based on a retinally-transformed version\nof the input image, (2) recognizing the target using thresholds placed on the target map, and\n(3) the operations required in the generation of eye movements. The following sub-sections\ndescribe each of the Figure 2 steps in greater detail.\n\n3.1. Retina transform\nWith each change in gaze position (set initially to the center of the image), our model\ntransforms the input image so as to re\ufb02ect the acuity limitations imposed by the human\nretina. We used the method described in [19, 20], which was shown to provide a close\napproximation to human acuity limitations, to implement this dynamic retina transform.\n\n3.2. Create target map\nEach point on the target map ranges in value between 0 and 1 and indicates the likelihood\nthat a target is located at that point. To create the target map, we \ufb01rst compute interest\npoints on the retinally-transformed image (see section 3.2.2), then compare the features\nsurrounding these points to features of the target object class extracted during training.\nTwo types of discriminative features were used in this study: color features and texture\nfeatures.\n\n3.2.1. Color features\nColor has long been used as a feature for instance object recognition [21]. In our study we\nexplore the potential use of color as a discriminative feature for an object class. Speci\ufb01cally,\nwe used a normalized color histogram of pixel hues in HSV space. Because backgrounds in\nour images were white and therefore uninformative, we set thresholds on the saturation and\nbrightness channels to remove these points. The hue channel was evenly divided into 11\nbins and each pixel\u2019s hue value was assigned to one of these bins using binary interpolation.\n\n\fValues within each bin were weighted by 1 \u2212 d, where d is the normalized unit distance to\nthe center of the bin. The \ufb01nal color histogram was normalized to be a unit vector.\nGiven a test image, It, and its color feature, Ht, we compute the distances between Ht\nand the color features of the training set {Hi, i = 1, ..., N}. The test image is labeled\nas: l(It) = l(Iarg min1\u2264i\u2264N \u03c72(Ht,Hi)), and the distance metric used was: \u03c72(Ht, Hi) =\n\n[Ht(k)\u2212Hi(k)]2\nHt(k)+Hi(k) , where K is the number of bins.\n\nk=1\n\n(cid:80)K\n\n3.2.2. Texture features\nLocal texture features were extracted on the gray level images during both training and\ntesting. To do this, we \ufb01rst used a Difference-of-Gaussion (DoG) operator to detect interest\npoints in the image, then used a Scale Invariant Feature Transform (SIFT) descriptor to rep-\nresent features at each of the interest point locations. SIFT features consist of a histogram\nrepresentation of the gradient orientation and magnitude information within a small image\npatch surrounding a point [22].\n\nAdaBoost is a feature selection method which produces a very accurate prediction rule by\ncombining relatively inaccurate rules-of-thumb [23]. Following the method described in\n[11, 12], we used AdaBoost during training to select a small set of SIFT features from\namong all the SIFT features computed for each sample in the training set. Speci\ufb01cally,\neach training image was represented by a set of SIFT features {Fi,j, j = 1, ...ni}, where\nni is the number of SIFT features in sample Ii. To select features from this set, AdaBoost\n\ufb01rst initialized the weights of the training samples wi to 1\n, where Np and Nn are\n2Np\nthe number of positive and negative samples, respectively. For each round of AdaBoost,\nwe then selected one feature as a weak classi\ufb01er and updated the weights of the training\nsamples. Details regarding the algorithm used for each round of boosting can be found in\n[12]. Eventually, T features were chosen having the best ability to discriminate the target\nobject class from the nontargets. Each of these selected features forms a weak classi\ufb01er,\nhk, consisting of three components: a feature vector, (fk), a distance threshold, (\u03b8k), and\nan output label, (uk). Only the features from the positive training samples are used as weak\nclassi\ufb01ers. For each feature vector, F , we compute the distance between it and the training\nsample, i, de\ufb01ned as di = min1\u2264j\u2264ni D(Fi,j, F0), then apply the classi\ufb01cation rule:\n\n2Nn\n\n,\n\n1\n\nh(f, \u03b8) = { 1, d < \u03b8\n0, d \u2265 \u03b8\n\n.\n\n(1)\n\nAfter the desired number of weak classi\ufb01ers has been found, the \ufb01nal strong classi\ufb01er can\nbe de\ufb01ned as:\n\nT(cid:88)\n\n\u03b1tht\n\n(cid:113)\n\nH =\n\n1\u2212\u0001t\n\u0001t\n\n(2)\n\n(cid:80)|uk \u2212 lk|.\n\nwhere \u03b1t = log(1/\u03b2t). Here \u03b2t =\n\nt=1\nand the classi\ufb01cation error \u0001t =\n\n3.2.3. Validation\nA validation set, consisting of the practice trials viewed by the human observers, was used\nto set parameters in the model. Because our model used two types of features, each having\ndifferent classi\ufb01ers with different outputs, some weight for combining these classi\ufb01ers was\nneeded. The validation set was used to set this weighting.\n\nThe output of the color classi\ufb01er, normalized to unit length, was based on the distance\nmin = min1\u2264i\u2264N and de\ufb01ned as:\n\u03c72\n\nCcolor = { 0, l(It) = 0\n\nf(\u03c72\n\nmin), l(It) = 1\n\n(3)\n\nwhere f(\u03c72\nlocal texture classi\ufb01er, Ctexture (Equation 2), also had normalized unit output.\n\nmin) is a function monotonically decreasing with respect to \u03c72\n\nmin. The strong\n\n\fThe weights of the two classi\ufb01ers were determined based on their classi\ufb01cation errors on\nthe validation set:\n\nWcolor = \u0001t\n,\n\u0001c+\u0001t\nWtexture = \u0001c\n\u0001c+\u0001t\n\n.\n\n(4)\n\nThe \ufb01nal combined output was used to generate the values in the target map and, ultimately,\nto guide the model\u2019s simulated eye movements.\n\n3.3. Recognition\nWe de\ufb01ne the highest-valued point on the target map as the hotspot. Recognition is ac-\ncomplished by comparing the hotspot to two thresholds, also set through validation. If the\nhotspot value exceeds the high target-present threshold, then the object will be recognized\nas an instance of the target class. If the hotspot value falls below the target-absent threshold,\nthen the object will be classi\ufb01ed as not belonging to the target class. Through validation,\nthe target-present threshold was set to yield a low false positive rate and the target-absent\nthreshold was set to yield a high true positive rate. Moreover, target-present judgments\nwere permitted only if the hotspot was \ufb01xated by the simulated fovea. This constraint was\nintroduced so as to avoid extremely high false positive rates stemming from the creation of\nfalse targets in the blurred periphery of the retina-transformed image.\n\n3.4. Eye movement\nIf neither the target-present nor the target-absent thresholds are satis\ufb01ed, processing passes\nto the eye movement stage of our model. If the simulated fovea is not on the hotspot, the\nmodel will make an eye movement to move gaze steadily toward the hotspot location. Fix-\nation in our model is de\ufb01ned as the centroid of activity on the target map, a computation\nconsistent with a neuronal population code. Eye movements are made by thresholding this\nmap over time, pruning off values that offer the least evidence for the target. Eventually,\nthis thresholding operation will cause the centroid of the target map to pass an eye move-\nment threshold, resulting in a gaze shift to the new centroid location. See [18] for details\nregarding the eye movement generation process. If the simulated fovea does acquire the\nhotspot and the target-present threshold is still not met, the model will assume that a non-\ntarget was \ufb01xated and this object will be \u201dzapped\u201d. Zapping consists of applying a negative\nGaussian \ufb01lter to the hotspot location, thereby preventing attention and gaze from return-\ning to this object (see [24] for a previous computational implementation of a conceptually\nrelated operation).\n4. Experimental results\n\nModel and human behavior were compared on a variety of measures, including error rates,\nnumber of \ufb01xations, cumulative probability of \ufb01xating the target, and scanpath ratio (a\nmeasure of how directly gaze moved to the target). For each measure, the model and\nhuman data were in reasonable agreement.\n\nTable 1: Error rates for model and human subjects.\n\nTotal trials\n\nHuman\nModel\n\n1440\n180\n\nMisses\n\nFrequency Rate\n3.2%\n3.9%\n\n46\n7\n\nFalse positives\n\nFrequency Rate\n1.0%\n2.2%\n\n14\n4\n\nTable 1 shows the error rates for the human subjects and the model, grouped by misses and\nfalse positives. Note that the data from all eight of the human subjects are shown, resulting\nin the greater number of total trials. There are two key patterns. First, despite the very\nhigh level of accuracy exhibited by the human subjects in this task, our model was able to\n\n\fTable 2: Average number of \ufb01xations by model and human.\n\nCase\n\nHuman\nModel\n\nTarget-present\np13\n3.74\n3.69\n\np20\n4.88\n5.68\n\np6\n3.38\n2.86\n\nslope\n0.11\n0.20\n\na6\n4.89\n3.97\n\nTarget-absent\na13\n7.23\n8.30\n\na20\n9.39\n10.47\n\nslope\n0.32\n0.46\n\nachieve comparable levels of accuracy. Second, and consistent with the behavioral search\nliterature, miss rates were larger than false positive rates for both the humans and model.\n\nTo the extent that our model offers an accurate account of human object detection behavior,\nit should be able to predict the average number of \ufb01xations made by human subjects in the\ndetection task. As indicated in Table 2, this indeed is the case. Data are grouped by target-\npresent (p), target-absent (a), and the number of objects in the scene (6, 13, 20). In all\nconditions, the model and human subjects made comparable numbers of \ufb01xations. Also\nconsistent with the behavioral literature, the average number of \ufb01xations made by human\nsubjects in our task increased with the number of objects in the scenes, and the rate of this\nincrease was greater in the target-absent data compared to the target-present data. Both of\nthese patterns are also present in the model data. The fact that our model is able to capture\nan interaction between set size and target presence in terms of the number of \ufb01xations\nneeded for detection lends support for our method.\n\nFigure 3: Cumulative probability of target \ufb01xation by model and human.\n\nFigure 3 shows the number of \ufb01xation data in more detail. Plotted are the cumulative proba-\nbilities of \ufb01xating the target as a function of the number of objects \ufb01xated during the search\ntask. When the scene contained only 6 or 13 objects, the model and the humans \ufb01xated\nroughly the same number of nontargets before \ufb01nally shifting gaze to the target. When the\nscene was more cluttered (20 objects), the model \ufb01xated an average of 1 additional nontar-\nget relative to the human subjects, a difference likely indicating a liberal bias in our human\nsubjects under these search conditions. Overall, these analyses suggest that our model was\nnot only making the same number of \ufb01xations as humans, but it was also \ufb01xating the same\nnumber of nontargets during search as our human subjects.\n\nTable 3: Comparison of model and human scanpath distance\n\n#Objects\nHuman\nModel\nMODEL\n\n6\n\n1.62\n1.93\n1.93\n\n13\n2.20\n3.09\n2.80\n\n20\n2.80\n6.10\n3.43\n\n\fHuman gaze does not jump randomly from one item to another during search, but instead\nmoves in a more orderly way toward the target. The ultimate test of our model would\nbe to reproduce this orderly movement of gaze. As a \ufb01rst approximation, we quantify\nthis behavior in terms of a scanpath distance. Scanpath distance is de\ufb01ned as the ratio of\nthe total scanpath length (i.e., the summed distance traveled by the eye) and the distance\nbetween the target and the center of the image (i.e., the minimum distance that the eye\nwould need to travel to \ufb01xate the target). As indicated in Table 3, the model and human data\nare in close agreement in the 6 and 13-object scenes, but not in the 20-object scenes. Upon\ncloser inspection of the data, we found several cases in which the model made multiple\n\ufb01xations between two nontarget objects, a very unnatural behavior arising from too small\nof a setting for our Gaussian \u201dzap\u201d window. When these 6 trials were removed, the model\ndata (MODEL) and the human data were in closer agreement.\n\nFigure 4: Representative scanpaths. Model data are shown in thick red lines, human data\nare shown in thin green lines.\n\nFigure 4 shows representative scanpaths from the model and one human subject for two\nsearch scenes. Although the scanpaths do not align perfectly, there is a qualitative agree-\nment between the human and model in the path followed by gaze to the target.\n5. Conclusion\n\nSearch tasks do not always come with speci\ufb01c targets. Very often, we need to search for\ndogs, or chairs, or pens, without any clear idea of the visual features comprising these\nobjects. Despite the prevalence of these tasks, the problem of object class detection has at-\ntracted surprisingly little research within the behavioral community [15], and has been ap-\nplied to a relatively narrow range of objects within the computer vision literature [6, 7, 8, 9].\nThe current work adds to our understanding of this important topic in two key respects.\nFirst, we provide a detailed eye movement analysis of human behavior in an object class\ndetection task. Second, we incorporate state-of-the-art computer vision object detection\nmethods into a biologically plausible model of eye movement control, then validate this\nmodel by comparing its behavior to the behavior of our human observers. Computational\nmodels capable of describing human eye movement behavior are extremely rare [25]; the\nfact that the current model was able to do so for multiple eye movement measures lends\nstrength to our approach. Moreover, our model was able to detect targets nearly as well\nas the human observers while maintaining a low false positive rate, a dif\ufb01cult standard to\nachieve in a generic detection model. Such agreement between human and model sug-\ngests that simple color and texture features may be used to guide human attention and eye\nmovement in an object class detection task.\n\nFuture computational work will explore the generality of our object class detection method\nto tasks with visually complex backgrounds, and future human work will attempt to use\n\n\fneuroimaging techniques to localize object class representations in the brain.\n\nAcknowledgments\n\nThis work was supported by grants from the NIMH (R01-MH63748) and ARO (DAAD19-\n03-1-0039) to G.J.Z.\nReferences\n[1] M. F. Land and D. N. Lee. Where we look when we steer. Nature, 369(6483):742\u2013744, 1994.\n[2] M. F. Land and M. Hayhoe. In what ways do eye movements contribute to everyday activities.\n\nVision Research, 41(25-36):3559\u20133565, 2001.\n\n[3] G. Zelinsky, R. Rao, M. Hayhoe, and D. Ballard. Eye movements reveal the spatio-temporal\n\ndynamics of visual search. Psychological Science, 8:448\u2013453, 1997.\n\n[4] J. Wolfe. Visual search.\n\nIn H. Pashler (Ed.), Attention, pages 13\u201371. London: University\n\nCollege London Press, 1997.\n\n[5] E. Weichselgartner and G. Sperling. Dynamics of automatic and controlled visual attention.\n\nScience, 238(4828):778\u2013780, 1987.\n\n[6] H. Schneiderman and T. Kanade. A statistical method for 3d object detection applied to faces\n\nand cars. In CVPR, volume I, pages 746\u2013751, 2000.\n\n[7] P. Viola and M.J. Jones. Rapid object detection using a boosted cascade of simple features. In\n\nCVPR, volume I, pages 511\u2013518, 2001.\n\n[8] S. Agarwal and D. Roth. Learning a sparse representation for object detection.\n\nvolume IV, page 113, 2002.\n\nIn ECCV,\n\n[9] Wolf Kienzle, G\u00a8okhan H. Bak\u0131r, Matthias O. Franz, and Bernhard Sch\u00a8olkopf. Face detection -\n\nef\ufb01cient and rank de\ufb01cient. In NIPS, 2004.\n\n[10] R. Fergus, P. Perona, and A. Zisserman. Object class recognition by unsupervised scale-\n\ninvariant learning. In CVPR03, volume II, pages 264\u2013271, 2003.\n\n[11] A. Opelt, M. Fussenegger, A. Pinz, and P. Auer. Weak hypotheses and boosting for generic\n\nobject detection and recognition. In ECCV04, volume II, pages 71\u201384, 2004.\n\n[12] W. Zhang, B. Yu, G. Zelinsky, and D. Samaras. Object class recognition using multiple layer\n\nboosting with multiple features. In CVPR, 2005.\n\n[13] L. Itti and C. Koch. A saliency-based search mechanism for overt and covert shifts of visual\n\nattention. Vision Research, 40:1489\u20131506, 2000.\n\n[14] R. Rao, G. Zelinsky, M. Hayhoe, and D. Ballard. Eye movements in iconic visual search. Vision\n\nResearch, 42:1447\u20131463, 2002.\n\n[15] D. T. Levin, Y. Takarae, A. G. Miner, and F. Keil. Ef\ufb01cient visual search by category: Speci-\nfying the features that mark the difference between artifacts and animal in preattentive vision.\nPerception and Psychophysics, 63(4):676\u2013697, 2001.\n\n[16] P. Cockrill. The teddy bear encyclopedia. New York: DK Publishing, Inc., 2001.\n[17] R. Rao, G. Zelinsky, M. Hayhoe, and D. Ballard. Modeling saccadic targeting in visual search.\n\nIn NIPS, 1995.\n\n[18] G. Zelinsky. Itti, L., Rees, G. and Tsotos, J.(Eds.), Neurobiology of attention, chapter Specifying\n\nthe components of attention in a visual search task, pages 395\u2013400. Elsevier, 2005.\n\n[19] W.S. Geisler and J.S. Perry. A real-time foveated multi-resolution system for low-bandwidth\nvideo communications. In Human Vision and Electronic Imaging, SPIE Proceddings, volume\n3299, pages 294\u2013305, 1998.\n\n[20] J.S. Perry and W.S. Geisler. Gaze-contingent real-time simulation of arbitrary visual \ufb01elds. In\n\nSPIE, 2002.\n\n[21] M.J. Swain and D.H. Ballard. Color indexing. IJCV, 7(1):11\u201332, November 1991.\n[22] D.G. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60(2):91\u2013110,\n\nNovember 2004.\n\n[23] Y. Freund and R.E. Schapire. A decision-theoretic generalization of on-line learning and an\n\napplication to boosting. Journal of Computer and System Sciences, 55(1):119\u2013139, 1997.\n\n[24] K. Yamada and G. Cottrell. A model of scan paths applied to face recognition. In Seventeenth\n\nAnnual Cognitive Science Conference, pages 55\u201360, 1995.\n\n[25] C. M. Privitera and L. W. Stark. Algorithms for de\ufb01ning visual regions-of-interest: comparison\n\nwith eye \ufb01xations. PAMI, 22:970\u2013982, 2000.\n\n\f", "award": [], "sourceid": 2949, "authors": [{"given_name": "Wei", "family_name": "Zhang", "institution": null}, {"given_name": "Hyejin", "family_name": "Yang", "institution": null}, {"given_name": "Dimitris", "family_name": "Samaras", "institution": null}, {"given_name": "Gregory", "family_name": "Zelinsky", "institution": null}]}