{"title": "Learned Region Sparsity and Diversity Also Predicts Visual Attention", "book": "Advances in Neural Information Processing Systems", "page_first": 1894, "page_last": 1902, "abstract": "Learned region sparsity has achieved state-of-the-art performance in classification tasks by exploiting and integrating a sparse set of local information into global decisions. The underlying mechanism resembles how people sample information from an image with their eye movements when making similar decisions. In this paper we incorporate the biologically plausible mechanism of Inhibition of Return into the learned region sparsity model, thereby imposing diversity on the selected regions. We investigate how these mechanisms of sparsity and diversity relate to visual attention by testing our model on three different types of visual search tasks. We report state-of-the-art results in predicting the locations of human gaze fixations, even though our model is trained only on image-level labels without object location annotations. Notably, the classification performance of the extended model  remains the same as the original. This work suggests a new computational perspective on visual attention mechanisms and shows how the inclusion of attention-based mechanisms can improve computer vision techniques.", "full_text": "Learned Region Sparsity and Diversity\n\nAlso Predict Visual Attention\n\nZijun Wei1\u2217, Hossein Adeli2\u2217, Gregory Zelinsky1,2, Minh Hoai1, Dimitris Samaras1\n1. Department of Computer Science 2. Department of Psychology \u2013 Stony Brook University\n\n*. Both authors contributed equally to this work\n\n1.{zijwei, minhhoai, samaras}@cs.stonybrook.edu\n\n2.{hossein.adelijelodar, gregory.zelinsky}@stonybrook.edu\n\nAbstract\n\nLearned region sparsity has achieved state-of-the-art performance in classi\ufb01cation\ntasks by exploiting and integrating a sparse set of local information into global\ndecisions. The underlying mechanism resembles how people sample information\nfrom an image with their eye movements when making similar decisions. In this\npaper we incorporate the biologically plausible mechanism of Inhibition of Return\ninto the learned region sparsity model, thereby imposing diversity on the selected\nregions. We investigate how these mechanisms of sparsity and diversity relate to\nvisual attention by testing our model on three different types of visual search tasks.\nWe report state-of-the-art results in predicting the locations of human gaze \ufb01xations,\neven though our model is trained only on image-level labels without object location\nannotations. Notably, the classi\ufb01cation performance of the extended model remains\nthe same as the original. This work suggests a new computational perspective\non visual attention mechanisms, and shows how the inclusion of attention-based\nmechanisms can improve computer vision techniques.\n\n1\n\nIntroduction\n\nVisual spatial attention refers to the narrowing of processing in the brain to particular objects in\nparticular locations so as to mediate everyday tasks. A widely used paradigm for studying visual\nspatial attention is visual search, where a desired object must be located and recognized in a typically\ncluttered environment. Visual search is accompanied by observable estimates\u2014in the form of\ngaze \ufb01xations\u2014of how attention samples information from a scene while searching for a target.\nEf\ufb01cient visual search requires prioritizing the locations of features of the target object class over\nfeatures at locations offering less evidence for the target [31]. Computational models of visual search\ntypically estimate and plot goal directed prioritization of visual space as priority maps for directing\nattention [32]. This form of target directed prioritization is different from the saliency modeling\nliterature, where bottom-up feature contrast in an image is used to predict \ufb01xation behavior during\nthe free-viewing of scenes [16].\nThe \ufb01eld of \ufb01xation prediction is highly active and growing [2], although it was not until fairly\nrecently that attention researchers have begun using the sophisticated object detection techniques\ndeveloped in the computer vision literature [8, 18, 31]. The dominant method used in the visual\nsearch literature to generate priority maps for detection has been the exhaustive detection mechanism\n[8, 18]. Using this method, an object detector is applied to an image to provide bounding boxes\nthat are then combined, weighted by their detection scores, to generate a priority map [8]. While\nthese models have had success in predicting behavior, training these detectors requires human labeled\nbounding boxes, which are expensive and laborious to collect, and also prone to individual annotator\ndifferences.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fAn alternative approach to modeling visual attention is to determine how model and behavioral task\nperformance depends on shared core computational principles [24]. To this end, a new class of\nattention-inspired models have been developed and applied to tasks ranging from image captioning\n[30] to hand writing generation [13], where selective spatial attention mechanisms have been shown\nto emerge [1, 25]. By requiring visual inputs to be gated in a manner similar to the human gating\nof visual inputs via \ufb01xations, these models are able to localize or \u201cattend\u201d selectively to the most\ninformative regions of an input image while ignoring irrelevant visual inputs [25, 1]. This built in\nattention mechanism enables the model of [30], trained only on generating captions, to bias the\nvisual input so as to gate only relevant information when generating each word to describe an image.\nPriority maps were then generated to show the mapping of attended image areas to generated words.\nWhile these new models show attention-like behavior, to our knowledge none have been used to\npredict actual human allocations of attention.\nThe current work bridges the behavioral and computer vision literatures by using a classi\ufb01cation\nmodel that has biologically plausible constraints to create a priority map for the purpose of predicting\nthe allocation of spatial attention as measured by changes in \ufb01xation. The speci\ufb01c image-category\nclassi\ufb01cation model that we use is called Region Ranking SVM (RRSVM) [29]. This model was\ndeveloped in our recent work [29], and it achieved state-of-the-art performance on a number of\nclassi\ufb01cation tasks by learning categorization with locally-pooled information from input images.\nThis model works by imposing sparsity on selected image areas that contribute to the classi\ufb01cation\ndecision, much like how humans prioritize visual space and sample with \ufb01xations only a sparse set of\nimage locations while attempting to detect and recognize object categories [4]. We believe that this\nanalogy between sparse sampling and attention makes this model a natural candidate for predicting\nattention behavior in visual search tasks. It is worth noting that this model was originally created for\nobject classi\ufb01cation and not localization, hence no object localization data is used to train it, unlike\nstandard \ufb01xation prediction algorithms [16, 17].\nThere are two contributions of our work. First, we show that the RSSVM model approaches state-\nof-the-art in predicting the \ufb01xations made by humans searching for the same targets in the same\nimages. This means that a model trained solely for the purpose of image classi\ufb01cation, without any\nlocalization data, is also able to predict the locations of \ufb01xations that people make while searching for\nthe to-be-classi\ufb01ed objects. Second, we incorporate the biologically plausible constraint of Inhibition\nof Return [10], which we model by requiring a set of diverse (minimally overlapping) sparse regions\nin RRSVM. Incorporating this constraint, we are able to reduce the error in \ufb01xation prediction (up\nto 21%). Importantly, adding the Inhibition of Return constraint does not affect the classi\ufb01cation\nperformance. By building this bridge, we hope to show how automated object detection might be\nimproved by the inclusion of an attention mechanism, and how a recent attention-inspired approach\nfrom computer vision might illuminate how the brain prioritizes visual information for the ef\ufb01cient\ndirection of spatial attention.\n\n2 Region Ranking SVM\n\nHere we review Region Ranking SVM (RRSVM) [29]. The main problem addressed by RRSVM is\nimage classi\ufb01cation, which aims to recognize the semantic category of an image, such as whether\nthe image contains a certain object (e.g., car, cat) or portrays a certain action (e.g., jumping, typing).\nRRSVM evaluates multiple local regions of an image, and subsequently outputs the classi\ufb01cation\ndecision based on a sparse set of regions. This mechanism is noteworthy and different from other\napproaches that aggregate information from multiple regions indistinguishably (e.g., [23, 28, 22, 14]).\ni=1 and associated binary labels {yi}n\nRRSVM assumes training data consisting of images {Bi}n\ni=1\nindicating the presence or absence of the visual element (object or action) of interest. To account\nfor the uncertainty of each semantic region in an image, RRSVM considers multiple local regions.\nThe number of regions can differ between images, but for brevity, assume each image has the\nsame number of regions. Let m be the number of regions for each image, and d the dimension\nof each region descriptor. RRSVM represents each image as a matrix Bi \u2208 (cid:60)d\u00d7m, but the order\nof the columns can be arbitrary. RRSVM jointly learns a region evaluation function and a region\ni=1(wT \u0393(Bi; w)s+b\u2212yi)2 subject to the constraints:\ns1 \u2265 s2 \u2265 \u00b7\u00b7\u00b7 \u2265 sm \u2265 0 and h(\u0393(Bi; w)s) \u2264 1. Here h(\u00b7) is the function that measures the spread\n. w and b are\nthe weight vector and the bias term of an SVM classi\ufb01er, which are the parameters of the region\n\nselection function by minimizing: \u03bb||w||2 +(cid:80)n\nof the column vectors of a matrix: h([x1,\u00b7\u00b7\u00b7 , xn]) = (cid:80)n\n\n(cid:80)n\n\n(cid:12)(cid:12)(cid:12)(cid:12)2\n\n(cid:12)(cid:12)(cid:12)(cid:12)xi \u2212 1\n\nn\n\ni=1\n\ni=1 xi\n\n2\n\n\fevaluation function. \u0393(B; w) denotes a matrix that can be obtained by rearranging the columns of\nthe matrix B so that wT \u0393(B; w) is a sequence of non-increasing values. The vector s is the weight\nvector for combining the SVM region scores for each image [15]; this vector is common to all images\nof a class.\nThe objective of the above formulation consists of the regularization term \u03bb||w||2 and the sum of\nsquared losses. This objective is based purely on classi\ufb01cation performance. However, note that\nthe classi\ufb01cation decision is based on both the region evaluation function (i.e., w, b) and the region\nselection function (i.e., s), which are simultaneously learned using the above formulation. What\nis interesting is that the obtained s vector is always sparse. An experiment [29] on the ImageNet\ndataset [27] with 1000 classes showed that RRSVM generally uses 20 regions or less (from hundreds\nof local regions considered). This intriguing fact prompted us to consider the connection between\nsparse region selection and visual attention. Would machine-based discriminative localization re\ufb02ect\nthe allocation of human attention in visual search? It turns out that there is compelling evidence for\na relationship, as will be shown in the experiment section. This relationship can be strengthened if\nRRSVM is extended to incorporate Inhibition of Return in the region selection process, which will\nbe explained next.\n\n3\n\nIncorporating Inhibition of Return into Region Ranking SVM\n\nA mechanism critical to the modeling of human visual search behavior is Inhibition of Return:\nthe lower probability of re-\ufb01xating on or near already attended areas, possibly mediated by lateral\ninhibition [16, 20]. This mechanism, however, is not currently enforced in the formulation of\nRRSVM, and indeed the spatial relationship between selected regions is not considered. RRSVM\nusually selects a sparse set of regions, but the selected regions are free to overlap and concentrate on\na single image area.\nInspired by Inhibition of Return, we consider an extension of RRSVM where non-maxima suppression\nis incorporated into the process of selecting regions. This mechanism will select the local maximum\nfor nearby activation areas (a potential \ufb01xation location) and discard the rest (non-maxima nearby\nlocations). The biological plausibility of non-maxima suppression has been discussed in previous\nwork, where it was shown to be a plausible method for allowing the stronger activations to stand out\n(see [21, 7] for details).\nTo incorporate non-maxima suppression in the framework of RRSVM, we replaced the region ranking\nprocedure \u0393(B; w) of RRSVM by \u03a8(Bi; w, \u03b1), a procedure that ranks and subsequently returns the\nlist of regions that do not signi\ufb01cantly overlap with one another. In particular, we use intersection\nover union to measure overlap, where \u03b1 is a threshold for tolerable overlap (we set \u03b1 = 0.5 in our\nexperiments). This leads to the following optimization problem:\n\nn(cid:88)\n\ni=1\n\n(1)\n\n(2)\n(3)\n\nminimize\n\nw,s,b\n\n\u03bb||w||2 +\n\n(wT \u03a8(Bi; w, \u03b1)s + b \u2212 yi)2\n\ns.t. s1 \u2265 s2 \u2265 \u00b7\u00b7\u00b7 \u2265 sm \u2265 0,\n\nh(\u03a8(Bi; w, \u03b1)s) \u2264 1.\n\nThe above formulation can be optimized in the same way as RRSVM in [29]. It will yield a classi\ufb01er\nthat makes a decision based on a sparse and diverse set of regions. Sparsity is inherited from RRSVM,\nand location diversity is attained using non-maxima suppression. Hereafter, we refer to this method\nas Sparse Diverse Regions (SDR) classi\ufb01er.\n\n4 Experiments and Analysis\n\nWe present here empirical evidence showing that learned region sparsity and diversity can also predict\nvisual attention. We \ufb01rst describe the implementation details of RRSVM and SDR. We then consider\nattention prediction under three conditions: (1) single-target present, that is to \ufb01nd the one instance of\na target category appearing in a stimulus image; (2) target absent, i.e., searching for a target category\nthat does not appear in the image; and (3) multiple-targets present, i.e., searching for multiple object\ncategories where at least one is present in the image. Experiments are performed on three datasets\nPOET [26], PET [11] and MIT900 [8], which are the only available datasets for object search tasks.\n\n3\n\n\f4.1\n\nImplementation details of RRSVM and SDR\n\nOur implementation of RRSVM and SDR is similar to [29], but we consider more local regions.\nThis yields a \ufb01ner localization map without changing the classi\ufb01cation performance. As in [29],\nthe feature extraction pipeline is based on VGG16 [28]. The last fully connected layer of VGG16\nis removed and the remaining fully connected layer is converted to a fully convolutional layer. To\ncompute feature vectors for multiple regions of an image, the image is resized and then fed into\nVGG16 to yield a feature map with 4096 channels. The size of the feature map depends on the size\nof the resized image, and each feature map corresponds to a subwindow of the original image. By\nresizing the original image to multiple sizes, one can compute feature vectors for multiple regions of\nthe original image. In this work, we consider 7 different image sizes instead of the three sizes used\nby [28, 29]. The \ufb01rst three resized images are obtained by scaling the image isotropically so that the\nsmallest dimension is 256, 384, or 512. For brevity, assuming the width is smaller than the height,\nthis yields three images with dimensions 256 \u00d7 a, 384 \u00d7 b, and 512 \u00d7 c. We consider four other\nresized images with dimensions 256 \u00d7 b, 384 \u00d7 c, 384 \u00d7 a, 512 \u00d7 b. These image sizes correspond to\nlocal regions having an aspect ratio of either 2:3 or 3:2, while the isotropically resized images yield\nsquare local regions. Additionally, we also consider horizontal \ufb02ips of the resized images. Overall,\nthis process yields 700 to 1000 feature vectors, each corresponding to a local image region.\nThe RRSVM and SDR classi\ufb01ers used in the following experiments are trained on the trainval set of\nPASCAL VOC 2007 dataset [9] unless otherwise stated. This dataset is distinct from the datasets\nused for evaluation. For SDR, the non-maxima suppression threshold is 0.5, and we only keep the\ntop ranked regions that have non-zero region scores (si \u2265 0.01). To generate a priority map, we \ufb01rst\nassociate each pixel with an integer indicating the total number of selected regions covering that\npixel, then apply a Gaussian blur kernel to the integer valued map, with the kernel width tuned on the\nvalidation set.\nTo test whether learned region sparsity and diversity predicts human attention, we compare the\ngenerated priority maps with the behaviorally-derived \ufb01xation density maps. To make this comparison\nwe use the Area Under the ROC Curve (AUC), a commonly used metric for visual search task\nevaluation [6]. We use the publicly available implementation of the AUC evaluation from the MIT\nsaliency benchmark [5], speci\ufb01cally the AUC-Judd implementation for its better approximation.\n\n4.2 Single-target present condition\n\nWe consider visual attention in the single-target present condition using the POET dataset [26].\nThis dataset is a subset of PASCAL VOC 2012 dataset [9], and it has 6270 images from 10 object\ncategories (aeroplane, boat, bike, motorbike, cat, dog, horse, cow, sofa and dining table). The task was\ntwo-alternative forced choice for object categories, approximating visual search, and eye movement\ndata were collected from 5 subjects as they freely viewed these images. On average, 5.7 \ufb01xations\nwere made per image. The SDR classi\ufb01er is trained on the trainval set of PASCAL VOC 2007 dataset,\nwhich does not overlap with the POET dataset. We randomly selected one third of the images for\neach category to compile a validation set for tuning the width of the Gaussian blur kernel for all\ncategories. The rest were used as test images.\nFor each test image, we compare the priority map generated for the selected regions by RRSVM with\nthe human \ufb01xation density map. The overall correlation is high, yielding a mean AUC score of 0.81\n(on all images of 10 object classes). This is intriguing because RRSVM is optimized for classi\ufb01cation\nperformance only; joint classi\ufb01cation is apparently related to discriminative localization by human\nattention in the context of a visual search task. By incorporating Inhibition of Return into RRSVM,\nwe observe even stronger correlation with human behavior, with the mean AUC score obtained by\nSDR now being 0.85.\nThe left part of Table 1 shows AUC scores for individual categories of the POET dataset. We\ncompare the performance of other attention prediction baselines. All recent \ufb01xation prediction\nmodels [8, 19, 31] apply object category detectors on the input image and combine the detection\nresults to create priority maps. Unfortunately, direct comparison to these models is not currently\npossible due to the unavailability of needed code and datasets. However, our RCNN [12] baseline,\nwhich is the state-of-the-art object detector on this dataset, should improve the pipelines of these\nmodels. To account for possible localization errors and multiple object instances, we keep all the\ndetections with a detection score greater than a threshold. This threshold is chosen to maximize the\n\n4\n\n\fTable 1: AUC scores on POET and PET test sets\n\nModel\nSDR\nRCNN\nCAM [34]\nAnnoBoxes\n\nPOET\n\naero bike boat cat cow table dog horse mbike sofa mean\n0.86 0.77 0.85\n0.87 0.85 0.83 0.89 0.88 0.79 0.88 0.86\n0.87 0.76 0.82\n0.84 0.83 0.79 0.84 0.81 0.76 0.83 0.80\n0.83 0.67 0.82\n0.86 0.78 0.78 0.88 0.84 0.74 0.87 0.84\n0.85 0.86 0.81 0.84 0.84 0.79 0.80 0.80\n0.88 0.80 0.83\n\nPET\n\nmulti-target\n\n0.83\n0.77\n0.65\n0.82\n\nFigure 1: Priority maps generated for SDR on the POET dataset. Warm colors represent high\nvalues. Dots represents human \ufb01xations. Best viewed on a digital device.\n\ndetector\u2019s F1 score, which is the harmonic mean between precision and recall. We also consider a\nvariant method where only the top detection is kept, but the result is not as good. We also consider\nthe recently proposed weakly-supervised object localization approach of [34], which is denoted as\nCAM in Table 1. We use the released model to extract features and train a linear SVM on top of the\nfeatures. For each test image, we weigh a linear sum of local activations to create an activation map.\nWe normalize the activation map to get the priority map. We even compare SDR with a method that\ndirectly uses the annotated object bounding boxes to predict human attention, which is denoted as\nAnnoBoxes in the table. For this method, the priority map is created by applying a Gaussian \ufb01lter to\na binary map where the center of the bounding box over the target(s) is set to 1 and everywhere else\n0. Notably, the methods selected for comparison are strong models for predicting human attention.\nRCNN has an unfair advantage over SDR because it has access to localized annotations in its training\ndata, and AnnoBoxes even assumes the availability of object bounding boxes for test data. As can be\nseen from Table 1, SDR signi\ufb01cantly outperforms the other methods. This provides strong empirical\nevidence suggesting that learned region sparsity and diversity is highly predictive human attention.\nFig. 1 shows some randomly selected results from SDR on test images.\nNote that the incorporation of Inhibition of Return into RRSVM and the consideration of more local\nregions does not affect the classi\ufb01cation performance. When evaluated on the PASCAL VOC 2007\ntest set, the RRSVM method that uses local regions corresponding to 3 image scales (as in [29]), the\nRRSVM method that uses more regions with different aspect ratios (as explained in Sec. 4.1), and\nthe RRSVM method that incorporates the NMS mechanism (i.e., SDR), all achieve a mean AP of\n92.9%. SDR, however, is signi\ufb01cantly better than RRSVM in predicting \ufb01xations during search tasks,\nincreasing the mean AUC score from 0.81 to 0.85. Also note that the predictive power of SDR is\nnot sensitive to the value of \u03b1: for aeroplane on the POET dataset, the AUC scores remain the same\n(0.87) when \u03b1 is varied from 0.5 to 0.7.\nFigure 2 shows some examples highlighting the difference between the regions selected by RRSVM\nand SDR. As can be seen, incorporating non-maxima suppression encourages greater dispersion of\n\n5\n\n\f(a)\n\n(b)\n\n0.11\n\nKLDiv\nFigure 2: Comparison between RRSVM and SDR on the POET dataset. (a): priority maps\ncreated by RRSVM, (b): priority maps generated by SDR. SDR better captures \ufb01xations when there\nare multiple instances of the target categories. The KL Divergence scores between RRSVM and SDR\nare reported in the bottom row.\n\n0.89\n\n0.29\n\n0.61\n\n(a) motorbike\n\n(b) aeroplane\n\n(c) diningtable\n\n(d) cow\n\nFigure 3: Failure cases. Representative images where the priority maps produced by SDR are\nsigni\ufb01cantly different from human \ufb01xations. The caption under each image indicates the target\ncategory. The modes of failure are: (a) failure in classi\ufb01cation; (b) and (c) existence of a more\nattractive object (text or face); (d) co-occurrence of multiple objects. Best viewed on digital devices.\n\nthe sparse areas as opposed to a more clustered distribution in RRSVM. This in turn better predicts\nattention when there are multiple instances of the target object in the display.\nFigure 3 shows representative cases where the priority maps produced by SDR are signi\ufb01cantly\ndifferent from human \ufb01xations. The common failure modes are: (1) failure to locate the correct\nregion for correct classi\ufb01cation (see Fig 3a); (2) particularly distracting elements in the scene, such as\ntext (3b) or faces (3c); (3) failure to attend to multiple instances of the target categories. Tuning SDR\nusing human \ufb01xation behavioral data [17] and combining SDR with multiple sources of guidance\ninformation [8], including saliency and scene context, could mitigate some of the model limitations.\n\n4.3 Target absent condition\n\nTo test whether SDR is able to predict people\u2019s \ufb01xations when the search target is absent, we\nperformed experiments on 456 target-absent images from the MIT900 dataset [8]. Human observers\nwere asked to search for people in real world scenes. Eye movement data were collected from 14\nsearchers who made roughly 6 \ufb01xations per image, on average. We picked a random subset of 150\nimages to tune the Gaussian blur parameter and reported the results for the remaining 306 images.\nWe noticed that the sizes and poses of the people in these images were very different from those of\nthe training samples in VOC2007, which could have led to poor SDR classi\ufb01cation performance. In\norder to address this issue, we augmented the training set of SDR with 456 images from MIT900 that\ncontain people. The added training examples were a disjoint set from the target-absent images for\nevaluation.\nOn these target absent cases, SDR achieves an AUC score of 0.78. As a reference, the method of\nEhinger et al. [8] also achieves AUC of 0.78. But the two methods are not directly comparable\nbecause Ehinger et al. [8] used a HOG-based person detector that was trained on a much larger\ndataset with location annotation.\n\n6\n\n\fFigure 4: Priority map predictions using SDR on some MIT target-absent stimuli. Warm colors\nrepresent high probabilities. Dots indicate human \ufb01xations. Best viewed on a digital device.\n\n(a) dog and sheep\n\n(b) cows and sheep\n\n(c) dog and cat\n\n(d) cows\n\nFigure 5: Visualization of SDR prediction on the PET dataset. Note that the high classi\ufb01cation\naccuracy ensures that more reliable regions are detected.\n\nFigure 4 shows some randomly selected results from the test set demonstrating SDR\u2019s success in\npredicting where people attend. Interestingly, SDR looks at regions that either contain person-like\nobjects or are likely to contain persons (e.g., sidewalks), with the latter observation likely the result of\nsidewalks co-occurring with persons in the positive training samples (a form of scene context effect).\n\n4.4 Multiple-target attention\n\nWe considered human visual search behavior when there were multiple targets. The experiments were\nperformed on the PET dataset [11]. This dataset is a subset of PASCAL VOC2012 dataset [9], and it\ncontains 4135 images from 6 animal categories (cat, dog, bird, horse cow and sheep). Four subjects\nwere instructed to \ufb01nd all of the animals in each image. Eye movements were recorded, where each\nsubject made roughly 6 \ufb01xations per image. We excluded the images that contained people to avoid\nambiguity with the animal category. We also removed the images that were shared with the PASCAL\nVOC 2007 dataset to ensure no overlap between training and testing data. This yielded a total of\n3309 images from which a random set of 1300 images were selected for tuning the Gaussian kernel\nwidth parameter. The remaining 2309 images were used for testing.\nTo model the search for multiple categories in an image, for all methods except AnnoBoxes we\napplied six animal classi\ufb01ers/detectors simultaneously to the test image. For each classi\ufb01er/detector\nof each category, a threshold was selected to achieve the highest F1 score on the validation data. The\nprediction results are shown in the right part of Tab. 1. SDR signi\ufb01cantly outperforms other methods.\nNotably, CAM performs poorly on this dataset, due perhaps to the low classi\ufb01cation accuracy of that\nmodel (83% mAP on VOC 2007 test set as opposed to 93% of SDR). Some randomly selected results\nare shown in Fig. 5.\n\n4.5 Center Bias\n\nFor the POET dataset, some of the target objects are quite iconic and in the center of the image.\nFor these cases, a simple center bias map might be a good predictor of the \ufb01xations. To test this,\nwe generated priority maps by setting the center of the image to 1 and everywhere else 0, and then\napplying a Gaussian \ufb01lter with sigma tuned on the validation set. This simple Center Bias (CB) map\nachieved an AUC score of 0.84, which is even higher than some of the methods presented in Tab. 1.\nThis prompted us to analyze whether the good performance of SDR is simply due to center bias.\nAn intuitive way to address the CB problem would be to use Shuf\ufb02ed AUC (sAUC) [33]. However,\nsAUC favors true positives over false negatives and gives more credit to off-center information [3],\nwhich may lead to biased results. This is especially true when the datasets are center-biased. The\nsAUC scores for RCNN, AnnoBox, CAM, SDR, and Inter-Observer [3] are 0.61, 0.61, 0.65, 0.64,\nand 0.70, respectively. SDR outperforms AnnoBox and RCNN by 3% and is on par with CAM. Also\n\n7\n\n\f(a)\n\n(b)\n\nFigure 6: (a): Red bars: the distribution of AUC scores of SDR for which the AUC scores of Center\nBias are under 0.6. Blue bars: the distribution of AUC scores Center Bias where AUC scores of SDR\nare under 0.6. (b): The box plot for the distributions of KL divergence between Center Bias and SDR\nscores on each class in POET dataset. The KL divergence distribution revealed that the priority maps\ncreated by Center Bias are signi\ufb01cantly different from the ones created by SDR.\n\nnote that sAUC for Inter-Observer is 0.70, which suggests the existence of center bias in POET (the\nsAUC score of Inter-Observer on MIT300 [17] is 0.81) and raises a concern that sAUC might be\nmisleading for model comparison using this dataset.\nTo further address the concern of center bias, we show in Fig. 6 that the priority maps produced by\nSDR and Center Bias are quite different. Fig. 6a plots the distribution of the AUC scores for one\nmethod when the AUC scores of the other method was low (< 0.6). The spread of these distributions\nindicate a low correlation between the errors of the two methods. Fig. 6b shows a box plot of the\ndistribution of KL divergence [6] between the priority maps generated by SDR and Center Bias. For\neach category, the mean KL divergence value is high, indicating a large difference between SDR and\nCenter Bias. For a more qualitative intuition of KL divergence in these distributions, see Figure 2.\nThe center bias effect in PET and MIT900 is not as pronounced as in POET because there are multiple\ntarget objects in the PET images and the target objects in the MIT900 dataset are relatively small. For\nthese datasets, Center Bias achieves AUC scores of 0.78 and 0.72, respectively. These numbers are\nsigni\ufb01cantly lower than the results obtained by SDR, which are 0.82 and 0.78, respectively.\n\n5 Conclusions and Future Work\n\nWe introduced a classi\ufb01cation model based on sparse and diverse region ranking and selection, which\nis trained only on image level annotations. We then provided experimental evidence from visual\nsearch tasks under three different conditions to support our hypothesis that these computational\nmechanisms might be analogous to computations underlying visual attention processes in the brain.\nWhile this work is not the \ufb01rst to use computer vision models to predict where humans look in visual\nsearch tasks, it is the \ufb01rst to show that core mechanisms driving high model performance in a search\ntask also predict how humans allocate their attention in the same tasks. By improving upon these core\ncomputational principles, and perhaps by incorporating new ones suggested by attention mechanisms,\nour hope is to shed more light on human visual processing.\nThere are several directions for future work. The \ufb01rst is to create a visual search dataset that mitigates\nthe center bias effect and avoids cases of trivially easy search. The second is to incorporate into\nthe current model known factors affecting search, such as a center bias, bottom-up saliency, scene\ncontext, etc., to better predict shifts in human spatial attention.\nAcknowledgment. This project was partially supported by the National Science Foundation Awards\nIIS-1161876 and IIS-1566248 and the Subsample project from the Digiteo Institute, France.\n\nReferences\n\n[1] J. Ba, V. Mnih, and K. Kavukcuoglu. Multiple object recognition with visual attention. In ICLR, 2015.\n[2] A. Borji and L. Itti. State-of-the-art in visual attention modeling. PAMI, 35(1):185\u2013207, 2013.\n\n8\n\n\f[3] A. Borji, H. R. Tavakoli, D. N. Sihite, and L. Itti. Analysis of scores, datasets, and models in visual saliency\n\nprediction. In ICCV, 2013.\n\n[4] N. D. Bruce and J. K. Tsotsos. Saliency, attention, and visual search: An information theoretic approach.\n\nJournal of Vision, 9(3):5\u20135, 2009.\n\n[5] Z. Bylinskii, T. Judd, A. Borji, L. Itti, F. Durand, A. Oliva, and A. Torralba. Mit saliency benchmark.\n\nhttp://saliency.mit.edu/.\n\n[6] Z. Bylinskii, T. Judd, A. Oliva, A. Torralba, and F. Durand. What do different evaluation metrics tell us\n\nabout saliency models? arXiv preprint arXiv:1604.03605, 2016.\n\n[7] P. Dario, G. Sandini, and P. Aebischer. Robots and biological systems: Towards a new bionics? In NATO\n\nAdvanced Workshop, 2012.\n\n[8] K. A. Ehinger, B. Hidalgo-Sotelo, A. Torralba, and A. Oliva. Modelling search for people in 900 scenes: A\n\ncombined source model of eye guidance. Visual Cognition, 17(6-7):945\u2013978, 2009.\n\n[9] M. Everingham, S. M. A. Eslami, L. V. Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal\n\nvisual object classes challenge: A retrospective. IJCV, 111(1):98\u2013136, 2015.\n\n[10] J. H. Fecteau and D. P. Munoz. Salience, relevance, and \ufb01ring: a priority map for target selection. Trends\n\nin cognitive sciences, 10(8):382\u2013390, 2006.\n\n[11] S. O. Gilani, R. Subramanian, Y. Yan, D. Melcher, N. Sebe, and S. Winkler. Pet: An eye-tracking dataset\n\nfor animal-centric pascal object classes. In ICME, 2015.\n\n[12] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection\n\nand semantic segmentation. In CVPR, 2014.\n\n[13] A. Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.\n[14] M. Hoai. Regularized max pooling for image categorization. In Proc. BMVC., 2014.\n[15] M. Hoai and A. Zisserman. Improving human action recognition using score distribution and ranking. In\n\nProc. ACCV, 2014.\n\n[16] L. Itti and C. Koch. A saliency-based search mechanism for overt and covert shifts of visual attention.\n\nVision Research, 40(10):1489\u20131506, 2000.\n\n[17] T. Judd, K. Ehinger, F. Durand, and A. Torralba. Learning to predict where humans look. In Proc. ICCV.\n\nIEEE, 2009.\n\n[18] C. Kanan, M. H. Tong, L. Zhang, and G. W. Cottrell. Sun: Top-down saliency using natural statistics.\n\nVisual Cognition, 17(6-7):979\u20131003, 2009.\n\n[19] A. Kannan, J. Winn, and C. Rother. Clustering appearance and shape by learning jigsaws. In NIPS. 2007.\n[20] C. Koch and S. Ullman. Shifts in selective visual attention: towards the underlying neural circuitry. In\n\nMatters of intelligence, pages 115\u2013141. Springer, 1987.\n\n[21] I. Kokkinos, R. Deriche, T. Papadopoulo, O. Faugeras, and P. Maragos. Towards bridging the Gap between\n\nBiological and Computational Image Segmentation. Research Report RR-6317, INRIA, 2007.\n\n[22] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classi\ufb01cation with deep convolutional neural\n\nnetworks. In NIPS, 2012.\n\n[23] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching for recognizing\n\nnatural scene categories. In CVPR, 2006.\n\n[24] T. S. Lee and X. Y. Stella. An information-theoretic framework for understanding saccadic eye movements.\n\nIn NIPS, 1999.\n\n[25] V. Mnih, N. Heess, A. Graves, et al. Recurrent models of visual attention. In NIPS, 2014.\n[26] D. P. Papadopoulos, A. D. Clarke, F. Keller, and V. Ferrari. Training object class detectors from eye\n\ntracking data. In ECCV. 2014.\n\n[27] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla,\nM. Bernstein, A. C. Berg, and L. Fei-Fei. Imagenet large scale visual recognition challenge. IJCV, 2015.\n[28] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In\n\nICLR, 2015.\n\n[29] Z. Wei and M. Hoai. Region ranking svms for image classi\ufb01cation. In CVPR, 2016.\n[30] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio. Show, attend and\n\ntell: Neural image caption generation with visual attention. In ICML, 2015.\n\n[31] G. J. Zelinsky, H. Adeli, Y. Peng, and D. Samaras. Modelling eye movements in a categorical search task.\nPhilosophical Transactions of the Royal Society of London B: Biological Sciences, 368(1628):20130058,\n2013.\n\n[32] G. J. Zelinsky and J. W. Bisley. The what, where, and why of priority maps and their interactions with\n\nvisual working memory. Annals of the New York Academy of Sciences, 1339(1):154\u2013164, 2015.\n\n[33] L. Zhang, M. H. Tong, T. K. Marks, H. Shan, and G. W. Cottrell. Sun: A bayesian framework for saliency\n\nusing natural statistics. Journal of vision, 8(7):32\u201332, 2008.\n\n[34] B. Zhou, A. Khosla, L. A., A. Oliva, and A. Torralba. Learning Deep Features for Discriminative\n\nLocalization. CVPR, 2016.\n\n9\n\n\f", "award": [], "sourceid": 1045, "authors": [{"given_name": "Zijun", "family_name": "Wei", "institution": "Stony Brook"}, {"given_name": "Hossein", "family_name": "Adeli", "institution": "Stony Brook University"}, {"given_name": "Minh Hoai", "family_name": "Nguyen", "institution": "Stony Brook University"}, {"given_name": "Greg", "family_name": "Zelinsky", "institution": "Stony Brook University"}, {"given_name": "Dimitris", "family_name": "Samaras", "institution": "Stony Brook University"}]}