{"title": "Saliency Based on Information Maximization", "book": "Advances in Neural Information Processing Systems", "page_first": 155, "page_last": 162, "abstract": null, "full_text": "Saliency Based on Information Maximization \n\nNeil D.B. Bruce and John K. Tsotsos \n\nDepartment of Computer Science and Centre for Vision Research \n\nYork University, Toronto, ON, M2N 5X8 \n\n{neil,tsotsos}@cs . yorku. c a \n\nAbstract \n\nA model of bottom-up overt attention is proposed based on the principle \nof maximizing information sampled from a scene. The proposed opera(cid:173)\ntion is based on Shannon's self-information measure and is achieved in \na neural circuit, which is demonstrated as having close ties with the cir(cid:173)\ncuitry existent in the primate visual cortex. It is further shown that the \nproposed saliency measure may be extended to address issues that cur(cid:173)\nrently elude explanation in the domain of saliency based models. Results \non natural images are compared with experimental eye tracking data re(cid:173)\nvealing the efficacy of the model in predicting the deployment of overt \nattention as compared with existing efforts. \n\n1 Introduction \n\nThere has long been interest in the nature of eye movements and fixation behavior fol(cid:173)\nlowing early studies by Buswell [I] and Yarbus [2]. However, a complete description of \nthe mechanisms underlying these peculiar fixation patterns remains elusive. This is further \ncomplicated by the fact that task demands and contextual knowledge factor heavily in how \nsampling of visual content proceeds. \n\nCurrent bottom-up models of attention posit that saliency is the impetus for selection of \nfixation points. Each model differs in its definition of saliency. In perhaps the most popular \nmodel of bottom-up attention, saliency is based on centre-surround contrast of units mod(cid:173)\neled on known properties of primary visual cortical cells [3]. In other efforts, saliency is \ndefined by more ad hoc quantities having less connection to biology [4] . In this paper, we \nexplore the notion that information is the driving force behind attentive sampling. \n\nThe application of information theory in this context is not in itself novel. There exist \nseveral previous efforts that define saliency based on Shannon entropy of image content \ndefined on a local neighborhood [5, 6, 7, 8]. The model presented in this work is based on \nthe closely related quantity of self-information [9]. In section 2.2 we discuss differences \nbetween entropy and self-information in this context, including why self-information may \npresent a more appropriate metric than entropy in this domain. That said, contributions of \nthis paper are as follows: \n\n1. A bottom-up model of overt attention with selection based on the self-information \n\nof local image content. \n\n2. A qualitative and quantitative comparison of predictions of the model with human \n\n\feye tracking data, contrasted against the model ofItti and Koch [3] . \n\n3. Demonstration that the model is neurally plausible via implementation based on a \nneural circuit resembling circuitry involved in early visual processing in primates. \n4. Discussion of how the proposal generalizes to address issues that deny explanation \n\nby existing saliency based attention models. \n\n2 The Proposed Saliency Measure \n\nThere exists much evidence indicating that the primate visual system is built on the prin(cid:173)\nciple of establishing a sparse representation of image statistics. In the most prominent of \nsuch studies, it was demonstrated that learning a sparse code for natural image statistics \nresults in the emergence of simple-cell receptive fields similar to those appearing in the \nprimary visual cortex of primates [10, 11]. The apparent benefit of such a representation \ncomes from the fact that a sparse representation allows certain independence assumptions \nwith regard to neural firing. This issue becomes important in evaluating the likelihood of a \nset of local image statistics and is elaborated on later in this section. \n\nIn this paper, saliency is determined by quantifying the self-information of each local im(cid:173)\nage patch. Even for a very small image patch, the probability distribution resides in a very \nhigh dimensional space. There is insufficient data in a single image to produce a reason(cid:173)\nable estimate of the probability distribution. For this reason, a representation based on \nindependent components is employed for the independence assumption it affords. leA is \nperformed on a large sample of 7x7 RGB patches drawn from natural images to determine \na suitable basis. For a given image, an estimate of the distribution of each basis coefficient \nis learned across the entire image through non-parametric density estimation. The proba(cid:173)\nbility of observing the RGB values corresponding to a patch centred at any image location \nmay then be evaluated by independently considering the likelihood of each corresponding \nbasis coefficient. The product of such likelihoods yields the joint likelihood of the entire \nset of basis coefficients. Given the basis determined by ICA, the preceding computation \nmay be realized entirely in the context of a biologically plausible neural circuit. The over(cid:173)\nall architecture is depicted in figure 1. Details of each of the aforesaid model components \nincluding the details of the neural circuit are as follows: \n\nProjection into independent component space provides, for each local neighborhood of the \nimage, a vector W consisting of N variables Wi with values Vi. Each W i specifies the con(cid:173)\ntribution of a particular basis function to the representation of the local neighborhood. As \nmentioned, these basis functions, learned from statistical regularities observed in a large set \nof natural images show remarkable similarity to V 1 cells [10, 11]. The ICA projection then \nallows a representation w, in which the components W i are as independent as possible. For \nfurther details on the ICA projection of local image statistics see [12]. In this paper, we pro(cid:173)\npose that salience may be defined based on a strategy for maximum information sampling. \nIn particular, Shannon's self-information measure [9], -log(p(x )), applied to the joint like(cid:173)\nlihood of statistics in a local neighborhood decribed by w, provides an appropriate trans(cid:173)\nformation between probability and the degree of infom1ation inherent in the local statistics. \nIt is in computing the observation likelihood that a sparse representation is instrumental: \nConsider the probability density function p( W l = Vl, Wz = Vz, ... , Wn = vn ) which quanti(cid:173)\nfies the likelihood of observing the local statistics with values Vl, ... , Vn within a particular \ncontext. An appropriate context may include a larger area encompassing the local neigbour(cid:173)\nhood described by w, or the entire scene in question. The presumed independence of the \nICA decomposition means that P(WI = VI, Wz = V2, ... , Wn = V n ) = rr~= l P(Wi = Vi) . \nThus, a sparse representation allows the estimation of the n-dimensional space described \nby W to be derived from n one dimensional probability density functions. Evaluating \np( Wl = VI, W2 = V2, ... , Wn = v n ) requires considering the distribution of values taken on \nby each W i in a more global context. In practice, this might be derived on the basis of a \n\n\fnonparametric or histogram density estimate. In the section that follows, we demonstrate \nthat an operation equivalent to a non-parametric density estimate may be achieved using a \nsuitable neural circuit. \n\n2.1 Likelihood Estimation in A Neural Circuit \n\nIn the following formulation, we assume an estimate of the likelihood of the components \nof W based on a Gaussian kernel density estimate. Any other choice of kernel may be \nsubstituted, with a Gaussian window chosen only for its common use in density estimation \nand without loss of generality. \n\nLet Wi ,j ,k denote the set of independent coefficients based on the neighborhood centered at \nj , k. An estimate of p( Wi,j,k = Vi,j,k) based on a Gaussian window is given by: \n\n(1) \n\nwith L s ,t w(s, t) = 1 where \\f! is the context on which the probability estimate of the coef(cid:173)\nficients of w is based. w (s, t) describes the degree to which the coefficient w at coordinates \ns, t contributes to the probability estimate. On the basis of the form given in equation I it is \nevident that this operation may equivalently be implemented by the neural circuit depicted \nin figure 2. Figure 2 demonstrates only coefficients derived from a horizontal cross-section. \nThe two dimensional case is analogous with parameters varying in i, j, and k dimensions. \nK consists of the Kernel function employed for density estimation. In our case this is a \nGaussian of the form 0\"~e-x2 /20- 2. w(s, t) is encoded based on the weight of connec(cid:173)\ntions to K. As x = Vi ,j,k - Vi,s,t the output of this operation encodes the impact of the \nKernel function with mean Vi,s,t on the value of p( Wi,j,k = Vi,j,k). Coefficients at the input \nlayer correspond to coefficients of v. The logarithmic operator at the final stage might also \nbe placed before the product on each incoming connection, with the product then becom(cid:173)\ning a summation. It is interesting to note that the structure of this circuit at the level of \nwithin feature spatial competition is remarkably similar to the standard feedforward model \nof lateral inhibition, a ubiquitous operation along the visual pathways thought to playa \nchief role in attentional processing [14]. The similarity between independent components \nand VI cells, in conjunction with the aforementioned consideration lends credibility to the \nproposal that information may contribute to driving overt attentional selection. \n\nOne aspect lacking from the preceding description is that the saliency map fails to take into \naccount the dropoff in visual acuity moving peripherally from the fovea. In some instances \nthe maximum information accommodating for visual acuity may correspond to the center \nof a cluster of salient items, rather than centered on one such item. For this reason, the \nresulting saliency map is convolved with a Gaussian with parameters chosen to correspond \napproximately to the drop off in visual acuity observed in the human visual system. \n\n2.2 Self-Information versus Entropy \n\nIt is important to distinguish between self-information and entropy since these terms are \noften confused. The difference is subtle but important on two fronts. The first consideration \nlies in the expected behavior in popout paradigms and the second in the neural circuitry \ninvolved. \nLet X = [Xl, X2, ... , xnl denote a vector of RGB values corresponding to image patch X, \nand D a probability density function describing the distribution of some feature set over \nX. For example, D might correspond to a histogram estimate of intensity values within \nX or the relative contribution of different orientations within a local neighborhood situ(cid:173)\nated on the boundary of an object silhouette [6]. Assuming an estimate of D based on N \n\n\fOrfglnallmage \n\n10 Examplo basad \non shlfUng window \nIn horizontal direction \n\nof A \n\nI... . . Functions : \nr----ii-iiliiiil---A----: \n:11 \u2022\u2022\u2022\u2022\u2022 \n: \n:. _. \n: \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 Ba~is : \n:. \u2022\u2022\u2022\u2022\u2022 \n! \n: \n~---------------------\n\nInromax \n\nleA \n\nr--------------\n\u2022 \nI ... I. \n. .. : \n\n\u2022 \u2022 \u2022 \n\nI \n\n\u2022 \n\n360,000 \n\n: \n: \nI \n: \nI Random Patches I \n------- ______ 1 \n\n\u2022 \n\nFigure I : The framework that achieves the desired information measure. Shown is the com(cid:173)\nputation corresponding to three horizontally adjacent neighbourhoods with flow through \nthe network indicated by the orange, purple, and cyan windows and connections. The \nconnections shown facilitate computation of the information measure corresponding to the \npixel centered in the purple window. The network architecture produces this measure on \nthe basis of evaluating the probability of these coefficients with consideration to the values \nof such coefficients in neighbouring regions. \n\nbins, the entropy of D is given by: - L~l Di1og(Di). In this example, entropy character(cid:173)\nizes the extent to which the feature(s) characterized by D are uniformly distributed on X . \nSelf-information in the proposed saliency measure is given by -log(p(X)). That is, Self(cid:173)\ninfolTIlation characterizes the raw likelihood of the specific n-dimensional vector of ROB \nvalues given by X . p(X) in this case is based on observing a number of n-dimensional \nfeature vectors based on patches drawn from the area surrounding X . Thus, p( X) charac(cid:173)\nterizes the raw likelihood of observing X based on its surround and -log(p(X)) becomes \ncloser to a measure of local contrast whereas entropy as defined in the usual manner is \ncloser to a measure of local activity. The importance of this disti nction is evident in con(cid:173)\nsidering figure 3. Figure 3 depicts a variety of candles of varying orientation, and color. \nThere is a tendency to fixate the empty region on the left, which is the location of lowest \nentropy in the image. In contrast, this region receives the highest confidence from the al(cid:173)\ngorithm proposed in this paper as it is highly informative in the context of this image. In \nclassic popout experiments, a vertical line among horizontal lines presents a highly salient \ntarget. The same vertical line among many lines of random orientations is not, although the \nentropy associated with the second scenario is much greater. \n\nWith regard to the neural circuitry involved, we have demonstrated that self-information \nmay be computed using a neural circuit in the absence of a representation of the entire \nprobability distribution. Whether an equivalent operation may be achieved in a biologically \nplausible manner for the computation of entropy remains to be established. \n\n\fj..1 , \nj..1, \n~2 ~ 1 \n\nj..1 . \ni \n\nj..1 , \ni+1 \n\nj..1 , \n/+2 \n\ni, \n,2 \n\ni, \n,1 \n\ni, \ni+1 \n\ni, \n/+2 \n\ni+1. \n1-2 \n\ni+1 , \n;.' \n\ni+1, \n/ \n\ni+1, \n/+' \n\ni+1 , \niT, \n\ni \n\nFigure 2: AID depiction of the neural architecture that computes the self-information of a \nset of local statistics. The operation is equivalent to a Kernel density estimate. Coefficients \ncorrespond to subscripts of Vi,j,k. The small black circles indicate an inhibitory relationship \nand the small white circles an excitatory relationship \n\nFigure 3: An image that highlights the difference between entropy and self-information. \nFixation invariably falls on the empty patch, the locus of minimum entropy in orientation \nand color but maximum in self-information when the surrounding context is considered. \n\n3 Experimental Validation \n\nThe following section evaluates the output of the proposed algorithm as compared with the \nbottom-up model of Itti and Koch [3]. The model of Itti and Koch is perhaps the most \npopular model of saliency based attention and currently appears to be the yardstick against \nwhich other models are measured. \n\n3.1 Experimental eye tracking data \n\nThe data that forms the basis for performance evaluation is derived from eye tracking ex(cid:173)\nperiments performed while subjects observed 120 different color images. Images were \npresented in random order for 4 seconds each with a mask between each pair of images. \nSubjects were positioned 0.75m from a 21 inch CRT monitor and given no particular in(cid:173)\nstructions except to observe the images. Images consist of a variety of indoor and outdoor \nscenes, some with very salient items, others with no particular regions of interest. The eye \ntracking apparatus consisted of a standard non head-mounted device. The parameters of the \nsetup are intended to quantify salience in a general sense based on stimuli that one might \nexpect to encounter in a typical urban environment. Data was collected from 20 different \nsubjects for the full set of 120 images. \n\nThe issue of comparing between the output of a particular algorithm, and the eye track(cid:173)\ning data is non-trivial. Previous efforts have selected a number of fixation points based \non the saliency map, and compared these with the experimental fixation points derived \n\n\ffrom a small number of subjects and images (7 subjects and 15 images in a recent effort \n[4]). There are a variety of methodological issues associated with such a representation. \nThe most important such consideration is that the representation of perceptual importance \nis typically based on a saliency map. Observing the output of an algorithm that selects \nfixation points based on the underlying saliency map obscures observation of the degree \nto which the saliency maps predict important and unimportant content and in particular, \nignores confidence away from highly salient regions. Secondly, it is not clear how many \nfixation points should be selected. Choosing this value based on the experimental data will \nbias output based on information pertaining to the content of the image and may produce \nartificially good results. \n\nThe preceding discussion is intended to motivate the fact that selecting discrete fixation co(cid:173)\nordinates based on the saliency map for comparison may not present the most appropriate \nrepresentation to use for performance evaluation. In this effort, we consider two different \nmeasures of performance. Qualitative comparison is based on the representation proposed \nin [16]. In this representation, a fixation density map is produced for each image based on \nall fixation points, and subjects. Given a fixation point, one might consider how the image \nunder consideration is sampled by the human visual system as photoreceptor density drops \nsteeply moving peripherally from the centre of the fovea. This dropoff may be modeled \nbased on a 2D Gaussian distribution with appropriately chosen parameters, and centred on \nthe measured fixation point. A continuous fixation density map may be derived for a par(cid:173)\nticular image based on the sum of all 2D Gaussians corresponding to each fixation point, \nfrom each subject. The density map then comprises a measure of the extent to which each \npixel of the image is sampled on average by a human observer based on observed fixations. \nThis affords a representation for which similarity to a saliency map may be considered at \na glance. Quantitative performance evaluation is achieved based on the measure proposed \nin [15]. The saliency maps produced by each algorithm are treated as binary classifiers for \nfixation versus non-fixation points. The choice of several different thresholds and assess(cid:173)\nment of performance in predicting fixated versus not fixated pixel locations allows an ROC \ncurve to be produced for each algorithm. \n\n3.2 Experimental Results \n\nFigure 4 affords a qualitative comparison of the output of the proposed model with the \nexperimental eye tracking data for a variety of images. Also depicted is the output of the \nItti and Koch algorithm for comparison. \n\nIn the implementation results shown, the ICA basis set was learned from a set of 360,000 \n7x7x3 image patches from 3600 natural images using the Lee et al. extended infomax \nalgorithm [17]. Processed images are 340 by 255 pixels. W consists of the entire extent \nof the image and w(s, t) = ~ 'V s, t with p the number of pixels in the image. One might \nmake a variety of selections for these variables based on arguments related to the human \nvisual system, or based on performance. In our case, the values have been chosen on the \nbasis of simplicity and do not appear to dramatically affect the predictive capacity of the \nmodel in the simulation results. In particular, we wished to avoid tuning these parameters \nto the available data set. Future work may include a closer look at some of the parameters \ninvolved in order to determine the most appropriate choices. The ROC curves appearing in \nfigure 5 give some sense of the efficacy of the model in predicting which regions of a scene \nhuman observers tend to fixate. As may be observed, the predictive capacity of the model is \non par with the approach of lui and Koch. Encouraging is the fact that similar perfonnance \nis achieved using a method derived from first principles, and with no parameter tuning or \nad hoc design choices. \n\n\fFigure 4: Results for qualitative comparison. Within each boxed region defined by solid \nlines: (Top Left) Original Image (Top Right) Saliency map produced by Itti + Koch algo(cid:173)\nrithm. (Bottom Left) Saliency map based on information maximization. (Bottom Right) \nFixation density map based on experimental human eye tracking data. \n\n4 On Biological Plausibility \n\nAlthough the proposed approach, along with the model of lui and Koch describe saliency \non the basis of a single topographical saliency map, there is mounting evidence that saliency \nin the primate brain is represented at several levels based on a hierarchical representation \n[18] of visual content. The proposed approach may accommodate such a configuration \nwith the single necessary condition being a sparse representation at each layer. \n\nAs we have described in section 2, there is evidence that suggests the possibility that the \nprimate visual system may consist of a multi-layer sparse coding architecture [10, 11]. The \nproposed algorithm quantifies information on the basis of a neural circuit, on units with \nresponse properties corresponding to neurons appearing in the primary visual cortex. How(cid:173)\never, given an analogous representation corresponding to higher visual areas that encode \nform, depth, convexity etc. the proposed method may be employed without any modifica(cid:173)\ntion. Since the popout of features can occur on the basis of more complex properties such \nas a convex surface among concave surfaces [19], this is perhaps the next stage in a system \nthat encodes saliency in the same manner as primates. Given a multi-layer architecture, the \nmechanism for selecting the locus of attention becomes less clear. In the model of Itti and \nKoch, a multi-layer winner-take-all network acts directly on the saliency map and there \nis no hierarchical representation of image content. There are however attention models \nthat subscribe to a distributed representation of saliency (e.g. [20]), that may implement \nattentional selection with the proposed neural circuit encoding saliency at each layer. \n\n\f\u2022\u2022 \n\n\u2022 0 \n\n~ 05 \ncr \n~ 05 \n\n\u2022\u2022 \n\n01 \n\n02 0) O' \n\nOS \n\n06 \n\n07 \n\n01 0' \n\n1 \n\nFalse Alarm Rate \n\nFigure 5: ROC curves for Self-information (blue) and Itti and Koch (red) saliency maps. \nArea under curves is 0.7288 and 0.7277 respectively. \n\n5 Conclusion \n\nWe have described a strategy that predicts human attentional deployment on the principle of \nmaximizing information sampled from a scene. Although no computational machinery is \nincluded strictly on the basis of biological plausibility, nevertheless the formulation results \nin an implementation based on a neurally plausible circuit acting on units that resemble \nthose that facilitate early visual processing in primates. Comparison with an existing atten(cid:173)\ntion model reveals the efficacy of the proposed model in predicting salient image content. \nFinally, we demonstrate that the proposal might be generalized to facilitate selection based \non high-level features provided an appropriate sparse representation is available. \n\nReferences \n\n[I] G.T. BusweII, How people look at pictures. Chicago: The University of Chicago Press. \n[2] A. Yarbus, Eye movements and vision. New York: Plenum Press. \n[3] L. Itti, C. Koch, E. Niebur, IEEE T PAMI, I I: I 254- I 259, 1998. \n[4] C. M. Privitera and L.w. Stark, IEEE T PAMI 22:970-981,2000. \n[5] F. Fritz, C. Seifert, L. Paletta, H. Bischof, Proc. WAPCV, Graz, Austria, 2004. \n[6] L.W. Renninger, J. Coughlan, P. Verghese, J. Malik, Proceedings NIPS 17, Vancouver, 2004. \n[7] T. Kadir, M. Brady, IJCV 45(2):83-105,2001. \n[8] T.S. Lee, S. Yu, Advances in NIPS 12:834-840, Ed. S.A. Solla, T.K. Leen, K. Muller, MIT Press. \n[9] C. E. Shannon, The BeII Systems Technical Journal, 27:93- I 54, 1948. \n[10] D.J. Field, and B. A. Olshausen, Nature 381 :607-609,1996. \n[I I] A.J. BeII, TJ. Sejnowski, Vision Research 37:3327-3338,1997. \n[12] N. Bruce, Neurocomputing, 65-66: I 25-133, 2005. \n[13] P. Comon, Signal Processing 36(3):287-314,1994. \n[14] M.W. Cannon and S.c. Fullenkamp, Vision Research 36(8):1 I 15-1 125, 1996. \n[15] B.W. Tatler, R.J. Baddeley, J.D. Gilchrist, Vision Research 45(5):643-659,2005. \n[16] H. Koesling, E. Carbone, H. Ritter, University of Bielefeld, Technical Report, 2002. \n[17] T.w. Lee, M. Girolami, TJ. Sejnowski, Neural Computation 11:417-441 , 1999. \n[1 8] J. Braun, C. Koch, D. K. Lee, L. Itti, In: Visual Attention and Cortical Circuits, (J . Braun, C. \n\nKoch, J. Davis Ed.), 215-242, Cambridge, MA:MIT Press, 200 I. \n\n[19] J. HuIIman, W. Te Winkel, F. Boselie, Perception and Psychophysics 62: 162-174, 2000. \n[20] J.K. Tsotsos, S. Culhane, W. Wai, Y. Lai, N. Davis, F. Nuflo, Art. Intel!. 78(1-2):507-547,1995. \n\n\f", "award": [], "sourceid": 2830, "authors": [{"given_name": "Neil", "family_name": "Bruce", "institution": null}, {"given_name": "John", "family_name": "Tsotsos", "institution": null}]}