{"title": "Saliency, Scale and Information: Towards a Unifying Theory", "book": "Advances in Neural Information Processing Systems", "page_first": 2188, "page_last": 2196, "abstract": "In this paper we present a definition for visual saliency grounded in information theory. This proposal is shown to relate to a variety of classic research contributions in scale-space theory, interest point detection, bilateral filtering, and to existing models of visual saliency. Based on the proposed definition of visual saliency, we demonstrate results competitive with the state-of-the art for both prediction of human fixations, and segmentation of salient objects. We also characterize different properties of this model including robustness to image transformations, and extension to a wide range of other data types with 3D mesh models serving as an example. Finally, we relate this proposal more generally to the role of saliency computation in visual information processing and draw connections to putative mechanisms for saliency computation in human vision.", "full_text": "Saliency, Scale and Information:\n\nTowards a Unifying Theory\n\nSha\ufb01n Rahman\n\nDepartment of Computer Science\n\nUniversity of Manitoba\n\nshafin109@gmail.com\n\nNeil D.B. Bruce\n\nDepartment of Computer Science\n\nUniversity of Manitoba\n\nbruce@cs.umanitoba.ca\n\nAbstract\n\nIn this paper we present a de\ufb01nition for visual saliency grounded in information\ntheory. This proposal is shown to relate to a variety of classic research contribu-\ntions in scale-space theory, interest point detection, bilateral \ufb01ltering, and to exist-\ning models of visual saliency. Based on the proposed de\ufb01nition of visual saliency,\nwe demonstrate results competitive with the state-of-the art for both prediction of\nhuman \ufb01xations, and segmentation of salient objects. We also characterize differ-\nent properties of this model including robustness to image transformations, and\nextension to a wide range of other data types with 3D mesh models serving as an\nexample. Finally, we relate this proposal more generally to the role of saliency\ncomputation in visual information processing and draw connections to putative\nmechanisms for saliency computation in human vision.\n\n1\n\nIntroduction\n\nMany models of visual saliency have been proposed in the last decade with differences in de\ufb01ning\nprinciples and also divergent objectives. The motivation for these models is divided among several\ndistinct but related problems including human \ufb01xation prediction, salient object segmentation, and\nmore general measures of objectness. Models also vary in intent and range from hypotheses for\nsaliency computation in human visual cortex to those motivated exclusively by applications in com-\nputer vision. At a high level the notion of saliency seems relatively straightforward and characterized\nby patterns that stand out from their context according to unique colors, striking patterns, disconti-\nnuities in structure, or more generally, \ufb01gure against ground. While this is a seemingly simplistic\nconcept, the relative importance of de\ufb01ning principles of a model, and \ufb01ne grained implementation\ndetails in determining output remains obscured. Given similarities in the motivation for different\nmodels, there is also value in considering how different de\ufb01nitions of saliency relate to each other\nwhile also giving careful consideration to parallels to related concepts in biological and computer\nvision.\nThe characterization sought by models of visual saliency is reminiscent ideas expressed throughout\nseminal work in computer vision. For example, early work in scale-space theory includes emphasis\non the importance of extrema in structure expressed across scale-space as an indicator of poten-\ntially important image content [1, 2]. Related efforts grounded in information theory that venture\ncloser to modern notions of saliency include Kadir and Brady [3] and Jagersand\u2019s [4] analysis of\ninteraction between scale and local entropy in de\ufb01ning relevant image content. These concepts have\nplayed a signi\ufb01cant role in techniques for af\ufb01ne invariant keypoint matching [5], but have received\nless attention in the direct prediction of saliency. Information theoretic models are found in the\nliterature directly addressing saliency prediction for determining gaze points or proto-objects. A\nprominent example of this is the AIM model wherein saliency is based directly on measuring the\nself-information of image patterns [6]. Alternative information theoretic de\ufb01nitions have been pro-\n\n1\n\n\fposed [7, 8] including numerous models based on measures of redundancy or compressibility that\nare strongly related to information theoretic concepts given common roots in communication theory.\nIn this paper, we present a relatively simple information theoretic de\ufb01nition of saliency that is shown\nto have strong ties to a number of classic concepts in the computer vision and visual saliency litera-\nture. Beyond a speci\ufb01c model, this also serves to establish formalism for characterizing relationships\nbetween scale, information and saliency. This analysis also hints at the relative importance of \ufb01ne\ngrained implementation details in differentiating performance across models that employ disparate,\nbut strongly related de\ufb01nitions of visual salience. The balance of the paper is structured as follows:\nIn section 2 we outline the principle for visual saliency computation proposed in this paper de\ufb01ned\nby maxima in information scale-space (MISS). In section 3 we demonstrate different characteristics\nof the proposed metric, and performance on standard benchmarks. Finally, section 4 summarizes\nmain points of this paper, and includes discussion of broader implications.\n\n2 Maxima in Information Scale-Space (MISS)\n\nIn the following, we present a general de\ufb01nition of saliency that is strongly related to prior work\ndiscussed in section 1.\nIn short, according to our proposal, saliency corresponds to maxima in\ninformation-scale space (MISS). The description of MISS follows, and is accompanied by more\nspeci\ufb01c discussion of related concepts in computer vision and visual saliency research.\nLet us \ufb01rst assume that the saliency of statistics that de\ufb01ne a local region of an image are a function\nof the rarity (likelihood) of such statistics. We\u2019ll further assume without loss of generality that these\nlocal statistics correspond to pixel intensities.\nThe likelihood of observing a pixel at position p with intensity Ip in an image based on the global\nstatistics, is given by the frequency of intensity Ip relative to the total number of pixels (i.e. a nor-\nq\u2208S \u03b4(Iq \u2212 Ip)/|S| with\n\u03b4 the Dirac delta function. One may generalize this expression to a non-parametric (kernel) density\nq\u2208S G\u03c3i(Ip \u2212 Iq) where G\u03c3i corresponds to a kernel function (assumed to be\nGaussian in this case). This may be viewed as either smoothing the intensity histogram, or applying\na density estimate that is more robust to low sample density1. In practice, the proximity of pixels to\none another is also relevant. Filtering operations applied to images are typically local in their extent,\nand the correlation among pixel values inversely proportional to the spatial distance between them.\nAdding a local spatial weighting to the likelihood estimate such that nearby pixels have a stronger\nin\ufb02uence, the expression is as follows:\n\nmalized histogram lookup). This may be expressed as follows: H(Ip) =(cid:80)\nestimate: H(Ip) =(cid:80)\n\n(cid:88)\n\nq\u2208S\n\nH(Ip) =\n\nG\u03c3b (||p \u2212 q||)G\u03c3i(||Ip \u2212 Iq||)\n\n(1)\n\nThis constitutes a locally weighted likelihood estimate of intensity values based on pixels in the\nsurround.\nHaving established the expression in equation 1, we shift to discussion of scale-space theory. In\ntraditional scale-space theory the scale-space representation L(x, y; t) is de\ufb01ned by convolution of\nan image f (x, y) with a Gaussian kernel g(x, y) such that L(x, y; t) = g(., ., t) \u2217 f (., .) with t the\nvariance of a Gaussian \ufb01lter. Scale-space features are often derived from the family of Gaussian\nderivatives de\ufb01ned by Lxmyn(., .; t) = \u03b4xmyng(., ., t) \u2217 f (., .) with differential invariants produced\nby combining Gaussian derivatives of different orders in a weighted combination. An important\nconcept in scale-space theory is the notion that scale selection or the size and position of relevant\nstructure in the data, is related to the scale at which features (e.g. normalized derivatives) assume\na maximum value. This consideration forms the basis for early de\ufb01nitions of saliency which derive\na measure of saliency corresponding to the scale at which local entropy is maximal. This point is\nrevisited later in this section.\nThe scale-space representation may also be de\ufb01ned as the solution to the heat equation: \u03b4I\n\nIxx + Iyy which may be rewritten as G[I]p \u2212 I \u2248 \u2206I where G[I]p =(cid:82)\n\n\u03b4t = \u2206I =\nS G\u03c3sIqdq and S the local\n\n1Although this example is based on pixels intensities, the same analysis may be applied to statistics of\narbitrary dimensionality. For higher dimensional feature vectors, appropriate sampling is especially important.\n\n2\n\n\fB\u03c3S\n\n(cid:82)\n\n(cid:80)\nq\u2208S G\u03c3b (||p \u2212 q||)G\u03c3i(||Ip \u2212 Iq||)Iq with Wp =(cid:80)\n\nspatial support. This expression is the solution to the heat equation when \u03c3s =\n2t. This corre-\nsponds to a diffusion process that is isotropic. There are also a variety of operations in image analysis\nand \ufb01ltering that correspond to a more general process of anisotropic diffusion. One prominent ex-\nample is the that proposed by Perona and Malik [9] that implements edge preserving smoothing. A\nG\u03c3r (||Ip \u2212 Iq||)Iqdq [10]\nsimilar process is captured by the Yaroslavsky \ufb01lter: Y [I]p = 1\nC(p)\nwith B\u03c3S re\ufb02ecting the spatial range of the \ufb01lter. The difference between these techniques and an\nisotropic diffusion process is that relative intensity values among local pixels determine the degree\nof diffusion (or weighted local sampling).\nThe Yaroslavsky \ufb01lter may be shown to be a special case of\nthe more general bilat-\neral \ufb01lter corresponding to a step-function for the spatial weight factor [11]: B[I]p =\n1\nWp\nIn the same manner that selection of scale-space extrema de\ufb01ned by an isotropic diffusion process\ncarries value in characterizing relevant image content and scale, we propose to consider scale-space\nextrema that carry a relationship to an anisotropic diffusion process.\nNote that the normalization term Wp appearing in the equation for the bilateral \ufb01lter is equivalent\nto the expression appearing in equation 1. In contrast to bilateral \ufb01ltering, we are not interested in\nproducing a weighted sample of local intensities but we instead consider the sum of the weights\nthemselves which correspond to a robust estimate of the likelihood of Ip. One may further relate\nthis to an information theoretic quantity of self-information in considering \u2212log(p(Ip)), the self-\ninformation associated with the observation of intensity Ip.\nWith the above terms de\ufb01ned, Maxima in Information Scale-Space are de\ufb01ned as:\n\nq\u2208S G\u03c3b (||p \u2212 q||)G\u03c3i(||Ip \u2212 Iq||).\n\n\u221a\n\n(cid:19)\n\n(cid:18)\n\n(cid:88)\n\nq\u2208S\n\nM ISS(Ip) = max\n\n\u03c3b\n\n\u2212 log(\n\nG\u03c3b (||p \u2212 q||)G\u03c3i(||Ip \u2212 Iq||))\n\n(2)\n\nSaliency is therefore equated to the local self-information for the scale at which this quantity has its\nmaximum value (for each pixel location) in a manner akin to scale selection based on normalized\ngradients or differential invariants [12]. This also corresponds to scale (and value) selection based\non maxima in the sum of weights that de\ufb01ne a local anisotropic diffusion process. In what follows,\nwe comment further on conceptual connections to related work:\n1. Scale space extrema: The de\ufb01nition expressed in equation 2 has a strong relationship to the idea\nof selecting extrema corresponding to normalized gradients in scale-space [1] or in curvature-scale\nspace [13]. In this case, rather than a Gaussian blurred intensity pro\ufb01le scale extrema are evaluated\nwith respect to local information expressed across scale space.\n2. Kadir and Brady: In Kadir and Brady\u2019s proposal, interest points or saliency in general is related\nto the scale at which entropy is maximal [3]. While entropy and self-information are related, max-\nima in local entropy alone are insuf\ufb01cient to de\ufb01ne salient content. Regions are therefore selected on\nthe basis of the product of maximal local entropy and magnitude change of the probability density\nfunction. In contrast, the approach employed by MISS relies only on the expression in equation 2,\nand does not require additional normalization. It is worth noting that success in matching keypoints\nrelies on the distinctness of keypoint descriptors which is a notion closely related to saliency.\n3. Attention based on Information Maximization (AIM): The quantity expressed in equation\n2 is identical to the de\ufb01nition of saliency assumed by the AIM model [6] for a speci\ufb01c choice of\nlocal features, and a \ufb01xed scale. The method proposed in equation 2 considers the maximum self-\ninformation expressed across scale space for each local observation to determine relative saliency.\n4. Bilateral \ufb01ltering: Bilateral \ufb01ltering produces a weighted sample of local intensity values based\non proximity in space and feature space. The sum of weights in the normalization term provides a\ndirect estimate of the likelihood of the intensity (or statistics) at the Kernel center, and is directly\nrelated to self-information.\n5. Graph Based Saliency and Random Walks: Proposals for visual saliency also include tech-\nniques de\ufb01ned by graphs and random walks [14]. There is also common ground between this family\nof approaches and those grounded in information theory. Speci\ufb01cally, a random walk or Markov\nprocess de\ufb01ned on a lattice may be seen as a process related to anisotropic diffusion where the tran-\nsition probabilities between nodes de\ufb01ne diffusion on the lattice. For a model such as Graph Based\nVisual Saliency (GBVS) [14], a directed edge from node (i, j) to node (p, q) is given a weight\n\n3\n\n\fw((i, j), (p, q)) = d((i, j)||(p, q))F (i\u2212 p, j \u2212 q) where d is a measure of dissimilarity and F a 2-D\nGaussian pro\ufb01le. In the event that the dissimilarity measure is also de\ufb01ned by a Gaussian function\nof intensity values at (i, j) and (p, q), the edge weight de\ufb01ning a transition probability is equivalent\nto Wp and the expression in equation 1.\n\n3 Evaluation\n\nIn this section we present an array of results that demonstrate the utility and generality of the pro-\nposed saliency measure. This includes typical saliency benchmark results for both \ufb01xation predic-\ntion and object segmentation based on MISS. We also consider the relative invariance of this measure\nto image deformations (e.g. viewpoint, lighting) and demonstrate robustness to such deformations.\nThis is accompanied by demonstration of the value of MISS in a more general sense in assessing\nsaliency for a broad range of data types, with a demonstration based on 3D point cloud data. Finally,\nwe also contrast behavior against very recently proposed models of visual saliency that leverage\ndeep learning, revealing distinct and important facets of the overall problem.\nThe results that are included follow the framework established in section 2. However, the intensity\nvalue appearing in equations in section 2 is replaced by a 3D vector of RGB values corresponding to\neach pixel. ||.|| denotes the L2 norm, and is therefore a Euclidean distance in the RGB colorspace. It\nis worth noting that the de\ufb01nition of MISS may be applied to arbitrary features including normalized\ngradients, differential invariants or alternative features. The motivation for choosing pixel color\nvalues is to demonstrate that a high level of performance may be achieved on standard benchmarks\nusing a relatively simple set of features in combination with MISS.\nA variety of post-processing steps are commonplace in evaluating saliency models, including topo-\nlogical spatial bias of output, or local Gaussian blur of the saliency map. In some of our results (as\nnoted) bilateral blurring has been applied to the output saliency map in place of standard Gaussian\nblurring. The reasons for this are detailed later on in in this section, but it is worth stating that this\nhas shown to be advantageous in comparison to the standard of Gaussian blur in our benchmark\nresults.\nBenchmark results are provided for both \ufb01xation data and salient object segmentation. For segmen-\ntation based evaluation, we apply the methods described by Li et al. [15]. This involves segmentation\nusing MCG [16], with resulting segments weighted based on the saliency map 2.\n\n3.1 MISS versus Scale\n\nIn considering scale space extrema, plotting entropy or energy among normalized derivatives across\nscale is revealing with respect to characteristic scale and regions of interest [3]. Following this line\nof analysis, in Figure 1 we demonstrate variation in information scale-space values as a function\nof \u03c3b expressed in pixels. In Figure 1(a) three pixels are labeled corresponding to each of these\ncategories as indicated by colored dots. The plot in Figure 1(b) shows the self-information for all\nof the selected pixels considering a wide range of scales. Object pixels, edge pixels and non-object\npixels tend to produce different characteristic curves across scale in considering \u2212log(p(Ip)).\n\n3.2 Center bias via local connectivity\n\nCenter bias has been much discussed in the saliency literature, and as such, we include results in this\nsection that apply a different strategy for considering center bias. In particular, in the following cen-\nter bias appears more directly as a factor that in\ufb02uences the relative weights assigned to a likelihood\nestimate de\ufb01ned by local pixels. This effectively means that pixels closer to the center have more\nin\ufb02uence in determining estimated likelihoods. One can imagine such an operation having a more\nprominent role in a foveated vision system wherein centrally located photoreceptors have a much\ngreater density than those in the periphery. The \ufb01rst variant of center bias proposed is as follows:\nwhere, c\n\nq\u2208S G\u03c3b (||p \u2212 q||)G\u03c3i(||Ip \u2212 Iq||)G\u03c3cb (||q \u2212 c||)(cid:3)(cid:19)\n\nM ISSCB\u22121(Ip) = max\n\u03c3b\n\n(cid:18)\n\n\u2212 log(cid:2)(cid:80)\n\n2Note that while the authors originally employed CMPC [17] as a segmentation algorithm, more recent\n\nrecommendations from the authors prescribe the use of MCG [16].\n\n4\n\n\fFigure 1: (a) Sample image with select pixel locations highlighted in color. (b) Self-information of\nthe corresponding pixel locations as a function of scale.\n\nFigure 2: Input images in (a) and sample output for (b) raw saliency maps (c) with bilateral blur (d)\nusing CB-1 bias (e) using CB-2 bias (f) object segmentation using MCG+MISS\n\nis the spatial center of the image, G\u03c3cb is a Gaussian function which controls the amount of center\nbias based on \u03c3cb. The second approach includes the center bias control parameters directly within\nin the second Gaussian function.\nM ISSCB\u22122(Ip) = max\n\u03c3b\nM is the maximum possible distance from the center pixel c to any other pixel.\n\nq\u2208S G\u03c3b (||p\u2212 q||)G\u03c3i(||Ip \u2212 Iq||\u00d7 (M \u2212||q \u2212 c||))(cid:3)(cid:19)\n\nwhere,\n\n(cid:18)\n\n\u2212 log(cid:2)(cid:80)\n\n3.3 Salient objects and \ufb01xations\n\nEvaluation results address two distinct and standard problems in saliency prediction. These are\n\ufb01xation prediction, and salient object prediction respectively. The evaluation largely follows the\nmethodology employed by Li et al. [15]. Benchmarking metrics considered are common standards\nin saliency model evaluation, and details are found in the supplementary material.\nWe have compared our results with several saliency and segmentation algorithms ITTI [18], AIM\n[6], GBVS [14], DVA [19], SUN [20], SIG [21], AWS [22], FT [23], GC [24], SF [25], PCAS [26],\nand across different datasets. Note that for segmentation based tests comparison among saliency\nalgorithms considers only MCG+GBVS. The reason for this is that this was the highest performing\nof all of the saliency algorithms considered by Li et al. [15].\nIn our results, we exercise a range of parameters to gauge their relative importance. The size of\nGaussian kernel G\u03c3b determines the spatial scale. 25 different Kernel sizes are considered in a range\nfrom 3x3 to 125x125 pixels with the standard deviation \u03c3b equal to one third of the kernel width. For\n\ufb01xation prediction, only a subset of smaller scales is suf\ufb01cient to achieve good performance, but the\ncomplete set of scales is necessary for segmentation. The Gaussian kernel that de\ufb01nes color distance\nG\u03c3i is determined by the standard deviation \u03c3i. We tested values for \u03c3i ranging from 0.1 to 10. For\n\n5\n\n (a)2040608010012014055.566.577.58 kernel size in pixels self\u2212information (b) object pixel 1object pixel 2object pixel 3edge pixel 1edge pixel 2edge pixel 3non\u2212object pixel 1non\u2212object pixel 2non\u2212object pixel 3\fpost processing standard bilateral \ufb01ltering (BB), a kernel size of 9\u00d7 9 is used, and center bias results\nare based on a \ufb01xed \u03c3cb = 5 for the kernel G\u03c3cb for CB-1. For the second alternative method (CB-2)\none Gaussian kernel G\u03c3i is used with \u03c3i = 10. All of these settings have also considered different\nscaling factors applied to the overall image 0.25, 0.5 and 1 and in most cases, results corresponding\nto the resize factor of 0.25 are best. Scaling down the image implies a shift in the scales spanned in\nscale space towards lower spatial frequencies.\n\nTable 1: Benchmarking results for \ufb01xation prediction\n\ns-AUC\n\naws\n\naim\n\nsig\n\ndva\n\ngbvs\n\nsun\n\nitti\n\nbruce\ncerf\njudd\nimgsal\npascal\n\n0.7171\n0.7343\n0.8292\n0.8691\n0.8111\n\n0.6973\n0.756\n0.824\n0.854\n0.803\n\n0.714\n0.7432\n0.812\n0.862\n0.8072\n\n0.684\n0.716\n0.807\n0.856\n0.795\n\n0.67\n0.706\n0.777\n0.83\n0.758\n\n0.665\n0.691\n0.806\n0.8682\n0.8044\n\n0.656\n0.681\n0.794\n0.851\n0.773\n\nmiss\nBasic\n0.68\n0.7431\n0.807\n0.8653\n0.802\n\nmiss\nBB\n0.6914\n0.72\n0.809\n0.8644\n0.803\n\nmiss\nCB-1\n0.625\n0.621\n0.8321\n0.832\n0.8043\n\nmiss\nCB-2\n0.672\n0.7264\n0.8253\n0.845\n0.801\n\nTable 2: Benchmarking results for salient object prediction (saliency algorithms)\n\nF-score\n\naws\n\naim\n\nsig\n\ndva\n\ngbvs\n\nsun\n\nitti\n\nft\n\nimgsal\npascal\n\n0.6932\n0.5951\n0.569\n\n0.6564\n0.536\n0.5871\n\n0.652\n0.5902\n0.566\n\n0.633\n0.491\n0.529\n\n0.649\n0.5574\n0.529\n\n0.638\n0.438\n0.514\n\n0.623\n0.520\n0.5852\n\nmiss\nBasic\n0.640\n0.432\n0.486\n\nmiss\nBB\n0.6853\n0.521\n0.508\n\nmiss\nCB-1\n0.653\n0.527\n0.5833\n\nmiss\nCB-2\n0.7131\n0.5823\n0.5744\n\nTable 3: Benchmarking results for salient object prediction (segmentation algorithms)\n\nF-score\n\nsf\n\ngc\n\npcas\n\nft\n\nft\n\nimgsal\npascal\n\n0.8533\n0.494\n0.534\n\n0.804\n0.5712\n0.582\n\n0.833\n0.6121\n0.600\n\n0.709\n0.418\n0.415\n\nmcg+gbvs\n\n[15]\n0.8532\n0.5423\n0.6752\n\nmcg+miss\n\nBasic\n0.8493\n0.5354\n0.6674\n\nmcg+miss\n\nBB\n0.8454\n0.513\n0.666\n\nmcg+miss\n\nCB-1\n0.8551\n0.514\n0.6791\n\nmcg+miss\n\nCB-2\n0.839\n0.521\n0.6733\n\nIn Figure 2, we show some qualitative results of output corresponding to MISS with different post-\nprocessing variants of center bias weighting for both saliency prediction and object segmentation.\n\n3.4 Lighting and viewpoint invariance\n\nGiven the relationship between MISS and models that address the problem of invariant keypoint\nselection, it is interesting to consider the relative invariance in saliency output subject to changing\nviewpoint, lighting or other imaging conditions. This is especially true given that saliency models\nhave been shown to typically exhibit a high degree of sensitivity to imaging conditions [27]. This\nimplies that this analysis is relevant not only to interest point selection, but also to measuring the\nrelative robustness to small changes in viewpoint, lighting or optics in predicting \ufb01xations or salient\ntargets.\nTo examine af\ufb01ne invariance, we have to used image samples from a classic benchmark [5] which\nrepresent changes in zoom+rotation, blur, lighting and viewpoint. In all of these sequence, the \ufb01rst\nimage is the reference image and the imaging conditions change gradually throughout the sequence.\nWe have applied the MISS algorithm (without considering any center bias) to all of the full-size\nimages in those sequences. From the raw saliency output, we have selected keypoints based on non-\nmaxima suppression with radius = 5 pixels, and threshold = 0.1. For every detected keypoint we\nassign a circular region centered at the keypoint. The radius of this circular region is based on the\nwidth of the Gaussian kernel G\u03c3b de\ufb01ning the characteristic scale at which self-information achieves\na maximum response. Keypoint regions are compared across images subject to their repeatability\n[5]. Repeatability measures the similarity among detected regions across different frames and is a\nstandard way of gauging the capability to detect common regions across different types of image\ndeformations. We compare our results with several other region detectors including Harris, Hessian,\nMSER, IBR and EBR [5].\nFigure 3 demonstrates that output corresponding to the proposed saliency measure, revealing a con-\nsiderable degree of invariance to af\ufb01ne transformations and changing image characteristics suggest-\ning robustness for applications for gaze prediction and object selection.\n\n6\n\n\fFigure 3: A demonstration of invariance to varying image conditions including viewpoint, lighting\nand blur based on a standard benchmark [5].\n\n3.5 Beyond Images\n\nthat\n\nFigure 4: Saliency for 2 different scales on a mesh\nmodel. Results correspond to a surround based\non 100 nearest neighbors (left) and 4000 nearest\nneighbors (right) respectively.\n\nWhile the discussion in this paper has focused\nalmost exclusively on image input, it is worth\nnoting that the proposed de\ufb01nition of saliency\nis suf\ufb01ciently general\nthis may be ap-\nplied to alternative forms of data including im-\nages/videos, 3D models, audio signals or any\nform of data with locality in space or time.\nTo demonstrate this, we present saliency output\nbased on scale-space information for a 3D mesh\nmodel. Given that vertices are sparsely repre-\nsented in a 3D coordinate space in contrast to\nthe continuous discretized grid representation\npresent for images, some differences are nec-\nessary in how likelihood estimates are derived.\nIn this case, the spatial support is de\ufb01ned ac-\ncording to the k nearest (spatial) neighbors of\neach vertex. Instead of color values, each vertex belonging to the mesh is characterized by a three\ndimensional vector de\ufb01ning a surface normal in the x, y and z directions. Computation is otherwise\nidentical to the process outlined in equation 2. An example of output associated with two different\nchoices of k is shown in \ufb01gure 4 corresponding to k = 100 and k = 4000 respectively for a 3D\nmodel with 172974 vertices. For demonstrative purposes, the output for two individual spatial scales\nis shown rather than the maximum across scales. Red indicates high levels of saliency, and green\nlow. Values on the mesh are histogram equalized to equate any contrast differences. It is interest-\ning to note that this saliency metric (and output) is very similar to proposals in computer graphics\nfor determining local mesh saliency serving mesh simpli\ufb01cation [28]. Note that this method allows\ndetermination of a characteristic scale for vertices on the mesh in addition to de\ufb01ning saliency. This\nmay also useful to inferring the relationship between different parts (e.g. hands vs. \ufb01ngers).\nThere is considerable generality in that the measure of saliency assumed is agnostic to the features\nconsidered, with a few caveats. Given that our results are based on local color values, this implies\na relatively low dimensional feature space on which likelihoods are estimated. However, one can\nimagine an analogous scenario wherein each image location is characterized by a feature vector\n(e.g. outputs of a bank of log-Gabor \ufb01lters) resulting in much higher dimensionality in the statistics.\nAs dimensionality increases in feature space, the \ufb01nite number of samples within a local spatial or\ntemporal window implies an exponential decline in the sample density for likelihood estimation.\nThis consideration can be solved in applying an approximation based on marginal statistics (as in\n[29, 20, 30]). Such an approximation relies on assumptions such as independence which may be\nachieved for arbitrary data sets in \ufb01rst encoding raw feature values via stacked (sparse) autoencoders\nor related feature learning strategies. One might also note that saliency values may be assigned to\nunits across different layers of a hierarchical representation based on such a feature representation.\n\n3.6 Saliency, context and human vision\n\nSolutions to quantifying visual saliency based on deep learning have begun to appear in the litera-\nture. This has been made possible in part by efforts to scale up data collection via crowdsourcing in\n\n7\n\n12345020406080100 (a) bark sequence repeatebility % increasing zoom + rotation12345304050607080 (b) bikes sequence repeatebility % increasing blur123454550556065707580 (c) leuven sequence repeatebility % increasing light123451020304050607080 (d) wall sequence repeatebility % increasing viewpoint angle haraffhesaffmserafibraffebraffmiss\fde\ufb01ning tasks that serve as an approximation of traditional gaze tracking studies [31]. Recent (yet\nto be published) methods of this variety show a considerable improvement on some standard bench-\nmarks over traditional models. It is therefore interesting to consider what differences exist between\nsuch approaches, and more traditional approaches premised on measures of local feature contrast.\nTo this end, we present some examples in Figure 5 where output differs signi\ufb01cantly between a\nmodel based on deep learning (SALICON [31]) and one based on feature contrast (MISS).\nThe importance of this example is in highlighting different aspects of saliency computation that\ncontribute to the bigger picture. It is evident that models capable of detecting speci\ufb01c objects and\nmodeling context are may perform well on saliency benchmarks. However, it is also evident that\nthere is some de\ufb01cit in their capacity to represent saliency de\ufb01ned by strong feature contrast or\naccording to factors of importance in human visual search behavior. In the same vane, in human\nvision, hierarchical feature extraction from edges to complex objects, and local measures for gain\ncontrol, normalization and feature contrast play a signi\ufb01cant role, all acting in concert. It is therefore\nnatural to entertain the idea that a comprehensive solution to the problem involves considering both\nhigh-level features of the nature implemented in deep learning models coupled with contrastive\nsaliency akin to MISS. In practice, the role of salience in a distributed representation in modulating\nobject and context speci\ufb01c signals presents one promising avenue for addressing this problem.\nIt has been argued that normalization is a canonical operation in sensory neural information pro-\ncessing. Under the assumption of Generalized Gaussian statistics, it can be shown that divisive\nnormalization implements an operation equivalent to a log likelihood of a neural response in refer-\nence to cells in the surround [30]. The nature of computation assumed by MISS therefore \ufb01nds a\nstrong correlate in basic operations that implement feature contrast in human vision, and that pairs\nnaturally with the structure of computation associated with representing objects and context.\n\nFigure 5: Examples where a deep learning model produces counterintuitive results relative to models\nbased on feature contrast. Top: Original Image. Middle: SALICON output. Bottom: MISS output.\n\n4 Discussion\n\nIn this paper we present a generalized information theoretic characterization of saliency based on\nmaxima in information scale-space. This de\ufb01nition is shown to be related to a variety of classic\nresearch contributions in scale-space theory, interest point detection, bilateral \ufb01ltering, and existing\nmodels of visual saliency. Based on a relatively simplistic de\ufb01nition, the proposal is shown to be\ncompetitive against contemporary saliency models for both \ufb01xation based and object based saliency\nprediction. This also includes a demonstration of the relative robustness to image transformations\nand generalization of the proposal to a broad range of data types. Finally, we motivate an important\ndistinction between contextual and contrast related factors in driving saliency, and draw connections\nto associated mechanisms for saliency computation in human vision.\n\nAcknowledgments\n\nThe authors acknowledge \ufb01nancial support from the NSERC Canada Discovery Grants program,\nUniversity of Manitoba GETS funding, and ONR grant #N00178-14-Q-4583.\n\nReferences\n[1] J.J. Koenderink. The structure of images. Biological cybernetics, 50(5):363\u2013370, 1984.\n\n8\n\n\f[2] T. Lindeberg. Scale-space theory: A basic tool for analyzing structures at different scales. Journal of\n\napplied statistics, 21(1-2):225\u2013270, 1994.\n\n[3] T. Kadir and M. Brady. Saliency, scale and image description. IJCV, 45(2):83\u2013105, 2001.\n[4] M. Jagersand. Saliency maps and attention selection in scale and spatial coordinates: An information\n\ntheoretic approach. In ICCV 1995, pages 195\u2013202. IEEE, 1995.\n\n[5] K. Mikolajczyk et al. A comparison of af\ufb01ne region detectors. IJCV, 65(1-2):43\u201372, 2005.\n[6] N.D.B. Bruce and J.K. Tsotsos. Saliency based on information maximization. NIPS 2005, pages 155\u2013162,\n\n2005.\n\n[7] M. Toews and W.M. Wells. A mutual-information scale-space for image feature detection and feature-\n\nbased classi\ufb01cation of volumetric brain images. In CVPR Workshops, pages 111\u2013116. IEEE, 2010.\n\n[8] A. Borji, D. Sihite, and L. Itti. Quantitative analysis of human-model agreement in visual saliency mod-\n\neling: A comparative study. IEEE TIP, 22(1):55\u201369, 2013.\n\n[9] P. Perona, T. Shiota, and J. Malik. Anisotropic diffusion. In Geometry-driven diffusion in computer vision,\n\npages 73\u201392. Springer, 1994.\n\n[10] A. Buades, B. Coll, and J-M Morel. A non-local algorithm for image denoising. In CVPR 2005, volume 2,\n\npages 60\u201365. IEEE, 2005.\n\n[11] S. Paris and F. Durand. A fast approximation of the bilateral \ufb01lter using a signal processing approach. In\n\nECCV 2006, pages 568\u2013580. Springer, 2006.\n\n[12] LMJ Florack, BM Ter Haar Romeny, Jan J Koenderink, and Max A Viergever. General intensity transfor-\n\nmations and differential invariants. Journal of Mathematical Imaging and Vision, 4(2):171\u2013187, 1994.\n\n[13] F. Mokhtarian and R. Suomela. Robust image corner detection through curvature scale space. IEEE T\n\nPAMI, 20(12):1376\u20131381, 1998.\n\n[14] J. Harel, C. Koch, and P. Perona. Graph-based visual saliency. NIPS 2006, 19:545, 2007.\n[15] Y. Li, X. Hou, C. Koch, J.M. Rehg, and A.L. Yuille. The secrets of salient object segmentation. CVPR\n\n2014, pages 280\u2013287, 2014.\n\n[16] P. Arbelez, J. Pont-Tuset, J. Barron, F. Marques, and J. Malik. Multiscale combinatorial grouping. CVPR\n\n2014, pages 328\u2013335, 2014.\n\n[17] J. Carreira and C. Sminchisescu. Cpmc: Automatic object segmentation using constrained parametric\n\nmin-cuts. IEEE TPAMI, 34(7):1312\u20131328, 2012.\n\n[18] L. Itti, C. Koch, and E. Niebur. A model of saliency-based visual attention for rapid scene analysis. IEEE\n\nT PAMI, 20(11):1254\u20131259, 1998.\n\n[19] X. Hou and L. Zhang. Dynamic visual attention: Searching for coding length increments. NIPS 2008,\n\npages 681\u2013688, 2009.\n\n[20] L. Zhang, M.H. Tong, T.K. Marks, H. Shan, and G.W. Cottrell. Sun: A bayesian framework for saliency\n\nusing natural statistics. Journal of Vision, 8(7), 2008.\n\n[21] X. Hou, J. Harel, and C. Koch.\n\n34(1):194\u2013201, 2012.\n\nImage signature: Highlighting sparse salient regions.\n\nIEEE TPAMI,\n\n[22] A. Garcia-Diaz, V. Lebor n, X.R. Fdez-Vidal, and X.M. Pardo. On the relationship between optical\nvariability, visual saliency, and eye \ufb01xations: A computational approach. Journal of Vision, 12(6), 2012.\n[23] R. Achanta, S. Hemamiz, F. Estrada, and S. Susstrunk. Frequency-tuned salient region detection. CVPR\n\n2009 Workshops, pages 1597\u20131604, 2009.\n\n[24] M.-M. Cheng, N.J. Mitra, X. Huang, P.H.S. Torr, and S.-M. Hu. Global contrast based salient region\n\ndetection. IEEE TPAMI, 37(3):569\u2013582, 2015.\n\n[25] F. Perazzi, P. Krahenbuhl, Y. Pritch, and A. Hornung. Saliency \ufb01lters: Contrast based \ufb01ltering for salient\n\nregion detection. CVPR 2012, pages 733\u2013740, 2012.\n\n[26] Tal A. Zelnik-Manor L. Margolin, R.\u02d9What makes a patch distinct? CVPR 2013, pages 1139\u20131146, 2013.\n[27] A. Andreopoulos and J.K. Tsotsos. On sensor bias in experimental methods for comparing interest-point,\n\nsaliency, and recognition algorithms. IEEE TPAMI, 34(1):110\u2013126, 2012.\n\n[28] C-H Lee, A. Varshney, and D.W. Jacobs. Mesh saliency. ACM SIGGRAPH 2005, pages 659\u2013666, 2005.\n[29] N.D.B. Bruce and J.K. Tsotsos. Saliency, attention, and visual search: An information theoretic approach.\n\nJournal of vision, 9(3):5, 2009.\n\n[30] D. Gao and N. Vasconcelos. Decision-theoretic saliency: computational principles, biological plausibility,\n\nand implications for neurophysiology and psychophysics. Neural computation, 21(1):239\u2013271, 2009.\n\n[31] M. Jiang et al. Salicon: Saliency in context. CVPR 2015, pages 1072\u20131080, 2015.\n\n9\n\n\f", "award": [], "sourceid": 1298, "authors": [{"given_name": "Shafin", "family_name": "Rahman", "institution": "University of Manitoba"}, {"given_name": "Neil", "family_name": "Bruce", "institution": "University of Manitoba"}]}