{"title": "Variational Laws of Visual Attention for Dynamic Scenes", "book": "Advances in Neural Information Processing Systems", "page_first": 3823, "page_last": 3832, "abstract": "Computational models of visual attention are at the crossroad of disciplines like cognitive science, computational neuroscience, and computer vision. This paper proposes a model of attentional scanpath that is based on the principle that there are foundational laws that drive the emergence of visual attention. We devise variational laws of the eye-movement that rely on a generalized view of the Least Action Principle in physics. The potential energy captures details as well as peripheral visual features, while the kinetic energy corresponds with the classic interpretation in analytic mechanics. In addition, the Lagrangian contains a brightness invariance term, which characterizes significantly the scanpath trajectories. We obtain differential equations of visual attention as the stationary point of the generalized action, and we propose an algorithm to estimate the model parameters. Finally, we report experimental results to validate the model in tasks of saliency detection.", "full_text": "Variational Laws of\n\nVisual Attention for Dynamic Scenes\n\nDario Zanca\n\nDINFO, University of Florence\n\nDIISM, University of Siena\ndario.zanca@unifi.it\n\nMarco Gori\n\nDIISM, University of Siena\nmarco@diism.unisi.it\n\nAbstract\n\nComputational models of visual attention are at the crossroad of disciplines like\ncognitive science, computational neuroscience, and computer vision. This paper\nproposes a model of attentional scanpath that is based on the principle that there\nare foundational laws that drive the emergence of visual attention. We devise varia-\ntional laws of the eye-movement that rely on a generalized view of the Least Action\nPrinciple in physics. The potential energy captures details as well as peripheral\nvisual features, while the kinetic energy corresponds with the classic interpretation\nin analytic mechanics. In addition, the Lagrangian contains a brightness invariance\nterm, which characterizes signi\ufb01cantly the scanpath trajectories. We obtain differ-\nential equations of visual attention as the stationary point of the generalized action,\nand we propose an algorithm to estimate the model parameters. Finally, we report\nexperimental results to validate the model in tasks of saliency detection.\n\n1\n\nIntroduction\n\nEye movements in humans constitute an essential mechanism to disentangle the tremendous amount\nof information that reaches the retina every second. This mechanism in adults is very sophisticated.\nIn fact, it involves both bottom-up processes, which depend on raw input features, and top-down\nprocesses, which include task dependent strategies [2; 3; 4]. It turns out that visual attention is\ninterwound with high level cognitive processes, so as its deep understanding seems to be trapped\ninto a sort of eggs-chicken dilemma. Does visual scene interpretation drive visual attention or the\nother way around? Which one \u201cwas born\u201d \ufb01rst? Interestingly, this dilemma seems to disappears\nin newborns: despite their lack of knowledge of the world, they exhibit mechanisms of attention to\nextract relevant information from what they see [5]. Moreover, there are evidences that the very \ufb01rst\n\ufb01xations are highly correlated among adult subjects who are presented with a new input [25]. This\nshows that they still share a common mechanism that drive early \ufb01xations, while scanpaths diverge\nlater under top-down in\ufb02uences.\nMany attempts have been made in the direction of modeling visual attention. Based on the feature\nintegration theory of attention [14], Koch and Ullman in [9] assume that human attention operates\nin the early representation, which is basically a set of feature maps. They assume that these maps\nare then combined in a central representation, namely the saliency map, which drives the attention\nmechanisms. The \ufb01rst complete implementation of this scheme was proposed by Itti et al. in [10].\nIn that paper, feature maps for color, intensity and orientation are extracted at different scales.\nThen center-surround differences and normalization are computed for each pixel. Finally, all this\ninformation is combined linearly in a centralized saliency map. Several other models have been\nproposed by the computer vision community, in particular to address the problem of re\ufb01ning saliency\nmaps estimation. They usually differ in the de\ufb01nition of saliency, while they postulate a centralized\ncontrol of the attention mechanism through the saliency map. For instance, it has been claimed that\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fthe attention is driven according to a principle of information maximization [16] or by an opportune\nselection of surprising regions [17]. A detailed description of the state of the art is given in [8].\nMachine learning approaches have been used to learn models of saliency. Judd et al. [1] collected\n1003 images observed by 15 subjects and trained an SVM classi\ufb01er with low-, middle-, and high-level\nfeatures. More recently, automatic feature extraction methods with convolutional neural networks\nachieved top level performance on saliency estimation [26; 18].\nMost of the referred papers share the idea that saliency is the product of a global computation. Some\nauthors also provide scanpaths of image exploration, but to simulate them over the image, they all use\nthe procedure de\ufb01ned by [9]. The winner-take-all algorithm is used to select the most salient location\nfor the \ufb01rst \ufb01xation. Then three rules are introduced to select the next location: inhibition-of-return,\nsimilarity preference, and proximity preference. An attempt of introducing biological biases has been\nmade by [6] to achieve more realistic saccades and improve performance.\nIn this paper, we present a novel paradigm in which visual attention emerges from a few unifying\nfunctional principles. In particular, we assume that attention is driven by the curiosity for regions with\nmany details, and by the need to achieve brightness invariance, which leads to \ufb01xation and motion\ntracking. These principles are given a mathematical expression by a variational approach based on\na generalization of least action, whose stationary point leads to the correspondent Euler-Lagrange\ndifferential equations of the focus of attention. The theory herein proposed offers an intriguing model\nfor capturing a mechanisms behind saccadic eye movements, as well as object tracking within the\nsame framework. In order to compare our results with the state of the art in the literature, we have\nalso computed the saliency map by counting the visits in each pixel over a given time window, both\non static and dynamic scenes. It is worth mentioning that while many papers rely on models that are\npurposely designed to optimize the approximation of the saliency map, for the proposed approach\nsuch a computation is obtained as a byproduct of a model of scanpath.\nThe paper is organized as follows. Section 2 provides a mathematical description of the model and\nthe Euler-Lagrange equations of motion that describe attention dynamics. The technical details,\nincluding formal derivation of the motion equations, are postponed to the Appendix. In the Section 3\nwe describe the experimental setup and show performance of the model in a task of saliency detection\non two popular dataset of images [12; 11] and one dataset of videos [27]. Some conclusions and\ncritical analysis are \ufb01nally drawn in Section 4.\n\n2 The model\n\nIn this section, we propose a model of visual attention that takes place in the earliest stage of vision,\nwhich we assume to be completely data driven. We begin discussing the driving principles.\n\n2.1 Principles of visual attention\n\nThe brightness signal b(t, x) can be thought of as a real-valued function\n\n(1)\nwhere t is the time and x = (x1, x2) denotes the position. The scanpath over the visual input is\nde\ufb01ned as\n\nb : R+ \u00d7 R2 \u2192 R\n\nx : R+ \u2192 R2\n\nThe scanpath x(t) will be also referred to as trajectory or observation.\nThree fundamental principles drive the model of attention. They lead to the introduction of the\ncorrespondent terms of the Lagrangian of the action.\n\n(2)\n\n(3)\n\ni) Boundedness of the trajectory\n\nTrajectory x(t) is bounded into a de\ufb01ned area (retina). This is modeled by a harmonic\noscillator at the borders of the image which constraints the motion within the retina1:\n\n(cid:0) (li \u2212 xi)2 \u00b7 [xi > li] + (xi)2 \u00b7 [xi < 0](cid:1)\n\nV (x) = k\n\n(cid:88)\n\n1Here, we use Iverson\u2019s notation, according to which if p is a proposition then [p] = 1 if p=true and\n\n[p] = 0 otherwise\n\ni=1,2\n\n2\n\n\fwhere k is the elastic constant, li is the i-th dimension of the rectangle which represents the\nretina2.\n\nii) Curiosity driven principle\n\nVisual attention is attracted by regions with many details, that is where the magnitude of\nthe gradient of the brightness is high. In addition to this local \ufb01eld, the role of peripheral\ninformation is included by processing a blurred version p(t, x) of the brightness b(t, x). The\nmodulation of these two terms is given by\n\nC(t, x) = b2\n\nx cos2(\u03c9t) + p2\n\nx sin2(\u03c9t),\n\n(4)\n\nwhere bx and px denote the gradient w.r.t. x. Notice that the alternation of the local and\nperipheral \ufb01elds has a fundamental role in avoiding trapping into regions with too many\ndetails.\n\niii) brightness invariance\n\nTrajectories that exhibit brightness invariance are motivated by the need to perform \ufb01xation.\nFormally, we impose the constraint \u02d9b = bt + bx \u02d9x = 0. This is in fact the classic constraint\nthat is widely used in computer vision for the estimation of the optical \ufb02ow [20]. Its\nsoft-satisfaction can be expressed by the associated term\n\nB(t, x, \u02d9x) =(cid:0)bt + bx \u02d9x(cid:1)2\n\n.\n\n(5)\n\nNotice that, in the case of static images, bt = 0, and the term is fully satis\ufb01ed for trajectory\nx(t) whose velocity \u02d9x is perpendicular to the gradient, i.e.when the focus is on the borders\nof the objects. This kind of behavior favors coherent \ufb01xation of objects. Interestingly, in\ncase of static images, the model can conveniently be simpli\ufb01ed by using the upper bound of\nthe brightness as follows:\n\nB(t, x, \u02d9x) = \u02d9b2(t, x) = (\u2202bt + bx \u02d9x)2 \u2264\nx \u02d9x2 := \u00afB(t, x, \u02d9x)\n\n\u2264 2b2\n\nt + 2b2\n\n(6)\n\nThis inequality comes from the parallelogram law of Hilbert spaces. As it will be seen the\nrest of the paper, this approximation signi\ufb01cantly simpli\ufb01es the motion equations.\n\n2.2 Least Action Principle\n\nVisual attention scanpaths are modeled as the motion of a particle of mass m within a potential \ufb01eld.\nThis makes it possible to construct the generalized action\n\n(cid:90) T\n\nS =\n\nL(t, x, \u02d9x) dt\n\nwhere L = K \u2212 U, where K is the kinetic energy\n\n0\n\nK( \u02d9x) =\n\n1\n2\n\nm \u02d9x2\n\nand U is a generalized potential energy de\ufb01ned as\n\nU (t, x, \u02d9x) = V (x) \u2212 \u03b7C(t, x) + \u03bbB(t, x, \u02d9x).\n\n(7)\n\n(8)\n\n(9)\n\nHere, we assume that \u03b7, \u03bb > 0. Notice, in passing that while V and B get the usual sign of potentials,\nC comes with the \ufb02ipped sign. This is due to the fact that, whenever it is large, it generates an\nattractive \ufb01eld. In addition, we notice that the brightness invariance term is not a truly potential,\nsince it depends on both the position and the velocity. However, its generalized interpretation as a\n\u201cpotential\u201d comes from considering that it generates a force \ufb01eld. In order to discover the trajectory\nwe look for a stationary point of the action in Eq. (7), which corresponds to the Euler-Lagrange\nequations\n\nd\ndt\n\n\u2202L\n\u2202 \u02d9xi\n\n=\n\n\u2202L\n\u2202xi\n\n,\n\n(10)\n\n2A straightforward extension can be given for circular retina.\n\n3\n\n\fwhere i = 1, 2 for the two motion coordinates. The right-hand term in (10) can be written as\n\nLikewise we have\n\n= \u03b7Cx \u2212 Vx \u2212 \u03bbBx.\n\n\u2202L\n\u2202x\n\nd\ndt\n\n\u2202L\n\u2202 \u02d9x\n\n= m\u00a8x \u2212 \u03bb\n\nd\ndt\n\nB \u02d9x\n\nso as the general motion equation turns out to be\n\nm\u00a8x \u2212 \u03bb\n\nd\ndt\n\nB \u02d9x + Vx \u2212 \u03b7Cx + \u03bbBx = 0.\n\n(11)\n\n(12)\n\n(13)\n\nThese are the general equations of visual attention. In the Appendix we give the technical details of\nthe derivations. Throughout the paper, the proposed model is referred to as the EYe MOvement Laws\n(EYMOL).\n\n2.3 Parameters estimation with simulated annealing\n\nDifferent choices of parameters lead to different behaviors of the system. In particular, weights\ncan emphasize the contribution of curiosity or brightness invariance terms. To better control the\nsystem we use two different parameters for the curiosity term, namely \u03b7b and \u03b7p, to weight b and p\ncontributions respectively. The best values for the three parameters (\u03b7b, \u03b7p, \u03bb) are estimated using\nthe algorithm of simulated annealing (SA). This method allows to perform iterative improvements,\nstarting from a known state i. At each step, the SA considers some neighbouring state j of the current\nstate, and probabilistically moves to the new state j or stays on the current state i. For our speci\ufb01c\nproblem, we limit our search to a parallelepiped-domain D of possible values, due to theoretical\nbounds and numerical3 issues. Distance between states i and j is proportional with a temperature T ,\nwhich is initialized to 1 and decreases over time as Tk = \u03b1 \u2217 Tk\u22121, where k identi\ufb01es the iteration\nstep, and 0 << \u03b1 < 1. The iteration step is repeated until the system reaches a state that is good\nenough for the application, which in our case is to maximize the NSS similarity between human\nsaliency maps and simulated saliency maps.\nOnly a batch of a 100 images from CAT2000-TRAIN is used to perform the SA algorithm4. This\nbatch is created by randomly selecting 5 images from each of the 20 categories of the dataset. To\nstart the SA, parameters are initialized in the middle point of the 3-dimensional parameters domain\nD. The process is repeated 5 times, on different sub-samples, to select 5 parameters con\ufb01gurations.\nFinally, those con\ufb01gurations together with the average con\ufb01guration are tested on the whole dataset,\nto select the best one.\n\nSelect an initial state i \u2208 D\nT \u2190 1\ndo\n\nAlgorithm 1 In the psedo-code, P() is the acceptance probability and score() is computed as the average of\nNSS scores on the sample batch of 100 images.\n1: procedure SIMULATEDANNEALING\n2:\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11: end procedure\n\nGenerate random state j, neighbor of i\nif P(score(i), score(j)) \u2265 Random(0, 1) then\n\ni \u2190 j\nend if\nT \u2190 \u03b1 \u2217 T\nwhile T \u2265 0.01\n\n3Too high values for \u03b7b or \u03b7p produce numerically unstable and unrealistic trajectories for the focus of\n\nattention.\n\n4Each step of the SA algorithm needs evaluation over all the selected images. Considering the whole dataset\n\nwould be very expensive in terms of time.\n\n4\n\n\fModel version\n\nMIT1003\n\nNSS\n\nAUC\n\nCAT2000-TRAIN\n\nAUC\n\nNSS\n\nV1 (approx. br. inv.)\nV2 (exact br. inv.)\n\n0.7996 (0.0002)\n0.7990 (0.0003)\n\n1.2784 (0.0003)\n1.2865 (0.0039)\n\n0.8393 (0.0001)\n0.8376 (0.0013)\n\n1.8208 (0.0015)\n1.8103 (0.0137)\n\nTable 1: Results on MIT1003 [1] and CAT2000-TRAIN [11] of the two different version of EYMOL. Between\nbrackets is indicated the standard error.\n\n3 Experiments\n\nTo quantitative evaluate how well our model predicts human \ufb01xations, we de\ufb01ned an experimen-\ntal setup for salient detection both in images and in video. We used images from MIT1003 [1],\nMIT300 [12] and CAT2000 [11], and video from SFU [27] eye-tracking database. Many of the design\nchoices were common to both experiments; when they differ, it is explicitly speci\ufb01ed.\n\n3.1\n\nInput pre-processing\n\nAll input images are converted to gray-scale. Peripheral input p is implemented as a blurred versions\nof the brightness b. This blurred version is obtained by convolving the original gray-scale image\nwith a Gaussian kernel. For the images only, an algorithm identi\ufb01es the rectangular zone of the\ninput image in which the totality of information is contained in order to compute li in (14). Finally\nboth b and p are multiplied by a Gaussian blob centered in the middle of the frame in order to make\nbrightness gradients smaller as we move toward periphery and produce a center bias.\n\n3.2 Saliency maps computation\n\nDifferently by many of the most popular methodologies in the state-of-the-art [10; 16; 1; 24; 18], the\nsaliency map is not itself the central component of our model but it can be naturally calculated from\nthe visual attention laws in (13). The output of the model is a trajectory determined by a system of\ntwo second ordered differential equations, provided with a set of initial conditions. Since numerical\nintegration of (13) does not raise big numerical dif\ufb01culties, we used standard functions of the python\nscienti\ufb01c library SciPy [21].\nSaliency map is then calculated by summing up the most visited locations during a suf\ufb01ciently large\nnumber of virtual observations. For images, we collected data by running the model 199 times, each\nrun was randomly initialized almost at the center of the image and with a small random velocity,\nand integrated for a running time corresponding to 1 second of visual exploration. For videos, we\ncollected data by running the model 100 times, each run was initialized almost at the center of the\n\ufb01rst frame of the clip and with a small random velocity.\nModel that have some blur and center bias on the saliency map can improve their score with respect\nto some metrics. A grid search over blur radius and center parameter \u03c3 have been used, in order to\nmaximize AUC-Judd and NSS score on the training data of CAT2000 in the case of images, and on\nSFU in case of videos.\n\n3.3 Saliency detection on images\n\nTwo versions of the the model have been evaluated. The \ufb01rst version V1 implementing brightness\ninvariance in the approximated form (6), the second version V2 implementing the brightness invari-\nance in its exact form, as described in the Appendix. Model V1 and V2 have been compared on the\nMIT1003 and CAT2000-TRAIN datasets, since they provide public data about \ufb01xations. Parameters\nestimation have been conducted independently for the two models and the best con\ufb01guration for each\none is used in this comparison. Results are statistically equivalent (see Table2) and this proves that,\nin the case of static images, the approximation is very good and does not cause loss in the score.\nFor further experiments we decided to use the approximated form V1 due to its simpler form of the\nequation that also reduces time of computation.\nModel V1 has been evaluated in two different dataset of eye-tracking data: MIT300 and CAT2000-\nTEST. In this case, scores were of\ufb01cially provided by MIT Saliency Benchmark Team [15]. De-\nscription of the metrics used is provided in [13]. Table 2 and Table 3 shows the scores of our\n\n5\n\n\fItti-Koch [10], implem. by [19]\nAIM [16]\nJudd Model [1]\nAWS [24]\neDN [18]\nEYMOL\n\nMIT300\n\nAUC SIM EMD CC NSS KL\n1.03\n0.75\n1.18\n0.77\n1.12\n0.81\n0.74\n1.07\n0.82\n1.14\n0.77\n1.53\n\n4.26\n4.73\n4.45\n4.62\n4.56\n3.64\n\n0.97\n0.79\n1.18\n1.01\n1.14\n1.06\n\n0.37\n0.31\n0.47\n0.37\n0.45\n0.43\n\n0.44\n0.40\n0.42\n0.43\n0.44\n0.46\n\nTable 2: Results on MIT300 [12] provided by MIT Saliency Benchmark Team [15]. The models are sorted\nchronologically. In bold, the best results for each metric and benchmarks.\n\nItti-Koch [10], implem. by [19]\nAIM [16]\nJudd Model [1]\nAWS [24]\neDN [18]\nEYMOL\n\nCAT2000-TEST\n\nAUC SIM EMD CC NSS KL\n0.92\n0.77\n1.13\n0.76\n0.94\n0.84\n0.94\n0.76\n0.85\n0.97\n1.67\n0.83\n\n0.48\n0.44\n0.46\n0.49\n0.52\n0.61\n\n1.06\n0.89\n1.30\n1.09\n1.30\n1.78\n\n0.42\n0.36\n0.54\n0.42\n0.54\n0.72\n\n3.44\n3.69\n3.60\n3.36\n2.64\n1.91\n\nTable 3: Results on CAT2000 [11] provided by MIT Saliency Benchmark Team [15]. The models are sorted\nchronologically. In bold, the best results for each metric and benchmarks.\n\nmodel compared with \ufb01ve other popular method [10; 16; 1; 24; 18], which have been selected to be\nrepresentative of different approaches. Despite its simplicity, our model reaches best score in half of\nthe cases and for different metrics.\n\n3.4 Saliency detection on dynamic scenes\n\nWe evaluated our model in a task of saliency detection with the dataset SFU [27]. The dataset contains\n12 clips and \ufb01xations of 15 observers, each of them have watched twice every video. Table 4 provides\na comparison with other four model. Also in this case, despite of its simplicity and even if it was not\ndesigned for the speci\ufb01c task, our model competes well with state-of-the-art models. Our model can\nbe easily run in real-time to produce an attentive scanpath. In some favorable case, it shows evidences\nof tracking moving objects on the scene.\n\nEYMOL Itti-Koch [10]\n\nSFU Eye-Tracking Database\n\nSurprise [17]\n\nJudd Model [1] HEVC [28]\n\nMean AUC\nMean NSS\n\n0.817\n1.015\n\n0.70\n0.28\n\n0.66\n0.48\n\n0.77\n1.06\n\n0.83\n1.41\n\nTable 4: Results on the video dataset SFU [27]. Scores are calculated as the mean of AUC and NSS metrics of\nall frames of each clip, and then averaged for the 12 clips.\n\n4 Conclusions\n\nIn this paper we investigated how human attention mechanisms emerge in the early stage of vision,\nwhich we assume completely data-driven. The proposed model consists of differential equations,\nwhich provide a real-time model of scanpath. These equations are derived in a generalized framework\nof least action, which nicely resembles related derivations of laws in physics. A remarkable novelty\nconcerns the uni\ufb01ed interpretation of curiosity-driven movements and the brightness invariance term\nfor \ufb01xation and tracking, that are regarded as mechanisms that jointly contribute to optimize the\nacquisition of visual information. Experimental results on both image and video datasets of saliency\nare very promising, especially if we consider that the proposed theory offers a truly model of eye\nmovements, whereas the computation of the saliency maps only arises as a byproduct.\n\n6\n\n\fIn future work, we intend to investigate behavioural data, not only in terms of saliency maps, but also\nby comparing actual generated scanpaths with human data in order to discover temporal correlations.\nWe aim at providing the integration of the presented model with a theory of feature extraction that is\nstill expressed in terms of variational-based laws of learning [29].\n\nAppendix: Euler-Lagrange equations\n\nIn this section we explicitly compute the differential laws of visual attention that describe the visual\nattention scanpath, as the Euler-Lagrange equations of the action functional (7).\nFirst, we compute the partial derivatives of the different contributions w.r.t. x, in order to compute\nthe exact contributions of (11). For the retina boundaries,\n\n(cid:88)\n\n(cid:0) \u2212 2 (li \u2212 xi) \u00b7 [xi > li] + 2xi \u00b7 [xi < 0](cid:1)\n\nVx = k\n\nThe curiosity term (4)\n\ni=1,2\n\nCx =2cos2(\u03c9t)bx \u00b7 bxx + 2sin2(\u03c9t)px \u00b7 pxx\n\nFor the term of brightness invariance,\n\nBx =\n\n\u2202\n\u2202x\n\n(bt + bx \u02d9x)2\n\n= 2 (bt + bx \u02d9x) (btx + bxx \u02d9x)\n\nSince we assume b \u2208 C2(t, x), by the Schwarz\u2019s theorem5, we have that btx = bxt, so that\n\n(18)\n(19)\nWe proceed by computing the contribution in (12). Derivative w.r.t. \u02d9x of the brightness invariance\nterm is\n\nBx = 2 (bt + bx \u02d9x) (bxt + bxx \u02d9x)\n\n= 2(\u02d9b)(\u02d9bx)\n\n(14)\n\n(15)\n\n(16)\n\n(17)\n\n(20)\n\n(21)\n(22)\n\nB \u02d9x =\n\n\u2202\n\u2202 \u02d9x\n\n(bt + bx \u02d9x)2\n\n= 2 (bt + bx \u02d9x) bx\n= 2(\u02d9b)(bx)\n\n(cid:16)\u00a8bbx + \u02d9b\u02d9bx\n\n(cid:17)\n\nd\ndt\n\nSo that, total derivative w.r.t. t can be write as\n\n(23)\nWe observe that \u00a8b \u2261 \u00a8b(t, x, \u02d9x, \u00a8x) is the only term which depends on second derivatives of x. Since\nwe are interested in expressing EL in an explicit form for the variable \u00a8x, we explore more closely its\ncontribution\n\nB \u02d9x =2\n\n\u00a8b(t, x, \u02d9x, \u00a8x) =\n\n\u02d9b\n\nd\ndt\nd\n=\ndt\n=\u02d9bt + \u02d9bx \u00b7 \u02d9x + bx \u00b7 \u00a8x\n\n(bt + bx \u02d9x)\n\n(cid:16)\n(cid:16)\n\nd\ndt\n\nB \u02d9x =2\n\n(\u02d9bt + \u02d9bx \u00b7 \u02d9x + bx \u00b7 \u00a8x)bx + \u02d9b\u02d9bx\n(\u02d9bt + \u02d9bx \u00b7 \u02d9x)bx + \u02d9b\u02d9bx\n\n+ 2(bx \u00b7 \u00a8x)bx\n\n(cid:17)\n\n(24)\n\n(25)\n\n(26)\n(27)\n\n(28)\n\n(cid:17)\n\n(29)\n5Schwarz\u2019s theorem states that, if f : Rn \u2192 R has continuous second partial derivatives at any given point\n\n=2\n\nin Rn, then \u2200i, j \u2208 {1, ..., n} it holds fxixj = fxj xi\n\nSubstituting it in (23) we have\n\n7\n\n\fSo that, from (12) we get\n\n(cid:16)\n\n= m\u00a8x \u2212 2\u03bb\n\n\u2202L\n\u2202 \u02d9x\n\nd\ndt\n\n(cid:16)\n\n(\u02d9bt + \u02d9bx \u00b7 \u02d9x)bx + \u02d9b\u02d9bx + (bx \u00b7 \u00a8x)bx\n\n(cid:17)\n\n(30)\n\nEuler-Lagrange equations. Combining (11) and (30), we get Euler-Lagrange equation of attention\n\nm\u00a8x \u2212 2\u03bb\n\n(\u02d9bt + \u02d9bx \u00b7 \u02d9x)(bx) + (\u02d9b)(\u02d9bx) + (bx \u00b7 \u00a8x)bx\n\n= \u03b7Cx \u2212 Vx \u2212 \u03bbBx\n\n(31)\n\nIn order to obtain explicit form for the variable \u00a8x, we re-write the equation as to move to the left all\ncontributes which do not depend on that variable.\n\nm\u00a8x \u2212 2\u03bb(bx \u00b7 \u00a8x)bx =\u03b7Cx \u2212 Vx \u2212 \u03bbBx + 2\u03bb((\u02d9bt + \u02d9bx \u00b7 \u02d9x)(bx) + (\u02d9b)(\u02d9bx))\n\n(cid:17)\n\nIn matrix form, the equation is(cid:18)m\u00a8x1\n\nwhich gives us the system of two differential equations\n\n(cid:124)\n\nA=(A1,A2)\n\n(cid:123)(cid:122)\n\n= \u03b7Cx \u2212 Vx + 2\u03bb(\u02d9bt + \u02d9bx \u00b7 \u02d9x)(bx)\n\n(cid:18)2\u03bb(bx1 \u00a8x1 + bx2 \u00a8x2)bx1\n\n(cid:125)\n(cid:19)\n(cid:19)\n(cid:18)A1\n(cid:26)m\u00a8x1 \u2212 2\u03bb(bx1 \u00a8x1 + bx2 \u00a8x2)bx1 = A1\n\n2\u03bb(bx1 \u00a8x1 + bx2 \u00a8x2)bx2\n\nm\u00a8x2 \u2212 2\u03bb(bx1 \u00a8x1 + bx2 \u00a8x2)bx2 = A2\n\n\u2212\n\nA2\n\n=\n\nm\u00a8x2\n\n(cid:19)\n\n(32)\n(33)\n\n(34)\n\n(35)\n\n(36)\n\n(37)\n\n(38)\n\n(39)\n\n)\u00a8x1 \u2212 2\u03bb(bx1bx2)\u00a8x2\n\n\u22122\u03bb(bx1bx2 )\u00a8x1 + (m \u2212 2\u03bbb2\n\n= A1\n)\u00a8x2 = A2\n\nx2\n\n(cid:12)(cid:12)(cid:12)(cid:12)\n(cid:12)(cid:12)(cid:12)(cid:12)(m \u2212 2\u03bbb2\n\n) A1\n\u22122\u03bb(bx1bx2 ) A2\n\nx1\n\n(cid:12)(cid:12)(cid:12)(cid:12)\n\n) \u22122\u03bb(bx1 bx2)\n(m \u2212 2\u03bbb2\n)\n\nx2\n\n(cid:12)(cid:12)(cid:12)(cid:12) , D2 =\n\nx1\n\nWe de\ufb01ne\n\nGrouping by same variable,(cid:26)(m \u2212 2\u03bbb2\n(cid:12)(cid:12)(cid:12)(cid:12)(m \u2212 2\u03bbb2\n(cid:12)(cid:12)(cid:12)(cid:12)A1 \u22122\u03bb(bx1bx2 )\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f3\n\n\u22122\u03bb(bx1bx2 )\n\n(m \u2212 2\u03bbb2\n\nD1 =\n\nD =\n\nA2\n\nx1\n\nx2\n\n)\n\n\u00a8x1 =\n\n\u00a8x2 =\n\nD1\nD\n\nD2\nD\n\nBy the Cramer\u2019s method we get differential equation of visual attention for the two spatial component,\ni.e.\n\nNotice that, this raise to a further condition over the parameter \u03bb. In particular, in the case values of\nb(t, x) are normalized in the range [0, 1], it imposes to chose\nm\n4\n\nD (cid:54)= 0 =\u21d2 \u03bb <\n\n(40)\n\nIn fact,\n\n(cid:16)\n\nD = (m \u2212 2\u03bbb2\n\n)(m \u2212 2\u03bbb2\n\nx1\n\nm \u2212 2\u03bb(b2\n\n= m\n\n+ b2\nx1\n\n)\n\nx1\n\n(cid:17)\n\nx2\n\nFor values of bx = 0, we have that\n\nso that \u2200t, we must impose\n\nD = m2 > 0\n\nD > 0.\n\n8\n\n) \u2212 4\u03bb2(bx1bx2)2\n\n(41)\n\n(42)\n\n(43)\n\n(44)\n\n\fIf \u03bb > 0, then\n\nm \u2212 2\u03bb(b2\n\nx1\n\n+ b2\nx1\n\n) > 0\n\n\u03bb <\n\n2(b2\nx1\n\nm\n+ b2\nx1\n\n)\n\nThe quantity on the right reaches its minimum at m\n4\n\n, so that the condition\n\nguarantees the well-posedness of the problem.\n\n0 < \u03bb <\n\nm\n4\n\n(45)\n\n(46)\n\n(47)\n\nReferences\n[1] Judd, T., Ehinger, K., Durand, F., Torralba, A.: Learning to Predict Where Humans Look. IEEE\n\nInternational Conference on Computer Vision (2009)\n\n[2] Itti, L., Koch, C.: Computational modelling of visual attention. Nature Reviews Neuroscience,\n\nvol 3, n 3, pp 194\u2013203. (2001)\n\n[3] Connor, C.E., Egeth, H.E., Yantis, S.: Visual Attention: Bottom-Up Versus Top-Down. Current\n\nBiology, vol 14, n 19, pp R850\u2013R852. (2004)\n\n[4] McMains, S., Kastner, S.: Interactions of Top-Down and Bottom-Up Mechanisms in Human\n\nVisual Cortex. Society for Neuroscience, vol 31, n 2, pp 587\u2013597. (2011)\n\n[5] Hainline, L., Turkel, J., Abramov, I., Lemerise, E., Harris, C.M.: Characteristics of saccades in\n\nhuman infants. Vision Research, vol 24, n 12, pp 1771\u20131780. (1984)\n\n[6] Le Meur, O., Liu, Z.: Saccadic model of eye movements for free-viewing condition. Vision\n\nResearch, vol 116, pp 152\u2013164. (2015)\n\n[7] Gelfand, I.M., Fomin, S.V.: Calculus of Variation. Englewood : Prentice Hall (1993)\n\n[8] Borji, A., Itti, L.: State-of-the-Art in Visual Attention Modeling. IEEE Transactions on Pattern\n\nAnalysis and Machine Intelligence, vol 35, n 1. (2013)\n\n[9] Koch, C., Ullman, S.: Shifts in selective visual attention: towards the underlying neural circuitry.\n\nSpringer Human Neurobiology, vol 4, n 4, pp 219-227. (1985)\n\n[10] Itti, L., Koch, C.: A Model of Saliency-Based Visual Attention for Rapid Scene Analysis. IEEE\n\nTransactions on Pattern Analysis and Machine Intelligence, vol 20, n 11. (1998)\n\n[11] Borji, A., Itti, L.: CAT2000: A Large Scale Fixation Dataset for Boosting Saliency Research.\n\narXiv:1505.03581. (2015)\n\n[12] Judd, T., Durand, F., Torralba, A.:: A Benchmark of Computational Models of Saliency to\n\nPredict Human Fixations. MIT Technical Report. (2012)\n\n[13] Bylinskii, Z., Judd, T., Oliva, A., Torralba, A.: What do different evaluation metrics tell us about\n\nsaliency models? arXiv:1604.03605. (2016)\n\n[14] Treisman, A.M., Gelade, G.: A Feature Integration Theory of Attention. Cognitive Psychology,\n\nvol 12, pp 97-136. (1980)\n\n[15] Bylinskii, Z., Judd, T., Borji, A., Itti, L., Durand, F., Torralba, A.: MIT Saliency Benchmark.\n\nhttp://saliency.mit.edu/\n\n[16] Bruce, N., Tsotsos, J.: Attention based on information maximization. J. Vis., vol 7, n 9. (2007)\n\n[17] Itti, L., Baldi, P.: Bayesian Surprise Attracts Human Attention. Vision Research, vol 49, n 10,\n\npp 1295\u20131306. (2009)\n\n9\n\n\f[18] Vig, E., Dorr, M., Cox, D.: Large-Scale Optimization of Hierarchical Features for Saliency\nPrediction in Natural Images. IEEE Conference on Computer Vision and Pattern Recognition.\n(2014)\n\n[19] Harel, J.: A Saliency Implementation in MATLAB . http://www.klab.caltech.edu/ harel/share/g-\n\nbvs.php\n\n[20] Horn, B.K.P., Schunck, B.G.: Determining optical \ufb02ow. Arti\ufb01cial Intelligence, vol 17, n 1, pp\n\n185-203. (1981)\n\n[21] Jones, E., Travis, O., Peterson, P.: SciPy: Open source scienti\ufb01c tools for Python.\n\nhttp://www.scipy.org/. (2001)\n\n[22] Zhang, J., Sclaroff, S.: Saliency detection: a Boolean map approach . Proc. of the IEEE\n\nInternational Conference on Computer Vision. (2013)\n\n[23] Cornia, M., Baraldi, L., Serra, G., Cucchiara, R.: Predicting Human Eye Fixations via an\n\nLSTM-based Saliency Attentive Model. http://arxiv.org/abs/1611.09571. (2016)\n\n[24] Garcia-Diaz, A., Lebor\u00e1n, V., Fdez-Vida, X.R., Pardo, X.M.: On the relationship between\noptical variability, visual saliency, and eye \ufb01xations: Journal of Vision, vol 12, n 6, pp 17. (2012)\n\n[25] Tatler, B.W., Baddeley, R.J., Gilchrist, I.D.: Visual correlates of \ufb01xation selection: Effects of\n\nscale and time. Vision Research, vol 45, n 5, pp 643-659. (2005)\n\n[26] Kruthiventi, S.S.S., Ayush, K., Venkatesh, R.:DeepFix: arXiv:1510.02927. (2015)\n\n[27] Hadizadeh, H., Enriquez, M.J., Bajic, I.V.: Eye-Tracking Database for a Set of Standard Video\n\nSequences. IEEE Transactions on Image Processing. (2012)\n\n[28] Xu, M., Jiang, L., Ye, Z., Wang, Z.: Learning to Detect Video Saliency With HEVC Features.\n\nIEEE Transactions on Image Processing. (2017)\n\n[29] Maggini, M., Rossi, A.: On-line Learning on Temporal Manifolds. AI*IA 2016 Advances in\n\nArti\ufb01cial Intelligence Springer International Publishing, pp 321\u2013333. (2016)\n\n10\n\n\f", "award": [], "sourceid": 2096, "authors": [{"given_name": "Dario", "family_name": "Zanca", "institution": "University of Florence, University of Siena"}, {"given_name": "Marco", "family_name": "Gori", "institution": "University of Siena"}]}