{"title": "Action from Still Image Dataset and Inverse Optimal Control to Learn Task Specific Visual Scanpaths", "book": "Advances in Neural Information Processing Systems", "page_first": 1923, "page_last": 1931, "abstract": "Human eye movements provide a rich source of information into the human visual processing. The complex interplay between the task and the visual stimulus is believed to determine human eye movements, yet it is not fully understood. This has precluded the development of reliable dynamic eye movement prediction systems. Our work makes three contributions towards addressing this problem. First, we complement one of the largest and most challenging static computer vision datasets, VOC 2012 Actions, with human eye movement annotations collected under the task constraints of action and context recognition. Our dataset is unique among eyetracking datasets for still images in terms of its large scale (over 1 million fixations, 9157 images), task control and action from a single image emphasis. Second, we introduce models to automatically discover areas of interest (AOI) and introduce novel dynamic consistency metrics, based on them. Our method can automatically determine the number and spatial support of the AOIs, in addition to their locations. Based on such encodings, we show that, on unconstrained read-world stimuli, task instructions have significant influence on visual behavior. Finally, we leverage our large scale dataset in conjunction with powerful machine learning techniques and computer vision features, to introduce novel dynamic eye movement prediction methods which learn task-sensitive reward functions from eye movement data and efficiently integrate these rewards to plan future saccades based on inverse optimal control. We show that the propose methodology achieves state of the art scanpath modeling results.", "full_text": "Action from Still Image Dataset and Inverse Optimal\n\nControl to Learn Task Speci\ufb01c Visual Scanpaths\n\nStefan Mathe1,3 and Cristian Sminchisescu2,1\n\n1Institute of Mathematics of the Romanian Academy of Science\n\n2Department of Mathematics, Faculty of Engineering, Lund University\n\n3Department of Computer Science, University of Toronto\n\nstefan.mathe@imar.ro, cristian.sminchisescu@math.lth.se\n\nAbstract\n\nHuman eye movements provide a rich source of information into the human vi-\nsual information processing. The complex interplay between the task and the\nvisual stimulus is believed to determine human eye movements, yet it is not fully\nunderstood, making it dif\ufb01cult to develop reliable eye movement prediction sys-\ntems. Our work makes three contributions towards addressing this problem. First,\nwe complement one of the largest and most challenging static computer vision\ndatasets, VOC 2012 Actions, with human eye movement recordings collected un-\nder the primary task constraint of action recognition, as well as, separately, for\ncontext recognition, in order to analyze the impact of different tasks. Our dataset\nis unique among the eyetracking datasets of still images in terms of large scale\n(over 1 million \ufb01xations recorded in 9157 images) and different task controls. Sec-\nond, we propose Markov models to automatically discover areas of interest (AOI)\nand introduce novel sequential consistency metrics based on them. Our methods\ncan automatically determine the number, the spatial support and the transitions\nbetween AOIs, in addition to their locations. Based on such encodings, we quan-\ntitatively show that given unconstrained read-world stimuli, task instructions have\nsigni\ufb01cant in\ufb02uence on the human visual search patterns and are stable across\nsubjects. Finally, we leverage powerful machine learning techniques and com-\nputer vision features in order to learn task-sensitive reward functions from eye\nmovement data within models that allow to effectively predict the human visual\nsearch patterns based on inverse optimal control. The methodology achieves state\nof the art scanpath modeling results.\n\n1\n\nIntroduction\n\nEye movements provide a rich source of knowledge into the human visual information processing\nand result from the complex interplay between the visual stimulus, prior knowledge of the visual\nworld, and the task. This complexity poses a challenge to current models, which often require\na complete speci\ufb01cation of the cognitive processes and of the way visual input is integrated by\nthem[4, 20]. The advent of modern eyetracking systems, powerful machine learning techniques,\nand visual features opens up the prospect of learning eye movement models directly from large real\nhuman eye movement datasets, collected under task constraints. This trend is still in its infancy, here\nwe aim to advance it on several fronts:\n\u2022 We introduce a large scale dataset of human eye movements collected under the task con-\nstraints of both action and context recognition from a single image, for the VOC 2012 Ac-\nThe eye movement data is introduced in \u00a73 and is publicly available at\ntions dataset.\nhttp://vision.imar.ro/eyetracking-voc-actions/.\n\u2022 We present a model to automatically discover areas of interest (AOIs) from eyetracking data, in\n\u00a74. The model integrates both spatial and sequential eye movement information, in order to better\n\n1\n\n\fFigure 1: Saliency maps obtained from the gaze patterns of 12 viewers under action recognition (left\nimage in pair) and context recognition (right, in pair), from a single image. Note that human gaze\nsigni\ufb01cantly depends on the task (see tab. 1b for quantitative results). The visualization also suggests\nthe existence of stable consistently \ufb01xated areas of interest (AOIs). See \ufb01g. 2 for illustration.\n\nconstrain estimates and to automatically identify the spatial support and the transitions between\nAOIs in addition to their locations. We use the proposed AOI discovery tools to study inter-subject\nconsistency and show that, on this dataset, task instructions have a signi\ufb01cant in\ufb02uence on human\nvisual attention patterns, both spatial and sequential. Our \ufb01ndings are presented in \u00a75.\n\u2022 We leverage the large amount of collected \ufb01xations and saccades in order to develop a novel, fully\ntrainable, eye movement prediction model. The method combines inverse reinforcement learning\nand advanced computer vision descriptors in order to learn task sensitive reward functions based on\nhuman eye movements. The model has the important property of being able to ef\ufb01ciently predict\nscanpaths of arbitrary length, by integrating information over a long time horizon. This leads to\nsigni\ufb01cantly improved estimates. Section \u00a76.2 gives the model and its assessment.\n\n2 Related Work\n\nHuman gaze pattern annotations have been collected for both static images[11, 13, 14, 12, 26, 18]\nand for video[19, 23, 15], see [24] for a recent overview. Most of the image datasets available\nhave been collected under free-viewing, and the few task controlled ones[14, 7] have been designed\nfor small scale studies. In contrast, our dataset is both task controlled and more than one order\nof magnitude larger than the existing image databases. This makes it adequate to using machine\nlearning techniques for saliency modeling and eye movement prediction.\nThe in\ufb02uence of task on eye movements has been investigated in early human vision studies[25, 3]\nfor picture viewing, but these groundbreaking studies have been fundamentally qualitative. Statisti-\ncal properties like the saccade amplitude and the \ufb01xation duration have been shown to be in\ufb02uenced\nby the task[5]. A quantitative analysis of task in\ufb02uence on visual search in the context of action\nrecognition from video appears in our prior work[19].\nHuman visual saliency prediction has received signi\ufb01cant interest in computer vision (see [2] for an\noverview). Recently, the trend has been to learn saliency models from \ufb01xation data in images[13, 22]\nand video[15, 19]. The prediction of eye movements has been less studied. In contrast, prede\ufb01ned\nvisual saliency measures can be used to obtain scanpaths[11] in conjunction with non-maximum\nsuppression. Eye movements have also been modeled explicitly by maximizing the expected future\ninformation gain[20, 4] (as one step in [20] or until the goal is reached in [4]). The methods operate\non pre-speci\ufb01ed reward functions, which limits their applicability. The method we propose shares\nsome resemblance with these later methods, in that we also aim at maximizing the future expected\nreward, albeit our reward function is learned instead of being pre-speci\ufb01ed, and we work in an\ninverse optimal control setting, which allows, in principle, an arbitrary time horizon. We are not\naware of any eye movement models that are learned from eye movement data.\n\n3 Action from a Single Image \u2013 New Human Eye Movement Dataset\n\nOne objective of this work is to introduce eye movement recordings for the PASCAL VOC image\ndataset used for action recognition. Presented in [10], it is one of the largest and most challenging\n\n2\n\n\fFigure 2: Illustration of areas of interest (AOI) obtained from scanpaths of subjects on three stimuli\nfor the action (left) and context (right) recognition tasks. Ellipses depict states, scaled to match the\nlearned spatial support, whereas dotted arrows illustrate high probability saccades. Visual search\npatterns are highly consistent both spatially and sequentially and are strongly in\ufb02uenced by task.\nSee \ufb01g. 3 and tab. 1 for quantitative results on spatial and sequential consistency.\navailable datasets of real world actions in static images. It contains 9157 images, covering 10 classes\n(jumping, phoning, playing instrument, reading, riding bike, riding horse, running, taking photo,\nusing computer, walking). Several persons may appear in each image. Multiple actions may be\nperformed by the same person and some instances belong to none of the 10 target classes.\nHuman subjects: We have collected data from 12 volunteers (5 male and 7 female) aged 22 to 46.\nTask: We split the subjects into two groups based on the given task. The \ufb01rst, action group (8 sub-\njects) was asked to recognize the actions in the image and indicate them from the labels provided\nby the PASCAL VOC dataset. To assess the effects of task on visual search, we asked the mem-\nbers of the second, context group (4 subjects), to \ufb01nd which of 8 contextual elements occur in the\nbackground of each image. Two of these contextual elements \u2013 furniture, painting/wallpaper \u2013 are\ntypical of indoors scenes, while the remaining 6 \u2013 body of water, building, car/truck, mountain/hill,\nroad, tree \u2013 occur mostly outdoors.\nRecording protocol: The recording setup is identical to the one used in [19]. Before each image\nwas shown, participants were required to \ufb01xate a target in the center of a uniform background on the\nscreen. We asked subjects in the action group to solve a multi-target \u2018detect and classify\u2019 task: press\na key each time they have identi\ufb01ed a person performing an action from the given set and also list\nthe actions they have seen. The exposure time for this task was 3 seconds.1 Their multiple choice\nanswers were recorded through a set of check-boxes displayed immediately following each image\nexposure. Participants in the context group underwent a similar protocol, having a slightly lower\nexposure time of 2.5 seconds. The images were shown to each subject in a different random order.\nDataset statistics: The dataset contains 1,085,381 \ufb01xations. The average scanpath length is 10.0 for\nthe action subjects and 9.5 for the context subjects, including the initial central \ufb01xation. The time\nelapsed from stimulus display until the \ufb01rst three key presses, averaged over trials in which they\noccur, are 1, 1.6 and 1.9 seconds, respectively.\n\n4 Automatic Discovery of Areas of Interest and Transitions using HMMs\n\nHuman \ufb01xations tend to cluster on salient regions that generally correspond to objects and object\nparts (\ufb01g. 1). Such areas of interest (AOI) offer an important tool for human visual pattern analysis,\ne.g. in evaluating inter-subject consistency[19] or the prediction quality of different saliency models.\nManually specifying AOIs is both time consuming and subjective. In this section, we propose a\nmodel to automatically discover the AOI locations, their spatial support and the transitions between\nthem, from human scanpaths recorded for a given image. While this may appear straightforward,\nwe are not aware of a similar model in the literature.\nIn deriving the model, we aim at four properties. First, we want to be able to exploit not only\nhuman \ufb01xations, but also constraints from saccades. Consider the case of several human subjects\n\ufb01xating the face of a person and the book she is reading. Based on \ufb01xations alone, it can be dif\ufb01cult\nto separate the book and the person\u2019s face into two distinct AOIs due to proximity. Nevertheless,\nfrequent saccades between the book and the person\u2019s face provide valuable hints for hypothesizing\ntwo distinct, semantically meaningful AOIs. Second, we wish to adapt to an unknown and varying\nnumber of AOIs in different images. Third, we want to estimate not only the center of the AOI, but\nalso the spatial support and location uncertainty. Finally, we wish to \ufb01nd the transition probabilities\nbetween AOIs. To meet such criteria in a visual representation, we use a statistical model.\n\n1Protocol may result in multiple keypresses per image. Exposure times were set empirically in a pilot study.\n\n3\n\n\ftask\n\naction\n\ncontext\n\nconsistency measure\n\nagreement\n\ncross-stimulus control\n\nrandom baseline\n\nrecognition\nrecognition\n92.2%\u00b11.1% 81.3%\u00b11.5%\n64.0%\u00b10.7% 59.1%\u00b10.9%\n50.0%\u00b10.0% 50.0%\u00b10.0%\n\naction recognition\n\ncontext recognition\n\n(a)\n\n(b)\n\nFigure 3: (a) Spatial inter-subject consistency for the tasks of action and context recognition, with\nstandard deviations across subjects. (b) ROC curves for predicting the \ufb01xations of one subject from\nthe \ufb01xations of the other subjects in the same group on the same image (blue) or on an image (green)\nrandomly selected from the dataset. See tab. 1 for sequential consistency results.\n\npaths(cid:8)\u03b4j =(cid:0)z1, z2, . . . , ztj\njoint log likelihood(cid:80)k\n\n(cid:1)(cid:9)k\nj=1 and we \ufb01nd the parameters \u03b8 = {\u00b5i, \u03a3i}n\n\nImage Speci\ufb01c Human Gaze Model: We model human gaze patterns in an image as a Hidden\nMarkov Model (HMM) where states {si}n\ni=1 correspond to AOIs \ufb01xated by the subjects and tran-\nsitions correspond to saccades. The observations are the \ufb01xation coordinates z = (x, y). The\nemission probability for AOI i is a Gaussian: p(z|si) = N (z|\u00b5i, \u03a3i), where \u00b5i and \u03a3i model the\ncenter and the spatial extent of the area of interest (AOI) i. In training, we are given a set of scan-\ni=1 that maximize the\nj=1 log p(\u03b4j|\u03b8), using EM[9]. We obtain AOIs, for each image and task, by\ntraining the HMM using the recorded human eye scanpaths. We compute the number of states N\u2217\nthat maximizes the leave-one-out cross validation likelihood over the scanpaths within the training\nset, with N \u2208 [1, 10]. We then re-train the model with N\u2217 states over the entire set of scanpaths.\nResults: Fig. 2 shows several HMMs trained from the \ufb01xations of subjects performing action recog-\nnition. On average, the model discovers 8.0 AOIs for action recognition and 5.6 for context recog-\nnition. The recovered AOIs are task dependent and tend to center on object and object parts with\nhigh task relevance, like phones, books, hands or legs. Context recognition AOIs generally appear\non the background and have larger spatial support, in agreement with the scale of the corresponding\nstructures. There is a small subset of AOIs that is common to both tasks. Most of these AOIs fall\non faces, an effect that has also been noted in [6]. Interestingly, some AOI transitions suggest the\npresence of cognitive routines aimed at establishing relevant relationships between object parts, e.g.\nwhether a person is looking at the manipulated object (\ufb01g. 2).\nThe HMM allows us to visualize and analyze the sequential inter-subject consistency (\u00a75) among\nsubjects. It also allows us to evaluate the performance of eye movement prediction models (\u00a76.2).\n\n5 Consistency Analysis\n\nQualitative studies in human vision[25, 16] have advocated a high degree of agreement between the\ngaze patterns of humans in answering questions regarding static stimuli and have shown that gaze\npatterns are highly task dependent, although such \ufb01ndings have not yet been con\ufb01rmed by large-\nscale quantitative analysis. In this section, we con\ufb01rm these effects on our large scale dataset for\naction and context recognition, from a single image. We \ufb01rst study spatial consistency using saliency\nmaps, then analyze sequential consistency in terms of AOI ordering under various metrics.\nSpatial Consistency: In this section, we evaluate the spatial inter-subject agreement in images.\nEvaluation Protocol: To measure the inter-subject agreement, we predict the regions \ufb01xated by a\nparticular subject from a saliency map derived from the \ufb01xations of the other subjects on the same\nimage. Samples represent image pixels and each pixel\u2019s score is the empirical saliency map derived\nfrom training subjects[14]. Labels are 1 at pixels \ufb01xated by the test subject, and 0 elsewhere. For\nunbiased cross-stimulus control, we check how well a subject\u2019s \ufb01xations on one stimulus can be\npredicted from those of the other subjects on a different, unrelated, stimulus. The average precision\nfor predicting \ufb01xations on the same stimulus is expected to be much greater than on different stimuli.\nFindings: Area under the curve (AUC) measured for the two subject groups and the corresponding\nROC curves are shown in \ufb01g. 3. We \ufb01nd good inter-subject agreement for both tasks, consistent with\npreviously reported results for both images and video [14, 19].\n\n4\n\n00.20.40.60.8100.20.40.60.81False alarm rateDetection rate  Inter\u2212Subject AgreementCross\u2212Stimulus Control00.20.40.60.8100.20.40.60.81False alarm rateDetection rate  Inter\u2212Subject AgreementCross\u2212Stimulus Control\fSequential Consistency using AOIs: Next we evaluate the degree to which scanpaths agree in\nthe order in which interesting locations are \ufb01xated. We do this as a three step process. First,\nwe map each \ufb01xation to an AOI obtained with the HMM presented in \u00a74, converting scanpaths to\nsequences of symbols. Then, we de\ufb01ne two metrics for comparing scanpaths, and compute inter-\nsubject agreement in a leave-one-out fashion, for each.\nMatching \ufb01xations to AOIs: We assign a subject\u2019s \ufb01xation to an AOI, if it falls within an ellipse\ncorresponding to its spatial support (\ufb01g. 2). If no match is found, we assign the \ufb01xation as null.\nHowever, due to noise, we allow the spatial support to be increased by a factor. The dashed blue\ncurve in \ufb01g. 4c-left shows the fraction (AOIP) of \ufb01xations of each human subject, with 2D positions\nthat fall inside AOIs derived from scanpaths of other subjects, as a function of the scale factor.\nThrough the rest of this section, we report results for the threshold to twice the estimated AOI scale,\nwhich ensures a 75% \ufb01xation match rate across subjects in both task groups.\nAOI based inter-subject consistency: Once we have converted each scanpath to a sequence of \ufb01xa-\ntions, we de\ufb01ne two metrics for inter-subject agreement. Given two sequences of symbols, the AOI\ntransition (AOIT) metric is de\ufb01ned as the number of consecutive non-null symbol pairs (AOI tran-\nsitions) that two sequences have in common. The second metric (AOIS), is obtained by sequence\nalignment, as in [19], and represents the longest common subsequence among the two scanpaths.\nBoth metrics are normalized by the length of the longest scanpath. To measure inter-subject agree-\nment, we match the scanpath of each subject i to the scanpaths belonging to other subjects, under\nthe two metrics de\ufb01ned above. The value of the metric for the best match de\ufb01nes the leave-one-out\nagreement for subject i. We then average over all subjects.\nBaselines: In addition to inter-subject agreement, we de\ufb01ne three baselines. First, for cross-stimulus\ncontrol, we evaluate agreement as in the case of spatial consistency, when the test and reference\nscanpaths correspond to different randomly selected images. Second, for the random baseline, we\ngenerate for each image a set of 100 random scanpaths, where \ufb01xations are uniformly distributed\nacross the image. The average metric assigned to these scanpaths with respect to the subjects repre-\nsents the baseline for sequential inter-subject agreement, in the absence of bias. Third, we randomize\nthe order of each subject\u2019s \ufb01xations in each image, while keeping their locations \ufb01xed, and compute\ninter-subject agreement with respect to the original scanpaths of the rest of the subjects. The initial\ncentral \ufb01xation is left unchanged during randomization. This baseline is intended to measure the\namount of observed consistency due to the \ufb01xation order.\nFindings: Both metrics reveal considerable inter-subject agreement (table 1), with values signi\ufb01-\ncantly higher than for cross-stimulus control and the random baselines. When each subject\u2019s \ufb01xa-\ntions are randomized, the fraction of matched saccades (AOIT) drops sharply, suggesting that se-\nquential effects have a signi\ufb01cant share in the overall inter-subject agreement. The AOIS metric is\nless sensitive to these effects, as it allows for gaps in matching AOI sequences.2\nIn\ufb02uence of Task: We will next study the task in\ufb02uence on human visual patterns. We compare the\nvisual patterns of the two subject groups using saliency map and sequential AOI metrics.\nEvaluation Protocol: For each image, we derive a saliency map from the \ufb01xations of subjects doing\naction recognition, and report the average p-statistic at the locations \ufb01xated by subjects performing\ncontext recognition. We also compute agreement under the AOI-based metrics between the scan-\npaths of subjects performing context recognition, and subjects from the action recognition group.\nFindings: Only 44.1% of \ufb01xations made during context recognition fall onto action recognition\nAOIs, with an average p-value of 0.28 with respect to the action recognition \ufb01xation distribution.\nOnly 10% of the context recognition saccades have also been made by active subjects, and the\nAOIS metric between context and active subjects\u2019 scanpaths is 23.8%. This indicates signi\ufb01cant\ndifferences between the subject groups in terms of their visual search patterns.\n\n6 Task-Speci\ufb01c Human Gaze Prediction\n\nIn this section, we show that it is possible to effectively predict task-speci\ufb01c human gaze patterns,\nboth spatially and sequentially. To achieve this, we combine the large amounts of information avail-\nable in our dataset with state-of-the art visual features and machine learning techniques.\n\n2Although harder to interpret numerically, the negative log likelihood of scanpaths under HMMs also de-\n\ufb01nes a valid sequential consistency measure. We observe the following values for the action recognition task:\nagreement 9.2, agreement (random order) 13.1, cross-stimulus control 25.8, random baseline 46.6.\n\n5\n\n\fconsistency measure\n\nagreement\n\nagreement (random order)\n\ncross-stimulus control\n\nrandom scanpaths\n\nAOIP\n\naction recognition\n\nAOIT\n\ntask\n\nAOIS\n\nAOIP\n\ncontext recognition\n\nAOIT\n\nAOIS\n\n79.9%\u00b11.9% 34.0%\u00b11.3% 39.9%\u00b11.0% 76.4%\u00b12.6% 35.6%\u00b10.9% 44.9%\u00b10.4%\n79.9%\u00b11.9% 21.8%\u00b10.7% 31.0%\u00b10.7% 76.4%\u00b12.6% 23.2%\u00b10.3% 35.5%\u00b10.3%\n29.4%\u00b10.8% 4.9% \u00b1 0.3% 13.9%\u00b10.3% 40.0%\u00b12.1% 7.9% \u00b1 0.5% 19.6%\u00b10.2%\n15.5%\u00b10.1% 1.5% \u00b1 0.0% 2.5% \u00b1 0.0% 31.9%\u00b10.1% 4.2% \u00b1 0.0% 7.6% \u00b1 0.0%\n\nTable 1: Sequential inter-subject consistency measured using AOIs (\ufb01g. 2), for both task groups.\nA large fraction of each subject\u2019s \ufb01xations falls onto AOIs derived from the scanpaths of the other\nsubjects (AOIP). Signi\ufb01cant inter-subject consistency exists in terms of AOI transitions (AOIT) and\nscanpath alignment score (AOIS).\n\n6.1 Task-Speci\ufb01c Human Visual Saliency Prediction\n\nWe \ufb01rst study the prediction of human visual saliency maps. Human \ufb01xations typically fall onto\nimage regions that are meaningful for the visual task (\ufb01g. 2). These regions often contain objects\nand object parts that have similar identities and con\ufb01gurations for each semantic class involved, e.g.\nthe con\ufb01guration of the legs while running. We exploit this repeatability and represent each human\n\ufb01xation by HoG descriptors[8]. We then train a sliding window detector with human \ufb01xations and\ncompare it with competitive approaches reported in the literature.\nEvaluation Protocol: For each subject group, we obtain positive examples from \ufb01xated locations\nacross the training portion of the dataset. Negative examples are extracted similarly at random\nimage locations positioned at least 3o away from all human \ufb01xations. We extract 7 HoG descrip-\ntors with different grid con\ufb01gurations and concatenate them, then represent the resulting descriptor\nusing an explicit, approximate \u03c72 kernel embedding[17]. We train a linear SVM to obtain a de-\ntector, which we run in sliding window fashion over the test set in order to predict saliency maps.\nWe evaluate the detector under the AUC metric and the spatial KL divergence criterion presented\nin [19]. We use three baselines for comparison. The \ufb01rst two are the uniform saliency map and\nthe central bias map (with intensity inversely proportional to distance from center). As an upper\nbound on performance, we also compute saliency maps derived from the \ufb01xations recorded from\nsubjects. The KL divergence score for this baseline is derived by splitting the human subjects into\ntwo groups and computing the KL divergence between the saliency maps derived from these two\ngroups, while the AUC metric is computed in a leave-one-out fashion, as for spatial consistency. We\ncompare the model with two state of the art predictors. The \ufb01rst is the bottom-up saliency model\nof Itti&Koch[11]. The second is a learned saliency predictor introduced by Judd et al.[13], which\nintegrates low and mid-level features with several high-level object detectors such as cars and people\nand is capable to optimally weight these features given a training set of human \ufb01xations. Note that\nmany of these objects often occur in the VOC 2012 actions dataset.\nFindings: Itti&Koch\u2019s model is not designed to predict task-speci\ufb01c saliency and cannot handle task\nin\ufb02uences on visual attention (\ufb01g. 4). Judd\u2019s model can adapt to some extent by adjusting feature\nweights, which were trained on our dataset. Out of the evaluated models, we \ufb01nd that the task-\nspeci\ufb01c HoG detector performs best under both metrics, especially under the spatial KL divergence,\nwhich is relevant for computer vision applications[19].\nIts \ufb02exibility stems from its large scale\ntraining using human \ufb01xations, the usage of general-purpose computer vision features (as opposed,\ne.g., to the speci\ufb01c object detectors used by Judd et al.[13]), and in part from the use of a powerful\nnonlinear kernel for which good linear approximations are available[17, 1].\n\n6.2 Scanpath Prediction via Maximum Entropy Inverse Reinforcement Learning\n\nWe now consider the problem of eye movement prediction under speci\ufb01c task constraints. Models\nof human visual saliency can be used to generate scanpaths, e.g. [11]. However, current models are\ndesigned to predict saliency for the free-viewing condition and do not capture the focus induced by\nthe cognitive task. Others [20, 4] hypothesize that the reward driving eye movements is the expected\nfuture information gain.\nHere we take a markedly different approach. Instead of specifying the reward function, we learn it\ndirectly from large amounts of human eye movement data, by exploiting policies that operate over\nlong time horizons. We cast the problem as Inverse Reinforcement Learning (IRL), where we aim\nto recover the intrinsic reward function that induces, with high probability, the scanpaths recorded\nfrom human subjects solving a speci\ufb01c visual recognition task. Our learned model can imitate\n\n6\n\n\ffeature\n\nuniform baseline\n\ncentral bias\n\nhuman\n\nHOG detector\u2217\nItti & Koch[11]\nJudd et al.[13]\u2217\n\nbaselines\n\naction recognition\nAUC\nKL\n0.500\n12.00\n0.780\n9.59\n6.14\n0.922\n\npredictors\n\n8.54\n16.53\n11.00\n\n0.736\n0.533\n0.715\n\ncontext recognition\nAUC\n0.500\n0.685\n0.813\n\nKL\n11.02\n8.82\n5.90\n\n8.10\n15.04\n9.66\n\n0.646\n0.512\n0.636\n\nfeature\n\nhuman scanpaths\nrandom scanpaths\n\nIRL\u2217\n\nRenninger [20]\nItti & Koch [11]\n\nbaselines\naction recognition\n\nAOIT\n\nAOIP\nAOIS\n79.9% 34.0% 39.9% 76.4% 35.6% 44.9%\n15.5%\n7.6%\n\nAOIT\n\nAOIP\n\nAOIS\n\n31.9%\n\n4.2%\n\ncontext recognition\n\n2.5%\n\n1.5%\npredictors\n\n35.6% 6.6% 18.4% 44.9% 11.6% 25.7%\n23.9%\n24.4%\n28.6%\n24.1%\n\n14.6% 40.3%\n16.8% 42.9%\n\n2.0%\n2.7%\n\n7.0%\n7.5%\n\n(a) human visual saliency prediction\n\n(b) eye movement prediction\n\n(c)\n\nFigure 4: Task-speci\ufb01c human gaze prediction performance on the VOC 2012 actions dataset. (a)\nOur trained HOG detector outperforms existing saliency models, when evaluated under both the KL\ndivergence and AUC metrics. (b-c) Learning techniques can also be used to predict eye movements\nunder task constraints. Our proposed Inverse Reinforcement Learning (IRL) model better matches\nobserved human visual search scanpaths when compared with two existing methods, under each of\nthe AOI based metrics we introduce. Methods marked by \u2018*\u2019 have been trained on our dataset.\n\nuseful saccadic strategies associated with cognitive processes involved in complex tasks such as\naction recognition, but avoids the dif\ufb01culty of explicitly specifying these processes.\nProblem Formulation: We model a scanpath \u03b4 as a sequence of states st = (xt, yt) and actions\nat = (\u2206x, \u2206y), where states correspond to \ufb01xations, represented by their visual angular coordinates\nwith respect to the center of the screen, and actions model saccades, represented as displacement\nvectors expressed in visual degrees. We rely on a maximum entropy IRL formulation[27] to model\nthe distribution over the set \u2206(s,T ) of all possible scanpaths of length T starting from state s for a\ngiven image as:\n\np(s,T )\n\u03b8\n\n(\u03b4) =\n\n1\n\nZ (T )(s)\n\n\u00b7 exp\n\nr\u03b8(st, at)\n\n, \u2200\u03b4 \u2208 \u2206(s,T )\n\n(1)\n\n(cid:35)\n\n(cid:34) T(cid:88)\n\nt=1\n\nwhere r\u03b8(st, at) is the reward function associated with taking the saccadic action at while \ufb01xating\nat position st, \u03b8 are the model parameters and Z (T )(s) is the partition function for paths of length T\nstarting with state s, see (3). The reward function r\u03b8(st, at) = f(cid:62)\n(st)\u03b8at is the inner product between\na feature vector f(st) extracted at image location st and a vector of weights corresponding to action\nat. Note that reward functions in our formulation depend on the subject\u2019s action. This enables the\nmodel to encode saccadic preferences conditioned on the current observation, in addition to planning\nfuture actions by maximizing the cumulative reward along the entire scanpath, as implied by (1).\nIn our formulation, the goal of Maximum Entropy IRL is to \ufb01nd the weights \u03b8 that maximize the\nlikelihood of the demonstrated scanpaths across all the images in the dataset. For a single image and\ngiven the set of human scanpaths E, all starting at the image center sc, the likelihood is:\n\nL\u03b8 =\n\n1\n|E|\n\nlog p(sc,T )\n\n\u03b8\n\n(\u03b4)\n\n(2)\n\nThis maximization problem can be solved using a two step dynamic programming formulation. In\nthe backward step, we compute the state and state-action partition functions for each possible state\ns and action a, and for each scanpath length i = 1, T :\n\n(cid:88)\n\n\u03b4\u2208E\n\n(cid:35)\n\n(cid:88)\n\n(cid:34) i(cid:88)\n\n\u03b8 (s) =\nZ (i)\n\nexp\n\n\u03b4\u2208\u2206(s,i)\n\nt=1\n\nr\u03b8(st, at)\n\n, Z (i)\n\n\u03b8 (s, a) =\n\n7\n\n(cid:35)\n\n(cid:88)\n\n(cid:34) i(cid:88)\n\n\u03b4\u2208\u2206(s,i)\ns.t.\na1=a\n\nt=1\n\nexp\n\nr\u03b8(st, at)\n\n(3)\n\n0123400.20.40.60.81AOI scale factorAOIP score  agreementcross\u2212stimulusrandomcross\u2212taskItti & KochRenninger et al.IRL0123400.050.10.150.20.250.30.350.40.45AOI scale factorAOIT score  agreementcross\u2212stimulusrandomcross\u2212taskItti & KochRenninger et al.IRL0123400.10.20.30.40.5AOI scale factorAOIS score  agreementcross\u2212stimulusrandomcross\u2212taskItti & KochRenninger et al.IRL\fThe optimal policy \u03c0(i)\n\n\u03b8 at the ith \ufb01xation is:\n\n\u03b8 (a|s) = Z (T\u2212i+1)\n\u03c0(i)\n\n\u03b8\n\n(s, a)/Z (T\u2212i+1)\n\n\u03b8\n\n(s)\n\n(4)\n\nThis policy induces the maximum entropy distribution p(sc,T )\nover scanpaths for the image and is\nused in the forward step to ef\ufb01ciently compute the expected mean feature count for each action\n, where I [\u00b7] is the indicator function. The\na, which is \u02c6f a\ngradient of the likelihood function (2) with respect to the parameters \u03b8a is:\n\nt=1 f (st) \u00b7 I [at = a]\n\n\u03b8 = E\n\n\u03b4\u223cp(sc ,T )\n\n\u03b8\n\n\u03b8\n\n(cid:104)(cid:80)T\n\n(cid:105)\n\n(cid:80)\n\n(cid:80)\n\n\u2202L\u03b8\n\u2202\u03b8a\n\n= \u02dcf a \u2212 \u02c6f a\n\n\u03b8\n\n(5)\n\n\u03b4\u2208E\n\nt f (st) \u00b7 I [at = a] is the empirical feature count along training scanpaths.\nwhere \u02dcf a = 1|E|\nEqs. (1)\u2013(5) are de\ufb01ned for a given input image. The likelihood and its gradient over the training\nset are obtained by summing up the corresponding quantities. In our formulation policies encode\nthe image speci\ufb01c strategy of the observer, based on a task speci\ufb01c reward function that is learned\nacross all images. We thus learn two different IRL models, for action and context analysis. Note\nthat we restrict ourselves to scanpaths of length T starting from the center of the screen and do not\nprede\ufb01ne goal states. We validate T to the average scanpath length in the dataset.\nExperimental Procedure: We use a \ufb01ne grid with 0.25o stepsize for the state space. The space of all\npossible saccades on this grid is too large to be practical (\u2248 105). We obtain a reduced vocabulary\nof 1, 000 actions by clustering saccades in the training set, using k-means. We then encode all\nscanpaths in this discrete (state,action) space, with an average positional error of 0.47o. We extract\nHoG features at each grid point and augment them with the output of our saliency detector. We\noptimize the weight vector \u03b8 in the IRL framework and use a BFGS solver for fast convergence.\nFindings: A trained MaxEnt IRL eye movement predictor performs better than the bottom up models\nof Itti&Koch[11] and Renninger et al.[20] (\ufb01g. 4bc). The model is particularly powerful for predict-\ning saccades (see the AOIT metric), as it can match more than twice the number of AOI transitions\ngenerated by bottom up models for the action recognition task. It also outperforms the other models\nunder the AOIP and AOIS metrics. Note that the latter only captures the overall ranking among\nAOIs as de\ufb01ned by the order in which these are \ufb01xated. A gap still remains to human performance,\nunderlining the dif\ufb01culty of predicting eye movements in real world images and for complex tasks\nsuch as action recognition. For context recognition, prediction scores are generally closer to the\nhuman baseline. This is, at least in part, facilitated by the often larger size of background structures\nas compared to the humans or the manipulated objects involved in actions (\ufb01g. 2).\n\n7 Conclusions\n\nWe have collected a large set of eye movement recordings for VOC 2012 Actions, one of the most\nchallenging datasets for action recognition in still images. Our data is obtained under the task\nconstraints of action and context recognition and is made publicly available. We have leveraged this\nlarge amount of data (1 million human \ufb01xations) in order to develop Hidden Markov Models that\nallow us to determine \ufb01xated AOI locations, their spatial support and the transitions between them\nautomatically from eyetracking data. This technique has made possible to develop novel evaluation\nmetrics and to perform quantitative analysis regarding inter-subject consistency and the in\ufb02uence of\ntask on eye movements. The results reveal that given real world unconstrained image stimuli, the\ntask has a signi\ufb01cant in\ufb02uence on the observed eye movements both spatially and sequentially. At\nthe same time such patterns are stable across subjects.\nWe have also introduced a novel eye movement prediction model that combines state-of-the-art\nreinforcement learning techniques with advanced computer vision operators to learn task-speci\ufb01c\nhuman visual search patterns. To our knowledge, the method is the \ufb01rst to learn eye movement\nmodels from human eyetracking data. When measured under various evaluation metrics, the model\nshows superior performance to existing bottom-up eye movement predictors. To close the human\nperformance gap, better image features, and more complex joint state and action spaces, within\nreinforcement learning schemes, will be explored in future work.\nAcknowledgments: Work supported in part by CNCS-UEFISCDI under CT-ERC-2012-1.\n\n8\n\n\fReferences\n[1] E. Bazavan, F. Li, and C. Sminchisescu. Fourier kernel learning. In European Conference on Computer\n\nVision, 2012.\n\n[2] A. Borji and L. Itti. State-of-the-art in visual attention modelling. IEEE Transactions on Pattern Analysis\n\nand Machine Intelligence, 35, 2011.\n\n[3] G. T. Buswell. How People Look at Pictures: A Study of the Psychology of Perception in Art. Chicago\n\nUniversity Press, 1935.\n\n[4] N. J. Butko and J. R. Movellan. Infomax control of eye movements. IEEE Transactions on Autonomous\n\nMental Development, 2:91\u2013107, 2010.\n\n[5] M. S. Castelhano, M. L. Mack, and J. M. Henderson. Viewing task in\ufb02uences eye movement control\n\nduring active scene perception. Journal of Vision, 9, 2008.\n\n[6] M. Cerf, E. P. Frady, and C. Koch. Faces and text attract gaze independent of the task: Experimental data\n\nand computer model. Journal of Vision, 9, 2009.\n\n[7] M. Cerf, J. Harel, W. Einhauser, and C. Koch. Predicting human gaze using low-level saliency combined\n\nwith face detection. In Advances in Neural Information Processing Systems, 2007.\n\n[8] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In IEEE International\n\nConference on Computer Vision and Pattern Recognition, 2005.\n\n[9] A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via the EM algorithm.\n\nJournal of the Royal Statistical Society, 1977.\n[10] M. Everingham, L. Van Gool, C. K.\n\nI. Williams,\n\nJ. Winn,\n\nand A. Zisserman.\n\nThe\nhttp://www.pascal-\n\nPASCAL Visual Object Classes Challenge 2012 (VOC2012) Results.\nnetwork.org/challenges/VOC/voc2012/workshop/index.html.\n\n[11] L. Itti and C. Koch. A saliency-based search mechanism for overt and covert shifts of visual attention.\n\nVision Research, 40, 2000.\n\n[12] T. Judd, F. Durand, and A. Torralba. Fixations on low resolution images. In IEEE International Confer-\n\nence on Computer Vision, 2009.\n\n[13] T. Judd, K. Ehinger, F. Durand, and A. Torralba. Learning to predict where humans look.\n\nInternational Conference on Computer Vision, 2009.\n\nIn IEEE\n\n[14] K.A.Ehinger, B.Sotelo, A.Torralba, and A.Oliva. Modeling search for people in 900 scenes: A combined\n\nsource model of eye guidance. Visual Cognition, 17, 2009.\n\n[15] W. Kienzle, B. Scholkopf, F. Wichmann, and M. Franz. How to \ufb01nd interesting locations in video: a\n\nspatiotemporal interest point detector learned from human eye movements. In DAGM, 2007.\n\n[16] M. F. Land and B. W. Tatler. Looking and Acting. Oxford University Press, 2009.\n[17] F. Li, G. Lebanon, and C. Sminchisescu. Chebyshev approximations to the histogram \u03c72 kernel. In IEEE\n\nInternational Conference on Computer Vision and Pattern Recognition, 2012.\n\n[18] E. Marinoiu, D. Papava, and C. Sminchisescu. Pictorial human spaces: How well do humans perceive a\n\n3d articulated pose? In IEEE International Conference on Computer Vision, 2013.\n\n[19] S. Mathe and C. Sminchisescu. Dynamic eye movement datasets and learnt saliency models for visual\n\naction recognition. In European Conference on Computer Vision, 2012.\n\n[20] L. W. Renninger, J. Coughlan, P. Verghese, and J. Malik. An information maximization model of eye\n\nmovements. In Advances in Neural Information Processing Systems, pages 1121\u20131128, 2004.\n\n[21] R. Subramanian, H. Katti, N. Sebe, and T.-S. Kankanhalli, M. Chua. An eye \ufb01xation database for saliency\n\ndetection in images. In European Conference on Computer Vision, 2010.\n\n[22] A. Torralba, A. Oliva, M. Castelhano, and J. Henderson. Contextual guidance of eye movements and\nattention in real-world scenes: The role of global features in object search. Psychological Review, 113,\n2006.\n\n[23] E. Vig, M. Dorr, and D. D. Cox. Space-variant descriptor sampling for action recognition based on\n\nsaliency and eye movements. In European Conference on Computer Vision, 2012.\n\n[24] S. Winkler and R. Subramanian. Overview of eye tracking datasets. In International Workshop on Quality\n\nof Multimedia Experience, 2013.\n\n[25] A. Yarbus. Eye Movements and Vision. New York Plenum Press, 1967.\n[26] K. Yun, Y. Pen, D. Samaras, G. J. Zelinsky, and T. L. Berg. Studying relationships between human gaze,\nIn IEEE International Conference on Computer Vision and Pattern\n\ndescription and computer vision.\nRecognition, 2013.\n\n[27] B. D. Ziebart, A. Maas, J. A. Bagnell, and A. K. Dey. Maximum entropy inverse reinforcement learning.\n\nIn AAAI Conference on Arti\ufb01cial Intelligence, 2008.\n\n9\n\n\f", "award": [], "sourceid": 982, "authors": [{"given_name": "Stefan", "family_name": "Mathe", "institution": "University of Toronto"}, {"given_name": "Cristian", "family_name": "Sminchisescu", "institution": "LTH"}]}