{"title": "Explaining human multiple object tracking as resource-constrained approximate inference in a dynamic probabilistic model", "book": "Advances in Neural Information Processing Systems", "page_first": 1955, "page_last": 1963, "abstract": "Multiple object tracking is a task commonly used to investigate the architecture of human visual attention. Human participants show a distinctive pattern of successes and failures in tracking experiments that is often attributed to limits on an object system, a tracking module, or other specialized cognitive structures. Here we use a computational analysis of the task of object tracking to ask which human failures arise from cognitive limitations and which are consequences of inevitable perceptual uncertainty in the tracking task. We find that many human performance phenomena, measured through novel behavioral experiments, are naturally produced by the operation of our ideal observer model (a Rao-Blackwelized particle filter). The tradeoff between the speed and number of objects being tracked, however, can only arise from the allocation of a flexible cognitive resource, which can be formalized as either memory or attention.", "full_text": "Explaining human multiple object tracking as\nresource-constrained approximate inference in a\n\ndynamic probabilistic model\n\nEdward Vul, Michael C. Frank, and Joshua B. Tenenbaum\n\nDepartment of Brain and Cognitive Sciences\n\nMassachusetts Institute of Technology\n{evul, mcfrank, jbt}@mit.edu\n\nCambridge, MA 02138\n\nGeorge Alvarez\n\nDepartment of Psychology\n\nHarvard University\n\nCambridge, MA 02138\n\nalvarez@wjh.harvard.edu\n\nAbstract\n\nMultiple object tracking is a task commonly used to investigate the architecture\nof human visual attention. Human participants show a distinctive pattern of suc-\ncesses and failures in tracking experiments that is often attributed to limits on an\nobject system, a tracking module, or other specialized cognitive structures. Here\nwe use a computational analysis of the task of object tracking to ask which human\nfailures arise from cognitive limitations and which are consequences of inevitable\nperceptual uncertainty in the tracking task. We \ufb01nd that many human perfor-\nmance phenomena, measured through novel behavioral experiments, are naturally\nproduced by the operation of our ideal observer model (a Rao-Blackwelized par-\nticle \ufb01lter). The tradeoff between the speed and number of objects being tracked,\nhowever, can only arise from the allocation of a \ufb02exible cognitive resource, which\ncan be formalized as either memory or attention.\n\n1\n\nIntroduction\n\nSince William James \ufb01rst described the phenomenology of attention [11], psychologists have been\nstruggling to specify the cognitive architecture of this process, how it is limited, and how it helps\ninformation processing. The study of visual attention speci\ufb01cally has bene\ufb01ted from rich, simple\nparadigms, and of these multiple object tracking (MOT) [16] has recently gained substantial popu-\nlarity. In a typical MOT task (Figure 1), subjects see a number of objects, typically colorless circles,\nmoving onscreen. Some subset of the objects are marked as targets before the trial begins, but dur-\ning the trial all objects turn to a uniform color and move haphazardly for several seconds. The task\nis to keep track of which objects were marked as targets at the start of the trial so that they can be\nidenti\ufb01ed at the end of the trial when the objects stop moving.\nThe pattern of results from MOT experiments is complicated. Participants can only track a \ufb01nite\nnumber of objects [16], but more objects can be tracked when they move slower [1], suggesting a\nlimit on attentional speed. If objects are moved far apart in the visual \ufb01eld, however, they can be\ntracked at high speeds, suggesting that spatial crowding also limits tracking [9]. When tracking,\nparticipants seem to maintain information about the velocity of objects [19] and this information is\nsometimes helpful in tracking [8]. More frequently, however, velocity is not used to track, suggesting\nlimitations on the kinds of information available to the tracking system [13]. Finally, although\nparticipants can track objects using features like color and orientation [3], some features seem to\nhurt tracking [15], and tracking is primarily considered to be a spatial phenomenon. These results\nand others have left researchers puzzled: What limits tracking performance?\n\n1\n\n\fFigure 1: Left: A typical multiple object tracking experiment. Right: The generative model underlying our\nprobabilistic tracker (see text for details).\n\nProposed limitations on MOT performance may be characterized along the dimensions of discrete-\nness and \ufb02exibility. A proposal positing \ufb01xed number of slots (each holding one object) describes a\ndiscrete limitation, while proposals positing limits on attention speed or resolution are more contin-\nuous. Attention and working memory are canonical examples of \ufb02exible limitations: Based on the\ntask, we decide where to attend and what to remember. Such cognitive limitations may be framed\neither as a discrete number of slots or as a continuous resource. In contrast, visual acuity and noise\nin velocity perception are low-level, task-independent limitations: Regardless of the task we are do-\ning, the resolution of our retina is limited and our motion-discrimination thresholds are stable. Such\nperceptual limitations tend only to be continuous.\nWe aim to determine which MOT effects re\ufb02ect perceptual, task-independent uncertainty, and which\nre\ufb02ect \ufb02exible, cognitive limitations. Our approach is to describe the minimal computations that an\nideal observer must undertake to track objects and combine available information. To the extent\nthat an effect is not naturally explained at the computational level given only perceptual sources of\nuncertainty, it is more likely to re\ufb02ect \ufb02exible cognitive limitations.\nWe propose that humans track objects in a manner consistent with the Bayesian multi-target tracking\nframework common in computer vision [10, 18]. We implement a variant of this tracking model\nusing Rao-Blackwellized particle \ufb01ltering and show how it can be easily adapted for a wide range\nof MOT experiments. This unifying model allows us to design novel experiments that interpolate\nbetween seemingly disparate phenomena. We argue that, since the effects of speed, spacing, and\nfeatures arise naturally in an ideal observer with no limits on attention, memory, or number of objects\nthat can be tracked, these phenomena can be explained by optimal object tracking given low-level,\nperceptual sources of uncertainty. We identify a subset of MOT phenomena that must re\ufb02ect \ufb02exible\ncognitive resources, however: effects that manipulate the number of objects that can be tracked. To\naccount for tradeoffs between object speed and number, a task-dependent resource constraint must\nbe added to our model. This constraint can be interpreted as either limited attentional resolution or\nlimited short term memory.\n\n2 Optimal multiple object tracking\n\nTo track objects in a typical MOT experiment (Figure 1), at each point in time the observer must\ndetermine which of many observed objects corresponds to which of the objects that were present\nin the display in the last frame. Here we will formalize this procedure using a classical tracking\nalgorithm in computer vision[10, 18].\n\n2.1 Dynamics\n\nObject tracking requires some assumptions about how objects evolve over time. Since there is no\nconsensus on how to generate object tracking displays in the visual attention literature, we will\nassume simple linear dynamics, which can approximate prior experimental manipulations. Specif-\nically, we assume that the true state of the world St contains information about each object being\ntracked (i): to start we consider objects de\ufb01ned by position (xt(i)) and velocity (vt(i)), but we will\nlater consider tracking objects through more complicated feature-spaces. Although we refer to posi-\ntion and velocity, the state actually contains two position and two velocity dimensions: one of each\nfor x and y.\n\n2\n\n\fSt evolves according to linear dynamics with noise. Position and velocity for x and y evolve inde-\npendently according to an Ornstein-Uhlenbeck (mean-reverting) process, which can be thought of\nas Brownian motion on a spring, and can be most clearly spelled out as a series of equations:\n\nxt = xt\u22121 + vt ,\nvt = \u03bbvt\u22121 \u2212 kxt\u22121 + wt ,\nwt \u223c N(0,\u03c3w)\n\n(1)\n\nwhere x and v are the position and velocity at time t; \u03bb is an inertia parameter constrained to be be-\ntween 0 and 1; k is a spring constant which produces the mean-reverting properties of the dynamics;\nand wt is random acceleration noise added at each time point which is distributed as a zero-mean\nGaussian with standard deviation \u03c3w.\nIn two dimensions, this stochastic process describes a randomly moving cloud of objects; the spring\nconstant assures that the objects will not drift off to in\ufb01nity, and the friction parameter assures that\nthey will not accelerate to in\ufb01nity. Within the range of parameters we consider, this process con-\nverges to a stable distribution of positions and velocities both of which will be normally distributed\naround zero. We can solve for the standard deviations for position (\u03c3x) and velocity (\u03c3v), by assum-\ning that the expected values of \u03c3x, \u03c3v and their covariance will not change through an update step;\nthus obtaining:\n\n(cid:115)\n\n(cid:115)\n\n\u03c3x =\n\n(1 + \u03bb)\u03c32\nw\n\nk(\u03bb\u2212 1)(2\u03bb\u2212 k\u2212 2)\n\n, and \u03c3v =\n\n2\u03c32\nw\n\nk(\u03bb\u2212 1)(2\u03bb\u2212 k + 2) ,\n\n(2)\n\nrespectively. Because these terms are familiar in the human multiple object tracking literature, for\nthe rest of this paper we will describe the dynamics in terms of the spatial extent of the cloud of\nmoving dots (\u03c3x), the standard deviation of the velocity distribution (\u03c3v ), and the inertia parameter\n(\u03bb; solving for k and \u03c3w to generate dynamics and track objects).\n\n2.2 Probabilistic model\n\nThe goal of an object tracking model is to track the set of n objects in S over a \ufb01xed period from\nt0 to tm. For our model, we assume observations (mt) at each time t are noisy measurements of\nthe true state of the world at that time (St). In other words, our tracking model is a stripped-down\nsimpli\ufb01cation of tracking models commonly used in computer vision because we do not track from\nnoisy images, but instead, from extracted position and velocity estimates. The observer must esti-\nmate St based on the current, and previous measurements, thus obtaining \u02c6St. However, this task is\ncomplicated by the fact that the observer obtains an unlabeled bag of observations (mt), and does\nnot know which observations correspond to which objects in the previous state estimate \u02c6St\u22121. Thus,\nthe observer must not only estimate St, but must also determine the data assignment of observations\nto objects \u2014 which can be described by a permutation vector \u03b3t.\nSince we assume independent linear dynamics for each individual object, then conditioned on \u03b3,\nwe can track each individual object via a Kalman \ufb01lter. That is, what is a series of unlabeled bags\nof observations when data assignments were unknown, becomes a set of individuated time-series\n\u2014 one for each object \u2014 in which each point in time contains only a single observation when\nconditioned on the data assignment. The Kalman \ufb01lter will be updated via transition matrix A,\naccording to St = ASt\u22121 +Wt, and state perturbations W are distributed with covariance Q (A and Q\ncan be derived straight-forwardly from the dynamics in Eq. 1; see Supplementary Materials).\nInference about both the state estimate and the data assignment can proceed by predicting the current\nlocation for each object, which will be a multivariate normal distribution with mean predicted state\n\u02c6St|t\u22121 = A \u02c6St\u22121 and predicted estimate covariance Gt|t\u22121 = AGt\u22121A(cid:48) + Q. From these predictions, we\ncan de\ufb01ne the probability of a particular data assignment permutation vector as:\n\nP(\u03b3t|St ,Gt ,Mt) = \u220f\n\nP(\u03b3t(i)| \u02c6St|t\u22121(i),Gt|t\u22121(i),Mt(i)), where\nP(\u03b3i| \u02c6St|t\u22121(i),Gt|t\u22121(i)) = N(mt(\u03b3(i)); \u02c6St|t\u22121(i),Gt|t\u22121(i) + Mt(\u03b3(i)))\n\ni\n\n(3)\n\nwhere the posterior probability of a particular \u03b3 value is determined by the Gaussian probability\ndensity, and Mt( j) is the covariance of measurement noise for mt( j). Because an observation can\n\n3\n\n\farise from only one object, mutual exclusivity is built into this conditional probability distribution\n\u2014 this complication makes analytical solutions impossible, and the data assignment vector, \u03b3, must\nbe sampled. However, given an estimate of \u03b3, an estimate of the current state of the object is given\nby the Kalman state update ([12]; see Supplementary Materials).\n\n2.3\n\nInference\n\nTo infer the state of the tracking model described above, we must sample the data-association vec-\ntor, \u03b3, and then the rest of the tracking may proceed analytically. Thus, we implement a Rao-\nBlackwelized particle \ufb01lter [6]: each particle corresponds to one sampled \u03b3 vector and contains the\nanalytically computed state estimates for each of the objects, conditioned on that sampled \u03b3 vector.\nThus, taken together, the particles used for tracking (in our case we use 50, but see Section 3.4 for\ndiscussion) approximate the joint probability distribution over \u03b3 and S.\nIn practice, we sample \u03b3 with the following iterative procedure. First, we sample each component\nof \u03b3 independently of all other \u03b3 components (as in PMHT [18]). Then, if the resulting \u03b3 vector\ncontains con\ufb02icts that violate the mutual exclusivity of data assignments, a subset of \u03b3 is resampled.\nIf this resampling procedure fails to come up with an assignment vector that meets the mutual\nexclusivity, we compute the combinatoric expansion of the permutation of the con\ufb02icted subset\nof \u03b3 and sample assignments within that subset from the combinatoric space. This procedure is very\nfast when tracking is easy, but can slow down when tracking is hard and the combinatoric expansion\nis necessary.\n\n2.4 Perceptual uncertainty\n\nIn order to determine the limits on optimal tracking in our model, we must know what information\nhuman observers have access to. We assume that observers know the summary statistics of the\ncloud of moving dots (their spatial extent, given by \u03c3x, and their velocity distribution, \u03c3v). We also\nstart with the assumption that they know the inertia parameter (\u03bb; however, this assumption will be\nquestioned in section 3.2). Given a perfect measurement of \u03c3x, \u03c3v, and \u03bb, observers will thus know\nthe dynamics by which the objects evolve.\nWe must also specify the low-level, task-independent noise for human observers. We assume that\nnoise in observing the positions of objects (\u03c3mx) is given by previously published eccentricity scaling\nparameters, \u03c3mx(x) = c(1 + 0.42x) (from [5]), where c is uncertainty in position. We use c = 0.08\n(standard deviation in degrees visual angle) throughout this paper. We also assume that observations\nof speed are corrupted by Weber-scaled noise with some irreducible uncertainty (a): \u03c3mv(v) = a+bv,\nsetting a = 0.01 and b = 0.05 (b is the weber fraction as measured in [17]).\n\n3 Results\n\n3.1 Tracking through space\n\nWhen objects move faster, tracking them is harder [1], suggesting to researchers that an attentional\nspeed limit may be limiting tracking. However, when objects cover a wider area of space (when they\nmove on a whole \ufb01eld display), they can be tracked more easily at a given speed, suggesting that\ncrowding rather than speed is the limiting factor [9].\nBoth of these effects are predicted by our model: both the speed and spatial separation of objects alter\nthe uncertainty inherent in the tracking task. When objects move faster (greater \u03c3v) predictions about\nabout where objects will be on the next time-step have greater uncertainty: the covariance of the\npredicted state (Gt|t\u22121) has greater entropy and inference about which observation arose from which\nobject (\u03b3) is less certain and more prone to errors. Additionally, even at a given speed and inertia,\nwhen the spatial extent (\u03c3x) is smaller, objects are closer together. Even given a \ufb01xed uncertainty\nabout where in space an object will end up, the odds of another object appearing therein is greater,\nagain limiting our ability to infer \u03b3. Thus, both increasing velocity variance and decreasing spatial\nvariance will make tracking harder, and to achieve a particular level of performance the two must\ntrade off.\n\n4\n\n\fFigure 2: Top: Stimuli and data from [9] \u2014 when objects are tracked over the whole visual \ufb01eld, they can\nmove at greater speed to achieve a particular level of accuracy. Bottom-Left: Our own experimental data in\nwhich subjects set a \u201ccomfortable\u201d spacing for tracking 3 of 6 objects at a particular speed. Bottom-Middle:\nModel accuracy for tracking 3 of 6 objects as a function of speed and spacing. Bottom-Right: Model \u201csettings\u201d\n\u2014 (85% accuracy) threshold spacing for a particular speed. See text for details.\n\nWe show the speed-space tradeoff in both people and our ideal tracking model. We asked 10 human\nobservers to track 3 of 6 objects moving according to the dynamics described earlier. Their goal was\nto adjust the dif\ufb01culty of the tracking task so that they could track the objects for 5 seconds. We\ntold them that sometimes tracking would be too hard and sometimes too easy, and they could adjust\nthe dif\ufb01culty by hitting one button to make the task easier and another button to make it harder.1\nMaking the task easier or harder amounted to moving the objects farther apart or closer together by\nadjusting \u03c3x of the dynamics, while the speed (\u03c3v) stayed constant. We parametrically varied \u03c3v\nbetween 0.01 and 0.4, and could thus obtain an iso-dif\ufb01culty curve for people making settings of \u03c3x\nas a function of \u03c3v (2).\nTo elicit predictions from our model on this task we simulated 5 second trials where the model had\nto track 3 of 6 objects, and measured accuracy across 15 spacing intervals (\u03c3x between 0.5 and 4.0\ndegrees visual angle), crossed with 11 speeds (\u03c3v between 0.01 and 0.4). At each point in this speed-\nspace grid, we simulated 250 trials, to measure mean tracking accuracy for the model. The resulting\naccuracy surface is shown in Figure 2 \u2014 an apparent tradeoff can be seen, when objects move faster,\nthey must be farther apart to achieve the same level of accuracy as when they move slower.\nTo make the model generate thresholds of \u03c3x for a particular \u03c3v, as we had human subjects do,\nwe \ufb01t psychometric functions to each cross-section through the accuracy surface, and used the psy-\nchometric function to predict settings that would achieve a particular level of accuracy (one such\npsychometric function is shown in red on the surface in Figure2).2 The plot in Figure 2 shows the\nmodel setting for the 0.85 accuracy mark; the upper and lower error bounds represent the settings to\nachieve an accuracy of 0.8 and 0.9, respectively (in subsequent plots we show only the 0.85 thresh-\nold for simplicity). As in the human performance, there is a continuous tradeoff: when objects are\nfaster, spacing must be wider to achieve the same level of dif\ufb01culty.\n\n1The correlation of this method with participants\u2019 objective tracking performance was validated by [1].\n2We used the Weibull cumulative density as our psychometric function p = 1\u2212 exp(x/xcrit)s, where x is\nthe stimulus dimension which, which covaries positively with performance (either \u03c3x, or 1/\u03c3v), xcrit is the\nlocation term, and s is the scale, or slope, parameter. We obtained the MAP estimate of both parameters of the\nWeibull density function, and predicted the model\u2019s 85% threshold (blue plane in Figure 2) from the inverse of\nthe psychometric function: x = \u2212xcrit ln(1\u2212 p)1/s.\n\n5\n\n\fFigure 3: Left: Human speed-space tradeoff settings do not vary for different physical inertias. Middle panels:\nThis is the case for the ideal model with no knowledge of inertia, but not so for the ideal model with perfect\nknowledge of inertia. Right: This may be the case because it is safer to assume a lower inertia: tracking is\nworse if inertia is assumed to be higher than it is (red) than vice versa (green).\n\n3.2\n\nInertia\n\nIt is disputed whether human observers use velocity to track[13]. Nonetheless, it is clear that adults,\nand even babies, know something about object velocity [19]. The model we propose can reconcile\nthese con\ufb02icting \ufb01ndings.\nIn our model, knowing object velocity means having an accurate \u03c3v term for the object: an estimate\nof how much distance it might cover in a particular time step. Using velocity trajectories to make\npredictions about future states also requires that people know the inertia term. Thus, the degree to\nwhich trajectories are used to track is a question about the inertia parameter (\u03bb) that best matches\nhuman performance. Thus far we have assumed that people know \u03bb perfectly and use it to predict\nfuture states, but this need not be the case. Indeed, while the two other parameters of the dynamics\n\u2014 the spatial extent (\u03c3x) and velocity distribution (\u03c3v) \u2014 may be estimated quickly and ef\ufb01ciently\nfrom a brief observation of the tracking display, inertia is more dif\ufb01cult to estimate. Thus, observers\nmay be more uncertain about the inertia, and may be more likely to guess it incorrectly. (Under our\nmodel, a guess of \u03bb = 0 corresponds to tracking without any velocity information.)\nWe ran an experiment to assess what inertia parameter best \ufb01ts human observers. We asked subjects\nto set iso-dif\ufb01culty contours as a function of the underlying inertia (\u03bb) parameter, by using the same\ndif\ufb01culty-setting procedure described earlier. An ideal observer who knows the inertia perfectly will\ngreatly bene\ufb01t from displays with high inertia in which uncertainty will be low, and will be able\nto track with the same level of accuracy at greater speeds given a particular spacing. However, if\ninertia is incorrectly assumed to be zero, high- and low-inertia iso-dif\ufb01culty contours will be quite\nsimilar (Figure 3). We \ufb01nd that in human observers, iso-dif\ufb01culty contours for \u03bb = 0.7, \u03bb = 0.8, and\n\u03bb = 0.9, are remarkably similar \u2014 consistent with observers assuming a single, low, inertia term.\nAlthough these results corroborate previous \ufb01ndings that human observers do not seem to use tra-\njectories to track, there is evidence that sometime people do use trajectories. These variations in\nobservers\u2019 assumptions about inertia may be attributable to two factors. First, most MOT experi-\nments including rather sudden changes in velocity from objects bouncing off the walls or simply\nas a function of their underlying dynamics. Second, under uncertainty about the inertia underlying\na particular display, an observer is better off underestimating rather than overestimating. Figure 3\nshows the decrement in performance as a function of a mismatch of the observers\u2019 assumed inertia\nto that of the tracking display.\n\n3.3 Tracking through feature space\n\nIn addition to tracking through space, observers can also track objects through feature domains. For\nexample, experimental participants can track two spatially superimposed gratings based on their\nslowly varying colors, orientations or spatial frequencies [3].\nWe can modify our model to track in feature space by adding new dimensions corresponding to the\nfeatures being tracked. Linear feature dimensions like the log of spatial frequency can be treated\nexactly like position and velocity. Circular features like hue angle and orientation require a slight\n\n6\n\n\fFigure 4: Left: When object color drifts\nmore slowly over time (lower \u03c3c), people\ncan track objects more effectively. Right:\nOur tracking model does so as well (ob-\nservation noise for color \u03c3mc in the model\nwas set to 0.02\u03c0)\n\nmodi\ufb01cation: we pre-process the state estimates and observations via modulus to preserve their\ncircular relationship and the linear the Kalman update. With this modi\ufb01cation, the linear Kalman\nstate update can operate on circular variables, and our basic tracking model can track colored objects\nwith a high level of accuracy when they are superimposed (\u03c3x = \u03c3v = 0, Figure 4).\nWe additionally tested the novel prediction from our model that human observers can combine the in-\nformation available from space and features for tracking. Nine human observers made iso-dif\ufb01culty\nsettings as described above; however, this time each object had a color and we varied the color drift\nrate (\u03c3c) on hue angle. Figure 4 shows subjects\u2019 settings of \u03c3x as a function of \u03c3v and \u03c3c. When\ncolor changes slowly, observers can track objects in a smaller space at a given velocity. Figure 4\nalso shows that the pattern of thresholds from the model in the same task match those of the ex-\nperimental participants. Thus, not only can human observers track objects in feature space, they\ncan combine both spatial location and featural information, and additional information in the feature\ndomain allows people to track successfully with less spatial information, as argued by [7].\n\n3.4 Cognitive limitations\n\nThus far we have shown that many human failures in multiple object tracking do not re\ufb02ect cognitive\nlimitations on tracking, but are instead a consequence of the structure of the task and the limits on\navailable perceptual information. However, a limit on the number of objects that may be tracked\n[16] cannot be accounted for in this way. Observers can more easily track 4 of 16 objects at a higher\nspeed than 8 of 16 objects (Figure 5), even though the stimulus presentation is identical in both cases\n[1]. Thus, this limitation must be a consequence of uncertainty that may be modulated by task \u2014 a\n\ufb02exible resource [2].\nWithin our model, there are two plausible alternatives for what such a limited resource may be:\nvisual attention, which improves the \ufb01delity of measurements; or memory, which enables more\nor less noiseless propagation of state estimates through time3. In both cases, when more objects\nare tracked, less of the resource is available for each object, resulting in an increase of noise and\nuncertainty. At a super\ufb01cial level, both memory and attention resources amount to a limited amount\nof gain to be used to reduce noise. Given the linear Kalman \ufb01ltering computation we have proposed\nas underlying tracking, equal magnitude noise in either will have the same effects. Thus, to avoid the\ncomplexities inherent in allocating attention to space, we will consider memory limitations, but this\nresource limitation can be thought of as \u201cattention gain\u201d as well (though some of our work suggests\nthat memory may be a more appropriate interpretation).\nWe must decide on a linking function between the covariance U of the memory noise, and the\nnumber of objects tracked. It is natural to propose that covariance scales positively with the number\n\u221a\nof objects tracked \u2013 that is U for n objects would be equal to Un = U1n. This expression captures\nthe idea that task modulated noise should follow the \u03c3 \u221d\nn rule, as would be the case if the state\nfor a given object were stored or measured with a \ufb01nite number of samples. With more samples,\n\n3One might suppose that limiting the number of particles used for tracking as in [4] and [14], might be\na likely resource capacity; however, in object tracking, having more particles produces a bene\ufb01t only insofar\nas future observations might disambiguate previous inferences. In multiple object tracking with uniform dots\n(as is the case in most human experiments) once objects have been mis-associated, no future observations can\nprovide evidence of a mistake having been made in the past; and as such, having additional particles to keep\ntrack of low-probability data associations carries no bene\ufb01t.\n\n7\n\n\fFigure 5: Left: When\nmore objects are tracked\n(out of 16) they must\nmove at a slower speed to\nreach a particular level of\naccuracy [1]. Right: Our\nmodel exhibits this effect\nonly if task-dependent\nuncertainty is introduced\n(see text).\n\nprecision would increase; however, because the number of samples available is \ufb01xed at c, the number\nof samples per object would be c/n, giving rise to the scaling rule described above.\nIn Figure 5 we add such a noise-term to our model and measure performance (threshold speed \u2014\n\u03c3v \u2014 for a given number of targets nt, when spacing is \ufb01xed, \u03c3x = 4, and the total number of\nobjects is also \ufb01xed n = 16). The characteristic tradeoff between the number of targets, and the\nspeed with which they may be tracked is clearly evident. Thus, while many results in MOT arise\nas consequences of the information available for the computational task, the speed-number tradeoff\nseems to be the result of a \ufb02exibly-allocated resource such as memory or attention.\n\n4 Conclusions\n\nWe investigated what limitations are responsible for human failures in multiple object tracking tasks.\nAre such limitations discrete (like a \ufb01xed number of objects) or continuous (like memory)? Are they\n\ufb02exible with task (cognitive resources such as memory and attention), or are they task-independent\n(like perceptual noise)?\nWe modi\ufb01ed a Bayes-optimal tracking solution for typical MOT experiments and implemented this\nsolution using a Rao-Blackwellized particle \ufb01lter. Using novel behavioral experiments inspired by\nthe model, we showed that this ideal observer exhibits many of the classic phenomena in multiple\nobject tracking given only perceptual uncertainty (a continuous, task-independent source of limita-\ntion). Just as for human observers, tracking in our model is harder when objects move faster or are\ncloser together; inertia information is available, but may not be used; and objects can be tracked\nin features as well as space. However, effects of the number of objects tracked do not arise from\nperceptual uncertainty alone. To account for the tradeoff between the number of objects tracked and\ntheir speed, a task-dependent resource must be introduced \u2013 we introduce this resource as a memory\nconstraint, but it may well be attentional gain.\nAlthough the dichotomy of \ufb02exible, cognitive resources and task-independent, low-level uncertainty\nis a convenient distinction to start our analysis, it is misleading. When engaging in any real world\ntask this distinction is blurred: people will use whatever resources they have to facilitate perfor-\nmance; even perceptual uncertainty as basic as the resolution of the retina becomes a \ufb02exible re-\nsource when people are allowed to move their eyes (they were not allowed to do so in our experi-\nments). Connecting resource limitations measured in controlled experiments to human performance\nin the real world requires that we address not only what the structure of the task may be, but also how\nhuman agents allocate resources to accomplish this task. Here we have shown that a computational\nmodel of the multiple object tracking task can unify a large set of experimental \ufb01ndings on human\nobject tracking, and most importantly, determine how these experimental \ufb01ndings map onto cog-\nnitive limitations. Because our \ufb01ndings implicate a \ufb02exible cognitive resource, the next necessary\nstep is to investigate how people allocate such a resource, and this question will be pursued in future\nwork.\nAcknowledgments: This work was supported by ONR MURI: Complex Learning and Skill Trans-\nfer with Video Games N00014-07-1-0937 (PI: Daphne Bavelier); NDSEG fellowship to EV and\nNSF DRMS Dissertation grant to EV.\n\n8\n\n\fReferences\n[1] G. Alvarez and S. Franconeri. How many objects can you attentively track?: Evidence for a resource-\n\nlimited tracking mechanism. Journal of Vision, 7(13):1\u201310, 2007.\n\n[2] P. Bays and M. Husain. Dynamic shifts of limited working memory resources in human vision. Science,\n\n321(5890):851, 2008.\n\n[3] E. Blaser, Z. Pylyshyn, and A. Holcombe.\n\n408(6809):196 \u2013 199, 2000.\n\nTracking an object through feature space. Nature,\n\n[4] S. Brown and M. Steyvers. Detecting and predicting changes. Cognitive Psychology, 58:49\u201367, 2008.\n[5] M. Carrasco and K. Frieder. Cortical magni\ufb01cation neutralizes the eccentricity effect in visual search.\n\nVision Research, 37(1):63\u201382, 1997.\n\n[6] A. Doucet, N. de Freitas, K. Murphy, and S. Russell. Rao-Blackwellised particle \ufb01ltering for dynamic\n\nBayesian networks. In Proceedings of Uncertainty in AI, volume 00, 2000.\n\n[7] J. Feldman and P. Tremoulet. Individuation of visual objects over time. Cognition, 99:131\u2013165, 2006.\n[8] D. E. Fencsik, J. Urrea, S. S. Place, J. M. Wolfe, and T. S. Horowitz. Velocity cues improve visual search\n\nand multiple object tracking. Visual Cognition, 14:92\u201395, 2006.\n\n[9] S. Franconeri, J. Lin, Z. Pylyshyn, B. Fisher, and J. Enns. Evidence against a speed limit in multiple\n\nobject tracking. Psychonomic Bulletin & Review, 15:802\u2013808, 2008.\n\n[10] F. Gustafsson, F. Gunnarsson, N. Bergman, U. Forssell, J. Jansson, R. Karlsson, and P. Nordlund. Particle\n\ufb01lters for positioning, navigation, and tracking. In IEEE Transactions on Signal Processing, volume 50,\n2002.\n\n[11] W. James. The Principles of Psychology. Harvard University Press, Cambridge, 1890.\n[12] R. Kalman. A new approach to linear \ufb01ltering and prediction problems. J. of Basic Engineering, 82D:35\u2013\n\n45, 1960.\n\n[13] B. P. Keane and Z. W. Pylyshyn. Is motion extrapolation employed in multiple object tracking? Tracking\n\nas a low-level non-predictive function. Cognitive Psychology, 52:346 \u2013 368, 2006.\n\n[14] R. Levy, F. Reali, and T. Grif\ufb01ths. Modeling the effects of memory on human online sentence processing\n\nwith particle \ufb01lters. In Advances in Neural Information Processing Systems, volume 21, 2009.\n\n[15] T. Makovski and Y. Jiang. Feature binding in attentive tracking of distinct objects. Visual Cognition,\n\n17:180 \u2013 194, 2009.\n\n[16] Z. W. Pylyshyn and R. W. Storm. Tracking multiple independent targets: Evidence for a parallel tracking\n\nmechanism. Spatial Vision, 3:179\u2013197, 1988.\n\n[17] R. Snowden and O. Braddick. The temporal integration and resolution of velocity signals. Vision Re-\n\nsearch, 31(5):907\u2013914, 1991.\n\n[18] R. Streit and Luginbuhl. Probabilistic multi-hypothesis tracking. Technical report 10428, NUWC, New-\n\nport, Rhode Island, USA, 1995.\n\n[19] S. P. Tripathy and B. T. Barrett. Severe loss of positional information when detecting deviations in multiple\n\ntrajectories. Journal of Vision, 4(12):1020 \u2013 1043, 2004.\n\n9\n\n\f", "award": [], "sourceid": 980, "authors": [{"given_name": "Ed", "family_name": "Vul", "institution": null}, {"given_name": "George", "family_name": "Alvarez", "institution": null}, {"given_name": "Joshua", "family_name": "Tenenbaum", "institution": null}, {"given_name": "Michael", "family_name": "Black", "institution": null}]}