Part of Advances in Neural Information Processing Systems 17 (NIPS 2004)

*Jongwoo Lim, David Ross, Ruei-sung Lin, Ming-Hsuan Yang*

Most existing tracking algorithms construct a representation of a target object prior to the tracking task starts, and utilize invariant features to handle appearance variation of the target caused by lighting, pose, and view angle change. In this paper, we present an efficient and effec- tive online algorithm that incrementally learns and adapts a low dimen- sional eigenspace representation to reflect appearance changes of the tar- get, thereby facilitating the tracking task. Furthermore, our incremental method correctly updates the sample mean and the eigenbasis, whereas existing incremental subspace update methods ignore the fact the sample mean varies over time. The tracking problem is formulated as a state inference problem within a Markov Chain Monte Carlo framework and a particle filter is incorporated for propagating sample distributions over time. Numerous experiments demonstrate the effectiveness of the pro- posed tracking algorithm in indoor and outdoor environments where the target objects undergo large pose and lighting changes.

1 Introduction

The main challenges of visual tracking can be attributed to the difficulty in handling appear- ance variability of a target object. Intrinsic appearance variabilities include pose variation and shape deformation of a target object, whereas extrinsic illumination change, camera motion, camera viewpoint, and occlusions inevitably cause large appearance variation. Due to the nature of the tracking problem, it is imperative for a tracking algorithm to model such appearance variation.

Here we developed a method that, during visual tracking, constantly and efficiently up- dates a low dimensional eigenspace representation of the appearance of the target object. The advantages of this adaptive subspace representation are several folds. The eigenspace representation provides a compact notion of the "thing" being tracked rather than treating the target as a set of independent pixels, i.e., "stuff" [1]. The use of an incremental method continually updates the eigenspace to reflect the appearance change caused by intrinsic and extrinsic factors, thereby facilitating the tracking process. To estimate the locations of the target objects in consecutive frames, we used a sampling algorithm with likelihood estimates, which is in direct contrast to other tracking methods that usually solve complex optimization problems using gradient-descent approach.

The proposed method differs from our prior work [14] in several aspects. First, the pro- posed algorithm does not require any training images of the target object before the tracking task starts. That is, our tracker learns a low dimensional eigenspace representation on-line and incrementally updates it as time progresses (We assume, like most tracking algorithms, that the target region has been initialized in the first frame). Second, we extend our sam- pling method to incorporate a particle filter so that the sample distributions are propagated over time. Based on the eigenspace model with updates, an effective likelihood estimation function is developed. Third, we extend the R-SVD algorithm [6] so that both the sample mean and eigenbasis are correctly updated as new data arrive. Though there are numerous subspace update algorithms in the literature, only the method by Hall et al. [8] is also able

to update the sample mean. However, their method is based on the addition of a single col- umn (single observation) rather than blocks (a number of observations in our case) and thus is less efficient than ours. While our formulation provides an exact solution, their algorithm gives only approximate updates and thus it may suffer from numerical instability. Finally, the proposed tracker is extended to use a robust error norm for likelihood estimation in the presence of noisy data or partial occlusions, thereby rendering more accurate and robust tracking results.

2 Previous Work and Motivation Black et al. [4] proposed a tracking algorithm using a pre-trained view-based eigenbasis representation and a robust error norm. Instead of relying on the popular brightness con- stancy working principal, they advocated the use of subspace constancy assumption for visual tracking. Although their algorithm demonstrated excellent empirical results, it re- quires to build a set of view-based eigenbases before the tracking task starts. Furthermore, their method assumes that certain factors, such as illumination conditions, do not change significantly as the eigenbasis, once constructed, is not updated.

Hager and Belhumeur [7] presented a tracking algorithm to handle the geometry and illu- mination variations of target objects. Their method extends a gradient-based optical flow algorithm to incorporate research findings in [2] for object tracking under varying illumi- nation conditions. Prior to the tracking task starts, a set of illumination basis needs to be constructed at a fixed pose in order to account for appearance variation of the target due to lighting changes. Consequently, it is not clear whether this method is effective if a target object undergoes changes in illumination with arbitrary pose.

In [9] Isard and Blake developed the Condensation algorithm for contour tracking in which multiple plausible interpretations are propagated over time. Though their probabilistic ap- proach has demonstrated success in tracking contours in clutter, the representation scheme is rather primitive, i.e., curves or splines, and is not updated as the appearance of a target varies due to pose or illumination change.

Mixture models have been used to describe appearance change for motion estimation [3] [10]. In Black et al. [3] four possible causes are identified in a mixture model for estimating appearance change in consecutive frames, and thereby more reliable image motion can be obtained. A more elaborate mixture model with an online EM algorithm was recently proposed by Jepson et al. [10] in which they use three components and wavelet filters to account for appearance changes during tracking. Their method is able to handle variations in pose, illumination and expression. However, their WSL appearance model treats pixels within the target region independently, and therefore does not have notion of the "thing" being tracked. This may result in modeling background rather than the foreground, and fail to track the target.

In contrast to the eigentracking algorithm [4], our algorithm does not require a training phase but learns the eigenbases on-line during the object tracking process, and constantly updates this representation as the appearance changes due to pose, view angle, and illumi- nation variation. Further, our method uses a particle filter for motion parameter estimation rather than the Gauss-Newton method which often gets stuck in local minima or is dis- tracted by outliers [4]. Our appearance-based model provides a richer description than simple curves or splines as used in [9], and has notion of the "thing" being tracked. In addition, the learned representation can be utilized for other tasks such as object recog- nition. In this work, an eigenspace representation is learned directly from pixel values within a target object in the image space. Experiments show that good tracking results can be obtained with this representation without resorting to wavelets as used in [10], and better performance can potentially be achieved using wavelet filters. Note also that the view-based eigenspace representation has demonstrated its ability to model appearance of objects at different pose [13], and under different lighting conditions [2].

3 Incremental Learning for Tracking We present the details of the proposed incremental learning algorithm for object tracking in this section.

3.1 Incremental Update of Eigenbasis and Mean

The appearance of a target object may change drastically due to intrinsic and extrinsic factors as discussed earlier. Therefore it is important to develop an efficient algorithm to update the eigenspace as the tracking task progresses. Numerous algorithms have been developed to update eigenbasis from a time-varying covariance matrix as more data arrive [6] [8] [11] [5]. However, most methods assume zero mean in updating the eigenbasis except the method by Hall et al. [8] in which they consider the change of the mean when updating eigenbasis as each new datum arrives. Their update algorithm only handles one datum per update and gives approximate results, while our formulation handles multiple data at the same time and renders exact solutions.

We extend the work of the classic R-SVD method [6] in which we update the eigenbasis while taking the shift of the sample mean into account. To the best of our knowledge, this formulation with mean update is new in the literature.

Given a d n data matrix A = {I1, . . . , In} where each column Ii is an observation (a d- dimensional image vector in this paper), we can compute the singular value decomposition (SVD) of A, i.e., A = U V . When a dm matrix E of new observations is available, the R-SVD algorithm efficiently computes the SVD of the matrix A = (A|E) = U V based on the SVD of A as follows:

```
1. Apply QR decomposition to and get orthonormal basis ~ E of E, and U = (U | ~ E).
2. Let V = V 0 0 I where Im is an m m identity matrix. It follows then, m
= U A V = U (A|E) V 0 = U AV U E = U E . ~ E 0 Im ~ E AV ~ E E 0 ~ E E
3. Compute the SVD of = ~ U ~ ~ V and the SVD of A is
A = U ( ~ U ~ ~ V )V = (U ~ U ) ~ ( ~ V V ).
```

Exploiting the properties of orthonormal bases and block structures, the R-SVD algorithm computes the new eigenbasis efficiently. The computational complexity analysis and more details are described in [6].

One problem with the R-SVD algorithm is that the eigenbasis U is computed from AA with the zero mean assumption. We modify the R-SVD algorithm and compute the eigen- basis with mean update. The following derivation is based on scatter matrix, which is same as covariance matrix except a scalar factor.

Proposition 1 Let Ip = {I1, I2, . . . , In}, Iq= {In+1, In+2, . . . , In+m}, and Ir = (Ip|Iq). Denote the means and scatter matrices of Ip, Iq, Ir as Ip, Iq, Ir, and Sp, Sq, Sr respec- tively, then Sr = Sp + Sq + nm (I n+m q - Ip)(Iq - Ip) .

Proof: By definition, I r = n I I (I n+m p + m n+m q , Ip - Ir = m n+m p - Iq); Iq - Ir = n (I n+m q - Ip) and, Sr = n ( ( i=1 Ii - Ir)(Ii - Ir) + n+m i=n+1 Ii - Ir)(Ii - Ir) = n ( i=1 Ii - Ip + Ip - Ir)(Ii - Ip + Ip - Ir) + n+m ( i=m+1 Ii - Iq + Iq - Ir)(Ii - Iq + Iq - Ir) = Sp + n(Ip - Ir)(Ip - Ir) + Sq + m(Iq - Ir)(Iq - Ir) = Sp + nm2 ( ( ( I I n+m)2 p - Iq)(Ip - Iq) + Sq + n2m (n+m)2 p - Iq)(Ip - Iq) = Sp + Sq + nm (I n+m p - Iq)(Ip - Iq)

Let ^ Ip = {I1 - Ip, . . . , In - Ip}, ^ Iq = {In+1 - Iq, . . . , In+m - Iq}, and ^ Ir = {I1 - Ir, . . . , In+m - Ir}, and the SVD of ^Ir = UrrVr . Let ~ E = ^ Iq| nm (I n+m p - Iq) ,

and use Proposition 1, Sr = (^ Ip| ~ E)(^ Ip| ~ E) . Therefore, we compute SVD on ( ^ Ip| ~ E) to get Ur. This can be done efficiently by the R-SVD algorithm as described above.

In summary, given the mean Ip and the SVD of existing data Ip, i.e., UppVp and new data Iq, we can compute the the mean Ir and the SVD of Ir, i.e., UrrVr easily:

```
1. Compute I r = n I I (I n+m p + m n+m q , and ~ E = Iq - Ir 1(1m) | nm n+m p - Iq) .
2. Compute R-SVD with (UppVp ) and ~ E to obtain (UrrVr ).
```

In numerous vision problems, we can further exploit the low dimensional approximation of image data and put larger weights on the recent observations, or equivalently downweight the contributions of previous observations. For example as the appearance of a target object gradually changes, we may want to put more weights on recent observations in updating the eigenbasis since they are more likely to be similar to the current appearance of the target. The forgetting factor f can be used under this premise as suggested in [11] , i.e., A = (f A |E) = (U (f )V |E) where A and A are original and weighted data matrices, respectively.

3.2 Sequential Inference Model

The visual tracking problem is cast as an inference problem with a Markov model and hidden state variable, where a state variable Xt describes the affine motion parameters (and thereby the location) of the target at time t. Given a set of observed images It = {I1, . . . , It}. we aim to estimate the value of the hidden state variable Xt. Using Bayes' theorem, we have

```
p(Xt| It) p(It|Xt) p(Xt|Xt-1) p(Xt-1| It-1) dXt-1
```

The tracking process is governed by the observation model p(It|Xt) where we estimate the likelihood of Xt observing It, and the dynamical model between two states p(Xt|Xt-1). The Condensation algorithm [9], based on factored sampling, approximates an arbitrary distribution of observations with a stochastically generated set of weighted samples. We use a variant of the Condensation algorithm to model the distribution over the object's location, as it evolves over time.

3.3 Dynamical and Observation Models

The motion of a target object between two consecutive frames can be approximated by an affine image warping. In this work, we use the six parameters of affine transform to model the state transition from Xt-1 to Xt of a target object being tracked. Let Xt = (xt, yt, t, st, t, t) where xt, yt, t, st, t, t, denote x, y translation, rotation angle, scale, aspect ratio, and skew direction at time t. Each parameter in Xt is modeled independently by a Gaussian distribution around its counterpart in Xt-1. That is, p(Xt|Xt-1) = N (Xt; Xt-1, ) where is a diagonal covariance matrix whose elements are the corresponding variances of affine parameters, i.e., 2x, 2y, 2, 2 . s , 2 , 2

Since our goal is to use a representation to model the "thing" that we are tracking, we model the image observations using a probabilistic interpretation of principal component analysis [16]. Given an image patch predicated by Xt, we assume the observed image It was generated from a subspace spanned by U centered at . The probability that a sample being generated from the subspace is inversely proportional to the distance d from the sample to the reference point (i.e., center) of the subspace, which can be decomposed into the distance-to-subspace, dt, and the distance-within-subspace from the projected sample

to the subspace center, dw. This distance formulation, based on a orthonormal subspace and its complement space, is similar to [12] in spirit.

The probability of a sample generated from a subspace, pd (I t t|Xt), is governed by a Gaus- sian distribution: pd (I t t | Xt) = N (It ; , U U + I ) where I is an identity matrix, is the mean, and I term corresponds to the additive Gaus- sian noise in the observation process. It can be shown [15] that the negative exponential distance from It to the subspace spanned by U , i.e., exp(-||(It - ) - U U (It - )||2), is proportional to N (It; , U U + I) as 0.

Within a subspace, the likelihood of the projected sample can be modeled by the Maha- lanobis distance from the mean as follows: pd (I w t | Xt) = N (It ; , U -2U ) where is the center of the subspace and is the matrix of singular values corresponding to the columns of U . Put together, the likelihood of a sample being generated from the subspace is governed by p(It|Xt) = pd (I (I t t|Xt) pdw t|Xt) = N (It; , U U + I) N (It; , U-2U ) (1)

Given a drawn sample Xt and the corresponding image region It, we aim to compute p(It|Xt) using (1). To minimize the effects of noisy pixels, we utilize a robust error norm [4], (x, ) = x2 instead of the Euclidean norm d(x) = ||x||2, to ignore the "outlier" 2+x2 pixels (i.e., the pixels that are not likely to appear inside the target region given the current eigenspace). We use a method similar to that used in [4] in order to compute dt and dw. This robust error norm is helpful especially when we use a rectangular region to enclose the target (which inevitably contains some noisy background pixels).

4 Experiments To test the performance of our proposed tracker, we collected a number of videos recorded in indoor and outdoor environments where the targets change pose in different lighting con- ditions. Each video consists of 320 240 gray scale images and are recorded at 15 frames per second unless specified otherwise. For the eigenspace representation, each target image region is resized to 32 32 patch, and the number of eigenvectors used in all experiments is set to 16 though fewer eigenvectors may also work well. Implemented in MATLAB with MEX, our algorithm runs at 4 frames per second on a standard computer with 200 particles. We present some tracking results in this section and more tracking results as well as videos can be found at http://vision.ucsd.edu/~jwlim/ilt/.

4.1 Experimental Results

Figure 1 shows the tracking results using a challenging sequence recorded with a mov- ing digital camera in which a person moves from a dark room toward a bright area while changing his pose, moving underneath spot lights, changing facial expressions and taking off glasses. All the eigenbases are constructed automatically from scratch and constantly updated to model the appearance of the target object while undergoing appearance changes. Even with the significant camera motion and low frame rate (which makes the motions be- tween frames more significant, or equivalently to tracking fast moving objects), our tracker stays stably on the target throughout the sequence.

The second sequence contains an animal doll moving in different pose, scale, and lighting conditions as shown in Figure 2. Experimental results demonstrate that our tracker is able to follow the target as it undergoes large pose change, cluttered background, and lighting variation. Notice that the non-convex target object is localized with an enclosing rectan- gular window, and thus it inevitably contains some background pixels in its appearance representation. The robust error norm enables the tracker to ignore background pixels and estimate the target location correctly. The results also show that our algorithm faithfully

Figure 1: A person moves from dark toward bright area with large lighting and pose changes. The images in the second row shows the current sample mean, tracked region, reconstructed image, and the reconstruction error respectively. The third and forth rows shows 10 largest eigenbases.

Figure 2: An animal doll moving with large pose, lighting variation in a cluttered background.

models the appearance of the target, as shown in eigenbases and reconstructed images, in the presence of noisy background pixels.

We recorded a sequence to demonstrate that our tracker performs well in outdoor environ- ment where lighting conditions change drastically. The video was acquired when a person walking underneath a trellis covered by vines. As shown in Figure 3, the cast shadow changes the appearance of the target face drastically. Furthermore, the combined pose and lighting variation with low frame rate makes the tracking task extremely difficult. Nev- ertheless, the results show that our tracker successfully follows the target accurately and robustly. Due to heavy shadows and drastic lighting change, other tracking methods based on gradient, contour, or color information are unlikely to perform well in this case.

Do not remove: This comment is monitored to verify that the site is working properly