{"title": "Adaptive Discriminative Generative Model and Its Applications", "book": "Advances in Neural Information Processing Systems", "page_first": 801, "page_last": 808, "abstract": null, "full_text": " Adaptive Discriminative Generative Model\n and Its Applications\n\n\n Ruei-Sung Lin David Ross Jongwoo Lim Ming-Hsuan Yang\n University of Illinois University of Toronto Honda Research Institute\n rlin1@uiuc.edu dross@cs.toronto.edu jlim1@uiuc.edu myang@honda-ri.com\n\n\n Abstract\n\n This paper presents an adaptive discriminative generative model that gen-\n eralizes the conventional Fisher Linear Discriminant algorithm and ren-\n ders a proper probabilistic interpretation. Within the context of object\n tracking, we aim to find a discriminative generative model that best sep-\n arates the target from the background. We present a computationally\n efficient algorithm to constantly update this discriminative model as time\n progresses. While most tracking algorithms operate on the premise that\n the object appearance or ambient lighting condition does not significantly\n change as time progresses, our method adapts a discriminative genera-\n tive model to reflect appearance variation of the target and background,\n thereby facilitating the tracking task in ever-changing environments. Nu-\n merous experiments show that our method is able to learn a discrimina-\n tive generative model for tracking target objects undergoing large pose\n and lighting changes.\n\n\n1 Introduction\n\nTracking moving objects is an important and essential component of visual perception,\nand has been an active research topic in computer vision community for decades. Object\ntracking can be formulated as a continuous state estimation problem where the unobserv-\nable states encode the locations or motion parameters of the target objects, and the task is\nto infer the unobservable states from the observed images over time. At each time step,\na tracker first predicts a few possible locations (i.e., hypotheses) of the target in the next\nframe based on its prior and current knowledge. The prior knowledge includes its previous\nobservations and estimated state transitions. Among these possible locations, the tracker\nthen determines the most likely location of the target object based on the new observa-\ntion. An attractive and effective prediction mechanism is based on Monte Carlo sampling\nin which the state dynamics (i.e., transition) can be learned with a Kalman filter or simply\nmodeled as a Gaussian distribution. Such a formulation indicates that the performance of\na tracker is largely based on a good observation model for validating all hypotheses. In-\ndeed, learning a robust observation model has been the focus of most recent object tracking\nresearch within this framework, and is also the focus of this paper.\n\nMost of the existing approaches utilize static observation models and construct them before\na tracking task starts. To account for all possible variation in a static observation model,\nit is imperative to collect a large set of training examples with the hope that it covers all\npossible variations of the object's appearance. However, it is well known that the appear-\nance of an object varies significantly under different illumination, viewing angle, and shape\ndeformation. It is a daunting, if not impossible, task to collect a training set that enumerates\nall possible cases. An alternative approach is to develop an adaptive method that contains\na number of trackers that track different features or parts of a target object [3]. Therefore,\n\n\f\neven though each tracker may fail under certain circumstances, it is unlikely all of them fail\nat the same time. The tracking method then adaptively selects the trackers that are robust at\ncurrent situation to predict object locations. Although this approach improves the flexibility\nand robustness of a tracking method, each tracker has a static observation model which has\nto be trained beforehand and consequently restricts its application domains severely. There\nare numerous cases, e.g., robotics applications, where the tracker is expected to track a pre-\nviously unseen target once it is detected. To the best of our knowledge, considerably less\nattention is paid to developing adaptive observation models to account for appearance vari-\nation of a target object (e.g., pose, deformation) or environmental changes (e.g., lighting\nconditions and viewing angles) as tracking task progresses.\n\nOur approach is to learn a model for determining the probability of a predicted image loca-\ntion being generated from the class of the target or the background. That is, we formulate\na binary classification problem and develop a discriminative model to distinguish obser-\nvations from the target class and the background class. While conventional discriminative\nclassifiers simply predict the class of each test sample, a good model within the above-\nmentioned tracking framework needs to select the most likely sample that belongs to target\nobject class from a set of samples (or hypotheses). In other words, an observation model\nneeds a classifier with proper probabilistic interpretation.\n\nIn this paper, we present an adaptive discriminative generative model and apply it to object\ntracking. The proposed model aims to best separate the target and the background in the\never-changing environment. The problem is formulated as a density estimation problem,\nwhere the goal is, given a set of positive (i.e., belonging to the target object class) and neg-\native examples (i.e., belonging to the background class), to learn a distribution that assigns\nhigh probability to the positive examples and low probability to the negative examples. This\nis done in a two-stage process. First, in the generative stage, we use a probabilistic principal\ncomponent analysis to model the density of the positive examples. The result of this state is\na Gaussian, which assigns high probability to examples lying in the linear subspace which\ncaptures the most variance of the positive examples. Second, in the discriminative stage,\nwe use negative examples (specifically, negative examples that are assigned high probabil-\nity by our generative model) to produce a new distribution which reduces the probability\nof the negative examples. This is done by learning a linear projection that, when applied\nto the data and the generative model, increases the distance between the negative examples\nand the mean. Toward that end, it is formulated as an optimization problem and we show\nthat this is a direct generalization of the conventional Fisher Linear Discriminant algorithm\nwith proper probabilistic interpretation. Our experimental results show that our algorithm\ncan reliably track moving objects whose appearance changes under different poses, illumi-\nnation, and self deformation.\n\n2 Probabilistic Tracking Algorithm\n\nWe formulate the object tracking problem as a state estimation problem in a way similar\nto [5] [9]. Denote ot as an image region observed at time t and Ot = {o1, . . . , ot} is a set\nof image regions observed from the beginning to time t. An object tracking problem is a\nprocess to infer state st from observation Ot, where state st contains a set of parameters\nreferring to the tracked object's 2-D position, orientation, and scale in image ot. Assuming\na Markovian state transition, this inference problem is formulated with a recursive equation:\n\n p(st|Ot) = kp(ot|st) p(st|st-1)p(st-1|Ot-1)dst-1 (1)\nwhere k is a constant, and p(ot|st) and p(st|st-1) correspond to the observation model and\ndynamic model, respectively.\n\nIn (1), p(st-1|Ot-1) is the state estimation given all the prior observations up to time t-1,\nand p(ot|st) is the likelihood that observing image ot at state st. Put together, the posterior\nestimation p(st|Ot) can be computed efficiently. For object tracking, an ideal distribution\n\n\f\nof p(st|Ot) should peak at ot, i.e., st matching the observed object's location ot. While\nthe integral in (1) predicts the regions where object is likely to appear given all the prior\nobservations, the observation model p(ot|st) determines the most likely state that matches\nthe observation at time t.\n\nIn our formulation, p(ot|st) measures the probability of observing ot as a sample being\ngenerated by the target object class. Note that Ot is an image sequence and if the images\nare acquired at high frame rate, it is expected that the difference between ot and ot-1\nis small though object's appearance might vary according to different of viewing angles,\nilluminations, and possible self-deformation. Instead of adopting a complex static model\nto learn p(ot|st) for all possible ot, a simpler model suffices by adapting this model to\naccount for the appearance changes. In addition, since ot and ot-1 are most likely similar\nand computing p(ot|st) depends on p(ot-1|st-1), the prior information p(ot-1|st-1) can\nbe used to enhance the distinctiveness between the object and its background in p(ot|st).\nThe idea of using an adaptive observation model for object tracking and then applying\ndiscriminative analysis to better predict object location is the focus of the rest the paper. The\nobservation model we use is based on probabilistic principle component analysis (PPCA)\n[10]. Object Tracking using PCA models have been well exploited in the computer vision\ncommunity [2]. Nevertheless, most existing tracking methods do not update the observation\nmodels as time progresses. In this paper, we follow the work by Tipping and Bishop [10]\nand propose an adaptive observation model based on PCA within a formal probabilistic\nframework. Our result is a generalization of the conventional Fisher Linear Discriminant\nwith proper probabilistic interpretation.\n\n3 A Discriminative Generative Observation Model\n\nIn this work, we track a target object based on its observations in the videos, i.e., ot. Since\nthe size of image region ot might change according to different st, we first convert ot to\na standard size and use it for tracking. In the following, we denote yt as the standardized\nappearance vector of ot.\n\nThe dimensionality of the appearance vector yt is usually high. In our experiments, the\nstandard image size is a 19 19 patch and thus yt is a 361-dimensional vector. We thus\nmodel the appearance vector with a graphical model of low-dimensional latent variables.\n\n3.1 A Generative Model with Latent Variables\n\nA latent model relates a n-dimensional appearance vector y to a m-dimensional vector of\nlatent variables x:\n\n y = W x + + (2)\n\nwhere W is a n m projection matrix associating y and x, is the mean of y, and is\nadditive noise. As commonly assumed in factor analysis [1] and other graphical models [6],\nthe latent variables x are independent with unit variance, x N(0, Im), where Im is the\nm-dimensional identity matrix, and is zero mean Gaussian noise, N(0, 2In). Since\nx and are both Gaussians, it follows that y is also a Gaussian distribution, y N(, C),\nwhere C = W W T + 2I and In is an n-dimensional identity matrix. Together with (2),\nwe have a generative observation model:\n\n p(ot|st) = p(yt|W, , ) N(yt|, WWT + 2In) (3)\nThis latent variable model follows the form of probabilistic principle component analysis,\nand its parameters can be estimated from a set of examples [10]. Given a set of appearance\nsamples Y = {y1, . . ., yN}, the covariance matrix of Y is denoted as S = 1 (y\n N -\n)(y - )T. Let {i|i = 1, . . ., N} be the eigenvalues of S arranged in descending order,\ni.e., i j if i < j. Also, define the diagonal matrix m = diag(1, . . . , m), and let\nUm be the eigenvectors that corresponds to the eigenvalues in m. Tipping and Bishop\n\n\f\n[10] show that the the maximum likelihood estimate of , W and can be obtained by\n 1 N 1 n\n = yi, W = Um(m i (4)\n N - 2Im)1/2R, 2 = n\n i=1 - m i=m+1\nwhere R is an arbitrary m m orthogonal rotation matrix.\nTo model all possible appearance variations of a target object (due to pose, illumination\nand view angle change), one could resort to a mixture of PPCA models. However, it not\nonly requires significant computation for estimating the model parameters but also leads\nto other serious questions such as the number of components as well as under-fitting or\nover-fitting. On the other hand, at any given time a linear PPCA model suffices to model\ngradual appearance variation if the model is constantly updated. In this paper, we use a\nsingle PPCA, and dynamically adapt the model parameters W , , and 2 to account for\nappearance change.\n\n3.1.1 Probability computation with Probabilistic PCA\n\nOnce the model parameters are known, we can compute the probability that a vector y is a\nsample of this generative appearance model. From (4), the log-probability is computed by\n 1\n L(W,,2) = - N log 2 + log\n 2 |C| + yTC-1y (5)\nwhere y = y - . Neglecting the constant terms, the log-probability is determined by\nyT C-1y. Together with C = W W T + 2In and (4), it follows that\n 1\n yT C-1y = yT Um-1\n m U T\n my + yT (In\n 2 - UmUTm)y (6)\nHere yT Um-1\n m U T\n my is the Mahalanobis distance of y in the subspace spanned by Um, and\nyT (In -UmUTm)y is the shortest distance from y to this subspace spanned by Um. Usually\n is set to a small value, and consequently the probability will be determined solely by the\ndistance to the subspace. However, the choice of is not trivial. From (6), if the is set to\na value much smaller than the actual one, the distance to the subspace will be favored and\nignore the contribution of Mahalanobis distance, thereby rendering an inaccurate estimate.\nThe choice of is even more critical in situations where the appearance changes dynami-\ncally and requires to be adjusted accordingly. This topic will be further examined in the\nfollowing section.\n\n3.1.2 Online Learning of Probabilistic PCA\n\nUnlike the analysis in the previous section where model parameters are estimated based on\na fixed set of training examples, our generative model has to learn and update its parameters\non line. Starting with a single example (the appearance of the tracked object in the first\nvideo frame), our generative model constantly updates its parameters as new observations\narrive.\n\nThe equations for updating parameters are derived from (4). The update procedure of Um\nand m is complicated since it involves the computations of eigenvalues and eigenvectors.\nHere we use a forgetting factor to put more weights on the most recent data. Denote the\nnewly arrived samples at time t as Y = {y1, . . ., yM}, and assume the mean is fixed,\nU tm and tm can be obtained by performing singular value decomposition (SVD) on\n [Um,t-1(m,t-1)1/2| (1 - )Y ] (7)\nwhere Y = [y1 -, . . ., yM -]. 1/2\n m,t and Um,t will contain the m-largest singular values\nand the corresponding singular vectors respectively at time t. This update procedure can\nbe efficiently implemented using the R-SVD algorithm, e.g., [4] [7].\n\nIf the mean constantly changes, the above update procedure can not be applied. We\n\nrecently proposed a method [8] to compute SVD with correct updated mean in which 1/2\n m,t\n\n\f\nand Um,t can be obtained by computing SVD on\n Um,t-1(m,t-1)1/2 (1 -)Y (1-)(t-1 -Y) (8)\nwhere Y = [y1 - M\n Y , . . . , yM - Y ] and Y = 1 yi. This formulation is similar to\n M i=1\nthe SVD computation with the fixed mean case, and the same incremental SVD algorithm\n\ncan be used to compute 1/2\n m,t and Um,t with an extra term shown in (8).\n\nComputing and updating is more difficult than the form in (8). In the previous section,\nwe show that an inaccurate value of will severely affect probability estimates. In order\nto have an accurate estimate of using (4), a large set of training examples is usually\nrequired. Our generative model starts with a single example and gradually adapts the model\nparameters. If we update based on (4), we will start with a very small value of since\nthere are only a few samples at our disposal at the outset, and the algorithm could quickly\nlose track of the target because of an inaccurate probability estimate. Since the training\nexamples are not permanently stored in memory, i in (4) and consequently may not be\naccurately updated if the number of drawn samples is insufficient. These constraints lead\nus to develop a method that adaptively adjusts according to the newly arrived samples,\nwhich will be explained in the next section.\n\n3.2 Discriminative Generative Model\n\nAs is observed in Section 2, the object's appearance at ot-1 and ot do not change much.\nTherefore, we can use the observation at ot-1 to boost the likelihood measurement in ot.\nThat is, we draw a set samples (i.e., image patches) parameterized by {sit-1|i = 1, ..., k}\nin ot-1 that have large p(ot-1|sit- ), but the low posterior p(si\n 1 t-1|Ot-1). These are treated\nas the negative samples (i.e., samples that are not generated from the class of the target\nobject) that the generative model is likely to confuse at Ot.\n\nGiven a set samples Y = {y1, . . . , yk} where yi is the appearance vector collected in ot-1\nbased on state parameter sit- , we want to find a linear projection V that projects Y onto\n 1\na subspace such that the likelihood of Y in the subspace is minimized. Let V be a p n\nmatrix and since p(y|W, , ) is a Gaussian, p(V y|V, W, , ) N(V , V CV T) is a also\nGaussian. The log likelihood is computed by\n k\n L(V,W,,) = - plog(2) + log\n 2 |V CV T| + tr((V CV T)-1V S V T) (9)\nwhere S = 1 k (yi\n k i=1 - )(yi - )T.\nTo facilitate the following analysis we first assume V projects Y to a 1-D space, i.e., p = 1\nand V = vT , and thus\n k vT S v\n L(V,W,,) = - log(2) + log (10)\n 2 |vTCv| + vTCv\nNote that vT Cv is the variance of the object samples in the projected space, and we need\nto impose a constraint, e.g., vtCv = 1, to ensure that the minimum likelihood solution of\nv does not increase the variance in the projected space. Let vT Cv = 1, the optimization\nproblem becomes\n vT S v\n v = arg max vT S v = arg max (11)\n {v|vT Cv=1} v vT Cv\nThus, we obtain an equation exactly like the Fisher discriminant analysis for a binary clas-\nsification problem. In (11), v is a projection that keeps the object's samples in the projected\nspace close to the (with variance vT Cv = 1), while keeping negative samples in Y away\nfrom . The optimal value of v is the generalized eigenvector of S and C that corresponds\nto largest eigenvalue. In a general case, it follows that\n\n V = arg max |V S V T| = argmax |V S V T| (12)\n {V CV T =I} V |V CV T|\n\n\f\nwhere V can be obtained by solving a generalized eigenvalue problem of S and C. By\nprojecting observation samples onto a low dimensional subspace, we enhance the discrim-\ninative power of the generative model. In the meanwhile, we reduce the time required to\ncompute probabilities, which is also a critical improvement for real time applications like\nobject tracking.\n\n3.2.1 Online Update of Discriminative Analysis\n\nThe computation of the projection matrix V depends on matrices C and S . In section\n3.1.2, we have shown the procedures to update C. The same procedures can be used to\nupdate S . Let k k\n Y = 1 yi and S (yi\n k i=1 Y = 1k i=1 - Y )(yi - Y )T,\n k\n 1\n S = (yi\n k - )(yi - )T = Sy + ( - Y )( - Y )T (13)\n i=1\n\nGiven S and C, V is computed by solving a generalized eigenvalue problem. If we de-\ncompose S = AT A and C = BT B, then we can find V more efficiently using generalized\nsingular value decomposition. Denote UY and Y as the SVD of SY , it follows that by\nletting A = [UY 1/2\n Y | ( - Y )]T and B = [Um1/2\n m |2I]T, we obtain S = ATA and\nC = BT B.\n\nAs is detailed in [4] , V can be computed by first performing a QR factorization:\n\n A Q\n = A R (14)\n B QB\nand computing the singular value decomposition of QA\n QA = UADAV T\n A (15)\n, we then obtain V = R-1VA. The rank of A is usually small in vision applications, and V\ncan be computed efficiently, thereby facilitating tracking the process.\n\n4 Proposed Tracking Algorithm\n\nIn this section, we summarize the proposed tracking algorithm and demonstrate how the\nabovementioned learning and inference algorithms are incorporated for object tracking.\nOur algorithm localizes the tracked object in each video frame using a rectangular window.\nA state s is a length-5 vector, s = (x, y, , w, h), that parameterizes the windows position\n(x, y), orientation () and width and height (w, h). The proposed algorithm is based on\nmaximum likelihood estimate (i.e., the most probable location of the object) given all the\nobservations up to that time instance, st = arg maxs p(s\n t t|Ot).\nWe assume that state transition is a Gaussian distribution, i.e.,\n p(st|st-1) N(st-1, s) (16)\nwhere s is a diagonal matrix. According to this distribution, the tracker then draws N\nsamples St = {c1, . . ., cN} which represent the possible locations of the target. Denote yit\nas the appearance vector of ot, and Yt = {y1t, . . . , yNt} as a set of appearance vectors that\ncorresponds to the set of state vectors St. The posterior probability that the tracked object\nis at ci in video frame ot is then defined as\n p(st = ci|Ot) = p(yit|V, W, , )p(st = ci|st- ) (17)\n 1\n\nwhere is a constant. Therefore, st = arg maxc p(s\n i St t = ci|Ot).\nOnce st is determined, the corresponding observation yt will be a new example to update\nW and . Appearance vectors yit with large p(yit|V, W, , ) but whose corresponding state\nparameters ci are away from st will be used as new examples to update V .\n\nOur tracking assumes o1 and s are given (through object detection) and thus obtains the\n 1\nfirst appearance vector y1 which in turn is used an the initial value of , but V and W are\n\n\f\nunknown at the outset. When V and W are not available, our tracking algorithm is based\non template matching (with being the template). The matrix W is computed after a small\nnumber of appearance vectors are observed. When W is available, we can then start to\ncompute and update V accordingly.\n\nAs mentioned in the Section 3.1.1, it is difficult to obtain an accurate estimate of . In our\ntracking the system, we adaptively adjust according to m in W . We set be a fixed\nfraction of the smallest eigenvalues in m. This will ensure the distance measurement\nin (6) will not be biased to favor either the Mahalanobis distance in the subspace or the\ndistance to the subspace.\n\n5 Experimental Results\n\nWe tested the proposed algorithm with numerous object tracking experiments. To ex-\namine whether our model is able to adapt and track objects in the dynamically chang-\ning environment, we recorded videos containing appearance deformation, large illumina-\ntion change, and large pose variations. All the image sequences consist of 320 240\npixel grayscale videos, recorded at 30 frames/second and 256 gray-levels per pixel. The\nforgetting term is empirically selected as 0.85, and the batch size for update is set to 5\nas a trade-off of computational efficiency as well as effectiveness of modeling appear-\nance change due to fast motion. More experimental results and videos can be found at\nhttp://www.ifp.uiuc.edu/~rlin1/adgm.html.\n\n\n\n\n\n Figure 1: A target undergoes pose and lighting variation.\n\n\nFigures 1 and 2 show snapshots of some tracking results enclosed with rectangular win-\ndows. There are two rows of images below each video frame. The first row shows the\nsampled images in the current frame that have the largest likelihoods of being the target lo-\ncations according our discriminative generative model. The second row shows the sample\nimages in the current video frame that are selected online for updating the discriminative\ngenerative model.\n\nThe results in Figure 1 show the our method is able to track targets undergoing pose and\nlighting change. Figure 2 shows tracking results where the object appearances change\nsignificantly due to variation in pose and lighting as well as cast shadows. These exper-\niments demonstrate that our tracking algorithm is able to follow objects even when there\nis a large appearance change due to pose or lighting variation. We have also tested these\ntwo sequences with conventional view-based eigentracker [2] or template-based method.\nEmpirical results show that such methods do not perform well as they do not update the\nobject representation to account for appearance change.\n\n\f\n Figure 2: A target undergoes large lighting and pose variation with cast shadows.\n\n\n6 Conclusion\n\nWe have presented a discriminative generative framework that generalizes the conventional\nFisher Linear Discriminant algorithm with a proper probabilistic interpretation. For object\ntracking, we aim to find a discriminative generative model that best separates the target\nclass from the background. With a computationally efficient algorithm that constantly up-\ndate this discriminative model as time progresses, our method adapts the discriminative\ngenerative model to account for appearance variation of the target and background, thereby\nfacilitating the tracking task in different situations. Our experiments show that the pro-\nposed model is able to learn a discriminative generative model for tracking target objects\nundergoing large pose and lighting changes. We also plan to apply the proposed method to\nother problems that deal with non-stationary data stream in our future work.\n\nReferences\n\n [1] T. W. Anderson. An Introduction to Multivariate Statistical Analysis. Wiley, New York, 1984.\n [2] M. J. Black and A. D. Jepson. Eigentracking: Robust matching and tracking of articulated objects using\n view-based representation. In B. Buxton and R. Cipolla, editors, Proceedings of the Fourth European\n Conference on Computer Vision, LNCS 1064, pp. 329342. Springer Verlag, 1996.\n [3] R. T. Collins and Y. Liu. On-line selection of discriminative tracking features. In Proceedings of the Ninth\n IEEE International Conference on Computer Vision, volume 1, pp. 346352, 2003.\n [4] G. H. Golub and C. F. Van Loan. Matrix Computations. The Johns Hopkins University Press, 1996.\n [5] M. Isard and A. Blake. Contour tracking by stochastic propagation of conditional density. In B. Buxton\n and R. Cipolla, editors, Proceedings of the Fourth European Conference on Computer Vision, LNCS 1064,\n pp. 343356. Springer Verlag, 1996.\n [6] M. I. Jordan, editor. Learning in Graphical Models. MIT Press, 1999.\n [7] A. Levy and M. Lindenbaum. Sequential Karhunen-Loeve basis extraction and its application to images.\n IEEE Transactions on Image Processing, 9(8):13711374, 2000.\n [8] R.-S. Lin, D. Ross, J. Lim, and M.-H. Yang. Incremental subspace update with running mean.\n Technical report, Beckman Institute, University of Illinois at Urbana-Champaign, 2004. available at\n http://www.ifp.uiuc.edu/~rlin1/isuwrm.pdf.\n [9] D. Ross, J. Lim, and M.-H. Yang. Adaptive probabilistic visual tracking with incremental subspace update.\n In T. Pajdla and J. Matas, editors, Proceedings of the Eighth European Conference on Computer Vision,\n LNCS 3022, pp. 470482. Springer Verlag, 2004.\n[10] M. E. Tipping and C. M. Bishop. Probabilistic principal component analysis. Journal of the Royal Statis-\n tical Society, Series B, 61(3):611622, 1999.\n\n\f\n", "award": [], "sourceid": 2642, "authors": [{"given_name": "Ruei-sung", "family_name": "Lin", "institution": null}, {"given_name": "David", "family_name": "Ross", "institution": null}, {"given_name": "Jongwoo", "family_name": "Lim", "institution": null}, {"given_name": "Ming-Hsuan", "family_name": "Yang", "institution": null}]}