{"title": "A Three Tiered Approach for Articulated Object Action Modeling and Recognition", "book": "Advances in Neural Information Processing Systems", "page_first": 841, "page_last": 848, "abstract": null, "full_text": "A Three Tiered Approach for Articulated Object\n Action Modeling and Recognition\n\n\n\n Le Lu Gregory D. Hager Laurent Younes\n Department of Computer Science Center of Imaging Science\n Johns Hopkins University Johns Hopkins University\n Baltimore, MD 21218 Baltimore, MD 21218\n lelu/hager@cs.jhu.edu younes@cis.jhu.edu\n\n\n\n\n Abstract\n\n Visual action recognition is an important problem in computer vision.\n In this paper, we propose a new method to probabilistically model and\n recognize actions of articulated objects, such as hand or body gestures,\n in image sequences. Our method consists of three levels of representa-\n tion. At the low level, we first extract a feature vector invariant to scale\n and in-plane rotation by using the Fourier transform of a circular spatial\n histogram. Then, spectral partitioning [20] is utilized to obtain an initial\n clustering; this clustering is then refined using a temporal smoothness\n constraint. Gaussian mixture model (GMM) based clustering and density\n estimation in the subspace of linear discriminant analysis (LDA) are then\n applied to thousands of image feature vectors to obtain an intermediate\n level representation. Finally, at the high level we build a temporal multi-\n resolution histogram model for each action by aggregating the clustering\n weights of sampled images belonging to that action. We discuss how this\n high level representation can be extended to achieve temporal scaling in-\n variance and to include Bi-gram or Multi-gram transition information.\n Both image clustering and action recognition/segmentation results are\n given to show the validity of our three tiered representation.\n\n\n1 Introduction\n\nArticulated object action modeling, tracking and recognition has been an important re-\nsearch issue in computer vision community for decades. Past approaches [3, 13, 4, 6, 23, 2]\nhave used many different kinds of direct image observations, including color, edges, con-\ntour or moments [14], to fit a hand or body's shape model and motion parameters.\n\nIn this paper, we propose to learn a small set of object appearance descriptors, and then\nto build an aggregated temporal representation of clustered object descriptors over time.\nThere are several obvious reasons to base gesture or motion recognition on a time sequence\nof observations. First, most hand or body postures are ambiguous. For example, in Ameri-\ncan Sign Language, 'D' and 'G', 'H' and 'U' have indistinguishable appearance from some\nviewpoints. Furthermore, these gestures are difficult to track from frame to frame due to\nmotion blur, lack of features, and complex self-occlusions. By modeling hand/body gesture\nas a sequential learning problem, appropriate discriminative information can be retrieved\nand more action categories can be handled.\n\n\f\nIn related work, Darrell and Pentland [7] describe dynamic time warping (DTW) to align\nand recognize a space-time gesture against a stored library. To build the library, key views\nare selected from incoming an video by choosing views that have low correlation with all\ncurrent views. This approach is empirical and does not guarantee any sort of global consis-\ntency of the chosen views. As a result, recognition may be unstable. In comparision, our\nmethod describes image appearances uniformly and clusters them globally from a training\nset containing different gestures.\n\nFor static hand posture recognition, Tomasi et al. [24] apply vector quantization methods\nto cluster images of different postures and different viewpoints. This is a feature-based\napproach, with thousands of features extracted for each image. However, clustering in a\nhigh dimensional space is very difficult and can be unstable. We argue that fewer, more\nglobal features are adequate for the purposes of gesture recognition. Furthermore, the\ncircular histogram representation has adjustable spatial resolution to accomodate differing\nappearance complexities, and it is translation, rotation, and scale invariant.\n\nIn other work, [27, 9] recognize human actions at a distance by computing motion in-\nformation between images and relying on temporal correlation on motion vectors across\nsequences. Our work also makes use of motion information, but does not rely exclusively\non it. Rather, we combine appearance and motion cues to increase sensitivity beyond what\neither can provide alone. Since our method is based on the temporal aggregation of image\nclusters as a histogram to recognize an action, it can also be considered to be a tempo-\nral texton-like method [17, 16]. One advantage of the aggregated histogram model in a\ntime-series is that it is straightforward to accommodate temporal scaling by using a sliding\nwindow. In addition, higher order models corresponding to bigrams or trigrams of simpler\n\"gestemes\" can also be naturally employed to extend the descriptive power of the method.\n\nIn summary, there are four principal contributions in this paper. First, we propose a new\nscale/rotation-invariant hand image descriptor which is stable, compact and representative.\nSecond, we introduce a methods for sequential smoothing of clustering results. Third,\nwe show LDA/GMM with spectral partitioning initialization is an effective way to learn\nwell-formed probability densities for clusters. Finally, we recognize image sequences as\nactions efficiently based on a flexible histogram model. We also discuss improvement to\nthe method by incorporating motion information.\n\n\n2 A Three Tiered Approach\n\nWe propose a three tiered approach for dynamic action modeling comprising low level\nfeature extraction, intermediate level feature vector clustering and high level histogram\nrecognition as shown in Figure 1.\n\n\n Low Level: Rotation Invariant Feature\n Extraction\n\n\n Intermediate Level: Clustering Presentation High Level: Aggregated Histogram Model\n for Image Frames for Action Recognition\n Probabilistic Foreground Map\n (GMM Color Segmentation,\n Probabilistic Appearance Modeling,\n Dynamic Texture Segmentation by Temporal Aggregated\n GMM/LDA Density Modeling for\n GPCA) Multiresolution Histogram for\n Clusters Action\n\n\n\n Feature Extraction Temporally Constrained From Unigram to Bigram,\n via Circular/Fourier Representation Clustering (Refinement) Multigram Histogram Model\n\n\n\n Framewise Clustering via\n Feature Dimension Reduction Spectral Segmentation Temporal Pyramids and Scaling\n via Variance Analysis (Initialization)\n\n\n\n\n\nFigure 1: Diagram of a three tier approach for dynamic articulated object action modeling.\n\n\f\n (a) (b) (c)\n\nFigure 2: (a) Image after background subtraction (b) GMM based color segmentation\n(c) Circular histogram for feature extraction.\n\n\n\n2.1 Low Level: Rotation Invariant Feature Extraction\n\nIn the low level image processing, our goals are to locate the region of interest in an im-\nage and to extract a scale and in-plane rotation invariant feature vector as its descriptor.\nIn order to accomplish this, a reliable and stable foreground model of the target in ques-\ntion is expected. Depending on the circumstances, a Gaussian mixture model (GMM) for\nsegmentation [15], probabilistic appearance modeling [5], or dynamic object segmentation\nby Generalized Principal Component Analysis (GPCA) [25] are possible solutions. In this\npaper, we apply a GMM for hand skin color segmentation.\n\nWe fit a GMM by first performing a simple background subtraction to obtain a noisy fore-\nground containing a hand object (shown in Figure 2 (a)). From this, more than 1 mil-\nlion RGB pixels are used to train skin and non-skin color density models with 10 Gaus-\nsian kernels for each class. Having done this, for new images a probability density ratio\nPskin/Pnonskin of these two classes is computed. If Pskin/Pnonskin is larger than 1, the\npixel is considered as skin (foreground) and is otherwise background. A morphological\noperator is then used to clean up this initial segmentaion and create a binary mask for the\nhand object. We then compute the centroid and second central moments of this 2D mask.\nA circle is defined about the target by setting its center as the centroid and its radius as\n2.8 times largest eigenvalues of the second central moment matrix (covering over 99% skin\npixels in Figure 2 (c)). This circle is then divided to have 6 concentric annuli which con-\ntain 1, 2, 4, 8, 16, 32 bins from inner to outer, respectively. Since the position and size of\nthis circular histogram is determined by the color segmentation, it is translation and scale\ninvariant.\n\nWe then normalize the density value Pskin + Pnonskin = 1 for every pixel within the\nforeground mask (Figure 2) over the hand region. For each bin of the circular histogram,\nwe calculate the mean of Pskin ( -log(Pskin), or -log(Pskin/Pnonskin) are also possible\nchoices) of pixels in that bin as its value. The values of all bins along each circle form a\nvector, and 1D Fourier transform is applied to this vector. The power spectra of all annuli\nare ordered into a linear list producing a feature vector f (t) of 63 dimensions representing\nthe appearance of a hand image.1 Note that the use of the Fourier power spectrum of the\nannuli makes the representation rotation invariant.\n\n2.2 Intermediate Level: Clustering Presentation for Image Frames\n\nAfter the low level processing, we obtain a scale and rotation invariant feature vector as\nan appearance representation for each image frame. The temporal evolution of feature\nvectors represent actions. However, not all the images are actually unique in appearance.\n\n 1An optional dimension reduction of feature vectors can be achieved by eliminating dimensions\nwhich have low variance. It means that feature values of those dimensions do not change much in the\ndata, therefore are non-informative.\n\n\f\nAt the intermediate level, we cluster images from a set of feature vectors. This frame-wise\nclustering is critical for dimension reduction and the stability of high level recognition.\n\n\nInitializing Clusters by Spectral Segmentation There are two critical problems with\nclustering algorithms: determining the true number of clusters and initializing each clus-\nter. Here we use a spectral clustering method [20, 22, 26, 18] to solve both problems.\nWe first build the affinity matrix of pairwise distances between feature fectors2. We then\nperform a singular value decomposition on the affinity matrix with proper normalization\n[20]. The number of clusters is determined by choosing the n dominant eigenvalues. The\ncorresponding eigenvectors are taken as an orthogonal subspace for all the data.\n\nTo get n cluster centers, we take the approach of [20] and choose vectors that minimize the\nabsolute value of cosine between any two cluster centers:\n\n rand(0, N ) : k = 1\n ID(k) = (1)\n arg min k-1\n t=1..N | cos(f n(ID(c)), f n(t))| : n k > 1\n c=1\n\nwhere f n(t) is the feature vector of image frame t after numerical normalization in [20]\nand ID(k) is the image frame number chosen for the center of cluster k. N is the number of\nimages used for spectral clustering. For better clustering results, multiple restarts are used\nfor initialization.\n\nUnlike [18], we find this simple clustering procedure is sufficient to obtain a good set of\nclusters from only a few restarts. After initialization, the Kmeans [8] is used to smooth the\ncenters. Let C1(t) denote the class label for image t, and g(c) = f (ID(c)); c = 1 . . . n\ndenote cluster centers.\n\n\nRefinement: Temporally Constrained Clustering Spectral clustering methods are de-\nsigned for an unordered \"bag\" of feature vectors, but, in our case, the temporal ordering\nof image is an important source of information. In particular, the stablity of appearance is\neasily computed by computing the motion energy3 between two frames. Let M (t) denote\nthe motion energy between frames t and t-1. Define Tk,j = {t|C1(t) = k, C1(t-1) = j}\nand \n M (k, j) = M (t)/|T\n tT k,j |. We now create a regularized clustering cost function\n k,j\nas\n\n e- f(t)-g(c) e- g(c)-g(C2(t-1))\n M (t)\n C \n 2(t) = arg maxc=1..n n + (2)\n e- f(t)-g(c) n \nwhere is the weighting p c=1 e- g(c)-g(C2(t-1))\n \n M (c,C2(t-1))\n c=1\n\n arameter. Here motion energy M (t) plays a role asthe tem-\nperature T in simulated annealing. When it is high (strong motion between frames), the\nmotion continuity condition is violated and the labels of successive frames can change\nfreely; when it is low, the smoothness term constrains the possible transitions of classes\nwith low \n M (k, j).\n\nWith this in place, we now scan through the sequence searching for C2(t) of maximum\nvalue given C2(t - 1) is already fixed. 4 This temporal smoothing is most relevant with\nimages with motions, and static frames are already stably clustered and therefore their\ncluster labels to not change.\n\n 2The exponent of either Euclidean distance or Cosine distance between two feature vectors can\nbe used in this case.\n 3A simple method is to compute motion energy as the Sum of Squared Differences (SSD) by\nsubtracting two Pskin density masses from successive images.\n 4Note that \n M (k, j) changes after scanning the labels of the image sequence once, thus more\niterations could be used to achieve more accurate temporal smoothness of C3(t), t = 1..N . From\nour experiments, more iterations does not change the result much.\n\n\f\nGMM for Density Modeling and Smoothing Given clusters, we build a probability den-\nsity model for each. A Gaussian Mixture Model [11, 8] is used to gain good local relaxation\nbased on the initial clustering result provided by the above method and good generalization\nfor new data. Due to the curse of dimensionality, it is difficult to obtain a good estimate\nof a high dimensional density function with limited and largely varied training data. We\nintroduce an iterative method incorporating Linear Discriminative Analysis (LDA) [8] and\na GMM in an EM-like fashion to perform dimensional reduction. The initial clustering\nlabels help to build the scatter matrices for LDA. The optimal projection matrix of LDA is\nthen obtained from the decomposition of clusters' scatter matrices [8]. The original feature\nvectors can be further projected into a low dimensional space, which improves the estima-\ntion of multi-variate Gaussian density function. With the new clustering result from GMM,\nLDA's scatter matrices and projection matrix can be re-estimated, and GMM can also be\nre-modeled in the new LDA subspace. This loop converges within 3 5 iterations from\nour experiments. Intuitively, LDA projects the data into a low dimensional subspace where\nthe image clusters are well separated, which helps to have a good parameter estimation for\nGMM with limited data. Given more accurate GMM, more accurate clustering results are\nobtained, which also causes better estimate of LDA. The theoretical proof of convergence\nis undertaken. After this process, we have a Gaussian density model for each cluster.\n\n2.3 High Level: Aggregated Histogram Model for Action Recognition\n\nGiven a set of n clusters, define w(t) = [pc (f (t)), p (f (t)), ..., p (f (t))]T where p\n 1 c2 cn x(y)\ndenotes the density value of the vector y with respect to the GMM for cluster x. An action\nis then a trajectory of [w(t1), w(t1 + 1), ..., w(t2)]T in n. For recognition purposes, we\nwant to calculate some discriminative statistics from each trajectory. One natural way is to\nuse its mean Ht = t2 w(t)/(t\n 1,t2 t=t 2 - t1 + 1) over time which is a temporal weighted\n 1\nhistogram. Note that the histogram Ht bins are precisely corresponding to the trained\n 1 ,t2\nclusters.\n\nFrom the training set, we aggregate the cluster weights of images within a given hand action\nto form a histogram model. In this way, a temporal image sequence corresponding to one\naction is represented by a single vector. The matching of different actions is equivalent to\ncompute the similarity of two histograms which has variants. Here we use Bhattacharyya\nsimilarity metric [1] which has has several useful properties including: it is an approxima-\ntion of 2 test statistics with fixed bias; it is self-consistent; it does not have the singularity\nproblem while matching empty histogram bins; and its value is properly bounded within\n[0, 1]. Assume we have a library of action histograms H1, H2, ..., HM , the class label of a\nnew action ^\n Ht is determined by the following equation.\n 1,t2\n\n 1\n n 2\n\n L( ^\n H \n t ) = arg min D(H ) = 1 - H(c) ^\n H (c) (3)\n 1,t2 l , ^\n Ht1,t2 t1,t2\n l=1..M l \nThis method is low cost be c=1\n cause only one exemplar per action category is needed \n .\n\nOne problem with this method is that all sequence information has been compressed, e.g.,\nwe cannot distinguish an opening hand gesture from a closing hand using only one his-\ntogram. This problem can be easily solved by subdividing the sequence and histogram\nmodel into m parts: Hm\n t = [H ]T . For an extreme\n 1 ,t2 t1,(t1+t2)/m, ..., H(t1+t2)(m-1)/m,t2\ncase when one frame is a subsequence, the histogram model simply becomes exactly the\nvector form of the representative surface.\n\nWe intend to classify hand actions with speed differences into the same category. To\nachieve this, the image frames within a hand action can be sub-sampled to build a set\nof temporal pyramids. In order to segment hand gestures from a long video sequence, we\ncreate several sliding windows with different frame sampling rates. The proper time scaling\nmagnitude is found by searching for the best fit over temporal pyramids.\n\n\f\nTaken together, the histogram representation achieves an adjustable multi-resolution mea-\nsurement to describe actions. A Hidden Markov Model (HMM) with discrete observations\ncould be also employed to train models for different hand actions, but more template sam-\nples per gesture class are required. The histogram recognition method has the additional\nadvantage that it does not depend on extremely accurate frame-wise clustering. A small\nproportion of incorrect labels does not effect the matching value much. In comparison,\nin an HMM with few training samples, outliers seriously impact the accuracy of learning.\nFrom the viewpoint of considering hand actions as a language process, our model is an in-\ntegration of individual observations (by labelling each frame with a set of learned clusters)\nfrom different time slots. The labels' transitions between successive frames are not used\nto describe the temporal sequence. By subdividing the histogram, we are extending the\nrepresentation to contain bigram, trigram, etc. information.\n\n\n3 Results\n\nWe have tested our three tiered method on the problem of recognizing sequences of hand\nspelling gestures.\n\nFramewise clustering. We first evaluate the low level representation of single images and\nintermediate clustering algorithms. A training set of 3015 images are used. The frame-to-\nframe motion energy is used to label images as static or dynamic. For spectral clustering,\n3 4 restarts from both the dynamic and static set are sufficient to cover all the modes in\nthe training set. Then, temporal smoothing is employed and a Gaussian density is calculated\nfor each cluster in a 10 dimensional subspace of the LDA projection. As a result, 24 clusters\nare obtained which contain 16 static and 8 dynamic modes. Figure 3 shows 5 frames\nclosest to the mean of the probability density of cluster 1, 3, 19, 5, 13, 8, 21, 15, 6, 12. It\ncan be seen that clustering results are insensitive to artifacts of skin segmentation. From\nFigure 3, it is also clear that dynamic modes have significantly larger determinants than\nstatic ones. The study of the eigenvalues of covariance matrices shows that their super-\nellipsoid shapes are expanded within 2 3 dimensions or 6 8 dimensions for static\nor dynamic clusters. Taken together, this means that static clusters are quite tight, while\ndynamic clusters contain much more in-class variation. From Figure 4 (c), dynamic clusters\ngain more weight during the smoothing process incorporating the temporal constraint and\nsubsequent GMM refinement.\n\n\n\n\n\n Figure 3: Image clustering results after low and intermediate level processing.\n\n\nAction recognition and segmentation. For testing images, we first project their feature\n\n\f\n A YA Y A W A X A\n A\n\n\n\n Y\n A\n\n\n\n\n Y\n A\n\n\n\n W\n\n\n\n\n\n A\n\n\n\n X\n A\n (a) (b) (c) (d)\n\nFigure 4: (a) Affinity matrix of 3015 images. (b) Affinity matrices of cluster centoids (from upper\nleft to lower right) after spectral clustering, temporal smoothing and GMM. (c) Labelling results of\n3015 images (red squares are frames whose labels changed with smoothing process after spectral\nclustering). (d) The similarity matrix of segmented hand gestures. The letters are labels of gestures.\n\n\n\nvectors into the LDA subspace. Then, the GMM is used to compute their weights with\nrespect to each cluster. We manually choose 100 sequences for testing purposes, and com-\npute their similarities with respect to a library of 25 gestures. The length of the action\nsequences was 9 38 frames. The temporal scale of actions in the same category ranged\nfrom 1 to 2.4. The results were recognition rates of 90% and 93% without/with temporal\nsmoothing (Equation 2). Including the top three candidates, the recognition rates increase\nto 94% and 96%, respectively. We also used the learned model and a sliding window with\ntemporal scaling to segment actions from a 6034 frame video sequence containing dynamic\ngestures and static hand postures. The similarity matrices among 123 actions found in the\nvideo is shown in Figure 4 (d). 106 out of 123 actions (86.2%) are correctly segmented and\nrecognized.\n\nIntegrating motion information. As noted previously, our method cannot distinguish\nopening/closing hand gestures without temporally subdividing histograms. An alternative\nsolution is to integrate motion information5 between frames. Motion feature vectors are\nalso clustered, which results a joint (appearance and motion) histogram model for actions.\nWe assume independence of the data and therefore simple contatenate these two histograms\ninto a single action representation. From our preliminary experiments, both motion in-\ntegration and histogram subdivision are comparably effective to recognize gestures with\nopposite direction.\n\n\n4 Conclusion and Discussion\n\nWe have presented a method for classifying the motion of articulated gestures using\nLDA/GMM-based clustering methods and a histogram-based model of temporal evolution.\nUsing this model, we have obtained extremely good recognition results using a relatively\ncoarse representation of appearance and motion in images.\n\nThere are mainly three methods to improve the performance of histogram-based classi-\nfication, i.e., adaptive binning, adaptive subregion, and adaptive weighting [21]. In our\napproach, adaptive binning of the histogram is automatically learned by our clustering\nalgorithms; adaptive subregion is realized by subdividing action sequences to enrich the\nhistogram's descriptive capacity in the temporal domain; adaptive weighting is achieved\nfrom the trained weights of Gaussian kernels in GMM.\n\nOur future work will focus on building a larger hand action database containing 50 100\n\n 5Motion information can be extracted by first aligning two hand blobs, subtracting two skin-color\ndensity masses, then using the same circular histogram in section 2.1 to extract a feature vector for\npositive and negative density residues respectively. Another simple way is to subtract two frames'\nfeature vectors directly.\n\n\f\ncategories for more extensive testing, and on extending the representation to include other\ntypes of image information (e.g. contour information). Also, by finding an effective fore-\nground segmentation module, we intend to apply the same methods to other applications\nsuch as recognizing stylized human body motion.\n\n\nReferences\n\n[1] F. Aherne, N. Thacker, and P. Rockett, The Bhattacharyya Metric as an Absolute Similarity\n Measure for Frequency Coded Data, Kybernetika, 34:4, pp. 363-68, 1998.\n[2] V. Athitsos and S. Sclaroff, Estimating 3D Hand Pose From a Cluttered Image, CVPR, 2003.\n[3] M. Brand, Shadow Puppetry, ICCV, 1999.\n[4] R. Bowden and M. Sarhadi, A Non-linear of Shape and Motion for Tracking Finger Spelt\n American Sign Language, Image and Vision Computing, 20:597-607, 2002.\n[5] T. Cootes, G. Edwards and C. Taylor, Active Appearance Models, IEEE Trans. PAMI, 23:6, pp.\n 681-685, 2001.\n[6] D. Cremers, T. Kohlberger and C. Schnrr, Shape statistics in Kernel Space for Variational Image\n Segmentation, Pattern Recognition, 36:1929-1943, 2003.\n[7] T. J. Darrell and A. P. Pentland, Recognition of Space-Time Gestures using a Distributed Rep-\n resentation, MIT Media Laboratory Vision and Modeling TR-197.\n[8] R. O. Duda, P. E. Hart and D. G. Stork, Pattern Classification, Wiley Interscience, 2002.\n[9] A. Efros, A. Berg, G. Mori and J. Malik, Recognizing Action at a Distance. ICCV, pp. 726733,\n 2003.\n[10] W. T. Freeman and E. H. Adelson, The Design and Use of Steerable Filters, IEEE Trans. PAMI,\n 13:9, pp. 891-906, 1991.\n[11] T. Hastie and R. Tibshirani, Discriminant Analysis by Gaussian Mixtures. Journal of Royal\n Statistical Society Series B, 58(1):155-176.\n[12] W. Hawkins, P. Leichner and N. Yang, The Circular Harmonic Transform for SPECT Recon-\n struction and Boundary Conditions on the Fourier Transform of the Sinogram, IEEE Trans. on\n Medical Imaging, 7:2, 1988.\n[13] A. Heap and D. Hogg, Wormholes in Shape Space: Tracking through Discontinuous Changes\n in Shape, ICCV, 1998.\n[14] M. K. Hu, Visual pattern recognition by moment invariants, IEEE Trans. Inform. Theory,\n 8:179-187, 1962.\n[15] M. J. Jones and J. M. Rehg, Statistical Color Models with Application to Skin Detection Int. J.\n of Computer Vision, 46:1 pp: 81-96, 2002.\n[16] B. Julesz, Textons, the elements of texture perception, and their interactions. Nature, 290:91-97,\n 1981.\n[17] T. Leung and J. Malik, Representing and Recognizing the Visual Appearance of Materials using\n Three-dimensional Textons, Int. Journal of Computer Vision, 41:1, pp. 29-44, 2001.\n[18] M. Maila and J. Shi, Learning Segmentation with Random Walk, NIPS 2001.\n[19] B. Moghaddam and A. Pentland, Probabilistic Visual Learning for Object Representation, IEEE\n Trans. PAMI 19:7, 1997.\n[20] A. Ng, M. Jordan and Y. Weiss, On Spectral Clustering: Analysis and an algorithm, NIPS,\n 2001.\n[21] S. Satoh, Generalized Histogram: Empirical Optimization of Low Dimensional Features for\n Image Matching, ECCV, 2004.\n[22] J. Shi and J. Malik, Normalized Cuts and Image Segmentation, IEEE Trans. on PAMI, 2000.\n[23] B. Stenger, A. Thayananthan, P. H. S. Torr, and R. Cipolla, Filtering Using a Tree-Based\n Estimator, ICCV, II:1063-1070, 2003.\n[24] C. Tomasi, S. Petrov and A. Sastry, 3D tracking = classification + interpolation, ICCV, 2003.\n[25] R. Vidal and R. Hartley, Motion Segmentation with Missing Data using PowerFactorization\n and GPCA, CVPR, 2004.\n[26] Y. Weiss, Segmentation using eigenvectors: A Unifying view. ICCV, 1999.\n[27] Lihi Zelnik-Manor and Michal Irani, Event-based video analysis, CVPR, 2001.\n\n\f\n", "award": [], "sourceid": 2640, "authors": [{"given_name": "Le", "family_name": "Lu", "institution": null}, {"given_name": "Gregory", "family_name": "Hager", "institution": null}, {"given_name": "Laurent", "family_name": "Younes", "institution": null}]}