{"title": "Manifold Precis: An Annealing Technique for Diverse Sampling of Manifolds", "book": "Advances in Neural Information Processing Systems", "page_first": 154, "page_last": 162, "abstract": "In this paper, we consider the 'Precis' problem of sampling K representative yet diverse data points from a large dataset. This problem arises frequently in applications such as video and document summarization, exploratory data analysis, and pre-filtering. We formulate a general theory which encompasses not just traditional techniques devised for vector spaces, but also non-Euclidean manifolds, thereby enabling these techniques to shapes, human activities, textures and many other image and video based datasets. We propose intrinsic manifold measures for measuring the quality of a selection of points with respect to their representative power, and their diversity. We then propose efficient algorithms to optimize the cost function using a novel annealing-based iterative alternation algorithm. The proposed formulation is applicable to manifolds of known geometry as well as to manifolds whose geometry needs to be estimated from samples. Experimental results show the strength and generality of the proposed approach.", "full_text": "Manifold Pr\u00b4ecis: An Annealing Technique for Diverse\n\nSampling of Manifolds\n\n\u2020Department of Electrical and Computer Engineering, University of Maryland, College Park\n\n\u2021School of Arts, Media, Engineering and ECEE, Arizona State University\n\nNitesh Shroff \u2020, Pavan Turaga \u2021, Rama Chellappa \u2020\n\n{nshroff,rama}@umiacs.umd.edu, pturaga@asu.edu\n\nAbstract\n\nIn this paper, we consider the Pr\u00b4ecis problem of sampling K representative yet\ndiverse data points from a large dataset. This problem arises frequently in ap-\nplications such as video and document summarization, exploratory data analysis,\nand pre-\ufb01ltering. We formulate a general theory which encompasses not just tra-\nditional techniques devised for vector spaces, but also non-Euclidean manifolds,\nthereby enabling these techniques to shapes, human activities, textures and many\nother image and video based datasets. We propose intrinsic manifold measures for\nmeasuring the quality of a selection of points with respect to their representative\npower, and their diversity. We then propose ef\ufb01cient algorithms to optimize the\ncost function using a novel annealing-based iterative alternation algorithm. The\nproposed formulation is applicable to manifolds of known geometry as well as\nto manifolds whose geometry needs to be estimated from samples. Experimental\nresults show the strength and generality of the proposed approach.\n\n1 Introduction\nThe problem of sampling K representative data points from a large dataset arises frequently in var-\nious applications. Consider analyzing large datasets of shapes, objects, documents or large video\nsequences, etc. Analysts spend a large amount of time sifting through the acquired data to famil-\niarize themselves with the content, before using them for their application speci\ufb01c tasks. This has\nnecessitated the problem of optimal selection of a few representative exemplars from the dataset as\nan important step in exploratory data analysis. Other applications include Internet-based video sum-\nmarization, where providing a quick overview of a video is important for improving the browsing\nexperience. Similarly, in medical image analysis, picking a subset of K anatomical shapes from\na large population helps in identifying the variations within and across shape classes, providing an\ninvaluable tool for analysts.\n\nDepending upon the application, several subset selection criteria have been proposed in the liter-\nature. However, there seems to be a consensus in selecting exemplars that are representative of\nthe dataset while minimizing the redundancy between the exemplars. Liu et al.[1] proposed that\nthe summary of a document should satisfy the \u2018coverage\u2019 and \u2018orthogonality\u2019 criteria. Shroff et\nal.[2] extended this idea to selecting exemplars from videos that maximize \u2018coverage\u2019 and \u2018diver-\nsity\u2019. Simon et al.[3] formulated scene summarization as one of picking interesting and important\nscenes with minimal redundancy. Similarly, in statistics, strati\ufb01ed sampling techniques sample the\npopulation by dividing the dataset into mutually exclusive and exhaustive \u2018strata\u2019 (sub-groups) fol-\nlowed by a random selection of representatives from each strata [4]. The splitting of population into\nstratas ensures that a diverse selection is obtained. The need to select diverse subsets has also been\nemphasized in information retrieval applications [5, 6].\n\nColumn Subset Selection (CSS) [7, 8, 9] has been one of the popular techniques to address this prob-\nlem. The goal of CSS is to select the K most \u201cwell-conditioned\u201d columns from the matrix of data\npoints. One of the key assumptions behind this and other techniques is that the objects or their repre-\nsentations, lie in the Euclidean space. Unfortunately, this assumption is not valid in many cases. In\n\n1\n\n\fapplications like computer vision, images and videos are represented by features/models like shapes\n[10], bags-of-words, linear dynamical systems (LDS) [11], etc. Many of these features/models have\nbeen shown to lie in non-Euclidean spaces, implying that the underlying distance metric of the space\nis not the usual `2/`p norm. Since these feature/model spaces have a non-trivial manifold structure,\nthe distance metrics are highly non-linear functions. Examples of features/models - manifold pairs\ninclude: shapes - complex spherical manifold [10], linear subspaces - Grassmann manifold, co-\nvariance matrices - tensor space, histograms - simplex in Rn, etc. Even the familiar bag-of-words\nrepresentation, used commonly in document analysis, is more naturally considered as a statistical\nmanifold than as a vector space [12]. The geometric properties of the non-Euclidean manifolds allow\none to develop accurate inference and classi\ufb01cation algorithms [13, 14]. In this paper, we focus on\nthe problem of selecting a subset of K exemplars from a dataset of N points when the dataset has an\nunderlying manifold structure to it. We formulate the notion of representational error and diversity\nmeasure of exemplars while utilizing the non-Euclidean structure of the data points followed by the\nproposal of an ef\ufb01cient annealing-based optimization algorithm.\nRelated Work: The problem of subset selection has been studied by the communities of numerical\nlinear algebra and theoretical computer science. Most work in the former community is related\nto the Rank Revealing QR factorization (RRQR) [7, 15, 16]. Given a data matrix Y , the goal of\nRRQR factorization is to \ufb01nd a permutation matrix \u03a0 such that the QR factorization of Y \u03a0 reveals\nthe numerical rank of the matrix. The resultant matrix Y \u03a0 has as its \ufb01rst K columns the most\n\u201cwell-conditioned\u201d columns of the matrix Y . On the other hand, the latter community has focused\non Column Subset Selection (CSS). The goal of CSS is to pick K columns forming a matrix C \u2208\nRm\u00d7K such that the residual || Y \u2212 PC Y ||\u03b6 is minimized over all possible choices for the matrix\nC. Here PC = CC \u2020 denotes the projection onto the K-dimensional space spanned by the columns\nof C and \u03b6 can represent the spectral or Frobenius norm. C \u2020 indicates the pseudo inverse of matrix\nC. Along these lines, different randomized algorithms have been proposed [17, 18, 9, 8]. Various\napproaches include a two-stage approach [9], subspace sampling methods [8], etc.\n\nClustering techniques [19] have also been applied for subset selection [20, 21]. In order to select K\nexemplars, data points are clustered into ` clusters with (` \u2264 K) followed by the selection of one\nor multiple exemplars from each cluster to obtain the best representation or low-rank approximation\nof each cluster. Af\ufb01nity Propagation [21], is a clustering algorithm that takes similarity measures as\ninput and recursively passes message between nodes until a set of exemplars emerges. As we discuss\nin this paper, the problems with these approaches are that (a) the objective functions optimized by the\nclustering functions do not incorporate the diversity of the exemplars, hence can be biased towards\ndenser clusters, and also by outliers, and (b) seeking low-rank approximation of the data matrix\nor clusters individually is not always an appropriate subset selection criterion. Furthermore, these\ntechniques are largely tuned towards addressing the problem in an Euclidean setting and cannot be\napplied for datasets in non-Euclidean spaces.\n\nRecently, advances have been made in utilizing non-Euclidean structure for statistical inferences\nand pattern recognition [13, 14, 22, 23]. These works have addressed inferences, clustering, dimen-\nsionality reduction, etc.\nin non-Euclidean spaces. To the best of our knowledge, the problem of\nsubset selection for analytic manifolds remains largely unaddressed. While one could try to solve\nthe problem by obtaining an embedding of a given manifold into a larger ambient Euclidean space, it\nis desirable to have a solution that is more intrinsic in nature. This is because the chosen embedding\nis often arbitrary, and introduces peculiarities that result from such extrinsic approaches. Further\nmanifolds such as the Grassmannian or the manifold of in\ufb01nite dimensional diffeomorphisms do\nnot admit a natural embedding into a vector space.\nContributions: 1) We present the \ufb01rst formal treatment of subset selection for the general case of\nmanifolds, 2) We propose a novel annealing-based alternation algorithm to ef\ufb01ciently solve the opti-\nmization problem, 3) We present an extension of the algorithm for data manifolds, and demonstrate\nthe favorable properties of the algorithm on real data.\n2 Subset Selection on Analytic Manifolds\n\nIn this section, we formalize the subset selection problem on manifolds and propose an ef\ufb01cient\nalgorithm. First, we brie\ufb02y touch upon the necessary basic concepts.\nGeometric Computations on Manifolds: Let M be an m-dimensional manifold and, for a point\np \u2208 M, consider a differentiable curve \u03b3 : (\u2212\u0001, \u0001) \u2192 M such that \u03b3(0) = p. The velocity \u02d9\u03b3(0)\n\n2\n\n\fdenotes the velocity of \u03b3 at p. This vector is an example of a tangent vector to M at p. The set of\nall such tangent vectors is called the tangent space to M at p. If M is a Riemannian manifold then\nthe exponential map expp : Tp(M) \u2192 M is de\ufb01ned by expp(v) = \u03b1v(1) where \u03b1v is a speci\ufb01c\ngeodesic. The inverse exponential map (logarithmic map) logp : M \u2192 Tp takes a point on the\nmanifold and returns a point on the tangent space \u2013 which is an Euclidean space.\n\nRepresentational error on manifolds: Let us assume that we are given a set of points X =\n{x1, x2, . . . xn} which belong to a manifold M. The goal is to select a few exemplars E =\n{e1, . . . eK} from the set X, such that the exemplars provide a good representation of the given\ndata points, and are minimally redundant. For the special case of vector spaces, two common ap-\nproaches for measuring representational error is in terms of linear spans, and nearest-exemplar error.\nThe linear span error is given by: minz kX \u2212 Ezk2\nF , where X is the matrix form of the data, and E\nkxk \u2212 eik2,\nis a matrix of chosen exemplars. The nearest-exemplar error is given by: PiPxk\u2208\u03a6i\n\nwhere ei is the ith exemplar and \u03a6i corresponds to its Voronoi region.\nOf these two measures, the notion of linear span, while appropriate for matrix approximation, is\nnot particularly meaningful for general dataset approximation problems since the \u2018span\u2019 of a dataset\nitem does not carry much perceptually meaningful information. For example, the linear span of a\nvector x \u2208 Rn is the set of points \u03b1x, \u03b1 \u2208 R. However, if x were an image, the linear span of x\nwould be the set of images obtained by varying the global contrast level. All elements of this set\nhowever are perceptually equivalent, and one does not obtain any representational advantage from\nconsidering the span of x. Further, points sampled from the linear span of few images, would not be\nmeaningful images. This situation is further complicated for manifold-valued data such as shapes,\nwhere the notion of linear span does not exist. One could attempt to de\ufb01ne the notion of linear spans\non the manifold as the set of points lying on the geodesic shot from some \ufb01xed pole toward the given\ndataset item. But, points sampled from this linear span might not be very meaningful e.g., samples\nfrom the linear span of a few shapes would give physically meaningless shapes.\n\nHence, it is but natural to consider the representational error of a set X with respect to a set of\nexemplars E as follows:\n\nJrep(E) = Xi Xxj \u2208\u03a6i\n\nd2\ng(xj, ei)\n\n(1)\n\nHere, dg is the geodesic distance on the manifold and \u03a6i is the Voronoi region of the ith exemplar.\nThis boils down to the familiar K-means or K-medoids cost function for Euclidean spaces. In order\nto avoid combinatorial optimization involved in solving this problem, we use ef\ufb01cient approxima-\ntions i.e., we \ufb01rst \ufb01nd the mean followed by the selection of ei as data point that is closest to the\nmean. The algorithm for optimizing Jrep is given in algorithm 1. Similar to K-means clustering,\na cluster label is assigned to each xj followed by the computation of the mean \u00b5i for each cluster.\nThis is further followed by selecting representative exemplar ei as the data point closest to \u00b5i.\n\nDiversity measures on manifolds: The next question we consider is to de\ufb01ne the notion of diver-\nsity of a selection of points on a manifold. We \ufb01rst begin by examining equivalent constructions for\nRn. One of the ways to measure diversity is simply to use the sample variance of the points. This is\nsimilar to the construction used recently in [2]. For the case of manifolds, the sample variance can be\nreplaced by the sample Karcher variance, given by the function: \u03c1(E) = 1\ng(\u00b5, ei), where\n\u00b5 is the Karcher mean [24], and the function value is the Karcher variance. However, this construc-\ntion leads to highly inef\ufb01cient optimization routines, essentially boiling down to a combinatorial\nsearch over all possible K-sized subsets of X.\nAn alternate formulation for vector spaces that results in highly ef\ufb01cient optimization routines is via\nRank-Revealing QR (RRQR) factorizations. For vector spaces, given a set of vectors X = {xi},\nwritten in matrix form X, RRQR [7] aims to \ufb01nd Q, R and a permutation matrix \u03a0 \u2208 Rn\u00d7n such\nthat X\u03a0 = QR reveals the numerical rank of the matrix X. This permutation X\u03a0 = (XK Xn\u2212K)\ngives XK, the K most linearly independent columns of X. This factorization is achieved by seeking\n\nK PK\n\ni=1 d2\n\n\u03a0 which maximizes \u039b(XK) = Qi \u03c3i(XK ), the product of the singular values of the matrix XK.\n\nFor the case of manifolds, we adopt an approximate approach in order to measure diversity in terms\nof the \u2018well-conditioned\u2019 nature of the set of exemplars projected on the tangent space at the mean.\nIn particular, for the dataset {xi} \u2286 M, with intrinsic mean \u00b5, and a given selection of exemplars\n\n3\n\n\fAlgorithm 1: Algorithm to minimize Jrep\nInput: X \u2208 M, k, index vector \u03c9, \u0393\nOutput: Permutation Matrix \u03a0\nInitialize \u03a0 \u2190 In\u00d7n\nfor \u03b3 \u2190 1 to \u0393 do\n\nInitialize \u03a0(\u03b3) \u2190 In\u00d7n\nei \u2190 x\u03c9i for i = {1,2,. . . ,k}\nfor i \u2190 1 to k do\n\n\u03a6i \u2190 {xp : arg minj dg(xp, ej) = i }\n\u00b5i \u2190 mean of \u03a6i\n\u02c6j \u2190 arg minj dg(xj, \u00b5i)\nUpdate: \u03a0(\u03b3) \u2190 \u03a0(\u03b3) \u03a0i\u2194\u02c6j\n\nend\nUpdate: \u03a0 \u2190 \u03a0 \u03a0(\u03b3), \u03c9 \u2190 \u03c9\u03a0(\u03b3)\nif \u03a0(\u03b3) = In\u00d7n then\n\nAlgorithm 2: Algorithm for Diversity Maximization\nInput: Matrix V \u2208 Rd\u00d7n, k, Tolerance tol\nOutput: Permutation Matrix \u03a0\nInitialize \u03a0 \u2190 In\u00d7n\nrepeat\n\nCompute QR decomposition of V to obtain\n\nR11, R12 and R22 s.t., V = Q(cid:18) R11 R12\nR22 (cid:19)\n\u03b2ij \u2190 q(R\u22121\ni R\u22121\n\nij + ||R22\u03b1j ||2\n\n11 R12)2\n\n2||\u03b1T\n\n11 ||2\n2\n\n0\n\n\u03b2m \u2190 maxij \u03b2ij\n(\u02c6i, \u02c6j) \u2190 arg maxij\u03b2ij\nUpdate: \u03a0 \u2190 \u03a0 \u03a0i\u2194(j+k)\nV \u2190 V \u03a0i\u2194(j+k)\n\nuntil \u03b2m < tol ;\n\nbreak\n\nend\n\nend\n\n{ej}, we measure the diversity of exemplars as follows: matrix TE = [log\u00b5(ej)] is obtained by\nprojecting the exemplars {ej} on the tangent space at mean \u00b5. Here, log() is the inverse exponential\nmap on the manifold and gives tangent vectors at \u00b5 that point towards ej.\nDiversity can then be quanti\ufb01ed as Jdiv(E) = \u039b(TE), where, \u039b(TE) represents the product of the\nsingular values of the matrix TE. For vector spaces, this measure is related to the sample variance\nof the chosen exemplars. For manifolds, this measure is related to the sample Karcher variance. If\nwe denote TX = [log\u00b5(xi)], the matrix of tangent vectors corresponding to all data-points, and if \u03a0\nis the permutation matrix that orders the columns such that the \ufb01rst K columns of TX correspond\nto the most diverse selection, then\n\nJdiv(E) = \u039b(TE) = det(R11), where, TX \u03a0 = QR = Q(cid:18) R11 R12\n\nR22 (cid:19)\n\n0\n\n(2)\n\nAlgorithm 3: Annealing-based Alternation Algo-\nrithm for Subset Selection on Manifolds\nInput: Data points X = {x1, x2, . . . , xn} \u2208 M,\nNumber of exemplars k, Tolerance step \u03b4\n\nHere, R11 \u2208 RK\u00d7K is the upper triangular matrix of R \u2208 Rn\u00d7n, R12 \u2208 RK\u00d7(n\u2212K) and R22 \u2208\nR(n\u2212K)\u00d7(n\u2212K). The advantage of viewing the required quantity as the determinant of a sub-matrix\non the right hand-side of the above equation is\nthat one can obtain ef\ufb01cient techniques for op-\ntimizing this cost function. The algorithm for\noptimizing Jdiv is adopted from [7] and de-\nscribed in algorithm 2. Input to the algorithm\nis a matrix V created by the tangent-space pro-\njection of X and output is the K most \u201cwell-\nconditioned\u201d columns of V . This is achieved\nby \ufb01rst decomposing V into QR and comput-\ning \u03b2ij, which indicates the bene\ufb01t of swapping\nith and jth columns [7]. The algorithm then se-\nlects pair (\u02c6i, \u02c6j) corresponding to the maximum\nbene\ufb01t swap \u03b2m and if \u03b2m > tol, this swap is\naccepted. This is repeated until either \u03b2m < tol\nor maximum number of iterations is completed.\n\nOutput: E = {e1, . . . ek} \u2286 X\nInitial setup:\nCompute intrinsic mean \u00b5 of X\nCompute tangent vectors vi \u2190 log\u00b5(xi)\nV \u2190 [v1, v2, . . . , vn]\n\u03c9 \u2190 [1, 2, . . . , n] be the 1 \u00d7 n index vector of X\nTol \u2190 1\nInitialize: \u03a0 \u2190 Randomly permute columns of In\u00d7n\nUpdate: V \u2190 V \u03a0, \u03c9 \u2190 \u03c9\u03a0.\nwhile \u03a0 6= In\u00d7n do\n\nDiversity: \u03a0 \u2190 Div(V, k, tol) as in algorithm 2.\nUpdate: V \u2190 V \u03a0, \u03c9 \u2190 \u03c9\u03a0.\nRepresentative Error: \u03a0 \u2190 Rep(X, k, \u03c9,1) as\nin algorithm 1\nUpdate: V \u2190 V \u03a0, \u03c9 \u2190 \u03c9\u03a0.\ntol \u2190 tol + \u03b4\n\nend\nei \u2190 x\u03c9i for i = {1,2,. . . ,k}\n\nRepresentation and Diversity Trade-offs\nfor Subset Selection: From (1) and (2),\nit can be seen that we seek a solution that\nrepresents a trade-off between two con\ufb02icting\ncriteria. As an example,\nin \ufb01gure 1(a) we\nshow two cases, where Jrep and Jdiv are\nindividually optimized. We can see that the\nsolutions look quite different in each case. One way to write the global cost function is as a\nweighted combination of the two. However, such a formulation does not lend itself to ef\ufb01cient\noptimization routines (c.f. [2]). Further, the choice of weights is often left unjusti\ufb01ed. Instead,\nwe propose an annealing-based alternating technique of optimizing the con\ufb02icting criteria Jrep\n\n4\n\n\fSymbol\n\n\u0393\n\nIn\u00d7n\n\n\u03a6i\n\n\u03a0i\u2194j\n\u03a0(\u03b3)\n\nV\nHij\n\u03b1j\n\nRepresents\n\nMaximum number of iterations\n\nIdentity matrix\n\nVoronoi region of ith exemplar\n\nPermutation matrix that swaps columns i and j\n\n\u03a0 in the \u03b3 th iteration\n\nMatrix obtained by tangent-space projection of X\n\n(i, j) element of matrix H\n\njth column of the identity matrix\n\nH\u03b1j , \u03b1T\n\nj H\n\njth column and row of matrix H respectively\n\nComputational Step\n\nM Exponential Map (assume)\n\nM Inverse exponential Map (assume)\n\nIntrinsic mean of X\n\nProjection of X to tangent-space\n\nGeodesic distances in alg. 1\n\nK intrinsic means\n\nAlg. 2\n\nGm,p Exponential Map\n\nGm,p Inverse exponential map\n\nComplexity\n\nO(\u03bd)\nO(\u03c7)\n\nO((n\u03c7 + \u03bd)\u0393)\n\nO(n\u03c7)\n\nO(nK\u03c7)\n\nO((n\u03c7 + K\u03bd)\u0393)\nO(mnK log n)\n\nO(p3)\nO(p3)\n\nTable 1: Notations used in Algorithm 1 - 3\n\nTable 2: Complexity of various computational steps.\n\nand Jdiv. Optimization algorithms for Jrep and Jdiv individually are given in algorithms 1 and\n2 respectively. We \ufb01rst optimize Jdiv to obtain an initial set of exemplars, and use this set as\nan initialization for optimizing Jrep. The output of this stage is used as the current solution to\nfurther optimize Jdiv. However, with each iteration, we increase the tolerance parameter tol in\nalgorithm 2. This has the effect of accepting only those permutations that increase the diversity\nby a higher factor as iterations progress. This is done to ensure that the algorithm is guided\ntowards convergence. If the tol value is not increased at each iteration, then optimizing Jdiv will\ncontinue to provide a new solution at each iteration that modi\ufb01es the cost function only marginally.\nThis is illustrated in \ufb01gure 1(c), where we show how the cost functions Jrep and Jdiv exhibit an\noscillatory behavior if annealing is not used. As seen in \ufb01gure 1(b) , the convergence of Jdiv\nand Jrep is obtained very quickly on using the proposed annealing alternation technique. The\ncomplete annealing based alternation algorithm is described in algorithm 3. A technical detail to\nbe noted here is that for algorithm 2, input matrix V \u2208 Rd\u00d7n should have d \u2265 k. For cases where\nd < k, algorithm 2 can be replaced by its extension proposed in [9]. Table 1 shows the notations\nintroduced in algorithms 1 - 3. \u03a0i\u2194j is obtained by permuting i and j columns of the identity matrix.\n\n3 Complexity, Special cases and Limitations\n\nIn this section, we discuss how the proposed method relates to the special case of M = Rn, and\nto sub-manifolds of Rn speci\ufb01ed by a large number of samples. For the case of Rn, the cost func-\ntions Jrep and Jdiv boil down to familiar notions of clustering and low-rank matrix approximation\nrespectively. In this case, algorithm 3 reduces to alternation between clustering and matrix approx-\nimation with the annealing ensuring that the algorithm converges. This results in a new algorithm\nfor subset-selection in vector spaces.\n\nFor the case of manifolds implicitly speci\ufb01ed using samples, one can approach the problem in one\nof two ways. The \ufb01rst would be to obtain an embedding of the space into a Euclidean space and\napply the special case of the algorithm for M = Rn. The embedding here needs to preserve the\ngeodesic distances between all pairs of points. Multi-dimensional scaling can be used for this pur-\npose. However, recent methods have also focused on estimating logarithmic maps numerically from\nsampled data points [25]. This would make the algorithm directly applicable for such cases, without\nthe need for a separate embedding. Thus the proposed formalism can accommodate manifolds with\nknown and unknown geometries.\n\nHowever, the formalism is limited to manifolds of \ufb01nite dimension. The case of in\ufb01nite dimen-\nsional manifolds, such as diffeomorphisms [26], space of closed curves [27], etc. pose problems\nin formulating the diversity cost function. While Jdiv could have been framed purely in terms of\npairwise geodesics, making it extensible to in\ufb01nite dimensional manifolds, it would have made the\noptimization a signi\ufb01cant bottleneck, as already discussed in section 2.\nComputational Complexity: The computational complexity of computing exponential map and\nits inverse is speci\ufb01c to each manifold. Let n be the number of data points and K be the number\nof exemplars to be selected. Table 2 enumerates the complexity of different computational step of\nthe algorithm. The last two rows show the complexity of an ef\ufb01cient algorithm proposed by [28] to\ncompute the exponential map and its inverse for the case of Grassmann manifold Gm,p.\n4 Experiments\n\nBaselines: We compare the proposed algorithm with two baselines. The \ufb01rst baseline is a\nclustering-based solution to subset selection, where we cluster the dataset into K clusters, and pick\nas exemplar the data point that is closest to the cluster centroid. Since clustering optimizes only the\n\n5\n\n\f3\n\n2.5\n\n2\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n \n\u22121.5\n\u22121\n\n \n\n800\n\n600\n\n400\n\n200\n\n \n\n0\n0\n\n1\n\n2\n\n3\n\n4\n\n Data\n Jrep\n Jdiv\n Proposed\n\n0\n\n1\n\n2\n\n3\n\n4\n\n5\n\n5\n\n8\nIteration Number\n\n6\n\n7\n\n \n\nJdiv\nJrep\n\n800\n\n600\n\n400\n\n200\n\n9\n\n10 11 12\n\n \n\n0\n0\n\n1\n\n2\n\n3\n\n4\n\n5\n\n8\nIteration Number\n\n6\n\n7\n\n \n\nJdiv\nJrep\n\n9\n\n10 11 12\n\n(a) Subset Selection\n\n(b) Convergence With Annealing\n\n(c) Without Annealing\n\nFigure 1: Subset selection for a simple dataset consisting of unbalanced classes in R4. (a) Data projected on\nR2 for visualization using PCA. While trying to minimize the representational error, Jrep picks two exemplars\nfrom the dominant class. Jdiv picks diverse exemplars but from the boundaries. The proposed approach strikes\na balance between the two and picks one \u2018representative\u2019 exemplar from each class. Convergence Analysis of\nalgorithm 3: (b) with annealing and (c) without annealing.\n\nrepresentation cost-function, we do not expect it to have the diversity of the proposed algorithm.\nThis corresponds to the special case of optimizing only Jrep. The second baseline is to apply a\ntangent-space approximation to the entire data-set at the mean of the dataset, and then apply a\nsubset-selection algorithm such as RRQR. This corresponds to optimizing only Jdiv where the\ninput matrix is the matrix of tangent vectors. Since minimization of Jrep is not explicitly enforced,\nwe do not expect the exemplars to be the best representatives, even though the set is diverse.\n\nA Simple Dataset: To gain some intuition, we \ufb01rst perform experiments on a simple synthetic\ndataset. For easy visualization and understanding, we generated a dataset with 3 unbalanced classes\nin Euclidean space R4. Individual cost functions, Jrep and Jdiv were \ufb01rst optimized to pick three\nexemplars using algorithms 1 and 2 respectively. Selected exemplars have been shown in \ufb01gure 1(a),\nwhere the four dimensional dataset has been projected into two dimensions for visualization using\nPrincipal Component Analysis (PCA). Despite unbalanced class sizes, when optimized individually,\nJdiv seeks to select exemplars from diverse classes but tends to pick them from the class boundaries.\nWhile unbalanced class sizes cause Jrep to pick 2 exemplars from the dominant cluster. Algorithm\n3 iteratively optimizes for both these cost functions and picks an exemplar from every class. These\nexemplars, are closer to the centroid of the individual classes.\n\nFigure 1(b) shows the convergence of the algorithm for this simple dataset and compares it\nwith the case when no annealing is applied (\ufb01gure 1(c)). Jrep and Jdiv plots are shown as the\niterations of algorithm 3 progresses. When annealing is applied, the tolerance value (tol) is\nincreased by 0.05 in each iteration. It can be noted that in this case the algorithm converges to\na steady state in 7 iterations (tol = 1.35). If no annealing is applied, the algorithm does not converge.\n\nShape sampling/summarization: We conducted a real shape summarization experiment on the\nMPEG dataset [29]. This dataset has 70 shape classes with 20 shapes per class. For our experi-\nments, we created a smaller dataset of 10 shape classes with 10 shapes per class. Figure 2(a) shows\nthe shapes used in our experiments. We use an af\ufb01ne-invariant representation of shapes based on\nlandmarks. Shape boundaries are uniformly sampled to obtain m landmark points. These points are\nconcatenated in a matrix to obtain the landmark matrix L \u2208 Rm\u00d72. Left singular vectors (Um\u00d72),\nobtained by the singular value decomposition of matrix L = U \u03a3V T , give the af\ufb01ne-invariant rep-\nresentation of shapes [30]. This af\ufb01ne shape-space U of m landmark points is a 2-dimensional\nsubspace of Rm. These p-dimensional subspaces in Rm constitute the Grassmann manifold Gm,p.\nDetails of the algorithms for the computation of exponential and inverse exponential map on Gm,p\ncan be found in [28] and has also been included in the supplemental material.\n\nIn the experiment, the cardinality of the subset was set to 10. As the number of shape classes is also\n10, one would ideally seek one exemplar from each class. Algorithms 1 and 2 were \ufb01rst individually\noptimized to select the optimal subset. Algorithm 1 was applied intrinsically on the manifold with\nmultiple initializations. Figure 2(b) shows the output with the least cost among these initializations.\nFor algorithm 2, data points were projected on the tangent space at the mean using the inverse\nexponential map and the selected subset is shown in \ufb01gure 2(c). Individual optimization of Jrep\nresults in 1 exemplar each from 6 classes, 2 each from 2 classes (\u2018apple\u2019 and \u2018\ufb02ower\u2019) and misses\n2 classes (\u2018bell\u2019 and \u2018chopper\u2019). While, individual optimization of Jdiv alone picks 1 each from 8\nclasses, 2 from the class \u2018car\u2019 and none from the class \u2018bell\u2019. It can be observed that exemplars\nchosen by Jdiv for classes \u2018glass\u2019, \u2018heart\u2019,\u2018\ufb02ower\u2019 and \u2018apple\u2019 tend to be unusual members of the\n\n6\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 2: (a) 10 classes from MPEG dataset with 10 shapes per class. Comparison of 10 exemplars selected by\n(b)Jrep, (c) Jdiv and (d) Proposed Approach. Jrep picks 2 exemplars each from 2 classes (\u2018apple\u2019 and \u2018\ufb02ower\u2019)\nand misses \u2018bell\u2019 and \u2018chopper\u2019 classes. Jdiv picks 1 from 8 different classes, 2 exemplars from class \u2018car\u2019 and\nnone from class \u2018bell\u2019. It can be observed that exemplars chosen by Jdiv for classes \u2018glass\u2019, \u2018heart\u2019, \u2018\ufb02ower\u2019\nand \u2018apple\u2019 tend to be unusual members of the class. It also picks up the \ufb02ipped car. While the proposed\napproach picks one representative exemplars from each class as desired.\n\nclass. It also picks up the \ufb02ipped car. Optimizing for both Jdiv and Jrep using algorithm 3 picks one\n\u2018representative\u2019 exemplar from each class as shown in \ufb01gure 2(d).\n\nThese exemplars picked by the three algorithms can be further used to label data points. Table 3\nshows the confusion table thus obtained. For each data point, we \ufb01nd the nearest exemplar, and\nlabel the data point with the ground-truth label of this exemplar. For example, consider the row\nlabeled as \u2018bell\u2019. All the data points of the class \u2018bell\u2019 were labeled as \u2018pocket\u2019 by Jrep while Jdiv\nlabeled 7 data points from this class as \u2018chopper\u2019 and 3 as \u2018pocket\u2019. This confusion is largely due\nto both Jrep and Jdiv having missed out picking exemplars from this class. The proposed approach\ncorrectly labels all data points as it picks exemplars from every class.\n\nGlass\nHeart\nApple\nBell\nBaby\n\nChopper\nFlower\n\nCar\n\nPocket\nTeddy\n\nGlass\n\n(10,10,10)\n\nHeart\n\nApple\n\nBell\n\nBaby\n\nChopper\n\nFlower\n\nCar\n\nPocket\n\nTeddy\n\n(10,10,10)\n\n(0,1,0)\n\n(8,7,10)\n\n(2,0,0)\n\n(8,0,0)\n\n(0,0,10)\n\n(10,10,10)\n\n(2,0,0)\n(0,7,0)\n\n(0,10,10)\n\n(0,2,0)\n(10,3,0)\n\n(10,10,10)\n\n(10,10,10)\n\n(10,10,10)\n\n(10,10,10)\n\nTable 3: Confusion Table. Entries correspond to the tuple (Jrep, Jdiv, P roposed). Row labels correspond to\nthe ground truth labels of the shape and the column labels correspond to the label of the nearest exemplar. Only\nnon-zero entries have been shown in the table.\n\nKTH human action dataset: The next experiment was conducted on the KTH human action\ndataset [31]. This dataset consists of videos with 6 actions conducted by 25 persons in 4 different\nscenarios. For our experiment, we created a smaller dataset of 30 videos with the \ufb01rst 5 human\nsubjects conducting 6 actions in the s4 (indoor) scenario. Figure 3(a) shows sample frames from\neach video. This dataset mainly consists of videos captured under constrained settings. This makes it\ndif\ufb01cult to identify the \u2018usual\u2019 or \u2018unusual\u2019 members of a class. To better understand the performance\nof the three algorithms, we synthetically added occlusion to the last video of each class. These\noccluded videos serve as the \u2018unusual\u2019 members.\n\nHistogram of Oriented Optical Flow (HOOF) [32] was extracted from each frame to obtain a nor-\nmalized time-series for the videos. A Linear Dynamical System (LDS) is then estimated from\nthis time-series using the approach in [11]. This model is described by the state transition equa-\ntion: x(t + 1) = Ax(t) + w(t) and the observation equation z(t) = Cx(t) + v(t), where\n\n7\n\n\f(a) Dataset\n\n(b) Jrep\n\n(c) Jdiv\n\n(d) Proposed\n\nFigure 3: (a) Sample frames from KTH action dataset [31]. From top to bottom action classes are { box, run,\nwalk, hand-clap, hand-wave and jog }. 5 exemplars selected by: (b)Jrep, (c) Jdiv and (d) Proposed. Exemplars\npicked by Jrep correspond to { box, run, run, hand-clap, hand-wave } actions. While Jdiv selects { box, walk,\nhand-clap, hand-wave and jog }. Proposed approach picks { box, run, walk, hand-clap and hand-wave }.\n\nm = [C T , (CA)T , (CA2)T , . . . , (CAm\u22121)T ]. The column space of this matrix OT\n\nx \u2208 Rd is the hidden state vector, z \u2208 Rp is the observation vector, w(t) \u223c N (0, \u0398) and\nv(t) \u223c N (0, \u039e) are the noise components. Here, A is the state-transition matrix and C is the\nobservation matrix. The expected observation sequence of model (A, C) lies in the column space\nof the in\ufb01nite extended \u2018observability\u2019 matrix which is commonly approximated by a \ufb01nite matrix\nm \u2208 Rmp\u00d7d is\nOT\na d-dimensional subspace and hence lies on the Grassmann manifold.\nIn this experiment, we consider the scenario when the number of classes in a dataset is unknown.\nWe asked the algorithm to pick 5 exemplars when the actual number of classes in the dataset is 6.\nFigure 3(b) shows one frame from each of the videos selected when Jrep was optimized alone. It\npicks 1 exemplar each from 3 classes (\u2018box\u2019,\u2018hand-clap\u2019 and \u2018hand-wave\u2019), 2 from the class \u2018run\u2019\nwhile misses out on \u2018walk\u2019 and \u2018jog\u2019. On the other hand, Jdiv (when optimized alone) picks 1 each\nfrom 5 different classes and misses the class \u2018run\u2019. It can be seen that Jdiv picks 2 exemplars that\nare \u2018unusual\u2019 members (occluded videos) of their respective class. The proposed approach picks\n1 representative exemplar from 5 classes and none from the class \u2018jog\u2019. The proposed approach\nachieves both a diverse selection of exemplars, and also avoids picking outlying exemplars.\nEffect of Parameters and Initialization:\nIn our experiments, the effect of tolerance steps (\u03b4) for\nsmaller values (< 0.1) has very minimal effect. After a few attempts, we \ufb01xed this value to 0.05 for\nall our experiments. In the \ufb01rst iteration, we start with tol = 1. With this value, algorithm 2 accepts\nany swap that increases Jdiv. This makes output of algorithm 2 after \ufb01rst iteration almost insensitive\nto initialization. While, in the later iterations, swaps are accepted only if they increase the value of\nJdiv signi\ufb01cantly and hence input to algorithm 2 becomes more important with the increase in tol.\n\n5 Conclusion and Discussion\n\nIn this paper, we addressed the problem of selecting K exemplars from a dataset when the dataset has\nan underlying manifold structure to it. We utilized the geometric structure of the manifold to formu-\nlate the notion of picking exemplars which minimize the representational error while maximizing the\ndiversity of exemplars. An iterative alternation optimization technique based on annealing has been\nproposed. We discussed its convergence and complexity and showed its extension to data manifolds\nand Euclidean spaces. We showed summarization experiments with real shape and human actions\ndataset. Future work includes formulating subset selection for in\ufb01nite dimensional manifolds and\nef\ufb01cient approximations for this case. Also, several special cases of the proposed approach point to\nnew directions of research such as the cases of vector spaces and data manifolds.\nAcknowledgement: This research was funded (in part) by a grant N 00014 \u2212 09 \u2212 1 \u2212 0044 from\nthe Of\ufb01ce of Naval Research. The \ufb01rst author would like to thank Dikpal Reddy and Sima Taheri\nfor helpful discussions and their valuable comments.\n\nReferences\n\n[1] K. Liu, E. Terzi, and T. Grandison, \u201cManyAspects: a system for highlighting diverse concepts in docu-\n\nments,\u201d in Proceedings of VLDB Endowment, 2008.\n\n8\n\n\f[2] N. Shroff, P. Turaga, and R. Chellappa, \u201cVideo Pr\u00b4ecis: Highlighting diverse aspects of videos,\u201d IEEE\n\nTransactions on Multimedia, vol. 12, no. 8, pp. 853 \u2013868, Dec. 2010.\n\n[3] I. Simon, N. Snavely, and S. Seitz, \u201cScene summarization for online image collections,\u201d in ICCV, 2007.\n[4] W. Cochran, Sampling techniques. Wiley, 1977.\n[5] Y. Yue and T. Joachims, \u201cPredicting diverse subsets using structural svms,\u201d in ICML, 2008.\n[6] J. Carbonell and J. Goldstein, \u201cThe use of mmr, diversity-based reranking for reordering documents and\n\nreproducing summaries,\u201d in SIGIR, 1998.\n\n[7] M. Gu and S. Eisenstat, \u201cEf\ufb01cient algorithms for computing a strong rank-revealing QR factorization,\u201d\n\nSIAM Journal on Scienti\ufb01c Computing, vol. 17, no. 4, pp. 848\u2013869, 1996.\n\n[8] P. Drineas, M. Mahoney, and S. Muthukrishnan, \u201cRelative-error CUR matrix decompositions,\u201d SIAM\n\nJournal on Matrix Analysis and Applications, vol. 30, pp. 844\u2013881, 2008.\n\n[9] C. Boutsidis, M. Mahoney, and P. Drineas, \u201cAn improved approximation algorithm for the column subset\n\nselection problem,\u201d in SODA, 2009.\n\n[10] D. Kendall, \u201cShape manifolds, Procrustean metrics and complex projective spaces,\u201d Bulletin of London\n\nMathematical society, vol. 16, pp. 81\u2013121, 1984.\n\n[11] S. Soatto, G. Doretto, and Y. N. Wu, \u201cDynamic textures,\u201d ICCV, 2001.\n[12] J. D. Lafferty and G. Lebanon, \u201cDiffusion kernels on statistical manifolds,\u201d Journal of Machine Learning\n\nResearch, vol. 6, pp. 129\u2013163, 2005.\n\n[13] P. T. Fletcher, C. Lu, S. M. Pizer, and S. C. Joshi, \u201cPrincipal geodesic analysis for the study of nonlinear\nstatistics of shape,\u201d IEEE Transactions on Medical Imaging, vol. 23, no. 8, pp. 995\u20131005, August 2004.\n[14] A. Srivastava, S. H. Joshi, W. Mio, and X. Liu, \u201cStatistical shape analysis: Clustering, learning, and\n\ntesting,\u201d IEEE Transactions on pattern analysis and machine intelligence, vol. 27, no. 4, 2005.\n\n[15] G. Golub, \u201cNumerical methods for solving linear least squares problems,\u201d Numerische Mathematik, vol. 7,\n\nno. 3, pp. 206\u2013216, 1965.\n\n[16] T. Chan, \u201cRank revealing QR factorizations,\u201d Linear Algebra and Its Applications, vol. 88, pp. 67\u201382,\n\n1987.\n\n[17] A. Frieze, R. Kannan, and S. Vempala, \u201cFast Monte-Carlo algorithms for \ufb01nding low-rank approxima-\n\ntions,\u201d Journal of the ACM (JACM), vol. 51, no. 6, pp. 1025\u20131041, 2004.\n\n[18] A. Deshpande and L. Rademacher, \u201cEf\ufb01cient volume sampling for row/column subset selection,\u201d in Foun-\n\ndations of Computer Science (FOCS), 2010.\n\n[19] G. Gan, C. Ma, and J. Wu, Data clustering: theory, algorithms, and applications. Society for Industrial\n\nand Applied Mathematics, 2007.\n\n[20] I. Dhillon and D. Modha, \u201cConcept decompositions for large sparse text data using clustering,\u201d Machine\n\nlearning, vol. 42, no. 1, pp. 143\u2013175, 2001.\n\n[21] B. J. Frey and D. Dueck, \u201cClustering by passing messages between data points,\u201d Science, vol. 315, pp.\n\n972\u2013976, Feb. 2007.\n\n[22] R. Subbarao and P. Meer, \u201cNonlinear mean shift for clustering over analytic manifolds,\u201d in CVPR, 2006.\n[23] A. Goh and R. Vidal, \u201cClustering and dimensionality reduction on riemannian manifolds,\u201d in CVPR, 2008.\n[24] H. Karcher, \u201cRiemannian center of mass and molli\ufb01er smoothing,\u201d Communications on Pure and Applied\n\nMathematics, vol. 30, no. 5, pp. 509\u2013541, 1977.\n\n[25] T. Lin and H. Zha, \u201cRiemannian manifold learning,\u201d IEEE Transactions on Pattern Analysis and Machine\n\nIntelligence, vol. 30, pp. 796\u2013809, 2008.\n\n[26] A. Trouv\u00b4e, \u201cDiffeomorphisms groups and pattern matching in image analysis,\u201d International Journal of\n\nComputer Vision, vol. 28, pp. 213\u2013221, July 1998.\n\n[27] W. Mio, A. Srivastava, and S. Joshi, \u201cOn shape of plane elastic curves,\u201d International Journal of Computer\n\nVision, vol. 73, no. 3, pp. 307\u2013324, 2007.\n\n[28] K. Gallivan, A. Srivastava, X. Liu, and P. Van Dooren, \u201cEf\ufb01cient algorithms for inferences on grassmann\n\nmanifolds,\u201d in IEEE Workshop on Statistical Signal Processing, 2003.\n\n[29] L. Latecki, R. Lakamper, and T. Eckhardt, \u201cShape descriptors for non-rigid shapes with a single closed\n\ncontour,\u201d in CVPR, 2000.\n\n[30] E. Begelfor and M. Werman, \u201cAf\ufb01ne invariance revisited,\u201d in CVPR, 2006.\n[31] C. Schuldt, I. Laptev, and B. Caputo, \u201cRecognizing human actions: a local SVM approach,\u201d in ICPR,\n\n2004.\n\n[32] R. Chaudhry, A. Ravichandran, G. Hager, and R. Vidal, \u201cHistograms of oriented optical \ufb02ow and binet-\n\ncauchy kernels on nonlinear dynamical systems for the recognition of human actions,\u201d in CVPR, 2009.\n\n9\n\n\f", "award": [], "sourceid": 129, "authors": [{"given_name": "Nitesh", "family_name": "Shroff", "institution": null}, {"given_name": "Pavan", "family_name": "Turaga", "institution": null}, {"given_name": "Rama", "family_name": "Chellappa", "institution": null}]}