{"title": "Generalized Unsupervised Manifold Alignment", "book": "Advances in Neural Information Processing Systems", "page_first": 2429, "page_last": 2437, "abstract": "In this paper, we propose a generalized Unsupervised Manifold Alignment (GUMA) method to build the connections between different but correlated datasets without any known correspondences. Based on the assumption that datasets of the same theme usually have similar manifold structures, GUMA is formulated into an explicit integer optimization problem considering the structure matching and preserving criteria, as well as the feature comparability of the corresponding points in the mutual embedding space. The main benefits of this model include: (1) simultaneous discovery and alignment of manifold structures; (2) fully unsupervised matching without any pre-specified correspondences; (3) efficient iterative alignment without computations in all permutation cases. Experimental results on dataset matching and real-world applications demonstrate the effectiveness and the practicability of our manifold alignment method.", "full_text": "Generalized Unsupervised Manifold Alignment\n\nZhen Cui1,2\n\nHong Chang1\n\nShiguang Shan1\n\nXilin Chen1\n\n1 Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS),\n\nInstitute of Computing Technology, CAS, Beijing, China\n\n2 School of Computer Science and Technology, Huaqiao University, Xiamen, China\n\n{zhen.cui,hong.chang}@vipl.ict.ac.cn; {sgshan,xlchen}@ict.ac.cn\n\nAbstract\n\nIn this paper, we propose a Generalized Unsupervised Manifold Alignment (GU-\nMA) method to build the connections between different but correlated datasets\nwithout any known correspondences. Based on the assumption that datasets of the\nsame theme usually have similar manifold structures, GUMA is formulated into\nan explicit integer optimization problem considering the structure matching and p-\nreserving criteria, as well as the feature comparability of the corresponding points\nin the mutual embedding space. The main bene\ufb01ts of this model include: (1)\nsimultaneous discovery and alignment of manifold structures; (2) fully unsuper-\nvised matching without any pre-speci\ufb01ed correspondences; (3) ef\ufb01cient iterative\nalignment without computations in all permutation cases. Experimental results on\ndataset matching and real-world applications demonstrate the effectiveness and\nthe practicability of our manifold alignment method.\n\n1 Introduction\n\nIn many machine learning applications, different datasets may reside on different but highly corre-\nlated manifolds. Representative scenarios include learning cross visual domains, cross visual views,\ncross languages, cross audio and video, and so on. Among them, a key problem in learning with such\ndatasets is to build connections cross different datasets, or align the underlying (manifold) structures.\nBy making full use of some priors, such as local geometry structures or manually annotated coun-\nterparts, manifold alignment tries to build or strengthen the relationships of different datasets and\nultimately project samples into a mutual embedding space, where the embedded features can be\ncompared directly. Since samples from different (even heterogeneous) datasets are usually located\nin different high dimensional spaces, direct alignment in the original spaces is very dif\ufb01cult. In\ncontrast, it is easier to align manifolds of lower intrinsic dimensions.\nIn recent years, manifold alignment becomes increasingly popular in machine learning and computer\nvision community. Generally, existing manifold alignment methods fall into two categories, (semi-\n)supervised methods and unsupervised methods. The former methods [15, 26, 19, 33, 28, 30] usually\nrequire some known between-set counterparts as prerequisite for the transformation learning, e.g.,\nlabels or handcrafted correspondences. Thus they are dif\ufb01cult to generalize to new circumstances,\nwhere the counterparts are unknown or intractable to construct.\nIn contrast, unsupervised manifold alignment learns from manifold structures and naturally avoid-\ns the above problem. With manifold structures characterized by local adjacent weight matrices ,\nWang et al. [29] de\ufb01ne the distance between two points respectively from either manifold as the\nminimum matching scores of the corresponding weight matrices in all possible structure permuta-\ntions. Therefore, when K neighbors are considered, the distance computation for any two points\nneeds K! permutations, a really high computational cost even for a small K. To alleviate this prob-\nlem, Pei et al. [21] use a B-spline curve to \ufb01t each sorted adjacent weight matrix and then compute\nmatching scores of the curves across manifolds for the subsequent local alignment. Both methods\n\n1\n\n\fin [29] and [21] divide manifold alignment into two steps, the computation of matching similari-\nties of data points across manifolds and the sequential counterparts \ufb01nding. However, the two-step\napproaches might be defective, as they might lead to inaccurate alignment due to the evolutions of\nneighborhood relationships, i.e., the local neighborhood of one point computed in the \ufb01rst step may\nchange if some of its original neighbors are not aligned in the second step. To address this problem,\nCui et al. [7] propose an af\ufb01ne-invariant sets alignment method by modeling geometry structures\nwith local reconstruction coef\ufb01cients.\nIn this paper, we propose a generalized unsupervised manifold alignment method, which can global-\nly discover and align manifold structures without any pre-speci\ufb01ed correspondences, as well as learn\nthe mutual embedding subspace. In order to jointly learn the transforms into the mutual embedding\nspace and the correspondences of two manifolds, we integrate the criteria of geometry structure\nmatching, feature matching and geometry preserving into an explicit quadratic optimization model\nwith 0-1 integer constraints. An ef\ufb01cient alternate optimization on the alignment and transforma-\ntions is employed to solve the model. In optimizing the alignment, we extend the Frank-Wolfe (FW)\nalgorithm [9] for the NP-hard integer quadratic programming. The algorithm approximately seeks\nfor optima along the path of global convergence on a relaxed convex objective function. Extensive\nexperiments demonstrate the effectiveness of our proposed method.\nDifferent from previous unsupervised alignment methods such as [29] and [21], our method can\n(i) simultaneously discover and align manifold structures without prede\ufb01ning the local neighbor-\nhood structures; (ii) perform structure matching globally; and (iii) conduct heterogeneous manifold\nalignment well by \ufb01nding the embedding feature spaces. Besides, our work is partly related to oth-\ner methods such as kernelized sorting [22], latent variable model [14], etc. However, they mostly\ndiscover counterparts in a latent space without considering geometric structures, although to some\nextend the constrained terms used in our model are formally similar to theirs.\n\n2 Problem Description\n\n|\nF = tr(X\n\nWe \ufb01rst de\ufb01ne the notations used in this paper. A lowercase/uppercase letter in bold denotes\na vector/matrix, while non-bold letters denote scalars. Xi\u00b7 (X\u00b7i) represents the ith row (col-\numn) of matrix X. xij or [X]ij denotes the element at the ith row and jth column of matrix\nX. 1m\u00d7n, 0m\u00d7n \u2208 Rm\u00d7n are matrices of ones and zeros. In \u2208 Rn\u00d7n is an identity matrix.\ntr(\u00b7) represents the trace norm.\nThe superscript | means the transpose of a vector or matrix.\n\u2225X\u22252\nX) designates the Frobenius norm. vec(X) denotes the vectorization of matrix\nX in columns. diag(X) is the diagonalization on matrix X, and diag(x) returns a diagonal matrix\nof the diagonal elements x. X \u2297 Z and X \u2299 Z denote the Kronecker and Hadamard products,\nrespectively.\nLet X \u2208 Rdx\u00d7nx and Z \u2208 Rdz\u00d7nz denote two datasets, residing in two different manifolds Mx\nand Mz, where dx(dz) and nx(nz) are respectively the dimensionalities and cardinalities of the\ndatasets. Without loss of generality, we suppose nx \u2264 nz. The goal of unsupervised manifold\nalignment is to build connections between X and Z without any pre-speci\ufb01ed correspondences. To\nthis end, we de\ufb01ne a 0-1 integer matrix F \u2208 {0, 1}nx\u00d7nz to mark the correspondences between X\nand Z.\n[F]ij = 1 means that the ith point of X and the jth point of Z are a counterpart. If all\ncounterparts are limited to one-to-one, the set of integer matrices F can be de\ufb01ned as\n\n(cid:5) = {F|F \u2208 {0, 1}nx\u00d7nz , F1nz = 1nx , 1\n\n(1)\nnx \u0338= nz means a partial permutation. Meanwhile, we expect to learn the lower dimensional intrinsic\nrepresentations for both datasets through explicit linear projections, Px \u2208 Rd\u00d7dx and Pz \u2208 Rd\u00d7dz,\nfrom the two datasets to a mutual embedding space M. Therefore, the correspondence matrix F\nas well as the embedding projections Px and Pz are what we need to learn to achieve generalized\nunsupervised manifold alignment.\n\n, nx \u2264 nz}.\n\n|\nnx\n\nF \u2264 1\n\n|\nnz\n\n3 The Model\n\nAligning two manifolds without any annotations is not a trivial work, especially for two heteroge-\nneous datasets. Even so, we can still make use of the similarities between the manifolds in geometry\nstructures and intrinsic representations to build the alignment. Speci\ufb01cally, we have three intuitive\n\n2\n\n\fobservations to explore. First, manifolds under the same theme, e.g., the same action sequences of\ndifferent persons, usually imply a certain similarity in geometry structures. Second, the embeddings\nof corresponding points from different manifolds should be as close as possible. Third, the geometry\nstructures of both manifolds should be preserved respectively in the mutual embedding space. Based\non these intuitions, we proposed an optimization objective for generalized unsupervised manifold\nalignment.\nOverall objective function\nFollowing the above analysis, we formulate unsupervised manifold alignment into an optimization\nproblem with integer constraints,\n\nmin\n\nPx,Pz,F\ns.t.\n\nEs + \u03b3f Ef + \u03b3pEp\nF \u2208 (cid:5), Px, Pz \u2208 (cid:2),\n\n(2)\n\nwhere \u03b3f , \u03b3p are the balance parameters, (cid:2) is a constraint to avoid trivial solutions for Px and Pz,\nEs, Ef and Ep are three terms respectively measuring the degree of geometry matching, feature\nmatching and geometry preserving, which will be detailed individually in the following text.\nEs: Geometry matching term\nTo build correspondences between two manifolds, they should be \ufb01rst geometrically aligned. There-\nfore, discovering the geometrical structure of either manifold should be the \ufb01rst task. For this pro-\npose, graph with weighted edges can be exploited to characterize the topological structure of mani-\nfold, e.g., via graph adjacency matrices Kx, Kz of datasets X and Z, which are usually non-negative\nand symmetric if not considering directions of edges. In the literatures of manifold learning, many\nmethods have been proposed to construct these adjacency matrices locally, e.g., via heat kernel func-\ntion [2]. However, in the context of manifold alignment, there might be partial alignment cases, in\nwhich some points on one manifold might not be corresponded to any points on the other manifold.\nThus these unmatched points should be detected out, and not involved in the computation of the\ngeometry relationship. To address this problem, we attempt to characterizes the global manifold\ngeometry structure by computing the full adjacency matrix, i.e., [K]ij = d(X\u00b7i, X\u00b7j), where d is\ngeodesic distance for general cases or Euclidean distance for \ufb02at manifolds. Note that, in order to\nreduce the effect of data scales, X and Z are respectively normalized to have unit standard deviation.\nThe degree of manifold matching in global geometry structures is then formulated as the following\nenergy term,\nwhere F \u2208 (cid:5) is the (partial) correspondence matrix de\ufb01ned in Eqn.(1).\nEf: Feature matching term\nGiven two datasets X and Z, the aligned data points should have similar intrinsic feature repre-\nsentations in the mutual embedding space M. Thus we can formulate the feature matching term\nas,\n\nEs = \u2225Kx \u2212 FKzF\n\n|\u22252\nF ,\n\n(3)\n\nEf = \u2225P\nxX \u2212 P\n|\n\n|\nz ZF\n\n|\u22252\nF ,\n\n(4)\nwhere Px and Pz are the embedding projections respectively for X and Z. They can also be ex-\ntended to implicit nonlinear projections through kernel tricks. This term penalizes the divergence of\nintrinsic features of aligned points in the embedding space M.\nEp: Geometry preserving term\nIn unrolling the manifold to the mutual embedding space, the local neighborhood relationship of\neither manifold is not expected to destroyed. In other words, the local geometry of either manifold\nshould be well preserved to avoid information loss. As done in many manifold learning algorithms\n[23, 2], we construct the local adjacency weight matrices Wx and Wz respectively for the datasets\nX and Z. Then, the geometry preserving term is de\ufb01ned as\n\nEp =\n\n\u2225P\nx(xi\u2212xj)\u22252wx\n|\nij +\n\n\u2225P\n\nz (zi\u2212zj)\u22252wz\n|\n\n|\n|\nxXLxX\nij = tr(P\n\n|\nPx + PzZLzZ\n\nPz),\n\n(5)\n\n\u2211\n\ni,j\n\n\u2211\n\u2211\n\ni,j\nwhere wx\nare the graph Laplacian matrices, with Lx = diag([\ndiag([\n\nij) is the weight between the ith point and the jth point in X (Z), Lx and Lz\nnxj]) \u2212 Wx and Lz =\n\n\u2211\n\n1j, . . . ,\n\nij(wz\n\nj wx\n\nj wx\n\nnzj]) \u2212 Wz.\n\nj wz\n\n1j, . . . ,\n\nj wz\n\n\u2211\n\n\u2211\n\n3\n\n\f4 Ef\ufb01cient Optimization\n\nSolving the objective function (2) is dif\ufb01cult due to multiple indecomposable variables and integer\nconstraints. Here we propose an ef\ufb01cient approximate solution via alternate optimization. Specif-\nically, the objective function (2) is decomposed into two submodels, corresponding to the opti-\nmizations of the integer matrix F and the projection matrices Px, Pz, respectively. With Px and\nPz \ufb01xed, we can get a submodel by solving a non-convex quadratic integer programming, whose\napproximate solution is computed along the gradient-descent path of a relaxed convex model by\nextending the Frank-Wolfe algorithm [9]. When \ufb01xing F, an analytic solution can be obtained for\nPx and Pz. The two submodels are alternately optimized until convergence to get the \ufb01nal solution.\n\n4.1 Learning Alignment\n\nWhen \ufb01xing Px Pz, the original problem reduces to minimize the following function,\n\nLet bX = P\n\nxX and bZ = P\n\n|\n\nobjective function can be rewritten as\n\nmin\nF\u2208(cid:5)\n\n(cid:9)0(F) = Es + \u03b3f Ef .\n\n(6)\n\n|\nz Z denote the transformed features. After a series of derivation, the\n\n(cid:9)0(F) = \u2225KxF \u2212 FKz\u22252\n\nmin\nF\u2208(cid:5)\n\n|\n\n|\n\n11\n\nF + tr(F\n\n(bZ \u2299bZ) \u2212 2bX\n\n|bZ) \u2212 11\n\nFKzz) + tr(F\n\n|\n\nB),\n\n(7)\n\n|\n\nwhere Kzz = Kz \u2299 Kz and B = \u03b3f (11\nKzz. This quadratic alignment\nproblem is NP-hard with n! enumerations under an exhaustive search strategy. To get effective and\nef\ufb01cient solution, we relax this optimization problem under the framework of Frank-Wolfe (FW)\nalgorithm [9], which is designed for convex models over a compact convex set. Concretely, we have\nfollowing two modi\ufb01cations:\n(i) Relax (cid:5) into a compact convex set. As the set of 0-1 integer matrices (cid:5) is not closed, we can\nrelax it to a compact closed set by using right stochastic matrices [3] as\n\n|\n\n\u2032\n(cid:5)\n\n= {F|F \u2265 0, F1nz = 1nx , 1\n\n|\nnx\n\nF \u2264 1\n\n|\nnz\n\n, nx \u2264 nz}.\n\n(8)\n\n\u2032 is a compact convex set.\n\nObviously, (cid:5)\n(ii) Relax the objective function (cid:9)0 into a convex function. As (cid:9)0 is non-convex, its optimization\neasily suffers from local optima. To avoid local optima in the optimization, we can incorporate an\nF), with \u03bb = nx \u00d7 max{\u2212 min(eig (Kzz)), 0}, into (cid:9)0 and get\nauxiliary function \u03d5(F) = \u03bb tr(F\nthe new objective as\n\n|\n\n(cid:9)(F) = \u2225KxF \u2212 FKz\u22252 + tr(F\n\n|\n\n|\n11\n\nFKzz + \u03bbF\n\n|\n\nF) + tr(F\n\n|\n\nB).\n\n(9)\n\n\u2297 (11\n|\n\nIn Eqn.(9), the \ufb01rst term is positive de\ufb01nite quadratic form for variable vec(F), and the Hessian\n|\nmatrix of the second term is 2(K\n) + \u03bbI) which is also positive de\ufb01nite. Therefore, the\n\u2032. Moreover, the solutions from minimizing\nzz\nnew objective function (cid:9) is convex over the convex set (cid:5)\n(cid:9)0 and (cid:9) over the integer constraint F \u2208 (cid:5) are consistent because \u03d5(F) is a constant.\nThe extended FW algorithm is summarized in Alg.1, which iteratively projects the one-order ap-\nproximate solution of (cid:9) into (cid:5). In step (4), the optimized solution is obtained using the KuhnC-\nMunkres (KM) algorithm in the 0-1 integer space [20], which makes the solution of the relaxed\nobjective function (cid:9) equal to that of the original objective (cid:9)0. Meanwhile, the continuous solution\npath is gradually descending in steps (5)\u223c(11) due to the convexity of function (cid:9), thus local optima\nis avoided unlike the original non-convex function over the integer space (cid:5). Furthermore, it can\nbe proved that the objective value (cid:9)(Fk) is non-increasing at each iteration and {F1, F2, . . .} will\nconverge into a \ufb01xed point.\n\n4.2 Learning Transformations\n\nWhen \ufb01xing F, the embedding transforms can be obtained by minimizing the following function,\n\nEc +\u03b3pEp = tr (P\n\n|\n|\nxX(\u03b3f I+\u03b3pLx)X\n\n|\nPx +P\nz Z(\u03b3f F\n\n|\n\n|\nF+\u03b3pLz)Z\n\n4\n\nPz\u22122\u03b3f P\n|\n|\nxXFZ\n\nPz) . (10)\n\n\fComputer the gradient of (cid:9) w.r.t. Fk:\n\u2207(Fk) = 2(K\n|\n|\nxKxFk + FkKzK\nz\nFind an optimal alignment at the current solution Fk by minizing one-order Taylor expansion\nof the objective function (cid:9), i.e., H = arg min\nif (cid:9)(H) < (cid:9)(Fk) then\n\n\u2212 2K\n|\n|\nFkKzz + \u03bbFk) + B;\nxFkKz + 11\ntr(\u2207(Fk)\n|\n\nF) using the KM algorithm;\n\nF\u2208(cid:5)\n\n(cid:9)(Fk + \u03b4(H \u2212 Fk));\n\nAlgorithm 1 Manifold alignment\n\nInput: Kx, Kz,bX,bZ, F0\n\n1: Initialize: F\u22c6 = F0, k = 0.\n2: repeat\n3:\n\n4:\n\n5:\n6:\n7:\n8:\n\n9:\n10:\n\nF\u22c6 = Fk+1 = H;\n\nelse\n\nFind the optimal step \u03b4 = arg min\n0\u2264\u03b4\u22641\nFk+1 = Fk + \u03b4(H \u2212 Fk);\nF\u22c6 = arg min\n\nF\u2208{H,F\u22c6} (cid:9)(F);\n\nend if\nk = k + 1;\n\n11:\n12:\n13: until \u2225(cid:9)(Fk+1) \u2212 (cid:9)(Fk)\u2225 < \u03f5.\nOutput: F\u22c6.\n\nTo avoid trivial solutions of Px, Pz, we centralize X, Z and reformulate the optimization problem\nby considering the rotation-invariant constraints:\n\n|\nxXFZ\ntr (P\n\n|\n\nPz) ,\n\n(11)\n\n|\n|\nxX(\u03b3f I + \u03b3pLx)X\nP\n\n|\nz Z(\u03b3f F\nPx = I, P\n\n|\n\n|\nF + \u03b3pLz)Z\n\nPz = I.\n\nmax\nPx,Pz\ns.t.\n\nThe above problem can be solved analytically by eigenvalue decomposition like Canonical Correla-\ntion Analysis (CCA) [16].\n\n4.3 Algorithm Analysis\n\nBy looping the above two steps, i.e., alternating optimization on the correspondence matrix F and the\nembedding maps Px, Pz, we can reach a feasible solution just like many block-coordinate descent\nmethods. The computational cost mainly lies in learning alignment, i.e., the optimization steps in\nAlg.1. In Alg.1, the time complexity of KM algorithm for linear integer optimization is O(n3\nz). As\nthe Frank-Wolfe method has a convergence rate of O(1/k) with k iterations, the time cost of Alg.1\nz), where \u03f5 is the threshold in step (13) of Alg.1. If the whole GUMA algorithm\nis about O( 1\n(please see the auxiliary \ufb01le) needs to iterate t times, the cost of whole algorithm will be O( 1\nz).\n\u03f5 tn3\nIn our experiments, only a few t and k iterations are required to achieve the satisfactory solution.\n\n\u03f5 n3\n\n5 Experiment\n\nTo validate the effectiveness of the proposed manifold alignment method, we \ufb01rst conduct two man-\nifold alignment experiments on dataset matching, including the alignment of face image sets across\ndifferent appearance variations and structure matching of protein sequences. Further application-\ns are also performed on video face recognition and visual domain adaptation to demonstrate the\npracticability of the proposed method.\nThe main parameters of our method are the balance parameters \u03b3f , \u03b3p, which are simply set to 1.\nIn the geometry preserving term, we set the nearest neighbor number K = 5 and the heat kernel\nparameter to 1. The embedding dimension d is set to the minimal rank of two sets minus 1.\n\n5.1 GUMA for Set Matching\n\nFirst, we perform alignment of face image sets containing different appearance variations in poses,\nexpressions, illuminations and so on. In this experiment, the goal is to connect corresponding face\n\n5\n\n\f\u25e6\n\n](15\n\n\u25e6\n\n, +45\n\nimages of different persons but with the same poses/expression. Here we use Multi-PIE database\n[13]. We choose totally 29,400 face images of the \ufb01rst 100 subjects in the dataset, which cover 7\nposes with yaw within [\u221245\n\u25e6 intervals), different expressions and illuminations across 3\nsessions. These faces are cropped and normalized into 60\u00d740 pixels with eyes at \ufb01xed locations. To\naccelerate the alignment, their dimensions are further reduced by PCA with 90% energy preserved.\nThe quantitative matching results1 on pose/expression matching are shown in Fig.1, which contains\nthe matching accuracy2 of poses (Fig.1(a)), expressions (Fig.1(b)) and their combination (Fig.1(c)).\nWe compare our method with two state-of-the-art methods, Wang\u2019s [29] and Pei\u2019s [21]. Moreover,\nthe results of using only feature matching or structure matching are also reported respectively, which\nare actually special cases of our method. Here we brie\ufb02y name them as GUMA(F)/GUMA(S), re-\nspectively corresponding to the feature/structure matching. As shown in Fig.1, we have the follow-\ning observations:\n\n(1) Manifold alignment bene\ufb01ts from manifold structures as well as sample features. Although\nfeatures contribute more to the performance in this dataset, manifold structures also play an\nimportant role in alignment. Actually, their relative contributions may be different with different\ndatasets, as the following experiments on protein sequence alignment indicate that manifold\nstructures alone can achieve a good performance. Anyway, combining both manifold structures\nand sample features promotes the performance more than 15%.\n\n(2) Compared with the other two manifold alignment methods, Wang\u2019s [29] and Pei\u2019s [21], the pro-\nposed method achieves better performance, which may be attributed to the synergy of global\nstructure matching and feature matching. It is also clear that Wang\u2019s method achieves relatively\nworse performance, which we conjecture can be ascribed to the use of only the geometric simi-\nlarity. This might also account for its similar performance to GUMA(S), which also makes uses\nof structure information only.\n\n(3) Pose matching is easier than expression matching in the alignment task of face image sets. This\nalso follows our intuition that poses usually vary more dramatic than subtle face expressions.\nFurther, the task combining poses and expressions (as shown in Fig.1(c)) is more dif\ufb01cult than\neither single task.\n\n(a) Pose matching\n\n(b) Exp. matching\n\n(c) Pose & exp. matching\n\nFigure 1: Alignment accuracy of face image sets from Multi-PIE [13].\n\nBesides, we also compare with two representative semi-supervised alignment methods [15, 28] to\ninvestigate\u201chow much user labeling is need to reach a performance comparable to our GUMA\nmethod?\u201d. In semi-supervised cases, we randomly choose some counterparts from two given sets\nas labeled data, and keep the remaining samples unlabeled. For both methods, 20%\u223c30% data is\nrequired to be labeled in pose matching, and 40%\u223c50% is required in expression and union match-\ning. The high proportional labeling for the latter case may be attributed to the extremely subtle face\nexpressions, for which \ufb01rst-order feature comparisons in both methods are not be effective enough.\nNext we illustrate how our method works by aligning the structures of two manifolds. We choose\nmanifold data from bioinformatics domain [28]. The structure matching of Glutaredoxin protein\nPDB-1G7O is used to validate our method, where the protein molecule has 215 amino acids. As\nshown in Fig.2, we provide the alignment results in 3D subspace of two sequences, 1G7O-10 and\n1G7O-21. More results can be found in the auxiliary \ufb01le. Wang\u2019s method [29] reaches a decent\nmatching result by only using local structure matching, but our method can achieve even better\nperformance by assorting to sample features and global manifold structures.\n\n1Some aligned examples can be found in the auxiliary \ufb01le.\n2Matching accuracy = #(correct matching pairs in testing)(cid:30)#(ground-truth matching pairs).\n\n6\n\n020406080100matching accuracy (%) Wang\u2019sPei\u2019sGUMA(F)GUMA(S)GUMA01020304050607080matching accuracy (%) Wang\u2019sPei\u2019sGUMA(F)GUMA(S)GUMA010203040506070matching accuracy (%) Wang\u2019sPei\u2019sGUMA(F)GUMA(S)GUMA\f(a) Pei\u2019s[21]\n\n(b) Wang\u2019s[29]\n\n(c) GUMA\n\nFigure 2: The structure alignment results of two protein sequences, 1G7O-10 and 1G7O-21.\n\n5.2 GUMA for Video-based Face Veri\ufb01cation\n\nIn the task of video face veri\ufb01cation, we need to judge whether a pair of videos are from the same\nperson. Here we use the recent published YouTube faces dataset [32], which contains 3,425 clips\ndownloaded from YouTube.\nIt is usually used to validate the performance of video-based face\nrecognition algorithms. Following the settings in [5], we normalize the face region sub-images to\n40\u00d724 pixels and then use histogram equalization to remove some lighting effect. For veri\ufb01cation,\nwe \ufb01rst align two videos by GUMA and then accumulate Euclidean distances of the counterparts as\ntheir dissimilarity. This method, without use of any label information, is named as GUMA(un). After\nalignment, CCA may be used to learn discriminant features by using training pairs, which is named\nas GUMA(su). Besides, we compare our algorithms with some classic video-based face recognition\nmethods, including MSM[34], MMD[31], AHISD[4], CHISD[4], SANP[17] and DCC[18]. For the\nimplementation of these methods, we use the source codes released by the authors and report the best\nresults with parameter tuning as described in their papers. The accuracy comparisons are reported\nin Table 1. In the \u201cUnaligned\u201d case, we accumulate the similarities of all combinatorial pairs across\ntwo sequences as the distance. We can observe that the alignment process promotes the performance\nto 65.74% from 61.80%. In the supervised case, GUMA(su) signi\ufb01cantly surpasses the most related\nDCC method, which learns discriminant features by using CCA from the view of subspace.\n\nTable 1: The comparisons on YouTube faces dataset (%).\n\nMethod MSM[34] MMD[31] AHISD[4] CHISD[4] SANP[17] DCC[18] Unaligned GUMA(un) GUMA(su)\n62.54\nStandard Error (cid:6)1.47\n\n63.74\n61.80\n(cid:6)1.69 (cid:6)1.57 (cid:6)2.29\n\n64.96\n(cid:6)1.00\n\n66.50\n(cid:6)2.03\n\n66.24\n(cid:6)1.70\n\nMean Accuracy\n\n70.84\n\n65.74\n(cid:6)1.81\n\n75.00\n(cid:6)1.56\n\n5.3 GUMA for Visual Domain Adaptation\n\nTo further validate the proposed method, we also apply it to visual domain adaptation task, which\ngenerally needs to discover the relationship between the samples of the source domain and those of\nthe target domain. Here we consider unsupervised domain adaptation scenario, where the labels of\nall the target examples are unknown. Given a pair of source domain and target domain, we attempt\nto use GUMA to align two domains and meanwhile \ufb01nd their embedding space. In the embedding\nspace, we classify the unlabeled examples of the target domain.\nWe use four public datasets, Amazon, Webcam, and DSLR collected in [24], and Caltech-256 [12].\nFollowing the protocol in [24, 11, 10, 6], we extract SURF features [1] and encode each image with\n800-bin token frequency feature by using a pre-trained codebook from Amazon images. The features\nare further normalized and z-scored with zero mean and unit standard deviation per dimension. Each\ndataset is regarded as one domain, so in total 12 settings of domain adaptation are formed. In the\nsource domain, 20 examples (resp. 8 examples) per class are selected randomly as labeled data from\nAmazon, Webcam and Caltech (resp. DSLR). All the examples in the target domain are used as\nunlabeled data and need to predict their labels as in [11, 10]. For all the settings, we conduct 20\nrounds of experiments with different randomly selected examples.\nWe compare the proposed method with \ufb01ve baselines, OriFea, Sampling Geodesic Flow (SGF) [11],\nGeodesic Flow Kernel (GFK) [10], Information Theoretical Learning (ITL) [25] and Subspace\nAlignment (SA) [8]. Among them, the latter four methods are the state-of-the-art unsupervised\ndomain adaptation methods proposed recently. OriFea uses the original features; SGF and its ex-\ntended version GFK try to learn invariant features by interpolating intermediate domains between\nsource and target domains; ITL is a recently proposed unsupervised domain adaptation method; and\n\n7\n\n\u221250050\u221220020\u221220\u221210010203D\u221220020\u221220020\u221210010203D\u221220020\u221220020\u221220\u221210010203D\f(a) C!A\n\n(b) C!W\n\n(c) C!D\n\n(d) A!C\n\n(e) A!W\n\n(f) A!D\n\n(g) W!C\n\n(h) W!A\n\n(i) W!D\n\n(j) D!C\n\n(k) D!A\n\n(l) D!W\n\nFigure 3: Performance comparisons in unsupervised domain adaptation. (A: Amazon, C: Caltech,\nD: DSLR, W: Webcam)\nSA tries to align the principal directions of two domains by characterizing each domain as a sub-\nspace. Except ITL, we use the source codes released by the original authors. For fair comparison,\nthe best parameters are tuned to report peak performance for all comparative methods. To compare\nintrinsically, we use the NN classi\ufb01er to predict the sample labels of target domain. Note SGF(PLS)\nand GFK(PLS) use partial least square (PLS) to learn discriminant mappings according to their pa-\npers. In our method, to obtain stable sample points from space of high-dimensionality, we perform\nclustering on the data of each class for source domain, and then cluster all unlabeled samples of tar-\nget domain, to get the representative points for subsequent manifold alignment, where the number\nof clusters is estimated using Jump method [27].\nAll comparisons are reported in Fig.3. Compared with the other methods, our method achieves more\ncompetitive performance, i.e., the best results in 9 out of 12 cases, which indicates manifold align-\nment can be properly applied to domain adaptation. It also implies that it can reduce the difference\nbetween domains by using manifold structures rather than the subspaces as in SGF, GFK and SA.\nGenerally, domain adaptation methods are better than OriFea. In the average accuracy, our method\nis better than the second best result, 44.98% for ours v.s. 43.68% for GFK(PLS).\n\n6 Conclusion\n\nIn this paper, we propose a generalized unsupervised manifold alignment method, which seeks for\nthe correspondences while \ufb01nding the mutual embedding subspace of two manifolds. We formulate\nunsupervised manifold alignment as an explicit 0-1 integer optimization problem by considering\nthe matching of global manifold structures as well as sample features. An ef\ufb01cient optimization\nalgorithm is further proposed by alternately solving two submodels, one is learning alignment with\ninteger constraints, and the other is learning transforms to get the mutual embedding subspace. In\nlearning alignment, we extend Frank-Wolfe algorithm to approximately seek for optima along the\ndescent path of the relaxed objective function. Experiments on set matching, video face recognition\nand visual domain adaptation demonstrate the effectiveness and practicability of our method. Next\nwe will further generalize GUMA by relaxing the integer constraint and explore more applications.\n\nAcknowledgments\n\nThis work is partially supported by Natural Science Foundation of China under contracts Nos.\n61272319, 61222211, 61202297, and 61390510.\n\n8\n\n202530354045Accuracy(%) OriFeaSGF(PCA)SGF(PLS)GFK(PCA)GFK(PLS)ITLSAGUMA2025303540Accuracy(%) OriFeaSGF(PCA)SGF(PLS)GFK(PCA)GFK(PLS)ITLSAGUMA202530354045Accuracy(%) OriFeaSGF(PCA)SGF(PLS)GFK(PCA)GFK(PLS)ITLSAGUMA1820222426283032343638Accuracy(%) OriFeaSGF(PCA)SGF(PLS)GFK(PCA)GFK(PLS)ITLSAGUMA202224262830323436Accuracy(%) OriFeaSGF(PCA)SGF(PLS)GFK(PCA)GFK(PLS)ITLSAGUMA1520253035Accuracy(%) OriFeaSGF(PCA)SGF(PLS)GFK(PCA)GFK(PLS)ITLSAGUMA1520253035Accuracy(%) OriFeaSGF(PCA)SGF(PLS)GFK(PCA)GFK(PLS)ITLSAGUMA1520253035Accuracy(%) OriFeaSGF(PCA)SGF(PLS)GFK(PCA)GFK(PLS)ITLSAGUMA45505560657075Accuracy(%) OriFeaSGF(PCA)SGF(PLS)GFK(PCA)GFK(PLS)ITLSAGUMA18202224262830323436Accuracy(%) OriFeaSGF(PCA)SGF(PLS)GFK(PCA)GFK(PLS)ITLSAGUMA222426283032343638Accuracy(%) OriFeaSGF(PCA)SGF(PLS)GFK(PCA)GFK(PLS)ITLSAGUMA55606570758085Accuracy(%) OriFeaSGF(PCA)SGF(PLS)GFK(PCA)GFK(PLS)ITLSAGUMA\fReferences\n[1] H. Bay, T. Tuytelaars, and L. Van Gool. Surf: Speeded up robust features. In ECCV, 2006.\n[2] M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation.\n\nNeural computation, 15(6):1373\u20131396, 2003.\n\n[3] R. A. Brualdi. Combinatorial matrix classes. Cambridge University Press, 2006.\n[4] H. Cevikalp and B. Triggs. Face recognition based on image sets. In CVPR, 2010.\n[5] Z. Cui, W. Li, D. Xu, S. Shan, and X. Chen. Fusing robust face region descriptors via multiple metric\n\nlearning for face recognition in the wild. In CVPR, 2013.\n\n[6] Z. Cui, W. Li, D. Xu, S. Shan, X. Chen, and X. Li. Flowing on Riemannian manifold: Domain adaptation\n\nby shifting covariance. IEEE Transactions on Cybernetics, Accepted in March 2014.\n\n[7] Z. Cui, S. Shan, H. Zhang, S. Lao, and X. Chen. Image sets alignment for video-based face recognition.\n\nIn CVPR, 2012.\n\n[8] B. Fernando, A. Habrard, M. Sebban, T. Tuytelaars, et al. Unsupervised visual domain adaptation using\n\nsubspace alignment. ICCV, 2013.\n\n[9] M. Frank and P. Wolfe. An algorithm for quadratic programming. Naval research logistics quarterly,\n\n3(1-2):95\u2013110, 1956.\n\n[10] B. Gong, Y. Shi, F. Sha, and K. Grauman. Geodesic \ufb02ow kernel for unsupervised domain adaptation. In\n\nCVPR, 2012.\n\n[11] R. Gopalan, R. Li, and R. Chellappa. Domain adaptation for object recognition: An unsupervised ap-\n\nproach. In ICCV, 2011.\n\n[12] G. Grif\ufb01n, A. Holub, and P. Perona. Caltech-256 object category dataset. Technical report, California\n\nInstitute of Technology, 2007.\n\n[13] R. Gross, I. Matthews, J. Cohn, T. Kanade, and S. Baker. Multi-pie.\n\n28(5):807\u2013813, 2010.\n\nImage and Vision Computing,\n\n[14] A. Haghighi, P. Liang, T. Berg-Kirkpatrick, and D. Klein. Learning bilingual lexicons from monolingual\n\ncorpora. In ACL, volume 2008, pages 771\u2013779, 2008.\n\n[15] J. Ham, D. Lee, and L. Saul. Semisupervised alignment of manifolds. In UAI, 2005.\n[16] D. R. Hardoon, S. Szedmak, and J. Shawe-Taylor. Canonical correlation analysis: An overview with\n\napplication to learning methods. Neural computation, 16(12):2639\u20132664, 2004.\n\n[17] Y. Hu, A. S. Mian, and R. Owens. Sparse approximated nearest points for image set classi\ufb01cation. In\n\nCVPR, 2011.\n\n[18] T. Kim, J. Kittler, and R. Cipolla. Discriminative learning and recognition of image set classes using\n\ncanonical correlations. T-PAMI, 2007.\n\n[19] S. Lafon, Y. Keller, and R. R. Coifman. Data fusion and multicue data matching by diffusion maps.\n\nT-PAMI, 28(11):1784\u20131797, 2006.\n\n[20] J. Munkres. Algorithms for the assignment and transportation problems. Journal of the Society for\n\nIndustrial & Applied Mathematics, 5(1):32\u201338, 1957.\n\n[21] Y. Pei, F. Huang, F. Shi, and H. Zha. Unsupervised image matching based on manifold alignment. T-PAMI,\n\n34(8):1658\u20131664, 2012.\n\n[22] N. Quadrianto, L. Song, and A. J. Smola. Kernelized sorting. In NIPS, 2009.\n[23] S. T. Roweis and L. K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science,\n\n290(5500):2323\u20132326, 2000.\n\n[24] K. Saenko, B. Kulis, M. Fritz, and T. Darrell. Adapting visual category models to new domains. In ECCV.\n\n2010.\n\n[25] Y. Shi and F. sha. Information-theoretical learning of discriminative clusters for unsupervised domain\n\nadaptation. In ICML, 2012.\n\n[26] A. Shon, K. Grochow, A. Hertzmann, and R. P. Rao. Learning shared latent structure for image synthesis\n\nand robotic imitation. In NIPS, 2005.\n\n[27] C. A. Sugar and G. M. James. Finding the number of clusters in a dataset. Journal of the American\n\nStatistical Association, 98(463), 2003.\n\n[28] C. Wang and S. Mahadevan. Manifold alignment using procrustes analysis. In ICML, 2008.\n[29] C. Wang and S. Mahadevan. Manifold alignment without correspondence. In IJCAI, 2009.\n[30] C. Wang and S. Mahadevan. Heterogeneous domain adaptation using manifold alignment.\n\n2011.\n\nIn IJCAI,\n\n[31] R. Wang, S. Shan, X. Chen, and W. Gao. Manifold-manifold distance with application to face recognition\n\nbased on image set. In CVPR, 2008.\n\n[32] L. Wolf, T. Hassner, and I. Maoz. Face recognition in unconstrained videos with matched background\n\nsimilarity. In CVPR, 2011.\n\n[33] L. Xiong, F. Wang, and C. Zhang. Semi-de\ufb01nite manifold alignment. In ECML. 2007.\n[34] O. Yamaguchi, K. Fukui, and K. Maeda. Face recognition using temporal image sequence. In FGR, 1998.\n\n9\n\n\f", "award": [], "sourceid": 1267, "authors": [{"given_name": "Zhen", "family_name": "Cui", "institution": "Vi\u00adsion and Ma\u00adchine Learn\u00ading Lab"}, {"given_name": "Hong", "family_name": "Chang", "institution": "Chinese Academy of Sciences"}, {"given_name": "Shiguang", "family_name": "Shan", "institution": null}, {"given_name": "Xilin", "family_name": "Chen", "institution": null}]}