{"title": "Kernelized Sorting", "book": "Advances in Neural Information Processing Systems", "page_first": 1289, "page_last": 1296, "abstract": "Object matching is a fundamental operation in data analysis. It typically requires the definition of a similarity measure between the classes of objects to be matched. Instead, we develop an approach which is able to perform matching by requiring a similarity measure only within each of the classes. This is achieved by maximizing the dependency between matched pairs of observations by means of the Hilbert Schmidt Independence Criterion. This problem can be cast as one of maximizing a quadratic assignment problem with special structure and we present a simple algorithm for finding a locally optimal solution.", "full_text": "Kernelized Sorting\n\nNovi Quadrianto\n\nRSISE, ANU & SML, NICTA\n\nCanberra, ACT, Australia\nnovi.quad@gmail.com\n\nAlex J. Smola\nYahoo! Research\n\nSanta Clara, CA, USA\n\nalex@smola.org\n\nLe Song\nSCS, CMU\n\nPittsburgh, PA, USA\nlesong@cs.cmu.edu\n\nAbstract\n\nObject matching is a fundamental operation in data analysis. It typically requires\nthe de\ufb01nition of a similarity measure between the classes of objects to be matched.\nInstead, we develop an approach which is able to perform matching by requiring a\nsimilarity measure only within each of the classes. This is achieved by maximizing\nthe dependency between matched pairs of observations by means of the Hilbert\nSchmidt Independence Criterion. This problem can be cast as one of maximizing\na quadratic assignment problem with special structure and we present a simple\nalgorithm for \ufb01nding a locally optimal solution.\n\n1 Introduction\n\nMatching pairs of objects is a fundamental operation of unsupervised learning. For instance, we\nmight want to match a photo with a textual description of a person, a map with a satellite image,\nor a music score with a music performance. In those cases it is desirable to have a compatibility\nfunction which determines how one set may be translated into the other. For many such instances\nwe may be able to design a compatibility score based on prior knowledge or to observe one based\non the co-occurrence of such objects.\nIn some cases, however, such a match may not exist or it may not be given to us beforehand. That\nis, while we may have a good understanding of two sources of observations, say X and Y, we may\nnot understand the mapping between the two spaces. For instance, we might have two collections of\ndocuments purportedly covering the same content, written in two different languages. Here it should\nbe our goal to determine the correspondence between both sets and to identify a mapping between\nthe two domains. In the following we present a method which is able to perform such matching\nwithout the need of a cross-domain similarity measure.\nOur method relies on the fact that one may estimate the dependence between sets of random variables\neven without knowing the cross-domain mapping. Various criteria are available. We choose the\nHilbert Schmidt Independence Criterion between two sets and we maximize over the permutation\ngroup to \ufb01nd a good match. As a side-effect we obtain an explicit representation of the covariance.\nWe show that our method generalizes sorting. When using a different measure of dependence,\nnamely an approximation of the mutual information, our method is related to an algorithm of [1].\nFinally, we give a simple approximation algorithm for kernelized sorting.\n\n1.1 Sorting and Matching\nThe basic idea underlying our algorithm is simple. Denote by X = {x1, . . . , xm} \u2286 X and Y =\n{y1, . . . , ym} \u2286 Y two sets of observations between which we would like to \ufb01nd a correspondence.\nThat is, we would like to \ufb01nd some permutation \u03c0 \u2208 \u03a0m on m terms, that is\n\n\u03a0m :=\n\n\u03c0|\u03c0 \u2208 {0, 1}m\u00d7m and \u03c01m = 1m and \u03c0(cid:62)1m = 1m\n\n(cid:110)\n\n(cid:111)\n\n,\n\n(1)\n\n\fsuch that the pairs Z(\u03c0) :=(cid:8)(xi, y\u03c0(i)) for 1 \u2264 i \u2264 m(cid:9) correspond to dependent random variables.\n\nHere 1m \u2208 Rm is the vector of all ones. We seek a permutation \u03c0 such that the mapping xi \u2192 y\u03c0(i)\nand its converse mapping from y to x are simple. Denote by D(Z(\u03c0)) a measure of the dependence\nbetween x and y. Then we de\ufb01ne nonparametric sorting of X and Y as follows\n\n(2)\nThis paper is concerned with measures of D and approximate algorithms for (2). In particular we\nwill investigate the Hilbert Schmidt Independence Criterion and the Mutual Information.\n\n\u03c0\u2217 := argmax\u03c0\u2208\u03a0m D(Z(\u03c0)).\n\n2 Hilbert Schmidt Independence Criterion\n\nLet sets of observations X and Y be drawn jointly from some probability distribution Prxy. The\nHilbert Schmidt Independence Criterion (HSIC) [2] measures the dependence between x and y by\ncomputing the norm of the cross-covariance operator over the domain X\u00d7 Y in Hilbert Space. It can\nbe shown, provided the Hilbert Space is universal, that this norm vanishes if and only if x and y are\nindependent. A large value suggests strong dependence with respect to the choice of kernels.\nFormally, let F be the Reproducing Kernel Hilbert Space (RKHS) on X with associated kernel\nk : X\u00d7 X \u2192 R and feature map \u03c6 : X \u2192 F. Let G be the RKHS on Y with kernel l and feature map\n\u03c8. The cross-covariance operator Cxy : G (cid:55)\u2192 F is de\ufb01ned by [3] as\n\nCxy = Exy[(\u03c6(x) \u2212 \u00b5x) \u2297 (\u03c8(y) \u2212 \u00b5y)],\n\n(3)\nwhere \u00b5x = E[\u03c6(x)], \u00b5y = E[\u03c8(y)], and \u2297 is the tensor product. HSIC, denoted as D, is then\nde\ufb01ned as the square of the Hilbert-Schmidt norm of Cxy [2] via D(F, G, Prxy) := (cid:107)Cxy(cid:107)2\nHS. In\nterm of kernels HSIC can be expressed as\nExx(cid:48)yy(cid:48)[k(x, x(cid:48))l(y, y(cid:48))] + Exx(cid:48)[k(x, x(cid:48))]Eyy(cid:48)[l(y, y(cid:48))] \u2212 2Exy[Ex(cid:48)[k(x, x(cid:48))]Ey(cid:48)[l(y, y(cid:48))]],\n\n(4)\nis the expectation over both (x, y) \u223c Prxy and an additional pair of vari-\nwhere Exx(cid:48)yy(cid:48)\nables (x(cid:48), y(cid:48)) \u223c Prxy drawn independently according to the same law. Given a sample Z =\n{(x1, y1), . . . , (xm, ym)} of size m drawn from Prxy an empirical estimate of HSIC is\n\nD(F, G, Z) = (m \u2212 1)\u22122 tr HKHL = (m \u2212 1)\u22122 tr \u00afK \u00afL.\n\n(5)\nwhere K, L \u2208 Rm\u00d7m are the kernel matrices for the data and the labels respectively, i.e. Kij =\nk(xi, xj) and Lij = l(yi, yj). Moreover, Hij = \u03b4ij \u2212 m\u22121 centers the data and the labels in feature\nspace. Finally, \u00afK := HKH and \u00afL := HLH denote the centered versions of K and L respectively.\nNote that (5) is a biased estimate where the expectations with respect to x, x(cid:48), y, y(cid:48) have all been\nreplaced by empirical averages over the set of observations.\n\n2.1 Kernelized Sorting\n\nPrevious work used HSIC to measure independence between given random variables [2]. Here we\nuse it to construct a mapping between X and Y by permuting Y to maximize dependence. There\nare several advantages in using HSIC as a dependence criterion. First, HSIC satis\ufb01es concentration\nof measure conditions [2]. That is, for random draws of observation from Prxy, HSIC provides\nvalues which are very similar. This is desirable, as we want our mapping to be robust to small\nchanges. Second, HSIC is easy to compute, since only the kernel matrices are required and no\ndensity estimation is needed. The freedom of choosing a kernel allows us to incorporate prior\nknowledge into the dependence estimation process. The consequence is that we are able to generate\na family of methods by simply choosing appropriate kernels for X and Y .\nLemma 1 The nonparametric sorting problem is given by \u03c0\u2217 = argmax\u03c0\u2208\u03a0m tr \u00afK\u03c0(cid:62) \u00afL\u03c0.\nProof We only need to establish that H\u03c0 = \u03c0H since the rest follows from the de\ufb01nition of (5).\nNote that since H is a centering matrix, it has the eigenvalue 0 for the vector of all ones and the\neigenvalue 1 for all vectors orthogonal to that. Next note that the vector of all ones is also an eigen-\nvector of any permutation matrix \u03c0 with \u03c01 = 1. Hence H and \u03c0 matrices commute.\nNext we show that the objective function is indeed reasonable: for this we need the following in-\nequality due to Polya, Littlewood and Hardy:\n\n\fLemma 2 Let a, b \u2208 Rm where a is sorted ascendingly. Then a(cid:62)\u03c0b is maximized for \u03c0 = argsort b.\nLemma 3 Let X = Y = R and let k(x, x(cid:48)) = xx(cid:48) and l(y, y(cid:48)) = yy(cid:48). Moreover, assume that x is\nsorted ascendingly. In this case (5) is maximized by either \u03c0 = argsort y or by \u03c0 = argsort \u2212y.\n\nProof Under the assumptions we have that \u00afK = Hxx(cid:62)H and \u00afL = Hyy(cid:62)H. Hence we may\n\nrewrite the objective as(cid:2)(Hx)(cid:62)\u03c0(Hy)(cid:3)2. This is maximized by sorting Hy ascendingly. Since the\n\ncentering matrix H only changes the offset but not the order this is equivalent to sorting y. We have\ntwo alternatives, since the objective function is insensitive to sign reversal of y.\nThis means that sorting is a special case of kernelized sorting, hence the name. In fact, when solving\nthe general problem, it turns out that a projection onto the principal eigenvectors of \u00afK and \u00afL is a\ngood initialization of an optimization procedure.\n\n2.2 Diagonal Dominance\n\n(cid:88)\n\n(cid:88)\n\nKijLuv \u2212\n\n2\n\nm(m\u22121)2\n\ni(cid:54)=j,u(cid:54)=v\n\n(cid:88)\n\nIn some cases the biased estimate of HSIC as given in (5) leads to very undesirable results, in\nparticular in the case of document analysis. This is the case since kernel matrices on texts tend to be\ndiagonally dominant: a document tends to be much more similar to itself than to others. In this case\nthe O(1/m) bias of (5) is signi\ufb01cant. Unfortunately, the minimum variance unbiased estimator [2]\ndoes not have a computationally appealing form. This can be addressed as follows at the expense of\na slightly less ef\ufb01cient estimator with a considerably reduced bias: we replace the expectations (4)\nby sums where no pairwise summation indices are identical. This leads to the objective function\n\n1\n\nm(m\u22121)\n\nKijLij +\n\n1\n\nm2(m\u22121)2\n\ni(cid:54)=j\n\ni,j(cid:54)=i,v(cid:54)=i\n\nKijLiv.\n\n(6)\n\nThis estimator still has a small degree of bias, albeit signi\ufb01cantly reduced since it only arises from\nthe product of expectations over (potentially) independent random variables. Using the shorthand\n\u02dcKij = Kij(1\u2212 \u03b4ij) and \u02dcLij = Lij(1\u2212 \u03b4ij) for kernel matrices where the main diagonal terms have\nbeen removed we arrive at the expression (m \u2212 1)\u22122 tr H \u02dcLH \u02dcK. The advantage of this term is that\nit can be used as a drop-in replacement in Lemma 1.\n\n2.3 Mutual Information\n\nAn alternative, natural means of studying the dependence between random variables is to compute\nthe mutual information between the random variables xi and y\u03c0(i).\nIn general, this is dif\ufb01cult,\nsince it requires density estimation. However, if we assume that x and y are jointly normal in the\nReproducing Kernel Hilbert Spaces spanned by the kernels k, l and k \u00b7 l we can devise an effec-\ntive approximation of the mutual information. Our reasoning relies on the fact that the differential\nentropy of a normal distribution with covariance \u03a3 is given by\n2 log |\u03a3| + constant.\n\n(7)\nSince the mutual information between random variables X and Y is I(X, Y ) = h(X) + h(Y ) \u2212\nh(X, Y ) we will obtain maximum mutual information by minimizing the joint entropy h(X, Y ).\nUsing the Gaussian upper bound on the joint entropy we can maximize a lower bound on the mutual\ninformation by minimizing the joint entropy of J(\u03c0) := h(X, Y ). By de\ufb01ning a joint kernel on\nX \u00d7 Y via k((x, y), (x(cid:48), y(cid:48))) = k(x, x(cid:48))l(y, y(cid:48)) we arrive at the optimization problem\n\nh(p) = 1\n\nargmin\u03c0\u2208\u03a0m log |HJ(\u03c0)H| where Jij = KijL\u03c0(i),\u03c0(j).\n\n(8)\n\nNote that this is related to the optimization criterion proposed by Jebara [1] in the context of sorting\nvia minimum volume PCA. What we have obtained here is an alternative derivation of Jebara\u2019s\ncriterion based on information theoretic considerations. The main difference is that [1] uses the\nsetting to align bags of observations by optimizing log |HJ(\u03c0)H| with respect to re-ordering within\neach of the bags. We will discuss multi-variable alignment at a later stage.\nIn terms of computation (8) is considerably more expensive to optimize. As we shall see, for the\noptimization in Lemma 1 a simple iteration over linear assignment problems will lead to desirable\nsolutions, whereas in (8) even computing derivatives is a computational challenge.\n\n\f3 Optimization\n\nDC Programming To \ufb01nd a local maximum of the matching problem we may take recourse to a\nwell-known algorithm, namely DC Programming [4] which in machine learning is also known as\nthe Concave Convex Procedure [5]. It works as follows: for a given function f(x) = g(x) \u2212 h(x),\nwhere g is convex and h is concave, a lower bound can be found by\n\nf(x) \u2265 g(x0) + (cid:104)x \u2212 x0, \u2202xg(x0)(cid:105) \u2212 h(x).\n\n(9)\nThis lower bound is convex and it can be maximized effectively over a convex domain. Subsequently\none \ufb01nds a new location x0 and the entire procedure is repeated.\nLemma 4 The function tr \u00afK\u03c0(cid:62) \u00afL\u03c0 is convex in \u03c0.\nSince \u00afK, \u00afL (cid:23) 0 we may factorize them as \u00afK = U(cid:62)U and \u00afL = V (cid:62)V we may rewrite the objective\n\n(cid:110)\nM \u2208 Rm\u00d7m where Mij \u2265 0 and (cid:88)\n\nfunction as(cid:13)(cid:13)V \u03c0U(cid:62)(cid:13)(cid:13)2 which is clearly a convex quadratic function in \u03c0.\nMij = 1 and (cid:88)\n(cid:2)tr \u00afK\u03c0(cid:62) \u00afL\u03c0i\n(cid:3)\n\n\u03c0i+1 = (1 \u2212 \u03bb)\u03c0i + \u03bb argmax\u03c0\u2208Pm\n\nhas only integral vertices, namely admissible permutation matrices. This means that the following\nprocedure will generate a succession of permutation matrices which will yield a local maximum for\nthe assignment problem:\n\n(11)\nHere we may choose \u03bb = 1 in the last step to ensure integrality. This optimization problem is well\nknown as a Linear Assignment Problem and effective solvers exist for it [6].\n\nNote that the set of feasible permutations \u03c0 is constrained in a unimodular fashion, that is, the set\n\nPm :=\n\n(cid:111)\n\ni\n\nMij = 1\n\nj\n\n(10)\n\nLemma 5 The algorithm described in (11) for \u03bb = 1 terminates in a \ufb01nite number of steps.\n\nWe know that the objective function may only increase for each step of (11). Moreover, the solution\nset of the linear assignment problem is \ufb01nite. Hence the algorithm does not cycle.\n\nNonconvex Maximization When using the bias corrected version of the objective function the\nproblem is no longer guaranteed to be convex. In this case we need to add a line-search procedure\nalong \u03bb which maximizes tr H \u02dcKH[(1 \u2212 \u03bb)\u03c0i + \u03bb\u02c6\u03c0i](cid:62)H \u02dcLH[(1 \u2212 \u03bb)\u03c0i + \u03bb\u02c6\u03c0i]. Since the function\nis quadratic in \u03bb we only need to check whether the search direction remains convex in \u03bb; otherwise\nwe may maximize the term by solving a simple linear equation.\n\nInitialization Since quadratic assignment problems are in general NP hard we may obviously not\nhope to achieve an optimal solution. That said, a good initialization is critical for good estimation\nperformance. This can be achieved by using Lemma 3. That is, if \u00afK and \u00afL only had rank-1, the\nproblem could be solved by sorting X and Y in matching fashion. Instead, we use the projections\nonto the \ufb01rst principal vectors as initialization in our experiments.\n\nRelaxation to a constrained eigenvalue problem Yet another alternative is to \ufb01nd an approximate\nsolution of the problem in Lemma 1 by solving\n\nmaximize\u03b7 \u03b7(cid:62)M \u03b7 subject to A\u03b7 = b\n\n(12)\nHere the matrix M = \u00afK \u2297 \u00afL \u2208 Rm2\u00d7m2 is given by the outer product of the constituting kernel\nmatrices, \u03b7 \u2208 Rm2 is a vectorized version of the permutation matrix \u03c0, and the constraints imposed\nby A and b amount to the polytope constraints imposed by \u03a0m. This is essentially the approach\nproposed by [7] in the context of balanced graph matching, albeit with a suboptimal optimization\nprocedure. Instead, one may use the exact algorithm proposed by [8].\nThe problem with the relaxation (12) is that it does not scale well to large estimation problems (the\nsize of the optimization problem scales O(m4)) and that the relaxation does not guarantee a feasible\nsolution which means that subsequent projection heuristics need to be found. Hence we did not\npursue this approach in our experiments.\n\n\f4 Multivariate Extensions\n\n(cid:35)\n\nT(cid:89)\n\n(cid:34) T(cid:89)\n\n(cid:35)\n\nT(cid:89)\n\n(cid:35)\n\n(cid:33)\n\nA natural extension is to align several sets of observations. For this purpose we need to introduce a\nmultivariate version of the Hilbert Schmidt Independence Criterion. One way of achieving this goal\nis to compute the Hilbert Space norm of the difference between the expectation operator for the joint\ndistribution and the expectation operator for the product of the marginal distributions.\nFormally, let there be T random variables xi \u2208 Xi which are jointly drawn from some distribution\np(x1, . . . , xm). Moreover, denote by ki : Xi \u00d7 Xi \u2192 R the corresponding kernels. In this case we\ncan de\ufb01ne a kernel on X1 \u2297 . . . \u2297 XT by k1 \u00b7 . . . kT . The expectation operator with respect to the\njoint distribution and with respect to the product of the marginals is given by [2]\n\nEx1,...,xT\n\nki(xi,\u00b7)\n\nand\n\nExi [ki(xi,\u00b7)]\n\n(13)\n\ni=1\n\ni=1\n\n(cid:34) T(cid:89)\n\nrespectively. Both terms are equal if and only if all random variables are independent. The squared\ndifference between both is given by\n\nExT\n\ni=1,x(cid:48) T\n\ni=1\n\nki(xi, x(cid:48)\ni)\n\n+\n\nExi,x(cid:48)\n\ni\n\n[ki(xi, x(cid:48)\n\ni)] \u2212 2ExT\n\ni=1\n\n[k(xi, x(cid:48)\ni)]\n\nEx(cid:48)\n\ni\n\n.\n\n(14)\n\ni=1\n\ni=1\n\ni=1\n\nwhich we refer to as multiway HSIC. A biased empirical estimate of the above is obtained by re-\nplacing sums by empirical averages. Denote by Ki the kernel matrix obtained from the kernel ki on\nthe set of observations Xi := {xi1, . . . , xim}. In this case the empirical estimate of (14) is given by\n\n(cid:34) T(cid:89)\n\n(cid:33)\n\n(cid:32) T(cid:75)\n\ni=1\n\nT(cid:89)\n\ni=1\n\n(cid:32) T(cid:75)\n\ni=1\n\nHSIC[X1, . . . , XT ] := 1(cid:62)\n\nm\n\nKi\n\n1m +\n\nmKi1m \u2212 2 \u00b7 1(cid:62)\n1(cid:62)\n\nm\n\nKi1m\n\n(15)\n\ni Ki\u03c0i.\n\nwhere (cid:12)T\nt=1\u2217 denotes elementwise product of its arguments (the \u2019.*\u2019 notation of Matlab). To apply\nthis to sorting we only need to de\ufb01ne T permutation matrices \u03c0i \u2208 \u03a0m and replace the kernel\nmatrices Ki by \u03c0(cid:62)\nWithout loss of generality we may set \u03c01 = 1, since we always have the freedom to \ufb01x the order of\none of the T sets with respect to which the other sets are to be ordered. In terms of optimization the\nsame considerations as presented in Section 3 apply. That is, the objective function is convex in the\npermutation matrices \u03c0i and we may apply DC programming to \ufb01nd a locally optimal solution. The\nexperimental results for multiway HSIC can be found in the appendix.\n\n5 Applications\n\nTo investigate the performance of our algorithm (it is a fairly nonstandard unsupervised method) we\napplied it to a variety of different problems ranging from visualization to matching and estimation.\nIn all our experiments, the maximum number of iterations used in the updates of \u03c0 is 100 and we\nterminate early if progress is less than 0.001% of the objective function.\n\n5.1 Data Visualization\n\nIn many cases we may want to visualize data according to the metric structure inherent in it. In\nparticular, we want to align it according to a given template, such as a grid, a torus, or any other\n\ufb01xed structure. Such problems occur when presenting images or documents to a user. While there\nis a large number of algorithms for low dimensional object layout (self organizing maps, maximum\nvariance unfolding, local-linear embedding, generative topographic map, . . . ), most of them suffer\nfrom the problem that the low dimensional presentation is nonuniform. This has the advantage of\nrevealing cluster structure but given limited screen size the presentation is undesirable.\nInstead, we may use kernelized sorting to align objects. Here the kernel matrix L is given by the\nsimilarity measure between the objects xi that are to be aligned. The kernel K, on the other hand,\ndenotes the similarity between the locations where objects are to be aligned to. For the sake of\nsimplicity we used a Gaussian RBF kernel between the objects to laid out and also between the\n\n\fFigure 1: Layout of 284 images into a \u2018NIPS 2008\u2019 letter grid using kernelized sorting.\n\npositions of the grid, i.e. k(x, x(cid:48)) = exp(\u2212\u03b3 (cid:107)x \u2212 x(cid:48)(cid:107)2). The kernel width \u03b3 was adjusted to the\ninverse median of (cid:107)x \u2212 x(cid:48)(cid:107)2 such that the argument of the exponential is O(1). Our choice of the\nGaussian RBF kernel is likely not optimal for the speci\ufb01c set of observations (e.g. SIFT feature\nextraction followed by a set kernel would be much more appropriate for images). That said we want\nto emphasize that the gains arise from the algorithm rather than a speci\ufb01c choice of a function class.\nWe obtained 284 images from http://www.flickr.com which were resized and downsampled\nto 40 \u00d7 40 pixels. We converted the images from RGB into Lab color space, yielding 40 \u00d7 40 \u00d7 3\ndimensional objects. The grid, corresponding to X is a \u2018NIPS 2008\u2019 letters on which the images\nare to be laid out. After sorting we display the images according to their matching coordinates\n(Figure 1). We can see images with similar color composition are found at proximal locations.\nWe also lay out the images (we add 36 images to make the number 320) into a 2D grid of 16 \u00d7\n20 mesh using kernelized sorting. For comparison we use a Self-Organizing Map (SOM) and a\nGenerative Topographic Mapping (GTM) and the results are shown in the appendix. Although the\nimages are also arranged according to the color grading, the drawback of SOM (and GTM) is that it\ncreates blank spaces in the layout. This is because SOM maps several images into the same neuron.\nHence some neurons may not have data associated with them. While SOM is excellent in grouping\nsimilar images together, it falls short in exactly arranging the images into 2D grid.\n\n5.2 Matching\n\nTo obtain more quanti\ufb01able results rather than just generally aesthetically pleasing pictures we apply\nour algorithm to matching problems where the correct match is known.\nImage matching: Our \ufb01rst test was to match image halves. For this purpose we used the data\nfrom the layout experiment and we cut the images into two 20 \u00d7 40 pixel patches. The aim was to\n\ufb01nd an alignment between both halves such that the dependence between them is maximized. In\nother words, given xi being the left half of the image and yi being the right half, we want to \ufb01nd a\npermutation \u03c0 which lines up xi and yi.\nThis would be a trivial undertaking when being able to compare the two image halves xi and yi.\nWhile such comparison is clearly feasible for images where we know the compatibility function, it\nmay not be possible for generic objects. The \ufb01gure is presented in the appendix. For a total of 320\nimages we recovered 140 pairs. This is quite respectable given that chance level would be 1 correct\npair (a random permutation matrix has on expectation one nonzero diagonal entry).\nEstimation In a next experiment we aim to determine how well the overall quality of the matches is.\nThat is, whether the objects matched share similar properties. For this purpose we used binary, multi-\n\n\fType\nBinary\n\nMulticlass\n\nRegression\n\nTable 1: Error rate for matching problems\n\nData set\naustralian\nbreastcancer\nderm\noptdigits\nwdbc\nsatimage\nsegment\nvehicle\nabalone\nbodyfat\n\nm\n690\n683\n358\n765\n569\n620\n693\n423\n417\n252\n\nKernelized Sorting Baseline\n0.49\n0.46\n0.43\n0.49\n0.47\n0.80\n0.86\n0.75\n18.7\n7.20\n\n0.29\u00b10.02\n0.06\u00b10.01\n0.08\u00b10.01\n0.01\u00b10.00\n0.11\u00b10.04\n0.20\u00b10.01\n0.58\u00b10.02\n0.58\u00b10.08\n13.9\u00b11.70\n4.5\u00b10.37\n\nReference\n0.21\u00b10.04\n0.06\u00b10.03\n0.00\u00b10.00\n0.01\u00b10.00\n0.05\u00b10.02\n0.13\u00b10.04\n0.05\u00b10.02\n0.24\u00b10.07\n6.44\u00b13.14\n3.80\u00b10.76\n\nTable 2: Number of correct matches (out of 300) for English aligned documents.\nDe\nSource language\n95\nKernelized Sorting\nBaseline (length match)\n4\n284\nReference (dictionary)\n\nEs\n218\n12\n298\n\nPt\n252\n9\n298\n\nSv\n150\n6\n296\n\nFr\n246\n8\n298\n\nDa\n230\n6\n297\n\nIt\n237\n11\n300\n\nNl\n223\n7\n298\n\nclass, and regression datasets from the UCI repository http://archive.ics.uci.edu/ml\nand the LibSVM site http://www.csie.ntu.edu.tw/\u02dccjlin/libsvmtools.\nIn our setup we split the dimensions of the data into two sets and permute the data in the second\nset. The so-generated two datasets are then matched and we use the estimation error to quantify the\nquality of the match. That is, assume that yi is associated with the observation xi. In this case we\ncompare yi and y\u03c0(i) using binary classi\ufb01cation, multiclass, or regression loss accordingly.\nTo ensure good dependence between the subsets of variables we choose a split which ensures corre-\nlation. This is achieved as follows: we pick the dimension with the largest correlation coef\ufb01cient as\na reference. We then choose the coordinates that have at least 0.5 correlation with the reference and\nsplit those equally into two sets, set A and set B. We also split the remainder coordinates equally\ninto the two existing sets and \ufb01nally put the reference coordinate into set A. This ensures that the set\nB of dimensions will have strong correlation with at least one dimension in the set A. The listing of\nthe set members for different datasets can be found in the appendix.\nThe results are summarized in Table 1. As before, we use a Gaussian RBF kernel with median\nadjustment of the kernel width. To obtain statistically meaningful results we subsample 80% of the\ndata 10 times and compute the error of the match on the subset (this is done in lieu of cross-validation\nsince the latter is meaningless for matching). As baseline we compute the expected performance of\nrandom permutations which can be done exactly. Finally, as reference we use SVM classi\ufb01cation /\nregression with results obtained by 10-fold cross-validation. Matching is able to retrieve signi\ufb01cant\ninformation about the labels of the corresponding classes, in some cases performing as well as a full\nclassi\ufb01cation approach.\nMultilingual Document Matching To illustrate that kernelized sorting is able to recover nontrivial\nsimilarity relations we applied our algorithm to the matching of multilingual documents. For this\npurpose we used the Europarl Parallel Corpus. It is a collection of the proceedings of the Euro-\npean Parliament, dating back to 1996 [9]. We select the 300 longest documents of Danish (Da),\nDutch (Nl), English (En), French (Fr), German (De), Italian (It), Portuguese (Pt), Spanish (Es), and\nSwedish (Sv). The purpose is to match the non-English documents (source languages) to its English\ntranslations (target language). Note that our algorithm does not require a cross-language dictionary.\nIn fact, one could use kernelized sorting to generate a dictionary after initial matching has occurred.\nIn keeping with the choice of a simple kernel we used standard TF-IDF (term frequency - inverse\ndocument frequency) features of a bag of words kernel. As preprocessing we remove stopwords (via\nNLTK) and perform stemming using http://snowball.tartarus.org. Finally, the feature\nvectors are normalized to unit length in term of (cid:96)2 norm. Since kernel matrices on documents are\nnotoriously diagonally dominant we use the bias-corrected version of our optimization problem.\n\n\fAs baseline we used a fairly straightforward means of document matching via its length. That is,\nlonger documents in one language will be most probably translated into longer documents in the\nother language. This observation has also been used in the widely adopted sentence alignment\nmethod [10]. As a dictionary-based alternative we translate the documents using Google\u2019s trans-\nlation engine http://translate.google.com to \ufb01nd counterparts in the source language.\nSmallest distance matches in combination with a linear assignment solver are used for the matching.\nThe experimental results are summarized in Table 2. We describe a line search procedure in Section\n3. In practice we \ufb01nd that \ufb01xing \u03bb at a given step size and choosing the best solution in terms of\nthe objective function for \u03bb \u2208 {0.1, 0.2, . . . , 1.0} works better. Further details can be found in the\nappendix. Low matching performance for the document length-based method might be due to small\nvariance in the document length after we choose the 300 longest documents. The dictionary-based\nmethod gives near-to-perfect matching performance. Further in forming the dictionary, we do not\nperform stemming on English words and thus the dictionary is highly customized to the problem\nat hand. Our method produces results consistent to the dictionary-based method with notably low\nperformance for matching German documents to its English translations. We conclude that the\ndif\ufb01culty of German-English document matching is inherent to this dataset [9]. Arguably the results\nare quite encouraging as our method uses only a within class similarity measure while still matches\nmore than 2/3 of what is possible by a dictionary-based method.\n\n6 Summary and Discussion\n\nIn this paper, we generalized sorting by maximizing the dependency between matched pairs or\nobservations by means of the Hilbert Schmidt Independence Criterion. This way we are able to\nperform matching without the need of a cross-domain similarity measure. The proposed sorting\nalgorithm is ef\ufb01cient and it can be applied to a variety of different problems ranging from data\nvisualization to image and multilingual document matching and estimation. Further examples of\nkernelized sorting and of reference algorithms are given in the appendix.\n\nAcknowledgments NICTA is funded through the Australian Government\u2019s Backing Australia\u2019s\nAbility initiative, in part through the ARC.This research was supported by the Pascal Network. Parts\nof this work were done while LS and AJS were working at NICTA.\n\nReferences\n[1] T. Jebara. Kernelizing sorting, permutation, and alignment for minimum volume PCA.\n\nIn\nConference on Computational Learning Theory (COLT), volume 3120 of LNAI, pages 609\u2013\n623. Springer, 2004.\n\n[2] A.J. Smola, A. Gretton, L. Song, and B. Sch\u00a8olkopf. A hilbert space embedding for distri-\nIn E. Takimoto, editor, Algorithmic Learning Theory, Lecture Notes on Computer\n\nbutions.\nScience. Springer, 2007.\n\n[3] K. Fukumizu, F. R. Bach, and M. I. Jordan. Dimensionality reduction for supervised learning\n\nwith reproducing kernel Hilbert spaces. J. Mach. Learn. Res., 5:73\u201399, 2004.\n\n[4] T. Pham Dinh and L. Hoai An. A D.C. optimization algorithm for solving the trust-region\n\nsubproblem. SIAM Journal on Optimization, 8(2):476\u2013505, 1988.\n\n[5] A.L. Yuille and A. Rangarajan. The concave-convex procedure. Neural Computation, 15:915\u2013\n\n936, 2003.\n\n[6] R. Jonker and A. Volgenant. A shortest augmenting path algorithm for dense and sparse linear\n\nassignment problems. Computing, 38:325\u2013340, 1987.\n\n[7] T. Cour, P. Srinivasan, and J. Shi. Balanced graph matching. In B. Sch\u00a8olkopf, J. Platt, and\nT. Hofmann, editors, Advances in Neural Information Processing Systems 19, pages 313\u2013320.\nMIT Press, December 2006.\n\n[8] W. Gander, G.H. Golub, and U. von Matt. A constrained eigenvalue problem.\n\nIn Linear\n\nAlgebra Appl. 114-115, pages 815\u2013839, 1989.\n\n[9] P. Koehn. Europarl: A parallel corpus for statistical machine translation. In Machine Transla-\n\ntion Summit X, pages 79\u201386, 2005.\n\n[10] W. A. Gale and K. W. Church. A program for aligning sentences in bilingual corpora.\n\nIn\n\nMeeting of the Association for Computational Linguistics, pages 177\u2013184, 1991.\n\n\f", "award": [], "sourceid": 552, "authors": [{"given_name": "Novi", "family_name": "Quadrianto", "institution": null}, {"given_name": "Le", "family_name": "Song", "institution": null}, {"given_name": "Alex", "family_name": "Smola", "institution": null}]}