{"title": "Classification on Pairwise Proximity Data", "book": "Advances in Neural Information Processing Systems", "page_first": 438, "page_last": 444, "abstract": null, "full_text": "Classification on Pairwise Proximity Data \n\nThore Graepelt , Ralf Herbrichi , \n\nPeter Bollmann-Sdorrat , Klaus Obermayert \n\nTechnical University of Berlin, \n\nt Statistics Research Group, Sekr. FR 6-9, \n\nt Neural Information Processing Group, Sekr. FR 2-1 , \n\nFranklinstr. 28/29, 10587 Berlin, Germany \n\nAbstract \n\nWe investigate the problem of learning a classification task on data \nrepresented in terms of their pairwise proximities. This representa(cid:173)\ntion does not refer to an explicit feature representation of the data \nitems and is thus more general than the standard approach of us(cid:173)\ning Euclidean feature vectors, from which pairwise proximities can \nalways be calculated. Our first approach is based on a combined \nlinear embedding and classification procedure resulting in an ex(cid:173)\ntension of the Optimal Hyperplane algorithm to pseudo-Euclidean \ndata. As an alternative we present another approach based on a \nlinear threshold model in the proximity values themselves, which is \noptimized using Structural Risk Minimization. We show that prior \nknowledge about the problem can be incorporated by the choice of \ndistance measures and examine different metrics W.r.t. their gener(cid:173)\nalization. Finally, the algorithms are successfully applied to protein \nstructure data and to data from the cat's cerebral cortex. They \nshow better performance than K-nearest-neighbor classification. \n\n1 \n\nIntroduction \n\nIn most areas of pattern recognition, machine learning, and neural computation it \nhas become common practice to represent data as feature vectors in a Euclidean \nvector space. This kind of representation is very convenient because the Euclidean \nvector space offers powerful analytical tools for data analysis not available in other \nrepresentations. However, such a representation incorporates assumptions about \nthe data that may not hold and of which the practitioner may not even be aware. \nAnd - an even more severe restriction - no domain-independent procedures for the \nconstruction of features are known [3J. \n\nA more general approach to the characterization of a set of data items is to de-\n\n\fClassification on Pairwise Proximity Data \n\n439 \n\nfine a proximity or distance measure between data items - not necessarily given as \nfeature vectors - and to provide a learning algorithm with a proximity matrix of \na set of training data. Since pairwise proximity measures can be defined on struc(cid:173)\ntured objects like graphs this procedure provides a bridge between the classical and \nthe\" structural\" approaches to pattern recognition [3J. Additionally, pairwise data \noccur frequently in empirical sciences like psychology, psychophysics , economics, \nbiochemistry etc., and most of the algorithms developed for this kind of data - pre(cid:173)\ndominantly clustering [5, 4J and multidimensional scaling [8, 6]- fall into the realm \nof unsupervised learning. \n\nIn contrast to nearest-neighbor classification schemes [10] we suggest algorithms \nwhich operate on the given proximity data via linear models. After a brief discus(cid:173)\nsion of different kinds of proximity data in terms of possible embeddings, we suggest \nhow the Optimal Hyperplane (OHC) algorithm for classification [2, 9] can be applied \nto distance data from both Euclidean and pseudo-Euclidean spaces. Subsequently, \na more general model is introduced which is formulated as a linear threshold model \non the proximities, and is optimized using the principle of Structural Risk Mini(cid:173)\nmization [9J . We demonstrate how the choice of proximity measure influences the \ngeneralization behavior of the algorithm and apply both algorithms to real-world \ndata from biochemistry and neuroanatomy. \n\n2 The Nature of Proximity Data \n\nWhen faced with proximity data in the form of a matrix P = {Pij} of pairwise \nproximity values between data items , one idea is to embed the data in a suitable \nspace for visualization and analysis. This is referred to as multidimensional scaling, \nand Torgerson [8J suggested a procedure for the linear embedding of proximity data. \nInterpreting the proximities as Euclidean distances in some unknown Euclidean \nspace one can calculate an inner product matrix H = XTX w.r.t. to the center of \nmass of the data from the proximities according to [8] \n\n(H)ij = -2 !Pij! \n\n1 \n\n( \n\nf \n\n2 1 \n\n- \u00a3 Ii !Pmj ! - \u00a3 ~ !Pin ! + \u00a32 m~l !Pmn! \n\n21 \n\n21 \n\nf \n\nf\n\n) \n\n2 \n\n. \n\n(1) \n\nLet us perform a spectral decomposition H = UDU T = XTX and choose D \nand U such that their columns are sorted in decreasing order of magnitude of \nthe eigenvalues .Ai of H . The embedding in an n-dimensional space is achieved \nby calculating the first n rows of X = D ~ U T . In order to embed a new data \nitem characterized by a vector p consisting of its pairwise proximities Pi w.r.t. to \nthe previously known data items, one calculates the corresponding inner product \nvector h using (1) with (H)ij, Pij, and Pmj replaced by hi , Pi , and Pm respectively, \nand then obtains the embedding x = D -~ UTh. \nThe matrix H has negative eigenvalues if the distance data P were not Eu(cid:173)\nclidean. Then the data can be isometrically embedded only in a pseudo-Euclidean \nor Minkowski space ~(n+,n-), equipped with a bilinear form q> , which is not \nIn this case the distance measure takes the form P(Xi, Xj) = \npositive definite. \nJq>(Xi - Xj) = J(Xi - xj)TM(Xi - Xj), where M is any n x n symmetric matrix \nassumed to have full rank, but not necessarily positive definite. However, we can \nalways find a basis such that the matrix M assumes the form M = diag(In+ , -In-) \nwith n = n+ + n-, where the pair (n+, n-) is called the signature of the pseudo(cid:173)\nEuclidean space [3J . Also in this case (1) serves to reconstruct the symmetric bilinear \nform , and the embedding proceeds as above with D replaced by D , whose diagonal \ncontains the modules of the eigenvalues of H. \n\n\f440 \n\nT. Graepel. R. Herbrich. P. Bollmann-Sdorra and K. Obermayer \n\nFrom the eigenvalue spectrum of H the effective dimensionality of the proximity \npreserving embedding can be obtained. (i) If there is only a small number of large \npositive eigenvalues, the data items can be reasonably embedded in a Euclidean \nspace. (ii) If there is a small number of positive and negative eigenvalues of large \nabsolute value, then an embedding in a pseudo-Euclidean space is possible. (iii) If \nthe spectrum is continuous and relatively flat, then no linear embedding is possible \nin less than .e - 1 dimensions. \n\n3 Classification in Euclidean and Pseudo-Euclidean Space \n\nLet the training set S be given by an .e x.e matrix P of pairwise distances of unknown \ndata vectors x in a Euclidean space, and a target class Yi E {-I, + I} for each data \nitem. Assuming that the data are linearly separable, we follow the OHC algorithm \n[2J and set up a linear model for the classification in data space, \n\ny(x) = sign(xT w + b) . \n\n(2) \n\nThen we can always find a weight vector wand threshold b such that \n\nYi(xTw+b)~l \n\n(3) \nNow the optimal hyperplane with maximal margin is found by minimizing IIw l12 \nunder the constraints (3). This is equivalent to maximizing the Wolfe dual W(o:) \nw.r.t. 0:, \n\ni=l, . .. ,.e. \n\n1 \n\nW(o:) = o:TI- 20:TYXTXYo: , \n\n(4) \nwith Y = diag(y) , and the .e-vector 1. The constraints are ai ~ 0, Vi, and 1 Ty 0:* = \nO. Since the optimal weight vector w* can be expressed as a linear combination of \ntraining examples \n\n(5) \nand the optimal threshold b* is obtained by evaluating b* = Yi - xT w* for any \ntraining example X i with at i- 0, the decision function (2) can be fully evaluated \nusing inner products between data vectors only. This formulation allows us to learn \non the distance data directly. \n\nw* = XYo:*, \n\nIn the Euclidean case we can apply (1) to the distance matrix P of the training \ndata, obtain the inner product matrix H = XTX, and introduce it directly -\nwithout explicit embedding of the data - into the Wolfe dual (4). The same is true \nfor the test phase, where only the inner products of the test vector with the training \nexamples are needed. \n\nIn the case of pseudo-Euclidean distance data the inner product matrix H obtained \nfrom the distance matrix P via (1) has negative eigenvalues. This means that \nthe corresponding data vectors can only be embedded in a pseudo-Euclidean space \nR(n+ ,n-) as explained in the previous section. Also H cannot serve as the Hessian \nin the quadratic programming (QP) problem (4). It turns out, however , that the \nindefiniteness of the bilinear form in pseudo-Euclidean spaces does not forestall \nlinear classification [3]. A decision plane is characterized by the equation xTMw = \n0, as illustrated in Fig. 1. However, Fig. 1 also shows that the same plane can just \nas well be described by x T W = 0 - as if the space were Euclidean - where w = Mw \nis simply the mirror image of w w.r.t. the axes of negative signature. For the \nOHC algorithm this means, that if we can reconstruct the Euclidean inner product \nmatrix XTX from the distance data, we can proceed with the OHC algorithm as \nusual. fI = XTX is calculated by \"flipping\" the axes of negative signature , i.e., \nwith D = diag(l>-ll, ... , I>-cl), we can calculate fI according to \n\nfI = UDU T , \n\n(6) \n\n\fClassification on Pairwise Proximity Data \n\n441 \n\n\"-\n\n\"-\n\n\"-\n\n/ \n\n/ \n\n/ \n\n-\nx \n\nxTMw = a \n\n/ \n\n/ xTMx = a \n\n/ \n\n\"-\n\n\"-\n\n\"-\n\n\"-\n\nw \n\nx+ \n\n\"-\n\nW \n\n\"-\n\n\"-\n\nFigure 1: Plot of a decision line (thick) \nin a 2D pseudo-Euclidean space with sig(cid:173)\nnature (1,1), i.e. , M = diag(l, -1). The \ndecision line is described by xTMw = a. \nWhen interpreted as Euclidean it is at right \nangles with w, which is the mirror image \nof w w.r.t. the axis X- of negative signa(cid:173)\nture. In physics this plot is referred to as a \nMinkowski space-time diagram, where x+ \ncorresponds to the space axis and x- to the \ntime axis. The dashed diagonal lines indi(cid:173)\ncate the points xTMx = a of zero length, \nthe light cone. \n\nwhich serves now as the Hessian matrix for normal OHC classification. Note, that \nH is positive semi-definite, which ensures a unique solution for the QP problem (4). \n\n4 Learning a Linear Decision Function in Proximity Space \n\nIn order to cope with general proximity data (case (iii) of Section 2) let the training \nset S be given by an f x R proximity matrix P whose elements P' ) = P( .l\" \" r ) ) \"rf' \nthe pairwise proximity values between data items Xi, i = 1, ... , \u00a3, and a target class \nYi E {-I , + I} for each data item. Let us assume that the proximity values satisfy \nreflexivity, Pii = a,Vi, and symmetry, Pij = pji,Vi,j. We can make a linear model \nfor the classification of a new data item x represented by a vector of proximities \n,pe)T where Pi = p(x, xd are the proximities of x w.r.t. to the items Xi \nP = (PI,'\" \nin the training set, \n\ny(x) = sign(pT w + b) . \n\n(7) \nComparing (7) to (2) we note, that this is equivalent to using the vector of proxim(cid:173)\nities p as the feature vector x characterizing data item x. Consequently, the OHC \nalgorithm from the previous section can be used to learn a proximity model when \nx is replaced by p in (2), XTX is replaced by p2 in the Wolfe dual (4), and the \ncolumns P l of P serve as the training data. \nNote that the formal correspondence does not imply that the columns of the prox(cid:173)\nimity matrix are Euclidean feature vectors as used in the SV setting. We merely \nconsider a linear threshold model on the proximities of a data item to all the training \ndata items. Since the Hessian of the QP problem (4) is the square of the proximity \nmatrix, it is always at least positive semi-definite, which guarantees a unique solu(cid:173)\ntion of the QP problem. Once the optimal coefficients 0:; have been found, a test \ndata item can be classified by determining its proximities Pi from the elements Xi of \nthe training set and by using conditions (2) together with (5) for its classification. \n\n5 Metric Proximities \n\nLet us consider two examples in order to see, what learning on pairwise metric data \namounts to. The first example is the minimalistic a-I-metric, which for two objects \nXi and x J is defined as follows : \n\n( . x.) _ { a if Xi = Xj \n1 otherwise \n\nPo Xl, \n\nJ \n\n-\n\n. \n\n(8) \n\n\f442 \n\nT. Graepe/, R. Herbrich, P Bollmann-Sdorra and K. Obermayer \n\n. - .... . '\" \n~.: ... ~ \n.. -. \n\n\u2022 \u2022 \n.. \n\nII \u2022 \u2022 \n\n'f \n\n)I \n\n, ' \n, , \n\" \n\n. . . \\ .' \n,. . \n\u2022 \n\n. . . .\" '. '.'~ .. : \n\n. : \n\n\u2022 I, ..... \n\n. . . . \n. . \" \n. \\ . .. , \n\u2022 \n. . . \n.... \"~ : \n.. , \n. I, \n. '. \n\n: I, \u2022\u2022 ,\".' \n\nb) \n\n.- I!\"' \u2022\u2022 \n.. \n\n.- .~ \n~ .: 'l~ \n. . . . -\n\na} \n\n\" . ~ \n\n, . . . \n. \n\\ . .. , \n.. \n.... \": : \n. '. \n. . ..... \n\n., \"I, \n\n,',. \n\nc) \n\nFigure 2: Decision functions in a simple two-class classification problem for different \nMinkowski metrics. The algorithm described in Sect. 4 was applied with (a) the \ncity-block metric (r = 1), (b) the Euclidean metric (r = 2), and (c) the maximum \nmetric (r -+ 00). The three metrics result in considerably different generalization \nbehavior, and use different Support Vectors (circled). \n\nThe corresponding \u00a3 x \u00a3 proximity matrix Po has full rank as can be seen from its \nnon-vanishing determinant det(Po) = (_I)l-l(\u00a3 - 1). From the definition of the \n0-1 metric it is clear that every data item x not contained in the training set is \nrepresented by the same proximity vector p = 1, and will be assigned to the same \nclass. For the 0-1 metric the QP problem (4) can be solved analytically by matrix \ninversion, and using POl = (\u00a3 - 1)-111 T -\n\nI we obtain for the classification \n\nThis result means, that each new data item is assigned to the majority class of \nthe training sample, which is - given the available information - the Bayes optimal \ndecision. This example demonstrates, how the prior information - in the case of the \n0-1 metric the minimal information of identity - is encoded in the chosen distance \nmeasure. \n\nAs an easy-to-visualize example of metric distance measures on vectors x E ~n let \nus consider the Minkowski r-metrics defined for r 2: 1 as \n\n(10) \n\nFor r = 2 the Minkowski metric is equivalent to the Euclidean distance. The case \nr = 1 corresponds to the so-called city-block metric, in which the distaqce is given \nby the sum of absolute differences for each feature. On the other extreme, the max(cid:173)\nimum norm, r -+ 00, takes only the largest absolute difference in feature values as \nthe distance between objects. Note that with increasing r more weight is given to \nthe larger differences in feature values, and that in the literature on multidimen(cid:173)\nsional scaling [1] Minkowski metrics have been used to examine the dominance of \nfeatures in human perception. Using the Minkowski metrics for classification in a \ntoy example, we observed that different values of r lead to very different generaliza(cid:173)\ntion behavior on the same set of data points, as can be seen in Fig. 2. Since there \nis no apriori reason to prefer one metric over the other, using a particular metric is \nequivalent to incorporating prior knowledge into the solution of the problem. \n\n\fClassification on Pairwise Proximity Data \n\n443 \n\nI Size of Class \nORC-cut-off \nORC-flip-axis \nOR C-proximi ty 3.08 \n1-NN \n5.82 \n2-NN \n6.09 \n5.29 \n3-NN \n4-NN \n6.45 \n5.55 \n5-NN \n\n3.08 \n4.62 \n3.08 1.54 \n4.62 \n6.00 \n4.46 \n2.29 \n5.14 \n2.75 \n\n6.15 \n4.62 \n3.08 \n6.09 \n7.91 \n4.18 \n3.68 \n2.72 \n\n3.08 \n3.08 \n1.54 \n6.74 \n5.09 \n4.71 \n5.17 \n5.29 \n\n4.01 \n0.91 \n0.91 \n4.01 \n0.45 3.60 \n3.66 \n1.65 \n5.27 \n2.01 \n6.34 \n2.14 \n2.46 \n5.13 \n5.09 \n1.65 \n\n0.45 \n0.45 \n0.45 \n0.00 \n0.00 \n0.00 \n0.00 \n0.00 \n\n0.00 \n0.00 \n0.00 \n2.01 \n3.44 \n2.68 \n4.87 \n4.11 \n\nTable 1: Classification results for Cat Cortex and Protein data. Bold numbers \nindicate best results. \n\n6 Real-World Proximity Data \n\nIn the numerical experiments we focused on two real-world data sets, which are both \ngiven in terms of a proximity matrix P and class labels y for each data item. The \ndata set called \"cat cortex\" consists of a matrix of connection strengths between \n65 cortical areas of the cat. The data was collected by Scannell [7] from text \nand figures of the available anatomical literature and the connections are assigned \nproximity values p as follows: self-connection (p = 0) , strong and dense connection \n(p = 1) , intermediate connection (p = 2), weak connection (p = 3), and absent or \nunreported connection (p = 4). From functional considerations the areas can be \nassigned to four different regions: auditory (A), visual (V), somatosensory (SS), \nand frontolimbic (FL). The classification task is to discriminate between these four \nregions, each time one against the three others. \nThe second data set consists of a proximity matrix from the structural comparison of \n224 protein sequences based upon the concept of evolutionary distance. The major(cid:173)\nity of these proteins can be assigned to one of four classes of globins: hemoglobin-a \n(R-a), hemoglobin-;3 (R-;3), myoglobin (M), and heterogenous globins (GR). The \nclassification task is to assign proteins to one of these classes, one against the rest. \nWe compared three different procedures for the described two-class classification \nproblems, performing leave-one-out cross-validation for the \"cat cortex\" dataset \nand lO-fold cross-validation for the \"protein\" data set to estimate the generaliza(cid:173)\ntion error. Table 1 shows the results. ORC-cut-off refers to the simple method \nof making the inner product matrix H positive semi-definite by neglecting projec(cid:173)\ntions to those eigenvectors with negative eigenvalues. ORC-flip-axis flips the axes \nof negative signature as described in (6) and thus preserves the information con(cid:173)\ntained in those directions for classification. ORC-proximit}', finally, refers to the \nmodel linear in the proximities as introduced in Section 4. \nIt can be seen that \naRC-proximity shows a better generalization than ORC-flip-axis , which in turn \nperforms slightly better than ORC-cut-off. This is especially the case on the cat \ncortex data set, whose inner Rroduct matrix H has negative eigenvalues. For com(cid:173)\nparison, the lower part of Table 1 shows the corresponding cross-validation results \nfor K-nearest-neighbor, which is a natural choice to use, because it only needs the \npairwise proximities to determine the training data to participate in the voting. \nThe presented algorithms ORC-flip-axis and aRC-proximity perform consistently \nbetter than K-nearest-neighbor, even when the value of K is optimally chosen. \n\n\f444 \n\nT Graepe/, R. Herbnch, P. Bollmann-Sdorra and K. Obermayer \n\n7 Conclusion and Future work \n\nIn this contribution we investigated the nature of proximity data and suggested \nways for performing classification on them. Due to the generality of the proxim(cid:173)\nity approach we expect that many other problems can be fruitfully cast into this \nframework. Although we focused on classification problems , regression can be con(cid:173)\nsidered on proximity data in an analogous way. Noting that Support Vector kernels \nand covariance functions for Gaussian processes are similarity measures for vector \nspaces, we see that this approach has recently gained a lot of popularity. However, \none problem with pairwise proximities is that their number scales quadratically \nwith the number of objects under consideration. Hence, for large scale practical \napplications the problems of missing data and active data selection for proximity \ndata will be of increasing importance. \n\nAcknow ledgments \n\nWe thank Prof. U. Kockelkorn for fruitful discussions. We also thank S. Gunn for \nproviding his Support Vector implementation. Finally, we are indebted to M. Vin(cid:173)\ngron and T. Hofmann for providing the protein data set. This project was funded \nby the Technical U ni versity of Berlin via the Forschungsinitiativprojekt FIP 13/41. \n\nReferences \n\n[1 J 1. Borg and J. Lingoes. Multidimensional Similarity Structure Analysis, vol(cid:173)\n\nume 13 of Springer Series in Statistics. Springer-Verlag, Berlin, Heidelberg, \n1987. \n\n[2J B. Boser, 1. Guyon, and V. N. Vapnik. A training algorithm for optimal margin \nIn Proceedings of the Fifth Annual Workshop on Computational \n\nclassifiers. \nLearning Theory, pages 144~ 152, 1992. \n\n[3J L. Goldfarb. Progress in Pattern Recognition, volume 2, chapter 9: A New \nApproach To Pattern Recognition, pages 241 ~402. Elsevier Science Publishers, \n1985. \n\n[4J T. Graepel and K. Obermayer. A stochastic self-organizing map for proximity \n\ndata. Neural Computation (accepted for pUblication), 1998. \n\n[5J T. Hofmann and J . Buhmann. Pairwise data clustering by deterministic an(cid:173)\nIEEE Transactions on Pattern Analysis and Machine Intelligence, \n\nnealing. \n19(1):1- 14, 1997. \n\n[6J H. Klock and J. M. Buhmann. Multidimensional scaling by deterministic an(cid:173)\nnealing. In M. Pelillo and E. R. Hancock, editors, Energy Minimization Meth(cid:173)\nods in Computer Vision and Pattern Recognition, volume 1223, pages 246-260, \nBerlin, Heidelberg, 1997. Springer-Verlag. \n\n[7J J. W. Scannell, C. Blakemore, and M. P. Young. Analysis of connectivity in \nthe cat cerebral cortex. The Journal of Neuroscience, 15(2):1463- 1483,1995. \n\n[8J W. S. Torgerson. Theory and Methods of Scaling. Wiley, New York, 1958. \n[9J V. Vapnik. The Nature of Statistical Learning. Springer-Verlag, Berlin, Hei(cid:173)\n\ndelberg, Germany, 1995. \n\n[10J D. Weinshall, D. W. Jacobs, and Y. Gdalyahu. Classification in non~metric \nspace. In Advances in Neural Information Processing Systems, volume 11, 1999. \nin press. \n\n\f", "award": [], "sourceid": 1571, "authors": [{"given_name": "Thore", "family_name": "Graepel", "institution": null}, {"given_name": "Ralf", "family_name": "Herbrich", "institution": null}, {"given_name": "Peter", "family_name": "Bollmann-Sdorra", "institution": null}, {"given_name": "Klaus", "family_name": "Obermayer", "institution": null}]}