{"title": "Data Clustering by Markovian Relaxation and the Information Bottleneck Method", "book": "Advances in Neural Information Processing Systems", "page_first": 640, "page_last": 646, "abstract": null, "full_text": "Data clustering by Markovian relaxation \nand the Information Bottleneck Method \n\nSchool of Computer Science and Engineering and Center for Neural Computation * \n\nN oam Slonim \n\nN aft ali Tishby \n\nand \n\nThe Hebrew University, Jerusalem, 91904 Israel \n\nemail: {tishby.noamm}ees.huji.ae.il \n\nAbstract \n\nWe introduce a new, non-parametric and principled, distance based \nclustering method. This method combines a pairwise based ap(cid:173)\nproach with a vector-quantization method which provide a mean(cid:173)\ningful interpretation to the resulting clusters. The idea is based \non turning the distance matrix into a Markov process and then \nexamine the decay of mutual-information during the relaxation of \nthis process. The clusters emerge as quasi-stable structures dur(cid:173)\ning this relaxation, and then are extracted using the information \nbottleneck method. These clusters capture the information about \nthe initial point of the relaxation in the most effective way. The \nmethod can cluster data with no geometric or other bias and makes \nno assumption about the underlying distribution. \n\n1 \n\nIntroduction \n\nData clustering is one of the most fundamental pattern recognition problems, with \nnumerous algorithms and applications. Yet, the problem itself is ill-defined: the \ngoal is to find a \"reasonable\" partition of data points into classes or clusters. What \nis meant by \"reasonable\" depends on the application, the representation of the data, \nand the assumptions about the origins of the data points, among other things. \n\nOne important class of clustering methods is for cases where the data is given as a \nmatrix of pairwise distances or (dis) similarity measures. Often these distances come \nfrom empirical measurement or some complex process, and there is no direct access , \nor even precise definition, of the distance function. In many cases this distance does \nnot form a metric, or it may even be non-symmetric. Such data does not necessarily \ncome as a sample of some meaningful distribution and even the issue of generaliza(cid:173)\ntion and sample to sample fluctuations is not well defined. Algorithms that use only \nthe pairwise distances, without explicit use of the distance measure itself, employ \nstatistical mechanics analogies [3] or collective graph theoretical properties [6] , etc. \nThe points are then grouped based on some global criteria, such as connected com(cid:173)\nponents , small cuts, or minimum alignment energy. Such algorithms are sometimes \ncomputationally inefficient and in most cases it is difficult to interpret the resulting \n'Work supported in part by the US-Israel binational science foundation (BSF) and by \nthe Human Frontier Science Project (HFSP). NS is supported by the Levi Eshkol grant. \n\n\fclusters. I.e., it is hard to determine a common property to all the points in one \ncluster - other than that the clusters \"look reasonable\" . \n\nA second class of clustering methods is represented by the generalized vector quan(cid:173)\ntization (VQ) algorithm. Here one fits a model (e.g. Gaussian distributions) to the \npoints in each cluster, such that an average (known) distortion between the data \npoints and their corresponding representative is minimized. This type of algorithm(cid:173)\ns may rely on theoretical frameworks, such as rate distortion theory, and provide \nmuch better interpretation for the resulting clusters. VQ type algorithms can also \nbe more computationally efficient since they require the calculation of distances, \nor distortion, between the data and the centroid models only, not between every \npair of data points. On the other hand, they require the knowledge of the distor(cid:173)\ntion function and thus make specific assumptions about the underlying structure or \nmodel of the data. \n\nIn this paper we present a new, information theoretic combination of pairwise clus(cid:173)\ntering with meaningful and intuitive interpretation for the resulting clusters. In \naddition, our algorithm provides a clear and objective figure of merit for the clus(cid:173)\nters - without making any assumption on the origin or structure of the data points. \n\n2 Pairwise distances and Markovian relaxation \n\nThe first step of our algorithm is to turn the pairwise distance matrix into a Markov \nprocess, through the following simple intuition. Assign a state of a Markov chain \nto each of the data points and transition probabilities between the states/points \nas a function of their pairwise distances. Thus the data can be considered as a \ndirected graph with the points as nodes and the pairwise distances, which need not \nbe symmetric or form a metric, on the arcs of the graph. Distances are normally \nconsidered additive, i.e., the length of a trajectory on the graph is the sum of the \narc-lengths. Probabilities, on the other hand, are multiplicative for independent \nevents, so if we want the probability of a (random) trajectory on the graph to be \nnaturally related to its length , the transition probabilities between points should be \nexponential in their distance. Denoting by d(Xi' Xj) the pairwise distance between \nthe points Xi and Xj, 1 then the transition probability that our Markov chain move \nfrom the point Xj at time t to the point X i at time t + 1, Pi,j == P(Xi(t + l)lxj(t)), \nis chosen as, \n\nP(Xi(t + l)lxj(t)) ex: exp( ->\"d(Xi,Xj)) , \n\n(1) \nwhere >..-1 is a length scaling factor that equals the mean pairwise distance of the k \nnearest neighbors to the point Xi. The details of this rescaling are not so important \nfor the final results, and a similar exponentiation of the distances, without our \nprobabilistic interpretation, was performed in other clustering works (see e.g. [3 , 6]). \nA proper normalization of each row is required to turn this matrix into a stochastic \ntransition matrix. \n\nGiven this transition matrix, one can imagine a random walk starting at every \npoint on the graph. Specifically, the probability distribution of the positions of \na random walk, starting at Xj after t time steps, is given by the j-th row of the \nt -th iteration of the I-step transition matrix. Denoting by pt the t-step transition \nmatrix, pt = (P)t , is indeed the t-th power of the I-step transition probability \nmatrix. The probability of a random walk starting at Xj at time 0, to be at Xi at \ntime t is thus, \n\n---------------------------\n\n(2) \n1 Henceforth we take the number of data points to be n and the point indices run \n\np(xi(t) IXj(O)) = Ptj . \n\nimplicitly from 1 to n unless stated otherwise. \n\n\fIf we assume that all the given pairwise distances are finite we obtain in this way \nan ergodic Markov process with a single stationary distribution, denoted by 7r. \nThis distribution is a right-eigenvector of the t-step transition matrix (for every t), \nsince, 7ri = 2:j Pi,j7rj . It is also the limit distribution of p(Xi (t) IXj (0)) for all j, \ni.e., limHOOp(xi(t)lxj(O)) = 7ri. During the dynamics of the Markov process any \ninitial state distribution is going to relax to this final stationary distribution and \nthe information about the initial point of a random walk is completely lost. \n\nRate of Information loss for example 1 \n\n035 ,-----~--~-~-_____, \n\nColon data, rate of Informallon loss vs clusters ac:curac.y \n\n,,' . \n\n;. \n\n' \" \n\n\" \n\n:. \n\n03 \n\n025 \n\n'6 \n~02 \nXC \n2015 \n~ \n\"0 \n'01 \n\n005 \n\n~ \n~ \n\no .\u2022\u2022\u2022 _'\" 0 \"0\". \n\n10 \n\n100,(1) \n\n15 \n\n15 \n\n20 \n\nFigure 1: On the left shown an example of data, consisting of 150 points in 2D. On the \nmiddle, we plot the rate of information loss, - d~~t) , during the relaxation. Notice that the \nalgorithm has no prior information about circles or ellipses. The rate of the information \nloss is slow when the \"random walks\" stabilize on some sub structures of the data - our \nproposed clusters. On the right we plot the rate of information loss for the colon cancer \ndata, and the accuracy of the obtained clusters for different relaxation times, with the \noriginal classes. \n\n2.1 Relaxation of the mutual information \n\nThe natural way to quantify the information loss during this relaxation process is \nby the mutual information between the initial point variable, X(O) = {Xj(O)} and \nthe point of the random walk at time t, X(t) = {Xi(t)}. The mutual information \nbetween the random variables X and Y is the symmetric functional of their joint \ndistribution, \n\nI(X ;Y) = L \n\nxEX,yEY \n\nP(x,Y)log( ~~~'r\\) = L \n\nP P Y \n\nxEX,yEY \n\np(x)p(YIX)log(P(Y(lx))) \n\nP Y \n\nFor the Markov relaxation this mutual information is given by, \n\nI(t) == I(X(O) ;X(t)) = LPj LP/,jlog Pi> = LPjDKdPt)lp~] , \n\nj \n\ni \n\nPi \n\nj \n\n(3) \n\n(4) \n\nwhere Pj is the prior probability of the states, and P; = 2:j P;,jPj is the uncondi(cid:173)\ntioned probability of Xi at time t. The DKL is the Kulback-Liebler divergence [4], \ndefined as: DKL [Pllq] == 2: y p(y) log ~ which is the information theoretic measure \nof similarity of distributions. Since all the rows P; j relax to 7r this divergence goes \nto zero as t --+ 00. While it is clear that the information about the initial point, \nI(t), decays monotonically (exponentially asymptotically) to zero, the rate of this \ndecay at finite t conveys much information on the structure of the data points. \n\nConsider, as a simple example, the planer data points shown in figure 1, with \nd(Xi,Xj) = (Xi - Xj)2 + (Yi - Yj)2. As can be seen, the rate of information loss \n\n\fabout the initial point of the random walk, - d~~t) , while always positive - slows \ndown at specific times during the relaxation. These relaxation locations indicate \nthe formation of quasi-stable structures on the graph. At these relaxation times \nthe transition probability matrix is approximately a projection matrix (satisfying \np2t ,:::: pt) where the almost invariant subgraphs correspond to the clusters. These \napproximate stationary transitions correspond to slow information loss, which can \nbe identified by derivatives of the information loss at time t. Another way to see this \nphenomena is by observing the rows of pt, which are the conditional distributions \np(x;(t)lxj(O)). The rows that are almost indistinguishable, following the partial \nrelaxation , correspond to points Xj with similar conditional distribution on the rest \nof the graph at time t. Such points should belong to the same structure, or cluster \non the graph. This can be seen directly by observing the matrix pt during the \nrelaxation , as shown in figure 2. The quasi-stable structures on the graph , during \n\nt~2\u00b0 \n\nt~23 \n\n20 \n\n40 \n\n60 \n\n80 \n\n100 \n\n120 \n\n140 \n\n20 \n\n40 \n\n60 \n\n80 \n\n100 \n\n120 \n\n140 \n\n50 \n\n100 \n\n150 \n\n50 \n\n100 \n\n150 \n\nt~28 \n\nt= 2 1O \n\n20 \n\n40 \n\n60 \n\n80 \n\n100 \n\n120 \n\n140 \n\n20 \n\n40 \n\n60 \n\n80 \n\n100 \n\n120 \n\n140 \n\n50 \n\n100 \n\n150 \n\n50 \n\n100 \n\n150 \n\n50 \n\n100 \n\n150 \n\nFigure 2: The relaxation process as seen directly on the matrix pt, for different times, for \nthe example data of figure 1. The darker colors correspond to higher probability density \nin every row. Since the points are ordered by the 3 ellipses, 50 in each ellipse, it is easy \nto see the clear emergence of 3 blocks of conditional distributions - the rows of the matrix \n- during the relaxation process. For very large t there is complete relaxation and all the \nrows equal the stationary distribution of the process. The best correlation between the \nresulting clusters and the original ellipses (i.e., highest \"accuracy\" value) is obtained for \nintermediate times, where the underlying structure emerges. \n\nthe relaxation process, are precisely the desirable m eaningful clusters. \n\nThe remaining question pertains to the correct way to group the initial points \ninto clusters that capture the information about the position on the graph after \nt-steps. In other words, can we replace the initial point with an initial cluster, that \nenables prediction of the location on the graph at time t, with similar accuracy? The \nanswer to this question is naturally provided via the recently introduced information \nbottleneck method [12, 11]. \n\n\f3 Clusters that preserve information \n\nThe problem of self-organization of the members of a set X based on the similarity \nof the conditional distributions of the members of another set, Y , {p(Ylx)} , was first \nintroduced in (9) and was termed \"distributional clustering\" . \n\nThis question was recently shown in (12) to b e a specific case of a much more \nfundamental problem: What are the features of the variable X \nthat are relevant \nto the prediction of another, relevance, variable Y? This general problem was \nshown to have a natural information theoretic formulation: Find a compressed \nrepresentation of the variable X, denoted X, such that the mutual information \nbetween X and Y, I(X; Y), is as high as possible, under a constraint on the mutual \ninformation between X and X, I(X; X) . Surprisingly, this variational principle \nyields an exact formal solution for the conditional distributions p(ylx), p(xlx), and \np(x). This constrained information optimization problem was called in (12) Th e \nInformation Bottleneck Method. \n\nThe original approach to the solution of the resulting equations, used already in \n[9], was based on an analogy with the \"deterministic annealing\" (DA) approach to \nclustering (see [10, 8]). This is a top-down hierarchical algorithm that starts from a \nsingle cluster and undergoes a cascade of cluster splits which are determined stochas(cid:173)\ntically (as phase transitions) into a \"soft\" (fuzzy) tree of clusters. We proposed an \nalternative approach, based on a greedy bottom-up merging, the \"Agglomerative \nInformation Bottleneck\" (AlB , see [11]) , which is simpler and works better than the \nDA approach in many situations. This algorithm was applied also in the examples \ngiven here. \n\n3.1 The information bottleneck method \n\nGiven any two non-independent random variables, X and Y, the objective of the \ninformation bottleneck method is to extract a compact representation of the vari(cid:173)\nable X, denoted here by X, with minimal loss of mutual information to another, \nrelevance, variable Y. More specifically, we want to find a (possibly stochastic) map, \np(x lx ), that maximizes the mutual information to the relevance variable I(X;Y) , \nunder a constraint on the (lossy) coding length of X via X, I(X; X). In other \nwords, we want to find an efficient representation of the variable X, X, such that \nthe predictions of Y from X through X will be as close as possible to the direc(cid:173)\nt prediction of Y from X. As shown in [12], by introducing a positive Lagrange \nmultiplier (3 to enforce the mutual information constraint, the problem amounts to \nmaximization of the Lagrangian: \n\n\u00a3(P(x lx )) = I(X; Y) - (3-1 I(X; X) , \n\n(5) \n\nwith respect to p(x lx), subject to the Markov condition X --+ X --+ Y and normal(cid:173)\nization. This minimization yields directly the following (self-consistent) equations \nfor the map p(xlx) , and for p(ylx) and p(x): \n\n{ \n\np(xlx) = ~~~) exp (-(3DKL (P(ylx )llp(ylx))) \np(ylx) = 2:x p(Ylx)p(xlx)~ \np(x) = 2:xp(x lx )p(x) \n\n(6) \n\nwhere Z((3, x) is a normalization function. The familiar Kulback-Liebler divergence, \nDKL(P(ylx)llp(ylx) )' emerges here from the variational principle. These equations \ncan be solved by iterations that are proved to converge for any finite value of (3 \n\n\f(see [12]). The Lagrange multiplier /3 has the natural interpretation of inverse \ntemperature, which suggests deterministic annealing to explore the hierarchy of \nsolutions in X. The variational principle, Eq.(5), determines also the shape of the \nannealing process, since by changing /3 the mutual informations Ix == I(X; X) and \nIy == I(X; Y) vary such that \n\nMy = /3-1 . \n