{"title": "Agglomerative Information Bottleneck", "book": "Advances in Neural Information Processing Systems", "page_first": 617, "page_last": 623, "abstract": null, "full_text": "Agglomerative Information Bottleneck \n\nNoam Slonim \n\nNaftali Tishby* \n\nInstitute of Computer Science and \n\nCenter for Neural Computation \n\nThe Hebrew University \nJerusalem, 91904 Israel \n\nemail: {noamm.tishby}(Qcs.huji.ac.il \n\nAbstract \n\nWe introduce a novel distributional clustering algorithm that max(cid:173)\nimizes the mutual information per cluster between data and giv(cid:173)\nen categories. This algorithm can be considered as a bottom up \nhard version of the recently introduced \"Information Bottleneck \nMethod\". The algorithm is compared with the top-down soft ver(cid:173)\nsion of the information bottleneck method and a relationship be(cid:173)\ntween the hard and soft results is established. We demonstrate the \nalgorithm on the 20 Newsgroups data set. For a subset of two news(cid:173)\ngroups we achieve compression by 3 orders of magnitudes loosing \nonly 10% of the original mutual information. \n\n1 \n\nIntroduction \n\nThe problem of self-organization of the members of a set X based on the similarity \nof the conditional distributions of the members of another set, Y, {p(Ylx)}, was first \nintroduced in [8] and was termed \"distributional clustering\" . \n\nThis question was recently shown in [9] to be a special case of a much more fun(cid:173)\ndamental problem: What are the features of the variable X that are relevant for \nthe prediction of another, relevance, variable Y? This general problem was shown \nto have a natural in~ormation theoretic formulation: Find a compressed represen(cid:173)\ntation of the variable X, denoted X, such that the mutual information between X \nand Y, I (X; Y), is as high as possible, under a constraint on the mutual infor(cid:173)\nmation between X and X, I (X; X). Surprisingly, this variational problem yields \nan exact self-consistent equations for the conditional distributions p(ylx), p(xlx), \nand p(x). This constrained information optimization problem was called in [9] The \nInformation Bottleneck Method. \n\nThe original approach to the solution of the resulting equations, used already in [8], \nwas based on an analogy with the \"deterministic annealing\" approach to clustering \n(see [7]). This is a top-down hierarchical algorithm that starts from a single cluster \nand undergoes a cascade of cluster splits which are determined stochastically (as \nphase transitions) into a \"soft\" (fuzzy) tree of clusters. \nIn this paper we propose an alternative approach to the information bottleneck \n\n\f618 \n\nN Slonim and N Tishby \n\nproblem, based on a greedy bottom-up merging. It has several advantages over the \ntop-down method. It is fully deterministic, yielding (initially) \"hard clusters\", for \nany desired number of clusters. It gives higher mutual information per-cluster than \nthe deterministic annealing algorithm and it can be considered as the hard (zero \ntemperature) limit of deterministic annealing, for any prescribed number of clusters. \nFurthermore, using the bottleneck self-consistent equations one can \"soften\" the \nresulting hard clusters and recover the deterministic annealing solutions without \nthe need to identify the cluster splits, which is rather tricky. The main disadvantage \nof this method is computational, since it starts from the limit of a cluster per each \nmember of the set X. \n\n1.1 The information bottleneck method \n\nThe mutual information between the random variables X and Y is the symmetric \nfunctional of their joint distribution, \n\nP Y \n\nxEX,yEY \n\nxEX,yEY \n\nP P Y \n\nI(X;Y) = L p(x,y) log ( ~~~'~\\) = L p(x)p(ylx) log (P(Y(lx))) . \n(1) \nThe objective of the information bottleneck method is to extract a compact rep(cid:173)\nresentation of the variable X, denoted here by X, with minimal loss of mutual \ninformation to another, relevance, variable Y. More specifically, we want to find a \n(possibly stochastic) map, p(xlx), that minimizes the (lossy) coding length of X via \nX, I(Xi X), under a constraint on the mutual information to the relevance variable \nI(Xi Y). In other words, we want to find an efficient representation of the variable \nX, X, such that the predictions of Y from X through X will be as close as possible \nto the direct prediction of Y from X. \nAs shown in [9], by introducing a positive Lagrange multiplier 13 to enforce the \nmutual information constraint, the problem amounts to minimization of the La(cid:173)\ngrangian: \n\n(2) \nwith respect to p(xlx), subject to the Markov condition X -t X -t Y and normal(cid:173)\nization. \n\n\u00a3[P(xlx)] = I(Xi X) - f3I(Xi Y) , \n\nThis minimization yields directly the following self-consistent equations for the map \np(xlx), as well as for p(Ylx) and p(x): \n\n{ \n\np(xlx) = i(~~;) exp (-f3DKL [P(Ylx)lIp(Ylx)]) \np(Ylx) = 2:xp(Ylx)p(xlx)~ \np(x) = 2:x p(xlx)p(x) \n\n(3) \n\nThe functional DKL[Pllq] \n\nis a normalization function. \n\n== \nwhere Z(f3, x) \n2: y p(y) log ~f~~ is the Kulback-Liebler divergence [3J , which emerges here from the \nvariational principle. These equations can be solved by iterations that are proved to \nconverge for any finite value of 13 (see [9]). The Lagrange multiplier 13 has the nat(cid:173)\nural interpretation of inverse temperature, which suggests deterministic annealing \n[7] to explore the hierarchy of solutions in X, an approach taken already in [8J . \nThe variational principle, Eq.(2), determines also the shape of the annealing process, \nsince by changing 13 the mutual informations Ix == I(X; X) and Iy == I(Y; X) vary \nsuch that \n\nMy = 13- 1 \nMx \n\n. \n\n(4) \n\n\fAgglomerative Information Bottleneck \n\n619 \n\nThus the optimal curve, which is analogous to the rate distortion function in in(cid:173)\nformation theory [3], follows a strictly concave curve in the (Ix,Iy) plane, called \nthe information plane. Deterministic annealing, at fixed number of clusters, follows \nsuch a concave curve as well, but this curve is suboptimal beyond a certain critical \nvalue of f3. \nAnother interpretation of the bottleneck principle comes from the relation between \nthe mutual information and Bayes classification error. This error is bounded above \nand below (see [6]) by an important information theoretic measure of the class con(cid:173)\nditional distributions p(XIYi), called the Jensen-Shannon divergence. This measure \nplays an important role in our context. \nThe Jensen-Shannon divergence of M class distributions, Pi(X), each with a prior \n7ri, 1 ~ i ~ M, is defined as, [6,4]. \n\nM \n\nM \n\nJS7r (Pl,P2, \u00b7\u00b7\u00b7,PM] == H[L 7riPi(X)] - L 7ri H [Pi(X)] , \n\n(5) \n\ni=l \n\ni=l \n\nwhere H[P(x)] is Shannon's entropy, H[P(x)] = - Ex p(x) logp(x). The convexi(cid:173)\nty of the entropy and Jensen inequality guarantees the non-negativity of the JS(cid:173)\ndivergence. \n\n1.2 The hard clustering limit \nFor any finite cardinality of the representation IXI == m the limit f3 -+ 00 of the \nEqs.(3) induces a hard partition of X into m disjoint subsets. In this limit each \nmember x E X belongs only to the subset x E X for which p(Ylx) has the smallest \nDKL[P(ylx)lIp(ylx)] and the probabilistic map p(xlx) obtains the limit values 0 and \n1 only. \n\nIn this paper we focus on a bottom up agglomerative algorithm for generating \n\"good\" hard partitions of X. We denote an m-partition of X, i.e. X with cardinality \nm, also by Zm = {Zl,Z2, ... ,Zm}, in which casep(x) =p(Zi). We say that Zm is an \noptimal m-partition (not necessarily unique) of X if for every other m-partition of \nX, Z:n, I(Zm; Y) ~ I(Z:n; Y). Starting from the trivial N-partition, with N = lXI, \nwe seek a sequence of merges into coarser and coarser partitions that are as close \nas possible to optimal. \nIt is easy to verify that in the f3 -+ 00 limit Eqs.(3) for the m-partition distributions \nare simplified as follows. Let x == Z = {XI,X2, ... ,xl z l} , Xi E X denote a specific \ncomponent (Le. cluster) of the partition Zm, then \n\n1 {I \n\n(6) \n\nif x E Z \nth \no erWlse \n\n0 \n\np(zlx) = \n. Vx E X \np(ylz) = plz) El~l P(Xi, y) Vy E Y \np(z) = El~l P(Xi) \n\nUsing these distributions one can easily evaluate the mutual information between \nZm and Y, I(Zm; Y), and between Zm and X, I(Zm; X), using Eq.(l). \nOnce any hard partition, or hard clustering, is obtained one can apply \"reverse \nannealing\" and \"soften\" the clusters by decreasing f3 in the self-consistent equations, \nEqs.( 3). Using this procedure we in fact recover the stochastic map, p(xlx), from \nthe hard partition without the need to identify the cluster splits. We demonstrate \nthis reverse deterministic annealing procedure in the last section. \n\n\f620 \n\nN. Slonim and N. Tishby \n\n1.3 Relation to other work \n\nA similar agglomerative procedure, without the information theoretic framework \nand analysis, was recently used in [1] for text categorization on the 20 newsgroup \ncorpus. Another approach that stems from the distributional clustering algorith(cid:173)\nm was given in [5] for clustering dyadic data. An earlier application of mutual \ninformation for semantic clustering of words was used in [2]. \n\n2 The agglomerative information bottleneck algorithm \n\nThe algorithm starts with the trivial partition into N = IXI clusters or components, \nwith each component contains exactly one element of X. At each step we merge \nseveral components of the current partition into a single new component in a way \nthat locally minimizes the loss of mutual information leX; Y) = l(Zm; Y). \nLet Zm be the current m-partition of X and Zm denote the new m-partition \nof X after the merge of several components of Zm. Obviously, m < m. Let \n{Zl, Z2, ... , zd ~ Zm denote the set of components to be merged, and Zk E Zm \nthe new component that is generated by the merge, so m = m - k + 1. \nTo evaluate the reduction in the mutual information l(Zm; Y) due to this merge one \nneeds the distributions that define the new m-partition, which are determined as \nfollows. For every Z E Zm, Z f:. Zk, its probability distributions (p(z),p(ylz),p{zlx\u00bb \nremains equal to its distributions in Zm. For the new component, Zk E Zm, we \ndefine, \n\n{ \n\np(Zk) = L~=l P(Zi) \np(yIZk) = P(~Ic) ~~=l P(Zi, y) \\ly E Y . \n(zlx) = {1 1f x E ~i for some 1 ~ ~ ~ k \n\n0 otherw1se \n\np \n\n(7) \n\n\\Ix X \n\nE \n\nIt is easy to verify that Zm is indeed a valid m-partition with proper probability \ndistributions. \n\nUsing the same notations, for every merge we define the additional quantities: \n\n\u2022 The merge prior distribution: defined by ilk == (71\"1,71\"2, ... , 7I\"k), where 7I\"i \nis the prior probability of Zi in the merged subset, i.e. 7I\"i == :t::). \nl(X; Y) due to a single merge, cHy(Zl' \"\"Zk) == l(Zm; Y) - l(Zm; Y) \n\n\u2022 The Y -information decrease: the decrease in the mutual information \n\n\u2022 The X-information decrease: the decrease in the mutual information \nl(X, X) due to a single merge, cHx (Z1, Z2, ... , Zk) == l(Zm; X) - l(Zm; X) \n\nOur algorithm is a greedy procedure, where in each step we perform \"the best possi(cid:173)\nble merge\" , i.e. merge the components {Z1, ... , zd of the current m-partition which \nminimize cHy(Z1, ... , Zk). Since cHy(Zl, ... , Zk) can only increase with k (corollary 2), \nfor a greedy procedure it is enough to check only the possible merging of pairs of \ncomponents of the current m-partition. Another advantage of merging only pairs is \nthat in this way we go through all the possible cardinalities of Z = X, from N to 1. \nthere are m(~-1) possible pairs \nFor a given m-partition Zm = {Z1,Z2, ... ,Zm} \nto merge. To find \"the best possible merge\" one must evaluate the reduction of \ninformation cHy(Zi, Zj) = l(Zm; Y) -\nl(Zm-1; Y) for every pair in Zm , which is \nO(m . WI) operations for every pair. However, using proposition 1 we know that \ncHy(Zi, Zj) = (P(Zi) + p(Zj\u00bb . JSn 2 (P(YIZi),p(Ylzj\u00bb, so the reduction in the mutual \n\n\fAgglomerative Information Bottleneck \n\n621 \n\ninformation due to the merge of Zi and Zj can be evaluated directly (looking only \nat this pair) in O(IYI) operations, a reduction of a factor of m in time complexity \n(for every merge). \n\nInput: Empirical probability matrix p(x,y), N = IX\\, M = IYI \nOutput: Zm : m-partition of X into m clusters, for every 1 ::; m ::; N \nInitialization: \n\n\u2022 Construct Z == X \n- For i = 1.. .N \n* Zi = {x;} \n* P(Zi) = p(Xi) \n* p(YIZi) = p(YIXi) for every y E Y \n* p(zlxj) = 1 if j = i and 0 otherwise \n\n- Z={Zl, ... ,ZN} \n\n\u2022 for every i, j = 1.. .N, i < j, calculate \n\ndi,j = (p(Zi) +p(Zj))' JSn2[P(ylzi),p(ylzj)] \n(every di,j points to the corresponding couple in Z) \n\nLoop: \n\n\u2022 For t = 1... (N - 1) \n\n- Find {a,.B} = argmini ,j {di,j } \n\n(if there are several minima choose arbitrarily between them) \n\n- Merge {z\"\" Zj3} => z : \n\n* p(z) = p(z\",) + p(Zj3) \n* p(ylz) = r>li) (p(z\"\"y) +p(zj3,y)) for every y E Y \n* p(zlx) = 1 if x E z'\" U Zj3 and 0 otherwise, for every x E X \n\n- Update Z = {Z - {z\"\" z,q}} U{z} \n\n(Z is now a new (N - t)-partition of X with N - t clusters) \n\n- Update di ,j costs and pointers w.r.t. z \n\n(only for couples contained z'\" or Zj3). \n\n\u2022 End For \n\nFigure 1: Pseudo-code of the algorithm. \n\n3 Discussion \n\nThe algorithm is non-parametric, it is a simple greedy procedure, that depends only \non the input empirical joint distribution of X and Y. The output of the algorithm \nis the hierarchy of all m-partitions Zm of X for m = N, (N - 1), ... ,2,1. Moreover, \nunlike most other clustering heuristics, it has a built in measure of efficiency even \nfor sub-optimal solutions, namely, the mutual information I(Zm; Y) which bounds \nthe Bayes classification error. The quality measure of the obtained Zm partition is \nthe fraction of the mutual information between X and Y that Zm captures. This \nvs. m = 1 Zm I. We found that empirically this \nis given by the curve \ncurve was concave. If this is always true the decrease in the mutual information at \nevery step, given by 8(m) == I(Z7n;Y)(-:~7n-l;Y) can only increase with decreasing \nm. Therefore, if at some point 8(m) becomes relatively high it is an indication \nthat we have reached a value of m with \"meaningful\" partition or clusters. Further \n\nII Z,; -: \n\n, \n\n\f622 \n\nN. Slonim and N. Tishby \n\nmerging results in substantial loss of information and thus significant reduction in \nthe performance of the clusters as features. However, since the computational cost \nof the final (low m) part of the procedure is very low we can just as well complete \nthe merging to a single cluster. \n\n:;:-\n~ \n:::\"0.5 \n>-\n~ \n\n0 .8 \n\n>\" 106 \n->\" \n\n\u2022 \n\n... \n~O . 4 -.6.+ \n\nNG100 \n\n0.2 \n\n0 .4 \n0 .6 \nI(Z;X) / H(X) \n\n0.8 \n\n0.2 \n\nI-\n\n%~--~'0~--~2~0----~30~---4~0----~~~ \n\no~' ----;-;;;;;;------\n\n\"3000:=--\n121 \n\n-= \n\nIZI \n\nFigure 2: On the left figure the results of the agglomerative algorithm are shown in the \n\"information plane\", normalized I(Z ; Y) vs. normalized I(Z ; X) for the NGlOOO dataset. \nIt is compared to the soft version of the information bottleneck via \"reverse annealing\" \nfor IZI = 2,5, 10, 20, 100 (the smooth curves on the left). For IZI = 20, 100 the annealing \ncurve is connected to the starting point by a dotted line. In this plane the hard algorithm \nis clearly inferior to the soft one. \nOn the right-hand side: I(Zm, Y) of the agglomerative algorithm is plotted vs. the car(cid:173)\ndinality of the partition m for three subsets of the newsgroup dataset. To compare the \nperformance over the different data cardinalities we normalize I(Zm ; Y) by the value of \nI(Zso ; Y) , thus forcing all three curves to start (and end) at the same points. The predic(cid:173)\ntive information on the newsgroup for NGlOOO and NGIOO is very similar, while for the \ndichotomy dataset, 2ng, a much better prediction is possible at the same IZI, as can be \nexpected for dichotomies. The inset presents the full curve of the normalized I(Z ; Y) vs. \nIZI for NGIOO data for comparison. In this plane the hard partitions are superior to the \nsoft ones. \n\n4 Application \n\nTo evaluate the ideas and the algorithm we apply it to several subsets of the \n20Newsgroups dataset, collected by Ken Lang using 20, 000 articles evenly distribut(cid:173)\ned among 20 UseNet discussion groups (see [1]). We replaced every digit by a single \ncharacter and by another to mark non-alphanumeric characters. Following this pre(cid:173)\nprocessing, the first dataset contained the 530 strings that appeared more then 1000 \ntimes in the data. This dataset is referred as NG1000 . Similarly, all the strings that \nappeared more then 100 times constitutes the NG100 dataset and it contains 5148 \ndifferent strings. To evaluate also a dichotomy data we used a corpus consisting of \nonly two discussion groups out of the 20Newsgroups with similar topics: alt. atheism \nand talk. religion. misc. Using the same pre-processing, and removing strings that \noccur less then 10 times, the resulting \"lexicon\" contained 5765 different strings. \nWe refer to this dataset as 2ng. \n\nWe plot the results of our algorithm on these three data sets in two different planes. \nFirst, the normalized information ;g~~~ vs. the size of partition of X (number \nof clusters) , IZI. The greedy procedure directly tries to maximize J(Z ; Y) for a \ngiven IZI, as can be seen by the strong concavity of these curves (figure 2, right). \nIndeed the procedure is able to maintain a high percentage of the relevant mutual \ninformation of the original data, while reducing the dimensionality of the \"features\" , \n\n\fAgglomerative Information Bottleneck \n\n623 \n\nIZI, by several orders of magnitude. \nOn the right hand-side of figure 2 we present a comparison between the efficiency \nof the procedure for the three datasets. The two-class data, consisting of 5765 \ndifferent strings, is compressed by two orders of magnitude, into 50 clusters, almost \nwithout loosing any of the mutual information about the news groups (the decrease \nin I(Xi Y) is about 0.1%). Compression by three orders of magnitude, into 6 \nclusters, maintains about 90% of the original mutual information. \nSimilar results, even though less striking, are obtained when Y contain all 20 news(cid:173)\ngroups. The NG100 dataset was compressed from 5148 strings to 515 clusters, \nkeeping 86% of the mutual information, and into 50 clusters keeping about 70% \nof the information. About the same compression efficiency was obtained for the \nNG1000 dataset. \nThe relationship between the soft and hard clustering is demonstrated in the Infor(cid:173)\nmation plane, i.e., the normalized mutual information values, :ti;;~~ vs. Ik(~)' In \nthis plane, the soft procedure is optimal since it is a direct maximization of I(Z; Y) \nwhile constraining I(Zi X). While the hard partition is suboptimal in this plane, \nas confirmed empirically, it provides an excellent starting point for reverse anneal(cid:173)\ning. In figure 2 we present the results of the agglomerative procedure for NG1000 \nin the information plane, together with the reverse annealing for different values \nof IZI. As predicted by the theory, the annealing curves merge at various critical \nvalues of f3 into the globally optimal curve, which correspond to the \"rate distor(cid:173)\ntion function\" for the information bottleneck problem. With the reverse annealing \n(\"heating\") procedure there is no need to identify the cluster splits as required in the \noriginal annealing (\"cooling\") procedure. As can be seen, the \"phase diagram\" is \nmuch better recovered by this procedure, suggesting a combination of agglomerative \nclustering and reverse annealing as the ultimate algorithm for this problem. \n\nReferences \n[1] L. D. Baker and A. K. McCallum. Distributional Clustering of Words for Text Clas(cid:173)\n\nsification In ACM SIGIR 98, 1998. \n\n[2] P. F . Brown, P.V. deSouza, R .L. Mercer, V.J. DellaPietra, and J.C. Lai. Class-based \nIn Computational Linguistics, 18( 4}:467-479, \n\nn-gram models of natural language. \n1992. \n\n[3] T. M. Cover and J. A. Thomas. Elements of Information Theory. John Wiley & \n\nSons, New York, 1991. \n\n[4] R. EI-Yaniv, S. Fine, and N. Tishby. Agnostic classification of Markovian sequences. \n\nIn Advances in Neural Information Processing (NIPS'97) , 1998. \n\n[5] T. Hofmann, J . Puzicha, and M. Jordan. Learning from dyadic data. In Advances in \n\nNeural Information Processing (NIPS'98), 1999. \n\n[6] J. Lin. Divergence Measures Based on the Shannon Entropy. IEEE Transactions on \n\nInformation theory, 37(1}:145- 151, 1991. \n\n[7] K. Rose. Deterministic Annealing for Clustering, Compression, Classification, Regres(cid:173)\nsion, and Related Optimization Problems. Proceedings of the IEEE, 86(11}:2210- 2239, \n1998. \n\n[8] F.C. Pereira, N. Tishby, and L. Lee. Distributional clustering of English words. In \n30th Annual Meeting of the Association for Computational Linguistics, Columbus, \nOhio, pages 183-190, 1993. \n\n[9] N. Tishby, W. Bialek, and F . C. Pereira. The information bottleneck method: Ex(cid:173)\n\ntracting relevant information from concurrent data. Yet unpublished manuscript, \nNEC Research Institute TR, 1998. \n\n\f", "award": [], "sourceid": 1651, "authors": [{"given_name": "Noam", "family_name": "Slonim", "institution": null}, {"given_name": "Naftali", "family_name": "Tishby", "institution": null}]}