{"title": "Learning from Dyadic Data", "book": "Advances in Neural Information Processing Systems", "page_first": 466, "page_last": 472, "abstract": null, "full_text": "Learning from Dyadic Data \n\nThomas Hofmann\u00b7, Jan Puzicha+, Michael I. Jordan\u00b7 \n\n\u2022 Center for Biological and Computational Learning, M .I.T \n\n+ Institut fi.ir Informatik III , Universitat Bonn, Germany, jan@cs.uni-bonn.de \n\nCambridge, MA , {hofmann , jordan}@ai.mit.edu \n\nAbstract \n\nDyadzc data refers to a domain with two finite sets of objects in \nwhich observations are made for dyads , i.e., pairs with one element \nfrom either set. This type of data arises naturally in many ap(cid:173)\nplication ranging from computational linguistics and information \nretrieval to preference analysis and computer vision. In this paper, \nwe present a systematic, domain-independent framework of learn(cid:173)\ning from dyadic data by statistical mixture models. Our approach \ncovers different models with fiat and hierarchical latent class struc(cid:173)\ntures. We propose an annealed version of the standard EM algo(cid:173)\nrithm for model fitting which is empirically evaluated on a variety \nof data sets from different domains. \n\n1 \n\nIntroduction \n\nOver the past decade learning from data has become a highly active field of re(cid:173)\nsearch distributed over many disciplines like pattern recognition, neural compu(cid:173)\ntation , statistics, machine learning, and data mining. Most domain-independent \nlearning architectures as well as the underlying th eories of learning have been fo(cid:173)\ncusing on a feature-based data representation by vectors in an Euclidean space. For \nthis restricted case substantial progress has been achieved. However, a variety of \nimportant problems does not fit into this setting and far less advances have been \nmade for data types based on different representations. \n\nIn this paper, we will present a general framework for unsupervised learning from \ndyadic data . The notion dyadic refers to a domain with two (abstract) sets of ob(cid:173)\njects, ;r = {Xl , ... , XN} and Y = {YI, ... , YM} in which observations S are made for \ndyads (Xi, Yk). In the simplest case - on which we focus - an elementary observation \nconsists just of (Xi, Yk) itself, i.e., a co-occurrence of Xi and Yk, while other cases \nmay also provide a scalar value Wik (strength of preference or association). Some ex(cid:173)\nemplary application areas are: (i) Computational linguistics with the corpus-based \nstatistical analysis of word co-occurrences with applications in language modeling , \nword clustering, word sense disambiguation , and thesaurus construction. (ii) Text(cid:173)\nbased znJormatzon retrieval, where ,:{, may correspond to a document collection , Y \n\n\fLearningfrom Dyadic Data \n\n467 \n\nto keywords , and (Xi, Yk) would represent the occurrence of a term Yk in a document \nXi. (iii) Modeling of preference and consumption behavior by identifying X with in(cid:173)\ndividuals and Y with obj ects or stimuli as in collaborative jilterzng. (iv) Computer \nVIS tOn , in particular in the context of image segmentation, where X corresponds to \nimagE' loc ations , y to discretized or categorical feature values , and a dyad (Xi , Yk) \nrepresents a feature Yk observed at a particular location Xi. \n\n2 Mixture Models for Dyadic Data \n\nAcross different domains there are at least two tasks which playa fundamental role \nin unsupervised learning from dyadic data: (i) probabilistic modeling, i.e., learning \na joint or conditional probability model over X xY , and (ii) structure discovery, e.g. , \nidentifying clusters and data hierarchies. The key problem in probabilistic modeling \nis the data sparseness: How can probabilities for rarely observed or even unobserved \nco-occurrences be reliably estimated? As an answer we propose a model-based ap(cid:173)\nproach and formulate latent class or mixture models. The latter have the further \nadvantage to offer a unifying method for probabilistic modeling and structure dis(cid:173)\ncovery. There are at least three (four, if both variants in (ii) are counted) different \nways of defining latent class models: \n\nI. The most direct way is to introduce an (unobserved) mapping c : X X Y --+ \n{Cl , . . . , CK} that partitions X x Y into K classes. This type of model is \ncalled aspect-based and the pre-image c- l (cO') is referred to as an aspect. \n\nn. Alternatively, a class can be defined as a subset of one of the spaces X (or Y \nby symmetry, yielding a different model) , i.e., C : X --+ {Cl, . .. , CK} which \ninduces a unique partitioning on X x Y by C(Xi , yk) == C(Xi) . This model is \nreferred to as on e-szded clustering and c-l(c a ) ~ X is called a cluster. \n\nIll. If latent classes are defined for both sets, c : X --+ {ci , .. . , cK} and C : \n\nY --+ {cI , . .. , cD, respectively, this induces a mapping C which is a K . L (cid:173)\npartitioning of X x y. This model is called two-sided clustering. \n\n2.1 Aspect Model for Dyadic Data \n\nIn order to sp ecify an aspect model we make the assumption that all co-occurrences \nin the sample set S are i.i .d. and that Xi and Yk are conditionally independent given \nthe class. With parameters P(x i lca ), P(Yklca) for the class-conditional distributions \nand prior probabilities P( cO' ) the complete data probability can be written as \n\nP(S , c) = IT [P(Cik)P(Xilcik)P(Yklcik)t (x\"Yk) , \n\n(1) \n\ni,k \n\nwhere n(xi, Yk) are the empirical counts for dyads in Sand Cik == C(Xi, Yk) . By \nsumming over the latent variables C the usual mixture formulation is obtained \nP(S) = IT P(Xi, Ykt(X\"Yk), where P(Xi , Yk) = L P(ca)P(xilca)P(Yk Ica ) \n\n(2) \n\n. \n\ni,k \n\na \n\nFollowing the standard Expectation Maximi zation approach for maximum likelihood \nt'stimation [DE'mpster et al .. 1977], the E-step equations for the class posterior prob(cid:173)\nabilities arE' given byl \n\n(3) \n\n1 In the case of multiple observations of dyads it has been assumed that each observation \nmay have a different latent class. If only one latent class variable is introduced for each \ndyad, slightly different equations are obtained. \n\n\f468 \n\n\u2022 \u2022 ~ . . . . . . . . . . _0 . . . . . . . . \n\n! P (Ca) \n: maximal \n!P(XiICcx) \n\n: maximal \n!p(YklcCX> \n\n............... --\n\nIll , U.UU4 \n\ntwo 0.18 \nseven 0.10 \ntbree 0.10 \nfour 0.06 \nfive 0.06 \n\nyears 0.11 \nbousand 0.1 \nbuodred 0.1 \ndays 0.07 \ncubits 0.05 \n\nT. Hofmann, J puzicha and M. 1. Jordan \n\nbave 0.38 \nbatb 0.22 \nbad 0.11 \nbast 0.09 \nbe 0.02 \n\n..... __ ......... \n\n114, U.UU:l \n\nsbalt 0.18 \nbast 0.08 \nwilt 0.08 \nart 0.07 \nif 0.05 \n\ntbou 0.85 \nDOt 0.01 \nalso 0.004 \nndeed 0.00 \naooiot 0.003 \n\ntbe 0.95 \nbis 0.006 \nmy 0.005 \nour 0.003 \ntby 0.003 \n\nlord 0.09 \nbildreo 0.0 \nSOD 0.02 \nland 0.02 \no Ie 0.02 \n\nup 0.40 \ndowoO.17 \nfortb 0.15 \nout 0.09 \nioO.Ol \n\nI\"', U.U~9 \n<.> 0.52 \n<:> 0.16 \n<,> 0.14 \n<;> 0.07 \n 0.04 \n\naDd 0.33 \nfor 0.08 \nbut 0.07 \nben 0.0 \nso 0.02 \n\nee O. \nme 0.03 \nhim 0.03 \nit 0.02 \nyou 0.02 \n\n 0.27 \n<,> 0.23 \n<.> 0.12 \n<:> 0.06 \n<.> 0.04 \n\nFigure 1: Some aspects of the Bible (bigrams) . \n\nIt is straightforward to derive the M-step re-estimation formulae \n\nP(ca) ex L n(xi' Yk)P{Cik = Ca}, P(xilca) ex L n(xi, Yk)P{Cik = Ca }, \n\n(4) \n\ni,k \n\nk \n\nand an analogous equation for P(Yk Ica). By re-parameterization the aspect model \ncan also be characterized by a cross-entropy criterion. Moreover, formal equiva(cid:173)\nlence to the aggregate Markov model, independently proposed for language model(cid:173)\ning in [Saul, Pereira, 1997], has been established (cf. [Hofmann, Puzicha, 1998] for \ndetails). \n\n2.2 One-Sided Clustering Model \n\nThe complete data model proposed for the one-sided clustering model is \n\nP(S, c) = P( c)P(SIc) = (If P( c(x;)) ) (IT [P( x;)P(Y' Ic( X;))]n(x\",,))\n\n, \n\n(5) \n\nwhere we have made the assumption that observations (Xi, Yk) for a particular Xi \nare conditionally independent given c( xd . This effectively defines the mixture \n\nP(S) = IT P(S;) , P(S;) = L P(ca ) IT [P(XdP(Yklea)r(X\"Yk) , \n\n(6) \n\na \n\nk \n\nwhere Si are all observations involving Xi. Notice that co-occurrences in Si are not \nindependent (as they are in the aspect model) , but get coupled by the (shared) \nlatent variable C(Xi). As before, it is straightforward to derive an EM algorithm \nwith update equations \n\nP{ c( Xi) = Ca } ex P( Ca) IT P(Yk Icat(x. ,Yk), P(Yk lea) ex L n(Xi, Yk )P{ c( Xi) = ca } (7) \n\nk \n\nand P(ca) ex Li P{C(Xi) = cal, P(Xi) ex Lj n(xi,Yj)\u00b7 The one-sided clustering \nmod el is similar to the distributional clustering model [Pereira et al. , 1993], how(cid:173)\never, there are two important differences: (i) the number of likelihood contributions \nin (7) scales with the number of observations - a fact which follows from Bayes' rule \n- and (ii) mixing proportions are rpissing in the original distributional clustering \nmodel. The one-sided clustering model corresponds to an unsupervised version of \nthe naive Bayes' classifier, if we interpret Y as a feature space for objects Xi EX . \nThere are also ways to weaken the conditional independence assumption, e.g., by \nutilizing a mixture of tree dependency models [Meila, Jordan, 1998] . \n\n2.3 Two-Sided Clustering Model \n\nThe latent variable structure of the two-sided clustering model significantly reduces \nthe degrees of freedom in the specification of the class conditional distribution. We \n\n\fLearning from Dyadic Data \n\n469 \n\nFigure 2: Exemplary segmentation results on Aerial by one-sided clustering. \n\npropose the following complete data model \n\nP(S, c) = II P(C(Xi))P(C(Yk)) [P(xi)P(Yk)1Tc(xi),c(YIc)f(x\"yIc) \n\n(8) \n\ni,k \n\n\"Y \n\n0\" \n\nwhere 1Tc:r: cll are cluster association parameters. In this model the latent variables \nin the X and Y space are coupled by the 1T-parameters. Therefore, there exists \nno simple mixture model representation for P(S). Skipping some of the technical \ndetails (cf. [Hofmann, Puzicha, 1998]) we obtain P(Xi) ex Lk n(xi,Yk), P(Yk) ex \nLi n(xi' Yk) and the M-step equations \n\nL i k n(xi, Yk)P{C(Xi) = c~ /\\ C(Yk) = c~} \n\n1Tc~.c~ = [Li P{C(Xi) = ~;} Lk n(xi, Yk)] [Lk P{C(Yk) = cn Li n(xi, Yk)] \n\n(9) \nas well as P(c~) = L i P{C(Xi) = c~} and P(c~) = Lk P{C(Xk) = cn . To preserve \ntractability for the remaining problem of computing the posterior probabilities in \nthe E-step, we apply a factorial approximation (mean field approximation), i.e., \nP{C(Xi ) = c~ /\\ C(Yk) = cO ~ P{C(Xi) = c~}P{C(Yk) = cn. This results in the \nfollowing coupled approximation equations for the marginal posterior probabilities \nP{ c(x;) = c~} ex P(c~) exp [~n(x;, y,) ~ PI cry,) = c'(} log \"'~\"~ 1 (10) \n\nand a similar equation for P {C(Yk) = c~}. The resulting approximate EM algorithm \nperforms updates according to the sequence (CX- post., 1T, cLpost., 1T). Intuitively \nthe (probabilistic) clustering in one set is optimized in alternation for a given clus(cid:173)\ntering in the other space and vice versa. The two-sided clustering model can also \nbe shown to maximize a mutual information criterion [Hofmann, Puzicha, 1998] . \n\n2.4 Discussion: Aspects and Clusters \nTo better understand the differences of the presented models it is elucidating to \nsystematically compare the conditional probabilities P( CO' Ixd and P( CO' IYk): \n\nOne-sided \n\nAspect \nModel \n\nX Clustering \nP{x.ico' W{co' 2 P{c(xd = cO'} \nP~lf.k Ic\", W( c'\" 2 \n\nP(x,) \n\nOne-sided \n\nTwo-sided \nClustering \nP{C(Xi) = c~} \nP(lf.kl cO' )P(cO' 2 P{C(Yk) = cO'} P{C(Yk) = c~} \n\nY Clustering \nP{xdcO' W{ CO' 2 \n\nP(colxd \n\nP(CoIYk ) \n\nP(Y k) \n\nP(Yk) \n\nP(x.) \n\nAs can be seen from the above table, probabilities P(CoIXi) and P(CaIYk) correspond \nto posterior probabilities of latent variables if clusters are defined in the X-and \nY-space, respectively. Otherwise, they are computed from model parameters. This \nis a crucial difference as, for example, the posterior probabilities are approaching \n\n\f470 \n\nT. Hofmann, J. Puzicha and M. I. Jordan \n\n' ~. -....... \n\nI .~ , \n\n..... r... \n\n'''' P' '' ''''~ .... \n\n... 1\u00b7 \n\nh .. .. ' \u2022 \u2022\u2022\u2022 \",oi, \n\u2022 . . ,, '\" \n\n.... \n\n.. \n, ..... _ .. ',.<. \n:~::~ \n< \u2022 \u2022\u2022 .,. \n< . . . . 11 .. .. i . .. \n\n~ \n... n, ... \" \n\nfl \u2022\u2022\u2022\u2022 \n\nf ..... I~ \n\n< \u2022\u2022 I~ \n\n.. --\" \n. .......... , ..... t. \n.. \n\" \n'dol \n_.ie \n. , ... \n_ \u2022\u2022\u2022 1 \nIl10' ' '' \n\", \n\n~~.:: \n\n1\" \n\n. .... ]0 \n1, \u2022\u2022 -\nc r ..... \n\n.flu \n\n........... \n:I~;:;\u00b7.~' \n\n'u.. \n~:.:' \n\n.. , .. ,... \n\nI \u2022 \u2022 ~ \n\nIT \n. h .~. \n\n1<1 \n~ ._ .. \n.,..... \n~,,\" \n~.::;,:'::.I \nU \n\n. ..... ,_. \n\" .. , .. \n., \u2022\u2022 \" \n,\" ua, \n\n........ \n.......... \n\nI''''''.' d\u00b7,. \n.. ~ ... , \nu \n~:.:: .. ~.\".. \n... ,,<., \n..... , \n\u2022 ~.... \nI\"\". \n... n, \n\u2022\u2022 10'''' \nf.:~I:.:,., \n\n,.1 .\u2022 \n.. ,.<. \n~::. \n\n,h .... , ., \n\n\" \u2022\u2022\u2022 , .. .. . \n\n..-~ .. \n., .. \u00abu. \n\" \n..... ,. \n..... \n, ..\u2022 , .. \n\n_ ;',117 \n\nf . . .. \n\n1111 , \n\n:~::. \n\u2022\u2022\u2022\u2022 ,e. \n\" \n..... ,~,-.\" ..\u2022... \n:,~:::. u .\"'~' .< ~ \u2022\u2022 I \n\u2022.. Ii., \n\n'It \nb ..... , \n\nd ..... \n\n\"'~h\" \n\n.\" .... \n.~I. \n)O,.i. \n. ~.c, .. ' \n\nII \n\n\u2022 ... ~ .\" ... , \n~.'''< .I \u2022\u2022 \n~:::: .. \n\nu d d . .. \n\n.h.y \n,,, \u2022\u2022. U, \n,.\".... \n\n.. .. .... hi \n\nHo, \nh' \" \nu u ll \u2022 \u2022\u2022 \n\n.. ~i... \n\nt._~.. \n\n_\" .... \n.. ,.. \n\".d . \n\n..10 \n\n,n., \n\u2022\u2022 cl~ \n\nU.tl~ \n\nD~~~\"\" .~ ~\" \n\n\" \n\u2022 ..... _ . \n\n.1 \nI \u2022\u2022 ,. \n' '''.,. \n'''.\" ...... , .... ~ .i ul \"\n::':~., \n\nI~ .. i,. \n::.7.:,., \n\nI~ \n._11 \nto., \n\n' .. tol. \n\nt~\u00b7.\u00b7t \n\n1& \n\"'7 \nIo'.,f .. , \n\u2022 \u2022\u2022 1-. \n~;:.;::.' \n\n:~::,~~:. \n\" .. ,fa_ . \n\u2022 ,.1 .... \n\n~::~' ... \n.i' .... \n\n1. '10 \u2022 \n\n.... ~., \n'.,\"10 ... \n\n. 1 . . .... \n\n~~\u00b7:;\u00b7i:\u00b7 \n\noa.\" .... , \n\n,' \" .... ,. \n\n~, ... c; \u2022\u2022 \"\" ... !:~:I::' \n\n.Iln., ...\n,... \n\n::,':~~;~~ \n\n... w \n.. \u2022\u2022 \u2022 \u2022 \u2022 L \n\n, ... ,, 1 ... \n\n... \" ... , ... it, \n,,,II \n\" . .. .. \n, .. h lt. \n,... \n::::,:,:;\" \u2022\u2022 1::;::;. \n\n.. \n\n.0 \n31 \n1.<0' \nI.\"\"\"'''' .. ~Ii.,