{"title": "Exponential Family Harmoniums with an Application to Information Retrieval", "book": "Advances in Neural Information Processing Systems", "page_first": 1481, "page_last": 1488, "abstract": null, "full_text": "Exponential Family Harmoniums\n\nwith an Application to Information Retrieval\n\nMax Welling & Michal Rosen-Zvi\nInformation and Computer Science\n\nUniversity of California\n\nIrvine CA 92697-3425 USA\n\nwelling@ics.uci.edu\n\nDepartment of Computer Science\n\nUniversity of Toronto\n\nGeoffrey Hinton\n\nToronto, 290G M5S 3G4, Canada\n\nhinton@cs.toronto.edu\n\nAbstract\n\nDirected graphical models with one layer of observed random variables\nand one or more layers of hidden random variables have been the dom-\ninant modelling paradigm in many research \ufb01elds. Although this ap-\nproach has met with considerable success, the causal semantics of these\nmodels can make it dif\ufb01cult to infer the posterior distribution over the\nhidden variables. In this paper we propose an alternative two-layer model\nbased on exponential family distributions and the semantics of undi-\nrected models. Inference in these \u201cexponential family harmoniums\u201d is\nfast while learning is performed by minimizing contrastive divergence.\nA member of this family is then studied as an alternative probabilistic\nmodel for latent semantic indexing. In experiments it is shown that they\nperform well on document retrieval tasks and provide an elegant solution\nto searching with keywords.\n\n1 Introduction\n\nGraphical models have become the basic framework for generative approaches to proba-\nbilistic modelling. In particular models with latent variables have proven to be a powerful\nway to capture hidden structure in the data. In this paper we study the important subclass\nof models with one layer of observed units and one layer of hidden units.\n\nTwo-layer models can be subdivided into various categories depending on a number of\ncharacteristics. An important property in that respect is given by the semantics of the\ngraphical model: either directed (Bayes net) or undirected (Markov random \ufb01eld). Most\ntwo-layer models fall in the \ufb01rst category or are approximations derived from it: mixtures\nof Gaussians (MoG), probabilistic PCA (pPCA), factor analysis (FA), independent compo-\nnents analysis (ICA), sigmoid belief networks (SBN), latent trait models, latent Dirichlet\nallocation (LDA, otherwise known as multinomial PCA, or mPCA) [1], exponential fam-\nily PCA (ePCA), probabilistic latent semantic indexing (pLSI) [6], non-negative matrix\nfactorization (NMF), and more recently the multiple multiplicative factor model (MMF)\n[8].\n\nDirected models enjoy important advantages such as easy (ancestral) sampling and easy\nhandling of unobserved attributes under certain conditions. Moreover, the semantics of\n\n\fdirected models dictates marginal independence of the latent variables, which is a suitable\nmodelling assumption for many datasets. However, it should also be noted that directed\nmodels come with an important disadvantage: inference of the posterior distribution of the\nlatent variables given the observations (which is, for instance, needed within the context of\nthe EM algorithm) is typically intractable resulting in approximate or slow iterative proce-\ndures. For important applications, such as latent semantic indexing (LSI), this drawback\nmay have serious consequences since we would like to swiftly search for documents that\nare similar in the latent topic space.\n\nA type of two-layer model that has not enjoyed much attention is the undirected analogue\nof the above described family of models. It was \ufb01rst introduced in [10] where it was named\n\u201charmonium\u201d. Later papers have studied the harmonium under various names (the \u201ccom-\nbination machine\u201d in [4] and the \u201crestricted Boltzmann machine\u201d in [5]) and turned it into\na practical method by introducing ef\ufb01cient learning algorithms. Harmoniums have only\nbeen considered in the context of discrete binary variables (in both hidden and observed\nlayers), and more recently in the Gaussian case [7]. The \ufb01rst contribution of this paper is to\nextend harmoniums into the exponential family which will make them much more widely\napplicable.\n\nHarmoniums also enjoy a number of important advantages which are rather orthogonal to\nthe properties of directed models. Firstly, their product structure has the ability to produce\ndistributions with very sharp boundaries. Unlike mixture models, adding a new expert may\ndecrease or increase the variance of the distribution, which may be a major advantage in\nhigh dimensions. Secondly, unlike directed models, inference in these models is very fast,\ndue to the fact that the latent variables are conditionally independent given the observations.\nThirdly, the latent variables of harmoniums produce distributed representations of the input.\nThis is much more ef\ufb01cient than the \u201cgrandmother-cell\u201d representation associated with\nmixture models where each observation is generated by a single latent variable. Their most\nimportant disadvantage is the presence of a global normalization factor which complicates\nboth the evaluation of probabilities of input vectors1 and learning free parameters from\nexamples. The second objective of this paper is to show that the introduction of contrastive\ndivergence has greatly improved the ef\ufb01ciency of learning and paved the way for large scale\napplications.\n\nWhether a directed two-layer model or a harmonium is more appropriate for a particu-\nlar application is an interesting question that will depend on many factors such as prior\n(conditional) independence assumptions and/or computational issues such as ef\ufb01ciency of\ninference. To expose the fact that harmoniums can be viable alternatives to directed mod-\nels we introduce an entirely new probabilistic extension of latent semantic analysis (LSI)\n[3] and show its usefulness in various applications. We do not want to claim superiority\nof harmoniums over their directed cousins, but rather that harmoniums enjoy rather dif-\nferent advantages that deserve more attention and that may one day be combined with the\nadvantages of directed models.\n\n2 Extending Harmoniums into the Exponential Family\n\nLet xi, i = 1...Mx be the set of observed random variables and hj, j = 1...Mh be the set\nof hidden (latent) variables. Both x and h can take values in either the continuous or the\ndiscrete domain. In the latter case, each variable has states a = 1...D.\nTo construct an exponential family harmonium (EFH) we \ufb01rst choose Mx independent\ndistributions pi(xi) for the observed variables and Mh independent distributions pj(hj)\n\n1However, it is easy to compute these probabilities up to a constant so it is possible to compare\n\nprobabilities of data-points.\n\n\ffor the hidden variables from the exponential family and combine them multiplicatively,\n\nMx(cid:89)\nMh(cid:89)\n\ni=1\n\nj=1\n\np({xi}) =\n\np({hj}) =\n\n(cid:88)\n(cid:88)\n\na\n\nb\n\nri(xi) exp [\n\n\u03b8iafia(xi) \u2212 Ai({\u03b8ia})]\n\nsj(hj) exp [\n\n\u03bbjbgjb(hj) \u2212 Bj({\u03bbjb})]\n\n(1)\n\n(2)\n\nwhere {fia(xi), gjb(hj)} are the suf\ufb01cient statistics for the models (otherwise known as\nfeatures), {\u03b8ia, \u03bbjb} the canonical parameters of the models and {Ai, Bj} the log-partition\nfunctions (or log-normalization factors). In the following we will consider log(ri(xi)) and\nlog(sj(hj)) as additional features multiplied by a constant.\nNext, we couple the random variables in the log-domain by the introduction of a quadratic\ninteraction term,\n\n(cid:88)\n\n(cid:88)\n\n(cid:88)\n\np({xi, hj}) \u221d exp [\n\n\u03b8iafia(xi) +\n\n\u03bbjbgjb(hj) +\n\nW jb\n\nia fia(xi)gjb(hj)]\n\n(3)\n\nia\n\njb\n\nijab\n\nNote that we did not write the log-partition function for this joint model in order to indi-\ncate our inability to compute it in general. For some combinations of exponential family\ndistributions it may be necessary to restrict the domain of W jb\nia in order to maintain nor-\nia \u2265 0). Although\nmalizability of the joint probability distribution (e.g. W jb\nwe could also have mutually coupled the observed variables (and/or the hidden variables)\nusing similar interaction terms we refrain from doing so in order to keep the learning and\ninference procedures ef\ufb01cient. Consequently, by this construction the conditional probabil-\nity distributions are a product of independent distributions in the exponential family with\nshifted parameters,\np({xi}|{hj}) =\n\n\u02c6\u03b8iafia(xi) \u2212 Ai({\u02c6\u03b8ia})]\n\nia \u2264 0 or W jb\n\n\u02c6\u03b8ia = \u03b8ia +\n\nia gjb(hj) (4)\n\nexp [\n\nW jb\n\n(cid:88)\n(cid:88)\n\nj\n\n(cid:88)\n(cid:88)\n\nia\n\np({hj}|{xi}) =\n\nexp [\n\n\u02c6\u03bbjbgjb(hj) \u2212 Bj({\u02c6\u03bbjb})] \u02c6\u03bbjb = \u03bbjb +\n\n(cid:80)\nia fia(xi)(5)\na \u03b8afa(y) = exp A({\u03b8a}) we can also\n\nW jb\n\nia\n\n(cid:80)\n\nFinally, using the following identity,\ncompute the marginal distributions of the observed and latent variables,\n\ny exp\n\np({xi}) \u221d exp [\n\n\u03b8iafia(xi) +\n\np({hj}) \u221d exp [\n\n\u03bbjbgjb(hj) +\n\nBj({\u03bbjb +\n\nAi({\u03b8ia +\n\nia fia(xi)})]\nW jb\nia gjb(hj)})]\nW jb\n\n(6)\n\n(7)\n\njb\n\ni\n\njb\n\nNote that 1) we can only compute the marginal distributions up to the normalization con-\nstant and 2) in accordance with the semantics of undirected models, there is no marginal\nindependence between the variables (but rather conditional independence).\n\n2.1 Training EF-Harmoniums using Contrastive Divergence\nLet \u02dcp({xi}) denote the data distribution (or the empirical distribution in case we observe a\n\ufb01nite dataset), and p the model distribution. Under the maximum likelihood objective the\nlearning rules for the EFH are conceptually simple2,\n\n\u03b4\u03b8ia \u221d (cid:104)fia(xi)(cid:105) \u02dcp \u2212 (cid:104)fia(xi)(cid:105)p\n\n\u03b4\u03bbjb \u221d (cid:104)B(cid:48)\n\njb(\u02c6\u03bbjb)(cid:105) \u02dcp \u2212 (cid:104)B(cid:48)\n\njb(\u02c6\u03bbjb)(cid:105)p\n\n(8)\n\n2These learning rules are derived by taking derivatives of the log-likelihood objective using Eqn.6.\n\nMx(cid:89)\nMh(cid:89)\n\ni=1\n\nj=1\n\n(cid:88)\n(cid:88)\n\na\n\nb\n\n(cid:88)\n(cid:88)\n\nia\n\n(cid:88)\n(cid:88)\n\njb\n\n\fjb(\u02c6\u03bbjb)(cid:105)p\n\nij \u221d (cid:104)fia(xi)B(cid:48)\n\njb(\u02c6\u03bbjb)(cid:105) \u02dcp \u2212 (cid:104)fia(xi)B(cid:48)\n\n\u03b4W ab\n(9)\nwhere we have de\ufb01ned B(cid:48)\njb = \u2202Bj(\u02c6\u03bbjb)/\u2202\u02c6\u03bbjb with \u02c6\u03bbjb de\ufb01ned in Eqn.5. One should note\nthat these learning rules are changing the parameters in an attempt to match the expected\nsuf\ufb01cient statistics of the data distribution and the model distribution (while maximizing\nentropy). Their simplicity is somewhat deceptive, however, since the averages (cid:104)\u00b7(cid:105)p are\nintractable to compute analytically and Markov chain sampling or mean \ufb01eld calculations\nare typically wheeled out to approximate them. Both have dif\ufb01culties: mean \ufb01eld can only\nrepresent one mode of the distribution and MCMC schemes are slow and suffer from high\nvariance in their estimates.\n\nIn the case of binary harmoniums (restricted BMs) it was shown in [5] that contrastive di-\nvergence has the potential to greatly improve on the ef\ufb01ciency and reduce the variance of\nthe estimates needed in the learning rules. The idea is that instead of running the Gibbs\nsampler to its equilibrium distribution we initialize Gibbs samplers on each data-vector\nand run them for only one (or a few) steps in parallel. Averages (cid:104)\u00b7(cid:105)p in the learning rules\nEqns.8,9 are now replaced by averages (cid:104)\u00b7(cid:105)pCD where pCD is the distribution of samples\nthat resulted from the truncated Gibbs chains. This idea is readily generalized to EFHs.\nDue to space limitations we refer to [5] for more details on contrastive divergence learn-\ning3. Deterministic learning rules can also be derived straightforwardly by generalizing the\nresults described in [12] to the exponential family.\n\n3 A Harmonium Model for Latent Semantic Indexing\n\nTo illustrate the new possibilities that have opened up by extending harmoniums to the\nexponential family we will next describe a novel model for latent semantic indexing (LSI).\nThis will represent the undirected counterpart of pLSI [6] and LDA [1].\n\nOne of the major drawbacks of LSI is that inherently discrete data (word counts) are being\nmodelled with variables in the continuous domain. The power of LSI on the other hand is\nthat it provides an ef\ufb01cient mapping of the input data into a lower dimensional (continuous)\nlatent space that has the effect of de-noising the input data and inferring semantic relation-\nships among words. To stay faithful to this idea and to construct a probabilistic model on\nthe correct (discrete) domain we propose the following EFH with continuous latent topic\nvariables, hj, and discrete word-count variables, xia,\n\n(cid:88)\nNhj [\n\nMh(cid:89)\nMx(cid:89)\n\nj=1\n\np({hj}|{xia}) =\n\n(10)\n\nW j\n\niaxia , 1]\n\nia\n\np({xia}|{hj}) =\n\n(cid:88)\nS{xia}[\u03b1ia +\n(cid:80)\n(cid:80)D\na xia = 1 \u2200i, where xia = 1\nNote that {xia} represent indicator variables satisfying\nmeans that word \u201ci\u201d in the vocabulary was observed \u201ca\u201d times. Nh[\u00b5, \u03c3] denotes a normal\ndistribution with mean \u00b5 and std.\u03c3 and S{xa}[\u03b3a] \u221d exp (\na=1 \u03b3axa) is the softmax\n(cid:88)\nfunction de\ufb01ning a probability distribution over x. Using Eqn.6 we can easily deduce the\nmarginal distribution of the input variables,\n\n(cid:88)\n\n(cid:88)\n\nhjW j\nia]\n\n(11)\n\ni=1\n\nj\n\np({xia}) \u221d exp [\n\n\u03b1iaxia +\n\n(\n\nW j\n\niaxia)2]\n\nia\n\nj\n\nia\n\n1\n2\n\n(12)\n\n3Non-believers in contrastive divergence are invited to simply run the the Gibbs sampler to equi-\nlibrium before they do an update of the parameters. They will \ufb01nd that due to the special bipartite\nstructure of EFHs learning is still more ef\ufb01cient than for general Boltzmann machines.\n\n\f(cid:80)\n\nWe observe that the role of the components W j\nia is that of templates or prototypes: input\niaxia \u2200j will have high probability under this\nvectors xia with large inner products\nmodel. Just like pLSI and LDA can be considered as natural generalizations of factor\nanalysis (which underlies LSI) into the class of directed models on the discrete domain, the\nabove model can be considered as the natural generalization of factor analysis into class\nof undirected models on the discrete domain. This idea is supported by the result that the\nsame model with Gaussian units in both hidden and observed layers is in fact equivalent to\nfactor analysis [7].\n\nia W j\n\n3.1 Identi\ufb01ability\n\nFrom the form of the marginal distribution Eqn.12 we can derive a number of transforma-\ntions of the parameters that will leave the distribution invariant. First we note that the com-\nponents W j\nk U jkW k\nia\nwith U T U = I. Secondly, we note that observed variables xia satisfy a constraint,\nia and\n\n(cid:80)\na xia = 1 \u2200i. This results in a combined shift invariance for the components W j\n\nia can be rotated and mirrored arbitrarily in latent space4: W j\n\nthe offsets \u03b1ia. Taken together, this results in the following set of transformations,\n\nia \u2192(cid:80)\n\n(cid:88)\n\n(cid:88)\n\n(cid:88)\n\nia \u2192\nW j\n\nU jk(W k\n\nia + V k\ni )\n\n\u03b1ia \u2192 (\u03b1ia + \u03b2i) \u2212\n\n(\n\nl )(W j\nV j\nia)\n\n(13)\n\nk\n\nj\n\nl\n\n(cid:80)\n\nwhere U T U = I. Although these transformations leave the marginal distribution over the\nobservable variables invariant, they do change the latent representation and as such may\nhave an impact on retrieval performance (if we use a \ufb01xed similarity measure between\ntopic representations of documents). To \ufb01x the spurious degrees of freedom we have cho-\nsen to impose conditions on the representations in latent space: hn\nia. First,\nwe center the latent representations which has the effect of minimizing the \u201cactivity\u201d of\nthe latent variables and moving as much log-probability as possible to the constant com-\nponent \u03b1ia. Next we align the axes in latent space with the eigen-directions of the latent\n(cid:80)\n(cid:81)\ncovariance matrix. This has the effect of approximately decorrelating the marginal latent\nactivities. This follows because the marginal distribution in latent space can be approx-\nj Nhj [\nia, 1]/N where we have used Eqn.10 and\nreplaced p({xia}) by its empirical distribution. Denoting by \u00b5 and \u03a3 = U T \u039bU the sample\nmean and sample covariance of {hn\nj }, it is not hard to show that the following transforma-\ntion will have the desired effect5:\n\nimated by: p({hj}) \u2248 (cid:80)\n(cid:181)\n(cid:88)\n\n(cid:88)\n\nia W j\n\nia W j\n\n(cid:182)\n\niaxn\n\niaxn\n\nj =\n\nn\n\nia \u2192\nW j\n\nU jk\n\nk\n\nia \u2212 1\nW k\nMx\n\n\u00b5k\n\n\u03b1ia \u2192 \u03b1ia +\n\n\u00b5jW j\nia\n\n(14)\n\nj\n\nOne could go one step further than the de-correlation process described above by introduc-\ning covariances \u03a3 in the conditional Gaussian distribution of the latent variables Eqn.10.\nThis would not result in a more general model because the effect of this on the marginal\ndistribution over the observed variables is given by: W j\nia KK T = \u03a3.\nHowever, the extra freedom can be used to de\ufb01ne axes in latent space for which the pro-\njected data become approximately independent and have the same scale in all directions.\n\nia \u2192 (cid:80)\n\nk K jkW k\n\n4Technically we call this the Euclidean group of transformations.\n5Some spurious degrees of freedom remain since shifts \u03b2i and shifts V j\ni\n\ni = 0\nwill not affect the projection into latent space. One could decide to \ufb01x the remaining degrees of\nfreedom by for example requiring that components are as small as possible in L2 norm (subject to\ni V j\nia W j\nthe constraint\nand \u03b1ia \u2192 \u03b1ia \u2212 1\n\ni = 0), leading to the further shifts, W j\n\nia \u2192 W j\n\nthat satisfy\n\nia\u2212 1\n\nia + 1\n\n(cid:80)\n\n(cid:80)\n\n(cid:80)\n\na W j\n\ni V j\n\nDMx\n\nia\n\nD\n\na \u03b1ia.\n\nD\n\n(cid:80)\n(cid:80)\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 1: Precision-recall curves when the query was (a) entire documents, (b) 1 keyword, (c) 2\nkeywords for the EFH with and without 10 MF iterations, LSI , TFIDF weighted words and random\nguessing. PR curves with more keywords looked very similar to (c). A marker at position k (counted\nfrom the left along a curve) indicates that 2k\u22121 documents were retrieved.\n\n4 Experiments\n\nia W j\n\n(cid:80)\n\niaxia and where {W j\n\nNewsgroups:\nWe have used the reduced version of the \u201c20newsgroups\u201d dataset prepared for MATLAB\nby Roweis6. Documents are presented as 100 dimensional binary occurrence vectors and\ntagged as a member of 1 out of 4 domains. Documents contain approximately 4% of the\nwords, averaged across the 16242 postings.\nAn EFH model with 10 latent variables was trained on 12000 training cases using sto-\nchastic gradient descent on mini-batches of 1000 randomly chosen documents (training\ntime approximately 1 hour on a 2GHz PC). A momentum term was added to speed up\nconvergence. To test the quality of the trained model we mapped the remaining 4242\nia, \u03b1ia} were\nquery documents into latent space using hj =\n\u201cgauged\u201d as in Eqns.14. Precision-recall curves were computed by comparing training\nand query documents using the usual \u201ccosine coef\ufb01cient\u201d (cosine of the angle between\ndocuments) and reporting success when the retrieved document was in the same domain\nas the query (results averaged over all queries).\nIn \ufb01gure 1a we compare the results\nwith LSI (also 10 dimensions) [3] where we preprocessed the data in the standard way\n(x \u2192 log(1 + x) and entropy weighting of the words) and to similarity in word space\nusing TF-IDF weighting of the words. In \ufb01gure 1b,c we show PR curves when only 1 or\n2 keywords were provided corresponding to randomly observed words in the query docu-\nment. The EFH model allows a principled way to deal with unobserved entries by inferring\nthem using the model (in all other methods we insert 0 for the unobserved entries which\n(cid:80)D\ncorresponds to ignoring them). We have used a few iterations of mean \ufb01eld to achieve\nthat: \u02c6xia \u2192 exp [\njb + \u03b1jb)\u02c6xjb]/\u03b3i where \u03b3i is a normalization constant\na=1 \u02c6xia = 1 \u2200i. We note that this is\nand where \u02c6xia represent probabilities: \u02c6xia \u2208 [0, 1],\nstill highly ef\ufb01cient and achieves a signi\ufb01cant improvement in performance. In all cases we\n\ufb01nd that without any preprocessing or weighting EFH still outperforms the other methods\nexcept when large numbers of documents were retrieved.\n\n(cid:80)\n\n(cid:80)\n\njb(\n\nk W k\n\niaW k\n\nIn the next experiment we compared performance of EFH, LSI and LDA by training mod-\nels on a random subset of 15430 documents with 5 and 10 latent dimensions (this was\nfound to be close to optimal for LDA). The EFH and LSI models were trained as in the\nprevious experiment while the training and testing details7 for LDA can be found in [9].\nFor the remaining test documents we clamped a varying number of observed words and\n\n6http://www.cs.toronto.edu/\u223croweis/data.html\n7The approximate inference procedure was implemented using Gibbs sampling.\n\n10\u2212510\u2212410\u2212310\u2212210\u221211000.10.20.30.40.50.60.70.8RecallPrecisionPrecision vs. Recall for Newsgroups DatasetRANDOM EFH LSA TFIDF 10\u2212510\u2212410\u2212310\u2212210\u221211000.150.20.250.30.350.40.450.50.550.60.65RecallPrecisionPrecision vs. Recall for Newgroups DatasetRANDOM EFH+MF EFH LSA TFIDF 10\u2212510\u2212410\u2212310\u2212210\u221211000.10.20.30.40.50.60.70.8RecallPrecisionPrecision vs. Recall for Newsgroups DatasetEFH+MF EFH LSA TFIDF RANDOM \f(a)\n\n(b)\n\n(c)\n\nFigure 2: (a) Fraction of observed words that was correctly observed by EFH, LSI and LDA using 5\nand 10 latent variables when we vary the number of keywords (observed words that were \u201cclamped\u201d),\n(b) latent 3-D representations of newsgroups data, (c) Fraction of documents retrieved by EFH on the\nNIPS dataset which was also retrieved by the TF-IDF method.\n\nasked the models to predict the remaining observed words in the documents by comput-\ning the probabilities for all words in the vocabulary to be present and ranking them (see\nprevious paragraph for details). By comparing the list of the R remaining observed words\nin the document with the top-R ranked inferred words we computed the fraction of cor-\nrectly predicted words. The results are shown in \ufb01gure 2a as a function of the number of\nclamped words. To provide anecdotal evidence that EFH can infer semantic relationships\nwe clamped the words \u2019drive\u2019 \u2019driver\u2019 and \u2019car\u2019 which resulted in: \u2019car\u2019 \u2019drive\u2019 \u2019engine\u2019\n\u2019dealer\u2019 \u2019honda\u2019 \u2019bmw\u2019 \u2019driver\u2019 \u2019oil\u2019 as the most probable words in the documents. Also,\nclamping \u2019pc\u2019 \u2019driver\u2019 and \u2019program\u2019 resulted in: \u2019windows\u2019 \u2019card\u2019 \u2019dos\u2019 \u2019graphics\u2019 \u2019soft-\nware\u2019 \u2019pc\u2019 \u2019program\u2019 \u2019\ufb01les\u2019.\nNIPS Conference Papers:\nNext we trained a model with 5 latent dimensions on the NIPS dataset8 which has a large\nvocabulary size (13649 words) and contains 1740 documents of which 1557 were used\nfor training and 183 for testing. Count values were redistributed in 12 bins. The array\nW contains therefore 5 \u00b7 13649 \u00b7 12 = 818940 parameters. Training was completed in the\norder of a few days. Due to the lack of document labels it is hard to assess the quality of the\ntrained model. We choose to compare performance on document retrieval with the \u201cgolden\nstandard\u201d: cosine similarity in TF-IDF weighted word space. In \ufb01gure 2c we depict the\nfraction of documents retrieved by EFH that was also retrieved by TF-IDF as we vary the\nnumber of retrieved documents. This correlation is indeed very high but note that EFH\ncomputes similarity in a 5-D space while TF-IDF computes similarity in a 13649-D space.\n\n5 Discussion\n\nThe main point of this paper was to show that there is a \ufb02exible family of 2-layer probabilis-\ntic models that represents a viable alternative to 2-layer causal (directed) models. These\nmodels enjoy very different properties and can be trained ef\ufb01ciently using contrastive di-\nvergence. As an example we have studied an EFH alternative for latent semantic indexing\nwhere we have found that the EFH has a number of favorable properties: fast inference al-\nlowing fast document retrieval and a principled approach to retrieval with keywords. These\nwere preliminary investigations and it is likely that domain speci\ufb01c adjustments such as a\nmore intelligent choice of features or parameterization could further improve performance.\n\nPrevious examples of EFH include the original harmonium [10], Gaussian variants thereof\n[7], and the PoT model [13] which couples a gamma distribution with the covariance of a\n\n8Obtained from http://www.cs.toronto.edu/\u223croweis/data.html.\n\n024681000.050.10.150.20.250.3Document Reconstruction for Newsgroup Datasetnumber of keywordsfraction observed words correctly predictedEFH+MF 10DEFH+MF 5D LSI 10DLSI 5DLDA 10DLDA 5D\u22126\u22124\u221220246\u22124\u22122024681012468Latent Representations of Newsgroups Data0204060801000.380.40.420.440.460.480.50.52number of retrieved documentsfraction documents also retrieved by TF\u2212IDFRetrieval Performance for NIPS Dataset\fnormal distribution. Some exponential family extensions of general Boltzmann machines\nwere proposed in [2], [14], but they do not have the bipartite structure that we study here.\nWhile the components of the Gaussian-multinomial EFH act as prototypes or templates\nfor highly probable input vectors, the components of the PoT act as constraints (i.e. input\nvectors with large inner product have low probability). This can be traced back to the\nshape of the non-linearity B in Eqn.6. Although by construction B must be convex (it\nis the log-partition function), for large input values it can both be positive (prototypes,\ne.g. B(x) = x2) or negative (constraints, e.g. B(x) = \u2212log(1 + x)).\nIt has proven\ndif\ufb01cult to jointly model both prototypes and constraints in the this formalism except for\nthe fully Gaussian case [11]. A future challenge is therefore to start the modelling process\nwith the desired non-linearity and to subsequently introduce auxiliary variables to facilitate\ninference and learning.\n\nReferences\n\n[1] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. Journal of Machine Learning\n\nResearch, 3:993\u20131022, 2003.\n\n[2] C.K.I.Williams. Continuous valued Boltzmann machines. Technical report, 1993.\n[3] S.C. Deerwester, S.T. Dumais, T.K. Landauer, G.W. Furnas, and R.A. Harshman. Indexing by\nlatent semantic analysis. Journal of the American Society of Information Science, 41(6):391\u2013\n407, 1990.\n\n[4] Y. Freund and D. Haussler. Unsupervised learning of distributions of binary vectors using\nIn Advances in Neural Information Processing Systems, volume 4, pages\n\n2-layer networks.\n912\u2013919, 1992.\n\n[5] G.E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Com-\n\nputation, 14:1771\u20131800, 2002.\n\n[6] Thomas Hofmann. Probabilistic latent semantic analysis. In Proc. of Uncertainty in Arti\ufb01cial\n\nIntelligence, UAI\u201999, Stockholm, 1999.\n\n[7] T. K. Marks and J. R. Movellan. Diffusion networks, products of experts, and factor analysis.\n\nTechnical Report UCSD MPLab TR 2001.02, University of California San Diego, 2001.\n\n[8] B. Marlin and R. Zemel. The multiple multiplicative factor model for collaborative \ufb01ltering. In\n\nProceedings of the 21st International Conference on Machine Learning, volume 21, 2004.\n\n[9] M. Rosen-Zvi, T. Grif\ufb01ths, M. Steyvers, and P. Smyth. The author-topic model for authors\nIn Proceedings of the Conference on Uncertainty in Arti\ufb01cial Intelligence,\n\nand documents.\nvolume 20, 2004.\n\n[10] P. Smolensky. Information processing in dynamical systems: foundations of harmony theory.\nIn D.E. Rumehart and J.L. McClelland, editors, Parallel Distributed Processing: Explorations\nin the Microstructure of Cognition. Volume 1: Foundations. McGraw-Hill, New York, 1986.\n\n[11] M. Welling, F. Agakov, and C.K.I. Williams. Extreme components analysis. In Advances in\n\nNeural Information Processing Systems, volume 16, Vancouver, Canada, 2003.\n\n[12] M. Welling and G.E. Hinton. A new learning algorithm for mean \ufb01eld Boltzmann machines.\nIn Proceedings of the International Conference on Arti\ufb01cial Neural Networks, Madrid, Spain,\n2001.\n\n[13] M. Welling, G.E. Hinton, and S. Osindero. Learning sparse topographic representations with\nIn Advances in Neural Information Processing Systems,\n\nproducts of student-t distributions.\nvolume 15, Vancouver, Canada, 2002.\n\n[14] R. Zemel, C. Williams, and M. Mozer. Lending direction to neural networks. Neural Networks,\n\n8(4):503\u2013512, 1995.\n\n\f", "award": [], "sourceid": 2672, "authors": [{"given_name": "Max", "family_name": "Welling", "institution": null}, {"given_name": "Michal", "family_name": "Rosen-zvi", "institution": null}, {"given_name": "Geoffrey", "family_name": "Hinton", "institution": null}]}