{"title": "Unsupervised learning of distributions on binary vectors using two layer networks", "book": "Advances in Neural Information Processing Systems", "page_first": 912, "page_last": 919, "abstract": "", "full_text": "Unsupervised learning \n\nof distributions on binary vectors \n\nusing two layer networks \n\nYoav Freund\u00b7 \n\nComputer and Information Sciences \nUniversity of California Santa Cruz \n\nSanta Cruz, CA 95064 \n\nDavid Haussler \n\nComputer and Information Sciences \nUniversity of California Santa Cruz \n\nSanta Cruz , CA 95064 \n\nAbstract \n\nWe study a particular type of Boltzmann machine with a bipartite graph structure called a harmo(cid:173)\nnium. Our interest is in using such a machine to model a probability distribution on binary input \nvectors. We analyze the class of probability distributions that can be modeled by such machines. \nshowing that for each n ~ 1 this class includes arbitrarily good appwximations to any distribution \non the set of all n-vectors of binary inputs. We then present two learning algorithms for these \nmachines .. The first learning algorithm is the standard gradient ascent heuristic for computing \nmaximum likelihood estimates for the parameters (i.e. weights and thresholds) of the modeL Here \nwe give a closed form for this gradient that is significantly easier to compute than the corresponding \ngradient for the general Boltzmann machine . The second learning algorithm is a greedy method \nthat creates the hidden units and computes their weights one at a time. This method is a variant \nof the standard method for projection pursuit density estimation . We give experimental results for \nthese learning methods on synthetic data and natural data from the domain of handwritten digits. \n\nIntroduction \n\n1 \nLet us suppose that each example in our in put data is a binary vector i = {x I, ... , xn } E {\u00b1 l}n. and that \neach such example is generated independently at random according some unknown distribution on {\u00b1l}n. \nThis situation arises. for instance. when each example consists of (possibly noisy) measurements of n different \nbinary attributes of a randomly selected object . In such a situation, unsupervised learning can be usefully \ndefined as using the input data to find a good model of the unknown distribution on {\u00b1 l}n and thereby \nlearning the structure in the data. \n\nThe process of learning an unknown distribution from examples is usually called denszty estimation or \nparameter estimation in statistics, depending on the nature of the class of distributions used as models. \nConnectionist models of this type include Bayes networks (14). mixture models [3.13], and Markov random \nfields [14,8]. Network models based on the notion of energy minimization such as Hopfield nets [9] and \nBoltzmann machines [1] can also be used as models of probability distributions . \n\n\u2022 yoavGcis. ucsc.edu \n\n912 \n\n\fUnsupervised learning of distributions on binary vectors using 2-layer networks \n\n913 \n\nThe models defined by Hopfield networks are a special case of the more general Markov random field models \nin which the local interactions are restricted to symmetric pairwise interactions between components of \nthe input. Boltzmann machines also use only pairwise interactions, but in addition they include hidden \nunits, which correspond to unobserved variables. These unobserved variables interact with the observed \nvariables represented by components of the input vector. The overall distribution on the set of possible \ninput vectors is defined as the marginal distribution induced on the components of the input vector by the \nMarkov random field over all variables, both observed and hidden. While the Hopfield network is relatively \nwell understood, it is limited in the types of distributions that it can model. On the other hand, Boltzmann \nmachines are universal in the sense that they are powerful enough to model any distribution (to any degree \nof approximation), but the mathematical analysis of their capabilities is often intractable. Moreover, the \nstandard learning algorithm for the Boltzmann machine, a gradient ascent heuristic to compute the maximum \nlikelihood estimates for the weights and thresholds, requires repeated stochastic approximation, which results \nin unacceptably slow learning. I In this work we attempt to narrow the gap between Hopfield networks and \nBoltzmann machines by finding a model that will be powerful enough to be universal, 2 yet simple enough \nto be analyzable and computationally efficient. 3 We have found such a model in a minor variant of the \nspecial type of Boltzmann machine defined by Smolensky in his harmony theory [16][Ch.6J. This special type \nof Boltzmann machine is defined by a network with a simple bipartite graph structure, which he called a \nharmonium. \n\nThe harmonium consists of two types of units: input units, each of which holds one component of the input \nvector, and hidden units, representing hidden variables. There is a weighted connection between each input \nunit and each hidden unit, and no connections between input units or between hidden units (see Figure (1)) . \nThe presence of the hidden units induces dependencies, or correlations, between the variables modeled by \ninput units. To illustrate the kind of model that results, consider the distribution of people that visit a \nspecific coffee shop on Sunday. Let each of the n input variables represent the presence (+ 1) or absence (-1) \nof a particular person that Sunday. These random variables are clearly not independent, e.g. if Fred's wife \nand daughter are there, it is more likely that Fred is there, if you see three members of the golf club, you \nexpect to see other members of the golf club, if Bill is there you are unlikely to see Brenda there, etc. This \nsituation can be modeled by a harmonium model in which each hidden variable represents the presence or \nabsence of a social group. The weights connecting a hidden unit and an ipput unit measure the tendency of \nthe corresponding person to be associated with the corresponding group. In this coffee shop situation, several \nsocial groups may be present at one time , exerting a combined influence on the distribution of customers. \nThis can be mo'deled easily with the harmonium , but is difficult to model using Bayes networks or mixture \nmodels. <4 \n\n2 The Model \n\nLet us begin by formalizing the harmonium model. To model a distribution on {\u00b1I}\" we will use n input \nunits and some number m ~ 0 of hidden units. These units are connected in a bipartite graph as illustrated \nin Figure (I) . \n\nThe random variables represented by the input units each take values in {+ I , -I}, while the hidden variables, \nrepresented by the hidden units, take values in to, I} . The state of the machine is defined by the values \nof these random variables. Define i = (XI,\"\n\" xn) E {\u00b1l}n to be the state of the input units, and h = \n(hi , ... , hm ) E {O,l}m to be the state of the hidden units. \n\nThe connection weights between the input ~nits and the ith hidden unit are denoted 5 by w(') E Rn and the \nbias of the ith hidden unit is denoted by 9(') E R. The parameter vector ~ = {(w(l),O(l\u00bb, . .. ,(w(m),o(m\u00bb)) \n\nlOne possible solution to this is tbe mean-field approximation [15], discussed furtber in section 2 below. \n'In (4) we show tbat any distribution over (\u00b11)\" can be approximated to within any desired accuracy by a \n\nharmonium model using 2\" bidden units. \n\nlSee also otber work relating Bayes nets and Boltzmann machines [12,1]. \nt Noisy-OR gates have been introduced in the framework of Bayes Networks to allow for such combinations. \n\nHowever, using this in networks with hidden units has not been studied, to the best of our knowledge. \n\n~In (16)[Ch.6J, binary connection weights are used . Here we use real-valued weights . \n\n\f914 \n\nFreund and Haussler \n\nHidden Units \n\nm=3 \n\nInput Units \n\n2:5 \nFigure 1: The bipartite graph of the harmonium \n\n2:3 \n\n2:2 \n\n2:4 \n\n2:1 \n\ndefines the entire network, and thus also the probability model induced by the network. For a given ,p, the \nenergy of a. state configuration of hidden and input units is defined to be \n\nE(i, hl~) = - L(w(i) . i + 8(i\u00bb)hi \n\nm \n\ni=! \n\n(1) \n\nand the probability of a configuration is \n\nPr(i,hl\u00a2l) = -Ze-E(Z,h l.) where Z = L.,e-;.E(Z,hl.). \n\n~-\n\n1 \n\n-\n\n-\n\nSumming over h, it is easy to show that in the general case the probability distribution over possible state \nvectors on the input units is given by \n\nz,;; \n\nThis product form is particular to the harmonium structure, and does not hold for general Boltzmann \nmachines. Product form distribution models have been used for density estimation in Projection Pursuit \n[10,6,5] . We shall look further into this relationship in section 5. \n\n3 Discussion of the model \n\nThe right hand side of Equation (2) has a simple intuitive interpretation . The ith factor in the product \ncorresponds to the hidden variable h. and is an increasing function of the dot product between i and the \nweight vector of the ith hidden unit. Hence an input vector i will tend to have large probability when it is \nin the direction of one of the weight vectors WCi) (i .e. when wei) . i is large). and small probability otherwise. \nThis is the way that the hidden variables can be seen to exert their\" influence\"; each corresponds to a. \npreferred or \"prototypical\" direction in space . \nThe next to the last formula. in Equation (2) shows that the harmonium model can be written as a mixture \nof 2m distributions of the form \n\n~ exp (f)W(i) . i + 8('\u00bb)h.) , \n\nZ(h) \n\ni=! \n\n\fUnsupervised learning of distributions on binary vectors using 2-layer networks \n\n915 \n\nwhere ii E to, l}m and Z(Ii) is the appropriate normalization factor. It is easily verified that each of these \ndistributions is in fact a product of n Bernoulli distributions on {+l, -l}, one for each input variable Xj. \nHence the harmonium model can be interpreted as a kind of mixture model. However, the number of \ncomponents in the mixture represented by a harmonium is exponential in the number of hidden units. \nIt is interesting to compare the class of harmonium models to the standard class of models defined by a \nmixture of products of Bernoulli distributions. The same bipartite graph described in Figure (1) can be \nused to define a standard mixture model. Assign each of the m hidden units a weight vector <.i;) and a \nprobability Pi such that I:~l Pi = 1. To generate an example, choose one of the hidden units according to \nthe distribution defined by the Pi'S, and then choose the vector i according to P;(i) = te.;J(\u00b7) \u00b7I. where Zi \nis the appropriate normalization factor so that LIE{\u00b1I}\" P;(i) = 1. We thus get the distribution \n\nP(i) = L Pi eW(') I \n\nm \n\ni=1 Z; \n\n(3) \n\nThis form for presenting the standard mixture model emphasizes the similarity between this model and the \nharmonium model. A vector i will have large probability if the dot product <.ii) \u00b7x is large for some 1 :s i :s m \n(so long as Pi is not too small). However, unlike the standard mixture model, the harmonium model allows \nmore than one hidden variable to be +1 for any generated example. This means that several hidden influences \ncan combine in the generation of a single example, because several hidden variables can be +1 at the same \ntime. To see why this is useful, consider the coffee shop example given in the introduction . At any moment \nof time it is reasonable to find severa/social groups of people sitting in the shop . The harmonium model will \nhave a natural representation for this situation, while in order for the standard mixture model to describe \nit accurately, a hidden variable has to be assigned to each combination of social groups that is likely to be \nfound in the shop at the same time. In such cases the harmonium model is exponentially more succinct than \nthe standard mixture model. \n\n4 Learning by gradient ascent on the log-likelihood \n\nWe now suppose that we are given a sample consisting of a set 5 of vectors in {\u00b1 l}n drawn independently \nat random froro some unknown distribution . Our goal is use the sample 5 to find a good model for this \nunknown distribution using a harmonium with m hidden units, if possible. The method we investigate here \nis the method of maximum likelihood estimation using gradient ascent . The goal of learning is thus reduced \nto finding the set of parameters for the harmonium that maximize the (log of the) probability of the set \nof examples S. In fact, this gives the standard learning algorithm for general Boltzmann machines. For \na general Boltzmann machine this would require stochastic estimation of the parameters. As stochastic \nestimation is very time-consuming, the result is that learning is very slow . In this section we show that \nstochastic estimation need not be used for the harmonium model. \nFrom (2), the log likelihood of a sample of input vectors 5 = {;{ I), ;(2), ... ,\u00a3(N)}, given a particular setting \n\u00a2J = {(w(l), 0(1\u00bb \u2022. ..\u2022 (w(m) , Oem\u00bb~} of the parameters of the model is: \n\n. \n\n. \n\n10g-hkehhood(\u00a2J) = Lin Pr(i!\u00a2J) = L L In(l + e'\" H' ) \n\n-(.) \n\n(.) \n\n( \n\nIES \n\nm \n.=1 \n\nIES \n\n) \n\n- N In Z . \n\n(4) \n\nTaking the gradient of the log-likelihood results in the following formula for the jth component of wei) \n\n{} ~i) log-likelihood(\u00a2) = L x, 1 + e-(W~') 1+9(.1) - N L Pr(il\u00a2J)x, 1 + e-(W!.IH,(,I) \n\n(5) \n\nwJ \n\nIES \n\nIE!:!}\" \n\nA similar formula holds for the derivative of the bias term. \n\nThe purpose of the clamped and unclamped phases in the Boltzmann machine learning algorithm is to \napproximate these two terms. In general, this requires stochastic methods. However, here the clamped term \nis easy to calculate, it requires summing a logistic type function over all training examples. The same term \n\n\f916 \n\nFreund and Haussler \n\nis obtained by making the mean field approximation for the clamped phase in the general algorithm [15], \nwhich is exact in this case. It is more difficult to compute the sleep phase term, as it is an explicit sum over \nthe entire input space, and within each term of this sum there is an implicit sum over the entire space of \nconfigurations of hidden units in the factor Pr(i!,p) . However, again taking advantage of the special structure \nof the harmonium, We can reduce this sleep phase gradient term to a sum only over the configurations of the \nhidden units, yielding for each component of w(i) \n\n8(i)log-likelibood(\u00a2l) = L: Zj 1 + e-(W~')'I+I('\u00bb - N L Pr(hl\u00a2l)hi tanh(E hkWy\u00bb \n\n(6) \n\n8wj \n\nles \n\nhe{O,I}\" \n\nk=1 \n\nwhere \n\nPr(hl\u00a2l) = \n\nexp(L~1 hi9(i\u00bb 0;=1 cosh(L~l hiW}i\u00bb \n\n. \n\nE.ii'e{o,I}\" exp(E~1 h;9(i\u00bb OJ: 1 cosh(L~1 h;wJ'})] \n\nDirect computation of (6) is fast for small m in contrast to the case for general Boltzmann machines (we \nhave performed experiments with m $ 10). However, for large m it is not possible to compute all 2m \nterms. There is a way to avoid this exponential explosion if we can assume that a small number of terms \ndominate the sums. If, for instance, we assume that the probability that more than k hidden units are \nacti ve (+ I) at the same time is negligibly small we can get a good approximation by computing only O( mk) \nterms. Alternately, if we are not sure which states of the hidden units have non-negligible probability, we \ncan dynamically search, as part of the learning process, for the significant terms in the sum . This way we \nget an algorithm that is always accurate, and is efficient when the number of significant terms is small. In \nthe extreme case where we assume that only one hidden unit is active at a time (i.e. k = 1), the harmonium \nmodel essentially reduces to the standard mixture model as discussed is section 3. For larger k, this type of \nassumption provides a middle ground between the generality of the harmonium model and the simplicity of \nthe mixture model. \n\n5 Projection Pursuit methods \n\nA statistical method that has a close relationship with the harmonium model is the Projection Pursuit (PP) \ntechnique [10,6 i5). The use of projection pursuit in the context of neural networks has been studied by \nseveral researchers (e.g. [11]). Most of the work is in exploratory projection pursuit and projection pursuit \nregreSSIOn. In this paper we are interested in projection pursuit dellslty estimation. Here PP avoids the \nexponential blowup of the standard gradient ascent technique, and also has that advantage that the number \nm of hidden units is estimated from the sample as well, rather than being specified in advance. \nProjection pursuit density estimation [6] is based on several types of analysis, using the central limit theorem, \nthat lead to the following general conclusion. If i E R\" is a random vector for which the different coordinates \nare Independent, and w E R\" is a vector from the n dimellsiollal ullit sphere, then the distribution of the \nprojectIon w\u00b7 i is close to gaussian for most w. Thus searching for those directions w for which the projection \nof a sample is most non-gaussian is a way for detecting dependencies between the coordinates in high \ndimensional distributions. Several \"projection-indices\" have been studied in the literature for measuring the \n\"non-gaussianity\" of projection, each enhancing different properties of the projected distribution. In order \nto find more than one projection direction, several methods of \"structure elimination\" have been devised . \nThese methods transform the sample in such a way that the the direction in which non-gaussianity has been \ndetected appears to be gaussian, thus enabling the algorithm to detect non-gaussian projections that would \notherwise be obscured. The search for a description of the distribution of a sample in terms of its projections \ncan be formalized in the context of maximal likelihood density estimation [6] . In order to create a formal \nrelation between the harmonium model and projection pursuit, we define a variant of the model that defines \na density over R\" instead of a distribution over {\u00b1l}\". Based on this form we devise a projection index and \na structure removal method that are the basis of the following learning algorithm (described fully in [4]) \n\n\u2022 Initialization \n\nSet So to be the input sample. \nSet Po to be the initial distribution (Gaussian). \n\n\fUnsupervised learning of distributions on binary vectors using 2-layer networks \n\n917 \n\n\u2022 Iteration \nRepeat the following steps for i = 1,2 . . . until no single-variable harmonium model has a significantly \nhigher likelihood than the Gaussian distribution with respect to Si' \n\n1. Perform an estimate-maximize (EM) [2) search on the log-likelihood of a single hidden variable \nmodel on the sample Si-I . Denote by 8i and wei) the parameters found by the search, and create \na new hidden unit with associated binary r. v. hi with these weights and bias. \n\n2. Transform Si-l into Si using the following structure removal procedure. \n\nFor each example; E S'_1 compute the probability that the hidden variable h; found in the last \nstep is 1 on this input: \n\nP(h; = 1) = (1 + e-