{"title": "Continuous Sigmoidal Belief Networks Trained using Slice Sampling", "book": "Advances in Neural Information Processing Systems", "page_first": 452, "page_last": 458, "abstract": null, "full_text": "Continuous sigmoidal belief networks \n\ntrained using slice sampling \n\nBrendan J. Frey \n\nDepartment of Computer Science, University of Toronto \n\n6 King's College Road, Toronto, Canada M5S 1A4 \n\nAbstract \n\nReal-valued random hidden variables can be useful for modelling \nlatent structure that explains correlations among observed vari(cid:173)\nables. I propose a simple unit that adds zero-mean Gaussian noise \nto its input before passing it through a sigmoidal squashing func(cid:173)\ntion. Such units can produce a variety of useful behaviors, ranging \nfrom deterministic to binary stochastic to continuous stochastic. I \nshow how \"slice sampling\" can be used for inference and learning \nin top-down networks of these units and demonstrate learning on \ntwo simple problems. \n\n1 \n\nIntroduction \n\nA variety of unsupervised connectionist models containing discrete-valued hidden \nunits have been developed. These include Boltzmann machines (Hinton and Se(cid:173)\njnowski 1986), binary sigmoidal belief networks (Neal 1992) and Helmholtz ma(cid:173)\nchines (Hinton et al. 1995; Dayan et al. 1995). However, some hidden variables, \nsuch as translation or scaling in images of shapes, are best represented using continu(cid:173)\nous values. Continuous-valued Boltzmann machines have been developed (Movellan \nand McClelland 1993), but these suffer from long simulation settling times and the \nrequirement of a \"negative phase\" during learning. Tibshirani (1992) and Bishop et \nal. (1996) consider learning mappings from a continuous latent variable space to a \nhigher-dimensional input space. MacKay (1995) has developed \"density networks\" \nthat can model both continuous and categorical latent spaces using stochasticity at \nthe top-most network layer. In this paper I consider a new hierarchical top-down \nconnectionist model that has stochastic hidden variables at all layers; moreover, \nthese variables can adapt to be continuous or categorical. \nThe proposed top-down model can be viewed as a continuous-valued belief net(cid:173)\nwork, which can be simulated by performing a quick top-down pass (Pearl 1988). \nWork done on continuous-valued belief networks has focussed mainly on Gaussian \nrandom variables that are linked linearly such that the joint distribution over all \n\n\fContinuous Sigmoidal Belief Networks Trained using Slice Sampling \n\n453 \n\n(a) \n\nZero-mean \nGaussian \nnoi~e with2 \nvaflance (T i \n\nYi == ()(Xi) \n\n(b) ~~ \n\n(c) \n\n4 0 \n\ny \n\n1 \n\n-4 \n\nx \n\nrv-- p(y)A \n()<:J1\\!W ~ L \n\n-4 \n\nx \n\n4 0 Y \n\n1 \n\n(d)~r1 \n\n(e)~()~d \n\n-4 \n\nx \n\n4 0 \n\ny \n\n1 \n\n-400 \n\nx \n\n400 0 \n\ny \n\n1 \n\nFigure 1: (a) shows the inner workings of the proposed unit. (b) to (e) illustrate \nfour quite different modes of behavior: (b) deterministic mode; (c) stochastic linear \nmode; (d) stochastic nonlinear mode; and (e) stochastic binary mode (note the \ndifferent horizontal scale). For the sake of graphical clarity, the density functions \nare normalized to have equal maxima and the subscripts are left off the variables. \n\nvariables is also Gaussian (Pearl 1988; Heckerman and Geiger 1995). Lauritzen \net al. (1990) have included discrete random variables within the linear Gaussian \nframework. These approaches infer the distribution over unobserved unit activities \ngiven observed ones by \"probability propagation\" (Pearl 1988). However, this pro(cid:173)\ncedure is highly suboptimal for the richly connected networks that I am interested \nin. Also, these approaches tend to assume that all the conditional Gaussian distri(cid:173)\nbutions represented by the belief network can be easily derived using information \nelicited from experts. Hofmann and Tresp (1996) consider the case of inference \nand learning in continuous belief networks that may be richly connected. They use \nmixture models and Parzen windows to implement conditional densities. \n\nMy main contribution is a simple, but versatile, continuous random unit that can \noperate in several different modes ranging from deterministic to binary stochastic \nto continuous stochastic. This spectrum of behaviors is controlled by only two \nparameters. Whereas the above approaches assume a particular mode for each \nunit (Gaussian or discrete), the proposed units are capable of adapting in order to \noperate in whatever mode is most appropriate. \n\n2 Description of the unit \nThe proposed unit is shown in Figure 1a. It is similar to the deterministic sigmoidal \nunit used in multilayer perceptrons, except that Gaussian noise is added to the total \ninput, I-'i, before the sigmoidal squashing function is applied.1 The probability \ndensity over presigmoid activity Xi for unit i is \n\np(xill-'i, u;) == exp[-(xi -l-'i)2 /2u;1/ yi27rUf, \n\n(1) \nwhere I-'i and U[ are the mean and variance for unit i. A postsigmoid activity, Yi, is \nobtained by passing the presigmoid activity through a sigmoidal squashing function: \n\nIncluding the transformation Jacobian, the post sigmoid distribution for unit i is \n\nYi == cI>(Xi). \n\n(2) \n\n( .1 \nP Yt 1-'\" \n\n. u~) = exp[-(cI>-1(Yi) -l-'i)2 /2ulj \ncI>'(cI>-1(Yi))yi27ru;' \n\n, \n\n(3) \n\nIGeoffrey Hinton suggested this unit as a way to make factor analysis nonlinear. \n\n\f454 \n\nB. 1. Frey \n\nI use the cumulative Gaussian squashing function: \n\n<)(x) == J~ooe-z2j2 /...tFi dz \n\n<)'(x) = (x) == e-z2j2 /...tFi. \n\n(4) \nBoth <)0 and <)-10 are nonanalytic, so I use the C-library erfO function to imple(cid:173)\nment <)0 and table lookup with quadratic interpolation to implement <)-10. \nNetworks of these units can represent a broad range of structures, including deter(cid:173)\nministic multilayer perceptrons, binary sigmoidal belief networks (aka. stochastic \nmultilayer perceptrons), mixture models, mixture of expert models, hierarchical \nmixture of expert models, and factor analysis models. This versatility is brought \nabout by a range of significantly different modes of behavior available to each unit. \nFigures 1b to Ie illustrate these modes. \nDeterministic mode: If the noise variance of a unit is very small, the postsigmoid \nactivity will be a practically deterministic sigmoidal function of the mean. This \nmode is useful for representing deterministic nonlinear mappings such as those found \nin deterministic multilayer perceptrons and mixture of expert models. \nStochastic linear mode: For a given mean, if the squashing function is approx(cid:173)\nimately linear over the span of the added noise, the postsigmoid distribution will \nbe approximately Gaussian with the mean and standard deviation linearly trans(cid:173)\nformed. This mode is useful for representing Gaussian noise effects such as those \nfound in mixture models, the outputs of mixture of expert models, and factor anal(cid:173)\nysis models. \n\nStochastic nonlinear mode: If the variance of a unit in the stochastic linear \nmode is increased so that the squashing function is used in its nonlinear region, a \nvariety of distributions are producible that range from skewed Gaussian to uniform \nto bimodal. \nStochastic binary mode: This is an extreme case of the stochastic nonlinear \nmode. If the variance of a unit is very large, then nearly all of the probability mass \nwill lie near the ends of the interval (0,1) (see figure Ie). Using the cumulative \nGaussian squashing function and a standard deviation of 150, less than 1% of the \nmass lies in (0.1,0.9). In this mode, the postsigmoid activity of unit i appears to \nbe binary with probability of being \"on\" (ie., Yi > 0.5 or, equivalently, Xi > 0): \n\n(. I. 2) -100 eXP[-(X-JLi)2/2o-;]d -11'; exp[-x2/2CT;]d \n\nP 't on JL~, CTi \n\n(5) \n\n\u2022 \n\n-\nx -\n\nX -\n\n-\n\no \n\nM:::7i \nV 27rCTi \n\n-00 \n\nM:::7i \nV 27rCTi \n\n..T..(JLi) \n\n'J.' \n\n-\nCTi \n\nThis sort of stochastic activation is found in binary sigmoidal belief networks \n(Jaakkola et al. 1996) and in the decision-making components of mixture of ex(cid:173)\npert models and hierarchical mixture of expert models. \n\n3 Continuous sigmoidal belief networks \nIf the mean of each unit depends on the activities of other units and there are \nfeedback connections, it is difficult to relate the density in equation 3 to a joint \ndistribution over all unit activities, and simulating the model would require a great \ndeal of computational effort. However, when a top-down topology is imposed on \nthe network (making it a directed acyclic graph), the densities given in equations 1 \nand 3 can be interpreted as conditional distributions and the joint distribution over \nall units can be expressed as \n\np({Xi}) = n~lP(Xil{xjh*i) and \nI use the following function for slice sampling: \n\nf(zi) = exp [ - E~=i+l {Xk - J.t;i - Wki~(Ui~-l(Zi) + J.ti)} 2 /2u~], \n\n(9) \n\nwhere J.t;i = Ej*