{"title": "Directional-Unit Boltzmann Machines", "book": "Advances in Neural Information Processing Systems", "page_first": 172, "page_last": 179, "abstract": null, "full_text": "Directional-Unit Boltzmann Machines \n\nRichard S. Zemel \n\nComputer Science Dept. \n\nUniversity of Toronto \n\nToronto, ONT M5S lA4 \n\nChristopher K. I. Williams \n\nComputer Science Dept. \nUniversity of Toronto \n\nToronto, ONT M5S lA4 \n\nMichael C. Mozer \n\nComputer Science Dept. \nUniversity of Colorado \nBoulder, CO 80309-0430 \n\nAbstract \n\nWe present a general formulation for a network of stochastic di(cid:173)\nrectional units. This formulation is an extension of the Boltzmann \nmachine in which the units are not binary, but take on values in a \ncyclic range, between 0 and 271' radians. The state of each unit in \na Directional-Unit Boltzmann Machine (DUBM) is described by a \ncomplex variable, where the phase component specifies a direction; \nthe weights are also complex variables. We associate a quadratic \nenergy function, and corresponding probability, with each DUBM \nconfiguration. The conditional distribution of a unit's stochastic \nstate is a circular version of the Gaussian probability distribution, \nknown as the von Mises distribution. In a mean-field approxima(cid:173)\ntion to a stochastic DUBM, the phase component of a unit's state \nrepresents its mean direction, and the magnitude component spec(cid:173)\nifies the degree of certainty associated with this direction. This \ncombination of a value and a certainty provides additional repre(cid:173)\nsentational power in a unit. We describe a learning algorithm and \nsimulations that demonstrate a mean-field DUBM'S ability to learn \ninteresting mappings. \n\nMany kinds of information can naturally be represented in terms of angular, or \ndirectional, variables. A circular range forms a suitable representation for explicitly \ndirectional information, such as wind direction, as well as for information where \nthe underlying range is periodic, such as days of the week or months of the year. \nIn computer vision, tangent fields and optic flow fields are represented as fields of \noriented line segments, each of which can be described by a magnitude and direction. \nDirections can also be used to represent a set of symbolic labels, e.g., object label \nA at 0, and object label B at 71'/2 radians. We discuss below some advantages of \nrepresenting symbolic labels with directional units. \n\n172 \n\n\fDirectional-Unit Boltzmann Machines \n\n173 \n\nThese and many other phenomena can be usefully encoded using a directional \nrepresentation-a polar coordinate representation of complex values in which the \nphase parameter indicates a direction between 0 and 27r radians. We have devised a \ngeneral formulation of networks of stochastic directional units. This paper describes \na directional-unit Boltzmann machine (DUBM), which is a novel generalization of a \nBoltzmann machine (Ackley, Hinton and Sejnowski, 1985) in which the units are \nnot binary, but instead take on directional values between 0 and 27r. \n\n1 STOCHASTIC DUBM \n\nA stochastic directional unit takes on values on the unit circle. We associate with \nunit j a random variable Zj; a particular state of j is described by a complex \nnumber with magnitude one and direction, or phase Tj: Zj = eiTj \u2022 \nThe weights of a DUBM also take on complex values. The weight from unit k to \nunit j is: Wj k = hj ke ifJ'k . We constrain the weight matrix W to be Hermitian: \nW T = W*, where the diagonal elements of the matrix are zero, and the asterisk \nindicates the complex conjugate operation. Note that if the components are real, \nthen W T = W, which is a real symmetric matrix. Thus, the Hermitian form is a \nnatural generalization of weight symmetry to the complex domain. \n\nThis definition of W leads to a Hermitian quadratic form that generalizes the real \nquadratic form of the Hopfield energy function: \n\nE(z) = -1/2 z*TWz = -1/2 LZjZZWjk \n\nj,k \n\n(1) \n\nwhere z is the vector of the units' complex states in a particular global configuration. \nNoest (1988) independently proposed this energy function. It is similar to that \nused in Fradkin, Huberman, and Shenker's (1978) generalization of the XY model \nof statistical mechanics to allow arbitary weight phases OJ k, and coupled oscillator \nmodels, e.g., Baldi and Meir (1990). \nWe can define a probability distribution over the possible states of a stochastic \nnetwork using the Boltzmann factor. In a DUBM, we can describe the energy as a \nfunction of the state of a particular unit j: \n\nE(Zj = Zj) = -1/2 [L ZjZZWjk + L ZkZIWkj] \n\nk \n\nk \n\nWe define \n\nXj = LZkwlk \n\nk \n\nto be the net input to unit j, where aj and O:j denote the magnitude and phase of \nx j, respectively. \nApplying the Boltzmann factor, we find that the probability that unit j IS 10 a \nparticular state is proportional to: \n\np(Zj = Zj) ex e-/3E(Zj=zj) = e/3aj COS(Tj-crj) \n\n(2) \n\nwhere f3 is the reciprocal of the system temperature. \n\n\f174 \n\nZemel, Williams, and Mozer \n\n0\u00b0 \n\n---> \n\nFigure 1: A circular normal density function laid over a unit circle. The dots \nalong the circle represent samples of the circular normal random variable Zj. The \nexpected direction of Zj, Tj, is 7r /4; rj is its resultant length. \n\nThis probability distribution for a unit's state corresponds to a distribution known \nas the von Mises, or circular normal, distribution (Mardia, 1972). Two parameters \ncompletely characterize this distribution: a mean direction r = (0,27r] and a con(cid:173)\ncentration parameter m > 0 that behaves like the reciprocal of the variance of a \nGaussian distribution on a linear random variable. The probability density function \nof a circular normal random variable Z is l : \n1 \n() \n27r1o m \n\np T; r, m = \n( \n\nem cos( T-T) \n\n(3) \n\n) \n\nFrom Equations 2 and 3, we see that if a unit adopts states according to its contribu(cid:173)\ntion to the system energy, it will be a circular normal variable with mean direction \nCl:j and concentration parameter mj = f3aj. These parameters are directly deter(cid:173)\nmined by the net input to the unit. \n\nFigure 1 shows a circular normal density function for Zj, the state of unit j. This \nfigure also shows the expected value of its stochastic state, which we define as: \n\nYj = < Zj > = rjei'yi \n\n(4) \nwhere Ij, the phase of Yj, is the mean direction and rj, the magnitude of Yj, is the \nresultant length. For a circular normal random variable, Ij = Tj, and rj = ~~~:!j.2 \nWhen samples of Zj are concentrated on a small arc about the mean (see Figure 1), \nrj will approach length one. This corresponds to a large concentration parameter \n(mj = f3aj). Conversely, for small mj, the distribution approaches the uniform \ndistribution on the circle, and the resultant length falls toward zero. For a uniform \ndistribution, rj = O. Note that the concentration parameter for a unit's circular \norder zero. An integral representation of this function is Io(m) = ~ J: e\u00b1mcos()d6. It can \nIThe normalization factor Io(m) is the modified Bessel function of the first kind and \n\nbe computed by numerical routines. \nk is h(m) = ~ Jo7r emcos() cos(k6)d6. Note that II(m) = dlo(m)/dm. \n\n2 An integral representation of the modified Bessel function of the first kind and order \n\n\fDirectional-Unit Boltzmann Machines \n\n175 \n\nnormal density function is proportional to /3, the reciprocal of the system tempera(cid:173)\nture. Higher temperatures will thus have the effect of making this distribution more \nuniform, just as they do in a binary-unit Boltzmann machine. \n\n2 EMERGENT PROPERTIES OF A DUBM \n\nA network of directional units as defined above contains two important emergent \nproperties. The first property is that the magnitude of the net input to unit j \ndescribes the extent to which its various inputs \"agree\". Intuitively, one can think \nof each component Zk wj k of the sum that comprises x j as predicting a phase for \nunit j. When the phases of these components are equal, the magnitude of Xj, aj, is \nmaximized. If these phase predictions are far apart, then they will act to cancel each \nother out, and produce a small aj. Given Xj, we can compute the expected value \nof the output of unit j. The expected direction of the unit roughly represents the \nweighted average of the phase predictions, while the resultant length is a monotonic \nfunction of aj and hence describes the agreement between the various predictions. \n\nThe key idea here is that the resultant length directly describes the degree of cer(cid:173)\ntainty in the expected direction of unit j. Thus, a DUBM naturally incorporates a \nrepresentation of the system's confidence in a value. This ability to combine several \nsources of evidence, and not only represent a value but also describe the certainty \nof that value is an important property that may be useful in a variety of domains. \n\nThe second emergent property is that the DUBM energy is globally rotation(cid:173)\ninvariant-E is unaffected when the same rotation is applied to all units' states \nin the network. For each DUBM configuration, there is an equivalence class of con(cid:173)\nfigurations which have the same energy. In a similar way, we find that the magnitude \nof Xj is rotation-invariant. That is, when we translate the phases of all units but \none by some phase, the magnitude of that unit is unaffected. This property under(cid:173)\nlies one of the key advantages of the representation: both the magnitude of a unit's \nstate as well as system energy depend on the relative rather than absolute phases \nof the units. \n\n3 DETERMINISTIC DUBM \n\nJust as in deterministic binary-unit Boltzmann machines (Peterson and Anderson, \n1987; Hinton, 1989), we can greatly reduce the computational time required to \nrun a large stochastic system if we invoke the mean-field approximation, which \nstates that once the system has reached equilibrium, the stochastic variables can be \napproximated by their mean values. In this approximation, the variables are treated \nas independent, and the system probability distribution is simply the product of \nthe probability distributions for the individual units. \n\nGislen, Peterson, and Soderberg (1992) originally proposed a mean-field theory for \nnetworks of directional (or \"rotor\") units, but only considered the case of real(cid:173)\nvalued weights. They derived the mean-field consistency equations by using the \nsaddle-point method. Our approach provides an alternative, perhaps more intuitive \nderivation, due to the use of the circular normal distribution. \n\n\f176 \n\nZemel, Williams, and Mozer \n\nWe can directly describe these mean values based on the circular normal interpre(cid:173)\ntation. We still denote the net input to a unit j as Xj: \nx j = ~ Yk W j k = aj e 1 \n\n~ * \n\niao \n\n(5) \n\nOnce equilibrium has been reached, the state of unit j is Yj, the expected value of \nZj given the mean-field approximation: \n\nk \n\n(6) \n\nIn the stochastic as well as the deterministic system, units evolve to minimize the \nfree energy, F = < E > - T H. The calculation of H, the entropy of the system, \nfollows directly from the circular normal distribution and the mean-field approxima(cid:173)\ntion. We can derive mean-field consistency equations for Xj and Yj by minimizing \nthe mean-field free energy, FM F, with respect to each variable independently. The \nresulting equations match the mean-field equations (Equations 5 and 6) derived \ndirectly from the circular normal probability density function. They also match the \nspecial case derived by Gislen et al. for real-valued weights. \n\nWe have implemented a DUBM using the mean-field approximation. We solve for a \nconsistent set of x and y values by performing synchronous updates of the discrete(cid:173)\ntime approximation of the set of differential equations based on the net input to \neach unit j. We update the x j variables using the following differential equation: \n\ndXj \n\n~ * \n--;It = -Xj + ~ YkWjk \n\nk \n\n(7) \n\nwhich has Equation 5 as its steady-state solution. In the simulations, we use simu(cid:173)\nlated annealing to help find good minima of FM F. \n\nJust as for the Hopfield binary-state network, it can be shown that the free energy \nalways decreases during the dynamical evolution described in Equation 7 (Zemel, \nWilliams and Mozer, 1992). The equilibrium solutions are free energy minima. \n\n4 DUBM LEARNING \n\nThe units in a DUBM can be arranged in a variety of architectures. The appropriate \nmethod for determining weight values for the network depends on the particular \nclass of network architecture. In an autoassociative network containing a single set \nof interconnected units, the weights can be set directly from the training patterns. \nIf hidden units are required to perform a task, then an algorithm for learning the \nweights is required. We use an algorithm that generalizes the Boltzmann machine \ntraining algorithm (Ackley, Hinton and Sejnowski, 1985; Peterson and Anderson, \n1987) to these networks. \n\nAs in the standard Boltzmann machine learning algorithm, the partial derivative of \nthe objective function with respect to a weight depends on the difference between \nthe partials of two mean-field free energies: one when both input and output units \nare clamped, and the other when only the input units are clamped. On a given \n\n\fDirectional-Unit Boltzmann Machines \n\n177 \n\ntraining case, for each of these stages we let the network settle to equilibrium and \nthen calculate the following derivatives: \n\nOFMF/objk \nOFM F / O(}j k \n\n'Yk + (}jk) \n-rjTk COS(-yj -\nrjrkbjk sin('Yj - 'Yk + (}jk) \n\nThe learning algorithm uses these gradients to find weight values that will minimize \nthe objective over a training set. \n\n5 EXPERIMENTAL RESULTS \n\nWe present below some illustrative examples to show that an adaptive network of \ndirectional units can be used in a range of paradigms, including associative memory, \ninput/output mappings, and pattern completion. \n\n5.1 SIMPLE AUTOASSOCIATIVE DUBM \n\nThe first set of experiments considers a simple autoassociative DUBM, which contains \nno hidden units, and the units are fully connected. As in a standard Hopfield \nnetwork, the weights are set directly from the training patterns; they equal the \nsuperposition of the outer product of the patterns. \n\nWe have run several experiments with simple autoassociative DUBMs. The empirical \nresults parallel those for binary-unit autoassociative networks. We find, for example, \nthat a network containing 30 fully interconnected units is capable of reliably settling \nfrom a corrupted version of one of 4 stored patterns to a state near the pattern. \nThese patterns thus form stable attractors, as the network can perform pattern \ncompletion and clean-up from noisy inputs. The rotation-invariance property of the \nenergy function allows any rotated version of a training pattern to also act as an \nattractor. The network's performance rapidly degrades for more than 4 orthogonal \npatterns; the patterns themselves no longer act as fixed-points, and many random \ninitial states end in states far from any stored pattern. In addition, more orthogonal \npatterns can be stored than random patterns. See Noest (1988) for an analysis of \nthe capacity of an autoassociative DUBM with sparse and asymmetric connections. \n\n5.2 LEARNING INPUT/OUTPUT MAPPINGS \n\nWe have also used the mean-field DUBM learning algorithm to learn the weights in \nnetworks containing hidden units. We have experimented with a task that is well(cid:173)\nsuited to a directional representation. There is a single-jointed robot arm, anchored \nat a point, as shown in Figure 2. The input consists of two angles: \nthe angle \nbetween the first arm segment and the positive x-axis (A), and the angle between \nthe two arm segments (p). The two segments each have a fixed length, A and B; \nthese are not explicitly given to the network. The output is the angle between the \nline connecting the two ends of the arm and the x-axis (J.t). This target angle is \nrelated in a complex, non-linear way to the input angles-the network must learn \nto approximate the following trigonometric relationship: \n\nJ.1. = arctan \n\n( A sin A - B sin( A + p) ) \nA cos A - B cos( A + p) \n\n\f178 \n\nZemel, Williams, and Mozer \n\n/ \n\nI \n\nI \n\n,\\ ~ p, \n\n----------- -j~----------~ \n\nFigure 2: A sample training case for the robot arm problem. The arm consists of \ntwo fixed-length segments, A and B, and is anchored on the x-axis. The two angles, \n,\\ and p, are given as input for each case, and the target output is the angle p,. \n\nWith 500 training cases, a DUBM with 2 input units and 8 hidden units is able to \nlearn the task so that it can accurately estimate p, for novel patterns. The learning \nrequires 200 iterations of a conjugate gradient training algorithm. On each of 100 \ntesting patterns, the resultant length of the output unit exceeds .85, and the mean \nerror on the angle is less than .05 radians. The network can also learn the task \nwith as few as 5 hidden units, with a concomitant decrease in learning speed. The \ncompact nature of this network shows that the directional units form a natural, \nefficient representation for this problem. \n\n5.3 COMPLEX PATTERN COMPLETION \n\nOur earlier work described a large-scale DUBM that attacks a difficult problem in \ncomputer vision: image segmentation. In MAGIC (Mozer et al., 1992), directional \nvalues are used to represent alternative labels that can be assigned to image features. \nThe goal of MAGIC is to learn to assign appropriate object labels to a set of image \nfeatures (e.g., edge segments) based on a set of examples. The idea is that the \nfeatures of a given object should have consistent phases, with each object taking on \nits own phase. The units in the network are arranged into two layers-feature and \nhidden-and the computation proceeds by randomly initializing the phases of the \nunits in the feature layer, and settling on a labeling through a relaxation procedure. \nThe units in the hidden layer learn to detect spatially local configurations of the \nimage features that are labeled in a consistent manner across the training examples. \n\nMAGIC successfully learns to segment novel scenes consisting of overlapping geomet(cid:173)\nric objects. The emergent DUBM properties described above are essential to MAGIC'S \nability to perform this task. The complex weights are necessary in MAGIC, as the \nweights encode statistical regularities in the relationships between image features, \ne.g., that two features typically belong to the same object (i.e., have similar phase \nvalues) or to different objects (i.e., are out of phase). The fact that a unit's re(cid:173)\nsultant length reflects the certainty in a phase label allows the system to decide \nwhich phase labels to use when updating labels of neighboring features: the ini(cid:173)\ntially random phases are ignored, while confident labels are propagated. Finally, \nthe rotation-invariance property allows the system to assign labels to features in a \nmanner consistent with the relationships described in the weights, where it is the \nrelative rather than absolute phases of the units that are important. \n\n\fDirectional-Unit Boltzmann Machines \n\n179 \n\n6 CURRENT DIRECTIONS \n\nWe are currently extending this work in a number of directions. We are extending \nthe definition of a DUBM to combine binary and directional units (Radford Neal, \npersonal communication). This expanded representation may be useful in domains \nwith directional data that is not present everywhere. For example, it can be directly \napplied to the object labeling problem explored in MAGIC. The binary aspect of \nthe unit can describe whether a particular image feature is present or absent. This \nmay enable the system to handle various complications, particularly labeling across \ngaps along the contour of an object. Finally, we are applying a DUBM network to \nthe interesting and challenging problem of time-series prediction of wind directions. \n\nAcknowledgements \n\nThe authors thank Geoffrey Hinton for his generous support and guidance. We \nthank Radford Neal, Peter Dayan, Conrad Galland, Sue Becker, Steve Nowlan, and \nother members of the Connectionist Research Group at the University of Toronto \nfor helpful comments regarding this work. This research was supported by a grant \nfrom the Information Technology Research Centre of Ontario to Geoffrey Hinton, \nand NSF Presidential Young Investigator award IRI-9058450 and grant 90-21 from \nthe James S. McDonnell Foundation to MM. \n\nReferences \n\nAckley, D. H., Hinton, G. E., and Sejnowski, T. J. (1985). A learning algorithm for \n\nBoltzmann machines. Cognitive Science, 9:147-169. \n\nBaldi, P. and Meir, R. (1990). Computing with arrays of coupled oscillators: \nAn application to preattentive texture discrimination. Neural Computation, \n2( 4):458-471. \n\nFradkin, E., Huberman, B. A.,\u00b7 and Shenker, S. H. (1978). Gauge symmetries in \n\nrandom magnetic systems. Physical Review B, 18(9):4789-4814. \n\nGisIen, 1., Peterson, C., and Soderberg, B. (1992). Rotor neurons: Basic formalism \n\nand dynamics. Neural Computation, 4(5):737-745. \n\nHinton, G. E. (1989). Deterministic Boltzmann learning performs steepest descent \n\nin weight-space. Neural Computation, 1(2):143-150. \n\nMardia, K. V. (1972). Statistics of Directional Data. Academic Press, London. \nMozer, M. C., Zemel, R. S., Behrmann, M., and Williams, C. K. I. (1992). Learn(cid:173)\n\ning to segment images using dynamic feature binding. Neural Computation, \n4(5):650-665. \n\nNoest, A. J. (1988). Phasor neural networks. In Neural Information Processing \n\nSystems, pages 584-591, New York. AlP. \n\nPeterson, C. and Anderson, J. R. (1987). A mean field theory learning algorithm \n\nfor neural networks. Complex Systems, 1:995-1019. \n\nZemel, R. S., Williams, C. K. I., and Mozer, M. C. (1992). Adaptive networks of \n\ndirectional units. Technical Report CRG-TR-92-2, University of Toronto. \n\n\f", "award": [], "sourceid": 674, "authors": [{"given_name": "Richard", "family_name": "Zemel", "institution": null}, {"given_name": "Christopher", "family_name": "Williams", "institution": null}, {"given_name": "Michael", "family_name": "Mozer", "institution": null}]}