{"title": "Expressive Power and Approximation Errors of Restricted Boltzmann Machines", "book": "Advances in Neural Information Processing Systems", "page_first": 415, "page_last": 423, "abstract": "We present explicit classes of probability distributions that can be learned by Restricted Boltzmann Machines (RBMs) depending on the number of units that they contain, and which are representative for the expressive power of the model. We use this to show that the maximal Kullback-Leibler divergence to the RBM model with n visible and m hidden units is bounded from above by (n-1)-log(m+1). In this way we can specify the number of hidden units that guarantees a sufficiently rich model containing different classes of distributions and respecting a given error tolerance.", "full_text": "Expressive Power and Approximation Errors of\n\nRestricted Boltzmann Machines\n\nGuido F. Mont\u00b4ufar1, Johannes Rauh1, and Nihat Ay1,2\n\n1Max Planck Institute for Mathematics in the Sciences, Inselstra\u00dfe 22 04103 Leipzig, Germany\n\n2Santa Fe Institute, 1399 Hyde Park Road, Santa Fe, New Mexico 87501, USA\n\n{montufar,jrauh,nay}@mis.mpg.de\n\nAbstract\n\nWe present explicit classes of probability distributions that can be learned by Re-\nstricted Boltzmann Machines (RBMs) depending on the number of units that they\ncontain, and which are representative for the expressive power of the model. We\nuse this to show that the maximal Kullback-Leibler divergence to the RBM model\nwith n visible and m hidden units is bounded from above by (n\u22121)\u2212log(m+1).\nIn this way we can specify the number of hidden units that guarantees a suf\ufb01ciently\nrich model containing different classes of distributions and respecting a given er-\nror tolerance.\n\n1\n\nIntroduction\n\nA Restricted Boltzmann Machine (RBM) [24, 10] is a learning system consisting of two layers\nof binary stochastic units, a hidden layer and a visible layer, with a complete bipartite interaction\ngraph. RBMs are used as generative models to simulate input distributions of binary data. They can\nbe trained in an unsupervised way and more ef\ufb01ciently than general Boltzmann Machines, which are\nnot restricted to have a bipartite interaction graph [11, 6]. Furthermore, they can be used as building\nblocks to progressively train and study deep learning systems [13, 4, 16, 21]. Hence, RBMs have\nreceived increasing attention in the past years.\n\nAn RBM with n visible and m hidden units generates a stationary distribution on the states of the\nvisible units which has the following form:\n\npW,C,B (v) =\n\n1\n\nZW,C,B Xh\u2208{0,1}m\n\nexp(cid:0)h\u22a4W v + C \u22a4h + B\u22a4v(cid:1)\n\n\u2200v \u2208 {0, 1}n ,\n\nwhere h \u2208 {0, 1}m denotes a state vector of the hidden units, W \u2208 Rm\u00d7n, C \u2208 Rm and B \u2208\nRn constitute the model parameters, and ZW,C,B is a corresponding normalization constant.\nIn\nthe sequel we denote by RBMn,m the set of all probability distributions on {0, 1}n which can be\napproximated arbitrarily well by a visible distribution generated by the RBM with m hidden and n\nvisible units for an appropriate choice of the parameter values.\nAs shown in [21] (generalizing results from [15]) RBMn,m contains any probability distribution if\nm \u2265 2n\u22121 \u2212 1. On the other hand, if RBMn,m equals the set P of all probability distributions on\n{0, 1}n, then it must have at least dim(P) = 2n \u2212 1 parameters, and thus at least \u23082n/(n + 1)\u2309 \u2212 1\nhidden units [21]. In fact, in [8] it was shown that for most combinations of m and n the dimension\nof RBMn,m (as a manifold, possibly with singularities) equals either the number of parameters or\n2n \u2212 1, whatever is smaller. However, the geometry of RBMn,m is intricate, and even an RBM of\ndimension 2n \u2212 1 is not guaranteed to contain all visible distributions, see [20] for counterexamples.\nIn summary, an RBM that can approximate any distribution arbitrarily well must have a very large\nnumber of parameters and hidden units. In practice, training such a large system is not desirable or\neven possible. However, there are at least two reasons why in many cases this is not necessary:\n\n1\n\n\f\u2022 An appropriate approximation of distributions is suf\ufb01cient for most purposes.\n\u2022 The interesting distributions the system shall simulate belong to a small class of distribu-\n\ntions. Therefore, the model does not need to approximate all distributions.\n\nFor example, the set of optimal policies in reinforcement learning [25], the set of dynamics kernels\nthat maximize predictive information in robotics [26] or the information \ufb02ow in neural networks [3]\nare contained in very low dimensional manifolds; see [2]. On the other hand, usually it is very\nhard to mathematically describe a set containing the optimal solutions to general problems, or a\nset of interesting probability distributions (for example the class of distributions generating natural\nimages). Furthermore, although RBMs are parametric models and for any choice of the parameters\nwe have a resulting probability distribution, in general it is dif\ufb01cult to explicitly specify this resulting\nprobability distribution (or even to estimate it [18]). Due to these dif\ufb01culties the number of hidden\nunits m is often chosen on the basis of experience [12], or m is considered as a hyperparameter\nwhich is optimized by extensive search, depending on the distributions to be simulated by the RBM.\nIn this paper we give an explicit description of classes of distributions that are contained in RBMn,m,\nand which are representative for the expressive power of this model. Using this description, we\nestimate the maximal Kullback-Leibler divergence between an arbitrary probability distribution and\nthe best approximation within RBMn,m.\nThis paper is organized as follows: Section 2 discusses the different kinds of errors that appear when\nan RBM learns. Section 3 introduces the statistical models studied in this paper. Section 4 studies\nsubmodels of RBMn,m. An upper bound of the approximation error for RBMs is found in Section 5.\n\n2 Approximation Error\n\nWhen training an RBM to represent a distribution p, there are mainly three contributions to the\ndiscrepancy between p and the state of the RBM after training:\n\n1. Usually the underlying distribution p is unknown and only a set of samples generated by\np is observed. These samples can be represented as an empirical distribution pData, which\nusually is not identical with p.\n\n2. The set RBMn,m does not contain every probability distribution, unless the number of\nhidden units is very large, as we outlined in the introduction. Therefore, we have an ap-\nproximation error given by the distance of pData to the best approximation pData\nRBM contained\nin the RBM model.\n\n3. The learning process may yield a solution \u02dcpData\n\nRBM.\nThis occurs, for example, if the learning algorithm gets trapped in a local optimum, or if\nit optimizes an objective different from Maximum Likelihood, e.g. contrastive divergence\n(CD), see [6].\n\nRBM in RBM which is not the optimum pData\n\nIn this paper we study the expressive power of the RBM model and the Kullback-Leibler diver-\ngence from an arbitrary distribution to its best representation within the RBM model. Estimating the\napproximation error is dif\ufb01cult, because the geometry of the RBM model is not suf\ufb01ciently under-\nstood. Our strategy is to \ufb01nd subsets M \u2286 RBMn,m that are easy to describe. Then the maximal\nerror when approximating probability distributions with an RBM is upper bounded by the maximal\nerror when approximating with M.\nConsider a \ufb01nite set X . A real valued function on X can be seen as a real vector with |X | entries.\nThe set P = P(X ) of all probability distributions on X is a (|X | \u2212 1)-dimensional simplex in\nR|X |. There are several notions of distance between probability distributions, and in turn for the\nerror in the representation (approximation) of a probability distribution. One possibility is to use the\ninduced distance of the Euclidian space R|X |. From the point of view of information theory, a more\nmeaningful distance notion for probability distributions is the Kullback-Leibler divergence:\n\nD(pkq) := Xx\n\np(x) log\n\np(x)\nq(x)\n\n.\n\nIn this paper we use the basis 2 logarithm. The Kullback-Leibler (KL) divergence is non-negative\nand vanishes if and only if p = q. If the support of q does not contain the support of p it is de\ufb01ned\n\n2\n\n\fq = p\n\nq = 1\n|X |\n\nrelative error\n\n0\n\n128\n255\n\n1\n\nFigure 1: This \ufb01gure gives an intuition on what the size of an error means for probability distri-\nbutions on images with 16 \u00d7 16 pixels. Every column shows four samples drawn from the best\napproximation q of the distribution p = 1\n2 (\u03b4(1...1) + \u03b4(0...0)) within a partition model with 2 ran-\ndomly chosen cubical blocks, containing (0 . . . 0) and (1 . . . 1), of cardinality from 1 (\ufb01rst column)\nto |X |\n|X |(cid:1). The\nlast column shows samples from the uniform distribution, which is, in particular, the best approxi-\nmation of p within RBMn,0. Note that an RBM with 1 hidden unit can approximate p with arbitrary\naccuracy, see Theorem 4.1.\n\n2 (last column). As a measure of error ranging from 0 to 1 we take D(pkq)/D(cid:0)pk 1\n\nas \u221e. The summands with p(x) = 0 are set to 0. The KL-divergence is not symmetric, but it has\nnice information theoretic properties [14, 7].\nIf E \u2286 P is a statistical model and if p \u2208 P, then any probability distribution pE \u2208 E satisfying\n\nD(pkpE ) = D(pkE) := min{D(pkq) : q \u2208 E}\n\nis called a (generalized) reversed information projection, or rI-projection. Here, E denotes the\nclosure of E. If p is an empirical distribution, then one can show that any rI-projection is a maximum\nlikelihood estimate.\n\nIn order to assess an RBM or some other model M we use the maximal approximation error with\nrespect to the KL-divergence when approximating arbitrary probability distributions using M:\n\nDM := max {D(pkM) : p \u2208 P} .\n\nFor example, the maximal KL-divergence to the uniform distribution 1\ndelta distributions \u03b4x, x \u2208 X , and amounts to:\n\n|X | is attained by any Dirac\n\nD{ 1\n\n|X | } = D(\u03b4xk 1\n\n|X | ) = log |X | .\n\n(1)\n\n3 Model Classes\n\n3.1 Exponential families and product measures\n\nIn this work we only need a restricted class of exponential families, namely exponential families on\na \ufb01nite set with uniform reference measure. See [5] for more on exponential families. The boundary\nof discrete exponential families is discussed in [23], which uses a similar notation.\n\nLet A \u2208 Rd\u00d7|X | be a matrix. The columns Ax of A will be indexed by x \u2208 X . The rows of A can\nbe interpreted as functions on R. The exponential family EA with suf\ufb01cient statistics A consists of\nall probability distributions of the form p\u03bb, \u03bb \u2208 Rd, where\n\np\u03bb(x) =\n\n,\n\nfor all x \u2208 X .\n\nNote that any probability distribution in EA has full support. Furthermore, EA is in general not a\nclosed set. The closure EA (with respect to the usual topology on RX ) will be important in the\nfollowing. Exponential families behave nicely with respect to rI-projection: Any p \u2208 P has a\nunique rI-projection pE to EA.\n\nexp(\u03bb\u22a4Ax)\n\nPx exp(\u03bb\u22a4Ax)\n\n3\n\n\fThe most important exponential families in this work are the independence models. The indepen-\ndence model of n binary random variables consists of all probability distributions on {0, 1}n that\nfactorize:\n\nEn = np \u2208 P(X ) : p(x1, . . . , xn) =\n\nn\n\nYi=1\n\npi(xi) for some pi \u2208 P({0, 1})o .\n\nIt is the closure of an n-dimensional exponential family En. This model corresponds to the RBM\nmodel with no hidden units. An element of the independence model is called a product distribution.\n\nLemma 3.1 (Corollary 4.1 of [1]) Let En be the independence model on {0, 1}n. If n > 0, then\nDEn = (n \u2212 1). The global maximizers are the distributions of the form 1\n2 (\u03b4x + \u03b4y), where x, y \u2208\n{0, 1}n satisfy xi + yi = 1 for all i.\n\nThis result should be compared with (1). Although the independence model is much larger than the\nset { 1\n|X | }, the maximal divergence decreases only by 1. As shown in [22], if E is any exponential\nfamily of dimension k, then DE \u2265 log(|X |/(k + 1)). Thus, this notion of distance is rather strong.\nThe exponential families satisfying DE = log(|X |/(k+1)) are partition models; they will be de\ufb01ned\nin the following section.\n\n3.2 Partition models and mixtures of products with disjoint supports\n\nThe mixture of m models M1, . . . , Mm \u2286 P is the set of all convex combinations\n\np = Xi\n\n\u03b1ipi , where pi \u2208 Mi, \u03b1i \u2265 0,Xi\n\n\u03b1i = 1 .\n\n(2)\n\nIn general, mixture models are complicated objects. Even if all models M1 = \u00b7 \u00b7 \u00b7 = Mm are equal,\nit is dif\ufb01cult to describe the mixture [17, 19]. The situation simpli\ufb01es considerably if the models\nhave disjoint supports. Note that given any partition \u03be = {X1, . . . , Xm} of X , any p \u2208 P can be\nwritten as p(x) = pXi(x)p(Xi) for all x \u2208 Xi and i \u2208 {1, . . . , m}, where pXi is a probability\nmeasure in P(Xi) for all i.\n\nLemma 3.2 Let \u03be = {X1, . . . , Xm} be a partition of X and let M1, . . . , Mm be statistical models\nsuch that Mi \u2286 P(Xi). Consider any p \u2208 P and corresponding pXi such that p(x) = pXi(x)p(Xi)\nfor x \u2208 Xi. Let pi be an rI-projection of pXi to Mi. Then the rI-projection pM of P to the mixture\nM of M1, . . . , Mm satis\ufb01es\n\npM(x) = p(Xi)pi(x),\n\nwhenever x \u2208 Xi .\n\nTherefore, D(pkM) = Pi p(Xi)D(pXi kMi), and so DM = maxi=1,...,m DMi.\nProof Let p \u2208 M be as in (2). Then D(qkp) = Pm\n\nthis sum is minimal if and only if each term is minimal.\n\ni=1 q(Xi)D(qXikpi) for all q \u2208 P. For \ufb01xed q\n(cid:3)\n\nIf each Mi is an exponential family, then the mixture is also an exponential family (this is not true if\nthe supports of the models Mi are not disjoint). In the rest of this section we discuss two examples.\nIf each Mi equals the set containing just the uniform distribution on Xi, then M is called the\npartition model of \u03be, denoted with P\u03be. The partition model P\u03be is given by all distributions with\nconstant value on each block Xi, i.e. those that satisfy p(x) = p(y) for all x, y \u2208 Xi. This is the\nclosure of the exponential family with suf\ufb01cient statistics\n\nAx = (\u03c71(x), \u03c72(x), . . . , \u03c7d(x))\u22a4 ,\n\nwhere \u03c7i := \u03c7Xi\npartition models.\n\nis 1 on x \u2208 Xi, and 0 everywhere else. See [22] for interesting properties of\n\nThe partition models include the set of \ufb01nite exchangeable distributions (see e.g. [9]), where the\nblocks of the partition are the sets of binary vectors which have the same number of entries equal to\none. The probability of a vector v depends only on the number of ones, but not on their position.\n\nCorollary 3.3 Let \u03be = {X1, . . . , Xm} be a partition of X . Then DP\u03be = maxi=1,...,m log |Xi|.\n\n4\n\n\f) )\n\n1\n\n0\n\n(\n\n\u03b4\n\n\u03b4(0 1)\n\n+\n\n)\n\n1\n\n1\n\n(\n\n\u03b4\n\n(\n\n1\n\n2\n\n\u03b4(1 1)\n\nP\n\nP\u03be\n\n\u03b4(1 1)\n\nP\n\nE1\n\n\u03b4(0 1)\n\n\u03b4(0 0)\n\n\u03b4(1 0)\n\n\u03b4(0 0)\n\nE2\n\n\u03b4(1 0)\n\nFigure 2: Models in P({0, 1}2). Left: The blue line represents the partition model P\u03be with partition\n\u03be = {(11), (01)}\u222a{(00), (10)}. The dashed lines represent the set of KL-divergence maximizers for\nP\u03be. Right: The mixture of the product distributions E1 and E2 with disjoint supports on {(11), (01)}\nand {(00), (10)} corresponding to the same partition \u03be equals the whole simplex P.\n\nNow assume that X = {0, 1}n is the set of binary vectors of length n. As a subset of Rn it consists\nof the vertices (extreme points) of the n-dimensional hypercube. The vertices of a k-dimensional\nface of the n-cube are given by \ufb01xing the values of x in n \u2212 k positions:\n\n{x \u2208 {0, 1}n : xi = \u02dcxi, \u2200i \u2208 I, for some I \u2286 {1, . . . , n}, |I| = n \u2212 k}\n\nWe call such a subset Y \u2286 X cubical or a face of the n-cube. A cubical subset of cardinality 2k can\nbe naturally identi\ufb01ed with {0, 1}k. This identi\ufb01cation allows to de\ufb01ne independence models and\nproduct measures on P(Y) \u2286 P(X ). Note that product measures on Y are also product measures\non X , and the independence model on Y is a subset of the independence model on X .\n\nCorollary 3.4 Let \u03be = {X1, . . . , Xm} be a partition of X = {0, 1}n into cubical sets. For any i let\nEi be the independence model on Xi, and let M be the mixture of E1, . . . , Em. Then\n\nDM = max\n\ni=1,...,m\n\nlog(|Xi|) \u2212 1 .\n\nSee Figure 1 for an intuition on the approximation error of partition models, and see Figure 2 for\nsmall examples of a partition model and of a mixture of products with disjoint support.\n\n4 Classes of distributions that RBMs can learn\n\ni=1 of m disjoint cubical sets Xi in X . Such a \u03be is a partition of some subset\nConsider a set \u03be = {Xi}m\n\u222a\u03be = \u222aiXi of X into m disjoint cubical sets. We write Gm for the collection of all such partitions.\nWe have the following result:\n\nTheorem 4.1 RBMn,m contains the following distributions:\n\n\u2022 Any mixture of one arbitrary product distribution, m \u2212 k product distributions with support\non arbitrary but disjoint faces of the n-cube, and k arbitrary distributions with support on\nany edges of the n-cube, for any 0 \u2264 k \u2264 m. In particular:\n\n\u2022 Any mixture of m + 1 product distributions with disjoint cubical supports. In consequence,\n\nRBMn,m contains the partition model of any partition in Gm+1.\n\nRestricting the cubical sets of the second item to edges, i.e. pairs of vectors differing in one entry, we\nsee that the above theorem implies the following previously known result, which was shown in [21].\n\nCorollary 4.2 RBMn,m contains the following distributions:\n\n\u2022 Any distribution with a support set that can be covered by m + 1 pairs of vectors differing\n\nin one entry. In particular, this includes:\n\n\u2022 Any distribution in P with a support of cardinality smaller than or equal to m + 1.\n\n5\n\n\fCorollary 4.2 implies that an RBM with m \u2265 2n\u22121 \u2212 1 hidden units is a universal approximator of\ndistributions on {0, 1}n, i.e. can approximate any distribution to an arbitrarily good accuracy.\nAssume m + 1 = 2k and let \u03be be a partition of X into m + 1 disjoint cubical sets of equal size. Let\nus denote by P\u03be,1 the set of all distributions which can be written as a mixture of m + 1 product\ndistributions with support on the elements of \u03be. The dimension of P\u03be,1 is given by\n\ndim P\u03be,1 = (m + 1) log(cid:18) 2n\n\nm + 1(cid:19) + m + 1 + n = (m + 1) \u00b7 n + (m + 1) + n \u2212 (m + 1) log(m + 1) .\n\nThe dimension of the set of visible distribution represented by an RBM is at most equal to the\nnumber of paramters, see [21], this is m \u00b7 n + m + n. This means that the class given above has\nroughly the same dimension of the set of distributions that can be represented. In fact,\n\ndim P\u03be,1 \u2212 dim RBMm\u22121 = n + 1 \u2212 (m + 1) log(m + 1) .\n\nThis means that the class of distributions P\u03be,1 which by Theorem 4.1 can be represented by RBMn,m\nis not contained in RBMn,m\u22121 when (m + 1)m+1 \u2264 2n+1.\n\nProof of Theorem 4.1 The proof draws on ideas from [15] and [21]. An RBM with no hidden units\ncan represent precisely the independence model, i.e. all product distributions, and in particular any\nuniform distribution on a face of the n-cube.\nConsider an RBM with m \u2212 1 hidden units. For any choice of the parameters W \u2208 Rm\u22121\u00d7n, B \u2208\nRn, C \u2208 Rm\u22121 we can write the resulting distribution on the visible units as:\n\np(v) = Ph z(v, h)\nPv\u2032,h\u2032 z(v\u2032, h\u2032)\n\n,\n\n(3)\n\nwhere z(v, h) = exp(hW v + Bv + Ch). Appending one additional hidden unit, with connection\nweights w to the visible units and bias c, produces a new distribution which can be written as follows:\n\npw,c(v) =\n\n(1 + exp(wv + c))Ph z(v, h)\nPv\u2032,h\u2032 (1 + exp(wv\u2032 + c))z(v\u2032, h\u2032)\n\n.\n\nConsider now any set I \u2286 [n] := {1, . . . , n} and an arbitrary visible vector u \u2208 X . The values of u\nin the positions [n]\\I de\ufb01ne a face F := {v \u2208 X : vi = ui , \u2200i 6\u2208 I} of the n-cube of dimension |I|.\nLet 1 := (1, . . . , 1) \u2208 Rn and denote by uI,0 the vector with entries uI,0\ni = ui, \u2200i 6\u2208 I and\nuI,0\ni = 0 , \u2200i 6\u2208 I and let \u03bbc, a \u2208 R. De\ufb01ne the connection\ni = 0, \u2200i \u2208 I. Let \u03bbI \u2208 Rn with \u03bbI\nweights w and c as follows:\n\nFor this choice and a \u2192 \u221e equation (4) yields:\n\nc = \u2212a(uI,0 \u2212\n\nw = a(uI,0 \u2212\n\n1I,0) + \u03bbI ,\n\n1\n2\n1\n1I,0)\u22a4u + \u03bbc .\n2\n\npw,c(v) = \uf8f1\uf8f2\n\uf8f3\n\np(v)\n\n1+Pv\u2032\u2208F exp (\u03bbI \u00b7v\u2032+\u03bbc)p(v\u2032) , \u2200v 6\u2208 F\n1+Pv\u2032\u2208F exp (\u03bbI \u00b7v\u2032+\u03bbc)p(v\u2032) , \u2200v \u2208 F\n\n(1+exp(\u03bbI \u00b7v+\u03bbc))p(v)\n\n.\n\n(4)\n\nIf the initial p from equation (3) is such that its restriction to F is a product distribution, then\ni = 0 , \u2200i 6\u2208 I. We\np(v) = K exp(\u03b7I \u00b7 v) , \u2200v \u2208 F , where K is a constant and \u03b7I is a vector with \u03b7I\nK Pv\u2208F exp(\u03b2I \u00b7v) . For this choice, equation (4) yields:\ncan choose \u03bbI = \u03b2I \u2212 \u03b7I, and exp(\u03bbc) = \u03b1\n\n1\n\npw,c = (\u03b1 \u2212 1)p + \u03b1\u02c6p ,\n\nwhere \u02c6p is a product distribution with support in F and arbitrary natural parameters \u03b2I, and \u03b1 is\nan arbitrary mixture weight in [0, 1]. Finally, the product distributions on edges of the cube are\narbitrary, see [19] or [21] for details, and hence the restriction of any p to any edge is a product\ndistribution.\n(cid:3)\n\n6\n\n\f2.5\n\n2\n\n1.5\n\n1\n\n)\nM\nB\nR\np\nk\n\ny\nt\ni\nr\na\np\n\np\n(\n\nD\n\n0.5\n\n0\n0\n\nRBMs with 3 visible units\n\nRBMs with 4 visible units\n\n50\n\nD\n\n0\n0\n\n1\n\n2\nm\n\n3\n\n4\n\n3\n\n2.5\n\n2\n\n1.5\n\n1\n\n)\nM\nB\nR\np\nk\n\ny\nt\ni\nr\na\np\n\np\n(\n\nD\n\n0.5\n\n(n \u2212 1) \u2212 log(m + 1)\n\n1\n\n3\nNumber of hidden units m\n\n2\n\n4\n\n0\n0\n\n1\n\n3\n\n2\nNumber of hidden units m\n\n4\n\n5\n\n6\n\n7\n\n8\n\nFigure 3: This \ufb01gure demonstrates our results for n = 3 and n = 4 visible units. The red curves\nrepresent the bounds from Theorem 5.1. We \ufb01xed pparity as target distribution, the uniform distri-\nbution on binary length n vectors with an even number of ones. The distribution pparity is not the\nKL-maximizer from RBMn,m, but it is in general dif\ufb01cult to represent. Qualitatively, samples from\npparity look like uniformly distributed, and representing pparity requires the maximal number of prod-\nuct mixture components [20, 19]. For both values of n and each m = 0, . . . , 2n/2 we initialized\n500 resp. 1000 RBMs at parameter values chosen uniformly at random in the range [\u221210, 10]. The\ninset of the left \ufb01gure shows the resulting KL-divergence D(pparitykprand\nRBM) (for n = 4 the resulting\nKL-divergence was larger). Randomly chosen distributions in RBMn,m are likely to be very far\nfrom the target distribution. We trained these randomly initialized RBMs using CD for 500 training\nepochs, learning rate 1 and a list of even parity vectors as training data. The result after training is\ngiven by the blue circles. After training the RBMs the result is often not better than the uniform\ndistribution, for which D(pparityk\n|{0,1}n| ) = 1. For each m, the best set of parameters after train-\ning was used to initialize a further CD training with a smaller learning rate (green squares, mostly\ncovered) followed by a short maximum likelihood gradient ascent (red \ufb01lled squares).\n\n1\n\n5 Maximal Approximation Errors of RBMs\n\nLet m < 2n\u22121 \u2212 1. By Theorem 4.1 all partition models for partitions of {0, 1}n into m + 1 cubical\nsets are contained in RBMn,m. Applying Corollary 3.3 to such a partition where the cardinality of\nall blocks is at most 2n\u2212\u230alog(m+1)\u230b yields the bound DRBMn,m \u2264 n \u2212 \u230alog(m + 1)\u230b. Similarly,\nusing mixtures of product distributions, Theorem 4.1 and Corollary 3.4 imply the smaller bound\nDRBMn,m \u2264 n \u2212 1 \u2212 \u230alog(m + 1)\u230b. In this section we derive an improved bound which strictly\ndecreases, as m increases, until 0 is reached.\n\nTheorem 5.1 Let m \u2264 2n\u22121 \u2212 1. Then the maximal Kullback-Leibler divergence from any distri-\nbution on {0, 1}n to RBMn,m is upper bounded by\n\nmax\np\u2208P\n\nD(pk RBMn,m) \u2264 (n \u2212 1) \u2212 log(m + 1) .\n\nConversely, given an error tolerance 0 \u2264 \u01eb \u2264 1, the choice m \u2265 2(n\u22121)(1\u2212\u01eb) \u2212 1 ensures a\nsuf\ufb01ciently rich RBM model that satis\ufb01es DRBMn,m \u2264 \u01ebDRBMn,0.\n\nFor m = 2n\u22121 \u2212 1 the error vanishes, corresponding to the fact that an RBM with that many hidden\nunits is a universal approximator. In Figure 3 we use computer experiments to illustrate Theorem 5.1.\nThe proof makes use of the following lemma:\n\nni\u22121\n2n\u2212ni .\n\nLemma 5.2 Let n1, . . . , nm \u2265 0 such that 2n1 + \u00b7 \u00b7 \u00b7 + 2nm = 2n. Let M be the union of all mix-\ntures of independent models corresponding to all cubical partitions of X into blocks of cardinalities\n2n1 , . . . , 2nm. Then DM \u2264 Pi:ni>1\n\nProof of Lemma 5.2 The proof is by induction on n. If n = 1, then m = 1 or m = 2, and in both\ncases it is easy to see that the inequality holds (both sides vanish). If n > 1, then order the ni such\nthat n1 \u2265 n2 \u2265 \u00b7 \u00b7 \u00b7 \u2265 nm \u2265 0. Without loss of generality assume m > 1.\nLet p \u2208 P(X ), and let Y be a cubical subset of X of cardinality 2n\u22121 such that p(Y) \u2264 1\n2 . Since\nthe numbers 2n1 + \u00b7 \u00b7 \u00b7 + 2ni for i = 1, . . . , m contain all multiples of 2n1 up to 2n and 2n/2n1 is\neven, there exists k such that 2n1 + \u00b7 \u00b7 \u00b7 + 2nk = 2n\u22121 = 2nk+1 + \u00b7 \u00b7 \u00b7 + 2nm.\n\n7\n\n\fLet M\u2032 be the union of all mixtures of independence models corresponding to all cubical partitions\n\u03be = {X1, . . . , Xm} of X into m blocks of cardinalities n1, . . . , nm such that X1 \u222a \u00b7 \u00b7 \u00b7 \u222a Xk = Y.\ni shall denote summation over all indices i such that ni > 1. By\n\nIn the following, the symbol P\u2032\n\ninduction\n\nD(pkM) \u2264 D(pkM\u2032) \u2264 p(Y)\n\nk\n\nX\u2032\n\ni=1\n\nni \u2212 1\n2n\u22121\u2212ni\n\n+ p(X \\ Y)\n\nm\n\nX\u2032\n\nj=k+1\n\nnj \u2212 1\n2n\u22121\u2212nj\n\n.\n\n(5)\n\nThere exist j1 = k + 1 < j2 < \u00b7 \u00b7 \u00b7 < jk < jk+1 = m + 1 such that 2ni = 2nji + \u00b7 \u00b7 \u00b7 + 2nji+1\u22121\nfor all i \u2264 k. Note that\n\nji+1\n\nX\u2032\n\nj=ji\n\nnj \u2212 1\n2n\u22121\u2212nj\n\n\u2264\n\nni \u2212 1\n2n\u22121 (2nji + \u00b7 \u00b7 \u00b7 + 2nji+1\u22121) =\n\nni \u2212 1\n2n\u22121\u2212ni\n\n,\n\nand therefore\n\n( 1\n2 \u2212 p(Y))\n\nni \u2212 1\n2n\u22121\u2212ni\n\n+ ( 1\n\n2 \u2212 p(X \\ Y))\n\nji+1\u22121\n\nX\u2032\n\nj=ji\n\nnj \u2212 1\n2n\u22121\u2212nj\n\n\u2265 0 .\n\nAdding these terms for i = 1, . . . , k to the right hand side of equation (5) yields\n\nD(pkM) \u2264\n\n1\n2\n\nk\n\nX\u2032\n\ni=1\n\nni \u2212 1\n2n\u22121\u2212ni\n\n+\n\n1\n2\n\nm\n\nX\u2032\n\nj=k+1\n\nnj \u2212 1\n2n\u22121\u2212nj\n\n,\n\nfrom which the assertions follow.\n\n(cid:3)\n\nProof of Theorem 5.1 From Theorem 4.1 we know that RBMn,m contains the union M of all\nmixtures of independent models corresponding to all partitions with up to m + 1 cubical blocks.\nHence, DRBMn,m \u2264 DM. Let k = n \u2212 \u230alog(m + 1)\u230b and l = 2m + 2 \u2212 2n\u2212k+1 \u2265 0; then\nl2k\u22121 + (m + 1 \u2212 l)2k = 2n. Lemma 5.2 with n1 = \u00b7 \u00b7 \u00b7 = nl = k \u2212 1 and nl+1 = \u00b7 \u00b7 \u00b7 = nm+1 = k\nimplies\n\nDM \u2264\n\nl(k \u2212 2)\n2n\u2212k+1 +\n\n(m + 1 \u2212 l)(k \u2212 1)\n\n2n\u2212k\n\n= k \u2212\n\nm + 1\n2n\u2212k .\n\nThe assertion follows from log(m + 1) \u2264 (n \u2212 k) + m+1\nwas used.\n\n2n\u2212k \u2212 1, where log(1 + x) \u2264 x for all x > 0\n\n(cid:3)\n\n6 Conclusion\n\nWe studied the expressive power of the Restricted Boltzmann Machine model with n visible and m\nhidden units. We presented a hierarchy of explicit classes of probability distributions that an RBM\ncan represent. These classes include large collections of mixtures of m + 1 product distributions. In\nparticular any mixture of an arbitrary product distribution and m further product distributions with\ndisjoint supports. The geometry of these submodels is easier to study than that of the RBM models,\nwhile these subsets still capture many of the distributions contained in the RBM models. Using\nthese results we derived bounds for the approximation errors of RBMs. We showed that it is always\npossible to reduce the error to at most (n \u2212 1) \u2212 log(m + 1). That is, given any target distribution,\nthere is a distribution within the RBM model for which the Kullback-Leibler divergence between\nboth is not larger than that number. Our results give a theoretical basis for selecting the size of an\nRBM which accounts for a desired error tolerance.\n\nComputer experiments showed that the bound captures the order of magnitude of the true approxi-\nmation error, at least for small examples. However, learning may not always \ufb01nd the best approxi-\nmation, resulting in an error that may well exceed our bound.\n\nAcknowledgments\n\nNihat Ay acknowledges support by the Santa Fe Institute.\n\n8\n\n\fReferences\n\n[1] N. Ay and A. Knauf. Maximizing multi-information. Kybernetika, 42:517\u2013538, 2006.\n[2] N. Ay, G. Mont\u00b4ufar, and J. Rauh. Selection criteria for neuromanifolds of stochastic dynamics.\n\nInternational Conference on Cognitive Neurodynamics, 2011.\n\n[3] N. Ay and T. Wennekers. Dynamical properties of strongly interacting Markov chains. Neural\n\nNetworks, 16:1483\u20131497, 2003.\n\n[4] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle. Greedy layer-wise training of deep\n\nnetworks. NIPS, 2007.\n\n[5] L. Brown. Fundamentals of Statistical Exponential Families: With Applications in Statistical\n\nDecision Theory. Inst. Math. Statist., Hayworth, CA, USA, 1986.\n\n[6] M. A. Carreira-Perpi\u02dcnan and G. E. Hinton. On contrastive divergence learning. In Proceedings\n\nof the 10-th International Workshop on Arti\ufb01cial Intelligence and Statistics, 2005.\n\n[7] T. M. Cover and J. A. Thomas. Elements of Information Theory. John Wiley & Sons, 2006.\n[8] M. A. Cueto, J. Morton, and B. Sturmfels. Geometry of the Restricted Boltzmann Machine.\nIn M. A. G. Viana and H. P. Wynn, editors, Algebraic methods in statistics and probability II,\nAMS Special Session. AMS, 2010.\n\n[9] P. Diaconis and D. Freedman. Finite exchangeable sequences. Ann. Probab., 8:745\u2013764, 1980.\n[10] Y. Freund and D. Haussler. Unsupervised learning of distributions on binary vectors using\n\n2-layer networks. NIPS, pages 912\u2013919, 1992.\n\n[11] G. E. Hinton. Training products of experts by minimizing contrastive divergence. Neural\n\nComput., 14:1771\u20131800, 2002.\n\n[12] G. E. Hinton. A practical guide to training Restricted Boltzmann Machines, version 1. Tech-\n\nnical report, UTML2010-003, University of Toronto, 2010.\n\n[13] G. E. Hinton, S. Osindero, and Y. Teh. A fast learning algorithm for Deep Belief Nets. Neural\n\nComput., 18:1527\u20131554, 2006.\n\n[14] S. Kullback and R. Leibler. On information and suf\ufb01ciency. Ann. Math. Stat., 22:79\u201386, 1951.\n[15] N. Le Roux and Y. Bengio. Representational power of Restricted Boltzmann Machines and\n\nDeep Belief Networks. Neural Comput., 20(6):1631\u20131649, 2008.\n\n[16] N. Le Roux and Y. Bengio. Deep Belief Networks are compact universal approximators. Neu-\n\nral Comput., 22:2192\u20132207, 2010.\n\n[17] B. Lindsay. Mixture models: theory, geometry, and applications. Inst. Math. Statist., 1995.\n[18] P. M. Long and R. A. Servedio. Restricted Boltzmann Machines are hard to approximately\n\nevaluate or simulate. In Proceedings of the 27-th ICML, pages 703\u2013710, 2010.\n\n[19] G. Mont\u00b4ufar. Mixture decompositions using a decomposition of the sample space. ArXiv\n\n1008.0204, 2010.\n\n[20] G. Mont\u00b4ufar. Mixture models and representational power of RBMs, DBNs and DBMs. NIPS\n\nDeep Learning and Unsupervised Feature Learning Workshop, 2010.\n\n[21] G. Mont\u00b4ufar and N. Ay. Re\ufb01nements of universal approximation results for Deep Belief Net-\n\nworks and Restricted Boltzmann Machines. Neural Comput., 23(5):1306\u20131319, 2011.\n\n[22] J. Rauh. Finding the maximizers of the information divergence from an exponential family.\n\nPhD thesis, Universit\u00a8at Leipzig, 2011.\n\n[23] J. Rauh, T. Kahle, and N. Ay. Support sets of exponential families and oriented matroids. Int.\n\nJ. Approx. Reason., 52(5):613\u2013626, 2011.\n\n[24] P. Smolensky. Information processing in dynamical systems: foundations of harmony theory.\n\nIn Symposium on Parallel and Distributed Processing, 1986.\n\n[25] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction (Adaptive Computation\n\nand Machine Learning). MIT Press, March 1998.\n\n[26] K. G. Zahedi, N. Ay, and R. Der. Higher coordination with less control \u2013 a result of infromation\n\nmaximization in the sensori-motor loop. Adaptive Behavior, 18(3-4):338\u2013355, 2010.\n\n9\n\n\f", "award": [], "sourceid": 307, "authors": [{"given_name": "Guido", "family_name": "Montufar", "institution": null}, {"given_name": "Johannes", "family_name": "Rauh", "institution": null}, {"given_name": "Nihat", "family_name": "Ay", "institution": null}]}