{"title": "Efficient Deep Approximation of GMMs", "book": "Advances in Neural Information Processing Systems", "page_first": 4552, "page_last": 4560, "abstract": "The universal approximation theorem states that any regular function can be approximated closely using a single hidden layer neural network. Some recent work has shown that, for some special functions, the number of nodes in such an approximation could be exponentially reduced with multi-layer neural networks. In this work, we extend this idea to a rich class of functions, namely the discriminant functions that arise in optimal Bayesian classification of Gaussian mixture models (GMMs) in $\\mathds{R}^n$. We show that such functions can be approximated with arbitrary precision using $O(n)$ nodes in a neural network with two hidden layers (deep neural network), while in contrast, a neural network with a single hidden layer (shallow neural network) would require at least $O(\\exp(n))$ nodes or exponentially large coefficients. Given the universality of the Gaussian distribution in the feature spaces of data, e.g., in speech, image and text, our results shed light on the observed efficiency of deep neural networks in practical classification problems.", "full_text": "Ef\ufb01cient Deep Approximation of GMMs\n\nShirin Jalali, Carl Nuzman, Iraj Saniee\n\nBell Labs, Nokia\n\n{shirin.jalali,carl.nuzman,iraj.saniee}@nokia-bell-labs.com\n\n600-700 Mountain Avenue\n\nMurray Hill, NJ 07974\n\nAbstract\n\nThe universal approximation theorem states that any regular function can be ap-\nproximated closely using a single hidden layer neural network. Some recent work\nhas shown that, for some special functions, the number of nodes in such an ap-\nproximation could be exponentially reduced with multi-layer neural networks. In\nthis work, we extend this idea to a rich class of functions, namely the discriminant\nfunctions that arise in optimal Bayesian classi\ufb01cation of Gaussian mixture models\n(GMMs) in Rn. We show that such functions can be approximated with arbitrary\nprecision using O(n) nodes in a neural network with two hidden layers (deep\nneural network), while in contrast, a neural network with a single hidden layer\n(shallow neural network) would require at least O(exp(n)) nodes or exponentially\nlarge coef\ufb01cients. Given the universality of the Gaussian distribution in the feature\nspaces of data, e.g., in speech, image and text, our results shed light on the observed\nef\ufb01ciency of deep neural networks in practical classi\ufb01cation problems.\n\n1\n\nIntroduction\n\nThere is a rapidly growing literature which demonstrates the effectiveness of deep neural networks\nin classi\ufb01cation problems that arise in practice; e.g., in audio, image or text classi\ufb01cation. The\nuniversal approximation theorem, UAT, see [1, 2, 3, 4], states that any regular function, which for\nexample separates in the (high dimensional) feature space a collection of points corresponding to\nimages of dogs from those of cats, can be approximated by a neural network. But UAT is proven for\nshallow, i.e., single hidden-layer, neural networks and in fact the number of nodes needed may be\nexponentially or super exponentially large in the ambient dimension of feature space. Yet, practical\ndeep neural networks are able to solve such classi\ufb01cation problems effectively and ef\ufb01ciently, i.e.,\nusing what amounts to a small number of nodes in terms of the size of the feature space of the data.\nThere is no theory yet as to why deep neural networks (DNNs from here on) are as effective and\nef\ufb01cient in practice as they evidently are. There are essentially two possibilities for this observed\noutcome: 1) DNNs are always signi\ufb01cantly more ef\ufb01cient in terms of the number of nodes used for\napproximation of any relevant function than shallow networks, or 2) DNNs are particularly suited to\ndiscriminant functions that arise in practice, e.g., those that separate in the feature space of images,\npoints representing dogs from points representing cats. If the latter proposition is true, then the\nobserved ef\ufb01ciency of DNNs is essentially due to the special form of the discriminant functions\nencountered in practice and not to the universal ef\ufb01ciency of DNNs, which the former proposition\nwould imply.\nThe \ufb01rst alternative proposed above is a general question about function approximation given neural\nnetworks as the collection of basis functions. As of today, there are no general results that show DNNs\n(those with two or more hidden layers) require fundamentally fewer nodes for approximation of\ngeneral functions than shallow neural networks (or SNNs from here on, i.e., those with a single hidden\nlayer). In this paper, we focus on the second alternative and provide an answer in the af\ufb01rmative; that\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\findeed many discriminant functions that arise in practice are such that DNNs require signi\ufb01cantly, i.e.,\nlogarithmically, fewer nodes for their approximation than SNNs. To formalize what may constitute\ndiscriminant functions that arise in practice, we focus on a versatile class of distributions often used\nto model real-life distributions, namely Gaussian mixture model (GMM for short). GMMs have been\nshown to be good models for audio, speech, image and text processing in the past decades, e.g., see\n[5, 6, 7, 8, 9].\n\n1.1 Background\n\nThe universal approximation theorem [1, 2, 3, 4] states that shallow neural networks (SNNs) can\napproximate regular functions to any required accuracy, albeit potentially with an exponentially large\nnumber of nodes. Can this number be reduced signi\ufb01cantly, e.g., logarithmically, by deep neural\nnetworks? As indicated above, there is no such result as of yet and there is scant literature that even\ndiscusses this question. Some evidence exists that DNNs may in fact not be ef\ufb01cient in theory, see [10].\nOn the other hand, some special functions have been constructed for which DNNs achieve signi\ufb01cant\nand even logarithmic reduction in the number of nodes compared to SNNs, e.g., see [11, 12, 13]\nfor a special radial function, functions of the form f (x, x(cid:48)) = g((cid:104)x, x(cid:48)(cid:105)) (f : S n\u22121 \u00d7 S n\u22121 \u2192 R\nand g : [\u22121, 1] \u2192 R), and polynomials, respectively. However, the functions considered in these\nreferences are typically very special and have little demonstrated basis in practice. Perhaps the most\nillustrative cases are the high degree polynomials discussed in [13], but the logarithmic reduction in\nthe number of nodes due to depth of the DNNs demonstrated in this work occurs only for very high\ndegrees of polynomials in the feature coordinate size.\nIn this work we are motivated by model universality considerations. What models of data are typical\nand what resulting discriminant functions do we typically need to approximate in practice? With\na plausible model, we can determine if the resulting discriminant function(s) can be approximated\nef\ufb01ciently by deep networks. To this end, we focus on data with Gaussian feature distributions, which\nprovide a practical model for many types of data, especially when the feature space is suf\ufb01ciently\nconcentrated, e.g. after a number of projections to lower-dimensional spaces, e.g., see [14].\nOur overall framework is based on the following set of de\ufb01nitions and demonstrations that we describe\nin detail in the following sections. Section 1.2 de\ufb01nes an L-layer neural network. Section 1.3 reviews\nthe problem of optimal classi\ufb01cation of a collection of high-dimensional GMMs. The classi\ufb01er\nfunction for GMMs is readily seen to be the maximum of multiple discriminant functions consisting\nof sums of exponentials of quadratic functions in dimension n. Section 2 establishes connections\nbetween approximating the de\ufb01ned discriminant functions with the described classi\ufb01cation problem.\nSection 3 demonstrates that DNNs can approximate general n-dimensional GMM discriminant\nfunctions using O(n) nodes. In Section 4, we show that, in contrast to DNNs, SNNs need either an\nexponential (in n) number of nodes, or exponentially large coef\ufb01cients to approximate discriminant\nfunctions of GMMs. In Section 5, we show suf\ufb01ciency of an exponential number of nodes by studying\nan SNN where the weights in the \ufb01rst layer are drawn from a random distribution.\nNotation. Throughput the paper, bold letters, such as x and y, refer to vectors. Sets are denoted by\ncalligraphic letters, such as X and Y. For a discrete set X , |X| denotes its cardinality. 0n denotes the\ni .\ni=1 x2\n\nall-zero vector in Rn. In denotes the n-dimensional identity matrix. For x \u2208 Rn, (cid:107)x(cid:107)2 =(cid:80)n\n\n1.2 L-layer neural networks and the activation function \u03c3\n\nL-layer Neural Network. Consider a fully-connected neural network with L hidden layers. We refer\nto a network with L = 1 hidden layer as an SNN and to a network with L > 1 hidden layers as\nDNN. Let x \u2208 Rn denote the input vector. The function generated by an L-layer neural network,\nf : Rn \u2192 Rc, can be represented as a composition of af\ufb01ne functions and the non-linear function \u03c3,\nas follows f (x) = \u03c3 \u25e6 T [L+1] \u25e6 \u03c3 \u25e6 T [L] . . . \u03c3 \u25e6 T [1](x). Here, T [(cid:96)] : Rn(cid:96)\u22121 \u2192 Rn(cid:96) denotes the af\ufb01ne\nmapping applied at layer (cid:96), represented by linear transformation W [(cid:96)] \u2208 Rn(cid:96)\u00d7n(cid:96)\u22121 and translation\nb[(cid:96)]. Moreover, \u03c3 : R \u2192 R denotes the non-linear function that is applied element-wise. In this\nde\ufb01nition, n(cid:96), (cid:96) = 1, . . . , L, denotes the number of hidden nodes in layer (cid:96). To make the notation\nconsistent, for (cid:96) = 0, let n0 = n, the dimension of the input data, and for (cid:96) = L + 1, nL+1 = c, the\nnumber of classes. We will occasionally use the notations dnn and snn to signify that the cases of\nL > 1 and L = 1, respectively. In a classi\ufb01cation task, the index of the highest value in the output\ntuple determines the optimal class for x.\n\n2\n\n\fNon-linear activation function \u03c3. As discussed later, for the two-layer construction in Section 3, we\nrequire some regularity assumptions on the activation function \u03c3, which are met by typical smooth\nDNN activation functions such as the sigmoid function. In Section 7, we indicate how to re\ufb01ne\nthe proofs to accommodate the popular and simple ReLU activation function. The proof of the\ninef\ufb01ciency of SNNs in Section 4 applies to a very general class of activation functions.\n\n1.3 GMMs and their optimal classi\ufb01cation functions\n\nConsider the problem of classifying points generated by a mixture of Gaussian distributions. Assume\nthat there are c classes and samples of each class are drawn from a mixture of Gaussian distributions.\nAssume that there are overall k different Gaussian distributions to draw from. For j \u2208 {1, . . . , k},\nlet \u00b5j \u2208 Rn and \u03a3j \u2208 Rn\u00d7n denote the mean and the covariance matrix of Gaussian distribution j.\nEach Gaussian distribution is assigned uniquely to one of the c classes. Assume that the assignment\nof the Gaussian clouds to the classes is represented by sets T1, . . . ,Tc, which form a partition of\n{1, . . . , k}. (That is, Ti \u2229 Tj = \u2205, for i (cid:54)= j, and \u222ac\ni=1Ti = {1, . . . , k}.) Set Ti represents the\nindices of the Gaussian distributions corresponding to Class i. Let \u03c6i, i = 1, . . . , c, denote the\nprobability that a data point belongs to Class i. Finally, for Class i, let wj, j \u2208 Ti, denote the\nprobability that within Class i, the data comes from Gaussian distribution j. Hence, for i = 1, . . . , n,\nwj = 1. Under the described model, and with a slight abuse of notation, the data is distributed\nwjN (\u00b5j, \u03a3j). Conditioned on being in Class i, the data points are drawn from a\nwjN (\u00b5j, \u03a3j). For j = 1, . . . , k, let \u03c0j : Rn \u2192 R\ndenote the probability density function (pdf) of the Gaussian distribution with mean \u00b5j and covariance\nmatrix \u03a3j.\nAn optimal Bayesian classi\ufb01er1 C\u2217 : Rn \u2192 {1, . . . , c} for these c GMMs maximizes the probability\nof membership across all classes. For Class i, de\ufb01ne the i-th discriminant function di : Rn \u2192 R, as\n(1)\n\n(cid:80)\nas(cid:80)c\nmixture of |Ti| Gaussian distributions as(cid:80)\n\nwj\u03b3j exp(\u2212gj(x)),\n\n(cid:80)\n\n(cid:80)\n\nj\u2208Ti\n\ni=1 \u03c6i\n\nj\u2208Ti\n\nj\u2208Ti\n\nwhere gj(x) (cid:44) 1\nclassi\ufb01er C\u2217 can be characterized as\n\n2 (x\u2212 \u00b5j)T \u03a3\u22121\n\ndi(x) (cid:44) \u03c6i\nj (x\u2212 \u00b5j) and \u03b3j (cid:44) (2\u03c0)\u2212 n\n\nj\u2208Ti\n\n2 |\u03a3j|\u2212 1\n\nC\u2217(x) = arg maxi\u2208{1,...,c} di(x).\n\n2 . Using this de\ufb01nition, the optimal\n\n(2)\n\n2 Connection between classi\ufb01cation and approximation\n\nThe main result of this paper is that the discriminant functions described in (1), required for computing\noptimal classi\ufb01cation function C\u2217(x), can be approximated accurately by a relatively small neural\nnetwork with two hidden layers, but that accurate approximation with a single hidden layer network is\nonly possible if either the number of the nodes or the magnitudes of the coef\ufb01cients are exponentially\nlarge in n. Before stating our main results, in this section, we establish a connection between the\naccuracy in approximating the discriminant functions of a classi\ufb01er and the error performance of a\nclassi\ufb01er that employs these approximations.\nGiven a non-negative function d(x), d : Rn \u2192 R, and threshold t > 0, let Sd,t denote the superlevel\nset of d(x) de\ufb01ned as\n\nSd,t (cid:44) {x \u2208 Rn : d(x) \u2265 t} .\n\nDe\ufb01nition 1 A function \u02c6d : Rn \u2192 R is a (\u03b4, q)-approximation of a non-negative function d : Rn \u2192\nR under a pdf p, if there is a threshold t, such that Pp[Sd,t] \u2265 1 \u2212 q, and\n\n(3)\n(4)\nLet t \u02c6d,\u03b4,q denote the corresponding threshold. If there are multiple such thresholds, let t \u02c6d,\u03b4,q denote\nthe in\ufb01mum of all such thresholds.\n\n| \u02c6d(x) \u2212 d(x)| \u2264 \u03b4d(x),\n0 \u2264 \u02c6d(x) \u2264 (1 + \u03b4)t,\n\nx \u2208 Sd,t\nx (cid:54)\u2208 Sd,t.\n\n1Throughout the paper, a Bayesian classi\ufb01er refers to a classi\ufb01er that has access to the distribution of the\n\ndata.\n\n3\n\n\fIn this de\ufb01nition, \u02c6d closely approximates d in a relative sense, wherever d(x) exceeds threshold t.\nThe function \u02c6d is small (in an absolute sense), when d(x) is small, an event that occurs with low\nprobability under p. Although p and d need not be related in this de\ufb01nition, we will typically use it in\ncases where d is just a scaled version of p.\nGiven two equiprobable classes with pdf functions p1 and p2, the optimal Bayesian classi\ufb01er chooses\nclass 1, if p1(x) > p2(x), and class 2 otherwise. Let e21,opt = P1[p2(x) > p1(x)] denote the\nprobability of incorrectly deciding class 2, when the true distribution is class 1. If we classify using\napproximate pdfs with relative errors bounded by \u03b1 \u2265 1, then the probability of error increases to\ne21,opt[\u03b1] := P1[p2(x) > p1(x)/\u03b1]. Under appropriate regularity conditions, e21,opt[\u03b1] approaches\ne21,opt, as \u03b1 converges to 1. Lemma 1 below shows that (\u03b4, q)-approximations of p1 and p2 enable\nus to approach e21,opt, by taking \u03b4 and q suf\ufb01ciently small.\n\nLemma 1 Given pdfs p1 and p2, let \u02c6d1 and \u02c6d2 denote (\u03b4, q)-approximations of discriminant functions\nd1 = p1 and d2 = p2 under distributions p1 and p2, respectively. De\ufb01ne ti, i = 1, 2, as ti (cid:44) t \u02c6di,\u03b4,q.\nConsider a classi\ufb01er that declares class 1 when \u02c6d1(X) > \u02c6d2(X) and class 2 otherwise. Then, the\nprobability of error of this classi\ufb01er, under distribution p1, is bounded by\n\nd1,(1+\u03b4)t2/(1\u2212\u03b4)),\n\ne21 \u2264 e21,opt[ 1+\u03b4\n\n1\u2212\u03b4 ] + q + P1(S c\nwhere P1(E) measures the probability of event E under p1.\nThe proof of Lemma 1 is presented in Section 1 of the supplementary material (SM).\nNote that as q converges to zero, both t1 and t2 converge to zero as well. Therefore, letting q converge\nto zero ensures that P1((1 + \u03b4)t2 \u2265 (1 \u2212 \u03b4)d1(x)) also converges to zero. One way to construct a\nnearly optimal classi\ufb01er for two distributions is thus to independently build a (\u03b4, q)-approximation\nfor each distribution, and then de\ufb01ne the classi\ufb01er based on maximum of the two functions.\nWith this motivation, in the rest of the paper, we focus on approximating the discriminant functions\nde\ufb01ned earlier for classifying GMMs. In the next section, we show that using a two hidden-layer\nneural network, we can construct a (\u03b4, q)-approximation \u02c6d of the discriminant function of a GMM,\nsee (1), with input dimension n, with O(n) nodes, for any \u03b4 > 0 and any q > 0.\nIn the subsequent section, we show by contrast that even for the simplest GMM consisting of a single\nGaussian distribution, even a weaker approximation that bounds the expected (cid:96)2 error cannot be\nachieved by a single hidden-layer network, unless the (neural) network has either exponentially many\nnodes or exponentially large coef\ufb01cients. The weaker de\ufb01nition of approximation that we will use in\nthe converse result is the following.\nDe\ufb01nition 2 A function \u02c6d : Rn \u2192 R is an \u0001-relative (cid:96)2 approximation for a function d : Rn \u2192 R\nunder pdf p, if\n\nEp[( \u02c6d(x) \u2212 d(x))2] \u2264 \u0001 Ep[(d(x))2].\n\nThe following lemma shows that if approximation under this weaker notion is impossible, then it is\nalso impossible under the stronger (\u03b4, q) notion.\n\nLemma 2 If \u02c6d is a (\u03b4, q)-approximation of a distribution d under distribution p, then it is also an\n\u0001-relative (cid:96)2 approximation of d, with parameter \u0001 = \u03b42 + (1+\u03b4)2q\n1\u2212q\n\n.\n\nProof of Lemma 2 is presented in Section 2 of the SM.\n3 Suf\ufb01ciency of two hidden-layer NN with O(n) nodes\n\nWe are interested in approximating the discriminant functions corresponding to optimal classi\ufb01cation\nof GMMs, as de\ufb01ned in Section 1.3. In this section, we consider a generic function for (1) as\n\nd(x) =\n\n\u03b2j exp (\u2212gj(x)),\n\n(5)\n\nJ(cid:88)\n\nj=1\n\n4\n\n\fi=1 y2\n\n\u22121/2\nj\n\n2 (x\u2212\u00b5j)T \u03a3\u22121\nj (x\u2212\u00b5j) and \u03b2j = \u03c6wj\u03b3j with \ufb01xed prior \u03c6, conditional probabilities\n2 |\u03a3j|\u2212 1\n2 .\n\n(x \u2212 \u00b5j) in the \ufb01rst layer, so that gj(x) is simply (cid:107)yj(cid:107)2 =(cid:80)n\n\nwhere gj(x) = 1\nwj, and \u03b3j = (2\u03c0)\u2212 n\nWe \ufb01rst observe that the function gj(x) is a general quadratic form in Rn and thus consists of the\nsum of O(n2) product terms of the form xixj. Since each such product term can be approximated\nvia four \u03c3 functions (see [15]), gj(x) can be approximated arbitrarily well using O(n2) nodes.\nWe can however reduce the number of nodes further by applying the af\ufb01ne transformation yj =\nj,i, i.e., it consists of n\n\u03a3\nquadratic terms, and can in turn can be approximated via O(n) nodes. This O(n2) to O(n) reduction\nin the number of nodes is speci\ufb01c to quadratic polynomials which are generic exponents of GMMs\nand their discriminant functions and we will take advantage of this reduction in our proofs.\nIn order to prove the main result of this section, we rely on the following regularity assumptions on the\nactivation function \u03c3(x). All of the assumptions are satis\ufb01ed by the sigmoid function \u03c3(x) = 1\n1+e\u2212x ,\nfor example.\nAssumption 1 (Curvature) There is a point \u03c4 \u2208 R, and parameters r and M, such that \u03c3(2)(\u03c4 ) > 0,\nsuch that \u03c3(3)(x) exists and is bounded, |\u03c3(3)(x)| \u2264 M, in the neighborhood \u03c4 \u2212 r \u2264 x \u2264 \u03c4 + r.\nAssumption 2 (Monotonicity) The symmetric function \u03c3(x + \u03c4 ) + \u03c3(\u2212x + \u03c4 ) is monotonically\nincreasing for x \u2265 0, with \u03c4 as de\ufb01ned in Assumption 1.\nAssumption 3 (Exponential Decay) There is \u03b7 > 0 such that |\u03c3(x)| \u2264 exp(\u03b7x) and |1 \u2212 \u03c3(x)| \u2264\nexp(\u2212\u03b7x).\nAssumptions 1 and 2 can be used to construct an approximation of x2 using O(1) nodes for each\nsuch term. They are satis\ufb01ed for example by common activation functions such as the sigmoid and\ntanh functions. Assumption 3 relied upon to construct an approximation of exp(x) in the second\nhidden layer, with O(n) nodes in each of the J subnetworks. This assumption is met by the indicator\nfunction u(x) = 1{x > 0}, and any number of activation functions that are smoother versions of\nu(x), including piecewise linear approximations of u(x) constructed with ReLU, and the sigmoid\nfunction.\nThe following is our main positive result about the ability to ef\ufb01ciently approximate GMM discrimi-\nnant functions with two-layer neural networks.\nTheorem 1 Consider a GMM with discriminant function d : Rn \u2192 R+ of the form (5), consisting\nof Gaussian pdfs with bounded covariance matrices. Let the activation function \u03c3 : R \u2192 R satisfy\nAssumptions 1, 2, and 3 . Then for any given \u03b4 > 0 and any q \u2208 (0, 1), there exists a two-hidden-layer\nneural network consisting of M = O(n) instances of the activation function \u03c3 and weights growing\nas O(n5), such that its output function \u02c6d is a (\u03b4, q)-approximation of d, under the distribution of the\nGMM.\n\nThe detailed proof of Theorem 1 is presented in Section 5 of the SM. The proof relies on several\nlemmas that are stated and proved in Section 4 of the SM.\nRemark 1 Applying Theorem 1 to a collection of c GMMs gives rise to a DNN with O(n) nodes that\napproximates the optimal classi\ufb01er of these GMMs via (2).\n\nRemark 2 The construction of an O(n)-node approximation of the GMM discriminant function\nassumes that the eigenvalues of the covariance matrices are bounded from above and also bounded\naway from zero, by constants independent of n.\n\nj=1 \u03b2j exp (\u2212gj(x)), we build a neural net consisting of\nJ sub-networks, with sub-network j approximating \u03b2j exp (\u2212gj(x)). For convenience, denote by\ncj(x) = \u03b2j exp(\u2212gj(x)), the desired output of the j-th subnetwork. The J subnetwork function\n\nTo prove Theorem 1, given d(x) = (cid:80)J\napproximations \u02c6cj(x), j = 1, . . . , J, are summed up to get the \ufb01nal output \u02c6d(x) =(cid:80)\nLemma 3 Given \u03b4 > 0, q > 0, and the GMM discriminant function d(x) =(cid:80)J\n\nj=1 cj(x), let t\u2217 be\nsuch that P[Sd,t\u2217 ] \u2265 1 \u2212 q under pdf p(x) = d(x)/\u03c6 . De\ufb01ne \u03bb = (t\u2217\u03b4)/(2J(1 + \u03b4)), and for each\n\nj \u02c6cj(x).\n\n5\n\n\fj, suppose we have an approximation function \u02c6cj of cj such that\n\nThen \u02c6d(x) =(cid:80)\n\n|\u02c6cj(x) \u2212 cj(x)| \u2264 \u03b4/2cj(x),\n0 \u2264 \u02c6cj(x) \u2264 \u03bb(1 + \u03b4),\n\nif cj(x) \u2265 \u03bb,\notherwise\nj \u02c6cj(x) is a (\u03b4, q)-approximation of d(x) under p(\u00b7).\n\nThe proof is presented in Section 3 of the SM. This lemma establishes a suf\ufb01cient standard of accuracy\nthat we will need for the subnetwork associated with each Gaussian component. In particular, there is\na level \u03bb such that we need to have relative error better than \u03b4/2 when the component function is\ngreater than \u03bb. Where the component function is smaller than \u03bb, we require only an upper bound\non the approximation function. The critical level \u03bb is proportional to t\u2217, which is a level achieved\nwith high probability by the overall discriminant function d. The scaling of the level t\u2217 with n is an\nimportant part of the proof, analyzed later in Lemma 5 of the SM.\n4 Exponential size of SNNs for approximating the GMM discriminant\n\nfunctions\n\nIn the previous section, we showed that a DNN with two hidden layers, and O(n) hidden nodes is\nable to approximate the discriminant functions corresponding to an optimal Bayesian classi\ufb01er for a\ncollection of GMMs. In this section, we prove a converse result for SNNs. More precisely, we prove\nthat for an SNN to approximate the discriminant function of even a single Gaussian distribution, the\nnumber of nodes needs to grow exponentially with n.\nConsider a neural network with a single hidden layer consisting of n1 nodes. As before, let \u03c3 : R \u2192 R\ndenote the non-linear function applied by each hidden node. For i = 1, . . . , n1, let wi \u2208 Rn, and\nbi \u2208 R denote the weight vector and the bias corresponding to node i, respectively. The function\ngenerated by this network can be written as\n\n(6)\nSuppose that x \u2208 Rn is distributed as N (0n, sxIn), with pdf \u00b5 : Rn \u2192 R. Suppose that the function\nto be approximated is\n\ni=1 ai\u03c3((cid:104)wi, x(cid:105) + bi) + a0.\n\nf (x)=(cid:80)n1\n\n(cid:18) sf + 2sx\n\n(cid:19) n\n\n4\n\nsf\n\n\u00b5c(x) (cid:44)\n\n\u2212 1\n2sf\n\ne\n\n(cid:107)x(cid:107)2\n\n,\n\n(7)\n\nf (x) =(cid:80)n1\n\nwhich has the form of a symmetric zero-mean Gaussian distribution with variance sf in each direction,\nand has been normalized so that E[\u00b52\nc(x)] = 1. Our goal is to show that unless the number of nodes\nn1 is exponentially large in the input dimension n, the network cannot approximate the function \u00b5c\nde\ufb01ned in (7) in the sense of De\ufb01nition 2.\nOur result applies to very general activation functions, and allows a different activation function in\nevery node; essentially all we require is that i) the response of each activation function depends on its\ninput x through a scalar product (cid:104)wi, x(cid:105), and ii) the output of each hidden node is square-integrable\nwith respect to the Gaussian distribution \u00b5(x). Incorporating the constant term into one of the\nactivation functions, we consider a more general model\n\ni=1 aihi((cid:104)wi, x(cid:105))\n\n(8)\nfor a set of functions hi : R \u2192 R. To avoid scale ambiguities in the de\ufb01nition of the coef\ufb01cients\nai and the functions hi, we scale hi as necessary so that, for i = 1, . . . , n1, (cid:107)wi(cid:107) = 1 and\nE[(hi((cid:104)wi, x(cid:105)))2] = 1.\nOur main result in this section shows the connection between the number of nodes (n1) and the\nachievable approximation error E[|\u00b5c(x) \u2212 f (x)|2]. We focus on approximation under De\ufb01nition 2,\nwhich, as shown in Lemma 2, is weaker than the notion used in Theorem 1. Therefore, proving the\nlower bound under this notion, automatically proves the same bound under the stronger notion too.\nTheorem 2 Consider \u00b5c : Rn \u2192 R and f : Rn \u2192 R de\ufb01ned in (7) and (8), respectively, for some\nsf > 0. Suppose that random vector x \u223c N (0n, sxIn), where sx > 0. For i = 1, . . . , n1, assume\nthat (cid:107)wi(cid:107) = 1, and E[(hi((cid:104)wi, x(cid:105)))2] = 1, for activation function hi : R \u2192 R. Then,\n\n\u221a\nE[|\u00b5c(x) \u2212 f (x)|2] \u2265 1 \u2212 2\n\nn1(cid:107)a(cid:107) (1 + sx/sf )1/4 \u03c1\u2212n/4,\n\n(9)\n\n6\n\n\fwhere\n\n\u03c1 (cid:44) 1 +\n\ns2\nx\n\ns2\nf + 2sxsf\n\n> 1.\n\n(10)\n\n1\u2212\u0001\n\n2A(1+sx/sf )1/4 \u03c1n/4, where A = 1\u221a\n\nThe proof of Theorem 2 is presented in Section 6 of the SM. This result shows that if we want\nto form an \u0001-relative (cid:96)2 approximation of \u00b5c, in the sense of De\ufb01nition 2, with an SNN, n1 must\n(cid:107)a(cid:107) denotes the root mean-squared value of a.\nsatisfy n1 \u2265\nThat is, the number of nodes need to grow exponentially with n, unless the norm of the \ufb01nal layer\ncoef\ufb01cients vector (cid:107)a(cid:107) grows exponentially in n as well. Note that in the natural case sf = sx where\nthe discriminant function to be approximated matches the distribution of the input data, the required\nexponential rate of growth is \u03c1n/4 = (4/3)n/4.\n\nn1\n\nRemark 3 The generalized model (8) covers a large class of activation functions. It is straightfor-\nward to con\ufb01rm that the required conditions are satis\ufb01ed by bounded activation functions, such as\nthe sigmoid function or the tanh function, with arbitrary bias values. For the popular ReLU function,\nhi((cid:104)wi, x(cid:105)) = max(|(cid:104)wi, x(cid:105) + bi|, 0). Therefore, E[|hi((cid:104)wi, x(cid:105))|2] \u2264 E[((cid:104)wi, x(cid:105) + bi)2] = sx + b2\ni ,\nwhich again con\ufb01rms the desired square-integrability property.\n\nRemark 4 From the point of numerical stability, it is natural to require the norm of the \ufb01nal layer\ncoef\ufb01cients, (cid:107)a(cid:107), to be bounded, as the following simple argument shows. Suppose that network\nimplementation can compute each activation function hi(x) exactly, but that the implementation\nrepresents each coef\ufb01cient ai in a \ufb02oating point format with a \ufb01nite precision. To gain intuition on\nthe effect of this quantization noise, consider the following modeling. The implementation replaces ai\ni ] = \u03bd|ai|2. Further assume that z1, . . . , zn1 are independent\nwith ai + zi, where E[zi] = 0 and E[z2\nof each other and of x. In this model, \u03bd re\ufb02ects the level of precision in the representation. Then, the\n(cid:35)\nerror due to quantization can be written as\n\ni (hi((cid:104)wi, x(cid:105)))2\nz2\n\n=\n\n\u03bd|ai|2 E\n\n= \u03bd(cid:107)a(cid:107)2\n\n(11)\n\n(cid:88)\n\n(cid:104)\n\n(hi((cid:104)wi, x(cid:105)))2(cid:105)\n\n(cid:34)(cid:88)\n\nE\n\noverall error \u0001, we need to have (cid:107)a(cid:107) (cid:28)(cid:112)\u0001/\u03bd. Unless the magnitudes of the weights used in the\n\nIn such an implementation, in order to keep the quantization error signi\ufb01cantly below the targeted\n\ni\n\ni\n\noutput layer are bounded in this way, accurate computation is not achievable in practice.\n\n5 Suf\ufb01ciency of exponentially many nodes\n\nIn Section 4, we studied the ability of an SNN in approximating function \u00b5c de\ufb01ned as (7) and\nshowed that such a network, if the weights are not allowed to grow exponentially with n, requires\nexponentially many nodes to make the error small. Clearly, Theorem 2 is a converse result, which\nimplies that the number of nodes n1 should grow with n, at least as \u03c1 n\n4 (\u03c1 > 1). The next natural\nquestion is the following: Would exponentially many nodes actually suf\ufb01ce in order to approximate\nfunction \u00b5c? In this section, we answer this question af\ufb01rmatively and show a simple construction\nwith random weights that, given enough nodes, is able to well approximate function \u00b5c de\ufb01ned in (7),\nwithin the desired accuracy. Recall that \u00b5c(x) = \u03b1 n\n. Consider\nthe output function of a single-hidden layer neural network with all biases set to zero. The function\ngenerated by such a network can be written as\n\n(cid:107)x(cid:107)2), where \u03b1 (cid:44) sf +2sx\n\n4 exp(\u2212 1\n\n2sf\n\nsf\n\ni=1 ai\u03c3((cid:104)wi, x(cid:105)).\n\n(12)\nAs before, here, \u03c3 : R \u2192 R denotes the non-linear function and wi \u2208 Rn, (cid:107)wi(cid:107) = 1, denotes the\nweights used by hidden node i. To show suf\ufb01ciency of exponentially many nodes, we consider a\nspecial non-linear function \u03c3(x) = cos(x/\nTheorem 3 Consider function \u00b5c : Rn \u2192 R, de\ufb01ned in (7), and n-dimensional random vector\n\u221a\nx \u223c N (0n, sxIn). Consider function f : Rn \u2192 R de\ufb01ned in (12). Let \u03c3(x) = cos(x/\nsf ), and,\n\u0001 \u03b1 n\nfor i = 1, . . . , n1, ai = \u03b1 n\n4 /n1, where \u03b1 = 1 + 2sx/sf . Given \u0001 > 0, assume that n1 > 1\n2 . Then,\nthere exists weights w1, . . . , wn1 such that Ex[(f (x) \u2212 \u00b5c(x))2] \u2264 \u0001.\n\nsf ).\n\n\u221a\n\nf (x)=(cid:80)n1\n\n7\n\n\f(1 + sx\nsf\n\nThe Proof of Theorem 3 is presented in Section 7 of the SM. To better understand the implications\nof Theorem 3 and how it compares against Theorem 2, de\ufb01ne m1 = \u03c1 = 1 + sx\n)\n2sf\nsf +2sx\nand m2 = \u03b12 = 1 + 4sx\n), where \u03c1 is de\ufb01ned in (10). It is straightforward to see that\nsf\n1 < m1 < m2, for all positive values of (sx, sf ). Theorems 2 and 3 show that there exist constants\nc1 and c2, such that if the number of hidden nodes in a single-hidden-layer network (n1) is smaller\nthan c1mn/4\n, the expected error in approximating function \u00b5c(x) must get arbitrarily close to one.\nOn other hand, if n1 is larger than c2mn/4\n, then there exists a set of weights such that the error\ncan be made arbitrary close to zero. In other words, it seems that there is a phase transition in the\nexponential growth of the number of nodes, below which, the function cannot be approximated with\na single hidden layer. Characterizing that phase transition is an interesting open question, which we\nleave to future work.\n\n(1 \u2212 sf\n\n1\n\n2\n\n6 Related work\n\nThere is a rich and well-developed literature on the complexity of Boolean circuits, and the important\nrole depth plays in them. However, since it is not clear to what extend such results on Boolean circuits\nhas a consequence for DNNs, we do not summarize this literature. The interested reader may wish to\nstart with [16]. A key notion for us is that of depth, that is to so say, the number of (hidden) layers\nof nodes in a neural network as de\ufb01ned in Section 1.2. We are interested to know to what extent, if\nany, depth reduces complexity of the neural network to express or approximate functions of interest\nin classi\ufb01cation. It is not the complexity of the function that we want to approximate that matters,\nbecause the UAT already tells us that regular functions, which include discriminant functions we\ndiscuss in Section 1.3, can be approximated by SNNs, shallow neural networks. But the complexity\nof the NNs, as measured by the number of nodes needed for the approximation is of interest to\nus. In this respect, the work of [17, 18, 19] contain approximation results for neural structures for\ncertain polynomials and tensor functions, in the spirit of what we are looking for, but as with Boolean\ncircuits, these models deviate substantially from the standard DNN models we consider here, those\nthat represent the neural networks that have worked well in practice and for whose behavior we wish\nto obtain fundamental insights.\nRemarkably, there is a small collection of recent results which, as in this paper, show that adding a\nsingle layer to an SNN reduces the number of nodes by a logarithmic factor for approximation of\nsome special functions: see [13, 11, 12, 20] for approximation of high-degree polynomials, a certain\nradial function, special functions of the inner products of high-dimensional vectors, and saw-tooth\nfunctions, respectively. Our work is therefore in the same spirit as these, showing the power of two\nin the reduction of complexity of DNNs, and is therefore, the continuation and generalization of\nthe said set of results and is especially informed by [11]. For a specialized radial function in Rn,\n[11] shows that while any SNN would require at least exponentially many nodes to approximate the\nfunction, there exists a DNN with two hidden layers and O(n19/4) nodes that well approximates\nthe same function. In the present work, for a general class of widely-used functions viz GMM\ndiscriminant functions, we show that while SNNs require at least exponentially many nodes, for any\nGMM discriminant function there exists a DNN with two hidden layers and only O(n) nodes that\napproximates it.\n\n7 Remarks and conclusion\n\nIt is worth noting that even though to prove Theorem 1 we used a variety of suf\ufb01cient regularity\nassumptions for the non-linear function \u03c3, these assumptions are not necessary to construct an\nef\ufb01cient two-layer network. For example, to construct a network using the commonly used Recti\ufb01er\nLinear Unit (ReLu) activation, in the \ufb01rst layer we can form n super-nodes, each of which has a\n\u221a\npiecewise constant response hi(x) that approximates x2 with the accuracy speci\ufb01ed in Lemma 1\nof the SM. The number of basic nodes needed in each super-node in this construction is 2R/\n\u03bd,\n\u221a\nwhere R and \u03bd denote the range and the accuracy for approximating x2 in layer one, respectively.\nThe analysis of R and \u03bd in Lemmas 3 and 5 in the SM show that R is O(\nn) and \u03bd is O(1/n), so\nthat the number of nodes needed per super-node in the \ufb01rst layer is now O(n), compared to O(1) in\nthe construction presented in Section 3. Since there are n such nodes, the total number of basic nodes\nin the network becomes O(n2) - still an exponential reduction compared with a single layer network.\n\n8\n\n\fReferences\n[1] G. Cybenko. Approximations by superpositions of a sigmoidal function. Math. of Cont., Sig. and Sys.,\n\n2:183\u2013192, 1989.\n\n[2] K.I. Funahashi. On the approximate realization of continuous mappings by neural networks. Neu. net.,\n\n2(3):183\u2013192, 1989.\n\n[3] K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal approximators.\n\nNeu. Net., 2(5):359\u2013366, 1989.\n\n[4] A. R. Barron. Approximation and estimation bounds for arti\ufb01cial neural networks. Mach. lear., 14(1):115\u2013\n\n133, 1994.\n\n[5] J.L. Gauvain and C.H. Lee. Maximum a posteriori estimation for multivariate Gaussian mixture observa-\n\ntions of Markov chains. IEEE Trans. on Speech and Aud. Proc., 2(2):291\u2013298, 1994.\n\n[6] D. A. Reynolds, T. F. Quatieri, and R. B. Dunn. Speaker veri\ufb01cation using adapted Gaussian mixture\n\nmodels. Dig. Sig. Proc., 10(1-3):19\u201341, 2000.\n\n[7] J. Portilla, V. Strela, M. J. Wainwright, and E. P. Simoncelli. Image denoising using scale mixtures of\n\ngaussians in the wavelet domain. IEEE Trans. on Ima. Proc., 12(11):1338\u20131351, 2003.\n\n[8] Z. Zivkovic. Improved adaptive Gaussian mixture model for background subtraction. In Proc. of the 17th\n\nInt. Conf. on Pat. Rec., volume 2, pages 28\u201331. IEEE, 2004.\n\n[9] N. Indurkhya and F. J. Damerau. Handbook of natural language processing, volume 2. CRC Press, 2010.\n\n[10] E. Abbe and C. Sandon. Provable limitations of deep learning. arXiv preprint arXiv:1812.06369, 2018.\n\n[11] R. Eldan and O. Shamir. The power of depth for feedforward neural networks. In Conf. on Lear. Theory,\n\npages 907\u2013940, 2016.\n\n[12] A. Daniely. Depth separation for neural networks. arXiv preprint arXiv:1702.08489, 2017.\n\n[13] D. Rolnick and M. Tegmark. The power of deeper networks for expressing natural functions. In Int. Conf.\n\non Lear. Rep., 2018.\n\n[14] E. Bingham and H. Mannila. Random projection in dimensionality reduction: applications to image and\ntext data. In Proc. of ACM SIGKDD Int. Conf. on Know. Dis. and Data Min., pages 245\u2013250. ACM, 2001.\n\n[15] H. W Lin, M. Tegmark, and D. Rolnick. Why does deep and cheap learning work so well? J. of Stat. Phy.,\n\n168(6):1223\u20131247, 2017.\n\n[16] A. Shpilka and A. Yehudayoff. Arithmetic Circuits: A Survey of Recent and Open Questions. Now, 2010.\n\n[17] O. Delalleau and Y. Bengio. Shallow vs. deep sum-product networks. In Adv. in Neu. Inf. Proc. Sys., pages\n\n666\u2013674, 2011.\n\n[18] J. Martens and V. Medabalimi. On the expressive ef\ufb01ciency of sum product networks. arXiv preprint\n\narXiv:1411.7717, 2014.\n\n[19] N. Cohen, O. Sharir, and A. Shashua. On the expressive power of deep learning: A tensor analysis. arXiv\n\npreprint arXiv:1509.05009, 2015.\n\n[20] M. Telgarsky. Representation bene\ufb01ts of deep feedforward networks. arXiv preprint arXiv:1509.08101,\n\n2015.\n\n9\n\n\f", "award": [], "sourceid": 2565, "authors": [{"given_name": "Shirin", "family_name": "Jalali", "institution": "Nokia Bell Labs"}, {"given_name": "Carl", "family_name": "Nuzman", "institution": "Nokia Bell Labs"}, {"given_name": "Iraj", "family_name": "Saniee", "institution": "Nokia Bell Labs"}]}