{"title": "Learning Distributions Generated by One-Layer ReLU Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 8107, "page_last": 8117, "abstract": "We consider the problem of estimating the parameters of a $d$-dimensional rectified Gaussian distribution from i.i.d. samples. A rectified Gaussian distribution is defined by passing a standard Gaussian distribution through a one-layer ReLU neural network. We give a simple algorithm to estimate the parameters (i.e., the weight matrix and bias vector of the ReLU neural network) up to an error $\\eps\\norm{W}_F$ using $\\widetilde{O}(1/\\eps^2)$ samples and $\\widetilde{O}(d^2/\\eps^2)$ time (log factors are ignored for simplicity). This implies that we can estimate the distribution up to $\\eps$ in total variation distance using $\\widetilde{O}(\\kappa^2d^2/\\eps^2)$ samples, where $\\kappa$ is the condition number of the covariance matrix. Our only assumption is that the bias vector is non-negative. Without this non-negativity assumption, we show that estimating the bias vector within any error requires the number of samples at least exponential in the infinity norm of the bias vector. Our algorithm is based on the key observation that vector norms and pairwise angles can be estimated separately. We use a recent result on learning from truncated samples. We also prove two sample complexity lower bounds: $\\Omega(1/\\eps^2)$ samples are required to estimate the parameters up to error $\\eps$, while $\\Omega(d/\\eps^2)$ samples are necessary to estimate the distribution up to $\\eps$ in total variation distance. The first lower bound implies that our algorithm is optimal for parameter estimation. Finally, we show an interesting connection between learning a two-layer generative model and non-negative matrix factorization. Experimental results are provided to support our analysis.", "full_text": "Learning Distributions Generated by\n\nOne-Layer ReLU Networks\n\nShanshan Wu, Alexandros G. Dimakis, Sujay Sanghavi\n\nDepartment of Electrical and Computer Engineering\n\nUniversity of Texas at Austin\n\nshanshan@utexas.edu, dimakis@austin.utexas.edu, sanghavi@mail.utexas.edu\n\nAbstract\n\nWe consider the problem of estimating the parameters of a d-dimensional recti\ufb01ed\nGaussian distribution from i.i.d. samples. A recti\ufb01ed Gaussian distribution is\nde\ufb01ned by passing a standard Gaussian distribution through a one-layer ReLU\nneural network. We give a simple algorithm to estimate the parameters (i.e., the\nweight matrix and bias vector of the ReLU neural network) up to an error \u0001(cid:107)W(cid:107)F\n\nusing (cid:101)O(1/\u00012) samples and (cid:101)O(d2/\u00012) time (log factors are ignored for simplicity).\nusing (cid:101)O(\u03ba2d2/\u00012) samples, where \u03ba is the condition number of the covariance\n\nThis implies that we can estimate the distribution up to \u0001 in total variation distance\n\nmatrix. Our only assumption is that the bias vector is non-negative. Without\nthis non-negativity assumption, we show that estimating the bias vector within\nany error requires the number of samples at least exponential in the in\ufb01nity norm\nof the bias vector. Our algorithm is based on the key observation that vector\nnorms and pairwise angles can be estimated separately. We use a recent result\non learning from truncated samples. We also prove two sample complexity lower\nbounds: \u2126(1/\u00012) samples are required to estimate the parameters up to error \u0001,\nwhile \u2126(d/\u00012) samples are necessary to estimate the distribution up to \u0001 in total\nvariation distance. The \ufb01rst lower bound implies that our algorithm is optimal for\nparameter estimation. Finally, we show an interesting connection between learning\na two-layer generative model and non-negative matrix factorization. Experimental\nresults are provided to support our analysis.\n\n1\n\nIntroduction\n\nEstimating a high-dimensional distribution from observed samples is a fundamental problem in\nmachine learning and statistics. A popular recent generative approach is to model complex distri-\nbutions by passing a simple distribution (typically a standard Gaussian) through a neural network.\nParameters of the neural network are then learned from data. Generative Adversarial Networks\n(GANs) [GPAM+14] and Variational Auto-Encoders (VAEs) [KW13] are built on this method of\nmodeling high-dimensional distributions.\nCurrent methods for learning such deep generative models do not have provable guarantees or sample\ncomplexity bounds. In this paper we obtain the \ufb01rst such results for a single-layer ReLU generative\nmodel. Speci\ufb01cally, we study the following problem: Assume that the latent variable z is selected\nfrom a standard Gaussian which then drives the generation of samples from a one-layer ReLU\nactivated neural network with weights W and bias b. We observe the output samples (but not the\nlatent variable realizations z) and we would like to provably learn the parameters W and b. More\nformally:\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fDe\ufb01nition 1. Let W \u2208 Rd\u00d7k be the weight matrix, and b \u2208 Rd be the bias vector. We de\ufb01ne D(W, b)\nas the distribution1 of the random variable x \u2208 Rd generated as follows:\nx = ReLU(W z + b), where z \u223c N (0, Ik).\n\n(1)\n\nHere z is a standard Gaussian random variable in Rk, and Ik is a k-by-k identity matrix.\nGiven n samples x1, x2, ..., xn from some D(W, b) with unknown parameters W and b, the goal is\nto estimate W and b from the given samples. Since the ReLU operation is not invertible2, estimating\nW and b via maximum likelihood is often intractable.\nIn this paper, we make the following contributions:\n\n\u2022 We provide a simple and novel algorithm to estimate the parameters of D(W, b) from i.i.d.\nsamples, under the assumption that b is non-negative. Our algorithm (Algorithm 1) takes two\nsteps. In Step 1, we estimate b and the row norms of W using a recent result on estimation\nfrom truncated samples (Algorithm 2). In Step 2, we estimate the angles between any two\nrow vectors of W using a simple geometric result (Fact 1).\n\n\u2022 We prove that the proposed algorithm needs (cid:101)O(1/\u00012) samples and (cid:101)O(d2/\u00012) time, in order to\nestimate the parameter W W T (reps. b) within an error \u0001(cid:107)W(cid:107)2\nF (resp. \u0001(cid:107)W(cid:107)F ) (Theorem 1).\nlearned distribution and the ground truth is within an error \u0001 given (cid:101)O(\u03ba2d2/\u00012) samples,\nThis implies that (for the non-degenerate case) the total variation distance between the\nwhere \u03ba is the condition number of W W T (Corollary 1).\n\u2022 Without the non-negativity assumption on b, we show that estimating the parameters of\nD(W, b) within any error requires \u2126(exp((cid:107)b(cid:107)2\u221e)) samples (Claim 2). Even when the bias\nvector b has negative components, our algorithm can still be used to recover part of the\nparameters with a small amount of samples (Section H.1).\n\u2022 We prove two lower bounds on the sample complexity. The \ufb01rst lower bound (Theorem 2)\nsays that \u2126(1/\u00012) samples are required in order to estimate b up to error \u0001(cid:107)W(cid:107)F , which\nimplies that our algorithm is optimal in estimating the parameters. The second lower bound\n(Theorem 3) says that \u2126(d/\u00012) samples are required to estimate the distribution up to total\nvariation distance \u0001.\n\u2022 We empirically evaluate our algorithm in terms of its dependence over the number of\nsamples, dimension, and condition number (Figure 1). The empirical results are consistent\nwith our analysis.\n\u2022 We provide a new algorithm to estimate the parameters of a two-layer generative model (Al-\ngorithm 4 in Appendix H). Our algorithm uses ideas from non-negative matrix factorization\n(Claim 3).\n\nThe (cid:96)p norm of a vector is de\ufb01ned as (cid:107)x(cid:107)p = ((cid:80)\nrow and the j-th column. The dot product between two vectors is (cid:104)x, y(cid:105) =(cid:80)\n\nNotation. We use capital letters to denote matrices and lower-case letters to denote vectors. We use\n[n] to denote the set {1, 2,\u00b7\u00b7\u00b7 , n}. For a vector x \u2208 Rd, we use x(i) to denote its i-th coordinate.\ni |x(i)|p)1/p. For a matrix W \u2208 Rd\u00d7k, we use\nW (i, j) to denote its (i, j)-th entry. We use W (i, :) \u2208 Rk and W (:, j) \u2208 Rd to the denote the i-th\ni x(i)y(i). For any\na \u2208 R, we use R>a to denote the set R>a := {x \u2208 R : x > a}. We use Ik \u2208 Rk\u00d7k to denote an\nidentity matrix.\n\n2 Related Work\n\nWe brie\ufb02y review the relevant work, and highlight the differences compared to our paper.\nEstimation from truncated samples. Given a d-dimensional distribution D and a subset S \u2286 Rd,\ntruncation means that we can only observe samples from D if it falls in S. Samples falling outside\nS (and their counts in proportion) are not revealed. Estimating the parameters of a multivariate\n\n1It is also called as a recti\ufb01ed Gaussian distribution, and can be used in non-negative factor analysis [HK07].\n2If the activation function \u03c3 (e.g., sigmoid, leaky ReLU, etc.) is invertible, then \u03c3\u22121(X) \u223c N (b, W W T ).\n\nIn that case the problem becomes learning a Gaussian from samples.\n\n2\n\n\fnormal distribution from truncated samples is a fundamental problem in statistics and a breakthrough\nwas achieved recently [DGTZ18] on this problem. This is different from our problem because our\nsamples are formed by projecting the samples of a multivariate normal distribution onto the positive\northant instead of truncating to the positive orthant. Nevertheless, a single coordinate of D(W, b) can\nbe viewed as a truncated univariate normal distribution (De\ufb01nition 2). We use this observation and\nleverage on the recent results of [DGTZ18] to estimate b and the row norms of W (Section 4.2).\nLearning ReLU neural networks. A recent series of work, e.g., [GMOV19, GKLW19, GKM18,\nLY17, ZSJ+17, Sol17], considers the problem of estimating the parameters of a ReLU neural network\ngiven samples of the form {(xi, yi)}n\ni=1. Here (xi, yi) represents the input features and the output\ntarget, e.g., yi = ReLU(W xi + b). This is a supervised learning problem, and hence, is different\nfrom our unsupervised density estimation problem.\nLearning neural network-based generative models. Many approaches have been proposed to\ntrain a neural network to model complex distributions. Examples include GAN [GPAM+14] and\nits variants (e.g., WGAN [ACB17], DCGAN [RMC15], etc.), VAE [KW13], autoregressive mod-\nels [OKK16], and reversible generative models [GCB+18]. All of those methods lack theoretical\nguarantees and explicit sample complexity bounds. A recent work [NWH18] proves that training an\nautoencoder via gradient descent can possibly recover a linear generative model. This is different\nfrom our setting, where we focus on non-linear generative models. Arya and Ankit [MR19] also con-\nsider the problem of learning from one-layer ReLU generative models. Their modeling assumption\nis different from ours. They assume that the bias vector b is a random variable whose distribution\nsatis\ufb01es certain conditions. Besides, there is no distributional assumption on the hidden variable z.\nBy contrast, in our model, both W and b are deterministic and unknown parameters. The randomness\nonly comes from z which is assumed to follow a standard Gaussian distribution.\n\nIdenti\ufb01ability\n\n3\nOur \ufb01rst question is whether W is identi\ufb01able from the distribution D(W, b). Claim 1 below implies\nthat only W W T can be possibly identi\ufb01ed from D(W, b).\n2 , and any vector b, D(W1, b) = D(W2, b).\nClaim 1. For any matrices satisfying W1W T\n1 = W2W T\n\nProof. Since W1W T\nSince z \u223c N (0, Ik), we have Qz \u223c N (0, Ik). The claim then follows.\n\n1 = W2W T\n\n2 , there exists a unitary matrix Q \u2208 Rk\u00d7k that satis\ufb01es W2 = W1Q.\n\nIdentifying the bias vector b from D(W, b) can be impossible in some cases. For example, if W is a\nzero matrix, then any negative coordinate of b cannot be identi\ufb01ed since it will be reset to zero after\nthe ReLU operation. For the cases when b is identi\ufb01able, our next claim provides a lower bound on\nthe sample complexity required to estimate the bias vector to be within an additive error \u0001.\nClaim 2. For any value \u03b4 > 0, there exists one-dimensional distributions D(1, b1) and D(1, b2) such\nthat: (a) |b1 \u2212 b2| = \u03b4; (b) at least \u2126(exp(b2\n\n1/2)) samples are required to distinguish them.\n\nProof. Let b1 < 0 and b2 = b1 \u2212 \u03b4. It is easy to check that (a) holds. To show (b), note that\nthe probability of observing a positive (i.e., nonzero) sample from D(1, b1) is upper bounded by\nP[ReLU(z \u2212 |b1|) > 0] = P[z > |b1|] \u2264 exp(\u2212b2\n1/2), where the last step follows from the standard\nGaussian tail bound [Wai19]. The same bound holds for D(1, b2). To distinguish D(1, b1) and\nD(1, b2), we need to observe at least one nonzero sample, which requires \u2126(exp(b2\n\n1/2)) samples.\n\nClaim 2 indicates that in order to estimate the parameters within any error, the sample complexity\nshould scale at least exponentially in (cid:107)b(cid:107)2\u221e. This is true if b is allowed to take negative values.\nIntuitively, if b has large negative values, then most of the samples would be zeros. To avoid this\nexponential dependence, we now assume that the bias vector is non-negative. In Section 4, we give an\nalgorithm to provably learn the parameters of D(W, b) with a sample complexity that is polynomial\nin 1/\u0001 and does not depend on the values of b. In Section H.1, we show that even when the bias\nvector has negative coordinates, our algorithm can still be able to recover part of the parameters with\na small number of samples.\n\n3\n\n\f4 Algorithm\nIn this section, we describe a novel algorithm to estimate W W T \u2208 Rd\u00d7d and b \u2208 Rd from i.i.d.\nsamples of D(W, b). Our goal is to estimate W W T instead of W since W is not identi\ufb01able (Claim 1).\nOur only assumption is that the true b is non-negative. As discussed in Claim 2, this assumption\ncan potentially avoid the exponential dependence in the values of b. Note that our algorithm does\nnot require to know the dimension k of the latent variable z. Omitted proofs can be found in the\nappendix.\n\nIntuition\n\n4.1\nLet W (i, :) \u2208 Rk be the i-th row (i \u2208 [d]) of W . For any i < j \u2208 [d], the (i, j)-th entry of W W T is\n(2)\n\n(cid:104)W (i, :), W (j, :)(cid:105) = (cid:107)W (i, :)(cid:107)2(cid:107)W (j, :)(cid:107)2 cos(\u03b8ij),\n\nwhere \u03b8ij is the angle between vectors W (i, :) and W (j, :). Our key idea is to estimate the norms\n(cid:107)W (i, :)(cid:107)2, (cid:107)W (j, :)(cid:107)2, and the angles \u03b8ij separately, as shown in Algorithm 1.\nEstimating the row norms3 (cid:107)W (i, :)(cid:107)2 as well as the i-th coordinate of the bias vector b(i) \u2208 R can\nbe done by only looking at the i-th coordinate of the given samples. The idea is to view the problem\nas estimating the parameters of a univariate normal distribution from truncated samples4. This part of\nthe algorithm is described in Section 4.2. To estimate \u03b8ij \u2208 [0, \u03c0) for every i < j \u2208 [d], we use a\nsimple fact that the angle between any two vectors can be estimated from their inner products with a\nrandom Gaussian vector. Details of this part can be found in Section 4.3.\n\nAlgorithm 1: Learning a single-layer ReLU generative model\nInput: n i.i.d. samples x1,\u00b7\u00b7\u00b7 , xn \u2208 Rd from D(W \u2217, b\u2217), b\u2217 is non-negative.\n1 for i \u2190 1 to d do\n2\n\nS \u2190 {xm(i), m \u2208 [n] : xm(i) > 0};\n\nOutput: (cid:98)\u03a3 \u2208 Rd\u00d7d,(cid:98)b \u2208 Rd.\n(cid:98)b(i),(cid:98)\u03a3(i, i) \u2190 NormBiasEst(S);\n(cid:17)\n(cid:16)\n0,(cid:98)b(i)\n(cid:98)b(i) \u2190 max\n(cid:16)(cid:80)n\n(cid:98)\u03b8ij \u2190 \u03c0 \u2212 2\u03c0\n(cid:98)\u03a3(i, j) \u2190(cid:113)(cid:98)\u03a3(i, i)(cid:98)\u03a3(j, j) cos((cid:98)\u03b8ij);\n(cid:98)\u03a3(j, i) \u2190(cid:98)\u03a3(i, j);\n\n4\n5 end\n6 for i < j \u2208 [d] do\n7\n\nm=1\n\nn\n\n3\n\n8\n\n;\n\n9\n10 end\n\n1(xm(i) >(cid:98)b(i)) 1(xm(j) >(cid:98)b(j))\n\n(cid:17)\n\n;\n\n4.2 Estimate (cid:107)W (i, :)(cid:107)2 and b(i)\nWithout loss of generality, we \ufb01x i = 1 and describe how to estimate (cid:107)W (1, :)(cid:107)2 \u2208 R and b(1) \u2208 R\nby looking at the \ufb01rst coordinate of the given samples.\nThe starting point of our algorithm is the following observation. Suppose x \u223c D(W, b), its \ufb01rst\ncoordinate can be written as\n\nx(1) = ReLU(W (1, :)T z + b(1)) = ReLU(y), where y \u223c N (b(1),(cid:107)W (1, :)(cid:107)2\n2).\n\n(3)\n\nBecause of the ReLU operation, we can only observe the samples of y when it is positive. Given\nsamples of x(1) \u2208 R, let us keep the samples that have positive values (i.e., ignore the zero samples).\n3Without loss of generality, we can assume that (cid:107)W (i, :)(cid:107)2 (cid:54)= 0 for all i \u2208 [d]. If W (i, :) is a zero vector,\n\none can easily detect that and \ufb01gure out the corresponding non-negative bias term.\n\n4Another idea is to use the median of the samples to estimate the i-th coordinate of the bias vector. This\n\napproach will give the same sample complexity bound as that of our proposed algorithm.\n\n4\n\n\fNow the problem of estimating b(1) and (cid:107)W (1, :)(cid:107)2 is equivalent to estimating the parameters of a\none-dimensional normal distribution using samples falling in the set R>0 := {x \u2208 R : x > 0}.\nRecently Daskalakis et al. [DGTZ18] gave an ef\ufb01cient algorithm for estimating the mean and\ncovariance matrix of a multivariate Gaussian distribution from truncated samples. We adapt their\nalgorithm for the speci\ufb01c problem described above. Before describing the details, we start with a\nformal de\ufb01nition of the truncated (univariate) normal distribution.\nDe\ufb01nition 2. The univariate normal distribution N (\u00b5, \u03c32) has probability density function\n\nN (\u00b5, \u03c32; x) =\n\n1\u221a\n2\u03c0\u03c32\n\nexp\n\n\u2212 1\n2\u03c32 (x \u2212 \u00b5)2\n\nfor x \u2208 R.\n\n,\n\nGiven a measurable set S \u2286 R, the S-truncated normal distribution N (\u00b5, \u03c32, S) is de\ufb01ned as\n\n(cid:19)\n\n(cid:18)\n(cid:40) N (\u00b5,\u03c32;x)\n(cid:82)\n\n0\n\nN (\u00b5, \u03c32, S; x) =\n\nS N (\u00b5,\u03c32;y)dy\n\nif x \u2208 S\nif x (cid:54)\u2208 S\n\n.\n\n(4)\n\n(5)\n\nWe are now ready to describe the algorithm in [DGTZ18] applied to our problem. The pseudocode is\ngiven in Algorithm 2. The algorithm is essentially maximum likelihood by projected stochastic gradi-\nent descent (SGD). Given a sample x \u223c N (\u00b5\u2217, \u03c3\u22172, S), let (cid:96)(\u00b5, \u03c3; x) be the negative log-likelihood\nthat x is from N (\u00b5, \u03c32, S), then (cid:96)(\u00b5, \u03c3; x) is a convex function with respect to a reparameterization\nv = [1/\u03c32, \u00b5/\u03c32] \u2208 R2. We use (cid:96)(v; x) to denote the negative log-likelihood after this reparameter-\nization. Let \u00af(cid:96)(v) = Ex[(cid:96)(v; x)] be the expected negative log-likelihood. Although it is intractable\nto compute \u00af(cid:96)(v), its gradient \u2207\u00af(cid:96)(v) with respect to v has a simple unbiased estimator. Speci\ufb01cally,\nde\ufb01ne a random vector g \u2208 R2 as\n\n, where x \u223c N (\u00b5\u2217, \u03c3\u22172, S), z \u223c N (\u00b5, \u03c32, S).\n\n(6)\n\n(cid:20)\u2212x2/2\n\n(cid:21)\n\n(cid:20)\u2212z2/2\n\n(cid:21)\n\nx\n\n+\n\nz\n\ng = \u2212\n\nWe have that \u2207\u00af(cid:96)(v) = Ex,z[g], i.e., g is an unbiased estimator of \u2207\u00af(cid:96)(v).\nEq. (6) indicates that one can maximize the log-likelihood via SGD, however, in order to ef\ufb01ciently\nperform this optimization, we need three extra steps.\nFirst, the convergence rate of SGD depends on the expected gradient norm E[(cid:107)g(cid:107)2\n2] (Theorem 14.11\nof [SSBD14]). In order to maintain a small gradient norm, we transform the given samples to a\nnew space (so that the empirical mean and variance is well-controlled) and perform optimization\nin that space. After the optimization is done, the solution is transformed back to the original space.\nSpeci\ufb01cally, given samples x1,\u00b7\u00b7\u00b7 , xn \u223c N (\u00b5\u2217, \u03c3\u22172, R>0), we transform them as\n\nxi \u2192 xi \u2212(cid:98)\u00b50(cid:98)\u03c30\n\nn(cid:88)\n(xi \u2212(cid:98)\u00b50)2.\nsamples truncated to the set R>\u2212(cid:98)\u00b50/(cid:98)\u03c30 = {x \u2208 R : x > \u2212(cid:98)\u00b50/(cid:98)\u03c30}.\n\n, where(cid:98)\u00b50 =\n\nxi, (cid:98)\u03c32\n\n0 =\n\nn(cid:88)\n\ni=1\n\n1\nn\n\n1\nn\n\ni=1\n\nIn the transformed space, the problem becomes estimating parameters of a normal distribution with\n\n(7)\n\nSecond, we need to control the strong-convexity of the objective function. This is done by pro-\njecting the parameters onto a domain where the strong-convexity is bounded. The domain Dr is\nparameterized by r > 0 and is de\ufb01ned as\n\nDr = {v \u2208 R2 : 1/r \u2264 v(1) \u2264 r,|v(2)| \u2264 r}.\n\n(8)\n\nAccording to [DGTZ18, Section 3.4], r = O(ln(1/\u03b1)/\u03b12) is a hyper-parameter that only depends\nS N (\u00b5\u2217, \u03c3\u22172; y)dy (i.e., the probability mass of original truncation set S). In our setting,\nwe have \u03b1 \u2265 1/2. This is because the original truncation set is R>0 and \u00b5\u2217 = b(1) \u2265 0. A large\nvalue of r would lead to a small strong-convexity parameter. In our experiments, we set r = 3.\nThird, a single run of the projected SGD algorithm only guarantees a constant probability of success.\nTo amplify the probability of success to 1 \u2212 \u03b4/d, a standard procedure is to repeat the algorithm\nO(ln(d/\u03b4)) times. This procedure is illustrated in Step 2-5 in Algorithm 2.\n\non \u03b1 =(cid:82)\n\n5\n\n\fAlgorithm 2: NormBiasEst\nInput: Samples from N (\u00b5, \u03c32, R>0).\n\nOutput:(cid:98)\u00b5 \u2208 R,(cid:99)\u03c32 \u2208 R.\n\n1 Shift and rescale the samples using (7);\n2 Split the samples into B = O(ln(d/\u03b4))\n3 For batch i \u2208 [B], run ProjSGD to get\n4 S \u2190 {v1,\u00b7\u00b7\u00b7 , vB};\n\nbatches;\nvi \u2208 R2;\n\nAlgorithm 3: ProjSGD\n\nInput: T = (cid:101)O(ln(d/\u03b4)/\u00012), \u03bb > 0.\n\nOutput: v \u2208 R2.\n1 Initialize v(0) = [1, 0] \u2208 R2;\n2 for t \u2190 1 to T do\n3\n4\n5\n\ng(t) \u2190 Estimate the gradient using (6);\nv(t) \u2190 v(t\u22121) \u2212 g(t)/(\u03bb \u00b7 t);\nv(t) \u2190 Project v(t) to the domain in\n(8);\n\n(cid:80)\n5 (cid:98)v \u2190 arg minvi\u2208S\n6 Transform(cid:98)v back to the original space;\nj\u2208[B](cid:107)vi \u2212 vj(cid:107)2;\n7 (cid:98)\u00b5 \u2190(cid:98)v(2)/(cid:98)v(1),(cid:99)\u03c32 \u2190 1/(cid:98)v(1);\nLemma 1. For any \u0001 \u2208 (0, 1) and \u03b4 \u2208 (0, 1), Algorithm 1 takes n = (cid:101)O(cid:0) 1\nD(W \u2217, b\u2217) (for some non-negative b\u2217) and outputs(cid:98)b(i) and(cid:98)\u03a3(i, i) for all i \u2208 [d] that satisfy\n|(cid:98)b(i) \u2212 b\u2217(i)| \u2264 \u0001(cid:107)W \u2217(i, :)(cid:107)2\n\n2 \u2264(cid:98)\u03a3(i, i) \u2264 (1 + \u0001)(cid:107)W \u2217(i, :)(cid:107)2\n\n7 v \u2190(cid:80)T\n\n(1 \u2212 \u0001)(cid:107)W \u2217(i, :)(cid:107)2\n\nt=1 v(t)/T ;\n\n\u00012 ln( d\n\n6 end\n\n2,\n\n\u03b4 )(cid:1) samples from\n\n(9)\n\nwith probability at least 1 \u2212 \u03b4.\n\n4.3 Estimate \u03b8ij\nTo estimate the angle between any two vectors W \u2217(i, :) and W \u2217(j, :) (where i (cid:54)= j \u2208 [d]), we will\nuse the following result.\nFact 1. (Lemma 6.7 in [WS11]). Let z \u223c N (0, Ik) be a standard Gaussian random variable in Rk.\nFor any two non-zero vectors u, v \u2208 Rk, the following holds:\n\nP\n\nz\u223cN (0,Ik)\n\n[uT z > 0 and vT z > 0] =\n\n\u03c0 \u2212 \u03b8\n2\u03c0\n\n, where \u03b8 = arccos\n\n.\n\n(10)\n\n(cid:19)\n\n(cid:18) (cid:104)u, v(cid:105)\n\n(cid:107)u(cid:107)2(cid:107)v(cid:107)2\n\nFact 1 says that the angle between any two vectors can be estimated from the sign of their inner\nproducts with a Gaussian random vector. Let x \u223c D(W \u2217, b\u2217), since b\u2217 is assumed to be non-negative,\nFact 1 gives an unbiased estimator for the pairwise angles.\nLemma 2. Suppose that x \u223c D(W \u2217, b\u2217) and that b\u2217 \u2208 Rd is non-negative, for all i (cid:54)= j \u2208 [d],\n\nP\n\nx\u223cD(W \u2217,b\u2217)\n\n[x(i) > b\u2217(i) and x(j) > b\u2217(j)] =\n\n\u03c0 \u2212 \u03b8\u2217\n2\u03c0\n\nij\n\n,\n\n(11)\n\nwhere \u03b8\u2217\n\nij is the angle between vectors W \u2217(i, :) and W \u2217(j, :).\n\nProof. Since x(i) = ReLU(cid:0)W \u2217(i, :)T z + b\u2217(i)(cid:1) and b\u2217 is non-negative, we have\n\nLHS =\n\nP\n\nz\u223cN (0,Ik)\n\n[W \u2217(i, :)T z > 0 and W \u2217(j, :)T z > 0] =\n\n\u03c0 \u2212 \u03b8\u2217\n2\u03c0\n\nij\n\n= RHS,\n\n(12)\n\nwhere the second equality follows from Fact 1.\nLemma 2 gives an unbiased estimator of \u03b8\u2217\nij, however, it requires knowing the true bias vector b\u2217.\nIn the previous section, we give an algorithm that can estimate b\u2217(i) within an additive error of\n\u0001(cid:107)W \u2217(i, :)(cid:107)2 for all i \u2208 [d]. Fortunately, this is good enough for estimating \u03b8\u2217\nij within an additive\nerror of \u0001, as indicated by the following lemma.\n\nLemma 3. Let x \u223c D(W \u2217, b\u2217), where b\u2217 is non-negative. Suppose that(cid:98)b \u2208 Rd is non-negative and\nsatis\ufb01es |(cid:98)b(i) \u2212 b\u2217(i)| \u2264 \u0001(cid:107)W \u2217(i, :)(cid:107)2 for all i \u2208 [d] and some \u0001 > 0. Then for all i (cid:54)= j \u2208 [d],\n\n(cid:12)(cid:12)(cid:12)P\n[x(i) >(cid:98)b(i) and x(j) >(cid:98)b(j)] \u2212 P\n\nx\n\nx\n\n[x(i) > b\u2217(i) and x(j) > b\u2217(j)]\n\n(13)\n\n(cid:12)(cid:12)(cid:12) \u2264 \u0001.\n\n6\n\n\fm=1 of D(W \u2217, b\u2217) and an estimated bias vector(cid:98)b, Lemma 2 and 3 implies that \u03b8\u2217\n\nLet 1(\u00b7) be the indicator function, e.g., 1(x > 0) = 1 if x > 0 and is 0 otherwise. Given samples\n{xm}n\nij can be\nestimated as\n\n(cid:98)\u03b8ij = \u03c0 \u2212 2\u03c0\n\nn(cid:88)\n\n1(xm(i) >(cid:98)b(i) and xm(j) >(cid:98)b(j)).\n\nn\n\nThe following lemma shows that the estimated(cid:98)\u03b8ij is close to the true \u03b8\u2217\nLemma 4. For a \ufb01xed pair of i (cid:54)= j \u2208 [d], for any \u0001, \u03b4 \u2208 (0, 1), suppose(cid:98)b satis\ufb01es the condition in\nLemma 3, given 80 ln(2/\u03b4)/\u00012 samples, with probability at least 1 \u2212 \u03b4, | cos((cid:98)\u03b8ij) \u2212 cos(\u03b8\u2217\nij)| \u2264 \u0001.\n\nij.\n\nm=1\n\n(14)\n\n4.4 Estimate W W T and b\nOur overall algorithm is given in Algorithm 1. In the \ufb01rst for-loop, we estimate the row norms of W \u2217\nand b\u2217. In the second for-loop, we estimate the angles between any two row vectors of W \u2217.\n\nTheorem 1. For any \u0001 \u2208 (0, 1) and \u03b4 \u2208 (0, 1), Algorithm 1 takes n = (cid:101)O(cid:0) 1\nD(W \u2217, b\u2217) (for some non-negative b\u2217) and outputs(cid:98)\u03a3 \u2208 Rd\u00d7d and(cid:98)b \u2208 Rd that satisfy\n(cid:107)(cid:98)b \u2212 b\u2217(cid:107)2 \u2264 \u0001(cid:107)W \u2217(cid:107)F\n(cid:16) d2\nand space (cid:101)O(cid:0) d\nwith probability at least 1 \u2212 \u03b4. Algorithm 1 runs in time (cid:101)O\n\n(cid:107)(cid:98)\u03a3 \u2212 W \u2217W \u2217T(cid:107)F \u2264 \u0001(cid:107)W \u2217(cid:107)2\n\n\u03b4 )(cid:1) samples from\n\u03b4 ) + d2(cid:1).\n\n\u00012 ln( d\n\u03b4 )\n\n\u00012 ln( d\n\n\u00012 ln( d\n\n(cid:17)\n\n(15)\n\nF ,\n\nTheorem 1 characterizes the sample complexity to achieve a small parameter estimation error. We\nare also interested in the distance between the estimated distribution and the true distribution. Let\nTV(A, B) be the total variation (TV) distance between two distributions A and B. Note that in order\nfor the TV distance to be meaningful5, we restrict ourselves to the non-degenerate case, i.e., when W\nis a full-rank square matrix. The following corollary characterizes the number of samples used by our\nalgorithm in order to achieve a small TV distance.\nCorollary 1. Suppose that W \u2217 \u2208 Rd\u00d7d is full-rank. Let \u03ba be the condition number of W \u2217W \u2217T . For\nsamples from D(W \u2217, b\u2217)\n\n(cid:16) \u03ba2d2\n(cid:17)\nany \u0001 \u2208 (0, 1/2] and \u03b4 \u2208 (0, 1), Algorithm 1 takes n = (cid:101)O\n(for some non-negative b\u2217) and outputs a distribution D((cid:98)\u03a31/2,(cid:98)b) that satis\ufb01es\n(cid:16)D((cid:98)\u03a31/2,(cid:98)b), D(W \u2217, b\u2217)\n(cid:17) \u2264 \u0001,\n(cid:16) \u03ba2d4\nwith probability at least 1 \u2212 \u03b4. Algorithm 1 runs in time (cid:101)O\n\nand space (cid:101)O\n\n(cid:16) \u03ba2d3\n\n\u00012\n\nln( d\n\u03b4 )\n\nln( d\n\u03b4 )\n\nln( d\n\u03b4 )\n\n(cid:17)\n\n(cid:17)\n\n(16)\n\nTV\n\n\u00012\n\n\u00012\n\n.\n\n5 Lower Bounds\nIn the previous section, we gave an algorithm to estimate W \u2217W \u2217T and b\u2217 using i.i.d. samples from\nD(W \u2217, b\u2217), and analyzed its sample complexity. In this section, we provide lower bounds for this\ndensity estimation problem. More precisely, we want to know: how many samples are necessary if\nwe want to learn D(W \u2217, b\u2217) up to some error measure \u0001?\nBefore stating our lower bounds, we \ufb01rst formally de\ufb01ne a framework for distribution learning6.\nLet S be a class of distributions. Let d be some distance function between the two distributions (or\nbetween the parameters of the two distributions). We say that a distribution learning algorithm learns\nS with sample complexity m(\u0001) if for any distribution p \u2208 S, given m(\u0001) i.i.d. samples from p, it\nconstructs a distribution q such that d(p, q) \u2264 \u0001 with success probability at least 2/37.\n\n5The TV distance between two different degenerate distributions can be a constant. As an example, let\nN (0, \u03a31) and N (0, \u03a32) be two Gaussian distributions in Rd. If both \u03a31, \u03a32 have rank smaller than d, then\nTV(N (0, \u03a31),N (0, \u03a32)) = 1 as long as \u03a31 (cid:54)= \u03a32.\n\n6This can be viewed as the standard PAC-learning framework [Val84].\n7We focus on constant success probability here as standard techniques can be used to boost the success\n\nprobability to 1 \u2212 \u03b4 with an extra multiplicative factor ln(1/\u03b4) in the sample complexity.\n\n7\n\n\fWe have analyzed the performance of Algorithm 1 in terms of two distance metrics: the distance\nin the parameter space (Theorem 1), and the TV distance between two distributions (Corollary 1).\nAccordingly, we will provide two sample complexity lower bounds.\nTheorem 2. (Lower bound for parameter estimation). Let \u03c3 > 0 be a \ufb01xed and known scalar. Let\nId be the identity matrix in Rd. Let S := {D(W, b) : W = \u03c3Id, b \u2208 Rd non-negative} be a class\n\nof distributions in Rd. Any algorithm that learns S to satisfy (cid:107)(cid:98)b \u2212 b\u2217(cid:107)2 \u2264 \u0001(cid:107)W \u2217(cid:107)F with success\n\nprobability at least 2/3 requires \u2126(1/\u00012) samples.\nTheorem 3. (Lower bound for distribution estimation). Let S := {D(W, 0) : W \u2208 Rd\u00d7d full rank}\nbe a set of distributions in Rd. Any algorithm that learns S within total variation distance \u0001 and\nsuccess probability at least 2/3 requires \u2126(d/\u00012) samples.\n\nComparing the sample complexity achieved by our algorithm (Theorem 1 and Corollary 1) and the\nabove lower bounds, we can see that 1) our algorithm matches the lower bound (up to log factors) for\nparameter estimation; 2) there is a gap between our sample complexity and the lower bound for TV\ndistance. There are two possible reasons why this gap shows up.\n\n\u2022 The lower bound given in Theorem 3 may be loose. In fact, since learning a d-dimensional\n\nGaussian distribution up to TV distance \u0001 requires(cid:101)\u0398(d2/\u00012) samples (this is both suf\ufb01cient\n\nand necessary [ABDH+18]), it is reasonable to guess that learning recti\ufb01ed Gaussian\ndistributions also requires at least \u2126(d2/\u00012) samples. It is thus interesting to see if one can\nshow a better lower bound than \u2126(d/\u00012).\n\u2022 Our sample complexity of learning D(W, b) up to TV distance \u0001 also depends on the condi-\ntion number \u03ba of W W T . Intuitively, this \u03ba dependence shows up because our algorithm\nestimates W W T entry-by-entry instead of estimating the matrix as a whole. Besides, our\nalgorithm is a proper learning algorithm, meaning that the output distribution belongs to\nthe family D(W, b). By contrast, the lower bound proved in Theorem 3 considers any\nnon-proper learning algorithm, i.e., there is no constraint on the output distribution. One\ninteresting direction for future research is to see if one can remove this \u03ba dependence.\n\n6 Experiments\n\nIn this section, we provide empirical results to verify the correctness of our algorithm as well as\nthe analysis. Code to reproduce our result8 can be found at https://github.com/wushanshan/\ndensityEstimation.\nWe evaluate three performance metrics, as shown in Figure 1. The \ufb01rst two metrics measure\nthe error between the estimated parameters and the ground truth. Speci\ufb01cally, we compute the\n\nthe parameter estimation error, we are also interested in the TV distance analyzed in Corollary 1:\nTV\n. It is dif\ufb01cult to compute the TV distance exactly, so we instead\ncompute an upper bound of it. Let KL(A||B) denote the KL divergence between two distributions.\n\nestimation errors analyzed in Theorem 1: (cid:107)(cid:98)\u03a3 \u2212 W \u2217W \u2217T(cid:107)F /(cid:107)W(cid:107)2\n(cid:17)\n(cid:16)D((cid:98)\u03a31/2,(cid:98)b), D(W \u2217, b\u2217)\nLet \u03a3\u2217 = W \u2217W \u2217T . Assuming that both \u03a3\u2217 and(cid:98)\u03a3 are full-rank, we have\n(cid:17) \u2264 TV\n(cid:16)D((cid:98)\u03a31/2,(cid:98)b), D(W \u2217, b\u2217)\n\n(cid:16)N ((cid:98)b,(cid:98)\u03a3),N (b\u2217, \u03a3\u2217)\n(cid:17) \u2264\n\nF and (cid:107)(cid:98)b \u2212 b(cid:107)2/(cid:107)W(cid:107)F . Besides\n\nTV\n\n(cid:114)\n\n(cid:16)N ((cid:98)b,(cid:98)\u03a3)||N (b\u2217, \u03a3\u2217)\n(cid:17)\n\nKL\n\n/2.\n\nThe \ufb01rst inequality follows from the data-processing inequality given in Lemma 7 of Appendix F\n(see also [ABDH+18, Fact A.5]): for any function f and random variables X, Y over the same space,\nTV(f (X), f (Y )) \u2264 TV(X, Y ). The second inequality follows from the Pinsker\u2019s inequality [Tsy09,\nLemma 2.5].\nSample Ef\ufb01ciency. The left plot of Figure 1 shows that both the parameter estimation errors and\nthe KL divergence decrease when we have more samples. Our experimental setting is simple: we\nset the dimension as d = k = 5 and the condition number as 1; we generate W \u2217 as a random\northonormal matrix; we generate b\u2217 as a random normal vector, followed by a ReLU operation (to\nensure non-negativity). This plot indicates that our algorithm is able to accurately estimate the true\nparameters and obtain a distribution that is close to the true distribution in TV distance.\n\n8The hyper-parameters are B = 1 (in Algorithm 2), r = 3 and \u03bb = 0.1 (in Algorithm 3).\n\n8\n\n\fFigure 1: Best viewed in color. Empirical performance of our algorithm with respect to three\nparameters: number of samples n, dimension d, and the condition number \u03ba. Left: Fix d = 5 and\n\u03ba = 1. Middle: Fix n = 5 \u00d7 105 and \u03ba = 1. Right: Fix n = 5 \u00d7 105 and d = 5. Every point shows\nthe mean and standard deviation across 10 runs. Each run corresponds to a different W \u2217 and b\u2217.\n\nDependence on Dimension. In the middle plot of Figure 1, we use 5 \u00d7 105 samples and keep the\ncondition number to be 1. We then increase the dimension (d = k) from 5 to 25. Both W \u2217 and b\u2217\nare generated in the same manner as the previous plot. As shown in the middle plot, the parameter\nestimation errors maintain the same value while the KL divergence increases as the dimension\nincreases. This is consistent with our analysis, because the sample complexity in Theorem 1 is\ndimension-free (ignoring the log factor) while the sample complexity in Corollary 1 depends on d2.\nDependence on Condition Number. In the right plot of Figure 1, we keep the dimension d = k = 5\nand the number of samples 5 \u00d7 105 \ufb01xed. We then increase the condition number \u03ba of W \u2217W \u2217T .\nThis plot shows the same trend as the middle plot, i.e., the parameter estimation errors remain the\nsame while the KL divergence increases as \u03ba increases, which is again consistent with our analysis.\nThe number of samples required to achieve an additive estimation error (Theorem 1) does not depend\non \u03ba, while the sample complexity to guarantee a small TV distance (Corollary 1) depends on \u03ba2.\n\n7 Conclusion\n\nA popular generative model nowadays is de\ufb01ned by passing a standard Gaussian random variable\nthrough a neural network. In this paper we are interested in the following fundamental question:\nGiven samples from this distribution, is it possible to recover the parameters of the underlying neural\nnetwork? We designed a new algorithm to provably recover the parameters of a single-layer ReLU\ngenerative model from i.i.d. samples, under the assumption that the bias vector is non-negative. We\nanalyzed the sample complexity of the proposed algorithm in terms of two error metrics: parameter\nestimation error and total variation distance. Sample complexity lower bounds and experimental\nresults are provided to support our analysis.\nThere are many questions that one could ask here. For example, what happens if the bias vector has\nnegative values? What if the generative model has two layers? What if the samples are noisy? We\nsummarized our thoughts on some problems in Appendix H. In particular, we showed an interesting\nconnection between learning a two-layer generative model and non-negative matrix factorization.\nWhile our focus here is parameter recovery, one interesting direction for future work is to see whether\none can directly estimate the distribution in some distance without \ufb01rst estimating the parameters.\nAnother interesting direction is to develop provable learning algorithms for the agnostic setting\ninstead of the realizable setting. Besides designing new algorithms, analyzing the existing algorithms,\ne.g., GANs, VAEs, and reversible generative models, is also an important research direction.\n\n8 Acknowledgements\n\nThis research has been supported by NSF Grants 1302435, 1564000, and 1618689, DMS 1723052,\nCCF 1763702, AF 1901292 and research gifts by Google, Western Digital and NVIDIA.\n\n9\n\n012Number of Samples (x105)00.20.40.60.8Error510152025Dimension (d)00.10.20.3Error510152025Condition Number ()00.050.10.15Error\fReferences\n[ABDH+18] Hassan Ashtiani, Shai Ben-David, Nicholas Harvey, Christopher Liaw, Abbas Mehra-\nbian, and Yaniv Plan. Nearly tight sample complexity bounds for learning mixtures\nof gaussians via sample compression schemes. In Advances in Neural Information\nProcessing Systems, pages 3412\u20133421, 2018.\n\n[ACB17] Martin Arjovsky, Soumith Chintala, and L\u00e9on Bottou. Wasserstein gan. arXiv preprint\n\narXiv:1701.07875, 2017.\n\n[AGKM12] Sanjeev Arora, Rong Ge, Ravindran Kannan, and Ankur Moitra. Computing a nonneg-\native matrix factorization\u2013provably. In Proceedings of the forty-fourth annual ACM\nsymposium on Theory of computing, pages 145\u2013162. ACM, 2012.\n\n[CS09] Youngmin Cho and Lawrence K Saul. Kernel methods for deep learning. In Advances\n\nin neural information processing systems, pages 342\u2013350, 2009.\n\n[DGTZ18] Constantinos Daskalakis, Themis Gouleakis, Chistos Tzamos, and Manolis Zampetakis.\nEf\ufb01cient statistics, in high dimensions, from truncated samples. In 2018 IEEE 59th\nAnnual Symposium on Foundations of Computer Science (FOCS), pages 639\u2013649.\nIEEE, 2018.\n\n[DMR18] Luc Devroye, Abbas Mehrabian, and Tommy Reddad. The total variation distance\n\nbetween high-dimensional gaussians. arXiv preprint arXiv:1810.08693, 2018.\n\n[DS04] David Donoho and Victoria Stodden. When does non-negative matrix factorization\ngive a correct decomposition into parts? In Advances in neural information processing\nsystems, pages 1141\u20131148, 2004.\n\n[Duc19] John Duchi. Lecture notes for statistics 311/electrical engineering 377. https:\n\n//stanford.edu/class/stats311/lecture-notes.pdf, March 13, 2019.\n\n[GCB+18] Will Grathwohl, Ricky TQ Chen, Jesse Betterncourt, Ilya Sutskever, and David Duve-\nnaud. Ffjord: Free-form continuous dynamics for scalable reversible generative models.\narXiv preprint arXiv:1810.01367, 2018.\n\n[GKLW19] Rong Ge, Rohith Kuditipudi, Zhize Li, and Xiang Wang. Learning two-layer neural\nnetworks with symmetric inputs. In International Conference on Learning Representa-\ntions, 2019.\n\n[GKM18] Surbhi Goel, Adam Klivans, and Raghu Meka. Learning one convolutional layer with\n\noverlapping patches. In International Conference on Machine Learning, 2018.\n\n[GMOV19] Weihao Gao, Ashok Makkuva, Sewoong Oh, and Pramod Viswanath. Learning one-\nhidden-layer neural networks under general input distributions. In Proceddings of\nthe 22nd International Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS),\npages 1950\u20131959, 2019.\n\n[GPAM+14] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley,\nSherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In\nAdvances in neural information processing systems, pages 2672\u20132680, 2014.\n\n[HK07] Markus Harva and Ata Kab\u00e1n. Variational learning for recti\ufb01ed factor analysis. Signal\n\nProcessing, 87(3):509\u2013527, 2007.\n\n[KW13] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint\n\narXiv:1312.6114, 2013.\n\n[LY17] Yuanzhi Li and Yang Yuan. Convergence analysis of two-layer neural networks with\nrelu activation. In Advances in Neural Information Processing Systems, pages 597\u2013607,\n2017.\n\n[MR19] Arya Mazumdar and Ankit Singh Rawat. Learning and recovery in the relu model.\nIn Proceedings of 57th Annual Allerton Conference on Communication, Control, and\nComputing, 2019, 2019.\n\n10\n\n\f[NWH18] Thanh V Nguyen, Raymond KW Wong, and Chinmay Hegde. Autoencoders learn\n\ngenerative linear models. arXiv preprint arXiv:1806.00572, 2018.\n\n[OKK16] Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent\n\nneural networks. arXiv preprint arXiv:1601.06759, 2016.\n\n[RMC15] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation\nlearning with deep convolutional generative adversarial networks. arXiv preprint\narXiv:1511.06434, 2015.\n\n[Sol17] Mahdi Soltanolkotabi. Learning relus via gradient descent. In Advances in Neural\n\nInformation Processing Systems, pages 2007\u20132017, 2017.\n\n[SSBD14] Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From\n\ntheory to algorithms. Cambridge University Press, 2014.\n\n[Tsy09] Alexandre B Tsybakov. Introduction to nonparametric estimation. Springer, 2009.\n\n[Val84] Leslie G Valiant. A theory of the learnable. In Proceedings of the sixteenth annual\n\nACM symposium on Theory of computing, pages 436\u2013445. ACM, 1984.\n\n[Vav09] Stephen A Vavasis. On the complexity of nonnegative matrix factorization. SIAM\n\nJournal on Optimization, 20(3):1364\u20131377, 2009.\n\n[Wai19] Martin J. Wainwright. High-dimensional statistics: A non-asymptotic viewpoint.\n\nCambridge University Press, 2019.\n\n[WS11] David P Williamson and David B Shmoys. The design of approximation algorithms.\n\nCambridge University Press, 2011.\n\n[ZSJ+17] Kai Zhong, Zhao Song, Prateek Jain, Peter L Bartlett, and Inderjit S Dhillon. Recovery\nIn International Conference on\n\nguarantees for one-hidden-layer neural networks.\nMachine Learning, pages 4140\u20134149, 2017.\n\n11\n\n\f", "award": [], "sourceid": 4425, "authors": [{"given_name": "Shanshan", "family_name": "Wu", "institution": "University of Texas at Austin"}, {"given_name": "Alexandros", "family_name": "Dimakis", "institution": "University of Texas, Austin"}, {"given_name": "Sujay", "family_name": "Sanghavi", "institution": "UT-Austin"}]}