NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID: 4425 Learning Distributions Generated by One-Layer ReLU Networks

### Reviewer 1

A popular generative model these days is as follows: pass a standard Gaussian noise through a neural network. But a major unanswered question is what is the structure of the resulting distribution? Given samples from such a distribution, can we learn the distribution parameters? This question is the topic of this paper. Specifically, consider a 1-layer ReLU neural network, which is specified by a matrix W and a real bias b. Given a standard Gaussian as its input, the output of this networks is some complicated distribution. Given samples from this distribution, this paper gives an algorithm that learns the parameters W and b. Namely, they show if the number of samples is large enough, then they can recover W W^T and b approximately (Theorem 1), and they can also learn the target distribution within a small error in total variation distance (Corollary 1). The main assumption is that each component of b must be non-negative. Indeed, in Claim 2 a simple example is given demonstrating that if b has negative components, learning this negative component requires exponential sample complexity. This is the first paper that gives an algorithm with a rigorous guarantee for this important problem, and the paper is well-written. The technique of the paper is also interesting: they first learn the norms of the rows of the matrix W, and then they learn the angles between the rows. The main limitation of the paper is the assumption that each component of b is non-negative. It's really a limiting assumption, and there is no reason to believe that this will happen in practice. The authors have argued that if some component of b is negative, then to estimate this component we need exponential samples. But then maybe estimating the components of b is the wrong objective. In many applications, you just need to output some distribution that's close to the target in the total variation distance. Can we do this even if b has negative components? Other comments for the authors: - line 24: it's not clear what do you mean by "latent code z" - In equation (10) and elsewhere: write P(A) instead of E[1(A)] - Line 243: matrixl -> matrix - There are lower and upper bounds for the total variation distance between Gaussians, look at the arXiv preprint "The total variation distance between high-dimensional Gaussians" == After reading the response: thanks, I have increased my score.

### Reviewer 2

This paper considers learning the distribution generated by one-layer ReLU functions, with input from Gaussian distribution. The high level idea is simple. First step is to estimate the bias term and the norm of each row of the weight vector. Second step is based on a well known fact on the covariance of two ReLU functions. The first step is shown via a maximum likelihood approach from an earlier work (DGTZ18). Overall, the paper is well presented and the proof looks correct to me. One limitation is that the proposed estimation method is somewhat restricted to the specific setting studied here. This makes the connection between this paper to the motivating examples (GANs, VAEs) somewhat weak. Other questions: *) Do similar results extend to more general input distributions? What are the boundaries beyond which estimation becomes hard? It may be worth discussing a little bit. *) In Sec 4.2: estimating the norm of W(i,:) and b(i) boils down to a single dimensional problem. I wonder if there is a simpler way to estimate these quantities.

### Reviewer 3

The paper studies the problem of learning the parameters of a one-layer ReLU network by observing its outputs on a latent standard Gaussian input. This problem is different from the supervised learning problem typically studied for deep networks where we see both the input z to the network and the output y and wish to learn the parameters of the network theta. Here we know the input z is coming from a standard Gaussian and we only see the output y and we wish to learn the parameters of the network (which would translate to learning the rectified Gaussian distribution) The paper presents an exponential sample lower bound for learning if the bias is allowed to be arbitrary. This is because when the bias is allowed to be negative most of the realizations of the latent variable z can map to 0 in the output space due to rectification and hence we will see very few effective samples. Then the authors proceed to restrict the bias to be non-negative and using recent results on learning a Gaussian from truncated samples give an algorithm for learning the network when the weight matrix is well-conditioned. The main algorithmic tool used is projected gradient descent. The estimation proceeds in three steps. First we realize that we can have identifiable recovery only for WW^T. To recover WW^T, it suffices to have the two norms of each row of W and the angles between all pairs of rows. To recover the two norms of each row of W and the bias vector, the algorithm for learning from truncated samples is applied. Then to estimate the angle between rows, the paper uses a simple observation wherein the sign of inner products of two vectors u,v with a random Gaussian vector z is used to infer the angle between u,v. This estimation however carries the noise in the estimation from the previous step and one needs to show that the noise doesnâ€™t accumulate by too much. Overall, the paper is well written and has clarity. I am not sure how novel the algorithm is given that is it heavily based on prior work but the idea of applying it to this setting carries merit. ========== POST AUTHOR-FEEDBACK: Thanks to the authors for their response where they explained the difficulties of extending the results to neural networks with larger number of layers. I will be staying with my current score.