Paper ID: 1053
Title: Predicting Parameters in Deep Learning
Reviews

Submitted by Assigned_Reviewer_2

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
Summary:

"Predicting Parameters in Deep Learning" explores the hypothesis that there is
significant redundancy in the naive (and universally used) representation of
the parameters of neural networks.

Pro:
* Interesting observation and interpretation
* Experiments show promising evidence

Con:
* Empirical results do not support central claims
* Missing comparisons with common weight matrix factoring like PCA preprocessing


Quality:

At a few points in the paper the authors remark on how difficult it is to
train with a parameter matrix W factored as a product UV. Could the authors
offer some intuition for why? One reason might be that if U = V = 0 at the outset,
absolutely naive backprop will not move them. On the other hand if W is
trained normally but simply constrained to be a product of some U and V, some
computational overhead during updates can be traded for lower communication
cost.

A more fundamental concern is that the paper's central observation that
weights are redundant, together with the use of a spatial exponential kernel
suggest that similar gains could be had by either doing a PCA of the input, or
downsampling the images and the kernels for convolution. These are both common
practice. PCA pre-processing also represents a low-rank factorization of the
first layer's weights.

The technique of maintaining a PCA of intermediate layers is not common practice, but
it might be helpful, and it certainly represents a conceptual starting point for your
work on "data driven expander matrices".

The discussion of columnar architectures and convolutional nets seems
reasonable, but largely hypothetical. Pointing out various ways to
refactor matrices in various architectures without more extensive empirical
follow-through seems like a departure from the core thesis of the paper.

The empirical aspect of this paper is important, and I think it needs more
work. The claim that the authors test is that "there exists models whose
parameters can be predicted", but that is a weak claim. The claim the authors are aiming for
from the tone of their introduction and which I agree they *should* be aiming for is that:
"there exist large models that work better than naively down-sized versions
and using our techniques we do less damage to the large models than simply
down-sizing them". This second point is the central claim of the text of the
paper, but it is not supported by empirical results.

The claim that parallel training can actually be accelerated by the use of
these techniques also requires more justification. The proposed method
introduces a few new sorts of computational overhead. Particular
encoder/decoder algorithms for transmitting and rebuilding weight matrices
would need to be implemented and tested to complete this argument.


Clarity:

"collimation" - mis-used word; it is actually a word, but not about making columns

"the the"

"redundancy" is probably the wrong term to describe the symmetry / equivalence
set around line 115.

Section 2 "Low rank weight matrices" is a short bit of background material, it
should probably be merged into the introduction.

Figure 1 is only referenced in the introduction, but it actually illustrates
the method of section 3.2 in action right? It would be clearer if Figure
1 were put into Section 3.2.

Please number your equations, if only for the sake of your reviewers.


Originality:

The idea of looking at weight matrices in neural networks as a continuous
function has precedent, as does the observation that weights have redundancy,
but the idea that training could be accelerated by communicating a small fraction of
randomly chosen matrix values is new.


Significance:

As the authors point out in their introduction, this work could be very
interesting to researchers trying to parallelize neural network training across
very low-bandwidth channels.


Edit: After reading the authors' rebuttal I have raised my quality score. I apparently did not understand the discussion of columns, and how it related to the experimental work. I hope the authors' defense of their experimental work is included into future revisions.
Q2: Please summarize your review in 1-2 sentences
The paper has some good ideas and gets you thinking, but the empirical results
do not really support the most important and interesting claims. The algorithm
for actually accelerating parallel training is only sketched.

Submitted by Assigned_Reviewer_6

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The paper presented a reduction approach to reducing the size of deep models, in order to improve the training efficiency of deep learning. The main contribution is: (1) the exploration of prior knowledge for the redundancy of deep models, such as spatial parameter smoothness for images; (2) the use of kernel ridge regression for parameter interpolation from a subset;

The paper is clearly written. The work seems to be original.

The authors claimed the approach should be very general, i.e., even applicable to non-image tasks. They described some extension of the methods to handle non-image data, such as autoencoder pertaining. But in the experiments there is nothing for non-image datasets. Therefore the point is very weak.

All the experiments on images are on pretty small datasets with simpler patterns. It's hard to believe any methods that are good for data like CIFAR will be supposed good for more realistic datasets such as ImageNet. Therefore the value of this work is not entirely convincing.

I have read the authors' rebuttal. I don't think I would change my recommendation.
Q2: Please summarize your review in 1-2 sentences
This work is novel, with some quite interesting ideas to reduce the size of deep networks. The value of the work has not been convincingly demonstrated in the experiments.

Submitted by Assigned_Reviewer_7

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
Motivated by recent attempts to learn very large networks this work proposes an approach for reducing the number of free parameters in neural-network type architectures. The method is based on the intuition that there is typically strong redundancy in the learned parameters (for instance, the first layer filters of of NNs applied to images are smooth): The authors suggest to learn only a subset of the parameter values and to then predicted the remaining ones through some form of interpolation. The proposed approach is evaluated for several architectures (MLP, convolutional NN, reconstruction-ICA) and different vision datasets (MNIST, CIFAR, STL-10). The results suggest that in general it is sufficient to learn fewer than 50% of the parameters without any loss in performance (significantly fewer parameters seem sufficient for MNIST).

The paper is clear and easy to follow. The method is relatively simple: The authors assume a low-rank decomposition of the weight matrix and then further fix one of the two matrices using prior knowledge about the data (e.g., in the vision case, exploiting the fact that nearby pixels - and weights - tend to be correlated). This can be interpreted as predicting the "unobserved" parameters from the subset of learned filter weights via kernel ridge regression, where the kernel captures prior knowledge about the topology / "smoothness" of the weights. For the situation when such prior knowledge is not available the authors describe a way to learn a suitable kernel from data.

The idea of reducing the number of parameters in NN-like architectures through connectivity constraints in itself is of course not novel, and the authors provide a pretty good discussion of related work in section 5. Their method is very closely related to the idea of factorizing weight matrices as is, for instance, commonly done for 3-way RBMs (e.g. ref [22] in the paper), but also occasionally for standard RBMs (e.g. [R1], missing in the paper). The present papers differs from these in that the authors propose to exploit prior knowledge to constrain one of the matrices. As also discussed by the authors, the approach can further be interpreted as a particular type of pooling -- a strategy commonly employed in convolutional neural networks. Another view of the proposed approach is that the filters are represented as a linear combination of basis functions (in the paper, the particular form of the basis functions is determined by the choice of kernel). Such representations have been explored in various forms and to various ends in the computer vision and signal processing literature (see e.g. [R2,R3,R4,R5]). [R4,R5], for instance, represent filters in terms of a linear combination of basis functions that reduce the computational complexity of the filtering process).

In the experimental section the authors make an effort to demonstrate that the proposed method is practically useful and widely applicable, considering multiple datasets and several different architectures. I think, however, that this section could have been stronger in several ways:

Firstly, it would have been nice if the paper had not focused exclusively on vision applications, and had put generally more emphasis on scenarios where there is a less obvious topolgy in the data that can be exploited when predicting weights (which will pose more of a challenge to the method). The data-dependent kernel is not very well evaluated.

Secondly, especially for the vision case, I am wondering why the authors are limiting themselves to the particular set of basis functions derived from the kernel regression view. It would seem natural to explore other linear basis representations of the filters (a simple choice would be a PCA basis, for instance). These might be more efficient in terms of reducing the number of free parameters, and might have other desirable properties (e.g. [R4,R5]).

Finally, I would have hoped for a truly compelling experimental use case demonstrating the impact of the approach in practice. Since the work is motivated as a way of reducing computational and memory complexity, I think it would have been useful to include a more detailed discussion and empirical demonstration how the reduction in number of learned parameters translates into such savings and consequently allows the training of larger networks that achieve better performance than otherwise possible. At the moment, the evaluation appears to be focused on moderately large networks, and the authors show that a moderate reduction in parameters can be achieved that way, without loosing performance (for MNIST the potential reduction in parameters seems to be very large, but for the more interesting CIFAR / STL-10 datasets it seems that at least 40% of the parameters are required to avoid a loss in accuracy). Why is computational complexity and speed of convergence not considered at all in the evaluation? I think that Fig. 6-right (which plots performance against # of free parameters for a standard network and a reduced parameterization) goes in the right direction, but it seems that there is really only a clear advantage of the reduced parameterization for CIFAR (much less so for STL), and I'm wondering whether the performance difference on CIFAR would disappear even for currently realistic network sizes.

All in all, I think the paper takes an interesting perspective although related approaches have appeared in the literature in various guises (see discussion above). I think that the ideas can have practical impact, and to my knowledge, they are currently not widely used in the NN/deep learning community. A stronger case could probably be made, however, by further exploring alternative implementations of the idea and by having a more compelling demonstration of the effectiveness and versatility of the approach.

Further remarks:

** It would have been really helpful to include state of the art results from the literature for the datasets / and specific evaluation regimes considered. This would put the reported performance into perspective and make it easier to judge whether savings in learned parameters can be achieved for competitive architectures.

** I find the evaluation of the data-dependent kernel somewhat limited: it is only explored for a single dataset / architecture combination (the MNIST, MLP experiments in section 4.1). As mentioned above, it would have been nice to inlcude some other type of data that requires the use of the empirical kernel.

For the experiments in section 4.1 it would have further been useful to also consider the case SE-rand. At the moment it's not clear that the learned kernel really outperforms the random approaches. The difference might simply be due to the fact that for the empirical kernel you're using the SE kernel in the first layer. Another interesting control for the effectiveness of the empirical kernel would be to have a setting where you're using the empirical kernel in both layers.

** Do you have any insights as to why the naive low-rank scheme, where both matrices are learned, is working so poorly? Have you made any attempt to remove the redundancy inherent in the parameterization? Would it be possible (and potentially advantageous) to start of with a fixed U (e.g. the empirical kernel), but allow U to change later at some point, possibly in an optimization scheme where U and V are updated in an alternating manner?

** In your conv-net experiments I suppose you are not using any pooling layers as described in [18]? This would also reduce the dimensionality of the representation in the higher layers. Furthermore, based on Fig. 5 I am wondering whether you can really argue that predicting 75% of the parameters has negligible impact on performance: when you're predicting 50% of the parameters the loss seems to be about 5-6 percentage points which does not seem so negligible to me (I guess it would help to have some errorbars here).

** How did you optimize the hyperparameters (learning rate etc.) for your comparisons? Shouldn't the kernel width vary with the size of the subset of learned parameters alpha?

** In section 4.1, for the MLP experiments, how do the columns differ? Do they use the same empirical kernel and differ only in the set of sampled pixels? Is the great advantage of using multiple columns here due to the fact that otherwise you don't sample the space well?

** Fig. 6 right: I find it curious that the reduced parameterization is advantageous for for CIFAR but there seems to be very little difference for STL - given the higher resolution of STL wouldn't one expect the opposite?

** For conv-nets where, in the higher-layers, there are many input channels relative to the spatial extent of each filter, would it be possible to use a data-dependent kernel to subsample across channels (if I understand correctly you are currently performing the subsampling only spatially, i.e. the elements of alphas are tied across channels)?

** Is there a way to extend your scheme such as to start off with a small fraction of learned parameters, but to then increase the fraction of learned parameters slowly until a desired performance is reached / no further improvement is observed?



References:


[R1] Dahl et al., 2012: Training Restricted Boltzmann Machines on Word Observations

[R2] Roth & Black, IJCV, 2009: Fields of Experts (Section 5.3)

[R3] Freeman & Adelson, 1991: The design and use of steerable filters

[R4] Rubinstein et al., 2010: Double Sparsity: Learning Sparse Dictionaries for
Sparse Signal Approximation

[R5] Rigamonti et al., 2013: Learning Separable Filters
Q2: Please summarize your review in 1-2 sentences
The paper proposes a way of reducing free parameters in NN and I think this can be a useful practical contribution. There is, however, some similarity to existing work, and a stronger case for the approach could probably be made by considering alternative implementations and providing a more compelling evaluation.

Author Feedback

Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 6000 characters. Note however that reviewers and area chairs are very busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.
This paper proposed a method to reduce the number of parameters of existing DEEP learning approaches. The proposed idea is independent of the specific choice of model and optimization algorithm. Our experiments are aimed at supporting this claim by showing that the idea works for many architectures (RICA, convnets, MLPs) and for the various optimization algorithms associated with each. The strengths of this idea are two-fold: generality and simplicity.

Regarding generality, previous approaches (those we cite and R1-R5 from reviewer 7) have explored factorizations of the weight matrix. Crucially, the factorization techniques are very much tied to the specific optimization and model being used. R5 uses two stages of optimization where ALL weights are first computed and then approximated in a second step. R2 uses PCA bases and instead of learning the second factor, it uses a random matrix. R4 requires alternating sparse coding. It is not obvious how to scale these approaches to arbitrary architectures. More importantly, these ideas were proposed for shallow models, whereas our idea is directly applicable to any deep model and corresponding optimizer. We have focused on images, but it is clear that for speech the same structure in the filters can be exploited (in fact, in speech the signal is typically transformed to an image as a pre-processing step). It is also clear that our idea will work for other datasets such as ImageNet because the filters reported in those works have similar structure. It is not clear how well our idea will work for text, but we think success with images is important enough to warrant publication. R1 proposes a factorization specifically tailored to RBMs and to address repeating tokens within a window of text. This idea could prove useful. In conclusion, the provided references are for shallow models and don't exhibit the generality of our approach.

A strength of our paper is that the idea is simple, but it is only obvious in retrospect. If it were obvious it would be widely used, as having fewer parameters reduces storage and communication costs as also pointed out in the excellent paper [R4]. We have researched this idea extensively on the web, presented it at an ICML workshop and have run it by the key leaders in the field of deep learning. Beyond the references they provided us with, which we cite, everyone thus far has agreed that this is a new idea.

Regarding the amount of experimentation, we tested with a broad set of models, optimizers and datasets. The papers we cite, e.g. for RICA, only used a subset of our datasets when published at NIPS. Our results were the product of several months of work by 4 experts dedicated to this problem. We believe the amount of experimentation is very reasonable for a conference publication.

We arrived at this idea using nonparametrics (the kernel ridge regressor is the mean of a Gaussian process over filters). However, we agree that the perspective as a linear, possibly sparse, combination of bases (PCA, Fourier, wavelets, etc) as in eg R2-R5 is also appealing. We thank the reviewers for suggesting this modification to our work, and we will provide comparisons between this approach and the kernel-based approach in the final version of the paper. Another important insight we gained from the reviews, is that we can make our technique even more efficient by exploiting separability as in some of the references provided by reviewer 7.

The central idea, which we hope the reviewers focus on, is one of reducing the number of parameters of most existing deep learning models and associated algorithms. This idea is new and useful.

Specific feedback

Reviewer 2

Factoring W and optimizing both factors is difficult because of the ambiguity we mention at the end of section 2. It may be possible to make this work well, for instance using alternating optimization as suggested by Reviewer 7. This would reduce the simplicity of our technique, since this involves a modification of the training algorithm, rather than merely the parameterization of the model.

The discussion of columns is not hypothetical; in fact it is used throughout our experiments. In Figure 4 we show that 5 columns outperforms 1 column in MLPs, and although we did not include it, the pattern is similar for other models. The improvement in performance from multiple columns is because each column has dynamic parameters in different locations, giving better coverage of the space.

It has been observed in the past that there is redundancy in the weights, but this redundancy has not been previously exploited in deep learning.

Reviewer 7

Including state of the art results from the literature would detract from the main message of our paper. One of the key strengths of our idea is its generality, which we demonstrate through the use of several different models. It seems clear that models tailored to specific benchmarks will outperform more general approaches on those benchmarks. That said, the performances we report are generally competitive with other modern techniques. For example our experiments with RICA achieve better results on STL10 that the original RICA paper [16].

We did experiment with SE-rand in Section 4.1, but found that SE-emp and SE-emp2 perform better. We did not include this permutation due to space constraints.

Network architectures were chosen in various ways: For MLPs we used one of the architectures from [13]. For convnets we tried architectures by hand until we found a good one. For other hyperparameters we used the defaults in pylearn2 for both models. For RICA we used the same architecture as [16] and used cross validation to optimize one parameter at a time.

There are many possibilities for choosing different sampling patterns for the locations of the alphas; we mention some in section 5. Exploring all the permutations here is beyond the scope of this paper.

See also our response to reviewer 2.