|
Submitted by
Assigned_Reviewer_6
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
Summary of paper: This is a technical paper on
supervised sparse coding. It defines a generalized form of Lasso problem
that equates some input variables x (e.g. the input image) to some hidden
variables y (e.g. the underlying, analyzed or denoised form of the image)
in least squares via two arbitrary weight matrices -- ||M1*x-M2*y||^2 --
adding an L1 penalty on a linear mapping Omega*y and an optional L2
penalty on y itself. This formulation is general enough to cover both
analysis and synthesis (generative) problems. The paper solves this system
for y(x) via a proximal iteration on auxilliary variables z = Omega*y --
an ADMM style method -- and it also offers fast approximation of y(x) via
a network that contains an unwound, truncated ADMM with re-learned
parameters along the lines of Gregor & LeCun [11]. Finally, it
proposes supervised parameter (mapping matrix) learning using stochastic
gradient descent over an arbitrary problem-specific loss function on y(x),
and to this end it derives some explicit formulae for cost gradients in
terms of sign(Omega*y) and the matrices and auxilliary vectors involved.
The method is illustrated on an image super-resolution problem and a
polyphonic music transcription one. It seems to give quite good results on
both, although I found this aspect difficult to judge and would bow to
more expert opinions on this.
Review: This is a reasonable but
somewhat pedestrian technical paper. AFAIK the precise model studied is
new, and I agree that it is useful to have very general models of this
kind as references and tutorial examples, especially as there are now so
many variant models in L1 learning that it is difficult to keep track of
all of them. However the techniques used -- proximal projection, ADMM,
Gregor-LeCun approximation network, SGD, L2 smoothing or limiting
arguments for the local gradient/subgradient estimates -- are all
straightforward and fairly standard for recent methods, and the
derivations themselves contain no surprises. Also, for many of the most
interesting special cases, more specialized algorithms are likely to be
preferred: the iterations for a model this general are necessarily rather
expensive (equations of size = number of image pixels need to be resolved,
at least in principle) and depending on the exact M1, M2, Omega,
potentially also very ill-conditioned. Still, on balance it will be useful
to have this study recorded, so I am favourable to acceptance if space
permits.
Further points:
- In the first equation of
proposition 1, "A" should presumably be "P_{Lambda^c}". The derivation of
the gradient could perhaps be relegated to an appendix. Fig.1 defines
different A,U,B matrices - consider changing notation there to avoid
confusion.
- More information on the effect of K in SI-NN, on the
(especially ambiguous?) cases where SI-NN fails but full SI-ADNN doesn't,
and on the degree to which the specific loss influences the learned
encoders would be useful.
- Just writing "matrix inverse" in Q, B,
etc, is not very satisfying. These very large sets of equations are
presumably solved by conjugate gradients or similar, and this will fail if
the matrices become too ill-conditioned. Comment on the methods actually
used.
Q2: Please summarize your review in 1-2
sentences
A competent and useful but ultimately fairly
straightforward (given the current state of technology) generalization of
Lasso to cover both analysis and synthesis problems. There are no major
surprises or simplifications, but the paper is still worthwhile and
probably worth accepting if space permits.
Submitted by
Assigned_Reviewer_7
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
The paper proposes to learn a general
analysis/synthesis sparse model, for inverse problems.
In section
2, the proposed formulation (1) is very general, even it cannot be
considered as a new model. However, the purpose of the paper is to learn
the parameters of this general model. A short discussion is derived to
refound more specific models from (1). Some parts of this discussion are,
from my point of view, out of the scope of the paper. Indeed, They do not
bring anything about the main subject (learning the parameters), are well
known by most of the potential interested readers. I mainly think about
the discussion on shift-invariant analysis operators.
Section
3 is the heart of the paper. It is well written, especially the derivation
of the gradient.
Section 4 discusses how to speed up the practical
implementation. A lot is written to finally concludes that the ADMM
algorithm is stopped after K iterations... While the paragraph on the
neural networks is hard to follow, and deserve more explanations.
Section 5 presents the experimental results. This section is a bit
disappointing for the following reasons: - in the first experiment,
the author compares their general model to an interpolation model and two
analysis model. I expected at least one synthesis model with, for example,
a undecimated wavelet dictionary. - As there exists algorithms to
learn analysis and synthesis prior, I expect some comparison with these
state-of-the-art. Indeed, this is the only way to know is learning such a
general model worth to be used ! - Conclusions of the 2sd experiments
do not reflects Fig. 2 Indeed, Fig. 2 tell us that the Nonnegative
synthesis model behaves very well. Then, I am not sure that one must
spends time to learn model (1).
Q2: Please
summarize your review in 1-2 sentences
well written paper, studying a supervised method for
learning a general synthesis/analysis model. Experiments and discussion on
the practical results are very disappointing, regarding the rest of the
paper. Submitted by
Assigned_Reviewer_8
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
Paper summary. The authors propose a supervised
learning procedure for general sparse representation model. In their
model, when given an observation, a prediction is made by solving a
strictly convex optimization problem parametrized by three linear
operators. Their learning procedure performs empirical risk
minimization (ERM); to this end the gradients with respect to the
linear operators is derived. An approximate model is given in which a
particular solution method (ADMM) is unrolled for only a few
iterations and gradient-based ERM is again possible. The two models
are evaluated on two tasks, image super-resolution and music
transcription and shown to perform well.
Review. This
is a very well written paper with an interesting if somewhat incremental
contribution.
The model (1) is interesting in the sense that
it is very general; the authors propose a gradient-based learning
procedure to perform empirical risk minimization for differentiable
loss functions. This indeed seems novel and is technically sound. The
derivation in Section 3 is very clear, but somewhat tedious and
technically straightforward.
The approximation in Section 4 is
again interesting but a straightforward extension of [11]. An
interesting generalization of the model shown in Figure 1 would be to
use different parameters (A,B) for each layer. Learning would still be
feasible but the model would have larger capacity for the same
computational expense.
The experiments are carefully done and
demonstrate the usefulness of the authors' contributions on two
relevant tasks.
Overall a nice paper that is relevant to the NIPS
community.
Q2: Please summarize your review in 1-2
sentences
A very well-written paper with clear contribution and
good experiments.
Q1:Author
rebuttal: Please respond to any concerns raised in the reviews. There are
no constraints on how you want to argue your case, except for the fact
that your text should be limited to a maximum of 6000 characters. Note
however that reviewers and area chairs are very busy and may not read long
vague rebuttals. It is in your own interest to be concise and to the
point.
We thank the reviewers and the chair for the time and
effort dedicated in reviewing our work and for such constructive feedback.
We are happy with their support of this work.
All their
suggestions will be carefully addressed in the camera-ready/revision of
the paper. Bellow we address the most important comments answering to each
reviewer separately.
Comments for Assigned_Reviewer_6 :
We
want to clarify that the proposed algorithm for learning the model
parameters is not restricted to work with a particular pursuit algorithm.
One could choose problem-specific solvers for different variants of the
model (1). We used ADMM to motivate the architecture of the fast encoders
presented in Section 4. These encoders are significantly faster than the
pursuit algorithms (for solving problems restricted to data following the
same distribution as the training samples). We will clarify this in the
text.
We thank the reviewer for the mentioned typos. We will
correct the text accordingly.
We will include a discussion on the
effect of the number of layers, K. In terms of the quality of the
approximation, around 10 layers, similar number as in deep learning works,
has shown to be enough for having a good representation, as depicted in
Figure 2. Also, the improvement obtained by adding layers starts to
saturate around that value. It is worth mentioning that in real time
applications, the selection of K would probably also by constrained by the
available computational budget or required input-output latency. Contrary
to standard optimization, this possibility to adapt K to computational
resources is a key advantage of this type of optimization..
Comments for Assigned_Reviewer_7 :
In fast implementation
discussed in Section 4, we truncate the iterations of the ADMM algorithm
only to motivate the architecture of the neural network. After learning
the parameters, the neural network has a computation load identical to the
truncated algorithm but the performance is significantly superior (similar
to the one obtained by running the ADMM algorithm an order of magnitude
more iterations).
We dedicated considerable space for talking
about convolutional setting, since we believe that presenting a principled
method for designing and training convolutional networks is an important
contribution of our work. This work provides a new angle and perspective
into convolution networks.
If space permits, we will include
comparisons against synthesis model both analytical and learned operators
to the super resolution example.
We want to point out that the
non-negative synthesis model is a particular instance of the model
presented in (1).
Comments for Assigned_Reviewer_8 :
We believe that extending the work of [11] to the analysis (and in
particular convoluitonal) setting is an important contribution of our
work. The idea of extending the model allowing for different parameters
per layer is indeed a very interesting and worth exploring.
| |